| rfc9766v1.txt | rfc9766.txt | |||
|---|---|---|---|---|
| Internet Engineering Task Force (IETF) T. Haynes | Internet Engineering Task Force (IETF) T. Haynes | |||
| Request for Comments: 9766 T. Myklebust | Request for Comments: 9766 T. Myklebust | |||
| Category: Standards Track Hammerspace | Category: Standards Track Hammerspace | |||
| ISSN: 2070-1721 February 2025 | ISSN: 2070-1721 April 2025 | |||
| Addition of LAYOUT_WCC to NFSv4.2's Flexible File Layout Type | Extensions for Weak Cache Consistency in NFSv4.2's Flexible File Layout | |||
| Abstract | Abstract | |||
| This document specifies extensions to the Parallel Network File | This document specifies extensions to NFSv4.2 for improving Weak | |||
| System (NFS) version 4 (pNFS) for improving write cache consistency. | Cache Consistency (WCC). These extensions introduce mechanisms that | |||
| These extensions introduce mechanisms that ensure partial writes | ensure partial writes performed under a Parallel NFS (pNFS) layout | |||
| performed under a pNFS layout remain coherent and correctly tracked. | remain coherent and correctly tracked. The solution addresses | |||
| The solution addresses concurrency and data integrity concerns that | concurrency and data integrity concerns that may arise when multiple | |||
| may arise when multiple clients write to the same file through | clients write to the same file through separate data servers. By | |||
| separate data servers. By defining additional interactions among | defining additional interactions among clients, metadata servers, and | |||
| clients, metadata servers, and data servers, this specification | data servers, this specification enhances the reliability of NFSv4 in | |||
| enhances the reliability of NFSv4 in parallel-access environments and | parallel-access environments and ensures consistency across diverse | |||
| ensures consistency across diverse deployment scenarios. | deployment scenarios. | |||
| Status of This Memo | Status of This Memo | |||
| This is an Internet Standards Track document. | This is an Internet Standards Track document. | |||
| This document is a product of the Internet Engineering Task Force | This document is a product of the Internet Engineering Task Force | |||
| (IETF). It represents the consensus of the IETF community. It has | (IETF). It represents the consensus of the IETF community. It has | |||
| received public review and has been approved for publication by the | received public review and has been approved for publication by the | |||
| Internet Engineering Steering Group (IESG). Further information on | Internet Engineering Steering Group (IESG). Further information on | |||
| Internet Standards is available in Section 2 of RFC 7841. | Internet Standards is available in Section 2 of RFC 7841. | |||
| skipping to change at line 78 ¶ | skipping to change at line 78 ¶ | |||
| 5. Security Considerations | 5. Security Considerations | |||
| 6. IANA Considerations | 6. IANA Considerations | |||
| 7. References | 7. References | |||
| 7.1. Normative References | 7.1. Normative References | |||
| 7.2. Informative References | 7.2. Informative References | |||
| Acknowledgments | Acknowledgments | |||
| Authors' Addresses | Authors' Addresses | |||
| 1. Introduction | 1. Introduction | |||
| In the Network File System version 4 (NFSv4) with a Parallel NFS | In the Parallel NFS (pNFS) flexible file layout (see [RFC8435]), | |||
| (pNFS) flexible file layout (see Section 12 of [RFC8435]) server, | ||||
| there is no mechanism for the data servers to update the metadata | there is no mechanism for the data servers to update the metadata | |||
| servers when the data portion of the file is modified. The metadata | servers when the data portion of the file is modified. The metadata | |||
| server needs this knowledge to correspondingly update the metadata | server needs this knowledge to correspondingly update the metadata | |||
| portion of the file. If the client is using NFSv3 as the protocol | portion of the file. If the client is using NFSv3 as the protocol | |||
| with the data server, it can leverage Weak Cache Consistency (WCC) to | with the data server, it can leverage Weak Cache Consistency (WCC) to | |||
| update the metadata server of the attribute changes. In this | update the metadata server of the attribute changes. In this | |||
| document, we introduce a new operation called LAYOUT_WCC to NFSv4.2, | document, we introduce a new operation called LAYOUT_WCC to NFSv4.2, | |||
| which allows the client to periodically report the attributes of the | which allows the client to periodically report the attributes of the | |||
| data files to the metadata server. | data files to the metadata server. | |||
| skipping to change at line 121 ¶ | skipping to change at line 120 ¶ | |||
| metadata server (MDS): the pNFS server that provides metadata | metadata server (MDS): the pNFS server that provides metadata | |||
| information for a file system object. | information for a file system object. | |||
| storage device: the target to which clients may direct I/O requests | storage device: the target to which clients may direct I/O requests | |||
| when they hold an appropriate layout. Note that each data server | when they hold an appropriate layout. Note that each data server | |||
| is a storage device but that some storage device are not data | is a storage device but that some storage device are not data | |||
| servers. (See Section 2.1 of [RFC8434] for a discussion on the | servers. (See Section 2.1 of [RFC8434] for a discussion on the | |||
| difference between a data server and a storage device.) | difference between a data server and a storage device.) | |||
| weak cache consistency (WCC): In NFSv3, WCC allows the client to | weak cache consistency (WCC): the mechanism in NFSv3 that allows the | |||
| check for file attribute changes before and after an operation | client to check for file attribute changes before and after an | |||
| (see Section 2.6 of [RFC1813]). | operation (see Section 2.6 of [RFC1813]). | |||
| 1.2. Requirements Language | 1.2. Requirements Language | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
| "OPTIONAL" in this document are to be interpreted as described in | "OPTIONAL" in this document are to be interpreted as described in | |||
| BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
| capitals, as shown here. | capitals, as shown here. | |||
| 2. Weak Cache Consistency (WCC) | 2. Weak Cache Consistency (WCC) | |||
| A pNFS layout type enables the metadata server to inform the client | A pNFS layout type enables the metadata server to inform the client | |||
| of both the storage protocol and the locations of the data that the | of both the storage protocol and the locations of the data that the | |||
| client should use when communicating with the storage devices. The | client should use when communicating with the storage devices. The | |||
| flexible file layout type, as specified in [RFC8435], describes how | flexible file layout type, as specified in [RFC8435], describes how | |||
| data servers using NFSv3 can be accessed. The client is restricted | data servers using NFSv3 can be accessed. The client is restricted | |||
| to performing the following NFSv3 operations on the filehandles | to performing the following NFSv3 operations on the filehandles | |||
| provided in the layout: READ (Section 3.3.6 of [RFC1813]), WRITE | provided in the layout: READ, WRITE, and COMMIT (see Sections 3.3.6, | |||
| (Section 3.3.7 of [RFC1813]), and COMMIT (Section 3.3.21 of | 3.3.7, and 3.3.21 of [RFC1813], respectively). In other words, the | |||
| [RFC1813]). In other words, the client may only use NFSv3 operations | client may only use NFSv3 operations that act directly on the data | |||
| that act directly on the data portion of the file. | portion of the file. | |||
| Because there is no control protocol (see [RFC8434]) possible with | Because there is no control protocol (see [RFC8434]) possible with | |||
| all data servers, NFSv3 is used as the control protocol. As such, | all data servers, NFSv3 is used as the control protocol. As such, | |||
| the following NFSv3 operations are commonly used by the metadata | the following NFSv3 operations are commonly used by the metadata | |||
| server: CREATE (see Section 3.3.8 of [RFC1813]), GETATTR (see | server: CREATE, GETATTR, and SETATTR (see Sections 3.3.8, 3.3.1, and | |||
| Section 3.3.1 of [RFC1813]), and SETATTR (see Section 3.3.2 of | 3.3.2 of [RFC1813], respectively). That is, the metadata server is | |||
| [RFC1813]). That is, the metadata server is only allowed to use | only allowed to use NFSv3 operations that directly act on the | |||
| NFSv3 operations that directly act on the metadata portion of the | metadata portion of the data file. GETATTR allows the metadata | |||
| data file. GETATTR allows the metadata server to mainly retrieve the | server to mainly retrieve the mtime (modify time), ctime (change | |||
| mtime (modify time), ctime (change time), and atime (access time). | time), and atime (access time). The metadata server can use this | |||
| The metadata server can use this information to determine if the | information to determine if the client modified the file whilst it | |||
| client modified the file whilst it held an iomode of LAYOUTIOMODE4_RW | held an iomode of LAYOUTIOMODE4_RW (see Section 3.3.20 of [RFC8881]). | |||
| (see Section 3.3.20 of [RFC8881]). Then it can determine the | Then it can determine the following for the metadata file: | |||
| following for the metadata file: time_modify (see Section 5.8.2.43 of | time_modify, time_metadata, and time_access (see Sections 5.8.2.43, | |||
| [RFC8881]), time_metadata (see Section 5.8.2.42 of [RFC8881]), and | 5.8.2.42, and 5.8.2.37 of [RFC8881], respectively). That is, it can | |||
| time_access (see Section 5.8.2.37 of [RFC8881]). That is, it can | ||||
| determine the information to return to clients in an NFSv4.2 GETATTR | determine the information to return to clients in an NFSv4.2 GETATTR | |||
| response. | response. | |||
| For example, the metadata server might issue an NFSv3 GETATTR | For example, the metadata server might issue an NFSv3 GETATTR | |||
| operation to the data server, which is typically triggered by a | operation to the data server, which is typically triggered by a | |||
| client's NFSv4 GETATTR request to the metadata server. In addition | client's NFSv4 GETATTR request to the metadata server. In addition | |||
| to the cost of each individual GETATTR operation, the data server can | to the cost of each individual GETATTR operation, the data server can | |||
| be overwhelmed by a large volume of such requests. NFSv3 addressed a | be overwhelmed by a large volume of such requests. NFSv3 addressed a | |||
| similar challenge by including a post-operation attribute in the READ | similar challenge by including a post-operation attribute in the READ | |||
| and WRITE operations to report WCC data (see Section 2.6 of | and WRITE operations to report WCC data (see Section 2.6 of | |||
| [RFC1813]). | [RFC1813]). | |||
| Each NFSv3 operation entails a single round trip between the client | Each NFSv3 operation entails a single round trip between the client | |||
| and server. Consequently, issuing a WRITE followed by a GETATTR | and server. Consequently, issuing a WRITE followed by a GETATTR | |||
| would require two round trips. In that situation, the retrieved | would require two round trips. In that situation, the retrieved | |||
| attribute information is regarded as strict server-client | attribute information is regarded as having strict server-client | |||
| consistency. By contrast, NFSv4 enables a WRITE and GETATTR to be | consistency. By contrast, NFSv4 enables a WRITE and GETATTR to be | |||
| combined within a compound operation, which requires only one round | combined within a compound operation, which requires only one round | |||
| trip. This combined approach is likewise considered strict server- | trip. This combined approach is likewise considered to have strict | |||
| client consistency. Essentially, NFSv4 READ and WRITE operations | server-client consistency. Essentially, NFSv4 READ and WRITE | |||
| omit post-operation attributes, allowing the client to determine | operations omit post-operation attributes, allowing the client to | |||
| whether it requires that information. | determine whether it requires that information. | |||
| Whilst NFSv4 got rid of the requirement for WCC information to be | Whilst NFSv4 got rid of the requirement for WCC information to be | |||
| supplied by the WRITE or READ operations, the introduction of pNFS | supplied by the WRITE or READ operations, the introduction of pNFS | |||
| reintroduces the same problem. The metadata server has to | reintroduces the same problem. The metadata server has to | |||
| communicate with the data server in order to get the data that could | communicate with the data server in order to get the data that could | |||
| be provided by a WCC model. | be provided by a WCC model. | |||
| With the flexible file layout type, the client can leverage the NFSv3 | With the flexible file layout type, the client can leverage the NFSv3 | |||
| WCC to service the proxying of times (see Section 5 of [RFC9754]), | WCC to service the proxying of times (see Section 5 of [RFC9754]), | |||
| but the granularity of this data is limited. With client-side | but the granularity of this data is limited. With client-side | |||
| skipping to change at line 290 ¶ | skipping to change at line 288 ¶ | |||
| - time_modify (see Section 5.8.2.43 of [RFC8881]) | - time_modify (see Section 5.8.2.43 of [RFC8881]) | |||
| * Whenever it sends an NFS4ERR_ACCESS error via LAYOUTRETURN or | * Whenever it sends an NFS4ERR_ACCESS error via LAYOUTRETURN or | |||
| LAYOUTERROR. It could have already gotten the NFSv3 uid and gid | LAYOUTERROR. It could have already gotten the NFSv3 uid and gid | |||
| values back in the WCC of the WRITE, READ, or COMMIT operation | values back in the WCC of the WRITE, READ, or COMMIT operation | |||
| that got the error. Thus, it could report that information back | that got the error. Thus, it could report that information back | |||
| to the metadata server, saving it from querying that information | to the metadata server, saving it from querying that information | |||
| via an NFSv3 GETATTR. | via an NFSv3 GETATTR. | |||
| * Whenever it sends a SETATTR to refresh the proxied times (see | * Whenever it sends a SETATTR to refresh the proxied times (see | |||
| Section 5 of [RFC9754]). The metadata server is going to want to | Section 5 of [RFC9754]). The metadata server will correlate these | |||
| correlate these times in order to detect later modification to the | times in order to detect later modification to the data file. | |||
| data file. | ||||
| 3.4.2. Examples of What to Send in LAYOUT_WCC | 3.4.2. Examples of What to Send in LAYOUT_WCC | |||
| The NFSv3 attributes returned in the WCC of WRITE, READ, and COMMIT | The NFSv3 attributes returned in the WCC of WRITE, READ, and COMMIT | |||
| operations are a smaller subset of what can be transmitted as an | operations are a smaller subset of what can be transmitted as an | |||
| NFSv4 attribute. The mapping of NFSv3 to NFSv4 attributes is shown | NFSv4 attribute. The mapping of NFSv3 to NFSv4 attributes is shown | |||
| in Table 1. The LAYOUT_WCC MUST provide all of these attributes to | in Table 1. The LAYOUT_WCC MUST provide all of these attributes to | |||
| the metadata server. Both the uid and gid are stringified into their | the metadata server. Both the uid and gid are stringified into their | |||
| respective attributes of owner and owner_group. In the case of | respective attributes of owner and owner_group. In the case of | |||
| NFS4ERR_ACCESS, the reason to provide these two attributes is that | NFS4ERR_ACCESS, the reason to provide these two attributes is that | |||
| skipping to change at line 416 ¶ | skipping to change at line 413 ¶ | |||
| attributes present. Or it could decide to present only the two | attributes present. Or it could decide to present only the two | |||
| mirrors that had been changed. | mirrors that had been changed. | |||
| In either case, the combination of ffdsw_deviceid, ffdsw_stateid, and | In either case, the combination of ffdsw_deviceid, ffdsw_stateid, and | |||
| ffdsw_fh_vers will uniquely identify the attributes to be updated. | ffdsw_fh_vers will uniquely identify the attributes to be updated. | |||
| All three arguments are required. A layout might have multiple data | All three arguments are required. A layout might have multiple data | |||
| files on the same storage device, in which case the ffdsw_deviceid | files on the same storage device, in which case the ffdsw_deviceid | |||
| and ffdsw_stateid would match, but the ffdsw_fh_vers would not. | and ffdsw_stateid would match, but the ffdsw_fh_vers would not. | |||
| The ffdsw_attributes are processed similar to the obj_attributes in | The ffdsw_attributes are processed similar to the obj_attributes in | |||
| the SETATTR arguments (see Section 18.34 of [RFC8881]). | the SETATTR arguments (see Section 18.30 of [RFC8881]). | |||
| 4. Extraction of XDR | 4. Extraction of XDR | |||
| This document contains the XDR [RFC4506] description of the new open | This document contains the XDR [RFC4506] description of the new | |||
| flags for delegating the file to the client. The XDR description is | NFSv4.2 operation LAYOUT_WCC. The XDR description is embedded in | |||
| embedded in this document in a way that makes it simple for the | this document in a way that makes it simple for the reader to extract | |||
| reader to extract into a ready-to-compile form. The reader can feed | into a ready-to-compile form. The reader can feed this document into | |||
| this document into the following shell script to produce the machine- | the following shell script to produce the machine-readable XDR | |||
| readable XDR description of the new flags: | description of the new NFSv4.2 operation LAYOUT_WCC. | |||
| <CODE BEGINS> | <CODE BEGINS> | |||
| #!/bin/sh | #!/bin/sh | |||
| grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' | grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' | |||
| <CODE ENDS> | <CODE ENDS> | |||
| That is, if the above script is stored in a file called 'extract.sh', | That is, if the above script is stored in a file called 'extract.sh', | |||
| and this document is in a file called 'spec.txt', then the reader can | and this document is in a file called 'spec.txt', then the reader can | |||
| do: | do: | |||
| <CODE BEGINS> | <CODE BEGINS> | |||
| sh extract.sh < spec.txt > layout_wcc.x | sh extract.sh < spec.txt > layout_wcc.x | |||
| <CODE ENDS> | <CODE ENDS> | |||
| The effect of the script is to remove leading white space from each | The effect of the script is to remove leading blank space from each | |||
| line, plus a sentinel sequence of '///'. XDR descriptions with the | line, plus a sentinel sequence of '///'. XDR descriptions with the | |||
| sentinel sequence are embedded throughout the document. | sentinel sequence are embedded throughout the document. | |||
| Note that the XDR code contained in this document depends on types | Note that the XDR code contained in this document depends on types | |||
| from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]). This | from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]). This | |||
| includes both nfs types that end with a 4 (such as offset4 and | includes both nfs types that end with a 4 (such as offset4 and | |||
| length4) as well as more generic types (such as uint32_t and | length4) as well as more generic types (such as uint32_t and | |||
| uint64_t). | uint64_t). | |||
| While the XDR can be appended to that from [RFC7863], the various | While the XDR can be appended to that from [RFC7863], the various | |||
| End of changes. 13 change blocks. | ||||
| 49 lines changed or deleted | 46 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. | ||||