| rfc9839v1.txt | rfc9839.txt | |||
|---|---|---|---|---|
| skipping to change at line 102 ¶ | skipping to change at line 102 ¶ | |||
| subset. The intended use is to serve as a convenient target for | subset. The intended use is to serve as a convenient target for | |||
| cross-reference from other specifications whose authors wish to | cross-reference from other specifications whose authors wish to | |||
| exclude problematic code points from the data format or protocol | exclude problematic code points from the data format or protocol | |||
| being specified. | being specified. | |||
| Note that this document only provides guidance on avoiding the use of | Note that this document only provides guidance on avoiding the use of | |||
| code points that cannot be used for interoperable interchange of | code points that cannot be used for interoperable interchange of | |||
| Unicode textual data. Dealing with strings, particularly in the | Unicode textual data. Dealing with strings, particularly in the | |||
| context of user interfaces, requires addressing language, text | context of user interfaces, requires addressing language, text | |||
| rendering direction, alternate representations of the same abstract | rendering direction, alternate representations of the same abstract | |||
| character, and so on. These issues, among many others, led to many | character, and so on. These issues, among many others, led to | |||
| efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | |||
| and [PRECIS], and internationalization efforts by W3C such as | and [PRECIS], and internationalization efforts by W3C such as | |||
| [W3C-CHAR]. The results of these efforts should be consulted by | [W3C-CHAR]. The results of these efforts should be consulted by | |||
| anyone engaging in such work. | anyone engaging in such work. | |||
| 1.1. Notation | 1.1. Notation | |||
| In this document, the numeric values assigned to Unicode characters | In this document, the numeric values assigned to Unicode characters | |||
| are provided in hexadecimal. This document uses Unicode's standard | are provided in hexadecimal. This document uses Unicode's standard | |||
| notation of "U+" followed by four or more hexadecimal digits. For | notation of "U+" followed by four or more hexadecimal digits. For | |||
| skipping to change at line 143 ¶ | skipping to change at line 143 ¶ | |||
| storage systems and to specify allowed subsets in specifications. | storage systems and to specify allowed subsets in specifications. | |||
| There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | |||
| (2024), about 155,000 have been assigned to characters. Since | (2024), about 155,000 have been assigned to characters. Since | |||
| unassigned code points regularly become assigned when new characters | unassigned code points regularly become assigned when new characters | |||
| are added to Unicode, it is usually not a good practice to specify | are added to Unicode, it is usually not a good practice to specify | |||
| that unassigned code points should be avoided. | that unassigned code points should be avoided. | |||
| 2.1. Encoding Forms | 2.1. Encoding Forms | |||
| Unicode describes a variety of encoding forms, ways to marshal code | Unicode describes a variety of encoding forms that can be used to | |||
| points into byte sequences. A survey of these is beyond the scope of | marshal code points into byte sequences. A survey of these is beyond | |||
| this document. However, it is useful to note that "UTF-16" | the scope of this document. However, it is useful to note that "UTF- | |||
| represents each code point with one or two 16-bit chunks, while "UTF- | 16" represents each code point with one or two 16-bit chunks, while | |||
| 8" uses variable-length byte sequences [RFC3629]. | "UTF-8" uses variable-length byte sequences [RFC3629]. | |||
| The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | |||
| says "Protocols MUST be able to use the UTF-8 charset", which becomes | says "Protocols MUST be able to use the UTF-8 charset", which becomes | |||
| a mandate to use UTF-8 for any protocol or data format that specifies | a mandate to use UTF-8 for any protocol or data format that specifies | |||
| a single encoding form. UTF-8 is widely used for interoperable data | a single encoding form. UTF-8 is widely used for interoperable data | |||
| formats such as JSON, YAML, CBOR, and XML. | formats such as JSON, YAML, CBOR, and XML. | |||
| 2.2. Problematic Code Points | 2.2. Problematic Code Points | |||
| This section classifies all the code points that can never represent | This section classifies all the code points that can never represent | |||
| skipping to change at line 235 ¶ | skipping to change at line 235 ¶ | |||
| Code points are organized into 17 "planes", each containing 2^16 code | Code points are organized into 17 "planes", each containing 2^16 code | |||
| points. The last two code points in each plane are noncharacters: | points. The last two code points in each plane are noncharacters: | |||
| U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | |||
| U+10FFFE, U+10FFFF. | U+10FFFE, U+10FFFF. | |||
| The code points in the range U+FDD0-U+FDEF are noncharacters. | The code points in the range U+FDD0-U+FDEF are noncharacters. | |||
| 3. Dealing with Problematic Code Points | 3. Dealing with Problematic Code Points | |||
| [RFC9413], "Maintaining Robust Protocols", provides a thorough | "Maintaining Robust Protocols" [RFC9413] provides a thorough | |||
| discussion of strategies for dealing with issues in input data. | discussion of strategies for dealing with issues in input data. | |||
| Different types of problematic code points cause different issues. | Different types of problematic code points cause different issues. | |||
| Noncharacters and legacy controls are unlikely to cause software | Noncharacters and legacy controls are unlikely to cause software | |||
| failures, but they cannot usefully be displayed to humans, and they | failures, but they cannot usefully be displayed to humans, and they | |||
| can be used in attacks based on attempting to display text that | can be used in attacks based on attempting to display text that | |||
| includes them. | includes them. | |||
| The behavior of software that encounters surrogates is unpredictable | The behavior of software that encounters surrogates is unpredictable | |||
| and differs among programming-language implementations, even between | and differs among programming-language implementations, even between | |||
| skipping to change at line 413 ¶ | skipping to change at line 413 ¶ | |||
| 8.1. Normative References | 8.1. Normative References | |||
| [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", STD 68, RFC 5234, | Specifications: ABNF", STD 68, RFC 5234, | |||
| DOI 10.17487/RFC5234, January 2008, | DOI 10.17487/RFC5234, January 2008, | |||
| <https://www.rfc-editor.org/info/rfc5234>. | <https://www.rfc-editor.org/info/rfc5234>. | |||
| [TR36] Davis, M., Ed. and M. Suignard, Ed., "Unicode Security | [TR36] Davis, M., Ed. and M. Suignard, Ed., "Unicode Security | |||
| Considerations", <https://www.unicode.org/reports/tr36/>. | Considerations", <https://www.unicode.org/reports/tr36/>. | |||
| Note that this reference is to the latest version of this | ||||
| document, rather than to a specific release. It is not | ||||
| expected that future updates will affect the referenced | ||||
| discussions. | ||||
| [TR55] Leroy, R., Ed. and M. Davis, Ed., "Unicode Source Code | [TR55] Leroy, R., Ed. and M. Davis, Ed., "Unicode Source Code | |||
| Handling", <https://www.unicode.org/reports/tr55/>. Note | Handling", <https://www.unicode.org/reports/tr55/>. | |||
| that this reference is to the latest version of this | ||||
| document, rather than to a specific release. It is not | ||||
| expected that future updates will affect the referenced | ||||
| discussions. | ||||
| [UNICODE] The Unicode Consortium, "The Unicode Standard", | [UNICODE] The Unicode Consortium, "The Unicode Standard", | |||
| <http://www.unicode.org/versions/latest/>. Note that this | <http://www.unicode.org/versions/latest/>. Note that this | |||
| reference is to the latest version of Unicode, rather than | reference is to the latest version of Unicode, rather than | |||
| to a specific release. It is not expected that future | to a specific release. It is not expected that future | |||
| changes in the Unicode Standard will affect the referenced | changes in the Unicode Standard will affect the referenced | |||
| definitions. | definitions. | |||
| 8.2. Informative References | 8.2. Informative References | |||
| End of changes. 5 change blocks. | ||||
| 16 lines changed or deleted | 8 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. | ||||