| rfc9839.original | rfc9839.txt | |||
|---|---|---|---|---|
| Network Working Group T. Bray | Internet Engineering Task Force (IETF) T. Bray | |||
| Internet-Draft Textuality Services | Request for Comments: 9839 Textuality Services | |||
| Intended status: Standards Track P. Hoffman | Category: Standards Track P. Hoffman | |||
| Expires: 28 November 2025 ICANN | ISSN: 2070-1721 ICANN | |||
| 27 May 2025 | August 2025 | |||
| Unicode Character Repertoire Subsets | Unicode Character Repertoire Subsets | |||
| draft-bray-unichars-15 | ||||
| Abstract | Abstract | |||
| This document discusses subsets of the Unicode character repertoire | This document discusses subsets of the Unicode character repertoire | |||
| for use in protocols and data formats, and specifies three subsets | for use in protocols and data formats and specifies three subsets | |||
| recommended for use in IETF specifications. | recommended for use in IETF specifications. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This is an Internet Standards Track document. | |||
| provisions of BCP 78 and BCP 79. | ||||
| Internet-Drafts are working documents of the Internet Engineering | ||||
| Task Force (IETF). Note that other groups may also distribute | ||||
| working documents as Internet-Drafts. The list of current Internet- | ||||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
| Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
| and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
| time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
| material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Further information on | |||
| Internet Standards is available in Section 2 of RFC 7841. | ||||
| This Internet-Draft will expire on 28 November 2025. | Information about the current status of this document, any errata, | |||
| and how to provide feedback on it may be obtained at | ||||
| https://www.rfc-editor.org/info/rfc9839. | ||||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2025 IETF Trust and the persons identified as the | Copyright (c) 2025 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
| license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
| and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
| extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
| described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
| provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
| in the Revised BSD License. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
| 1.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Notation | |||
| 2. Characters and Code Points . . . . . . . . . . . . . . . . . 3 | 2. Characters and Code Points | |||
| 2.1. Encoding forms . . . . . . . . . . . . . . . . . . . . . 4 | 2.1. Encoding Forms | |||
| 2.2. Problematic Code Points . . . . . . . . . . . . . . . . . 4 | 2.2. Problematic Code Points | |||
| 2.2.1. Surrogates . . . . . . . . . . . . . . . . . . . . . 5 | 2.2.1. Surrogates | |||
| 2.2.2. Control Codes . . . . . . . . . . . . . . . . . . . . 5 | 2.2.2. Control Codes | |||
| 2.2.3. Noncharacters . . . . . . . . . . . . . . . . . . . . 5 | 2.2.3. Noncharacters | |||
| 3. Dealing With Problematic Code Points . . . . . . . . . . . . 6 | 3. Dealing with Problematic Code Points | |||
| 4. Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | 4. Subsets | |||
| 4.1. Unicode Scalars . . . . . . . . . . . . . . . . . . . . . 7 | 4.1. Unicode Scalars | |||
| 4.2. XML Characters . . . . . . . . . . . . . . . . . . . . . 7 | 4.2. XML Characters | |||
| 4.3. Unicode Assignables . . . . . . . . . . . . . . . . . . . 8 | 4.3. Unicode Assignables | |||
| 5. Using Subsets . . . . . . . . . . . . . . . . . . . . . . . . 8 | 5. Using Subsets | |||
| 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 | 6. IANA Considerations | |||
| 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 | 7. Security Considerations | |||
| 8. Normative References . . . . . . . . . . . . . . . . . . . . 9 | 8. References | |||
| 9. Informative References . . . . . . . . . . . . . . . . . . . 10 | 8.1. Normative References | |||
| Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11 | 8.2. Informative References | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 | Acknowledgements | |||
| Authors' Addresses | ||||
| 1. Introduction | 1. Introduction | |||
| Protocols and data formats frequently contain or are made up of | Protocols and data formats frequently contain or are made up of | |||
| textual data. Such text is normally composed of Unicode [UNICODE] | textual data. Such text is normally composed of Unicode [UNICODE] | |||
| characters, to support use by speakers of many languages. Unicode | characters, to support use by speakers of many languages. Unicode | |||
| characters are represented by numeric code points, and the "set of | characters are represented by numeric code points, and the "set of | |||
| all Unicode code points" is generally not a good choice for use in | all Unicode code points" is generally not a good choice for use in | |||
| text fields. Unicode recognizes different types of code points, not | text fields. Unicode recognizes different types of code points, not | |||
| all of which are appropriate in protocols, or even associated with | all of which are appropriate in protocols or even associated with | |||
| characters. Therefore, even if the desire is to support "all Unicode | characters. Therefore, even if the desire is to support "all Unicode | |||
| characters" a subset of the Unicode code point repertoire should be | characters", a subset of the Unicode code point repertoire should be | |||
| specified. Subsets such as those discussed in this document are | specified. Subsets such as those discussed in this document are | |||
| appropriate choices when more-specific limitations do not apply. | appropriate choices when more-specific limitations do not apply. | |||
| In this document, "subset" means a subset of the Unicode character | In this document, "subset" means a subset of the Unicode character | |||
| repertoire. This document specifies subsets that exclude some or all | repertoire. This document specifies subsets that exclude some or all | |||
| of the code points that are "problematic" as defined in Section 2.2. | of the code points that are "problematic" as defined in Section 2.2. | |||
| Authors should have a way to concisely and exactly reference a stable | Authors should have a way to concisely and exactly reference a stable | |||
| specification that identifies which subset a protocol or data format | specification that identifies which subset a protocol or data format | |||
| accepts. | accepts. | |||
| This document discusses issues that apply in choosing subsets, names | This document discusses issues that apply in choosing subsets, names | |||
| two subsets that have been popular in practice, and suggests one new | two subsets that have been popular in practice, and suggests one new | |||
| subset. The intended use is to serve as a convenient target for | subset. The intended use is to serve as a convenient target for | |||
| cross-reference from other specifications whose authors wish to | cross-reference from other specifications whose authors wish to | |||
| exclude problematic code points from the data format or protocol | exclude problematic code points from the data format or protocol | |||
| being specified. | being specified. | |||
| Note that this document only provides guidance on avoiding the use of | Note that this document only provides guidance on avoiding the use of | |||
| code points which cannot be used for interoperable interchange of | code points that cannot be used for interoperable interchange of | |||
| Unicode textual data. Dealing with strings, particularly in the | Unicode textual data. Dealing with strings, particularly in the | |||
| context of user interfaces, requires addressing language, text | context of user interfaces, requires addressing language, text | |||
| rendering direction, alternate representations of the same abstract | rendering direction, alternate representations of the same abstract | |||
| character, and so on. These issues, among many others, led to many | character, and so on. These issues, among many others, led to | |||
| efforts by the Unicode Consortium, IETF efforts like [IDN] and | efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | |||
| [PRECIS], and W3C internationalization efforts such as [W3C-CHAR]. | and [PRECIS], and internationalization efforts by W3C such as | |||
| The results of these efforts should be consulted by anyone engaging | [W3C-CHAR]. The results of these efforts should be consulted by | |||
| in such work. | anyone engaging in such work. | |||
| 1.1. Notation | 1.1. Notation | |||
| In this document, the numeric values assigned to Unicode characters | In this document, the numeric values assigned to Unicode characters | |||
| are provided in hexadecimal. This document uses Unicode's standard | are provided in hexadecimal. This document uses Unicode's standard | |||
| notation of "U+" followed by four or more hexadecimal digits. For | notation of "U+" followed by four or more hexadecimal digits. For | |||
| example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black | example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black | |||
| Heart), decimal 128,420, is U+1F5A4. | Heart), decimal 128,420, is U+1F5A4. | |||
| Groups of numeric values described in Section 4 are given in ABNF | Groups of numeric values described in Section 4 are given in ABNF | |||
| [RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather | [RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather | |||
| than "U+". | than "U+". | |||
| All the numeric ranges in this document are inclusive. | All the numeric ranges in this document are inclusive. | |||
| The subsets are described in ABNF. | The subsets are described in ABNF. | |||
| 2. Characters and Code Points | 2. Characters and Code Points | |||
| Definition D9 in section 3.4 of [UNICODE] defines "Unicode codespace" | Definition D9 in Section 3.4 of [UNICODE] defines "Unicode codespace" | |||
| as "a range of integers from 0 to 10FFFF_16". Definition D10 defines | as "a range of integers from 0 to 10FFFF_16". Definition D10 defines | |||
| "code point" as "Any value in the Unicode codespace". | "code point" as "Any value in the Unicode codespace". | |||
| The Unicode Standard's definition of "Unicode character" is | The Unicode Standard's definition of "Unicode character" is | |||
| conceptual. However, each Unicode character is assigned a code | conceptual. However, each Unicode character is assigned a code | |||
| point, used to represent the characters in computer memory and | point, used to represent the characters in computer memory and | |||
| storage systems and, in specifications, to specify allowed subsets. | storage systems and to specify allowed subsets in specifications. | |||
| There are 1,114,112 (17 ⨉ 2^16) code points; as of Unicode 16.0 | There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | |||
| (2024), about 155,000 have been assigned to characters. Since | (2024), about 155,000 have been assigned to characters. Since | |||
| unassigned code points regularly become assigned when new characters | unassigned code points regularly become assigned when new characters | |||
| are added to Unicode, it is usually not a good practice to specify | are added to Unicode, it is usually not a good practice to specify | |||
| that unassigned code points should be avoided. | that unassigned code points should be avoided. | |||
| 2.1. Encoding forms | 2.1. Encoding Forms | |||
| Unicode describes a variety of encoding forms, ways to marshal code | Unicode describes a variety of encoding forms that can be used to | |||
| points into byte sequences. A survey of these is beyond the scope of | marshal code points into byte sequences. A survey of these is beyond | |||
| this document. However, it is useful to note that "UTF-16" | the scope of this document. However, it is useful to note that "UTF- | |||
| represents each code point with one or two 16-bit chunks, while "UTF- | 16" represents each code point with one or two 16-bit chunks, while | |||
| 8" uses variable-length byte sequences [RFC3629]. | "UTF-8" uses variable-length byte sequences [RFC3629]. | |||
| The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | |||
| says "Protocols MUST be able to use the UTF-8 charset", which becomes | says "Protocols MUST be able to use the UTF-8 charset", which becomes | |||
| a mandate to use UTF-8 for any protocol or data format that specifies | a mandate to use UTF-8 for any protocol or data format that specifies | |||
| a single encoding form. UTF-8 is widely used for interoperable data | a single encoding form. UTF-8 is widely used for interoperable data | |||
| formats such as JSON, YAML, CBOR, and XML. | formats such as JSON, YAML, CBOR, and XML. | |||
| 2.2. Problematic Code Points | 2.2. Problematic Code Points | |||
| This section classifies as "problematic" all the code points which | This section classifies all the code points that can never represent | |||
| can never represent useful text and in some cases can lead to | useful text and, in some cases, can lead to software misbehavior as | |||
| software misbehavior. This is a low bar; the PRECIS [RFC8264] | "problematic". This is a low bar; the PRECIS [RFC8264] framework's | |||
| framework's "IdentifierClass" and "FreeformClass" exclude many more | "IdentifierClass" and "FreeformClass" exclude many more code points | |||
| code points which can cause problems when displayed to humans, in | that can cause problems when displayed to humans, in some cases | |||
| some cases presenting security risks. Specifications of fields in | presenting security risks. Specifications of fields in protocols and | |||
| protocols and data formats whose contents are designed for display to | data formats whose contents are designed for display to and | |||
| and interactions with humans would benefit from careful consideration | interactions with humans would benefit from careful consideration of | |||
| of the issues described by PRECIS; its more-restrictive subsets might | the issues described by PRECIS; its more-restrictive subsets might be | |||
| be better choices than those specified in this document. | better choices than those specified in this document. | |||
| Definition D10a in section 3.4 of [UNICODE] defines seven code point | Definition D10a in Section 3.4 of [UNICODE] defines seven code point | |||
| types. Three types of code points are assigned to entities which are | types. Three types of code points are assigned to entities that are | |||
| not actually characters or whose value as Unicode characters in text | not actually characters or whose value as Unicode characters in text | |||
| fields is questionable: "Surrogate", "Control", and "Noncharacter". | fields is questionable: "Surrogate", "Control", and "Noncharacter". | |||
| In this document, "problematic" refers to code points whose type is | In this document, "problematic" refers to code points whose type is | |||
| "Surrogate" or "Noncharacter", and to "legacy controls" as defined in | "Surrogate" or "Noncharacter" and to "legacy controls" as defined in | |||
| Section 2.2.2.2 below. | Section 2.2.2.2 below. | |||
| Unicode's definition D49 concerns the "private-use" type and section | Definition D49 in [UNICODE] concerns the "private-use" type, and | |||
| 3.5.10 states that they "are considered to be assigned characters". | Section 3.5.10 states that they "are considered to be assigned | |||
| Section 23.5 further states that these characters' "use may be | characters". Section 23.5 further states that these characters' "use | |||
| determined by private agreement among cooperating users". Because | may be determined by private agreement among cooperating users". | |||
| private-use code points may have uses based on private agreements, | Because private-use code points may have uses based on private | |||
| this document does not classify them as "problematic". | agreements, this document does not classify them as "problematic". | |||
| 2.2.1. Surrogates | 2.2.1. Surrogates | |||
| A total of 2,048 code points, the range U+D800-U+DFFF, is divided | A total of 2,048 code points, in the range U+D800-U+DFFF, are divided | |||
| into two blocks called "high surrogates" and "low surrogates"; | into two blocks called "high surrogates" and "low surrogates"; | |||
| collectively the 2,048 code points are referred to as "surrogates". | collectively, the 2,048 code points are referred to as "surrogates". | |||
| [UNICODE] section 23.6 specifies how surrogates may be used in | Section 23.6 of [UNICODE] specifies how surrogates may be used in | |||
| Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate | Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate | |||
| pair represents a code point greater than U+FFFF. | pair represents a code point greater than U+FFFF. | |||
| A surrogate which occurs in text encoded in any encoding form other | A surrogate that occurs in text encoded in any encoding form other | |||
| than UTF-16 has no meaning. In particular, [UNICODE] section 3.9.3 | than UTF-16 has no meaning. In particular, Section 3.9.3 of | |||
| forbids representing a surrogate in UTF-8. | [UNICODE] forbids representing a surrogate in UTF-8. | |||
| 2.2.2. Control Codes | 2.2.2. Control Codes | |||
| Section 23.1 of [UNICODE] introduces the control codes for | Section 23.1 of [UNICODE] introduces the control codes for | |||
| compatibility with legacy pre-Unicode standards. They comprise 65 | compatibility with legacy pre-Unicode standards. They comprise 65 | |||
| code points in the ranges U+0000-U+001F ("C0 controls") and | code points in the ranges U+0000-U+001F ("C0 controls") and | |||
| U+0080-U+009F ("C1 controls"), plus U+007F, "DEL". | U+0080-U+009F ("C1 controls"), plus U+007F, "DEL". | |||
| 2.2.2.1. Useful Controls | 2.2.2.1. Useful Controls | |||
| skipping to change at page 6, line 7 ¶ | skipping to change at line 233 ¶ | |||
| asserts repeatedly that they are not designed or used for open | asserts repeatedly that they are not designed or used for open | |||
| interchange. | interchange. | |||
| Code points are organized into 17 "planes", each containing 2^16 code | Code points are organized into 17 "planes", each containing 2^16 code | |||
| points. The last two code points in each plane are noncharacters: | points. The last two code points in each plane are noncharacters: | |||
| U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | |||
| U+10FFFE, U+10FFFF. | U+10FFFE, U+10FFFF. | |||
| The code points in the range U+FDD0-U+FDEF are noncharacters. | The code points in the range U+FDD0-U+FDEF are noncharacters. | |||
| 3. Dealing With Problematic Code Points | 3. Dealing with Problematic Code Points | |||
| [RFC9413], "Maintaining Robust Protocols", provides a thorough | "Maintaining Robust Protocols" [RFC9413] provides a thorough | |||
| discussion of strategies for dealing with issues in input data. | discussion of strategies for dealing with issues in input data. | |||
| Different types of problematic code points cause different issues. | Different types of problematic code points cause different issues. | |||
| Noncharacters and legacy controls are unlikely to cause software | Noncharacters and legacy controls are unlikely to cause software | |||
| failures, but they cannot usefully be displayed to humans, and can be | failures, but they cannot usefully be displayed to humans, and they | |||
| used in attacks based on attempting to display text that includes | can be used in attacks based on attempting to display text that | |||
| them. | includes them. | |||
| The behavior of software which encounters surrogates is unpredictable | The behavior of software that encounters surrogates is unpredictable | |||
| and differs among programming-language implementations, even between | and differs among programming-language implementations, even between | |||
| different API calls in the same language. | different API calls in the same language. | |||
| Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence | Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence | |||
| which would map to a surrogate is ill-formed. If a specification | that would map to a surrogate is ill-formed. If a specification | |||
| requires that input data be encoded with UTF-8, and if all input were | requires that input data be encoded with UTF-8, and if all input were | |||
| well-formed, implementors would never have to concern themselves with | well-formed, implementors would never have to concern themselves with | |||
| surrogates. | surrogates. | |||
| Unfortunately, industry experience teaches that problematic code | Unfortunately, industry experience teaches that problematic code | |||
| points, including surrogates, can and do occur in program input where | points, including surrogates, can and do occur in program input where | |||
| the source of input data is not controlled by the implementor. In | the source of input data is not controlled by the implementor. In | |||
| particular, the specification of JSON allows any code point to appear | particular, the specification of JSON allows any code point to appear | |||
| in object member names and string values [RFC8259]. | in object member names and string values [RFC8259]. | |||
| For example, the following is a conforming JSON text: | For example, the following is a conforming JSON text: | |||
| {"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"} | {"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"} | |||
| The value of the "example" field contains the C0 control NUL, the C1 | The value of the "example" field contains the C0 control NUL, the C1 | |||
| control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired | control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired | |||
| surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two | surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two | |||
| escaped UTF-16 surrogate code points as described in [RFC8259] | escaped UTF-16 surrogate code points as described in Section 7 of | |||
| section 7. It is unlikely to be useful as the value of a text field. | [RFC8259]. It is unlikely to be useful as the value of a text field. | |||
| That value cannot be serialized into well-formed UTF-8, but the | That value cannot be serialized into well-formed UTF-8, but the | |||
| behavior of libraries asked to parse the sample is unpredictable; | behavior of libraries asked to parse the sample is unpredictable; | |||
| some will silently parse this and generate an ill-formed UTF-8 | some will silently parse this and generate an ill-formed UTF-8 | |||
| string. | string. | |||
| Two reasonable options for dealing with problematic input are either | Two reasonable options for dealing with problematic input are either | |||
| rejecting text containing problematic code points, or replacing the | rejecting text containing problematic code points or replacing the | |||
| problematic code points with placeholders. | problematic code points with placeholders. | |||
| Silently deleting an ill-formed part of a string is a known security | Silently deleting an ill-formed part of a string is a known security | |||
| risk. Responding to that risk, [UNICODE] section 3.2 recommends | risk. Responding to that risk, Section 3.2 of [UNICODE] recommends | |||
| dealing with ill-formed byte sequences by signaling an error, or | dealing with ill-formed byte sequences by signaling an error or | |||
| replacing problematic code points, ideally with "�" (U+FFFD, | replacing problematic code points, ideally with "�" (U+FFFD, | |||
| REPLACEMENT CHARACTER). | REPLACEMENT CHARACTER). | |||
| 4. Subsets | 4. Subsets | |||
| This section describes three increasingly restrictive subsets that | This section describes three increasingly restrictive subsets that | |||
| can be used in specifying acceptable content for text fields in | can be used in specifying acceptable content for text fields in | |||
| protocols and data types. Specifications can refer to these subsets | protocols and data types. Specifications can refer to these subsets | |||
| by the names "Unicode Scalars", "XML Characters", and "Unicode | by the names "Unicode Scalars", "XML Characters", and "Unicode | |||
| Assignables". | Assignables". | |||
| 4.1. Unicode Scalars | 4.1. Unicode Scalars | |||
| Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode | Definition D76 in Section 3.9 of [UNICODE] defines the term "Unicode | |||
| scalar value" as "Any Unicode code point except high-surrogate and | scalar value" as "Any Unicode code point except high-surrogate and | |||
| low-surrogate code points." | low-surrogate code points". | |||
| The "Unicode Scalars" subset can be expressed as an ABNF production: | The "Unicode Scalars" subset can be expressed as an ABNF production: | |||
| unicode-scalar = | unicode-scalar = | |||
| %x0-D7FF / ; exclude surrogates | %x0-D7FF / ; exclude surrogates | |||
| %xE000-10FFFF | %xE000-10FFFF | |||
| This subset is the default for CBOR [RFC8949], and has the advantage | This subset is the default for Concise Binary Object Representation | |||
| of excluding surrogates. However, it includes legacy controls and | (CBOR) [RFC8949] and has the advantage of excluding surrogates. | |||
| noncharacters. | However, it includes legacy controls and noncharacters. | |||
| 4.2. XML Characters | 4.2. XML Characters | |||
| The XML 1.0 Specification [XML], in its grammar production labeled | The XML 1.0 Specification [XML], in its grammar production labeled | |||
| "Char", specifies a subset of Unicode code points that excludes | "Char", specifies a subset of Unicode code points that excludes | |||
| surrogates, legacy C0 controls, and the noncharacters U+FFFE and | surrogates, legacy C0 controls, and the noncharacters U+FFFE and | |||
| U+FFFF. | U+FFFF. | |||
| The "XML Characters" subset can be expressed as an ABNF production: | The "XML Characters" subset can be expressed as an ABNF production: | |||
| xml-character = | xml-character = | |||
| %x9 / %xA / %xD / ; useful controls | %x9 / %xA / %xD / ; useful controls | |||
| %x20-D7FF / ; exclude surrogates | %x20-D7FF / ; exclude surrogates | |||
| %xE000-FFFD / ; exclude FFFE and FFFF nonchars | %xE000-FFFD / ; exclude FFFE and FFFF nonchars | |||
| %x100000-10FFFF | %x10000-10FFFF | |||
| While this subset does not exclude all the problematic code points, | While this subset does not exclude all the problematic code points, | |||
| the C1 controls are less likely than the C0 controls to appear | the C1 controls are less likely than the C0 controls to appear | |||
| erroneously in data, and have not been observed to be a frequent | erroneously in data and have not been observed to be a frequent | |||
| source of problems. Also, the noncharacters greater in value than | source of problems. Also, the noncharacters greater in value than | |||
| U+FFFF are rarely encountered. | U+FFFF are rarely encountered. | |||
| 4.3. Unicode Assignables | 4.3. Unicode Assignables | |||
| This document defines the "Unicode Assignables" subset as all the | This document defines the "Unicode Assignables" subset as all the | |||
| Unicode code points that are not problematic. This, a proper subset | Unicode code points that are not problematic. This, a proper subset | |||
| of each of the others, comprises all code points that are currently | of each of the others, comprises all code points that are currently | |||
| assigned, excluding legacy control codes, or that might in future be | assigned, excluding legacy control codes, or that might be assigned | |||
| assigned. | in the future. | |||
| Unicode Assignables can be expressed as an ABNF production: | Unicode Assignables can be expressed as an ABNF production: | |||
| unicode-assignable = | unicode-assignable = | |||
| %x9 / %xA / %xD / ; useful controls | %x9 / %xA / %xD / ; useful controls | |||
| %x20-7E / ; exclude C1 controls and DEL | %x20-7E / ; exclude C1 controls and DEL | |||
| %xA0-D7FF / ; exclude surrogates | %xA0-D7FF / ; exclude surrogates | |||
| %xE000-FDCF / ; exclude FDD0 nonchars | %xE000-FDCF / ; exclude FDD0 nonchars | |||
| %xFDF0-FFFD / ; exclude FFFE and FFFF nonchars | %xFDF0-FFFD / ; exclude FFFE and FFFF nonchars | |||
| %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane) | %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane) | |||
| skipping to change at page 8, line 39 ¶ | skipping to change at line 357 ¶ | |||
| %x50000-5FFFD / %x60000-6FFFD / | %x50000-5FFFD / %x60000-6FFFD / | |||
| %x70000-7FFFD / %x80000-8FFFD / | %x70000-7FFFD / %x80000-8FFFD / | |||
| %x90000-9FFFD / %xA0000-AFFFD / | %x90000-9FFFD / %xA0000-AFFFD / | |||
| %xB0000-BFFFD / %xC0000-CFFFD / | %xB0000-BFFFD / %xC0000-CFFFD / | |||
| %xD0000-DFFFD / %xE0000-EFFFD / | %xD0000-DFFFD / %xE0000-EFFFD / | |||
| %xF0000-FFFFD / %x100000-10FFFD | %xF0000-FFFFD / %x100000-10FFFD | |||
| 5. Using Subsets | 5. Using Subsets | |||
| Many IETF specifications rely on well-known data formats such as | Many IETF specifications rely on well-known data formats such as | |||
| JSON, I-JSON, CBOR, YAML, and XML. These formats specify default | JSON, Internet JSON (I-JSON), CBOR, YAML, and XML. These formats | |||
| subsets. For example, JSON allows object member names and string | specify default subsets. For example, JSON allows object member | |||
| values to include any Unicode code point, including all the | names and string values to include any Unicode code point, including | |||
| problematic types. | all the problematic types. | |||
| A protocol based on JSON can be made more robust and implementor- | A protocol based on JSON can be made more robust and implementor- | |||
| friendly by restricting the contents of object member names and | friendly by restricting the contents of object member names and | |||
| string values to one of the subsets described in Section 4. | string values to one of the subsets described in Section 4. | |||
| Equivalent restrictions are possible for other packaging formats such | Equivalent restrictions are possible for other packaging formats such | |||
| as I-JSON, XML, YAML, and CBOR. | as I-JSON, XML, YAML, and CBOR. | |||
| Note that escaping techniques such as those in the JSON example in | Note that escaping techniques such as those in the JSON example in | |||
| Section 3 cannot be used to circumvent this sort of restriction, | Section 3 cannot be used to circumvent this sort of restriction, | |||
| which applies to data content, not textual representation in | which applies to data content, not textual representation in | |||
| packaging formats. If a specification restricted a JSON field value | packaging formats. If a specification restricted a JSON field value | |||
| to the Unicode Assignables, the example would remain a conforming | to the Unicode Assignables, the example would remain a conforming | |||
| JSON Text but the data it represents would not constitute Unicode | JSON text but the data it represents would not constitute Unicode | |||
| Assignable code points. | Assignable code points. | |||
| 6. IANA Considerations | 6. IANA Considerations | |||
| This document has no actions for IANA. | This document has no IANA actions. | |||
| 7. Security Considerations | 7. Security Considerations | |||
| Section 3 of this document discusses security issues. | Section 3 of this document discusses security issues. | |||
| Unicode Security Considerations [TR36] is a wide-ranging survey of | Unicode Security Considerations [TR36] is a wide-ranging survey of | |||
| the issues implementors should consider while writing software to | the issues implementors should consider while writing software to | |||
| process Unicode text. Unicode Source Code Handling [TR55] discusses | process Unicode text. Unicode Source Code Handling [TR55] discusses | |||
| use of Unicode in programming languages, with a focus on security | use of Unicode in programming languages, with a focus on security | |||
| issues. Many of the attacks they discuss are aimed at deceiving | issues. Many of the attacks they discuss are aimed at deceiving | |||
| human readers, but vulnerabilities involving issues such as | human readers, but vulnerabilities involving issues such as | |||
| surrogates and noncharacters are also covered, and in fact can | surrogates and noncharacters are also covered and, in fact, can | |||
| contribute to human-deceiving exploits. | contribute to human-deceiving exploits. | |||
| The Security Considerations in Section 12 of [RFC8264] generally | The security considerations in Section 12 of [RFC8264] generally | |||
| applies to this document as well. | apply to this document as well. | |||
| Note that the Unicode-character subsets specified in this document | Note that the Unicode-character subsets specified in this document | |||
| are increasingly restrictive, omitting more and more problematic code | are increasingly restrictive, omitting more and more problematic code | |||
| points, and thus should be less and less susceptible to many of these | points, and thus should be less and less susceptible to many of these | |||
| exploits. The Section 4.3 subset, "Unicode Assignables", excludes | exploits. The subset in Section 4.3, "Unicode Assignables", excludes | |||
| all of these code points. | all of these code points. | |||
| 8. Normative References | 8. References | |||
| 8.1. Normative References | ||||
| [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | |||
| Specifications: ABNF", STD 68, RFC 5234, | Specifications: ABNF", STD 68, RFC 5234, | |||
| DOI 10.17487/RFC5234, January 2008, | DOI 10.17487/RFC5234, January 2008, | |||
| <https://www.rfc-editor.org/info/rfc5234>. | <https://www.rfc-editor.org/info/rfc5234>. | |||
| [TR36] The Unicode Consortium, "Unicode Security Considerations", | [TR36] Davis, M., Ed. and M. Suignard, Ed., "Unicode Security | |||
| <https://www.unicode.org/reports/tr36/>. Note that this | Considerations", <https://www.unicode.org/reports/tr36/>. | |||
| reference is to the latest version of this document, | ||||
| rather than to a specific release. It is not expected | ||||
| that future updates will affect the referenced | ||||
| discussions. | ||||
| [TR55] The Unicode Consortium, "Unicode Source Code Handling", | [TR55] Leroy, R., Ed. and M. Davis, Ed., "Unicode Source Code | |||
| <https://www.unicode.org/reports/tr55/>. Note that this | Handling", <https://www.unicode.org/reports/tr55/>. | |||
| reference is to the latest version of this document, | ||||
| rather than to a specific release. It is not expected | ||||
| that future updates will affect the referenced | ||||
| discussions. | ||||
| [UNICODE] The Unicode Consortium, "The Unicode Standard", | [UNICODE] The Unicode Consortium, "The Unicode Standard", | |||
| <http://www.unicode.org/versions/latest/>. Note that this | <http://www.unicode.org/versions/latest/>. Note that this | |||
| reference is to the latest version of Unicode, rather than | reference is to the latest version of Unicode, rather than | |||
| to a specific release. It is not expected that future | to a specific release. It is not expected that future | |||
| changes in the Unicode Standard will affect the referenced | changes in the Unicode Standard will affect the referenced | |||
| definitions. | definitions. | |||
| 9. Informative References | 8.2. Informative References | |||
| [IDN] "Internationalized Domain Name Working Group", | [IDN] "Internationalized Domain Name Working Group", | |||
| <https://datatracker.ietf.org/group/idn/>. | <https://datatracker.ietf.org/group/idn/>. | |||
| [PRECIS] "PRECIS Working Group", | [PRECIS] "PRECIS Working Group", | |||
| <https://datatracker.ietf.org/group/precis/>. | <https://datatracker.ietf.org/group/precis/>. | |||
| [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and | |||
| Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, | Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, | |||
| January 1998, <https://www.rfc-editor.org/info/rfc2277>. | January 1998, <https://www.rfc-editor.org/info/rfc2277>. | |||
| skipping to change at page 11, line 9 ¶ | skipping to change at line 464 ¶ | |||
| <https://www.rfc-editor.org/info/rfc8949>. | <https://www.rfc-editor.org/info/rfc8949>. | |||
| [RFC9413] Thomson, M. and D. Schinazi, "Maintaining Robust | [RFC9413] Thomson, M. and D. Schinazi, "Maintaining Robust | |||
| Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023, | Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023, | |||
| <https://www.rfc-editor.org/info/rfc9413>. | <https://www.rfc-editor.org/info/rfc9413>. | |||
| [W3C-CHAR] W3C, "Character encodings: Essential concepts", | [W3C-CHAR] W3C, "Character encodings: Essential concepts", | |||
| <https://www.w3.org/International/articles/definitions- | <https://www.w3.org/International/articles/definitions- | |||
| characters/>. | characters/>. | |||
| [XML] Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F. | [XML] Bray, T., Ed., Paoli, J., Ed., McQueen, C.M., Ed., Maler, | |||
| Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth | E., Ed., and F. Yergeau, Ed., "Extensible Markup Language | |||
| Edition)", 26 November 2008, | (XML) 1.0 (Fifth Edition)", W3C Recommendation, 26 | |||
| November 2008, | ||||
| <http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that | <http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that | |||
| this reference is to a specific release, based on a | this reference is to a specific release, based on a | |||
| history of previous "Edition" releases having changed this | history of previous "Edition" releases having changed this | |||
| production. | production. | |||
| Acknowledgements | Acknowledgements | |||
| Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata | Thanks are due to Guillaume Fortin-Debigaré, who filed an errata | |||
| Report against RFC 8259, The JavaScript Object Notation, noting | report against RFC 8259, "The JavaScript Object Notation (JSON) Data | |||
| frequent references to "Unicode characters", when in fact the RFC | Interchange Format", noting frequent references to "Unicode | |||
| formally specifies the use of Unicode Code Points. | characters", when in fact the RFC formally specifies the use of | |||
| Unicode code points. | ||||
| Thanks also to Asmus Freytag for careful review and many constructive | Thanks also to Asmus Freytag for careful review and many constructive | |||
| suggestions aimed at making the language more consistent with the | suggestions aimed at making the language more consistent with the | |||
| structure of the Unicode Standard. | structure of the Unicode Standard. | |||
| Thanks also to James Manger for the correctness of the ABNF and JSON | Thanks also to James Manger for the correctness of the ABNF and JSON | |||
| samples. | samples. | |||
| Thanks also to Addison Phillips and the W3C Internationalization | Thanks also to Addison Phillips and the W3C Internationalization | |||
| Working Group for helpful suggestions on language and references. | Working Group for helpful suggestions on language and references. | |||
| Thoughtful comments during the many iterations of this draft, which | Thoughtful comments during the many draft versions of this document, | |||
| helped tighten up wording and make difficult points clearer, were | which helped tighten up wording and make difficult points clearer, | |||
| contributed by Harald Alvestrand, Martin J Dürst, Donald E. | were contributed by Harald Alvestrand, Martin J. Dürst, Donald | |||
| Eastlake, John Klensin, Barry Leiba, Glyn Normington, Peter Saint- | E. Eastlake, John Klensin, Barry Leiba, Glyn Normington, Peter Saint- | |||
| Andre, and Rob Sayre. | Andre, and Rob Sayre. | |||
| Authors' Addresses | Authors' Addresses | |||
| Tim Bray | Tim Bray | |||
| Textuality Services | Textuality Services | |||
| Email: tbray@textuality.com | Email: tbray@textuality.com | |||
| Paul Hoffman | Paul Hoffman | |||
| ICANN | ICANN | |||
| End of changes. 51 change blocks. | ||||
| 144 lines changed or deleted | 138 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. | ||||