Specifying Unicode Character Repertoires in RFCs

Internet-Draft	Specifying Unicode	September 2023
Bray & Hoffman	Expires 11 March 2024	[Page]

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶

This Internet-Draft will expire on 11 March 2024.¶

1. Introduction

When a protocol or data format has text fields, that text is normally composed of Unicode [UNICODE] characters, to support use by speakers of many languages. IETF policy mandates this [RFC2277]. Because of the way the Unicode Standard defines the term "Unicode character", the "set of all Unicode characters" is not always useful for technical specifications. Instead, subsets such as those defined in this document are typically used.¶

Protocols and data formats usually need to describe exactly which selection of the available Unicode characters are to be used. The term "character repertoire" is a well-understood concept when applied to an encoding standard; in this document it describes selected subsets of the Unicode characters. Authors should have a way to concisely and exactly reference a stable specification that identifies a protocol or data format's character repertoire¶

This document describes and names several subsets that have been popular choices in specification character repertoires, and suggests one new subset. The goal is to provide a convenient target for cross-reference from other specifications which discuss character repertoires.¶

1.1. Notation

In this document, the numeric values assigned to Unicode characters are provided in hexadecimal. In the text, Unicode’s standard "U+", zero-padded to four places [RFC5137], is used. For example, "A", decimal 65, would be expressed as U+0041, and "😉" (Winking Face), decimal 128,521, would be U+1F609.¶

Groups of numeric values described in Section 3 and Section 4 are given in ABNF [RFC5234]. In ABNF, the hexadecimal values for characters are preceded by "%x" rather than "U+".¶

All the numeric ranges in this document are inclusive.¶

2. Character Concepts

The Unicode Standard's definition of "Unicode character" is conceptual. However, each Unicode character is assigned an integer identifier in the range U+0000-U+10FFFF. These numbers are used to represent the characters in computer memory and storage systems and, in specifications, to specify the allowed repertoires of Unicode characters.¶

The numbers assigned to Unicode characters are called “code points”; there are potentially 1,114,112 of them. As of 2023, fewer than 150,000 characters have had code points assigned. While the inclusion of unassigned code points in text data is undesirable, it is difficult to specify that it should be avoided, because unassigned code points regularly become assigned as new characters are added to Unicode. Fortunately, the occurrence of unassigned code points in texts is generally unlikely to cause software to malfunction.¶

2.1. Transformation Formats

Unicode describes a variety of "transformation formats", ways to encode code points in bytes of computer memory. A survey of transformation formats is beyond the scope of this document. However, it is useful to note that the "UTF-16" transformation format represents each code point with one or two 16-bit chunks, and the “UTF-8” transformation format uses variable-length byte sequences.¶

Use of the UTF-8 transformation format is mandated by the IETF [RFC2277] and widely used for interoperable data formats such as JSON, YAML, and XML.¶

2.2. Problematic Code Point Types

Definition D10a in section 3.4 of [UNICODE] defines seven code point types. Three types of code points are assigned to constructs which are not actually characters or whose value as Unicode characters in text fields is questionable: "Control", "Surrogate", and "Noncharacter".¶

2.2.1. Surrogates

A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into two blocks called "high surrogates" and "low surrogates"; collectively the 2,048 code points are referred to as "surrogates". Surrogates may only be used in Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate pair represents a code point greater than U+FFFF.¶

A surrogate which occurs in text encoded in any transformation format other than UTF-16 has no meaning and may cause malfunction in software that encounters it. In particular, it is impossible to represent a surrogate in well-formed UTF-8.¶

2.2.2. Control Codes

Section 23.1 of [UNICODE] introduces the "Control Codes" for compatibility with legacy pre-Unicode standards. They comprise 65 code points in the ranges U+0000-U+001F ("C0 Controls") and U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".¶

2.2.2.1. Useful Controls

The C0 Controls include the newline (U+000A), carriage return (U+000D), and Tab (U+0009); this document refers to these three characters as the "useful controls".¶

2.2.2.2. Legacy Controls

Aside from the useful controls, the control codes are mostly obsolete and generally lack interoperable semantics. This document uses the phrase "legacy controls" to describe control codes that are not useful controls.¶

Since the code points for C0 Controls include the 32 smallest integers including zero, they are likely to occur in data as a result of programming errors.¶

2.2.3. Noncharacters

Certain code points are classified as "noncharacters", and [UNICODE] asserts repeatedly that they are not designed or used for open interchange.¶

Code points are organized into 17 "planes", each containing 2¹⁶ code points. The last two code points in each plane are noncharacters: U+00FFFE, U+00FFFF, U+01FFFE, U+01FFF, U+02FFFE, U+02FFFF, and so on, up to U+10FFFE, U+10FFFF.¶

The code points in the range U+FDD0-U+FDEF are noncharacters.¶

3. Subsets Defined in the Unicode Standard

This section describes popular subsets of the code points that are defined in [UNICODE]. Specifications can refer to these repertoires by the names "Unicode Code Points" and "Unicode Scalar Values".¶

3.1. Unicode Code Points

Definition D9 in section 3.4 of [UNICODE] defines the term "Unicode codespace" as "a range of integers from 0 to 10FFFF₁₆". Definition D10 defines the term "Code point" as "Any value in the Unicode codespace".¶

The "Unicode Code Points" subset can be expressed as an ABNF production:¶

unicode-code-points =
   %x0-10FFFF

This subset is notable for including all possible code points, including those of the problematic types discussed above. It is the default repertoire of JSON [RFC8259] and CBOR [RFC8949].¶

3.2. Unicode Scalar Values

Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode scalar value" as "Any Unicode code point except high-surrogate and low-surrogate code points."¶

The "Unicode Scalar Values" subset can be expressed as an ABNF production:¶

unicode-scalar-values =
   %x0-D7FF / %xE000-10FFFF  ; exclude surrogates

This subset is the default character repertoire for I-JSON [RFC7493], and has the advantage of excluding surrogates. However, it includes legacy controls and noncharacters.¶

4. Other Definitions

This section lists other ways to specify subsets of the code points beyond those provided by the Unicode Standard itself. Specifications can refer to these repertoires by the names "XML Characters" and "Useful Assignables".¶

4.1. XML Characters

The XML 1.0 Specification [XML], in its grammar production labeled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 Controls, and the noncharacters U+FFFE and U+FFFF.¶

The "XML Characters" subset can be expressed as an ABNF production:¶

xml-chars =
   %x9 / %xA / %xD /   ; useful controls
   %x20-D7FF /         ; exclude surrogates
   %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
   %x100000-10FFFF

While this subset does not exclude all the problematic code points, the C1 Controls are less likely than the C0 Controls to appear erroneously in data, and have not been observed to be a frequent source of problems. Also, the noncharacters greater in value than U+FFFF are rarely encountered.¶

4.2. Useful Assignables

For convenience, this document defines the "Useful Assignables" subset as the Unicode code points, excluding the legacy controls, surrogates, and noncharacters. This comprises all code points that are currently assigned, or might in future be assigned, to characters that are not legacy control codes, plus the useful controls.¶

Useful Assignables can be expressed as an ABNF production:¶

useful-assignables =
   %x9 / %xA / %xD /               ; useful controls
   %x20-7E /                       ; exclude C1 Controls and DEL
   %xA0-D7FF /                     ; exclude surrogates
   %xE000-FDCF                     ; exclude FDD0 nonchars
   %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
   %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
   %x30000-3FFFD / %x40000-4FFFD /
   %x50000-5FFFD / %x60000-6FFFD /
   %x70000-7FFFD / %x80000-8FFFD /
   %x90000-9FFFD / %xA0000-AFFFD /
   %xB0000-BFFFD / %xC0000-CFFFD /
   %xD0000-DFFFD / %xE0000-EFFFD /
   %xF0000-FFFFD / %x100000-10FFFD

This subset excludes all code points whose types are identified as problematic above.¶

5. Refining Character Repertoires

Messages interchanged in Internet protocols of the type that IETF specifies are typically packaged into well-known data formats such as JSON, I-JSON, CBOR, YAML, and XML. These packaging formats typically have a default character repertoire. For example, JSON allows member names and string values to include any Unicode code points, including all the problematic types; the following is a legal JSON document:¶

{"example": "\u0000\U0089\uDEAD\u7FFFF"}

The value of the "example" field contains the C0 Control NUL, the C1 Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncharacter U+7FFFF. It is unlikely to be useful as the value of a text field. It cannot be serialized into legal UTF-8, but many libraries will silently parse this and generate an ill-formed UTF-8 string. Implementors must be prepared to deal with these sorts of problematic code points.¶

It is unlikely that anyone specifying a new data format would choose to allow this character repertoire.¶

A protocol based on JSON could be made more robust and implementor-friendly by requiring that the contents of member names and string values contain only Useful Assignables (see Section 4.2). An equivalent requirement is possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.¶

9. Informative References

[RFC2277]: Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277, January 1998, <https://www.rfc-editor.org/info/rfc2277>.
[RFC5137]: Klensin, J., "ASCII Escaping of Unicode Characters", BCP 137, RFC 5137, DOI 10.17487/RFC5137, February 2008, <https://www.rfc-editor.org/info/rfc5137>.
[RFC7493]: Bray, T., Ed., "The I-JSON Message Format", RFC 7493, DOI 10.17487/RFC7493, March 2015, <https://www.rfc-editor.org/info/rfc7493>.
[RFC8259]: Bray, T., Ed., "The JavaScript Object Notation (JSON) Data Interchange Format", STD 90, RFC 8259, DOI 10.17487/RFC8259, December 2017, <https://www.rfc-editor.org/info/rfc8259>.
[RFC8949]: Bormann, C. and P. Hoffman, "Concise Binary Object Representation (CBOR)", STD 94, RFC 8949, DOI 10.17487/RFC8949, December 2020, <https://www.rfc-editor.org/info/rfc8949>.
[XML]: Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth Edition)", 26 November 2008, <http://www.w3.org/TR/2008/REC-xml-20081126/>. Note that this reference is to a specific release, based on a history of previous "Edition" releases having changed this production.