Network Working Group S. Leonard Internet-Draft Penango, Inc. Updates: 5234 (if approved) C. Newman Intended Status: Experimental Oracle Expires: September 14, 2017 March 13, 2017 Unicode in ABNF draft-seantek-unicode-in-abnf-03 Abstract This experimental document adds support for Unicode strings in ABNF (Augmented Backus-Naur Form), and provides certain symbols related to Unicode code point ranges. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft is a fork of draft-seantek-abnf-more-core-rules-05. Copyright Notice Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Leonard & Newman Experimental [Page 1] Internet-Draft Unicode in ABNF March 13, 2017 1. Introduction Augmented Backus-Naur Form (ABNF) [RFC5234] is a formal syntax that is popular among many Internet specifications. Many Internet documents employ this syntax along with the Core Rules defined in Appendix B.1 of [RFC5234]. ABNF is defined in terms of ASCII [ASCII86, RFC0020]; however, Unicode [UNICODE] has become increasingly popular--even required--as the Internet has evolved over the last two decades. Unicode (as UTF-8) will be permitted in the RFC series [IABNA], while [RFC5198] established Net-Unicode as the standard form for the use of Unicode as "network text". Protocols that originally were ASCII-based have been, or are being, extended to support Unicode. However, protocols that use Unicode in some way (e.g., permit UTF-8 content in a production) use different ABNF expressions, some of which do not conform to the modern Unicode Standard 9.0.0, and therefore could introduce interoperability or security problems. Many parties have expressed interest in incorporating [UNICODE] into ABNF, yet the questions remain: "How?" and "To what extent?" This document proposes standardized techniques for expressing Unicode code points using ABNF. This document intends to be very conservative in its approach: a conforming implementation only needs to know how to map between the Unicode scalar values and any Unicode encoding form. The Unicode Character Database (UCD, Section 4.1 of [UNICODE]) is intentionally not necessary. ABNF text that uses the syntax in this document needs to be in a Unicode encoding form (Conformance Clause D89 of [UNICODE]), but ABNF text that just uses the rules or terminal values can be expressed in ASCII [RFC0020]. 2. Unicode Code Points in ABNF (Consult Section 2.3 of [RFC5234] in relation to this paragraph.) Unicode has been expressed in several different ways in RFCs to-date. This document establishes that in contexts where Unicode is specified as the coded character set [RFC2130], the terminal values %x00-10FFFF are to be used to represent the Unicode code points. Only the Unicode scalar values are to be used in specifications that follow this document; surrogate code points (%xD800-DFFF) are not to be used [[NB: directly]]. This technique aligns ABNF with W3C EBNF [XMLEBNF] and Unicode EBNF [UNICODE]. (Consult Section 2.4 and Appendix B.2 of [RFC5234] in relation to this paragraph.) In contexts where Unicode is specified as the character set, the ABNF-based grammar may have multiple external encodings. This document does not fix the encoding scheme. The obvious external Leonard & Newman Experimental [Page 2] Internet-Draft Unicode in ABNF March 13, 2017 encoding is UTF-8 (see Net-Unicode [RFC5198]), but other encodings are possible. This document neither restricts productions to NFC, nor provides a syntax for normalization to NFC. 3. Unicode Core Rule Update Appendix A furnishes Unicode Core Rules that include comprehensive support for certain Unicode ranges and characters. These Unicode Core Rules supplement the Core Rules of [RFC5234] and [ABNFMORE]; they are intended to be available whenever this document is invoked. The rules reflect broad categories of allowable and disallowable characters in protocols for interchange between systems, as the Internet community has evolved, and as of Unicode 9.0.0 in August 2016 [UNICODE]. It is a design goal that a general-purpose ABNF grammar should not need to delve into the minutiae of Unicode character properties, which can be tailorable (i.e., language- specific), overridable, and unstable (between Unicode versions). It is a further design goal that a general-purpose ABNF grammar should not need to rely on sizeable external sources, namely the Unicode Character Database (Section 4.1 of [UNICODE]). To constrain this document's scope, character properties are not addressed further. According to a survey of all RFCs published through August 2016, many widely used Internet protocols rely on horizontal whitespace (HT and SP, or occasionally SP alone) and line breaks (usually CRLF, sometimes LF) as delimiters. Therefore, the rules specifically address horizontal whitespace and line breaks. Rules that both include and exclude the private-use characters (Section 23.5 of [UNICODE]) are provided. Private-use characters "are intended for open interchange, subject to interpretation by private agreement" (Section 23.7 of [UNICODE]). Therefore, there is no way within [UNICODE] itself to provide for a common interpretation of these code points. See also Section 4 of [RFC5198]. A protocol designer needs to establish that common interpretation in prose, provide for protocol elements that establish the common interpretation, or (explicitly) accept that a common interpretation is done outside of the designer's protocol. 4. Case-Sensitive Unicode String Syntax This document extends ABNF with a new case-sensitive Unicode string literal. The type is denoted using a type prefix similar to the type prefixes used with numeric values and case-sensitive ASCII string literals. No syntax is provided for a case-insensitive Unicode string literal because doing so would require implementing Unicode caseless matching [UNICODE], which is language-dependent, Unicode version- Leonard & Newman Experimental [Page 3] Internet-Draft Unicode in ABNF March 13, 2017 dependent, and very complicated overall. Caseless matching also requires the UCD. Add the contents of Section 4.1 to [RFC5234]. 4.1. Terminal Values - Literal Text Strings Literal case sensitive text strings in ABNF may be in the Unicode character set [UNICODE]. The following prefix is used: %su = case-sensitive, Unicode To be consistent with prior implementations of ABNF, having no prefix means that the string is case insensitive and in ASCII. [[ALT/DISCUSS: [RFC7405] %s"text" could be extended to support characters beyond ASCII. It is a strict superset of [RFC7405] and thus simpler. This document would leave [%i]"text" undefined for the time being, or, a collation from [RFC4790] could be identified.]] The case-sensitive Unicode string can be comprised of any Graphic, Format, or Reserved code point. Control, Private-Use, Surrogate, and Noncharacter code points are excluded. Newline (line breaking) characters are also omitted. (See Table 2-3 of [UNICODE].) An example: rulename = %su"!100Q$" where the character ! is actually the Unicode code point U+00A5 YEN SIGN, and the character $ is actually the Unicode code point U+1F39F ADMISSION TICKETS, is equivalent to the rule: rulename = %xA5.31.30.30.51.1F39F 4.2. ABNF Definition of ABNF - char-val char-val =/ case-sensitive-Unicode-string ; ALT/DISCUSS: "%s", modify 7405 case-sensitive-Unicode-string = "%su" quoted-Unicode-string quoted-Unicode-string = DQUOTE *(%x20-21 / %x23-7E / UVCHARBEYONDASCII) DQUOTE ; quoted string of SP and VCHAR ; without DQUOTE, and UVCHAR ; beyond the ASCII range Leonard & Newman Experimental [Page 4] Internet-Draft Unicode in ABNF March 13, 2017 5. Terminal Value Transformation Syntax for UTF-8 and UTF-16 While Section 2 establishes terminal values %x00-10FFFF for Unicode, many Internet protocols incorporate Unicode using UTF-8 and define protocol elements using UTF-8 terminal values (i.e., values in the 8- bit range of %x00-FF, or more specifically, %x00-BF and %xC2-F4); see [RFC3629]. A smaller yet notable set of protocols use UTF-16. Writing out Unicode code points or ranges in UTF-8 or UTF-16 can be cumbersome and error-prone. This document therefore provides a "terminal value transformation syntax", so that the code points %x00- 10FFFF can be written out natively, but the resulting ABNF represents 8-bit or 16-bit units at the level of ABNF syntax. From there, a protocol can supply a specific mapping (encoding) of those values into a character set or other representation, consistent with Section 2.3 of [RFC5234]. The syntax is: %t8(...) for 8-bit UTF-8 (transform to %x00-BF and %xC2-F4) %t16(...) for 16-bit UTF-16 (transform to %x00-D7FF, %xD800-DBFF %xDC00-DFFF, and %xE000-FFFF) %t16le(...) for 8-bit UTF-16LE (transform to %x00.00-%xFF.FF, little-endian) %t16be(...) for 8-bit UTF-16BE (transform to %x00.00-%xFF.FF, big-endian) [[NB: Other possibilities: !t8 ~t8 $t8 #t8 -t8]] A transform is applied by recursively driving it into the elements, transforming terminal values from the original code point to the corresponding Unicode Transformation Format over an 8-bit (or 16-bit) field. The transforms in this document distribute over ABNF operators. "%t16" outputs 16-bit terminal values from %x00-FFFF, meaning that the endianness is not specified: a protocol needs to specify this or furnish a protocol slot for 16-bit code units. In contrast, "%t16be" and "%t16le" output 8-bit terminal values: each terminal value in the input will correspond to two or four terminal values in the output. If a transform is used on a terminal value outside the Unicode scalar value range (see the proposed Core Rule ), the resulting terminal value can be neither satisfied nor produced. A "reverse transformation syntax" to go from 8-bit or 16-bit terminal values to reassembled Unicode code points is not proposed at this time. 5.1. Examples Leonard & Newman Experimental [Page 5] Internet-Draft Unicode in ABNF March 13, 2017 Example 1: The following rules are equivalent; see [RFC3629]: UTF8-MB = UTF8-2 / UTF8-3 / UTF8-4 ; from RFC 3629 ; %x80-D7FF / %xE000-10FFFF UTF8-MB = %t8( BEYONDASCII ) Example 2: The code point U+1F430 RABBIT FACE can be represented as %x1F430. It can also be represented as %xD83D.DC30 or %t16( %x1F430 ) when UTF-16 is intended. 5.2. Advantages and Features Using transformation syntax offers several advantages: The generic ABNF syntax of a textual protocol can take full advantage of the Unicode character set; the syntax is not dependent on a particular encoding form. Specifying ranges of characters becomes unwieldy when explicitly defined in terms of code units in a Unicode encoding form, e.g., as UTF-8 code units (octets) for characters beyond ASCII, or as UTF-16 code units (16-bit words) for supplementary characters. Trying to specify Punycode in ABNF would be, for all intents and purposes, impossible! (Note: it's not actually impossible, but very difficult and not particularly useful.) Protocols that have arbitrary binary slots (e.g., BINARYMIME) are inherently incompatible with Section 2 syntax, but compatibility can be achieved by using transformation syntax. Protocol designers can effectively exploit the "holes" in UTF-8, because octets C0, C1, and F5-FF are never seen in UTF-8. These octets provide natural delimiters for arbitrary runs of UTF-8. An advantage of using such octets as delimiters is that checking for these octets has to be done anyway for security reasons, so a designer can save cycles by incorporating this part of a check for well-formed Unicode into a protocol. Such delimiters can only be expressed outside of "%t8", since a "%t8" transform will never produce those terminal values. (UTF-16 also has such "holes", namely, in unpaired surrogates. But using unpaired surrogates as delimiters may suffer from other security pitfalls; in any event, UTF-16 is far less common in IETF usage.) Leonard & Newman Experimental [Page 6] Internet-Draft Unicode in ABNF March 13, 2017 6. Comment Syntax This document extends ABNF to have Unicode comments. Comments are treated as specification prose, so they may be normative depending on the context. Comment text allows for the same repertoire of characters as RFC text. The RFC Editors can regulate comments to the same extent as specification prose, including disallowing certain characters or code points. 6.1. Comment: ; Comment (No changes to the text of Section 3.9 of [RFC5234] are needed.) 6.2. ABNF Definition of ABNF - comment ; given: comment = ";" *(WSP / VCHAR) CRLF ; increment (unambiguous grammar): comment =/ ";" *(UWSP / UVCHAR / PUACHAR) (UWSPBEYONDASCII / UVCHARBEYONDASCII / PUACHAR) *(UWSP / UVCHAR / PUACHAR) CRLF ; or redefine: comment = ";" *(UWSP / UVCHAR / PUACHAR) CRLF 7. Notational Conventions For readability it is advisable to express a Unicode code point as the character itself, the numeric terminal value, and the name or a name alias. Only one expression is used for the formal ABNF notation: either the character itself (Section 4) or the numeric terminal value (Section 2). The other expressions can be incorporated into an adjacent comment. The suggested notational convention for the adjacent comment follows Appendix A of [UNICODE]. The comment text is comprised of one or more WSP characters, optionally either the character itself or "U+" syntax followed by exactly one SP, and the name or a name alias in ALL-CAPS ASCII. Multiple characters can be notated in sequence on multiple comment lines or on a single comment line. It is neither advisable nor necessary to notate characters in the ASCII range. Examples of the notation include: Leonard & Newman Experimental [Page 7] Internet-Draft Unicode in ABNF March 13, 2017 ; U+2206 INCREMENT ; U+2030 PER MILLE SIGN change-in-temp = %su"$" 3DIGIT %su"%" ; # EURO SIGN ZWJ / VULGAR FRACTION ONE HALF euros = %x20AC 3DIGIT [%x200D.BD] where the characters $, %, #, and / are actually the respective Unicode characters mentioned in the comments. 8. Effects on RFC 5234 Formally, this document updates [RFC5234] but does not modify it in situ. Authors need to reference this document if they want to include these enhancements; bare references to [RFC5234] do not include this specification (or, for that matter, [RFC7405]). This directive follows a model whereby document authors can choose whether to invoke particular enhancements to ABNF. As time goes on, the IETF can determine how often these enhancements are invoked, and can decide whether to include them as part of a revision to the base [RFC5234]. A bare reference to this document invokes the case-sensitive Unicode literal string syntax enhancement, the Unicode comment syntax enhancement, and the Unicode Core Rules of Appendix A (i.e., the Core Rules do not have to be further referenced). Nevertheless, document authors are free to qualify a reference to this document to invoke each feature selectively. Appendix A of this document is meant to supplement Appendix B.1 of [RFC5234] and Appendix A of [ABNFMORE]; therefore, concurrently referencing those documents is a good idea. Document authors who reference this document should use the rules of Appendix A, and should not attempt to redefine or provide incremental alternatives to them (except for backwards compatibility with prior documents). 9. IANA Considerations This document implies no IANA considerations. 10. Security Considerations While the Unicode Core Rules themselves may not be security-relevant, the use of C1 control characters could very well be security- relevant, because they may trigger special functions on various devices, while being invisible in other contexts. Similarly, case- sensitive Unicode string syntax allows for a broad range of code points, many of which represent characters that are confusable with other characters, or can only be inferred by visible yet subtle Leonard & Newman Experimental [Page 8] Internet-Draft Unicode in ABNF March 13, 2017 changes in the surrounding graphemes (or worse, semantic changes that do not have visual representations). Protocols using Unicode should evaluate the applicability of Unicode security considerations [UTR#36]. 11. References 11.1. Normative References [ASCII86] American National Standards Institute, "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986. [RFC0020] Cerf, V., "ASCII format for network interchange", RFC 20, October 1969. [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network Interchange", RFC 5198, March 2008. [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008. [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 9.0.0", The Unicode Consortium, August 2016. 11.2. Informative References [IABNA] Flanagan, H., "The Use of Non-ASCII Characters in RFCs", draft-iab-rfc-nonascii-02 (work in progress), April 2016. [RFC1345] Simonsen, K., "Character Mnemonics and Character Sets", RFC 1345, June 1992. [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M., and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997. [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2003. [RFC4790] Newman, C., Duerst, M., and A. Gulbrandsen, "Internet Application Protocol Collation Registry", RFC 4790, March 2007. [RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF", RFC Leonard & Newman Experimental [Page 9] Internet-Draft Unicode in ABNF March 13, 2017 7405, December 2014. [UTR#36] Davis, M. and M. Suignard, "Unicode Security Considerations", Unicode Technical Report #36, September 2014, . [XMLEBNF] Bray, T., Paoli, J., Sperberg-McQueen, M., Maler, E., and F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth Edition)", Section 6, W3C Recommendation REC-xml-20081126, November 2008, . Appendix A. Comprehensive Unicode Core Rules Certain basic rules are in uppercase, such as SP, HTAB, CRLF, DIGIT, ALPHA, etc. ; D76 Unicode scalar value UNICODE = BEYONDASCII = BEYONDG0 = C1 = BEYONDC1 = G1 = ; 96-set BEYONDG1 = LATIN1 = BEYONDLATIN1 = ; C2 D14 noncharacter (sentinel) ; Section 23.7 Noncharacters, see also NUL NONUCHAR = ; UCHAR rules are analogous to CHAR UCHARBEYONDBMP = UCHARBEYONDLATIN1 = / UCHARBEYONDBMP UCHARBEYONDC1 = / UCHARBEYONDBMP UCHARBEYONDASCII = C1 / UCHARBEYONDC1 UCHAR = / UCHARBEYONDBMP ; D49 private-use ; Section 23.5 Private-Use Characters ; Primary Private Use Area (in BMP) PPUACHAR = ; Supplementary Private Use Area-A SPUAACHAR = ; Supplementary Private Use Area-B SPUABCHAR = ; TODO: possible alternates: PUCHAR, PUA PUACHAR = PPUACHAR / SPUAACHAR / SPUABCHAR ; Unicode-y VCHAR: like VCHAR, attempts to capture ; "all standardized graphic and formatting ; characters/code points for open interchange, ; excluding white space and controls" ; EXCLUDES: Noncharacters (some Cn), Cs, Co, Cc, Z (Zs, Zl, Zp) UVCHARBEYONDBMP = UVCHARBEYONDLATIN1 = / UVCHARBEYONDBMP UVCHARBEYONDASCII = / UVCHARBEYONDBMP UVCHARBEYONDC1 = UVCHARBEYONDASCII UVCHAR = VCHAR / UVCHARBEYONDASCII ; horizontal white space only (Zs beyond ASCII), ; NO line breaks (Cc, Zl, Zp) ; cf Section 5.8 Newline Guidelines with RFC 5198 ; see also SP UWSPBEYONDASCII = ; includes HT UWSP = WSP / UWSPBEYONDASCII ; C1 Controls PAD = ; gov't health warning: figment HOP = ; gov't health warning: figment BPH = NBH = IND = NEL = ; NLF CRLF, CR, LF, NEL (not LS or PS) ; --probably unnecessary for Internet usage: ; CRLF is already the standard SSA = ESA = HTS = HTJ = VTS = PLD = PLU = RI = SS2 = SS3 = DCS = PU1 = PU2 = Leonard & Newman Experimental [Page 12] Internet-Draft Unicode in ABNF March 13, 2017 STS = CCH = MW = SPA = EPA = SOS = SGCI = ; or SGC, gov't health warning: figment SCI = CSI = ST = OSC = PM = APC = ; Latin1 NBSP = SHY = ; Zl, Zp ; NB: These are excluded from both UVCHAR and UWSP LS = PS = Authors' Addresses Sean Leonard Penango, Inc. 5900 Wilshire Boulevard 21st Floor Los Angeles, CA 90036 USA EMail: dev+ietf@seantek.com URI: http://www.penango.com/ Chris Newman Oracle 440 E. Huntington Dr., Suite 400 Arcadia, CA 91006 USA EMail: chris.newman@oracle.com Leonard & Newman Experimental [Page 13]