I-Regexp: An Interoperable Regexp Format

Internet-Draft	I-Regexp	March 2022
Bormann & Bray	Expires 8 September 2022	[Page]

Abstract

This document specifies I-Regexp, a flavor of regular expressions that is limited in scope with the goal of interoperation across many different regular-expression libraries.¶

3. I-Regexp Syntax

An I-Regexp MUST conform to the ABNF specification in Figure 1.¶

i-regexp = branch *( "|" branch )
branch = *piece
piece = atom [ quantifier ]
quantifier = ( %x2A-2B ; '*'-'+'
 / "?" ) / ( "{" quantity "}" )
quantity = QuantExact [ "," [ QuantExact ] ]
QuantExact = 1*%x30-39 ; '0'-'9'

atom = NormalChar / charClass / ( "(" i-regexp ")" )
NormalChar = ( %x00-27 / %x2C-2D ; ','-'-'
 / %x2F-3E ; '/'-'>'
 / %x40-5A ; '@'-'Z'
 / %x5E-7A ; '^'-'z'
 / %x7E-10FFFF )
charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
SingleCharEsc = "\" ( %x28-2B ; '('-'+'
 / %x2D-2E ; '-'-'.'
 / "?" / %x5B-5E ; '['-'^'
 / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
 )
charClassEsc = catEsc / complEsc
charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
 / %x5E-10FFFF ) / SingleCharEsc
catEsc = %s"\p{" charProp "}"
complEsc = %s"\P{" charProp "}"
charProp = IsCategory / IsBlock
IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
    Symbols / Others
Letters = %s"L" [ ( %x6C-6D ; 'l'-'m'
 / %s"o" / %x74-75 ; 't'-'u'
 ) ]
Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
 / %s"i" / %s"o" / %s"s" ) ]
Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
Others = %s"C" [ ( %s"c" / %s"f" / %x6E-6F ; 'n'-'o'
 ) ]
IsBlock = %s"Is" 1*( "-" / %x30-39 ; '0'-'9'
 / %x41-5A ; 'A'-'Z'
 / %x61-7A ; 'a'-'z'
 )

Figure 1: I-Regexp Syntax in ABNF

As an additional restriction, charClassExpr is not allowed to match [^], which according to this grammar would parse as a positive character class containing the single character ^.¶

This is essentially XSD regexp without character class subtraction and multi-character escapes.¶

An I-Regexp implementation MUST be a complete implementation of this limited subset. In particular, full Unicode support is REQUIRED; the implementation MUST NOT limit itself to 7- or 8-bit character sets such as ASCII and MUST support the Unicode character property set in character classes.¶

Issues: The ABNF has been automatically generated and maybe could use some further polishing. The ABNF has been verified against Appendix A, but a wider corpus of regular expressions will need to be examined. Note that about a third of the complexity of this ABNF grammar comes from going into details on the Unicode IsCategory classes. Additional complexity stems from the way hyphens can be used inside character classes to denote ranges; the grammar deliberately excludes questionable usage such as /[a-z-A-Z]/.¶

5. Mapping I-Regexp to Regexp Dialects

(TBD; these mappings need to be further verified in implementation work.)¶

5.1. XSD Regexps

Any I-Regexp also is an XSD Regexp [XSD-2], so the mapping is an identity function.¶

Note that a few errata for [XSD-2] have been fixed in [XSD11-2], which is therefore also included as a normative reference. XSD 1.1 is less widely implemented than XSD 1.0, and implementations of XSD 1.0 are likely to include these bugfixes, so for the intents and purposes of this specification an implementation of XSD 1.0 regexps is equivalent to an implementation of XSD 1.1 regexps.¶

5.2. ECMAScript Regexps

Perform the following steps on an I-Regexp to obtain an ECMAScript regexp [ECMA-262]:¶

For any dots (.) outside character classes (first alternative of charClass production): replace dot by [^\n\r].¶
Envelope the result in ^ and $.¶

Note that where a regexp literal is required, this needs to enclose the actual regexp in /.¶

The performance of an ECMAScript matcher can be increased by turning parenthesized regexps (last choice in production atom) into (?:...) constructions.¶

5.3. PCRE, RE2, Ruby Regexps

Perform the same steps as in Section 5.2 to obtain a valid regexp in PCRE [PCRE2], the Go programming language [RE2], and the Ruby programming language, except that the last step is:¶

Envelope the result in \A and \z.¶

Again, the performance can be increased by turning parenthesized regexps (production atom) into (?:...) constructions.¶

5.4. << Your kind of Regexp here >>

(Please submit the mapping needed for your favorite kind of regexp.)¶

6. Motivation and Background

Data modeling formats (YANG, CDDL) as well as query languages (jsonpath) often need a regular expression (regexp) sublanguage. There are many dialects of regular expressions in use in platforms, programming languages, and data modeling formats.¶

While regular expressions originally were intended to describe a formal language, i.e., to provide a Boolean matching function, they have turned into parsing functions for many applications, with capture groups, greedy/lazy/possessive variants, etc. Language features such as backreferences allow specifying languages that actually are context-free (Chomsky type 2) instead of the regular languages (Chomsky type 3) that regular expressions are named for.¶

YANG (Section 9.4.5 of [RFC7950]) and CDDL (Section 3.8.3 of [RFC8610]) have adopted the regexp language from W3C Schema [XSD-2]. XSD regexp is a pure matching language, i.e., XSD regexps can be used to match a string against them and yield a simple true or false result. XSD regexps are not as widely implemented as programming language regexp dialects such as those of Perl, Python, Ruby, Go [RE2], or JavaScript (ECMAScript) [ECMA-262]. The latter are often in a state of continuous development; in the best case (ECMAScript) there is a complete specification which however is highly complex (Section 21.2 of [ECMA-262] comprises 62 pages) and evolves on a yearly timeline, with significant additions. Regexp dialects such as PCRE [PCRE2] have evolved to cover a common set of functions available in parsing regexp dialects, offered in a widely available library.¶

With continuing accretion of complex features, parsing regexp libraries have become susceptible to bugs and performance degradation, in particular those that can be exploited in Denial of Service (DoS) attacks. The library RE2 that is compatible with Go language regexps strives to be immune to DoS attacks, making it attractive to applications such as query languages where an attacker could control the input. The problem remains that other bugs in such libraries can lead to exploitable vulnerabilities; at the time of writing, the Common Vulnerabilities and Exposures (CVE) system has 131 entries that mention the word "regex" [REGEX-CVE] (not all, but many of which are such bugs, with 23 matches for arbitrary code execution).¶

Implementations of YANG and CDDL often struggle with providing true XSD regexps; some instead cheat by providing one of the parsing regexp varieties, sometimes without even advertising this fact.¶

A matching regexp that does not use the more complex XSD features (Section 6.1) can usually be converted into a parsing regexp of many dialects by simply surrounding it with anchors of that dialect (e.g., ^ or \A and $ or \z). If the original matching regexps exceed the envelope of compatibility between dialects, this can lead to interoperability problems, or, worse, security vulnerabilities. Also, features of the target dialect such as capture groups may be triggered inadvertently, reducing performance.¶

6.1. Subsetting XSD Regexps

XSD regexps are relatively easy to implement or map to widely implemented parsing regexp dialects, with a small number of notable exceptions:¶

Character class subtraction. This is a very useful feature in many specifications, but it is unfortunately mostly absent from parsing regexp dialects.¶

Discussion: This absence can often be addressed by translating character class subtraction into positive character classes (possibly requiring significant expansion) and/or inserting negative lookahead assertions (which are not universally supported by regexp libraries, most notably not by RE2 [RE2]). This specification therefore opts for leaving out character class subtraction.¶
Multi-character escapes. \d, \w, \s and their uppercase equivalents (complement classes) exhibit a large amount of variation between regexp flavors. (E.g., predefined character classes such as \w may be meant to be ASCII only, or they may encompass all letters and digits defined in Unicode. The latter is usually of interest in the application of query languages to text in human languages, while the former is of interest to a subset of applications in data model specifications.)¶
Unicode. While there is no doubt that a regexp flavor meant to last needs to be Unicode enabled, there are a number of aspects of this that need discussion. Not all regexp implementations that one might want to map I-Regexps into will support accesses to Unicode tables that enable executing on constructs such as \p{IsCoptic}, for mapping into such implementations, translation needs to be provided. Fortunately, the \p/\P feature in general is now quite widely available.¶

Discussion: The ASCII focus can partially be addressed by adding a constraint outside the regexp that the matched text has to be ASCII in the first place. This often is all that is needed where regexps are used to define lexical elements of a computer language. This reduces the size of the Unicode tables required in such a constrained implementation considerably. (In Appendix A, RFC 6643 contains a lone instance of \p{IsBasicLatin}{0,255}, which is needed to describe a transition from a legacy character set to Unicode. RFC2622 contains [[:digit:]], [[:alpha:]], [[:alnum:]], albeit in a specification for the flex tool; this is intended to be close to \d, \p{L}, \w in an ASCII subset.)¶

9. References

9.1. Normative References

[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[XSD-2]: Biron, P. and A. Malhotra, "XML Schema Part 2: Datatypes Second Edition", World Wide Web Consortium Recommendation REC-xmlschema-2-20041028, 28 October 2004, <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028>.
[XSD11-2]: Peterson, D., Gao, S., Malhotra, A., Sperberg-McQueen, M., Thompson, H., and P. Biron, "W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes", World Wide Web Consortium Recommendation REC-xmlschema11-2-20120405, 5 April 2012, <https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405>.

9.2. Informative References

[ECMA-262]: Ecma International, "ECMAScript 2020 Language Specification", ECMA Standard ECMA-262, 11th Edition, June 2020, <https://www.ecma-international.org/wp-content/uploads/ECMA-262.pdf>.
[PCRE2]: "Perl-compatible Regular Expressions (revised API: PCRE2)", n.d., <http://pcre.org/current/doc/html/>.
[RE2]: "RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.", n.d., <https://github.com/google/re2>.
[REGEX-CVE]: "CVE - Search Results", n.d., <https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=regex>.
[RFC7493]: Bray, T., Ed., "The I-JSON Message Format", RFC 7493, DOI 10.17487/RFC7493, March 2015, <https://www.rfc-editor.org/info/rfc7493>.
[RFC7950]: Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", RFC 7950, DOI 10.17487/RFC7950, August 2016, <https://www.rfc-editor.org/info/rfc7950>.
[RFC8610]: Birkholz, H., Vigano, C., and C. Bormann, "Concise Data Definition Language (CDDL): A Notational Convention to Express Concise Binary Object Representation (CBOR) and JSON Data Structures", RFC 8610, DOI 10.17487/RFC8610, June 2019, <https://www.rfc-editor.org/info/rfc8610>.

Appendix A. Regexps and Similar Constructs in Recent Published RFCs

This appendix contains a number of regular expressions that have been extracted from some recently published RFCs based on some ad-hoc matching. Multi-line constructions were not included. With the exception of some (often surprisingly dubious) usage of multi-character escapes, all regular expressions validate against the ABNF in Figure 1.¶

rfc6021.txt  459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
rfc6021.txt  513 \d*(\.\d*){1,127}
rfc6021.txt  529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
rfc6021.txt  631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
rfc6021.txt  647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
rfc6021.txt  933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6021.txt  938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc6020.txt 6647 [0-9a-fA-F]*
rfc6095.txt 2544 \S(.*\S)?
rfc6110.txt 1583 [aeiouy]*
rfc6110.txt 3222 [A-Z][a-z]*
rfc6536.txt 1583 \*
rfc6536.txt 1632 [^\*].*
rfc6643.txt  524 \p{IsBasicLatin}{0,255}
rfc6728.txt 3480 \S+
rfc6728.txt 3500 \S(.*\S)?
rfc6991.txt  477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
rfc6991.txt  525 \d*(\.\d*){1,127}
rfc6991.txt  541 [a-zA-Z_][a-zA-Z0-9\-_.]*
rfc6991.txt  542 .|..|[^xX].*|.[^mM].*|..[^lL].*
rfc6991.txt  571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
rfc6991.txt  665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
rfc6991.txt  693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
rfc6991.txt  725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
rfc6991.txt  743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-
rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc6991.txt 1099 [0-9\.]*
rfc6991.txt 1109 [0-9a-fA-F:\.]*
rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
rfc7407.txt  933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254}
rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31}
rfc7758.txt  703 \d{2}:\d{2}:\d{2}(\.\d+)?
rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)?
rfc7895.txt  349 \d{4}-\d{2}-\d{2}
rfc7950.txt 8323 [0-9a-fA-F]*
rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]*
rfc7950.txt 8356 [xX][mM][lL].*
rfc8040.txt 4713 \d{4}-\d{2}-\d{2}
rfc8049.txt 6704 [A-Z]{2}
rfc8194.txt  629 \*
rfc8194.txt  637 [0-9]{8}\.[0-9]{6}
rfc8194.txt  905 Z|[\+\-]\d{2}:\d{2}
rfc8194.txt  963 (2((2[4-9])|(3[0-9]))\.).*
rfc8194.txt  974 (([fF]{2}[0-9a-fA-F]{2}):).*
rfc8299.txt 7986 [A-Z]{2}
rfc8341.txt 1878 \*
rfc8341.txt 1927 [^\*].*
rfc8407.txt 1723 [0-9\.]*
rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]*
rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].*
rfc8525.txt  550 \d{4}-\d{2}-\d{2}
rfc8776.txt  838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)*
rfc8776.txt  874 ([a-zA-Z0-9\-_.]+:)*
rfc8819.txt  311 [\S ]+
rfc8944.txt  596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7}

Figure 2: Example regular expressions extracted from RFCs

The multi-character escapes (MCE) or the character classes built around them used here can be substituted as shown in Table 1.¶

Table 1: Substitutes for multi-character escapes in examples
MCE/class	Substitute class
`\S`	`[^ \t\n\r]`
`[\S ]`	`[^\t\n\r]`
`\d`	`[0-9]`

Note that the semantics of \d in XSD regular expressions is that of \p{Nd}; however, this would include all Unicode characters that are digits in various writing systems and certainly is not actually meant in the RFCs listed.¶

I-Regexp: An Interoperable Regexp Format

Abstract

About This Document

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Terminology

2. Requirements