Internet-Draft | I-Regexp | March 2022 |
Bormann & Bray | Expires 8 September 2022 | [Page] |
This document specifies I-Regexp, a flavor of regular expressions that is limited in scope with the goal of interoperation across many different regular-expression libraries.¶
This note is to be removed before publishing as an RFC.¶
Status information for this document may be found at https://datatracker.ietf.org/doc/draft-bormann-jsonpath-iregexp/.¶
Discussion of this document takes place on the JSONpath Working Group mailing list (mailto:JSONpath@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/JSONpath/.¶
Source for this draft and an issue tracker can be found at https://github.com/cabo/iregexp.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 8 September 2022.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
The present specification defines an interoperable regular expression flavor, I-Regexp.¶
This document uses the abbreviation "regexp" for what are usually called regular expressions in programming. "I-Regexp" is used as a noun meaning a character string which conforms to the requirements in this specification; the plural is "I-Regexps".¶
I-Regexp does not provide advanced regexp features such as capture groups, lookahead, or backreferences. It supports only a Boolean matching capability, i.e., testing whether a given regexp matches a given piece of text.¶
I-Regexp is a subset of XSD regexps [XSD-2].¶
This document includes rules for converting I-Regexps for use with several well-known regexp libraries.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
I-Regexps should handle the vast majority of practical cases where a matching regexp is needed in a data model specification or a query language expression.¶
A brief survey of published RFCs yielded the regexp patterns in Appendix A (with no attempt at completeness). With certain exceptions as discussed there, these should be covered by I-Regexps, both syntactically and with their intended semantics.¶
An I-Regexp MUST conform to the ABNF specification in Figure 1.¶
As an additional restriction, charClassExpr
is not allowed to
match [^]
, which according to this grammar would parse as a
positive character class containing the single character ^
.¶
This is essentially XSD regexp without character class subtraction and multi-character escapes.¶
An I-Regexp implementation MUST be a complete implementation of this limited subset. In particular, full Unicode support is REQUIRED; the implementation MUST NOT limit itself to 7- or 8-bit character sets such as ASCII and MUST support the Unicode character property set in character classes.¶
/[a-z-A-Z]/
.¶
This syntax is a subset of that of [XSD-2]. Implementations which interpret I-Regexps MUST yield Boolean results as specified in [XSD-2]. (See also Section 5.1.)¶
(TBD; these mappings need to be further verified in implementation work.)¶
Any I-Regexp also is an XSD Regexp [XSD-2], so the mapping is an identity function.¶
Note that a few errata for [XSD-2] have been fixed in [XSD11-2], which is therefore also included as a normative reference. XSD 1.1 is less widely implemented than XSD 1.0, and implementations of XSD 1.0 are likely to include these bugfixes, so for the intents and purposes of this specification an implementation of XSD 1.0 regexps is equivalent to an implementation of XSD 1.1 regexps.¶
Perform the following steps on an I-Regexp to obtain an ECMAScript regexp [ECMA-262]:¶
.
) outside character classes (first alternative
of charClass
production): replace dot by [^\n\r]
.¶
^
and $
.¶
Note that where a regexp literal is required, this needs to enclose
the actual regexp in /
.¶
The performance of an ECMAScript matcher can be increased by turning parenthesized regexps
(last choice in production atom
) into (?:...)
constructions.¶
Perform the same steps as in Section 5.2 to obtain a valid regexp in PCRE [PCRE2], the Go programming language [RE2], and the Ruby programming language, except that the last step is:¶
\A
and \z
.¶
Again, the performance can be increased by turning parenthesized
regexps (production atom
) into (?:...)
constructions.¶
(Please submit the mapping needed for your favorite kind of regexp.)¶
Data modeling formats (YANG, CDDL) as well as query languages (jsonpath) often need a regular expression (regexp) sublanguage. There are many dialects of regular expressions in use in platforms, programming languages, and data modeling formats.¶
While regular expressions originally were intended to describe a formal language, i.e., to provide a Boolean matching function, they have turned into parsing functions for many applications, with capture groups, greedy/lazy/possessive variants, etc. Language features such as backreferences allow specifying languages that actually are context-free (Chomsky type 2) instead of the regular languages (Chomsky type 3) that regular expressions are named for.¶
YANG (Section 9.4.5 of [RFC7950]) and CDDL (Section 3.8.3 of [RFC8610]) have adopted the regexp language from W3C Schema [XSD-2]. XSD regexp is a pure matching language, i.e., XSD regexps can be used to match a string against them and yield a simple true or false result. XSD regexps are not as widely implemented as programming language regexp dialects such as those of Perl, Python, Ruby, Go [RE2], or JavaScript (ECMAScript) [ECMA-262]. The latter are often in a state of continuous development; in the best case (ECMAScript) there is a complete specification which however is highly complex (Section 21.2 of [ECMA-262] comprises 62 pages) and evolves on a yearly timeline, with significant additions. Regexp dialects such as PCRE [PCRE2] have evolved to cover a common set of functions available in parsing regexp dialects, offered in a widely available library.¶
With continuing accretion of complex features, parsing regexp libraries have become susceptible to bugs and performance degradation, in particular those that can be exploited in Denial of Service (DoS) attacks. The library RE2 that is compatible with Go language regexps strives to be immune to DoS attacks, making it attractive to applications such as query languages where an attacker could control the input. The problem remains that other bugs in such libraries can lead to exploitable vulnerabilities; at the time of writing, the Common Vulnerabilities and Exposures (CVE) system has 131 entries that mention the word "regex" [REGEX-CVE] (not all, but many of which are such bugs, with 23 matches for arbitrary code execution).¶
Implementations of YANG and CDDL often struggle with providing true XSD regexps; some instead cheat by providing one of the parsing regexp varieties, sometimes without even advertising this fact.¶
A matching regexp that does not use the more complex XSD features
(Section 6.1) can usually be converted into a parsing regexp of many
dialects by simply surrounding it with anchors of that dialect (e.g., ^
or \A
and $
or \z
).
If the original matching regexps exceed the envelope of compatibility
between dialects, this can lead to interoperability problems, or,
worse, security vulnerabilities.
Also, features of the target dialect such as capture groups may be triggered inadvertently, reducing performance.¶
XSD regexps are relatively easy to implement or map to widely implemented parsing regexp dialects, with a small number of notable exceptions:¶
Character class subtraction. This is a very useful feature in many specifications, but it is unfortunately mostly absent from parsing regexp dialects.¶
Discussion: This absence can often be addressed by translating character class subtraction into positive character classes (possibly requiring significant expansion) and/or inserting negative lookahead assertions (which are not universally supported by regexp libraries, most notably not by RE2 [RE2]). This specification therefore opts for leaving out character class subtraction.¶
\d
, \w
, \s
and their uppercase
equivalents (complement classes) exhibit a
large amount of variation between regexp flavors.
(E.g., predefined character classes such as \w
may be meant
to be ASCII only, or they may encompass all letters and digits
defined in Unicode. The latter is usually of interest in the
application of query
languages to text in human languages, while the former is of interest to a subset of
applications in data model specifications.)¶
Unicode.
While there is no doubt that a regexp flavor meant to last needs to
be Unicode enabled, there are a number of aspects of this that need
discussion.
Not all regexp implementations that one might want to map
I-Regexps into will support accesses to Unicode tables that enable
executing on constructs such as \p{IsCoptic}
, for mapping into such
implementations, translation needs to be provided.
Fortunately, the \p
/\P
feature in general is now quite
widely available.¶
Discussion: The ASCII focus can partially be addressed by adding a
constraint outside the regexp that the matched text has to be
ASCII in the first place. This often is all that is needed where
regexps are used to define lexical elements of a computer
language. This reduces the size of the Unicode tables required in
such a constrained implementation considerably. (In Appendix A, RFC
6643 contains a lone instance of \p{IsBasicLatin}{0,255}
, which
is needed to describe a transition from a legacy character set to
Unicode. RFC2622 contains [[:digit:]]
,
[[:alpha:]]
, [[:alnum:]]
, albeit in a specification for the
flex
tool; this is intended to be close to \d
, \p{L}
, \w
in an ASCII subset.)¶
This document makes no requests of IANA.¶
As discussed in Section 6, more complex regexp libraries are likely to contain exploitable bugs leading to crashes and remote code execution. There is also the problem that such libraries often have hard to predict performance characteristics, leading to attack vectors that overload an implementation by matching against an expensive attacked controlled regexp.¶
I-Regexps have been designed to allow implementation in a way that is resilient to both threats; this objective needs to be addressed throughout the implementation effort.¶
This appendix contains a number of regular expressions that have been extracted from some recently published RFCs based on some ad-hoc matching. Multi-line constructions were not included. With the exception of some (often surprisingly dubious) usage of multi-character escapes, all regular expressions validate against the ABNF in Figure 1.¶
The multi-character escapes (MCE) or the character classes built around them used here can be substituted as shown in Table 1.¶
MCE/class | Substitute class |
---|---|
\S
|
[^ \t\n\r]
|
[\S ]
|
[^\t\n\r]
|
\d
|
[0-9]
|
Note that the semantics of \d
in XSD regular expressions is that of
\p{Nd}
; however, this would include all Unicode characters that are
digits in various writing systems and certainly is not actually meant
in the RFCs listed.¶
This draft has been motivated by the discussion in the IETF JSONPATH
WG about whether to include a regexp mechanism into the JSONPath query
expression specification, as well as by previous discussions about the
YANG pattern
and CDDL .regexp
features.¶
The basic approach for this draft was inspired by The I-JSON Message Format [RFC7493].¶