Network Working Group | C. Weber |
Internet-Draft | Casaba Security |
Intended status: Standards Track | July 12, 2011 |
Expires: January 13, 2012 |
Guidelines for Implementers of Internationalized Resource Identifiers (IRIs)
draft-weber-iri-guidelines-01
Some members of the implementation community have expressed confusion about the rules and algorithms for processing Internationalized Resource Identifiers (IRIs). This document aims to clarify these matters and improve interoperability around IRI processing by summarizing the steps required to prepare and parse arbitrary Unicode strings as Internationalized Resource Identifiers. Further goals of this document are to define limited scheme-specific rules around IRI processing and to define the steps required for producing the canonical form of an IRI.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 13, 2012.
Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Internationalized Resource Identifiers (IRIs) extend the Uniform Resource Identifier (URI) specification [RFC3986] by opening up the authority, path, query, and fragment components to the character space available in Unicode/ISO 10646. Arbitrary Unicode strings may be prepared and parsed into IRI sub-components which map directly to the same URI sub-components.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
This section defines terminology used in this document. Unicode characters are referred to using the notation described in section 3 of [RFC5137]. A Unicode code point is represented as U+NNNN where the NNNN string consists of the code point's hexadecimal numbers.
reference-string
The original and unprocessed input string being considered as an IRI reference.
pre-processed-reference-string
The reference-string that has been through the pre-processing steps.
relative reference
The term "relative reference" used in this document can be interpreted according to the rules in section 4.2 of [RFC3986] using the additionally allowed characters that [RFC3987] permits in each component.
percent-encode
Convert each octet of a sequence to %HH, where HH is the hexadecimal notation of the octet value.
This document makes reference to the following sources of information about the parsing of IRIs:
This section describes the pre-processing steps required to prepare an arbitrary Unicode reference-string for later parsing into IRI sub-components.
This is the pre-processed-reference-string ready for parsing.
With an arbitrary IRI string that has been through pre-processing, referred to as the "pre-processed-reference-string", this section describes the subsequent process of parsing the string into its five major IRI sub-components using rules defined by [RFC3896] (using an algorithm equivalent to Appendix B of [RFC3986]) but with updated ABNF of [RFC3987]. These rules are summarized here.
Reference: Section 3.2 of [RFC3987].
If the current string does not contain a ":" U+003A then the string does not contain a scheme and the pre-processed-reference-string may be handled as a relative reference according to the rules in section 4.2 of [RFC3986] using the additionally allowed characters that [RFC3987] permits.
If the first character of the string is not an <ALPHA> then this is not a valid scheme and the pre-processed-reference-string may be handled as a relative reference.
Consume all characters up to but not including the first occurrence of ":" U+003A. If the consumed substring contains any characters other than < ALPHA / DIGIT / "+" / "-" / "." > then it is not a valid scheme and the pre-processed-reference-string may be handled as a relative reference.
The consumed substring at this point is the scheme. Skip over the ":" U+003A character. Continue parsing the remaining string as an authority part.
The URI authority component may contain userinfo, a host, and a port. If the current string does not begin with the two characters "//" U+002F U+002F then the string is not an authority and may be handled as a path sub-component.
Consume up to the first occurrence of any one of the authority's terminating characters "/" U+002F, "?" U+003F, "#" U+0023, or the end of the string. This is the authority, also known as the iauthority under IRI RFC3987. Continue further parsing of the authority to identify the userinfo, host, and port parts.
The userinfo may come in the form of a username and password rendered as "user:password". The userinfo part may be parsed according to the rules of RFC3986 Section 3.2.1 with the updated ABNF for iuserinfo in RFC3987.
If the authority does not contain a "@" U+0040 then the string does not contain a userinfo part and the authority may be parsed for a host and port part.
From the beginning of the authority, consume each character up to but not including the first occurrence of "@" U+0040. This is the userinfo.
Further parsing of the userinfo is scheme-specific.
Skip over the first occurrence of "@" U+0040 following the userinfo, and continue parsing the remaining authority to identify the host.
The host part of authority may contain an IP-literal, IPv4address, or a reg-name according to the ABNF rules of RFC3986 updated to support Unicode characters in the ireg-name as described in RFC3987. Consume all characters up to but not including the last ":" U+003A character or the end of authority.
If this substring is determined to be an IP-literal or IPv4address, then the consumed characters are the host.
Else the substring is determined to be an ireg-name according to the ABNF naming convention from RFC3987. This is the host.
The host SHOULD be processed according to the rules of IDNA2008, but MAY be processed according to UTS46 or IDNA2003. If the host is in DNS Internet dot-notaion then it's labels SHOULD be converted to punycode. This is the host.
TODO: Error handling. Mention leaving the host name in pure Unicode form for intranet/local name scenarios that don't use DNS, e.g. WINS?
Further processing SHOULD skip the first occurrence of ":" U+003A and consume the remaining characters. If these characters are not *DIGIT then the port is invalid.
Else this is the port.
Consume the remaining pre-processed-reference-string up to but not including the first occurrence of a terminating character "?" U+003F, "#" U+0023, or the end of the string.
Percent-encode all characters present from the ucschar list.
If the path contains any characters not allowed by the ABNF of RFC3987 Section 2.2 or Section 7.2 then replace those characters with their percent-encoding.
This is the path.
If the terminating character was "?" then process the remaining string including the leading "?" according to "Identify the query".
If the terminating character was "#" then process the remaining string including the leading "#" according to "Identify the fragment".
TODO: Handling of special characters "/" and "\".
Consume the remaining string starting with the leading "?" and up to but not including the first occurrence of "#" or the end of the string.
Percent-encode all characters present from the ucschar list.
If the path contains any characters not allowed by the ABNF of [RFC3987] Section 2.2 or the lists in Section 7.2 then replace those characters with their percent-encoding.
This is the query component.
If the terminating character was "#" then process the remaining string including the leading "#" according to "Identify the fragment".
TODO: Handling of special characters "&", "?", "=", and "/"
Consume the remaining string starting with the leading "#" and to the end of the string.
Percent-encode all characters present from the ucschar list.
This is the fragment.
TODO: Handling of special characters "?" and "/"
TODO Apply limited scheme-specific rules. Reference [RFC4395]
(NOTE: Taken directly from [RFC3987]) For compatibility with existing deployed HTTP infrastructure, the following special case applies for schemes "http" and "https" and IRIs whose origin has a document charset other than one which is UCS-based (e.g., UTF-8 or UTF-16). In such a case, the "query" component of an IRI is mapped into a URI by using the document charset rather than UTF-8 as the binary representation before pct-encoding. This mapping is not applied for any other scheme or component.
Reference: Section 3.5 and 7.2 of [RFC3987].
TODO reference? http://tools.ietf.org/html/draft-hoehrmann-javascript-scheme-03
TODO reference [RFC6068]
TODO reference [RFC2397]
To follow.
[NOTE: Call out Special Case here or earlier? The U+005C should be either percent-encoded or converted to U+002F. Of course, folks who want to name such files will want to use the escaped form of \ in order for their site to work in other browsers.
For each character which is not allowed anywhere in a valid URI, apply the following steps.
Reference: Section 3.5 of [RFC3987].
To follow.
This document has no actions for the IANA.
Many thanks to Mykyta Yevstifeyev, Addison Phillips, Mark Davis, Anne van Kesteren, Adam Barth, Martin Duerst, and Julian Reschke for their feedback.
[2] | Barth, A, "How Browsers Process URLs", Internet-Draft draft-abarth-url-01, April 2011. |
[3] | Hickson, I., "HTML Living Standard: URLs", WHATWG ?, 2011. |
[4] | Fielding, R., "Change proposal for ISSUE-56", July 2010. |
[5] | W3C, "URIs, URLs, and URNs: Clarifications and Recommendations 1.0", W3C Note 21 September 2001, September 2001. |
[6] | Masinter, L., "HTML CHANGE PROPOSAL; change definition of URL to normative reference to IRIBIS", February 2010. |
[7] | Davis, M., "Revision of UBA for improved display of URL/IRIs", May 2011. |
[8] | Davis, M. and M. Duerst, "URLs", April 2003. |
[RFC3987] | Duerst, M, Suignard, M and L Masinter, "Internationalized Resource Identifiers (IRIs)", Internet-Draft draft-ietf-iri-3987bis-05, March 2011. |
[RFC3986] | Berners-Lee, T, Fielding, R and L Masinter, "Uniform Resource Identifier (URI)", Internet-Standard rfc3986, January 2005. |
[RFC3629] | Yergeau, F, "UTF-8, a transformation format of ISO 10646", Internet-Standard rfc3629, November 2003. |
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |