-
Notifications
You must be signed in to change notification settings - Fork 48
/
draft-duerst-iri-bis.txt
3248 lines (2224 loc) · 136 KB
/
draft-duerst-iri-bis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Network Working Group M. Duerst
Internet-Draft Aoyama Gakuin University
Obsoletes: RFC 3987 M. Suignard
(if approved) Unicode Consortium
Intended status: Standards Track L. Masinter
Expires: April 29, 2010 Adobe
October 26, 2009
Internationalized Resource Identifiers (IRIs)
draft-duerst-iri-bis-07
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79. This document may contain material
from IETF Documents or IETF Contributions published or made publicly
available before November 10, 2008. The person(s) controlling the
copyright in some of this material may not have granted the IETF
Trust the right to allow modifications of such material outside the
IETF Standards Process. Without obtaining an adequate license from
the person(s) controlling the copyright in such materials, this
document may not be modified outside the IETF Standards Process, and
derivative works of it may not be created outside the IETF Standards
Process, except to format it for publication as an RFC or to
translate it into languages other than English.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 29, 2010.
Copyright Notice
Copyright (c) 2009 IETF Trust and the persons identified as the
Duerst, et al. Expires April 29, 2010 [Page 1]
Internet-Draft IRIs October 2009
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents in effect on the date of
publication of this document (http://trustee.ietf.org/license-info).
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document.
Abstract
This document defines the Internationalized Resource Identifier (IRI)
protocol element, as an extension of the Uniform Resource Identifier
(URI). An IRI is a sequence of characters from the Universal
Character Set (Unicode/ISO 10646). Grammar and processing rules are
given for IRIs and related syntactic forms.
In addition, this document provides named additional rule sets for
processing otherwise invalid IRIs, in a way that supports other
specifications that wish to mandate common behavior for 'error'
handling. In particular, rules used in some XML languages (LEIRI)
and web applications are given.
Defining IRI as new protocol element (rather than updating or
extending the definition of URI) allows independent orderly
transitions: other protocols and languages that use URIs must
explicitly choose to allow IRIs.
Guidelines are provided for the use and deployment of IRIs and
related protocol elements when revising protocols, formats, and
software components that currently deal only with URIs.
[RFC Editor: Please remove this paragraph before publication.] This
document is intended to update RFC 3987 and move towards IETF Draft
Standard. This is an interim version in preparation for the IRI BOF
at IETF 76 in Hiroshima. For discussion and comments on this draft,
please use the [email protected] mailing list.
Duerst, et al. Expires April 29, 2010 [Page 2]
Internet-Draft IRIs October 2009
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1. Overview and Motivation . . . . . . . . . . . . . . . . . 5
1.2. Applicability . . . . . . . . . . . . . . . . . . . . . . 6
1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . . 6
1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 9
2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . . 10
2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . . 10
3. Processing IRIs and related protocol elements . . . . . . . . 13
3.1. Converting to UCS . . . . . . . . . . . . . . . . . . . . 14
3.2. Parse the IRI into IRI components . . . . . . . . . . . . 14
3.3. General percent-encoding of IRI components . . . . . . . . 15
3.4. Mapping ireg-name . . . . . . . . . . . . . . . . . . . . 15
3.5. Mapping query components . . . . . . . . . . . . . . . . . 17
3.6. Mapping IRIs to URIs . . . . . . . . . . . . . . . . . . . 17
3.7. Converting URIs to IRIs . . . . . . . . . . . . . . . . . 17
3.7.1. Examples . . . . . . . . . . . . . . . . . . . . . . . 19
4. Bidirectional IRIs for Right-to-Left Languages . . . . . . . . 20
4.1. Logical Storage and Visual Presentation . . . . . . . . . 21
4.2. Bidi IRI Structure . . . . . . . . . . . . . . . . . . . . 22
4.3. Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . . 23
4.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23
5. Normalization and Comparison . . . . . . . . . . . . . . . . . 25
5.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . . 25
5.2. Preparation for Comparison . . . . . . . . . . . . . . . . 26
5.3. Comparison Ladder . . . . . . . . . . . . . . . . . . . . 27
5.3.1. Simple String Comparison . . . . . . . . . . . . . . . 27
5.3.2. Syntax-Based Normalization . . . . . . . . . . . . . . 28
5.3.3. Scheme-Based Normalization . . . . . . . . . . . . . . 31
5.3.4. Protocol-Based Normalization . . . . . . . . . . . . . 32
6. Use of IRIs . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1. Limitations on UCS Characters Allowed in IRIs . . . . . . 33
6.2. Software Interfaces and Protocols . . . . . . . . . . . . 33
6.3. Format of URIs and IRIs in Documents and Protocols . . . . 33
6.4. Use of UTF-8 for Encoding Original Characters . . . . . . 34
6.5. Relative IRI References . . . . . . . . . . . . . . . . . 36
7. Liberal handling of otherwise invalid IRIs . . . . . . . . . . 36
7.1. LEIRI processing . . . . . . . . . . . . . . . . . . . . . 36
7.2. Web Address processing . . . . . . . . . . . . . . . . . . 36
7.3. Characters not allowed in IRIs . . . . . . . . . . . . . . 38
8. URI/IRI Processing Guidelines (Informative) . . . . . . . . . 40
8.1. URI/IRI Software Interfaces . . . . . . . . . . . . . . . 40
8.2. URI/IRI Entry . . . . . . . . . . . . . . . . . . . . . . 41
8.3. URI/IRI Transfer between Applications . . . . . . . . . . 42
8.4. URI/IRI Generation . . . . . . . . . . . . . . . . . . . . 42
8.5. URI/IRI Selection . . . . . . . . . . . . . . . . . . . . 43
Duerst, et al. Expires April 29, 2010 [Page 3]
Internet-Draft IRIs October 2009
8.6. Display of URIs/IRIs . . . . . . . . . . . . . . . . . . . 43
8.7. Interpretation of URIs and IRIs . . . . . . . . . . . . . 44
8.8. Upgrading Strategy . . . . . . . . . . . . . . . . . . . . 44
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45
10. Security Considerations . . . . . . . . . . . . . . . . . . . 46
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 47
12. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 48
13. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13.1. Changes from -06 to this document . . . . . . . . . . . . 50
13.1.1. OLD WAY . . . . . . . . . . . . . . . . . . . . . . . 50
13.1.2. NEW WAY . . . . . . . . . . . . . . . . . . . . . . . 51
13.2. Changes from -05 to -06 . . . . . . . . . . . . . . . . . 51
13.3. Changes from -04 to -05 . . . . . . . . . . . . . . . . . 51
13.4. Changes from -03 to -04 . . . . . . . . . . . . . . . . . 51
13.5. Changes from -02 to -03 . . . . . . . . . . . . . . . . . 51
13.6. Changes from -01 to -02 . . . . . . . . . . . . . . . . . 52
13.7. Changes from -00 to -01 . . . . . . . . . . . . . . . . . 52
13.8. Changes from RFC 3987 to -00 . . . . . . . . . . . . . . . 52
14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 52
14.1. Normative References . . . . . . . . . . . . . . . . . . . 52
14.2. Informative References . . . . . . . . . . . . . . . . . . 53
Appendix A. Design Alternatives . . . . . . . . . . . . . . . . . 55
A.1. New Scheme(s) . . . . . . . . . . . . . . . . . . . . . . 56
A.2. Character Encodings Other Than UTF-8 . . . . . . . . . . . 56
A.3. New Encoding Convention . . . . . . . . . . . . . . . . . 56
A.4. Indicating Character Encodings in the URI/IRI . . . . . . 57
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 57
Duerst, et al. Expires April 29, 2010 [Page 4]
Internet-Draft IRIs October 2009
1. Introduction
1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of
natural languages. This usage has many advantages: Such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A -
Z. For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for those who use only
the Latin alphabet. Many languages with non-Latin scripts are
transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional difficulties.
The infrastructure for the appropriate handling of characters from
additional scripts is now widely deployed in operating system and
application software. Software that can handle a wide variety of
scripts and languages at the same time is increasingly common. Also,
an increasing number of protocols and formats can carry a wide range
of characters.
URIs are used both as a protocol element (for transmission and
processing by software) and also a presentation element (for display
and handling by people who read, interpret, coin, or guess them).
The transition between these roles is more difficult and complex when
dealing with the larger set of characters than allowed for URIs in
[RFC3986].
This document defines the protocol element called Internationalized
Resource Identifier (IRI), which allow applications of URIs to be
extended to use resource identifiers that have a much wider
repertoire of characters. It also provides corresponding
"internationalized" versions of other constructs from [RFC3986], such
as URI references. The syntax of IRIs is defined in Section 2.
Using characters outside of A - Z in IRIs adds a number of
difficulties. Section 4 discusses the special case of bidirectional
IRIs using characters from scripts written right-to-left. Section 5
discusses various forms of equivalence between IRIs. Section 6
discusses the use of IRIs in different situations. Section 8 gives
additional informative guidelines. Section 10 discusses IRI-specific
security considerations.
Duerst, et al. Expires April 29, 2010 [Page 5]
Internet-Draft IRIs October 2009
1.2. Applicability
IRIs are designed to allow protocols and software that deal with URIs
to be updated to handle IRIs. A "URI scheme" (as defined by
[RFC3986] and registered through the IANA process defined in
[RFC4395] also serves as an "IRI scheme". Processing of IRIs is
accomplished by extending the URI syntax while retaining (and not
expanding) the set of "reserved" characters, such that the syntax for
any URI scheme may be uniformly extended to allow non-ASCII
characters. In addition, following parsing of an IRI, it is possible
to construct a corresponding URI by first encoding characters outside
of the allowed URI range and then reassembling the components.
Practical use of IRIs forms in place of URIs forms depends on the
following conditions being met:
a. A protocol or format element MUST be explicitly designated to be
able to carry IRIs. The intent is to avoid introducing IRIs into
contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other
hand, in the [RFC2616] definition of HTTP/1.1, the Request URI is
defined as a URI, which means that direct use of IRIs is not
allowed in HTTP requests.
b. The protocol or format carrying the IRIs MUST have a mechanism to
represent the wide range of characters used in IRIs, either
natively or by some protocol- or format-specific escaping
mechanism (for example, numeric character references in [XML1]).
c. The URI scheme definition, if it explicitly allows a percent sign
("%") in any syntactic component, SHOULD define the interpretation
of sequences of percent-encoded octets (using "%XX" hex octets) as
octet from sequences of UTF-8 encoded strings; this is recommended
in the guidelines for registering new schemes, [RFC4395]. For
example, this is the practice for IMAP URLs [RFC2192], POP URLs
[RFC2384] and the URN syntax [RFC2141]). Note that use of
percent-encoding may also be restricted in some situations, for
example, URI schemes that disallow percent-encoding might still be
used with a fragment identifier which is percent-encoded (e.g.,
[XPointer]). See Section 6.4 for further discussion.
1.3. Definitions
The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646].
Duerst, et al. Expires April 29, 2010 [Page 6]
Internet-Draft IRIs October 2009
character: A member of a set of elements used for the organization,
control, or representation of data. For example, "LATIN CAPITAL
LETTER A" names a character.
octet: An ordered sequence of eight bits considered as a unit.
character repertoire: A set of characters (set in the mathematical
sense).
sequence of characters: A sequence of characters (one after
another).
sequence of octets: A sequence of octets (one after another).
character encoding: A method of representing a sequence of
characters as a sequence of octets (maybe with variants). Also, a
method of (unambiguously) converting a sequence of octets into a
sequence of characters.
charset: The name of a parameter or attribute used to identify a
character encoding.
UCS: Universal Character Set. The coded character set defined by
ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
IRI reference: Denotes the common usage of an Internationalized
Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs.
URL: The term "URL" was originally used [RFC1738] for roughly what
is now called a "URI". Books, software and documentation often
refers to URIs and IRIs using the "URL" term. Some usages
restrict "URL" to those URIs which are not URNs. Because of the
ambiguity of the term using the term "URL" is NOT RECOMMENDED in
formal documents.
LEIRI (Legacy Extended IRI) processing: This term was used in
various XML specifications to refer to strings that, although not
valid IRIs, were acceptable input to the processing rules in
Section 7.1.
Duerst, et al. Expires April 29, 2010 [Page 7]
Internet-Draft IRIs October 2009
(Web Address, Hypertext Reference, HREF): These terms have been
added in this document for convenience, to allow other
specifications to refer to those strings that, although not valid
IRIs, are acceptable input to the processing rules in Section 7.2.
This usage corresponds to the parsing rules of some popular web
browsing applications. ISSUE: Need to find a good name/
abbreviation for these.
running text: Human text (paragraphs, sentences, phrases) with
syntax according to orthographic conventions of a natural
language, as opposed to syntax defined for ease of processing by
machines (e.g., markup, programming languages).
protocol element: Any portion of a message that affects processing
of that message by the protocol in question.
presentation element: A presentation form corresponding to a
protocol element; for example, using a wider range of characters.
create (a URI or IRI): With respect to URIs and IRIs, the term is
used for the initial creation. This may be the initial creation
of a resource with a certain identifier, or the initial exposition
of a resource under a particular identifier.
generate (a URI or IRI): With respect to URIs and IRIs, the term is
used when the identifier is generated by derivation from other
information.
parsed URI component: When a URI processor parses a URI (following
the generic syntax or a scheme-specific syntax, the result is a
set of parsed URI components, each of which has a type
(corresponding to the syntactic definition) and a sequence of URI
characters.
parsed IRI component: When an IRI processor parses an IRI directly,
following the general syntax or a scheme-specific syntax, the
result is a set of parsed IRI components, each of which has a type
(corresponding to the syntactice definition) and a sequence of IRI
characters. (This definition is analogous to "parsed URI
component".)
IRI scheme: A URI scheme may also be known as an "IRI scheme" if the
scheme's syntax has been extended to allow non-US-ASCII characters
according to the rules in this document.
Duerst, et al. Expires April 29, 2010 [Page 8]
Internet-Draft IRIs October 2009
1.4. Notation
RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters in examples.
In text, characters outside US-ASCII are sometimes referenced by
using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document
uses two notations: 'XML Notation' and 'Bidi Notation'.
XML Notation uses a leading '&#x', a trailing ';', and the
hexadecimal number of the character in the UCS in between. For
example, я stands for CYRILLIC CAPITAL LETTER YA. In this
notation, an actual '&' is denoted by '&'.
Bidi Notation is used for bidirectional examples: Lower case letters
stand for Latin letters or other letters that are written left to
right, whereas upper case letters represent Arabic or Hebrew letters
that are written right to left.
To denote actual octets in examples (as opposed to percent-encoded
octets), the two hex digits denoting the octet are enclosed in "<"
and ">". For example, the octet often denoted as 0xc9 is denoted
here as <c9>.
In this document, the key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" are to be interpreted as described in [RFC2119].
2. IRI Syntax
This section defines the syntax of Internationalized Resource
Identifiers (IRIs).
As with URIs, an IRI is defined as a sequence of characters, not as a
sequence of octets. This definition accommodates the fact that IRIs
may be written on paper or read over the radio as well as stored or
transmitted digitally. The same IRI might be represented as
different sequences of octets in different protocols or documents if
these protocols or documents use different character encodings
(and/or transfer encodings). Using the same character encoding as
the containing protocol or document ensures that the characters in
the IRI can be handled (e.g., searched, converted, displayed) in the
same way as the rest of the protocol or document.
Duerst, et al. Expires April 29, 2010 [Page 9]
Internet-Draft IRIs October 2009
2.1. Summary of IRI Syntax
IRIs are defined by extending the URI syntax in [RFC3986], but
extending the class of unreserved characters by adding the characters
of the UCS (Universal Character Set, [ISO10646]) beyond U+007F,
subject to the limitations given in the syntax rules below and in
Section 6.1.
The syntax and use of components and reserved characters is the same
as that in [RFC3986]. Each "URI scheme" thus also functions as an
"IRI scheme", in that scheme-specific parsing rules for URIs of a
scheme are be extended to allow parsing of IRIs using the same
parsing rules.
All the operations defined in [RFC3986], such as the resolution of
relative references, can be applied to IRIs by IRI-processing
software in exactly the same way as they are for URIs by URI-
processing software.
Characters outside the US-ASCII repertoire MUST NOT be reserved and
therefore MUST NOT be used for syntactical purposes, such as to
delimit components in newly defined schemes. For example, U+00A2,
CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
the 'iunreserved' category. This is similar to the fact that it is
not possible to use '-' as a delimiter in URIs, because it is in the
'unreserved' category.
2.2. ABNF for IRI References and IRIs
An ABNF definition for IRI references (which are the most general
concept and the start of the grammar) and IRIs is given here. The
syntax of this ABNF is described in [STD68]. Character numbers are
taken from the UCS, without implying any actual binary encoding.
Terminals in the ABNF are characters, not octets.
The following grammar closely follows the URI grammar in [RFC3986],
except that the range of unreserved characters is expanded to include
UCS characters, with the restriction that private UCS characters can
occur only in query parts. The grammar is split into two parts:
Rules that differ from [RFC3986] because of the above-mentioned
expansion, and rules that are the same as those in [RFC3986]. For
rules that are different than those in [RFC3986], the names of the
non-terminals have been changed as follows. If the non-terminal
contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i'
has been prefixed.
The following rules are different from those in [RFC3986]:
Duerst, et al. Expires April 29, 2010 [Page 10]
Internet-Draft IRIs October 2009
IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ]
ihier-part = "//" iauthority ipath-abempty
/ ipath-absolute
/ ipath-rootless
/ ipath-empty
IRI-reference = IRI / irelative-ref
absolute-IRI = scheme ":" ihier-part [ "?" iquery ]
irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]
irelative-part = "//" iauthority ipath-abempty
/ ipath-absolute
/ ipath-noscheme
/ ipath-empty
iauthority = [ iuserinfo "@" ] ihost [ ":" port ]
iuserinfo = *( iunreserved / pct-form / sub-delims / ":" )
ihost = IP-literal / IPv4address / ireg-name
pct-form = pct-encoded
ireg-name = *( iunreserved / sub-delims )
ipath = ipath-abempty ; begins with "/" or is empty
/ ipath-absolute ; begins with "/" but not "//"
/ ipath-noscheme ; begins with a non-colon segment
/ ipath-rootless ; begins with a segment
/ ipath-empty ; zero characters
ipath-abempty = *( path-sep isegment )
ipath-absolute = path-sep [ isegment-nz *( path-sep isegment ) ]
ipath-noscheme = isegment-nz-nc *( path-sep isegment )
ipath-rootless = isegment-nz *( path-sep isegment )
ipath-empty = 0<ipchar>
path-sep = "/"
isegment = *ipchar
isegment-nz = 1*ipchar
isegment-nz-nc = 1*( iunreserved / pct-form / sub-delims
/ "@" )
; non-zero-length segment without any colon ":"
ipchar = iunreserved / pct-form / sub-delims / ":"
/ "@"
Duerst, et al. Expires April 29, 2010 [Page 11]
Internet-Draft IRIs October 2009
iquery = *( ipchar / iprivate / "/" / "?" )
ifragment = *( ipchar / "/" / "?" / "#" )
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
iprivate = %xE000-F8FF / %xE0000-E0FFF / %xF0000-FFFFD
/ %x100000-10FFFD
Some productions are ambiguous. The "first-match-wins" (a.k.a.
"greedy") algorithm applies. For details, see [RFC3986].
Duerst, et al. Expires April 29, 2010 [Page 12]
Internet-Draft IRIs October 2009
The following rules are the same as those in [RFC3986]:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
port = *DIGIT
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
h16 = 1*4HEXDIG
ls32 = ( h16 ":" h16 ) / IPv4address
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
This syntax does not support IPv6 scoped addressing zone identifiers.
3. Processing IRIs and related protocol elements
IRIs are meant to replace URIs in identifying resources within new
versions of protocols, formats, and software components that use a
UCS-based character repertoire. Protocols and components may use and
process IRIs directly. However, there are still numerous systems and
Duerst, et al. Expires April 29, 2010 [Page 13]
Internet-Draft IRIs October 2009
protocols which only accept URIs or components of parsed URIs; that
is, they only accept sequences of characters within the subset of US-
ASCII characters allowed in URIs.
This section defines specific processing steps for IRI consumers
which establish the relationship between the string given and the
interpreted derivatives. These processing steps apply to both IRIs
and IRI references (i.e., absolute or relative forms); for IRIs, some
steps are scheme specific.
3.1. Converting to UCS
Input that is already in a Unicode form (i.e., a sequence of Unicode
characters or an octet-stream representing a Unicode-based character
encoding such as UTF-8 or UTF-16) should be left as is and not
normalized (see (see Section 5.3.2.2).
If the IRI or IRI reference is an octet stream in some known non-
Unicode character encoding, convert the IRI to a sequence of
characters from the UCS; this sequence SHOULD also be normalized
according to Unicode Normalization Form C (NFC, [UTR15]). In this
case, retain the original character encoding as the "document
character encoding". (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS
DO, CHANGE? )
In other cases (written on paper, read aloud, or otherwise
represented independent of any character encoding) represent the IRI
as a sequence of characters from the UCS normalized according to
Unicode Normalization Form C (NFC, [UTR15]).
3.2. Parse the IRI into IRI components
Parse the IRI, either as a relative reference (no scheme) or using
scheme specific processing (according to the scheme given); the
result resulting in a set of parsed IRI components. (NOTE: FIX
BEFORE RELEASE: INTENT IS THAT ALL IRI SCHEMES THAT USE GENERIC
SYNTAX AND ALLOW NON-ASCII AUTHORITY CAN ONLY USE AUTHORITY FOR NAMES
THAT FOLLOW PUNICODE.)
NOTE: The result of parsing into components will correspond result in
a correspondence of subtrings of the IRI according to the part
matched. For example, in [HTML5], the protocol components of
interest are SCHEME (scheme), HOST (ireg-name), PORT (port), the PATH
(ipath after the initial "/"), QUERY (iquery), FRAGMENT (ifragment),
and AUTHORITY (iauthority).
Subsequent processing rules are sometimes used to define other
syntactic components. For example, [HTML5] defines APIs for IRI
Duerst, et al. Expires April 29, 2010 [Page 14]
Internet-Draft IRIs October 2009
processing; in these APIs:
HOSTSPECIFIC the substring that follows the substring matched by the
iauthority production, or the whole string if the iauthority
production wasn't matched.
HOSTPORT if there is a scheme component and a port component and the
port given by the port component is different than the default
port defined for the protocol given by the scheme component, then
HOSTPORT is the substring that starts with the substring matched
by the host production and ends with the substring matched by the
port production, and includes the colon in between the two.
Otherwise, it is the same as the host component.
3.3. General percent-encoding of IRI components
For most IRI components, it is possible to map the IRI component to
an equivalent URI component by percent-encoding those characters not
allowed in URIs. Previous processing steps will have removed some
characters, and the interpretation of reserved characters will have
already been done (with the syntactic reserved characters outside of
the IRI component). This mapping is defined for all sequences of
Unicode characters, whether or not they are valid for the component
in question.
For each character which is not allowed in a valid URI (NOTE: WHAT IS
THE RIGHT REFERENCE HERE), apply the following steps.
Convert to UTF-8 Convert the character to a sequence of one or more
octets using UTF-8 [RFC3629].
Percent encode Convert each octet of this sequence to %HH, where HH
is the hexadecimal notation of the octet value. The hexadecimal
notation SHOULD use uppercase letters. (This is the general URI
percent-encoding mechanism in Section 2.1 of [RFC3986].)
Note that the mapping is an identity transformation for parsed URI
components of valid URIs, and is idempotent: applying the mapping a
second time will not change anything.
3.4. Mapping ireg-name
Schemes that allow non-ASCII based characters in the reg-name (ireg-
name) position MUST convert the ireg-name component of an IRI as
follows:
Replace the ireg-name part of the IRI by the part converted using the
ToASCII operation specified in Section 4.1 of [RFC3490] on each dot-
Duerst, et al. Expires April 29, 2010 [Page 15]
Internet-Draft IRIs October 2009
separated label, and by using U+002E (FULL STOP) as a label
separator, with the flag UseSTD3ASCIIRules set to FALSE, and with the
flag AllowUnassigned set to FALSE. The ToASCII operation may fail,
but this would mean that the IRI cannot be resolved. In such cases,
if the domain name conversion fails, then the entire IRI conversion
fails. Processors that have no mechanism for signalling a failure
MAY instead substitute an otherwise invalid host name, although such
processing SHOULD be avoided.
For example, the IRI
"http://résumé.example.org"
MAY be converted to
"http://xn--rsum-bad.example.org"
; conversion to percent-encoded form, e.g.,
"http://r%C3%A9sum%C3%A9.example.org", MUST NOT be performed.
Note: Domain Names may appear in parts of an IRI other than the
ireg-name part. It is the responsibility of scheme-specific
implementations (if the Internationalized Domain Name is part of
the scheme syntax) or of server-side implementations (if the
Internationalized Domain Name is part of 'iquery') to apply the
necessary conversions at the appropriate point. Example: Trying
to validate the Web page at
http://résumé.example.org would lead to an IRI of
http://validator.w3.org/check?uri=http%3A%2F%2Frésumé.
example.org, which would convert to a URI of
http://validator.w3.org/check?uri=http%3A%2F%2Fr%C3%A9sum%C3%A9.
example.org. The server-side implementation is responsible for
making the necessary conversions to be able to retrieve the Web
page.
Note: In this process, characters allowed in URI references and
existing percent-encoded sequences are not encoded further. (This
mapping is similar to, but different from, the encoding applied
when arbitrary content is included in some part of a URI.) For
example, an IRI of
"http://www.example.org/red%09rosé#red" (in XML notation) is
converted to
"http://www.example.org/red%09ros%C3%A9#red", not to something
like
"http%3A%2F%2Fwww.example.org%2Fred%2509ros%C3%A9%23red".
((DESIGN QUESTION: What about e.g.
http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get
converted to punycode, or not?))
Duerst, et al. Expires April 29, 2010 [Page 16]
Internet-Draft IRIs October 2009
3.5. Mapping query components
((NOTE: SEE ISSUES LIST)) For compatibility with existing deployed
HTTP infrastructure, the following special case applies for schemes
"http" and "https" and IRIs whose origin has a document charset other
than one which is UCS-based (e.g., UTF-8 or UTF-16). In such a case,
the "query" component of an IRI is mapped into a URI by using the
document charset rather than UTF-8 as the binary representation
before pct-encoding. This mapping is not applied for any other
scheme or component.
3.6. Mapping IRIs to URIs
The canonical mapping from a IRI to URI is defined by applying the
mapping above (from IRI to URI components) and then reassembling a
URI from the parsed URI components using the original punctuation
that delimited the IRI components.
3.7. Converting URIs to IRIs
In some situations, for presentation and further processing, it is
desirable to convert a URI into an equivalent IRI in which natural
characters are represented directly rather than percent encoded. Of
course, every URI is already an IRI in its own right without any
conversion, and in general there This section gives one such
procedure for this conversion.
The conversion described in this section, if given a valid URI, will
result in an IRI that maps back to the URI used as an input for the
conversion (except for potential case differences in percent-encoding
and for potential percent-encoded unreserved characters). However,
the IRI resulting from this conversion may differ from the original
IRI (if there ever was one).
URI-to-IRI conversion removes percent-encodings, but not all percent-
encodings can be eliminated. There are several reasons for this:
1. Some percent-encodings are necessary to distinguish percent-
encoded and unencoded uses of reserved characters.
2. Some percent-encodings cannot be interpreted as sequences of UTF-8
octets.
(Note: The octet patterns of UTF-8 are highly regular. Therefore,
there is a very high probability, but no guarantee, that percent-
encodings that can be interpreted as sequences of UTF-8 octets
actually originated from UTF-8. For a detailed discussion, see
[Duerst97].)
Duerst, et al. Expires April 29, 2010 [Page 17]
Internet-Draft IRIs October 2009
3. The conversion may result in a character that is not appropriate
in an IRI. See Section 2.2, Section 4.1, and Section 6.1 for
further details.
4. IRI to URI conversion has different rules for dealing with domain
names and query parameters.
Conversion from a URI to an IRI MAY be done by using the following
steps:
1. Represent the URI as a sequence of octets in US-ASCII.
2. Convert all percent-encodings ("%" followed by two hexadecimal
digits) to the corresponding octets, except those corresponding to
"%", characters in "reserved", and characters in US-ASCII not
allowed in URIs.
3. Re-percent-encode any octet produced in step 2 that is not part of
a strictly legal UTF-8 octet sequence.
4. Re-percent-encode all octets produced in step 3 that in UTF-8
represent characters that are not appropriate according to
Section 2.2, Section 4.1, and Section 6.1.
5. Interpret the resulting octet sequence as a sequence of characters
encoded in UTF-8.
6. URIs known to contain domain names in the reg-name component
SHOULD convert punycode-encoded domain name labels to the
corresponding characters using the ToUnicode procedure.
This procedure will convert as many percent-encoded characters as
possible to characters in an IRI. Because there are some choices
when step 4 is applied (see Section 6.1), results may vary.
Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI. For example, the URI
"http://www.example.org/r%E9sum%E9.html" might with some guessing be
interpreted to contain two e-acute characters encoded as iso-8859-1.
It must not be converted to an IRI containing these e-acute
characters. Otherwise, in the future the IRI will be mapped to
"http://www.example.org/r%C3%A9sum%C3%A9.html", which is a different
URI from "http://www.example.org/r%E9sum%E9.html".