To avoid the necessity of chasing around multiple specifications and
interpreting things based on what isn't said, a simple statement should
be added to the data-types-url section of the spec:
IRIs MUST be converted to URIs before being included in an RSS 2.0
document.
Perhaps with a hypertext link to
http://www.apps.ietf.org/rfc/rfc3987.html#sec-3.2
Background: the domain name in URIs have always been based on the
US-centric ASCII character set. Understandably, there has been a
growing demand for domain names which include characters which are
present in non-English languages. From RFC 3987:
The characters in URIs are frequently used for representing words of
natural languages. This usage has many advantages: Such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A -
Z. For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for those who use only
the Latin alphabet. Many languages with non-Latin scripts are
transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional ambiguities.
As an example, see:
http://www.atemschutzunfälle.de/asu.rdf
Despite the rdf extension, this actually is a valid RSS 0.93 feed.
Based on concerns of breaking existing software, the way this was
approached was in two phases. RFC 3743 specifies a backwards compatible
metchanism for Internationalizing Domain Names in Applications. It
involves encoding the non-ASCII characters in a special way. The domain
name above, which contains an umlaut, gets encoded thus:
www.xn--atemschutzunflle-7nb.de
The other way forward was captured in RFC 3987, and it allows such
characters to be included directly into IRIs. Quoting from that RFC:
a. A protocol or format element should be explicitly designated to
be able to carry IRIs. The intent is not to introduce IRIs into
contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other
hand, in the HTTP protocol [RFC2616], the Request URI is defined
as a URI, which means that direct use of IRIs is not allowed in
HTTP requests.
Including IRIs as the url attribute of enclosure elements would quite
likely break existing software. As that was not the intent of IRIs, any
IRIs need to be mapped to an URI first.
Again, I don't think all this background needs to be included in the
spec, but a simple statement like the one suggested above would be
appropriate.
Test cases:
http://feedvalidator.org/testcases/rss20/data-types-url/
- Sam Ruby