2009/2/17 David-Sarah Hopwood <david.hopwood@...>:
> Mike Samuel wrote:
>> 2009/2/16 David-Sarah Hopwood <david.hopwood@...>
>>> Suppose that S is a Unicode string in which each character matches
>>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
>>> not containing ("&" followed by a character not matching AmpFollower).
>>> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
>>> an attacker.
>>>
>>> ValidChar :: one of
>>> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
>>> [\u0020-\u007E]
>>> [\u00A0-\u00AC]
>>> [\u00AE-\u05FF]
>>> [\u0604-\u06DC]
>>> [\u06DE-\u070E]
>>> [\u0710-\u17B3]
>>> [\u17B6-\u200A]
>>> [\u2010-\u2027]
>>> [\u202F-\u205F]
>>> [\u2070-\uD7FF]
>>
>> So no surrogates?
>
> Correct. They're not characters (or even "noncharacters").
>
>>> [\uE000-\uFDCF]
>>> [\uFDF0-\uFEFE]
>>> [\uFF00-\uFFEF]
>>
>> Why include FFEF?
>
> It's unassigned, and there's no particular reason to exclude it.
> (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
> for "special" characters.)
Isn't it the reflection of fffe, the byte-order-marker.
This is probably a very minor issue, but if one part of a parser
naively delegates to another parser that mistakenly treats its content
as a byte string instead of code units, the presence of a BOM might
cause the delegatee to misinterpret content when something that looks
like a BOM appears at the beginning of a chunk of embedded language.
>>> AmpFollower :: one of
>>> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
>>> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
>>> // single quote, backslash, space, TAB, LF, CR
>>>
>>> (ValidChar excludes format control characters, and some other
>>> characters known to be mishandled by browsers. AmpFollower is
>>> intended to exclude characters that can start an entity reference.)
>>>
>>> S is inserted between "<script>" and "</script>" in a place where a
>>> <script> tag is allowed in an otherwise valid HTML document, or
>>> between "<script><![CDATA[" and "]]></script>" in a place where a
>>> <script> tag is allowed in an otherwise valid XHTML document.
>>> The HTML or XHTML document starts with a correct <!DOCTYPE or
>>> <?xml declaration respectively, and is encoded as well-formed
>>> UTF-8.
>>>
>>> Are these restrictions sufficient to ensure that the embedded
>>> script is interpreted as it would have been if referenced from
>>> an external file, foiling any attempts of browsers to collude
>>> with the attacker in misparsing it?
>>
>> You may still be subject to encoding attacks. I'm sure there are
>> valid scripts that look like UTF-7, so if the script appears in the
>> first 1024B, you might need to make sure it's preceded by a <meta>
>> element specifying an encoding, and/or use the XML prologue form that
>> specifies an encoding.
>
> Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
> start sufficient for all browsers (that is, are there any browsers
> in which a <meta> tag or other content sniffing can override an
> explicit initial UTF-8 BOM, in either HTML or XHTML)?
Ah cool. I don't know the answer to that question.
> HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
> specifies an encoding" (i.e. presumably the charset specified in
> a Content-Type header), then that should override a BOM. That's
> irritating, because it means that you have to assume that the server
> gets the Content-Type right, *as well as* including a BOM for the
> browsers in which Content-Type doesn't override sniffing
> (Internet Explorer, at least), and for the case where the document
> is read from a local file.
Yeah. I think the best thing to do is to use a fairly standard
encoding like UTF-8, and make sure the XML prologue, <meta
http-equiv="content-type">, and headers all agree.
I don't think that you can do much about file hosting services that go
out of their way to specify a whacky encoding. Putting a BOM at the
front will help hosting services that make a genuine effort.
> --
> David-Sarah Hopwood ⚥
>
>