Search the web
Sign In
New User? Sign Up
caplet · The Caplet Group
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Am I paranoid enough?   Message List  
Reply | Forward Message #285 of 310 |
Re: Am I paranoid enough?

No, I'm not paranoid enough yet. It's not sufficient only to say
that the HTML is encoded as UTF-8 (see below).

David-Sarah Hopwood wrote:
[...]
> The HTML or XHTML document starts with a correct <!DOCTYPE or
> <?xml declaration respectively,

I meant, the document starts with <!DOCTYPE HTML> in the case
of HTML, or <?xml version="1.0"?><!DOCTYPE HTML> in the case of
XHTML.

(This will also put the parser into sane^H^H^H^Hstandards mode.)

> and is encoded as well-formed UTF-8.

The document must also start with a UTF-8 BOM, *and* must not
contain a META directive that changes the charset, *and* in the
case of HTML, must either be retrieved from a local file or over
HTTP with the header "Content-Type: text/html; charset=UTF-8".
This is because the method of determining the encoding is chosen
based on the phase of the moon.

Any other problems?

--
David-Sarah Hopwood ⚥




Mon Feb 16, 2009 4:29 pm

david.hopwood@...
Send Email Send Email

Forward
Message #285 of 310 |
Expand Messages Author Sort by Date

Suppose that S is a Unicode string in which each character matches ValidChar below, not containing the subsequences "<!", "</" or "]]>", and not containing...
David-Sarah Hopwood
david.hopwood@...
Send Email
Feb 16, 2009
3:16 pm

No, I'm not paranoid enough yet. It's not sufficient only to say that the HTML is encoded as UTF-8 (see below). David-Sarah Hopwood wrote: [...] ... I meant,...
David-Sarah Hopwood
david.hopwood@...
Send Email
Feb 16, 2009
4:29 pm

2009/2/16 David-Sarah Hopwood <david.hopwood@...> ... So no surrogates? ... Why include FFEF? ... You may still be subject to encoding...
Mike Samuel
mikesamuel
Offline Send Email
Feb 16, 2009
11:38 pm

... Correct. They're not characters (or even "noncharacters"). ... It's unassigned, and there's no particular reason to exclude it. (\uFFF0-\uFFF8 are also...
David-Sarah Hopwood
david.hopwood@...
Send Email
Feb 17, 2009
11:13 am

... Isn't it the reflection of fffe, the byte-order-marker. This is probably a very minor issue, but if one part of a parser naively delegates to another...
Mike Samuel
mikesamuel
Offline Send Email
Feb 17, 2009
6:50 pm

... [...] ... No, \uFEFF is the BOM, and its byte-reflection \uFFFE is a noncharacter, so already excluded from ValidChar. (Thought you'd spotted something I'd...
David-Sarah Hopwood
david.hopwood@...
Send Email
Feb 18, 2009
5:26 pm

... Ah, quite right....
Mike Samuel
mikesamuel
Offline Send Email
Feb 18, 2009
9:54 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help