I've uploaded a test suite for TagSoup to the files area. (It's MIT-licensed, which should be compatible with everything else.) JUnit tests are included, as...
... Thank you! I'll move it to the TagSoup page on Monday, and probably incorporate it into the next release. ... Hmm, yes. I suppose it should generate...
Hi , I have tested tagsoup parser on http://www.yahoo.fr and i was really surpised on the events on the html content : in the first script tag content you can...
... I'd like to see these changes and possibly incorporate them into the next release. Can you send them to me, please? -- "While staying with the Asonu, I...
Well, it's time for another public release of TagSoup, the SAX-compliant Java parser for nasty, ugly HTML. TagSoup 0.9.3 fixes most known bugs and provides ...
Arrgh. I forgot the URLs: TagSoup: http://mercury.ccil.org/~cowan/XML/tagsoup TSaxon: http://mercury.ccil.org/~cowan/XML/tagsoup/tsaxon -- Using RELAX NG...
... Basically my failure to reflect on why they shouldn't be; I introduced them for my own reasons, but obviously they'd be useful to TagSoup clients as well. ...
What would tagsoup do with this... <meta http-equiv="refresh"content="0;http://...."> i.e. No space between two attributes. I suspect tagsoup doesn't handle ...
What would ignore bogons do with a tag like this... <o:p> I don't know why html would contain stuff like this, but I've got it and I don't think tagsoup ignore...
... Tags beginning "o:" are generated by Word in its HTML mode, which notoriously generates absolute garbage, neither HTML nor XML. Ignore-bogons should ...
... It works already: the quotation mark terminates the attribute and resets us to the inside-a-start-tag state. A quick test shows no implementation problem....
I just did two big runs of TSaxon 0.9.3, processing about 1500 HTML documents captured mostly at random from the Web, one at a time. In all cases the same...
hi, is it possible to ignore (or not modify) an html comment inside a 'script' or 'style' element ? for example, <style><!-- comment --></style> should become ...
... This is a tricky point of SGML. In script and style elements, comments do not exist, because those two elements have CDATA type (which does not exist in...
Hi all, hmm - CDATA does not exist in XML? I don't think this is true. It just serializes differently. Cheers, Oliver...
Oliver Koell
listen@...
Apr 1, 2004 4:17 pm
61
... Elements of type CDATA do not exist in XML. They are not to be confused with CDATA sections, which do exist in XML, nor with CDATA attributes, which do ...
... I thought April fools day finished at noon :-) regards DaveP...
Dave Pawson
dpawson@...
Apr 1, 2004 5:18 pm
63
Hi John, thanks for responding to thoughtless remarks :-) Just out of curiosity: is there a particular reason, the XMLWriter prefers escaping over CDATA...
Oliver Koell
listen@...
Apr 2, 2004 9:28 am
64
... It's a general-purpose writer and doesn't know about particular elements except in HTML mode. Deciding when to use CDATA sections cleverly requires either...
Hi John, in XHTML it's recommended to wrap your the content of your script elements into CDATA markers, like this: <script type="text/javascript"> <![CDATA[ ...
Oliver Koell
listen@...
Apr 2, 2004 1:58 pm
66
Hi, It appears that Tagsoup is auto-inserting empty HTML attributes with a value of "BOOLEAN". Example: <td> results in <td nowrap="BOOLEAN"> ...
Oliver Koell
listen@...
Apr 2, 2004 2:24 pm
67
... XHTML is not supported, on the assumption (perhaps false) that people who are actually going to the trouble of producing XHTML are producing at least...
... Oooooops. That is what is known as a paper-bag bug, meaning that after releasing it I should go around wearing a paper bag on my head for a while. Today's...
And here it is today, as promised: TagSoup 0.9.4. This fixes the paper-bag bug, allows CDATA sections in the input (but they must be perfectly well-formed or...
hello, the following html : <p><table>...</table></p> becomes <p/><table>...</table> with Tagsoup...how can I configure Tagsoup to avoid the closing of the <p>...
... You need to perform surgery on src/definitions/html/elements. Look for the line beginning "table" and change the strong "%block" to "%block+%inline". That...
Hi all, I'm looking for comments on a possible use case for TagSoup. My current employer's hosted message board product allows users to include HTML in message...
... I would think so, indeed. Use it to parse what the users send you, which will be very likely to make it well-formed (not 100% guaranteed, only about 99%)....