Hi John, Anyone, Do you happen to have any URI lists or http-able sized collections of soupy HTML? I've been playing with a really crude tagsoup-style parser, ...
... It could be done, but are you sure that's what you want? It would entail, for instance, that a sequence of paragraphs like <p>foobar <p>bazzam <p>quxquux ...
... In the next release I'll make "script" allowed to appear anywhere, since browsers seem to allow it anywhere. ... In general, yes. -- But you, Wormtongue,...
... Well, it's a SAX parser: you can learn how to use SAX parsers at http://www.saxproject.org . You can also look at the static main, tidy, and chooseContent...
... That's a for-sure bug. Can you send me the document exactly as is, so I can reproduced the problem? Thanks. -- Si hoc legere scis, nimium eruditionis...
sorry for the bad expression in my query...i just reformulate it : i'd like to transform bad html in xml by keeping the initial tags structure and without...
I don't know if this is a tagsoup issue or not, but perhaps someone can steer me the right way.... I've got an application where I am feeding soup into...
... If JTidy treats " " and U+00A0 differently, then I have to say it's buggy. These are supposed to be exactly equivalent in HTML files. ... It's hard...
When I do an octal dump I get 0302, 0240 (i.e. C2, A0 ). Is that what you write as U+00A0 ? Is that the UTF-8 encoding thereof? It would hardly suprise me if...
... Yes, exactly. ... The output of the TagSoup main program is always UTF-8; you may need to tell JTidy that. ... I'd write a replacement main() method. -- ...
Hello, Here is HTML snippet I tried to tagsoup: <td width="435"><nobr><a href="/"><img src="http://g.delfi.lv/d/h/news_on.gif" border=0 alt="Ziòas" width=78...
I've put together a simple, fairly forgiving SAX2-style HTML/XML parser in Python, may be of interest here. As a demo there's a simple RSS aggregator. ...
Hi, The Water language currently uses Tidy for converting HTML to XHTML, and I'd like to move to TagSoup because: TagSoup should never fail to return TagSoup...
TagSoup 0.9.5 is now available at the usual place, http://www.ccil.org/~cowan/XML/tagsoup . This is a bug-fix release, but the bug goes right back to the...
I am using TagSoup with org.apache.xalan.xsltc.trax.SAX2DOM to read Web Pages and create DOM Documents in order to parse data. I have come across web pages on...
... This is a known problem which will be fixed in the next release, which I expect to have out shortly. Attribute names beginning with digits will be changed...
TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish,...
Thank you giving me something amusing to google. I've had a few of those in my day. :-) Howard...
Howard Katz
howardk@...
Aug 13, 2004 10:05 pm
99
Help! I'm trying to compile the example shown on Hackdiary, http://www.hackdiary.com/archives/000041.html I loaded perl. I loaded Xalan.jar in \lib under my...
... Discard all the classes in the tagsoup-0.9.7/src/java/.../test directory; they were released by accident. I have now yanked them from both the source and...
Hi, XML header tag is added a second time when it allready exists. Run tagsoup on the attached testfile to see... Thanks for your help, Sytse Hengeveld ... ...
sytse@...
Aug 16, 2004 12:46 pm
102
... Thanks. This is one of the last remaining well-formedness bugs, and it'll be fixed in the next release (I would have fixed it in this one, except for some...
I think I see what's happening. According to the HTML DTD NOSCRIPT is not allowed in the HEAD. Therefore, tagSoup closes the HEAD as soon as it sees...