Hi John, everyone,
Version 1.2 of TagSoup occasionally throws an exception when trying to push back data to the internal PushbackReader. Examples of failing...
... Thank you very much, especially for the failing input. There was an earlier bug report to this effect, but no examples were forthcoming. ... It should...
... My pleasure. This issue can be avoided by passing custom PushbackReader on the input. See my other e-mail about nested tags, I think that one can be...
... Problem solved! The issue arises when an & appears at the end of a line, and the line terminator is either-LF (Windows) or CR alone (Mac Classic), as in...
Hi John and tagsoup-friends, would it be possible to briefly describe (or provide reliable pointers to) a way to create an instance of...
Godmar Back
godmar@...
Feb 6, 2008 4:02 am
1003
... I don't know of any HTML DOMs that have pluggable parsers, since there is no standard interface for streaming HTML parsers. Most people use XML DOMs or...
... I'll investigate that. I was intrigued by your suggestion in your 2002 talk that "SAX-to-DOM converters" were abundant; apparently, this doesn't include...
Godmar Back
godmar@...
Feb 6, 2008 5:53 am
1005
... SAX is purely an XML standard, unless you are using an HTML-to-SAX parser like Cyberneko, TagSoup, or JTidy. ... The HTML DOM doesn't really buy you much...
Confirmed, works as advertised -- thanks John. For some reason my other bug report didn't get through to the list. I will try to re-send it again to start ...
Another bug, this time more serious and with no apparent workaround (sorry, John). Try to run: java -jar tagsoup-1.2.jar error-67.txt > out on the ZIPped HTML...
... I've been using Castor quite a bit for processing structured XML and in my mind had hoped that HTML DOM would provide a binding that would be similarly...
Godmar Back
godmar@...
Feb 6, 2008 1:03 pm
1009
... I admit that this case is extreme (the 375K input balloons to a 15M output file), but not actually erroneous. There are 485 <small> tags in the input and...
... It is my understanding that there's also NUX (http://dsd.lbl.gov/nux/index.html ), which embeds XOM. Do you recommend using the NUX wrappers/packaging or...
Godmar Back
godmar@...
Feb 6, 2008 3:26 pm
1011
... Right... I have more of these -- when you make real crawls, you pull some real #*&^ out of the Web. ... Ok, I admit I never thought of the semantics of...
I've implemented three approaches to this. 1) Tagsoup parses, XOM represents XHTML in XML, Output via TagSoup serializer is XHTML 1.0. I had to add a number of...
... Thanks for these answers! I actually don't need to output the tree, I'm just interested in analyzing it conveniently - say feed it to an expert system such...
Godmar Back
godmar@...
Feb 6, 2008 4:07 pm
1014
... I have done both with good success, and I think it comes down to whether you need just basic tree-access with xpath (which XOM can do well), or more...
... Fixing the lower-case doctype bug turns out to be trivial: change the "equals" to "equalsIgnoreCase" in line 837 of Parser.java. ... Still working on this...
... No, not a problem: I could introduce a new element property in TSSL which says "terminate all restartable elements". The question is whether this produces...
... "scripsit"? What language is this :) ... I am not really an expert in the HTML spec (don't even know if this is anywhere in the spec), but intuitively an...
... Latin: "has written", or more accurately "has completed writing". ... Correct, I think. ... Very painful, which is why I've avoided it. -- A: "Spiro...
Hi, I want to parse HTML files and make customizable XML files corresponding to those HTML files. How can this be done? Any suggestions would be of great help....
... You will have to specify the correct encoding, such as Shift-JIS or ISO-2022-JP, and it needs to be one that your Java VM understands. -- "Well, I'm back."...
... Thanks John. Can you help me with some sample code? I have not used TagSoup or Saxon before or if you can direct me to some documentation on the same. ...
Hi, I've downloaded 1.2 source code and have seen some folders tssl, stml. What are these for? I haven't found anything regarding this on the documentation. ...
Diego Campo
diego.campo@...
Mar 5, 2008 11:36 am
1028
... They are required when building TagSoup from source. You cannot just compile the provided source code yourself -- you must use Ant. -- John Cowan...
Is this to produce the jar? I'd like to integrate the code so I can make my own changes if necessary, with no jar creation. Should I then integrate the tagsoup...