Awesome tool John! This library fixes 99% of my HTML parsing woo. Was trying to build a page analyzer tool that grabs the visible content of webpage. Then I...
... That's a genuine bug: anything beginning "</script" is currently recognized as the end-of-script tag, because I don't (as I should) check for the final...
... woo. ... will ... recognized ... the final ">". ... Yup, you are right. Did more testing and realized once it recognize "</script" the next "</" will...
This may or may not be a tagsoup problem, but perhaps someone here will know. I have a simple program that takes in html, transforms it with tagsoup into well...
Hey ! Has anyone seen this, and can give me a heads up ? ... -- François Beausoleil Solutions Technologiques Internationales Téléphone: (819) 566-5997...
... Sorry for not responding earlier. ... It's a bug. The routine for detecting the end of a script or style element isn't general enough yet. I'll try to...
... I think that your XSLT implementation is adding a META (in upper case) element in HTML output mode without noticing if one is already in the result tree or...
... Thanks. ... I think you are doing the Right Thing, since that guarantees that the meta element specifies the correct encoding. Maybe someday TagSoup will...
John I followed your advice and subscribed to the TagSoup list. As I was saying, many popular sites have lots of nested JavaScript in their HTML and TagSoup...
... This definitely seems to be a result of the known problem with detected end-tags in script and style elements. -- John Cowan www.ccil.org/~cowan...
... It sounds like you'd be better off with jchardet, a Java port of the Mozilla encoding guesser. Its result can be set into the InputSource object you pass...
Hello, TagSoup + XOM here. I get an error somewhere deep in my XML manipulations that emerges as a ParsingException and the message "-1" :( Unfortunately, I...
There seems to be an amazing number of pages out there with multiple body tags! I guess this comes from people doing includes of whole pages. It would be nice...
... That's one source. An old bug in early versions of Netscape meant that background-color attributes in multiple body tags would be interpreted dynamically,...
... I understand. I am parsing real (i.e. ugly) HTML using XOM's NodeFactory. What's the best strategy to remove those extra body tags? I tried using booleans...
... I am using XOM's NodeFactory to parse raw HTML. My problem is that I am using the body closing tag as the cue point to start collecting statistics about a...
... XSLT is your friend; so is the full use of the XOM model. You are trying to strain the limits of a streaming API beyond what's reasonable. -- John Cowan...
Finally a new release of TagSoup and TSaxon. Summary of changes: Convert CR and CRLF to LF in comments and PIs Force empty elements to close immediately Match...
Hello I know I won't be able to do a compilation under jdk 1.1.8 directly, but is the code (including the code generated from the xslt transformations)...
... Well, you'd have to go through and change references to HashMap into Hashtable, but that's all I can think of offhand. ... Thank you. -- At the end of the...