This release, available as usual at http://www.tagsoup.info , fixes a couple of nasty paper-bag bugs in 1.0rc4, and adds a new feature, the --nocolons switch,...
... I was just about to release it, so adding a fix for your problem (which was obvious once I thought about it, and the result of inadequate testing) was...
Hi, I am trying to use Tagsoup with StAX but cannot seem to get all the peices to fit together. I prefer to use the StAX API to parse my document than SAX....
... I found that when I changed implementations to the Codehaus version it just worked! Looking at their code I can see they treat SAXSource as a special case...
... With just these tiny fragments I can't tell. TagSoup is a SAX parser, so it should be possible to pass an org.ccil.cowan.tagsoup.Parser object to anything...
I think that what I was doing was fundamentally wrong. To bridge between SAX and StAX in a streaming way would require one thread to parse the document and...
... Yes. ... If you use XOM (http://www.xom.nu), you can get either a complete tree or (if you subclass XOM's Builder class) you can control exactly which...
TagSoup trekking....across the universe... Am I reading correctly that the TagSoup cognoscenti call bogus HTML tags like <o:blah> "bogons", or is there more to...
... A bogon is any element that's not in the schema (src/definitions/html.tssl). It may or may not have a colon in its name. By default bogons are assumed to...
Hello! I'm using tagsoup for parsing html and web-crawling. After parsing about 9000 urls successfully tagsoup falls with NullPointerException on 38th line of...
... about ... unstable) ... I can't reproduce this effect, when I try to reparse failed pages it's ok. Class begins throwing exceptions at differens pages...
... Double ouch. The failure is not data-dependent; it happens when the parser tries to initialize its stack with the dummy element named "<<root>>". I guess...
Hello! I'm creating new instance of Parser object for every page. I have 91Mb links to parse in my list, how can I send it to you? BTW, I use tagsoup in...
This release fixed a bunch of bugs around namespaces. The SAX spec was a little hard to follow, so I am now doing a subset of what Xerces does, in hopes that...
I've found several ways of how to use tagsoup in code, which one is (more) correct according to memory usage and performance? 1. New parser for each page with...
... Obviously safe, costs some memory, shouldn't affect performance except for the cost of creating the schema object. ... Safe provided the schema is not...
Hello! In third way I mean that _each_ thread will have it's own Parser and Schema object created once in thread's constructor. In the run method only parse() ...
... Yes, that is perfectly safe. As usual there are many wrong ways, but several right ways with different trade-offs. ... Huh. I wonder if you are holding...
I'm using TagSoup to read HTML e-mails into a Lucene application, which indexes the text content from them. In essence, I have org.ccil.cowan.tagsoup.Parser...
... TagSoup already recognizes comments as such; that is, it passes them back only to an optional LexicalHandler, not as character data. The exception is...
... apparent ... Hmmm. OK, I'm in the source tree for 1.0rc6 for the first time, and I'm looking at src/definitions/html.stml, but there's nothing very ...
I see you setFlags(0) the ElementTypes for "script" and "style" in the static HTMLSchema instance in CommandLine when you set --nocdata, but I can't see where...
... HTMLSchema is generated from a Java template in src/templates/org/ccil/cowan/tagsoup/HTMLSchema.java and the XML file src/definitions/html.stml. There is...
When you talk about removing "type='cdata' attributes" from src/definitions/html.stml, do you mean commenting out <action id='A_CDATA'/> ? Sorry to be such a...