That would be helpful to Maven users. I can help with any questions you have on the process. Here's the basic FAQ with links to the information you'll need: ...
... On the whole I'd rather have someone else do this (you, maybe?). I don't expect a whole lot of new releases, only bug fixes from here on out. -- John Cowan...
I want to be translated to 0x20 rather than 0xa0. Do I simply comment out... <entity name='nbsp' codepoint='00A0'/> ... in html.tssl for this and...
... The problem is that U+0020 is very different to U+00A0. The SAX api does allow you to intercept entities using the lexical handler, You should be able to...
David Pashley
david@...
Jul 13, 2006 11:43 pm
555
... If you do that, it will return " " (that is, the same as &nbsp; would). ... That will work, but it's IMHO better to use Java operations on the...
There are probaby some other ugly non-entity characters that I ought to clean too, like Microsoft smart quotes. I'll put some substitution into my SAX...
... In my experiences of writing SAX parsers, the entities would have been expanded by the time characters() method is called. Unknown entities would throw a...
David Pashley
david@...
Jul 14, 2006 9:55 am
558
... The first point is quite true: however, I assume that Rob wants to clean all NBSPs, not merely those specified using an entity. TagSoup never throws a...
I've loaded a page and used SAX2DOM to create a DOM tree. I then used XPathAPI.selectSingleNode to get a starting point and traversed the subtree. Curiously,...
... Not without something to work from. I need the input page and some information on what XPaths returned what, or a dump of the DOM generated by SAX2DOM as...
hi, I would like to make TagSoup bling to user tags. for examples in the folowing html, I would like it to simple ignore (AKA be blind to) the <tag> tag in the...
... on out. ... And these many months later I finally remember this conversation and do something about it: http://jira.codehaus.org/browse/MAVENUPLOAD-1127 ...
Hello, I'm trying to make the handling of < characters more forgiving. By default a < surrounded by space seems to get converted to a < which is good. But...
... Just to explain this output, I'm pretty much just outputting XML as it comes through, so basically TagSoup is interpretting <- as the start of a tag called...
... Fair enough. It's really, really hard for the code to decide which uses of < are plausible tags or other things and which are not, since it proceeds like...
... No, not in xml (it is legal after first char though) ... I guess so, since underscore is legal as the first name char. On the other hand, all HTML tags...
Hi - we're using TagSoup happily with the Xalan XSLT replacement, and we're wondering about the bug that makes the default version not work correctly... Is it...
... The bug is about building, not about using; TagSoup doesn't do any XSLT at run time. As for why the XSLT building transform doesn't work with the default...
Hello, I just started using tagsoup so I don't know of this is normal behavior or a bug or wrong arguments. I'm using the version tagsoup-1.0.1.jar and here is...
... I admit that's not very good, but it's not clear what general method would be better. Currently TagSoup assumes that "0 CELLPADDING=" is the value of the...
Hi there! We are using TagSoup for our Web crawler, and we found for the page at http://www.borngayprocon.org/ TagSoup consider <!-[if IE]> as a comment, and ...
Eugeny N Dzhurinsky
bofh@...
Dec 7, 2006 9:10 am
590
Hi! I have recently come across TagSoup and want to see whether I can use it instead of JTidy. I need t be able to clean up HTML documents in a wide range of ...
... That is because TagSoup does not know which characters can be safely written to which encodings, so it plays safe and uses character references for all...
I brought up conditional IE comments a while back. I showed using some pathological examples of IE conditionals that it's impossible to proper SAX events if...
... Quite so. But there is a bug involving comments that lack the second minus sign: <!-foo--> causes TagSoup to malfunction. -- John Cowan cowan@......