All versions of TagSoup are now, by the wave of my magic wand, licensed under the Apache 2.0 license as well as the Academic Free License 3.0 and the GNU GPL...
I was using tagsoup to parse some html and kept getting funny errors and I finally realized that it was attempting to parse the tags within comments as well....
... Short answer: It's because TagSoup by default transforms the page to XHTML. If you want HTML output, use the --html switch. Long answer: Comments inside...
... That is indeed the issue. ... Definitely not going there! ... This might or might not help; I'm not sure. Essentially it would be a change in parsing...
Forgot to add: I do see why comments have to get escaped within javascript blocks for xhtml. But it seems like comment nodes don't even make it into the ...
... Depending on the DOM implementation (TagSoup doesn't have a DOM implementation itself) you may need to tell it to tell TagSoup to report lexical features,...
... Back in the day, the day being 1997 when we had to support Netscape Navigator 3 and Internet Explorer 3, this was a JavaScript FAQ: it was essential to use...
Nick Fitzsimons
nick@...
Jun 6, 2007 11:51 pm
855
... Exactly! ... And so is wrapping Javascript in comment delimiters for the sake of unbelievably ancient browsers. -- John Cowan cowan@......
Note: forwarded message attached. ... Pinpoint customers who are looking for what you sell. Hi All, I am new to TagSoup, I need ur help in using TagSoup....
... You need to understand how to use SAX parsers in general. Start at sax.sourceforge.net, or google for "SAX tutorial". -- Principles. You can't say A is...
Hi All, For conversion from html to XHTML, i am using TagSoup but it doesnot work good with mathml tags. for eg if my tags are likes. <html><mathml><mstyle...
... Right. TagSoup does not currently handle foreign tagsets very well. Someone could write a MathML schema in TagSoup Schema Language, but it would also be...
Thanks John, Are you aware of any parser which supports foreign or mathml tags? Does JTidy support this? --Savitha John Cowan <cowan@...> wrote: ... ...
... Hash: SHA1 I had formed the impression (not sure from where) that the tagsoup 'rectifier' did no lookahead, but I can't square that with the following...
... Correct. ... The document root start-tag is magic, and corresponding end-tags are ignored. So the first <html> is the document root, the first </html> is...
Hi John and fellow members, I'm working on a program to analyze web page structural similarity and currently using Tagsoup as the html parser and JDOM to form...
... That's a JDOM issue; JDOM wants the comments and asks TagSoup for them. -- De plichten van een docent zijn divers, John Cowan die van het gehoor...
Hi. Is there any way to tell tagsoup to remove any HTML comments it finds in the input document? I have been unable to find this in the documentation. Thanks...
... If you mean the command-line program, they are removed by default. If you mean the library, it's all about whether you register a LexicalHandler with the...
Hi, It's not clear to me from either the license or the source code comments whether tagsoup is copyrighted, and whether it requires attribution if used in a...
... It is copyrighted by me (there are no copyright notices at present, but they are not actually required; I'll be adding new ones in the fairly near future)....
... Span is for inline content, div for block content, so yes, that happens. If you don't like it, change the schema. -- John Cowan http://ccil.org/~cowan...
Hi, I am using TagSoup, Dom4J and Jaxen to parse various web-pages and pull out some key pieces of data. Mostly, and mostly thanks to TagSoup itself of course,...
Hi. I am having a problem with conversion of HTML entites. The specific entity that is causing me problems at the moment is the entity �. When I try to...
Jaran Nilsen
jaran.nilsen@...
Sep 3, 2007 9:16 am
936
... Just set the output encoding to something other than UTF-8. It has to be something your Java VM understands; US-ASCII will always work. -- John Cowan...
My input documents are russian, chinese and whatnot, so I fear US-ASCII will not do me much good? Or am I wrong? First thing I do when I download the documents...
Jaran Nilsen
jaran.nilsen@...
Sep 3, 2007 7:27 pm
938
... No, it's the *output* encoding that controls whether character references are generated. TagSoup doesn't know which encodings can support which characters...