... Hash: SHA1 I had formed the impression (not sure from where) that the tagsoup 'rectifier' did no lookahead, but I can't square that with the following...
... Correct. ... The document root start-tag is magic, and corresponding end-tags are ignored. So the first <html> is the document root, the first </html> is...
Hi John and fellow members, I'm working on a program to analyze web page structural similarity and currently using Tagsoup as the html parser and JDOM to form...
... That's a JDOM issue; JDOM wants the comments and asks TagSoup for them. -- De plichten van een docent zijn divers, John Cowan die van het gehoor...
Hi. Is there any way to tell tagsoup to remove any HTML comments it finds in the input document? I have been unable to find this in the documentation. Thanks...
... If you mean the command-line program, they are removed by default. If you mean the library, it's all about whether you register a LexicalHandler with the...
Hi, It's not clear to me from either the license or the source code comments whether tagsoup is copyrighted, and whether it requires attribution if used in a...
... It is copyrighted by me (there are no copyright notices at present, but they are not actually required; I'll be adding new ones in the fairly near future)....
... Span is for inline content, div for block content, so yes, that happens. If you don't like it, change the schema. -- John Cowan http://ccil.org/~cowan...
Hi, I am using TagSoup, Dom4J and Jaxen to parse various web-pages and pull out some key pieces of data. Mostly, and mostly thanks to TagSoup itself of course,...
Hi. I am having a problem with conversion of HTML entites. The specific entity that is causing me problems at the moment is the entity �. When I try to...
Jaran Nilsen
jaran.nilsen@...
Sep 3, 2007 9:16 am
936
... Just set the output encoding to something other than UTF-8. It has to be something your Java VM understands; US-ASCII will always work. -- John Cowan...
My input documents are russian, chinese and whatnot, so I fear US-ASCII will not do me much good? Or am I wrong? First thing I do when I download the documents...
Jaran Nilsen
jaran.nilsen@...
Sep 3, 2007 7:27 pm
938
... No, it's the *output* encoding that controls whether character references are generated. TagSoup doesn't know which encodings can support which characters...
Ok, I will see if I can solve it somehow. Thanks a lot for your input :) Jaran...
Jaran Nilsen
jaran.nilsen@...
Sep 4, 2007 6:16 am
940
... This bug breaks tagsoup for my use. I am willing to help fix it. Is the bug in definitions/html.stml? Or I could fall back to version 1.0. Where is there a...
... Very probably. I'll try to look for it this weekend, if I can. ... http://www.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0.jar . But I make no guarantees that...
Hi, I have an application which receives html which comes out of a Mozilla application, subsequently the structure of the html is valid (in that all tags have...
... I assume you mean that it has unmarked empty-tags, unquoted attribute values, short-form attributes like "checked" for "checked='checked'" and the like....
Hi, ... Yes all them sort of things, our application is a crawler so it gets exposed to all sorts of pages. For example the first error i get in my current...
Hi there. There were signals on Nutch mailing list that TagSoup forces entity substitution in URIs. This indeed seems to be the case -- not good for the ...
When using TagSoup from the command line, one can use the --lexical option to have it report comments. How does one do this programmatically? I tried just...
Elliotte Harold - jav...
elharo@...
Nov 2, 2007 8:16 pm
954
Disregard previous message. The comments are going through. The bug (as yet unidentified) is not where we thought it is, but does not seem to be in TagSoup. --...
Elliotte Harold - jav...
elharo@...
Nov 2, 2007 8:32 pm
964
Hello, I've downloaded Tagsoup version 1.1.3 from the Tagsoup homepage and am able to use it with Java 6. Now I've tried to use it with Java 5 (1.5.0_12),...
Ole Laurisch
ole.laurisch@...
Dec 12, 2007 4:29 pm
965
Hm... works for me on Java 1.5.0_14. Just to get the ball rolling, I'll ask the usual starter question (with no insult meant): are you sure you have...
For interest's sake, in our app we tell TagSoup to retain comments from the HTML input like this: XMLWriter xmlWriter; try { xmlWriter = new XMLWriter(new ...
Hi Mark, at first I wanted to answer "Hey, c'mon! Sure I have the tagsoup jar in my classpath", but then I double checked it and found out the following. All...
Ole Laurisch
ole.laurisch@...
Dec 13, 2007 7:45 am
969
I have found a web site [http://canada.com/] which uses '<?xml:namespace prefix = cwi />' in many of its pages, including its main page. The pages start with...
... Just a minor nitpick: ... It is actually not even well-formed xml: processing instructions can not have target that starts with 'xml' (case insensitive),...