Hi, I am using TagSoup, Dom4J and Jaxen to parse various web-pages and pull out some key pieces of data. Mostly, and mostly thanks to TagSoup itself of course,...
Hi. I am having a problem with conversion of HTML entites. The specific entity that is causing me problems at the moment is the entity �. When I try to...
Jaran Nilsen
jaran.nilsen@...
Sep 3, 2007 9:16 am
936
... Just set the output encoding to something other than UTF-8. It has to be something your Java VM understands; US-ASCII will always work. -- John Cowan...
My input documents are russian, chinese and whatnot, so I fear US-ASCII will not do me much good? Or am I wrong? First thing I do when I download the documents...
Jaran Nilsen
jaran.nilsen@...
Sep 3, 2007 7:27 pm
938
... No, it's the *output* encoding that controls whether character references are generated. TagSoup doesn't know which encodings can support which characters...
Ok, I will see if I can solve it somehow. Thanks a lot for your input :) Jaran...
Jaran Nilsen
jaran.nilsen@...
Sep 4, 2007 6:16 am
940
... This bug breaks tagsoup for my use. I am willing to help fix it. Is the bug in definitions/html.stml? Or I could fall back to version 1.0. Where is there a...
... Very probably. I'll try to look for it this weekend, if I can. ... http://www.ccil.org/~cowan/XML/tagsoup/tagsoup-1.0.jar . But I make no guarantees that...
Hi, I have an application which receives html which comes out of a Mozilla application, subsequently the structure of the html is valid (in that all tags have...
... I assume you mean that it has unmarked empty-tags, unquoted attribute values, short-form attributes like "checked" for "checked='checked'" and the like....
Hi, ... Yes all them sort of things, our application is a crawler so it gets exposed to all sorts of pages. For example the first error i get in my current...
Hi there. There were signals on Nutch mailing list that TagSoup forces entity substitution in URIs. This indeed seems to be the case -- not good for the ...
When using TagSoup from the command line, one can use the --lexical option to have it report comments. How does one do this programmatically? I tried just...
Elliotte Harold - jav...
elharo@...
Nov 2, 2007 8:16 pm
954
Disregard previous message. The comments are going through. The bug (as yet unidentified) is not where we thought it is, but does not seem to be in TagSoup. --...
Elliotte Harold - jav...
elharo@...
Nov 2, 2007 8:32 pm
964
Hello, I've downloaded Tagsoup version 1.1.3 from the Tagsoup homepage and am able to use it with Java 6. Now I've tried to use it with Java 5 (1.5.0_12),...
Ole Laurisch
ole.laurisch@...
Dec 12, 2007 4:29 pm
965
Hm... works for me on Java 1.5.0_14. Just to get the ball rolling, I'll ask the usual starter question (with no insult meant): are you sure you have...
For interest's sake, in our app we tell TagSoup to retain comments from the HTML input like this: XMLWriter xmlWriter; try { xmlWriter = new XMLWriter(new ...
Hi Mark, at first I wanted to answer "Hey, c'mon! Sure I have the tagsoup jar in my classpath", but then I double checked it and found out the following. All...
Ole Laurisch
ole.laurisch@...
Dec 13, 2007 7:45 am
969
I have found a web site [http://canada.com/] which uses '<?xml:namespace prefix = cwi />' in many of its pages, including its main page. The pages start with...
... Just a minor nitpick: ... It is actually not even well-formed xml: processing instructions can not have target that starts with 'xml' (case insensitive),...
... Right. Considered without regard to case: <?xml ...> is not well formed as a PI (the XML declaration is not a PI); <?xml:foo ...?> is XML well-formed, but...
... You are absolutely correct. :-) My mistake -- I did mix up rules for PI names and restrictions on reserved namespace prefixes (where anything starting with...
As a New Year's present to the TagSoup community (and to fulfill a pre-New-Year resolution of mine), I've completed development work on TagSoup 1.2. This is...
There are a great many changes, most of them fixes for long-standing bugs, in this release. Only the most important are listed here; for the rest, see the...
... Thanks. -- John Cowan cowan@... http://ccil.org/~cowan Female celebrity stalker, on a hot morning in Cairo: "Imagine, Colonel Lawrence, ninety-two...