When I do an octal dump I get 0302, 0240 (i.e. C2, A0 ). Is that what you write as U+00A0 ? Is that the UTF-8 encoding thereof? It would hardly suprise me if...
... Yes, exactly. ... The output of the TagSoup main program is always UTF-8; you may need to tell JTidy that. ... I'd write a replacement main() method. -- ...
Hello, Here is HTML snippet I tried to tagsoup: <td width="435"><nobr><a href="/"><img src="http://g.delfi.lv/d/h/news_on.gif" border=0 alt="Ziņas" width=78...
I've put together a simple, fairly forgiving SAX2-style HTML/XML parser in Python, may be of interest here. As a demo there's a simple RSS aggregator. ...
Hi, The Water language currently uses Tidy for converting HTML to XHTML, and I'd like to move to TagSoup because: TagSoup should never fail to return TagSoup...
TagSoup 0.9.5 is now available at the usual place, http://www.ccil.org/~cowan/XML/tagsoup . This is a bug-fix release, but the bug goes right back to the...
I am using TagSoup with org.apache.xalan.xsltc.trax.SAX2DOM to read Web Pages and create DOM Documents in order to parse data. I have come across web pages on...
... This is a known problem which will be fixed in the next release, which I expect to have out shortly. Attribute names beginning with digits will be changed...
TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish,...
Thank you giving me something amusing to google. I've had a few of those in my day. :-) Howard...
Howard Katz
howardk@...
Aug 13, 2004 10:05 pm
99
Help! I'm trying to compile the example shown on Hackdiary, http://www.hackdiary.com/archives/000041.html I loaded perl. I loaded Xalan.jar in \lib under my...
... Discard all the classes in the tagsoup-0.9.7/src/java/.../test directory; they were released by accident. I have now yanked them from both the source and...
Hi, XML header tag is added a second time when it allready exists. Run tagsoup on the attached testfile to see... Thanks for your help, Sytse Hengeveld ... ...
sytse@...
Aug 16, 2004 12:46 pm
102
... Thanks. This is one of the last remaining well-formedness bugs, and it'll be fixed in the next release (I would have fixed it in this one, except for some...
I think I see what's happening. According to the HTML DTD NOSCRIPT is not allowed in the HEAD. Therefore, tagSoup closes the HEAD as soon as it sees...
Elliotte Rusty Harold
elharo@...
Aug 17, 2004 9:00 pm
104
Hi, There is a problem with the self closing tag. For example, if there is a self closing script tag, the slash in the script tag is being removed and a close ...
sytse@...
Aug 17, 2004 10:02 pm
105
... I think this is a general issue with TagSoup that it doesn't really recognize XMLish syntax such as empty-element tags. As more and more XML gets mixed...
Elliotte Rusty Harold
elharo@...
Aug 17, 2004 10:32 pm
106
I've noticed TagSoup generates XML declarations in the files it outputs. That's fine. However, when these files get resouped (fed back into TagSoup a second...
Elliotte Rusty Harold
elharo@...
Aug 17, 2004 10:32 pm
107
... TagSoup processes broken HTML, not XML, and the self-closing (or empty) tag of XML is unknown in HTML -- it's treated as a malformed open tag. I agree that...
I was trying a simple hello world tagsoup example similar to that found at: http://www.hackdiary.com/archives/000041.html My code looks like (imports left...
... This smells like a classic XPath problem that has nothing to do with tag soup. In brief, if the elements are in the default namespace, as they ar ein...
Elliotte Rusty Harold
elharo@...
Aug 22, 2004 8:02 pm
112
... Yes, the XPath probably needs a prefix (the orig example uses html) but unfortunately I don't get that far. It throws the exception in p.parse(...) which...
... The current version of TagSoup doesn't treat ":" as special in names. Consequently, it's returning the namespace name as empty and the local name as...
Hi all, This may be related to the XML namespace issue. When TagSoup parses a tag which contains a bare colon: <meta :> it outputs this, which isn't valid HTML...
... It definitely is well-formed XML 1.0 (though not namespace-correct). It isn't tag-valid SGML only because ":" is not a name character in the default SGML...
I've had a private inquiry whether anyone is using TagSoup with Xalan. "I don't know", said I; "I'll ask." So I ask. -- Si hoc legere scis, nimium eruditionis...
Hi John, ... Xalan. ... We are, but not directly using the SAX interface (yet). We are converting the output of TagSoup back to a string, and then parsing it...