Here's a simplified example of the HTML I'm trying to parse: <p> <span id="data"> <p>important information</p> </span> </p> And here's what I get out of...
Tagsoup is right, you're wrong. You can't have block elements such as <p> inside inline elements such as <span>; tagsoup fixes this problem for you. - Godmar...
Godmar Back
godmar@...
Jan 8, 2009 3:57 am
1242
I understand, and that's what I suspected. In this case I'm not interested in correcting the HTML, I simply want to access the contents of the SPAN with id of...
... In order to do that, you have to change the HTML grammar in src/definitions/html.tssl to specify a different language. The simplest way to do that is...
Hey, Thanks for some great software! I'm having some trouble with manipulating HTML by parsing it with tagsoup into a DOM and then writing it again. The main...
... I may at some future date give the table and form elements a content model of M_ANY, since people are quite good about providing the end tags for them. ......
Hi, I did just that - allowed M_ANY within table and tr, and that fixed my problem. Maybe tagsoup should be distributed with such a "relaxed" schema in the...
... No. TagSoup interprets entity references on iput, but does not regenerate them on output. But if you set the output encoding to something other than...
I feel like I've seen this discussed at some point in the past 5 years, but I can't remember or find the answer. If an HTML page has an ampersand in the text,...
... Yes, it should be handled (and returned as a raw &, to be escaped on output as &). ... @#$*, I thought I got rid of that class of bug. Apparently the...
Hi all, What is the best way to unit test the parser methods like startElement(), endElement(), ... one at a time, and by starting from reading an XML file...
... You got me there. Parsing is inherently a tightly coupled group of behaviors, since everything depends on building up a rather complex and varying state. ...
... Almost by definition unit testing doesn't read files. Passing your own arguments is the right way to *unit* test. That said, it is important to test with...
Elliotte Harold
elharo@...
Feb 16, 2009 2:42 pm
1254
Thank you for your answer. Your proposal tends to indicate that we need to go for an intrusive solution in which we modify the real code to throw exceptions...
Yes I agree and that is what I am doing for the time being. I don't read files but I get my test input from unit test strings. BR, CP. ... from ... own ... ...
... Without more context, I simply can't say. -- John Cowan cowan@... http://ccil.org/~cowan The penguin geeks is happy / As under the waves they lark ...
With TSaxon the -H switch allows one to process (ill formed) HTML files when they are the source. What about when the source file is XML and you're trying to...
... I don't know any way to do that. The -H switch is just shorthand for the Saxon switch '-x org.ccil.cowan.tagsoup.Parser', and that affects both the main...
I want to use Tagsoup to process a html page (a malformed one) and i got it to work using the comand line -H flag. However when i tried it in code, following...
As a followup: I ended up having to pass the output from tagSoup v1.2 into a build of htmlTidy in order to get it to parse in TinyXML for certain html samples...
... Looks like TinyXML is not a conforming XML parser, if it doesn't understand character references. To get UTF-8 output without entities, though, just...
... Erm, I hate to be slightly rude, but haven't we had the conversation about the command line problems re: output encodings and win32? I started this whole...
The documentation for XMLWriter says * <p>According to the XML Recommendation, <em>all</em> whitespace * in an XML document is potentially significant to an...
... If you look at the Infoset, you'll see that whitespace outside the root element is generally considered nonsignificant, despite the letter of the XML Rec....
... [mailto:tagsoup-friends@yahoogroups.com] On Behalf Of John Cowan ... TagSoup 1.2 ... whitespace ... root ... question. John, Thank you for your quick...
... Sorry, quite right. Since I don't use Windows, I have no idea why the output encoding is broken (if that's really what's happening). Can someone using...