Well, time for a new release of TagSoup. Here are the changes: Version attribute default value removed from html element Leading and trailing hyphens now...
As bbeck pointed out, the 1.0.2 jarfile was built with Java 6, which meant that it would not run on earlier JVMs. 1.0.3 fixes that problem, fixes the...
Does anyone know of a working example of use of TagSoup to execute an XPath query against an HTML document? (preferably using XOM or dom4j) thanks in advance, ...
Hey all, I'd been having some problems for some reason getting it to work, I finally realized I wasn't doing namespace stuff correctly. Here is what I got...
Here's the head of a document I'm playing with: <head> <title>endtag</title> <script type="text/javascript" language="javascript"> /* Only sunsites are allowed...
Elliotte Harold
elharo@...
Feb 18, 2007 6:39 pm
677
... I can't reproduce this problem using either 1.0.1 or 1.0.3 (current) under any of 1.4, 1.5, or 1.6 VMs. I always get "<" no matter what I do. Nor can I...
... OK. et's look a little closer. First I'm using 1.0.1. Second I just did it again and it didn't trigger. However scrolling back in my terminal buffer it...
Elliotte Harold
elharo@...
Feb 18, 2007 8:20 pm
679
... [snip] ... CDATA, but yes; markup inside script elements isn't interpreted, except for the "</script>" tag. The same is true of style elements, though...
Consider this common JavaScript pattern: <script> <!-- script goes here // --></script> TagSoup is going to decomment that text. This will turn it into ...
Elliotte Harold
elharo@...
Feb 18, 2007 10:06 pm
681
... I think it will just not consider any content within script tag to be markup, and thus there is no comment encountered. And Javascript parser browsers use ...
... Neither Firefox nor IE likes the "<" at all, though the ">" is inside a JavaScript comment and invisible. ... Turning on --html or --method=html is...
... Slightly off topic for TagSoup, but as a JavaScript programmer, can I point out that wrapping the contents of inline scripts in HTML comments hasn't been...
Nick Fitzsimons
nick@...
Feb 19, 2007 4:27 pm
684
This is a bug-fix release. The --method=html switch (and therefore the --html switch as well) now properly suppresses character escaping within script and...
I have a problem with characters that are present in the source HTML (and rendered properly in browsers) are not getting to my program as I expect. I am using...
I figured it was something like that but how do I control it? In fact the messed up characters are in the DOM tree that was created from the SAX2COM parse...
I notice that in 1.0.3 There is only one new feature in this release, the --output-encoding switch, which allows you to specify the character encoding for...
Elliotte Harold
elharo@...
Feb 24, 2007 4:55 pm
693
... The default encoding is often very useful, and the way to specify it is to leave off the switch. This is true for both --encoding and --output-encoding. ...
Reading the message from Elliotte Harold titled "Re: [tagsoup-friends] output-encoding" lead me to a solution. I read the TagSoup command line code to see how...
The W3C Technical Architecture Group has an open issue (http://www.w3.org/2001/tag/issues.html?type=1#TagSoupIntegration-54) on the relationship of HTML, XHTML...
... John Cowan has such a set in his personal possession. However since it's taken from real world web pages, distributing it would involve massive copyright...
Elliotte Harold
elharo@...
Mar 2, 2007 4:17 pm
702
... AFAIK TagSoup and the HTML5 spec are the only contenders. TagSoup has the constraint "quod scripsit, scripsit": it cannot recall SAX events and issue new...
... The HTML 5 spec. is not what I would call declarative -- discursive, more like it. ... Understood. I guess what I am thinking about is not shipping some ...
... I stated the constraint badly: I can and do postpone SAX events, but not character events, since they are unbounded in size. They at least must be...
It's been a long time for me, but doesn't the main verb need to be pluperfect, and the clause in the subjunctive? Quod scripserit, scripserat. ... From:...
... Well, in the Vulgate Pilate says "Quod scripsi, scripsi" = "What I have written, I have written", when the Jews ask him to take down the sign saying "Jesus...
I guess attempting to correct the bible is pretty much a definition of hubris. Sorry for the distraction. ... From: tagsoup-friends@yahoogroups.com ...
Hi, I am currently using TagSoup 1.0.4. I having problem in the XML result tree after parsing the HTML source document. The XML result document will have...
... If you can apply XPath to your input document, then it is already well-formed XML, and TagSoup is not appropriate. The purpose of TagSoup is to process...
I'm using HTMLScanner as the first step in my shift-step experiment, and basically it's working OK (after hacking a workaround to cope with XML-style empty...