My Soupy Friends, First, I wanted to say that, with all the whining in the RSS world about how expecting all RSS to be well-formed is just Too Much Too Ask of...
... Excellent idea! ... TagSoup is a library, not an application, but there is a stub main method which you can invoke thus: java -classpath tagsoup-0.9.jar...
Sorry for the stupid question. But what is RSS? And a question for John.. How is tagsoup different from HTML Tidy? --On Friday, January 23, 2004 9:27 AM -0500...
... It's not quite what you're after, but I did write up a (hopefully) minimal code example at http://www.hackdiary.com/archives/000041.html showing how to use...
... Here's the introduction that I usually point co-workers to: http://www.eevl.ac.uk/rss_primer/ There's been a lot of debate in the RSS/weblogging community...
... Thanks for this pointer. The 0.9 version now allows namespace and namespace-prefixes support to be turned on and off (this does nothing, which is...
... Tidy (and its Java version JTidy) are full-featured HTML cleanup software; they can do things like transforming old-style markup into CSS, and have some...
... And, as I understand it, Tidy is all about HTML, but TagSoup is configurable to deal with other kinds of ill-formed XML, which is why I suggested that the...
The file tagsoup-0.9.jar is about 20% of the size of the latest Tidy.jar. Of course, it does a whole lot less, too. I hacked Tester.java to write everything to...
The *.java files in the root don't belong there. I have deleted them from tagsoup-0.9.src.zip. Their presence is harmless, so no need to re-fetch the ...
Here's a first cut at explaining how TagSoup rectifies the stream of start-tags and end-tags into well-structured XML. For this purpose, character content is...
And, as I understand it, Tidy is all about HTML, but TagSoup is configurable to deal with other kinds of ill-formed XML, which is why I suggested that the...
Saxon's help screen includes this line: -x classname Use specified SAX parser for source file With tagsoup-0.9.jar added to my CLASSPATH, I tried this, java...
... Try using an explicit -cp tagsoup-0.9.jar switch on the command line, and see if it still fails. Also, can you substitute other parsers with the -x switch...
... Yes John. I use it to select xerces. regards DaveP...
Dave Pawson
dpawson@...
Jan 29, 2004 4:26 pm
17
I noticed that Bob DuCharme's first thought was to run TagSoup through 'java -jar'; this patch adds the necessary manifest line for this to work (and tidies up...
Joseph Walton's patches inspired me to issue a new release, incorporating them and some other changes, as follows: o Changed existing XMLWriter to HTMLWriter o...
Currently TagSoup's behavior about entity references is as follows. If an entity is recognized by the schema, such as , it is turned into a single...
... I wouldn't do that, as for instance you do see people using for instance é in alt attributes. You could restrict that behaviour to href attributes...
Robin Berjon
robin.berjon@...
Feb 12, 2004 10:52 am
21
JC: Clearly this can be fixed by being smart about not inserting ; when the entity reference is unknown. But I'm wondering if it wouldn't just be better to...
... The SGML behavior could be something to consider. This is off the top of my head, and probably not exactly correct, but I believe an SGML parser that finds...
... I do that too. But unlike an SGML parser, I can't just cough and die in either of the two bad cases: unknown entity and missing semicolon. Too many HTML...
I just got an off-list request to add support for HTML comments through the LexicalHandler interface. I wonder if anyone else thinks this feature is useful. ...
Hi, We have a project for a national archive to translate data into standard formats for long term archiving. One of these formats is HTML. Whilst we will keep...
I 'll defer to John on the other questions... What does it mean "it does not convert presentation HTML to CSS"? I believe that means in cases like: ...
... Forgive my ignorance, but is the latter valid xhtml? If so, why would anybody want to change it? <center> was deprecated in HTML 4.01, from which XHTML is...
I'm not sure about JTidy, but the exe version of Tidy has an option - I just tried this : <center>text</center> Checking the "Output as XHTML" and "Replace...