This is hopefully the last message you'll be seeing from Yahoo; the next
should be a welcome message from Google. I've migrated everyone with
their normal vs. digest vs. no-mail status intact.
Because this is a large move, Google has to manually check the request
for reasonableness. This will probably take 3-4 days, at least that's
what they tell me. If it goes longer than that, I'll switch to inviting
you, which means you'll all have to opt in again -- I hope to avoid
that annoyance.
On the new list, replies will by default go to the *author only*, not
the whole list. Use Reply All to send to the new list.
--
John Cowan cowan@...
"Mr. Lane, if you ever wish anything that I can do, all you will have
to do will be to send me a telegram asking and it will be done."
"Mr. Hearst, if you ever get a telegram from me asking you to do
anything, you can put the telegram down as a forgery."
Ed Davies scripsit:
> Jaran Nilsen wrote:
> > I say move sooner than later.
>
> On behalf of Sonia, Sarah, Shazia, Deepkia and even Maryam (wow, a
> name not ending in an "a" sound) I second that. :-)
It's just the Arabic equivalent of "Maria", so I think it counts.
--
John Cowan http://ccil.org/~cowancowan@...
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows. When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged. --Neal Stephenson
Jaran Nilsen wrote:
> I say move sooner than later.
On behalf of Sonia, Sarah, Shazia, Deepkia and even Maryam (wow, a
name not ending in an "a" sound) I second that. :-)
I say move sooner than later. Those using the list will catch the change, and as long as the home page of tagsoup is updated, so should future users as well :)
J
On Wed, Sep 23, 2009 at 7:20 AM, John Cowan <cowan@...> wrote:
Ed Davies scripsit:
> Just checking but the only thing we'd need to do would be to
> post messages starting a new thread to a different address?
No. All further postings would have to be to the new posting address,
so using Reply or Reply All wouldn't work any more. However, it's not
like we have gobs of live threads at the moment.
--
Is a chair finely made tragic or comic? Is the John Cowan
portrait of Mona Lisa good if I desire to see cowan@...
it? Is the bust of Sir Philip Crampton lyrical, http://ccil.org/~cowan
epical or dramatic? If a man hacking in fury
at a block of wood make there an image of a cow,
is that image a work of art? If not, why not? --Stephen Dedalus
Ed Davies scripsit:
> Just checking but the only thing we'd need to do would be to
> post messages starting a new thread to a different address?
No. All further postings would have to be to the new posting address,
so using Reply or Reply All wouldn't work any more. However, it's not
like we have gobs of live threads at the moment.
--
Is a chair finely made tragic or comic? Is the John Cowan
portrait of Mona Lisa good if I desire to see cowan@...
it? Is the bust of Sir Philip Crampton lyrical, http://ccil.org/~cowan
epical or dramatic? If a man hacking in fury
at a block of wood make there an image of a cow,
is that image a work of art? If not, why not? --Stephen Dedalus
John Cowan wrote:
> Another huge burst of spam. I haven't had nearly this much trouble on
> the Google Groups mailing lists I manage. Would there be any objections
> to my moving this list there? It doesn't mean you'll all need Google
> accounts or GMail addresses.
Just checking but the only thing we'd need to do would be to
post messages starting a new thread to a different address?
Ed.
knobs1723 scripsit:
> I have marked-up text, mixing HTML tags with a custom tag <placeName>. The
latter is usually embedded in a <p>...</p> pair, i.e:
<p>...<placeName>...</placeName>...</p>. When I parse the text with TagSoup, the
output is made into:
> <p>...</p><placeName>...</placeName>
Sorry for the delayed response. It is because TagSoup does not know that
placename is a tag that can go inside p elements.
> Any ideas, suggestions what I am doing wrong? I'm using TagSoup 1.2.
Changing the schema may be too difficult, but if you have *lots* of text, do
that.
Otherwise, look for <placeName> and change it to <span class="placeName"> or
something of the sort, and change </placeName> to </span>, both before parsing.
--
You escaped them by the will-death John Cowan
and the Way of the Black Wheel. cowan@...
I could not. --Great-Souled Sam http://www.ccil.org/~cowan
graemekidd@... scripsit:
> Ah Ok, If thats the case then I think I was slightly confused by this:
> "TagSoup also includes a command-line processor that reads HTML files and can
generate either clean HTML or well-formed XML that is a close approximation to
XHTML."
>
> Does that "approximation" mean that empty tags are not possible.
Empty tags are a strictly optional feature. It's not possible to
control what kind of quotes are used around attribute values, or
to use newlines instead of spaces between attributes, or many other
purely lexical things.
--
Overhead, without any fuss, the stars were going out.
--Arthur C. Clarke, "The Nine Billion Names of God"
John Cowan <cowan@...>
> > Hi,
> >
> > I noticed that after my XHTML is parsed the empty link tags <link /> are
> > converted to start and closed tags e.g. <link></link>
>
> In XML there is no distinction between the two forms, so TagSoup always
> generates the longer form and never the shorter.
>
> --
> On the Semantic Web, it's too hard to prove John Cowan cowan@...
> you're not a dog. --Bill de hOra http://www.ccil.org/~cowan
>
Ah Ok, If thats the case then I think I was slightly confused by this:
"TagSoup also includes a command-line processor that reads HTML files and can
generate either clean HTML or well-formed XML that is a close approximation to
XHTML."
Does that "approximation" mean that empty tags are not possible.
Graeme Kidd scripsit:
> Hi,
>
> I noticed that after my XHTML is parsed the empty link tags <link /> are
> converted to start and closed tags e.g. <link></link>
In XML there is no distinction between the two forms, so TagSoup always
generates the longer form and never the shorter.
--
On the Semantic Web, it's too hard to prove John Cowan cowan@...
you're not a dog. --Bill de hOra http://www.ccil.org/~cowan
Hi,
this might have been asked a couple of times already, but searching the forum
did not really help.
I have marked-up text, mixing HTML tags with a custom tag <placeName>. The
latter is usually embedded in a <p>...</p> pair, i.e:
<p>...<placeName>...</placeName>...</p>. When I parse the text with TagSoup, the
output is made into:
<p>...</p><placeName>...</placeName>
Any ideas, suggestions what I am doing wrong? I'm using TagSoup 1.2.
Cheers,
Alex
Jaran Nilsen scripsit:
> Why not host the project on Sourceforge, Google Code or similar
> services? Then you get all the other project management goodies as
> well.
I probably will move to Google Code at some point. For now, I'll just
move the mailing list unless I hear objections. It's a very straightforward
process.
--
John Cowan <cowan@...> http://ccil.org/~cowan
Micropayment advocates mistakenly believe that efficient allocation of
resources is the purpose of markets. Efficiency is a byproduct of market
systems, not their goal. The reasons markets work are not because users
have embraced efficiency but because markets are the best place to allow
users to maximize their preferences, and very often their preferences are
not for conservation of cheap resources. --Clay Shirkey
Why not host the project on Sourceforge, Google Code or similar
services? Then you get all the other project management goodies as
well.
Just a suggestion :)
Jaran
On Thu, Sep 3, 2009 at 12:14 AM, John Cowan<cowan@...> wrote:
>
>
> Another huge burst of spam. I haven't had nearly this much trouble on
> the Google Groups mailing lists I manage. Would there be any objections
> to my moving this list there? It doesn't mean you'll all need Google
> accounts or GMail addresses.
>
> Please respond with any objections within a week. Silence gives consent.
>
> --
> What has four pairs of pants, lives John Cowan
> in Philadelphia, and it never rains http://www.ccil.org/~cowan
> but it pours? cowan@...
> --Rufus T. Firefly
>
>
--
Jaran Nilsen
Web: http://www.jaranweb.com
MSN/GTalk: passport@... / jaran.nilsen@...
Tel.: +47 97 19 33 69
http://twitter.com/jarannilsenhttp://www.linkedin.com/in/jarannilsenhttp://www.facebook.com/jaran.nilsen
Another huge burst of spam. I haven't had nearly this much trouble on
the Google Groups mailing lists I manage. Would there be any objections
to my moving this list there? It doesn't mean you'll all need Google
accounts or GMail addresses.
Please respond with any objections within a week. Silence gives consent.
--
What has four pairs of pants, lives John Cowan
in Philadelphia, and it never rains http://www.ccil.org/~cowan
but it pours? cowan@...
--Rufus T. Firefly
--- In tagsoup-friends@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> jeremiebousquet scripsit:
>
> > String xmlData = """<html>
> > <body>
> > <b>
> > <p><a href="url">my first text</a></p>
> > </b>
> > </body>
> > </html>"""
>
> What's happening here is that TagSoup's model of HTML doesn't believe that
> a B element can have a P element inside it. B elements are part
> of the inline group, P elements are part of the block group.
>
> Consequently, when the P start-tag is seen, the B element is closed;
> however, B is known to be a restartable element, so it will be reopened
> inside the P element and closed again when the P element is closed; once
> again, B will be restarted outside the P element and then closed for good.
>
> TagSoup isn't guaranteed to produce the best possible result, just one
> that is well-formed and follows the general model of HTML 4. It's up
> to you to fix up the output in any useful way, using XSLT for example.
>
> --
> The first thing you learn in a lawin' family John Cowan
> is that there ain't no definite answers cowan@...
> to anything. --Calpurnia in To Kill A Mockingbird
>
It's quite clear now, thanks for your answer !
I managed to workaround by changing all "<b>" to "<bold>" (quite ugly but it
works), but maybe just removing "<b></b>" and changing order between <b> and <p>
would be enough.
Thanks again for help,
Jeremie
jeremiebousquet scripsit:
> String xmlData = """<html>
> <body>
> <b>
> <p><a href="url">my first text</a></p>
> </b>
> </body>
> </html>"""
What's happening here is that TagSoup's model of HTML doesn't believe that
a B element can have a P element inside it. B elements are part
of the inline group, P elements are part of the block group.
Consequently, when the P start-tag is seen, the B element is closed;
however, B is known to be a restartable element, so it will be reopened
inside the P element and closed again when the P element is closed; once
again, B will be restarted outside the P element and then closed for good.
TagSoup isn't guaranteed to produce the best possible result, just one
that is well-formed and follows the general model of HTML 4. It's up
to you to fix up the output in any useful way, using XSLT for example.
--
The first thing you learn in a lawin' family John Cowan
is that there ain't no definite answers cowan@...
to anything. --Calpurnia in To Kill A Mockingbird
I'll let others with better HTML knowledge confirm but I have
vague recollections that in theory, b tags are supposed to always
be inside other tags, e.g. p tags. It depends what you are trying
to do but perhaps you can just walk through all of the nodes:
def tsparser = new XmlParser(new org.ccil.cowan.tagsoup.Parser())
tshtml = tsparser.parseText(xmlData)
println "HTML with TagSoup = " + tshtml
tshtml.body.'**'.each{ tsbtags ->
println " " + tsbtags
}
Cheers, Paul.
jeremiebousquet wrote:
>
>
>
> Hello,
>
> I'm new to TagSoup and Groovy, and trying to parse some html, not well
> formed I'm afraid (why I use TagSoup).
>
> But I have some strange behaviour that I can't explain, maybe due to my
> lack of knowledge. Hope you will be able to help me.
>
> This is my program, with heavily cleared html, I retained only the
> structure that seem to cause the problem :
>
> /* -------------------------------- */
> String xmlData = """<html>
> <body>
> <b>
> <p><a href="url">my first text</a></p>
> </b>
> </body>
> </html>"""
>
> def parser = new XmlParser()
> html = parser.parseText(xmlData)
> println("HTML = " + html)
> html.body.b.each() { btags ->
> println("TAG B = " + btags)
> }
>
> println()
> def tsparser = new XmlParser(new org.ccil.cowan.tagsoup.Parser())
> tshtml = tsparser.parseText(xmlData)
> println("HTML with TagSoup = " + tshtml)
> tshtml.body.b.each() { tsbtags ->
> println("TAG B with TagSoup = " + tsbtags)
> }
> /* -------------------------------- */
>
> Here is the result that is displayed :
>
> -----------------------------
> HTML = html[attributes={}; value=[body[attributes={};
> value=[b[attributes={}; value=[p[attributes={};
> value=[a[attributes={href=url}; value=[my first text]]]]]]]]]]
> TAG B = b[attributes={}; value=[p[attributes={};
> value=[a[attributes={href=url}; value=[my first text]]]]]]
>
> HTML with TagSoup = {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}html[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}body[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={};
> value=[]], {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}p[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}a[attributes={shape=rect, href=url};
> value=[my first text]]]]]],
> {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={}; value=[]]]]]]
> TAG B with TagSoup = {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={}; value=[]]
> TAG B with TagSoup = {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={}; value=[]]
> -----------------------------
>
> It seems that the XMLParser alone could easily retrieve the <b> and its
> content, while TagSoup recognized 2 <b> (the opening and closing ?), but
> empty. Also there are strange urls everywhere.
> If I replace in my html data the <b> by, say, <be>, then neither of
> XMLParser alone or with Groovy have difficulty to find this <be> tag and
> its content. It seems to occur only with <b>.
> I can't use XMLParser alone, though, because my original html is not
> parseable with it because too badly formed.
>
> Am I missing something ?
>
> Thanks for help,
> Jeremie
>
>
Hello,
I'm new to TagSoup and Groovy, and trying to parse some html, not well formed
I'm afraid (why I use TagSoup).
But I have some strange behaviour that I can't explain, maybe due to my lack of
knowledge. Hope you will be able to help me.
This is my program, with heavily cleared html, I retained only the structure
that seem to cause the problem :
/* -------------------------------- */
String xmlData = """<html>
<body>
<b>
<p><a href="url">my first text</a></p>
</b>
</body>
</html>"""
def parser = new XmlParser()
html = parser.parseText(xmlData)
println("HTML = " + html)
html.body.b.each() { btags ->
println("TAG B = " + btags)
}
println()
def tsparser = new XmlParser(new org.ccil.cowan.tagsoup.Parser())
tshtml = tsparser.parseText(xmlData)
println("HTML with TagSoup = " + tshtml)
tshtml.body.b.each() { tsbtags ->
println("TAG B with TagSoup = " + tsbtags)
}
/* -------------------------------- */
Here is the result that is displayed :
-----------------------------
HTML = html[attributes={}; value=[body[attributes={}; value=[b[attributes={};
value=[p[attributes={}; value=[a[attributes={href=url}; value=[my first
text]]]]]]]]]]
TAG B = b[attributes={}; value=[p[attributes={}; value=[a[attributes={href=url};
value=[my first text]]]]]]
HTML with TagSoup = {http://www.w3.org/1999/xhtml}html[attributes={};
value=[{http://www.w3.org/1999/xhtml}body[attributes={};
value=[{http://www.w3.org/1999/xhtml}b[attributes={};
value=[]], {http://www.w3.org/1999/xhtml}p[attributes={};
value=[{http://www.w3.org/1999/xhtml}b[attributes={};
value=[{http://www.w3.org/1999/xhtml}a[attributes={shape=rect, href=url};
value=[my first text]]]]]],
{http://www.w3.org/1999/xhtml}b[attributes={}; value=[]]]]]]
TAG B with TagSoup = {http://www.w3.org/1999/xhtml}b[attributes={}; value=[]]
TAG B with TagSoup = {http://www.w3.org/1999/xhtml}b[attributes={}; value=[]]
-----------------------------
It seems that the XMLParser alone could easily retrieve the <b> and its content,
while TagSoup recognized 2 <b> (the opening and closing ?), but empty. Also
there are strange urls everywhere.
If I replace in my html data the <b> by, say, <be>, then neither of XMLParser
alone or with Groovy have difficulty to find this <be> tag and its content. It
seems to occur only with <b>.
I can't use XMLParser alone, though, because my original html is not parseable
with it because too badly formed.
Am I missing something ?
Thanks for help,
Jeremie
Only a few of them got through my personal filters, and when I noticed the
pattern, I removed and blocked the sending email and removed the messages from
the archive.
I figured it out. It was a class path problem. I'm using eclipse plug-ins so
my plug-in as trying to use Xalan's SAX2DOM and the class loader was seeing two
ContextHandler definitions because of the way I setup my Xalan plug-in. Once I
figured out that was what was going on I referred to my copy of the book
"Eclipse Rich Client Platform Designing Coding and Packaging Java Applications"
about trouble shooting class path problems. I now have it working just fine.
Ian
----- Original Message ----
From: Leslie Software <lesliesoftware@...>
<snip>
java.lang.IncompatibleClassChangeError: Class
org.apache.xalan.xsltc.trax.SAX2DOM does not implement the requested interface
org.xml.sax.ContentHandler
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:405)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.createSlo\
tsFromHTMLTransfer(ReadClipboard.java:236)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:125)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:103)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction$1.run(PasteAc\
tion.java:64)
at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction.run(PasteActi\
on.java:58)
...
<snip>
__________________________________________________________________
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your
favourite sites. Download it now
http://ca.toolbar.yahoo.com.
I have been using TagSoup for some time for various tasks and it does a great
job. One application uses TagSoup to parse HTML from the clipboard. Recently
when I recently tried to fix up my access to the SAX2DOM class I ran into
trouble. I had been using the internal implementation from the Sun JRE found in
com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM without really paying
attention that I was doing this. When I noticed I was using an internal class
that was not public I downloaded Apache Xalan 2.7.1 and tired to use it instead.
This generated the error:
java.lang.IncompatibleClassChangeError: Class
org.apache.xalan.xsltc.trax.SAX2DOM does not implement the requested interface
org.xml.sax.ContentHandler
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:405)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.createSlo\
tsFromHTMLTransfer(ReadClipboard.java:236)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:125)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:103)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction$1.run(PasteAc\
tion.java:64)
at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction.run(PasteActi\
on.java:58)
...
Am I using incompatible versions? Do I need to modify my code to make it work?
Any suggestions or advice would be appreciated.
Ian
Here is the parsing code looks like this (the exception comes at the
parser.parse line):
private void createSlotsFromHTMLTransfer (String contents) {
StringReader inputReader = null;
try {
// Create and configure the parser
Parser parser = new Parser ();
parser.setFeature ("http://xml.org/sax/features/namespace-prefixes",
true); //$NON-NLS-1$
// Parse the HTML
SAX2DOM sax2dom = new SAX2DOM ();
parser.setContentHandler (sax2dom);
inputReader = new StringReader (contents);
InputSource inputSource = new InputSource (inputReader);
inputSource.setEncoding ("UTF-8"); //$NON-NLS-1$
parser.parse (inputSource);
Node doc = sax2dom.getDOM ();
if (TraceUtil.isOptionEnabled (DebugOptions.CLIPBOARD_SAVEXML)) {
debug code to save the xml to disk omitted
}
resetCurrentLine ();
String topLevelParents = "/html:html/html:body"; //$NON-NLS-1$
NodeList topLevelNodes = XPathHelper.selectNodeList (doc,
topLevelParents);
for (int index = 0; index < topLevelNodes.getLength (); index++) {
Node curNode = topLevelNodes.item (index);
processContainingNode (curNode);
}
} catch (Exception exception) {
// Any exception indicates invalid data
EditorPlugin.getDefault ().getLog ().log
(DeckEditorError.clipboardFormatError (exception));
DND.error (DND.ERROR_INVALID_DATA);
mySlots = null;
} finally {
if (inputReader != null)
inputReader.close ();
}
}
--
Ian Leslie - Shareware Author (mailto:lesliesoftware@...)
__________________________________________________________________
Ask a question on any topic and get answers from real people. Go to Yahoo!
Answers and share what you know at http://ca.answers.yahoo.com
--- In tagsoup-friends@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> James Abley scripsit:
>
> > I've encountered an issue using TagSoup and I wanted to clarify whether
> > it is expected behaviour due to how I'm using it, or something else.
> >
> > The issue that I'm seeing is that I'm parsing an RSS feed and it
> > eventually goes through TagSoup to ensure that I store well-formed XML.
> >
> >
http://www.guardian.co.uk/football/2009/feb/26/real-madrid-rafa-benitez-liverpoo\
l/rss
> >
> > The <br/> element between the first two bullet points in that story
> > is getting removed when I parse the <item/> description and I'm not
> > sure why that is the case.
>
> I can't duplicate this problem with TagSoup 1.2. It turns into a
> <br clear="none"></br>, because there's a default attribute value
> in the HTML 4.0 DTD, and TagSoup doesn't generate empty elements.
>
> > Is there a source repository that I can check out anonymously and write
> > some tests against? I've not been able to find one through Google -
> > too much interference from the Haskell version, etc.
>
> You can always download the source of released versions from
> http://tagsoup.info. There is no public repository.
>
> --
> John Cowan <cowan@...> http://www.ccil.org/~cowan
> But no living man am I! You look upon a woman. Eowyn I am, Eomund's
daughter.
> You stand between me and my lord and kin. Begone, if you be not deathless.
> For living or dark undead, I will smite you if you touch him.
>
Sorry, that's absolutely right. A later step in my XML pipeline is removing that
element. Apologies for the noise.
Cheers,
James
James Abley scripsit:
> I've encountered an issue using TagSoup and I wanted to clarify whether
> it is expected behaviour due to how I'm using it, or something else.
>
> The issue that I'm seeing is that I'm parsing an RSS feed and it
> eventually goes through TagSoup to ensure that I store well-formed XML.
>
>
http://www.guardian.co.uk/football/2009/feb/26/real-madrid-rafa-benitez-liverpoo\
l/rss
>
> The <br/> element between the first two bullet points in that story
> is getting removed when I parse the <item/> description and I'm not
> sure why that is the case.
I can't duplicate this problem with TagSoup 1.2. It turns into a
<br clear="none"></br>, because there's a default attribute value
in the HTML 4.0 DTD, and TagSoup doesn't generate empty elements.
> Is there a source repository that I can check out anonymously and write
> some tests against? I've not been able to find one through Google -
> too much interference from the Haskell version, etc.
You can always download the source of released versions from
http://tagsoup.info. There is no public repository.
--
John Cowan <cowan@...> http://www.ccil.org/~cowan
But no living man am I! You look upon a woman. Eowyn I am, Eomund's daughter.
You stand between me and my lord and kin. Begone, if you be not deathless.
For living or dark undead, I will smite you if you touch him.
Hi,
I've encountered an issue using TagSoup and I wanted to clarify whether it is
expected behaviour due to how I'm using it, or something else.
The issue that I'm seeing is that I'm parsing an RSS feed and it eventually goes
through TagSoup to ensure that I store well-formed XML.
http://www.guardian.co.uk/football/2009/feb/26/real-madrid-rafa-benitez-liverpoo\
l/rss
The <br/> element between the first two bullet points in that story is getting
removed when I parse the <item/> description and I'm not sure why that is the
case.
"<p>• Liverpool manager says he will be staying at Anfield<br />•
Spaniard praises team for win away to Real Madrid</p>"
The markup is being correctly unescaped prior to being passed to TagSoup.
Is there a source repository that I can check out anonymously and write some
tests against? I've not been able to find one through Google - too much
interference from the Haskell version, etc.
Cheers,
James
I seem to have found another place where TagSoup gets in a bit of a
huff. Perhaps there is a flag I can specify to make things better.
The issue is this piece of HTML (the less than sign is, erroneously,
straight up (i.e. not using an entity)):
<em><90 min</em>
and it is occurring on this page - http://tinyurl.com/dgmjjt
On first inspection it seems like TS makes some sort of sense out of it:
<em><_90 min="min" em="em">. </_90></em>
But then it starts inserting <em></em> all over the document (this was
the only "em" in the original doc). And then one of those inserted ones
doesn't have a matching end tag (which is how I stumbled upon this when
it hit the XML parser). Any simple resolution?
-Mike
Miguel Garcia scripsit:
> Hi,
>
> In a proyect where we use Tagsoup to tidy some malformed xhtml code have
> found that if there is an odd number of quotes on the doctype
> declaration tagsoup throws an String related exception and fails. For
> example with the following input,
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN" "> <html
> xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head><title>Test
> with bogus doctype</title></head> <body> <p>This page has an extra quote
> in the doctype, which the tagsoup library doesn't like.</p> </body>
> </html>
The real problem is that TagSoup thinks the system-id begins with a quote
and ends with a quote, but doesn't realize that it's zero-length. The
obvious fix to Parser#trimquotes doesn't work, though. I think this will
be straightforward to find a patch for, but I'll need to do a bit of debugging.
--
John Cowan http://www.ccil.org/~cowancowan@...
Uneasy lies the head that wears the Editor's hat! --Eddie Foirbeis Climo
Hi,
In a proyect where we use Tagsoup to tidy some malformed xhtml code have
found that if there is an odd number of quotes on the doctype
declaration tagsoup throws an String related exception and fails. For
example with the following input,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN" "> <html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head><title>Test
with bogus doctype</title></head> <body> <p>This page has an extra quote
in the doctype, which the tagsoup library doesn't like.</p> </body>
</html>
Tagsoup throws the next exception,
[Fatal Error] :2:14: The document type declaration for root element type
"html" must end with '>'.
Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
String index out of range: -1
Not sure if making a patch to this library would be quite easy (I
haven't reviewed the source code yet) or should it better just making
some workarounds that help to recover from any unexpected error from
tagsoup.
Miguel
Thanks! And to correct another typo for the record, I'm sending
fragments, not fragmenents, which I guess is a back-formation from
documenents.
Leigh.
-----Original Message-----
From: tagsoup-friends@yahoogroups.com
[mailto:tagsoup-friends@yahoogroups.com] On Behalf Of John Cowan
Sent: Thursday, March 12, 2009 3:34 PM
To: tagsoup-friends@yahoogroups.com
Subject: Re: [tagsoup-friends] Patch for XMLWriter and newlines in
TagSoup 1.2
Klotz, Leigh scripsit:
> In tests, I'm seeing not just a newline, but a blank line after the
> document. Is that what you see? Your quoted simple output looks like
> it might have a blank line after it.
Yes, you're right. I've never paid attention to this before.
> * Problem 2:
> I've violated an assumption of XMLWriter; I'm sending it fragmenents,
> not a document or even a single element. The result is a newline
> after each "toplevel" element.
Aha.
> Given that endDocument() outputs the final newline in the document, I
> really don't see what benefit line 632 has at all. I believe that
> simply removing line 632 will let XMLWriter handle fragments without
> introducing extra whitespace, and will still leave the resulting
> serialization newline (though not blank line) terminated.
I agree: line 632 should just be flushed.
--
Clear? Huh! Why a four-year-old child John Cowan
could understand this report. Run out cowan@...
and find me a four-year-old child. I
http://www.ccil.org/~cowan
can't make head or tail out of it.
--Rufus T. Firefly on government reports
Klotz, Leigh scripsit:
> In tests, I'm seeing not just a newline, but a blank line after the
> document. Is that what you see? Your quoted simple output looks like
> it might have a blank line after it.
Yes, you're right. I've never paid attention to this before.
> * Problem 2:
> I've violated an assumption of XMLWriter; I'm sending it fragmenents,
> not a document or even a single element. The result is a newline
> after each "toplevel" element.
Aha.
> Given that endDocument() outputs the final newline in the document,
> I really don't see what benefit line 632 has at all. I believe that
> simply removing line 632 will let XMLWriter handle fragments without
> introducing extra whitespace, and will still leave the resulting
> serialization newline (though not blank line) terminated.
I agree: line 632 should just be flushed.
--
Clear? Huh! Why a four-year-old child John Cowan
could understand this report. Run out cowan@...
and find me a four-year-old child. I http://www.ccil.org/~cowan
can't make head or tail out of it.
--Rufus T. Firefly on government reports