Search the web
Sign In
New User? Sign Up
tagsoup-friends · Friends of TagSoup
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1273 - 1386 of 1386   Newest  |  < Newer  |  Older >  |  Oldest
Messages: Show Message Summaries   (Group by Topic) Sort by Date v  
#1386 From: cowan@...
Date: Sat Sep 26, 2009 6:40 am
Subject: ADMIN: list migration begun
johnwcowan
Offline Offline
Send Email Send Email
 
This is hopefully the last message you'll be seeing from Yahoo; the next
should be a welcome message from Google.  I've migrated everyone with
their normal vs. digest vs. no-mail status intact.

Because this is a large move, Google has to manually check the request
for reasonableness.  This will probably take 3-4 days, at least that's
what they tell me.  If it goes longer than that, I'll switch to inviting
you, which means you'll all have to opt in again -- I hope to avoid
that annoyance.

On the new list, replies will by default go to the *author only*, not
the whole list.  Use Reply All to send to the new list.

--
John Cowan   cowan@...
     "Mr. Lane, if you ever wish anything that I can do, all you will have
         to do will be to send me a telegram asking and it will be done."
     "Mr. Hearst, if you ever get a telegram from me asking you to do
         anything, you can put the telegram down as a forgery."

#1377 From: John Cowan <cowan@...>
Date: Wed Sep 23, 2009 8:35 pm
Subject: Re: ADMIN: too much spam, may move list
johnwcowan
Offline Offline
Send Email Send Email
 
Ed Davies scripsit:
> Jaran Nilsen wrote:
> > I say move sooner than later.
>
> On behalf of Sonia, Sarah, Shazia, Deepkia and even Maryam (wow, a
> name not ending in an "a" sound) I second that. :-)

It's just the Arabic equivalent of "Maria", so I think it counts.

--
John Cowan   http://ccil.org/~cowan  cowan@...
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson

#1376 From: Ed Davies <edavies@...>
Date: Wed Sep 23, 2009 8:34 pm
Subject: Re: ADMIN: too much spam, may move list
edavies971
Offline Offline
Send Email Send Email
 
Jaran Nilsen wrote:
> I say move sooner than later.

On behalf of Sonia, Sarah, Shazia, Deepkia and even Maryam (wow, a
name not ending in an "a" sound) I second that. :-)

#1372 From: Jaran Nilsen <jaran.nilsen@...>
Date: Wed Sep 23, 2009 6:13 am
Subject: Re: ADMIN: too much spam, may move list
jaran.nilsen@...
Send Email Send Email
 
I say move sooner than later. Those using the list will catch the change, and as long as the home page of tagsoup is updated, so should future users as well :)

J

On Wed, Sep 23, 2009 at 7:20 AM, John Cowan <cowan@...> wrote:
 

Ed Davies scripsit:



> Just checking but the only thing we'd need to do would be to
> post messages starting a new thread to a different address?

No. All further postings would have to be to the new posting address,
so using Reply or Reply All wouldn't work any more. However, it's not
like we have gobs of live threads at the moment.

--
Is a chair finely made tragic or comic? Is the John Cowan
portrait of Mona Lisa good if I desire to see cowan@...
it? Is the bust of Sir Philip Crampton lyrical, http://ccil.org/~cowan
epical or dramatic? If a man hacking in fury
at a block of wood make there an image of a cow,
is that image a work of art? If not, why not? --Stephen Dedalus




--
Jaran Nilsen
Web: http://www.jaranweb.com
MSN/GTalk: passport@... / jaran.nilsen@...
Tel.: +47 97 19 33 69
http://twitter.com/jarannilsen
http://www.linkedin.com/in/jarannilsen
http://www.facebook.com/jaran.nilsen

#1371 From: John Cowan <cowan@...>
Date: Wed Sep 23, 2009 5:20 am
Subject: Re: ADMIN: too much spam, may move list
johnwcowan
Offline Offline
Send Email Send Email
 
Ed Davies scripsit:

> Just checking but the only thing we'd need to do would be to
> post messages starting a new thread to a different address?

No.  All further postings would have to be to the new posting address,
so using Reply or Reply All wouldn't work any more.  However, it's not
like we have gobs of live threads at the moment.

--
Is a chair finely made tragic or comic? Is the          John Cowan
portrait of Mona Lisa good if I desire to see           cowan@...
it? Is the bust of Sir Philip Crampton lyrical,         http://ccil.org/~cowan
epical or dramatic?  If a man hacking in fury
at a block of wood make there an image of a cow,
is that image a work of art? If not, why not?               --Stephen Dedalus

#1368 From: Ed Davies <edavies@...>
Date: Tue Sep 22, 2009 10:11 pm
Subject: Re: ADMIN: too much spam, may move list
edavies971
Offline Offline
Send Email Send Email
 
John Cowan wrote:
> Another huge burst of spam.  I haven't had nearly this much trouble on
> the Google Groups mailing lists I manage.  Would there be any objections
> to my moving this list there?  It doesn't mean you'll all need Google
> accounts or GMail addresses.

Just checking but the only thing we'd need to do would be to
post messages starting a new thread to a different address?

Ed.

#1364 From: John Cowan <cowan@...>
Date: Fri Sep 18, 2009 10:57 pm
Subject: Re: Problems with nested tags
johnwcowan
Offline Offline
Send Email Send Email
 
knobs1723 scripsit:

> I have marked-up text, mixing HTML tags with a custom tag <placeName>. The
latter is usually embedded in a <p>...</p> pair, i.e:
<p>...<placeName>...</placeName>...</p>. When I parse the text with TagSoup, the
output is made into:
> <p>...</p><placeName>...</placeName>

Sorry for the delayed response.  It is because TagSoup does not know that
placename is a tag that can go inside p elements.

> Any ideas, suggestions what I am doing wrong? I'm using TagSoup 1.2.

Changing the schema may be too difficult, but if you have *lots* of text, do
that.
Otherwise, look for <placeName> and change it to <span class="placeName"> or
something of the sort, and change </placeName> to </span>, both before parsing.

--
You escaped them by the will-death              John Cowan
and the Way of the Black Wheel.                 cowan@...
I could not.  --Great-Souled Sam                http://www.ccil.org/~cowan

#1363 From: John Cowan <cowan@...>
Date: Sun Sep 13, 2009 11:58 pm
Subject: Re: Re: Empty link tags converted to open and closed tags
johnwcowan
Offline Offline
Send Email Send Email
 
graemekidd@... scripsit:

> Ah Ok, If thats the case then I think I was slightly confused by this:
> "TagSoup also includes a command-line processor that reads HTML files and can
generate either clean HTML or well-formed XML that is a close approximation to
XHTML."
>
> Does that "approximation" mean that empty tags are not possible.

Empty tags are a strictly optional feature.  It's not possible to
control what kind of quotes are used around attribute values, or
to use newlines instead of spaces between attributes, or many other
purely lexical things.

--
Overhead, without any fuss, the stars were going out.
         --Arthur C. Clarke, "The Nine Billion Names of God"
                 John Cowan <cowan@...>

#1362 From: "graemekidd@..." <coolkidd3@...>
Date: Sun Sep 13, 2009 11:38 pm
Subject: Re: Empty link tags converted to open and closed tags
graemekidd...
Offline Offline
Send Email Send Email
 
> > Hi,
> >
> > I noticed that after my XHTML is parsed the empty link tags <link /> are
> > converted to start and closed tags e.g. <link></link>
>
> In XML there is no distinction between the two forms, so TagSoup always
> generates the longer form and never the shorter.
>
> --
> On the Semantic Web, it's too hard to prove     John Cowan    cowan@...
> you're not a dog.  --Bill de hOra               http://www.ccil.org/~cowan
>
Ah Ok, If thats the case then I think I was slightly confused by this:
"TagSoup also includes a command-line processor that reads HTML files and can
generate either clean HTML or well-formed XML that is a close approximation to
XHTML."

Does that "approximation" mean that empty tags are not possible.

#1361 From: John Cowan <cowan@...>
Date: Sun Sep 13, 2009 11:04 pm
Subject: Re: Empty link tags converted to open and closed tags
johnwcowan
Offline Offline
Send Email Send Email
 
Graeme Kidd scripsit:
> Hi,
>
> I noticed that after my XHTML is parsed the empty link tags <link /> are
> converted to start and closed tags e.g. <link></link>

In XML there is no distinction between the two forms, so TagSoup always
generates the longer form and never the shorter.

--
On the Semantic Web, it's too hard to prove     John Cowan    cowan@...
you're not a dog.  --Bill de hOra               http://www.ccil.org/~cowan

#1360 From: "Graeme Kidd" <coolkidd3@...>
Date: Sun Sep 13, 2009 10:54 pm
Subject: Empty link tags converted to open and closed tags
graemekidd...
Offline Offline
Send Email Send Email
 

Hi,

I noticed that after my XHTML is parsed the empty link tags <link /> are converted to start and closed tags e.g. <link></link>

 

Does any know how I can prevent this from happening?

 

Thanks


#1359 From: "knobs1723" <avl1@...>
Date: Fri Sep 11, 2009 10:49 am
Subject: Problems with nested tags
knobs1723
Offline Offline
Send Email Send Email
 
Hi,

this might have been asked a couple of times already, but searching the forum
did not really help.

I have marked-up text, mixing HTML tags with a custom tag <placeName>. The
latter is usually embedded in a <p>...</p> pair, i.e:
<p>...<placeName>...</placeName>...</p>. When I parse the text with TagSoup, the
output is made into:
<p>...</p><placeName>...</placeName>

Any ideas, suggestions what I am doing wrong? I'm using TagSoup 1.2.

Cheers,
Alex

#1358 From: John Cowan <cowan@...>
Date: Sun Sep 6, 2009 3:39 am
Subject: Re: ADMIN: too much spam, may move list
johnwcowan
Offline Offline
Send Email Send Email
 
Jaran Nilsen scripsit:

> Why not host the project on Sourceforge, Google Code or similar
> services? Then you get all the other project management goodies as
> well.

I probably will move to Google Code at some point.  For now, I'll just
move the mailing list unless I hear objections.  It's a very straightforward
process.

--
John Cowan  <cowan@...>  http://ccil.org/~cowan
Micropayment advocates mistakenly believe that efficient allocation of
resources is the purpose of markets.  Efficiency is a byproduct of market
systems, not their goal.  The reasons markets work are not because users
have embraced efficiency but because markets are the best place to allow
users to maximize their preferences, and very often their preferences are
not for conservation of cheap resources.  --Clay Shirkey

#1356 From: Jaran Nilsen <jaran.nilsen@...>
Date: Thu Sep 3, 2009 5:40 am
Subject: Re: ADMIN: too much spam, may move list
jaran.nilsen@...
Send Email Send Email
 
Why not host the project on Sourceforge, Google Code or similar
services? Then you get all the other project management goodies as
well.

Just a suggestion :)

Jaran

On Thu, Sep 3, 2009 at 12:14 AM, John Cowan<cowan@...> wrote:
>
>
> Another huge burst of spam. I haven't had nearly this much trouble on
> the Google Groups mailing lists I manage. Would there be any objections
> to my moving this list there? It doesn't mean you'll all need Google
> accounts or GMail addresses.
>
> Please respond with any objections within a week. Silence gives consent.
>
> --
> What has four pairs of pants, lives John Cowan
> in Philadelphia, and it never rains http://www.ccil.org/~cowan
> but it pours? cowan@...
> --Rufus T. Firefly
>
>



--
Jaran Nilsen
Web: http://www.jaranweb.com
MSN/GTalk: passport@... / jaran.nilsen@...
Tel.: +47 97 19 33 69
http://twitter.com/jarannilsen
http://www.linkedin.com/in/jarannilsen
http://www.facebook.com/jaran.nilsen

#1355 From: John Cowan <cowan@...>
Date: Wed Sep 2, 2009 10:14 pm
Subject: ADMIN: too much spam, may move list
johnwcowan
Offline Offline
Send Email Send Email
 
Another huge burst of spam.  I haven't had nearly this much trouble on
the Google Groups mailing lists I manage.  Would there be any objections
to my moving this list there?  It doesn't mean you'll all need Google
accounts or GMail addresses.

Please respond with any objections within a week.  Silence gives consent.

--
What has four pairs of pants, lives             John Cowan
in Philadelphia, and it never rains             http://www.ccil.org/~cowan
but it pours?                                   cowan@...
         --Rufus T. Firefly

#1324 From: "jeremiebousquet" <jeremie.bousquet@...>
Date: Tue Jun 23, 2009 6:09 pm
Subject: Re: Strange behaviour of TagSoup on <b> tags ?
jeremiebousquet
Offline Offline
Send Email Send Email
 
--- In tagsoup-friends@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> jeremiebousquet scripsit:
>
> > String xmlData = """<html>
> >  <body>
> > 	 <b>
> > 		 <p><a href="url">my first text</a></p>
> > 	 </b>
> >  </body>
> > </html>"""
>
> What's happening here is that TagSoup's model of HTML doesn't believe that
> a B element can have a P element inside it.  B elements are part
> of the inline group, P elements are part of the block group.
>
> Consequently, when the P start-tag is seen, the B element is closed;
> however, B is known to be a restartable element, so it will be reopened
> inside the P element and closed again when the P element is closed; once
> again, B will be restarted outside the P element and then closed for good.
>
> TagSoup isn't guaranteed to produce the best possible result, just one
> that is well-formed and follows the general model of HTML 4.  It's up
> to you to fix up the output in any useful way, using XSLT for example.
>
> --
> The first thing you learn in a lawin' family    John Cowan
> is that there ain't no definite answers         cowan@...
> to anything.  --Calpurnia in To Kill A Mockingbird
>

It's quite clear now, thanks for your answer !
I managed to workaround by changing all "<b>" to "<bold>" (quite ugly but it
works), but maybe just removing "<b></b>" and changing order between <b> and <p>
would be enough.

Thanks again for help,
Jeremie

#1323 From: John Cowan <cowan@...>
Date: Sun Jun 21, 2009 5:57 pm
Subject: Re: Strange behaviour of TagSoup on <b> tags ?
johnwcowan
Offline Offline
Send Email Send Email
 
jeremiebousquet scripsit:

> String xmlData = """<html>
>  <body>
> 	 <b>
> 		 <p><a href="url">my first text</a></p>
> 	 </b>
>  </body>
> </html>"""

What's happening here is that TagSoup's model of HTML doesn't believe that
a B element can have a P element inside it.  B elements are part
of the inline group, P elements are part of the block group.

Consequently, when the P start-tag is seen, the B element is closed;
however, B is known to be a restartable element, so it will be reopened
inside the P element and closed again when the P element is closed; once
again, B will be restarted outside the P element and then closed for good.

TagSoup isn't guaranteed to produce the best possible result, just one
that is well-formed and follows the general model of HTML 4.  It's up
to you to fix up the output in any useful way, using XSLT for example.

--
The first thing you learn in a lawin' family    John Cowan
is that there ain't no definite answers         cowan@...
to anything.  --Calpurnia in To Kill A Mockingbird

#1322 From: Paul King <king@...>
Date: Sun Jun 21, 2009 9:09 am
Subject: Re: Strange behaviour of TagSoup on <b> tags ?
pking_asert
Offline Offline
Send Email Send Email
 
I'll let others with better HTML knowledge confirm but I have
vague recollections that in theory, b tags are supposed to always
be inside other tags, e.g. p tags. It depends what you are trying
to do but perhaps you can just walk through all of the nodes:

def tsparser = new XmlParser(new org.ccil.cowan.tagsoup.Parser())
tshtml = tsparser.parseText(xmlData)
println "HTML with TagSoup = " + tshtml
tshtml.body.'**'.each{ tsbtags ->
    println "  " + tsbtags
}

Cheers, Paul.


jeremiebousquet wrote:
>
>
>
> Hello,
>
> I'm new to TagSoup and Groovy, and trying to parse some html, not well
> formed I'm afraid (why I use TagSoup).
>
> But I have some strange behaviour that I can't explain, maybe due to my
> lack of knowledge. Hope you will be able to help me.
>
> This is my program, with heavily cleared html, I retained only the
> structure that seem to cause the problem :
>
> /* -------------------------------- */
> String xmlData = """<html>
> <body>
> <b>
> <p><a href="url">my first text</a></p>
> </b>
> </body>
> </html>"""
>
> def parser = new XmlParser()
> html = parser.parseText(xmlData)
> println("HTML = " + html)
> html.body.b.each() { btags ->
> println("TAG B = " + btags)
> }
>
> println()
> def tsparser = new XmlParser(new org.ccil.cowan.tagsoup.Parser())
> tshtml = tsparser.parseText(xmlData)
> println("HTML with TagSoup = " + tshtml)
> tshtml.body.b.each() { tsbtags ->
> println("TAG B with TagSoup = " + tsbtags)
> }
> /* -------------------------------- */
>
> Here is the result that is displayed :
>
> -----------------------------
> HTML = html[attributes={}; value=[body[attributes={};
> value=[b[attributes={}; value=[p[attributes={};
> value=[a[attributes={href=url}; value=[my first text]]]]]]]]]]
> TAG B = b[attributes={}; value=[p[attributes={};
> value=[a[attributes={href=url}; value=[my first text]]]]]]
>
> HTML with TagSoup = {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}html[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}body[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={};
> value=[]], {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}p[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={};
> value=[{http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}a[attributes={shape=rect, href=url};
> value=[my first text]]]]]],
> {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={}; value=[]]]]]]
> TAG B with TagSoup = {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={}; value=[]]
> TAG B with TagSoup = {http://www.w3.org/1999/xhtml
> <http://www.w3.org/1999/xhtml>}b[attributes={}; value=[]]
> -----------------------------
>
> It seems that the XMLParser alone could easily retrieve the <b> and its
> content, while TagSoup recognized 2 <b> (the opening and closing ?), but
> empty. Also there are strange urls everywhere.
> If I replace in my html data the <b> by, say, <be>, then neither of
> XMLParser alone or with Groovy have difficulty to find this <be> tag and
> its content. It seems to occur only with <b>.
> I can't use XMLParser alone, though, because my original html is not
> parseable with it because too badly formed.
>
> Am I missing something ?
>
> Thanks for help,
> Jeremie
>
>

#1321 From: "jeremiebousquet" <jeremie.bousquet@...>
Date: Sun Jun 21, 2009 7:47 am
Subject: Strange behaviour of TagSoup on <b> tags ?
jeremiebousquet
Offline Offline
Send Email Send Email
 
Hello,

I'm new to TagSoup and Groovy, and trying to parse some html, not well formed
I'm afraid (why I use TagSoup).

But I have some strange behaviour that I can't explain, maybe due to my lack of
knowledge. Hope you will be able to help me.

This is my program, with heavily cleared html, I retained only the structure
that seem to cause the problem :

/* -------------------------------- */
String xmlData = """<html>
	 <body>
		 <b>
			 <p><a href="url">my first text</a></p>
		 </b>
	 </body>
</html>"""

def parser = new XmlParser()
html = parser.parseText(xmlData)
println("HTML = " + html)
html.body.b.each() { btags ->
   println("TAG B = " + btags)
}

println()
def tsparser = new XmlParser(new org.ccil.cowan.tagsoup.Parser())
tshtml = tsparser.parseText(xmlData)
println("HTML with TagSoup = " + tshtml)
tshtml.body.b.each() { tsbtags ->
   println("TAG B with TagSoup = " + tsbtags)
}
/* -------------------------------- */

Here is the result that is displayed :

-----------------------------
HTML = html[attributes={}; value=[body[attributes={}; value=[b[attributes={};
value=[p[attributes={}; value=[a[attributes={href=url}; value=[my first
text]]]]]]]]]]
TAG B = b[attributes={}; value=[p[attributes={}; value=[a[attributes={href=url};
value=[my first text]]]]]]

HTML with TagSoup = {http://www.w3.org/1999/xhtml}html[attributes={};
  value=[{http://www.w3.org/1999/xhtml}body[attributes={};
value=[{http://www.w3.org/1999/xhtml}b[attributes={};
value=[]], {http://www.w3.org/1999/xhtml}p[attributes={};
value=[{http://www.w3.org/1999/xhtml}b[attributes={};
value=[{http://www.w3.org/1999/xhtml}a[attributes={shape=rect, href=url};
value=[my first text]]]]]],
{http://www.w3.org/1999/xhtml}b[attributes={}; value=[]]]]]]
TAG B with TagSoup = {http://www.w3.org/1999/xhtml}b[attributes={}; value=[]]
TAG B with TagSoup = {http://www.w3.org/1999/xhtml}b[attributes={}; value=[]]
-----------------------------

It seems that the XMLParser alone could easily retrieve the <b> and its content,
while TagSoup recognized 2 <b> (the opening and closing ?), but empty. Also
there are strange urls everywhere.
If I replace in my html data the <b> by, say, <be>, then neither of XMLParser
alone or with Groovy have difficulty to find this <be> tag and its content. It
seems to occur only with <b>.
I can't use XMLParser alone, though, because my original html is not parseable
with it because too badly formed.

Am I missing something ?

Thanks for help,
Jeremie

#1320 From: "John Cowan" <cowan@...>
Date: Thu Jun 11, 2009 6:28 pm
Subject: My apologies for the outburst of spam
johnwcowan
Offline Offline
Send Email Send Email
 
Only a few of them got through my personal filters, and when I noticed the
pattern, I removed and blocked the sending email and removed the messages from
the archive.

#1287 From: Leslie Software <lesliesoftware@...>
Date: Fri May 29, 2009 8:18 pm
Subject: SOLVED Re: Switching from internal SAX2DOM to SAX2DOM from Apache causes IncompatibleClassChangeError exception
lesliesoftware
Offline Offline
Send Email Send Email
 
I figured it out.  It was a class path problem.  I'm using eclipse plug-ins so
my plug-in as trying to use Xalan's SAX2DOM and the class loader was seeing two
ContextHandler definitions because of the way I setup my Xalan plug-in.  Once I
figured out that was what was going on I referred to my copy of the book
"Eclipse Rich Client Platform Designing Coding and Packaging Java Applications"
about trouble shooting class path problems.  I now have it working just fine.

Ian


----- Original Message ----
From: Leslie Software <lesliesoftware@...>
<snip>
java.lang.IncompatibleClassChangeError: Class
org.apache.xalan.xsltc.trax.SAX2DOM does not implement the requested interface
org.xml.sax.ContentHandler
     at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:405)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.createSlo\
tsFromHTMLTransfer(ReadClipboard.java:236)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:125)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:103)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction$1.run(PasteAc\
tion.java:64)
     at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction.run(PasteActi\
on.java:58)
     ...
<snip>



       __________________________________________________________________
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your
favourite sites. Download it now
http://ca.toolbar.yahoo.com.

#1286 From: Leslie Software <lesliesoftware@...>
Date: Fri May 29, 2009 10:39 am
Subject: Switching from internal SAX2DOM to SAX2DOM from Apache causes IncompatibleClassChangeError exception
lesliesoftware
Offline Offline
Send Email Send Email
 
I have been using TagSoup for some time for various tasks and it does a great
job.  One application uses TagSoup to parse HTML from the clipboard.  Recently
when I recently tried to fix up my access to the SAX2DOM class I ran into
trouble.  I had been using the internal implementation from the Sun JRE found in
com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM without really paying
attention that I was doing this.  When I noticed I was using an internal class
that was not public I downloaded Apache Xalan 2.7.1 and tired to use it instead.
This generated the error:

java.lang.IncompatibleClassChangeError: Class
org.apache.xalan.xsltc.trax.SAX2DOM does not implement the requested interface
org.xml.sax.ContentHandler
     at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:405)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.createSlo\
tsFromHTMLTransfer(ReadClipboard.java:236)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:125)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.clipboard.ReadClipboard.readConte\
nts(ReadClipboard.java:103)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction$1.run(PasteAc\
tion.java:64)
     at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
     at
com.lesliesoftware.wizardsfamiliar.deck.editor.actions.PasteAction.run(PasteActi\
on.java:58)
     ...


Am I using incompatible versions?  Do I need to modify my code to make it work?

Any suggestions or advice would be appreciated.

Ian

Here is the parsing code looks like this (the exception comes at the
parser.parse line):

     private void createSlotsFromHTMLTransfer (String contents) {

         StringReader inputReader = null;
         try {
             // Create and configure the parser
             Parser parser = new Parser ();
             parser.setFeature ("http://xml.org/sax/features/namespace-prefixes",
true); //$NON-NLS-1$

             // Parse the HTML
             SAX2DOM sax2dom = new SAX2DOM ();
             parser.setContentHandler (sax2dom);
             inputReader = new StringReader (contents);
             InputSource inputSource = new InputSource (inputReader);
             inputSource.setEncoding ("UTF-8"); //$NON-NLS-1$
             parser.parse (inputSource);
             Node doc = sax2dom.getDOM ();

             if (TraceUtil.isOptionEnabled (DebugOptions.CLIPBOARD_SAVEXML))  {
debug code to save the xml to disk omitted
             }

             resetCurrentLine ();

             String topLevelParents = "/html:html/html:body"; //$NON-NLS-1$
             NodeList topLevelNodes = XPathHelper.selectNodeList (doc,
topLevelParents);

             for (int index = 0; index < topLevelNodes.getLength (); index++)  {
                 Node curNode = topLevelNodes.item (index);

                 processContainingNode (curNode);
             }

         }  catch (Exception exception)  {
             //  Any exception indicates invalid data
             EditorPlugin.getDefault ().getLog ().log
(DeckEditorError.clipboardFormatError (exception));
             DND.error (DND.ERROR_INVALID_DATA);
             mySlots = null;
         }  finally  {
             if (inputReader != null)
                 inputReader.close ();
         }
     }



--
Ian Leslie - Shareware Author (mailto:lesliesoftware@...)


       __________________________________________________________________
Ask a question on any topic and get answers from real people. Go to Yahoo!
Answers and share what you know at http://ca.answers.yahoo.com

#1281 From: "James Abley" <james.abley@...>
Date: Wed Apr 29, 2009 9:04 pm
Subject: Re: Elements getting stripped out unexpectedly
taboozizi
Online Now Online Now
Send Email Send Email
 
--- In tagsoup-friends@yahoogroups.com, John Cowan <cowan@...> wrote:
>
> James Abley scripsit:
>
> > I've encountered an issue using TagSoup and I wanted to clarify whether
> > it is expected behaviour due to how I'm using it, or something else.
> >
> > The issue that I'm seeing is that I'm parsing an RSS feed and it
> > eventually goes through TagSoup to ensure that I store well-formed XML.
> >
> >
http://www.guardian.co.uk/football/2009/feb/26/real-madrid-rafa-benitez-liverpoo\
l/rss
> >
> > The <br/> element between the first two bullet points in that story
> > is getting removed when I parse the <item/> description and I'm not
> > sure why that is the case.
>
> I can't duplicate this problem with TagSoup 1.2.  It turns into a
> <br clear="none"></br>, because there's a default attribute value
> in the HTML 4.0 DTD, and TagSoup doesn't generate empty elements.
>
> > Is there a source repository that I can check out anonymously and write
> > some tests against? I've not been able to find one through Google -
> > too much interference from the Haskell version, etc.
>
> You can always download the source of released versions from
> http://tagsoup.info.  There is no public repository.
>
> --
> John Cowan    <cowan@...>     http://www.ccil.org/~cowan
> But no living man am I!  You look upon a woman.  Eowyn I am, Eomund's
daughter.
> You stand between me and my lord and kin.  Begone, if you be not deathless.
> For living or dark undead, I will smite you if you touch him.
>

Sorry, that's absolutely right. A later step in my XML pipeline is removing that
element. Apologies for the noise.

Cheers,

James

#1280 From: John Cowan <cowan@...>
Date: Tue Apr 28, 2009 9:45 pm
Subject: Re: Elements getting stripped out unexpectedly
johnwcowan
Offline Offline
Send Email Send Email
 
James Abley scripsit:

> I've encountered an issue using TagSoup and I wanted to clarify whether
> it is expected behaviour due to how I'm using it, or something else.
>
> The issue that I'm seeing is that I'm parsing an RSS feed and it
> eventually goes through TagSoup to ensure that I store well-formed XML.
>
>
http://www.guardian.co.uk/football/2009/feb/26/real-madrid-rafa-benitez-liverpoo\
l/rss
>
> The <br/> element between the first two bullet points in that story
> is getting removed when I parse the <item/> description and I'm not
> sure why that is the case.

I can't duplicate this problem with TagSoup 1.2.  It turns into a
<br clear="none"></br>, because there's a default attribute value
in the HTML 4.0 DTD, and TagSoup doesn't generate empty elements.

> Is there a source repository that I can check out anonymously and write
> some tests against? I've not been able to find one through Google -
> too much interference from the Haskell version, etc.

You can always download the source of released versions from
http://tagsoup.info.  There is no public repository.

--
John Cowan    <cowan@...>     http://www.ccil.org/~cowan
But no living man am I!  You look upon a woman.  Eowyn I am, Eomund's daughter.
You stand between me and my lord and kin.  Begone, if you be not deathless.
For living or dark undead, I will smite you if you touch him.

#1279 From: "James Abley" <james.abley@...>
Date: Tue Apr 28, 2009 10:07 am
Subject: Elements getting stripped out unexpectedly
taboozizi
Online Now Online Now
Send Email Send Email
 
Hi,

I've encountered an issue using TagSoup and I wanted to clarify whether it is
expected behaviour due to how I'm using it, or something else.

The issue that I'm seeing is that I'm parsing an RSS feed and it eventually goes
through TagSoup to ensure that I store well-formed XML.

http://www.guardian.co.uk/football/2009/feb/26/real-madrid-rafa-benitez-liverpoo\
l/rss

The <br/> element between the first two bullet points in that story is getting
removed when I parse the <item/> description and I'm not sure why that is the
case.

"&lt;p&gt;• Liverpool manager says he will be staying at Anfield&lt;br /&gt;•
Spaniard praises team for win away to Real Madrid&lt;/p&gt;"

The markup is being correctly unescaped prior to being passed to TagSoup.

Is there a source repository that I can check out anonymously and write some
tests against? I've not been able to find one through Google - too much
interference from the Haskell version, etc.

Cheers,

James

#1277 From: Michael Giles <mgiles@...>
Date: Thu Mar 26, 2009 10:10 pm
Subject: Unescaped less than...
michael_a_giles
Online Now Online Now
Send Email Send Email
 
I seem to have found another place where TagSoup gets in a bit of a
huff.  Perhaps there is a flag I can specify to make things better.

The issue is this piece of HTML (the less than sign is, erroneously,
straight up (i.e. not using an entity)):

<em><90 min</em>

and it is occurring on this page - http://tinyurl.com/dgmjjt

On first inspection it seems like TS makes some sort of sense out of it:

<em><_90 min="min" em="em">. </_90></em>

But then it starts inserting <em></em> all over the document (this was
the only "em" in the original doc). And then one of those inserted ones
doesn't have a matching end tag (which is how I stumbled upon this when
it hit the XML parser).  Any simple resolution?

-Mike

#1276 From: John Cowan <cowan@...>
Date: Wed Mar 18, 2009 8:45 pm
Subject: Re: Tagsoup library breaks on malformed doctype
johnwcowan
Offline Offline
Send Email Send Email
 
Miguel Garcia scripsit:
> Hi,
>
> In a proyect where we use Tagsoup to tidy some malformed xhtml code have
> found that if there is an odd number of quotes on the doctype
> declaration tagsoup throws an String related exception and fails. For
> example with the following input,
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN" "> <html
> xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head><title>Test
> with bogus doctype</title></head> <body> <p>This page has an extra quote
> in the doctype, which the tagsoup library doesn't like.</p> </body>
> </html>

The real problem is that TagSoup thinks the system-id begins with a quote
and ends with a quote, but doesn't realize that it's zero-length.  The
obvious fix to Parser#trimquotes doesn't work, though.  I think this will
be straightforward to find a patch for, but I'll need to do a bit of debugging.

--
John Cowan            http://www.ccil.org/~cowan     cowan@...
Uneasy lies the head that wears the Editor's hat! --Eddie Foirbeis Climo

#1275 From: "Miguel Garcia" <miguel.garcia@...>
Date: Wed Mar 18, 2009 10:58 am
Subject: Tagsoup library breaks on malformed doctype
miguel.garcia@...
Send Email Send Email
 
Hi,

In a proyect where we use Tagsoup to tidy some malformed xhtml code have
found that if there is an odd number of quotes on the doctype
declaration tagsoup throws an String related exception and fails. For
example with the following input,

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN" "> <html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head><title>Test
with bogus doctype</title></head> <body> <p>This page has an extra quote
in the doctype, which the tagsoup library doesn't like.</p> </body>
</html>

Tagsoup throws the next exception,

[Fatal Error] :2:14: The document type declaration for root element type
"html" must end with '>'.
Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
String index out of range: -1

Not sure if making a patch to this library would be quite easy (I
haven't reviewed the source code yet) or should it better just making
some workarounds that help to recover from any unexpected error from
tagsoup.

Miguel

#1274 From: "Klotz, Leigh" <Leigh.Klotz@...>
Date: Thu Mar 12, 2009 10:37 pm
Subject: RE: Patch for XMLWriter and newlines in TagSoup 1.2
leighklotz
Offline Offline
Send Email Send Email
 
Thanks!  And to correct another typo for the record, I'm sending
fragments, not fragmenents, which I guess is a back-formation from
documenents.
Leigh.

-----Original Message-----
From: tagsoup-friends@yahoogroups.com
[mailto:tagsoup-friends@yahoogroups.com] On Behalf Of John Cowan
Sent: Thursday, March 12, 2009 3:34 PM
To: tagsoup-friends@yahoogroups.com
Subject: Re: [tagsoup-friends] Patch for XMLWriter and newlines in
TagSoup 1.2

Klotz, Leigh scripsit:

> In tests, I'm seeing not just a newline, but a blank line after the
> document.  Is that what you see?  Your quoted simple output looks like

> it might have a blank line after it.

Yes, you're right.  I've never paid attention to this before.

> *   Problem 2:
> I've violated an assumption of XMLWriter; I'm sending it fragmenents,
> not a document or even a single element.  The result is a newline
> after each "toplevel" element.

Aha.

> Given that endDocument() outputs the final newline in the document, I
> really don't see what benefit line 632 has at all.  I believe that
> simply removing line 632 will let XMLWriter handle fragments without
> introducing extra whitespace, and will still leave the resulting
> serialization newline (though not blank line) terminated.

I agree: line 632 should just be flushed.

--
Clear?  Huh!  Why a four-year-old child         John Cowan
could understand this report.  Run out          cowan@...
and find me a four-year-old child.  I
http://www.ccil.org/~cowan
can't make head or tail out of it.
         --Rufus T. Firefly on government reports

#1273 From: John Cowan <cowan@...>
Date: Thu Mar 12, 2009 10:33 pm
Subject: Re: Patch for XMLWriter and newlines in TagSoup 1.2
johnwcowan
Offline Offline
Send Email Send Email
 
Klotz, Leigh scripsit:

> In tests, I'm seeing not just a newline, but a blank line after the
> document.  Is that what you see?  Your quoted simple output looks like
> it might have a blank line after it.

Yes, you're right.  I've never paid attention to this before.

> *   Problem 2:
> I've violated an assumption of XMLWriter; I'm sending it fragmenents,
> not a document or even a single element.  The result is a newline
> after each "toplevel" element.

Aha.

> Given that endDocument() outputs the final newline in the document,
> I really don't see what benefit line 632 has at all.  I believe that
> simply removing line 632 will let XMLWriter handle fragments without
> introducing extra whitespace, and will still leave the resulting
> serialization newline (though not blank line) terminated.

I agree: line 632 should just be flushed.

--
Clear?  Huh!  Why a four-year-old child         John Cowan
could understand this report.  Run out          cowan@...
and find me a four-year-old child.  I           http://www.ccil.org/~cowan
can't make head or tail out of it.
         --Rufus T. Firefly on government reports

Messages 1273 - 1386 of 1386   Newest  |  < Newer  |  Older >  |  Oldest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help