For the record, there are no GPL-only components in any version of TagSoup. In addition, please note that if someone must have a GPL-licensed version of...
G'day When I use a url with a question mark The server is not happy XMLReader parser = new org.ccil.cowan.tagsoup.Parser(); // tagsoup parser XPathContext...
Does TagSoup intend to be turning all the various quote characters into the appropriate unicode entities? I ask because I noticed that a page I parsed...
... Fortunately, the answer is "none of the above". :-) Unlike browsers, TagSoup does not do automatic encoding detection. (There is a Java library to do so...
Thanks for the quick response, John. Having gone through all of this pain five years ago for Furl, I should have known to be more explicit about my encodings....
I was wondering if TagSoup will meet my needs. I have a server that is in need of a Parser/Substituter to replace a given URL (either img or a) with another...
... No, it isn't. TagSoup doesn't preserve invalid HTML. I'm not sure why you'd want to preserve it, but if that's your requirement, then you should probably...
This one has me stumped, and I can't quite track it down. I have some code that does some basic sraping using tagsoup. It works fine with some input, but...
... Are you setting any SAX properties or features? (Xalan might be setting some and you wouldn't know it, unfortunately). Is there any chance that you are...
hello i have 2 issues with tagsoup 1.2: 1. i have source of the page that detects javascript in this way: <html><head><noscript><meta http-equiv="refresh" ...
Martin Zdila
m.zdila@...
May 20, 2008 1:12 pm
1104
... That's easily patched. Get the source and edit src/definitions/html.tssl. Then add "<memberOf group='M_HEAD'/>" after the line "<element name='noscript'...
Is there a way to keep the body of <script> intact? I have HTML that looks like this: ... <script ...> //<![CDATA[ ... if (myvalue && yourvalue){ //]]> ...
... It suppresses close-tags on empty elements -- so <hr>, not <hr></hr> -- and it uses minimized attributes in certain cases, so <input checked>, not <input...
hi john thanks for your reply, it helped me a lot! in addition i had to also add <contains group='M_HEAD'/> to <element name='noscript' type='mixed'>. without...
Martin Zdila
m.zdila@...
May 23, 2008 9:26 am
1117
hello java -jar tagsoup-1.2.jar http://ppe.sk/news.htm you will see many nested <strong> tags which are not on the original page. is it possible to fix that? ...
Martin Zdila
m.zdila@...
May 23, 2008 11:15 am
1119
... Thanks. Quite right. I've added this to the next release. -- In my last lifetime, John Cowan I believed in reincarnation;...
Hello After parsing (X)HTML document I am allways getting null from Document.getDoctype(). Is that actually implemented? If not, could you please do that? It...
Martin Zdila
m.zdila@...
May 28, 2008 12:20 pm
1121
sorry, but my DOMBuilder didn't handle that. bad martin, bad martin :-) ... -- Martin Zdila CTO M-Way Solutions Slovakia s.r.o. Letna 27, 040 01 Kosice ...
Martin Zdila
m.zdila@...
May 28, 2008 1:12 pm
1125
... That's a known problem that has to do with tags opened in each of various cells of a table and never closed again. I will fix it in the next release. -- ...
Hello Group, I've been using TagSoup with some data for which I do not know the encoding ahead of time and playing around with auto detection of character...
Nitay Joffe
nitay@...
Jun 3, 2008 11:52 pm
1129
Hello I hope that intent of tagsoup is to parse ugly HTML to DOM (XML) so that result displayed of both in the modern webbrowser looks the same. It means that...
Martin Zdila
m.zdila@...
Jun 9, 2008 7:54 am
1131
Hello I found one page with following structure: <html><head>...</head><noscript><body>...</body></noscript><frameset>...</frameset></html> body was thrown out...
Martin Zdila
m.zdila@...
Jun 9, 2008 8:52 am
1132
... Yes and no. TagSoup does attempt to produce output similar to that of Web browsers, but only within the limits of its design model. It does not contain...
... Thanks. I'll add this to the next release. ... When I get time and energy to work on it enough to release it. ... Not at present. ... It's just me, except...
hello ... What I need is simple thing ;-) - let the SAX generates events: open table, open tr, open td, text "cell1", close td, open span, text "err1", close...
Martin Zdila
m.zdila@...
Jun 9, 2008 3:12 pm
1135
... You want to modify html.tssl, not html.stml (which is about the lexer). The simplest change *for this specific problem* is probably to add <contains...
... Like John said, TagSoup operates at a lower-level, "below" a dom. So what you can do is to use a tree model such as XOM, and do additional fixing _you_...
Hello Tatu thanks 4 the reaction ... I am actually using xerces to build DOM from TagSoup and xalan for XPath processing, transformation and serialization....