... I don't know. I built it as a library, and originally added the stand-alone application support for my own testing purposes, but I suspect that many...
I use tagsoup as one step in DeXSS. http://freshmeat.net/projects/dexss/ ... From: tagsoup-friends@yahoogroups.com [mailto:tagsoup-friends@yahoogroups.com] On...
I'm a library user. I used the standalone app for testing, but the projects where I've integrated TagSoup have always been as a library. Is there discussion...
I use TagSoup as a library. I use it to transform lists on the web into XML so it can be loaded into a database. I also use it in my main application to...
... Various "screen-scraping" jobs. E.g., this one which is just for fun: http://www.edavies.nildram.co.uk/#bumps More details at the bottom of this page: ...
Greeting. So, I'm using the tagSoup-1.2.jar file as a stand alone program which I shell out to. What I'm trying for here, is to convert in the wild html into...
... These are symptoms of specifying the wrong input encoding. You can't specify the input as UTF-8 unless the .html file *really is* encoded in UTF-8, or you...
... Recommendation. ... So, I've tried a variety of combinations of --encoding and --output-encoding parameters. The input html does indeed seem to be utf8...
... So it is. ... $ tagsoup --encoding=utf-8 --output-encoding=utf-8 <index.html >index.xhtml ... TagSoup can't provide that. It interprets all entity and...
... http://www.ccil.org/~cowan ... Okay, I'll have to accept that as the tagSoup behavoir. However, small update. On linux, your command line example works...
... Just to make sure: did you verify actual output file contents (and similarly for input), or view using an app? I ask this because the most common problem...
... two systems? On the windows machine, it's Java(TM) SE Runtime Environment (build 1.6.0_07-b06).(Official JRE from Sun) On Ubuntu, it's OpenJDK Runtime...
... this platform? ... similarly for input), or view using an app? I ask this because the most common problem reported is usually caused by a viewing app ...
... just ... with ... Let's try to tackle this from a slightly different angle here. For a moment, let's pretend that I'm a random user who has just discovered...
... About the only thing I can think of, as a difference, is that the platform-specific default encoding may well differ between stock windows system vs....
Hrm, not sure. There is a variance in result output depending on which encoding switches I provide on the command line. I'd say that at least *some* of them...
... Oh, you mean the literal command line, as in cmd.exe? I didn't realize that -- I thought you were spawning TagSoup from a program. -- You let them out...
... http://ccil.org/~cowan ... I have tested with both, there's no behavior difference in tagSoup. Either I'm on a cmd.exe window calling java -jar to tagSoup,...
... realize ... TagSoup from java is as an embedded library; that is, it's called via its API, not from command-line interface. And that may be the key ...
... Indeed, which says that the Java code is not at fault. The only thing I can think of is that the names of the available encodings might be different on...
... Yes, but not the same environment, with respect to the platform default encoding. The most likely scenario is that the default encoding in linux happens to...
... Quite so, but in this case both --encoding and --output-encoding were specified as UTF-8. ... It seems that if the encoding name specified is not known,...
Here's a simplified example of the HTML I'm trying to parse: <p> <span id="data"> <p>important information</p> </span> </p> And here's what I get out of...