Search the web
Sign In
New User? Sign Up
bixo-dev · Bixo
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Neko HTML Parser   Message List  
Reply | Forward Message #155 of 297 |
Re: [bixo-dev] Neko HTML Parser

Hi Fuad,

1. I'll need to look into how CyberNeko uses filters, but from what I've seen with other parsers, these filters get applied during parsing, not before. Otherwise it seems very hard to be able to properly handle broken HTML where (for example) there are missing end tags.

2. Also note that Tika is switching to TagSoup (Jira issue TIKA-310).

3. Re what tags are of interest - depends on what Bixo is being used to do. If it's just generating an index, then often it's only the content and links, so tags without either can be stripped.

But I think the more common use case is for data mining, where you'd want all of the tags to be able to do appropriate pattern matching on layout - that's key for extracting semi-structured data.

-- Ken


On Nov 1, 2009, at 11:51am, Freddy wrote:

Guys,

I am currently using ElementRemover with list of TAGS to be ignored (removed) from stream _before_ parsing content; it allows also to deal with invalid XML right before parsing and more... are we interested in [p], [table], [div] tags in DOM, or just anchor[a] with href? Plus images of course... I believe it runs faster:

ElementRemover remover = new ElementRemover();
for (HtmlTag t : HtmlTag.TAGS) {
if (t.accept)
remover.acceptElement(t.tag, t.attributes);
if (t.remove)
remover.removeElement(t.tag);
}

XMLDocumentFilter[] filters = { remover, };
parser = new DOMParser();
...
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

-Fuad

P.S.
I am interested in structural tags also, such as [table], [div]... for some kind of "mining"... but I haven't implemented it yet.


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g






Tue Nov 3, 2009 10:39 pm

kkrugler
Offline Offline
Send Email Send Email

Forward
Message #155 of 297 |
Expand Messages Author Sort by Date

Guys, I am currently using ElementRemover with list of TAGS to be ignored (removed) from stream _before_ parsing content; it allows also to deal with invalid...
Freddy
fouad_efendi
Offline Send Email
Nov 1, 2009
7:52 pm

Hi Fuad, 1. I'll need to look into how CyberNeko uses filters, but from what I've seen with other parsers, these filters get applied during parsing, not...
Ken Krugler
kkrugler
Offline Send Email
Nov 3, 2009
10:45 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help