Hi Fuad,
1. I'll need to look into how CyberNeko uses filters, but from what I've seen with other parsers, these filters get applied during parsing, not before. Otherwise it seems very hard to be able to properly handle broken HTML where (for example) there are missing end tags.
2. Also note that Tika is switching to TagSoup (Jira issue TIKA-310).
3. Re what tags are of interest - depends on what Bixo is being used to do. If it's just generating an index, then often it's only the content and links, so tags without either can be stripped.
But I think the more common use case is for data mining, where you'd want all of the tags to be able to do appropriate pattern matching on layout - that's key for extracting semi-structured data.
-- Ken
On Nov 1, 2009, at 11:51am, Freddy wrote:
Guys,
I am currently using ElementRemover with list of TAGS to be ignored (removed) from stream _before_ parsing content; it allows also to deal with invalid XML right before parsing and more... are we interested in [p], [table], [div] tags in DOM, or just anchor[a] with href? Plus images of course... I believe it runs faster:
ElementRemover remover = new ElementRemover();
for (HtmlTag t : HtmlTag.TAGS) {
if (t.accept)
remover.acceptElement(t.tag, t.attributes);
if (t.remove)
remover.removeElement(t.tag);
}
XMLDocumentFilter[] filters = { remover, };
parser = new DOMParser();
...
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
-Fuad
P.S.
I am interested in structural tags also, such as [table], [div]... for some kind of "mining"... but I haven't implemented it yet.
--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c w e b m i n i n g