And if the question was lost in all that; "Do I need to reparse the fetched data because BoilerpipeContentHandler cannot act as the default one for the Tika...
Hi Pat, ... My guess is that you're not calling super(new Fields("url", "content") in the constructor for your WriteMahoutSequenceFileFunction class. Without...
Hi Pat, ... Correct. The content handler gets called on all of the tags that the Tika parser returns. ... If you look inside of the SimpleParser class's...
Hi Ken, Well, you are correct but since the operation is being passed in a ParsedDatum I did this in the constructor: public WriteMahoutSequenceFileFunction()...
Hi Ken, Well I'm still an idiot, but an idiot with running code! Many thanks. I see now that the fields are for the planner and are tied to the data created,...
... The field names for what you're generating can be anything, since SequenceFiles don't store field names, just the type of the key/values, followed by an...
Now that I have bixo writing parsed data to mahout using boilerpipe (thanks Ken). It is indeed better than slided bread but now I'd like to consolidate output...
Hi Pat, Creating a Map<pipe name, Tap> for sources is when you have multiple input files that go to different head pipes ("open" pipes at the top of the...
Hi Michele, ... There is a cascading.kryo project that makes it easy to serialize custom objects (e.g. they don't have to implement Hadoop Writable). Though I...
Hi Michele, Thanks for the continued input, it's useful. Yes, we need to include some information that clarifies how to build a separate tool (as a job jar)...
Hi Ken, Yes, that is exactly what I needed. Everything is running well now and exporting to a single directory. I've run mahout vector generation, clustering,...
I've moved bixo from my pseudo-distributed one node cluster on localhost to a small two machine cluster running hadoop 0.20.203 on Ubuntu 11.10 (is this OK,...
Hi Pat, ... This should be fine, though we typically run on 0.20.205 ... What class of server are you using for the slaves? And what values of maxthreads have...
The cluster is made of some old machines, a 4-core and a 2-core 64-bit with 8G of ram and 1T of disk each. I tried 2 and 10 for max threads, maybe will try one...
I'm using bixo with boilerpipe through TikaCallable. I also need to filter by language but it looks like the language in ParsedDatum is not being set?...
... There shouldn't be an issue with using Boilerpipe. If you check out the source in TikaCallable#call, you'll see: if (_extractLanguage) { profilingHandler =...
OK, it looks like Tika is failing to determine the language for pages that look pretty obvious and are recognized by the Google compact language detector in...
... You may or may not get better results from using the output of Boilerpipe to detect the language, versus the full page contents (which is how it works...
That site is a lot of pure Spanish. Reducing the amount of text fed to the language identifier seems counter intuitive but I might try it if the easy thing...
1000
Ted Dunning
ted.dunning@...
May 14, 2012 11:56 pm
The visible text may be pure Spanish but HTML pages are often constructed by copying and pasting vats of HTML or CSS. These can have comments in the wrong...
... I've done some work incorporating language-detector (yet another language detection library, written in java) into another project. It handles short text...
FYI: I just did a very shallow crawl of these sites where Tika failed to detected any language. Seems that it is not so good for web pages. 20minutos.es ...
Hi Pat, ... I took a look at the first site, and yes - Tika should be able to detect it properly. I tried the same text using a version of language-detector...
I instrumented my exporter function to print out any detected language from the ParsedDatum and got none for the crawl. I also spot checked on several pages...
If, for some reason, a crawl fails to finish properly. What is the recommended way to restart it where it left off, or somewhere close. I tried deleting what...
Hi Pat, See below. But in general the SimpleCrawlTool is a demo of Bixo, not a complete crawler, thus much of the functionality you're asking about is missing....
OK, so if I understand correctly every time I restart a crawl on an existing one, it will extend the original crawl with newly fetched data. It never recrawls...
... The most recent loop dir has an up-to-date snapshot of the crawlDB, which is regenerated after each loop. ... Yes, exactly. You've hit upon a fundamental...
When the crawler loops it fetches every url found in the previous loop, right? So the crawl time will likely increase exponentially with each loop, right? So...