Skip to search.
bixo-dev · Bixo Web Mining Toolkit

Group Information

  • Members: 83
  • Category: Open Source
  • Founded: May 17, 2009
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

  Messages Help
Advanced
Messages 980 - 1009 of 1009   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
980 reallyreallylongalias
reallyreally... Offline Send Email
Apr 26, 2012
2:00 am
And if the question was lost in all that; "Do I need to reparse the fetched data because BoilerpipeContentHandler cannot act as the default one for the Tika...
981 Ken Krugler
kkrugler Offline Send Email
Apr 26, 2012
2:49 am
Hi Pat, ... My guess is that you're not calling super(new Fields("url", "content") in the constructor for your WriteMahoutSequenceFileFunction class. Without...
982 Ken Krugler
kkrugler Offline Send Email
Apr 26, 2012
2:55 am
Hi Pat, ... Correct. The content handler gets called on all of the tags that the Tika parser returns. ... If you look inside of the SimpleParser class's...
983 reallyreallylongalias
reallyreally... Offline Send Email
Apr 26, 2012
1:13 pm
Hi Ken, Well, you are correct but since the operation is being passed in a ParsedDatum I did this in the constructor: public WriteMahoutSequenceFileFunction()...
984 reallyreallylongalias
reallyreally... Offline Send Email
Apr 26, 2012
2:54 pm
Hi Ken, Well I'm still an idiot, but an idiot with running code! Many thanks. I see now that the fields are for the planner and are tied to the data created,...
985 Ken Krugler
kkrugler Offline Send Email
Apr 26, 2012
5:46 pm
... The field names for what you're generating can be anything, since SequenceFiles don't store field names, just the type of the key/values, followed by an...
986 reallyreallylongalias
reallyreally... Offline Send Email
Apr 26, 2012
10:26 pm
Now that I have bixo writing parsed data to mahout using boilerpipe (thanks Ken). It is indeed better than slided bread but now I'd like to consolidate output...
987 Ken Krugler
kkrugler Offline Send Email
Apr 26, 2012
11:43 pm
Hi Pat, Creating a Map<pipe name, Tap> for sources is when you have multiple input files that go to different head pipes ("open" pipes at the top of the...
988 Ken Krugler
kkrugler Offline Send Email
Apr 27, 2012
1:50 pm
Hi Michele, ... There is a cascading.kryo project that makes it easy to serialize custom objects (e.g. they don't have to implement Hadoop Writable). Though I...
989 Ken Krugler
kkrugler Offline Send Email
Apr 27, 2012
1:53 pm
Hi Michele, Thanks for the continued input, it's useful. Yes, we need to include some information that clarifies how to build a separate tool (as a job jar)...
990 reallyreallylongalias
reallyreally... Offline Send Email
Apr 27, 2012
3:08 pm
Hi Ken, Yes, that is exactly what I needed. Everything is running well now and exporting to a single directory. I've run mahout vector generation, clustering,...
991 Pat Ferrel
reallyreally... Offline Send Email
May 3, 2012
5:32 pm
I've moved bixo from my pseudo-distributed one node cluster on localhost to a small two machine cluster running hadoop 0.20.203 on Ubuntu 11.10 (is this OK,...
992 Ken Krugler
kkrugler Offline Send Email
May 3, 2012
6:33 pm
Hi Pat, ... This should be fine, though we typically run on 0.20.205 ... What class of server are you using for the slaves? And what values of maxthreads have...
993 Pat Ferrel
reallyreally... Offline Send Email
May 3, 2012
9:08 pm
The cluster is made of some old machines, a 4-core and a 2-core 64-bit with 8G of ram and 1T of disk each. I tried 2 and 10 for max threads, maybe will try one...
994 Pat Ferrel
reallyreally... Offline Send Email
May 4, 2012
10:08 pm
Yikes, an ubuntu update changed a php5 cron job that spawned zombies. Sounds like the plot of a cable TV show... Bixo is still a zombie free zone....
995 Pat Ferrel
reallyreally... Offline Send Email
May 14, 2012
4:53 pm
I'm using bixo with boilerpipe through TikaCallable. I also need to filter by language but it looks like the language in ParsedDatum is not being set?...
996 Ken Krugler
kkrugler Offline Send Email
May 14, 2012
5:04 pm
... There shouldn't be an issue with using Boilerpipe. If you check out the source in TikaCallable#call, you'll see: if (_extractLanguage) { profilingHandler =...
997 Pat Ferrel
reallyreally... Offline Send Email
May 14, 2012
6:34 pm
OK, it looks like Tika is failing to determine the language for pages that look pretty obvious and are recognized by the Google compact language detector in...
998 Ken Krugler
kkrugler Offline Send Email
May 14, 2012
7:06 pm
... You may or may not get better results from using the output of Boilerpipe to detect the language, versus the full page contents (which is how it works...
999 Pat Ferrel
reallyreally... Offline Send Email
May 14, 2012
8:51 pm
That site is a lot of pure Spanish. Reducing the amount of text fed to the language identifier seems counter intuitive but I might try it if the easy thing...
1000 Ted Dunning
ted.dunning@... Send Email
May 14, 2012
11:56 pm
The visible text may be pure Spanish but HTML pages are often constructed by copying and pasting vats of HTML or CSS. These can have comments in the wrong...
1001 Ken Krugler
kkrugler Offline Send Email
May 15, 2012
4:11 pm
... I've done some work incorporating language-detector (yet another language detection library, written in java) into another project. It handles short text...
1002 Pat Ferrel
reallyreally... Offline Send Email
May 15, 2012
6:51 pm
FYI: I just did a very shallow crawl of these sites where Tika failed to detected any language. Seems that it is not so good for web pages. 20minutos.es ...
1003 Ken Krugler
kkrugler Offline Send Email
May 17, 2012
6:11 pm
Hi Pat, ... I took a look at the first site, and yes - Tika should be able to detect it properly. I tried the same text using a version of language-detector...
1004 Pat Ferrel
reallyreally... Offline Send Email
May 17, 2012
8:15 pm
I instrumented my exporter function to print out any detected language from the ParsedDatum and got none for the crawl. I also spot checked on several pages...
1005 Pat Ferrel
reallyreally... Offline Send Email
May 25, 2012
5:15 pm
If, for some reason, a crawl fails to finish properly. What is the recommended way to restart it where it left off, or somewhere close. I tried deleting what...
1006 Ken Krugler
kkrugler Offline Send Email
May 25, 2012
5:24 pm
Hi Pat, See below. But in general the SimpleCrawlTool is a demo of Bixo, not a complete crawler, thus much of the functionality you're asking about is missing....
1007 Pat Ferrel
reallyreally... Offline Send Email
May 25, 2012
7:07 pm
OK, so if I understand correctly every time I restart a crawl on an existing one, it will extend the original crawl with newly fetched data. It never recrawls...
1008 Ken Krugler
kkrugler Offline Send Email
May 25, 2012
7:24 pm
... The most recent loop dir has an up-to-date snapshot of the crawlDB, which is regenerated after each loop. ... Yes, exactly. You've hit upon a fundamental...
1009 Pat Ferrel
reallyreally... Offline Send Email
4:54 pm
When the crawler loops it fetches every url found in the previous loop, right? So the crawl time will likely increase exponentially with each loop, right? So...
Messages 980 - 1009 of 1009   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help