Thanks alot ken.! hadoop/mahout is required to view the contents of all the folders like content,html,parse and status folder .? Thanks alot ken.!...
1287
Rehan Malek
rehan_malek75
Jan 9, 2013 8:57 am
Thanks alot ken.! could you please tell me that hadoop/mahout is required to view the contents of all the folders like content,html,parse and status folder...
1288
Nilay Upadhyay
nilayupadhyay17
Jan 9, 2013 12:19 pm
 hello Chris..! ThankYou for your great response. i am telling you what i have got it . please correct me if i am wrong ,. here some of my questions. ...
1289
Ken Krugler
kkrugler
Jan 9, 2013 3:06 pm
... No - those are Hadoop SequenceFiles. They are binary, so there's no easy way to view them directly. You could switch to using a text format - see previous...
1290
Chris Schneider
schmed2000
Jan 9, 2013 6:50 pm
Hi Nilay, I have attempted to answer your latest questions below. However, you must ultimately be responsible for reading and understanding the Bixo...
1291
Rehan Malek
rehan_malek75
Jan 10, 2013 9:21 am
Hi all, i sincerely request to give the detailed answer of this because its one of the most important part for all the bixo developers community when...
1292
Vivek Magotra
vmagotra
Jan 10, 2013 1:26 pm
Hi Rehan, For a beginner, the first thing I would suggest is to get familiar with Cascading (http://www.cascading.org). Currently Bixo uses the 1.2.x release....
1293
Vivek Magotra
vmagotra
Jan 10, 2013 1:26 pm
Hi Rehan, For a beginner, the first thing I would suggest is to get familiar with Cascading (http://www.cascading.org). Currently Bixo uses the 1.2.x release. ...
1294
Nilay Upadhyay
nilayupadhyay17
Jan 10, 2013 1:56 pm
Hello Chris.! thankyou so much from bottom of my heart for giving your valuable time. there are just few questions I have question about -url argument ...
1295
Chris Schneider
schmed2000
Jan 10, 2013 3:02 pm
Hi Nilay, ... I have tried to answer your latest set of questions below. I have now exhausted the time I have available to help you, at least until you are...
1296
Rehan Malek
rehan_malek75
Jan 11, 2013 8:49 am
hi vivek :) Thankyou for your quick response. just gone through the cascading documents. but what should be done to get only urls for all fetched pages. and i...
1297
Vivek Magotra
vmagotra
Jan 13, 2013 7:44 pm
Hi Rehan, On Jan 11, 2013, at 5:49 PM, Rehan Malek <rehan_malek75@...> wrote: [snip] ... The status pipe (FetchPipe.getStatusTailPipe()) has the status...
1298
Rehan Malek
rehan_malek75
Jan 17, 2013 1:11 pm
Thankyou vivek.! i am still unable to get all the urls associated with fetched pages. could you please provide cascading workflow for getting all urls.....?? ...
1299
rehan_malek75
Jan 18, 2013 8:05 am
Hi all, How to modify Democrawlworkflow to get all the urls of all fetched pages please explain it in Detail ....
1300
Vivek Magotra
vmagotra
Jan 19, 2013 2:11 am
Hi Rehan, ... To get all the urls of the fetched pages for the current loop here's what I would do : In the createFlow() method, after you get the statusPipe,...
1301
rehan_malek75
Jan 21, 2013 8:33 am
Thanks for giving response. and i am working on this...
1302
rehan_malek75
Jan 21, 2013 8:33 am
hi all, i am currently facing problem with status sub-folder inside output directory. i am unable to view the status sub-folder. as such by default its...
1303
Pat Ferrel
sillyaliases...
Jan 24, 2013 4:01 am
I think Vivek added all of the NotSoSimpleCrawlTool to Bixo's DemoCrawlTool. It produces the same hadoop sequence file in each loop dir. I wrote another...
1304
Ken Krugler
kkrugler
Feb 3, 2013 3:05 am
Hi all, Just a heads-up that Lewis McGibbney has just released 0.2 of the crawler-commons library. The next release of Bixo will use this jar, since it...
1305
markatasu
Feb 21, 2013 5:53 am
Hi Everyone, I'm working with an early-stage well funded stealth mode start-up in the big data analytics space – creating a unified platform that collects,...
1306
jeffjeffrsn
Apr 2, 2013 12:58 pm
Hi, In the DemoCrawlTool I added a new Pipe to the tail of the parsePipe. In it i use the parsed content and the url. Now i also need the original...
1307
Chris Schneider
schmed2000
Apr 2, 2013 2:12 pm
Hi Jeff, I am not sure what you meant when you wrote "added a new Pipe to the tail of the parsePipe". If you did add a tail pipe containing only...
1308
jeffjeffrsn
Apr 2, 2013 5:02 pm
Hi Chris, Thanks for the answer. Now I'm subclassing the baseparser. Thanks, - Jeff...
1309
jeffjeffrsn
Apr 2, 2013 5:22 pm
Hi Eeryone, I noticed, that the democrawler stays at one domain. ... I've got the domain example.com. At this domain there are outlinks to test.example.com,...
1310
Ken Krugler
kkrugler
Apr 4, 2013 11:36 pm
Hi Vivek, I was looking at the DemoCrawlWorkflow source, and noticed this snippet: Pipe urlFromOutlinksPipe = new Pipe("url from outlinks",...
1311
Ken Krugler
kkrugler
Apr 4, 2013 11:41 pm
Hi Jeff, ... By default if you provide a -domain parameter, then URL filtering is set up such that only URLs for that domain are accepted (all other URLs are...
1312
Pat Ferrel
sillyaliases...
May 9, 2013 5:31 pm
It's been awhile since I did a new build of bixo. For some reason, though I haven't changed the code, I'm getting all sorts of test errors. I was getting an...
1313
Pat Ferrel
reallyreally...
May 9, 2013 6:00 pm
Hmm, comment out the test and it completes without errors. Maybe openDNS is the problem? On May 9, 2013, at 10:31 AM, Pat Ferrel <pat.ferrel@...> wrote: ...
1314
Pat Ferrel
reallyreally...
May 17, 2013 2:00 pm
Hi guys, I'm back to crawling Pinterest to update my experimental recommender. I created a merged miner/crawler, which was working fine if slowly. I added an...
1315
Ken Krugler
kkrugler
May 17, 2013 10:00 pm
Hi Pat, ... It will try to fetch every URL, but it will only make one HttpClient request for each URL. HttpClient will retry multiple times, and if the server...