Hi Nilay, ... I have tried to answer your latest set of questions below. I have now exhausted the time I have available to help you, at least until you are...
1296
Rehan Malek
rehan_malek75
Jan 11, 2013 8:49 am
hi vivek :) Thankyou for your quick response. just gone through the cascading documents. but what should be done to get only urls for all fetched pages. and i...
1297
Vivek Magotra
vmagotra
Jan 13, 2013 7:44 pm
Hi Rehan, On Jan 11, 2013, at 5:49 PM, Rehan Malek <rehan_malek75@...> wrote: [snip] ... The status pipe (FetchPipe.getStatusTailPipe()) has the status...
1298
Rehan Malek
rehan_malek75
Jan 17, 2013 1:11 pm
Thankyou vivek.! i am still unable to get all the urls associated with fetched pages. could you please provide cascading workflow for getting all urls.....?? ...
1299
rehan_malek75
Jan 18, 2013 8:05 am
Hi all, How to modify Democrawlworkflow to get all the urls of all fetched pages please explain it in Detail ....
1300
Vivek Magotra
vmagotra
Jan 19, 2013 2:11 am
Hi Rehan, ... To get all the urls of the fetched pages for the current loop here's what I would do : In the createFlow() method, after you get the statusPipe,...
1301
rehan_malek75
Jan 21, 2013 8:33 am
Thanks for giving response. and i am working on this...
1302
rehan_malek75
Jan 21, 2013 8:33 am
hi all, i am currently facing problem with status sub-folder inside output directory. i am unable to view the status sub-folder. as such by default its...
1303
Pat Ferrel
sillyaliases...
Jan 24, 2013 4:01 am
I think Vivek added all of the NotSoSimpleCrawlTool to Bixo's DemoCrawlTool. It produces the same hadoop sequence file in each loop dir. I wrote another...
1304
Ken Krugler
kkrugler
Feb 3, 2013 3:05 am
Hi all, Just a heads-up that Lewis McGibbney has just released 0.2 of the crawler-commons library. The next release of Bixo will use this jar, since it...
1305
markatasu
Feb 21, 2013 5:53 am
Hi Everyone, I'm working with an early-stage well funded stealth mode start-up in the big data analytics space – creating a unified platform that collects,...
1306
jeffjeffrsn
Apr 2, 2013 12:58 pm
Hi, In the DemoCrawlTool I added a new Pipe to the tail of the parsePipe. In it i use the parsed content and the url. Now i also need the original...
1307
Chris Schneider
schmed2000
Apr 2, 2013 2:12 pm
Hi Jeff, I am not sure what you meant when you wrote "added a new Pipe to the tail of the parsePipe". If you did add a tail pipe containing only...
1308
jeffjeffrsn
Apr 2, 2013 5:02 pm
Hi Chris, Thanks for the answer. Now I'm subclassing the baseparser. Thanks, - Jeff...
1309
jeffjeffrsn
Apr 2, 2013 5:22 pm
Hi Eeryone, I noticed, that the democrawler stays at one domain. ... I've got the domain example.com. At this domain there are outlinks to test.example.com,...
1310
Ken Krugler
kkrugler
Apr 4, 2013 11:36 pm
Hi Vivek, I was looking at the DemoCrawlWorkflow source, and noticed this snippet: Pipe urlFromOutlinksPipe = new Pipe("url from outlinks",...
1311
Ken Krugler
kkrugler
Apr 4, 2013 11:41 pm
Hi Jeff, ... By default if you provide a -domain parameter, then URL filtering is set up such that only URLs for that domain are accepted (all other URLs are...
1312
Pat Ferrel
sillyaliases...
May 9, 2013 5:31 pm
It's been awhile since I did a new build of bixo. For some reason, though I haven't changed the code, I'm getting all sorts of test errors. I was getting an...
1313
Pat Ferrel
reallyreally...
May 9, 2013 6:00 pm
Hmm, comment out the test and it completes without errors. Maybe openDNS is the problem? On May 9, 2013, at 10:31 AM, Pat Ferrel <pat.ferrel@...> wrote: ...
1314
Pat Ferrel
reallyreally...
May 17, 2013 2:00 pm
Hi guys, I'm back to crawling Pinterest to update my experimental recommender. I created a merged miner/crawler, which was working fine if slowly. I added an...
1315
Ken Krugler
kkrugler
May 17, 2013 10:00 pm
Hi Pat, ... It will try to fetch every URL, but it will only make one HttpClient request for each URL. HttpClient will retry multiple times, and if the server...
1316
Pat Ferrel
reallyreally...
May 21, 2013 11:58 pm
Thanks. With the below settings I also changed the retry to 0 and the log level to trace. It looks like perfectly good urls are not getting fetched. Thinking I...
1317
Ken Krugler
kkrugler
May 22, 2013 1:53 pm
Hi Pat, ... In bixo there's a FetchAndParseTool that you can use to fetch individual URLs. I'd try that, as another way to test. Some random ideas… ...
1318
Pat Ferrel
sillyaliases...
May 22, 2013 3:06 pm
Again, thanks. You are the only one I know with crawler-fu skills. I'll take a look at the headers in the simple fetcher. I suspect that Pinterest is not...
1319
Ken Krugler
kkrugler
May 22, 2013 4:35 pm
... One other thought - we do "batch" fetching of URLs, using keep-alive to optimize the connection that we create with the server. Pinterest might not like...
1320
Pat Ferrel
sillyaliases...
May 22, 2013 4:55 pm
The miner is getting urls to 404s. So Pinterest is allowing removal of pages but leaving the links to those pages around. If leaving them in the crawldb marked...
1321
Ken Krugler
kkrugler
May 22, 2013 5:10 pm
... A 404 should result in UrlStatus.HTTP_NOT_FOUND What you do with those entries in the crawlDB is up to your processing code. In the DemoCrawlTool, the...
1323
dennis.buroh
Jun 14, 2013 2:14 pm
Hello A short question: Who works the FixedScoreGenerator.class. or where can i find the data of the class? Thank you! Best regards Dennis...
1324
Ken Krugler
kkrugler
Jun 14, 2013 9:15 pm
Hi Dennis, ... I'm not sure what exactly you're asking, but the code is in bixo.operations.FixedScoreGenerator, and looks like: public class...
1325
dennis.buroh
Jun 15, 2013 4:16 pm
Thank you very much. That's exactly what i wanted....