I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting...
698
Vivek Magotra
vmagotra
Feb 26, 2011 11:36 pm
Hi Natarajan, Thanks :-) Bixo is a toolkit and not a product so there are no predefined inputs and outputs. The general usage pattern involves writing...
699
Ken Krugler
kkrugler
Feb 27, 2011 5:39 pm
Hi Natarajan, ... Something else you should look at is the "helpful" example in Bixo. This shows how to use Bixo to extract links to mailing list archives from...
700
Natarajan
natarajansr_mdu
Feb 28, 2011 7:51 am
Thanks Ken and Vivek for your replies. Now I am able to run the bixo from eclipse and crawl the cnn.com. I have generated API document using javadoc and...
701
Ken Krugler
kkrugler
Feb 28, 2011 5:30 pm
... Yes ... It is used to run the crawling & parsing workflows. The Hadoop file system (HDFS) can be used to save intermediate and final results. ... Cascading...
702
Natarajan
natarajansr_mdu
Mar 1, 2011 8:33 am
Thanks ken for correcting the mistakes in my comments. All these information are very useful in developing an application using the Bixo Toolkit for Web Data...
703
victorprietoalvarez
victorprieto...
Mar 4, 2011 12:47 pm
Hello again, In the last few weeks, I have been thoroughly reading the documentation of Hadoop, Cascading and Bixo. I've also set up a small cluster in Nutch...
704
Chris Schneider
schmed2000
Mar 4, 2011 5:24 pm
Hi Victor, ... Yes, Bixo in its current form is dependent on both Cascading and Hadoop. ... Bixo has only a handful of direct dependencies on the Hadoop...
705
Ken Krugler
kkrugler
Mar 4, 2011 7:31 pm
I saw this post on the Nutch list, and thought that we should verify our handling of URLs (or rather, what Tika does) is correct in regards to resolving...
706
Ken Krugler
kkrugler
Mar 4, 2011 7:32 pm
I just posted this to the Nutch list, but it seems likely to be of interest to Bixo users as well... -- Ken ...
707
Ken Krugler
kkrugler
Mar 4, 2011 11:00 pm
Hi Otis, ... Well, it's a toolkit - so hooking it up for a wide/big crawl means writing some code, but there's nothing in the Bixo architecture (after a few...
708
Ken Krugler
kkrugler
Mar 4, 2011 11:13 pm
Hi Otis, More input, though mostly from recent experience w/Bixo... ... We fire up nscd on every server in the cluster - check out the Bixo remote-init.sh...
709
Fuad Efendi
fouad_efendi
Mar 5, 2011 3:53 am
... Frankly I believe Keep-Alive won't improve anything for Robot-side. and Robot should use specific request headers trying to "keep-alive": Connection:...
710
Ted Dunning
ted.dunning@...
Mar 5, 2011 5:51 am
Bixo does this as much as anything to be polite to the web-site. The theory is that a bunch of requests on the same connection consumes less server resources...
711
Fuad Efendi
fouad_efendi
Mar 5, 2011 6:09 am
I agree only partially… “less server resources”… I think more appropriate to say that resources are the same but resource allocation/deallocation may...
712
Ken Krugler
kkrugler
Mar 5, 2011 11:03 pm
Just to clarify... ... Bixo doesn't use keep-alive when fetching robots.txt - not sure where that came from. ... There's a cost to create a connection to a...
713
Fuad Efendi
fouad_efendi
Mar 5, 2011 11:50 pm
To clarify: "Robot-side" and "Server-side", like Client-Server. There is BIXO (Robot, or "Robot-side"), and (for instance) Apache HTTPD server somewhere...
714
Fuad Efendi
fouad_efendi
Mar 5, 2011 11:56 pm
Sorry for some typo in previous Email; I modified: From: bixo-dev@yahoogroups.com [mailto:bixo-dev@yahoogroups.com] On Behalf Of Fuad Efendi Sent: March-05-11...
715
Fuad Efendi
fouad_efendi
Mar 6, 2011 12:12 am
Just to clarify my vision: 1. Keep-Alive vs. Not-Keep-Alive: it is 0.1 - 0.01 milliseconds difference in CPU time, plus 50 - 100 milliseconds difference...
716
Ted Dunning
ted.dunning@...
Mar 6, 2011 12:16 am
Actually, as a point of information, most high end web-sites sprite their images from a single large image. This decreases the number of fetches for a...
717
Fuad Efendi
fouad_efendi
Mar 6, 2011 1:47 am
... - Yes I know that; there are even some tools available which make this possible; it’s called “CSS Sprites”...
718
Fuad Efendi
fouad_efendi
Mar 6, 2011 2:06 am
I was wrong, I checked RFC 2616: Keep-Alive relates to persistent connections, pipelining requests (several requests sent via single TCP without waiting for...
719
Fuad Efendi
fouad_efendi
Mar 6, 2011 3:34 am
And subsequently we have new scenario: Can BIXO send several HTTP requests via single connection without waiting for response? (this would be truly...
720
victorprietoalvarez
victorprieto...
Mar 7, 2011 5:39 pm
Thanks for the answer. I had already checked out most of the webpages you have pointed out in your email, but I will go through them again. Anyway, after...
721
Chris Schneider
schmed2000
Mar 9, 2011 12:34 am
Hi Victor, ... I'm afraid that your questions suggest a need for understanding the Hadoop architecture if much more detail than you currently do. It's also...
722
ivan.panachev
Mar 15, 2011 1:46 am
Hi Ken, ... [...] ... May I ask how did you do the following things: 1) classification of site size (how do you determine if site is big or small?) 2) having...
723
Ken Krugler
kkrugler
Mar 15, 2011 4:08 am
... We base in on their US traffic rank. ... No, we change the policy used during fetching to incorporate this traffic rank data. -- Ken ... Ken Krugler +1...
724
ivan.panachev
Mar 15, 2011 6:11 pm
Hi Ken, ... [...] ... Looks like a great idea! ... Could you give a bit more detail? It looks like you modified prefetchPipe in the FetchPipe constructor, but...
725
Chris Schneider
schmed2000
Mar 16, 2011 12:41 am
Hi Ivan, ... Yes, we're currently using an "average" of the Alexa and Quantcast traffic rank for each domain on our white list. ... This area of Bixo could...
726
ivan.panachev
Mar 16, 2011 10:14 am
Hi Chris, ... [...] ... Thank you, that's precisely what I asked!...