Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

bixo-dev · Bixo Web Mining Toolkit

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 113
  • Category: Open Source
  • Founded: May 17, 2009
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Messages

Advanced
Messages Help
Messages 697 - 726 of 1321   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand Author Sort by Date ^
697 Natarajan
natarajansr_mdu Send Email
Feb 26, 2011
3:05 am
I am highly impressed with this product. I downloaded and installed the same in my system . I am able to run the Simple Crawler as documented in the Getting...
698 Vivek Magotra
vmagotra Send Email
Feb 26, 2011
11:36 pm
Hi Natarajan, Thanks :-) Bixo is a toolkit and not a product so there are no predefined inputs and outputs. The general usage pattern involves writing...
699 Ken Krugler
kkrugler Send Email
Feb 27, 2011
5:39 pm
Hi Natarajan, ... Something else you should look at is the "helpful" example in Bixo. This shows how to use Bixo to extract links to mailing list archives from...
700 Natarajan
natarajansr_mdu Send Email
Feb 28, 2011
7:51 am
Thanks Ken and Vivek for your replies. Now I am able to run the bixo from eclipse and crawl the cnn.com. I have generated API document using javadoc and...
701 Ken Krugler
kkrugler Send Email
Feb 28, 2011
5:30 pm
... Yes ... It is used to run the crawling & parsing workflows. The Hadoop file system (HDFS) can be used to save intermediate and final results. ... Cascading...
702 Natarajan
natarajansr_mdu Send Email
Mar 1, 2011
8:33 am
Thanks ken for correcting the mistakes in my comments. All these information are very useful in developing an application using the Bixo Toolkit for Web Data...
703 victorprietoalvarez
victorprieto... Send Email
Mar 4, 2011
12:47 pm
Hello again, In the last few weeks, I have been thoroughly reading the documentation of Hadoop, Cascading and Bixo. I've also set up a small cluster in Nutch...
704 Chris Schneider
schmed2000 Send Email
Mar 4, 2011
5:24 pm
Hi Victor, ... Yes, Bixo in its current form is dependent on both Cascading and Hadoop. ... Bixo has only a handful of direct dependencies on the Hadoop...
705 Ken Krugler
kkrugler Send Email
Mar 4, 2011
7:31 pm
I saw this post on the Nutch list, and thought that we should verify our handling of URLs (or rather, what Tika does) is correct in regards to resolving...
706 Ken Krugler
kkrugler Send Email
Mar 4, 2011
7:32 pm
I just posted this to the Nutch list, but it seems likely to be of interest to Bixo users as well... -- Ken ...
707 Ken Krugler
kkrugler Send Email
Mar 4, 2011
11:00 pm
Hi Otis, ... Well, it's a toolkit - so hooking it up for a wide/big crawl means writing some code, but there's nothing in the Bixo architecture (after a few...
708 Ken Krugler
kkrugler Send Email
Mar 4, 2011
11:13 pm
Hi Otis, More input, though mostly from recent experience w/Bixo... ... We fire up nscd on every server in the cluster - check out the Bixo remote-init.sh...
709 Fuad Efendi
fouad_efendi Send Email
Mar 5, 2011
3:53 am
... Frankly I believe Keep-Alive won't improve anything for Robot-side. and Robot should use specific request headers trying to "keep-alive&quot;: Connection:...
710 Ted Dunning
ted.dunning@... Send Email
Mar 5, 2011
5:51 am
Bixo does this as much as anything to be polite to the web-site. The theory is that a bunch of requests on the same connection consumes less server resources...
711 Fuad Efendi
fouad_efendi Send Email
Mar 5, 2011
6:09 am
I agree only partially… “less server resources”… I think more appropriate to say that resources are the same but resource allocation/deallocation may...
712 Ken Krugler
kkrugler Send Email
Mar 5, 2011
11:03 pm
Just to clarify... ... Bixo doesn't use keep-alive when fetching robots.txt - not sure where that came from. ... There's a cost to create a connection to a...
713 Fuad Efendi
fouad_efendi Send Email
Mar 5, 2011
11:50 pm
To clarify: "Robot-side" and "Server-side", like Client-Server. There is BIXO (Robot, or "Robot-side"), and (for instance) Apache HTTPD server somewhere...
714 Fuad Efendi
fouad_efendi Send Email
Mar 5, 2011
11:56 pm
Sorry for some typo in previous Email; I modified: From: bixo-dev@yahoogroups.com [mailto:bixo-dev@yahoogroups.com] On Behalf Of Fuad Efendi Sent: March-05-11...
715 Fuad Efendi
fouad_efendi Send Email
Mar 6, 2011
12:12 am
Just to clarify my vision: 1. Keep-Alive vs. Not-Keep-Alive: it is 0.1 - 0.01 milliseconds difference in CPU time, plus 50 - 100 milliseconds difference...
716 Ted Dunning
ted.dunning@... Send Email
Mar 6, 2011
12:16 am
Actually, as a point of information, most high end web-sites sprite their images from a single large image. This decreases the number of fetches for a...
717 Fuad Efendi
fouad_efendi Send Email
Mar 6, 2011
1:47 am
... - Yes I know that; there are even some tools available which make this possible; it’s called “CSS Sprites”...
718 Fuad Efendi
fouad_efendi Send Email
Mar 6, 2011
2:06 am
I was wrong, I checked RFC 2616: Keep-Alive relates to persistent connections, pipelining requests (several requests sent via single TCP without waiting for...
719 Fuad Efendi
fouad_efendi Send Email
Mar 6, 2011
3:34 am
And subsequently we have new scenario: Can BIXO send several HTTP requests via single connection without waiting for response? (this would be truly...
720 victorprietoalvarez
victorprieto... Send Email
Mar 7, 2011
5:39 pm
Thanks for the answer. I had already checked out most of the webpages you have pointed out in your email, but I will go through them again. Anyway, after...
721 Chris Schneider
schmed2000 Send Email
Mar 9, 2011
12:34 am
Hi Victor, ... I'm afraid that your questions suggest a need for understanding the Hadoop architecture if much more detail than you currently do. It's also...
722 ivan.panachev Send Email Mar 15, 2011
1:46 am
Hi Ken, ... [...] ... May I ask how did you do the following things: 1) classification of site size (how do you determine if site is big or small?) 2) having...
723 Ken Krugler
kkrugler Send Email
Mar 15, 2011
4:08 am
... We base in on their US traffic rank. ... No, we change the policy used during fetching to incorporate this traffic rank data. -- Ken ... Ken Krugler +1...
724 ivan.panachev Send Email Mar 15, 2011
6:11 pm
Hi Ken, ... [...] ... Looks like a great idea! ... Could you give a bit more detail? It looks like you modified prefetchPipe in the FetchPipe constructor, but...
725 Chris Schneider
schmed2000 Send Email
Mar 16, 2011
12:41 am
Hi Ivan, ... Yes, we're currently using an "average" of the Alexa and Quantcast traffic rank for each domain on our white list. ... This area of Bixo could...
726 ivan.panachev Send Email Mar 16, 2011
10:14 am
Hi Chris, ... [...] ... Thank you, that's precisely what I asked!...
Messages 697 - 726 of 1321   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help