Group Information- Members: 32
- Category: Open Source
- Founded: May 17, 2009
- Language: English
|
Visit the
Groups blog for the latest Yahoo! Groups information
Description
Bixo is an open source vertical web crawler toolkit. It consists of an easily customizable set of components that can be piped together via Cascading to create complex work-flows.
The primary use case for Bixo is data mining the web, where the scope of pages to be fetched and processed is 100K to 100M.
More sources of information about Bixo include:
|
Re: New domain name
Hi Ken, personally I would prefer bixominer.org because it sounds good and tells what it does. Bruno.
Posted - Wed Dec 23, 2009 8:37 am
|
bruno_abitbol
Offline |
New domain name
We're trying to pick a domain name to use for the Bixo project. Current options are: * bixo-project.org * openbixo.org * bixominer.org Any input and/or
Posted - Wed Dec 23, 2009 12:07 am
|
Ken Krugler
kkrugler
Offline Send Email
|
Re: Beginner Question
Hi Bruno, ... You could use the LoadUrlsFunction with an Each() operator to import the URLs from a text file. For example, here's some code from a white-list
Posted - Tue Dec 22, 2009 3:39 pm
|
Ken Krugler
kkrugler
Offline Send Email
|
Re: Beginner Question
Hi Ken, thank you for your quick response. ... How can I inject the 50 URLS in the crawler? ... 10000 to 50000 URLS per domain so let's say a total of 1 500
Posted - Tue Dec 22, 2009 3:17 pm
|
bruno_abitbol
Offline |
Re: Beginner Question
Hi Bruno, ... If you have URLs from 50 domains, then if you have 50 threads (typically one reducer would be enough) you'll be crawling 50 simultaneously.
Posted - Tue Dec 22, 2009 2:44 pm
|
Ken Krugler
kkrugler
Offline Send Email
|
Add bixo-dev to your personalized My Yahoo! page What's This?
|
Message History
Group Email Addresses
| Related Link: |
http://bixo.101tec.com |
| Post message: |
bixo-dev@yahoogroups.com |
| Subscribe: |
bixo-dev-subscribe@yahoogroups.com |
| Unsubscribe: |
bixo-dev-unsubscribe@yahoogroups.com |
| List owner: |
bixo-dev-owner@yahoogroups.com |
|