Skip to search.
bixo-dev · Bixo Web Mining Toolkit

Group Information

  • Members: 90
  • Category: Open Source
  • Founded: May 17, 2009
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.
Notice  Upcoming calendar upgrade: Yahoo! Groups calendars are being upgraded to a much improved version. You may not have access to the calendar (for up to 24 hours) when this group is upgraded. We sincerely apologize for this inconvenience.

Home

 

Activity within 7 days:

Description

Bixo is an open source vertical web crawler toolkit. It consists of an easily customizable set of components that can be piped together via Cascading to create complex workflows.

The primary use case for Bixo is data mining the web, where the scope of pages to be fetched and processed is 100K to 100M.

More sources of information about Bixo include:

Most Recent Messages

  (View All)
(Group by Topic)
Advanced
   Start Topic
Re: restarting a crawl and adding to a previous crawl
When the crawler loops it fetches every url found in the previous loop, right? So the crawl time will likely increase exponentially with each loop, right? So
Posted - Sat May 26, 2012 4:54 pm
Pat Ferrel
reallyreally...
Offline Offline
Send Email Send Email
Re: restarting a crawl and adding to a previous crawl
... The most recent loop dir has an up-to-date snapshot of the crawlDB, which is regenerated after each loop. ... Yes, exactly. You've hit upon a fundamental
Posted - Fri May 25, 2012 7:24 pm
Ken Krugler
kkrugler
Offline Offline
Send Email Send Email
Re: restarting a crawl and adding to a previous crawl
OK, so if I understand correctly every time I restart a crawl on an existing one, it will extend the original crawl with newly fetched data. It never recrawls
Posted - Fri May 25, 2012 7:07 pm
Pat Ferrel
reallyreally...
Offline Offline
Send Email Send Email
Re: restarting a crawl and adding to a previous crawl
Hi Pat, See below. But in general the SimpleCrawlTool is a demo of Bixo, not a complete crawler, thus much of the functionality you're asking about is missing.
Posted - Fri May 25, 2012 5:24 pm
Ken Krugler
kkrugler
Offline Offline
Send Email Send Email
restarting a crawl and adding to a previous crawl
If, for some reason, a crawl fails to finish properly. What is the recommended way to restart it where it left off, or somewhere close. I tried deleting what
Posted - Fri May 25, 2012 5:15 pm
Pat Ferrel
reallyreally...
Offline Offline
Send Email Send Email
Add bixo-dev to your personalized My Yahoo! page Add to My Yahoo! XML What's This?

Message History

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012 29 30 15 50 19
2011 21 18 38 17 6 6 4 1 24 4 7 58
2010 47 7 2 5 16 42 54 94 7 10 45 29
2009 39 12 18 19 6 57 28 125
What is Yahoo! Answers?

Yahoo! Answers, a new Yahoo! community, is a question and answer exchange where the world gathers to share what they know...and make each other's day. People can ask questions on any topic, and help others out by answering their questions.

What is Yahoo! Answers?

Yahoo! Answers, a new Yahoo! community, is a question and answer exchange where the world gathers to share what they know...and make each other's day. People can ask questions on any topic, and help others out by answering their questions.

Questions in Computers & Internet > Software

  • Questions are currently unavailable.

Want to help answer other questions? Go to Yahoo! Answers


Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help