Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Stuck crawler   Message List  
Reply | Forward Message #1027 of 6148 |
RE: [archive-crawler] Large experimental crawl

Hi Sebastian,

 

The main problem with running broad crawls (over a larger number of hosts) lies in the memory overhead associated with each active host.

 

That is, each host that is being crawled requires memory, both for host specific information and the ‘top’ of the per host queue of waiting URIs.

 

One of the best way to combat this is to enable site first, limiting the number of active hosts to the bare minimum needed to fully occupy the crawler. Judging from your post I’m unsure if that is acceptiable to you since this will favor certain hosts until they are exhausted before moving on.

 

Another thing that can be done to mitigate the memory use is to limit the number of URIs kept in memory for each host. This is done by setting the “host-queues-memory-capacity” setting to a smaller value. The default is 200, with a true broad crawl this is too much. I’d suggest something in the range of 20-50. A low value will incur more disk access but save memory.

 

Other then the site-first and host-queues-memory-capacity there isn’t all that much you can to limit memory use by configuring Heritrix. I know the guys at the Archive are looking at some ideas for limiting memory use further but that is still a long way away.

 

As for hardware and heap size. Well the latter is easy. Assign as large a heap as the hardware allows. I’ve been running on a macine with 1.5GB RAM and I usually assign 1.25GB to the java heap.

As for hardware you’ll want plenty of memory (duh) and a fast processor also helps if you are concerned about the speed the crawl runs at. The crawl tends to be limited by the processing power availible. If you choose to limit memory use by lowering the host-queues-memory-capacity significantly a fast HD will be very useful. I’d suggest having different HD for state/scratch/log files and ARC files.

 

- Kris

 

-----Original Message-----
From: Sebastian de Castelberg [mailto:sdecaste@...]
Sent: 23. september 2004 09:32
To: archive-crawler@yahoogroups.com
Subject: [archive-crawler] Large experimental crawl

 

Hi,

for a research-project we have to implement two different random-walk
algorithms for uniform page sampling. This needs us to gather about
2-4Mio. URL's.
So we have to chose an adapted Broad Crawl, which chooses the URL's,
which are fed back into the frontier, randomly. So the fetch queue
wouldn't grow exponential. We also do not need to write the whole
content to disk.
Heritrix seemed to work quite well as crawler for antother project
(thanks for the good development work at this place). But we got often
problems, based on memory limitations.

On the known limitations page, there's written that it is possible to
crawl about 6Mio URL's and about 10000 hosts with default settings.
My question is: What's the best setup to reach this number of URL's
(HW/Java heap size/Heritrix config)?
Do we need special hardware, or can it be done with a common pc (p4
2.6GHz 512-1024 MB ram)?

We planned to use Debian GNU/Linux as os. Maybe there's someone who has
already experiences with large-scaled crawls and can give me some hints.

thanks
sebastian de castelberg




Thu Sep 23, 2004 9:50 am

kristsi25
Offline Offline
Send Email Send Email

Forward
Message #1027 of 6148 |
Expand Messages Author Sort by Date

I've had a crawl running for several days. Late Friday it stopped fetching anything. Any attempts to view the Reports for the job ...
Tom Emerson
tree02139
Offline Send Email
Sep 20, 2004
12:53 pm

Ansi had this prob. on list last week Tom. I'll have a go at it this morning. Thanks for the detailed report. Soon as the Frontier has a problem with a...
stack
stack@...
Send Email
Sep 20, 2004
3:38 pm

... OK. I'm off to Egypt for a conference tonight, so I won't be able to try anything (probably) until I get back on Saturday. If there is anything you want...
Tom Emerson
tree02139
Offline Send Email
Sep 20, 2004
3:45 pm

Hi, for a research-project we have to implement two different random-walk algorithms for uniform page sampling. This needs us to gather about 2-4Mio. URL's. So...
Sebastian de Castelberg
sdecaste@...
Send Email
Sep 23, 2004
9:36 am

Hi Sebastian, The main problem with running broad crawls (over a larger number of hosts) lies in the memory overhead associated with each active host. That is,...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Sep 23, 2004
10:05 am

Hi,. I'm just doing a really broad crawl - broadscope, seed "www.yahoo.com", path depth of 3 - everything else is default I think. Heritrix has lots of memory...
mark
Mark.Williamson@...
Send Email
Sep 27, 2004
8:53 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help