Hi Sebastian,
The main problem with running broad crawls (over a larger number of hosts) lies in the memory overhead associated with each active host.
That is, each host that is being crawled requires memory, both for host specific information and the ‘top’ of the per host queue of waiting URIs.
One of the best way to combat this is to enable site first, limiting the number of active hosts to the bare minimum needed to fully occupy the crawler. Judging from your post I’m unsure if that is acceptiable to you since this will favor certain hosts until they are exhausted before moving on.
Another thing that can be done to mitigate the memory use is to limit the number of URIs kept in memory for each host. This is done by setting the “host-queues-memory-capacity” setting to a smaller value. The default is 200, with a true broad crawl this is too much. I’d suggest something in the range of 20-50. A low value will incur more disk access but save memory.
Other then the site-first and host-queues-memory-capacity there isn’t all that much you can to limit memory use by configuring Heritrix. I know the guys at the Archive are looking at some ideas for limiting memory use further but that is still a long way away.
As for hardware and heap size. Well the latter is easy. Assign as large a heap as the hardware allows. I’ve been running on a macine with 1.5GB RAM and I usually assign 1.25GB to the java heap.
As for hardware you’ll want plenty of memory (duh) and a fast processor also helps if you are concerned about the speed the crawl runs at. The crawl tends to be limited by the processing power availible. If you choose to limit memory use by lowering the host-queues-memory-capacity significantly a fast HD will be very useful. I’d suggest having different HD for state/scratch/log files and ARC files.
- Kris
-----Original Message-----
From: Sebastian de Castelberg
[mailto:sdecaste@...]
Sent: 23. september 2004 09:32
To:
archive-crawler@yahoogroups.com
Subject: [archive-crawler] Large
experimental crawl
Hi,
for a research-project we have to implement two
different random-walk
algorithms for uniform page sampling. This needs
us to gather about
2-4Mio. URL's.
So we have to chose an adapted Broad Crawl, which
chooses the URL's,
which are fed back into the frontier, randomly. So
the fetch queue
wouldn't grow exponential. We also do not need to
write the whole
content to disk.
Heritrix seemed to work quite well as crawler for
antother project
(thanks for the good development work at this
place). But we got often
problems, based on memory limitations.
On the known limitations page, there's written
that it is possible to
crawl about 6Mio URL's and about 10000 hosts with
default settings.
My question is: What's the best setup to reach
this number of URL's
(HW/Java heap size/Heritrix config)?
Do we need special hardware, or can it be done
with a common pc (p4
2.6GHz 512-1024 MB ram)?
We planned to use Debian GNU/Linux as os. Maybe
there's someone who has
already experiences with large-scaled crawls and
can give me some hints.
thanks
sebastian de castelberg