On the subject of using Heritrix, The National and University Library of Iceland is currently using Heritrix to crawl the entire .is TLD (about 11.000 domains). We've already performed one complete snapshot using 1.0.4 (got about 35 million URIs) and plan on performing 3 more this year (in addition to iterative crawls using new modules currently under development).
On the subject of performance metrics. Micheal cites my figures correctly but I should note that we are using a customized scope, since we discovered that the HostScope becomes a performance bottleneck with large numbers of seeds.
Our last complete snapshot (with v. 1.0.4) was conducted in 10 segments, over a 2 month period. With all the improvements made to Heritrix, recent tests have convinced us that it is POSSIBLE to do this in a single crawl, but probably more efficient to split it up into 3-4 segments. (Currently I'd advise people to use a development build rather then any released 'stable' build, the advances in the BdbFrontier are quite remarkable.)
We estimate that the next .is snapshot will take about a month. This may vary though, since we are currently in the process of upgrading our hardware to a dual processor machine. I will report on it's performance once we have some numbers.
Regards,
Kristinn Sigurðsson
Software Engineer
National and University Library of Iceland
P.s. Michael, I like the idea of having a page for 'known Heritrix projects and uses'. Feel free to add us to the list.
-----Original Message-----warrby101 wrote:
From: stack [mailto:stack@...]
Sent: 20. janúar 2005 03:06
To: archive-crawler@yahoogroups.com
Subject: Re: [archive-crawler] Benchmarks
>
> We're currently evaluating Heretrix, and I'm looking for two pieces of
> information:
> 1. Performance metrics -- I've looked through the Heretrix site and
> the performance information is not clear to me. Does anyone have
> metrics that they can share?
What metrics are you looking for? Heritrix performance varies wildly
with the settings used, where Heritrix is in the crawl lifecycle, JVM,
quality of your connection to the net, what you are crawling, and
profile of the Hardware its deployed on.
This morning a partner from Iceland reported Heritrix, after running for
4 days, doing a HostScoped crawl against 11k hosts, steadily pulling at
a rate of ~15docs or 857k a second on a fairly decent machine (I don't
have its exact stats other than JVM was using 1.2gigs of RAM). We
ourselves are running comparable crawls on sets of slow machines (999mhz
and 512k RAM with JVM using 300k) with each machine settled after a few
days running at a rate of ~5docs a second. Other crawls we've run on
'better' hardware -- fast dual processors with loads of RAM -- have
sustained periods pulling 30-40 docs a second with intervals of 50-70
docs a second near the start of a crawl.
It all depends.
> 2. A list of organizations (besides the Archive) using Heretrix with
> descriptions of the scale of their usage.
You might go back through the archives. In a few instances people have
volunteered what they are using Heritrix for.
(If wanted, I can start a page that lists uses of Heritrix. Send me a
one sentence summary and a link to the Heritrix project if a public link
exists and I'll add it to a public list).
Here's a sampling of what we've used Heritrix for here at the Archive
(Numbers and descriptions of machines are coarse):
+ Crawl of the .fr domain. About a month of crawling divided between 4-8
machines using a mix of the 999mhz machines descibed above and another,
faster 1.7ghz machine. This crawl pulled down ~50-60million documents.
+ Crawl of all government sites for NARA to coincide with the 'change'
in administration. This crawl lasted 4 to 5 weeks and pulled down
~75million documents (6.5Tb) with the crawl split across 4-5 1.7ghz
machines.
- Crawl of sites that pertain to the Iraq war on a weekly basis. One
1.7ghz machine pulling down about 300gigs a week.
- Crawl of a small set of UK government sites on a weekly and 6-monthly
basis.
- Monthly congressional crawl.
- Crawl of sites related to Indian Ocean Tsumami
Yours,
St.Ack
>
> Thanks
>
>
>
>
> ------------------------------------------------------------------------
> *Yahoo! Groups Links*
>
> * To visit your group on the web, go to:
> http://groups.yahoo.com/group/archive-crawler/
>
> * To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
> <mailto:archive-crawler-unsubscribe@yahoogroups.com?subject=Unsubscribe>
>
> * Your use of Yahoo! Groups is subject to the Yahoo! Terms of
> Service <http://docs.yahoo.com/info/terms/>.
>
>