A PreRelease preview has been available for awhile and I have mentioned it
here before.
The module has been undergoing testing at The National and University
Library of Iceland (where it was developed) for the past several months. It
has also been extensively reviewed by Netarkivet.dk (my thanks) during the
same period. It is believed to be entirely stable, however support for Etags
is limited.
The DeDuplicator is a add-on module for Heritrix that allows sequential
snapshot crawl to leverage information about previous iterations to avoid
storing (or even downloading) duplicate data. You will need a working
Heritrix installation to use it. It has been tested under Heritrix 1.8.0 and
1.10.0 but is believed to also work under older versions.
This is accomplished by building an index between snapshots. This index is
then queried at crawl time to determine if crawled documents remain
unchanged.
Testing has revealed that for moderately sized crawls (on good hardware) of
roughly 1-2 million documents the performance penalty in negligible. For
larger crawls it is recommended that only non-text documents be indexed (as
they are far more likely to not change and represent a disproportionate
amount of the data collected).
To avoid downloading unchanged documents a substitute FetchHTTP processor is
included that used 'last-modified' information to base its decision upon.
Note that this is not foolproof and may cause you to miss valid content.
Note also that this should never be applied to documents that may contain
links.
A guide for getting started is available on the website.
DeDuplicator site: http://vefsofnun.bok.hi.is/deduplicator/index.html
With best regards,
Kristinn Siguršsson
IT-Group Project Manager
The National and University Library of Iceland
Arngrimsgata 3
107 Reykjavik
E-mail: kristsi@...