Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
[ANN] Heritrix 1.10.0 release   Message List  
Reply | Forward Message #3282 of 6139 |
[ANN] Deduplicator (add-on for Heritrix) 0.2.0 release

A PreRelease preview has been available for awhile and I have mentioned it
here before.
The module has been undergoing testing at The National and University
Library of Iceland (where it was developed) for the past several months. It
has also been extensively reviewed by Netarkivet.dk (my thanks) during the
same period. It is believed to be entirely stable, however support for Etags
is limited.

The DeDuplicator is a add-on module for Heritrix that allows sequential
snapshot crawl to leverage information about previous iterations to avoid
storing (or even downloading) duplicate data. You will need a working
Heritrix installation to use it. It has been tested under Heritrix 1.8.0 and
1.10.0 but is believed to also work under older versions.

This is accomplished by building an index between snapshots. This index is
then queried at crawl time to determine if crawled documents remain
unchanged.

Testing has revealed that for moderately sized crawls (on good hardware) of
roughly 1-2 million documents the performance penalty in negligible. For
larger crawls it is recommended that only non-text documents be indexed (as
they are far more likely to not change and represent a disproportionate
amount of the data collected).

To avoid downloading unchanged documents a substitute FetchHTTP processor is
included that used 'last-modified' information to base its decision upon.
Note that this is not foolproof and may cause you to miss valid content.
Note also that this should never be applied to documents that may contain
links.

A guide for getting started is available on the website.

DeDuplicator site: http://vefsofnun.bok.hi.is/deduplicator/index.html

With best regards,
Kristinn Siguršsson
IT-Group Project Manager
The National and University Library of Iceland
Arngrimsgata 3
107 Reykjavik
E-mail: kristsi@...





Wed Sep 13, 2006 9:16 am

kristsi25
Offline Offline
Send Email Send Email

Forward
Message #3282 of 6139 |
Expand Messages Author Sort by Date

Release 1.10.0 adds new configuration options, experimental new protocol and format support, and lots of fixes (43 tracked bugs have been fixed and 35 feature...
Michael Stack
stackarchiveorg
Offline Send Email
Sep 12, 2006
12:40 am

A PreRelease preview has been available for awhile and I have mentioned it here before. The module has been undergoing testing at The National and University ...
Kristinn Siguršsson
kristsi25
Offline Send Email
Sep 13, 2006
9:22 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help