Thanks for the feedback. Our main goals are: 1. achieving maximum use of commodity hardware. If we want to distribute 200k domains across say, 4 machines, we...
Joel Halbert
joel@...
Jun 1, 2009 10:39 am
5873
... Not really. At checkpoints, logs are rotated, so you could compress or move away early parts of the log. We have occasionally considered auto-compressing...
Thanks for the feedback. Our main goals are: 1. achieving maximum use of commodity hardware. If we want to distribute 200k domains across say, 4 machines, we...
Joel Halbert
joel@...
Jun 2, 2009 3:01 pm
5875
... This goal (considered alone) is best served by a large crawl with all seeds entered at the very start. Then the only things limiting the crawler will be...
Thanks Gordon. This really helps. ... From: Gordon Mohr <gojomo@...> Reply-To: archive-crawler@yahoogroups.com To: archive-crawler@yahoogroups.com ...
Joel Halbert
joel@...
Jun 3, 2009 8:39 am
5877
Where can I set the path depth? I'm new and the manual is hard to decipher on this issue as there are multiple depth settings. I just want it to get the...
HI Aaron, If the seeds for your crawl are the home pages of the several thousand websites, then you can try using the TooManyHopsDecideRule and set the...
Hi all. I'm still working with heritrix-1.14.1. I'm already ALMOST up to beginner status! At this time, I'm having problems when a crawl job hits its...
At Sun, 07 Jun 2009 01:21:45 -0000, ... Heritrix should close all open arc files on exit but often doesn’t, for some reason. In any case it is generally...
What Erik says, but also: if you can reliably reproduce a situation where the job finishes but ARCs are left with the ".open" suffix, please let us know, and...
The Internet Archive is planning to host a 'Heritrix Expert Summit' this fall in San Francisco, for advanced Heritrix crawl operators and developers to share...
A preview/alpha testing version of Heritrix 3.0 is now available. We encourage expert Heritrix users curious about upcoming changes to review this alpha and...
Hi guys, I'm new to Heritrix and I've been wanting to configure 1.14.3 to download text only. I've gone to the user manual and also searched on the topic and...
Hi, I am using a C# application to launch a crawl job using the command line interface, but I need to included seeds for that job. Is there a way of providing...
See Gordons notes on adding URI's mid crawl here: http://webarchive.jira.com/wiki/display/Heritrix/Adding+URIs+mid-crawl And an example of connecting to...
Joel Halbert
joel@...
Jun 15, 2009 9:07 am
5890
Hello We are using heritrix on a dual opteron server with fiber optic connection, but crawling speed is unfortunately very low. It is mostly around 130 KB/s,...
... seeds.txt contains about 22 000 domains, some of them could be unregistered we are crawling on a one server for now, we would like to set up a cluster of...
I am trying to crawl twitter to get a search query http://search.twitter.com/search?q=sotomayor as part of a new collection that the Library of Congress is...
... Aha -- that some of the domains could be unresponsive (registered but not running an HTTP server) could be the real culprit. A normal fetchable URI can be...
... i believe the search API limit is 1500 status up to 1.5 weeks back, and the max rpp is 100. so you can get 15 pages of 100 statuses, or 100 pages of 15...
... Are you just interested in a sampling, or are you hoping to capture every relevant tweet during the collection period? (If your chief aim was complete...
Relevant tweets as best as possible. I am certainly hoping that the division isn't expecting 100% but I want to get more than what we would get with the weekly...
... we have filtered seeds.txt, so it now contains only sites that have registered DNS and port 80 is giving response [around 10% of list was removed] we have...
Hi people! I'm pretty new to heritrix, so please help me out. I've been using heritrix 2.0.3, and I have set it up, everything works fine, however, after a...
... As this is logged as a 'nonfatal' error, if it is the only symptom, it shouldn't be a cause for concern. Is the real problem that there is no progress in...
... Hi Gordon, Yes, the progress stops, and as all the seeds are from the same domain, no other threads are run during the error. It tries different URIs every...