Hi, The digest strings between "crawl.log" and "output of arcreader -d" seems to be different. Am I wrong something or is this a feature? <crawl.log> ...
Hi, I'm new to Heritrix, and keep having this same problem. I've tried the 1.12 and 1.13 versions, but get the same results. All the domains I enter into the...
I'm getting this java.io.exception as well. I can't find any replies to this post. Was this problem resolved for you, and if so. How? ... ServerCache ... ...
Have you customized the crawl configuration at all -- especially with regard to the PreconditionEnforcer, FetchDNS, or Scope components? What happens if you...
Thanks for the quick reply. I'm running Windows XP. I reviewed the install docs, and found that I didn't create the profile folder under the conf directory....
Hi, Sorry for asking some simple questions, but I can't find the answers anywhere. If there are answers posted somewhere (i.e. a Wiki) please direct me there,...
A new beta test release of Heritrix-2.0.0, "alpha-2", is now available. Specificially, the beta release is considered to be the autobuild with the identifier...
... There is much documentation on 1.x in the user manual: http://crawler.archive.org/articles/user_manual/index.html Settings in the web UI have at least a...
... Look in the settings for anything with 'delay' in it. Our default profiles include a minimum 2000ms delay before releasing another URI from the same...
Can be configured with a canonicalization rule like this: <newObject name="testing_canonicalization" class="org.archive.crawler.url.canonicalize.RegexRule"> ...
Hi, I'm new to Heritrix and currently using version 1.12.1. I successfully crawled my first few webpages and want to use the ArchiveReader to further process...
Hello, I've a problem crawling a site written in .NET with a form that uses VIEWSTATE hidden field for postback action. So the same url is used for displaying...
I am new to Heritrix and would like to setup a heritrix cluster (one instance per machine). I took a look at the hcc javadocs. Is there anymore documentation...
Hello Antonino, there currently is no other place than the JIRA issue to get this feature from and apply the given patch to the codebase yourself. I think the...
Hello, I am new to heritrix and stuck with a problem. I am running HEritrix on the latest ubuntu distribution; When I start a crawl, the job is immediately...
Hi all, I'd like to be able to start crawl which would basically be restricted to only the seed page. So for example, seed is http://aaa.com/bbb.html. I'd like...
Hi Robert, You can simply setup the rejectIfTooManyHops (TooManyHopsDecideRule) to 0. That will ensure that only seeds are fetched. Keep in mind that Heritrix...
Hi Christoph, Maybe the arc file is corrupted. Does the gzip test pass? (gzip -t CRAWL-20071130161712-00001-graz.arc.gz) Is this always happing after the first...
Hi Igor, thanks for the hint, but the files are okay (gzip -t and the command-line arcreader work). I'm trying to iterate over the records from a Java program...
That does sound like a bug to me. A work around would be to create a wrapper iterator class, and pass the ArchiveReader iterator to that class's constructor....