Anything in heritrix_out.log? Can you run your embedded instance inside a debugger to try and figure what is awry? Does the order.xml+seeds.txt in an...
Well, it's not possible to disable the test goal - but its possible to tell maven-junit plugin to skip all the tests... maven test -Dmaven.test.skip=true This...
No, there are nothing in heritrix_out.log. In debug mode I get the following lines: 03.07.2007 10:43:38 org.archive.crawler.Heritrix postRegister INFO: ...
The WARNING in the below is a complaint about your order file. It states that current Heritrix version does not know what to do with the configuration named...
Dear all The State and University Library of Denmark and The Royal Library of Denmark are pleased to announce the release of the NetarchiveSuite as Open...
Thanks, I've replaced 'sha1-content' config by two options according with the current Heritrix version. There are no WARNINGS at all. About 'problem solved'....
Artem, Stack: I am going to jump in if you don't mind :) We changed processor Filters to DecidRrules in 1.12.0. From the release notes: 2.4.2. DecideRules have...
hello , Now i am trying to run Herirtix on Window. But cannot do anything. What should i do ?It give me that errors. java.lang.NullPointerException ...
Hi, There is one option in settings called bind-address: So my question is can I put here multiple comma separated IP addresses... If I bind 10 IPs to my...
Hi, For the purpose of covering the real crawler IP, and use multiple IP addresses just not to look like one crawling robot, but as geographically distributed...
Hi, I am feeding 2 Lacs seed to my heritrix instance and I am using QuotaEnforcer to limit 10 docs per host. I am using SURT Rule in the Deciding Scope. ...
Hi, I can not set max-retries : 2 If I do this then It only fetches robots.txt page of each domain present in seed file. Actually I do not want to retry any...
I've had that problem as well - it seems the crawler won't crawl with a setting for max-retries lower that 3. Could be a bug? best -- Bjarne Andersen Daily...
Notice in crawl.log that every seed has at least 3 retries (3t in annotations). Every time an URI is deferred, the count of retries goes up. Seed will be...
Hello Jigar, It seems that all that is left to be crawled are sites that are problematic. -2 means that we failed to connect to a server. Every time you have...
You can, and you don't need to restart the crawl. SurtPrefixedDecideRule has rebuild-on-reconfig option that you can set to true. However, you have to 'touch'...
As Igor notes, this behavior is 'by design'. When a URI comes up for crawling, but its host has not been (recently) fetched via a prerequisite DNS URI, or its...
I want to have a post processor update my own database with the location of the ARC file that was written for the current crawl (i.e. for each URI). Is there...
Hi there, My name is Dawid Weiss, I have been using Heritrix for a few weeks now -- great software, really. I noticed a very annoying bug, described below. ...
... Are you asking if you can insert a processor into the crawler at the 'postprocessor' stage to insert the ARC file name a particular download was written to...