... Hello Kaisa. ... Is this the only test that fails? Off-the-list, Mike Schwartz reports that retrying the build, the test passes second-time around. Is...
... Probably because we include the JMX reference implementation in Heritrix and its clashing with the jboss JMX implementation. What happens if you leave in...
... We mustn't be making use of TypeHandler otherwise I'd imagine there'd be compile-time complaints. What tool are you using to do the dependency checking? ...
I'm trying to only save documents that have a certain pattern in the body of the document. I can't figure out a way to do this. It's possible to filter on...
Thanks for the response. I'm trying to crawl as many english language sites as possible. I have a cluster of a few machines I can dedicate to this task. I'm ...
... There is no such filter in Heritrix currently. You'll have to write one. Do it as a standalone filter or as a DecideRule to include in a DecidingFilter....
... You might also take a look at the recent Rainbow interface contribution (See http://groups.yahoo.com/group/archive-crawler/message/1905). You might recast...
I'd like to hightlight two recent contributions. 1. Mark Williamson of the British Library organized the contribution of Hedaern, an ARC access tool. The...
... Looks like this failure is easy reproduce if build is done over fedora. For now I've made an issue and commented out this test of unused functionality. ......
Hi, I actually use only the heritrix-1.4.0.jar in our system without the the jmxri*.jar and the jmxtools*.jars from your distribution. As I have pointed out,...
Hi again, I simply checked out from Sourceforge CVS a new version of Heritrix this morning and the build went through although it had some 33 warnings in...
... Thats odd that it would start working like that (The javadoc warnings we need to fix but they're harmless). ... Retry Kaisa. The below is an issue w/...
I cannot build the lastest version from HEAD: java:compile: [echo] Compiling to /tmp/heritrix-1.5.0-200507050934/target/classes [javac] Compiling 416 source...
I've made a simple new DomainnameQueueAssignmentPolicy that bases queues on domain-names instead of host-names (domain defined as 2 last names in the hostname)...
Is it deliberate that hosts-report.txt has changed format to (in HEAD): [#urls] [#bytes] [host] 538 32710 dns: 10 38453 130.226.47.102 10 41802 www.etracker.de...
Yes. The primary aim was to place the most important and compact numeric info to the left, where it would not scroll/line-wrap off the right margin. (The...
I got Heritrix compiled with Maven and now the crawler is running too. Thanks for your help. I also imported Heritrix into Eclipse. It’ a great visual way to...
Kaisa, you'll need to configure Eclipse for Java 1.4 compliance to get rid of the assert errors (prior to Java 1.4 'assert' was not a keyword but currently...
Dear all, I was happily crawling the web, when I've got an out-of-memory error and heritrix hanged up. I tried to restart the crawl through the recovery...
Hi,
I have experimented a bit and added log output to the classes CrawlScope for 1.2 and ClassicScope for 1.4 which I have attached to this mail.
I can see...
... The tail on the recover log is an incomplete compression block; The crash interrupted the compressed recover log writing. Because of this, gzip is...