Dear St.Ack. I have now uploaded two files: HeritrixLauncher.java: the class responsible for executing Heritrix. Netarkivet-log.txt (logs from a run, where...
... Sure. Looking at your log, it looks like we're stuck here: http://crawler.archive.org/xref/org/archive/crawler/framework/CrawlController.html#1031. I see...
... I'd doubt it (I ran a few tests anyways and it seems to behave itself). On your end, you might removing it from your config. to see if it changes the...
I took another look at your log, Says zero active threads but I only counted 48 logs of "FINE: ToeThread #50: finished for order 'default_orderxml"....
It happens only when we're calling: crawlController.requestCrawlStop(); We're calling that when the crawler hangs for too long (our own kind of timeout...
After looking at your HeritrixLauncher class, I suspect the problem is a thread deadlock. requestCrawlStop() is only called inside doCrawlLoop, which holds the...
Hello, This email message is a notification to let you know that a file has been uploaded to the Files area of the archive-crawler group. File :...
archive-crawler@yahoo...
Dec 2, 2005 12:34 pm
2388
Seeing the latest log dump you've uploaded, I think the real problem is with this exception, indicative of a bug fixed just yesterday: Exception in thread...
Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included...
Dear Gordon It works like it should now. Thanks for your assistance All the best ... Søren Vejrup Carlsen, DUP, Det Kongelige Bibliotek tlf: (+45) 33 47 48 41...
Hi, Just wondering if anybody have used heritrix to do large crawling at the scale at around 500M links. I know I probably need to use mutliple instances and...
Hi, I am looking at the recover job feature. Correct me if I am wrong, but is it only able to reload the urls that it visited and the ones that are on the ...
Hi, I am new to Heritrix, and at times when I refresh my administrator console the crawl appears as paused. I never manually pause the crawl. I haven't...
Dear all. We have just started using the group-max-success-kb in the QuotaEnforcer available with heritrix 1.6.0. If we set the limit to X bytes, and get less...
Thanks for the insight, I was using 1.4.0 which didn't include checkpointing. Is it possible to configure the job to automatically save checkpoints for every...
Hi St.Ack, Well, we need to restablish state if the crawl fails, and the only way to keep the state is using the checkpoint feature. We would definitely want...
... Crawler must be quiescent before you checkpoint. Otherwise the checkpoint will be inconsistent. So, at least for now until we do more work, the pause is...
... You might try crawling a known small site whose size you know letting Heritrix crawl it to completion. Do it first without QuotaEnforcer, then with,...
... Not that I know of. I've witnessed/heard-of 200-300Million with 2 to 3 machines. Would be interested in hearing about your experiences. ... Dual opterons...
... Seems ... becomes ... the ... other ... we've ... How confident do you guys feel that if I use broad-scope I can go above 50M links (or even 100M links)...
We found the error in our own logic (and code) It was a simple rounding error (500000 bytes/1024 is 488Kb but 488Kb*1024 is not 500000 bytes) best Bjarne...
... I'd suggest you startup a proofing test crawl with BroadScope and see it does. On machines with specs like those listed below we've pulled down ... was not...
We have encountered a new problem with the group-max-success-kb feature. It seems that robots.txt URI is not considered by QuotaEnforcer. These robots.txt are,...
... To shed some light (maybe) on the the problem, the update for the host-report happens through StatisticsTracker.crawledURISuccessful, called from...
I think Søren formulated the wrong question - or formulated the opposite as he really ment. The problem is that the size of the robots.txt is summed together...
Hi all, The last couple of days I've been writing a Processor to write the results on a MySQL databse. Everything turned out to be ok so far, except the fact...