Hello Joe/Community, I believe you can make use of the JMX call exposed by Heritrix "interrupt(String threadName") to kill a runaway thread. The call...
In such a situation, it's good to run a 'Threads Report' to see what the laggard thread is doing; it might be legitimately in the middle of a long download. ...
... Yes. The second crawl will re-fetch all pages though they may have been just-fetched by a previous crawl. By default, a Heritrix crawl has no knowledge of...
I finally logged in to the web console and looked at the thread report. It looks like the thread is stuck at one regular expression call (java.uitl.regex) and...
What is the difference? I have a proof crawler setup at 1500MB heap. It can never run Bloom filter for more than 1 day (maybe half a day). I've tried several ...
... If it was busy in regex extraction, rather than a network fetch, there's probably a reproduceable problem with our regex expressions, where they perform...
Hello, I want to restrict the number of fetched documents per webserver. I am using the QuotaEnforcer prefetch procesor. But I am unsure which directive should...
Thimo Eichstädt
abc@...
Jan 4, 2006 2:24 am
2493
Hi all, Does anyone know how can I retrieve the original seed URI from the current CrawlURI (which is not always the base uri, as they got somehow redirected...
... pass ... from). ... directory ... I did that and I was able to create a job and it does look like a recovered job. However I'm not seeing any directory...
... Was the 'previous checkpoint' not a 'fast checkpoint'? How did you instantiate the recovery? Via the UI? Does the listing of bdbje logs under the...
... The recovered job resumes the crawl? Did you pass order file to JMX addJob or a jar of order + seeds + overrides, etc? If the latter, it will make a new...
... One makes counts on a 'CrawlHost' [http://crawler.archive.org/xref/org/archive/crawler/datamodel/CrawlHost.html] basis. The other on a 'CrawlServer' ...
... the ... dir). ... directory ... I was doing the former. seems like a bad choice... Assuming I don't have any overrides, can I just pass it a jar file ...
... is ... you ... I have <boolean name="checkpoint-copy-bdbje-logs">true</boolean> so I guess the fast checkpointing is off. I did the instantiation from the...
... Yes. ... Others here that make use of it pass the empty string (Need to add an override that is absent the seeds parameter). Keep asking questions, St.Ack...
... Is it possible that the job successfully started from this checkpoint wrote back atop the checkpoint (Sounds like its possible going by your mails of...
Here are more background information problem: There are 2 crawl job involved, each one have 3 checkpoint directory. The first crawl job I started about 5 days...
Hello people. :) I hope you can help me in making heritrix work. I was able to start heritrix UI and run crawling jobs. However, all my job will hang at 33%...
... I would add to Stack's examples of things combined in the 'host' count: other HTTP servers on nonstandard ports, such as "http://example.com:8080". All...
... problems, ... the ... Here is the hung thread along with the problem URL: [ToeThread #24: http://www.just-for-golf.com/golf-vacation.html CrawlURI...
... Did the problem occur only after upgrading Mozilla? What URL are you trying to access the admin web UI? What error does Mozilla give? Is the Heritrix...
I have a job that is getting very large, the partition that Heritrix lives on will get full soon. I have paused the job. I would like to shut down Heritrix,...
... They do not. You need to checkpoint and then in the new location, do a recovery from the checkpoint made at end-of-crawl. See section 9.4, "Checkpointing"...
If you've never had success running Heritrix on this machine, the browser is probably a non-issue. The error message suggests nothing is listening on the...