Dear IA-Team, during my last crawl I've been experiencing the "death" of one of my heritrix instances. I'll share my experiences, maybe someone can help me...
Sorry for the late reply. Here is the stack trace for the failed temporary file delete. Hope it's helpful. Thanks! 12/23/2006 01:33:35 -0800 WARNING ...
... Were ... I was calling the toString() method on the ReplayCharSequence and then putting the resulting String into the AList. ... I can try this, but will...
Hi all, I'm evaluating Heritrix (1.10.1) and in one of my test runs (http://www.telegraaf.nl) Heritrix downloads 38231 variations of the same index.html file...
... During a single crawl, Heritrix should only ever download a single instance of any particular URL. In your crawl logs, there is a single URL logged 38231...
You are running two ARCWriters in the same crawler? You may have uncovered an issue in the pooling code. The two instances may be interfering w/ each other....
When using the excellent jmxclient utilitly, I'm having trouble supplying arguments to operations. I *am* able to use no-arg operations successfully. I'm...
The format is argValues separated by commas (Not argType then argValue). You seem to have an extraneous 'org.archive.jmx.Client' in your TestBean sample...
Hi Michael, ... Sorry, should have passed this information immediately: it seems they all differ by query parameter, what's causing this? -rw-r--r-- 1 user...
Ah, sorry - that was a cut-paste error from an experiment I did where I exploded the jar. The command I'm running is: java -jar cmdline-jmxclient.jar...
Sorry - the extraneous 'org.archive.jmx.Client' was a cut/paste error from an experiment I ran where I exploded the jar file. The correct command I'm running...
(Hmm - my replies seem to be vanishing...) That extraneous 'org.archive.jmx.Client' was a cut/paste error from an experiment I ran where I exploded the jar. ...
Its hard to help when I can't reproduce. Is the cmdline-jmxclient that you are using from the Heritrix 1.10.1 $HERITRIX_HOME/bin? I do not see a 1.5.1 JDK...
A better source would be the crawl.log. Does it have multiple instances of the said 'index.html' page, with or without parameters? If not, then its likely...
Find Free Hosting now - search many categories. Free 1GB Online Storage Plus Sharing, Backup, FTP Software Drag-drop, super fast like local! $0.00 Free Web...
Find Free Hosting now - search many categories. Free 1GB Online Storage Plus Sharing, Backup, FTP Software Drag-drop, super fast like local! $0.00 Free Web...
Find Free Hosting now - search many categories. Free 1GB Online Storage Plus Sharing, Backup, FTP Software Drag-drop, super fast like local! $0.00 Free Web...
Hello all, Since last week we're testing Heritrix on an outsourced server. There are no problems when we're writing the job-data to the local disk of that...
Hi, I found this blog very helpful: http://www.dreamersrealm.net/~tree/blog/?s=text%2Fhtml&submit=GO, when trying to implemnt a job which would ignore sending...
... I've not seen that before but then we don't run on NFS mounts. Anything in the server system logs that you can correlate with the process going defunct?...
Hi Michael, Thanks for the prompt reply. Since we have limited control over the harvest server, I'm going to contact the people maintaining that server for us ...
Hi, Was reading through the Heritrix code base and found the following: workerqueuefrontier.schedule method calls add or addforce on the alreadyincluded...
Hi, A lil help would be greatly appreciated. I am curious about how Heritrix is handling Redirects. Per my understanding, The HTTP Method in the source is by...
Hi, Thanks a lot for help. I've added ContentTypeRegExFilter to the 'midfetch-filter' and 'write-processors' sections, and now only html/text contents are ...
Dear all. Our harvesting depends on the QuotaEnforcer. However, the QuotaEnforcer in 1.10.1 (v. 1.7) generate a NullPointerException, whenever bad URI is given...
Hi, Not quite sure on your question, but I would check the regular expression, .*(?i)\.(doc|ppt)$, not sure why you have two slashes. Let me know if this...
Hi, My understanding is once a link passes the scope test, and then later gets redirected doesn't go through the SURT test again, this was a problem for me and...
... Are you looking in the right place? The crawl.log has an entry per downloaded URL. There is nothing related to telegraaf.nl, assuming thats the site you...