Hi there, I have some problems with relative URIs (Heritrix 1.8). The following is happening: A page like "http://a.site.com/contents.php" is downloaded. It ...
Hello all, I've been crawling for some time and have now over 7 million files. No problems, although lately Heritrix used to stop itself and I clicked it from...
Hi there, Yeah, that happens a lot. Several possibilities as to what to do: 1. You remembered to checkpoint, and its a good checkpoint. Stop the job, stop...
Hi, I'm running a crawl where initially some URIs were rejected by a DecideRule that was added by mistake. So the URIs show up in the crawl.log as having a...
... Thats odd. With the 'force-fetch' flag enabled, the URLs should be crawled whether they've been seen or not. Did the URLs show again in crawl.log after...
... As Stack notes, this should work. The 'force-fetch' flag means to ignore the already-included status. However, it doesn't ignore scoping -- are you sure...
I am pretty certain there is something I am not doing right, just that I am trying to pin point the issue. I am seeing pages http://xyz.com/target and ...
Hi Anmol, In my experience, these two URLs are usually distinct documents where the first one is redirect to the second one. So, in most cases you want to get...
Thanks Igor, but have a look at this document. http://dblab.ssu.ac.kr/publication/LeKi05a.pdf They seem to have done fair bit of experiments with this issue....
Hi Anmol, ... Thanks for the paper. I just briefly scanned the section on trailing slash normalization and the results of 50% duplication are expected without...
Hi everyone, I'm currently having a problem that I'm not able to solve. I am currently spidering a rather big set of seeds and thus my crawl.log gets really...
Hi again, actually upon further analysis, you are right, the URI I had force fed into the crawler via JMX did indeed eventually get crawled. When I took a...
... If you pause a crawl, towards the base of the index.jsp console page appears 'View or Edit Frontier URIs'. Click here. Allows adding/deleting URIs...
... Hey Olaf: Postprocessing crawl.log w/ perl/python/awk/etc. seems much easier than modifying Heritrix but if you insist, one suggestion for how to change ...
... That is an interesting paper, thanks for the pointer. However, after skimming it over, I don't think it offers much guidance for typical Heritrix uses, for...
We are migrating to from Heritrix 1.8 to Heritrix 1.10.1. Now, the attribute 'scope-embedded-links', which are set to true in our templates have now been...
Hi *, URLs marked as duplicates are being added to the DeDuplicator-index when running the DigestIndexer in its current implementation. Would't it make sense...
Even at the risk of holding a monolog here :-) ... let me share my findings with a patched version of the DeDuplicator: The processed logfile is the second of...
Maximilian Schoefmann
schoefma@...
Dec 15, 2006 2:31 pm
3597
Hey Max, The reason for adding also those marked duplicates is that the typical usage scenario has been to rebuild the index each time. If you are adding to...
Hey Kris, ... I'm doing very frequent crawls of the same sites and have automated updating the deduplicator index after every crawl. Your DeDuplicator is ...
Maximilian Schoefmann
schoefma@...
Dec 15, 2006 4:16 pm
3599
I've added the patch to HEAD and made a new interim build (20061218) that includes it. - Kris ... parseLinde is ... want to...
Hi everyone, I were able to track down the source of the NPEs. They were caused by the AddRedirectFromRootServerToScope decideRule. My decideRule chain starts...
Thats an ugly one Olaf. Thanks for persevering. I added a check for null host basename to the DR AddRedirectFromRootServerToScope and I just made it so URLs...
I trying to install rain bow but it doesnot compile on gcc . Maybe it is because of newer version of gcc as rain bow had it's last version in 2002. I tried...
I've written a little web app running on tomcat, which uses heritrix to crawl specific sites. Lately, I've been running into the following error messages on...