... Exactly this happened to me a year or two back (different site and different crawler, of course). I can understand it too. Chances are that the admin...
Gordon Paynter
Gordon.Paynter@...
Apr 5, 2006 6:33 am
2772
... Looks like I'd already setup auto-building of the hcc jar. Get latest version from here: ...
Hi, I'm using Heritrix 1.2 on windows. I also created job which takes long time to crawl like 2 to 3 days, but yet its not completed and after all i have to...
... Upgrade. Latest is 1.6 (1.8 release is imminent). It can be obtained here: http://crawler.archive.org/downloads.html. ... Recovery mechanisms are...
Hi guys, when using NutchWax to search the ARC file. If I want to save the search result of a query string, which part of code I should modify? I mean which...
Hi my friends, There are many websites now using forms to generate dynamic webpages. How can I use Heritrix to crawl such kind of dynamic webpages? Thanks for...
Any suggestions on the easiest way of rerunning a form login? Theres a key input value being generated on the first hit and I couldn't figure a way to...
Can anyone give me a clue? Or does Heritrix support this kind of function? Where can I find such kind of information or examples? Sorry, I am a newbie of...
Yes, Michael. I saw the mail. But I still don't know how to do the setting of Heritrix to do that. Because there are some name/value pairs I need to assign for...
Hello, Samuel, I am trying to use HTML form credentials too. But I didn't know how to do it. Can you tell me where did you set your username/password pairs for...
Regards the below, Andy wrote me off list clarifying what it is that he wants to do. He wants to be able to POST to arbitrary html forms arbitrary data. The...
Andy, have you read the link cited below? If it is an insufficent description of the login functionality, I would like to know so I can redress. Thanks, St.Ack...
... Is there a reproducible sequence of events that lead to your getting below exception? Is it with HEAD of heritrix or a released version? Sounds like a bug...
St.Ack, Time to start testing HCC. Any tips? I have seen that now the sources and jar can be downloaded directly, but the "Getting Started" section in the...
Hi Occasionally we want to be able to restart the crawler and start a new job and carry across the already-seen list, but nothing else (such as the frontier,...
... Another option would be to use the 'recover' log; all of the 'F+' lines are URLs that the crawl considered 'seen'. A typical recover-from-log scans the log...
Ok, I see. Yes, this is a little bit complex operation. Will this one be one of those features of Heritrix in the future? I think this could be useful for...
Thanks for the report -- my investigation reveals this is actually a bug in OnHostsDecideRule, in how it updates itself when a new seed is added. It fails to...
Thanks alot Gordon, your proposed fix works like a charm. The only thing I'm wondering is why this won't make it into CVS before 1.8.0 release as this is a...
... Im running released version 1.6.0 of Dec 2, 2005. I still haven't quite figured at what circunstances the Exception is thrown. Most of the time I only got...
Hi you gurus, I am new to Heritrix and met some errors. My box is Windows XP, JDK1.4.2_11, Heritrix1.6.0. I can get heritrix run up but can not get it crawl...
Hi, I am using Heritrix 1.6.0. And I have exprienced a strange behaviour: When setting send-range to true, Heritrix doesn't honor the entries in robots.txt...