Hi ! We are using the budget-facilities in heritrix with great success. However it seems that dns-requests are counted on the same queues as the URI's...
Yes ... It could be reasonable for DNS URIs to be 'free' with regard to the queue-budgeting. The reason they aren't now is that the budgeting process was ...
Using: heritrix-1.5.1-200509291859 Heritrix tried to crawl this URL: http://www.anns-personalized-books.com/jiff/my_zoo_rhymes/my-zoo-rhymes-page.htm And then...
This is by design; these values (and options) are often fetchable URIs that browsers visit, either because of Javascript code that triggers on form actions or...
Hi, We are using a version of Heritrix which was taken from the CVS HEAD back in May. It works reasonably well except for one major problem - which is that it...
... Many improvements have been made since May (All main in-memory structures have been made bdbje disk-based). Can you update your instance? A while back we...
Hi, Does anyone know why i get the following: bash: ./bin/arcreader: Permission denied It happens when i am trying to read the arc files using the arc reader. ...
Hi, I am kind of new to the Heritrix crawler and i am currently using it for some project. What i noticed when crawling my site was that the crawler seems to ...
The crawler is setup to be very polite. So compared to something like httrack it will seem slow. You need to make sure your scope is setup properly so that the...
Hi: I had a problem like this with a more recent version of Heretrix, and it turned out that the problem was that at the end of a single site crawl, there was...
Gordon Paynter
Gordon.Paynter@...
Oct 9, 2005 9:04 pm
2230
Hi, Can i exclude a certain file type from being crawled? For example, video files like .wmv files....
I have been getting this message in the seeds report while the job is running. Like I am crawling only 3 seeds and one of the seeds has this message and never...
... org.archive.net.UURI. Study its superclasses LaxURI and commons-httpclient URI. Also see the UURIFactory#fixup code. ... Sounds like something we should...
Hi, The problematic seed has been changing. It doesn't have redirect or anything. Like in one job, i have 3 seeds and the problematic seed can be anything,...
Hi, Jay. Some ideas: (1) Update your code. CVS HEAD often has problems, even occasionally fatal problems, but they are also regularly fixed. Your version from ...
Hello Gordon, Thanks for your suggestion. I have and just tried everything except updating the code which I will try in momentarily. My comments following...
Hi Folks, While I am checking out CVS Head and rebuilding heritrix, I run into this question. Do you guys planning to leave compatibility with java 1.4.2 and...
Thanks for the pointers - the URL I gave was not complete - this ons is: http://www.bs.dk/content.aspx?itemguid={31637766-92B4-4ACA-9A0D-5CFF042B151E} URLs...
Dear all. The current Heritrix 1.5.1 now breaks one of our unit-tests, that tests for validity of an ARC-file: ARCReader ar = ARCReaderFactory.get(anArc): ...
Hey Søren: Apoloize for breaking your test. Below is excerpt from the commit message that removed isValid. revision 1.48 date: 2005/07/16 01:47:59; author:...
We intend to keep compatible with 1.4.2 for now -- that commit slipped in inadvertently, and will be fixed. In advance of some future official release, we'll...
Hi Gordon, I get the head from CVS and doing test crawl and so far, I didn't see the problem yet. But Delete function from "View or Edit Frontier URIs" is...
I am trying to drop Heritrix into Tomcat 5.0.28 to test it out and so I can run a remote debug on it to see exactly what is going on as the JSP interacts with...
Spam detection software, running on the system "ia00524.archive.org", has identified this incoming email as possible spam. The original message has been...