Takeru & Matt: Thanks for the report of this issue. I confirmed there were some problems with both the admin-UI editing of seeds, and the reading of seeds from...
Hi all, This is not a "real" Heritrix issue but maybe somebody has an answer to this: I programmed 2 processors for Heritrix 2 and put my classes in some...
I see two potential issues in your order: - There's no FetchHistoryProcessor, which is still necessary to collect deduplication-relevant information and insert...
sorry about the broken link sent before. http://cs.odu.edu/~pramo_p/crawler-beans.cxml ... From: Pranay Pandey <sspranay@...> Subject: [archive-crawler]...
Hello Matt and Gordon, Following Gordon's advice and assuming HER-1706 to be fixed, I am using only two of the persistProcessors: load and store. As suggested,...
thank you matt, I know about IDN in Herritrix. And I know the problem about seed. I am watching it. takeru 2009/11/22 Matthew Warhaftig <mwarhaftig@...>...
takeru sasaki
sasaki.takeru@...
Nov 24, 2009 4:19 am
6167
You should most likely use the QuotaEnforcer module that allows you to set number of documents (succesful and total) and number of bytes per queue. If at the...
Hi, I used recently heritrix 1.14.3 and I do not understand how to limit the nb of documents per host. The parameter max-document-download limit the total...
Hello. I am using Heritrix 2.0.2 with configuration based on "Broad but shallow crawl". It seems that I've managed to setup everything OK. The one problem I...
Віталій...
tivv00@...
Nov 23, 2009 11:33 am
6164
The first Release Candidate test release of Heritrix 3.0 is now available, version identifier 3.0.0-RC1. We encourage expert Heritrix users curious about the...
Hi all, Although I searched thoroughly in the group messages, my searches didn't end up with a solution. I am looking for a way to exclude a list of urls in...
Hi Takeru, Heritrix automatically converts Internationalized Domain Name seeds and discovered links to punycode. However, I recreated your issue of certain...
Hey crawlers: I was Apachecon in Oakland in early November and was present during a meeting of a few of the open source crawler projects (Ken Krugle for Bixo, ...
Our main challenge is that we need the queues sperated on TLD (foo.com and bar.com) to use the quota-enforcer to limit number of bytes on each TLD but at the...
John pretty much sums up the problem. The way I've dealt with this has been on a case by case basis. Each time we detect this situation, we override the...
Hi, It is not actually possible to guarantee this. There is no real way to distinguish for sure where the actual physical hardware that hosts a name is. There...
Hi all. Is it possible in Heritrix 1.14.3 to avoid overloading webservers hosting many virtual servers. We currently have the problem that those webservers...
Hi Kris, Thank you for your reply. I will use JIRA as you suggested attach patches against the latest 1.14.x revision from the Heritrix repository. Regards, --...
Tomas Ukkonen
tomas.ukkonen@...
Nov 18, 2009 3:57 pm
6154
Hi, IA is hiring a tech lead for our new web crawling initiative. You can read the job description here: http://www.archive.org/about/webjobs.php#wwcengineer ...
Hi, I want to know about IDN support of heritrix. ("Internationalized domain name" http://en.wikipedia.org/wiki/Internationalized_Domain_Name) I was tried to...
takeru sasaki
sasaki.takeru@...
Nov 16, 2009 11:27 am
6152
I'd suggest opening up an issue (probably multiple issues in your case) in the crawler issue base (https://webarchive.jira.com/browse/HER) and to then attach...
Good advice, thank you Gordon. Adding the recrawl processors to the chain bean and pointing PersistLoadProcessor directly to my existing history (no preload)...
Hi In National Library of Finland we have made some improvements to Heritrix: - document classification (content-based) deciderules support (e.g....
Tomas Ukkonen
tomas.ukkonen@...
Nov 13, 2009 11:38 am
6149
... not in java, but ruby: http://anemone.rubyforge.org/ just use from shell: $ anemone url-list http://crawler.archive.org > seeds.txt -- ...
raffaele messuti
raffaele@...
Nov 13, 2009 10:21 am
6148
I am looking for a simple way to spider web pages from within an app I am working on. I know heritrix is not intended to be used as a library, but would using...
Heritrix cannot execute Javascript, so its link-extraction with respect to Javascript uses a crude heuristic of trying strings that might be relative URIs...