... not in java, but ruby: http://anemone.rubyforge.org/ just use from shell: $ anemone url-list http://crawler.archive.org > seeds.txt -- ...
raffaele messuti
raffaele@...
Nov 13, 2009 10:21 am
6150
Hi In National Library of Finland we have made some improvements to Heritrix: - document classification (content-based) deciderules support (e.g....
Tomas Ukkonen
tomas.ukkonen@...
Nov 13, 2009 11:38 am
6151
Good advice, thank you Gordon. Adding the recrawl processors to the chain bean and pointing PersistLoadProcessor directly to my existing history (no preload)...
I'd suggest opening up an issue (probably multiple issues in your case) in the crawler issue base (https://webarchive.jira.com/browse/HER) and to then attach...
Hi, I want to know about IDN support of heritrix. ("Internationalized domain name" http://en.wikipedia.org/wiki/Internationalized_Domain_Name) I was tried to...
takeru sasaki
sasaki.takeru@...
Nov 16, 2009 11:27 am
6154
Hi, IA is hiring a tech lead for our new web crawling initiative. You can read the job description here: http://www.archive.org/about/webjobs.php#wwcengineer ...
Hi Kris, Thank you for your reply. I will use JIRA as you suggested attach patches against the latest 1.14.x revision from the Heritrix repository. Regards, --...
Tomas Ukkonen
tomas.ukkonen@...
Nov 18, 2009 3:57 pm
6156
I reinstalled Heritrix and got the same message again after configuring my first job: Wrong document type 'crawl-order' in...
Hi all. Is it possible in Heritrix 1.14.3 to avoid overloading webservers hosting many virtual servers. We currently have the problem that those webservers...
Hi, It is not actually possible to guarantee this. There is no real way to distinguish for sure where the actual physical hardware that hosts a name is. There...
John pretty much sums up the problem. The way I've dealt with this has been on a case by case basis. Each time we detect this situation, we override the...
Our main challenge is that we need the queues sperated on TLD (foo.com and bar.com) to use the quota-enforcer to limit number of bytes on each TLD but at the...
Hey crawlers: I was Apachecon in Oakland in early November and was present during a meeting of a few of the open source crawler projects (Ken Krugle for Bixo, ...
Hi Takeru, Heritrix automatically converts Internationalized Domain Name seeds and discovered links to punycode. However, I recreated your issue of certain...
Hi all, Although I searched thoroughly in the group messages, my searches didn't end up with a solution. I am looking for a way to exclude a list of urls in...
The first Release Candidate test release of Heritrix 3.0 is now available, version identifier 3.0.0-RC1. We encourage expert Heritrix users curious about the...
Hello. I am using Heritrix 2.0.2 with configuration based on "Broad but shallow crawl". It seems that I've managed to setup everything OK. The one problem I...
Віталій...
tivv00@...
Nov 23, 2009 11:33 am
6166
Hi, I used recently heritrix 1.14.3 and I do not understand how to limit the nb of documents per host. The parameter max-document-download limit the total...
You should most likely use the QuotaEnforcer module that allows you to set number of documents (succesful and total) and number of bytes per queue. If at the...
thank you matt, I know about IDN in Herritrix. And I know the problem about seed. I am watching it. takeru 2009/11/22 Matthew Warhaftig <mwarhaftig@...>...
takeru sasaki
sasaki.takeru@...
Nov 24, 2009 4:19 am
6169
Hello Matt and Gordon, Following Gordon's advice and assuming HER-1706 to be fixed, I am using only two of the persistProcessors: load and store. As suggested,...
sorry about the broken link sent before. http://cs.odu.edu/~pramo_p/crawler-beans.cxml ... From: Pranay Pandey <sspranay@...> Subject: [archive-crawler]...
I see two potential issues in your order: - There's no FetchHistoryProcessor, which is still necessary to collect deduplication-relevant information and insert...
Hi all, This is not a "real" Heritrix issue but maybe somebody has an answer to this: I programmed 2 processors for Heritrix 2 and put my classes in some...
Takeru & Matt: Thanks for the report of this issue. I confirmed there were some problems with both the admin-UI editing of seeds, and the reading of seeds from...
... If <http://soamoa.org:9292/artistRegistry?WSDL> is the URL you want, you should add it as a seed -- then you don't have to wait for it to be discovered. -...
... Actually I've found out the problem and it looks to me like the bug. The root:queue-assignment-policy setting is not used for queue assignment, the one...