The discovery path is explained here: http://crawler.archive.org/articles/user_manual/glossary.html#discoverypath best Bjarne Andersen netarchive.dk ...
the TransclusionDecideRule is explained here: http://crawler.archive.org/apidocs/org/archive/crawler/deciderules/TransclusionDecideRule.html It is used for...
err... I read the User/Developer Manual, nearly all the docs in the heritrix web. I'm confused that you said directly "NO" but asked my explanation of my ...
Hello Lilei, Can you explain what you mean by isPageKnown? And also how you want to use it, i.e. what you would do with the result of the operation? Noah...
Hi NoahLet's say this way, I wanna know what will heritrix do if it fetch a page that it has already seen, that is, same page content, but with different urls....
Hi, Lilei. It's not completely clear what you're referring to without additional context, so it's hard to give a definitive answer. Are you referring to the...
Hi, I am trying to run H3-beta using the 'r' commandLine option. I am prompted with this message - "You must specify a password for the web interface using...
Heritrix 3 always launches with the web interface for monitoring and remote-control, and so your choice of administrator credentials must always be supplied at...
We're about to choose the final dates for a 2-3 day Heritrix Expert Summit in San Francisco, from among candidate dates in January-April of 2010. The idea of...
Hi, I noticed Heritrix has additional writers to HBase and Hadoop (to write crawled content only); but, can I run distributed crawler in a cluster? Thanks...
Thanks Gordon, Using curl I am able to access a broader range of actions (build,launch,terminate etc). My goal is to set up cron jobs to launch the same job...
I would like to add some crawl profiles that would be available immediately after installing Heritrix without any additional steps by the person doing the...
Hello I'm trying to get QueueOverbudgetDecideRule to work but I don't seem to be able to do this. Is this module still functional or maybe I have added it to a...
Hello guys, I've been using heritrix to do some crawls with about 10 seeds. I find that I am getting excessively large amounts of trash in the data I collect....
Hello all, I am new on this group. I am looking for a web crawler which can get list of links of a webpage and convert a website downloaded in mht file. It...
hi Tram, can you be more specific about what you consider "junk"? the default profile includes a TransclusionDecideRule which tells the crawler to transitively...
Hi, Is there any documentation on the Heritrix implementation of WARC beyond just the source code? i.e. elements from the specification in-/excluded, which...
Coram, Roger
Roger.Coram@...
Oct 27, 2009 4:05 pm
6131
hi Roger, the latest versions of Heritrix deliver warc output in format: "WARC File Format 1.0" which conforms to the ISO 28500 specification, an ISO standard...
Hi, I had set up a crawl job to run for 6 hours using H3-beta. I had it configured to be least polite and number of parallel queue was set to 5. After...
The 'queued' URIs are almost certainly on some hosts that are not responding. Heritrix is trying them every 15 minutes, but then putting them back on the queue...
When and where does this error appear? (For example: at the time Heritrix is launched, at the time you try to start a crawl, at the time you edit settings,...
Hi Gordon. I can't find the tool to migrate 1.X configurations to 3.X style configurations. I have downloaded the heritrix-3.0.0-beta-dist.tar.gz from...
I was reebooting and now the http://127.0.0.1:8080 address shows "Failed to connect". http://localhost:8080 doesnt work either. When I start the terminal its...
After some testing I determined that conf/profiles is created lazily if either a new profile is created or if the default profile is edited in the web UI. To...
Hi, In H3 I am trying to setup crawl jobs that use FetchHistoryProcessor/ PersistStoreProcessor/PersistLoadProcessor to discard duplicate content. I can get...
Matthew Warhaftig
mwarhaftig@...
Nov 8, 2009 10:05 pm
6141
I have written a script that controls the build, launch and termination of jobs using curl commands. I pass along a parameter to the script telling it how long...
We got an email from a website owner that encountered many attempts from us.Heritrix ran with default configuration We searched crawl.log file for the details...