... quickly is ... great ... assuming right ... policy ... I did that and the OOME did not reappear. I cannot say which of the two options was the cause (since...
Hi: I downloaded and installed Heritrix 2 on my machine. I followed the guide for Version 2 and only changed contact url and email. My goal is to crawl urls...
could be useful to have a bin/warcreader utility in heritrix, like the existent arcreader, to make .cdx indexes of warc files now i'm using the warc-indexer...
raffaele messuti
raffaele@...
Sep 2, 2008 10:39 am
5440
I ran into the problem again today and I think I have a more specific question to ask now. I found my processes in a situation where all ten toe threads were ...
Matt Kent
matthew.e.kent@...
Sep 2, 2008 5:46 pm
5441
... Glad to hear the problem cleared up. I would like to know if only one of the changes is enough to trigger the OOME -- there must be a bug here somewhere....
I haven't seen this before. FWIW, 1.14.1 moved to BDB-JE version 3.3.62 (whereas I believe 1.14.0 used BDB-JE 3.2.76). I don't have any specific reason to...
The WARC-reader bundled with our Wayback project will be the best one the Internet Archive has and what we use in our own projects. We'd like to hear of any...
hi Jean-Noel, Ok, i think i found the problem. when you create a new sheet Add Single Sheet ... [my-new-sheet] and click Submit, you drop right into Settings...
It's not an issue with your scoping/prefixes. I tried crawling <http://www.genealogy.ams.org/id.php?id=123>, and the result line in the crawl.log was: ...
... options. It ... see if ... one of ... For what it's worth I've run into this bug using 'most-favored' robots policy and standard ARC sizes. Never thought...
Great Thanks! ... the ... code ... understand." ... browser, and ... doesn't ... the ... HTTP 'Accept' ... any ... smallest, ... add to ... the ... because it ...
Hello, Everyone: Now I have a problem to crawl some website. The problem is not a still bug. Some computer could occur, but other computer cuold not occur this...
Hi, I am using Heritrix crawler for crawling through the domains. The problem which Iam facing now is when I try to crawl through some domains even though the...
... Hi, I'll have a look into my magic crystal ball to find the answer... ;-) ... Well, first of all check which urls aren't handled by heritrix. Then recheck...
Hello! I tried to find this in the docs but it seems a rather uncommon use case to explicitly exclude some domain (including all subdomains) from a crawl. Any...
Here you go. The XML object to not crawl these guys. Pretty straightforward. Use it, early and often. Where nocrawl-all.surt is the list of do not crawl. ...
hi all, we are evaluating the option of running heritrix on amazon's ec2 servers. Is anyone else running the crawler on ec2 and how is your experience with it?...
Hi, I am have used a method in the past with version 1.12 of Heritrix where if you add a Beanshell processor after the pre-processor and reset any CrawlURI...
we are running heretrix 1.14 on 4 ec2 machines for last 4 months. its working good ,but when i compare the performance on our paris and amerterdam machines...
It seems that Heritrix 1.14.1 is the most recent release. The Javadoc claims it represents version 1.15.2, but I can find no evidence of said version in SVN or...
Matt Kent
matthew.e.kent@...
Sep 10, 2008 8:30 pm
5460
For anyone interested, I found the build hiding in Cruise Control: http://builds.archive.org:8080/cruisecontrol/artifacts/HEAD-heritrix/20080808001045/...
Matt Kent
matthew.e.kent@...
Sep 10, 2008 8:34 pm
5461
Hi does anyone know if there's a way to retrieve the seed url from a document in the ARC file. in other word, how to find out which original seed is a...
At Wed, 10 Sep 2008 13:34:25 -0700, ... Hi Matt. It sounds like you know this, but this is presumably not a released version but a build from trunk (or HEAD)...
Versions of Heritrix from SVN TRUNK will have odd numbers in the second position, eg. the '15' in 1.15.2. (That's the current label of what's in SVN.) No...
Excellent, thanks for the explanation. It seems that 1.14.1 will be sufficient for me. Matt...
Matt Kent
matthew.e.kent@...
Sep 10, 2008 10:24 pm
5465
... down. ... Hi all! I have the same problem with the size of some of the logs created by WCT (Web Curator Tool). I can't upgrade the WCT versión neither ...
Hi all! I have the same problem with the size of some of the logs created by WCT (Web Curator Tool). I can't upgrade the WCT versión neither embebed heritrix...
Hi all. I’m experimenting with squeezing more crawlers out of a single JVM in Heritrix. (Background: it is possible to run multiple crawlers in a single JVM,...