Sorry - my mistake ! Those URL's were the referring ones - the invalid ones were all 'invalid:' in the crawl.log best Bjarne Andersen ... -- Bjarne Andersen ...
Using Head: 1.5.0-200507251046 ToeThreadReport (slightly edited to just show times): Check Thread #5 and #6, active for + 6m and + 3m. These times don't seem...
Followup: This seems to be progressive. I launched a new crawl (same order.xml, different seeds), and while I'm seeing LinkScoper processors running for 7-10...
Dear All, I've been looking into the User Manual, but I was not able to find out how to do the following: I would like to do a crawl in which I do not download...
Marco, You can use the DomainSensitiveFrontier and set the "max-docs" parameter in the Frontier settings to the maximum number from each website. Rob. ... -- ...
Two things that first come to mind: (1) Do the URIs they're working on have a gigantic number of outlinks? [*] (2) What Scope are you using? The legacy scopes...
... [*] Not a potential cause of the slowdown, but related: A new feature in CVS HEAD sets a cap on the number of links that can be found in any page, as a...
... Hey Bjarne: The 'invalid:' scheme is added to URLs when the deserialization of CrawlURIs fails. It shouldn't ever happen but we see it from time to time...
I crawled with heritrix-1.5.0-200506151248 - so that's without the fix - this is because that's the version we currently use in our production system - to...
Our experience with the DomainSensitiveFrontier is that it's not suited for large seed-lists (e.g. 10.000 seeds which is our default crawl-size) - things slow...
Hi members, I came across problem running Heritrix on Windows XP, it was throwing exception in terminal "java.io.IOException: Cannot find subdir: conf". I just...
Would it be possible to put in new QueueAssignmentPolicies without having to alter: private final static String [] AVAILABLE_QUEUE_ASSIGNMENT_POLICIES in...
... Yes. How about adding a line in heritrix.properties that looked like this: org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy =...
hi I am new to this and i want to write a code to lets say crawl couple of web sites. Can anyone suggest where i can get some sample code . I went through the...
thanks for sharing this, I gave up a while back because I was having so many problems but soon I want to try again. If you have any other stuff to share in ...
... Let me have a go at this in the next day or so. ... There is one built into the jar but if we find one on filesystem, then we use this one instead. St.Ack...
Yeah that is ok but I want to do this through java code. Any idea how can i do this. Regards Madhur ... at all. ... if ... set ... your seeds ... couple of ......
... Not really, I examined a dozen or so URLs in the list. A few pages had ALOT of links (maybe 1200-1500 at most), but most were normal pages. ... Since I...
... See http://crawler.archive.org/faq.html#embedding for a start. Thereafter, study Heritrix.java class for how it adds jobs. Or there is a JMX interface to...
Hi, One more addition to "Windows XP" problem. "heritrix-1.4.0.zip" doesn't contain the default profile. I tried in Linux (with same zip file), it works...
I took OS/X and Java 1.4.2 out of the picture, by starting up a crawl on a RedHat Linux machine with Java 1.5. I noticed Java 1.5 provides toe thread stack...
Hello fellow crawlers, At Luxembourg's national library we're taking our first steps in crawling. I'm looking at this myself at the moment, I've managed to...
... This sounds like this issue: http://crawler.archive.org/faq.html#windowsstart. Do you have your own wrapper script starting up Heritrix or are you using...
... Yes. Its kinda sweet. Only available on 1.5.x JVM though (Feature courtesy of Christian Kohlschütter). ... expungeStaleEntries is called everytime we...