I've set max-hops to 2 in my TooManyHopsDecideRule in an effort to start from small sample crawls and build up from there, in large part based on research...
I believe you want '<boolean name="if-match-return">false</boolean>' in your case, no? Tree's filter is for *including* only html in the crawl. If you want...
... Be a bit careful here. The job name is actually a little more involved. Notice how in CrawlJob#preRegister (line 2041) we build up the name adding the...
... Yes. The parse is unable to pull a 'host' from URIs the likes of "http:/" or "http:/~galvan/index.html". ... Agreed. I tried them in our test harness and...
... I'm surprised it's crawling anything of interest at all: 2 hops deep is not very much, and even a broad crawl started from a massive directory page that...
Hi Gordon- Looks like that's exactly what's happening. The crawl log is only still reporting 3 sites out of 20. One of them, washingtonpost.com, looks like...
Ah -- I was using CrawlJob from what I think was the 1.6 release, and preRegister was simpler then. More from me after some more experimentation. I've gotten...
... Thats true. ... Where are they coming from? (I'd guess they are because you're invoking operations against a null bean -- and you're getting the null bean...
Hello - I don't know if this program is overkill or just right, so, I figured I would ask the group. I need to create an application that will check links and...
... Its straight-forward enough plugging in a little module that per page, asks Heritrix for the list of links found to try against an external database. But...
All due respect to Heritrix, this would be much easier implemented in Perl with the various CPAN libraries than trying to wedge it into Heritrix. -- Tom...
Hi, We got dinged again by using Heritrix in that a crawlee complained that we were ignoring their robots.txt file. On the face of it, they look like they are...
Hi Karl, Heritrix rechecks robots.txt files every 24 hours by default. Did you change that value? It seems that this robots file has been recently modified and...
... It seems clear from the logs that 24 hours elapsed since whatever change occurred. The date on the robots.txt is 3/6 and the date of the crawl is 3/8. I...
Have anyone seen this error before? I used to have a very stable configuration that I can run for at least a couple weeks but now I'm testing out the new...
... If you are saving crawl results to ARCs, the robots.txt that was consulted will be in the crawl ARCs -- as will be each daily refetch. ... Such bugs are...
... we're not; we're discarding them. So we are out of luck. ... I was initially suspicious because the readLine() javadoc said it only recognized line...
Hi all, Just wanted to let you know we've started this page on embedding Heritrix. http://crawler.archive.org/cgi-bin/wiki.pl?EmbeddingHeritrix It's just a...
... crawl ... haven't ... We just started the production crawl --- 8 crawlers each is equipped with 3TB storage (yeah I took your suggestion seriously, plenty...
From the developer manual: "For each processor only one instance is created per crawl. As there are multiple threads running these processors must be carefully...
Sorry if this is an abuse of this mailing list but it seemed the best way to get this out to developers who have experience with heritrix and might be...
Andrea Goethals
andrea_goethals@...
Mar 17, 2006 6:41 pm
2748
Hi there, Is there any way to reuse a job after it has finished? I mean, after a job finishes its crawl, could it be enqued automatically again. Thanks very...
I checked the maunual and didn't find the meaning of 'P' in the discovery path (crawl.log). In terms of "R", will it affect the link depth of a crawling job....
... No. Automated rescheduling of jobs is not part of the crawler. Will Yan Zhang's suggestion work for you? You could create and queue up lots of the same...
... See 'Discovery path' in the glossary section of the user manual. 'P' is for prerequisites (dns or robots). ... It looks like all on the discovery path gets...
The 'max-link-hops' value for the 'classic' scopes (BroadScope, DomainScope, HostScope, PathScope) only counts plain navigation-link 'L' hops. So in each of...