I'm not using a mid-fetch filter, but rather Tom's URI regex exclude- filter in the scope. Now that I think about it, this problem started happening around...
... https://sourceforge.net/tracker/index.php?func=detail&aid=1116204&group_id=73833&atid=539099 ... Thx for the suggestion we tried it on my friends computer...
... We're looking into adding accounting that will allow the running of multiple crawls inside of a single running instance -- More to follow on this after it...
I have enough machine resources to start up another crawler instance (once it gets to this low thread parallelism state the CPU consumption drops way down) the...
... Sounds good. Let us know if you can think of something we should add to the crawler to help you implement this strategy (You might have suggestions for...
... I tried it. When the URIs were not in scope -- which was the case for the first few I tried -- I got a -5000 in the crawl.log: e.g. '20050301212535071...
I might have been misunderstanding what the Add to Frontier was supposed to do, and in light of your comment it makes sense. The frontier will be determined...
Had some trouble with SurtPrefixScope (in 1.2.0 and HEAD), and I now believe the trouble is PEBKAC; please to verify? Summary: created a list of SURTs,...
... Yes. Adding to the seed list changes scope (And adds the URI to the queue). Adding via the Frontier screen does not alter scope. It just adds item to...
Hi everybody! I've got another question (i know... again ;)): If i understood it correctly everytime a exception during the crawl is thrown, the crawler waits...
... Hmm i just remarked that i was wrong because i had a queue snoozing but it still continued to crawl (i guess it just stopped crawling last time cuz there...
Queues can be snoozed to enforce politeness (usually just a few seconds), this is controlled by the delay-factor, max-delay-ms and min-delay-ms settings on the...
The most common causes of retryable errors are network connection failures or drops -- it's presumed they may be transient conditions that will improve in a...
The below should be fixed as part of the recent commit of an updated DomainSensitiveFrontier (DSF adds filters midcrawl that stop further downloads from a host...
Yes: it's never been intended to use SURT form to specify seeds, only scopes. To be more precise, what is used to specify scope are SURT *prefixes*, which may...
Hi, If you use WUI, you may add org.archive.crawler.filter.URIRegExpFilter on the Filter page (Scope->exclude-filter), go to Setting page, use a regular ...
I only can find Heritrix 1.2.0 from http://sourceforge.net/projects/ archive-crawler/. But I noticed that some guys already began experiments with Heritrix...
... There is no 1.3.0 'release'. 1.3.0 is the 'version' number of builds made using the unreleased HEAD of the source tree (Our system for version numbering...
Hello! Now I'm trying to read the crawled html files using the API provided by heritrix. While reading gzipped ARC files, I alwas get an error message "Failed...
Rev Tamas
bridgeman@...
Mar 13, 2005 12:54 am
1653
You may as well check your path. I also use 1.2.0, I can use arcreader to read arc file. The command I used is like: /yourpath/heritrix-1.2.0/bin/arcreader...
Hi all, I have edited the simple profile by adding overrides for a particular domain. In the override I just modified certain scope parameters such as I...
if you know something more about filters do let me know :) The whole filter thing seems to be too confusing and even after spending 4-5 hours today I am not...
Thanks! So I need to use not the API but the command-line interface. Tamas...
Rev Tamas
bridgeman@...
Mar 14, 2005 11:36 am
1657
... I'd suggest following Yan's suggestion and get the command line interface working first. Once thats working against your ARCs and its not throwing 'Failed...
... You might check the files under YOUR_NEW_PROFILE/settings -- the location under which snippets of xml that represent the override are ketp -- with...
... Okay, now the command line interface is working. If I set a proper offset, there will be no GZIP MAGIC complaint :) Now I can go on using arcreader. thx: ...