... The link posted by libsoft in an earlier response is a good place to start. Change the filters so they look for text/xml and application/xml rather than...
Hi, I have question about QuotaEnforcer. Is there any possibility to log sites that have more pages then limit set in host-max-fetch-successes? (when the site...
... You'd have to add it. Would be easy enough to do though. See http://crawler.archive.org/xref/org/archive/crawler/prefetch/QuotaEnforcer.html#237. Add in...
... Any URIs blocked by quota will appear in the crawl.log with the -5003 (S_BLOCKED_BY_QUOTA) code, so that might be enough. The ultimate vision for this...
Hi there, I'm using Heritrix to crawl about 130 domains. It occurs that some domains needs specials configurations and I'm doing this with Refinement (one for...
Hmm, OK, thanks. There may be more information in heritrix_out.log. Can you look through that file for any SEVERE lines that correspond to the alerts, and...
The Internet Archive, home of the Heritrix web crawler, Wayback archive browser, and NutchWAX archive search engine projects, has current opportunities for...
Hi all: Can anyone confirm that (in theory) the Heritrix crawl.log file can be entirely reproduced from the ARC files generated by a successful crawl? And to...
Gordon Paynter
Gordon.Paynter@...
May 4, 2006 12:58 am
2836
... You could come close, but not reproduce all info in these logs from ARCs. For example, taking a representative successful crawl.log line: ...
Hallo, I know the OutOfMemory problem was there mentioned many times, but i could not find any topic that can solve my problem. I am working on crawling whole...
... Some suggestions regarding the OOME: - If you are using the ExtractorSWF, use the latest release (1.8.0, officially out just today) or disable the...
Hi everyone, I'v dug some more through the Heritrix sources to track down the problem but haven't found a solution yet. This problem starts to get really...
Hi, thanks a lot for your advices. Today I have started heritrix 1.8.0. With this options JAVA_OPTS= -Xmx1800m on Java(TM) 2 Runtime Environment, Standard...
And when I decrease size of seeds list to 50000 then the crawl start normaly, but then I recieve this alert many times.. Serious error occured trying to...
Olaf, I apologize for not responding to you initial report sooner. Of course, this should never happen: the 'queued' tally should match the total number of...
These "java.lang.OutOfMemoryError: unable to create new native thread" errors are exactly the sort of OOMEs I mentioned that are *not* due to a shortage of...
Hi, In the HeritrixProtocolSocketFactory class (Heritrix 1.8), the following method: public Socket createSocket(String host, int port) throws IOException,...
Hi everyone, I have yet another question: Is it possible to specify the location of the profiles folder (either via a heritrix.properties setting or via...
... No. We look for profiles in the conf directory. Conf directory is expected to be at HERITRIX_HOME/conf. What if we added system property (and environment...
without stopping the crawler? Currently we have a filter nocrawl-sites.surt that looks like this +http://(com,companycomplains, +http://(com,badcompany, ... ...
... Hi St. Ack, that would be really helpfull. That way I could just unpack a new heritrix version (for as long as configurations don't change significantly)...
... What happens if you pause the crawler, remove the current filter, add a new filter of a different name but with same list plus the new element? Its still a...
... I've added an RFE: http://sourceforge.net/tracker/index.php?func=detail&aid=1485819&group_id=73833&atid=539102. Shouldn't be hard to do. Most other disk...
St.Ack, We have the HCC tests in stand by, because we have to show some people that Heritrix can run multiple parallel jobs in the same machine. As a colleague...