Hi, I have a problem starting Heritrix 2.0 under windows from Java main, using: Heritrix.main(new String[]{"-a testPassword"}); When I run this, I get the...
... When you say 'nothing more happens' -- does that mean the browser hangs, spinning, waiting for a response? Is there any output to the JVM's standard out? ...
Hi Nathalie. The following code works for me on my Ubuntu Linux machine. Try launching this code in your favorite IDE, setting the classpath to the libaries...
Hi Christian, The problem was indeed the String[] that I submitted as parameter. Using new String[] {"-a", "password"} everything works fine. Thanks a lot for...
I'd like to be able to crawl with the following 4 filters: 1 - Crawl by path (ie. http://sub.example.com/foo/ ) 2 - Crawl by host (ie. http://sub.example.com )...
Micah Wedemeyer
mwedeme@...
Mar 4, 2008 9:44 pm
5037
Hi, We are currently deploying the Portuguese web archive and the next step is to start using the deduplicator. I read the paper "Managing duplicates across...
Daniel Gomes
daniel.gomes@...
Mar 5, 2008 3:12 pm
5038
Still having trouble here... I've tried adding the SURT prefixes to the seeds file, and it doesn't seem to limit the crawl scope. In addition, I see the...
Micah Wedemeyer
mwedeme@...
Mar 5, 2008 10:37 pm
5039
(Sorry for spamming, but I wanted to head off the inevitable reply...) I looked deeper and found the notes about starting each SURT prefix line with a "+"...
Micah Wedemeyer
mwedeme@...
Mar 5, 2008 10:58 pm
5040
Typically (and in the rule progression you've shown), SURT prefixes are used to rule things in, but not out. The general operation of your rules in plain...
To also answer some of your earlier questions: ... The 'implied conversion' which is done automatically if you choose to use your seeds as SURT prefixes is...
Gordon, This really clears things up, especially the transclusion rules. For our purposes, we're only doing text analysis, not archiving, so losing images and...
Micah Wedemeyer
mwedeme@...
Mar 6, 2008 3:04 pm
5043
My crawl jobs are slowing down drastically at the end of the crawl despite several threads being active according to web UI. I would understand if the active...
Hi! I'm new to Heritrix and was wondering what resource is mainly responsible for the amount of memory Heritrix uses. Is it the number of queued links? Gerwin...
It's a combination of active threads, queue size, and number of seeds. If your trying to crawl a large domain, then I'd recommend braking the domain down into...
At Thu, 6 Mar 2008 18:09:22 -0500, ... This is not unusual (others may correct me if I’m wrong). Often times you have basically run out of URLs except for a...
I don't know either, but I was thinking that maybe you have come across some URLs that are trigging an error (such as a timeout). The default for Heritrix is...
Example: schedule with high priority if the URI matches /.*?\.(pdf|ppt)/. It appears like BdbMultipleWorkQueues could be used to achieve this. What's the...
It’s definitely not just a large number of URLs on one site nor the retry settings. There are at least 7 threads and 7 sites and my retry settings are set...
Hello, I would like to download a single page (and all of its images, javascript,...) with heritrix, but i do not know the real settings for this. Either only...
This might be no help, but based on my (very limited) experience, here's what I would look at: Set up your decision rules as follows: 1) REJECT all...
Micah Wedemeyer
mwedeme@...
Mar 11, 2008 9:28 pm
5052
Hi All, As part of my research work I have been developing a Focused Crawler that uses Heritrix as its foundation. I just wanted to ask a couple of questions ...
Hi all, I'm interested in writing a new extractor, but I wanted to get peoples input first before I get started. Maybe this has already been done by someone,...
Here is the decision chaine that we use to collect a single page (Heritrix 1.12.0): 1) REJECT by default (RejectDecideRule) 2) ACCEPT if Surt prefixed...
I have just deployed Heritrix and am in the process of trying to optimize is. We are running it on an 8-core/16GB RAM server and in the passed we have ...
If you have access to a DNS server on your network. You could configure multiple domains to point to that external domain. So that Heritrix would see it as...
What you're describing could look like a DoS attack from the server side. Do you have permission from the admins of the site to hit them that hard? I'd...
Micah Wedemeyer
mwedeme@...
Mar 13, 2008 8:00 pm
5058
Hi, I'm using Heritrix 1.12.1. In a standard frontier report, what does the "active balance" mean? In the following queue, what's the meaning of the different...
Hi, I'm using Heritrix 1.12.1. Under which conditions does the code "-4001 Too many link hops away from seed" appear in logs? I intended to log the links not...
Hello, Thank you for your hints, but when I use your settings, i will always get only the html page. (At the crawl report only the mime type text/html ist...