Hi, all: I have met a question. I Want to use Heirtrix to realize increasemetal crawling, but i don't know how to do it . does anyone know how to solve...
This is a somewhat loaded question as 'incremental crawling' can be a vague term. It may help to explain what you hope to achieve (reduce data volume etc.) For...
I am crawling a large (prominent) website and I am encountering an issue with their links. During a crawl I discovered what I believe is a bug in HttpClient,...
Hi all, I am trying to run a crawl with scope=DecidingScope and rule=TooManyHopsDecideRule with max-hop=3. I get the alert mentioned in the subject of the...
OK - I agree - this is not a bug. The manual should though mention that the minimum setting for max-retries should be 3 best Bjarne ... -- Bjarne Andersen ...
I have downloaded the latest Heritrix source from Sourceforge - 1.12.1. It mentions in the manual somewhere that the .project and .classpath files are included...
The .classpath and .project files are not included in the source archives. If you have Subclipse installed on your box I recommend just using it to checkout...
Tom Emerson
TEmerson@...
Aug 6, 2007 4:22 pm
4494
Hello Michael and others, I gathered the .project and .classpath files from the link below and dropped these in the top level directory of the heritrix-1.12.1...
When I try setting up credentials for a HTML-login on some page I have a problem with the first seed not getting crawled. I have 2 seeds: http://www.foo.dk/ ...
Looking at your log below, it seems that www.foo.dk is http://www.arto.dk/ <http://www.arto.dk/> and the login page is http://www.arto.dk/login.asp. If that...
I want to add (or find) functionality that allows for notifications to be received when a crawl is completed. Does anyone know where the best place to...
After running for a few seconds, heritrix is crashed. I found the following URL error message in the logs/uri-error file :: 2007-08-10T15:01:27.978Z...
Ahmed - The "Contains non-LDH characters" URL error is minor and advisory; it won't stop a crawl. The "Stream closed" error is more serious; the recovery log...
I checked my configuration and I checked all the things you mentioned , but finally I found these errors did not appear when i remove a preprocessor that I...
I am using 1.10.2 and have paused the crawl. I then clicked the 'Logs' tab and scrolled down to the bottom where I see 'Rotate crawler logs'. I clicked on this...
Hi all, I am using Heritrix 1.12.1 for crawling. Could any one tell me how to filter some urls from crawling. i.e. I don't want to crawl the contactus,...
Hi there I am running Heritrix 1.12.1 and I have a very strange problem. When I submit a job via JMX (details to come) the seed URL causes an internal error,...
I have run into this. Did you create your own XML file for submitting jobs? Did you add anything (new nodes) to the document? Does the document have any xml...
First of all, thanks for the response. As for you questions: 1. I am using my profile's order.xml file, and I modify only the name, description and date (I set...
I've got Heritrix running in Amazon EC2, and I'm still mucking around with the configuration. I let a crawl run for a few days, and the job's state/ directory...
Ted Dziuba
ted@...
Aug 22, 2007 7:38 pm
4509
Nope. No steady state. 8G is chump change for a long running crawler. Its a state db of where have you been. So, of course it grows, as you run longer. I...
So is the size of the state directory a function of number of URLs visited, and not amount of data downloaded? Tangentially, I think that EC2 will do for our...
Ted Dziuba
ted@...
Aug 22, 2007 11:03 pm
4511
While I haven't taken the code apart, I think so. EC2 can work for a number of small crawls, no problem. I just hate running systems that fall apart cause they...
The state directory is the home of the BerkeleyDB-JE environment used by the crawler. The three main things stored there are: (1) A series of disk-backed maps...
Hello All, I the latest Eclipse IDE running on my win pc. I would like to get to the point where I can run Heritrix within the Eclipse environment so I can...
... There's the developer's guide: http://crawler.archive.org/articles/developer_manual/building.html#eclipse ... You'll need to copy the /lib directory jar...