Hi Folks, I am using Heritrix 1.12.1 to crawl 500,000 thousand URLs. The setup has 2 machines with 1 GB RAM, each running a crawler instance with 10 Toe...
If you have permission/assurance that it's OK to crawl the target sites with no politeness delays, I'd say try it: set the delays to 0 and see how fast things...
Is there a way to figure out the "parent" URI that the current URI you are crawling was discovered from? I want to create a "breadcrumb" type trail of the...
Hello Andrew, Take a look at org.archive.crawler.datamodel.CandidateURI.getVia() method. Note that the 'via' URI itself will not have information about its own...
So, I have a simple class to invoke crawl jobs on Heritrix via JMX. It seems to work fine for about 600 total jobs submitted, but eventually I cannot connect...
Can you get thread dumps from the heritrix instance and try and figure why its stopped accepting connections and what its doing w/ that 100% of CPU (Would be...
I'll submit my jobs to reproduce it. It will take quite while but when it happens again I'll dump the threads. The UI is dead when this problem occurs so I...
Well, I may have jumped the gun again. I was thinking that this symptom might be due to another post wrote. The issue might be that my single domain crawl...
... You are your own worst enemy. Smile. ... Could be because the code that prints out that message ain't too smart. The UI is looking for an instance named...
The issue is in the web application. The /admin/include/handler.jsp contains the code: snip ... if (handler == null) { if(Heritrix.isSingleInstance()) { ...
I began a crawl with 10 or so seed items. I am using a surt_source_file for Decision: ACCEPT and another surt_source_file for Decision: REJECT For both...
Hi Mike, Once an URI is finished, crawled successfully or not, it will not be crawled again. In order to crawl a finished URI you will have to re-fetch them....
Hi Andrew, I just realized that you can import URIs from the GUI as well. Once the crawl is paused, you can follow the "View or Edit Frontier URIs" link in the...
Hi, Can anyone suggest a sample configuration and typical crawling speed observed using Heritrix 1.12.1 on a single box setup having 1 GB of RAM trying t crawl...
hi, modified my job management script in a way that will not allow more than 4 instances at the same time and changed the bdb-cache-percent value to 35 in...
I recently paused my crawl and added several seeds to the seed list. Viewing the seed report under the Reports tab I see the following displayed under 'Status...
600Kbps isn't very much bandwidth for crawling: it would allow 75KB to be collected per second. We often find content to average near 20KB per URL (larger for...
... As 4*35% = 140% of heap, if at any time the 4 contemporaneous crawls try to use their maximum bdb-cache allocation, you will get an OOME. ... It depends on...
... try ... i have a single machine/multiple crawl jobs system, that value was the best i have without killing the system so far :o( ... my changes on...
Hi, I was wondering if there is anywhere a single resource for the various important settings when doing a larger crawl, and/or experience with ...
Holger Lausen
holger.lausen@...
Nov 8, 2007 11:35 am
4670
I recently paused my crawl and added several seeds to the seed list. Viewing the seed report under the Reports tab I see the following displayed under 'Status...
I am performing a very large domain crawl that is going extremely slow as many of the pages are dynamically created pages from database driven sites. As a...
I'm using heretrix to crawl sites, download software packages and get signatures for the packages. One of the site I need to crawl is Red Hat's "network" site....
Well, doesn't it just figure... After two days working on this problem, I decide to post... And then I find something on Red Hat's site. They had an...
... If you politeness (or the target server's performance) is the gating factor, and you're going to adjust dual crawlers to be just as polite, what's the...
... Are the affected seeds DNS URLs or IP dotted-quad (eg. 208.70.27.35) hosts? Is the separate crawl on the same machine? Can the success/failure machines...
The affected entries are DNS URLs. The separate crawl is on the same machine utilizing another heritrix instance under Setup. I don't have access to ping on...
Dear IA-Team, looking at JIRA and SVN I'm well aware that you guys are really busy with work on Heritrix 2.x, but I'm wondering what chances are to see issue...