Is there a way with Heritrix to define a job with many starting points, one per site, then have the job run in a "balanced" manner so that each site gets its...
Heritrix does a pretty good job of running in a "balanced" manner by default. The list of urls yet to be crawled (the frontier) is divided into queues by host,...
Hello, as far as I know, the ordinal would not be setted when extracting the hyperlink, but when the referenced ressource whould be downloaded or fetched by...
... Sending PM is ok. But I think it whould be even better, if the emails whould be send to the list, so other researchers may comment it or use it for their...
My problem is that when I use 20 seeds (each for a different site), the ones that have many sub-domains seem to get more crawler "attention". If I understand...
I understand that windows is not supported but actually heritrix 2.0 seems to be doing very well on it so far (though I'm currently very new to heritrix) One...
Copy to the list for archive ... Well the code I've send to the list is for Heritrix 2.0.X, as I've mentioned in the Email. ... // Get the DatabaseFacade...
Yes, each sub-domain gets its own queue. If you are only crawling 20 sites, a quick fix (at least if your using Heritrix 1.x) is to create an override for each...
... The character-map setting can be used to map ':' to some other character. The help for it has a recommendation for Windows. Note also the case-sensitive,...
Hi I'm trying to fetch html pages only defined in seeds using heritrix2.0. and don't want other pages linked in html pages be download. This is my settings: ...
... Hi, I think you have to change the order of the deciderules. They are executed in order, so the last decision whould be used to make the whole decision for...
... If I understand correctly, you only want your seed URIs to be fetched, and nothing else -- no in-page resources (like images, scripts, or CSS), and no...
Hi Thank you. i set a job "basic_seed_sites-20080712225206" as a copy of basic_seed_sites. just as previous message, It's not works as my image. Please help. ...
Hi all, what should be done to make Heritrix revisit all the visited web pages once the crawl is finished? I need to evaluate how the pages have changed during...
At IA, we will soon be performing final triage of outstanding issues to be fixed for a 1.14.1 bugfix release and a 2.0.1 bugfix release. As a reminder, our...
I just voted for the fixing of the encoding issue I reported. This one is I think really critical. I did not see any bugs for better documentation for the 2.0...
Jean-Noël Rivasseau
elvanor@...
Jul 17, 2008 9:38 pm
5362
Hi, I'm running heritrix from the command line, like tgeorgio. I can start a crawl and watch the progress-statistics.log file to see its progress. What is the...
At Tue, 22 Jul 2008 13:18:50 -0700, ... Hi Colin. When the job is finished the 3rd line of the state.job file should be start with “Finished”. It sounds...
... Thanks for the info, Erik. So, if I don't have a state.job file, the crawl job is still running? All of my completed jobs have that file, but the recently...
At Tue, 22 Jul 2008 13:50:40 -0700, ... I’ve never seen a job without a state.job file; somebody else will have to help you out with this. Every Heritrix...
Hi, I am running an archive-crawler, Heritrix with Caputer-HPC. It takes such a very long time to finish a crawling job. Even when I tested a site of small...
오성근
freeosg@...
Jul 29, 2008 4:37 am
5368
Hello, I just setup the 2.0.0 version in Windows XP SP3 and was able to start the Jetty at 8080 with tweaks to the .cmd scripts provided. Here is a summary of...
I'm taking 1.14 release for a spin but having issues. On the first job, I get the following message from seeds-report.txt [code] [status] [seed] [redirect] -50...
Hello, I searched JIRA and found no open bugs for this. When I launch the Web UI (either with or without an associated local engine), and add a remote engine...
Jean-Noël Rivasseau
elvanor@...
Jul 30, 2008 9:38 am
5371
Hello, My task is the following one: I need to search for a specific URL in the ARC file. From reading the Javadocs it seems there is no facility for doing...