Hi, I've added the files to the ticket HER-1543. The code was battle-tested using the Heritrix 2.0.0 release and Jericho 2.6 on a quad core machine running for...
Hi, I need a small help from u people. My problem is like when Iam trying to crawling through www.company.com the url is getting links from 20 page depth ie...
Hi, Thanks for your suggestion. I tried with the same ie; by changing the default to 2 from 20 now what the problem is if you a give any link which have more...
Hi to all, In order to try to save bandwidth when harvesting I have investigated Heritrix 1.14's ability to process compressed HTTP traffic supported by HTTP...
Tomas Ukkonen
tomas.ukkonen@...
Nov 7, 2008 5:22 pm
5569
Heritrix releases 1.14.2 and 2.0.2 are now available at Sourceforge: http://sourceforge.net/project/showfiles.php?group_id=73833 These are both 'micro'...
Hi all,
we are trying to figure out what is the best heritrix and server setup to run large scale crawls.
At our lab we used two different server setups and...
Juergen Umbrich
juergen@...
Nov 12, 2008 5:38 pm
5571
Hi thanks a lot Gordon for all your help! ... Is it because heritrix 2.0.1 (or 2.0.2) is to unstable, or what is your reason? ... yes, thats right, we changed...
Juergen Umbrich
juergen@...
Nov 12, 2008 5:56 pm
5572
Hi all,
we had a TMOF-Excetpion while we tried to run a crawl with 300 ToeThreads, 1M seed URIs, and a
#ulimit -l = 32768. (global sheet attached)
The...
Juergen Umbrich
juergen@...
Nov 12, 2008 6:16 pm
5573
Sorry, I attached the wrong global.sheet, Please find attached the correct global.sheet and also the logs. Sorry for the repost. best juergen ... root=map,...
Juergen Umbrich
juergen@...
Nov 12, 2008 6:42 pm
5574
Hi Jürgen, you are probably right, the machines hosting our current crawl has only about 1300 open files, but we had a TMOF problem before because of a ...
I want to start the crawl job from the backgroud, which not via the webUI that click the start button. I am already start the crawl job from the Heritrix main...
I do suspect RAM is the major reason for the difference. In a default configuration, two large, crucial data structures are implemented using disk-backed...
... 2.0.x has a lot of new things that could be destabilizing, but we don't know of any particular fatal problems, and we have done some sizable test crawls...
You previous message reported running several successful crawls that collected ~10M pages a day, starting from 1M seed URLs. What is different about the...
Hi Gordon thanks again for the fast reply. ... Yes, that would be a interesting comparison. We will think about a test and if so post the results. ... We read...
Juergen Umbrich
juergen@...
Nov 18, 2008 1:27 pm
5580
Hi all, ... Yes, that is correct. All tests were performed on the same server with the same version of heritrix. ... The only difference in the heritrix setup...
Juergen Umbrich
juergen@...
Nov 18, 2008 1:37 pm
5581
Hi,everybody. What I should to do to avoid the error? please give me some suggestions? Thanks in advance! I start my job from the webUI, at the job finished...
Hello, I'm launching crawls with the max-document-download setting set to 50, but the crawls keep running even after 50 docs are downloaded. I am using...
Hi, Can any one help me in this case.I will describe u the scenario which is happening now in my case and what i need Existing scenario: 1. Create seed file 2....
Hi, Your question is not that much clear for me. Any ways I will try to help you as much as i can. Actualy what you need is an automated crawl that you want to...
Hello avinashnash, I'm not entirely clear on what you're asking. What is the output of your Html Parser? Or equivalently, what input does your "requirement"...
Hallo, I am working on large crawl (whole czech domain -- 480k domains). The crawl is based on this order.xml http://raptor.webarchiv.cz/heritrix/order.xml ...
... What version of Heritrix? ... This looks like in the course of composing a web page response for the admin UI, on which a count of unviewed alerts appears,...
Hi, Thanks for your response. I will describe the requirement in details its like now whats happening is when i do a crawl the out put will be an arc file...
Hi, I need a small help from you people I need one small clarification on the following Suppose i have given a seed list which contains www.India.com when the...
Hi- I recently started using Heritrix and decided to upgrade an existing 1.x installation to 2.x. The 1.x jobs were being launched via cron by specifying a job...
Hi, the Czech crawl again :) . I started with default profile and set some specific rules (100MB limit etc.) and run the crawl again. You can find the...