Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5564 - 5594 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5564
Hi, I've added the files to the ticket HER-1543. The code was battle-tested using the Heritrix 2.0.0 release and Jericho 2.6 on a quad core machine running for...
Christian Krumm
chuk_ol
Offline Send Email
Nov 2, 2008
8:59 pm
5565
Hi, I need a small help from u people. My problem is like when Iam trying to crawling through www.company.com the url is getting links from 20 page depth ie...
avinashnash
Offline Send Email
Nov 3, 2008
11:09 am
5566
Hi, Have you tried changing the default max-hops setting from 20 to 2?...
laurendko
Offline Send Email
Nov 3, 2008
2:25 pm
5567
Hi, Thanks for your suggestion. I tried with the same ie; by changing the default to 2 from 20 now what the problem is if you a give any link which have more...
avinashnash
Offline Send Email
Nov 4, 2008
11:47 am
5568
Hi to all, In order to try to save bandwidth when harvesting I have investigated Heritrix 1.14's ability to process compressed HTTP traffic supported by HTTP...
Tomas Ukkonen
tomas.ukkonen@...
Send Email
Nov 7, 2008
5:22 pm
5569
Heritrix releases 1.14.2 and 2.0.2 are now available at Sourceforge: http://sourceforge.net/project/showfiles.php?group_id=73833 These are both 'micro'...
Gordon Mohr
gojomo
Offline Send Email
Nov 11, 2008
6:58 pm
5570
Hi all, we are trying to figure out what is the best heritrix and server setup to run large scale crawls. At our lab we used two different server setups and...
Juergen Umbrich
juergen@...
Send Email
Nov 12, 2008
5:38 pm
5571
Hi thanks a lot Gordon for all your help! ... Is it because heritrix 2.0.1 (or 2.0.2) is to unstable, or what is your reason? ... yes, thats right, we changed...
Juergen Umbrich
juergen@...
Send Email
Nov 12, 2008
5:56 pm
5572
Hi all, we had a TMOF-Excetpion while we tried to run a crawl with 300 ToeThreads, 1M seed URIs, and a #ulimit -l = 32768. (global sheet attached) The...
Juergen Umbrich
juergen@...
Send Email
Nov 12, 2008
6:16 pm
5573
Sorry, I attached the wrong global.sheet, Please find attached the correct global.sheet and also the logs. Sorry for the repost. best juergen ... root=map,...
Juergen Umbrich
juergen@...
Send Email
Nov 12, 2008
6:42 pm
5574
Hi Jürgen, you are probably right, the machines hosting our current crawl has only about 1300 open files, but we had a TMOF problem before because of a ...
Holger Lausen
hlausen
Offline Send Email
Nov 13, 2008
5:03 am
5575
I want to start the crawl job from the backgroud, which not via the webUI that click the start button. I am already start the crawl job from the Heritrix main...
clccnt
Offline Send Email
Nov 17, 2008
5:06 pm
5576
I do suspect RAM is the major reason for the difference. In a default configuration, two large, crucial data structures are implemented using disk-backed...
Gordon Mohr
gojomo
Offline Send Email
Nov 17, 2008
11:25 pm
5577
... 2.0.x has a lot of new things that could be destabilizing, but we don't know of any particular fatal problems, and we have done some sizable test crawls...
Gordon Mohr
gojomo
Offline Send Email
Nov 17, 2008
11:34 pm
5578
You previous message reported running several successful crawls that collected ~10M pages a day, starting from 1M seed URLs. What is different about the...
Gordon Mohr
gojomo
Offline Send Email
Nov 17, 2008
11:40 pm
5579
Hi Gordon thanks again for the fast reply. ... Yes, that would be a interesting comparison. We will think about a test and if so post the results. ... We read...
Juergen Umbrich
juergen@...
Send Email
Nov 18, 2008
1:27 pm
5580
Hi all, ... Yes, that is correct. All tests were performed on the same server with the same version of heritrix. ... The only difference in the heritrix setup...
Juergen Umbrich
juergen@...
Send Email
Nov 18, 2008
1:37 pm
5581
Hi,everybody. What I should to do to avoid the error? please give me some suggestions? Thanks in advance! I start my job from the webUI, at the job finished...
clccnt
Offline Send Email
Nov 19, 2008
4:03 am
5582
Hello, I'm launching crawls with the max-document-download setting set to 50, but the crawls keep running even after 50 docs are downloaded. I am using...
adam.taylor78
Offline Send Email
Nov 20, 2008
6:45 pm
5584
Update on this... I took a look into the code. I managed to fix this issue by adding this line at the start of the main loop in...
adam.taylor78
Offline Send Email
Nov 20, 2008
8:01 pm
5585
Hi, Can any one help me in this case.I will describe u the scenario which is happening now in my case and what i need Existing scenario: 1. Create seed file 2....
avinashnash
Offline Send Email
Nov 21, 2008
5:44 am
5586
Hi, Your question is not that much clear for me. Any ways I will try to help you as much as i can. Actualy what you need is an automated crawl that you want to...
avinashnash
Offline Send Email
Nov 21, 2008
6:03 am
5587
Hello avinashnash, I'm not entirely clear on what you're asking. What is the output of your Html Parser? Or equivalently, what input does your "requirement"...
Noah Levitt
nlevitt0
Offline Send Email
Nov 21, 2008
6:58 pm
5588
Hallo, I am working on large crawl (whole czech domain -- 480k domains). The crawl is based on this order.xml http://raptor.webarchiv.cz/heritrix/order.xml ...
goblin_cz
Offline Send Email
Nov 22, 2008
5:46 pm
5589
... What version of Heritrix? ... This looks like in the course of composing a web page response for the admin UI, on which a count of unviewed alerts appears,...
Gordon Mohr
gojomo
Offline Send Email
Nov 22, 2008
7:15 pm
5590
Hi, Thanks for your response. I will describe the requirement in details its like now whats happening is when i do a crawl the out put will be an arc file...
avinashnash
Offline Send Email
Nov 25, 2008
10:49 am
5591
Hi, I need a small help from you people I need one small clarification on the following Suppose i have given a seed list which contains www.India.com when the...
avinashnash
Offline Send Email
Nov 25, 2008
11:14 am
5592
Hi- I recently started using Heritrix and decided to upgrade an existing 1.x installation to 2.x. The 1.x jobs were being launched via cron by specifying a job...
christsmith
Offline Send Email
Nov 27, 2008
6:04 pm
5593
Hi, the Czech crawl again :) . I started with default profile and set some specific rules (100MB limit etc.) and run the crawl again. You can find the...
goblin_cz
Offline Send Email
Dec 3, 2008
10:49 pm
5594
Hello Adam, thanks for the report. I can't claim to know anything about the more serious second issue. But the first issue appears to be ...
Noah Levitt
nlevitt0
Offline Send Email
Dec 3, 2008
11:12 pm
Messages 5564 - 5594 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help