Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
poor performance   Message List  
Reply | Forward Message #5913 of 6140 |
Re: poor performance

Hi

we have found what was the problem with speed

we had set mirrorwriter processor, which caused a lot of disk operations which
slowed disk down and used a lot of threads for Archiver

WARC processor is using only one thread for writing
speed is now 6 mb/s it could be a little higher using more threads, but not
much

last time you gave me tips about distributed crawling...
well, i read the procedures, but i dont think they are applicable in our
scenario

most people who were doing such big crawls used segmentating domains by SURT
and divided domains into several groups [for example .uk domains is one, .com
domains second]

all our domains are first level domains http://xxx.at
so no such differentation can be done

secondly, the method of hasing them and giving some domains to one servere and
other to another server, seems to me as the same as the method of dividing list
of domains into several smaller lists by split linux command which could be
crawled by servers, each having its own list with different domains


our crawl is going to be limited, we are going to crawl x TB of data,
so it would be nice to have an option to stop crawl if disk quota is met
it could be done by bash script, but maybe h2 has this option



Ivan Rusnacko




Sat Jul 4, 2009 7:23 am

nukleonrus
Offline Offline
Send Email Send Email

Forward
Message #5913 of 6140 |
Expand Messages Author Sort by Date

Hello We are using heritrix on a dual opteron server with fiber optic connection, but crawling speed is unfortunately very low. It is mostly around 130 KB/s,...
nukleonrus
Offline Send Email
Jun 17, 2009
9:12 pm

... How many seeds are used to start your crawl, and what does the summary "QUEUES" section of the "frontier report" show (with totals of ...
Gordon Mohr
gojomo
Online Now Send Email
Jun 18, 2009
7:29 am

... seeds.txt contains about 22 000 domains, some of them could be unregistered we are crawling on a one server for now, we would like to set up a cluster of...
nukleonrus
Offline Send Email
Jun 18, 2009
11:01 am

... Aha -- that some of the domains could be unresponsive (registered but not running an HTTP server) could be the real culprit. A normal fetchable URI can be...
Gordon Mohr
gojomo
Online Now Send Email
Jun 18, 2009
6:22 pm

... we have filtered seeds.txt, so it now contains only sites that have registered DNS and port 80 is giving response [around 10% of list was removed] we have...
nukleonrus
Offline Send Email
Jun 21, 2009
1:21 pm

... At this point, I think there's a good chance the bottleneck is not in Heritrix: you've got plenty of target sites to crawl (as evidence by the large number...
Gordon Mohr
gojomo
Online Now Send Email
Jun 24, 2009
4:27 am

Hi we have found what was the problem with speed we had set mirrorwriter processor, which caused a lot of disk operations which slowed disk down and used a lot...
nukleonrus
Offline Send Email
Jul 4, 2009
7:23 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help