Hi
we have found what was the problem with speed
we had set mirrorwriter processor, which caused a lot of disk operations which
slowed disk down and used a lot of threads for Archiver
WARC processor is using only one thread for writing
speed is now 6 mb/s it could be a little higher using more threads, but not
much
last time you gave me tips about distributed crawling...
well, i read the procedures, but i dont think they are applicable in our
scenario
most people who were doing such big crawls used segmentating domains by SURT
and divided domains into several groups [for example .uk domains is one, .com
domains second]
all our domains are first level domains http://xxx.at
so no such differentation can be done
secondly, the method of hasing them and giving some domains to one servere and
other to another server, seems to me as the same as the method of dividing list
of domains into several smaller lists by split linux command which could be
crawled by servers, each having its own list with different domains
our crawl is going to be limited, we are going to crawl x TB of data,
so it would be nice to have an option to stop crawl if disk quota is met
it could be done by bash script, but maybe h2 has this option
Ivan Rusnacko