Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Large crawl experience (like, 500M links)   Message List  
Reply | Forward Message #2450 of 6148 |
Re: [archive-crawler] Re: Large crawl experience (like, 500M links)

joehung302 wrote:
> I did a proof crawling using BroadScope and 22K seeds. I got OOME
> within a day. I then checkpoint it, restart the crawler, start
> another crawl from the checkpoint, OOME within a day.
>
> I then changed to use 5K seeds and BroadScope, OOME within a day.
> Restart with the checkpoint and still OOME within a day.
>
> I then run 5K seeds with DomainScope (kind of given up on
> broadscope). OOME within a day.
>
> I have my JVM set to -Xmx1500m. BTW, I'm using 64 bit JDK1.5.
>
> One thing that I observed is, broad scope runs much faster than
> domain scope under roughly the same condition. In both broadscope
> runs I was able to top 1000KB/s bandwidth limit with around 50% cpu
> usage. In the domain scope run I can only get to 500KB/s throughput
> with 100% cpu busy.
>
> I used to be able to run 1.0.4 for a week with <1K seeds and get
> around 1M links per day. I thought the bdb improvement should be
> able to take more seeds and run longer. I really want the crawler to
> run with a big seed list because we're going to seed my big crawl
> with links from ODP.
>
> Any suggestions that I can try?

We've run into problems under 64bit JVMs, and they seem mostly
attributable to the fact that the JVM's object pointers are larger
and thus the same object structures will take up more RAM.

This post from a Sun engineer suggests a rule of thumb of a 40%
larger heap to be comparable to a 32bit JVM heap:

http://forum.java.sun.com/thread.jspa?threadID=671184
(see reply #8)

So your 1500m heap in a 64bit JVM may be roughly comparable to a
1071m heap in a 32bit JVM.

Further, as noted in the 1.6 release notes, BerkeleyDB-JE 2.0.90's
internal mechanisms for staying within the budgetted cache size
are inaccurate under 64bit JVMs, so rather than the default 60%
cache size, 40% or even 30% would be safer.

Even with these adjustments, there are still a few structures in
the frontier that slowly grow without bound in a broad crawl. We
aim to constrain the last of these by the 1.8 release, leading to
crawls that wobble (slow down) rather than ever falling down (OOME),
as long as there's still disk space.

BdbUriUniqFilter helps defer an OOME until those other structures
become a problem, by not letting the URL already-seen structures
grow without bound. However, it's pretty inefficient for this kind
of set-membership testing, especially once the crawl is big/disperse
such that the cache isn't helping much. (It gets very slow.)

BloomUriUniqFilter offers another option: its speed doesn't degrade
with the number of URIs crawled. However, this comes at the cost of
a higher false-positive rate (misrecognizing a URI as already-seen
when it hasn't been) -- and once the crawl gets larger than the size
the Bloom filter was designed for, the false-positive rate grows to
approach 100%. The default parameters use ~500MB to achieve a 1-in-
4 million false-positive rate through 125 million URLs; these can
be tuned via System properties. (See the BloomUriUniqFilter source
and http://crawler.archive.org/cgi-bin/wiki.pl?BloomUriUniqFilter
for details.)

We've started work on another UriUniqFilter that uses a batch
merging technique described in the 2001 "High-Performance Web
Crawling" paper by Mark Najork and Allan Heydon, in section 3.2,
"Efficient Duplicate URL Eliminators". A rough version is in CVS
now but it will need more tuning to match or surpass the existing
options. The hope is that it will offer adequate performance into
the hundreds of millions of URIs without hitting the walls of the
current options.

Regarding the difference between DomainScope and BroadScope
performance:

All the 'classic' limited scopes -- DomainScope, HostScope,
PathScope -- use an inefficient linear probe against all
acceptable patterns (usually, all seeds) to test if a URI is
in scope. So, with a large number of seeds, they're slow
CPU hogs.

SurtPrefixScope can do anything they can, and much more
efficiently, so it's worth it to recast anything you were
using DomainScope for to use SurtPrefixScope instead.

--

One other thing which should help a little in the BdbUriUniqFilter
performance bottleneck is to use the 'queue budgetting' features
so that the crawler concentrates on a specific queue (host) for a
while, then rotates it out of activity to give other queues a chance.
In the BdbFrontier expert settings, this means making sure the
'cost-policy' is something other than ZeroCostAssignmentPolicy,
and tending to make the 'balance-replenish-amount' larger rather
than smaller. The current defaults for these are OK, but if you've
changed them you may have decreased the potential for the BDB cache
to benefit from site-locality patterns in discovered links.

Hope this helps,

- Gordon @ IA

> --- In archive-crawler@yahoogroups.com, stack <stack@a...> wrote:
>
>>joehung302 wrote:
>>
>>
>>>>Use the bloom filter option for the already-seen in
>
> BdbFrontier.
>
>>>Seems
>>>
>>>>to work better when a machine goes above 30-50million. Bloom
>>>
>>>becomes
>>>
>>>>saturated at 125million so thats about the upperbound per
>
> machine at
>
>>>the
>>>
>>>>moment unless you up the bloom filter size (but its already
>
> big and
>
>>>>you'll start eating into heap the crawler is using going about
>
> its
>
>>>other
>>>
>>>>business). Thereafter the rate of false positives -- reports
>
> that
>
>>>we've
>>>
>>>>seen an URL when in fact we haven't -- starts to increase
>
> (Read the
>
>>>>BloomFilter javadoc for more on its workings).
>>>>
>>>
>>>How confident do you guys feel that if I use broad-scope I can go
>>>above 50M links (or even 100M links) without OOME on a single
>
> machine?
>
>>
>>I'd suggest you startup a proofing test crawl with BroadScope and
>
> see it
>
>>does.
>>
>>On machines with specs like those listed below we've pulled down
>> >50Million documents per instance with >125million discovered.
>
> Scope
>
>>was not BroadScope. Once or twice we OOME'd but thought is that
>>probable cause has been addressed in 1.6 release (If there is an
>
> OOME,
>
>>you can checkpoint, restart and recover the crawl. Often it will
>>continue the crawl as it avoids an exact replay of the
>
> circumstances
>
>>that brought on the OOME).
>>
>>One thing I forgot to add to yesterday's list is regular
>
> checkpointing
>
>>-- every 4 hours or so.
>>
>>St.Ack
>>
>>
>>-bash-3.00$ uname -a
>>Linux crawling015.archive.org 2.6.11-1.27_FC3smp #1 SMP Tue May 17
>>20:43:11 EDT 2005 i686 athlon i386 GNU/Linux
>>
>>-bash-3.00$ more /etc/issue
>>Fedora Core release 3 (Heidelberg)
>>Kernel \r on an \m
>>
>>Dual AMD Opteron(tm) Processor 246 w/ cpu MHz : 2009.374
>
> and
>
>>cache size : 1024 KB
>>
>>[crawling013 5] ~ > /lib/libc.so.6
>>GNU C Library stable release version 2.3.4 (20050218), by Roland
>
> McGrath
>
>>et al.
>>Copyright (C) 2005 Free Software Foundation, Inc.
>>This is free software; see the source for copying conditions.
>>There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
>>PARTICULAR PURPOSE.
>>Configured for i586-suse-linux.
>>Compiled by GNU CC version 3.3.5 20050117 (prerelease) (SUSE
>
> Linux).
>
>>Compiled on a Linux 2.6.9 system on 2005-06-10.
>>Available extensions:
>> GNU libio by Per Bothner
>> crypt add-on version 2.1 by Michael Glad and others
>> linuxthreads-0.10 by Xavier Leroy
>> GNU Libidn by Simon Josefsson
>> NoVersion patch for broken glibc 2.0 binaries
>> BIND-8.2.3-T5B
>> libthread_db work sponsored by Alpha Processor Inc
>> NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
>>Thread-local storage support included.
>>For bug reporting instructions, please see:
>><http://www.gnu.org/software/libc/bugs.html>.
>>
>>
>>
>>
>>We used sun 1.5.0:
>>
>>-bash-3.00$ /usr/local/jdk1.5.0_03/bin/java -version
>>java version "1.5.0_03"
>>Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_03-
>
> b07)
>
>>Java HotSpot(TM) Server VM (build 1.5.0_03-b07, mixed mode)
>>
>>
>>
>>
>>>That to me that seems to be the deciding factor on whether we
>
> should
>
>>>start with 5 beefy machines and hope each one can go up to 100M
>
> links,
>
>>>or with 10 less beefy machines and each one can go up to 50M
>
> links
>
>>>without OOME.
>>>
>>>I know I'm shooting darts in the dark now...I have to start the
>>>project planning soon so I'd like to take my best guess with all
>
> the
>
>>>advices I can get.
>>>
>>>cheers,
>>>-joe
>>>
>>>
>>>
>>>
>>>
>>>-----------------------------------------------------------------
>
> -------
>
>>>YAHOO! GROUPS LINKS
>>>
>>> * Visit your group "archive-crawler
>>> <http://groups.yahoo.com/group/archive-crawler>" on the
>
> web.
>
>>>
>>> * To unsubscribe from this group, send an email to:
>>> archive-crawler-unsubscribe@yahoogroups.com
>>> <mailto:archive-crawler-unsubscribe@yahoogroups.com?
>
> subject=Unsubscribe>
>
>>>
>>> * Your use of Yahoo! Groups is subject to the Yahoo! Terms
>
> of
>
>>> Service <http://docs.yahoo.com/info/terms/>.
>>>
>>>
>>>-----------------------------------------------------------------
>
> -------
>
>
>
>
>
>
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>




Tue Dec 20, 2005 2:02 am

gojomo
Online Now Online Now
Send Email Send Email

Forward
Message #2450 of 6148 |
Expand Messages Author Sort by Date

Hi, Just wondering if anybody have used heritrix to do large crawling at the scale at around 500M links. I know I probably need to use mutliple instances and...
joehung302
Offline Send Email
Dec 6, 2005
2:17 am

... Not that I know of. I've witnessed/heard-of 200-300Million with 2 to 3 machines. Would be interested in hearing about your experiences. ... Dual opterons...
stack
stackarchiveorg
Offline Send Email
Dec 8, 2005
12:34 am

... Seems ... becomes ... the ... other ... we've ... How confident do you guys feel that if I use broad-scope I can go above 50M links (or even 100M links)...
joehung302
Offline Send Email
Dec 8, 2005
8:25 pm

... I'd suggest you startup a proofing test crawl with BroadScope and see it does. On machines with specs like those listed below we've pulled down ... was not...
stack
stackarchiveorg
Offline Send Email
Dec 8, 2005
9:00 pm

I did a proof crawling using BroadScope and 22K seeds. I got OOME within a day. I then checkpoint it, restart the crawler, start another crawl from the...
joehung302
Offline Send Email
Dec 19, 2005
11:27 pm

... We've run into problems under 64bit JVMs, and they seem mostly attributable to the fact that the JVM's object pointers are larger and thus the same object...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Dec 20, 2005
2:02 am

Thanks a lot for the insight. I'll change to use 32bit JVM immediately. I'm using the BloomUriUniqFilter already. I'll do some reading on SurtPrefixScope and...
joehung302
Offline Send Email
Dec 20, 2005
6:33 pm

Follow up questions on SurtPrefixScope: I'm confused about the relationship between seeds and SurtPrefixScope. Let's say I have a SurtScope for crawling .edu...
joehung302
Offline Send Email
Dec 22, 2005
8:52 pm

... Yes. ... Yes, though some notes to keep in mind: - when adding URIs with the JMX importUris, the scope rules are not applied immediately: even URIs that...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Dec 23, 2005
1:02 am

... queued up ... How about new URIs discovered through the JMX importUris as non-seed? Let's say I JMX imported this link (http://members.aol.com/joe) as ...
joehung302
Offline Send Email
Dec 23, 2005
1:44 am

... Depends on the rest of the scope settings. Would these two URIs have been accepted by the scope before the importUris, if they had been discovered on a...
Gordon Mohr (archive....
gojomo
Online Now Send Email
Dec 23, 2005
2:15 am

... crawlers Would the following configuration breaks .com into two? Machine A's SurtPrefixScope http://(com.a http://(com.b http://(com.c ... http://(com.n ...
joehung302
Offline Send Email
Dec 23, 2005
8:21 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help