Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2380 - 2409 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2380
Dear St.Ack. I have now uploaded two files: HeritrixLauncher.java: the class responsible for executing Heritrix. Netarkivet-log.txt (logs from a run, where...
Søren Vejrup Carlsen
svc400
Offline Send Email
Dec 1, 2005
12:06 pm
2381
... Sure. Looking at your log, it looks like we're stuck here: http://crawler.archive.org/xref/org/archive/crawler/framework/CrawlController.html#1031. I see...
stack
stackarchiveorg
Offline Send Email
Dec 1, 2005
6:55 pm
2382
We're using the newly added QuotaEnforcer - could that perhaps be the problem ? ... Sure. Looking at your log, it looks like we're stuck here: ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Dec 1, 2005
9:09 pm
2383
... I'd doubt it (I ran a few tests anyways and it seems to behave itself). On your end, you might removing it from your config. to see if it changes the...
stack
stackarchiveorg
Offline Send Email
Dec 1, 2005
10:10 pm
2384
I took another look at your log, Says zero active threads but I only counted 48 logs of "FINE: ToeThread #50: finished for order 'default_orderxml"....
stack
stackarchiveorg
Offline Send Email
Dec 1, 2005
10:20 pm
2385
It happens only when we're calling: crawlController.requestCrawlStop(); We're calling that when the crawler hangs for too long (our own kind of timeout...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Dec 1, 2005
10:40 pm
2386
After looking at your HeritrixLauncher class, I suspect the problem is a thread deadlock. requestCrawlStop() is only called inside doCrawlLoop, which holds the...
Gordon Mohr
gojomo
Online Now Send Email
Dec 1, 2005
10:50 pm
2387
Hello, This email message is a notification to let you know that a file has been uploaded to the Files area of the archive-crawler group. File :...
archive-crawler@yahoo...
Send Email
Dec 2, 2005
12:34 pm
2388
Seeing the latest log dump you've uploaded, I think the real problem is with this exception, indicative of a bug fixed just yesterday: Exception in thread...
Gordon Mohr (@Interne...
gojomo
Online Now Send Email
Dec 2, 2005
5:46 pm
2389
Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included...
stack
stackarchiveorg
Offline Send Email
Dec 2, 2005
7:06 pm
2390
Dear Gordon It works like it should now. Thanks for your assistance All the best ... Søren Vejrup Carlsen, DUP, Det Kongelige Bibliotek tlf: (+45) 33 47 48 41...
svc@...
svc400
Offline Send Email
Dec 2, 2005
7:21 pm
2391
Hi, Just wondering if anybody have used heritrix to do large crawling at the scale at around 500M links. I know I probably need to use mutliple instances and...
joehung302
Online Now Send Email
Dec 6, 2005
2:17 am
2392
Hi, I am looking at the recover job feature. Correct me if I am wrong, but is it only able to reload the urls that it visited and the ones that are on the ...
Alex
SUNYAl
Offline Send Email
Dec 6, 2005
10:50 pm
2393
... Thats about right. It reconstitutes CrawlURI instances using a line from the recover log (Checkout the content of recover log). Perhaps the 1.6...
stack
stackarchiveorg
Offline Send Email
Dec 6, 2005
11:21 pm
2394
Hi, I am new to Heritrix, and at times when I refresh my administrator console the crawl appears as paused. I never manually pause the crawl. I haven't...
four_oh_six
Offline Send Email
Dec 7, 2005
1:33 am
2395
... Heritrix will only auto-pause if catastrophic error, usually an OutOfMemoryError. Does the UI show alerts (See top-right-hand corner)? St.Ack...
stack
stackarchiveorg
Offline Send Email
Dec 7, 2005
2:11 am
2396
Dear all. We have just started using the group-max-success-kb in the QuotaEnforcer available with heritrix 1.6.0. If we set the limit to X bytes, and get less...
Søren Vejrup Carl...
svc400
Offline Send Email
Dec 7, 2005
2:23 pm
2397
Thanks for the insight, I was using 1.4.0 which didn't include checkpointing. Is it possible to configure the job to automatically save checkpoints for every...
Alex
SUNYAl
Offline Send Email
Dec 7, 2005
5:14 pm
2398
... No. Not yet. You think this should be part of the crawler? Or is it ok if its done by an external process? It can be done from cron via the ...
stack
stackarchiveorg
Offline Send Email
Dec 7, 2005
6:40 pm
2399
Hi St.Ack, Well, we need to restablish state if the crawl fails, and the only way to keep the state is using the checkpoint feature. We would definitely want...
Alex
SUNYAl
Offline Send Email
Dec 7, 2005
6:54 pm
2400
... Crawler must be quiescent before you checkpoint. Otherwise the checkpoint will be inconsistent. So, at least for now until we do more work, the pause is...
stack
stackarchiveorg
Offline Send Email
Dec 7, 2005
7:28 pm
2401
... You might try crawling a known small site whose size you know letting Heritrix crawl it to completion. Do it first without QuotaEnforcer, then with,...
stack
stackarchiveorg
Offline Send Email
Dec 7, 2005
11:52 pm
2402
... Not that I know of. I've witnessed/heard-of 200-300Million with 2 to 3 machines. Would be interested in hearing about your experiences. ... Dual opterons...
stack
stackarchiveorg
Offline Send Email
Dec 8, 2005
12:34 am
2403
... Seems ... becomes ... the ... other ... we've ... How confident do you guys feel that if I use broad-scope I can go above 50M links (or even 100M links)...
joehung302
Online Now Send Email
Dec 8, 2005
8:25 pm
2404
We found the error in our own logic (and code) It was a simple rounding error (500000 bytes/1024 is 488Kb but 488Kb*1024 is not 500000 bytes) best Bjarne...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Dec 8, 2005
8:37 pm
2405
... I'd suggest you startup a proofing test crawl with BroadScope and see it does. On machines with specs like those listed below we've pulled down ... was not...
stack
stackarchiveorg
Offline Send Email
Dec 8, 2005
9:00 pm
2406
We have encountered a new problem with the group-max-success-kb feature. It seems that robots.txt URI is not considered by QuotaEnforcer. These robots.txt are,...
Søren Vejrup Carl...
svc400
Offline Send Email
Dec 12, 2005
10:46 am
2407
... To shed some light (maybe) on the the problem, the update for the host-report happens through StatisticsTracker.crawledURISuccessful, called from...
Lars Clausen
lrclause
Offline Send Email
Dec 12, 2005
2:34 pm
2408
I think Søren formulated the wrong question - or formulated the opposite as he really ment. The problem is that the size of the robots.txt is summed together...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Dec 12, 2005
5:07 pm
2409
Hi all, The last couple of days I've been writing a Processor to write the results on a MySQL databse. Everything turned out to be ok so far, except the fact...
Samuel
samendonca
Offline Send Email
Dec 12, 2005
5:24 pm
Messages 2380 - 2409 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help