Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 730 - 759 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
730
Kristinn Sigurdsson£¬ÄúºÃ£¡ There are about 30 sites¡¡in the seeds.txt fiel,and this morning I found there 1917665 of 2311102 (82%) have been...
zhousp
zhousp@...
Send Email
Aug 2, 2004
12:43 am
731
I see there are only 6 host queues in existence. There will never be more active threads than host queues, and usually much fewer threads, as some queues will...
Gordon Mohr (@Interne...
gojomo
Online Now Send Email
Aug 2, 2004
5:38 am
732
Hi, I have a question about the arcwriter in heretix/bin. when I passin an arc.gz file here is what I get. ...
Yousef Ourabi
yousef_ourabi
Offline Send Email
Aug 2, 2004
8:07 am
733
... Thank you, that is a lot clearer in the manual at least. -Lars...
Lars Clausen
lrclause
Offline Send Email
Aug 2, 2004
8:34 am
734
... The av_* programs, they are not freely available. You could use the bin/arcreader to extract any of the listed files by passing an offset, field 3 from the...
stack
stack@...
Send Email
Aug 2, 2004
2:53 pm
735
Hi, thanks for your reply! Hadn't noticed about the socket stuff and the CDX functionality. ... of ... and ... jar. I agree. I managed to sepearate out a...
nhckbdk
Offline Send Email
Aug 2, 2004
3:08 pm
736
Just a heads up that we're preparing to release 1.0.0 by the end of this week. We're currently moving internal crawls to CVS HEAD to make sure alls well...
stack
stack@...
Send Email
Aug 2, 2004
4:34 pm
737
... Sorry about that. Post 1.0.0, lets break it out. We've been talking about a 'tools' sourceforge project for a while now. We could put the arc...
stack
stack@...
Send Email
Aug 2, 2004
5:25 pm
738
... No need. It wasn't designed for that. :-) ... Super. Thanks! ... Looks good. ... What I'm using it for is an application where I'm running several, ...
nhckbdk
Offline Send Email
Aug 3, 2004
3:44 pm
739
... Ok. Maybe then this class would just read in all of the ARCRecord on construction? Then you could do without the readAll. Instead it'd become a 'byte []...
stack
stack@...
Send Email
Aug 3, 2004
4:07 pm
740
Hi Kris, Read through your proposal and it looks good to me. Sounds like it would provide the functionality that we want. I know you're all busy with the...
robeger
Online Now Send Email
Aug 4, 2004
10:42 pm
741
Current status of the AR module is ‘just getting started’. Don’t expect to see anything ready for serious use for at least two to three months. Depending...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Aug 5, 2004
9:28 am
742
... I think reading the whole record on construction is a bad idea, it prevents you from just reading the metadata quickly. A getBytes() method could...
Lars Clausen
lrclause
Offline Send Email
Aug 5, 2004
9:39 am
743
... Ok. Just a suggestion. I was thinking the proposed child classes' main selling point was that it could be detached from the serial reading of the ARC and...
stack
stack@...
Send Email
Aug 5, 2004
3:07 pm
744
... To avoid destabilizing or delaying 1.0.0, checkpointing implementation has been deferred. It's the first post-1.0.0 priority. In the meantime, we've...
Gordon Mohr (@Interne...
gojomo
Online Now Send Email
Aug 5, 2004
5:02 pm
745
I've got a crawl running with 21 seed URIs, and the max toe threads is set to 50. I've never seen a higher number than "4 of 50" in the active threads value...
robeger
Online Now Send Email
Aug 5, 2004
9:29 pm
746
here's a snapshot of the threads report (seem to be a lot of threads with long wait times): oe threads report - 200408052130 Job being crawled: 21siteCrawl ...
robeger
Online Now Send Email
Aug 5, 2004
9:33 pm
747
Hi Rob, For how long have you been crawling these seeds? Is it possible that the crawl is almost done and only one host is left to be crawled. In...
Igor Ranitovic
iranitovic
Offline Send Email
Aug 5, 2004
9:51 pm
748
Hi Igor, It was runnign for about 30 mins when I stopped it. The progress statistics log never showed busy threads higher than 3. According to the progress...
robeger
Online Now Send Email
Aug 5, 2004
10:00 pm
749
Hi Rob, It would be great if you just copy and past the order file, seeds and and per host settings (if any.) If I don't see anything unsully in the...
Igor Ranitovic
iranitovic
Offline Send Email
Aug 5, 2004
10:10 pm
750
Because of the crawler's adjustable politeness settings, with respect to any one given remote server, it will typically spend more time waiting between fetches...
Gordon Mohr
gojomo
Online Now Send Email
Aug 5, 2004
10:13 pm
751
Igor, Here you go: seeds.txt (the ones before the blank line are as I entered them): # Seed URIs # enter one per line http://www.garvins.com ...
robeger
Online Now Send Email
Aug 5, 2004
10:30 pm
752
So seeing a docs processed per second rate of less than 1 is normal? (that's what I was seeing before I stopped it) Maybe I have my politeness settings too...
robeger
Online Now Send Email
Aug 5, 2004
10:33 pm
753
You are waiting at least 5 seconds between fetches from a single host. <integer name="max-delay-ms">10000</integer> <integer name="min-delay-ms">5000</integer>...
Igor Ranitovic
iranitovic
Offline Send Email
Aug 5, 2004
11:49 pm
754
... It depends. Even though you've started with 20-something sites, if many of them are small, they'll finish early and then you'll only be left with a handful...
Gordon Mohr (@Interne...
gojomo
Online Now Send Email
Aug 6, 2004
4:16 am
755
hi,all I use heritrix to download page and use lucene to index them. I have an idea to add a pagerank plugins for heritrix just like Google. For example, if...
zhousp
zhousp@...
Send Email
Aug 6, 2004
6:32 am
756
Probably you want to have a look at nutch, (http://www.nutch.org/), from the website, "Nutch is a nascent effort to implement an open-source web search ...
xiaoming liu
xmliu_23508
Offline Send Email
Aug 6, 2004
5:08 pm
757
We've released 1.0.0. See http://crawler.archive.org/changes-report.html#1.0.0 for a list of changes. See...
stack
stack@...
Send Email
Aug 6, 2004
5:40 pm
758
... Thanks for the pointer. I was going to look at it next week. I'll take a look see what it'd take to hook up heritrix and nutch. Yours, St.Ack...
stack
stack@...
Send Email
Aug 6, 2004
6:57 pm
759
Congratulations, Heritrixians, on 1.0 -- you've done a tremendous amount in a remarkably short time. As the St.Ack says: "good stuff!" So with the code freeze...
Andy Boyko
andyboyko
Offline Send Email
Aug 6, 2004
9:37 pm
Messages 730 - 759 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help