Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 795
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 730 - 759 of 8128   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand Author Sort by Date ^
730 zhousp
zhousp@... Send Email
Aug 2, 2004
12:43 am
Kristinn Sigurdsson£¬ÄúºÃ£¡ There are about 30 sites¡¡in the seeds.txt fiel,and this morning I found there 1917665 of 2311102 (82%) have been...
731 Gordon Mohr (@Interne...
gojomo Send Email
Aug 2, 2004
5:38 am
I see there are only 6 host queues in existence. There will never be more active threads than host queues, and usually much fewer threads, as some queues will...
732 Yousef Ourabi
yousef_ourabi Send Email
Aug 2, 2004
8:07 am
Hi, I have a question about the arcwriter in heretix/bin. when I passin an arc.gz file here is what I get. ...
733 Lars Clausen
lrclause Send Email
Aug 2, 2004
8:34 am
... Thank you, that is a lot clearer in the manual at least. -Lars...
734 stack
stack@... Send Email
Aug 2, 2004
2:53 pm
... The av_* programs, they are not freely available. You could use the bin/arcreader to extract any of the listed files by passing an offset, field 3 from the...
735 nhckbdk Send Email Aug 2, 2004
3:08 pm
Hi, thanks for your reply! Hadn't noticed about the socket stuff and the CDX functionality. ... of ... and ... jar. I agree. I managed to sepearate out a...
736 stack
stack@... Send Email
Aug 2, 2004
4:34 pm
Just a heads up that we're preparing to release 1.0.0 by the end of this week. We're currently moving internal crawls to CVS HEAD to make sure alls well...
737 stack
stack@... Send Email
Aug 2, 2004
5:25 pm
... Sorry about that. Post 1.0.0, lets break it out. We've been talking about a 'tools&#39; sourceforge project for a while now. We could put the arc...
738 nhckbdk Send Email Aug 3, 2004
3:44 pm
... No need. It wasn't designed for that. :-) ... Super. Thanks! ... Looks good. ... What I'm using it for is an application where I'm running several, ...
739 stack
stack@... Send Email
Aug 3, 2004
4:07 pm
... Ok. Maybe then this class would just read in all of the ARCRecord on construction? Then you could do without the readAll. Instead it'd become a 'byte []...
740 robeger Send Email Aug 4, 2004
10:42 pm
Hi Kris, Read through your proposal and it looks good to me. Sounds like it would provide the functionality that we want. I know you're all busy with the...
741 Kristinn Sigurdsson
kristsi25 Send Email
Aug 5, 2004
9:28 am
Current status of the AR module is ‘just getting started’. Don’t expect to see anything ready for serious use for at least two to three months. Depending...
742 Lars Clausen
lrclause Send Email
Aug 5, 2004
9:39 am
... I think reading the whole record on construction is a bad idea, it prevents you from just reading the metadata quickly. A getBytes() method could...
743 stack
stack@... Send Email
Aug 5, 2004
3:07 pm
... Ok. Just a suggestion. I was thinking the proposed child classes' main selling point was that it could be detached from the serial reading of the ARC and...
744 Gordon Mohr (@Interne...
gojomo Send Email
Aug 5, 2004
5:02 pm
... To avoid destabilizing or delaying 1.0.0, checkpointing implementation has been deferred. It's the first post-1.0.0 priority. In the meantime, we've...
745 robeger Send Email Aug 5, 2004
9:29 pm
I've got a crawl running with 21 seed URIs, and the max toe threads is set to 50. I've never seen a higher number than "4 of 50" in the active threads value...
746 robeger Send Email Aug 5, 2004
9:33 pm
here's a snapshot of the threads report (seem to be a lot of threads with long wait times): oe threads report - 200408052130 Job being crawled: 21siteCrawl ...
747 Igor Ranitovic
iranitovic Send Email
Aug 5, 2004
9:51 pm
Hi Rob, For how long have you been crawling these seeds? Is it possible that the crawl is almost done and only one host is left to be crawled. In...
748 robeger Send Email Aug 5, 2004
10:00 pm
Hi Igor, It was runnign for about 30 mins when I stopped it. The progress statistics log never showed busy threads higher than 3. According to the progress...
749 Igor Ranitovic
iranitovic Send Email
Aug 5, 2004
10:10 pm
Hi Rob, It would be great if you just copy and past the order file, seeds and and per host settings (if any.) If I don't see anything unsully in the...
750 Gordon Mohr
gojomo Send Email
Aug 5, 2004
10:13 pm
Because of the crawler's adjustable politeness settings, with respect to any one given remote server, it will typically spend more time waiting between fetches...
751 robeger Send Email Aug 5, 2004
10:30 pm
Igor, Here you go: seeds.txt (the ones before the blank line are as I entered them): # Seed URIs # enter one per line http://www.garvins.com ...
752 robeger Send Email Aug 5, 2004
10:33 pm
So seeing a docs processed per second rate of less than 1 is normal? (that's what I was seeing before I stopped it) Maybe I have my politeness settings too...
753 Igor Ranitovic
iranitovic Send Email
Aug 5, 2004
11:49 pm
You are waiting at least 5 seconds between fetches from a single host. <integer name="max-delay-ms">10000</integer> <integer name="min-delay-ms">5000</integer>...
754 Gordon Mohr (@Interne...
gojomo Send Email
Aug 6, 2004
4:16 am
... It depends. Even though you've started with 20-something sites, if many of them are small, they'll finish early and then you'll only be left with a handful...
755 zhousp
zhousp@... Send Email
Aug 6, 2004
6:32 am
hi,all I use heritrix to download page and use lucene to index them. I have an idea to add a pagerank plugins for heritrix just like Google. For example, if...
756 xiaoming liu
xmliu_23508 Send Email
Aug 6, 2004
5:08 pm
Probably you want to have a look at nutch, (http://www.nutch.org/), from the website, "Nutch is a nascent effort to implement an open-source web search ...
757 stack
stack@... Send Email
Aug 6, 2004
5:40 pm
We've released 1.0.0. See http://crawler.archive.org/changes-report.html#1.0.0 for a list of changes. See...
758 stack
stack@... Send Email
Aug 6, 2004
6:57 pm
... Thanks for the pointer. I was going to look at it next week. I'll take a look see what it'd take to hook up heritrix and nutch. Yours, St.Ack...
759 Andy Boyko
andyboyko Send Email
Aug 6, 2004
9:37 pm
Congratulations, Heritrixians, on 1.0 -- you've done a tremendous amount in a remarkably short time. As the St.Ack says: "good stuff!" So with the code freeze...
Messages 730 - 759 of 8128   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help