Kristinn Sigurdsson£¬ÄúºÃ£¡ There are about 30 sites¡¡in the seeds.txt fiel,and this morning I found there 1917665 of 2311102 (82%) have been...
731
Gordon Mohr (@Interne...
gojomo
Aug 2, 2004 5:38 am
I see there are only 6 host queues in existence. There will never be more active threads than host queues, and usually much fewer threads, as some queues will...
732
Yousef Ourabi
yousef_ourabi
Aug 2, 2004 8:07 am
Hi, I have a question about the arcwriter in heretix/bin. when I passin an arc.gz file here is what I get. ...
733
Lars Clausen
lrclause
Aug 2, 2004 8:34 am
... Thank you, that is a lot clearer in the manual at least. -Lars...
734
stack
stack@...
Aug 2, 2004 2:53 pm
... The av_* programs, they are not freely available. You could use the bin/arcreader to extract any of the listed files by passing an offset, field 3 from the...
735
nhckbdk
Aug 2, 2004 3:08 pm
Hi, thanks for your reply! Hadn't noticed about the socket stuff and the CDX functionality. ... of ... and ... jar. I agree. I managed to sepearate out a...
736
stack
stack@...
Aug 2, 2004 4:34 pm
Just a heads up that we're preparing to release 1.0.0 by the end of this week. We're currently moving internal crawls to CVS HEAD to make sure alls well...
737
stack
stack@...
Aug 2, 2004 5:25 pm
... Sorry about that. Post 1.0.0, lets break it out. We've been talking about a 'tools39; sourceforge project for a while now. We could put the arc...
738
nhckbdk
Aug 3, 2004 3:44 pm
... No need. It wasn't designed for that. :-) ... Super. Thanks! ... Looks good. ... What I'm using it for is an application where I'm running several, ...
739
stack
stack@...
Aug 3, 2004 4:07 pm
... Ok. Maybe then this class would just read in all of the ARCRecord on construction? Then you could do without the readAll. Instead it'd become a 'byte []...
740
robeger
Aug 4, 2004 10:42 pm
Hi Kris, Read through your proposal and it looks good to me. Sounds like it would provide the functionality that we want. I know you're all busy with the...
741
Kristinn Sigurdsson
kristsi25
Aug 5, 2004 9:28 am
Current status of the AR module is ‘just getting started’. Don’t expect to see anything ready for serious use for at least two to three months. Depending...
742
Lars Clausen
lrclause
Aug 5, 2004 9:39 am
... I think reading the whole record on construction is a bad idea, it prevents you from just reading the metadata quickly. A getBytes() method could...
743
stack
stack@...
Aug 5, 2004 3:07 pm
... Ok. Just a suggestion. I was thinking the proposed child classes' main selling point was that it could be detached from the serial reading of the ARC and...
744
Gordon Mohr (@Interne...
gojomo
Aug 5, 2004 5:02 pm
... To avoid destabilizing or delaying 1.0.0, checkpointing implementation has been deferred. It's the first post-1.0.0 priority. In the meantime, we've...
745
robeger
Aug 5, 2004 9:29 pm
I've got a crawl running with 21 seed URIs, and the max toe threads is set to 50. I've never seen a higher number than "4 of 50" in the active threads value...
746
robeger
Aug 5, 2004 9:33 pm
here's a snapshot of the threads report (seem to be a lot of threads with long wait times): oe threads report - 200408052130 Job being crawled: 21siteCrawl ...
747
Igor Ranitovic
iranitovic
Aug 5, 2004 9:51 pm
Hi Rob, For how long have you been crawling these seeds? Is it possible that the crawl is almost done and only one host is left to be crawled. In...
748
robeger
Aug 5, 2004 10:00 pm
Hi Igor, It was runnign for about 30 mins when I stopped it. The progress statistics log never showed busy threads higher than 3. According to the progress...
749
Igor Ranitovic
iranitovic
Aug 5, 2004 10:10 pm
Hi Rob, It would be great if you just copy and past the order file, seeds and and per host settings (if any.) If I don't see anything unsully in the...
750
Gordon Mohr
gojomo
Aug 5, 2004 10:13 pm
Because of the crawler's adjustable politeness settings, with respect to any one given remote server, it will typically spend more time waiting between fetches...
751
robeger
Aug 5, 2004 10:30 pm
Igor, Here you go: seeds.txt (the ones before the blank line are as I entered them): # Seed URIs # enter one per line http://www.garvins.com ...
752
robeger
Aug 5, 2004 10:33 pm
So seeing a docs processed per second rate of less than 1 is normal? (that's what I was seeing before I stopped it) Maybe I have my politeness settings too...
753
Igor Ranitovic
iranitovic
Aug 5, 2004 11:49 pm
You are waiting at least 5 seconds between fetches from a single host. <integer name="max-delay-ms">10000</integer> <integer name="min-delay-ms">5000</integer>...
754
Gordon Mohr (@Interne...
gojomo
Aug 6, 2004 4:16 am
... It depends. Even though you've started with 20-something sites, if many of them are small, they'll finish early and then you'll only be left with a handful...
755
zhousp
zhousp@...
Aug 6, 2004 6:32 am
hi,all I use heritrix to download page and use lucene to index them. I have an idea to add a pagerank plugins for heritrix just like Google. For example, if...
756
xiaoming liu
xmliu_23508
Aug 6, 2004 5:08 pm
Probably you want to have a look at nutch, (http://www.nutch.org/), from the website, "Nutch is a nascent effort to implement an open-source web search ...
757
stack
stack@...
Aug 6, 2004 5:40 pm
We've released 1.0.0. See http://crawler.archive.org/changes-report.html#1.0.0 for a list of changes. See...
758
stack
stack@...
Aug 6, 2004 6:57 pm
... Thanks for the pointer. I was going to look at it next week. I'll take a look see what it'd take to hook up heritrix and nutch. Yours, St.Ack...
759
Andy Boyko
andyboyko
Aug 6, 2004 9:37 pm
Congratulations, Heritrixians, on 1.0 -- you've done a tremendous amount in a remarkably short time. As the St.Ack says: "good stuff!" So with the code freeze...