Kristinn Sigurdsson£¬ÄúºÃ£¡ There are about 30 sites¡¡in the seeds.txt fiel,and this morning I found there 1917665 of 2311102 (82%) have been...
zhousp
zhousp@...
Aug 2, 2004 12:43 am
731
I see there are only 6 host queues in existence. There will never be more active threads than host queues, and usually much fewer threads, as some queues will...
... The av_* programs, they are not freely available. You could use the bin/arcreader to extract any of the listed files by passing an offset, field 3 from the...
stack
stack@...
Aug 2, 2004 2:53 pm
735
Hi, thanks for your reply! Hadn't noticed about the socket stuff and the CDX functionality. ... of ... and ... jar. I agree. I managed to sepearate out a...
Just a heads up that we're preparing to release 1.0.0 by the end of this week. We're currently moving internal crawls to CVS HEAD to make sure alls well...
stack
stack@...
Aug 2, 2004 4:34 pm
737
... Sorry about that. Post 1.0.0, lets break it out. We've been talking about a 'tools' sourceforge project for a while now. We could put the arc...
stack
stack@...
Aug 2, 2004 5:25 pm
738
... No need. It wasn't designed for that. :-) ... Super. Thanks! ... Looks good. ... What I'm using it for is an application where I'm running several, ...
... Ok. Maybe then this class would just read in all of the ARCRecord on construction? Then you could do without the readAll. Instead it'd become a 'byte []...
stack
stack@...
Aug 3, 2004 4:07 pm
740
Hi Kris, Read through your proposal and it looks good to me. Sounds like it would provide the functionality that we want. I know you're all busy with the...
Current status of the AR module is ‘just getting started’. Don’t expect to see anything ready for serious use for at least two to three months. Depending...
... I think reading the whole record on construction is a bad idea, it prevents you from just reading the metadata quickly. A getBytes() method could...
... Ok. Just a suggestion. I was thinking the proposed child classes' main selling point was that it could be detached from the serial reading of the ARC and...
stack
stack@...
Aug 5, 2004 3:07 pm
744
... To avoid destabilizing or delaying 1.0.0, checkpointing implementation has been deferred. It's the first post-1.0.0 priority. In the meantime, we've...
I've got a crawl running with 21 seed URIs, and the max toe threads is set to 50. I've never seen a higher number than "4 of 50" in the active threads value...
here's a snapshot of the threads report (seem to be a lot of threads with long wait times): oe threads report - 200408052130 Job being crawled: 21siteCrawl ...
Hi Igor, It was runnign for about 30 mins when I stopped it. The progress statistics log never showed busy threads higher than 3. According to the progress...
Hi Rob, It would be great if you just copy and past the order file, seeds and and per host settings (if any.) If I don't see anything unsully in the...
Because of the crawler's adjustable politeness settings, with respect to any one given remote server, it will typically spend more time waiting between fetches...
So seeing a docs processed per second rate of less than 1 is normal? (that's what I was seeing before I stopped it) Maybe I have my politeness settings too...
You are waiting at least 5 seconds between fetches from a single host. <integer name="max-delay-ms">10000</integer> <integer name="min-delay-ms">5000</integer>...
... It depends. Even though you've started with 20-something sites, if many of them are small, they'll finish early and then you'll only be left with a handful...
hi,all I use heritrix to download page and use lucene to index them. I have an idea to add a pagerank plugins for heritrix just like Google. For example, if...
zhousp
zhousp@...
Aug 6, 2004 6:32 am
756
Probably you want to have a look at nutch, (http://www.nutch.org/), from the website, "Nutch is a nascent effort to implement an open-source web search ...
We've released 1.0.0. See http://crawler.archive.org/changes-report.html#1.0.0 for a list of changes. See...
stack
stack@...
Aug 6, 2004 5:40 pm
758
... Thanks for the pointer. I was going to look at it next week. I'll take a look see what it'd take to hook up heritrix and nutch. Yours, St.Ack...
stack
stack@...
Aug 6, 2004 6:57 pm
759
Congratulations, Heritrixians, on 1.0 -- you've done a tremendous amount in a remarkably short time. As the St.Ack says: "good stuff!" So with the code freeze...