Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 54 - 84 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
54
Gordon, I am presently working on doing buffered i/o over RandomAccessFile on the spilled files. On some of the other issues listed below, please send in your...
G.B.Reddy
gbreddysoft
Offline Send Email
May 5, 2003
2:19 pm
55
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Gordon Mohr
gojomo
Online Now Send Email
May 12, 2003
6:06 pm
56
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Gordon Mohr
gojomo
Online Now Send Email
May 12, 2003
6:06 pm
57
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Gordon Mohr
gojomo
Online Now Send Email
May 12, 2003
6:07 pm
58
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Gordon Mohr
gojomo
Online Now Send Email
May 12, 2003
6:25 pm
59
Gordon, Thanks for the clarifications. I will work on it to get it done. We shall have the weekly conf call tomorrow at 8:30pm PST. The updated project...
G.B.Reddy
gbreddysoft
Offline Send Email
May 13, 2003
3:15 pm
60
At our Friday April 25th meeting at the Archive, we decided that in the interest of having a demoable and focused-usable crawler as soon as possible, we would...
Gordon Mohr
gojomo
Online Now Send Email
May 14, 2003
1:15 am
61
... Reviewing this document ("CVSInstructions.txt"), I don't fully agree with putting everything in a single CVS module. In particular, I still want to use a...
Gordon Mohr
gojomo
Online Now Send Email
May 14, 2003
1:15 am
62
Raymie pointed out an interesting possibility in design comments a while back: that DNS lookups that occur during the crawl could be handled as just another...
Gordon Mohr
gojomo
Online Now Send Email
May 14, 2003
1:35 am
63
Gordon and Raymie, Attached Synch.zip contains the following changes on the Sync model. -- A new SampleLinkExtractor.java added which does some preliminary...
G.B.Reddy
gbreddysoft
Offline Send Email
May 15, 2003
2:49 am
64
A draft overview of our planned crawler architecture, and the steps towards fully implementing it, is available for review. ...
Gordon Mohr
gojomo
Online Now Send Email
May 20, 2003
9:13 pm
65
Sometimes Raymie and I have had private exchanges about design issues that should really be copied to the archive-crawler discussion list. We'll try to direct...
Gordon Mohr
gojomo
Online Now Send Email
May 29, 2003
9:17 pm
67
I've been hammering out the details of a basic scheduler/store/selector (SSS) implementation: one which does not yet use persistent disk for large crawls or...
Gordon Mohr
gojomo
Online Now Send Email
May 29, 2003
11:02 pm
68
This regexp... ("|')([^\.\n\r\s'"]*(\.[^\.\n\r\s'"]+)+)(\1) ...does a fair job of selecting just those strings from javascript code that are highly likely to...
Gordon Mohr
gojomo
Online Now Send Email
May 30, 2003
12:17 am
69
It just occurred to me that when we crawl "https:" sites, we may want to archive the server-side SSL certificates presented at crawl-time. - Gordon...
Gordon Mohr
gojomo
Online Now Send Email
Jun 6, 2003
11:21 pm
70
Looking to see if previous projects had created testing websites for crawlers, I came across Funnelback, a Java crawler I hadn't heard of before. Their main...
Gordon Mohr
gojomo
Online Now Send Email
Jun 9, 2003
11:58 pm
71
Attached is a document that summarizes what we now know about DNS, URIs, and arc files, and how these could be made to play nice with one another. Gordon and...
Parker Thompson
michaelparke...
Offline Send Email
Jun 10, 2003
4:53 pm
72
At my request, Judy has set up a wiki and weblog for our project at our Sourceforge website -- which is now conveniently aliased to: http://crawler.archive.org...
Gordon Mohr
gojomo
Online Now Send Email
Jun 10, 2003
5:29 pm
73
... No, RFC 2540 prescribes a binary format and a text equivelent, both of which contain the same fields. ... To the best of my understanding no. I did not...
Parker Thompson
michaelparke...
Offline Send Email
Jun 10, 2003
5:31 pm
74
... To clarify a bit, the binary and text formats prescribed by RFC 2540 are equivelent, though these do not represent all information contained in a raw DNS...
Parker Thompson
michaelparke...
Offline Send Email
Jun 10, 2003
6:05 pm
75
In general, we should try to follow the Java code formatting and naming conventions set out by Sun at: ...
Gordon Mohr
gojomo
Online Now Send Email
Jun 12, 2003
7:18 pm
76
I should confess that in the code I've written so far, many of these conventions have been violated -- in particular the practice of always declaring variables...
Gordon Mohr
gojomo
Online Now Send Email
Jun 12, 2003
7:41 pm
77
I put some brainstorming on what the test "web garden" should cover on the project Wiki at: http://crawler.archive.org/cgi-bin/wiki.pl?WebGarden Feel free to...
Gordon Mohr
gojomo
Online Now Send Email
Jun 16, 2003
7:42 pm
78
[moving discussion to arcive-crawler@yahoogroups.com] Looks good! I think we'll want to split up the tests into at least 3 non-overlapping (not cross-linked)...
Gordon Mohr
gojomo
Online Now Send Email
Jun 19, 2003
12:01 am
79
FYI: the searchtools guys are up for us using anything we like, as long as credit is given and we say hey to Brewster on behalf of Avi ;). pt. -- Parker...
Parker Thompson
michaelparke...
Offline Send Email
Jun 19, 2003
12:41 am
80
According to rfc1808.txt a relative url "../testDotDot.html" with a base "http://test.com" should be constructed to absolute URL ...
Igor Ranitovic
igor@...
Send Email
Jun 20, 2003
8:38 pm
81
We're getting a good collection of tests at... http://crawl08.archive.org/index-2.html and http://crawl08.archive.org/newtest/ But, could we split them up as...
Gordon Mohr
gojomo
Online Now Send Email
Jun 24, 2003
6:46 pm
82
Another related open-source project, Nutch, which includes a crawler as part of its functionality: http://www.nutch.org "Nutch is a nascent effort to implement...
Gordon Mohr
gojomo
Online Now Send Email
Jun 24, 2003
7:20 pm
83
We were briefly putting JUnit test code into subpackages named 'test', in the main source tree, at the same level as the code they tested. However, Reddy had...
Gordon Mohr
gojomo
Online Now Send Email
Jun 24, 2003
7:36 pm
84
A 1.1MB arc.gz of what the dev version of Heritrix gets, when crawling from... http://crawl08.archive.org ...is available in my archive home directory, ...
Gordon Mohr
gojomo
Online Now Send Email
Jun 26, 2003
11:21 pm
Messages 54 - 84 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help