Gordon, I am presently working on doing buffered i/o over RandomAccessFile on the spilled files. On some of the other issues listed below, please send in your...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Gordon, Thanks for the clarifications. I will work on it to get it done. We shall have the weekly conf call tomorrow at 8:30pm PST. The updated project...
At our Friday April 25th meeting at the Archive, we decided that in the interest of having a demoable and focused-usable crawler as soon as possible, we would...
... Reviewing this document ("CVSInstructions.txt"), I don't fully agree with putting everything in a single CVS module. In particular, I still want to use a...
Raymie pointed out an interesting possibility in design comments a while back: that DNS lookups that occur during the crawl could be handled as just another...
Gordon and Raymie, Attached Synch.zip contains the following changes on the Sync model. -- A new SampleLinkExtractor.java added which does some preliminary...
Sometimes Raymie and I have had private exchanges about design issues that should really be copied to the archive-crawler discussion list. We'll try to direct...
I've been hammering out the details of a basic scheduler/store/selector (SSS) implementation: one which does not yet use persistent disk for large crawls or...
This regexp... ("|')([^\.\n\r\s'"]*(\.[^\.\n\r\s'"]+)+)(\1) ...does a fair job of selecting just those strings from javascript code that are highly likely to...
Looking to see if previous projects had created testing websites for crawlers, I came across Funnelback, a Java crawler I hadn't heard of before. Their main...
Attached is a document that summarizes what we now know about DNS, URIs,
and arc files, and how these could be made to play nice with one another. Gordon and...
At my request, Judy has set up a wiki and weblog for our project at our Sourceforge website -- which is now conveniently aliased to: http://crawler.archive.org...
... No, RFC 2540 prescribes a binary format and a text equivelent, both of which contain the same fields. ... To the best of my understanding no. I did not...
... To clarify a bit, the binary and text formats prescribed by RFC 2540 are equivelent, though these do not represent all information contained in a raw DNS...
I should confess that in the code I've written so far, many of these conventions have been violated -- in particular the practice of always declaring variables...
I put some brainstorming on what the test "web garden" should cover on the project Wiki at: http://crawler.archive.org/cgi-bin/wiki.pl?WebGarden Feel free to...
[moving discussion to arcive-crawler@yahoogroups.com] Looks good! I think we'll want to split up the tests into at least 3 non-overlapping (not cross-linked)...
FYI: the searchtools guys are up for us using anything we like, as long as credit is given and we say hey to Brewster on behalf of Avi ;). pt. -- Parker...
According to rfc1808.txt a relative url "../testDotDot.html" with a base "http://test.com" should be constructed to absolute URL ...
Igor Ranitovic
igor@...
Jun 20, 2003 8:38 pm
81
We're getting a good collection of tests at... http://crawl08.archive.org/index-2.html and http://crawl08.archive.org/newtest/ But, could we split them up as...
Another related open-source project, Nutch, which includes a crawler as part of its functionality: http://www.nutch.org "Nutch is a nascent effort to implement...
We were briefly putting JUnit test code into subpackages named 'test', in the main source tree, at the same level as the code they tested. However, Reddy had...
A 1.1MB arc.gz of what the dev version of Heritrix gets, when crawling from... http://crawl08.archive.org ...is available in my archive home directory, ...