Gordon, Attached are - A proposal for the JAnt based nightly builds and JUnit based unit tests. Please review it. - Schedule for the work items with...
On a recent test crawl, we stepped on an interesting, unintentional crawler trap related to soft 404s, relative URLs, and the implicit typing of certain...
... I think a number of recent commercial distros have backported the O(1) scheduler themselves: Red Hat 8 and Suse 8.1, at least, maybe recent Mandrakes. -...
Gordon, I am presently working on the SEDA socket overhead determination work. I will get the results in the next two days. Some initial insights into it are -...
Thanks for the update! The Ocenstore libhttp package is known to be very rough; really just a placeholder or starting point for what we'd need. (Note that...
Gordon, As you say, the aSocketInputStream is needed only if the client stage is multithreaded. But the oceanstore "HttpClient" stage is internally using it to...
Just to capture the idea I mentioned yesterday in the archives: A potential way to extract Javascript-synthesized URIs from web pages without integrating a...
Gordon, Please find attached the performance report on the SEDA aSocket NIO layer. Look for the last section in the document "SEDA NIO Socket Framework" and...
Thanks for the updated analysis! However, I am concerned that the results may be more a result of the test design or the specific HTTP implementation we're...
... Tuesday isn't good for me; how about Wednesday 8:30p PT (3:30UTC) instead? ... This looks good; I've been getting the notifications. ... I just wanted to...
I've moved some things out of the Anecdote CVS module, as that was never intended to be the all-inclusive home of our work. The socket tests have moved to a...
Gordon, The various tests and the results that we got today are as follows. ( In the below lines, Java downloader means the HTTP downloader which we used...
Gordon, I had a look into the SEDA code to understand the synchronization issues which we agreed on the day of discussion could be the reason behind low...
Gordon, The updated performance doc is attached. Please review the "test results section" and the "other misc results section" in the SEDA NIO pages. The JDK...
Gordon, The attached package contains the necessary sources and scripts to run the ( SEDA and non-SEDA ) downloaders. A readme is also present in it. The seda...
Thanks! Some thoughts: I'd like to approach this part of the system -- buffers/streams for multi-Kb entities across one processing cycle -- at three separate ...
Raymie, On the first day we discussed about the memory pool manager, we decided that the 8MB big chunk of memory will be broken into pieces of 4K each. And the...
Gordon, The MemPoolManager updates are checked into the CVS in the ArchiveOpenCrawler module ( in the same org.archive.crawler.io package ). Some of the...
Gordon, I am presently working on doing buffered i/o over RandomAccessFile on the spilled files. On some of the other issues listed below, please send in your...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Sorry for not getting back to you sooner while I was travelling. Re: VirtualBuffers I think that initially, it is OK to assume that the virtualbuffers are only...
Gordon, Thanks for the clarifications. I will work on it to get it done. We shall have the weekly conf call tomorrow at 8:30pm PST. The updated project...
At our Friday April 25th meeting at the Archive, we decided that in the interest of having a demoable and focused-usable crawler as soon as possible, we would...
... Reviewing this document ("CVSInstructions.txt"), I don't fully agree with putting everything in a single CVS module. In particular, I still want to use a...
Raymie pointed out an interesting possibility in design comments a while back: that DNS lookups that occur during the crawl could be handled as just another...
Gordon and Raymie, Attached Synch.zip contains the following changes on the Sync model. -- A new SampleLinkExtractor.java added which does some preliminary...