Driven by our meeting with Raymie last Thursday, and refined by further analysis, here are some notes on our design directions. = STAGED CRAWLER DESIGN NOTES =...
Gordon and Raymie, Below are the various stages and their design with the issues involved in the DNS Resolver and HTTP Client implementation. DNS History/Cache...
Patrick Eaton forwarded me a pair of staged HTTP client implementations which are part of the OceanStore project at Berkeley, and are essentially what are also...
I've just checked into Sourceforge CVS the module 'Anecdote', a first stab at a staged crawler. Right now it just sets up dummy printing stages, grabs a list...
More insight on the DNS stages. As stated in the design earlier, "DNS Querying Stage", "DNS Response Processing Stage" and "Timeout and Retry Handling Stage"...
Gordon, Igor, Raymie present. (1) Access to work in progress: start using SourceForge CVS (Post meeting note: 2 modules now exist there: 'Anecdote', a staged...
I added very dumb HTTP fetching toe the Anecdote 'Fetching' stage via the Apache Commons HTTPClient library soon after my message yesterday. ... This spinning...
These are good decompositions of the steps involved, and the LGPL dnsjava library looks very useful for our needs. My tendency would be to think fewer stages...
Gordon, I am done with the asynchronous DNS code. I shall test it more tomorrow and checkin. I may start using the caching mechanism present in the dnsjava ...
Gordon, I have checked in the first version of the asynchronous DNS lookup stage (DNSLookingUp.java). I have also updated the README and the anecdote.cfg file...
I'll take a look. Don't feel obligated to go with Eclipse -- even though it is a very nice environment. Eventually we'll include versioned ant scripts with...
Gordon, Yes, as you said dnsjava creates a new udpsocket for every message. I am planning to separate out the processing logic from the socket related code and...
I'm trying out the 'libhttp' staged HTTP code we were passed by the Berkeley OceanStore project, and it requires all aspects of the outbound request to be...
Gordon, I have proposed a detailed design for supporting asynchronous DNS lookups in the dnsjava libraries to its author (Brian Wellington). He is yet to get ...
Major additions and changes: - Moved "pumping" activity into URIChoosing stage, so it can better react to depletion of URIs to consider - Converted most...
Gordon, Attached are - A proposal for the JAnt based nightly builds and JUnit based unit tests. Please review it. - Schedule for the work items with...
On a recent test crawl, we stepped on an interesting, unintentional crawler trap related to soft 404s, relative URLs, and the implicit typing of certain...
... I think a number of recent commercial distros have backported the O(1) scheduler themselves: Red Hat 8 and Suse 8.1, at least, maybe recent Mandrakes. -...
Gordon, I am presently working on the SEDA socket overhead determination work. I will get the results in the next two days. Some initial insights into it are -...
Thanks for the update! The Ocenstore libhttp package is known to be very rough; really just a placeholder or starting point for what we'd need. (Note that...
Gordon, As you say, the aSocketInputStream is needed only if the client stage is multithreaded. But the oceanstore "HttpClient" stage is internally using it to...
Just to capture the idea I mentioned yesterday in the archives: A potential way to extract Javascript-synthesized URIs from web pages without integrating a...
Gordon, Please find attached the performance report on the SEDA aSocket NIO layer. Look for the last section in the document "SEDA NIO Socket Framework" and...
Thanks for the updated analysis! However, I am concerned that the results may be more a result of the test design or the specific HTTP implementation we're...
... Tuesday isn't good for me; how about Wednesday 8:30p PT (3:30UTC) instead? ... This looks good; I've been getting the notifications. ... I just wanted to...