I've been looking at what crawlers have typically done, and considering what we'd like the new crawler to do. The following general outline -- in roughly valid...
Gordon, I assume that the worker thread is doing synchronous I/O. We are not sure yet on what mode we have finalized, synchronous I/O or asynchronous I/O ? The...
Hi, Reddy! ... Yes, this outline most easily maps to blocking I/O. Per a discussion last Thursday, we'd initially like to get up and running with the familiar...
We now have a project at SourceForge for hosting our source code; see the details below. I definitely want to use their CVS, and perhaps their bug/ ...
From a number of sources, I've been hearing about tricky crawler situations -- misbehaving or malicious servers, endless domains, difficult-to-extract link...
At our last design meeting, Raymie and I sketched an outline of crawler operation as a series of discrete stages connected by queues -- a style compatible with...
[cc'd to the archive-crawler@yahoogroups.com discussion list] These are all important matters to address -- and for most of these issues, I think there will be...
... I said "_not_ RAM" Gordon said "swappable strategies will be enabled, starting with a simple RAM-based approach to get the crawler testable for small...
I don't think we can build the best mega-scale crawler until after we've built a really good, modular, efficient small-scale crawler. That's how the existing...
[CC'ing to archive-crawler@yahoogroups.com] ... This looks like a good first cut. I'm still working to improve my understanding of the best way to use the...
Gordon and Raymie, Here goes the proposal for the asynchronous DNS lookup API implementation. We shall implement a minimal resolver which is capable of sending...
At our kickoff engineering review meeting last friday, most discussion centered around understanding and clarifying the requirements document. Key areas...
Sounds like a reasonable plan. By "local name server" do you mean something *very* local -- for example, a standard nameserver we run on the same machine? That...
Driven by our meeting with Raymie last Thursday, and refined by further analysis, here are some notes on our design directions. = STAGED CRAWLER DESIGN NOTES =...
Gordon and Raymie, Below are the various stages and their design with the issues involved in the DNS Resolver and HTTP Client implementation. DNS History/Cache...
Patrick Eaton forwarded me a pair of staged HTTP client implementations which are part of the OceanStore project at Berkeley, and are essentially what are also...
I've just checked into Sourceforge CVS the module 'Anecdote', a first stab at a staged crawler. Right now it just sets up dummy printing stages, grabs a list...
More insight on the DNS stages. As stated in the design earlier, "DNS Querying Stage", "DNS Response Processing Stage" and "Timeout and Retry Handling Stage"...
Gordon, Igor, Raymie present. (1) Access to work in progress: start using SourceForge CVS (Post meeting note: 2 modules now exist there: 'Anecdote', a staged...
I added very dumb HTTP fetching toe the Anecdote 'Fetching' stage via the Apache Commons HTTPClient library soon after my message yesterday. ... This spinning...
These are good decompositions of the steps involved, and the LGPL dnsjava library looks very useful for our needs. My tendency would be to think fewer stages...
Gordon, I am done with the asynchronous DNS code. I shall test it more tomorrow and checkin. I may start using the caching mechanism present in the dnsjava ...
Gordon, I have checked in the first version of the asynchronous DNS lookup stage (DNSLookingUp.java). I have also updated the README and the anecdote.cfg file...
I'll take a look. Don't feel obligated to go with Eclipse -- even though it is a very nice environment. Eventually we'll include versioned ant scripts with...
Gordon, Yes, as you said dnsjava creates a new udpsocket for every message. I am planning to separate out the processing logic from the socket related code and...
I'm trying out the 'libhttp' staged HTTP code we were passed by the Berkeley OceanStore project, and it requires all aspects of the outbound request to be...