Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 795
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Messages

Advanced
Messages Help
Messages 1 - 30 of 8130   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand Author Sort by Date ^
1 Raymie Stata
rstata Send Email
Dec 1, 2002
4:38 pm
I've put together a bibliography on crawling, see: http://groups.yahoo.com/group/archive-crawler/files/crawling-links.html Please add papers as you find them. ...
2 Gordon Mohr
gojomo Send Email
Feb 10, 2003
7:19 pm
I've been looking at what crawlers have typically done, and considering what we'd like the new crawler to do. The following general outline -- in roughly valid...
3 G.B.Reddy
gbreddysoft Send Email
Feb 12, 2003
5:45 pm
Gordon, I assume that the worker thread is doing synchronous I/O. We are not sure yet on what mode we have finalized, synchronous I/O or asynchronous I/O ? The...
4 Gordon Mohr
gojomo Send Email
Feb 12, 2003
7:08 pm
Hi, Reddy! ... Yes, this outline most easily maps to blocking I/O. Per a discussion last Thursday, we'd initially like to get up and running with the familiar...
5 Gordon Mohr
gojomo Send Email
Feb 13, 2003
6:58 pm
We now have a project at SourceForge for hosting our source code; see the details below. I definitely want to use their CVS, and perhaps their bug/ ...
6 Gordon Mohr
gojomo Send Email
Feb 18, 2003
8:05 pm
From a number of sources, I've been hearing about tricky crawler situations -- misbehaving or malicious servers, endless domains, difficult-to-extract link...
7 Gordon Mohr
gojomo Send Email
Feb 19, 2003
10:21 am
At our last design meeting, Raymie and I sketched an outline of crawler operation as a series of discrete stages connected by queues -- a style compatible with...
8 Gordon Mohr
gojomo Send Email
Feb 21, 2003
7:16 am
[cc'd to the archive-crawler@yahoogroups.com discussion list] These are all important matters to address -- and for most of these issues, I think there will be...
9 Raymie Stata
rstata Send Email
Feb 21, 2003
7:29 am
... I said "_not_ RAM" Gordon said "swappable strategies will be enabled, starting with a simple RAM-based approach to get the crawler testable for small...
10 Gordon Mohr
gojomo Send Email
Feb 21, 2003
4:17 pm
I don't think we can build the best mega-scale crawler until after we've built a really good, modular, efficient small-scale crawler. That's how the existing...
11 Brewster Kahle
brewsterkahle Send Email
Feb 21, 2003
11:04 pm
got it. all cleared up today at the meeting, I think. good start! -brewster...
12 Gordon Mohr
gojomo Send Email
Feb 22, 2003
12:17 am
[CC'ing to archive-crawler@yahoogroups.com] ... This looks like a good first cut. I'm still working to improve my understanding of the best way to use the...
13 G.B.Reddy
gbreddysoft Send Email
Feb 27, 2003
4:55 pm
Gordon and Raymie, Here goes the proposal for the asynchronous DNS lookup API implementation. We shall implement a minimal resolver which is capable of sending...
14 Gordon Mohr
gojomo Send Email
Feb 28, 2003
6:11 pm
At our kickoff engineering review meeting last friday, most discussion centered around understanding and clarifying the requirements document. Key areas...
15 Gordon Mohr
gojomo Send Email
Feb 28, 2003
9:13 pm
Sounds like a reasonable plan. By "local name server" do you mean something *very* local -- for example, a standard nameserver we run on the same machine? That...
16 G.B.Reddy
gbreddysoft Send Email
Mar 3, 2003
1:19 pm
Yes, it is a local name server. It could also be remote. -Reddy ... From: Gordon Mohr To: archive-crawler@yahoogroups.com Cc: Raymie Stata ;...
17 Gordon Mohr
gojomo Send Email
Mar 5, 2003
10:29 pm
Driven by our meeting with Raymie last Thursday, and refined by further analysis, here are some notes on our design directions. = STAGED CRAWLER DESIGN NOTES =...
18 G.B.Reddy
gbreddysoft Send Email
Mar 6, 2003
5:52 pm
Gordon and Raymie, Below are the various stages and their design with the issues involved in the DNS Resolver and HTTP Client implementation. DNS History/Cache...
19 Gordon Mohr
gojomo Send Email
Mar 7, 2003
1:40 am
Patrick Eaton forwarded me a pair of staged HTTP client implementations which are part of the OceanStore project at Berkeley, and are essentially what are also...
20 Gordon Mohr
gojomo Send Email
Mar 7, 2003
2:11 am
I've just checked into Sourceforge CVS the module 'Anecdote&#39;, a first stab at a staged crawler. Right now it just sets up dummy printing stages, grabs a list...
21 G.B.Reddy
gbreddysoft Send Email
Mar 7, 2003
4:31 pm
More insight on the DNS stages. As stated in the design earlier, "DNS Querying Stage", "DNS Response Processing Stage" and "Timeout and Retry Handling Stage"...
22 Gordon Mohr
gojomo Send Email
Mar 7, 2003
9:44 pm
Gordon, Igor, Raymie present. (1) Access to work in progress: start using SourceForge CVS (Post meeting note: 2 modules now exist there: 'Anecdote&#39;, a staged...
23 Gordon Mohr
gojomo Send Email
Mar 7, 2003
9:51 pm
I added very dumb HTTP fetching toe the Anecdote 'Fetching&#39; stage via the Apache Commons HTTPClient library soon after my message yesterday. ... This spinning...
24 Gordon Mohr
gojomo Send Email
Mar 7, 2003
11:30 pm
These are good decompositions of the steps involved, and the LGPL dnsjava library looks very useful for our needs. My tendency would be to think fewer stages...
25 G.B.Reddy
gbreddysoft Send Email
Mar 12, 2003
4:15 pm
Gordon, I am done with the asynchronous DNS code. I shall test it more tomorrow and checkin. I may start using the caching mechanism present in the dnsjava ...
26 G.B.Reddy
gbreddysoft Send Email
Mar 17, 2003
8:08 pm
Gordon, I have checked in the first version of the asynchronous DNS lookup stage (DNSLookingUp.java). I have also updated the README and the anecdote.cfg file...
27 Gordon Mohr
gojomo Send Email
Mar 17, 2003
11:42 pm
I'll take a look. Don't feel obligated to go with Eclipse -- even though it is a very nice environment. Eventually we'll include versioned ant scripts with...
28 G.B.Reddy
gbreddysoft Send Email
Mar 18, 2003
2:20 am
Gordon, Yes, as you said dnsjava creates a new udpsocket for every message. I am planning to separate out the processing logic from the socket related code and...
29 Gordon Mohr
gojomo Send Email
Mar 19, 2003
7:38 pm
I'm trying out the 'libhttp&#39; staged HTTP code we were passed by the Berkeley OceanStore project, and it requires all aspects of the outbound request to be...
30 Gordon Mohr
gojomo Send Email
Mar 19, 2003
8:59 pm
Messages 1 - 30 of 8130   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help