FYI, regarding Alexa's crawler request headers, Andy at Alexa writes:
> Our current implementation sends only
>
> GET /foo.html HTTP/1.0
> <Cookies>
> Connection: close
> Host: foo.com
> User-Agent: ia_archiver
> From: crawler@...
>
> I don't know why we use 1.0 instead of 1.1.
>
> adj
I'm trying out the 'libhttp' staged HTTP code we were passed
by the Berkeley OceanStore project, and it requires all aspects
of the outbound request to be explicitly set. (For example,
it won't crack up a full HTTP URI so that it can put just the
path_info on the request line and the host in a "Host:" header,
etc.)
So, I'd like to get an idea of the minimal HTTP requests we
should compose to get representative responses from web servers.
For example, this is the request IE6/WinXP generates on a fetch
of "http://www.yahoo.com":
--
GET / HTTP/1.1
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR
1.0.3705)
Host: www.yahoo.com
Connection: Keep-Alive
--
For comparison, here's what Moz1.3/WinXP generates:
--
GET / HTTP/1.1
Host: www.yahoo.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3)
Gecko/20030312
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,\
video/x-mng,image/png,image/jpeg,image/gif;q=0.2,*/*
;q=0.1
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate,compress;q=0.9
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
--
Raymie (or anyone else who knows):
Does Mercator by default include
Accept/Accept-Language/Accept-Encoding/Accept-Charset
headers like the above? Any other headers which have proven useful to include in
crawling?
- Gordon
Gordon,
Yes, as you said dnsjava creates a new udpsocket for every message. I am
planning to separate out the processing logic from the socket related code
and give interfaces such that people can use their own transport mechanism
(sync or async). In our case we would use the seda socket library. Moreover
the Lookup.java class ( a core dnsjava class) has a lot of member variables
which are used to carry state between the methods. This forces one to create
instances of Lookup class. I would want to make the methods independent of
each other as far as possible. These limitations have made me extract pieces
of code from the Lookup class and use it in my DNSLookingUp class.
Thanks,
Reddy
----- Original Message -----
From: "Gordon Mohr" <gojomo@...>
To: <archive-crawler@yahoogroups.com>
Cc: <wcr-team@...>
Sent: Tuesday, March 18, 2003 5:11 AM
Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web crawler
work ??
> I'll take a look.
>
> Don't feel obligated to go with Eclipse -- even though it is a
> very nice environment. Eventually we'll include versioned ant
> scripts with each module so as to support the widest possible
> choice of dev styles.
>
> Upgrading dnsjava's async capabilities looks like a good idea.
> I see it has a rudimentary async simulation via a background
> thread-per-call, and also that it uses a fresh UDP socket for every
> lookup to avoid having to do any other response-to-request mapping.
> Both of those practices seem too wasteful for us (though maybe
> the disposable UDP socket cost isn't that high; I'd need to run
> tests to be sure).
>
> - Gordon
>
> ----- Original Message -----
> From: "G.B.Reddy" <reddy@...>
> To: <archive-crawler@yahoogroups.com>
> Cc: <wcr-team@...>
> Sent: Monday, March 17, 2003 12:07 PM
> Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
crawler work ??
>
>
> > Gordon,
> >
> > I have checked in the first version of the asynchronous DNS lookup stage
> > (DNSLookingUp.java). I have also updated the README and the anecdote.cfg
> > file accordingly.
> >
> > I have made my first version without any changes in the dnsjava
libraries.
> > However, I would like to reorganize the dnsjava code to support proper
> > interfaces for asynchronous queries. I can do it in the next 3 or 4
days.
> > This would help us in the near future since it contains lot of useful
> > features of a full fledged resolver. At present, my implementation
requires
> > that the name server be configured for recursion.
> >
> > At present, the code just prints the received DNS replies and errors.
The
> > results/errors are not set into the CrawlURI object yet.
> >
> > Please note that I have not yet updated the .project and .classpath
files. I
> > am yet to switch to eclipse.
> >
> > Thanks,
> > Reddy
> >
> >
> > ----- Original Message -----
> > From: "Gordon Mohr" <gojomo@...>
> > To: "G.B.Reddy" <reddy@...>
> > Cc: <wcr-team@...>
> > Sent: Tuesday, March 18, 2003 1:03 AM
> > Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
crawler
> > work ??
> >
> >
> > > Great -- I look forward to seeing your design in more detail.
> > >
> > > Feel free to check it into the 'Anecdote' module -- for now,
> > > that's a scratch/prototyping area we can put the first working
> > > versions of things. Perhaps the package could be org.archive.util.dns
> > > or org.archive.seda.dns?
> > >
> > > I expect that when the system is a bit more stable and evolved, we'll
> > > move it to a new CVS module (and perhaps package structure).
> > >
> > > - Gordon
> > >
> > > ----- Original Message -----
> > > From: "G.B.Reddy" <reddy@...>
> > > To: "Gordon Mohr" <gojomo@...>
> > > Cc: <wcr-team@...>
> > > Sent: Friday, March 14, 2003 8:20 AM
> > > Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
> > crawler work ??
> > >
> > >
> > > > Gordon,
> > > >
> > > > I have made certain modifications to the dnsjava code to remove the
> > tight
> > > > coupling between the caching and the dns reply message processing. I
> > would
> > > > complete this and checkin on Monday.
> > > >
> > > > -Reddy
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "G.B.Reddy" <reddy@...>
> > > > To: "Gordon Mohr" <gojomo@...>;
> > <archive-crawler@yahoogroups.com>
> > > > Cc: <wcr-team@...>
> > > > Sent: Wednesday, March 12, 2003 9:44 PM
> > > > Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
> > crawler
> > > > work ??
> > > >
> > > >
> > > > > Gordon,
> > > > >
> > > > > I am done with the asynchronous DNS code. I shall test it more
> > tomorrow
> > > > and
> > > > > checkin. I may start using the caching mechanism present in the
> > dnsjava
> > > > > libraries. They seem to be a bit tightly bound with the response
> > > > processing
> > > > > code which I am using.
> > > > >
> > > > > The CrawlURI object as of now contains only the uri string. I
would
> > need
> > > > > variables for setting the ipaddress and other state to indicate
lookup
> > > > > failures. Please let me know if you have any versions over it.
> > > > >
> > > > > Thanks,
> > > > > Reddy
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: "Gordon Mohr" <gojomo@...>
> > > > > To: <archive-crawler@yahoogroups.com>
> > > > > Cc: <wcr-team@...>
> > > > > Sent: Saturday, March 08, 2003 5:00 AM
> > > > > Subject: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
> > crawler
> > > > > work ??
> > > > >
> > > > >
> > > > > > These are good decompositions of the steps involved, and the
LGPL
> > > > dnsjava
> > > > > > library looks very useful for our needs.
> > > > > >
> > > > > > My tendency would be to think fewer stages are better -- and
when
> > > > > > communicating between our stages, rather than using a shared
> > RequestMap,
> > > > > > include on the requesting event whatever context will be needed
to
> > deal
> > > > > > with the response when it arrives.
> > > > > >
> > > > > > You can see this technique used in the ostore.network.ADns class
I
> > > > > referenced
> > > > > > yesterday -- the user_data object lets any client of the ADns
stage
> > > > > recover
> > > > > > the state it needs to deal with ADns's eventual response. (We
may
> > still
> > > > > need
> > > > > > the equivalent of a RequestMap to deal with mapping UDP response
> > packets
> > > > > to
> > > > > > the lookups that triggered them -- but that'd then be a map that
can
> > be
> > > > > private
> > > > > > to a single stage.)
> > > > > >
> > > > > > Nothing in the DNS procedure needs to block -- once you assume
> > > > > asynchronous
> > > > > > UDP replies and an in-RAM response cache -- and thus one thread
> > should
> > > > be
> > > > > > as good as (in fact better than) N threads.
> > > > > >
> > > > > > Similarly, I don't think more than 1 UDP socket will be
necessary
> > for
> > > > > > all DNS lookups, because as a practical matter, 1 open UDP
socket
> > will
> > > > > > never be "busy" in a way that additional UDP sockets would help.
> > > > > >
> > > > > > --
> > > > > > Regarding 404 and other explicit HTTP app errors: these should
be
> > > > recorded
> > > > > > into the CrawlURI, and forwarded to the next processing stage,
just
> > as
> > > > > > with successes.
> > > > > >
> > > > > > Redirects are a special case we haven't discussed much yet; they
> > would
> > > > > seem
> > > > > > candidates for some sort of expedited fetching. I think, though,
> > such
> > > > > > expedited activity must pass through the full cycle of stages
for
> > > > > > proper logging and policy application. (In contrast, retries in
the
> > face
> > > > > > of transient errors, which may, at least in some cases, be fed
> > > > immediately
> > > > > > back to the Fetching or Preprocessing stages.)
> > > > > >
> > > > > > --
> > > > > > The issue of timeouts (esp. in HTTP) is thorny. From what I hear
> > about
> > > > > > crawler traps, we need to not only to be able to deal with
hangs,
> > but
> > > > > > also various infinite-length trickles.
> > > > > >
> > > > > > If any of the HTTP-transactions-in-progress is not making
sufficient
> > > > > > progress, according to some set of progress-per-time-unit
> > thresholds,
> > > > > > we have to be ready to give up on it. We should mark it as an
error,
> > > > > > but probably retain any partial info we did get for later
analysis.
> > > > > >
> > > > > > Separating this out into its own stage increases synchronization
> > issues.
> > > > > > If timeout analysis can instead be part of the same
single-threaded
> > > > stage
> > > > > > where HTTP some-progress-made events (from sockets) are handled,
I
> > think
> > > > > > it can be handled very efficiently:
> > > > > >
> > > > > > - keep all outstanding transactions on a linked-list
> > > > > > - whenever progress is reported (from the socket),
> > > > > > and that progress keeps the transaction above
> > > > > > acceptable thresholds, move that transaction to
> > > > > > the head of the list
> > > > > > - check the tail of the list occasionally to see
> > > > > > if the worst transactions need to be cut
> > > > > >
> > > > > > That "occasional" check could mostly occur as part of normal
> > > > > > progress-packet processing; an extremely infrequent timer could
> > > > > > trigger reevaluation in the rare case where all progress on all
> > > > > > sockets has stopped...
> > > > > >
> > > > > > - Gordon
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: G.B.Reddy
> > > > > > To: archive-crawler@yahoogroups.com
> > > > > > Cc: wcr-team@...
> > > > > > Sent: Friday, March 07, 2003 8:31 AM
> > > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > > >
> > > > > >
> > > > > >
> > > > > > More insight on the DNS stages.
> > > > > >
> > > > > > As stated in the design earlier, "DNS Querying Stage", "DNS
Response
> > > > > Processing Stage" and "Timeout and Retry Handling Stage"
> > > > > > access/update the shared RequestMap. So, they need to be
> > synchronized.
> > > > We
> > > > > were just thinking whether these need to be three separate
> > > > > > stages or could it be one stage multiplexing each of those
incoming
> > > > event
> > > > > types. The pros and cons of them are as below.
> > > > > >
> > > > > > Single stage benefit / Multi Stage Issue :
> > > > > > - If it is single stage, then synchronization is not done across
> > stages.
> > > > > Even though, synchronization anyway would be needed
> > > > > > internally in case of single stage, conceptually it looks neater
not
> > to
> > > > > synchronize across stages.
> > > > > >
> > > > > > Single stage issue / Multi Stage benefit :
> > > > > > - In case of single stage, the event queue would be one and this
> > fact
> > > > > pulls our legs when we want to handle overload conditions.
> > > > > > Since there is no clear count of the distinct elements in the
queue,
> > it
> > > > > becomes difficult to analyse and condition the system
> > > > > > accordingly. Whereas, in multi stage, each one having its own
queue,
> > > > > conditioning/prioritizating becomes easy.
> > > > > >
> > > > > > I think we would end up going in for multi stage, unless SEDA
could
> > take
> > > > > care of it by itself.
> > > > > >
> > > > > > And in case of parallel stages, like the ones in post
processing, I
> > > > think
> > > > > most often, synchronization across them may be
> > > > > > unavoidable.
> > > > > >
> > > > > > Thanks,
> > > > > > Reddy
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: G.B.Reddy
> > > > > > To: archive-crawler@yahoogroups.com
> > > > > > Cc: wcr-team@...
> > > > > > Sent: Thursday, March 06, 2003 11:21 PM
> > > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > > >
> > > > > >
> > > > > > Gordon and Raymie,
> > > > > >
> > > > > > Below are the various stages and their design with the issues
> > involved
> > > > in
> > > > > the DNS Resolver and HTTP Client implementation.
> > > > > >
> > > > > > DNS History/Cache Handling Stage :
> > > > > >
> > > > > > Overview:
> > > > > > - Maintains successful lookups in cache.
> > > > > > - Does negative caching.
> > > > > > - Times itself to clean the expired entries based on TTL. (Would
use
> > the
> > > > > ssTimer SEDA APIs to schedule itself periodically)
> > > > > > - This stage would be dummy or could be skipped as of now since
we
> > want
> > > > to
> > > > > do caching later.
> > > > > >
> > > > > > Events:
> > > > > > - Two types of events : DNSCacheLookupEvent,
DNSCacheUpdateEvent.
> > > > > > - DNSCacheLookupEvent : If entry is found in cache, the
ipaddress is
> > set
> > > > > in the CrawlURI object and is enqueued into the "Page
> > > > > > Requesting Stage". Else, it is enqueued into the "DNS Querying
> > Stage".
> > > > > > - DNSCacheUpdateEvent : This event is published by the "DNS
Response
> > > > > Processing Stage" every after successful/failed lookup inorder
> > > > > > to update the cache.
> > > > > >
> > > > > > Other notes:
> > > > > > - This stage could be single threaded else lot of
synchronization
> > might
> > > > be
> > > > > needed.
> > > > > > - Resubmitting events on queue full exceptions while enqueuing
into
> > this
> > > > > stage's queue should be handled by the caller by scheduling
> > > > > > it in future.
> > > > > >
> > > > > >
> > > > > > DNS Querying Stage :
> > > > > >
> > > > > > Overview:
> > > > > > - Sends the actual DNS ARecord query packets to the DNS Server.
(The
> > > > > response packets are processed in a later stage)
> > > > > > - Maintains a pool of DatagramSocket objects.
> > > > > >
> > > > > > Events:
> > > > > > - SendDNSQueryEvent : This is published by the "DNS
History/Cache
> > > > Handling
> > > > > Stage" when cache miss happens. The DNS query is formed
> > > > > > and sent out. The response handling sink is set as the "DNS
Response
> > > > > Processing Stage".
> > > > > >
> > > > > > Implementation:
> > > > > > - A pool of DatagramSockets of a configurable maximum size is
> > > > maintained.
> > > > > It will be filled incrementally. All these datagram
> > > > > > sockets will be registered to the selector maintained by the
SEDA
> > > > > internals. It would be ideal if this pool gets shrunk or expanded
> > > > > > based on the requirement. If it is not shrunk back, then it is
an
> > > > > unnecessary overhead on the selector. The reason behind having a
> > > > > > pool is to restrict the number of ports the selector has to
listen
> > upon
> > > > > and also not to create individual DatagramSocket objects for
> > > > > > every query. Can this logic of bounded pool, be implemented as a
> > > > > Controller in the SEDA framework (just like the
> > > > > > ThreadPoolController) is an open question.
> > > > > > - When an event comes in, a free datagram socket in the pool
will be
> > > > > utilized for sending the message. If all sockets are engaged,
> > > > > > the incoming event should be postponed to reenter again after a
> > period
> > > > of
> > > > > time.
> > > > > > - This stage additionally has also to maintain the list of
messages
> > sent
> > > > > out, their IDs and the request timestamps. Let us call this
> > > > > > "RequestMap" for future reference. The ID is the integer,
described
> > in
> > > > the
> > > > > RFC DNS message format, used to map the
> > > > > > request-responses. The request timestamp will be made use of in
> > query
> > > > > timeout handling (discussed later).
> > > > > >
> > > > > > Parameters to this stage :
> > > > > > - The DNS server hostname/ipaddress. If this is not given, then
the
> > > > > /etc/resolve.conf will be parsed to get the name server (only
> > > > > > the primary would be taken as of now.). As a next step we will
have
> > to
> > > > > build a round-robin way of querying the various name servers
> > > > > > in resolve.conf, inorder to be polite with them.
> > > > > > - If resolve.conf is not present, the local host will be assumed
as
> > the
> > > >
> > > > > name server.
> > > > > >
> > > > > >
> > > > > > DNS Response Processing Stage :
> > > > > >
> > > > > > Overview:
> > > > > > - Processes DNS responses.
> > > > > >
> > > > > > Implementation:
> > > > > > - When the DNS datagram packets are received, the ID field in
the
> > header
> > > > > should be used to match the corresponding request packet.
> > > > > > - Check for timeouts, and discard it if it had timed out; else,
set
> > the
> > > > > ipaddress/canonical name in the CrawlURI object and enqueue
> > > > > > it to the "Page Requesting Stage". In addition, enqueue an event
> > into
> > > > the
> > > > > "DNS Cache Handling Stage" for it to update its cache. Do
> > > > > > the same, even on DNS Errors like "Name not found authoritative
> > error".
> > > > > > - The request entry in the RequestMap (maintained for timeout
> > handling)
> > > > > should be removed. This map, being shared across stages,
> > > > > > should be synchronized.
> > > > > >
> > > > > >
> > > > > > HTTP Page Requesting Stage :
> > > > > >
> > > > > > Overview:
> > > > > > - Connects to host and sends GET requests for pages.
> > > > > >
> > > > > > Events:
> > > > > > - Handles two types of events - StartFetchEvent and
> > > > > ConnectionCompleteEvent.
> > > > > > - The StartFetchEvent will make a TCP connect request to the
host.
> > While
> > > > > doing so, we will register the current stage itself to
> > > > > > receive back the ConnectionComplete events. Once we receive this
> > > > > ConnectionCompleteEvent, we should send a HTTP GET request to the
> > > > > > page. The response handling sink is set as the "HTTP Response
> > Processing
> > > > > Stage". Write failures should be handled.
> > > > > >
> > > > > >
> > > > > > HTTP Response Processing Stage :
> > > > > >
> > > > > > Overview:
> > > > > > - Processes downloaded pages.
> > > > > >
> > > > > > Implementation:
> > > > > > - Check for timeouts, and discard it if it had timed out; else,
read
> > the
> > > > > packets.
> > > > > > - Once the response is completely read, the request entry in the
> > > > > RequestMap (maintained for timeout handling) should be removed.
> > > > > > - One issue here is when we are reading lengthy HTML pages, we
might
> > > > > receive half of the page and it might stop after that. So,
> > > > > > essentially the timeout should be applied between chunks of
> > reception.
> > > > > > - Where should the errors like "404 Not Found", etc be
propogated
> > ???
> > > > > >
> > > > > >
> > > > > > Timeout and Retry Handling Stage :
> > > > > >
> > > > > > Overview:
> > > > > > This is a single threaded stage which enumerates through the
> > RequestMap
> > > > > and checks for timeouts. The timed out CrawlURIs will be
> > > > > > retried until retry count exhausts. This stage will be
self-timed
> > > > > periodically using the SEDA ssTimer APIs.
> > > > > >
> > > > > > Other Notes:
> > > > > > This timeout handling is a common stuff between the DNS requests
and
> > the
> > > > > HTTP requests.
> > > > > >
> > > > > > Parameters to this stage:
> > > > > > - DNS timeout value.
> > > > > > - HTTP timeout value.
> > > > > > - DNS retry count.
> > > > > > - HTTP retry count. ( This would be 1 ).
> > > > > >
> > > > > >
> > > > > > One other thing that could be done is that, the events by
themselves
> > > > will
> > > > > contain information as to which next stage the output has
> > > > > > to traverse. This will be flexible and no hardcoding is needed.
> > > > > Especially, in making this non-blocking DNS library an
open-source,
> > > > > > it would come handy. Moreover, many users might not want it to
be
> > over
> > > > > SEDA. So, we will have to give other interfaces as well.
> > > > > >
> > > > > > I am presently using the library classes given by dnsjava-1.3.2.
(
> > > > > http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java
> > > > > > based synchronous implementation of DNS Resolver. I only make
use of
> > the
> > > > > classes which encapsulate the formation of request packets,
> > > > > > parses response packets and the various ResourceRecord classes.
This
> > > > > library is being used by Java Apache Mail Enterprise Server (
> > > > > > http://james.apache.org/ ). So, it should be pretty reliable and
> > tested.
> > > > > Moreover it has support for IPv6, compression and security
> > > > > > which we can make use of later.
> > > > > >
> > > > > > Thanks,
> > > > > > Reddy
> > > > > >
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: Gordon Mohr
> > > > > > To: archive-crawler@yahoogroups.com
> > > > > > Cc: Raymie Stata ; wcr-team@...
> > > > > > Sent: Saturday, March 01, 2003 2:43 AM
> > > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > > >
> > > > > >
> > > > > > Sounds like a reasonable plan.
> > > > > >
> > > > > > By "local name server" do you mean something *very* local -- for
> > > > example,
> > > > > > a standard nameserver we run on the same machine?
> > > > > >
> > > > > > That would seem to offer other benefits -- such as minimizing
the
> > modes
> > > > > > of DNS lookup we have to do and offloading caching to another
piece
> > of
> > > > > > software (at least at first).
> > > > > >
> > > > > > - Gordon
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: G.B.Reddy
> > > > > > To: archive-crawler@yahoogroups.com
> > > > > > Cc: Raymie Stata ; wcr-team@...
> > > > > > Sent: Thursday, February 27, 2003 8:55 AM
> > > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > > >
> > > > > >
> > > > > > Gordon and Raymie,
> > > > > >
> > > > > > Here goes the proposal for the asynchronous DNS lookup API
> > > > implementation.
> > > > > >
> > > > > > We shall implement a minimal resolver which is capable of
sending
> > DNS
> > > > > request packets and processing response packets in an
> > > > > > asynchrounous nio fashion. This resolver class will contact a
local
> > name
> > > > > server and rely on it to do the actual lookup. The local
> > > > > > name server will be configured to support recursion and better
would
> > be
> > > > to
> > > > > use a name server which does lookup asynchronously. (
> > > > > > SQUID has asynchronous DNS lookup facilities ). Even if the
local
> > name
> > > > > server is not asynchrounous, our java resolver being
> > > > > > asynchronous will be good enough since our primary goal is that
we
> > do
> > > > not
> > > > > want any blocking code in our crawler implementation. This
> > > > > > idea even sounds good considering the fact we would only
reinvent
> > the
> > > > same
> > > > > wheel if we opt to implement a complete full-fledged
> > > > > > resolver implementation which complies with the RFC 1035 and
1034.
> > We
> > > > can
> > > > > definitely implement this full-fledged resolver but the
> > > > > > real concern is that this would require a lot of testing and the
> > efforts
> > > > > to make it rock solid in terms of robustness would be huge.
> > > > > >
> > > > > > So, the various jobs that we would have to do to build our
Simple
> > > > > Asynchronous DNS lookup API would be
> > > > > > -- Request packet formation and reply packet parsing in the
> > exact
> > > > RFC
> > > > > format.
> > > > > > -- Use non-blocking IO APIs and do UDP. (We might not need
TCP
> > since
> > > > > the name server is only in the local network.)
> > > > > > -- Do canonical name queries and A record queries.
> > > > > > -- Implement timeouts.
> > > > > > -- Implement caching based on TTL. ( This may have to be
> > deferred as
> > > > > pointed by Raymie earlier. )
> > > > > > -- Integrate with SEDA.
> > > > > >
> > > > > > Thanks,
> > > > > > Reddy
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: Gordon Mohr
> > > > > > To: G.B.Reddy
> > > > > > Cc: Raymie Stata ; wcr-team@... ;
> > > > > archive-crawler@yahoogroups.com
> > > > > > Sent: Saturday, February 22, 2003 5:37 AM
> > > > > > Subject: [archive-crawler] Re: Web crawler work ??
> > > > > >
> > > > > >
> > > > > > [CC'ing to archive-crawler@yahoogroups.com]
> > > > > >
> > > > > > Reddy writes:
> > > > > >
> > > > > > > On the first cut do we need to look at implementing an
> > asynchronous
> > > > DNS
> > > > > > > lookup mechanism. If we are not, then it is going to be two
> > stages,
> > > > viz.
> > > > > > > DNSCacheHandlingStage and ResolvingStage, that can be employed
> > using
> > > > the
> > > > > > > blocking DNS lookup calls in Java. The first stage,
> > > > > DNSCacheHandlingStage,
> > > > > > > would check if the entry is available in the cache. If
available,
> > he
> > > > > would
> > > > > > > set the resolved address in the CrawlURI object and enqueue it
to
> > the
> > > > > > > appropriate next stage. If the cache doesn't contain the
entry,
> > then
> > > > he
> > > > > > > would pass the request to the Resolving stage which would call
the
> > > > > > > InetAddress.getByName blocking method to resolve it. The
getByName
> > > > > result
> > > > > > > would be set in the CrawlURI object as before and enqueued
into
> > the
> > > > > > > appropriate next stage. In addition to this, the Resolving
stage
> > will
> > > > > > > enqueue another event into the DNSCacheHandlingStage to enable
him
> > > > > update
> > > > > > > his cache. So, the DNSCacheHandlingStage would be handling two
> > types
> > > > of
> > > > > > > events, one is the lookup events and the other is the update
cache
> > > > > events.
> > > > > > >
> > > > > > > One problem here is that the InetAddress class does not expose
its
> > > > cache
> > > > > > > variables to its users. Even we cannot check if the cache has
an
> > entry
> > > > > > > before calling the getByName method. So, we should be
disabling
> > the
> > > > java
> > > > > > > cache ( using the policy file ) and implementing our own
caching
> > > > > mechanism.
> > > > > > > ( The DNSCacheHandlingStage would have to additionally do the
job
> > of
> > > > > > > throwing away the expired entries in the cache also.)
> > > > > > >
> > > > > > > Let me know your comments on this.
> > > > > >
> > > > > > This looks like a good first cut. I'm still working to improve
my
> > > > > > understanding of the best way to use the staged style, mostly by
> > > > > > looking at their HTTP and HTTP Server (Haboob) code.
> > > > > >
> > > > > > It seems that they've tended to use a single Stage object to do
> > > > > > many different steps/aspects of one process, by switching on the
> > > > > > type of QueueElement received.
> > > > > >
> > > > > > So for example their
seda.sandStorm.seda.apps.Haboob.http.HttpRecv
> > > > > > accepts events of types....
> > > > > >
> > > > > > - httpConnection
> > > > > > - httpRequest
> > > > > > - SinkClosedEvent
> > > > > > - timerEvent
> > > > > >
> > > > > > And their seda.sandStorm.lib.http.httpServer accepts events of
> > > > > > types...
> > > > > >
> > > > > > - ATcpInPacket
> > > > > > - ATcpConnection
> > > > > > - aSocketErrorEvent
> > > > > > - SinkDrainedEvent
> > > > > > - SinkCloggedEvent
> > > > > > - SinkClosedEvent
> > > > > > - ATcpListenSuccessEvent
> > > > > >
> > > > > > They also use Sinks that are not associated with stages; rather,
> > > > > > they interface to unstaged components which nonetheless result
in
> > > > > > an eventual event to some supplied answer Sink. See for example
> > > > > > seda.sandStorm.lib.http.httpConnection.
> > > > > >
> > > > > > So perhaps as a matter of grouping related tasks, the same Stage
> > object
> > > > > > should be re-entered over the course of a lookup, with different
> > > > > triggering
> > > > > > events. For example, you might want to reenter a single
> > > > DNSResolvingStage
> > > > > > over the course of cache lookup, lookup-initiation,
result-receiving
> > (or
> > > > > > timeout), etc. I'm not sure; use your judgement as to how many
> > stages
> > > > are
> > > > > > really needed.
> > > > > >
> > > > > > > P.S : We found some openly available async dns client APIs in
C
> > > > > language.
> > > > > >
> > > > > > That could be useful as a model. (I doubt we'd want to call out
to C
> > > > > > for this simple step, though -- and if we nailed down a truly
async
> > Java
> > > > > > DNS facility, a lot of open source projects would probably be
quite
> > > > > happy.)
> > > > > >
> > > > > > Also: I heard back from Patrick Eaton about SEDA-style async
HTTP
> > client
> > > > > > code... he has a rough implementation for simple usage, and he
knows
> > of
> > > > > > another one at Berkeley which goes deeper into HTTP/1.1
conformance
> > and
> > > > > > optimal performance. I've asked him to forward whatever
additional
> > code
> > > > > > or details he can.
> > > > > >
> > > > > > - Gordon
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > To unsubscribe from this group, send an email to:
> > > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > > >
> > > > > >
> > > > > >
> > > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of
Service.
> > > > > >
> > > > > >
> > > > > > Yahoo! Groups Sponsor
> > > > > > ADVERTISEMENT
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > To unsubscribe from this group, send an email to:
> > > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > > >
> > > > > >
> > > > > >
> > > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of
Service.
> > > > > >
> > > > > >
> > > > > >
> > > > > > To unsubscribe from this group, send an email to:
> > > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > > >
> > > > > >
> > > > > >
> > > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of
Service.
> > > > > >
> > > > > >
> > > > > > Yahoo! Groups Sponsor
> > > > > > ADVERTISEMENT
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > To unsubscribe from this group, send an email to:
> > > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > > >
> > > > > >
> > > > > >
> > > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of
Service.
> > > >
> >
> >
> >
> > To unsubscribe from this group, send an email to:
> > archive-crawler-unsubscribe@yahoogroups.com
> >
> >
> >
> > Your use of Yahoo! Groups is subject to
http://docs.yahoo.com/info/terms/
> >
> >
I'll take a look.
Don't feel obligated to go with Eclipse -- even though it is a
very nice environment. Eventually we'll include versioned ant
scripts with each module so as to support the widest possible
choice of dev styles.
Upgrading dnsjava's async capabilities looks like a good idea.
I see it has a rudimentary async simulation via a background
thread-per-call, and also that it uses a fresh UDP socket for every
lookup to avoid having to do any other response-to-request mapping.
Both of those practices seem too wasteful for us (though maybe
the disposable UDP socket cost isn't that high; I'd need to run
tests to be sure).
- Gordon
----- Original Message -----
From: "G.B.Reddy" <reddy@...>
To: <archive-crawler@yahoogroups.com>
Cc: <wcr-team@...>
Sent: Monday, March 17, 2003 12:07 PM
Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web crawler
work ??
> Gordon,
>
> I have checked in the first version of the asynchronous DNS lookup stage
> (DNSLookingUp.java). I have also updated the README and the anecdote.cfg
> file accordingly.
>
> I have made my first version without any changes in the dnsjava libraries.
> However, I would like to reorganize the dnsjava code to support proper
> interfaces for asynchronous queries. I can do it in the next 3 or 4 days.
> This would help us in the near future since it contains lot of useful
> features of a full fledged resolver. At present, my implementation requires
> that the name server be configured for recursion.
>
> At present, the code just prints the received DNS replies and errors. The
> results/errors are not set into the CrawlURI object yet.
>
> Please note that I have not yet updated the .project and .classpath files. I
> am yet to switch to eclipse.
>
> Thanks,
> Reddy
>
>
> ----- Original Message -----
> From: "Gordon Mohr" <gojomo@...>
> To: "G.B.Reddy" <reddy@...>
> Cc: <wcr-team@...>
> Sent: Tuesday, March 18, 2003 1:03 AM
> Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web crawler
> work ??
>
>
> > Great -- I look forward to seeing your design in more detail.
> >
> > Feel free to check it into the 'Anecdote' module -- for now,
> > that's a scratch/prototyping area we can put the first working
> > versions of things. Perhaps the package could be org.archive.util.dns
> > or org.archive.seda.dns?
> >
> > I expect that when the system is a bit more stable and evolved, we'll
> > move it to a new CVS module (and perhaps package structure).
> >
> > - Gordon
> >
> > ----- Original Message -----
> > From: "G.B.Reddy" <reddy@...>
> > To: "Gordon Mohr" <gojomo@...>
> > Cc: <wcr-team@...>
> > Sent: Friday, March 14, 2003 8:20 AM
> > Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
> crawler work ??
> >
> >
> > > Gordon,
> > >
> > > I have made certain modifications to the dnsjava code to remove the
> tight
> > > coupling between the caching and the dns reply message processing. I
> would
> > > complete this and checkin on Monday.
> > >
> > > -Reddy
> > >
> > >
> > > ----- Original Message -----
> > > From: "G.B.Reddy" <reddy@...>
> > > To: "Gordon Mohr" <gojomo@...>;
> <archive-crawler@yahoogroups.com>
> > > Cc: <wcr-team@...>
> > > Sent: Wednesday, March 12, 2003 9:44 PM
> > > Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
> crawler
> > > work ??
> > >
> > >
> > > > Gordon,
> > > >
> > > > I am done with the asynchronous DNS code. I shall test it more
> tomorrow
> > > and
> > > > checkin. I may start using the caching mechanism present in the
> dnsjava
> > > > libraries. They seem to be a bit tightly bound with the response
> > > processing
> > > > code which I am using.
> > > >
> > > > The CrawlURI object as of now contains only the uri string. I would
> need
> > > > variables for setting the ipaddress and other state to indicate lookup
> > > > failures. Please let me know if you have any versions over it.
> > > >
> > > > Thanks,
> > > > Reddy
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: "Gordon Mohr" <gojomo@...>
> > > > To: <archive-crawler@yahoogroups.com>
> > > > Cc: <wcr-team@...>
> > > > Sent: Saturday, March 08, 2003 5:00 AM
> > > > Subject: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
> crawler
> > > > work ??
> > > >
> > > >
> > > > > These are good decompositions of the steps involved, and the LGPL
> > > dnsjava
> > > > > library looks very useful for our needs.
> > > > >
> > > > > My tendency would be to think fewer stages are better -- and when
> > > > > communicating between our stages, rather than using a shared
> RequestMap,
> > > > > include on the requesting event whatever context will be needed to
> deal
> > > > > with the response when it arrives.
> > > > >
> > > > > You can see this technique used in the ostore.network.ADns class I
> > > > referenced
> > > > > yesterday -- the user_data object lets any client of the ADns stage
> > > > recover
> > > > > the state it needs to deal with ADns's eventual response. (We may
> still
> > > > need
> > > > > the equivalent of a RequestMap to deal with mapping UDP response
> packets
> > > > to
> > > > > the lookups that triggered them -- but that'd then be a map that can
> be
> > > > private
> > > > > to a single stage.)
> > > > >
> > > > > Nothing in the DNS procedure needs to block -- once you assume
> > > > asynchronous
> > > > > UDP replies and an in-RAM response cache -- and thus one thread
> should
> > > be
> > > > > as good as (in fact better than) N threads.
> > > > >
> > > > > Similarly, I don't think more than 1 UDP socket will be necessary
> for
> > > > > all DNS lookups, because as a practical matter, 1 open UDP socket
> will
> > > > > never be "busy" in a way that additional UDP sockets would help.
> > > > >
> > > > > --
> > > > > Regarding 404 and other explicit HTTP app errors: these should be
> > > recorded
> > > > > into the CrawlURI, and forwarded to the next processing stage, just
> as
> > > > > with successes.
> > > > >
> > > > > Redirects are a special case we haven't discussed much yet; they
> would
> > > > seem
> > > > > candidates for some sort of expedited fetching. I think, though,
> such
> > > > > expedited activity must pass through the full cycle of stages for
> > > > > proper logging and policy application. (In contrast, retries in the
> face
> > > > > of transient errors, which may, at least in some cases, be fed
> > > immediately
> > > > > back to the Fetching or Preprocessing stages.)
> > > > >
> > > > > --
> > > > > The issue of timeouts (esp. in HTTP) is thorny. From what I hear
> about
> > > > > crawler traps, we need to not only to be able to deal with hangs,
> but
> > > > > also various infinite-length trickles.
> > > > >
> > > > > If any of the HTTP-transactions-in-progress is not making sufficient
> > > > > progress, according to some set of progress-per-time-unit
> thresholds,
> > > > > we have to be ready to give up on it. We should mark it as an error,
> > > > > but probably retain any partial info we did get for later analysis.
> > > > >
> > > > > Separating this out into its own stage increases synchronization
> issues.
> > > > > If timeout analysis can instead be part of the same single-threaded
> > > stage
> > > > > where HTTP some-progress-made events (from sockets) are handled, I
> think
> > > > > it can be handled very efficiently:
> > > > >
> > > > > - keep all outstanding transactions on a linked-list
> > > > > - whenever progress is reported (from the socket),
> > > > > and that progress keeps the transaction above
> > > > > acceptable thresholds, move that transaction to
> > > > > the head of the list
> > > > > - check the tail of the list occasionally to see
> > > > > if the worst transactions need to be cut
> > > > >
> > > > > That "occasional" check could mostly occur as part of normal
> > > > > progress-packet processing; an extremely infrequent timer could
> > > > > trigger reevaluation in the rare case where all progress on all
> > > > > sockets has stopped...
> > > > >
> > > > > - Gordon
> > > > >
> > > > > ----- Original Message -----
> > > > > From: G.B.Reddy
> > > > > To: archive-crawler@yahoogroups.com
> > > > > Cc: wcr-team@...
> > > > > Sent: Friday, March 07, 2003 8:31 AM
> > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > >
> > > > >
> > > > >
> > > > > More insight on the DNS stages.
> > > > >
> > > > > As stated in the design earlier, "DNS Querying Stage", "DNS Response
> > > > Processing Stage" and "Timeout and Retry Handling Stage"
> > > > > access/update the shared RequestMap. So, they need to be
> synchronized.
> > > We
> > > > were just thinking whether these need to be three separate
> > > > > stages or could it be one stage multiplexing each of those incoming
> > > event
> > > > types. The pros and cons of them are as below.
> > > > >
> > > > > Single stage benefit / Multi Stage Issue :
> > > > > - If it is single stage, then synchronization is not done across
> stages.
> > > > Even though, synchronization anyway would be needed
> > > > > internally in case of single stage, conceptually it looks neater not
> to
> > > > synchronize across stages.
> > > > >
> > > > > Single stage issue / Multi Stage benefit :
> > > > > - In case of single stage, the event queue would be one and this
> fact
> > > > pulls our legs when we want to handle overload conditions.
> > > > > Since there is no clear count of the distinct elements in the queue,
> it
> > > > becomes difficult to analyse and condition the system
> > > > > accordingly. Whereas, in multi stage, each one having its own queue,
> > > > conditioning/prioritizating becomes easy.
> > > > >
> > > > > I think we would end up going in for multi stage, unless SEDA could
> take
> > > > care of it by itself.
> > > > >
> > > > > And in case of parallel stages, like the ones in post processing, I
> > > think
> > > > most often, synchronization across them may be
> > > > > unavoidable.
> > > > >
> > > > > Thanks,
> > > > > Reddy
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: G.B.Reddy
> > > > > To: archive-crawler@yahoogroups.com
> > > > > Cc: wcr-team@...
> > > > > Sent: Thursday, March 06, 2003 11:21 PM
> > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > >
> > > > >
> > > > > Gordon and Raymie,
> > > > >
> > > > > Below are the various stages and their design with the issues
> involved
> > > in
> > > > the DNS Resolver and HTTP Client implementation.
> > > > >
> > > > > DNS History/Cache Handling Stage :
> > > > >
> > > > > Overview:
> > > > > - Maintains successful lookups in cache.
> > > > > - Does negative caching.
> > > > > - Times itself to clean the expired entries based on TTL. (Would use
> the
> > > > ssTimer SEDA APIs to schedule itself periodically)
> > > > > - This stage would be dummy or could be skipped as of now since we
> want
> > > to
> > > > do caching later.
> > > > >
> > > > > Events:
> > > > > - Two types of events : DNSCacheLookupEvent, DNSCacheUpdateEvent.
> > > > > - DNSCacheLookupEvent : If entry is found in cache, the ipaddress is
> set
> > > > in the CrawlURI object and is enqueued into the "Page
> > > > > Requesting Stage". Else, it is enqueued into the "DNS Querying
> Stage".
> > > > > - DNSCacheUpdateEvent : This event is published by the "DNS Response
> > > > Processing Stage" every after successful/failed lookup inorder
> > > > > to update the cache.
> > > > >
> > > > > Other notes:
> > > > > - This stage could be single threaded else lot of synchronization
> might
> > > be
> > > > needed.
> > > > > - Resubmitting events on queue full exceptions while enqueuing into
> this
> > > > stage's queue should be handled by the caller by scheduling
> > > > > it in future.
> > > > >
> > > > >
> > > > > DNS Querying Stage :
> > > > >
> > > > > Overview:
> > > > > - Sends the actual DNS ARecord query packets to the DNS Server. (The
> > > > response packets are processed in a later stage)
> > > > > - Maintains a pool of DatagramSocket objects.
> > > > >
> > > > > Events:
> > > > > - SendDNSQueryEvent : This is published by the "DNS History/Cache
> > > Handling
> > > > Stage" when cache miss happens. The DNS query is formed
> > > > > and sent out. The response handling sink is set as the "DNS Response
> > > > Processing Stage".
> > > > >
> > > > > Implementation:
> > > > > - A pool of DatagramSockets of a configurable maximum size is
> > > maintained.
> > > > It will be filled incrementally. All these datagram
> > > > > sockets will be registered to the selector maintained by the SEDA
> > > > internals. It would be ideal if this pool gets shrunk or expanded
> > > > > based on the requirement. If it is not shrunk back, then it is an
> > > > unnecessary overhead on the selector. The reason behind having a
> > > > > pool is to restrict the number of ports the selector has to listen
> upon
> > > > and also not to create individual DatagramSocket objects for
> > > > > every query. Can this logic of bounded pool, be implemented as a
> > > > Controller in the SEDA framework (just like the
> > > > > ThreadPoolController) is an open question.
> > > > > - When an event comes in, a free datagram socket in the pool will be
> > > > utilized for sending the message. If all sockets are engaged,
> > > > > the incoming event should be postponed to reenter again after a
> period
> > > of
> > > > time.
> > > > > - This stage additionally has also to maintain the list of messages
> sent
> > > > out, their IDs and the request timestamps. Let us call this
> > > > > "RequestMap" for future reference. The ID is the integer, described
> in
> > > the
> > > > RFC DNS message format, used to map the
> > > > > request-responses. The request timestamp will be made use of in
> query
> > > > timeout handling (discussed later).
> > > > >
> > > > > Parameters to this stage :
> > > > > - The DNS server hostname/ipaddress. If this is not given, then the
> > > > /etc/resolve.conf will be parsed to get the name server (only
> > > > > the primary would be taken as of now.). As a next step we will have
> to
> > > > build a round-robin way of querying the various name servers
> > > > > in resolve.conf, inorder to be polite with them.
> > > > > - If resolve.conf is not present, the local host will be assumed as
> the
> > >
> > > > name server.
> > > > >
> > > > >
> > > > > DNS Response Processing Stage :
> > > > >
> > > > > Overview:
> > > > > - Processes DNS responses.
> > > > >
> > > > > Implementation:
> > > > > - When the DNS datagram packets are received, the ID field in the
> header
> > > > should be used to match the corresponding request packet.
> > > > > - Check for timeouts, and discard it if it had timed out; else, set
> the
> > > > ipaddress/canonical name in the CrawlURI object and enqueue
> > > > > it to the "Page Requesting Stage". In addition, enqueue an event
> into
> > > the
> > > > "DNS Cache Handling Stage" for it to update its cache. Do
> > > > > the same, even on DNS Errors like "Name not found authoritative
> error".
> > > > > - The request entry in the RequestMap (maintained for timeout
> handling)
> > > > should be removed. This map, being shared across stages,
> > > > > should be synchronized.
> > > > >
> > > > >
> > > > > HTTP Page Requesting Stage :
> > > > >
> > > > > Overview:
> > > > > - Connects to host and sends GET requests for pages.
> > > > >
> > > > > Events:
> > > > > - Handles two types of events - StartFetchEvent and
> > > > ConnectionCompleteEvent.
> > > > > - The StartFetchEvent will make a TCP connect request to the host.
> While
> > > > doing so, we will register the current stage itself to
> > > > > receive back the ConnectionComplete events. Once we receive this
> > > > ConnectionCompleteEvent, we should send a HTTP GET request to the
> > > > > page. The response handling sink is set as the "HTTP Response
> Processing
> > > > Stage". Write failures should be handled.
> > > > >
> > > > >
> > > > > HTTP Response Processing Stage :
> > > > >
> > > > > Overview:
> > > > > - Processes downloaded pages.
> > > > >
> > > > > Implementation:
> > > > > - Check for timeouts, and discard it if it had timed out; else, read
> the
> > > > packets.
> > > > > - Once the response is completely read, the request entry in the
> > > > RequestMap (maintained for timeout handling) should be removed.
> > > > > - One issue here is when we are reading lengthy HTML pages, we might
> > > > receive half of the page and it might stop after that. So,
> > > > > essentially the timeout should be applied between chunks of
> reception.
> > > > > - Where should the errors like "404 Not Found", etc be propogated
> ???
> > > > >
> > > > >
> > > > > Timeout and Retry Handling Stage :
> > > > >
> > > > > Overview:
> > > > > This is a single threaded stage which enumerates through the
> RequestMap
> > > > and checks for timeouts. The timed out CrawlURIs will be
> > > > > retried until retry count exhausts. This stage will be self-timed
> > > > periodically using the SEDA ssTimer APIs.
> > > > >
> > > > > Other Notes:
> > > > > This timeout handling is a common stuff between the DNS requests and
> the
> > > > HTTP requests.
> > > > >
> > > > > Parameters to this stage:
> > > > > - DNS timeout value.
> > > > > - HTTP timeout value.
> > > > > - DNS retry count.
> > > > > - HTTP retry count. ( This would be 1 ).
> > > > >
> > > > >
> > > > > One other thing that could be done is that, the events by themselves
> > > will
> > > > contain information as to which next stage the output has
> > > > > to traverse. This will be flexible and no hardcoding is needed.
> > > > Especially, in making this non-blocking DNS library an open-source,
> > > > > it would come handy. Moreover, many users might not want it to be
> over
> > > > SEDA. So, we will have to give other interfaces as well.
> > > > >
> > > > > I am presently using the library classes given by dnsjava-1.3.2. (
> > > > http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java
> > > > > based synchronous implementation of DNS Resolver. I only make use of
> the
> > > > classes which encapsulate the formation of request packets,
> > > > > parses response packets and the various ResourceRecord classes. This
> > > > library is being used by Java Apache Mail Enterprise Server (
> > > > > http://james.apache.org/ ). So, it should be pretty reliable and
> tested.
> > > > Moreover it has support for IPv6, compression and security
> > > > > which we can make use of later.
> > > > >
> > > > > Thanks,
> > > > > Reddy
> > > > >
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: Gordon Mohr
> > > > > To: archive-crawler@yahoogroups.com
> > > > > Cc: Raymie Stata ; wcr-team@...
> > > > > Sent: Saturday, March 01, 2003 2:43 AM
> > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > >
> > > > >
> > > > > Sounds like a reasonable plan.
> > > > >
> > > > > By "local name server" do you mean something *very* local -- for
> > > example,
> > > > > a standard nameserver we run on the same machine?
> > > > >
> > > > > That would seem to offer other benefits -- such as minimizing the
> modes
> > > > > of DNS lookup we have to do and offloading caching to another piece
> of
> > > > > software (at least at first).
> > > > >
> > > > > - Gordon
> > > > >
> > > > > ----- Original Message -----
> > > > > From: G.B.Reddy
> > > > > To: archive-crawler@yahoogroups.com
> > > > > Cc: Raymie Stata ; wcr-team@...
> > > > > Sent: Thursday, February 27, 2003 8:55 AM
> > > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > > >
> > > > >
> > > > > Gordon and Raymie,
> > > > >
> > > > > Here goes the proposal for the asynchronous DNS lookup API
> > > implementation.
> > > > >
> > > > > We shall implement a minimal resolver which is capable of sending
> DNS
> > > > request packets and processing response packets in an
> > > > > asynchrounous nio fashion. This resolver class will contact a local
> name
> > > > server and rely on it to do the actual lookup. The local
> > > > > name server will be configured to support recursion and better would
> be
> > > to
> > > > use a name server which does lookup asynchronously. (
> > > > > SQUID has asynchronous DNS lookup facilities ). Even if the local
> name
> > > > server is not asynchrounous, our java resolver being
> > > > > asynchronous will be good enough since our primary goal is that we
> do
> > > not
> > > > want any blocking code in our crawler implementation. This
> > > > > idea even sounds good considering the fact we would only reinvent
> the
> > > same
> > > > wheel if we opt to implement a complete full-fledged
> > > > > resolver implementation which complies with the RFC 1035 and 1034.
> We
> > > can
> > > > definitely implement this full-fledged resolver but the
> > > > > real concern is that this would require a lot of testing and the
> efforts
> > > > to make it rock solid in terms of robustness would be huge.
> > > > >
> > > > > So, the various jobs that we would have to do to build our Simple
> > > > Asynchronous DNS lookup API would be
> > > > > -- Request packet formation and reply packet parsing in the
> exact
> > > RFC
> > > > format.
> > > > > -- Use non-blocking IO APIs and do UDP. (We might not need TCP
> since
> > > > the name server is only in the local network.)
> > > > > -- Do canonical name queries and A record queries.
> > > > > -- Implement timeouts.
> > > > > -- Implement caching based on TTL. ( This may have to be
> deferred as
> > > > pointed by Raymie earlier. )
> > > > > -- Integrate with SEDA.
> > > > >
> > > > > Thanks,
> > > > > Reddy
> > > > >
> > > > > ----- Original Message -----
> > > > > From: Gordon Mohr
> > > > > To: G.B.Reddy
> > > > > Cc: Raymie Stata ; wcr-team@... ;
> > > > archive-crawler@yahoogroups.com
> > > > > Sent: Saturday, February 22, 2003 5:37 AM
> > > > > Subject: [archive-crawler] Re: Web crawler work ??
> > > > >
> > > > >
> > > > > [CC'ing to archive-crawler@yahoogroups.com]
> > > > >
> > > > > Reddy writes:
> > > > >
> > > > > > On the first cut do we need to look at implementing an
> asynchronous
> > > DNS
> > > > > > lookup mechanism. If we are not, then it is going to be two
> stages,
> > > viz.
> > > > > > DNSCacheHandlingStage and ResolvingStage, that can be employed
> using
> > > the
> > > > > > blocking DNS lookup calls in Java. The first stage,
> > > > DNSCacheHandlingStage,
> > > > > > would check if the entry is available in the cache. If available,
> he
> > > > would
> > > > > > set the resolved address in the CrawlURI object and enqueue it to
> the
> > > > > > appropriate next stage. If the cache doesn't contain the entry,
> then
> > > he
> > > > > > would pass the request to the Resolving stage which would call the
> > > > > > InetAddress.getByName blocking method to resolve it. The getByName
> > > > result
> > > > > > would be set in the CrawlURI object as before and enqueued into
> the
> > > > > > appropriate next stage. In addition to this, the Resolving stage
> will
> > > > > > enqueue another event into the DNSCacheHandlingStage to enable him
> > > > update
> > > > > > his cache. So, the DNSCacheHandlingStage would be handling two
> types
> > > of
> > > > > > events, one is the lookup events and the other is the update cache
> > > > events.
> > > > > >
> > > > > > One problem here is that the InetAddress class does not expose its
> > > cache
> > > > > > variables to its users. Even we cannot check if the cache has an
> entry
> > > > > > before calling the getByName method. So, we should be disabling
> the
> > > java
> > > > > > cache ( using the policy file ) and implementing our own caching
> > > > mechanism.
> > > > > > ( The DNSCacheHandlingStage would have to additionally do the job
> of
> > > > > > throwing away the expired entries in the cache also.)
> > > > > >
> > > > > > Let me know your comments on this.
> > > > >
> > > > > This looks like a good first cut. I'm still working to improve my
> > > > > understanding of the best way to use the staged style, mostly by
> > > > > looking at their HTTP and HTTP Server (Haboob) code.
> > > > >
> > > > > It seems that they've tended to use a single Stage object to do
> > > > > many different steps/aspects of one process, by switching on the
> > > > > type of QueueElement received.
> > > > >
> > > > > So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv
> > > > > accepts events of types....
> > > > >
> > > > > - httpConnection
> > > > > - httpRequest
> > > > > - SinkClosedEvent
> > > > > - timerEvent
> > > > >
> > > > > And their seda.sandStorm.lib.http.httpServer accepts events of
> > > > > types...
> > > > >
> > > > > - ATcpInPacket
> > > > > - ATcpConnection
> > > > > - aSocketErrorEvent
> > > > > - SinkDrainedEvent
> > > > > - SinkCloggedEvent
> > > > > - SinkClosedEvent
> > > > > - ATcpListenSuccessEvent
> > > > >
> > > > > They also use Sinks that are not associated with stages; rather,
> > > > > they interface to unstaged components which nonetheless result in
> > > > > an eventual event to some supplied answer Sink. See for example
> > > > > seda.sandStorm.lib.http.httpConnection.
> > > > >
> > > > > So perhaps as a matter of grouping related tasks, the same Stage
> object
> > > > > should be re-entered over the course of a lookup, with different
> > > > triggering
> > > > > events. For example, you might want to reenter a single
> > > DNSResolvingStage
> > > > > over the course of cache lookup, lookup-initiation, result-receiving
> (or
> > > > > timeout), etc. I'm not sure; use your judgement as to how many
> stages
> > > are
> > > > > really needed.
> > > > >
> > > > > > P.S : We found some openly available async dns client APIs in C
> > > > language.
> > > > >
> > > > > That could be useful as a model. (I doubt we'd want to call out to C
> > > > > for this simple step, though -- and if we nailed down a truly async
> Java
> > > > > DNS facility, a lot of open source projects would probably be quite
> > > > happy.)
> > > > >
> > > > > Also: I heard back from Patrick Eaton about SEDA-style async HTTP
> client
> > > > > code... he has a rough implementation for simple usage, and he knows
> of
> > > > > another one at Berkeley which goes deeper into HTTP/1.1 conformance
> and
> > > > > optimal performance. I've asked him to forward whatever additional
> code
> > > > > or details he can.
> > > > >
> > > > > - Gordon
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > To unsubscribe from this group, send an email to:
> > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > >
> > > > >
> > > > >
> > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > > > >
> > > > >
> > > > > Yahoo! Groups Sponsor
> > > > > ADVERTISEMENT
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > To unsubscribe from this group, send an email to:
> > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > >
> > > > >
> > > > >
> > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > > > >
> > > > >
> > > > >
> > > > > To unsubscribe from this group, send an email to:
> > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > >
> > > > >
> > > > >
> > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > > > >
> > > > >
> > > > > Yahoo! Groups Sponsor
> > > > > ADVERTISEMENT
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > To unsubscribe from this group, send an email to:
> > > > > archive-crawler-unsubscribe@yahoogroups.com
> > > > >
> > > > >
> > > > >
> > > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > >
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
>
>
Gordon,
I have checked in the first version of the asynchronous DNS lookup stage
(DNSLookingUp.java). I have also updated the README and the anecdote.cfg
file accordingly.
I have made my first version without any changes in the dnsjava libraries.
However, I would like to reorganize the dnsjava code to support proper
interfaces for asynchronous queries. I can do it in the next 3 or 4 days.
This would help us in the near future since it contains lot of useful
features of a full fledged resolver. At present, my implementation requires
that the name server be configured for recursion.
At present, the code just prints the received DNS replies and errors. The
results/errors are not set into the CrawlURI object yet.
Please note that I have not yet updated the .project and .classpath files. I
am yet to switch to eclipse.
Thanks,
Reddy
----- Original Message -----
From: "Gordon Mohr" <gojomo@...>
To: "G.B.Reddy" <reddy@...>
Cc: <wcr-team@...>
Sent: Tuesday, March 18, 2003 1:03 AM
Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web crawler
work ??
> Great -- I look forward to seeing your design in more detail.
>
> Feel free to check it into the 'Anecdote' module -- for now,
> that's a scratch/prototyping area we can put the first working
> versions of things. Perhaps the package could be org.archive.util.dns
> or org.archive.seda.dns?
>
> I expect that when the system is a bit more stable and evolved, we'll
> move it to a new CVS module (and perhaps package structure).
>
> - Gordon
>
> ----- Original Message -----
> From: "G.B.Reddy" <reddy@...>
> To: "Gordon Mohr" <gojomo@...>
> Cc: <wcr-team@...>
> Sent: Friday, March 14, 2003 8:20 AM
> Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
crawler work ??
>
>
> > Gordon,
> >
> > I have made certain modifications to the dnsjava code to remove the
tight
> > coupling between the caching and the dns reply message processing. I
would
> > complete this and checkin on Monday.
> >
> > -Reddy
> >
> >
> > ----- Original Message -----
> > From: "G.B.Reddy" <reddy@...>
> > To: "Gordon Mohr" <gojomo@...>;
<archive-crawler@yahoogroups.com>
> > Cc: <wcr-team@...>
> > Sent: Wednesday, March 12, 2003 9:44 PM
> > Subject: Re: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
crawler
> > work ??
> >
> >
> > > Gordon,
> > >
> > > I am done with the asynchronous DNS code. I shall test it more
tomorrow
> > and
> > > checkin. I may start using the caching mechanism present in the
dnsjava
> > > libraries. They seem to be a bit tightly bound with the response
> > processing
> > > code which I am using.
> > >
> > > The CrawlURI object as of now contains only the uri string. I would
need
> > > variables for setting the ipaddress and other state to indicate lookup
> > > failures. Please let me know if you have any versions over it.
> > >
> > > Thanks,
> > > Reddy
> > >
> > >
> > > ----- Original Message -----
> > > From: "Gordon Mohr" <gojomo@...>
> > > To: <archive-crawler@yahoogroups.com>
> > > Cc: <wcr-team@...>
> > > Sent: Saturday, March 08, 2003 5:00 AM
> > > Subject: DNS and HTTP staging (was Re: [archive-crawler] Re: Web
crawler
> > > work ??
> > >
> > >
> > > > These are good decompositions of the steps involved, and the LGPL
> > dnsjava
> > > > library looks very useful for our needs.
> > > >
> > > > My tendency would be to think fewer stages are better -- and when
> > > > communicating between our stages, rather than using a shared
RequestMap,
> > > > include on the requesting event whatever context will be needed to
deal
> > > > with the response when it arrives.
> > > >
> > > > You can see this technique used in the ostore.network.ADns class I
> > > referenced
> > > > yesterday -- the user_data object lets any client of the ADns stage
> > > recover
> > > > the state it needs to deal with ADns's eventual response. (We may
still
> > > need
> > > > the equivalent of a RequestMap to deal with mapping UDP response
packets
> > > to
> > > > the lookups that triggered them -- but that'd then be a map that can
be
> > > private
> > > > to a single stage.)
> > > >
> > > > Nothing in the DNS procedure needs to block -- once you assume
> > > asynchronous
> > > > UDP replies and an in-RAM response cache -- and thus one thread
should
> > be
> > > > as good as (in fact better than) N threads.
> > > >
> > > > Similarly, I don't think more than 1 UDP socket will be necessary
for
> > > > all DNS lookups, because as a practical matter, 1 open UDP socket
will
> > > > never be "busy" in a way that additional UDP sockets would help.
> > > >
> > > > --
> > > > Regarding 404 and other explicit HTTP app errors: these should be
> > recorded
> > > > into the CrawlURI, and forwarded to the next processing stage, just
as
> > > > with successes.
> > > >
> > > > Redirects are a special case we haven't discussed much yet; they
would
> > > seem
> > > > candidates for some sort of expedited fetching. I think, though,
such
> > > > expedited activity must pass through the full cycle of stages for
> > > > proper logging and policy application. (In contrast, retries in the
face
> > > > of transient errors, which may, at least in some cases, be fed
> > immediately
> > > > back to the Fetching or Preprocessing stages.)
> > > >
> > > > --
> > > > The issue of timeouts (esp. in HTTP) is thorny. From what I hear
about
> > > > crawler traps, we need to not only to be able to deal with hangs,
but
> > > > also various infinite-length trickles.
> > > >
> > > > If any of the HTTP-transactions-in-progress is not making sufficient
> > > > progress, according to some set of progress-per-time-unit
thresholds,
> > > > we have to be ready to give up on it. We should mark it as an error,
> > > > but probably retain any partial info we did get for later analysis.
> > > >
> > > > Separating this out into its own stage increases synchronization
issues.
> > > > If timeout analysis can instead be part of the same single-threaded
> > stage
> > > > where HTTP some-progress-made events (from sockets) are handled, I
think
> > > > it can be handled very efficiently:
> > > >
> > > > - keep all outstanding transactions on a linked-list
> > > > - whenever progress is reported (from the socket),
> > > > and that progress keeps the transaction above
> > > > acceptable thresholds, move that transaction to
> > > > the head of the list
> > > > - check the tail of the list occasionally to see
> > > > if the worst transactions need to be cut
> > > >
> > > > That "occasional" check could mostly occur as part of normal
> > > > progress-packet processing; an extremely infrequent timer could
> > > > trigger reevaluation in the rare case where all progress on all
> > > > sockets has stopped...
> > > >
> > > > - Gordon
> > > >
> > > > ----- Original Message -----
> > > > From: G.B.Reddy
> > > > To: archive-crawler@yahoogroups.com
> > > > Cc: wcr-team@...
> > > > Sent: Friday, March 07, 2003 8:31 AM
> > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > >
> > > >
> > > >
> > > > More insight on the DNS stages.
> > > >
> > > > As stated in the design earlier, "DNS Querying Stage", "DNS Response
> > > Processing Stage" and "Timeout and Retry Handling Stage"
> > > > access/update the shared RequestMap. So, they need to be
synchronized.
> > We
> > > were just thinking whether these need to be three separate
> > > > stages or could it be one stage multiplexing each of those incoming
> > event
> > > types. The pros and cons of them are as below.
> > > >
> > > > Single stage benefit / Multi Stage Issue :
> > > > - If it is single stage, then synchronization is not done across
stages.
> > > Even though, synchronization anyway would be needed
> > > > internally in case of single stage, conceptually it looks neater not
to
> > > synchronize across stages.
> > > >
> > > > Single stage issue / Multi Stage benefit :
> > > > - In case of single stage, the event queue would be one and this
fact
> > > pulls our legs when we want to handle overload conditions.
> > > > Since there is no clear count of the distinct elements in the queue,
it
> > > becomes difficult to analyse and condition the system
> > > > accordingly. Whereas, in multi stage, each one having its own queue,
> > > conditioning/prioritizating becomes easy.
> > > >
> > > > I think we would end up going in for multi stage, unless SEDA could
take
> > > care of it by itself.
> > > >
> > > > And in case of parallel stages, like the ones in post processing, I
> > think
> > > most often, synchronization across them may be
> > > > unavoidable.
> > > >
> > > > Thanks,
> > > > Reddy
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: G.B.Reddy
> > > > To: archive-crawler@yahoogroups.com
> > > > Cc: wcr-team@...
> > > > Sent: Thursday, March 06, 2003 11:21 PM
> > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > >
> > > >
> > > > Gordon and Raymie,
> > > >
> > > > Below are the various stages and their design with the issues
involved
> > in
> > > the DNS Resolver and HTTP Client implementation.
> > > >
> > > > DNS History/Cache Handling Stage :
> > > >
> > > > Overview:
> > > > - Maintains successful lookups in cache.
> > > > - Does negative caching.
> > > > - Times itself to clean the expired entries based on TTL. (Would use
the
> > > ssTimer SEDA APIs to schedule itself periodically)
> > > > - This stage would be dummy or could be skipped as of now since we
want
> > to
> > > do caching later.
> > > >
> > > > Events:
> > > > - Two types of events : DNSCacheLookupEvent, DNSCacheUpdateEvent.
> > > > - DNSCacheLookupEvent : If entry is found in cache, the ipaddress is
set
> > > in the CrawlURI object and is enqueued into the "Page
> > > > Requesting Stage". Else, it is enqueued into the "DNS Querying
Stage".
> > > > - DNSCacheUpdateEvent : This event is published by the "DNS Response
> > > Processing Stage" every after successful/failed lookup inorder
> > > > to update the cache.
> > > >
> > > > Other notes:
> > > > - This stage could be single threaded else lot of synchronization
might
> > be
> > > needed.
> > > > - Resubmitting events on queue full exceptions while enqueuing into
this
> > > stage's queue should be handled by the caller by scheduling
> > > > it in future.
> > > >
> > > >
> > > > DNS Querying Stage :
> > > >
> > > > Overview:
> > > > - Sends the actual DNS ARecord query packets to the DNS Server. (The
> > > response packets are processed in a later stage)
> > > > - Maintains a pool of DatagramSocket objects.
> > > >
> > > > Events:
> > > > - SendDNSQueryEvent : This is published by the "DNS History/Cache
> > Handling
> > > Stage" when cache miss happens. The DNS query is formed
> > > > and sent out. The response handling sink is set as the "DNS Response
> > > Processing Stage".
> > > >
> > > > Implementation:
> > > > - A pool of DatagramSockets of a configurable maximum size is
> > maintained.
> > > It will be filled incrementally. All these datagram
> > > > sockets will be registered to the selector maintained by the SEDA
> > > internals. It would be ideal if this pool gets shrunk or expanded
> > > > based on the requirement. If it is not shrunk back, then it is an
> > > unnecessary overhead on the selector. The reason behind having a
> > > > pool is to restrict the number of ports the selector has to listen
upon
> > > and also not to create individual DatagramSocket objects for
> > > > every query. Can this logic of bounded pool, be implemented as a
> > > Controller in the SEDA framework (just like the
> > > > ThreadPoolController) is an open question.
> > > > - When an event comes in, a free datagram socket in the pool will be
> > > utilized for sending the message. If all sockets are engaged,
> > > > the incoming event should be postponed to reenter again after a
period
> > of
> > > time.
> > > > - This stage additionally has also to maintain the list of messages
sent
> > > out, their IDs and the request timestamps. Let us call this
> > > > "RequestMap" for future reference. The ID is the integer, described
in
> > the
> > > RFC DNS message format, used to map the
> > > > request-responses. The request timestamp will be made use of in
query
> > > timeout handling (discussed later).
> > > >
> > > > Parameters to this stage :
> > > > - The DNS server hostname/ipaddress. If this is not given, then the
> > > /etc/resolve.conf will be parsed to get the name server (only
> > > > the primary would be taken as of now.). As a next step we will have
to
> > > build a round-robin way of querying the various name servers
> > > > in resolve.conf, inorder to be polite with them.
> > > > - If resolve.conf is not present, the local host will be assumed as
the
> >
> > > name server.
> > > >
> > > >
> > > > DNS Response Processing Stage :
> > > >
> > > > Overview:
> > > > - Processes DNS responses.
> > > >
> > > > Implementation:
> > > > - When the DNS datagram packets are received, the ID field in the
header
> > > should be used to match the corresponding request packet.
> > > > - Check for timeouts, and discard it if it had timed out; else, set
the
> > > ipaddress/canonical name in the CrawlURI object and enqueue
> > > > it to the "Page Requesting Stage". In addition, enqueue an event
into
> > the
> > > "DNS Cache Handling Stage" for it to update its cache. Do
> > > > the same, even on DNS Errors like "Name not found authoritative
error".
> > > > - The request entry in the RequestMap (maintained for timeout
handling)
> > > should be removed. This map, being shared across stages,
> > > > should be synchronized.
> > > >
> > > >
> > > > HTTP Page Requesting Stage :
> > > >
> > > > Overview:
> > > > - Connects to host and sends GET requests for pages.
> > > >
> > > > Events:
> > > > - Handles two types of events - StartFetchEvent and
> > > ConnectionCompleteEvent.
> > > > - The StartFetchEvent will make a TCP connect request to the host.
While
> > > doing so, we will register the current stage itself to
> > > > receive back the ConnectionComplete events. Once we receive this
> > > ConnectionCompleteEvent, we should send a HTTP GET request to the
> > > > page. The response handling sink is set as the "HTTP Response
Processing
> > > Stage". Write failures should be handled.
> > > >
> > > >
> > > > HTTP Response Processing Stage :
> > > >
> > > > Overview:
> > > > - Processes downloaded pages.
> > > >
> > > > Implementation:
> > > > - Check for timeouts, and discard it if it had timed out; else, read
the
> > > packets.
> > > > - Once the response is completely read, the request entry in the
> > > RequestMap (maintained for timeout handling) should be removed.
> > > > - One issue here is when we are reading lengthy HTML pages, we might
> > > receive half of the page and it might stop after that. So,
> > > > essentially the timeout should be applied between chunks of
reception.
> > > > - Where should the errors like "404 Not Found", etc be propogated
???
> > > >
> > > >
> > > > Timeout and Retry Handling Stage :
> > > >
> > > > Overview:
> > > > This is a single threaded stage which enumerates through the
RequestMap
> > > and checks for timeouts. The timed out CrawlURIs will be
> > > > retried until retry count exhausts. This stage will be self-timed
> > > periodically using the SEDA ssTimer APIs.
> > > >
> > > > Other Notes:
> > > > This timeout handling is a common stuff between the DNS requests and
the
> > > HTTP requests.
> > > >
> > > > Parameters to this stage:
> > > > - DNS timeout value.
> > > > - HTTP timeout value.
> > > > - DNS retry count.
> > > > - HTTP retry count. ( This would be 1 ).
> > > >
> > > >
> > > > One other thing that could be done is that, the events by themselves
> > will
> > > contain information as to which next stage the output has
> > > > to traverse. This will be flexible and no hardcoding is needed.
> > > Especially, in making this non-blocking DNS library an open-source,
> > > > it would come handy. Moreover, many users might not want it to be
over
> > > SEDA. So, we will have to give other interfaces as well.
> > > >
> > > > I am presently using the library classes given by dnsjava-1.3.2. (
> > > http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java
> > > > based synchronous implementation of DNS Resolver. I only make use of
the
> > > classes which encapsulate the formation of request packets,
> > > > parses response packets and the various ResourceRecord classes. This
> > > library is being used by Java Apache Mail Enterprise Server (
> > > > http://james.apache.org/ ). So, it should be pretty reliable and
tested.
> > > Moreover it has support for IPv6, compression and security
> > > > which we can make use of later.
> > > >
> > > > Thanks,
> > > > Reddy
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Gordon Mohr
> > > > To: archive-crawler@yahoogroups.com
> > > > Cc: Raymie Stata ; wcr-team@...
> > > > Sent: Saturday, March 01, 2003 2:43 AM
> > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > >
> > > >
> > > > Sounds like a reasonable plan.
> > > >
> > > > By "local name server" do you mean something *very* local -- for
> > example,
> > > > a standard nameserver we run on the same machine?
> > > >
> > > > That would seem to offer other benefits -- such as minimizing the
modes
> > > > of DNS lookup we have to do and offloading caching to another piece
of
> > > > software (at least at first).
> > > >
> > > > - Gordon
> > > >
> > > > ----- Original Message -----
> > > > From: G.B.Reddy
> > > > To: archive-crawler@yahoogroups.com
> > > > Cc: Raymie Stata ; wcr-team@...
> > > > Sent: Thursday, February 27, 2003 8:55 AM
> > > > Subject: Re: [archive-crawler] Re: Web crawler work ??
> > > >
> > > >
> > > > Gordon and Raymie,
> > > >
> > > > Here goes the proposal for the asynchronous DNS lookup API
> > implementation.
> > > >
> > > > We shall implement a minimal resolver which is capable of sending
DNS
> > > request packets and processing response packets in an
> > > > asynchrounous nio fashion. This resolver class will contact a local
name
> > > server and rely on it to do the actual lookup. The local
> > > > name server will be configured to support recursion and better would
be
> > to
> > > use a name server which does lookup asynchronously. (
> > > > SQUID has asynchronous DNS lookup facilities ). Even if the local
name
> > > server is not asynchrounous, our java resolver being
> > > > asynchronous will be good enough since our primary goal is that we
do
> > not
> > > want any blocking code in our crawler implementation. This
> > > > idea even sounds good considering the fact we would only reinvent
the
> > same
> > > wheel if we opt to implement a complete full-fledged
> > > > resolver implementation which complies with the RFC 1035 and 1034.
We
> > can
> > > definitely implement this full-fledged resolver but the
> > > > real concern is that this would require a lot of testing and the
efforts
> > > to make it rock solid in terms of robustness would be huge.
> > > >
> > > > So, the various jobs that we would have to do to build our Simple
> > > Asynchronous DNS lookup API would be
> > > > -- Request packet formation and reply packet parsing in the
exact
> > RFC
> > > format.
> > > > -- Use non-blocking IO APIs and do UDP. (We might not need TCP
since
> > > the name server is only in the local network.)
> > > > -- Do canonical name queries and A record queries.
> > > > -- Implement timeouts.
> > > > -- Implement caching based on TTL. ( This may have to be
deferred as
> > > pointed by Raymie earlier. )
> > > > -- Integrate with SEDA.
> > > >
> > > > Thanks,
> > > > Reddy
> > > >
> > > > ----- Original Message -----
> > > > From: Gordon Mohr
> > > > To: G.B.Reddy
> > > > Cc: Raymie Stata ; wcr-team@... ;
> > > archive-crawler@yahoogroups.com
> > > > Sent: Saturday, February 22, 2003 5:37 AM
> > > > Subject: [archive-crawler] Re: Web crawler work ??
> > > >
> > > >
> > > > [CC'ing to archive-crawler@yahoogroups.com]
> > > >
> > > > Reddy writes:
> > > >
> > > > > On the first cut do we need to look at implementing an
asynchronous
> > DNS
> > > > > lookup mechanism. If we are not, then it is going to be two
stages,
> > viz.
> > > > > DNSCacheHandlingStage and ResolvingStage, that can be employed
using
> > the
> > > > > blocking DNS lookup calls in Java. The first stage,
> > > DNSCacheHandlingStage,
> > > > > would check if the entry is available in the cache. If available,
he
> > > would
> > > > > set the resolved address in the CrawlURI object and enqueue it to
the
> > > > > appropriate next stage. If the cache doesn't contain the entry,
then
> > he
> > > > > would pass the request to the Resolving stage which would call the
> > > > > InetAddress.getByName blocking method to resolve it. The getByName
> > > result
> > > > > would be set in the CrawlURI object as before and enqueued into
the
> > > > > appropriate next stage. In addition to this, the Resolving stage
will
> > > > > enqueue another event into the DNSCacheHandlingStage to enable him
> > > update
> > > > > his cache. So, the DNSCacheHandlingStage would be handling two
types
> > of
> > > > > events, one is the lookup events and the other is the update cache
> > > events.
> > > > >
> > > > > One problem here is that the InetAddress class does not expose its
> > cache
> > > > > variables to its users. Even we cannot check if the cache has an
entry
> > > > > before calling the getByName method. So, we should be disabling
the
> > java
> > > > > cache ( using the policy file ) and implementing our own caching
> > > mechanism.
> > > > > ( The DNSCacheHandlingStage would have to additionally do the job
of
> > > > > throwing away the expired entries in the cache also.)
> > > > >
> > > > > Let me know your comments on this.
> > > >
> > > > This looks like a good first cut. I'm still working to improve my
> > > > understanding of the best way to use the staged style, mostly by
> > > > looking at their HTTP and HTTP Server (Haboob) code.
> > > >
> > > > It seems that they've tended to use a single Stage object to do
> > > > many different steps/aspects of one process, by switching on the
> > > > type of QueueElement received.
> > > >
> > > > So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv
> > > > accepts events of types....
> > > >
> > > > - httpConnection
> > > > - httpRequest
> > > > - SinkClosedEvent
> > > > - timerEvent
> > > >
> > > > And their seda.sandStorm.lib.http.httpServer accepts events of
> > > > types...
> > > >
> > > > - ATcpInPacket
> > > > - ATcpConnection
> > > > - aSocketErrorEvent
> > > > - SinkDrainedEvent
> > > > - SinkCloggedEvent
> > > > - SinkClosedEvent
> > > > - ATcpListenSuccessEvent
> > > >
> > > > They also use Sinks that are not associated with stages; rather,
> > > > they interface to unstaged components which nonetheless result in
> > > > an eventual event to some supplied answer Sink. See for example
> > > > seda.sandStorm.lib.http.httpConnection.
> > > >
> > > > So perhaps as a matter of grouping related tasks, the same Stage
object
> > > > should be re-entered over the course of a lookup, with different
> > > triggering
> > > > events. For example, you might want to reenter a single
> > DNSResolvingStage
> > > > over the course of cache lookup, lookup-initiation, result-receiving
(or
> > > > timeout), etc. I'm not sure; use your judgement as to how many
stages
> > are
> > > > really needed.
> > > >
> > > > > P.S : We found some openly available async dns client APIs in C
> > > language.
> > > >
> > > > That could be useful as a model. (I doubt we'd want to call out to C
> > > > for this simple step, though -- and if we nailed down a truly async
Java
> > > > DNS facility, a lot of open source projects would probably be quite
> > > happy.)
> > > >
> > > > Also: I heard back from Patrick Eaton about SEDA-style async HTTP
client
> > > > code... he has a rough implementation for simple usage, and he knows
of
> > > > another one at Berkeley which goes deeper into HTTP/1.1 conformance
and
> > > > optimal performance. I've asked him to forward whatever additional
code
> > > > or details he can.
> > > >
> > > > - Gordon
> > > >
> > > >
> > > >
> > > >
> > > > To unsubscribe from this group, send an email to:
> > > > archive-crawler-unsubscribe@yahoogroups.com
> > > >
> > > >
> > > >
> > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > > >
> > > >
> > > > Yahoo! Groups Sponsor
> > > > ADVERTISEMENT
> > > >
> > > >
> > > >
> > > >
> > > > To unsubscribe from this group, send an email to:
> > > > archive-crawler-unsubscribe@yahoogroups.com
> > > >
> > > >
> > > >
> > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > > >
> > > >
> > > >
> > > > To unsubscribe from this group, send an email to:
> > > > archive-crawler-unsubscribe@yahoogroups.com
> > > >
> > > >
> > > >
> > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> > > >
> > > >
> > > > Yahoo! Groups Sponsor
> > > > ADVERTISEMENT
> > > >
> > > >
> > > >
> > > >
> > > > To unsubscribe from this group, send an email to:
> > > > archive-crawler-unsubscribe@yahoogroups.com
> > > >
> > > >
> > > >
> > > > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
> >
Gordon,
I am done with the asynchronous DNS code. I shall test it more tomorrow and
checkin. I may start using the caching mechanism present in the dnsjava
libraries. They seem to be a bit tightly bound with the response processing
code which I am using.
The CrawlURI object as of now contains only the uri string. I would need
variables for setting the ipaddress and other state to indicate lookup
failures. Please let me know if you have any versions over it.
Thanks,
Reddy
----- Original Message -----
From: "Gordon Mohr" <gojomo@...>
To: <archive-crawler@yahoogroups.com>
Cc: <wcr-team@...>
Sent: Saturday, March 08, 2003 5:00 AM
Subject: DNS and HTTP staging (was Re: [archive-crawler] Re: Web crawler
work ??
> These are good decompositions of the steps involved, and the LGPL dnsjava
> library looks very useful for our needs.
>
> My tendency would be to think fewer stages are better -- and when
> communicating between our stages, rather than using a shared RequestMap,
> include on the requesting event whatever context will be needed to deal
> with the response when it arrives.
>
> You can see this technique used in the ostore.network.ADns class I
referenced
> yesterday -- the user_data object lets any client of the ADns stage
recover
> the state it needs to deal with ADns's eventual response. (We may still
need
> the equivalent of a RequestMap to deal with mapping UDP response packets
to
> the lookups that triggered them -- but that'd then be a map that can be
private
> to a single stage.)
>
> Nothing in the DNS procedure needs to block -- once you assume
asynchronous
> UDP replies and an in-RAM response cache -- and thus one thread should be
> as good as (in fact better than) N threads.
>
> Similarly, I don't think more than 1 UDP socket will be necessary for
> all DNS lookups, because as a practical matter, 1 open UDP socket will
> never be "busy" in a way that additional UDP sockets would help.
>
> --
> Regarding 404 and other explicit HTTP app errors: these should be recorded
> into the CrawlURI, and forwarded to the next processing stage, just as
> with successes.
>
> Redirects are a special case we haven't discussed much yet; they would
seem
> candidates for some sort of expedited fetching. I think, though, such
> expedited activity must pass through the full cycle of stages for
> proper logging and policy application. (In contrast, retries in the face
> of transient errors, which may, at least in some cases, be fed immediately
> back to the Fetching or Preprocessing stages.)
>
> --
> The issue of timeouts (esp. in HTTP) is thorny. From what I hear about
> crawler traps, we need to not only to be able to deal with hangs, but
> also various infinite-length trickles.
>
> If any of the HTTP-transactions-in-progress is not making sufficient
> progress, according to some set of progress-per-time-unit thresholds,
> we have to be ready to give up on it. We should mark it as an error,
> but probably retain any partial info we did get for later analysis.
>
> Separating this out into its own stage increases synchronization issues.
> If timeout analysis can instead be part of the same single-threaded stage
> where HTTP some-progress-made events (from sockets) are handled, I think
> it can be handled very efficiently:
>
> - keep all outstanding transactions on a linked-list
> - whenever progress is reported (from the socket),
> and that progress keeps the transaction above
> acceptable thresholds, move that transaction to
> the head of the list
> - check the tail of the list occasionally to see
> if the worst transactions need to be cut
>
> That "occasional" check could mostly occur as part of normal
> progress-packet processing; an extremely infrequent timer could
> trigger reevaluation in the rare case where all progress on all
> sockets has stopped...
>
> - Gordon
>
> ----- Original Message -----
> From: G.B.Reddy
> To: archive-crawler@yahoogroups.com
> Cc: wcr-team@...
> Sent: Friday, March 07, 2003 8:31 AM
> Subject: Re: [archive-crawler] Re: Web crawler work ??
>
>
>
> More insight on the DNS stages.
>
> As stated in the design earlier, "DNS Querying Stage", "DNS Response
Processing Stage" and "Timeout and Retry Handling Stage"
> access/update the shared RequestMap. So, they need to be synchronized. We
were just thinking whether these need to be three separate
> stages or could it be one stage multiplexing each of those incoming event
types. The pros and cons of them are as below.
>
> Single stage benefit / Multi Stage Issue :
> - If it is single stage, then synchronization is not done across stages.
Even though, synchronization anyway would be needed
> internally in case of single stage, conceptually it looks neater not to
synchronize across stages.
>
> Single stage issue / Multi Stage benefit :
> - In case of single stage, the event queue would be one and this fact
pulls our legs when we want to handle overload conditions.
> Since there is no clear count of the distinct elements in the queue, it
becomes difficult to analyse and condition the system
> accordingly. Whereas, in multi stage, each one having its own queue,
conditioning/prioritizating becomes easy.
>
> I think we would end up going in for multi stage, unless SEDA could take
care of it by itself.
>
> And in case of parallel stages, like the ones in post processing, I think
most often, synchronization across them may be
> unavoidable.
>
> Thanks,
> Reddy
>
>
> ----- Original Message -----
> From: G.B.Reddy
> To: archive-crawler@yahoogroups.com
> Cc: wcr-team@...
> Sent: Thursday, March 06, 2003 11:21 PM
> Subject: Re: [archive-crawler] Re: Web crawler work ??
>
>
> Gordon and Raymie,
>
> Below are the various stages and their design with the issues involved in
the DNS Resolver and HTTP Client implementation.
>
> DNS History/Cache Handling Stage :
>
> Overview:
> - Maintains successful lookups in cache.
> - Does negative caching.
> - Times itself to clean the expired entries based on TTL. (Would use the
ssTimer SEDA APIs to schedule itself periodically)
> - This stage would be dummy or could be skipped as of now since we want to
do caching later.
>
> Events:
> - Two types of events : DNSCacheLookupEvent, DNSCacheUpdateEvent.
> - DNSCacheLookupEvent : If entry is found in cache, the ipaddress is set
in the CrawlURI object and is enqueued into the "Page
> Requesting Stage". Else, it is enqueued into the "DNS Querying Stage".
> - DNSCacheUpdateEvent : This event is published by the "DNS Response
Processing Stage" every after successful/failed lookup inorder
> to update the cache.
>
> Other notes:
> - This stage could be single threaded else lot of synchronization might be
needed.
> - Resubmitting events on queue full exceptions while enqueuing into this
stage's queue should be handled by the caller by scheduling
> it in future.
>
>
> DNS Querying Stage :
>
> Overview:
> - Sends the actual DNS ARecord query packets to the DNS Server. (The
response packets are processed in a later stage)
> - Maintains a pool of DatagramSocket objects.
>
> Events:
> - SendDNSQueryEvent : This is published by the "DNS History/Cache Handling
Stage" when cache miss happens. The DNS query is formed
> and sent out. The response handling sink is set as the "DNS Response
Processing Stage".
>
> Implementation:
> - A pool of DatagramSockets of a configurable maximum size is maintained.
It will be filled incrementally. All these datagram
> sockets will be registered to the selector maintained by the SEDA
internals. It would be ideal if this pool gets shrunk or expanded
> based on the requirement. If it is not shrunk back, then it is an
unnecessary overhead on the selector. The reason behind having a
> pool is to restrict the number of ports the selector has to listen upon
and also not to create individual DatagramSocket objects for
> every query. Can this logic of bounded pool, be implemented as a
Controller in the SEDA framework (just like the
> ThreadPoolController) is an open question.
> - When an event comes in, a free datagram socket in the pool will be
utilized for sending the message. If all sockets are engaged,
> the incoming event should be postponed to reenter again after a period of
time.
> - This stage additionally has also to maintain the list of messages sent
out, their IDs and the request timestamps. Let us call this
> "RequestMap" for future reference. The ID is the integer, described in the
RFC DNS message format, used to map the
> request-responses. The request timestamp will be made use of in query
timeout handling (discussed later).
>
> Parameters to this stage :
> - The DNS server hostname/ipaddress. If this is not given, then the
/etc/resolve.conf will be parsed to get the name server (only
> the primary would be taken as of now.). As a next step we will have to
build a round-robin way of querying the various name servers
> in resolve.conf, inorder to be polite with them.
> - If resolve.conf is not present, the local host will be assumed as the
name server.
>
>
> DNS Response Processing Stage :
>
> Overview:
> - Processes DNS responses.
>
> Implementation:
> - When the DNS datagram packets are received, the ID field in the header
should be used to match the corresponding request packet.
> - Check for timeouts, and discard it if it had timed out; else, set the
ipaddress/canonical name in the CrawlURI object and enqueue
> it to the "Page Requesting Stage". In addition, enqueue an event into the
"DNS Cache Handling Stage" for it to update its cache. Do
> the same, even on DNS Errors like "Name not found authoritative error".
> - The request entry in the RequestMap (maintained for timeout handling)
should be removed. This map, being shared across stages,
> should be synchronized.
>
>
> HTTP Page Requesting Stage :
>
> Overview:
> - Connects to host and sends GET requests for pages.
>
> Events:
> - Handles two types of events - StartFetchEvent and
ConnectionCompleteEvent.
> - The StartFetchEvent will make a TCP connect request to the host. While
doing so, we will register the current stage itself to
> receive back the ConnectionComplete events. Once we receive this
ConnectionCompleteEvent, we should send a HTTP GET request to the
> page. The response handling sink is set as the "HTTP Response Processing
Stage". Write failures should be handled.
>
>
> HTTP Response Processing Stage :
>
> Overview:
> - Processes downloaded pages.
>
> Implementation:
> - Check for timeouts, and discard it if it had timed out; else, read the
packets.
> - Once the response is completely read, the request entry in the
RequestMap (maintained for timeout handling) should be removed.
> - One issue here is when we are reading lengthy HTML pages, we might
receive half of the page and it might stop after that. So,
> essentially the timeout should be applied between chunks of reception.
> - Where should the errors like "404 Not Found", etc be propogated ???
>
>
> Timeout and Retry Handling Stage :
>
> Overview:
> This is a single threaded stage which enumerates through the RequestMap
and checks for timeouts. The timed out CrawlURIs will be
> retried until retry count exhausts. This stage will be self-timed
periodically using the SEDA ssTimer APIs.
>
> Other Notes:
> This timeout handling is a common stuff between the DNS requests and the
HTTP requests.
>
> Parameters to this stage:
> - DNS timeout value.
> - HTTP timeout value.
> - DNS retry count.
> - HTTP retry count. ( This would be 1 ).
>
>
> One other thing that could be done is that, the events by themselves will
contain information as to which next stage the output has
> to traverse. This will be flexible and no hardcoding is needed.
Especially, in making this non-blocking DNS library an open-source,
> it would come handy. Moreover, many users might not want it to be over
SEDA. So, we will have to give other interfaces as well.
>
> I am presently using the library classes given by dnsjava-1.3.2. (
http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java
> based synchronous implementation of DNS Resolver. I only make use of the
classes which encapsulate the formation of request packets,
> parses response packets and the various ResourceRecord classes. This
library is being used by Java Apache Mail Enterprise Server (
> http://james.apache.org/ ). So, it should be pretty reliable and tested.
Moreover it has support for IPv6, compression and security
> which we can make use of later.
>
> Thanks,
> Reddy
>
>
>
> ----- Original Message -----
> From: Gordon Mohr
> To: archive-crawler@yahoogroups.com
> Cc: Raymie Stata ; wcr-team@...
> Sent: Saturday, March 01, 2003 2:43 AM
> Subject: Re: [archive-crawler] Re: Web crawler work ??
>
>
> Sounds like a reasonable plan.
>
> By "local name server" do you mean something *very* local -- for example,
> a standard nameserver we run on the same machine?
>
> That would seem to offer other benefits -- such as minimizing the modes
> of DNS lookup we have to do and offloading caching to another piece of
> software (at least at first).
>
> - Gordon
>
> ----- Original Message -----
> From: G.B.Reddy
> To: archive-crawler@yahoogroups.com
> Cc: Raymie Stata ; wcr-team@...
> Sent: Thursday, February 27, 2003 8:55 AM
> Subject: Re: [archive-crawler] Re: Web crawler work ??
>
>
> Gordon and Raymie,
>
> Here goes the proposal for the asynchronous DNS lookup API implementation.
>
> We shall implement a minimal resolver which is capable of sending DNS
request packets and processing response packets in an
> asynchrounous nio fashion. This resolver class will contact a local name
server and rely on it to do the actual lookup. The local
> name server will be configured to support recursion and better would be to
use a name server which does lookup asynchronously. (
> SQUID has asynchronous DNS lookup facilities ). Even if the local name
server is not asynchrounous, our java resolver being
> asynchronous will be good enough since our primary goal is that we do not
want any blocking code in our crawler implementation. This
> idea even sounds good considering the fact we would only reinvent the same
wheel if we opt to implement a complete full-fledged
> resolver implementation which complies with the RFC 1035 and 1034. We can
definitely implement this full-fledged resolver but the
> real concern is that this would require a lot of testing and the efforts
to make it rock solid in terms of robustness would be huge.
>
> So, the various jobs that we would have to do to build our Simple
Asynchronous DNS lookup API would be
> -- Request packet formation and reply packet parsing in the exact RFC
format.
> -- Use non-blocking IO APIs and do UDP. (We might not need TCP since
the name server is only in the local network.)
> -- Do canonical name queries and A record queries.
> -- Implement timeouts.
> -- Implement caching based on TTL. ( This may have to be deferred as
pointed by Raymie earlier. )
> -- Integrate with SEDA.
>
> Thanks,
> Reddy
>
> ----- Original Message -----
> From: Gordon Mohr
> To: G.B.Reddy
> Cc: Raymie Stata ; wcr-team@... ;
archive-crawler@yahoogroups.com
> Sent: Saturday, February 22, 2003 5:37 AM
> Subject: [archive-crawler] Re: Web crawler work ??
>
>
> [CC'ing to archive-crawler@yahoogroups.com]
>
> Reddy writes:
>
> > On the first cut do we need to look at implementing an asynchronous DNS
> > lookup mechanism. If we are not, then it is going to be two stages, viz.
> > DNSCacheHandlingStage and ResolvingStage, that can be employed using the
> > blocking DNS lookup calls in Java. The first stage,
DNSCacheHandlingStage,
> > would check if the entry is available in the cache. If available, he
would
> > set the resolved address in the CrawlURI object and enqueue it to the
> > appropriate next stage. If the cache doesn't contain the entry, then he
> > would pass the request to the Resolving stage which would call the
> > InetAddress.getByName blocking method to resolve it. The getByName
result
> > would be set in the CrawlURI object as before and enqueued into the
> > appropriate next stage. In addition to this, the Resolving stage will
> > enqueue another event into the DNSCacheHandlingStage to enable him
update
> > his cache. So, the DNSCacheHandlingStage would be handling two types of
> > events, one is the lookup events and the other is the update cache
events.
> >
> > One problem here is that the InetAddress class does not expose its cache
> > variables to its users. Even we cannot check if the cache has an entry
> > before calling the getByName method. So, we should be disabling the java
> > cache ( using the policy file ) and implementing our own caching
mechanism.
> > ( The DNSCacheHandlingStage would have to additionally do the job of
> > throwing away the expired entries in the cache also.)
> >
> > Let me know your comments on this.
>
> This looks like a good first cut. I'm still working to improve my
> understanding of the best way to use the staged style, mostly by
> looking at their HTTP and HTTP Server (Haboob) code.
>
> It seems that they've tended to use a single Stage object to do
> many different steps/aspects of one process, by switching on the
> type of QueueElement received.
>
> So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv
> accepts events of types....
>
> - httpConnection
> - httpRequest
> - SinkClosedEvent
> - timerEvent
>
> And their seda.sandStorm.lib.http.httpServer accepts events of
> types...
>
> - ATcpInPacket
> - ATcpConnection
> - aSocketErrorEvent
> - SinkDrainedEvent
> - SinkCloggedEvent
> - SinkClosedEvent
> - ATcpListenSuccessEvent
>
> They also use Sinks that are not associated with stages; rather,
> they interface to unstaged components which nonetheless result in
> an eventual event to some supplied answer Sink. See for example
> seda.sandStorm.lib.http.httpConnection.
>
> So perhaps as a matter of grouping related tasks, the same Stage object
> should be re-entered over the course of a lookup, with different
triggering
> events. For example, you might want to reenter a single DNSResolvingStage
> over the course of cache lookup, lookup-initiation, result-receiving (or
> timeout), etc. I'm not sure; use your judgement as to how many stages are
> really needed.
>
> > P.S : We found some openly available async dns client APIs in C
language.
>
> That could be useful as a model. (I doubt we'd want to call out to C
> for this simple step, though -- and if we nailed down a truly async Java
> DNS facility, a lot of open source projects would probably be quite
happy.)
>
> Also: I heard back from Patrick Eaton about SEDA-style async HTTP client
> code... he has a rough implementation for simple usage, and he knows of
> another one at Berkeley which goes deeper into HTTP/1.1 conformance and
> optimal performance. I've asked him to forward whatever additional code
> or details he can.
>
> - Gordon
>
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
>
>
> Yahoo! Groups Sponsor
> ADVERTISEMENT
>
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
>
>
> Yahoo! Groups Sponsor
> ADVERTISEMENT
>
>
>
>
> To unsubscribe from this group, send an email to:
> archive-crawler-unsubscribe@yahoogroups.com
>
>
>
> Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
These are good decompositions of the steps involved, and the LGPL dnsjava
library looks very useful for our needs.
My tendency would be to think fewer stages are better -- and when
communicating between our stages, rather than using a shared RequestMap,
include on the requesting event whatever context will be needed to deal
with the response when it arrives.
You can see this technique used in the ostore.network.ADns class I referenced
yesterday -- the user_data object lets any client of the ADns stage recover
the state it needs to deal with ADns's eventual response. (We may still need
the equivalent of a RequestMap to deal with mapping UDP response packets to
the lookups that triggered them -- but that'd then be a map that can be private
to a single stage.)
Nothing in the DNS procedure needs to block -- once you assume asynchronous
UDP replies and an in-RAM response cache -- and thus one thread should be
as good as (in fact better than) N threads.
Similarly, I don't think more than 1 UDP socket will be necessary for
all DNS lookups, because as a practical matter, 1 open UDP socket will
never be "busy" in a way that additional UDP sockets would help.
--
Regarding 404 and other explicit HTTP app errors: these should be recorded
into the CrawlURI, and forwarded to the next processing stage, just as
with successes.
Redirects are a special case we haven't discussed much yet; they would seem
candidates for some sort of expedited fetching. I think, though, such
expedited activity must pass through the full cycle of stages for
proper logging and policy application. (In contrast, retries in the face
of transient errors, which may, at least in some cases, be fed immediately
back to the Fetching or Preprocessing stages.)
--
The issue of timeouts (esp. in HTTP) is thorny. From what I hear about
crawler traps, we need to not only to be able to deal with hangs, but
also various infinite-length trickles.
If any of the HTTP-transactions-in-progress is not making sufficient
progress, according to some set of progress-per-time-unit thresholds,
we have to be ready to give up on it. We should mark it as an error,
but probably retain any partial info we did get for later analysis.
Separating this out into its own stage increases synchronization issues.
If timeout analysis can instead be part of the same single-threaded stage
where HTTP some-progress-made events (from sockets) are handled, I think
it can be handled very efficiently:
- keep all outstanding transactions on a linked-list
- whenever progress is reported (from the socket),
and that progress keeps the transaction above
acceptable thresholds, move that transaction to
the head of the list
- check the tail of the list occasionally to see
if the worst transactions need to be cut
That "occasional" check could mostly occur as part of normal
progress-packet processing; an extremely infrequent timer could
trigger reevaluation in the rare case where all progress on all
sockets has stopped...
- Gordon
----- Original Message -----
From: G.B.Reddy
To: archive-crawler@yahoogroups.com
Cc: wcr-team@...
Sent: Friday, March 07, 2003 8:31 AM
Subject: Re: [archive-crawler] Re: Web crawler work ??
More insight on the DNS stages.
As stated in the design earlier, "DNS Querying Stage", "DNS Response Processing
Stage" and "Timeout and Retry Handling Stage"
access/update the shared RequestMap. So, they need to be synchronized. We were
just thinking whether these need to be three separate
stages or could it be one stage multiplexing each of those incoming event types.
The pros and cons of them are as below.
Single stage benefit / Multi Stage Issue :
- If it is single stage, then synchronization is not done across stages. Even
though, synchronization anyway would be needed
internally in case of single stage, conceptually it looks neater not to
synchronize across stages.
Single stage issue / Multi Stage benefit :
- In case of single stage, the event queue would be one and this fact pulls our
legs when we want to handle overload conditions.
Since there is no clear count of the distinct elements in the queue, it becomes
difficult to analyse and condition the system
accordingly. Whereas, in multi stage, each one having its own queue,
conditioning/prioritizating becomes easy.
I think we would end up going in for multi stage, unless SEDA could take care of
it by itself.
And in case of parallel stages, like the ones in post processing, I think most
often, synchronization across them may be
unavoidable.
Thanks,
Reddy
----- Original Message -----
From: G.B.Reddy
To: archive-crawler@yahoogroups.com
Cc: wcr-team@...
Sent: Thursday, March 06, 2003 11:21 PM
Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Below are the various stages and their design with the issues involved in the
DNS Resolver and HTTP Client implementation.
DNS History/Cache Handling Stage :
Overview:
- Maintains successful lookups in cache.
- Does negative caching.
- Times itself to clean the expired entries based on TTL. (Would use the ssTimer
SEDA APIs to schedule itself periodically)
- This stage would be dummy or could be skipped as of now since we want to do
caching later.
Events:
- Two types of events : DNSCacheLookupEvent, DNSCacheUpdateEvent.
- DNSCacheLookupEvent : If entry is found in cache, the ipaddress is set in the
CrawlURI object and is enqueued into the "Page
Requesting Stage". Else, it is enqueued into the "DNS Querying Stage".
- DNSCacheUpdateEvent : This event is published by the "DNS Response Processing
Stage" every after successful/failed lookup inorder
to update the cache.
Other notes:
- This stage could be single threaded else lot of synchronization might be
needed.
- Resubmitting events on queue full exceptions while enqueuing into this stage's
queue should be handled by the caller by scheduling
it in future.
DNS Querying Stage :
Overview:
- Sends the actual DNS ARecord query packets to the DNS Server. (The response
packets are processed in a later stage)
- Maintains a pool of DatagramSocket objects.
Events:
- SendDNSQueryEvent : This is published by the "DNS History/Cache Handling
Stage" when cache miss happens. The DNS query is formed
and sent out. The response handling sink is set as the "DNS Response Processing
Stage".
Implementation:
- A pool of DatagramSockets of a configurable maximum size is maintained. It
will be filled incrementally. All these datagram
sockets will be registered to the selector maintained by the SEDA internals. It
would be ideal if this pool gets shrunk or expanded
based on the requirement. If it is not shrunk back, then it is an unnecessary
overhead on the selector. The reason behind having a
pool is to restrict the number of ports the selector has to listen upon and also
not to create individual DatagramSocket objects for
every query. Can this logic of bounded pool, be implemented as a Controller in
the SEDA framework (just like the
ThreadPoolController) is an open question.
- When an event comes in, a free datagram socket in the pool will be utilized
for sending the message. If all sockets are engaged,
the incoming event should be postponed to reenter again after a period of time.
- This stage additionally has also to maintain the list of messages sent out,
their IDs and the request timestamps. Let us call this
"RequestMap" for future reference. The ID is the integer, described in the RFC
DNS message format, used to map the
request-responses. The request timestamp will be made use of in query timeout
handling (discussed later).
Parameters to this stage :
- The DNS server hostname/ipaddress. If this is not given, then the
/etc/resolve.conf will be parsed to get the name server (only
the primary would be taken as of now.). As a next step we will have to build a
round-robin way of querying the various name servers
in resolve.conf, inorder to be polite with them.
- If resolve.conf is not present, the local host will be assumed as the name
server.
DNS Response Processing Stage :
Overview:
- Processes DNS responses.
Implementation:
- When the DNS datagram packets are received, the ID field in the header should
be used to match the corresponding request packet.
- Check for timeouts, and discard it if it had timed out; else, set the
ipaddress/canonical name in the CrawlURI object and enqueue
it to the "Page Requesting Stage". In addition, enqueue an event into the "DNS
Cache Handling Stage" for it to update its cache. Do
the same, even on DNS Errors like "Name not found authoritative error".
- The request entry in the RequestMap (maintained for timeout handling) should
be removed. This map, being shared across stages,
should be synchronized.
HTTP Page Requesting Stage :
Overview:
- Connects to host and sends GET requests for pages.
Events:
- Handles two types of events - StartFetchEvent and ConnectionCompleteEvent.
- The StartFetchEvent will make a TCP connect request to the host. While doing
so, we will register the current stage itself to
receive back the ConnectionComplete events. Once we receive this
ConnectionCompleteEvent, we should send a HTTP GET request to the
page. The response handling sink is set as the "HTTP Response Processing Stage".
Write failures should be handled.
HTTP Response Processing Stage :
Overview:
- Processes downloaded pages.
Implementation:
- Check for timeouts, and discard it if it had timed out; else, read the
packets.
- Once the response is completely read, the request entry in the RequestMap
(maintained for timeout handling) should be removed.
- One issue here is when we are reading lengthy HTML pages, we might receive
half of the page and it might stop after that. So,
essentially the timeout should be applied between chunks of reception.
- Where should the errors like "404 Not Found", etc be propogated ???
Timeout and Retry Handling Stage :
Overview:
This is a single threaded stage which enumerates through the RequestMap and
checks for timeouts. The timed out CrawlURIs will be
retried until retry count exhausts. This stage will be self-timed periodically
using the SEDA ssTimer APIs.
Other Notes:
This timeout handling is a common stuff between the DNS requests and the HTTP
requests.
Parameters to this stage:
- DNS timeout value.
- HTTP timeout value.
- DNS retry count.
- HTTP retry count. ( This would be 1 ).
One other thing that could be done is that, the events by themselves will
contain information as to which next stage the output has
to traverse. This will be flexible and no hardcoding is needed. Especially, in
making this non-blocking DNS library an open-source,
it would come handy. Moreover, many users might not want it to be over SEDA. So,
we will have to give other interfaces as well.
I am presently using the library classes given by dnsjava-1.3.2. (
http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java
based synchronous implementation of DNS Resolver. I only make use of the classes
which encapsulate the formation of request packets,
parses response packets and the various ResourceRecord classes. This library is
being used by Java Apache Mail Enterprise Server (
http://james.apache.org/ ). So, it should be pretty reliable and tested.
Moreover it has support for IPv6, compression and security
which we can make use of later.
Thanks,
Reddy
----- Original Message -----
From: Gordon Mohr
To: archive-crawler@yahoogroups.com
Cc: Raymie Stata ; wcr-team@...
Sent: Saturday, March 01, 2003 2:43 AM
Subject: Re: [archive-crawler] Re: Web crawler work ??
Sounds like a reasonable plan.
By "local name server" do you mean something *very* local -- for example,
a standard nameserver we run on the same machine?
That would seem to offer other benefits -- such as minimizing the modes
of DNS lookup we have to do and offloading caching to another piece of
software (at least at first).
- Gordon
----- Original Message -----
From: G.B.Reddy
To: archive-crawler@yahoogroups.com
Cc: Raymie Stata ; wcr-team@...
Sent: Thursday, February 27, 2003 8:55 AM
Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Here goes the proposal for the asynchronous DNS lookup API implementation.
We shall implement a minimal resolver which is capable of sending DNS request
packets and processing response packets in an
asynchrounous nio fashion. This resolver class will contact a local name server
and rely on it to do the actual lookup. The local
name server will be configured to support recursion and better would be to use a
name server which does lookup asynchronously. (
SQUID has asynchronous DNS lookup facilities ). Even if the local name server
is not asynchrounous, our java resolver being
asynchronous will be good enough since our primary goal is that we do not want
any blocking code in our crawler implementation. This
idea even sounds good considering the fact we would only reinvent the same wheel
if we opt to implement a complete full-fledged
resolver implementation which complies with the RFC 1035 and 1034. We can
definitely implement this full-fledged resolver but the
real concern is that this would require a lot of testing and the efforts to make
it rock solid in terms of robustness would be huge.
So, the various jobs that we would have to do to build our Simple Asynchronous
DNS lookup API would be
-- Request packet formation and reply packet parsing in the exact RFC
format.
-- Use non-blocking IO APIs and do UDP. (We might not need TCP since the
name server is only in the local network.)
-- Do canonical name queries and A record queries.
-- Implement timeouts.
-- Implement caching based on TTL. ( This may have to be deferred as pointed
by Raymie earlier. )
-- Integrate with SEDA.
Thanks,
Reddy
----- Original Message -----
From: Gordon Mohr
To: G.B.Reddy
Cc: Raymie Stata ; wcr-team@... ; archive-crawler@yahoogroups.com
Sent: Saturday, February 22, 2003 5:37 AM
Subject: [archive-crawler] Re: Web crawler work ??
[CC'ing to archive-crawler@yahoogroups.com]
Reddy writes:
> On the first cut do we need to look at implementing an asynchronous DNS
> lookup mechanism. If we are not, then it is going to be two stages, viz.
> DNSCacheHandlingStage and ResolvingStage, that can be employed using the
> blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage,
> would check if the entry is available in the cache. If available, he would
> set the resolved address in the CrawlURI object and enqueue it to the
> appropriate next stage. If the cache doesn't contain the entry, then he
> would pass the request to the Resolving stage which would call the
> InetAddress.getByName blocking method to resolve it. The getByName result
> would be set in the CrawlURI object as before and enqueued into the
> appropriate next stage. In addition to this, the Resolving stage will
> enqueue another event into the DNSCacheHandlingStage to enable him update
> his cache. So, the DNSCacheHandlingStage would be handling two types of
> events, one is the lookup events and the other is the update cache events.
>
> One problem here is that the InetAddress class does not expose its cache
> variables to its users. Even we cannot check if the cache has an entry
> before calling the getByName method. So, we should be disabling the java
> cache ( using the policy file ) and implementing our own caching mechanism.
> ( The DNSCacheHandlingStage would have to additionally do the job of
> throwing away the expired entries in the cache also.)
>
> Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my
understanding of the best way to use the staged style, mostly by
looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do
many different steps/aspects of one process, by switching on the
type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv
accepts events of types....
- httpConnection
- httpRequest
- SinkClosedEvent
- timerEvent
And their seda.sandStorm.lib.http.httpServer accepts events of
types...
- ATcpInPacket
- ATcpConnection
- aSocketErrorEvent
- SinkDrainedEvent
- SinkCloggedEvent
- SinkClosedEvent
- ATcpListenSuccessEvent
They also use Sinks that are not associated with stages; rather,
they interface to unstaged components which nonetheless result in
an eventual event to some supplied answer Sink. See for example
seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object
should be re-entered over the course of a lookup, with different triggering
events. For example, you might want to reenter a single DNSResolvingStage
over the course of cache lookup, lookup-initiation, result-receiving (or
timeout), etc. I'm not sure; use your judgement as to how many stages are
really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C
for this simple step, though -- and if we nailed down a truly async Java
DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client
code... he has a rough implementation for simple usage, and he knows of
another one at Berkeley which goes deeper into HTTP/1.1 conformance and
optimal performance. I've asked him to forward whatever additional code
or details he can.
- Gordon
To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Yahoo! Groups Sponsor
ADVERTISEMENT
To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Yahoo! Groups Sponsor
ADVERTISEMENT
To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
I added very dumb HTTP fetching toe the Anecdote 'Fetching' stage via the
Apache Commons HTTPClient library soon after my message yesterday.
I also wrote about Anecdote:
> When it runs out of URIs, it just spins, so remember to kill the
> process after a run.
>
> Next steps for this prototype:
> - Fixing the spin condition
This spinning is making me reconsider the use of the external Pumping
stage; maybe that functionality needs to be in the URIChoosing stage,
as only it knows how to pace itself.
So part of the contract of all URIChoosings would be to accept 'start'
and 'stop' directive events, and keep feeding (batches of) new
CrawlURIs to the next stage as fast as the next stage will take them.
- Gordon
Gordon, Igor, Raymie present.
(1) Access to work in progress: start using SourceForge CVS
(Post meeting note: 2 modules now exist there: 'Anecdote',
a staged crawler prototype, and 'Design', for upcoming design
docs.)
(2) Milestones
Next few:
- "Downloader" (grabs static list of URIs): Friday 3/7
- "Crawler" (link extraction): Friday 3/14
- Others: TBD
(3) Configurability & Terminology
We should be clear when we're talking about what the "framework"
enables (and similarly, what minimally is demanded of each
framework-using crawler), and what the "default/best-practices"
crawler will do.
(4) Testing
Unit testing via Ant builds & Ant-driven JUnit tests.
Integration/system testing via each of:
- Synthetic crawl environment: ~4 in house machines running
web servers with variety of material (and ability to create
link structures of infinite depth)
- Open internet background crawling
- Chosen intranet crawls (will IBM let us loose inside?)
(5) Further specification of key areas:
CrawlURI OBJECT:
- accepts flexible key-attribute properties
- some properties persist when URI is re-interred
between visits, some don't: and this is configurable
in a way that individual stages don't have to
know (but may) if what they're storing in the CrawlURI
will persist
- persisted properties may also live on in a nested
"history" property (eg: each persisting causes
full set of current properties to be pushed onto
the top of a "last X tries" stack)
- has 'next' field for chaining CrawlURIs (see below)
FETCHING STAGE:
- expects a CrawlURI object
- CrawlURI should already include
- string URI,
- resolved IP address,
- (optional) cancel flag (prevents fetching; means
expedited pass-through, usu. in error cases)
- any extra headers (cookies, authentication) that must
be included on the request
- to help support HTTP/1.1 pipelining, CrawlURI
will have a 'next' instance variable which can
serve to chain CrawlURIs together in a linked
list. The fetching stage MAY put these together
in a single HTTP/1.1 connection. However, the fetching
stage should only emit unchained CrawlURIs
- the fetching stage may have separate "nextSink"s
for (1) passing along definitively tried CrawlURIs (eg:
got a response, or a serious error), or (2)
transient failures, which MAY be fed back (to
the fetching stage, or a preprocessing stage) for
more immediate retries.
- Post-meeting refinement: I believe the Fetching
stage should be able to add a (configured) user-agent,
if one does not already exist.
Will further work out and document contracts of minimal
(framework) stages/roles, as Java/Javadoc skeletons where
appropriate.
OTHER KEY ISSUES TO VISIT SOON:
- checkpointing & crawler pause/restart/stop issues
- persistence and multi-machine distribution of URI frontier
--
- Gordon
As stated in the design earlier, "DNS Querying Stage", "DNS Response Processing Stage" and "Timeout and Retry Handling Stage" access/update the shared RequestMap. So, they need to be synchronized. We were just thinking whether these need to be three separate stages or could it be one stage multiplexing each of those incoming event types. The pros and cons of them are as below.
Single stage benefit / Multi Stage Issue :
- If it is single stage, then synchronization is not done across stages. Even though, synchronization anyway would be needed internally in case of single stage, conceptually it looks neater not to synchronize across stages.
Single stage issue / Multi Stage benefit :
- In case of single stage, the event queue would be one and this fact pulls our legs when we want to handle overload conditions. Since there is no clear count of the distinct elements in the queue, it becomes difficult to analyse and condition the system accordingly. Whereas, in multi stage, each one having its own queue, conditioning/prioritizating becomes easy.
I think we would end up going in for multi stage, unless SEDA could take care of it by itself.
And in case of parallel stages, like the ones in post processing, I think most often, synchronization across them may be unavoidable.
Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Below are the various stages and their design with the issues involved in the DNS Resolver and HTTP Client implementation.
DNS History/Cache Handling Stage :
Overview: - Maintains successful lookups in cache.
- Does negative caching. - Times itself to clean the expired entries based on TTL. (Would use the ssTimer SEDA APIs to schedule itself periodically)
- This stage would be dummy or could be skipped as of now since we want to do caching later.
Events:
- Two types of events : DNSCacheLookupEvent, DNSCacheUpdateEvent.
- DNSCacheLookupEvent : If entry is found in cache, the ipaddress is set in the CrawlURI object and is enqueued into the "Page Requesting Stage". Else, it is enqueued into the "DNS Querying Stage".
- DNSCacheUpdateEvent : This event is published by the "DNS Response Processing Stage" every after successful/failed lookup inorder to update the cache.
Other notes:
- This stage could be single threaded else lot of synchronization might be needed.
- Resubmitting events on queue full exceptions while enqueuing into this stage’s queue should be handled by the caller by scheduling it in future.
DNS Querying Stage :
Overview: - Sends the actual DNS ARecord query packets to the DNS Server. (The response packets are processed in a later stage)
- Maintains a pool of DatagramSocket objects.
Events:
- SendDNSQueryEvent : This is published by the "DNS History/Cache Handling Stage" when cache miss happens. The DNS query is formed and sent out. The response handling sink is set as the "DNS Response Processing Stage".
Implementation:
- A pool of DatagramSockets of a configurable maximum size is maintained. It will be filled incrementally. All these datagram sockets will be registered to the selector maintained by the SEDA internals. It would be ideal if this pool gets shrunk or expanded based on the requirement. If it is not shrunk back, then it is an unnecessary overhead on the selector. The reason behind having a pool is to restrict the number of ports the selector has to listen upon and also not to create individual DatagramSocket objects for every query. Can this logic of bounded pool, be implemented as a Controller in the SEDA framework (just like the ThreadPoolController) is an open question. - When an event comes in, a free datagram socket in the pool will be utilized for sending the message. If all sockets are engaged, the incoming event should be postponed to reenter again after a period of time.
- This stage additionally has also to maintain the list of messages sent out, their IDs and the request timestamps. Let us call this "RequestMap" for future reference. The ID is the integer, described in the RFC DNS message format, used to map the request-responses. The request timestamp will be made use of in query timeout handling (discussed later).
Parameters to this stage :
- The DNS server hostname/ipaddress. If this is not given, then the /etc/resolve.conf will be parsed to get the name server (only the primary would be taken as of now.). As a next step we will have to build a round-robin way of querying the various name servers in resolve.conf, inorder to be polite with them.
- If resolve.conf is not present, the local host will be assumed as the name server.
DNS Response Processing Stage :
Overview:
- Processes DNS responses.
Implementation:
- When the DNS datagram packets are received, the ID field in the header should be used to match the corresponding request packet.
- Check for timeouts, and discard it if it had timed out; else, set the ipaddress/canonical name in the CrawlURI object and enqueue it to the "Page Requesting Stage". In addition, enqueue an event into the "DNS Cache Handling Stage" for it to update its cache. Do the same, even on DNS Errors like “Name not found authoritative error”.
- The request entry in the RequestMap (maintained for timeout handling) should be removed. This map, being shared across stages, should be synchronized.
HTTP Page Requesting Stage :
Overview:
- Connects to host and sends GET requests for pages.
Events:
- Handles two types of events - StartFetchEvent and ConnectionCompleteEvent.
- The StartFetchEvent will make a TCP connect request to the host. While doing so, we will register the current stage itself to receive back the ConnectionComplete events. Once we receive this ConnectionCompleteEvent, we should send a HTTP GET request to the page. The response handling sink is set as the "HTTP Response Processing Stage". Write failures should be handled.
HTTP Response Processing Stage :
Overview:
- Processes downloaded pages.
Implementation:
- Check for timeouts, and discard it if it had timed out; else, read the packets.
- Once the response is completely read, the request entry in the RequestMap (maintained for timeout handling) should be removed.
- One issue here is when we are reading lengthy HTML pages, we might receive half of the page and it might stop after that. So, essentially the timeout should be applied between chunks of reception.
- Where should the errors like "404 Not Found", etc be propogated ???
Timeout and Retry Handling Stage :
Overview:
This is a single threaded stage which enumerates through the RequestMap and checks for timeouts. The timed out CrawlURIs will be retried until retry count exhausts. This stage will be self-timed periodically using the SEDA ssTimer APIs.
Other Notes:
This timeout handling is a common stuff between the DNS requests and the HTTP requests.
Parameters to this stage:
- DNS timeout value.
- HTTP timeout value.
- DNS retry count.
- HTTP retry count. ( This would be 1 ).
One other thing that could be done is that, the events by themselves will contain information as to which next stage the output has to traverse. This will be flexible and no hardcoding is needed. Especially, in making this non-blocking DNS library an open-source, it would come handy. Moreover, many users might not want it to be over SEDA. So, we will have to give other interfaces as well.
I am presently using the library classes given by dnsjava-1.3.2. ( http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java based synchronous implementation of DNS Resolver. I only make use of the classes which encapsulate the formation of request packets, parses response packets and the various ResourceRecord classes. This library is being used by Java Apache Mail Enterprise Server ( http://james.apache.org/ ). So, it should be pretty reliable and tested. Moreover it has support for IPv6, compression and security which we can make use of later.
Subject: Re: [archive-crawler] Re: Web crawler work ??
Sounds like a reasonable plan.
By "local name server" do you mean something *very* local -- for example, a standard nameserver we run on the same machine?
That would seem to offer other benefits -- such as minimizing the modes of DNS lookup we have to do and offloading caching to another piece of software (at least at first).
- Gordon
----- Original Message ----- From: G.B.Reddy To: archive-crawler@yahoogroups.com Cc: Raymie Stata ; wcr-team@... Sent: Thursday, February 27, 2003 8:55 AM Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Here goes the proposal for the asynchronous DNS lookup API implementation.
We shall implement a minimal resolver which is capable of sending DNS request packets and processing response packets in an asynchrounous nio fashion. This resolver class will contact a local name server and rely on it to do the actual lookup. The local name server will be configured to support recursion and better would be to use a name server which does lookup asynchronously. ( SQUID has asynchronous DNS lookup facilities ). Even if the local name server is not asynchrounous, our java resolver being asynchronous will be good enough since our primary goal is that we do not want any blocking code in our crawler implementation. This idea even sounds good considering the fact we would only reinvent the same wheel if we opt to implement a complete full-fledged resolver implementation which complies with the RFC 1035 and 1034. We can definitely implement this full-fledged resolver but the real concern is that this would require a lot of testing and the efforts to make it rock solid in terms of robustness would be huge.
So, the various jobs that we would have to do to build our Simple Asynchronous DNS lookup API would be -- Request packet formation and reply packet parsing in the exact RFC format. -- Use non-blocking IO APIs and do UDP. (We might not need TCP since the name server is only in the local network.) -- Do canonical name queries and A record queries. -- Implement timeouts. -- Implement caching based on TTL. ( This may have to be deferred as pointed by Raymie earlier. ) -- Integrate with SEDA.
Thanks, Reddy
----- Original Message ----- From: Gordon Mohr To: G.B.Reddy Cc: Raymie Stata ; wcr-team@... ; archive-crawler@yahoogroups.com Sent: Saturday, February 22, 2003 5:37 AM Subject: [archive-crawler] Re: Web crawler work ??
[CC'ing to archive-crawler@yahoogroups.com]
Reddy writes:
> On the first cut do we need to look at implementing an asynchronous DNS > lookup mechanism. If we are not, then it is going to be two stages, viz. > DNSCacheHandlingStage and ResolvingStage, that can be employed using the > blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage, > would check if the entry is available in the cache. If available, he would > set the resolved address in the CrawlURI object and enqueue it to the > appropriate next stage. If the cache doesn't contain the entry, then he > would pass the request to the Resolving stage which would call the > InetAddress.getByName blocking method to resolve it. The getByName result > would be set in the CrawlURI object as before and enqueued into the > appropriate next stage. In addition to this, the Resolving stage will > enqueue another event into the DNSCacheHandlingStage to enable him update > his cache. So, the DNSCacheHandlingStage would be handling two types of > events, one is the lookup events and the other is the update cache events. > > One problem here is that the InetAddress class does not expose its cache > variables to its users. Even we cannot check if the cache has an entry > before calling the getByName method. So, we should be disabling the java > cache ( using the policy file ) and implementing our own caching mechanism. > ( The DNSCacheHandlingStage would have to additionally do the job of > throwing away the expired entries in the cache also.) > > Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my understanding of the best way to use the staged style, mostly by looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do many different steps/aspects of one process, by switching on the type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv accepts events of types....
They also use Sinks that are not associated with stages; rather, they interface to unstaged components which nonetheless result in an eventual event to some supplied answer Sink. See for example seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object should be re-entered over the course of a lookup, with different triggering events. For example, you might want to reenter a single DNSResolvingStage over the course of cache lookup, lookup-initiation, result-receiving (or timeout), etc. I'm not sure; use your judgement as to how many stages are really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C for this simple step, though -- and if we nailed down a truly async Java DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client code... he has a rough implementation for simple usage, and he knows of another one at Berkeley which goes deeper into HTTP/1.1 conformance and optimal performance. I've asked him to forward whatever additional code or details he can.
- Gordon
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Yahoo! Groups Sponsor ADVERTISEMENT
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
I've just checked into Sourceforge CVS the module 'Anecdote', a first
stab at a staged crawler.
Right now it just sets up dummy printing stages, grabs a list of URIs
from a static file, and pumps them through the stages.
Its config file is suggestive of how our staged crawler might be
configured, and the (somewhat nondeterministic) batching of URIs
through stages is illustrative of the event flows in a application
that's decomposed like this.
When it runs out of URIs, it just spins, so remember to kill the
process after a run.
Next steps for this prototype:
- Fixing the spin condition
- HTTP fetching (even if using an off-the-shelf blocking lib)
- Simple delay-based politeness
- A pair of URIStoring/URIChoosing classes which implement
breadth- or depth-crawling of discovered URIs
- Gordon
Patrick Eaton forwarded me a pair of staged HTTP client implementations
which are part of the OceanStore project at Berkeley, and are essentially
what are also online at:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/oceanstore/pond/ostore/apps
(See the libhttp and libhttp2 directories.) I'm still trying to evaluate
its usefulness.
They've also written a simple DNS stage which just wraps standard blocking
DNS. Thus other stages can pretend DNS is asynchronous, while the SEDA
manager throws extra threads at the ADns stage. See:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/oceanstore/pond/ostore/network/AD\
ns.java?rev=1.1&content-type=text/vnd.viewcvs-markup
Another open-source Java HTTP library that could prove useful to us is the
Apache Jakarta Commons HTTPClient project:
http://jakarta.apache.org/commons/httpclient/
It's blocking, heavily synchronized, and doesn't look very memory/allocation
optimized, so it may not meet our ultimate needs, but it is full-featured,
including stuff like cookies, multiple authentication methods, and connection
pipelining...
- Gordon
Below are the various stages and their design with the issues involved in the DNS Resolver and HTTP Client implementation.
DNS History/Cache Handling Stage :
Overview: - Maintains successful lookups in cache.
- Does negative caching. - Times itself to clean the expired entries based on TTL. (Would use the ssTimer SEDA APIs to schedule itself periodically)
- This stage would be dummy or could be skipped as of now since we want to do caching later.
Events:
- Two types of events : DNSCacheLookupEvent, DNSCacheUpdateEvent.
- DNSCacheLookupEvent : If entry is found in cache, the ipaddress is set in the CrawlURI object and is enqueued into the "Page Requesting Stage". Else, it is enqueued into the "DNS Querying Stage".
- DNSCacheUpdateEvent : This event is published by the "DNS Response Processing Stage" every after successful/failed lookup inorder to update the cache.
Other notes:
- This stage could be single threaded else lot of synchronization might be needed.
- Resubmitting events on queue full exceptions while enqueuing into this stage’s queue should be handled by the caller by scheduling it in future.
DNS Querying Stage :
Overview: - Sends the actual DNS ARecord query packets to the DNS Server. (The response packets are processed in a later stage)
- Maintains a pool of DatagramSocket objects.
Events:
- SendDNSQueryEvent : This is published by the "DNS History/Cache Handling Stage" when cache miss happens. The DNS query is formed and sent out. The response handling sink is set as the "DNS Response Processing Stage".
Implementation:
- A pool of DatagramSockets of a configurable maximum size is maintained. It will be filled incrementally. All these datagram sockets will be registered to the selector maintained by the SEDA internals. It would be ideal if this pool gets shrunk or expanded based on the requirement. If it is not shrunk back, then it is an unnecessary overhead on the selector. The reason behind having a pool is to restrict the number of ports the selector has to listen upon and also not to create individual DatagramSocket objects for every query. Can this logic of bounded pool, be implemented as a Controller in the SEDA framework (just like the ThreadPoolController) is an open question. - When an event comes in, a free datagram socket in the pool will be utilized for sending the message. If all sockets are engaged, the incoming event should be postponed to reenter again after a period of time.
- This stage additionally has also to maintain the list of messages sent out, their IDs and the request timestamps. Let us call this "RequestMap" for future reference. The ID is the integer, described in the RFC DNS message format, used to map the request-responses. The request timestamp will be made use of in query timeout handling (discussed later).
Parameters to this stage :
- The DNS server hostname/ipaddress. If this is not given, then the /etc/resolve.conf will be parsed to get the name server (only the primary would be taken as of now.). As a next step we will have to build a round-robin way of querying the various name servers in resolve.conf, inorder to be polite with them.
- If resolve.conf is not present, the local host will be assumed as the name server.
DNS Response Processing Stage :
Overview:
- Processes DNS responses.
Implementation:
- When the DNS datagram packets are received, the ID field in the header should be used to match the corresponding request packet.
- Check for timeouts, and discard it if it had timed out; else, set the ipaddress/canonical name in the CrawlURI object and enqueue it to the "Page Requesting Stage". In addition, enqueue an event into the "DNS Cache Handling Stage" for it to update its cache. Do the same, even on DNS Errors like “Name not found authoritative error”.
- The request entry in the RequestMap (maintained for timeout handling) should be removed. This map, being shared across stages, should be synchronized.
HTTP Page Requesting Stage :
Overview:
- Connects to host and sends GET requests for pages.
Events:
- Handles two types of events - StartFetchEvent and ConnectionCompleteEvent.
- The StartFetchEvent will make a TCP connect request to the host. While doing so, we will register the current stage itself to receive back the ConnectionComplete events. Once we receive this ConnectionCompleteEvent, we should send a HTTP GET request to the page. The response handling sink is set as the "HTTP Response Processing Stage". Write failures should be handled.
HTTP Response Processing Stage :
Overview:
- Processes downloaded pages.
Implementation:
- Check for timeouts, and discard it if it had timed out; else, read the packets.
- Once the response is completely read, the request entry in the RequestMap (maintained for timeout handling) should be removed.
- One issue here is when we are reading lengthy HTML pages, we might receive half of the page and it might stop after that. So, essentially the timeout should be applied between chunks of reception.
- Where should the errors like "404 Not Found", etc be propogated ???
Timeout and Retry Handling Stage :
Overview:
This is a single threaded stage which enumerates through the RequestMap and checks for timeouts. The timed out CrawlURIs will be retried until retry count exhausts. This stage will be self-timed periodically using the SEDA ssTimer APIs.
Other Notes:
This timeout handling is a common stuff between the DNS requests and the HTTP requests.
Parameters to this stage:
- DNS timeout value.
- HTTP timeout value.
- DNS retry count.
- HTTP retry count. ( This would be 1 ).
One other thing that could be done is that, the events by themselves will contain information as to which next stage the output has to traverse. This will be flexible and no hardcoding is needed. Especially, in making this non-blocking DNS library an open-source, it would come handy. Moreover, many users might not want it to be over SEDA. So, we will have to give other interfaces as well.
I am presently using the library classes given by dnsjava-1.3.2. ( http://sourceforge.net/projects/dnsjava/ ). This is an LGPL java based synchronous implementation of DNS Resolver. I only make use of the classes which encapsulate the formation of request packets, parses response packets and the various ResourceRecord classes. This library is being used by Java Apache Mail Enterprise Server ( http://james.apache.org/ ). So, it should be pretty reliable and tested. Moreover it has support for IPv6, compression and security which we can make use of later.
Subject: Re: [archive-crawler] Re: Web crawler work ??
Sounds like a reasonable plan.
By "local name server" do you mean something *very* local -- for example, a standard nameserver we run on the same machine?
That would seem to offer other benefits -- such as minimizing the modes of DNS lookup we have to do and offloading caching to another piece of software (at least at first).
- Gordon
----- Original Message ----- From: G.B.Reddy To: archive-crawler@yahoogroups.com Cc: Raymie Stata ; wcr-team@... Sent: Thursday, February 27, 2003 8:55 AM Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Here goes the proposal for the asynchronous DNS lookup API implementation.
We shall implement a minimal resolver which is capable of sending DNS request packets and processing response packets in an asynchrounous nio fashion. This resolver class will contact a local name server and rely on it to do the actual lookup. The local name server will be configured to support recursion and better would be to use a name server which does lookup asynchronously. ( SQUID has asynchronous DNS lookup facilities ). Even if the local name server is not asynchrounous, our java resolver being asynchronous will be good enough since our primary goal is that we do not want any blocking code in our crawler implementation. This idea even sounds good considering the fact we would only reinvent the same wheel if we opt to implement a complete full-fledged resolver implementation which complies with the RFC 1035 and 1034. We can definitely implement this full-fledged resolver but the real concern is that this would require a lot of testing and the efforts to make it rock solid in terms of robustness would be huge.
So, the various jobs that we would have to do to build our Simple Asynchronous DNS lookup API would be -- Request packet formation and reply packet parsing in the exact RFC format. -- Use non-blocking IO APIs and do UDP. (We might not need TCP since the name server is only in the local network.) -- Do canonical name queries and A record queries. -- Implement timeouts. -- Implement caching based on TTL. ( This may have to be deferred as pointed by Raymie earlier. ) -- Integrate with SEDA.
Thanks, Reddy
----- Original Message ----- From: Gordon Mohr To: G.B.Reddy Cc: Raymie Stata ; wcr-team@... ; archive-crawler@yahoogroups.com Sent: Saturday, February 22, 2003 5:37 AM Subject: [archive-crawler] Re: Web crawler work ??
[CC'ing to archive-crawler@yahoogroups.com]
Reddy writes:
> On the first cut do we need to look at implementing an asynchronous DNS > lookup mechanism. If we are not, then it is going to be two stages, viz. > DNSCacheHandlingStage and ResolvingStage, that can be employed using the > blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage, > would check if the entry is available in the cache. If available, he would > set the resolved address in the CrawlURI object and enqueue it to the > appropriate next stage. If the cache doesn't contain the entry, then he > would pass the request to the Resolving stage which would call the > InetAddress.getByName blocking method to resolve it. The getByName result > would be set in the CrawlURI object as before and enqueued into the > appropriate next stage. In addition to this, the Resolving stage will > enqueue another event into the DNSCacheHandlingStage to enable him update > his cache. So, the DNSCacheHandlingStage would be handling two types of > events, one is the lookup events and the other is the update cache events. > > One problem here is that the InetAddress class does not expose its cache > variables to its users. Even we cannot check if the cache has an entry > before calling the getByName method. So, we should be disabling the java > cache ( using the policy file ) and implementing our own caching mechanism. > ( The DNSCacheHandlingStage would have to additionally do the job of > throwing away the expired entries in the cache also.) > > Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my understanding of the best way to use the staged style, mostly by looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do many different steps/aspects of one process, by switching on the type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv accepts events of types....
They also use Sinks that are not associated with stages; rather, they interface to unstaged components which nonetheless result in an eventual event to some supplied answer Sink. See for example seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object should be re-entered over the course of a lookup, with different triggering events. For example, you might want to reenter a single DNSResolvingStage over the course of cache lookup, lookup-initiation, result-receiving (or timeout), etc. I'm not sure; use your judgement as to how many stages are really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C for this simple step, though -- and if we nailed down a truly async Java DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client code... he has a rough implementation for simple usage, and he knows of another one at Berkeley which goes deeper into HTTP/1.1 conformance and optimal performance. I've asked him to forward whatever additional code or details he can.
- Gordon
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Yahoo! Groups Sponsor ADVERTISEMENT
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Driven by our meeting with Raymie last Thursday, and refined by further
analysis, here are some notes on our design directions.
= STAGED CRAWLER DESIGN NOTES =
STAGE 0: "Pumping"
Overview:
- Exists to drive the system into action
- Modelled as stage so that external SEDA flow-maximizing
managers can monitor its output, allocate it more or
less cycles as necessary to optimize system.
Event flow:
- Accepts events which START, CONTINUE, or STOP action
- Each START event flips stage to "on" state, and
further enqueues one DO event to next stage, and
one CONTINUE event to self.
- Each CONTINUE event while "on" enqueues one DO event
to next stage. While "off", CONTINUEs are ignored.
- Each STOP event flips stage to "off" state.
- When next stage gives SinkFullException, a CONTINUE
is scheduled for some configurable delay in the future.
- When next stage gives SinkClosedException, stage
flips to "off" status.
Implementation:
- Initially a SimplePump class. Further refinements TBD.
STAGE 1: "URIChoosing"
Overview:
- Exists to emit candidate URIs to further processing.
- Output should reflect Crawler's ideal sense of URI
prioritization -- with politeness typically delegated
to later stage(s).
- May be one facet of larger URIManager entity, which
also includes "URIStoring" stage (see below).
Event flow:
- Accepts DO events, which are a signal to identify
some finite number of candidate URIs and pass to
next stage.
- Emits CrawlURI objects, which implement QueueElementIF,
to next stage.
Implementation(s):
- Initially: StaticFrontier class, which supplies
URIs one at a time from a static disk file. CrawlURI
only includes URI string.
- Then: Breadth- and Depth-first crawls from finite
number of seeds.
- Then: More sophisticated policies using persistence
and distribution across machines, TBD.
STAGE 2: "Preprocessing"
Overview:
- Exists to apply configurable politeness and filtering
to candidate URIs before passing to next stage.
- Will eventually perform DNS, to allow IP-based (rather
than hostname-based) politeness rules.
- Could potentially be chained.
Event flow:
- Accepts CrawlURI objects as events.
- Examine, process, markup, and where appropriate
hold CrawlURI objects in internal waiting queues
to enforce politeness rules.
- May mark CrawlURIs as cancelled to disable further
processing by subsequent stages ("preconditioned
execution"), or mark them and directly forward them
to an explicit "cancelled stage" (such as the final
"URIStoring" stage).
- Feeds changed CrawlURI object to next stage.
Implementation(s):
- Initially, a noop stage which prints the URI and
passes it along.
- Then, a stage which does basic hostname-based politeness.
- Then, a stage which does DNS and ip-based politeness.
- Then, a stage which accomodates priority fetching
(such as robots.txt or closely-associated resources.)
- Then, a stage which respects robots.txt
- Then, as TBD.
STAGE 3: "Fetching"
Overview:
- Exists to conduct all network-fetching operations.
Event flow:
- Accepts CrawlURI objects, possibly as marked up
with resolved IP addresses, past fetch/error histories,
cookies/authentication info, etc.
- Makes a finite number of attempts to fetch resource,
probably just one, and then marks up CrawlURI with
results.
- Forwards CrawlURI to next stage.
Implementation(s):
- Initially, blocking HTTP fetch using off-the-shelf
(JRE or 3rd-party) utility library. Retrieved resource
fully in memory, discard oversized fetches.
- Then, non-blocking HTTP fetch.
- Then, disk-assist for large items.
- Then, greater capabilities TBD.
STAGE 4: "Postprocessing"
Overview:
- Exists to analyze results, primarily in the ways the
crawl needs to proceed.
- Could potentially be chained.
Event flow:
- Accepts CrawlURI objects, as marked up by fetch
results.
- Scans and processes in arbitrary ways, marking
up passed-in CrawlURI and potentially creating
new CrawlURIs.
- Forwards passed-in CrawlURI and others to next
stage.
Implementation(s):
- Initially, a noop stage which prints the URI and
passes it along.
- Then, a simple ARC-writer.
- Then, a simple link-extractor.
- Then, more sophisticated policies adapting to
new file/link types and multi-machine distribution
architectures, TBD.
STAGE 5: "URIStoring"
Overview:
- Exists to persist results of one CrawlURI processing
cycle, and feed new candidates back to URIChoosing
stage.
Event flow:
- Accepts CrawlURI objects, which may have finished
a processing cycle or be brand new.
- Performs necessary persistence and possibly
prioritization work; affects shared data structures
with URIChoosing stage in a thread-safe way, but
does not enqueue events to URIChoosing.
- Output events, if any, TBD.
Implementation(s):
- Initially, a logging stage which prints the URI with
any processing results.
- Then, a stage which arranges for URIs as necessary
to be available to URIChoosing.
- Then, a stage with proper persistence and synchronization
with URIChoosing to support checkpointing.
- Then, support for more sophisticated scheduling policies
and multi-machine distribution.
--
NOTES ON THE SYSTEM AS A WHOLE
- If you stop feeding stages 1-5 "DO" events, crawler
will eventually come to rest, within some bounded time.
- Although the "Pumping" stage will push as many DOs in
as it can, a properly-configured external queue-monitoring
SEDA manager will deprive it of the cycles it needs
to get too far ahead of the rest of the system.
- URIChoosing and URIStoring can maintain an "in-process" list
of CrawlURIs (those released to the other stages that
have not been returned), remembering their state before they
were passed to the Preprocessing stage. Thus a checkpoint at
any time should be synthesizable by capturing the state of
URIChoosing+URIStoring, plus the "rollback" version of
pending CrawlURIs.
- There is likely an in-memory Host/IP database, available
in the Preprocessing stage, that can rapidly decorate CrawlURIs
with the latest knowledge about their host/IP.
--
THE CrawlURI CLASS
Over time will come to contain all of the following:
- URI string
- a KnownHost
- IP (as resolved at certain times)
- performance history for that IP
- robots.txt
- server-software
- cookies
- authentication info
- fetching history (including result codes/summary values)
- processing history
- labelling based on originating context (eg role in page, encountered URIs)
Needs flexible, efficient keyed storage of arbitrary attributes; clear
conventions for use by stages; allocation-and-GC-efficient string facilities;
clear conventions for persistence/reconstruction by URIManager.
--
- Gordon
Subject: Re: [archive-crawler] Re: Web crawler work ??
Sounds like a reasonable plan.
By "local name server" do you mean something *very* local -- for example, a standard nameserver we run on the same machine?
That would seem to offer other benefits -- such as minimizing the modes of DNS lookup we have to do and offloading caching to another piece of software (at least at first).
- Gordon
----- Original Message ----- From: G.B.Reddy To: archive-crawler@yahoogroups.com Cc: Raymie Stata ; wcr-team@... Sent: Thursday, February 27, 2003 8:55 AM Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Here goes the proposal for the asynchronous DNS lookup API implementation.
We shall implement a minimal resolver which is capable of sending DNS request packets and processing response packets in an asynchrounous nio fashion. This resolver class will contact a local name server and rely on it to do the actual lookup. The local name server will be configured to support recursion and better would be to use a name server which does lookup asynchronously. ( SQUID has asynchronous DNS lookup facilities ). Even if the local name server is not asynchrounous, our java resolver being asynchronous will be good enough since our primary goal is that we do not want any blocking code in our crawler implementation. This idea even sounds good considering the fact we would only reinvent the same wheel if we opt to implement a complete full-fledged resolver implementation which complies with the RFC 1035 and 1034. We can definitely implement this full-fledged resolver but the real concern is that this would require a lot of testing and the efforts to make it rock solid in terms of robustness would be huge.
So, the various jobs that we would have to do to build our Simple Asynchronous DNS lookup API would be -- Request packet formation and reply packet parsing in the exact RFC format. -- Use non-blocking IO APIs and do UDP. (We might not need TCP since the name server is only in the local network.) -- Do canonical name queries and A record queries. -- Implement timeouts. -- Implement caching based on TTL. ( This may have to be deferred as pointed by Raymie earlier. ) -- Integrate with SEDA.
Thanks, Reddy
----- Original Message ----- From: Gordon Mohr To: G.B.Reddy Cc: Raymie Stata ; wcr-team@... ; archive-crawler@yahoogroups.com Sent: Saturday, February 22, 2003 5:37 AM Subject: [archive-crawler] Re: Web crawler work ??
[CC'ing to archive-crawler@yahoogroups.com]
Reddy writes:
> On the first cut do we need to look at implementing an asynchronous DNS > lookup mechanism. If we are not, then it is going to be two stages, viz. > DNSCacheHandlingStage and ResolvingStage, that can be employed using the > blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage, > would check if the entry is available in the cache. If available, he would > set the resolved address in the CrawlURI object and enqueue it to the > appropriate next stage. If the cache doesn't contain the entry, then he > would pass the request to the Resolving stage which would call the > InetAddress.getByName blocking method to resolve it. The getByName result > would be set in the CrawlURI object as before and enqueued into the > appropriate next stage. In addition to this, the Resolving stage will > enqueue another event into the DNSCacheHandlingStage to enable him update > his cache. So, the DNSCacheHandlingStage would be handling two types of > events, one is the lookup events and the other is the update cache events. > > One problem here is that the InetAddress class does not expose its cache > variables to its users. Even we cannot check if the cache has an entry > before calling the getByName method. So, we should be disabling the java > cache ( using the policy file ) and implementing our own caching mechanism. > ( The DNSCacheHandlingStage would have to additionally do the job of > throwing away the expired entries in the cache also.) > > Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my understanding of the best way to use the staged style, mostly by looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do many different steps/aspects of one process, by switching on the type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv accepts events of types....
They also use Sinks that are not associated with stages; rather, they interface to unstaged components which nonetheless result in an eventual event to some supplied answer Sink. See for example seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object should be re-entered over the course of a lookup, with different triggering events. For example, you might want to reenter a single DNSResolvingStage over the course of cache lookup, lookup-initiation, result-receiving (or timeout), etc. I'm not sure; use your judgement as to how many stages are really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C for this simple step, though -- and if we nailed down a truly async Java DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client code... he has a rough implementation for simple usage, and he knows of another one at Berkeley which goes deeper into HTTP/1.1 conformance and optimal performance. I've asked him to forward whatever additional code or details he can.
- Gordon
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Yahoo! Groups Sponsor ADVERTISEMENT
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
Sounds like a reasonable plan.
By "local name server" do you mean something *very* local -- for example,
a standard nameserver we run on the same machine?
That would seem to offer other benefits -- such as minimizing the modes
of DNS lookup we have to do and offloading caching to another piece of
software (at least at first).
- Gordon
----- Original Message -----
From: G.B.Reddy
To: archive-crawler@yahoogroups.com
Cc: Raymie Stata ; wcr-team@...
Sent: Thursday, February 27, 2003 8:55 AM
Subject: Re: [archive-crawler] Re: Web crawler work ??
Gordon and Raymie,
Here goes the proposal for the asynchronous DNS lookup API implementation.
We shall implement a minimal resolver which is capable of sending DNS request
packets and processing response packets in an
asynchrounous nio fashion. This resolver class will contact a local name server
and rely on it to do the actual lookup. The local
name server will be configured to support recursion and better would be to use a
name server which does lookup asynchronously. (
SQUID has asynchronous DNS lookup facilities ). Even if the local name server
is not asynchrounous, our java resolver being
asynchronous will be good enough since our primary goal is that we do not want
any blocking code in our crawler implementation. This
idea even sounds good considering the fact we would only reinvent the same wheel
if we opt to implement a complete full-fledged
resolver implementation which complies with the RFC 1035 and 1034. We can
definitely implement this full-fledged resolver but the
real concern is that this would require a lot of testing and the efforts to make
it rock solid in terms of robustness would be huge.
So, the various jobs that we would have to do to build our Simple Asynchronous
DNS lookup API would be
-- Request packet formation and reply packet parsing in the exact RFC
format.
-- Use non-blocking IO APIs and do UDP. (We might not need TCP since the
name server is only in the local network.)
-- Do canonical name queries and A record queries.
-- Implement timeouts.
-- Implement caching based on TTL. ( This may have to be deferred as pointed
by Raymie earlier. )
-- Integrate with SEDA.
Thanks,
Reddy
----- Original Message -----
From: Gordon Mohr
To: G.B.Reddy
Cc: Raymie Stata ; wcr-team@... ; archive-crawler@yahoogroups.com
Sent: Saturday, February 22, 2003 5:37 AM
Subject: [archive-crawler] Re: Web crawler work ??
[CC'ing to archive-crawler@yahoogroups.com]
Reddy writes:
> On the first cut do we need to look at implementing an asynchronous DNS
> lookup mechanism. If we are not, then it is going to be two stages, viz.
> DNSCacheHandlingStage and ResolvingStage, that can be employed using the
> blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage,
> would check if the entry is available in the cache. If available, he would
> set the resolved address in the CrawlURI object and enqueue it to the
> appropriate next stage. If the cache doesn't contain the entry, then he
> would pass the request to the Resolving stage which would call the
> InetAddress.getByName blocking method to resolve it. The getByName result
> would be set in the CrawlURI object as before and enqueued into the
> appropriate next stage. In addition to this, the Resolving stage will
> enqueue another event into the DNSCacheHandlingStage to enable him update
> his cache. So, the DNSCacheHandlingStage would be handling two types of
> events, one is the lookup events and the other is the update cache events.
>
> One problem here is that the InetAddress class does not expose its cache
> variables to its users. Even we cannot check if the cache has an entry
> before calling the getByName method. So, we should be disabling the java
> cache ( using the policy file ) and implementing our own caching mechanism.
> ( The DNSCacheHandlingStage would have to additionally do the job of
> throwing away the expired entries in the cache also.)
>
> Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my
understanding of the best way to use the staged style, mostly by
looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do
many different steps/aspects of one process, by switching on the
type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv
accepts events of types....
- httpConnection
- httpRequest
- SinkClosedEvent
- timerEvent
And their seda.sandStorm.lib.http.httpServer accepts events of
types...
- ATcpInPacket
- ATcpConnection
- aSocketErrorEvent
- SinkDrainedEvent
- SinkCloggedEvent
- SinkClosedEvent
- ATcpListenSuccessEvent
They also use Sinks that are not associated with stages; rather,
they interface to unstaged components which nonetheless result in
an eventual event to some supplied answer Sink. See for example
seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object
should be re-entered over the course of a lookup, with different triggering
events. For example, you might want to reenter a single DNSResolvingStage
over the course of cache lookup, lookup-initiation, result-receiving (or
timeout), etc. I'm not sure; use your judgement as to how many stages are
really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C
for this simple step, though -- and if we nailed down a truly async Java
DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client
code... he has a rough implementation for simple usage, and he knows of
another one at Berkeley which goes deeper into HTTP/1.1 conformance and
optimal performance. I've asked him to forward whatever additional code
or details he can.
- Gordon
To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
Yahoo! Groups Sponsor
ADVERTISEMENT
To unsubscribe from this group, send an email to:
archive-crawler-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
At our kickoff engineering review meeting last friday, most discussion
centered around understanding and clarifying the requirements document.
Key areas highlighted for expansion or contraction were:
--
RE: 2.6 "Zero" Administration / ease of operation
A desire for a web-based admin interface was expressed. Brewster
suggested this would be an ideal way for remote clients of Archive
crawling services to specify, start, and check the progress of
crawls. Such an interface can even be helpful for local control
of an otherwise "headless" (UI-free) process.
--
RE: 3.1.1.5 Document processing
The handling of duplicates should be done in a way that is amenable
to use by standard post-crawl tools.
Outputting crawl info in the ARC format is of paramount importance;
we don't need to implement other output formats ourselves, and we
shouldn't encourage format proliferation (even if there are clean
hooks for the determined to insert other formats). We should work
to move the ARC format towards being a widely-recognized standard,
improving its specification (and possibly extending it) as necessary
for this goal.
--
RE: 3.2 high bandwidth
It was suggested that a goal of 10Mbps per machine is enough,
rather than 15-30Mbps, so that 10 machines could saturate a
100Mbps link. Further, a crawler "farm" size of 10-20 machines,
rather than 5-10, should be our target.
UPDATE: Reddy's Java IO throughput tests actually showed 2-6
mega*bytes* per second, not mega*bits* per second, so we have
in fact seen (in simple tests) throughput of 16-48Mbps. So
a target of 15-30Mbps in a system where other bottlenecks
have been removed may still be reasonable.
--
RE: 3.5 Portability
All of the initial users of the Crawler are content with using
Linux machines -- so high performance on Windows is not a
requirement.
(With good Windows Java environments, we may get suitable Windows
performance anyway, and this could be revisited if Microsoft
funding were to materialize, but for now, only high performance
on Linux is necessary.)
--
RE: 3.6 Ease of Operation
It was emphasized that real-world crawls require tweaking,
babysitting, and adjustment -- and thus that we need to set
expectations reasonably, not promising a "fire-and-forget"
capability. Also, for multi-machine crawls, a relatively
sophisticated operator will always be necessary.
Logging should support long-term confidence in the output
of crawling activity -- providing an audit trail that begins
to take into account the "provenance" concerns of careful
archivists. However, such separate logs should never be necessary
to usefully analyze crawl results: the ARCs should stand alone
for that purpose, as self-contained crawl records.
--
RE: 3.7 Open Licensing
The "Lesser (or Library) GPL" (LGPL) meets our needs for the
forseeable future, encouraging wide usage and a large contribution
community.
--
- Gordon
Here goes the proposal for the asynchronous DNS lookup API implementation.
We shall implement a minimal resolver which is capable of sending DNS request packets and processing response packets in an asynchrounous nio fashion. This resolver class will contact a local name server and rely on it to do the actual lookup. The local name server will be configured to support recursion and better would be to use a name server which does lookup asynchronously. ( SQUID has asynchronous DNS lookup facilities ). Even if the local name server is not asynchrounous, our java resolver being asynchronous will be good enough since our primary goal is that we do not want any blocking code in our crawler implementation. This idea even sounds good considering the fact we would only reinvent the same wheel if we opt to implement a complete full-fledged resolver implementation which complies with the RFC 1035 and 1034. We can definitely implement this full-fledged resolver but the real concern is that this would require a lot of testing and the efforts to make it rock solid in terms of robustness would be huge.
So, the various jobs that we would have to do to build our Simple Asynchronous DNS lookup API would be
-- Request packet formation and reply packet parsing in the exact RFC format.
-- Use non-blocking IO APIs and do UDP. (We might not need TCP since the name server is only in the local network.)
-- Do canonical name queries and A record queries.
-- Implement timeouts.
-- Implement caching based on TTL. ( This may have to be deferred as pointed by Raymie earlier. )
> On the first cut do we need to look at implementing an asynchronous DNS > lookup mechanism. If we are not, then it is going to be two stages, viz. > DNSCacheHandlingStage and ResolvingStage, that can be employed using the > blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage, > would check if the entry is available in the cache. If available, he would > set the resolved address in the CrawlURI object and enqueue it to the > appropriate next stage. If the cache doesn't contain the entry, then he > would pass the request to the Resolving stage which would call the > InetAddress.getByName blocking method to resolve it. The getByName result > would be set in the CrawlURI object as before and enqueued into the > appropriate next stage. In addition to this, the Resolving stage will > enqueue another event into the DNSCacheHandlingStage to enable him update > his cache. So, the DNSCacheHandlingStage would be handling two types of > events, one is the lookup events and the other is the update cache events. > > One problem here is that the InetAddress class does not expose its cache > variables to its users. Even we cannot check if the cache has an entry > before calling the getByName method. So, we should be disabling the java > cache ( using the policy file ) and implementing our own caching mechanism. > ( The DNSCacheHandlingStage would have to additionally do the job of > throwing away the expired entries in the cache also.) > > Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my understanding of the best way to use the staged style, mostly by looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do many different steps/aspects of one process, by switching on the type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv accepts events of types....
They also use Sinks that are not associated with stages; rather, they interface to unstaged components which nonetheless result in an eventual event to some supplied answer Sink. See for example seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object should be re-entered over the course of a lookup, with different triggering events. For example, you might want to reenter a single DNSResolvingStage over the course of cache lookup, lookup-initiation, result-receiving (or timeout), etc. I'm not sure; use your judgement as to how many stages are really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C for this simple step, though -- and if we nailed down a truly async Java DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client code... he has a rough implementation for simple usage, and he knows of another one at Berkeley which goes deeper into HTTP/1.1 conformance and optimal performance. I've asked him to forward whatever additional code or details he can.
- Gordon
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
[CC'ing to archive-crawler@yahoogroups.com]
Reddy writes:
> On the first cut do we need to look at implementing an asynchronous DNS
> lookup mechanism. If we are not, then it is going to be two stages, viz.
> DNSCacheHandlingStage and ResolvingStage, that can be employed using the
> blocking DNS lookup calls in Java. The first stage, DNSCacheHandlingStage,
> would check if the entry is available in the cache. If available, he would
> set the resolved address in the CrawlURI object and enqueue it to the
> appropriate next stage. If the cache doesn't contain the entry, then he
> would pass the request to the Resolving stage which would call the
> InetAddress.getByName blocking method to resolve it. The getByName result
> would be set in the CrawlURI object as before and enqueued into the
> appropriate next stage. In addition to this, the Resolving stage will
> enqueue another event into the DNSCacheHandlingStage to enable him update
> his cache. So, the DNSCacheHandlingStage would be handling two types of
> events, one is the lookup events and the other is the update cache events.
>
> One problem here is that the InetAddress class does not expose its cache
> variables to its users. Even we cannot check if the cache has an entry
> before calling the getByName method. So, we should be disabling the java
> cache ( using the policy file ) and implementing our own caching mechanism.
> ( The DNSCacheHandlingStage would have to additionally do the job of
> throwing away the expired entries in the cache also.)
>
> Let me know your comments on this.
This looks like a good first cut. I'm still working to improve my
understanding of the best way to use the staged style, mostly by
looking at their HTTP and HTTP Server (Haboob) code.
It seems that they've tended to use a single Stage object to do
many different steps/aspects of one process, by switching on the
type of QueueElement received.
So for example their seda.sandStorm.seda.apps.Haboob.http.HttpRecv
accepts events of types....
- httpConnection
- httpRequest
- SinkClosedEvent
- timerEvent
And their seda.sandStorm.lib.http.httpServer accepts events of
types...
- ATcpInPacket
- ATcpConnection
- aSocketErrorEvent
- SinkDrainedEvent
- SinkCloggedEvent
- SinkClosedEvent
- ATcpListenSuccessEvent
They also use Sinks that are not associated with stages; rather,
they interface to unstaged components which nonetheless result in
an eventual event to some supplied answer Sink. See for example
seda.sandStorm.lib.http.httpConnection.
So perhaps as a matter of grouping related tasks, the same Stage object
should be re-entered over the course of a lookup, with different triggering
events. For example, you might want to reenter a single DNSResolvingStage
over the course of cache lookup, lookup-initiation, result-receiving (or
timeout), etc. I'm not sure; use your judgement as to how many stages are
really needed.
> P.S : We found some openly available async dns client APIs in C language.
That could be useful as a model. (I doubt we'd want to call out to C
for this simple step, though -- and if we nailed down a truly async Java
DNS facility, a lot of open source projects would probably be quite happy.)
Also: I heard back from Patrick Eaton about SEDA-style async HTTP client
code... he has a rough implementation for simple usage, and he knows of
another one at Berkeley which goes deeper into HTTP/1.1 conformance and
optimal performance. I've asked him to forward whatever additional code
or details he can.
- Gordon
got it. all cleared up today at the meeting, I think. good start!
-brewster
At 10:11 AM 2/21/2003 -0800, Raymie Stata wrote:
>>> Maybe I am missing something. What is the point of writing
>>> something that wont help us do something big?
>
>I think you are missing something, we WILL be writing something that supports
really big crawling.
>
>Gordon correctly points out that the plan is to get there in increments, but I
imagine the increments to be measured in weeks, not quarters.
>
>Bottom line: we'll be doing large scale crawling soon enough, but even then
we'll have plenty to do in front of us.
I don't think we can build the best mega-scale crawler until after
we've built a really good, modular, efficient small-scale crawler.
That's how the existing big crawlers reached their current state,
and while we can leverage their wisdom a lot, without large swaths
of their source code we've got to take an incremental path.
If you look over the requirements, they're very ambitious -- with
the eventual software surpassing anything currently available to
us (or the public), and probably exceeding in many respects even
the capabilities of the state-of-the-art private, proprietary
crawlers.
- Gordon
----- Original Message -----
From: "Brewster Kahle" <brewster@...>
To: "Gordon Mohr" <gojomo@...>
Cc: <archive-crawler@yahoogroups.com>
Sent: Friday, February 21, 2003 7:53 AM
Subject: Re: Crawler engineering review kickoff - 10am Friday
>
>
> Maybe I am missing something. what is the point of writing something that
wont help us do something big? we have the
mercator crawler for doing little things.
>
> is it a way to practice coding java?
>
> -brewster
>
>
>
> At 11:15 PM 2/20/2003 -0800, Gordon Mohr wrote:
> >[cc'd to the archive-crawler@yahoogroups.com discussion list]
> >
> >These are all important matters to address -- and for most of these issues,
> >I think there will be three overlapping answers:
> >
> > (1) what range of choices are enabled by the design;
> > (2) what choice we make in a first, simple, proof-of-design
> > crawler;
> >...and ultimately...
> > (3) what choice we make in the eventual high-performance,
> > multi-machine, whole-web, max-any-downlink crawler
> >
> >Considering a few of the specifics:
> >> * how to prioritize the urls to crawl. maybe we could list the
different ways current crawlers work as a guideline for
some
> >of the options.
> >> we know xyleme and alexa and mercator.
> >> gordon, maybe you could write up a short summary as a way to
understand the options.
> >
> >Common choices have included:
> >
> > - a breadth-first traversal of the seed URI set and discovered URIs
> > - a depth-first traversal (within politeness and site-saturation
constraints)
> > - descending estimated value of unvisited URIs, where estimated value is...
> > - number of inlinks
> > - classic Google PageRank (where inlinks from high-inlink pages are worth
more)
> > - apparent popularity by user visits (eg Alexa traffic data)
> > - weighted by URI contents or content-analysis of the inlink sources
> >
> >Into these can be mixed other measures of a page's importance or
> >implied volativility -- to influence how often something is recrawled.
> >
> >Because of the wide range of anticipated users, the crawler design must
> >be capable of having any of these policies -- or weighted combinations of
> >these policies -- plugged in, either by configuration file or code
> >modules. ("Answer #1")
> >
> >For our first instantiation, simpleminded swappable breadth-first and
> >depth-first policies will suffice for verifying the design and initial
> >testing. ("Answer #2")
> >
> >For the eventual high-performance, mega-scale crawler, a more
> >sophisticated ranking for ensuring "more important" pages are
> >fetched before (or more often than) other pages will eventually
> >be devised. Some research indicates breadth-first is pretty good
> >at approximating the (much harder to compute) inlink and PageRank
> >approaches. We'll have to tinker over time. ("Answer #3")
> >
> >> * how tasks will be broken up on different machines. for instance:
> >> 1. different machines for different stages of processing, or
> >> 2. a host does all stages, but urls are broken up onto different
machines some way
> >
> >The second approach -- distribution to relatively full-fledged hosts
> >by URI ranges -- seems to dominate at first (it's straightforward), but
> >then at really large scales there may be opportunities for specialized
> >machines to take over tough subproblems.
> >
> >A nice part of the stages-and-queues model is that every stage
> >transition, through a queue, can potentially travel to another machine
> >in a clean fashion. So breaking up the process "horizontally" (a URI
> >lifecycle per machine) or "vertically" (a URI lifecycle spans many
> >machines) or a mixture of the two are all possible.
> >
> >> * will we use an RAM based system or disk based (for the url list)
> >> (xyleme and mercator used RAM, (xyleme said this was a big
limitation on their system)
> >> alexa does not)
> >> I dont know if you can buy cost-effective large ram machines.
> >> talking with Jad at alexa, he said 2GB is cheap otherwise very
expensive (this would be
> >> based on the HP line).
> >
> >Again, swappable strategies will be enabled, and I think our choice will
> >change over time, starting with a simple RAM-based approach to get
> >the crawler testable for small crawls, transitioning to a disk-based
> >system when we get to web-wide crawls.
> >
> >> * how can remote coordination be done between crawlers that are
geographically far apart
> >
> >This is an interesting question, but right now seems pretty far down
> >the priority scale. I recall Paula at Alexa saying they were capable of
> >doing this -- but didn't really use that capability much. This also
> >may not be much different from the multiple-machines-in-same-location
> >situation, if the method by which proximate machines coordinate their
> >actions isn't especially bandwidth and low-latency intensive.
> >
> >> * what the thread/process model should be for processing the urls.
separate processes or threads
> >> in a single process.
> >
> >Unsure exactly what you mean; but with the assumption that we're using Java,
> >it will make sense to keep all related activities in the same running
> >JVM, and split tasks among Java threads.
> >
> >Even with a relatively simple/traditional structure -- one thread works
> >its way through the whole process of handling one URI, blocking socket
> >I/O -- using a lot of Java threads can achieve pretty good throughput.
> >We're likely to make use of the new nonblocking socket support, and
> >decompose the process into bite-sized nonblocking stages, so that a
> >smaller number of threads can scream through the workload with minimal
> >scheduling overhead.
> >
> >> * duplicate site detection techniques
> >
> >I want to enable hash-based content-body duplication detection to the
> >greatest extent possible. Detecting when whole sites are exactly or
> >nearly alike is a challenge for further down the road; at least with
> >content-body duplicate detection, the storage cost of revisiting entire
> >exact-mirror sites remains very small.
> >
> >> * session-id detection and handling
> >
> >This is an area where there seems to be a lot of folk wisdom and
> >hand-tuning (eg the Alexa URL purification regexp rules) but not
> >much documentation.
> >
> >I've started a tracking database at our Sourceforge project for
> >capturing exactly these kind of real-web tricky/nonobvious challenges
> >for crawlers and retrieval tools -- so that issues and potential
> >fixes don't just get hidden in the code (and mind) of the first
> >person to encounter them.
> >
> >> * what user-agent we use.
> >
> >I think we need to pick a name for the software package better than
> >"archive-crawler" -- and that name should be featured in the User-Agent
> >string during official Archive crawls.
> >
> >Some crawl applications/customers will need to change User-Agent
> >arbitrarily, so that will also be supported.
> >
> >- Gordon
> >
> >
> >----- Original Message -----
> >From: "Brewster Kahle" <brewster@...>
> >To: "Gordon Mohr" <gojomo@...>; "Brad Tofel" <brad@...>;
"Igor Ranitovic" <igor@...>;
<michele@...>
> >Cc: "Raymie Stata" <raymie@...>; <gojomo@...>
> >Sent: Thursday, February 20, 2003 9:08 PM
> >Subject: Re: Crawler engineering review kickoff - 10am Friday
> >
> >
> >>
> >> thank you for pulling this together.
> >>
> >> A few other design decisions that seem to be important at this stage:
> >>
> >> * how to prioritize the urls to crawl. maybe we could list the
different ways current crawlers work as a guideline for
some
> >of the options.
> >> we know xyleme and alexa and mercator.
> >> gordon, maybe you could write up a short summary as a way to
understand the options.
> >>
> >> * how tasks will be broken up on different machines. for instance:
> >> 1. different machines for different stages of processing, or
> >> 2. a host does all stages, but urls are broken up onto different
machines some way
> >>
> >> * will we use an RAM based system or disk based (for the url list)
> >> (xyleme and mercator used RAM, (xyleme said this was a big
limitation on their system)
> >> alexa does not)
> >> I dont know if you can buy cost-effective large ram machines.
> >> talking with Jad at alexa, he said 2GB is cheap otherwise very
expensive (this would be
> >> based on the HP line).
> >>
> >> * how can remote coordination be done between crawlers that are
geographically far apart
> >>
> >> * what the thread/process model should be for processing the urls.
separate processes or threads
> >> in a single process.
> >>
> >> * duplicate site detection techniques
> >>
> >> * session-id detection and handling
> >>
> >> * what user-agent we use.
> >>
> >>
> >> -brewster
> >>
> >>
> >> At 06:13 PM 2/20/2003 -0800, Gordon Mohr wrote:
> >> >Brad, Brewster, Igor:
> >> >
> >> >Key points to cover in our crawler engineering review meeting
> >> >(tomorrow morning at 10am) are listed below.
> >> >
> >> >Most important will be going over the general requirements,
> >> >so if you have a chance to look over the first document referenced
> >> >below before the meeting, that'd be ideal.
> >> >
> >> >We lose Brewster to another meeting at 11am -- but it's early in
> >> >the project and we should try to keep discussion to broad issues
> >> >rather than details, so I'm hopeful we'll get through what we
> >> >need to in an hour.
> >> >
> >> >Agenda:
> >> >
> >> >(1) General crawler requirements
> >> >
> >> > This document is a starting point:
> >> >
> >> > http://homeserver.archive.org/webprojects/crawler-requirements.htm
> >> >
> >> > Does this breakdown cover everything it should? What are
> >> > the relative priorities of all the requirements -- in
> >> > both urgency and long-term importance?
> >> >
> >> >(2) Key early design decisions
> >> >
> >> > Using Java
> >> >
> >> > Crawler modeled as loosely-linked program stages connected
> >> > by queues.
> >> >
> >> > Top-level stages:
> >> > -> URIChoosing (what URI next?)
> >> > \- Preprocessing (robots/politeness/etc.)
> >> > \_ Fetching (HTTP)
> >> > \_ Postprocessing (analysis/archival)
> >> > \_ URIStoring (return URI(s) to pool for revisiting later)
> >> >
> >> > First priority: dumb, bad single-machine crawler -- then
> >> > incremental extension to better, multi-machine crawler.
> >> >
> >> >(3) Chosen tools and processes
> >> >
> >> > "archive-crawler" Yahoo Group for team discussion and file sharing:
> >> >
> >> > http://groups.yahoo.com/group/archive-crawler/
> >> >
> >> > "archive-crawler" Sourceforge project for source code/CVS hosting
> >> > and engineering-issue tracking:
> >> >
> >> > http://sourceforge.net/projects/archive-crawler
> >> >
> >> > Overview design documents, plus Javadoc extracted from skeleton/
> >> > actual code files.
> >> >
> >> > Engineering review meetings (like this one) every few weeks to
> >> > get feedback and report progress.
> >> >
> >> > Others needed or recommended?
> >> >
> >> >(4) Test plans
> >> >
> >> > Unit/stage conformance tests alongside coding
> >> >
> >> > Acceptable output tests using existing ARC processing tools
> >> > and compared on "test crawls" against historic data or data
> >> > obtained by other crawlers.
> >> >
> >> > Coordination with acceptance rules/processes established
> >> > by the national libraries and project agreement.
> >> >
> >> >(5) Everything else (?)
> >> >
> >> >- Gordon
> >>
> >>
>
>
I think the response of Gordon and I in agreement, with a little difference on
the following point:
>> * will we use an RAM based system or disk based
>> (for the url list)
I said "_not_ RAM" Gordon said "swappable strategies will be enabled, starting
with a simple RAM-based approach to get the crawler testable for small crawls,
transitioning to a disk-based system when we get to web-wide crawls."
Gordon's is the better answer.
[cc'd to the archive-crawler@yahoogroups.com discussion list]
These are all important matters to address -- and for most of these issues,
I think there will be three overlapping answers:
(1) what range of choices are enabled by the design;
(2) what choice we make in a first, simple, proof-of-design
crawler;
...and ultimately...
(3) what choice we make in the eventual high-performance,
multi-machine, whole-web, max-any-downlink crawler
Considering a few of the specifics:
> * how to prioritize the urls to crawl. maybe we could list the different
ways current crawlers work as a guideline for some
of the options.
> we know xyleme and alexa and mercator.
> gordon, maybe you could write up a short summary as a way to understand
the options.
Common choices have included:
- a breadth-first traversal of the seed URI set and discovered URIs
- a depth-first traversal (within politeness and site-saturation constraints)
- descending estimated value of unvisited URIs, where estimated value is...
- number of inlinks
- classic Google PageRank (where inlinks from high-inlink pages are worth
more)
- apparent popularity by user visits (eg Alexa traffic data)
- weighted by URI contents or content-analysis of the inlink sources
Into these can be mixed other measures of a page's importance or
implied volativility -- to influence how often something is recrawled.
Because of the wide range of anticipated users, the crawler design must
be capable of having any of these policies -- or weighted combinations of
these policies -- plugged in, either by configuration file or code
modules. ("Answer #1")
For our first instantiation, simpleminded swappable breadth-first and
depth-first policies will suffice for verifying the design and initial
testing. ("Answer #2")
For the eventual high-performance, mega-scale crawler, a more
sophisticated ranking for ensuring "more important" pages are
fetched before (or more often than) other pages will eventually
be devised. Some research indicates breadth-first is pretty good
at approximating the (much harder to compute) inlink and PageRank
approaches. We'll have to tinker over time. ("Answer #3")
> * how tasks will be broken up on different machines. for instance:
> 1. different machines for different stages of processing, or
> 2. a host does all stages, but urls are broken up onto different
machines some way
The second approach -- distribution to relatively full-fledged hosts
by URI ranges -- seems to dominate at first (it's straightforward), but
then at really large scales there may be opportunities for specialized
machines to take over tough subproblems.
A nice part of the stages-and-queues model is that every stage
transition, through a queue, can potentially travel to another machine
in a clean fashion. So breaking up the process "horizontally" (a URI
lifecycle per machine) or "vertically" (a URI lifecycle spans many
machines) or a mixture of the two are all possible.
> * will we use an RAM based system or disk based (for the url list)
> (xyleme and mercator used RAM, (xyleme said this was a big
limitation on their system)
> alexa does not)
> I dont know if you can buy cost-effective large ram machines.
> talking with Jad at alexa, he said 2GB is cheap otherwise very
expensive (this would be
> based on the HP line).
Again, swappable strategies will be enabled, and I think our choice will
change over time, starting with a simple RAM-based approach to get
the crawler testable for small crawls, transitioning to a disk-based
system when we get to web-wide crawls.
> * how can remote coordination be done between crawlers that are
geographically far apart
This is an interesting question, but right now seems pretty far down
the priority scale. I recall Paula at Alexa saying they were capable of
doing this -- but didn't really use that capability much. This also
may not be much different from the multiple-machines-in-same-location
situation, if the method by which proximate machines coordinate their
actions isn't especially bandwidth and low-latency intensive.
> * what the thread/process model should be for processing the urls.
separate processes or threads
> in a single process.
Unsure exactly what you mean; but with the assumption that we're using Java,
it will make sense to keep all related activities in the same running
JVM, and split tasks among Java threads.
Even with a relatively simple/traditional structure -- one thread works
its way through the whole process of handling one URI, blocking socket
I/O -- using a lot of Java threads can achieve pretty good throughput.
We're likely to make use of the new nonblocking socket support, and
decompose the process into bite-sized nonblocking stages, so that a
smaller number of threads can scream through the workload with minimal
scheduling overhead.
> * duplicate site detection techniques
I want to enable hash-based content-body duplication detection to the
greatest extent possible. Detecting when whole sites are exactly or
nearly alike is a challenge for further down the road; at least with
content-body duplicate detection, the storage cost of revisiting entire
exact-mirror sites remains very small.
> * session-id detection and handling
This is an area where there seems to be a lot of folk wisdom and
hand-tuning (eg the Alexa URL purification regexp rules) but not
much documentation.
I've started a tracking database at our Sourceforge project for
capturing exactly these kind of real-web tricky/nonobvious challenges
for crawlers and retrieval tools -- so that issues and potential
fixes don't just get hidden in the code (and mind) of the first
person to encounter them.
> * what user-agent we use.
I think we need to pick a name for the software package better than
"archive-crawler" -- and that name should be featured in the User-Agent
string during official Archive crawls.
Some crawl applications/customers will need to change User-Agent
arbitrarily, so that will also be supported.
- Gordon
----- Original Message -----
From: "Brewster Kahle" <brewster@...>
To: "Gordon Mohr" <gojomo@...>; "Brad Tofel" <brad@...>; "Igor
Ranitovic" <igor@...>; <michele@...>
Cc: "Raymie Stata" <raymie@...>; <gojomo@...>
Sent: Thursday, February 20, 2003 9:08 PM
Subject: Re: Crawler engineering review kickoff - 10am Friday
>
> thank you for pulling this together.
>
> A few other design decisions that seem to be important at this stage:
>
> * how to prioritize the urls to crawl. maybe we could list the different
ways current crawlers work as a guideline for some
of the options.
> we know xyleme and alexa and mercator.
> gordon, maybe you could write up a short summary as a way to understand
the options.
>
> * how tasks will be broken up on different machines. for instance:
> 1. different machines for different stages of processing, or
> 2. a host does all stages, but urls are broken up onto different
machines some way
>
> * will we use an RAM based system or disk based (for the url list)
> (xyleme and mercator used RAM, (xyleme said this was a big
limitation on their system)
> alexa does not)
> I dont know if you can buy cost-effective large ram machines.
> talking with Jad at alexa, he said 2GB is cheap otherwise very
expensive (this would be
> based on the HP line).
>
> * how can remote coordination be done between crawlers that are
geographically far apart
>
> * what the thread/process model should be for processing the urls.
separate processes or threads
> in a single process.
>
> * duplicate site detection techniques
>
> * session-id detection and handling
>
> * what user-agent we use.
>
>
> -brewster
>
>
> At 06:13 PM 2/20/2003 -0800, Gordon Mohr wrote:
> >Brad, Brewster, Igor:
> >
> >Key points to cover in our crawler engineering review meeting
> >(tomorrow morning at 10am) are listed below.
> >
> >Most important will be going over the general requirements,
> >so if you have a chance to look over the first document referenced
> >below before the meeting, that'd be ideal.
> >
> >We lose Brewster to another meeting at 11am -- but it's early in
> >the project and we should try to keep discussion to broad issues
> >rather than details, so I'm hopeful we'll get through what we
> >need to in an hour.
> >
> >Agenda:
> >
> >(1) General crawler requirements
> >
> > This document is a starting point:
> >
> > http://homeserver.archive.org/webprojects/crawler-requirements.htm
> >
> > Does this breakdown cover everything it should? What are
> > the relative priorities of all the requirements -- in
> > both urgency and long-term importance?
> >
> >(2) Key early design decisions
> >
> > Using Java
> >
> > Crawler modeled as loosely-linked program stages connected
> > by queues.
> >
> > Top-level stages:
> > -> URIChoosing (what URI next?)
> > \- Preprocessing (robots/politeness/etc.)
> > \_ Fetching (HTTP)
> > \_ Postprocessing (analysis/archival)
> > \_ URIStoring (return URI(s) to pool for revisiting later)
> >
> > First priority: dumb, bad single-machine crawler -- then
> > incremental extension to better, multi-machine crawler.
> >
> >(3) Chosen tools and processes
> >
> > "archive-crawler" Yahoo Group for team discussion and file sharing:
> >
> > http://groups.yahoo.com/group/archive-crawler/
> >
> > "archive-crawler" Sourceforge project for source code/CVS hosting
> > and engineering-issue tracking:
> >
> > http://sourceforge.net/projects/archive-crawler
> >
> > Overview design documents, plus Javadoc extracted from skeleton/
> > actual code files.
> >
> > Engineering review meetings (like this one) every few weeks to
> > get feedback and report progress.
> >
> > Others needed or recommended?
> >
> >(4) Test plans
> >
> > Unit/stage conformance tests alongside coding
> >
> > Acceptable output tests using existing ARC processing tools
> > and compared on "test crawls" against historic data or data
> > obtained by other crawlers.
> >
> > Coordination with acceptance rules/processes established
> > by the national libraries and project agreement.
> >
> >(5) Everything else (?)
> >
> >- Gordon
>
>
At our last design meeting, Raymie and I sketched an outline
of crawler operation as a series of discrete stages connected
by queues -- a style compatible with the SEDA ("Staged Event
Driven Architecture") tools available at:
http://www.cs.berkeley.edu/~mdw/proj/sandstorm/
In many ways, this is an "inside-out" approach compared to a
more typical consciously-threaded design. The SEDA framework
handles threading issues; the application is just a set of simple,
ideally non-blocking but thread-safe event handlers dropping
event objects into each others' in-sinks.
In this approach, the entity called 'CandidateURI' assumes a
central importance: it is usually the object being pushed to
each new stage, carrying with it any work done so far. I now
think 'CrawlURI' is a better name for this object -- after all,
it's only a 'candidate' for part of its lifecycle.
A CrawlURI object has a lot of state and functionality:
- a few key required fields
- an open-ended, hierarchical attribute-value structure
for holding whatever data stages need to pass between
each other -- or retain persistently across multiple,
spaced-out URI revisits
- tests for whether the CrawlURI meets certain ready/error
conditions
- possibly, convenience protocols for accessing external
resources associated with the URI (eg contents from previous
visits)
A rough outline of basic crawler operations in the SEDA style
is below. The same general tasks are being handled as in my
previous, classical single-thread pseudocode outline -- except
in the handleEvent() methods of Stage objects.
I'm assuming that a driver process outside the SEDA/SandStorm
system enqueues 'DoURI' events to the URIChoosing stage. (In
the SEDA example server apps, connections from external clients,
as handled by system service threads, serve to kick the system
into motion.) These DoURIs can just be empty indicators used to
push along the process -- the actual URI is chosen after each
DoURI kick is received -- but in a future distributed setup DoURI
events could also indicate local restrictions on the next URI
to choose.
I've named the stages with the "-ing" verb tense, to indicate
them as free-floating actions. The "front" of the URI Frontier
is prompted to give our URIs for processing via a URIChoosing
stage; the "back" of the URI Frontier which takes either new
URIs or old URIs at the end of a processing cycle is represented
by a URIStoring stage.
--
The basic flow would be:
EVENT STAGE
---------- ------------------
{DoURI} -> URIChoosing (frontier/URI database "front")
{CrawlURI} -> Preprocessing (robots, politeness, dns)
{CrawlURI} -> DNSLookingUp (when necessary)
{CrawlURI} -> Fetching (which may have substages)
{CrawlURI} -> Postprocessing (links, archival)
{CrawlURI} -> URIStoring (frontier/URI database "back")
--
A bit more about the various stage handleEvent() methods:
URIChoosing.handleEvent( (DoURI) element ) ==
{
CrawlURI curi = nextUri(); // decide on URI to address
nextStage().enqueue(curi); // pass to Preprocessing
}
Preprocessing.handleEvent( (CrawlURI) element ) ==
{
CrawlURI curi = (CrawlURI) element;
if ( curi.needsGoverningRobotsTxt() ) {
curi.markAsDeferredForRobots();
failStage().enqueue( curi ); // pass back to URIStoring
return;
}
curi.robotsTxt().apply(); // apply rules & markup results
if ( curi.fetchDisallowed() ) {
failStage().enqueue( curi ); // pass back to URIStoring
}
if ( curi.needsDNSLookup() ) {
dnsStage().enqueue( curi )); // pass to DNSLookingUp
return;
}
politenessPolicy.applyTo( curi ); // apply rules & markup results
if ( curi.fetchDisallowed() ) {
failStage().enqueue( curi ); // pass back to URIStoring
return;
}
// other rules for tricky situations
// could be implemented here, or as
// chained next stage
nextStage().enqueue( curi ); // pass to Fetching
}
DNSLookingUp.handleEvent((CrawlURI) element) ==
{
//
// do dns lookup in staged fashion,
// marking up CrawlURI (and local host
// database), eventually...
//
if ( curi.domainUnavailable() ) {
failStage.enqueue( curi );
return;
}
nextStage().enqueue( curi ); // otherwise, return to Preprocessing
}
Fetching.handleEvent( element ) ==
{
//
// do fetching in staged fashion,
// marking up CrawlURI, eventually
// reaching...
//
nextStage().enqueue( curi ); // pass to Postprocessor
}
Postprocessing.handleEvent((CrawlURI) element) ==
{
//
// do archiving, link extraction, etc.
// perhaps enqueuing many new CrawlURIs to
// URIStoring stage, and eventually...
//
nextStage().enqueue( curi ); // return to URIStoring
}
URIStoring.handleEvent((CrawlURI) element) ==
{
//
// examine CrawlURI for evidence of
// correctable/retryable errors
//
// persist CrawlURI in the right place
// for revisiting if/when appropriate
//
// do *not* push new CrawlURIs into the
// full processing loop -- but perhaps put
// them at the top of the structures consulted
// by URIChoosing
}
--
The most basic way to tune the system for best performance
is to adaptively vary the rate of DoURI enqueues until
maximum throughput is observed.
Ideally, though, when the various stage thresholds are set and
auto-adjusted properly, simply enqueuing DoURIs as fast as a
fixed-size entry queue will take them should reliably drive the
system to maximum throughput.
--
Thoughts? Questions? Concerns?
- Gordon
From a number of sources, I've been hearing about tricky crawler
situations -- misbehaving or malicious servers, endless domains,
difficult-to-extract link references, etc. It also seems that
new challenges arise constantly -- for example, in the Israel
Election crawl.
So I'd like to capture writeups of these situations in a central
reference location, as a way to address them in the new crawler
design before they get reported as bugs/enhancement-requests
later in the cycle.
I've set up a "Tracker" at the SourceForge project called
"Crawl Challenges". (Sourceforge Trackers are a sort of
generic issue-tracking tool, with categories/comments/priorities
etc.) You can browse entered issues at:
http://sourceforge.net/tracker/?atid=540792&group_id=73833&func=browse
Please, when you encounter (or think of) a tricky challenge our
crawler will need to address, add it to this project Tracker.
- Gordon
We now have a project at SourceForge for hosting our source code;
see the details below.
I definitely want to use their CVS, and perhaps their bug/
enhacement-request tracking. I think we'll want to continue
using this YahooGroup for most technical discussion.
Please let me know your SourceForge login names, so that I can
add you to the project.
- Gordon
From SourceForge:
> Your project registration for SourceForge.net has been approved.
>
> Project Descriptive Name: Internet Archive Web Crawl Tools
> Project Unix Name: archive-crawler
> CVS Server: cvs.archive-crawler.sourceforge.net
> Shell/Web Server: archive-crawler.sourceforge.net
>
> Your DNS will take up to a day to become active on our site.
> While waiting for your DNS to resolve, you may try shelling into
> shell.sourceforge.net and pointing CVS to cvs.sourceforge.net.
>
> If after 6 hours your shell/CVS accounts still do not work, please
> open a support ticket so that we may take a look at the problem.
> Please note that all shell/CVS accounts are closed to telnet and only
> work with SSH1.
>
> Your web site is accessible through your shell account. Please read
> site documentation (see link below) about intended usage, available
> services, and directory layout of the account.
>
> Please take some time to read the site documentation about project
> administration (http://sourceforge.net/docs/site/). If you visit your
> own project page in SourceForge.net while logged in, you will find
> additional menu functions to your left labeled 'Project Admin'.
>
> We highly suggest that you now visit SourceForge.net and create a public
> description for your project. This can be done by visiting your project
> page while logged in, and selecting 'Project Admin' from the menus
> on the left (or by visiting
https://sourceforge.net/project/admin/?group_id=73833
> after login).
>
> Your project will also not appear in the Trove Software Map (primary
> list of projects hosted on SourceForge.net which offers great flexibility in
> browsing and search) until you categorize it in the project administration
> screens. So that people can find your project, you should do this now.
> Visit your project while logged in, and select 'Project Admin' from the
> menus on the left.
>
> Enjoy the system, and please tell others about SourceForge.net. Let us know
> if there is anything we can do to help you.
>
> -- the SourceForge.net crew
>
Hi, Reddy!
You write:
> I assume that the worker thread is doing synchronous I/O. We are
> not sure yet on what mode we have finalized, synchronous I/O or
> asynchronous I/O ?
Yes, this outline most easily maps to blocking I/O. Per a discussion
last Thursday, we'd initially like to get up and running with the
familiar blocking I/O approach. Where possible we'll detach the
raw fetching/communication code from the other policy and analysis,
so that if/when the Windows NBIO problems are better understood, or
we need the extra performance of NBIO, we can revisit the issue then.
> The following views are based on the belief that we are following
> the Mercator design at the higher level.
>
> [Gordon] - Things like robots.txt or politeness could be handled in
> a very-smart 'frontier' object, or as one or more 'preprocessors'
> which occasionally veto/snooze offered URIs, pushing them back to
> the frontier for later processing
> [Reddy] - We feel that politeness cannot be handled in the preprocessing
> modules since preprocessing works on a single candidateURI at a time.
> It is the frontier which has to do this work much like what is done in
> Mercator. Mercator's frontier maintains one queue per worker thread
> based on the hostnames and the politeness factor.
I think there's a continuum of possibilities for how the decisionmaking
is split between the frontier and individual worker-threads.
(This next part is just enumerating the full range of possibilities;
I'm not yet advocating a specific approach in our main
implementation...)
At one extreme, you could have a very dumb frontier queue which essentially
just returns URIs first-in, first-out -- but with some mechanism for
a worker-thread to push back URIs with extra conditions about when/if they
reappear.
So the URIs might be grabbed by...
candidateUri = frontier.nextUriFor(this.workerIdentity());
...but then sometimes pushed back via...
frontier.takebackUriUntil(candidateUri,time);
or
frontier.takebackAndNeverReturnTo(candidateUri,this.workerIdentity());
That's not necessarily an efficient way to hand off URIs --
if a worker-thread is only handling 1/100th of the URI-space,
it'll be pushing back 99/100 URIs thrown at it -- but it does
let the frontier be simple, while all policy decisions are
made in a swappable/transparent fashion at the worker-threads.
At the other extreme, each worker-thread trusts the frontier
completely, assuming that it would never be handed a URI
that wasn't proper to retrieve.
When considering our design as a framework for many crawling
apps, I think we should make it possible for people's plug-in
modules to pursue either extreme.
When thinking about what's the optimal division for the Archive's
needs and deployment scenario, I tend to think that the
worker-threads will have the best local information (about fresh
robots.txt directives, per-host visit histories, and per-host
apparent responsiveness), while the frontier will have better
global knowledge of how duties are split and prioritized among
worker-threads. I believe this puts somewhat more into each
worker-thread than the classic Mercator approach.
> [Gordon] - In this outline, the frontier object does a lot: it's
> the URI-seen database, the URI-fetched database, the URI-prioritization
> policy, etc. etc. Could/should it be further decomposed ?
> [Reddy] - We think it should definitely be decomposed. URI-seen and
> URI-fetched databases will be simple but URI-prioritization will be a
> complex thing to handle.
Yes, but I also think that URI-seen and URI-fetched will become
challenging at the level of billions of URIs, and tend to merge
with the prioritization problem when considering a continuous
crawl.
> [Gordon] - Where do DNS lookups take place, and how much of an audit
> trail is generated? One possibility is that there is a "DNS frontier"
> just like the URI frontier. A Preprocessor module only crawls a URI if
> its domain name has already been (recently) resolved; if not, it starts
> a DNS lookup, and "snoozes" the URI for enough time that the name will
> probably have resolved by the time the URI again is considered.
> [Reddy] - DNS resolver would be a separate synchronous thread with its
> own expiration logic for the entries in the cache. So, it is agreed
> that the Preprocessor would snooze the crawl if the hostname is not yet
> resolved. But when it is reconsidered after resolution is done, where
> would we place the URI in the queue ? It would have to be based on
> prioritization again.
There seem to be a number of good reasons to have a sort of "As
Soon As Possible" subqueue for each worker-thread -- for robots.txt
and inline-image fetches, among others. It could also make sense
for fetches that were only delayed because of DNS latency to be
pushed into the ASAP pile.
> Other questions :
>
> - We are not sure of the role of postProcessing. What would be the
> kind of jobs done here in sequence ?
Postprocessing could include on-the-fly analysis, archiving permanent
copies, etc. Essentially anything that requires either the whole
document to be fetched (or the error status to be clear/stable), and
you want to do contemporaneously with the crawl. On-the-fly incremental
page-priority recalculation (eg Xyleme approach or others) could be
handled here.
> - When would we revisit the rejected ( based on politeness policy )
> URIs ? I assume that these URIs would be retrieved from the database
> at a later point of time and re-scheduled based on the prioritization
> policy.
I think there are many potential answers... the easiest being that
the frontier can manage a "don't revisit until" timestamp on any
given URI.
> Can we have a brief chat on this tomorrow morning 9 am IST. If not
> please let us know a suitable timing. We would reach better
> understanding if we discuss.
Possibly -- pending our discussion here this afternoon.
- Gordon
I assume that the worker thread is doing synchronous I/O. We are not sure yet on what mode we have finalized, synchronous I/O or asynchronous I/O ?
The following views are based on the belief that we are following the Mercator design at the higher level.
[Gordon] - Things like robots.txt or politeness could be handled in a very-smart 'frontier' object, or as one or more 'preprocessors' which occasionally veto/snooze offered URIs, pushing them back to the frontier for later processing
[Reddy] - We feel that politeness cannot be handled in the preprocessing modules since preprocessing works on a single candidateURI at a time. It is the frontier which has to do this work much like what is done in Mercator. Mercator's frontier maintains one queue per worker thread based on the hostnames and the politeness factor.
[Gordon] - In this outline, the frontier object does a lot: it's the URI-seen database, the URI-fetched database, the URI-prioritization policy, etc. etc. Could/should it be further decomposed ? [Reddy] - We think it should definitely be decomposed. URI-seen and URI-fetched databases will be simple but URI-prioritization will be a complex thing to handle.
[Gordon] - Where do DNS lookups take place, and how much of an audit trail is generated? One possibility is that there is a "DNS frontier" just like the URI frontier. A Preprocessor module only crawls a URI if its domain name has already been (recently) resolved; if not, it starts a DNS lookup, and "snoozes" the URI for enough time that the name will probably have resolved by the time the URI again is considered. [Reddy] - DNS resolver would be a separate synchronous thread with its own expiration logic for the entries in the cache. So, it is agreed that the Preprocessor would snooze the crawl if the hostname is not yet resolved. But when it is reconsidered after resolution is done, where would we place the URI in the queue ? It would have to be based on prioritization again.
Other questions :
- We are not sure of the role of postProcessing. What would be the kind of jobs done here in sequence ?
- When would we revisit the rejected ( based on politeness policy ) URIs ? I assume that these URIs would be retrieved from the database at a later point of time and re-scheduled based on the prioritization policy.
Can we have a brief chat on this tomorrow morning 9 am IST. If not please let us know a suitable timing. We would reach better understanding if we discuss.
Subject: [archive-crawler] Design thoughts: a single worker thread
I've been looking at what crawlers have typically done, and considering what we'd like the new crawler to do.
The following general outline -- in roughly valid Java -- represents my current thoughts on the general oepration of a single crawler worker thread: { setup(); // read parameters, populate submodules, etc.
while ( shouldProceed() ) {
candidateURI = frontier.nextUri(this);
preprocess(candidateURI); // hands to each "Preprocessor" // module in sequence; each // may cancel (or defer) further // processing, or decorate the // CandidateURI object with // attributes which affect // further processing. By convention, // "preprocessing" only uses 'local' // resources -- eg no internet // latencies
if ( candidateURI.shouldProcess() ) {
process(candidateURI); // an attempt is made to resolve the resource in the URI-appropriate fashion. Zero or more "Coprocessor" objects may be fed the data as it arriving, in parallel. As end result, candidateURI is decorated with sufficient info for later modules to get details/content of fetch
postprocess(candidateURI); // hands to each "Postprocessor" // module (AKA "Analyzer") in // sequence }
frontier.inform(candidateURI); // let frontier know what happened
- CandidateURI is a URI *and* an arbitrary amount of other annotation, which may come from the frontier or be added/ changed by any module to affect the processing by subsequent modules
- Things like robots.txt or politeness could be handled in a very-smart 'frontier' object, or as one or more 'preprocessors' which occasionally veto/snooze offered URIs, pushing them back to the frontier for later processing
Open questions:
- In this outline, the frontier object does a lot: it's the URI-seen database, the URI-fetched database, the URI-prioritization policy, etc. etc. Could/should it be further decomposed?
- Where do DNS lookups take place, and how much of an audit trail is generated? One possibility is that there is a "DNS frontier" just like the URI frontier. A Preprocessor module only crawls a URI if its domain name has already been (recently) resolved; if not, it starts a DNS lookup, and "snoozes" the URI for enough time that the name will probably have resolved by the time the URI again is considered.
Reactions?
- Gordon
To unsubscribe from this group, send an email to: archive-crawler-unsubscribe@yahoogroups.com
I've been looking at what crawlers have typically done, and
considering what we'd like the new crawler to do.
The following general outline -- in roughly valid Java --
represents my current thoughts on the general oepration
of a single crawler worker thread:
{
setup(); // read parameters, populate submodules, etc.
while ( shouldProceed() ) {
candidateURI = frontier.nextUri(this);
preprocess(candidateURI); // hands to each "Preprocessor"
// module in sequence; each
// may cancel (or defer) further
// processing, or decorate the
// CandidateURI object with
// attributes which affect
// further processing. By convention,
// "preprocessing" only uses 'local'
// resources -- eg no internet
// latencies
if ( candidateURI.shouldProcess() ) {
process(candidateURI); // an attempt is made to resolve
the resource in the URI-appropriate
fashion. Zero or more "Coprocessor"
objects may be fed the data as it
arriving, in parallel. As end result,
candidateURI is decorated with
sufficient info for later modules
to get details/content of fetch
postprocess(candidateURI); // hands to each "Postprocessor"
// module (AKA "Analyzer") in
// sequence
}
frontier.inform(candidateURI); // let frontier know what happened
update(); // update internal state, perhaps recognizing
outside stop/checkpoint request
}
shutdown(); // clean up
}
Some notes:
- CandidateURI is a URI *and* an arbitrary amount of other
annotation, which may come from the frontier or be added/
changed by any module to affect the processing by subsequent
modules
- Things like robots.txt or politeness could be handled in a
very-smart 'frontier' object, or as one or more 'preprocessors'
which occasionally veto/snooze offered URIs, pushing them
back to the frontier for later processing
Open questions:
- In this outline, the frontier object does a lot: it's the
URI-seen database, the URI-fetched database, the
URI-prioritization policy, etc. etc. Could/should it be
further decomposed?
- Where do DNS lookups take place, and how much of an audit
trail is generated? One possibility is that there is a
"DNS frontier" just like the URI frontier. A Preprocessor
module only crawls a URI if its domain name has already
been (recently) resolved; if not, it starts a DNS lookup,
and "snoozes" the URI for enough time that the name will
probably have resolved by the time the URI again is considered.
Reactions?
- Gordon