Looking to see if previous projects had created testing websites for crawlers, I came across Funnelback, a Java crawler I hadn't heard of before. Their main...
Attached is a document that summarizes what we now know about DNS, URIs,
and arc files, and how these could be made to play nice with one another. Gordon and...
At my request, Judy has set up a wiki and weblog for our project at our Sourceforge website -- which is now conveniently aliased to: http://crawler.archive.org...
... No, RFC 2540 prescribes a binary format and a text equivelent, both of which contain the same fields. ... To the best of my understanding no. I did not...
... To clarify a bit, the binary and text formats prescribed by RFC 2540 are equivelent, though these do not represent all information contained in a raw DNS...
I should confess that in the code I've written so far, many of these conventions have been violated -- in particular the practice of always declaring variables...
I put some brainstorming on what the test "web garden" should cover on the project Wiki at: http://crawler.archive.org/cgi-bin/wiki.pl?WebGarden Feel free to...
[moving discussion to arcive-crawler@yahoogroups.com] Looks good! I think we'll want to split up the tests into at least 3 non-overlapping (not cross-linked)...
FYI: the searchtools guys are up for us using anything we like, as long as credit is given and we say hey to Brewster on behalf of Avi ;). pt. -- Parker...
According to rfc1808.txt a relative url "../testDotDot.html" with a base "http://test.com" should be constructed to absolute URL ...
Igor Ranitovic
igor@...
Jun 20, 2003 8:38 pm
81
We're getting a good collection of tests at... http://crawl08.archive.org/index-2.html and http://crawl08.archive.org/newtest/ But, could we split them up as...
Another related open-source project, Nutch, which includes a crawler as part of its functionality: http://www.nutch.org "Nutch is a nascent effort to implement...
We were briefly putting JUnit test code into subpackages named 'test', in the main source tree, at the same level as the code they tested. However, Reddy had...
A 1.1MB arc.gz of what the dev version of Heritrix gets, when crawling from... http://crawl08.archive.org ...is available in my archive home directory, ...
Just as a quick progress check, I ran the current dev crawler on Crawl09 ( the 2GB RAM machine) with eight broad seeds and 200 worker threads. I still only...
... I've noticed a 1-byte discrepancy on sets that should be identical as well. It's most likely an issue with flushing/properly closing the output streams. ...
... Looks like this varies by a few bytes because crawls are run at different times, which produces different Date lines. This doesn't affect the uncompressed...
Today, I tried the latest IBM Java VM for Linux, and gave the VM about 1.5GB of heap space. In the first 10 minutes, from the same seeds, it collected: -...
... Yes, Mercator is striping URLs after it finds chars like &, " ", #, \n, and etc. We have the code that skips this link striping, but it works only for "&" ...
Igor Ranitovic
igor@...
Jun 30, 2003 7:17 pm
92
Useful list of common robots.txt errors: http://www.searchengineworld.com/misc/robots_txt_crawl.htm There's also an automatic syntax checker. Upon feeding it...
[moved to archive-crawler discussion list] I have some ideas in the Wiki at... http://crawler.archive.org/cgi-bin/wiki.pl/wiki.pl?TestingCoverageAgainstGoogle ...
Here is an initial assesment of possible binary format parsers. Word Documents: There are several choices here, though the seemingly obvious choice (and ...
A running crawler may create many logs of its ongoing activity. Some of the logs may capture individual transactions or errors; others may capture summaries of...