Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 69 - 98 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
69
It just occurred to me that when we crawl "https:" sites, we may want to archive the server-side SSL certificates presented at crawl-time. - Gordon...
Gordon Mohr
gojomo
Offline Send Email
Jun 6, 2003
11:21 pm
70
Looking to see if previous projects had created testing websites for crawlers, I came across Funnelback, a Java crawler I hadn't heard of before. Their main...
Gordon Mohr
gojomo
Offline Send Email
Jun 9, 2003
11:58 pm
71
Attached is a document that summarizes what we now know about DNS, URIs, and arc files, and how these could be made to play nice with one another. Gordon and...
Parker Thompson
michaelparke...
Offline Send Email
Jun 10, 2003
4:53 pm
72
At my request, Judy has set up a wiki and weblog for our project at our Sourceforge website -- which is now conveniently aliased to: http://crawler.archive.org...
Gordon Mohr
gojomo
Offline Send Email
Jun 10, 2003
5:29 pm
73
... No, RFC 2540 prescribes a binary format and a text equivelent, both of which contain the same fields. ... To the best of my understanding no. I did not...
Parker Thompson
michaelparke...
Offline Send Email
Jun 10, 2003
5:31 pm
74
... To clarify a bit, the binary and text formats prescribed by RFC 2540 are equivelent, though these do not represent all information contained in a raw DNS...
Parker Thompson
michaelparke...
Offline Send Email
Jun 10, 2003
6:05 pm
75
In general, we should try to follow the Java code formatting and naming conventions set out by Sun at: ...
Gordon Mohr
gojomo
Offline Send Email
Jun 12, 2003
7:18 pm
76
I should confess that in the code I've written so far, many of these conventions have been violated -- in particular the practice of always declaring variables...
Gordon Mohr
gojomo
Offline Send Email
Jun 12, 2003
7:41 pm
77
I put some brainstorming on what the test "web garden" should cover on the project Wiki at: http://crawler.archive.org/cgi-bin/wiki.pl?WebGarden Feel free to...
Gordon Mohr
gojomo
Offline Send Email
Jun 16, 2003
7:42 pm
78
[moving discussion to arcive-crawler@yahoogroups.com] Looks good! I think we'll want to split up the tests into at least 3 non-overlapping (not cross-linked)...
Gordon Mohr
gojomo
Offline Send Email
Jun 19, 2003
12:01 am
79
FYI: the searchtools guys are up for us using anything we like, as long as credit is given and we say hey to Brewster on behalf of Avi ;). pt. -- Parker...
Parker Thompson
michaelparke...
Offline Send Email
Jun 19, 2003
12:41 am
80
According to rfc1808.txt a relative url "../testDotDot.html" with a base "http://test.com" should be constructed to absolute URL ...
Igor Ranitovic
igor@...
Send Email
Jun 20, 2003
8:38 pm
81
We're getting a good collection of tests at... http://crawl08.archive.org/index-2.html and http://crawl08.archive.org/newtest/ But, could we split them up as...
Gordon Mohr
gojomo
Offline Send Email
Jun 24, 2003
6:46 pm
82
Another related open-source project, Nutch, which includes a crawler as part of its functionality: http://www.nutch.org "Nutch is a nascent effort to implement...
Gordon Mohr
gojomo
Offline Send Email
Jun 24, 2003
7:20 pm
83
We were briefly putting JUnit test code into subpackages named 'test', in the main source tree, at the same level as the code they tested. However, Reddy had...
Gordon Mohr
gojomo
Offline Send Email
Jun 24, 2003
7:36 pm
84
A 1.1MB arc.gz of what the dev version of Heritrix gets, when crawling from... http://crawl08.archive.org ...is available in my archive home directory, ...
Gordon Mohr
gojomo
Offline Send Email
Jun 26, 2003
11:21 pm
85
I've created two arc files generated by crawling the same material. These omit dns records: ~parkert/heritrix-crawl08-easy-desktop.arc.gz ...
Parker Thompson
michaelparke...
Offline Send Email
Jun 27, 2003
12:07 am
86
Just as a quick progress check, I ran the current dev crawler on Crawl09 ( the 2GB RAM machine) with eight broad seeds and 200 worker threads. I still only...
Gordon Mohr
gojomo
Offline Send Email
Jun 27, 2003
1:27 am
87
... I've noticed a 1-byte discrepancy on sets that should be identical as well. It's most likely an issue with flushing/properly closing the output streams. ...
Gordon Mohr
gojomo
Offline Send Email
Jun 27, 2003
8:43 pm
88
... Looks like this varies by a few bytes because crawls are run at different times, which produces different Date lines. This doesn't affect the uncompressed...
Parker Thompson
michaelparke...
Offline Send Email
Jun 27, 2003
9:05 pm
89
Today, I tried the latest IBM Java VM for Linux, and gave the VM about 1.5GB of heap space. In the first 10 minutes, from the same seeds, it collected: -...
Gordon Mohr
gojomo
Offline Send Email
Jun 28, 2003
12:09 am
90
Analysis of Mercator v. Heritrix v. HTTrack URLs Mercator Heritrix %_of_files col2-col3 crawl08.archive.org 24 31...
Judy Ma
jma0112
Offline Send Email
Jun 30, 2003
6:11 pm
91
... Yes, Mercator is striping URLs after it finds chars like &, " ", #, \n, and etc. We have the code that skips this link striping, but it works only for "&" ...
Igor Ranitovic
igor@...
Send Email
Jun 30, 2003
7:17 pm
92
Useful list of common robots.txt errors: http://www.searchengineworld.com/misc/robots_txt_crawl.htm There's also an automatic syntax checker. Upon feeding it...
Gordon Mohr
gojomo
Offline Send Email
Jun 30, 2003
9:54 pm
93
[moved to archive-crawler discussion list] I have some ideas in the Wiki at... http://crawler.archive.org/cgi-bin/wiki.pl/wiki.pl?TestingCoverageAgainstGoogle ...
Gordon Mohr
gojomo
Offline Send Email
Jul 1, 2003
1:37 am
94
111 slides, broad survey of software/projects/techniques, also details of "PolyBot" crawler design (towards end): http://cis.poly.edu/suel/talks/shoco2002.pdf ...
Gordon Mohr
gojomo
Offline Send Email
Jul 2, 2003
11:59 pm
95
Here is an initial assesment of possible binary format parsers. Word Documents: There are several choices here, though the seemingly obvious choice (and ...
Parker Thompson
michaelparke...
Offline Send Email
Jul 3, 2003
12:52 am
96
A running crawler may create many logs of its ongoing activity. Some of the logs may capture individual transactions or errors; others may capture summaries of...
Gordon Mohr
gojomo
Offline Send Email
Jul 7, 2003
7:53 pm
97
Test Crawl Seeds: http://www.arts.gov http://www.copyright.gov http://www.exploratorium.edu http://www.comics.com http://www.city.fi http://www.sony.com ...
Igor Ranitovic
igor@...
Send Email
Jul 7, 2003
9:31 pm
98
An important issue that will come up with our open code base being run by outsiders is raised, in the context of Nutch, at: ...
Gordon Mohr
gojomo
Offline Send Email
Jul 7, 2003
9:42 pm
Messages 69 - 98 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help