Thanks! This is a good plan for beginning implementation work. ... As with navigational headers on a website, some standard grouping of status info and...
151
Søren Vejrup Carlsen
svc400
Oct 2, 2003 3:19 pm
Dear All. I have been monitoring this mailing-list and the activity on the cvs repository for the heritrix crawler for a while. And today, I have tried...
152
Gordon Mohr
gojomo
Oct 2, 2003 4:42 pm
Hi, Søren. ... These errors are related to two recent changes: (1) To resolve an HTTP hanging problem, we began developing against the latest Apache Commons...
153
Gordon Mohr
gojomo
Oct 8, 2003 7:29 pm
Appended below is an example crawler configuration ("crawl-order") file, as currently drives my working-dev-branch crawler in desktop test crawls. I would...
154
Søren Vejrup Carlsen
svc400
Oct 9, 2003 11:49 am
Dear Gordon.
Here are some questions:
In the Scope element, what does max-trans-hops, and max-toe-threads mean?
I believe, that max-link-hops=10 should...
155
Gordon Mohr
gojomo
Oct 9, 2003 10:52 pm
... max-trans-hops: We consider certain discovered URIs to imply "transitive" inclusion: - embeds (such as IMG SRC and FRAME SRC) - referrals (redirections &...
156
Søren Vejrup Carlsen
svc400
Oct 23, 2003 12:28 pm
Dear All. In the Crawling Links document, one of the links on the must-reads list: "Kimpton, Stata. Crawler requirements for library consotium" is now dead...
157
Gordon Mohr
gojomo
Oct 27, 2003 7:37 pm
In our meeting Friday, we resolved to each create 3 new test cases for our crawler "test garden". This is an expandable collection of web-server content...
158
Parker Thompson
michaelparke...
Oct 27, 2003 11:14 pm
Along the same linkes it also might be useful to think of "test classes". By this I mean that at some point we'll want to have a coherent hierarchy of...
159
Igor Ranitovic
iranitovic
Oct 28, 2003 12:44 am
I will add the following tests: 1. Parsing links between escaped quotes found in javascript. Example: document.write("<a href=\"http://a.com/aPage.html\"> test...
160
Gordon Mohr
gojomo
Nov 3, 2003 10:18 am
I've just merged a large body of work that had been in a separate CVS branch back into the main branch. Most notably, many classes have been moved or renamed,...
161
Lars Clausen
lrclause
Nov 4, 2003 1:08 pm
Hi! I'm part of the Web archival group at the Danish State Library, where we're looking at using Heritrix for our crawling. I'm hacking around on it to see...
162
Gordon Mohr
gojomo
Nov 4, 2003 4:37 pm
... Thanks! ... It's now the "max-trans-hops" attribute on the <scope> element, because it serves as a limit on other forms of "transitive" inclusion of URIs,...
163
Gordon Mohr
gojomo
Nov 5, 2003 8:26 pm
Here's a guide to the major areas of change that were checked into CVS the other day. A number of things will initially be slower and somewhat flaky after...
164
Lars Clausen
lrclause
Nov 6, 2003 1:45 pm
Using todays CVS, we're having some problems with the seed regexp, (?i:(http(s)?://\\w+)|(\\w+\92;.\\w+)(\\.\\w+)*(:\\d+)?(/\92;S*)?), in...
165
Søren Vejrup Carlsen
svc400
Nov 6, 2003 5:40 pm
Dear all. The current Heritrix.java does not adequately test its arguments. If #arg=1, it assumes that the argument is the crawlorder-file, when it could be...
166
Igor Ranitovic
iranitovic
Nov 6, 2003 6:56 pm
Hi Søren, Thanks for the patch. It seems that right solution is to accept any string value (if #agrs is 1) and then check if the file exists. If not report...
167
Gordon Mohr
gojomo
Nov 7, 2003 8:28 am
Yes, a problem with the seed extraction pattern and one or more null-pointer exception problems were fixed today. The main CVS code will continue to be...
168
Søren Vejrup Carlsen
svc400
Nov 7, 2003 12:51 pm
Dear Gordon. Hostnames with "-" are very common in the nordic countries, and properly in european domains generally. I have searched for the "-" pattern in the...
169
Lars Clausen
lrclause
Nov 8, 2003 12:14 pm
... Occasionally? Fully 22% of registered danish domains have a '-' (as of 30/04/2002 (newest list we have)). Also, according to RFC 1035, '-' (dash) is...
170
Gordon Mohr
gojomo
Nov 11, 2003 9:05 am
... Oops; of course you're right, I was thinking of '_'. - Gordon...
171
Søren Vejrup Carlsen
svc400
Nov 11, 2003 1:58 pm
Dear All. Recently, it has become necessary to unselect some of the java classes during javadoc generation. Otherwise javadoc dies with a null-pointer...
172
Gordon Mohr
gojomo
Nov 11, 2003 4:54 pm
We had experienced a problem with the 'ð' character, when committed to CVS by Eclipse on Windows, causing problems for Eclipse's editors and build process on...
173
Kristinn Sigurðsson
kristsi25
Nov 11, 2003 7:52 pm
Hi all, Currently we are looking very carefully at the statistics being generated by Heritrix. The current implementation collect's statistics in a bit...
174
John Erik Halse
johnerikhalse
Nov 11, 2003 8:30 pm
One correction about how things works at the moment. ... Actually it is the doc/sec since the last snapshot (the statistics interval in the order.xml file)....
175
Gordon Mohr
gojomo
Nov 12, 2003 8:49 am
... I've used the export -> Javadoc option built into Eclipse; it gives warnings for some of our incomplete comments but no NPEs or other fatal errors. Can you...
176
Gordon Mohr
gojomo
Nov 12, 2003 2:56 pm
For consistency and clarity, I've renamed the order.xml setting at /loggers/crawl-statistics/@interval to 'interval-seconds39;. I've adjusted the code and...
177
archive-crawler@yahoo...
Nov 18, 2003 7:57 am
Hello, This email message is a notification to let you know that a file has been uploaded to the Files area of the archive-crawler group. File :...
178
Gordon Mohr
gojomo
Nov 19, 2003 10:59 pm
The top frustration during our recent evaluation crawl was that we don't yet have a working system for resuming a crawl in progress from disk-based state, aka...
179
Søren Vejrup Carlsen
svc400
Nov 21, 2003 5:24 pm
Dear All. Could you send me a copy of "Archiving Crawler Functional Requirements" There should be a version 1 and a version 2 of this document. Please send me...