Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

archive-crawler

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 795
  • Category: Cyberculture
  • Founded: Dec 1, 2002
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Messages

Advanced
Messages Help
Messages 150 - 179 of 8130   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand Author Sort by Date ^
150 Gordon Mohr
gojomo Send Email
Sep 2, 2003
10:38 pm
Thanks! This is a good plan for beginning implementation work. ... As with navigational headers on a website, some standard grouping of status info and...
151 Søren Vejrup Carlsen
svc400 Send Email
Oct 2, 2003
3:19 pm
Dear All. I have been monitoring this mailing-list and the activity on the cvs repository for the heritrix crawler for a while. And today, I have tried...
152 Gordon Mohr
gojomo Send Email
Oct 2, 2003
4:42 pm
Hi, Søren. ... These errors are related to two recent changes: (1) To resolve an HTTP hanging problem, we began developing against the latest Apache Commons...
153 Gordon Mohr
gojomo Send Email
Oct 8, 2003
7:29 pm
Appended below is an example crawler configuration ("crawl-order") file, as currently drives my working-dev-branch crawler in desktop test crawls. I would...
154 Søren Vejrup Carlsen
svc400 Send Email
Oct 9, 2003
11:49 am
Dear Gordon. Here are some questions: In the Scope element, what does max-trans-hops, and max-toe-threads mean? I believe, that max-link-hops=10 should...
155 Gordon Mohr
gojomo Send Email
Oct 9, 2003
10:52 pm
... max-trans-hops: We consider certain discovered URIs to imply "transitive&quot; inclusion: - embeds (such as IMG SRC and FRAME SRC) - referrals (redirections &...
156 Søren Vejrup Carlsen
svc400 Send Email
Oct 23, 2003
12:28 pm
Dear All. In the Crawling Links document, one of the links on the must-reads list: "Kimpton, Stata. Crawler requirements for library consotium" is now dead...
157 Gordon Mohr
gojomo Send Email
Oct 27, 2003
7:37 pm
In our meeting Friday, we resolved to each create 3 new test cases for our crawler "test garden". This is an expandable collection of web-server content...
158 Parker Thompson
michaelparke... Send Email
Oct 27, 2003
11:14 pm
Along the same linkes it also might be useful to think of "test classes". By this I mean that at some point we'll want to have a coherent hierarchy of...
159 Igor Ranitovic
iranitovic Send Email
Oct 28, 2003
12:44 am
I will add the following tests: 1. Parsing links between escaped quotes found in javascript. Example: document.write("<a href=\"http://a.com/aPage.html\"> test...
160 Gordon Mohr
gojomo Send Email
Nov 3, 2003
10:18 am
I've just merged a large body of work that had been in a separate CVS branch back into the main branch. Most notably, many classes have been moved or renamed,...
161 Lars Clausen
lrclause Send Email
Nov 4, 2003
1:08 pm
Hi! I'm part of the Web archival group at the Danish State Library, where we're looking at using Heritrix for our crawling. I'm hacking around on it to see...
162 Gordon Mohr
gojomo Send Email
Nov 4, 2003
4:37 pm
... Thanks! ... It's now the "max-trans-hops" attribute on the <scope> element, because it serves as a limit on other forms of "transitive&quot; inclusion of URIs,...
163 Gordon Mohr
gojomo Send Email
Nov 5, 2003
8:26 pm
Here's a guide to the major areas of change that were checked into CVS the other day. A number of things will initially be slower and somewhat flaky after...
164 Lars Clausen
lrclause Send Email
Nov 6, 2003
1:45 pm
Using todays CVS, we're having some problems with the seed regexp, (?i:(http(s)?://\\w+)|(\\w+\&#92;.\\w+)(\\.\\w+)*(:\\d+)?(/\&#92;S*)?), in...
165 Søren Vejrup Carlsen
svc400 Send Email
Nov 6, 2003
5:40 pm
Dear all. The current Heritrix.java does not adequately test its arguments. If #arg=1, it assumes that the argument is the crawlorder-file, when it could be...
166 Igor Ranitovic
iranitovic Send Email
Nov 6, 2003
6:56 pm
Hi Søren, Thanks for the patch. It seems that right solution is to accept any string value (if #agrs is 1) and then check if the file exists. If not report...
167 Gordon Mohr
gojomo Send Email
Nov 7, 2003
8:28 am
Yes, a problem with the seed extraction pattern and one or more null-pointer exception problems were fixed today. The main CVS code will continue to be...
168 Søren Vejrup Carlsen
svc400 Send Email
Nov 7, 2003
12:51 pm
Dear Gordon. Hostnames with "-" are very common in the nordic countries, and properly in european domains generally. I have searched for the "-" pattern in the...
169 Lars Clausen
lrclause Send Email
Nov 8, 2003
12:14 pm
... Occasionally? Fully 22% of registered danish domains have a '-' (as of 30/04/2002 (newest list we have)). Also, according to RFC 1035, '-' (dash) is...
170 Gordon Mohr
gojomo Send Email
Nov 11, 2003
9:05 am
... Oops; of course you're right, I was thinking of '_'. - Gordon...
171 Søren Vejrup Carlsen
svc400 Send Email
Nov 11, 2003
1:58 pm
Dear All. Recently, it has become necessary to unselect some of the java classes during javadoc generation. Otherwise javadoc dies with a null-pointer...
172 Gordon Mohr
gojomo Send Email
Nov 11, 2003
4:54 pm
We had experienced a problem with the 'ð' character, when committed to CVS by Eclipse on Windows, causing problems for Eclipse's editors and build process on...
173 Kristinn Sigurðsson
kristsi25 Send Email
Nov 11, 2003
7:52 pm
Hi all, Currently we are looking very carefully at the statistics being generated by Heritrix. The current implementation collect's statistics in a bit...
174 John Erik Halse
johnerikhalse Send Email
Nov 11, 2003
8:30 pm
One correction about how things works at the moment. ... Actually it is the doc/sec since the last snapshot (the statistics interval in the order.xml file)....
175 Gordon Mohr
gojomo Send Email
Nov 12, 2003
8:49 am
... I've used the export -> Javadoc option built into Eclipse; it gives warnings for some of our incomplete comments but no NPEs or other fatal errors. Can you...
176 Gordon Mohr
gojomo Send Email
Nov 12, 2003
2:56 pm
For consistency and clarity, I've renamed the order.xml setting at /loggers/crawl-statistics/@interval to 'interval-seconds&#39;. I've adjusted the code and...
177 archive-crawler@yahoo... Send Email Nov 18, 2003
7:57 am
Hello, This email message is a notification to let you know that a file has been uploaded to the Files area of the archive-crawler group. File :...
178 Gordon Mohr
gojomo Send Email
Nov 19, 2003
10:59 pm
The top frustration during our recent evaluation crawl was that we don't yet have a working system for resuming a crawl in progress from disk-based state, aka...
179 Søren Vejrup Carlsen
svc400 Send Email
Nov 21, 2003
5:24 pm
Dear All. Could you send me a copy of "Archiving Crawler Functional Requirements" There should be a version 1 and a version 2 of this document. Please send me...
Messages 150 - 179 of 8130   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help