I have a problem similar to the one reported by "Over" on the web archiving blog at "http://wa.archive.org/blog/2007/12/07/heritrix-200-rc1-is-out/" I am using...
At Thu, 03 Jan 2008 15:23:40 -0000, ... This problem is because the latest Ubuntu has switched to using dash, a lightweight POSIX as its primary shell (i.e....
Thanks, this does solve my problem. Not being a shell guru myself, I took the option of replacing the script headers on my local install of heritrix. Besides,...
What are all the dependencies for libarc? Is it completely independent of Heritrix? I am using an Ubuntu Gutsy (7.10), and I am trying to install libarc, but I...
Thanks for all the responses, but in the meantime I've got carried away playing with another crawler which worked right out of the box. Since I'm not a Java...
Kevin Porter
kev@...
Jan 9, 2008 2:58 pm
4884
In 2.0 RC1, I applied a JDBCWriterProcessor. I added the JDBC info to the KeyManager like the following and was able to see the data show up in the sheets...
... The scope in the configuration below should work for this. ... You had the right idea in the configuration below, but processors consider a DecideRule that...
Hi, Am I correct in assuming that the JDBC_DRIVER setting is going to be unchanged during the lifetime of a crawl? If that assumption is correct, then: 1. The...
I am crawling a domain with a very large number of hosts and content. It appears that due to time contraints we may not be able to gather all content. Is there...
... There's no general rule, and I doubt one could be arrived at for all crawls -- the web is so diverse, and it would depend on the seeds you're crawling and...
Correct. The driver will only change before a crawl is launched, not during. Thanks so much! I'll give a try. From: archive-crawler@yahoogroups.com ...
Hi Daniel, I think that you will have to write code to do this. If you want to use 1.12.1 out-of-box then you can do this with a beanshell processor which will...
We recently found a workaround for this Windows-specific issue and applied it to the 'heritrix2' trunk in November, before the 'beta' release -- see issue: ...
Is arcreader supposed to return the offset of the URL record (ex. http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202) or the first...
At Fri, 11 Jan 2008 17:40:59 -0000, ... Yes. The offset is the position in the file where the header starts. The header is of indeterminate length: you will...
I would like to implement the HTML filtering noted in the following link. I didn't see the noted URIRegExpFilter in the sheet for version 2.0. How can I...
Dear all. In the NetarchiveSuite project, we're in the process of migrating our Heritrix 1.10 templates from using the deprecated HostScope/DomainScope, and...
Dear all. I'm in the middle of migrating our Heritrix 1.12.1 templates from the deprecated scopes to DecidingScope, and just found out, that the filters have...
Hi Søren, In org.archive.crawler.deciderules, see: - MatchesListRegExpDecideRule - NotMatchesListRegExpDecideRule - TooManyPathSegmentsDecideRule Take care, ...
Hi Søren, I cannot rember exact behavior of the PathScope, I will have to look it up. But I think that DecidingScope's seeds-as-surt-prefixes option will do...
I used the profile basic_seed_sites as a template and changed the max-retries from 30 to 1 and retry-delay-seconds from 900 to 30. I placed a link in the...
Hello, Anyone knows if there is a way to do it ? Using Heritrix 1.x, this is pretty straightforward, but with 2.0 I could not figure how. Any pointers...
Hi, I've got heritrix embedded in another app, and I'm currently using the CrawlJobHandler and CrawlJob classes to minimize the amount of setup I have to do. ...
Micah Wedemeyer
mwedeme@...
Jan 17, 2008 5:46 pm
4905
Hi Bert, ... There are several things that I don't like about it. It is limited to URI matching rules, can be confusing and forces you to teach people ...
Hi, I'm looking for some feedback as to a best approach to a crawling problem I need solve. I have a list of URLs, some of which I only want to crawl that...
Hi, Thanks for your reply, Igor. Would this still be a preferred way if I have 1500 of these URLs? I would worry about the performance hit of every crawled...