Dear All. Here are some comments on the README.txt (version 1.10) in the distributrion. (1) In section 1,2, I suggest we mention how to do this assignment by...
I'm working on character encodings in heritrix. I made a proposal for addressing our current blindness for other than single-byte character sets in ...
Michael Stack
stack@...
Mar 2, 2004 6:19 pm
284
We have begun discussing how the crawler could be able to revisit already crawled URIs. Some initial thoughts have been written on the Wiki <url: ...
Hi all, By default all JSP pages are recompiled each time you launch Heritrix. This is of course very annoying. But if you create a directory called 'work' in...
Update on this issue. Recent change (post 0.6.0) includes a directive to Jetty to store the compiled JSP pages in a fixed location. That means that the 'work'...
Here's a proposal for the Heritrix negotiation of authentication schemes feature: http://crawler.archive.org/proposals/auth/ Would love feedback if any. Will...
Michael Stack
stack@...
Mar 26, 2004 11:10 pm
289
I have been playing around with heritrix for a few weeks now and I am in the process of turning it loose on a controlled environment for one of my research...
Currently you can build heritrix with maven and ant. The maven build is more complete in that it generates all of documentation and the site at ...
Michael Stack
stack@...
Mar 29, 2004 9:40 pm
291
Thanks for trying Heritrix Seb. See below. ... Do you mean 0.4.0. You say 0.9.0 above. We just released 0.6.0 on friday. Try it if you haven't already. Lots...
Michael Stack
stack@...
Mar 29, 2004 10:11 pm
292
Michael, Thanks for the prompt reply. Indeed I upgraded to 0.6.0 over the weekend: amazing how close a 9 looks to a 6 given enough lack of sleep In English (to...
Hei Seb. See below ... If you are only crawling one site at a time (using DomainScope) the max-bytes-download just the thing for you. It limits the total...
Dear St.Ack. I have a comment to assumption 3 (section 1.2.3): No means of recording credentials used authenticating in an ARC But shouldn't there be a means...
Hiya Kris, Thanks for the hint. That is pretty much where I ended up last nite. To clarify, my original intent was to manage xtiple sites from a single crawl...
... Yes Søren. It needs to be addressed but it probably won't be before first delivery of this new Heritrix feature. Thanks for the feedback. St.Ack...
Michael Stack
stack@...
Mar 30, 2004 4:30 pm
297
See below ... We are also aware of the fact that you can't overload websites and that is why the crawler is very polite. If you look at the settings under...
... The versions of jasper*.jar and servlet*.jar checked into heritrix came from the Jetty 4.2.17 bundle. Rather than our going via the middleman, Jetty, I...
Michael Stack
stack@...
Mar 30, 2004 11:06 pm
299
It seems that Sebs concern is not just polities but the number of bytes downloaded from sites. Some ISPs will charge you arm and leg if you exceed given...
Also, RFE 891986 added a bandwidth-throttle facility, which I believe can be set per host. John Erik, can you say more about this capability? - Gordon...
The bandwidth-throttle facility consists of two different settings. One which sets the maximum average bandwidth the crawler is allowed to use. The other...
Hi ! I updated my HERITRIX installation from CVS - and now I can't crawl at all - I get alerts on every try: Could someone tell me whether the CVS version is...
Hello Bjarne. I just tried a build from HEAD and all seems to work fine. Perhaps your order file is from a previous version and the newer code has trouble ...
Michael Stack
stack@...
Apr 6, 2004 5:08 pm
304
The alerts all came up in the UI - when configuring HERITRIX from inside the UI (using the Simple Profile) I returned to the official release 0.6.0 - it works...
Hello ! Does HERITRIX handle cookies? - in the UI there are two text-fiels for save and load cookie-file ! When the crawler runs - does it save cookies...
Hi, I collected a nice test archive of about 100000 docs with heritrix 0.6.0 I think it went well (I didn't yet try out very baaad web sites;) Now I try to...
... Yes it does. Handling of cookies is done by default. The load cookies option allows an operator to pre-load existing cookies file (in the Netscape's ...
Hello every one, Im trying to use heritrix on a windows(!) plattform. Whenever i submit a job via the web interface i get an error - here is the log (alert)...
Hi ! We are testing HERITRIX in connection with harvesting specially selected websites - when harvesting only one website (on only one host / domain) the...