... Sure. Here it is. I hope this is what you have been looking for. For preventing non HTML-Documents to be downloaded I've added three DecideRules to the...
Thanks so much for the config information. I have merged my configuration as I would like to go to a single site. Here is my complete config with the merge but...
... I've forgotten to tell you: I'm using my own heritrix release, called heritrix-2.0.0-OFFIS, based on the 2.0.0 release of the IA. Within the 2.0.0. there...
... Hi Ravi, do you get it working using the 2.0.1-SNAPSHOT or are there still validation problems? I may send you the patched jar-files I'm currently using ...
... Thanks for the offer! The best approach to make a contribution is to... (1) create an issue in the Heritrix JIRA tracker (2) attach a patch with your...
Christian, I haven't pulled out the 2.0.1 snapshot and please send the jar file and I will try to test it. Thanks for checking with me. Also, I want to get...
Hi Gordon, thanks for your advises. I've created an issue in JIRA (HER-1543). I'll attach the code in september or oktober. Currently I've haven't much time...
Great -- we can track the issue, collect comments or votes from others who are interested, and whenever the code is battle-tested and in good shape it can be...
Heritrix installation is fine. Profiles and setting done according to the user manual. After I create job, it finishs immediately. The crawl report shows that...
... Your configuration has serious problems. In particular, it appears all of the usual and necessary Processors that perform the steps of handling a single...
I do what you recommand, and I retain original mudules settings and create job again, it shows 2 warning and 1 severe alert, and I list ... Time: ??. 7, 2008...
Ok, this looks similar to this old issue: http://webteam.archive.org/jira/browse/HER-510 Are you using the Heritrix WAR version inside another servlet...
I use it in standalone mode, in Windows XP SP2. I haven't try it in Eclipse or other container, and I think it's the last way to resolve the question if I am...
OK, I can reproduce your problem here, and there's actually another old issue describing it: http://webteam.archive.org/jira/browse/HER-540 At the Internet...
Heritrix releases 1.14.1 and 2.0.1 are now available at Sourceforge: http://sourceforge.net/project/showfiles.php?group_id=73833 These are both primarily...
Hi, I have a question concerning the function getContentSize() in the class ReplayInputStream. The Java Doc indicates that the function will return the total...
... Yes, it should... and getSize() will get the full recorded data size, including headers. ... You're getting these using the JerichoExtractorHTML, right? ...
It seems to me Heritrix does not consider the hash nor save it in the arc files. It could be useful to add this support. What do you think? Jean-Noel...
Jean-Noël Rivasseau
elvanor@...
Aug 8, 2008 6:23 pm
5400
Hi Christian, are you using jericho html 2.5 or allready jericho html 2.6 that hasn't made it into 1.x yet? I'm asking cause there seems to be a serious bug...
As of yesterday's 1.14.1 release, Heritrix 1 is using the jericho JAR version 2.6. I was only able to bring Heritrix 2 up to Jericho 2.5 because of an issue...
Hi Gordon and Olaf. Thanks for your help! I'll give it a try. Olaf, I'm currently useing the 2.6 Version of Jericho, which seems to work fine. I've downloaded...
Hello, I am attempting to modify the scope rules in a sheet in one of my profiles, and am receiving this exception when clicking on "add': Problem:...
Jean-Noël Rivasseau
elvanor@...
Aug 11, 2008 3:02 pm
5404
No replies to this, anyone can at least confirm that this is the case?...
Jean-Noël Rivasseau
elvanor@...
Aug 11, 2008 3:05 pm
5405
I have the following reproducable behavior, both in 2.0.0 and in 2.0.1: I launch an engine and then access it remotely via JNX. In the web UI, when I go to a...
Jean-Noël Rivasseau
elvanor@...
Aug 11, 2008 7:24 pm
5406
IIRC, yes, anything past the # is ignored. Two URLs that different only in the component that follows the # are considered the same (I do not recall whether...
At Fri, 08 Aug 2008 20:23:32 +0200, ... The URI fragment (aka hash) is interpreted by the client and is media type specific. The client-server interaction to...
(sorry, forgot to reply to list) I agree that it's a hack of course, but some (mainly Ajax based) sites store informations in the hash. In such a site, I could...
Jean-Noël Rivasseau
elvanor@...
Aug 11, 2008 8:46 pm
5409
The portion after the '#' (the 'fragment') is not sent on HTTP requests, and so does not affect what is returned from servers. So from the perspective of...
... Good explanation of the network-equivalence of two URIs differing only after the '#'. But regarding... ... The situation I've seen is where a JS/AJAXy...