Hi, can anybody tell me how to make custom modules like processors, DecideRules etc. show up in the create drop-down menu of the heritrix2 webui? In heritrix...
Thanks a lot Gordon, it does work. It solved lot of problems for me... Sorry for letting you know so late, however. thanks Pratyush ... From: Gordon Mohr...
Pratyush Banerjee
Pratyushbanerjee@...
Jun 2, 2008 7:10 am
5257
Probably the easiest way to do this is to have the mid-fetch filter 'tag' the CrawlURI (using CrawlURI.put methods) and then have a processor run directly...
Hello, At http://crawler.archive.org/apidocs/index.html, the javadocs correspond to version 1.15.1. I am using version 2.0 and would like to find Javadocs...
Jean-Noël Rivasseau
elvanor@...
Jun 2, 2008 2:17 pm
5259
*bump* Still not understanding why this happens....
Jean-Noël Rivasseau
elvanor@...
Jun 2, 2008 3:02 pm
5260
Hi Jean-Noël Rivasseau, have you tried implementing the interface org.archive.state.Initializable and put your initial code in the initialTasks-method? ...
Hello, I have a processor that needs to access some HTML documents fetched by Heritrix. If my seed urls contain an URL corresponding to such a document, I...
Jean-Noël Rivasseau
elvanor@...
Jun 2, 2008 3:13 pm
5262
Hi Christian, thanks a lot, your suggestion works perfectly. Now, for my own curiosity, is it easy to understand why a constructor did not work? Internally,...
Jean-Noël Rivasseau
elvanor@...
Jun 2, 2008 3:23 pm
5263
... Hi Jean-Noel, sorry I don't really know it either. I think this has something to do with the internals of the settings framework, but I don't really know....
I dont know either - I currently always instantiate "manually". If you one day find out the answer I would be interested in knowing too. Jean-Noel...
Jean-Noël Rivasseau
elvanor@...
Jun 2, 2008 3:57 pm
5265
Hi! Thanks for the reply. I did have a ContentTypeMatchesRegExpDecideRule under the writer processor section with the following regex (?i)application/xml.* But...
Hello, I had a problem with encoding today and took a look at Heritrix code. Unfortunately it seems to me (from my understanding of the code) that Heritrix...
Jean-Noël Rivasseau
elvanor@...
Jun 2, 2008 6:27 pm
5267
... Here's an excerpt from the configuration that worked for me on the mangosproject.org website: root:credential-store=primary, ...
The code which builds the lists shown in the web UI lives at org.archive.crawler.webui.Settings, in the method getSubclasses(). It looks for premade text files...
I don't know what's happening, but parts of your description don't add up. In particular, all the concrete Processor classes standard with Heritrix have...
... It depends on where your Processor is in the chain. The same URI can enter processing several times, especially if when it first comes up, the DNS/robots...
Do you have a REJECT rule first that applies to everything, then the ContentTypeMatchesRegExpDecideRule to ACCEPT the right kind of content? Otherwise, the...
... Heritrix should already support many other encodings -- limited mainly by what support is in your Java VM. As per HTTP/1.1, when no other charset is...
... We don't yet have the autogenerated Javadocs or Maven2 project site automatically uploaded to the main Heritrix website. Until we have that set up, one...
I also have experience with a couple of newspaper sites that do not allow heritrix logging in although I'm quite sure I give all the nessecary credentials...
I had a default REJECT in the scope to start with but not in the mid-fetch or writer processor phase. I also tried adding those in as the first rule for them...
At Wed, 28 May 2008 15:04:25 -0000, ... Just to confirm that I have seen identical traces in our runtime-errors.log files. We are running up to 10 simultaneous...
Hello. My problem was that I had a page that was actually encoded in windows-1252 cp, but advertised itself as a ISO-8859-1 page (although it did this only in...
Jean-Noël Rivasseau
elvanor@...
Jun 4, 2008 11:35 am
5282
Yes, the setup is like you described. I'm using the standard arc writer and I checked the arc files to see the type. Besides application/xml content, the other...