Well, Its simple. The RegEx rule i added [ when the job was crawling] did not apply on already queued URIs . The canonicalization rules applied only before...
Hi, couldn't find this in the docs anywhere but I may well have missed it. Is there a mimetype for arcfiles themselves? (ie. if I was serving them from a...
... actually I got it the wrong way round by the looks of it - should be application/x-arc ... I guess it would make sense to get this officially recognised...
Hi InternatArchive Team, thanks alot for adding my first two DecideRules into Heritrix. I now also changes my SuccessfulFetchFilter into a pair of DecideRules ...
Dear Heritrix friends, I am a researcher using heritrix to build a focused crawler in my experimental work. I have a new crawling strategy which determines...
My understanding is that you want to analyse the fetched pages and extract links based on that. I think you should write a new Extractor which calls your...
You might also take a look at how Heritrix is integrated into the metacombine project: http://www.metacombine.org/. Check under the software tab. Here is the...
Hello Olaf: Should be no problem adding multiple writers, each with its own rule set. We implemented code freeze friday in readyness for 1.10.0 release (Code ...
Stack, It may be a little late to request this, but I thought it would be really useful to list several use cases in the user manual for the most typical types...
Sounds great Frank. Any chance you'd like to take a first cut at it even if it was only an outline for the rest of us to fill in. Good stuff, St.Ack...
Doc. can go in up to the second before release. I'd say end-of-this week, start-of-next should see release of 1.10.0 unless we trip over the unexpected. Good...
I took a look and these are the response headers we're getting back: [Date: Tue, 05 Sep 2006 21:26:58 GMT , Server: Apache/2.0.59 (Unix) mod_ssl/2.0.59...
Did you try stripping all but the id parameter from the query string? Would the following java-string regex work for you in the regex canonicalization rule? ...
I found some syntax errors while reading the heritrix manulas, so if it does worth, where to go and correct it ... Do you Yahoo!? Everyone is raving about...
You can send them to me off list or put them into a bug up on sourceforge: http://sourceforge.net/tracker/?group_id=73833&atid=539099. Thanks Ahmed, St.Ack...
Stack, I created 3 use cases here: http://www.cs.odu.edu/~fmccown/heritrix/use_cases.html The parts in red are where someone more experienced than me should ...
... Fantastic. Thanks Frank. Stack's a little busy at the moment so I'm going to see if I can flesh out the sections in red. We'll then add an appendix to...
... Hello again, So I have filled out the red bits in case #1 and case #2, but I'm not sure what you're asking in case #3 -- "How could the rule be applied ...
Hi Michael, ... set. I think there is a problem about doing it - the webUI simply doesn't allow me to do it. It just allows me to add one writer of each type, ...
It is a shortcoming of the WebUI (that was intentional at the time). There is a dirty fix to it by editing the Processors.options file and having several...
... Hello Paul- I appreciate you taking the time to edit my use cases. I've only been using Heritrix a few months, so I hope what I have written so far makes ...
Hello, I need to feed somehow heretrix with a URI list within a SQL database, is there any SQL based frontier? or simpler, using an existing frontier, can I...
... Yes to the latter. See the JMX overview in the manual getting started: http://crawler.archive.org/articles/user_manual/outside.html#mon_com. The API is...
... The importUris operation is available in the CrawlJob MBean, not in the Heritrix instance CrawlService MBean (Your listing below is from the CrawlService...
The arcreader tool has the ability to either output a resource with its http header or without it. There doesn't appear to be an option to just print the http...