A DecideRule that shuttles pages to another running process asking whether to continue processing or not would make for a nice Heritrix contribution. What were...
Hi, Anybody has this heritrix.war for heritrix 1.10 ? Please mail me as the site is down for 2 days . when i try to deploy admin.war with all necessary jars in...
Hmm.. If i explicitly removes the Extension lists from the MANIFEST.MF in heritrix.jar it works fine. Is there any nice way of handling this without modifying...
Heritrix-as-a-WAR uses the containers' authentication mechanism. If you haven't already, set up an 'admin' role with login/password for your container -- for...
Michael Stack
stack@...
Mar 2, 2007 4:51 pm
3861
I was wondering if anyone was getting OutOfMemoryErrors thrown when checkpointing. On my system (8GB memory, 1GB allocated to the heap) I have checkpointing...
... It would be useful to see the lines of heritrix_out.log and progress_statistics.log before the point where the error occurred. How large is this crawl?...
Michael Magin
magin@...
Mar 6, 2007 6:20 pm
3863
What version of Heritrix are you running? If anything earlier than 1.10.2, I recommend upgrading: there's a fix to an issue with an included third-party...
I'm back looking at Heritrix and how I would implement an RPC decide rule, but I need to rethink how to go about this. I need Heritrix to make an RPC call,...
It looks like I've answered my own question. I'll be writing a decide rule which makes an RPC call. I'm not going to serialize the CrawlURI as I plan on...
I need to add the Apache XMLRPC jars to the Heritrix build, but I have no idea how to add these to Maven. Could someone point me in the right direction or at...
See the dependencies section in the project.xml. If the xmlrpc lib is available up in the maven1 ibiblio repository, maven should just fetch it for you (You...
First, thank you so much for the response. That was a ton of help. Just one more question. If I already have the jar files and want to use my local copies,...
... You need to add in both places. Make sure that the ID in the project.xml dependency section matches the ID suffix in your 'maven.jar.ID' entry in...
Michael Stack
stack@...
Mar 8, 2007 6:04 pm
3870
Thanks for the continued help. Building Perl projects is a lot different than this. I've managed to build Heritrix with the new jars in the project files and...
Never mind. Two seconds after I posted this, I realized that the runtime wasn't finding the libraries because Maven wasn't magically copying them into the...
Could someone point me at some documentation or give me a hint on what I'm doing wrong with this DecideRule I've created? It has the constructor which takes...
... Which DR are you subclassing? If you override #decisionFor, is it called? Your rule is one of a set of rules in a DecidingScope (You say CrawlScope...
Michael Stack
stack@...
Mar 9, 2007 4:33 pm
3875
Woohooo, it works! Thank you so much for the help, St.Ack. I was subclassing DecideRule which doesn't have an #evaluate to override. I mis-read the code....
I am trying to build heritrix from source(1.10.2) using maven 1.0.2, I am using jdk 1.6 and I exported the following variables on my machine: export...
Ahmed Ghozia, St.Ack touches on this issue here (The source is missing files): http://tech.groups.yahoo.com/group/archive-crawler/message/3850 Follow the...
Hi Gordon, thanks for the great work and congrats to the release candidate. But... I have two issues: 1) How to commit bug reports to the new system? Do I have...
Hi Gordon, consider the last part of my post void - seems like for some reason I've been testing with a broken WARC file. But still I'm unable to use the v10...
You need to be registered to submit a bug Olaf. Go here Olaf: http://webteam.archive.org/jira/secure/Dashboard.jspa. Let us know if it doesn't work for you. ...
Could someone familiar with the Heritrix Arch help me out? From the CrawlURI, I'm trying to find the "depth" of the current URI from the seed URI. If the...
... I've been practicing the migration over on the archive-access sister project and its been taking up a bunch of time. The move from m1 to m2, in effect,...