Am using heritrix-1.10.1 on windows xp to crawl a local site. It does it well but when i see it in WUI it shows me 99% completed and never finishes. Then I...
Read about the ARC file format here, http://crawler.archive.org/articles/developer_manual/arcs.html, in the Developer's Manual. Its not a zip file. Use gzip...
Hello all, During the course of crawling, ExtractorHTML will construct outlinks using any form action urls that it encounters. Looking through the source...
On a related note, if during the course of discovering form action urls with long parameter strings, the discovered outlink has length > 2083 (i.e., the max....
I just had a crawl using the latest heritrix run out of disk space and it is now in an unusable state. This is surprising me as I've had crawls with 1.8 run...
Eric
ej@...
Feb 3, 2007 6:22 pm
3775
Once a crawl hits an out-of-disk condition, it may be in an unresumable and uncheckpointable state -- cleanly recovering from everywhere this might happen...
Ah, I just had to wait for all 10 timeouts to happen. Then I could pause and checkpoint just fine and now it's back to running as normal. However, I do have...
Eric
ej@...
Feb 3, 2007 8:11 pm
3777
That's a good idea. Don't run out of disk space. (Said the Seagate stockholder. Ok, not anymore - I sold it during the Veritas stock flip.) Point still holds....
... Never seen that. If you see anything interesting from your alerts, heritrix_out.log, or other diagnostics (like SIGQUIT or 'jstack' thread dump), please...
0 alerts for this job. Job never started it seems. Failed on loading order.xml put the crawler into a "Could" state. Will rerun this order.xml and see what...
Ok found the problem. Somehow when the order.xml was sent all I got for name field was " <name/> " It was lacking the first part " <name> " I'm going to guess...
Hello everyone, Crawl-by-Example plugin for Heritrix, done as a part of Google Summer of Code project under the guidance of Gordon Mohr, is now released for ...
... Heritrix only does HTTP POST when configured to supply login credentials. See the 6.2.3. Credentials section on this page ...
Michael Stack
stack@...
Feb 6, 2007 7:30 pm
3785
Thank you. Nice diagram. What kind of changes did you make to libarc to make it run on windows? Were they just porting changes or were there any patches that...
Thanks Mostly porting changes to the original lib were made yes, I'll have to make sure that everything is correctly isolated for the next release, as so far...
I have tried running Heritrix 1.10.2 on a server with a firewall and encountered errors when I tried running it. Both the GUIPORT and the JMXPORT are already...
Hi all, I'd be more than happy to integrate Regis's changes into libarc, when I get a spare minute I'll take a look at them and update things appropriately....
Tom Emerson
TEmerson@...
Feb 7, 2007 2:02 pm
3789
Hey Alexis: You cannot reach the GUI through the firewall? I'd think that this at least should work. In standalone mode, Heritrix registers itself with the...
Michael Stack
stack@...
Feb 7, 2007 5:21 pm
3790
Hello all, I try to setup a cluster. There is several Heritrix-1.11.0 sucessfully running and registered at ... # java -cp hcc-0.2.0.jar...
Do you have archive-commons.jar in your CLASSPATH? (You can get one here, http://builds.archive.org:8080/cruisecontrol/buildresults/HEAD-heritrix, under the...
Michael Stack
stack@...
Feb 8, 2007 3:53 pm
3792
... Feb 8, 2007 8:35:21 PM org.archive.hcc.ClusterControllerBean init INFO: maxPerContainer setting: 5 javax.naming.NoInitialContextException: Need to specify...
You need a jndi.properties on your CLASSPATH. See the jndi.properties in Heritrix. At head of file is comment describing setup using JBOSS for example. ...
Michael Stack
stack@...
Feb 8, 2007 4:58 pm
3794
You can find it here: http://www.zvents.com/labs/hdfs_writer_processor I've iterated on this a bit and have used it for a 5 million document crawl with no...
... Thank you very much! Connection is created. But the next exception is ... Feb 9, 2007 1:04:49 AM org.archive.hcc.ClusterControllerBean init INFO:...
Looking at the code, http://crawler.archive.org/hcc/xref/org/archive/hcc/ClusterControllerBean.html#1600, it looks like the setup of the proxy failed (line...
OK, I see the Credential-related classes. Now in the CredentialStore class there is a create() method. Can I use this method to programmatically create a...
Does Heretrix visit a page multiple times during a crawl? If so under what conditions? If a page X is linked to from 3 other pages for example, does Heretrix ...