We hadn't considered this need to crawl domains only resolvable by the proxy before. So, it would take some trickery (supply fake DNS to the crawler) or new...
Hi I have tried to run heritrix several times using the config mentioned on the user manual section 4 to run the first crawl job. I have tried different...
Rahil Baig
rahil.baig@...
Apr 3, 2009 3:37 pm
5751
What is your... - Heritrix version - Operating System - Java version ...and how are you launching Heritrix? There have been a number of reports of similar...
Hi Im using Heritrix 1.14.3 Win XP pro sp2 java v6 Im using the following script to initiate the crawler. There hasnt been any issue launching it in the...
Rahil Baig
rahil.baig@...
Apr 4, 2009 9:06 am
5753
Hi, I like to filter specific URLs and everything works really fine with RegExp in "fetch-processor -> midfetch-decide-rules". But I noticed in the logs that...
Hi, everything works fine, I deployed the heritrix war in Tomcat 6 by importing the file, I can crawl and the crawl terminates successfully. But where can I...
Does the Java VM on Windows consult the 'classpath' variable you're setting? I thought it had to be passed on the JVM invocation command-line. Did you try the...
The 'midfetch-decide-rules' are used only to abort a fetch midway (after the headers have been retrieved). The proper way to control which URIs are even...
By default, ARCs land in an 'arc' subdirectory of your job directory, which is itself usually in the 'jobs' directory of your installation. If you are not very...
I seem to be stuck. No matter what I try for my first web crawl, I ALWAYS wind up with no sites downloaded. No reports are available, but the job moves...
Hi Richard ... Since you do not provide any useful details at all, you cannot expect to receive a helpful answer. At least, please be so kind and provide the...
juergen@...
Apr 6, 2009 10:17 am
5760
I tried the bundled launch script as well (on win), and it still gave me the same alert WARNING *Message:* Value of illegal type:...
Rahil Baig
rahil.baig@...
Apr 6, 2009 10:23 am
5761
The "fetch-processor -> midfetch-decide-rules" are good for rejecting or accepting by MIME type before fully fetching a file because that is the only point...
Back when I started using Heritrix one of the best resources I could find about limitig the scope of my crawl to certain document types was the following wiki...
Please include your version of Heritrix, Java, and Operating System, in case they're relevant. We'll also need more details about how you set up the crawl and...
Thanks for responding, Gordon. I am running Ubuntu *.10 (Intrepid) and Java 1.6.0_0 IcedTea6 1.3.1 (6b12-0ununtu6.4) Runtime Environment (build 1.6.0_0-b12)...
Thanks so much everyone, for responding. I'm delighted to get such helpful feedback, especially concerning the preprocessor PreconditionEnforcer. I must...
Did you launch your 'all defaults' crawl by starting from the bundled, never-edited 'default' profile? Have you tried Sun's JDK? Is there anything in...
Hi Gordon Yes, I launched my 'all defaults' crawl by starting from the bundled, never-edited 'default' profile. This has always been my result in trying to...
Richard - Please 'reply' to previous messages so all traffic on this topic lands in the same thread. I don't see any problems with your order.xml. The only way...
... Hi Gordon Well - that WAS the key! After I moved over to java-6-sun, I began seeing results. This was a breakthrough! Eventually I figured out I had...
I'm having the same issue, the MirrorWriter is dropping the "?" for some reason. Do I need to use the ARCWriter instead? I mean, isn't the MirrorWriter...
Hi there, I'm trying to write a program which will use Heritrix to crawl for semantic web documents, but I'm having difficulty understanding what needs to be...
hi Jamie, to run Heritrix1 in Eclipse: - create a new project from SVN File > New > Other > Checkout projects from SVN Create a new repository location: ...
The 3rd revision of the concurrency improvement patch has been uploaded. I have tested this version for 9 different wide crawls (500,000+ seeds) and it has...
a couple of corrections: 1) the "-a" argument for H1 should be "username:password" selected by the user, and for H2, the "password" selected by the user. 2)...
... Thank you all for the replies. Ah ok, the first String you're passing to the main will be your username:password, but what exactly are the other 3?...
Dear all, I couldn't open "An Introduction to Heritrix.pdf" in Acrobat 6.0, 9.0 because of many errors such as cannot extract the embedded font, cannot display...