Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5749 - 5778 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5749
We hadn't considered this need to crawl domains only resolvable by the proxy before. So, it would take some trickery (supply fake DNS to the crawler) or new...
Gordon Mohr
gojomo
Offline Send Email
Apr 1, 2009
9:24 pm
5750
Hi I have tried to run heritrix several times using the config mentioned on the user manual section 4 to run the first crawl job. I have tried different...
Rahil Baig
rahil.baig@...
Send Email
Apr 3, 2009
3:37 pm
5751
What is your... - Heritrix version - Operating System - Java version ...and how are you launching Heritrix? There have been a number of reports of similar...
Gordon Mohr
gojomo
Offline Send Email
Apr 3, 2009
9:41 pm
5752
Hi Im using Heritrix 1.14.3 Win XP pro sp2 java v6 Im using the following script to initiate the crawler. There hasnt been any issue launching it in the...
Rahil Baig
rahil.baig@...
Send Email
Apr 4, 2009
9:06 am
5753
Hi, I like to filter specific URLs and everything works really fine with RegExp in "fetch-processor -> midfetch-decide-rules". But I noticed in the logs that...
felizimm
Offline Send Email
Apr 4, 2009
5:05 pm
5754
Hi, everything works fine, I deployed the heritrix war in Tomcat 6 by importing the file, I can crawl and the crawl terminates successfully. But where can I...
felizimm
Offline Send Email
Apr 4, 2009
5:06 pm
5755
Does the Java VM on Windows consult the 'classpath' variable you're setting? I thought it had to be passed on the JVM invocation command-line. Did you try the...
Gordon Mohr
gojomo
Offline Send Email
Apr 5, 2009
10:20 pm
5756
The 'midfetch-decide-rules' are used only to abort a fetch midway (after the headers have been retrieved). The proper way to control which URIs are even...
Gordon Mohr
gojomo
Offline Send Email
Apr 5, 2009
10:24 pm
5757
By default, ARCs land in an 'arc' subdirectory of your job directory, which is itself usually in the 'jobs' directory of your installation. If you are not very...
Gordon Mohr
gojomo
Offline Send Email
Apr 5, 2009
10:26 pm
5758
I seem to be stuck. No matter what I try for my first web crawl, I ALWAYS wind up with no sites downloaded. No reports are available, but the job moves...
bowser.richard
Offline Send Email
Apr 6, 2009
5:04 am
5759
Hi Richard ... Since you do not provide any useful details at all, you cannot expect to receive a helpful answer. At least, please be so kind and provide the...
juergen@...
Send Email
Apr 6, 2009
10:17 am
5760
I tried the bundled launch script as well (on win), and it still gave me the same alert WARNING *Message:* Value of illegal type:...
Rahil Baig
rahil.baig@...
Send Email
Apr 6, 2009
10:23 am
5761
The "fetch-processor -> midfetch-decide-rules" are good for rejecting or accepting by MIME type before fully fetching a file because that is the only point...
pbaclace
Offline Send Email
Apr 6, 2009
8:48 pm
5762
Back when I started using Heritrix one of the best resources I could find about limitig the scope of my crawl to certain document types was the following wiki...
pandae667
Offline Send Email
Apr 6, 2009
9:38 pm
5763
Please include your version of Heritrix, Java, and Operating System, in case they're relevant. We'll also need more details about how you set up the crawl and...
Gordon Mohr
gojomo
Offline Send Email
Apr 7, 2009
12:28 am
5764
Thanks for responding, Gordon. I am running Ubuntu *.10 (Intrepid) and Java 1.6.0_0 IcedTea6 1.3.1 (6b12-0ununtu6.4) Runtime Environment (build 1.6.0_0-b12)...
bowser.richard
Offline Send Email
Apr 7, 2009
3:34 am
5765
You should do a test crawl with all defaults first to confirm a basic crawl works, then make changes one-by-one. We don't recommend use of the old...
Gordon Mohr
gojomo
Offline Send Email
Apr 7, 2009
5:10 am
5766
Thanks so much everyone, for responding. I'm delighted to get such helpful feedback, especially concerning the preprocessor PreconditionEnforcer. I must...
bowser.richard
Offline Send Email
Apr 7, 2009
3:50 pm
5767
Did you launch your 'all defaults' crawl by starting from the bundled, never-edited 'default' profile? Have you tried Sun's JDK? Is there anything in...
Gordon Mohr
gojomo
Offline Send Email
Apr 7, 2009
5:18 pm
5768
Hi Gordon Yes, I launched my 'all defaults' crawl by starting from the bundled, never-edited 'default' profile. This has always been my result in trying to...
bowser.richard
Offline Send Email
Apr 8, 2009
11:34 am
5769
Richard - Please 'reply' to previous messages so all traffic on this topic lands in the same thread. I don't see any problems with your order.xml. The only way...
Gordon Mohr
gojomo
Offline Send Email
Apr 8, 2009
9:51 pm
5770
... Hi Gordon Well - that WAS the key! After I moved over to java-6-sun, I began seeing results. This was a breakthrough! Eventually I figured out I had...
bowser.richard
Offline Send Email
Apr 9, 2009
4:09 am
5771
I'm having the same issue, the MirrorWriter is dropping the "?" for some reason. Do I need to use the ARCWriter instead? I mean, isn't the MirrorWriter...
dan.gold00
Offline Send Email
Apr 15, 2009
11:53 pm
5772
Hi there, I'm trying to write a program which will use Heritrix to crawl for semantic web documents, but I'm having difficulty understanding what needs to be...
jamie_condon@...
jamie_condon...
Offline Send Email
Apr 16, 2009
4:21 pm
5773
... import org.archive.crawler.Heritrix; ... public static void main(String[] args) throws Exception { Heritrix.main(new String[]{"-a","PWD", ...
Juergen Umbrich
juergen@...
Send Email
Apr 16, 2009
4:25 pm
5774
hi Jamie, to run Heritrix1 in Eclipse: - create a new project from SVN File > New > Other > Checkout projects from SVN Create a new repository location: ...
steve@...
stearcorg
Offline Send Email
Apr 16, 2009
5:43 pm
5775
The 3rd revision of the concurrency improvement patch has been uploaded. I have tested this version for 9 different wide crawls (500,000+ seeds) and it has...
pbaclace
Offline Send Email
Apr 17, 2009
2:07 am
5776
a couple of corrections: 1) the "-a" argument for H1 should be "username:password" selected by the user, and for H2, the "password" selected by the user. 2)...
steve@...
stearcorg
Offline Send Email
Apr 17, 2009
6:42 pm
5777
... Thank you all for the replies. Ah ok, the first String you're passing to the main will be your username:password, but what exactly are the other 3?...
jamie_condon@...
jamie_condon...
Offline Send Email
Apr 17, 2009
7:09 pm
5778
Dear all, I couldn't open "An Introduction to Heritrix.pdf" in Acrobat 6.0, 9.0 because of many errors such as cannot extract the embedded font, cannot display...
re_writing
Offline Send Email
Apr 20, 2009
5:38 pm
Messages 5749 - 5778 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help