Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2895 - 2924 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2895
Hi St.Ack, Thanks for your very very quick reply. ... with regards, Shadab ... ...
callforshadab
Offline Send Email
Jun 1, 2006
7:06 am
2896
Hi all, (i am not sure, but I think that) the current HTMLExtractor or other extractors extract only links from web pages probably using regular expressions. i...
callforshadab
Offline Send Email
Jun 1, 2006
7:33 am
2897
Thanks for the reply St.Ack. I think you're right. Although I got these alerts, I dont think the crawl terminated because of that. Because on re-running the...
jonathansiddharth
jonathansidd...
Offline Send Email
Jun 1, 2006
12:50 pm
2898
For sure its hung? You seem to be crawling wikipedia this time and in your last mail. Is this all you are crawling? If so, retries because of the below...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 1, 2006
3:32 pm
2899
... Thats right. ... Current extractors don't do this. They just find links. You'll need to amend them or do your own processor; perhaps one that first...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 1, 2006
3:38 pm
2900
Some other wrinkles to consider: ... This would be the ideal solution if you want *all* extracted URLs, whether they are eligible to be crawled (in-scope) or...
Gordon Mohr
gojomo
Offline Send Email
Jun 1, 2006
7:13 pm
2901
Yes that is all I'm crawling(wikipedia pages..and archiving just the history pages of featured articles. Im starting the crawl from the featured aticles home...
jonathansiddharth
jonathansidd...
Offline Send Email
Jun 1, 2006
11:05 pm
2902
... One thing to keep in mind is that the % completion shown in the web UI is a very rough/flawed estimate -- almost certain to be a massive underestimate in...
Gordon Mohr
gojomo
Offline Send Email
Jun 1, 2006
11:47 pm
2903
Thanks Gordon for your reply. I was using a bunch of decide rules and one of them was discarding hops greater than 2. That might explain a whole lot of URLs...
jonathansiddharth
jonathansidd...
Offline Send Email
Jun 2, 2006
3:25 am
2904
... Maybe, but: the decide-rules are applied even before URLs are scheduled, so I wouldn't expect to see a low percent-complete number (high 'queued' URL...
Gordon Mohr
gojomo
Offline Send Email
Jun 2, 2006
8:10 am
2905
Folks, I need to set up Heritrix to run under Apache, such that Apache is forwarding requests to Heritrix, and the Heritrix UI cannot be accessed directly from...
Karl Wright
daddywri
Offline Send Email
Jun 2, 2006
2:13 pm
2906
... You could use WAR version of Heritrix and host it in tomcat. Or it looks like you could make Heritrix UI go via AJP (mod_jk or mod_jk2) by changing the...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 2, 2006
4:54 pm
2907
Has anyone tried Libarc library on Suse Linux? I am getting errors when I am trying to make the library. I am interested in arcdump utility and wondering if...
anand_akela
Offline Send Email
Jun 2, 2006
9:52 pm
2908
I found the problem, hacked the misc.h and was able to compile and run the arcdump utility. Somehow strerror_r was returning a char* , but configure script...
anand_akela
Offline Send Email
Jun 3, 2006
6:25 am
2909
We just finished a large scale crawl using Heritrix. We've crawled 1 billion URLs within 3 months. We'd like to share with the list on the crawling experience....
joehung302
Offline Send Email
Jun 3, 2006
7:18 am
2910
Congratulations! It really is a big achievement....
Siddharth Shah
iamsidd
Offline Send Email
Jun 3, 2006
10:36 am
2911
I've installed Sun's Java on my CentOS system. Untarred the Heritrix tarball, but when I run the installation stuff according to the User Manual I get all...
traef06
Offline Send Email
Jun 4, 2006
1:13 pm
2912
Hi, 1. You need maven to build heritrix (in case you dont have it do build it from links given on heritrix's homepage) 2. In case you have maven, try doing ...
Siddharth Shah
iamsidd
Offline Send Email
Jun 4, 2006
5:28 pm
2913
Did you download source or binary? St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 4, 2006
5:32 pm
2914
Hi Group! Meta-newbie here. Understand that Heritrix is not currently supported on OS X, but have also seen some posts referencing that as a pure Java app it...
robertmothershead
robertmother...
Offline Send Email
Jun 4, 2006
5:54 pm
2915
Source. I have never built from java source before. I thought I had downloaded the binary. Thank you for helping me remove from head from my ... ...
Thomas Raef
traef06
Offline Send Email
Jun 4, 2006
6:02 pm
2916
... Works for me. ... I've been setting JAVA_HOME=/usr St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 4, 2006
6:22 pm
2917
Hello Michael Thank you for your post. This will be the point where the newbie illustrates the essence of newbie-ness. I'm a mere information researcher, not a...
Robert Mothershead
robertmother...
Offline Send Email
Jun 4, 2006
7:48 pm
2918
... The variable name is 'JAVA_HOME', not 'java_home'. Try: set JAVA_HOME=/usr See if that works, St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 4, 2006
7:59 pm
2919
Michael, Many thanks for your tips. I have a feeling I am doing something fundamental and waffleheaded, but have little or no knowledge of just how and where...
Robert Mothershead
robertmother...
Offline Send Email
Jun 5, 2006
1:41 pm
2920
... No worries. ... The plist file looks like info the os needs to support double-clicking the application. You'll be running Heritrix from the command-line....
Michael Stack
stackarchiveorg
Offline Send Email
Jun 5, 2006
4:04 pm
2921
Hi Michael, Heritrix is up and running. Thank you very much for your assistance. I really appreciate you hanging in with me on this... This is what I had to do...
Robert Mothershead
robertmother...
Offline Send Email
Jun 5, 2006
4:56 pm
2922
I'm glad you got it running. Thanks posting the steps below. St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Jun 5, 2006
5:04 pm
2923
When crawling a single host, my download rate reachs only 2 urls per second. I'd like to increase the rate until either the client or the server's resources...
jcr2102
Offline Send Email
Jun 6, 2006
1:40 am
2924
... Delay is calculated as the time the last fetch took, times the 'delay-factor', but no less than 'min-delay-ms' nor more than 'max-delay-ms'. So you should...
Gordon Mohr
gojomo
Offline Send Email
Jun 6, 2006
4:58 am
Messages 2895 - 2924 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help