Hi all, (i am not sure, but I think that) the current HTMLExtractor or other extractors extract only links from web pages probably using regular expressions. i...
Thanks for the reply St.Ack. I think you're right. Although I got these alerts, I dont think the crawl terminated because of that. Because on re-running the...
For sure its hung? You seem to be crawling wikipedia this time and in your last mail. Is this all you are crawling? If so, retries because of the below...
... Thats right. ... Current extractors don't do this. They just find links. You'll need to amend them or do your own processor; perhaps one that first...
Some other wrinkles to consider: ... This would be the ideal solution if you want *all* extracted URLs, whether they are eligible to be crawled (in-scope) or...
Yes that is all I'm crawling(wikipedia pages..and archiving just the history pages of featured articles. Im starting the crawl from the featured aticles home...
... One thing to keep in mind is that the % completion shown in the web UI is a very rough/flawed estimate -- almost certain to be a massive underestimate in...
Thanks Gordon for your reply. I was using a bunch of decide rules and one of them was discarding hops greater than 2. That might explain a whole lot of URLs...
... Maybe, but: the decide-rules are applied even before URLs are scheduled, so I wouldn't expect to see a low percent-complete number (high 'queued' URL...
Folks, I need to set up Heritrix to run under Apache, such that Apache is forwarding requests to Heritrix, and the Heritrix UI cannot be accessed directly from...
... You could use WAR version of Heritrix and host it in tomcat. Or it looks like you could make Heritrix UI go via AJP (mod_jk or mod_jk2) by changing the...
Has anyone tried Libarc library on Suse Linux? I am getting errors when I am trying to make the library. I am interested in arcdump utility and wondering if...
I found the problem, hacked the misc.h and was able to compile and run the arcdump utility. Somehow strerror_r was returning a char* , but configure script...
We just finished a large scale crawl using Heritrix. We've crawled 1 billion URLs within 3 months. We'd like to share with the list on the crawling experience....
I've installed Sun's Java on my CentOS system. Untarred the Heritrix tarball, but when I run the installation stuff according to the User Manual I get all...
Hi, 1. You need maven to build heritrix (in case you dont have it do build it from links given on heritrix's homepage) 2. In case you have maven, try doing ...
Hi Group! Meta-newbie here. Understand that Heritrix is not currently supported on OS X, but have also seen some posts referencing that as a pure Java app it...
Hello Michael Thank you for your post. This will be the point where the newbie illustrates the essence of newbie-ness. I'm a mere information researcher, not a...
Michael, Many thanks for your tips. I have a feeling I am doing something fundamental and waffleheaded, but have little or no knowledge of just how and where...
... No worries. ... The plist file looks like info the os needs to support double-clicking the application. You'll be running Heritrix from the command-line....
Hi Michael, Heritrix is up and running. Thank you very much for your assistance. I really appreciate you hanging in with me on this... This is what I had to do...
When crawling a single host, my download rate reachs only 2 urls per second. I'd like to increase the rate until either the client or the server's resources...
... Delay is calculated as the time the last fetch took, times the 'delay-factor', but no less than 'min-delay-ms' nor more than 'max-delay-ms'. So you should...