I would like to be able to connect to an RDBMS while heritrix is running. I was able to do this with 1.13 by deploying it into jboss and allow jboss to manage...
I'm bringing up this issue again. First of all, we're now doing repeatable 5B+ crawling using Heritrix. Thanks to you guys. These days we've got quite a few...
... We put our crawler "signature" in Heritrix when we crawl the Internet. The signature is specified in the order.xml file. Heritrix, as a user agent, will...
Hi Gordon and nfoscarini: Thanks for your help. I am indeed running on Ubuntu Linux. Ubuntu (and Debian) now have full support for Sun's Java, and I have it...
Gordon Paynter
Gordon.Paynter@...
Apr 2, 2008 2:53 am
5097
This is an interesting topic, because I have not come across any Heritrix users (before now) who have been contacted. I have been crawling for months now with...
... As this only requires shuttling an externally-provided number into the existing min-delay-ms setting, it should be pretty simple, but was overlooked as we...
Hello folks, We are in the starting phase of a project, and we are currently wondering whether Heritrix or Nutch is the best choice of crawler for us. Our...
Svein Yngvar Willassen
svein@...
Apr 5, 2008 9:36 pm
5100
Hey all, I just wanted to let everyone know about something that has happened at my work that I've dubbed "the Heritrix effect." We've used Heritrix on several...
Micah Wedemeyer
mwedeme@...
Apr 7, 2008 2:32 pm
5101
... wondering ... I've never used Nutch, but I think Heritrix is more flexible given it's plug-in design. ... You intent to control your crawlers using Hadoop?...
... No, we just want to use Hadoop to store and process the data from the crawlers. ... Heritrix Cluster Controller (hcc)? I've had a look at it, but must ...
Svein Yngvar Willassen
svein@...
Apr 7, 2008 4:28 pm
5104
... No, I don't think it was hcc. I don't think hcc is under development anymore, and I never really understood what problem hcc was suppose to solve. Arg.....
Thank you for your help. NetarchiveSuite certainly looks worth taking a closer look at. Best Regards, Svein ... -- Best Regards, Svein Y. Willassen ...
Svein Yngvar Willassen
svein@...
Apr 7, 2008 5:44 pm
5106
... No. HCC is just a simple tool for addressing a herd of heritrice as one; start/stop/monitor, etc. If you look back over the heritrix archives, you'll see...
... Yeah, you can check them out at: http://vitrue.com/ They concentrate on video-centric social media and bringing it to companies that want to generate buzz...
Micah Wedemeyer
mwedeme@...
Apr 7, 2008 9:27 pm
5108
Hi all: I am trying to crawl a website that requires a POST with some credentials as well as a challenge string. My first question is this: How can I (using...
Hi Folks, Do we have a wiki page describing the best practises and useful tips when setting up multiple heritrix crawlers for doing large crawls ? Such a page...
Hi, Do we have an architecture document illustrating architectural changes when moving from 1.x to 2.x ? To start with a 1 or 2 page document with labelled...
Where configure the Proxy Servers in Heritrx? Or need I develop some plug-in? or modify some source code ? Thanks!...
何翔
calvin.he.84@...
Apr 9, 2008 9:10 pm
5114
I want to crawl a given domain website. When I start Heritrix by default setting, it only start one thread and crawls so slowly. Later I'v known that it...
何翔
calvin.he.84@...
Apr 9, 2008 9:10 pm
5115
Hi, I was trying to run Heritrix-2.0.0 in windows yesterday and ran into a slightly different problem. I followed the instructions for the jmxremote.password...
Hi, I'm trying to get the server started (using ubuntu server) and access it's web interface. Launching worked apparently (messages end with "Web UI listening...
At Wed, 09 Apr 2008 18:59:10 -0000, ... The web UI is listening on the localhost (127.0.0.1), not the external interface (in your case, 67.207.136.21). You...
... Hi Bernard. Try connecting directly to http://67.207.136.21:8080/. For me this works. I can connect o your Web administrative Console from here. The path...