I tried to launch Heritrix from Windows 2000 command line across the command: java -jar heritrix-1.4.0.jar But received the following message: Exception in...
... I've been thinking harder about this problem and have an idea for a possible solution (with some pointing in the right direction). If Heritrix were...
Matt Ittigson
cydatamatt@...
Sep 2, 2005 7:58 pm
2151
If you are interested in only getting those 7M seeds wouldn't it be easier to just write a script in Perl to spin over them and fetch the content? -- Tom...
... Good question. I really like the Heritrix GUI. The JMX integration is a great way to track the progress of the spider from other processes. The coming ...
Matt Ittigson
cydatamatt@...
Sep 2, 2005 8:22 pm
2153
... That's a good enough reason to continue using it! Your idea sounded fine to me: a replacement scope and frontier seem to be in order. -- Tom Emerson...
Hi Matt, If you are planning to fetch only 7M URLs and not continue with traversing newly discover urls, you can use broad scope and import URIs as non seeds...
... I'll try that first and get back to everyone. Thanks for the suggestion. -matt...
Matt Ittigson
cydatamatt@...
Sep 2, 2005 10:04 pm
2156
Hi Matt, Stack just brought up that you need to turn off the extractors. If you want embeds (images, frames, etc.) than leave extractors on, setup max-hops to...
... Part of our application is to grab the links found in each URL and dump them out into a separate file. Without the extractors, would curi.getOutLinks be...
Matt Ittigson
cydatamatt@...
Sep 2, 2005 11:16 pm
2158
... It'd be empty if no extractors. You need them. ... I'd doubt the running of the extractors your memory problem. The extractors have upper-bounds on the...
... Done exactly as suggested. ... 1,146,463 URLs loaded via the importUris JMX command with the heap climbing to 1,869,072 KB. I just started the job and...
Matt Ittigson
cydatamatt@...
Sep 3, 2005 2:18 am
2160
Hi, Are there any plans to have a distributed architecture, so that we can possibly run large number of toeThreads in different machines to acheive concurrency...
Hi all: I'm a relative newcomer to the Heritrix software. I have a couple of question about Heritrix' handling of HTML forms which I have been unable to answer...
Gordon Paynter
Gordon.Paynter@...
Sep 4, 2005 10:38 pm
2162
Attempt to deploy admin.jar on WLS but received the following log: Deployer:149033]preparing application admin on myserver [Deployer:149033]prepared...
... This is correct - if the URI can be found in the action-field it will be ancountered as a new seed (without parameters) ... Yes - even action="POST" will...
admin.war does not include heritrix jars. Its just a bundling of the jsps used by the UI (Hence the ClassNotFoundExceptions). Try the heritrix.war thats up on...
... Yes. Its currently our highest priority. I'd point you at the wiki page that summarizes current outline thoughts on how we'd go about this but the content...
Thanks for your reply. Also, I have come across some web-posting/links mentioning possible use of some of the nutch components(mapreduce/NDFS etc.), but...
It was deployed successfully and a login screen is available, but when I use admin/letmein to log in, the screen always stays in login screen and take me...
... Heritrix as a WAR relies on the container's auth. Checkout its web.xml. Setup a login in your container for user 'admin' with role 'admin' however its...
... Where would I go about watching this? It seems pertinent given the information below. ... Before running out of memory, the final statiscal count was...
Matt Ittigson
cydatamatt@...
Sep 6, 2005 6:27 pm
2170
... heritrix_out.log (The file that captures all errant stdout/stderr emissions). ... That is a small number of downloads. I'm guessing whats killing you is ...
... Perhaps you were reading of the nutchwax project (http://archive-access.sourceforge.net/projects/nutch/)? We're not thinking of putting Heritrix atop a...
Hi all: I've just built Heritrix version 1.5.1 from anonymous CVS. Is it possible to copy my Jobs and Profiles over from 1.4.0? I'm using Blackdown (i.e. Sun)...
Gordon Paynter
Gordon.Paynter@...
Sep 7, 2005 10:58 pm
2173
... You'll have to make a manual edit to your jobs and profiles. Here is from the release notes for the as yet unreleased 1.6.0 version (For some reason...
Hi! I'm doing my diploma work at European Archive. My task is to implement a module for Heritrix crawler in order to crawl and store AV contents (stream...
nicolas.baly@...
Sep 8, 2005 2:20 pm
2175
Heritrix currently only retrieves audiovisual content which is (1) served via HTTP; and (2) of finite length. We've created a page on the project wiki to...
It's encouraging to read this dialog. We have been researching the streaming-media capture problem at the Library of Congress and are eager to make progress on...
I'm just wondering whether anyone has written a filter or module to do incremental crawling. What I mean is something that will do a HEAD request on pages and...