Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2148 - 2177 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2148
I tried to launch Heritrix from Windows 2000 command line across the command: java -jar heritrix-1.4.0.jar But received the following message: Exception in...
samhwg
Offline Send Email
Sep 2, 2005
1:27 am
2149
... You need other jars on the classpath. See FAQ, http://crawler.archive.org/faq.html#windows, for help on how to run on windows: St.Ack...
stack
stackarchiveorg
Offline Send Email
Sep 2, 2005
1:33 am
2150
... I've been thinking harder about this problem and have an idea for a possible solution (with some pointing in the right direction). If Heritrix were...
Matt Ittigson
cydatamatt@...
Send Email
Sep 2, 2005
7:58 pm
2151
If you are interested in only getting those 7M seeds wouldn't it be easier to just write a script in Perl to spin over them and fetch the content? -- Tom...
Tom Emerson
tree02139
Offline Send Email
Sep 2, 2005
8:05 pm
2152
... Good question. I really like the Heritrix GUI. The JMX integration is a great way to track the progress of the spider from other processes. The coming ...
Matt Ittigson
cydatamatt@...
Send Email
Sep 2, 2005
8:22 pm
2153
... That's a good enough reason to continue using it! Your idea sounded fine to me: a replacement scope and frontier seem to be in order. -- Tom Emerson...
Tom Emerson
tree02139
Offline Send Email
Sep 2, 2005
8:34 pm
2154
Hi Matt, If you are planning to fetch only 7M URLs and not continue with traversing newly discover urls, you can use broad scope and import URIs as non seeds...
Igor Ranitovic
iranitovic
Offline Send Email
Sep 2, 2005
9:17 pm
2155
... I'll try that first and get back to everyone. Thanks for the suggestion. -matt...
Matt Ittigson
cydatamatt@...
Send Email
Sep 2, 2005
10:04 pm
2156
Hi Matt, Stack just brought up that you need to turn off the extractors. If you want embeds (images, frames, etc.) than leave extractors on, setup max-hops to...
Igor Ranitovic
iranitovic
Offline Send Email
Sep 2, 2005
10:44 pm
2157
... Part of our application is to grab the links found in each URL and dump them out into a separate file. Without the extractors, would curi.getOutLinks be...
Matt Ittigson
cydatamatt@...
Send Email
Sep 2, 2005
11:16 pm
2158
... It'd be empty if no extractors. You need them. ... I'd doubt the running of the extractors your memory problem. The extractors have upper-bounds on the...
stack
stackarchiveorg
Offline Send Email
Sep 2, 2005
11:53 pm
2159
... Done exactly as suggested. ... 1,146,463 URLs loaded via the importUris JMX command with the heap climbing to 1,869,072 KB. I just started the job and...
Matt Ittigson
cydatamatt@...
Send Email
Sep 3, 2005
2:18 am
2160
Hi, Are there any plans to have a distributed architecture, so that we can possibly run large number of toeThreads in different machines to acheive concurrency...
Prasenjit
prasen_aol
Offline Send Email
Sep 3, 2005
10:36 am
2161
Hi all: I'm a relative newcomer to the Heritrix software. I have a couple of question about Heritrix' handling of HTML forms which I have been unable to answer...
Gordon Paynter
Gordon.Paynter@...
Send Email
Sep 4, 2005
10:38 pm
2162
Attempt to deploy admin.jar on WLS but received the following log: Deployer:149033]preparing application admin on myserver [Deployer:149033]prepared...
samhwg
Offline Send Email
Sep 5, 2005
7:37 am
2163
... This is correct - if the URI can be found in the action-field it will be ancountered as a new seed (without parameters) ... Yes - even action="POST" will...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Sep 5, 2005
9:50 am
2164
admin.war does not include heritrix jars. Its just a bundling of the jsps used by the UI (Hence the ClassNotFoundExceptions). Try the heritrix.war thats up on...
stack
stackarchiveorg
Offline Send Email
Sep 5, 2005
7:52 pm
2165
... Yes. Its currently our highest priority. I'd point you at the wiki page that summarizes current outline thoughts on how we'd go about this but the content...
stack
stackarchiveorg
Offline Send Email
Sep 5, 2005
8:00 pm
2166
Thanks for your reply. Also, I have come across some web-posting/links mentioning possible use of some of the nutch components(mapreduce/NDFS etc.), but...
Prasenjit Mukherjee
prasen_aol
Offline Send Email
Sep 6, 2005
5:09 am
2167
It was deployed successfully and a login screen is available, but when I use admin/letmein to log in, the screen always stays in login screen and take me...
samhwg
Offline Send Email
Sep 6, 2005
6:26 am
2168
... Heritrix as a WAR relies on the container's auth. Checkout its web.xml. Setup a login in your container for user 'admin' with role 'admin' however its...
stack
stackarchiveorg
Offline Send Email
Sep 6, 2005
4:53 pm
2169
... Where would I go about watching this? It seems pertinent given the information below. ... Before running out of memory, the final statiscal count was...
Matt Ittigson
cydatamatt@...
Send Email
Sep 6, 2005
6:27 pm
2170
... heritrix_out.log (The file that captures all errant stdout/stderr emissions). ... That is a small number of downloads. I'm guessing whats killing you is ...
stack
stackarchiveorg
Offline Send Email
Sep 6, 2005
7:01 pm
2171
... Perhaps you were reading of the nutchwax project (http://archive-access.sourceforge.net/projects/nutch/)? We're not thinking of putting Heritrix atop a...
stack
stackarchiveorg
Offline Send Email
Sep 6, 2005
11:51 pm
2172
Hi all: I've just built Heritrix version 1.5.1 from anonymous CVS. Is it possible to copy my Jobs and Profiles over from 1.4.0? I'm using Blackdown (i.e. Sun)...
Gordon Paynter
Gordon.Paynter@...
Send Email
Sep 7, 2005
10:58 pm
2173
... You'll have to make a manual edit to your jobs and profiles. Here is from the release notes for the as yet unreleased 1.6.0 version (For some reason...
stack
stackarchiveorg
Offline Send Email
Sep 7, 2005
11:33 pm
2174
Hi! I'm doing my diploma work at European Archive. My task is to implement a module for Heritrix crawler in order to crawl and store AV contents (stream...
nicolas.baly@...
Send Email
Sep 8, 2005
2:20 pm
2175
Heritrix currently only retrieves audiovisual content which is (1) served via HTTP; and (2) of finite length. We've created a page on the project wiki to...
Gordon Mohr
gojomo
Offline Send Email
Sep 8, 2005
6:23 pm
2176
It's encouraging to read this dialog. We have been researching the streaming-media capture problem at the Library of Congress and are eager to make progress on...
Michael Ashenfelder
ashenfelder
Offline Send Email
Sep 9, 2005
4:09 pm
2177
I'm just wondering whether anyone has written a filter or module to do incremental crawling. What I mean is something that will do a HEAD request on pages and...
Dennis Hotson
dwh@...
Send Email
Sep 12, 2005
3:11 am
Messages 2148 - 2177 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help