Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4879 - 4908 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4879
I have a problem similar to the one reported by "Over" on the web archiving blog at "http://wa.archive.org/blog/2007/12/07/heritrix-200-rc1-is-out/" I am using...
xavier_sautejeau
xavier_saute...
Offline Send Email
Jan 3, 2008
4:50 pm
4880
At Thu, 03 Jan 2008 15:23:40 -0000, ... This problem is because the latest Ubuntu has switched to using dash, a lightweight POSIX as its primary shell (i.e....
Erik Hetzner
e_hetzner
Offline Send Email
Jan 3, 2008
5:16 pm
4881
Thanks, this does solve my problem. Not being a shell guru myself, I took the option of replacing the script headers on my local install of heritrix. Besides,...
xavier_sautejeau
xavier_saute...
Offline Send Email
Jan 4, 2008
9:15 am
4882
What are all the dependencies for libarc? Is it completely independent of Heritrix? I am using an Ubuntu Gutsy (7.10), and I am trying to install libarc, but I...
dyamblor
Offline Send Email
Jan 8, 2008
12:24 am
4883
Thanks for all the responses, but in the meantime I've got carried away playing with another crawler which worked right out of the box. Since I'm not a Java...
Kevin Porter
kev@...
Send Email
Jan 9, 2008
2:58 pm
4884
In 2.0 RC1, I applied a JDBCWriterProcessor. I added the JDBC info to the KeyManager like the following and was able to see the data show up in the sheets...
Daniel Clark
daniel_a_clark
Offline Send Email
Jan 9, 2008
7:21 pm
4885
Hi, Use ProcessorURI.getHttpMethod().getResponseHeader(String). -Paul...
Paul Jack
poetbeware
Offline Send Email
Jan 9, 2008
8:47 pm
4886
... The scope in the configuration below should work for this. ... You had the right idea in the configuration below, but processors consider a DecideRule that...
Paul Jack
poetbeware
Offline Send Email
Jan 9, 2008
9:08 pm
4887
Hi, Am I correct in assuming that the JDBC_DRIVER setting is going to be unchanged during the lifetime of a crawl? If that assumption is correct, then: 1. The...
Paul Jack
poetbeware
Offline Send Email
Jan 9, 2008
9:18 pm
4888
I am crawling a domain with a very large number of hosts and content. It appears that due to time contraints we may not be able to gather all content. Is there...
mjjjhjemj
Offline Send Email
Jan 9, 2008
9:24 pm
4889
... There's no general rule, and I doubt one could be arrived at for all crawls -- the web is so diverse, and it would depend on the seeds you're crawling and...
Gordon Mohr
gojomo
Online Now Send Email
Jan 9, 2008
10:27 pm
4890
Correct. The driver will only change before a crawl is launched, not during. Thanks so much! I'll give a try. From: archive-crawler@yahoogroups.com ...
Daniel Clark
daniel_a_clark
Offline Send Email
Jan 9, 2008
10:55 pm
4891
Hi Daniel, I think that you will have to write code to do this. If you want to use 1.12.1 out-of-box then you can do this with a beanshell processor which will...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 9, 2008
11:04 pm
4892
Hi, I am using Heritrix-1.12.0 with Java 6 on windows-xp system. When crawled an site and got exception. 2008-01-10T09:26:54.718Z 200 79022 ...
jls_nayak1983
Offline Send Email
Jan 10, 2008
6:58 pm
4893
We recently found a workaround for this Windows-specific issue and applied it to the 'heritrix2' trunk in November, before the 'beta' release -- see issue: ...
Gordon Mohr
gojomo
Online Now Send Email
Jan 10, 2008
8:36 pm
4894
Is arcreader supposed to return the offset of the URL record (ex. http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103 text/html 202) or the first...
dyamblor
Offline Send Email
Jan 11, 2008
5:41 pm
4895
At Fri, 11 Jan 2008 17:40:59 -0000, ... Yes. The offset is the position in the file where the header starts. The header is of indeterminate length: you will...
Erik Hetzner
e_hetzner
Offline Send Email
Jan 11, 2008
6:27 pm
4896
I would like to implement the HTML filtering noted in the following link. I didn't see the noted URIRegExpFilter in the sheet for version 2.0. How can I...
Daniel Clark
daniel_a_clark
Offline Send Email
Jan 15, 2008
9:37 pm
4897
Dear all. In the NetarchiveSuite project, we're in the process of migrating our Heritrix 1.10 templates from using the deprecated HostScope/DomainScope, and...
Søren Vejrup Carl...
svc400
Offline Send Email
Jan 16, 2008
4:36 pm
4898
Dear all. I'm in the middle of migrating our Heritrix 1.12.1 templates from the deprecated scopes to DecidingScope, and just found out, that the filters have...
Søren Vejrup Carlsen
svc400
Offline Send Email
Jan 16, 2008
5:17 pm
4899
Hi Søren, In org.archive.crawler.deciderules, see: - MatchesListRegExpDecideRule - NotMatchesListRegExpDecideRule - TooManyPathSegmentsDecideRule Take care, ...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 16, 2008
6:11 pm
4900
Hi Søren, I cannot rember exact behavior of the PathScope, I will have to look it up. But I think that DecidingScope's seeds-as-surt-prefixes option will do...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 16, 2008
7:01 pm
4901
I used the profile basic_seed_sites as a template and changed the max-retries from 30 to 1 and retry-delay-seconds from 900 to 30. I placed a link in the...
Daniel Clark
daniel_a_clark
Offline Send Email
Jan 16, 2008
9:53 pm
4902
Hi Igor, ... Could you, please, briefly explain why not? ... Thanks, Bert...
Bert Wendland
bwendland42
Offline Send Email
Jan 17, 2008
10:34 am
4903
Hello, Anyone knows if there is a way to do it ? Using Heritrix 1.x, this is pretty straightforward, but with 2.0 I could not figure how. Any pointers...
xavier_sautejeau
xavier_saute...
Offline Send Email
Jan 17, 2008
5:00 pm
4904
Hi, I've got heritrix embedded in another app, and I'm currently using the CrawlJobHandler and CrawlJob classes to minimize the amount of setup I have to do. ...
Micah Wedemeyer
mwedeme@...
Send Email
Jan 17, 2008
5:46 pm
4905
Hi Bert, ... There are several things that I don't like about it. It is limited to URI matching rules, can be confusing and forces you to teach people ...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 17, 2008
7:00 pm
4906
Hi, I'm looking for some feedback as to a best approach to a crawling problem I need solve. I have a list of URLs, some of which I only want to crawl that...
Travis Jensen
tajensen72
Offline Send Email
Jan 18, 2008
6:50 pm
4907
I don't think that you need to run separate jobs for each seed. For example, is you have two seeds as: http://www.foo.com/baz/bar2.html ...
Igor Ranitovic
iranitovic
Offline Send Email
Jan 18, 2008
7:25 pm
4908
Hi, Thanks for your reply, Igor. Would this still be a preferred way if I have 1500 of these URLs? I would worry about the performance hit of every crawled...
Travis Jensen
tajensen72
Offline Send Email
Jan 18, 2008
8:09 pm
Messages 4879 - 4908 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help