Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5437 - 5467 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5437
... quickly is ... great ... assuming right ... policy ... I did that and the OOME did not reappear. I cannot say which of the two options was the cause (since...
rhispm
Offline Send Email
Sep 1, 2008
7:23 am
5438
Hi: I downloaded and installed Heritrix 2 on my machine. I followed the guide for Version 2 and only changed contact url and email. My goal is to crawl urls...
liangjie.hong
Offline Send Email
Sep 1, 2008
9:59 pm
5439
could be useful to have a bin/warcreader utility in heritrix, like the existent arcreader, to make .cdx indexes of warc files now i'm using the warc-indexer...
raffaele messuti
raffaele@...
Send Email
Sep 2, 2008
10:39 am
5440
I ran into the problem again today and I think I have a more specific question to ask now. I found my processes in a situation where all ten toe threads were ...
Matt Kent
matthew.e.kent@...
Send Email
Sep 2, 2008
5:46 pm
5441
... Glad to hear the problem cleared up. I would like to know if only one of the changes is enough to trigger the OOME -- there must be a bug here somewhere....
Gordon Mohr
gojomo
Online Now Send Email
Sep 2, 2008
9:29 pm
5442
I haven't seen this before. FWIW, 1.14.1 moved to BDB-JE version 3.3.62 (whereas I believe 1.14.0 used BDB-JE 3.2.76). I don't have any specific reason to...
Gordon Mohr
gojomo
Online Now Send Email
Sep 3, 2008
11:39 pm
5443
The WARC-reader bundled with our Wayback project will be the best one the Internet Archive has and what we use in our own projects. We'd like to hear of any...
Gordon Mohr
gojomo
Online Now Send Email
Sep 3, 2008
11:43 pm
5444
hi Jean-Noel, Ok, i think i found the problem. when you create a new sheet Add Single Sheet ... [my-new-sheet] and click Submit, you drop right into Settings...
steve@...
stearcorg
Online Now Send Email
Sep 4, 2008
1:48 am
5445
It's not an issue with your scoping/prefixes. I tried crawling <http://www.genealogy.ams.org/id.php?id=123>, and the result line in the crawl.log was: ...
Gordon Mohr
gojomo
Online Now Send Email
Sep 4, 2008
2:56 am
5446
... options. It ... see if ... one of ... For what it's worth I've run into this bug using 'most-favored' robots policy and standard ARC sizes. Never thought...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Sep 4, 2008
11:04 am
5447
Great Thanks! ... the ... code ... understand." ... browser, and ... doesn't ... the ... HTTP 'Accept' ... any ... smallest, ... add to ... the ... because it ...
liangjie.hong
Offline Send Email
Sep 5, 2008
4:18 pm
5448
Hello, Everyone: Now I have a problem to crawl some website. The problem is not a still bug. Some computer could occur, but other computer cuold not occur this...
yuanetking
Offline Send Email
Sep 6, 2008
4:04 pm
5449
Hi, I am using Heritrix crawler for crawling through the domains. The problem which Iam facing now is when I try to crawl through some domains even though the...
avinashnash
Online Now Send Email
Sep 8, 2008
9:50 am
5450
... Hi, I'll have a look into my magic crystal ball to find the answer... ;-) ... Well, first of all check which urls aren't handled by heritrix. Then recheck...
Christian Krumm
chuk_ol
Offline Send Email
Sep 8, 2008
12:03 pm
5451
Hello! I tried to find this in the docs but it seems a rather uncommon use case to explicitly exclude some domain (including all subdomains) from a crawl. Any...
rhispm
Offline Send Email
Sep 8, 2008
3:09 pm
5452
Here you go. The XML object to not crawl these guys. Pretty straightforward. Use it, early and often. Where nocrawl-all.surt is the list of do not crawl. ...
lekash
Offline Send Email
Sep 8, 2008
7:05 pm
5453
hi all, we are evaluating the option of running heritrix on amazon's ec2 servers. Is anyone else running the crawler on ec2 and how is your experience with it?...
alihoaliho
Offline Send Email
Sep 8, 2008
8:46 pm
5454
You should be fine on EC2. For storage, you can use their new Elastic Block Store - ...
Michael Giles
michael_a_giles
Online Now Send Email
Sep 8, 2008
10:26 pm
5456
Hi, I am have used a method in the past with version 1.12 of Heritrix where if you add a Beanshell processor after the pre-processor and reset any CrawlURI...
astar_t
Offline Send Email
Sep 9, 2008
2:14 pm
5457
we are running heretrix 1.14 on 4 ec2 machines for last 4 months. its working good ,but when i compare the performance on our paris and amerterdam machines...
ravinder_vashist002
ravinder_vas...
Offline Send Email
Sep 9, 2008
4:10 pm
5458
Can you also give me some suggestions about how to distribute the seeds?...
alihoaliho
Offline Send Email
Sep 9, 2008
5:12 pm
5459
It seems that Heritrix 1.14.1 is the most recent release. The Javadoc claims it represents version 1.15.2, but I can find no evidence of said version in SVN or...
Matt Kent
matthew.e.kent@...
Send Email
Sep 10, 2008
8:30 pm
5460
For anyone interested, I found the build hiding in Cruise Control: http://builds.archive.org:8080/cruisecontrol/artifacts/HEAD-heritrix/20080808001045/...
Matt Kent
matthew.e.kent@...
Send Email
Sep 10, 2008
8:34 pm
5461
Hi does anyone know if there's a way to retrieve the seed url from a document in the ARC file. in other word, how to find out which original seed is a...
alihoaliho
Offline Send Email
Sep 10, 2008
8:52 pm
5462
At Wed, 10 Sep 2008 13:34:25 -0700, ... Hi Matt. It sounds like you know this, but this is presumably not a released version but a build from trunk (or HEAD)...
Erik Hetzner
e_hetzner
Offline Send Email
Sep 10, 2008
9:38 pm
5463
Versions of Heritrix from SVN TRUNK will have odd numbers in the second position, eg. the '15' in 1.15.2. (That's the current label of what's in SVN.) No...
Gordon Mohr
gojomo
Online Now Send Email
Sep 10, 2008
10:15 pm
5464
Excellent, thanks for the explanation. It seems that 1.14.1 will be sufficient for me. Matt...
Matt Kent
matthew.e.kent@...
Send Email
Sep 10, 2008
10:24 pm
5465
... down. ... Hi all! I have the same problem with the size of some of the logs created by WCT (Web Curator Tool). I can't upgrade the WCT versión neither ...
javier_azaola
Offline Send Email
Sep 11, 2008
4:27 pm
5466
Hi all! I have the same problem with the size of some of the logs created by WCT (Web Curator Tool). I can't upgrade the WCT versión neither embebed heritrix...
javier_azaola
Offline Send Email
Sep 11, 2008
4:27 pm
5467
Hi all. I’m experimenting with squeezing more crawlers out of a single JVM in Heritrix. (Background: it is possible to run multiple crawlers in a single JVM,...
Erik Hetzner
e_hetzner
Offline Send Email
Sep 11, 2008
7:15 pm
Messages 5437 - 5467 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help