Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5872 - 5901 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5872
Thanks for the feedback. Our main goals are: 1. achieving maximum use of commodity hardware. If we want to distribute 200k domains across say, 4 machines, we...
Joel Halbert
joel@...
Send Email
Jun 1, 2009
10:39 am
5873
... Not really. At checkpoints, logs are rotated, so you could compress or move away early parts of the log. We have occasionally considered auto-compressing...
Gordon Mohr
gojomo
Online Now Send Email
Jun 1, 2009
8:22 pm
5874
Thanks for the feedback. Our main goals are: 1. achieving maximum use of commodity hardware. If we want to distribute 200k domains across say, 4 machines, we...
Joel Halbert
joel@...
Send Email
Jun 2, 2009
3:01 pm
5875
... This goal (considered alone) is best served by a large crawl with all seeds entered at the very start. Then the only things limiting the crawler will be...
Gordon Mohr
gojomo
Online Now Send Email
Jun 3, 2009
1:09 am
5876
Thanks Gordon. This really helps. ... From: Gordon Mohr <gojomo@...> Reply-To: archive-crawler@yahoogroups.com To: archive-crawler@yahoogroups.com ...
Joel Halbert
joel@...
Send Email
Jun 3, 2009
8:39 am
5877
Where can I set the path depth? I'm new and the manual is hard to decipher on this issue as there are multiple depth settings. I just want it to get the...
Aaron Kreider
aaronlk
Offline Send Email
Jun 4, 2009
3:10 am
5878
HI Aaron, If the seeds for your crawl are the home pages of the several thousand websites, then you can try using the TooManyHopsDecideRule and set the...
Ko, Lauren
laurendko
Offline Send Email
Jun 4, 2009
4:59 pm
5879
Hi all. I'm still working with heritrix-1.14.1. I'm already ALMOST up to beginner status! At this time, I'm having problems when a crawl job hits its...
bowser.richard
Offline Send Email
Jun 7, 2009
1:22 am
5880
At Sun, 07 Jun 2009 01:21:45 -0000, ... Heritrix should close all open arc files on exit but often doesn’t, for some reason. In any case it is generally...
Erik Hetzner
e_hetzner
Offline Send Email
Jun 8, 2009
4:36 pm
5881
What Erik says, but also: if you can reliably reproduce a situation where the job finishes but ARCs are left with the ".open" suffix, please let us know, and...
Gordon Mohr
gojomo
Online Now Send Email
Jun 8, 2009
6:52 pm
5882
The Internet Archive is planning to host a 'Heritrix Expert Summit' this fall in San Francisco, for advanced Heritrix crawl operators and developers to share...
Gordon Mohr
gojomo
Online Now Send Email
Jun 9, 2009
8:53 pm
5883
A preview/alpha testing version of Heritrix 3.0 is now available. We encourage expert Heritrix users curious about upcoming changes to review this alpha and...
Gordon Mohr
gojomo
Online Now Send Email
Jun 11, 2009
9:08 am
5884
I've seen the Tom Emerson's ... Does anyone have a copy of this backed up at all? I've been trying to access it, but it's returning 404...
Mark
froozle
Offline Send Email
Jun 12, 2009
10:05 pm
5885
Hi guys, I'm new to Heritrix and I've been wanting to configure 1.14.3 to download text only. I've gone to the user manual and also searched on the topic and...
Mark
froozle
Offline Send Email
Jun 12, 2009
10:06 pm
5886
... Fortunately, we have a thing called the Wayback Machine: ...
Gordon Mohr
gojomo
Online Now Send Email
Jun 12, 2009
10:08 pm
5887
... Heh, funny, I completely forgot about the wayback machine :) I even tried accessing the site via google's cache to no avail...
Mark
froozle
Offline Send Email
Jun 12, 2009
11:02 pm
5888
Hi, I am using a C# application to launch a crawl job using the command line interface, but I need to included seeds for that job. Is there a way of providing...
nfoscarini
Offline Send Email
Jun 14, 2009
7:50 pm
5889
See Gordons notes on adding URI's mid crawl here: http://webarchive.jira.com/wiki/display/Heritrix/Adding+URIs+mid-crawl And an example of connecting to...
Joel Halbert
joel@...
Send Email
Jun 15, 2009
9:07 am
5890
Hello We are using heritrix on a dual opteron server with fiber optic connection, but crawling speed is unfortunately very low. It is mostly around 130 KB/s,...
nukleonrus
Offline Send Email
Jun 17, 2009
9:12 pm
5891
... How many seeds are used to start your crawl, and what does the summary "QUEUES" section of the "frontier report" show (with totals of ...
Gordon Mohr
gojomo
Online Now Send Email
Jun 18, 2009
7:29 am
5892
... seeds.txt contains about 22 000 domains, some of them could be unregistered we are crawling on a one server for now, we would like to set up a cluster of...
nukleonrus
Offline Send Email
Jun 18, 2009
11:01 am
5893
I am trying to crawl twitter to get a search query http://search.twitter.com/search?q=sotomayor as part of a new collection that the Library of Congress is...
Gina Jones
gmj2053
Offline Send Email
Jun 18, 2009
4:59 pm
5894
... Aha -- that some of the domains could be unresponsive (registered but not running an HTTP server) could be the real culprit. A normal fetchable URI can be...
Gordon Mohr
gojomo
Online Now Send Email
Jun 18, 2009
6:22 pm
5895
... i believe the search API limit is 1500 status up to 1.5 weeks back, and the max rpp is 100. so you can get 15 pages of 100 statuses, or 100 pages of 15...
steve@...
stearcorg
Online Now Send Email
Jun 18, 2009
6:28 pm
5896
... Are you just interested in a sampling, or are you hoping to capture every relevant tweet during the collection period? (If your chief aim was complete...
Gordon Mohr
gojomo
Online Now Send Email
Jun 18, 2009
9:29 pm
5897
Relevant tweets as best as possible. I am certainly hoping that the division isn't expecting 100% but I want to get more than what we would get with the weekly...
Gina Jones
gmj2053
Offline Send Email
Jun 19, 2009
8:01 pm
5898
... we have filtered seeds.txt, so it now contains only sites that have registered DNS and port 80 is giving response [around 10% of list was removed] we have...
nukleonrus
Offline Send Email
Jun 21, 2009
1:21 pm
5899
Hi people! I'm pretty new to heritrix, so please help me out. I've been using heritrix 2.0.3, and I have set it up, everything works fine, however, after a...
progre55
Offline Send Email
Jun 22, 2009
3:58 am
5900
... As this is logged as a 'nonfatal' error, if it is the only symptom, it shouldn't be a cause for concern. Is the real problem that there is no progress in...
Gordon Mohr
gojomo
Online Now Send Email
Jun 22, 2009
9:11 pm
5901
... Hi Gordon, Yes, the progress stops, and as all the seeds are from the same domain, no other threads are run during the error. It tries different URIs every...
progre55
Offline Send Email
Jun 23, 2009
12:35 am
Messages 5872 - 5901 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help