Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Real people. Real stories. See how Yahoo! Groups impacts members worldwide.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 1630 - 1659 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
1630
I'm not using a mid-fetch filter, but rather Tom's URI regex exclude- filter in the scope. Now that I think about it, this problem started happening around...
gsellek
Offline Send Email
Mar 1, 2005
4:21 pm
1631
... https://sourceforge.net/tracker/index.php?func=detail&aid=1116204&group_id=73833&atid=539099 ... Thx for the suggestion we tried it on my friends computer...
Christoph Spielmann
spielc
Offline Send Email
Mar 1, 2005
4:53 pm
1632
... We're looking into adding accounting that will allow the running of multiple crawls inside of a single running instance -- More to follow on this after it...
stack
stackarchiveorg
Offline Send Email
Mar 1, 2005
7:53 pm
1633
I have enough machine resources to start up another crawler instance (once it gets to this low thread parallelism state the CPU consumption drops way down) the...
Mike Schwartz
mfschwartz
Offline Send Email
Mar 1, 2005
8:13 pm
1634
... Sounds good. Let us know if you can think of something we should add to the crawler to help you implement this strategy (You might have suggestions for...
stack
stackarchiveorg
Offline Send Email
Mar 1, 2005
9:05 pm
1635
... I tried it. When the URIs were not in scope -- which was the case for the first few I tried -- I got a -5000 in the crawl.log: e.g. '20050301212535071...
stack
stackarchiveorg
Offline Send Email
Mar 1, 2005
9:46 pm
1636
I might have been misunderstanding what the Add to Frontier was supposed to do, and in light of your comment it makes sense. The frontier will be determined...
Rob Eger
robeger
Offline Send Email
Mar 1, 2005
10:31 pm
1637
Had some trouble with SurtPrefixScope (in 1.2.0 and HEAD), and I now believe the trouble is PEBKAC; please to verify? Summary: created a list of SURTs,...
Andy Boyko
andyboyko
Online Now Send Email
Mar 1, 2005
10:52 pm
1638
... Yes. Adding to the seed list changes scope (And adds the URI to the queue). Adding via the Frontier screen does not alter scope. It just adds item to...
stack
stackarchiveorg
Offline Send Email
Mar 1, 2005
11:01 pm
1639
We're looking to fill two positions here at the Internet Archive. The openings are in our webgroup, the team that works on Heritrix. ...
stackarchiveorg
Offline Send Email
Mar 1, 2005
11:56 pm
1640
Hi everybody! I've got another question (i know... again ;)): If i understood it correctly everytime a exception during the crawl is thrown, the crawler waits...
Christoph Spielmann
spielc
Offline Send Email
Mar 2, 2005
10:42 am
1641
... Hmm i just remarked that i was wrong because i had a queue snoozing but it still continued to crawl (i guess it just stopped crawling last time cuz there...
Christoph Spielmann
spielc
Offline Send Email
Mar 2, 2005
12:12 pm
1642
Queues can be snoozed to enforce politeness (usually just a few seconds), this is controlled by the delay-factor, max-delay-ms and min-delay-ms settings on the...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Mar 2, 2005
12:42 pm
1643
The most common causes of retryable errors are network connection failures or drops -- it's presumed they may be transient conditions that will improve in a...
Gordon Mohr (@Interne...
gojomo
Offline Send Email
Mar 2, 2005
6:19 pm
1644
The below should be fixed as part of the recent commit of an updated DomainSensitiveFrontier (DSF adds filters midcrawl that stop further downloads from a host...
stack
stackarchiveorg
Offline Send Email
Mar 2, 2005
6:39 pm
1645
Yes: it's never been intended to use SURT form to specify seeds, only scopes. To be more precise, what is used to specify scope are SURT *prefixes*, which may...
Gordon Mohr (Internet...
gojomo
Offline Send Email
Mar 3, 2005
1:09 am
1646
Hi, how did you deal with the "https" URIs in ProxyViewer? Thanks! Yan ... 20041116233820 ... 20041116233821 ... lookup ... lookup ... original...
yzhang_il
Offline Send Email
Mar 4, 2005
9:17 pm
1647
I cannot browse "https" urls using ProxyViewer. Did anyone have the similar problem? How did you deal with it? Thanks a lot! Best, Yan...
yzhang_il
Offline Send Email
Mar 4, 2005
9:18 pm
1648
Hi, If you use WUI, you may add org.archive.crawler.filter.URIRegExpFilter on the Filter page (Scope->exclude-filter), go to Setting page, use a regular ...
yzhang_il
Offline Send Email
Mar 4, 2005
9:22 pm
1649
I only can find Heritrix 1.2.0 from http://sourceforge.net/projects/ archive-crawler/. But I noticed that some guys already began experiments with Heritrix...
yzhang_il
Offline Send Email
Mar 7, 2005
4:28 pm
1650
... There is no 1.3.0 'release'. 1.3.0 is the 'version' number of builds made using the unreleased HEAD of the source tree (Our system for version numbering...
stack
stackarchiveorg
Offline Send Email
Mar 7, 2005
6:31 pm
1651
Sorry, "if-match-return" should be set "True". You can also use this method to exclude some directories you don't need....
yzhang_il
Offline Send Email
Mar 8, 2005
6:14 am
1652
Hello! Now I'm trying to read the crawled html files using the API provided by heritrix. While reading gzipped ARC files, I alwas get an error message "Failed...
Rev Tamas
bridgeman@...
Send Email
Mar 13, 2005
12:54 am
1653
You may as well check your path. I also use 1.2.0, I can use arcreader to read arc file. The command I used is like: /yourpath/heritrix-1.2.0/bin/arcreader...
yzhang_il
Offline Send Email
Mar 13, 2005
4:11 pm
1654
Hi all, I have edited the simple profile by adding overrides for a particular domain. In the override I just modified certain scope parameters such as I...
ranjeetbhatia1976
ranjeetbhati...
Offline Send Email
Mar 13, 2005
6:25 pm
1655
if you know something more about filters do let me know :) The whole filter thing seems to be too confusing and even after spending 4-5 hours today I am not...
ranjeetbhatia1976
ranjeetbhati...
Offline Send Email
Mar 14, 2005
4:40 am
1656
Thanks! So I need to use not the API but the command-line interface. Tamas...
Rev Tamas
bridgeman@...
Send Email
Mar 14, 2005
11:36 am
1657
... I'd suggest following Yan's suggestion and get the command line interface working first. Once thats working against your ARCs and its not throwing 'Failed...
stack
stackarchiveorg
Offline Send Email
Mar 14, 2005
6:53 pm
1658
... You might check the files under YOUR_NEW_PROFILE/settings -- the location under which snippets of xml that represent the override are ketp -- with...
stack
stackarchiveorg
Offline Send Email
Mar 14, 2005
7:17 pm
1659
... Okay, now the command line interface is working. If I set a proper offset, there will be no GZIP MAGIC complaint :) Now I can go on using arcreader. thx: ...
Rev Tamas
bridgeman@...
Send Email
Mar 15, 2005
12:10 am
Messages 1630 - 1659 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help