Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 6149 - 6178 of 6178   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
6149
... not in java, but ruby: http://anemone.rubyforge.org/ just use from shell: $ anemone url-list http://crawler.archive.org > seeds.txt -- ...
raffaele messuti
raffaele@...
Send Email
Nov 13, 2009
10:21 am
6150
Hi In National Library of Finland we have made some improvements to Heritrix: - document classification (content-based) deciderules support (e.g....
Tomas Ukkonen
tomas.ukkonen@...
Send Email
Nov 13, 2009
11:38 am
6151
Good advice, thank you Gordon. Adding the recrawl processors to the chain bean and pointing PersistLoadProcessor directly to my existing history (no preload)...
Matthew Warhaftig
matthewwarha...
Offline Send Email
Nov 14, 2009
2:07 am
6152
I'd suggest opening up an issue (probably multiple issues in your case) in the crawler issue base (https://webarchive.jira.com/browse/HER) and to then attach...
kristsi25
Offline Send Email
Nov 16, 2009
10:38 am
6153
Hi, I want to know about IDN support of heritrix. ("Internationalized domain name" http://en.wikipedia.org/wiki/Internationalized_Domain_Name) I was tried to...
takeru sasaki
sasaki.takeru@...
Send Email
Nov 16, 2009
11:27 am
6154
Hi, IA is hiring a tech lead for our new web crawling initiative. You can read the job description here: http://www.archive.org/about/webjobs.php#wwcengineer ...
Alexis
alexisrossi
Offline Send Email
Nov 17, 2009
11:15 pm
6155
Hi Kris, Thank you for your reply. I will use JIRA as you suggested attach patches against the latest 1.14.x revision from the Heritrix repository. Regards, --...
Tomas Ukkonen
tomas.ukkonen@...
Send Email
Nov 18, 2009
3:57 pm
6156
I reinstalled Heritrix and got the same message again after configuring my first job: Wrong document type 'crawl-order' in...
parseram34
Offline Send Email
Nov 19, 2009
8:01 am
6157
Hi all. Is it possible in Heritrix 1.14.3 to avoid overloading webservers hosting many virtual servers. We currently have the problem that those webservers...
Søren Vejrup Carlsen
svc400
Offline Send Email
Nov 19, 2009
3:03 pm
6158
Hi, It is not actually possible to guarantee this. There is no real way to distinguish for sure where the actual physical hardware that hosts a name is. There...
John Lekashman
lekash
Offline Send Email
Nov 19, 2009
3:41 pm
6159
John pretty much sums up the problem. The way I've dealt with this has been on a case by case basis. Each time we detect this situation, we override the...
kristsi25
Offline Send Email
Nov 19, 2009
4:08 pm
6160
Our main challenge is that we need the queues sperated on TLD (foo.com and bar.com) to use the quota-enforcer to limit number of bytes on each TLD but at the...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Nov 19, 2009
4:44 pm
6161
Hey crawlers: I was Apachecon in Oakland in early November and was present during a meeting of a few of the open source crawler projects (Ken Krugle for Bixo, ...
stack
stackarchiveorg
Offline Send Email
Nov 20, 2009
4:48 pm
6162
Hi Takeru, Heritrix automatically converts Internationalized Domain Name seeds and discovered links to punycode. However, I recreated your issue of certain...
Matthew Warhaftig
matthewwarha...
Offline Send Email
Nov 21, 2009
5:17 pm
6163
Hi all, Although I searched thoroughly in the group messages, my searches didn't end up with a solution. I am looking for a way to exclude a list of urls in...
cagtat
Offline Send Email
Nov 23, 2009
12:43 am
6164
The first Release Candidate test release of Heritrix 3.0 is now available, version identifier 3.0.0-RC1. We encourage expert Heritrix users curious about the...
Gordon Mohr
gojomo
Online Now Send Email
Nov 23, 2009
11:32 am
6165
Hello. I am using Heritrix 2.0.2 with configuration based on "Broad but shallow crawl". It seems that I've managed to setup everything OK. The one problem I...
Віталій...
tivv00@...
Send Email
Nov 23, 2009
11:33 am
6166
Hi, I used recently heritrix 1.14.3 and I do not understand how to limit the nb of documents per host. The parameter max-document-download limit the total...
bourely
Online Now Send Email
Nov 23, 2009
7:30 pm
6167
You should most likely use the QuotaEnforcer module that allows you to set number of documents (succesful and total) and number of bytes per queue. If at the...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Nov 23, 2009
8:33 pm
6168
thank you matt, I know about IDN in Herritrix. And I know the problem about seed. I am watching it. takeru 2009/11/22 Matthew Warhaftig <mwarhaftig@...>...
takeru sasaki
sasaki.takeru@...
Send Email
Nov 24, 2009
4:19 am
6169
Hello Matt and Gordon, Following Gordon's advice and assuming HER-1706 to be fixed, I am using only two of the persistProcessors: load and store. As suggested,...
Pranay Pandey
sspranay
Offline Send Email
Nov 24, 2009
3:36 pm
6170
sorry about the broken link sent before. http://cs.odu.edu/~pramo_p/crawler-beans.cxml ... From: Pranay Pandey <sspranay@...> Subject: [archive-crawler]...
Pranay Pandey
sspranay
Offline Send Email
Nov 24, 2009
3:44 pm
6171
I see two potential issues in your order: - There's no FetchHistoryProcessor, which is still necessary to collect deduplication-relevant information and insert...
Gordon Mohr
gojomo
Online Now Send Email
Nov 24, 2009
7:48 pm
6172
Hi all, This is not a "real" Heritrix issue but maybe somebody has an answer to this: I programmed 2 processors for Heritrix 2 and put my classes in some...
sendaman69
Offline Send Email
Nov 24, 2009
11:16 pm
6173
I am using heritrix 1.14.3.. I just want use heritrix to crawl some wsdl document,but after several attempts£¬I found it too hard. my problem: ...
zhongkem@...
zhongkem...
Offline Send Email
Nov 25, 2009
10:07 am
6174
... FYI, we've started to use the term 'assignment-level-domain' (or 'ALD') for what you're calling 'TLD' here. Of course, a true 'TLD' is only the...
Gordon Mohr
gojomo
Online Now Send Email
Nov 25, 2009
10:45 pm
6175
... FWIW, the paper on IRLBot used the term PLD (paid-level domain) for this concept, and we've adopted that for Bixo. ...
Ken Krugler
kkrugler
Offline Send Email
Nov 25, 2009
11:09 pm
6176
Takeru & Matt: Thanks for the report of this issue. I confirmed there were some problems with both the admin-UI editing of seeds, and the reading of seeds from...
Gordon Mohr
gojomo
Online Now Send Email
Nov 25, 2009
11:26 pm
6177
... If <http://soamoa.org:9292/artistRegistry?WSDL> is the URL you want, you should add it as a seed -- then you don't have to wait for it to be discovered. -...
Gordon Mohr
gojomo
Online Now Send Email
Nov 25, 2009
11:48 pm
6178
... Actually I've found out the problem and it looks to me like the bug. The root:queue-assignment-policy setting is not used for queue assignment, the one...
Vitalii Tymchyshyn
tivv00@...
Send Email
11:15 am
Messages 6149 - 6178 of 6178   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help