Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 6113 - 6142 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
6113
The discovery path is explained here: http://crawler.archive.org/articles/user_manual/glossary.html#discoverypath best Bjarne Andersen netarchive.dk ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Oct 15, 2009
8:18 pm
6114
the TransclusionDecideRule is explained here: http://crawler.archive.org/apidocs/org/archive/crawler/deciderules/TransclusionDecideRule.html It is used for...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Oct 15, 2009
8:22 pm
6115
err... I read the User/Developer Manual, nearly all the docs in the heritrix web. I'm confused that you said directly "NO" but asked my explanation of my ...
Lilei
bjtu.lilei...
Offline Send Email
Oct 16, 2009
3:06 am
6116
Hello Lilei, Can you explain what you mean by isPageKnown? And also how you want to use it, i.e. what you would do with the result of the operation? Noah...
Noah Levitt
nlevitt0
Offline Send Email
Oct 16, 2009
3:09 am
6117
Hi NoahLet's say this way, I wanna know what will heritrix do if it fetch a page that it has already seen, that is, same page content, but with different urls....
Lilei
bjtu.lilei...
Offline Send Email
Oct 16, 2009
3:49 am
6118
Hi, Lilei. It's not completely clear what you're referring to without additional context, so it's hard to give a definitive answer. Are you referring to the...
Gordon Mohr
gojomo
Online Now Send Email
Oct 16, 2009
4:04 am
6119
Hi, Mohr Yes, a content-hash structure, that's exactly what I sought. Thank you, Mohr. 2009/10/16 Gordon Mohr <gojomo@...> ... -- ...
Lilei
bjtu.lilei...
Offline Send Email
Oct 16, 2009
4:24 am
6120
Hi, I am trying to run H3-beta using the 'r' commandLine option. I am prompted with this message - "You must specify a password for the web interface using...
Pranay Pandey
sspranay
Offline Send Email
Oct 20, 2009
2:06 pm
6121
Heritrix 3 always launches with the web interface for monitoring and remote-control, and so your choice of administrator credentials must always be supplied at...
Gordon Mohr
gojomo
Online Now Send Email
Oct 20, 2009
7:02 pm
6122
We're about to choose the final dates for a 2-3 day Heritrix Expert Summit in San Francisco, from among candidate dates in January-April of 2010. The idea of...
Gordon Mohr
gojomo
Online Now Send Email
Oct 20, 2009
10:57 pm
6123
Hi, I noticed Heritrix has additional writers to HBase and Hadoop (to write crawled content only); but, can I run distributed crawler in a cluster? Thanks...
Freddy
fouad_efendi
Offline Send Email
Oct 21, 2009
7:29 pm
6124
Thanks Gordon, Using curl I am able to access a broader range of actions (build,launch,terminate etc). My goal is to set up cron jobs to launch the same job...
Pranay Pandey
sspranay
Offline Send Email
Oct 21, 2009
7:43 pm
6125
I would like to add some crawl profiles that would be available immediately after installing Heritrix without any additional steps by the person doing the...
pbaclace
Offline Send Email
Oct 22, 2009
8:13 pm
6126
Hello I'm trying to get QueueOverbudgetDecideRule to work but I don't seem to be able to do this. Is this module still functional or maybe I have added it to a...
olintocattaneo
Offline Send Email
Oct 22, 2009
8:31 pm
6127
Hello guys, I've been using heritrix to do some crawls with about 10 seeds. I find that I am getting excessively large amounts of trash in the data I collect....
tristram.bethea
Offline Send Email
Oct 23, 2009
2:04 am
6128
Hello all, I am new on this group. I am looking for a web crawler which can get list of links of a webpage and convert a website downloaded in mht file. It...
alphonse.smith16
alphonse.smi...
Offline Send Email
Oct 27, 2009
3:44 pm
6129
hi Tram, can you be more specific about what you consider "junk"? the default profile includes a TransclusionDecideRule which tells the crawler to transitively...
steve@...
stearcorg
Online Now Send Email
Oct 27, 2009
4:05 pm
6130
Hi, Is there any documentation on the Heritrix implementation of WARC beyond just the source code? i.e. elements from the specification in-/excluded, which...
Coram, Roger
Roger.Coram@...
Send Email
Oct 27, 2009
4:05 pm
6131
hi Roger, the latest versions of Heritrix deliver warc output in format: "WARC File Format 1.0" which conforms to the ISO 28500 specification, an ISO standard...
steve@...
stearcorg
Online Now Send Email
Oct 27, 2009
4:16 pm
6132
Hi, I had set up a crawl job to run for 6 hours using H3-beta. I had it configured to be least polite and number of parallel queue was set to 5. After...
Pranay Pandey
sspranay
Offline Send Email
Oct 28, 2009
2:46 pm
6133
The 'queued' URIs are almost certainly on some hosts that are not responding. Heritrix is trying them every 15 minutes, but then putting them back on the queue...
Gordon Mohr
gojomo
Online Now Send Email
Oct 29, 2009
12:55 am
6134
Could you please help me..I get the following Error message in Heritrix: Wrong document type 'Crawl-order' in...
parseram34
Offline Send Email
Oct 29, 2009
10:02 am
6135
When and where does this error appear? (For example: at the time Heritrix is launched, at the time you try to start a crawl, at the time you edit settings,...
Gordon Mohr
gojomo
Online Now Send Email
Oct 29, 2009
7:36 pm
6136
Hi Gordon. I can't find the tool to migrate 1.X configurations to 3.X style configurations. I have downloaded the heritrix-3.0.0-beta-dist.tar.gz from...
Søren Vejrup Carlsen
svc400
Offline Send Email
Oct 30, 2009
4:06 pm
6137
I was reebooting and now the http://127.0.0.1:8080 address shows "Failed to connect". http://localhost:8080 doesnt work either. When I start the terminal its...
parseram34
Offline Send Email
Oct 31, 2009
6:12 pm
6138
Replying to myself just in case anyone competent missed this. Olinto...
olintocattaneo
Offline Send Email
Nov 2, 2009
12:40 pm
6139
After some testing I determined that conf/profiles is created lazily if either a new profile is created or if the default profile is edited in the web UI. To...
pbaclace
Offline Send Email
Nov 3, 2009
12:06 am
6140
Hi, In H3 I am trying to setup crawl jobs that use FetchHistoryProcessor/ PersistStoreProcessor/PersistLoadProcessor to discard duplicate content. I can get...
Matthew Warhaftig
mwarhaftig@...
Send Email
Nov 8, 2009
10:05 pm
6141
I have written a script that controls the build, launch and termination of jobs using curl commands. I pass along a parameter to the script telling it how long...
Pranay Pandey
sspranay
Offline Send Email
Nov 9, 2009
6:53 pm
6142
We got an email from a website owner that encountered many attempts from us.Heritrix ran with default configuration We searched crawl.log file for the details...
nukleonrus
Offline Send Email
Nov 9, 2009
9:20 pm
Messages 6113 - 6142 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help