Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5342 - 5371 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5342
Is there a way with Heritrix to define a job with many starting points, one per site, then have the job run in a "balanced" manner so that each site gets its...
nickbirren
Offline Send Email
Jul 2, 2008
11:11 pm
5343
Heritrix does a pretty good job of running in a "balanced" manner by default. The list of urls yet to be crawled (the frontier) is divided into queues by host,...
Noah Levitt
nlevitt0
Offline Send Email
Jul 3, 2008
12:28 am
5344
Hello Is there any unique number/ordinal  for discover URI/candidate URI.? I got the outlinks of CrawlUri by Curi.getoutlinks(); How can i get the...
hijbul alam
hijbul_bd
Offline Send Email
Jul 3, 2008
6:12 pm
5345
Hello, as far as I know, the ordinal would not be setted when extracting the hyperlink, but when the referenced ressource whould be downloaded or fetched by...
Christian Krumm
chuk_ol
Offline Send Email
Jul 3, 2008
8:11 pm
5346
... Sending PM is ok. But I think it whould be even better, if the emails whould be send to the list, so other researchers may comment it or use it for their...
Christian Krumm
chuk_ol
Offline Send Email
Jul 4, 2008
7:50 am
5347
My problem is that when I use 20 seeds (each for a different site), the ones that have many sub-domains seem to get more crawler "attention". If I understand...
nickbirren
Offline Send Email
Jul 5, 2008
8:28 pm
5348
I understand that windows is not supported but actually heritrix 2.0 seems to be doing very well on it so far (though I'm currently very new to heritrix) One...
nickbirren
Offline Send Email
Jul 5, 2008
8:33 pm
5349
Copy to the list for archive ... Well the code I've send to the list is for Heritrix 2.0.X, as I've mentioned in the Email. ... // Get the DatabaseFacade...
Christian Krumm
chuk_ol
Offline Send Email
Jul 6, 2008
3:20 pm
5350
Yes, each sub-domain gets its own queue. If you are only crawling 20 sites, a quick fix (at least if your using Heritrix 1.x) is to create an override for each...
Kristinn Sigurdsson
kristsi25
Offline Send Email
Jul 7, 2008
1:55 pm
5351
... The character-map setting can be used to map ':' to some other character. The help for it has a recommendation for Windows. Note also the case-sensitive,...
tztwh
Offline Send Email
Jul 7, 2008
6:21 pm
5352
The 'force-queue-assignment' option is available in both Heritrix 1.X and 2.0. Another option that may be appropriate is the "TopmostAssignedSurt" ...
Gordon Mohr
gojomo
Online Now Send Email
Jul 8, 2008
12:34 am
5353
Hi I'm trying to fetch html pages only defined in seeds using heritrix2.0. and don't want other pages linked in html pages be download. This is my settings: ...
haidong.pan
Offline Send Email
Jul 11, 2008
4:31 pm
5354
... Hi, I think you have to change the order of the deciderules. They are executed in order, so the last decision whould be used to make the whole decision for...
Christian Krumm
chuk_ol
Offline Send Email
Jul 11, 2008
7:19 pm
5355
... If I understand correctly, you only want your seed URIs to be fetched, and nothing else -- no in-page resources (like images, scripts, or CSS), and no...
Gordon Mohr
gojomo
Online Now Send Email
Jul 11, 2008
8:34 pm
5356
Hi Thank you very much for your help. Yes, I only want fetch seed URIS, no any other resources. And i follow second advice, only 1 URI in sheet:...
haidong.pan
Offline Send Email
Jul 13, 2008
7:16 am
5357
Hi Thank you. i set a job "basic_seed_sites-20080712225206" as a copy of basic_seed_sites. just as previous message, It's not works as my image. Please help. ...
haidong.pan
Offline Send Email
Jul 13, 2008
7:16 am
5358
Thank you. It's works now. There are many Scopes, It make a big trouble to me. :) ... org.archive.modules.deciderules.DecideRule ... ...
haidong.pan
Offline Send Email
Jul 14, 2008
12:01 pm
5359
Hi all, what should be done to make Heritrix revisit all the visited web pages once the crawl is finished? I need to evaluate how the pages have changed during...
mitko_denev
Offline Send Email
Jul 15, 2008
10:11 am
5360
At IA, we will soon be performing final triage of outstanding issues to be fixed for a 1.14.1 bugfix release and a 2.0.1 bugfix release. As a reminder, our...
Gordon Mohr
gojomo
Online Now Send Email
Jul 17, 2008
7:46 pm
5361
I just voted for the fixing of the encoding issue I reported. This one is I think really critical. I did not see any bugs for better documentation for the 2.0...
Jean-Noël Rivasseau
elvanor@...
Send Email
Jul 17, 2008
9:38 pm
5362
Hi, I'm running heritrix from the command line, like tgeorgio. I can start a crawl and watch the progress-statistics.log file to see its progress. What is the...
Colin Meyer
helvella_lac...
Offline Send Email
Jul 22, 2008
8:18 pm
5363
At Tue, 22 Jul 2008 13:18:50 -0700, ... Hi Colin. When the job is finished the 3rd line of the state.job file should be start with “Finished”. It sounds...
Erik Hetzner
e_hetzner
Offline Send Email
Jul 22, 2008
8:24 pm
5364
... Thanks for the info, Erik. So, if I don't have a state.job file, the crawl job is still running? All of my completed jobs have that file, but the recently...
Colin Meyer
helvella_lac...
Offline Send Email
Jul 22, 2008
8:50 pm
5365
At Tue, 22 Jul 2008 13:50:40 -0700, ... I’ve never seen a job without a state.job file; somebody else will have to help you out with this. Every Heritrix...
Erik Hetzner
e_hetzner
Offline Send Email
Jul 22, 2008
11:27 pm
5366
but I want to put url between 100-200 or more seed. Please Help me. Thank you very much....
panupong.cs46
Offline Send Email
Jul 23, 2008
2:51 am
5367
Hi, I am running an archive-crawler, Heritrix with Caputer-HPC. It takes such a very long time to finish a crawling job. Even when I tested a site of small...
오성근
freeosg@...
Send Email
Jul 29, 2008
4:37 am
5368
Hello, I just setup the 2.0.0 version in Windows XP SP3 and was able to start the Jetty at 8080 with tweaks to the .cmd scripts provided. Here is a summary of...
ivar_sr
Offline Send Email
Jul 29, 2008
4:37 am
5369
I'm taking 1.14 release for a spin but having issues. On the first job, I get the following message from seeds-report.txt [code] [status] [seed] [redirect] -50...
ivar_sr
Offline Send Email
Jul 29, 2008
1:38 pm
5370
Hello, I searched JIRA and found no open bugs for this. When I launch the Web UI (either with or without an associated local engine), and add a remote engine...
Jean-Noël Rivasseau
elvanor@...
Send Email
Jul 30, 2008
9:38 am
5371
Hello, My task is the following one: I need to search for a specific URL in the ARC file. From reading the Javadocs it seems there is no facility for doing...
Jean-Noël Rivasseau
elvanor@...
Send Email
Jul 30, 2008
2:49 pm
Messages 5342 - 5371 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help