Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4399 - 4428 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4399
Is there a way to disable the test goal in when building Heritrix? that is, ignore all tests while building it?...
waisovsky
Offline Send Email
Jul 1, 2007
4:58 pm
4400
Hello, I use the following code to embedding Heritrix to my web-application: try { Heritrix heritrix = new Heritrix(true); heritrix.addCrawlJob( ...
Artem Antonov
antonov.artem
Online Now Send Email
Jul 2, 2007
1:24 pm
4401
No. Not in the maven1 used to build heritrix 1.12.x. Heritrix 2.x will use maven2. M2 is more amenable regards which targets/goals to run. St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 2, 2007
4:06 pm
4402
Anything in heritrix_out.log? Can you run your embedded instance inside a debugger to try and figure what is awry? Does the order.xml+seeds.txt in an...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 2, 2007
4:06 pm
4403
Well, it's not possible to disable the test goal - but its possible to tell maven-junit plugin to skip all the tests... maven test -Dmaven.test.skip=true This...
pandae667
Offline Send Email
Jul 2, 2007
5:45 pm
4404
No, there are nothing in heritrix_out.log. In debug mode I get the following lines: 03.07.2007 10:43:38 org.archive.crawler.Heritrix postRegister INFO: ...
Artem Antonov
antonov.artem
Online Now Send Email
Jul 3, 2007
8:14 am
4405
I set the arguments into CATALINA_PATH variable: -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false ...
Artem Antonov
antonov.artem
Online Now Send Email
Jul 3, 2007
10:19 am
4406
The problem it seems solved according to: http://tech.groups.yahoo.com/group/archive-crawler/message/3296. Thanks. Regards, Artem....
Artem Antonov
antonov.artem
Online Now Send Email
Jul 3, 2007
3:17 pm
4407
The WARNING in the below is a complaint about your order file. It states that current Heritrix version does not know what to do with the configuration named...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 3, 2007
4:12 pm
4408
Dear all The State and University Library of Denmark and The Royal Library of Denmark are pleased to announce the release of the NetarchiveSuite as Open...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Jul 4, 2007
10:55 am
4409
Thanks, I've replaced 'sha1-content' config by two options according with the current Heritrix version. There are no WARNINGS at all. About 'problem solved'....
Artem Antonov
antonov.artem
Online Now Send Email
Jul 4, 2007
11:25 am
4410
Artem, Stack: I am going to jump in if you don't mind :) We changed processor Filters to DecidRrules in 1.12.0. From the release notes: 2.4.2. DecideRules have...
Igor Ranitovic
iranitovic
Offline Send Email
Jul 4, 2007
5:57 pm
4411
Some body help me how to use regular expressions in heritrix web crawler and how to restrict crawling images from sites...
sridhar.vorla
Offline Send Email
Jul 5, 2007
12:15 am
4412
Igor, Thanks for your help. The ARCWriterProcessor now works fine. Regards, Artem. ... acting on ... DecideRule ... be ... a ... ...
Artem Antonov
antonov.artem
Online Now Send Email
Jul 5, 2007
8:13 am
4413
hello , Now i am trying to run Herirtix on Window. But cannot do anything. What should i do ?It give me that errors. java.lang.NullPointerException ...
pann1981
Offline Send Email
Jul 5, 2007
6:03 pm
4414
Hi, There is one option in settings called bind-address: So my question is can I put here multiple comma separated IP addresses... If I bind 10 IPs to my...
Jigar Patel
jigar_bca
Offline Send Email
Jul 6, 2007
11:44 am
4415
Hi, For the purpose of covering the real crawler IP, and use multiple IP addresses just not to look like one crawling robot, but as geographically distributed...
Laurian Gridinoc
lauriangridinoc
Offline Send Email
Jul 7, 2007
7:19 am
4416
i faced this errors . what should i do ? java.io.IOException: Failed to get host java.sun.com address from ServerCache at...
pann yu mon
pann1981
Offline Send Email
Jul 9, 2007
4:45 am
4417
Hi, Please any one tell me where I need to put QuotaEnforcer means its exact loaction.... Thanks and Regards, Jigar Patel...
Jigar Patel
jigar_bca
Offline Send Email
Jul 9, 2007
7:07 am
4418
Hi, I am feeding 2 Lacs seed to my heritrix instance and I am using QuotaEnforcer to limit 10 docs per host. I am using SURT Rule in the Deciding Scope. ...
Jigar Patel
jigar_bca
Offline Send Email
Jul 9, 2007
7:15 am
4419
Hi, I can not set max-retries : 2 If I do this then It only fetches robots.txt page of each domain present in seed file. Actually I do not want to retry any...
Jigar Patel
jigar_bca
Offline Send Email
Jul 9, 2007
10:45 am
4420
I've had that problem as well - it seems the crawler won't crawl with a setting for max-retries lower that 3. Could be a bug? best -- Bjarne Andersen Daily...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Jul 9, 2007
12:04 pm
4421
Notice in crawl.log that every seed has at least 3 retries (3t in annotations). Every time an URI is deferred, the count of retries goes up. Seed will be...
Igor Ranitovic
iranitovic
Offline Send Email
Jul 9, 2007
12:58 pm
4422
Hello Jigar, It seems that all that is left to be crawled are sites that are problematic. -2 means that we failed to connect to a server. Every time you have...
Igor Ranitovic
iranitovic
Offline Send Email
Jul 9, 2007
1:11 pm
4423
Do I need to restart the crawl, or are changes to these files picked up dynamically? Thanks, Mike...
mjjjhjemj
Offline Send Email
Jul 9, 2007
3:14 pm
4424
You can, and you don't need to restart the crawl. SurtPrefixedDecideRule has rebuild-on-reconfig option that you can set to true. However, you have to 'touch'...
Igor Ranitovic
iranitovic
Offline Send Email
Jul 9, 2007
3:33 pm
4425
As Igor notes, this behavior is 'by design'. When a URI comes up for crawling, but its host has not been (recently) fetched via a prerequisite DNS URI, or its...
Gordon Mohr
gojomo
Online Now Send Email
Jul 9, 2007
7:51 pm
4426
I want to have a post processor update my own database with the location of the ARC file that was written for the current crawl (i.e. for each URI). Is there...
Andrew Serff
andrewserff
Offline Send Email
Jul 10, 2007
10:40 pm
4427
Hi there, My name is Dawid Weiss, I have been using Heritrix for a few weeks now -- great software, really. I noticed a very annoying bug, described below. ...
Dawid Weiss
dawid_weiss
Offline Send Email
Jul 12, 2007
12:06 am
4428
... Are you asking if you can insert a processor into the crawler at the 'postprocessor' stage to insert the ARC file name a particular download was written to...
Michael Stack
stackarchiveorg
Offline Send Email
Jul 12, 2007
2:10 am
Messages 4399 - 4428 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help