Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4578 - 4607 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4578
A third, simpler, option would be to edit the Processors.options file and create multiple entries for the extractor, each with a different 'name' after the |. ...
Kristinn Sigurðsson
kristsi25
Offline Send Email
Oct 1, 2007
12:54 pm
4579
Hi, I have started JBOSS and hjeritrix istance. Please anyone give me sample hcc.properties file for setting HCC. Thanks in advance, Jigar Patel ... Take the...
Jigar Patel
jigar_bca
Offline Send Email
Oct 2, 2007
4:34 am
4580
Hi Jigar, My hcc.properties only has a single line that's not a comment: org.archive.hcc.ClusterControlleBean.maxPerContainer=3 In my experience the real work...
Andrea Goethals
andrea_goethals@...
Send Email
Oct 2, 2007
3:10 pm
4581
Hi, I am using Heritrix 1.12.1 and I'm taking the default approach of using seeds as SURTs. Here is an example seed list: # start of seed list http://site-a.ca...
astar_t
Offline Send Email
Oct 2, 2007
6:09 pm
4582
Hi Adam, It would be fairly easy to transform the seeds into the surt directives. Maybe something like (linux cmdline): cat seeds.txt | perl -pe...
Igor Ranitovic
iranitovic
Offline Send Email
Oct 2, 2007
6:20 pm
4583
Hi ANDREA, Thanks a lot. You solved my problem. I got connection but some issue is there. ============================= ...
Jigar Patel
jigar_bca
Offline Send Email
Oct 3, 2007
4:46 am
4584
Hi Jigar, I had that same exception when I was trying to set up the CrawlController. Here's the entry from the mailing list about it - see if this solution...
Andrea Goethals
andrea_goethals@...
Send Email
Oct 3, 2007
3:00 pm
4585
Hi, I already read this reply. But it does not work for me and gives me same reply... Will you please show me where exactly you have changed port and how you...
Jigar Patel
jigar_bca
Offline Send Email
Oct 4, 2007
1:46 pm
4586
Hi Gordon, Hi all! Sorry for replying late. Gordon you are right, that is exactly what I wanted! The proposal from joeyfreund indeed causes that all seed urls...
Martin Kammerlander
mkammerlander
Offline Send Email
Oct 4, 2007
5:11 pm
4587
Jigar, I'm attaching a zip file of our main Crawl Controller-related files. This probably would have been better distributed by setting up a Crawl Controller ...
Andrea Goethals
andrea_goethals@...
Send Email
Oct 5, 2007
3:56 pm
4588
I have a problem to use cookie and retry a URL. 1. If I use http://www.myjones.com/code/signup.php as the seed and set the crawler to only the seed. This URL...
ruyanbo
Offline Send Email
Oct 8, 2007
7:00 pm
4589
Quick and dirt solution would be to add a fake seed at the top of the seed list. Maybe something like: http://www.myjones.com/code/signup.php?fakeparam ...
Igor Ranitovic
iranitovic
Offline Send Email
Oct 8, 2007
8:18 pm
4590
hi,all recently, I download one heritrix.war. put it into tomcat/webapps. but , I don't know username and password....
vretr
Offline Send Email
Oct 9, 2007
2:19 am
4591
You to acquire the 'admin' role. Add something like: <user name="admin" password="SOME_PASSWORD" roles="admin" /> ... to $TOMCAT_HOME/conf/tomcat-users.xml....
Michael Stack
stackarchiveorg
Offline Send Email
Oct 9, 2007
3:22 am
4592
Thanks. Michael Stack <stack@...> дµÀ£º You to acquire the 'admin' role. Add something like: <user name="admin"...
 
vretr
Offline Send Email
Oct 9, 2007
2:52 pm
4593
The methods you suggested works. But it needs the user to look at the crawl log or crawled content and identify URLs that need to be re-crawled. Is it possible...
ruyanbo
Offline Send Email
Oct 9, 2007
5:52 pm
4594
Hi, i'm using 1.12.1. i added and enabled regex module in canonicalizaiton rules and added a regex rule relevant to may crawling jobs in the profile settings...
hinoglu
Online Now Send Email
Oct 10, 2007
3:23 pm
4595
Hello, I have Heritrix embedded in my application, and I am having problems with extremely large ARC files being created. I set each run to stop after 5...
Lostokies
kalebmurphy
Online Now Send Email
Oct 12, 2007
3:44 pm
4596
We have a few Heritrix instances set up in Amazon EC2 crawling a fairly small amount of data. We are seeing the crawl pause itself as it goes, with no...
Ted Dziuba
ted@...
Send Email
Oct 15, 2007
9:04 pm
4597
Anything on the consoles about OOMEs? Are the disks full? St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Oct 15, 2007
9:12 pm
4598
Ah, yes, I am seeing OOMEs in ToeThread's run method. If all ToeTheads throw exceptions, does the crawl pause? Ted...
Ted Dziuba
ted@...
Send Email
Oct 15, 2007
10:00 pm
4599
If a 'serious errors' such as an OOME, it'll trigger a pause (See http://crawler.archive.org/xref/org/archive/crawler/framework/ToeThread.html#211 and line...
Michael Stack
stackarchiveorg
Offline Send Email
Oct 15, 2007
10:21 pm
4600
Interesting. It seems that we were using the JVM's default maximum heap size, which is the lesser of 1/4 the system memory and 1GB, so in EC2, about 450MB....
Ted Dziuba
ted@...
Send Email
Oct 15, 2007
10:35 pm
4601
Hi Ted, I would just like to add that by default bdb cache can take up to 60% of assigned heep size. So, changing bdb-cache-percent from default 0 to ...
Igor Ranitovic
iranitovic
Offline Send Email
Oct 16, 2007
8:00 am
4602
there are different DecideRules in org.archive.crawer.deciderules,but just a few in the Heritrix Web UI i wonder in what way i can see MatchesRegExpDecideRule...
nickzwk
Offline Send Email
Oct 16, 2007
8:01 am
4603
Paul Pedersen who has been crawling up on EC2 with some success says he uses defaults and a heap of 256M. St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Oct 16, 2007
4:08 pm
4604
I believe that he crawls one or maybe a few hosts per Heritrix instance. In other words, the number of queues is low. So, in that scenario that makes sense. i....
Igor Ranitovic
iranitovic
Offline Send Email
Oct 16, 2007
4:18 pm
4605
Hi All, I am trying to setup heritrix crawler for my projject. I am not able to run my first job in heritrix. I get an error saying "failed to crawl job". ...
Rajeev Sharma
sharma.rajeev
Online Now Send Email
Oct 17, 2007
3:43 pm
4606
... You should start with one of the bundled default configurations, then change it minimally to create your job. For the heart of the crawler, the set of...
Gordon Mohr
gojomo
Offline Send Email
Oct 17, 2007
5:17 pm
4607
hi, using version 1.12.1. changing any other default setting is written to order.xml file of the given profile, but after adding a urlcanonicalization rule,...
hinoglu
Online Now Send Email
Oct 18, 2007
2:29 am
Messages 4578 - 4607 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help