A third, simpler, option would be to edit the Processors.options file and create multiple entries for the extractor, each with a different 'name' after the |. ...
Hi, I have started JBOSS and hjeritrix istance. Please anyone give me sample hcc.properties file for setting HCC. Thanks in advance, Jigar Patel ... Take the...
Hi Jigar, My hcc.properties only has a single line that's not a comment: org.archive.hcc.ClusterControlleBean.maxPerContainer=3 In my experience the real work...
Andrea Goethals
andrea_goethals@...
Oct 2, 2007 3:10 pm
4581
Hi, I am using Heritrix 1.12.1 and I'm taking the default approach of using seeds as SURTs. Here is an example seed list: # start of seed list http://site-a.ca...
Hi Jigar, I had that same exception when I was trying to set up the CrawlController. Here's the entry from the mailing list about it - see if this solution...
Andrea Goethals
andrea_goethals@...
Oct 3, 2007 3:00 pm
4585
Hi, I already read this reply. But it does not work for me and gives me same reply... Will you please show me where exactly you have changed port and how you...
Hi Gordon, Hi all! Sorry for replying late. Gordon you are right, that is exactly what I wanted! The proposal from joeyfreund indeed causes that all seed urls...
Jigar, I'm attaching a zip file of our main Crawl Controller-related files. This probably would have been better distributed by setting up a Crawl Controller ...
Andrea Goethals
andrea_goethals@...
Oct 5, 2007 3:56 pm
4588
I have a problem to use cookie and retry a URL. 1. If I use http://www.myjones.com/code/signup.php as the seed and set the crawler to only the seed. This URL...
Quick and dirt solution would be to add a fake seed at the top of the seed list. Maybe something like: http://www.myjones.com/code/signup.php?fakeparam ...
You to acquire the 'admin' role. Add something like: <user name="admin" password="SOME_PASSWORD" roles="admin" /> ... to $TOMCAT_HOME/conf/tomcat-users.xml....
The methods you suggested works. But it needs the user to look at the crawl log or crawled content and identify URLs that need to be re-crawled. Is it possible...
Hi, i'm using 1.12.1. i added and enabled regex module in canonicalizaiton rules and added a regex rule relevant to may crawling jobs in the profile settings...
Hello, I have Heritrix embedded in my application, and I am having problems with extremely large ARC files being created. I set each run to stop after 5...
We have a few Heritrix instances set up in Amazon EC2 crawling a fairly small amount of data. We are seeing the crawl pause itself as it goes, with no...
Ted Dziuba
ted@...
Oct 15, 2007 9:04 pm
4597
Anything on the consoles about OOMEs? Are the disks full? St.Ack...
Ah, yes, I am seeing OOMEs in ToeThread's run method. If all ToeTheads throw exceptions, does the crawl pause? Ted...
Ted Dziuba
ted@...
Oct 15, 2007 10:00 pm
4599
If a 'serious errors' such as an OOME, it'll trigger a pause (See http://crawler.archive.org/xref/org/archive/crawler/framework/ToeThread.html#211 and line...
Interesting. It seems that we were using the JVM's default maximum heap size, which is the lesser of 1/4 the system memory and 1GB, so in EC2, about 450MB....
Ted Dziuba
ted@...
Oct 15, 2007 10:35 pm
4601
Hi Ted, I would just like to add that by default bdb cache can take up to 60% of assigned heep size. So, changing bdb-cache-percent from default 0 to ...
there are different DecideRules in org.archive.crawer.deciderules,but just a few in the Heritrix Web UI i wonder in what way i can see MatchesRegExpDecideRule...
I believe that he crawls one or maybe a few hosts per Heritrix instance. In other words, the number of queues is low. So, in that scenario that makes sense. i....
Hi All, I am trying to setup heritrix crawler for my projject. I am not able to run my first job in heritrix. I get an error saying "failed to crawl job". ...
... You should start with one of the bundled default configurations, then change it minimally to create your job. For the heart of the crawler, the set of...
hi, using version 1.12.1. changing any other default setting is written to order.xml file of the given profile, but after adding a urlcanonicalization rule,...