Hi,all I used Heritrix to crawl some page, but I cannot get the ture url of page of .asp. For example, If the ture url is www.ceee.com/shownews.aps? ...
Hi All, I wanted to know which property should i set to limit hte crawler to crawl only the domains as per the seed urls. --Thanks and Regards Vaijanath N. Rao...
Hi All, I am using Heritrix 2.0.2 version. Can some one point out the rule scope. The current one which I have is as follows root:scope:rules:0=object,...
... Here are some thoughts about heritrix 1.x (may or may not apply to 2.x): My understanding is that any particular decide rule sequence can be either...
Heritrix release 1.14.3 is now available at Sourceforge: http://sourceforge.net/project/showfiles.php?group_id=73833 This is a 'micro' release with small bug...
Hi, Could someone on the list give me some guide for integrating Heritrix with Solr. I would like to use Heritrix as crawler and Solr as indexer. Thanks! Tony ...
I'm not aware of any existing utilities that do this; you write a custom loader that reads the arc files and sends http requests to the solr update handler;...
Hi, Just wanted to hire someone to work on my Heritrix/Solr integration project. Basically, I would like to have a Heritrix writer that can write the crawl ...
Hi Tony, We have written Solr Index Writer Processor and would make it available soon to Heritrix community soon. We will get it by end of this month. If you...
It's most likely normal; I think heritrix will increase its crawl delay if it sees that the crawled site is serving pages more slowly. Check the logs for...
Once the crawler is down to just one or two sites to crawl, the limiting factor is its politeness: it only requests one URL at a time, and it pauses several...
Hello, I am trying version 1.14.3. Thank you for Gordon and development team!! And, I have two question. 1. I want to select fetch or not, per hosts. I create...
takeru sasaki
sasaki.takeru@...
Mar 11, 2009 6:23 am
5720
Hi All, I am trying too use the cmdline-jmxclient-0.10.5.jar to start/monitor jobs. How could I do this? There isn't much documention on this (that I could...
hi, I try new 'preload-source' property, but crawl was not started, and OutOfMemoryError is occered. source state dir includes 00000000.jdb ... 000000a9.jdb,...
takeru sasaki
sasaki.takeru@...
Mar 11, 2009 1:46 pm
5722
... Usually, you should add rules to the 'scope' (DecidingScope's rule chain). For rules that are a URI prefix -- like your example <http://xxxx.com/ad/.*> --...
Hi, Gordon and members. ... Thank you, I will try SurtPrefixDecideRule. I want to use like: ^http://xxx.com/(news|blog)\?id=A\d+$ ...
takeru sasaki
sasaki.takeru@...
Mar 12, 2009 2:00 am
5724
Hi, ... I know now, - "Preselector#decide-rules" is rules for "work this Preselector" - "Preselector#decide-rules" is not rules for "block URLs go to fetch...
takeru sasaki
sasaki.takeru@...
Mar 12, 2009 2:46 am
5725
Hi all, I am new to heritrix. I wanted to run the crawler from the command-line. what are the options for running cmdline-jmxclient-0.10.5.jar? I have had a...
Hello enigmacodes, FYI the current policy is to recommend starting with heritrix 1.14.3, until 2.2 comes out. ... There's some rudimentary info on using...
Hi I am looking for a crawler tool for copyright infringement purposes. I believe Heritrix could solve that problem? I don't have much knowledge on web...
Rahil Baig
rahil.baig@...
Mar 18, 2009 5:03 pm
5731
Hi Rahul, ... answer : yes, it could be possible , but need to do integration of application and heretrix modules also for this purpose. ... answer: it depend...
Thanks alot Ravinder for the response There is another question now re BDB database, does it mean that bdb will be available in the Heritrix installation...
Rahil Baig
rahil.baig@...
Mar 18, 2009 9:05 pm
5733
Hi Is there any website where I could have a look at screen shots of UI to get a feel of the interface or could anyone please send us some on the group. Thanks...
Rahil Baig
rahil.baig@...
Mar 19, 2009 6:32 am
5734
There are some screen shots of Heritrix 2.0 at http://webteam.archive.org/confluence/display/Heritrix/Using+the+Web+User+Interface. This is the version that I...
Hi Noah, Thanks for the prompt reply the other day. Apologies for the tardiness on my part I am using 1.14.3 now. I still cant get cmdline-jmxclient-0.10.5.jar...