Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Show off your group to the world. Share a photo of your group with us.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 5707 - 5736 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
5707
Hi,all I used Heritrix to crawl some page, but I cannot get the ture url of page of .asp. For example, If the ture url is www.ceee.com/shownews.aps? ...
silver_chuan
Offline Send Email
Mar 2, 2009
6:06 am
5708
Hi All, I wanted to know which property should i set to limit hte crawler to crawl only the domains as per the seed urls. --Thanks and Regards Vaijanath N. Rao...
Vaijanath N. Rao
vaiju1981
Offline Send Email
Mar 2, 2009
12:36 pm
5709
Hi All, I am using Heritrix 2.0.2 version. Can some one point out the rule scope. The current one which I have is as follows root:scope:rules:0=object,...
Vaijanath N. Rao
vaiju1981
Offline Send Email
Mar 2, 2009
12:54 pm
5710
... Here are some thoughts about heritrix 1.x (may or may not apply to 2.x): My understanding is that any particular decide rule sequence can be either...
pbaclace
Offline Send Email
Mar 2, 2009
7:55 pm
5711
Heritrix release 1.14.3 is now available at Sourceforge: http://sourceforge.net/project/showfiles.php?group_id=73833 This is a 'micro' release with small bug...
Gordon Mohr
gojomo
Online Now Send Email
Mar 3, 2009
11:34 pm
5712
Hi, Could someone on the list give me some guide for integrating Heritrix with Solr. I would like to use Heritrix as crawler and Solr as indexer. Thanks! Tony ...
Tony Wang
gwangcs
Offline Send Email
Mar 6, 2009
5:11 pm
5713
I'm not aware of any existing utilities that do this; you write a custom loader that reads the arc files and sends http requests to the solr update handler;...
Roger Caplan
rogercaplan
Offline Send Email
Mar 6, 2009
5:55 pm
5714
Hi, Just wanted to hire someone to work on my Heritrix/Solr integration project. Basically, I would like to have a Heritrix writer that can write the crawl ...
Tony Wang
gwangcs
Offline Send Email
Mar 8, 2009
5:15 am
5715
Hi Tony, We have written Solr Index Writer Processor and would make it available soon to Heritrix community soon. We will get it by end of this month. If you...
Vaijanath N. Rao
vaiju1981
Offline Send Email
Mar 9, 2009
4:19 am
5716
Hi, I'm testing Heritrix 1.4 on my dedicated box and I have set up a job that crawls the following websites: http://www.eBiomethods.com ...
Tony Wang
gwangcs
Offline Send Email
Mar 9, 2009
1:52 pm
5717
It's most likely normal; I think heritrix will increase its crawl delay if it sees that the crawled site is serving pages more slowly. Check the logs for...
Roger Caplan
rogercaplan
Offline Send Email
Mar 9, 2009
3:14 pm
5718
Once the crawler is down to just one or two sites to crawl, the limiting factor is its politeness: it only requests one URL at a time, and it pauses several...
Gordon Mohr
gojomo
Online Now Send Email
Mar 9, 2009
10:23 pm
5719
Hello, I am trying version 1.14.3. Thank you for Gordon and development team!! And, I have two question. 1. I want to select fetch or not, per hosts. I create...
takeru sasaki
sasaki.takeru@...
Send Email
Mar 11, 2009
6:23 am
5720
Hi All, I am trying too use the cmdline-jmxclient-0.10.5.jar to start/monitor jobs. How could I do this? There isn't much documention on this (that I could...
enigmacodes
Offline Send Email
Mar 11, 2009
10:59 am
5721
hi, I try new 'preload-source' property, but crawl was not started, and OutOfMemoryError is occered. source state dir includes 00000000.jdb ... 000000a9.jdb,...
takeru sasaki
sasaki.takeru@...
Send Email
Mar 11, 2009
1:46 pm
5722
... Usually, you should add rules to the 'scope' (DecidingScope's rule chain). For rules that are a URI prefix -- like your example <http://xxxx.com/ad/.*> --...
Gordon Mohr
gojomo
Online Now Send Email
Mar 12, 2009
12:03 am
5723
Hi, Gordon and members. ... Thank you, I will try SurtPrefixDecideRule. I want to use like: ^http://xxx.com/(news|blog)\?id=A\d+$ ...
takeru sasaki
sasaki.takeru@...
Send Email
Mar 12, 2009
2:00 am
5724
Hi, ... I know now, - "Preselector#decide-rules" is rules for "work this Preselector" - "Preselector#decide-rules" is not rules for "block URLs go to fetch...
takeru sasaki
sasaki.takeru@...
Send Email
Mar 12, 2009
2:46 am
5725
Hi all, I am new to heritrix. I wanted to run the crawler from the command-line. what are the options for running cmdline-jmxclient-0.10.5.jar? I have had a...
enigmacodes
Offline Send Email
Mar 12, 2009
5:39 am
5726
hi, I'm a newbie too and could anyone point me/us to any doc or mail archive about filtering by domain? thx. regards, mingfai...
mingfai.ma
Offline Send Email
Mar 13, 2009
8:19 pm
5727
http://crawler.archive.org/articles/user_manual/config.html is a good place to start if you haven't looked at it yet. Lauren ...
Ko, Lauren
laurendko
Offline Send Email
Mar 13, 2009
9:04 pm
5728
Hello enigmacodes, FYI the current policy is to recommend starting with heritrix 1.14.3, until 2.2 comes out. ... There's some rudimentary info on using...
Noah Levitt
nlevitt0
Offline Send Email
Mar 14, 2009
12:38 am
5729
I want to limit a number of downloaded document per host or domain. How to config my heritrix (I use heritrix version 1.14.3)...
Punnawat T.
punnawatt2
Offline Send Email
Mar 18, 2009
4:14 pm
5730
Hi I am looking for a crawler tool for copyright infringement purposes. I believe Heritrix could solve that problem? I don't have much knowledge on web...
Rahil Baig
rahil.baig@...
Send Email
Mar 18, 2009
5:03 pm
5731
Hi Rahul, ... answer : yes, it could be possible , but need to do integration of application and heretrix modules also for this purpose. ... answer: it depend...
ravinder vashist
ravinder_vas...
Offline Send Email
Mar 18, 2009
5:24 pm
5732
Thanks alot Ravinder for the response There is another question now re BDB database, does it mean that bdb will be available in the Heritrix installation...
Rahil Baig
rahil.baig@...
Send Email
Mar 18, 2009
9:05 pm
5733
Hi Is there any website where I could have a look at screen shots of UI to get a feel of the interface or could anyone please send us some on the group. Thanks...
Rahil Baig
rahil.baig@...
Send Email
Mar 19, 2009
6:32 am
5734
There are some screen shots of Heritrix 2.0 at http://webteam.archive.org/confluence/display/Heritrix/Using+the+Web+User+Interface. This is the version that I...
laurendko
Offline Send Email
Mar 19, 2009
2:14 pm
5735
looks like the link may have changed. please try: Using the Web User Interface ...
siznax
stearcorg
Offline Send Email
Mar 19, 2009
5:08 pm
5736
Hi Noah, Thanks for the prompt reply the other day. Apologies for the tardiness on my part I am using 1.14.3 now. I still cant get cmdline-jmxclient-0.10.5.jar...
enigmacodes
Offline Send Email
Mar 25, 2009
8:46 am
Messages 5707 - 5736 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help