Has anyone here had any experience crawling lotus notes sites? and could provide general crawling recommendations. I can get more specific about the issues...
Hi, I configured a crawl job using default settings. Now, I'm tweaking the delay-factor and min-delay-ms values hoping I can speed up the time to fetch the...
Hi, I am trying to add a Html Form Credential. However I could'nt figure it out how to supply the form-items, becuase on the Web UI Settings I couldn't see any...
Well, we are doing 2 things with heritrix... The first thing is to read a priority listed database of URL's and seed heritrix with them. The thought of...
I am using the following uri-canonicalization-rules in my crawl: --> RegexRule enabled: true matching-regex: ^(.+)(?:\/\$)(.*)$ format: ${1}${2} Where as I...
Heritrix URI canonicalization only affects the form of the URI used to determine if the URI has been already-scheduled. It does not change the form of the URI...
Could you just attach a working order.xml in your reply? Although your explanations are really helpful, I seem to manage to run into some exceptions like the...
Hi Gordon, I ran into a challenge here as well. I did have a couple of questions on it. Is there anything in the on disk state directory from the old job e.g....
Hi, I am doing a quick comparison between wget and heritrix. I configured both to use the same seed: http://news.bbc.co.uk/2/hi/middle_east/default.stm and...
Hello list, So I've been able to successfully get Heritrix crawling my seed URLs by configuring jobs via the web UI. Now I'd like to get jobs started in an...
Jesse Peterson
jesse.peterson@...
May 7, 2007 9:49 pm
4222
Dear IA. The bug HER-1097 is unfortunately not fixed in the latest release 1.12.1 http://webteam.archive.org/jira/browse/HER-1097 I think, the fix is pretty...
... Thanks for the report... I can see the problem path, but am wondering how you've triggered that path when my crawl tests have not. Can you describe your...
Free service to shorten long URLs, short URL always looks better ! Visitors counter. * Redirection to any page. * Perfect for long Amazon Affiliate URLs. * ...
Free service to shorten long URLs, short URL always looks better ! Visitors counter. * Redirection to any page. * Perfect for long Amazon Affiliate URLs. * ...
Free service to shorten long URLs, short URL always looks better ! Visitors counter. * Redirection to any page. * Perfect for long Amazon Affiliate URLs. * ...
We're running a 10 machine crawl with the HashCrawlMapper. What is the best way to know, give a host name, which crawler 'owns' the host? Cheers, -Joe...
... There's a static method on HashCrawlMapper, mapString, that can help: public static String mapString(String key, String reducePattern, long bucketCount) ...
As some who are monitoring the Sourceforge project may have noticed (over 200 downloads already!), Heritrix 1.12.1 was released May 6 and is available for...
Thanks for the test case -- it helped clarify what was happening, an error triggered by a call to a public ARCWriter method in custom code rather than typical...
Get your own money making Are you unemployed? Are you disabled? Tired of your current job? Are you a college student? Need to make some extra cash? Frustrated...
Get your own money making Are you unemployed? Are you disabled? Tired of your current job? Are you a college student? Need to make some extra cash? Frustrated...
... Yes, but unless you've done a true checkpoint, its contents may be inconsistent. At checkpoints and after a crawl finishes cleanly, the info may be more...
hi, every. I have two problem: 1. IIPC has provided BAT, and how does bat cooperate with heritrix, nutchwax,and wera? Are there any detailed materials on this...
Hello, I was browsing the API for CrawlURIDispositionListener, and I came upon this little blurb: * Also note that the object implementing this interface *...
Hi, I was looking for a way to only crawl top-level domains i.e. using subdomains and subfolders only to search for more links, but purging them after a...
You can set Archiver decide-rules to reject storing of all unwanted URIs. For example: If you want to save only slash pages of second level domains of the com...