Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4210 - 4240 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4210
Has anyone here had any experience crawling lotus notes sites? and could provide general crawling recommendations. I can get more specific about the issues...
mjjjhjemj
Offline Send Email
May 1, 2007
3:29 pm
4211
Hi, I configured a crawl job using default settings. Now, I'm tweaking the delay-factor and min-delay-ms values hoping I can speed up the time to fetch the...
pchoy
Offline Send Email
May 2, 2007
6:20 am
4212
Hi, I am trying to add a Html Form Credential. However I could'nt figure it out how to supply the form-items, becuase on the Web UI Settings I couldn't see any...
nildem10
Offline Send Email
May 2, 2007
1:13 pm
4213
Hi, ... It works, odd :) Thanks, -- Laurian Gridinoc, purl.org/net/laur...
Laurian Gridinoc
lauriangridinoc
Offline Send Email
May 2, 2007
3:44 pm
4214
Well, we are doing 2 things with heritrix... The first thing is to read a priority listed database of URL's and seed heritrix with them. The thought of...
badswalu
Offline Send Email
May 2, 2007
5:49 pm
4215
I am using the following uri-canonicalization-rules in my crawl: --> RegexRule enabled: true matching-regex: ^(.+)(?:\/\$)(.*)$ format: ${1}${2} Where as I...
mjjjhjemj
Offline Send Email
May 2, 2007
11:03 pm
4217
Heritrix URI canonicalization only affects the form of the URI used to determine if the URI has been already-scheduled. It does not change the form of the URI...
Gordon Mohr
gojomo
Online Now Send Email
May 3, 2007
1:10 am
4218
Could you just attach a working order.xml in your reply? Although your explanations are really helpful, I seem to manage to run into some exceptions like the...
Cetin Sert
cetinsert
Offline Send Email
May 4, 2007
11:26 am
4219
Hi Gordon, I ran into a challenge here as well. I did have a couple of questions on it. Is there anything in the on disk state directory from the old job e.g....
John Lekashman
lekash
Offline Send Email
May 4, 2007
5:01 pm
4220
Hi, I am doing a quick comparison between wget and heritrix. I configured both to use the same seed: http://news.bbc.co.uk/2/hi/middle_east/default.stm and...
pchoy
Offline Send Email
May 6, 2007
1:24 am
4221
Hello list, So I've been able to successfully get Heritrix crawling my seed URLs by configuring jobs via the web UI. Now I'd like to get jobs started in an...
Jesse Peterson
jesse.peterson@...
Send Email
May 7, 2007
9:49 pm
4222
Dear IA. The bug HER-1097 is unfortunately not fixed in the latest release 1.12.1 http://webteam.archive.org/jira/browse/HER-1097 I think, the fix is pretty...
Søren Vejrup Carlsen
svc400
Offline Send Email
May 8, 2007
3:19 pm
4223
... Thanks for the report... I can see the problem path, but am wondering how you've triggered that path when my crawl tests have not. Can you describe your...
Gordon Mohr
gojomo
Online Now Send Email
May 8, 2007
6:48 pm
4224
Actually I have given you as much call-stack as would be meaningful for you. But I have now uploaded my unittest to...
Søren Vejrup Carlsen
svc400
Offline Send Email
May 8, 2007
9:30 pm
4225
Free service to shorten long URLs, short URL always looks better ! Visitors counter. * Redirection to any page. * Perfect for long Amazon Affiliate URLs. * ...
Hello
mondey_5000
Offline Send Email
May 9, 2007
3:26 am
4226
Free service to shorten long URLs, short URL always looks better ! Visitors counter. * Redirection to any page. * Perfect for long Amazon Affiliate URLs. * ...
Hello
mondey_5000
Offline Send Email
May 9, 2007
4:38 am
4227
Free service to shorten long URLs, short URL always looks better ! Visitors counter. * Redirection to any page. * Perfect for long Amazon Affiliate URLs. * ...
Hello
mondey_5000
Offline Send Email
May 9, 2007
5:28 am
4228
We're running a 10 machine crawl with the HashCrawlMapper. What is the best way to know, give a host name, which crawler 'owns' the host? Cheers, -Joe...
joehung302
Online Now Send Email
May 9, 2007
4:50 pm
4229
... There's a static method on HashCrawlMapper, mapString, that can help: public static String mapString(String key, String reducePattern, long bucketCount) ...
Gordon Mohr
gojomo
Online Now Send Email
May 9, 2007
7:39 pm
4230
As some who are monitoring the Sourceforge project may have noticed (over 200 downloads already!), Heritrix 1.12.1 was released May 6 and is available for...
Gordon Mohr
gojomo
Online Now Send Email
May 9, 2007
7:47 pm
4231
Thanks for the test case -- it helped clarify what was happening, an error triggered by a call to a public ARCWriter method in custom code rather than typical...
Gordon Mohr
gojomo
Online Now Send Email
May 9, 2007
11:30 pm
4232
Get your own money making Are you unemployed? Are you disabled? Tired of your current job? Are you a college student? Need to make some extra cash? Frustrated...
top10
moneymakerz_5
Offline Send Email
May 10, 2007
1:39 am
4233
Get your own money making Are you unemployed? Are you disabled? Tired of your current job? Are you a college student? Need to make some extra cash? Frustrated...
top10
moneymakerz_5
Offline Send Email
May 10, 2007
2:14 am
4234
... Yes, but unless you've done a true checkpoint, its contents may be inconsistent. At checkpoints and after a crawl finishes cleanly, the info may be more...
Gordon Mohr
gojomo
Online Now Send Email
May 10, 2007
11:21 pm
4235
hi, every. I have two problem: 1. IIPC has provided BAT, and how does bat cooperate with heritrix, nutchwax,and wera? Are there any detailed materials on this...
vretr
Offline Send Email
May 11, 2007
1:50 am
4236
Hello, I was browsing the API for CrawlURIDispositionListener, and I came upon this little blurb: * Also note that the object implementing this interface *...
blah_1977
Online Now Send Email
May 11, 2007
7:56 pm
4237
I am looking for someone that would like to get to know each other. If you would like to know me more, please check myspace here: ...
rally6542
Offline Send Email
May 12, 2007
2:54 am
4238
Hi, I was looking for a way to only crawl top-level domains i.e. using subdomains and subfolders only to search for more links, but purging them after a...
vpdn81
Offline Send Email
May 12, 2007
12:00 pm
4239
You can set Archiver decide-rules to reject storing of all unwanted URIs. For example: If you want to save only slash pages of second level domains of the com...
Igor Ranitovic
iranitovic
Offline Send Email
May 12, 2007
12:45 pm
4240
Hallo, when I want index crawl.log created by Heritrix 1.10.0 with dedupdigest (Deduplicator 0.2.0) it throws this exception: Indexing: ...
goblin_cz
Offline Send Email
May 13, 2007
10:56 am
Messages 4210 - 4240 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help