Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want to share photos of your group with the world? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 4750 - 4779 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
4750
Hi, The digest strings between "crawl.log" and "output of arcreader -d" seems to be different. Am I wrong something or is this a feature? <crawl.log> ...
sekiguchi_koji
Offline Send Email
Dec 3, 2007
7:43 am
4751
Hi, I'm new to Heritrix, and keep having this same problem. I've tried the 1.12 and 1.13 versions, but get the same results. All the domains I enter into the...
nfoscarini
Offline Send Email
Dec 3, 2007
6:21 pm
4752
I'm getting this java.io.exception as well. I can't find any replies to this post. Was this problem resolved for you, and if so. How? ... ServerCache ... ...
nfoscarini
Offline Send Email
Dec 3, 2007
6:21 pm
4753
Have you customized the crawl configuration at all -- especially with regard to the PreconditionEnforcer, FetchDNS, or Scope components? What happens if you...
Gordon Mohr
gojomo
Offline Send Email
Dec 3, 2007
6:27 pm
4754
Thanks for the quick reply. I'm running Windows XP. I reviewed the install docs, and found that I didn't create the profile folder under the conf directory....
nfoscarini
Offline Send Email
Dec 3, 2007
6:59 pm
4755
i want to crawl webboard that have lots of link e.g. /index.php/topic,7413.0/?board=34.0 /index.php/topic,7413.0/?board=104.0 ...
sudarat.jeampokakul
sudarat.jeam...
Offline Send Email
Dec 4, 2007
3:43 am
4756
Hi, Sorry for asking some simple questions, but I can't find the answers anywhere. If there are answers posted somewhere (i.e. a Wiki) please direct me there,...
nfoscarini
Offline Send Email
Dec 4, 2007
9:44 pm
4757
I think there is a URI rule that will strip away everything after the "?" character, but I don't remember the name of it. ... cut off...
nfoscarini
Offline Send Email
Dec 4, 2007
9:45 pm
4758
A new beta test release of Heritrix-2.0.0, "alpha-2", is now available. Specificially, the beta release is considered to be the autobuild with the identifier...
Gordon Mohr
gojomo
Offline Send Email
Dec 4, 2007
9:56 pm
4759
Thank you! I can't wait to give this a try. ... Be a better pen pal. Text or chat with friends inside Yahoo! Mail. See how....
Mathew Nik Foscarini
nfoscarini
Offline Send Email
Dec 4, 2007
10:14 pm
4760
... There is much documentation on 1.x in the user manual: http://crawler.archive.org/articles/user_manual/index.html Settings in the web UI have at least a...
Gordon Mohr
gojomo
Offline Send Email
Dec 4, 2007
10:43 pm
4761
... Look in the settings for anything with 'delay' in it. Our default profiles include a minimum 2000ms delay before releasing another URI from the same...
Gordon Mohr
gojomo
Offline Send Email
Dec 4, 2007
10:57 pm
4762
Can be configured with a canonicalization rule like this: <newObject name="testing_canonicalization" class="org.archive.crawler.url.canonicalize.RegexRule"> ...
Bjarne Andersen
bjarne_dk2000
Offline Send Email
Dec 5, 2007
1:38 pm
4763
Hi, I'm new to Heritrix and currently using version 1.12.1. I successfully crawled my first few webpages and want to use the ArchiveReader to further process...
Christoph Buescher
christoph_bu...
Offline Send Email
Dec 5, 2007
6:04 pm
4764
Hello, I've a problem crawling a site written in .NET with a form that uses VIEWSTATE hidden field for postback action. So the same url is used for displaying...
susannanino
Offline Send Email
Dec 5, 2007
6:05 pm
4765
I am new to Heritrix and would like to setup a heritrix cluster (one instance per machine). I took a look at the hcc javadocs. Is there anymore documentation...
Daniel Clark
daniel_a_clark
Offline Send Email
Dec 5, 2007
6:05 pm
4766
Hello Antonino, there currently is no other place than the JIRA issue to get this feature from and apply the given patch to the codebase yourself. I think the...
Aaron
pandae667
Offline Send Email
Dec 5, 2007
6:20 pm
4767
i set config with a canonicalization rule like this: <newObject name="RegRule" class="org.archive.crawler.url.canonicalize.RegexRule"> <boolean...
sudarat.jeampokakul
sudarat.jeam...
Offline Send Email
Dec 6, 2007
5:17 am
4768
Hello, I am new to heritrix and stuck with a problem. I am running HEritrix on the latest ubuntu distribution; When I start a crawl, the job is immediately...
sonjaschaule
Offline Send Email
Dec 6, 2007
6:03 pm
4769
How to implement fetching pages by regular time?thanks...
myepoch2008
Offline Send Email
Dec 7, 2007
1:15 am
4770
A new release candidate of Heritrix-2.0.0, "RC1", is now available. You can retrieve the release from our maven2 repository: ...
Paul Jack
poetbeware
Offline Send Email
Dec 7, 2007
3:19 am
4771
Hi all, I'd like to be able to start crawl which would basically be restricted to only the seed page. So for example, seed is http://aaa.com/bbb.html. I'd like...
Robert Svoboda
r080tic
Offline Send Email
Dec 7, 2007
8:43 am
4772
Hi Robert, You can simply setup the rejectIfTooManyHops (TooManyHopsDecideRule) to 0. That will ensure that only seeds are fetched. Keep in mind that Heritrix...
Igor Ranitovic
iranitovic
Offline Send Email
Dec 7, 2007
2:43 pm
4773
I am not sure what do you mean. Could you please elaborate? i....
Igor Ranitovic
iranitovic
Offline Send Email
Dec 7, 2007
2:46 pm
4774
Maybe the order.xml files is corrupted? Check the file with xmllint (for example): $: xmllint order.xml > /dev/null; echo $? Take care, i....
Igor Ranitovic
iranitovic
Offline Send Email
Dec 7, 2007
2:52 pm
4775
Maybe he means something like a cron scheduled job that fetches pages at a give time....
nfoscarini
Offline Send Email
Dec 7, 2007
3:01 pm
4776
Hi Christoph, Maybe the arc file is corrupted. Does the gzip test pass? (gzip -t CRAWL-20071130161712-00001-graz.arc.gz) Is this always happing after the first...
Igor Ranitovic
iranitovic
Offline Send Email
Dec 7, 2007
3:23 pm
4777
Hi Igor, thanks for the hint, but the files are okay (gzip -t and the command-line arcreader work). I'm trying to iterate over the records from a Java program...
Christoph Buescher
christoph_bu...
Offline Send Email
Dec 7, 2007
4:00 pm
4778
That does sound like a bug to me. A work around would be to create a wrapper iterator class, and pass the ArchiveReader iterator to that class's constructor....
Mathew Nik Foscarini
nfoscarini
Offline Send Email
Dec 7, 2007
4:25 pm
4779
... anyone help me?thanks...
myepoch2008
Offline Send Email
Dec 7, 2007
5:10 pm
Messages 4750 - 4779 of 6140   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help