Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 3574 - 3603 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
3574
May I know if I want to automate this retrieving process by using Java, how should I do? ... os/San_Juan/Ciudades/San_Juan/ ... by...
tomlowct
Offline Send Email
Dec 5, 2006
2:23 pm
3575
How to use the arcReader.java in my Java program to retrieve files from arc files automatically? What should I do?...
tomlowct
Offline Send Email
Dec 5, 2006
2:33 pm
3576
Never tried it, but by looking at the API doc's (http://crawler.archive.org/apidocs/index.html) I'd say something like this: ARCReader reader =...
Bart Kiers
prometheuzz
Offline Send Email
Dec 5, 2006
3:12 pm
3577
Hi there, I have some problems with relative URIs (Heritrix 1.8). The following is happening: A page like "http://a.site.com/contents.php" is downloaded. It ...
tizo_trico
Offline Send Email
Dec 8, 2006
12:56 am
3578
Hello all, I've been crawling for some time and have now over 7 million files. No problems, although lately Heritrix used to stop itself and I clicked it from...
Kaisa Kaunonen
kaisa_kaunonen
Offline Send Email
Dec 8, 2006
12:39 pm
3579
Hi there, Yeah, that happens a lot. Several possibilities as to what to do: 1. You remembered to checkpoint, and its a good checkpoint. Stop the job, stop...
John Lekashman
lekash
Offline Send Email
Dec 8, 2006
3:26 pm
3580
Hi, I'm running a crawl where initially some URIs were rejected by a DecideRule that was added by mistake. So the URIs show up in the crawl.log as having a...
astar_t
Offline Send Email
Dec 8, 2006
6:50 pm
3581
... Thats odd. With the 'force-fetch' flag enabled, the URLs should be crawled whether they've been seen or not. Did the URLs show again in crawl.log after...
Michael Stack
stackarchiveorg
Offline Send Email
Dec 8, 2006
7:05 pm
3582
... As Stack notes, this should work. The 'force-fetch' flag means to ignore the already-included status. However, it doesn't ignore scoping -- are you sure...
Gordon Mohr
gojomo
Online Now Send Email
Dec 8, 2006
8:51 pm
3583
I am pretty certain there is something I am not doing right, just that I am trying to pin point the issue. I am seeing pages http://xyz.com/target and ...
Anmol Bhasin
molzbh
Online Now Send Email
Dec 8, 2006
11:39 pm
3584
Hi Anmol, In my experience, these two URLs are usually distinct documents where the first one is redirect to the second one. So, in most cases you want to get...
Igor Ranitovic
iranitovic
Offline Send Email
Dec 9, 2006
12:50 am
3585
Thanks Igor, but have a look at this document. http://dblab.ssu.ac.kr/publication/LeKi05a.pdf They seem to have done fair bit of experiments with this issue....
Anmol Bhasin
molzbh
Online Now Send Email
Dec 9, 2006
2:45 am
3586
Hi Anmol, ... Thanks for the paper. I just briefly scanned the section on trailing slash normalization and the results of 50% duplication are expected without...
Igor Ranitovic
iranitovic
Offline Send Email
Dec 9, 2006
7:14 pm
3587
Hi everyone, I'm currently having a problem that I'm not able to solve. I am currently spidering a rather big set of seeds and thus my crawl.log gets really...
pandae667
Offline Send Email
Dec 10, 2006
8:37 pm
3588
Hi again, actually upon further analysis, you are right, the URI I had force fed into the crawler via JMX did indeed eventually get crawled. When I took a...
astar_t
Offline Send Email
Dec 11, 2006
5:27 pm
3589
... If you pause a crawl, towards the base of the index.jsp console page appears 'View or Edit Frontier URIs'. Click here. Allows adding/deleting URIs...
Michael Stack
stackarchiveorg
Offline Send Email
Dec 11, 2006
7:26 pm
3590
... Hey Olaf: Postprocessing crawl.log w/ perl/python/awk/etc. seems much easier than modifying Heritrix but if you insist, one suggestion for how to change ...
Michael Stack
stackarchiveorg
Offline Send Email
Dec 11, 2006
8:20 pm
3591
... That is an interesting paper, thanks for the pointer. However, after skimming it over, I don't think it offers much guidance for typical Heritrix uses, for...
Gordon Mohr
gojomo
Online Now Send Email
Dec 11, 2006
9:50 pm
3592
Hallo, will heritrix support Google Sitemaps in the future? Thank you, Adam Brokes...
goblin_cz
Offline Send Email
Dec 13, 2006
9:45 am
3593
We are migrating to from Heritrix 1.8 to Heritrix 1.10.1. Now, the attribute 'scope-embedded-links', which are set to true in our templates have now been...
Søren Vejrup Carlsen
svc400
Offline Send Email
Dec 14, 2006
6:37 pm
3594
Hi *, URLs marked as duplicates are being added to the DeDuplicator-index when running the DigestIndexer in its current implementation. Would't it make sense...
Maximilian Schoefmann
schoefma@...
Send Email
Dec 14, 2006
8:38 pm
3595
Hey ... revision 1.7 date: 2006/07/14 23:43:56; author: gojomo; state: Exp; lines: +3 -24 Fix for [ 1522108 ] LinksScoper scope-embedded-links...
Michael Stack
stackarchiveorg
Offline Send Email
Dec 14, 2006
8:47 pm
3596
Even at the risk of holding a monolog here :-) ... let me share my findings with a patched version of the DeDuplicator: The processed logfile is the second of...
Maximilian Schoefmann
schoefma@...
Send Email
Dec 15, 2006
2:31 pm
3597
Hey Max, The reason for adding also those marked duplicates is that the typical usage scenario has been to rebuild the index each time. If you are adding to...
Kristinn Sigurðsson
kristsi25
Offline Send Email
Dec 15, 2006
2:54 pm
3598
Hey Kris, ... I'm doing very frequent crawls of the same sites and have automated updating the deduplicator index after every crawl. Your DeDuplicator is ...
Maximilian Schoefmann
schoefma@...
Send Email
Dec 15, 2006
4:16 pm
3599
I've added the patch to HEAD and made a new interim build (20061218) that includes it. - Kris ... parseLinde is ... want to...
Kristinn Sigurðsson
kristsi25
Offline Send Email
Dec 18, 2006
12:52 pm
3600
Hi everyone, I were able to track down the source of the NPEs. They were caused by the AddRedirectFromRootServerToScope decideRule. My decideRule chain starts...
pandae667
Offline Send Email
Dec 19, 2006
7:26 am
3601
Thats an ugly one Olaf. Thanks for persevering. I added a check for null host basename to the DR AddRedirectFromRootServerToScope and I just made it so URLs...
Michael Stack
stackarchiveorg
Offline Send Email
Dec 19, 2006
6:13 pm
3602
I trying to install rain bow but it doesnot compile on gcc . Maybe it is because of newer version of gcc as rain bow had it's last version in 2002. I tried...
abhigyansharma
Offline Send Email
Dec 19, 2006
7:00 pm
3603
I've written a little web app running on tomcat, which uses heritrix to crawl specific sites. Lately, I've been running into the following error messages on...
blah_1977
Online Now Send Email
Dec 22, 2006
3:29 am
Messages 3574 - 3603 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help