Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Want your group to be featured on the Yahoo! Groups website? Add a group photo to Flickr.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 2725 - 2754 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
2725
I've set max-hops to 2 in my TooManyHopsDecideRule in an effort to start from small sample crawls and build up from there, in large part based on research...
Adam Fisk
afisk3
Offline Send Email
Mar 1, 2006
5:08 pm
2726
I believe you want '<boolean name="if-match-return">false</boolean>' in your case, no? Tree's filter is for *including* only html in the crawl. If you want...
Adam Fisk
afisk3
Offline Send Email
Mar 1, 2006
5:12 pm
2727
... Be a bit careful here. The job name is actually a little more involved. Notice how in CrawlJob#preRegister (line 2041) we build up the name adding the...
stack
stackarchiveorg
Offline Send Email
Mar 1, 2006
5:18 pm
2728
... Yes. The parse is unable to pull a 'host' from URIs the likes of "http:/" or "http:/~galvan/index.html". ... Agreed. I tried them in our test harness and...
stack
stackarchiveorg
Offline Send Email
Mar 1, 2006
5:57 pm
2729
... I'm surprised it's crawling anything of interest at all: 2 hops deep is not very much, and even a broad crawl started from a massive directory page that...
Gordon Mohr (archive....
gojomo
Offline Send Email
Mar 1, 2006
6:32 pm
2730
Hi Gordon- Looks like that's exactly what's happening. The crawl log is only still reporting 3 sites out of 20. One of them, washingtonpost.com, looks like...
Adam Fisk
afisk3
Offline Send Email
Mar 1, 2006
7:29 pm
2731
Ah -- I was using CrawlJob from what I think was the 1.6 release, and preRegister was simpler then. More from me after some more experimentation. I've gotten...
Shifra Raffel
shifrasr
Offline Send Email
Mar 2, 2006
1:36 am
2732
... Thats true. ... Where are they coming from? (I'd guess they are because you're invoking operations against a null bean -- and you're getting the null bean...
stack
stackarchiveorg
Offline Send Email
Mar 2, 2006
4:35 pm
2733
Hello - I don't know if this program is overkill or just right, so, I figured I would ask the group. I need to create an application that will check links and...
jaemsjohn
Offline Send Email
Mar 6, 2006
3:38 pm
2734
... Its straight-forward enough plugging in a little module that per page, asks Heritrix for the list of links found to try against an external database. But...
stack@...
stackarchiveorg
Offline Send Email
Mar 6, 2006
4:10 pm
2735
All due respect to Heritrix, this would be much easier implemented in Perl with the various CPAN libraries than trying to wedge it into Heritrix. -- Tom...
Tom Emerson
tree02139
Offline Send Email
Mar 6, 2006
4:40 pm
2736
Hi, We got dinged again by using Heritrix in that a crawlee complained that we were ignoring their robots.txt file. On the face of it, they look like they are...
Karl Wright
daddywri
Offline Send Email
Mar 8, 2006
5:58 pm
2737
Hi Karl, Heritrix rechecks robots.txt files every 24 hours by default. Did you change that value? It seems that this robots file has been recently modified and...
Igor Ranitovic
iranitovic
Offline Send Email
Mar 8, 2006
6:19 pm
2738
... It seems clear from the logs that 24 hours elapsed since whatever change occurred. The date on the robots.txt is 3/6 and the date of the crawl is 3/8. I...
Karl Wright
daddywri
Offline Send Email
Mar 8, 2006
6:29 pm
2739
The last modified data from the server is showing 07 Mar 2006 23:20:42 GMT and the log entire is 08 Mar 2006 16:11:41 GMT GET...
Igor Ranitovic
iranitovic
Offline Send Email
Mar 8, 2006
6:59 pm
2740
Have anyone seen this error before? I used to have a very stable configuration that I can run for at least a couple weeks but now I'm testing out the new...
joehung302
Offline Send Email
Mar 8, 2006
7:28 pm
2741
... Tell us more about the circumstance Joe? By any chance, is this crawl based on a checkpoint recovery and perhaps you switched Heritrix versions...
Michael Stack
stackarchiveorg
Offline Send Email
Mar 8, 2006
8:55 pm
2742
... If you are saving crawl results to ARCs, the robots.txt that was consulted will be in the crawl ARCs -- as will be each daily refetch. ... Such bugs are...
Gordon Mohr (archive....
gojomo
Offline Send Email
Mar 9, 2006
10:02 am
2743
... we're not; we're discarding them. So we are out of luck. ... I was initially suspicious because the readLine() javadoc said it only recognized line...
Karl Wright
daddywri
Offline Send Email
Mar 9, 2006
12:25 pm
2744
Hi all, Just wanted to let you know we've started this page on embedding Heritrix. http://crawler.archive.org/cgi-bin/wiki.pl?EmbeddingHeritrix It's just a...
Shifra Raffel
shifrasr
Offline Send Email
Mar 10, 2006
1:17 am
2745
... crawl ... haven't ... We just started the production crawl --- 8 crawlers each is equipped with 3TB storage (yeah I took your suggestion seriously, plenty...
joehung302
Offline Send Email
Mar 13, 2006
11:52 pm
2746
From the developer manual: "For each processor only one instance is created per crawl. As there are multiple threads running these processors must be carefully...
Samuel
samendonca
Offline Send Email
Mar 14, 2006
1:58 pm
2747
Sorry if this is an abuse of this mailing list but it seemed the best way to get this out to developers who have experience with heritrix and might be...
Andrea Goethals
andrea_goethals@...
Send Email
Mar 17, 2006
6:41 pm
2748
Hi there, Is there any way to reuse a job after it has finished? I mean, after a job finishes its crawl, could it be enqued automatically again. Thanks very...
tizo_trico
Offline Send Email
Mar 20, 2006
11:09 pm
2749
I usually created a new job by only changing the previous job name with adding a digit, then the new job will run automatically. ... ...
Yan Zhang
yzhang_il
Offline Send Email
Mar 20, 2006
11:43 pm
2750
I checked the maunual and didn't find the meaning of 'P' in the discovery path (crawl.log). In terms of "R", will it affect the link depth of a crawling job....
Yan Zhang
yzhang_il
Offline Send Email
Mar 20, 2006
11:49 pm
2751
... No. Automated rescheduling of jobs is not part of the crawler. Will Yan Zhang's suggestion work for you? You could create and queue up lots of the same...
Michael Stack
stackarchiveorg
Offline Send Email
Mar 21, 2006
3:13 pm
2752
... See 'Discovery path' in the glossary section of the user manual. 'P' is for prerequisites (dns or robots). ... It looks like all on the discovery path gets...
stackarchiveorg
Offline Send Email
Mar 21, 2006
3:27 pm
2753
Hi St.Ack, I used the PathScope for that job. Basically I used the default setting. The fields I made some changes are, scope: max-link-hops: 8 ...
Yan Zhang
yzhang_il
Offline Send Email
Mar 21, 2006
6:47 pm
2754
The 'max-link-hops' value for the 'classic' scopes (BroadScope, DomainScope, HostScope, PathScope) only counts plain navigation-link 'L' hops. So in each of...
Gordon Mohr (archive....
gojomo
Offline Send Email
Mar 21, 2006
7:51 pm
Messages 2725 - 2754 of 6147   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help