Search the web
Sign In
New User? Sign Up
archive-crawler
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Messages 3496 - 3525 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Simplify | Expand   (Group by Topic) Author Sort by Date ^
3496
I have a crawl that ran for 26 days, after which I paused it. Now I am trying to resume the crawl and it is more than 24 hours, and the crawler is still...
nt_bdr
Offline Send Email
Nov 1, 2006
2:04 pm
3497
I have the scope of the crawl set to SURT Scope. If I have the following entry in the seed list - www.xdrive.com. Will the crawler crawl xdrive.com aswell?...
nt_bdr
Offline Send Email
Nov 2, 2006
3:44 pm
3498
... Both the SurtPrefixScope and the SurtPrefixedDecideRule (used in a DecidingScope) will convert that seed into an implied SURT prefix that includes the...
Gordon Mohr
gojomo
Offline Send Email
Nov 2, 2006
8:58 pm
3499
Greeting, i am working on hritrix from last month on windows platform and not using its web UI but using command. it is working fine. during using this one...
fandufunkyman
Offline Send Email
Nov 5, 2006
6:55 am
3500
I have Google'd and Yahoo'd until my head is about to fall off and I don't have a good answer. Hoping you folks can lend a thought. I have been testing...
Raj Bala
raj_bala
Offline Send Email
Nov 6, 2006
12:54 am
3501
The Free memory reported by Top is the Resident Memory size of the Heap, Code, Stack etc. What is reported on the console is purely the HEAP. ... -- Its fun...
Anmol Bhasin
molzbh
Online Now Send Email
Nov 6, 2006
1:30 am
3502
hi friends, i'm using heritrix to crawl the web pages. for this i'm using seed.txt and order.xml and run heritrix batch and it crawled the website specify in...
fandufunkyman
Offline Send Email
Nov 6, 2006
12:35 pm
3503
Committed. Thanks for contrib. Olaf. St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 6, 2006
8:17 pm
3504
I'm having the same problem. I've just built and run Heritrix through Eclipse. The webapp is running fine, but under the "Modules" tab when editing a...
blah_1977
Online Now Send Email
Nov 7, 2006
4:51 am
3505
hi, i'm using heritrix 1.10.1 and not getting able to use surt in order.xml and seeds.txt. i don't want to crawl few pages of my site using heritrix. how can i...
fandufunkyman
Offline Send Email
Nov 7, 2006
12:05 pm
3506
Hi everyone, Just a quick update for those interested in the DeDuplicator. As of today, the website for it is http://deduplicator.sourceforge.net. The software...
Kristinn Sigurðsson
kristsi25
Offline Send Email
Nov 7, 2006
3:38 pm
3507
I found the solution to my own problem. In the eclipse project, you need to add /src/conf/ to the classpath. Hope this helps anybody else who is having the...
blah_1977
Online Now Send Email
Nov 7, 2006
8:45 pm
3508
Did you see the '2.4 Eclipse' section in the Developer's Manual (http://crawler.archive.org/articles/developer_manual/building.html#eclipse)? Had you set...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 7, 2006
9:19 pm
3509
Yup, I had set '-Dheritrix.development' in the "VM arguments" section in the "Arguments" tab of the Debug configuration setup dialog in eclipse. Even with...
blah_1977
Online Now Send Email
Nov 8, 2006
12:50 am
3510
Hi Max, I am using Heritrix1.10.1 and jdk1.5 on window platform. I am able to crawl successfully the site specified in the seeds.txt file. I am getting the...
jls_nayak1983
Offline Send Email
Nov 8, 2006
1:01 pm
3511
... Thanks for the response. So are you saying, if I add +xdrive.com to the seed instead of www.xdrive.com, then it will crawl both www.xdrive.com and...
nt_bdr
Offline Send Email
Nov 8, 2006
3:34 pm
3512
... I believe thats how it works. You might give it a try. Yours, St.Ack...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 8, 2006
5:32 pm
3513
Check out '6.1.1.2. DecidingScope' in the user manual: http://crawler.archive.org/articles/user_manual/config.html. Try adding decide rule(s) to REJECT your...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 8, 2006
5:44 pm
3514
... Sounds like you got things working but I'm mildly interested in why you had to tinker w/ the eclipse settings. There is already an entry in the...
Michael Stack
stackarchiveorg
Offline Send Email
Nov 8, 2006
6:24 pm
3515
... Yes. And also subdomain1.xdrive.com, subdomain2.xdrive.com, etc. if such subdomains exist and are discovered. - Gordon @ IA...
Gordon Mohr
gojomo
Offline Send Email
Nov 8, 2006
8:05 pm
3516
Hallo, is there any possibility to crawl such site as: http://www.export.cz/index.asp?p=info or http://katalogy.nm.cz/opac/ns/index_ph.php This is javascript...
goblin_cz
Offline Send Email
Nov 8, 2006
10:15 pm
3517
Hello all, I'm trying to analyze the hierarchy of some crawled pages. From a CrawlURI object, is there any way to get the hierarchy of referral URI's all the...
blah_1977
Online Now Send Email
Nov 9, 2006
3:24 am
3518
Hi Thanks for replying. But I am not getting the point you want to tell. Please tell me specifically that what entry I have to made and where. Please tell me ,...
jls_nayak1983
Offline Send Email
Nov 9, 2006
12:07 pm
3519
Hi, you could also use the NotMatchesRegExpDecideRule in your rule chain. Thus, you can avoid having to deal with special surts-files. The RegEx can look very...
Maximilian Schoefmann
schoefma@...
Send Email
Nov 9, 2006
1:47 pm
3520
Crawl Scope question again. I want to crawl a particular site and only the "pages" linked to it and not the entire linked sites. I tried with Max-hop-filter 1...
Anmol Bhasin
molzbh
Online Now Send Email
Nov 9, 2006
6:57 pm
3521
... If I understand correctly, you want to crawl a site, plus any pages linked-to from that site -- "one hop off the target site". The various max-hops...
Gordon Mohr
gojomo
Offline Send Email
Nov 9, 2006
8:41 pm
3522
Thanks. What would a seed coinstitute ... just a page ? As in if I put the seed as www.xyz.com then the index is the seed. Am I right, which basically means...
Anmol Bhasin
molzbh
Online Now Send Email
Nov 9, 2006
9:04 pm
3523
I got down to SURT Prefix Decide Filters you mentioned. I placed a URL in the SURTFILE http://www.xyz.com In the Surtfile_dump I obtain http://(com,xyz,www, ...
Anmol Bhasin
molzbh
Online Now Send Email
Nov 9, 2006
10:29 pm
3524
Hi, I am also having the same problem, while crawling sites like this... Regards....
callforshadab
Offline Send Email
Nov 10, 2006
4:26 am
3525
Hi, Thanks for your kindly and active reply but i'm very sorry to say that i'm still confused. i'm configuring everything in order.xml and giving url in...
jls_nayak1983
Offline Send Email
Nov 10, 2006
5:05 am
Messages 3496 - 3525 of 6142   Oldest  |  < Older  |  Newer >  |  Newest
Advanced
Add to My Yahoo!      XML What's This?

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help