I have a crawl that ran for 26 days, after which I paused it. Now I am trying to resume the crawl and it is more than 24 hours, and the crawler is still...
I have the scope of the crawl set to SURT Scope. If I have the following entry in the seed list - www.xdrive.com. Will the crawler crawl xdrive.com aswell?...
... Both the SurtPrefixScope and the SurtPrefixedDecideRule (used in a DecidingScope) will convert that seed into an implied SURT prefix that includes the...
Greeting, i am working on hritrix from last month on windows platform and not using its web UI but using command. it is working fine. during using this one...
I have Google'd and Yahoo'd until my head is about to fall off and I don't have a good answer. Hoping you folks can lend a thought. I have been testing...
The Free memory reported by Top is the Resident Memory size of the Heap, Code, Stack etc. What is reported on the console is purely the HEAP. ... -- Its fun...
hi friends, i'm using heritrix to crawl the web pages. for this i'm using seed.txt and order.xml and run heritrix batch and it crawled the website specify in...
I'm having the same problem. I've just built and run Heritrix through Eclipse. The webapp is running fine, but under the "Modules" tab when editing a...
hi, i'm using heritrix 1.10.1 and not getting able to use surt in order.xml and seeds.txt. i don't want to crawl few pages of my site using heritrix. how can i...
Hi everyone, Just a quick update for those interested in the DeDuplicator. As of today, the website for it is http://deduplicator.sourceforge.net. The software...
I found the solution to my own problem. In the eclipse project, you need to add /src/conf/ to the classpath. Hope this helps anybody else who is having the...
Did you see the '2.4 Eclipse' section in the Developer's Manual (http://crawler.archive.org/articles/developer_manual/building.html#eclipse)? Had you set...
Yup, I had set '-Dheritrix.development' in the "VM arguments" section in the "Arguments" tab of the Debug configuration setup dialog in eclipse. Even with...
Hi Max, I am using Heritrix1.10.1 and jdk1.5 on window platform. I am able to crawl successfully the site specified in the seeds.txt file. I am getting the...
... Thanks for the response. So are you saying, if I add +xdrive.com to the seed instead of www.xdrive.com, then it will crawl both www.xdrive.com and...
Check out '6.1.1.2. DecidingScope' in the user manual: http://crawler.archive.org/articles/user_manual/config.html. Try adding decide rule(s) to REJECT your...
Hallo, is there any possibility to crawl such site as: http://www.export.cz/index.asp?p=info or http://katalogy.nm.cz/opac/ns/index_ph.php This is javascript...
Hello all, I'm trying to analyze the hierarchy of some crawled pages. From a CrawlURI object, is there any way to get the hierarchy of referral URI's all the...
Hi Thanks for replying. But I am not getting the point you want to tell. Please tell me specifically that what entry I have to made and where. Please tell me ,...
Hi, you could also use the NotMatchesRegExpDecideRule in your rule chain. Thus, you can avoid having to deal with special surts-files. The RegEx can look very...
Maximilian Schoefmann
schoefma@...
Nov 9, 2006 1:47 pm
3520
Crawl Scope question again. I want to crawl a particular site and only the "pages" linked to it and not the entire linked sites. I tried with Max-hop-filter 1...
... If I understand correctly, you want to crawl a site, plus any pages linked-to from that site -- "one hop off the target site". The various max-hops...
Thanks. What would a seed coinstitute ... just a page ? As in if I put the seed as www.xyz.com then the index is the seed. Am I right, which basically means...
I got down to SURT Prefix Decide Filters you mentioned. I placed a URL in the SURTFILE http://www.xyz.com In the Surtfile_dump I obtain http://(com,xyz,www, ...
Hi, Thanks for your kindly and active reply but i'm very sorry to say that i'm still confused. i'm configuring everything in order.xml and giving url in...