Hi, the Czech crawl again :) . I started with default profile and set some specific rules (100MB limit etc.) and run the crawl again. You can find the...
Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE As Noah notes, this is a known issue which should be fixed in a future 2.x release. In...
Hi to all, Is there way to fetch YouTube videos with Heritrix (1.14.x)? We have been trying to download selected youtube videos in our thematic crawls but...
Tomas Ukkonen
tomas.ukkonen@...
Dec 8, 2008 6:50 pm
5597
Hi, we have a custom profile to crawl a small subset of pages from each site in a seed file, and we wanted to limit our crawler such that it only download...
As also mentioned in the referenced message, your "htmlContentTypeFilter" makes no sense in a scope -- there's no content-type yet to compare. In the case of...
I found a thread here that describes how you can use the JMX Client to have an url retried (How to add a URL into the retry list?) excerpt describing JMX...
Hi all refering to the post with subject "Broad-scope 10M seeds Xmx6G 64-Bit JVM: OOME: GC overhead limit exceeded" and the statement that the OOME exception...
Juergen Umbrich
juergen@...
Dec 10, 2008 12:39 am
5601
If my guesswork on the previous post was correct, it was the requests to display seed reports (via the web UI) that created the problem -- not the mere...
Hi, I am useing version 2.0.2. I am viewing crawl.log by RegExp. I enter this: http://.*¥.html but text field value was changed: http://.*?.html Can I use...
takeru sasaki
sasaki.takeru@...
Dec 11, 2008 3:56 pm
5603
This is probably a web UI encoding issue we could fix, BUT... I don't think any '¥' characters, exactly as such, will be found in the crawl.log. Instead, it...
Thank you for your help. I want to escape "." (dot), not "¥" (back slash). And other Regex meta charactors. Such as ".()[]?". I will debug and build heritrix...
takeru sasaki
sasaki.takeru@...
Dec 12, 2008 2:05 am
5605
I'm sorry I didn't understand. Heritrix uses the Java regex syntax, as described at: http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html So, a...
Thank you. You are right!! I am using mac osx 10.4 and Firefox3 in japanese. I enter backslash "\" into editor (CarbonEmacs) and copy it, and paste in firefox....
takeru sasaki
sasaki.takeru@...
Dec 12, 2008 3:20 am
5607
Hi Tomas, Here is a script that can be used to download vidoes with Heritrix. Maybe this will help you? It is now on the Heritrix WIKI: ...
Hi! I have a problem with the NotMatchesListRegExpDecideRule. My aim is to crawl the following sites : http://tennis.fr/outils http://tennis.fr/breves ...
Thanks for your answer. Here are my decideRule : global root:scope:rules list org.archive.modules.deciderules.DecideRule global root:scope:rules:0 object ...
Thanks. Your rule #8, NotOnDomainsDecideRule, may be unnecessary. Based on the earlier rules, only on-domain and inline-linked URIs will have been ACCEPTed...
It works fine. Thanks you very much. I have seen that if I had a DecideRule, I must run a first crawl in order to the settings appear. Is that normal?...
... I'm sorry, I don't understand the question. I would say that when building a custom scope with new DecideRules, it is good to work incrementally, only...
hello! I am trying wayback. http://archive-access.sourceforge.net/projects/wayback/ I have a question. My wayback instance has many html pages. If image-file...
takeru sasaki
sasaki.takeru@...
Dec 19, 2008 5:04 am
5615
Hi All, I am a newbie to heritrix..I checked out the latest stable version of Heritrix 2...and I tried to debug the crawl process step by step... My Attempts &...
... For a beginner, the 1.14.x code could be a better place to start -- the documentation is better, and there are still significant UI/configuration changes...
Thanks a lot !!! That helps.... I am looking for continuous crawling functionality which I believe is being developed on 2.x ( correct me if I am wrong ) ...I...
Hi, I am new to heritrix and trying to run one sample job but facing problem. I have configured "max-toe-thread" to 100 but still it never starts all threads....
All threads will only be used if there are many separate sites to crawl. Heritrix will only fetch a single URI from a site at a time, and will pause between...
Hi, Â Thanks for reply... I tried with large set of websites and its working as per expectation. Â Thanks. Bhavin ... From: Gordon Mohr <gojomo@...> ...
Hi, I have started using heritrix on single machine. I was just thinking what are the different ways we can achieve distributed crawling using heritrix. I can...