Dear Heretrix Experts, I am a Heretrix newbie-- I downloaded Heretrix and maven, but when I try building, I receive the following error: $ maven dist [ ... ] ...
You don't say what version you're trying to build, but it sounds similar to the issue recently discussed in this thread, which has some tips for working around...
Hey guys! I'm new at the Heritrix game (and to this forum), so I apologize in advance if I've missed something in the documentation or in prior discussions,...
Hey guys! I'm new at the Heritrix so I apologize in advance if I've missed something in the documentation but I have a dubt about DecideRule in Heritrix. I'm...
wp1740
wp1740@...
Feb 11, 2009 5:55 pm
5660
The best approach is to pick a similar DecideRule and use it as a model. The method for making it appear in the web UI varies between 1.14.x and 2.0.x. In...
Thank you, Gordon! That's just what I needed. BTW: What are the possible types for the parameter Object in: public Object decisionFor(Object object) { } ...
In Heritrix 1.x, it's essentially always a CandidateURI (and sometimes also a CrawlURI). In Heritrix 2.0, it's defined as a ProcessorURI (instead of Object)...
A test run of: * Heritrix 1.14.2 on an AWS/EC2, small instance, with 100 worker threads, 1.3M seeds, 900MB heap Has the following resource utilization stats: *...
I also just saw StackOverflowError in the web UI of heritrix 1.14.2 right after I clicked Submit Job after composing a job from a previous one. ... An error...
Does it happen every time you follow the same steps? Does it happen if you turn FINE logging off? It almost looks like there's an Attribute with a reference...
Hi, Just started using heritrix , have a few queries : How do I analyse the ARC files! How do I filter out ads and other media... please any amount of help...
... Yes. The Default profile did not have the problem, but my customized profile did. ... I commented-out: org.archive.crawler.level = FINE and restarted...
Hello, I think the NotMatches* deciderules in the 2.0.2 branch are missing a static KeyManager.addKeys() call, which is preventing them from working properly....
A good way to contribute patches is to attach them to issues in the project JIRA issue tracker: http://webteam.archive.org/jira There's already an issue about...
A "special" no-operation decide rules that I think would be handy: * Comment. Sometimes I would like to put a comment in a rule sequence and this would be a...
The "un-knotting" performance change worked. I see a 2X speedup in heritrix v1.14.2: * 460KB/sec (from 230KB/sec) network usage * 100% cpu with load between...
Hi, I created a DecideRule that checks for every CrawlURI what methods (POST, PUT, DELETE, etc.) are allowed. Therefore, I am using the HTTP OPTIONS method...
These sound useful and we'd be happy to integrate them into the core if contributed. Also, other simple custom rules can be implemented with Beanshell scripts,...
I recommend beginners use Heritrix 1.14.x versions; there is better documentation and fewer gotchas when customizing the configuration. I can't tell from the...
Because the exception is triggered in toString() code called from... ... ...that code, rather than your checkOptionsMethod() code, would offer the best hints...
ServerCache has been noted as a bottleneck before, so this is a very welcome result. Can you post a patch either here or to a JIRA issue for others to review...
Hi, is it possible to connect two decide rules with a logical AND? Lets say I want on the one hand the URI to match a RegExp and on the other hand the URI...
Hi, I hope I do not miss something obvious, but I believe to remember being able to modify "bdb-cache-percent", which is particularly usefull to avoid OOM when...
Hi Holger ... in the profile folder is a file called config.txt: # The percentage of JVM RAM to use for the BDB database during a live crawl. # Defaults to...
Juergen Umbrich
juergen@...
Feb 17, 2009 5:20 pm
5684
In 2.0.x, the BDB environment is already open during job configuration, to store SURT-to-override-sheet associations. Thus, its cache-size is determined...
I will use the following bug report for the proposed patch: http://webteam.archive.org/jira/browse/HER-1609 The patch is not yet ready; I will post a message...