Hello all, I hope I am not missing something obvious. I am using Heritrix 2.0.2 and have successfully installed/run simple crawls. Now, I only want to save...
I recommend using 1.14.x unless you specifically need 2.0 features. The documentation (both in the official user manual and various notes/threads) is better;...
Hi - We are beginning to look at implementing duplicate reduction in our crawling. I am trying to get my head around the various features available for...
Can someone tell me how can I append jobs (orders) and than execute those orders via command line (using heritix params) I can successfully run 1 (one) order...
Folks, I'm having trouble building heritrix from 1.14.1 and 2 sources. Part of the problem is my unfamiliarity with maven, jelly, qdox, etc. The problems are...
I don't know of a formal comparison, or of any other systems in use with Heritrix-centric workflows. The WARC writing of recent Heritrix releases already...
Hello Steve, When you say "heritrix-1.14.1 and 2", I'm not sure if you mean heritrix-1.14.2 or heritrix 2.0, but anyway the latest version on the 1.x line is...
Noah, Thanks -- that worked. I think I had the wrong qdox-current.jar before. Since your message, I tried from scratch, and apparently the old maven doesn't...
Steve, that's an interesting idea. I filed http://webteam.archive.org/jira/browse/HER-1591. Feel free to add yourself as a watcher on that issue if you like. A...
I have a couple of questions about Heritrix. Does it support POST? For example, I want to crawl sites that have dropdown boxes and submit buttons. I'd like to...
I would like to configure Heritrix to not even look for a robots.txt. I have permission for the sites I am crawling to ignore the robots.txt which I have done,...
Hello Greg, Heritrix doesn't support POST, nor does it support the kind of link extraction you describe. The core reason not to support POST is that...
Hi list, I'm running heritrix-2.0.2. I'm having trouble restarting heritrix. I stopped it with the kill command (with the default TERM signal). Now when I want...
The problem isn't with your shutdown; rather there's a bug preventing settings of NotMatchesListRegExpDecideRule from being recognized. (I presume you added...
I am a newbie to Heritrix...Someone set it up for me a while ago and I just found this problem couple days ago and I couldn't find a solution,, Heritrix...
... There is no version 2.1. From the screenshot you'd posted on a JIRA issue, it looks like one of the 1.X versions... perhaps an outdated version, because...
Hi, Thanks for your answer. ... In fact the option is present in the web UI after a first crawl ... out -- OK. Are the ARC file formats identical between 1.14...
... Hmm, did you initially add the rule via the web UI? ... Yes, both the ARC and WARC formats should be identical between 1.14.2 and 2.0.2. - Gordon @ IA...
At Tue, 13 Jan 2009 15:48:09 -0800, ... Hi Gordon - Many thanks for your response. I have been making use of the PersistLogProcessors and will be trying out...
... I have to launch a first crawl with a sheet without this notmatchregexrule. Then when I go back in the sheet editor in the WebUI, the option is available. ...
Because Heritrix WebUI is over http instead of https our institution is requiring us to use X server apps (Xming) on our PCs for access, which in turn has a...
... It's probably easiest to wrap it with an https proxy, then hopefully the code for the web UI won't need to change. I once did this for another app using...
Brendan O'Connor
brenocon@...
Jan 23, 2009 7:29 pm
5649
This was very helpful, thanks. For future record, I ended up adding MatchesFilePatternDecideRules using the use-preset-pattern for audio, video and images and...
There's currently no way to tell Heritrix not to request a robots.txt, and changing that would probably require custom coding. We are unlikely to make the...
Hello, I have a question. How many memory for one crawl job. My crawl is stoped with OutOfMemoryError, and Web UI is not work well. I created 4-5 jobs based on...
takeru sasaki
sasaki.takeru@...
Jan 27, 2009 10:40 am
5653
... The memory requirements depend on your crawl parameters, especially the number of ToeThreads configured. You should be safe with the default configuration...
Thank you for your help. Default setting is -Xmx256m, I know it is for single crawl. I will try with 256*(same time crawls) MB memory. Thank you very much! And...
takeru sasaki
sasaki.takeru@...
Jan 28, 2009 6:12 am
5655
... That may not help, unless you also change the BDB cache-percent setting. Each crawl's database environment will grow, as long as the crawl is still...