Hi, I've tried something similar (integrating Heritrix with a Tomcat Webapp) and succeeded to a point. Unfortunately I wanted to use the MirrorWriter instead...
I forgot to mention that I used a combination of the code in the selftest package and in the Heritrix webapp to figure out the relevant sequence of calls. X. ...
Thanks Xavier. I've been poking around in the WebUI class in order to get an idea of how this stuff is set up. This is the first time I've ever heard of ...
Micah Wedemeyer
mwedeme@...
Feb 1, 2008 3:15 pm
4956
One way to do this is to use alist on the CrawlURI level. Your extractor can add key-value pairs to the CrawlURI's alist for each image discovered from that...
It is, but I'm a little unclear as to how it works. I basically run a job order for about 4 hours, and then checkpoint and terminate the job. The next day I...
If you recover from a checkpoint you will not lose your additional information. If you recover from the recover log, then yes you will need to save the ...
... Yes, an embedded Heritrix should be controllable via the objects' APIs and not only via JMX. I just committed some changes to enable that. These changes...
If you implement a module that supports the CrawlStatusListener, and register it with the controller. You get all the start/stop notifications. This is really...
... Aaarghh!!! I somehow missed this e-mail and spent the last couple days pouring through the WebUI code trying to puzzle out how the Remote, JMX, BeanProxy,...
Micah Wedemeyer
mwedeme@...
Feb 5, 2008 3:17 pm
4965
Version: trunk/heritrix2@5733 As far as I can tell, addCrawlStatusListener has been removed. How can I register to receive the notifications? Note: I'm...
Micah Wedemeyer
mwedeme@...
Feb 5, 2008 9:25 pm
4966
Hi, I've been struggling with 2.0.1 snapshot for the past few days, and I just do not like all the JMX integration. From the perspective of someone just...
Micah Wedemeyer
mwedeme@...
Feb 6, 2008 4:59 pm
4967
... Heritrix is not a developer's library. It's a service designed to run as a daemon with an optional WebUI for the administrator. Someone can correct me if...
... The current event dispatch stuff in CrawlController is quite messy, sorry. We were going to standardize on JMX notifications but ended up just porting...
version: trunk/heritrix2@5733 ... EngineConfig config = new EngineConfig(); config.setJobsDirectory("/tmp/hjobs"); Engine engine = new EngineImpl(config); ...
Micah Wedemeyer
mwedeme@...
Feb 6, 2008 7:54 pm
4970
Paul, Thanks. This looks a little less confusing now that you've explained it. In full disclosure, I only used 1.x very little, but I often compare the ...
Micah Wedemeyer
mwedeme@...
Feb 6, 2008 7:56 pm
4971
I figured out that launching the job does make the difference. I guess was doing something wrong. I think I was trying to get the SheetManager for...
Micah Wedemeyer
mwedeme@...
Feb 6, 2008 10:13 pm
4972
... They don't exist until the job is active. Prior to that, modules are stored as org.archive.settings.Stub placeholders. So instead of root:controller...
At some point in 2008, the Internet Archive's Heritrix team would like to transition the Heritrix project to the Apache License (version 2), rather than the...
It means the necessary robots.txt prerequisite fetch (even getting a 404) couldn't complete before giving up on the URI in question. Do you see preceding...
Hi, I am new to Heritrix and I am trying the deduplicator. I am on version 1.12.1. I downloaded the deduplicator version 0.3. According to the instruction, I...