Dear all,
We have made available a Heritrix processor that interfaces with
rainbow, the most widely known, and perhaps most widely used, text
classification system in the last decade.
If you include this processor in your Heritrix crawls, you can either
focus crawls to a particular topic by training rainbow to recognize
the topic, or else weed out unwanted pages by training rainbow to
spot those pages.
For more information, download the software (113K) from
http://www.metacombine.org/software. Grab the file named
"metacombine_focusedCrawl_module1.0.tar.gz". There is a
README for brief install instructions, and a .pdf for more complete
documentation. Feedback welcomed.
Saurabh Pathak, Emory University <spatha2@...>
Donna Bergmark, Cornell University <bergmark@...>