Hi Ken,
This is what I am thinking. I want to get Lucene indexing done before I attempt Solr.
1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be the version in Maven, but I couldn't download the sources for Lucene 2.4.1. The tar and zip seem to have a problem on the Lucene site.
2) I have added a couple of lines to SiteCrawler.java.
Tap luceneSink = new Hfs(new SequenceFile( FetchedDatum.FIELDS.append( MetaData. FIELDS)), curCrawlDirName + "/lucene");
LucenePipe lucenePipe = new LucenePipe( fetchPipe.getContentTailPipe( ), MetaData.FIELDS) ;
sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
Flow flow = flowConnector.connect(inputSou rce, sinkMap, fetchPipe, urlPipe, lucenePipe);
3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.
4) Adding a class named bixo.operations.LuceneFunction modeled on ParseFunction. The writing of the Lucene index will happen in the operate() function.
I have these done. I will borrow the code to write the index from existing test code.
Let me know what you think,
Sanjoy