Yup, that's right.
It then does
Fields indexFields = new Fields( "ParsedDatum-parsedText");
Store[] storeSettings = new Store[] { Store.YES };
Index[] indexSettings = new Index[] { Index.ANALYZED };
indexScheme = new IndexScheme( indexFields, storeSettings, indexSettings, false,
StandardAnalyzer.class, MaxFieldLength.UNLIMITED.getLimit());
in LucenePipe.java. It basically pipes the field into IndexScheme.
This IndexScheme is defined as a Sink in SiteCrawler.java.
LucenePipe lucenePipe = new LucenePipe( parsePipe, MetaData.FIELDS);
Tap luceneSink = new Lfs( lucenePipe.getIndexScheme(), urCrawlDirName +
"/lucene", SinkMode.REPLACE);
and I add this into the sinkMap in SiteCrawler.java. That's the connection.
Please let me know if this is not the right way,
Sanjoy
--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Sanjoy,
>
> In the code below, it looks like you're creating a new tuple that has
> just the parsed text.
>
> If so, then how does this tie into the process of generating a Lucene
> index?
>
> Thanks,
>
> -- Ken
>
>
> On Nov 3, 2009, at 3:05pm, sanjoy wrote:
>
> > Hi Ken,
> >
> > Could you pass me the utility code you have.
> >
> > For indexing the output of the crawler, my only concern was this
> > will double the storage requirement since we are storing all the
> > content and also indexing it. That's why I was indexing on the fly.
> > Plus it will be faster since the stored content is not reread.
> >
> > If you still think we should index the stored content I will code
> > for that.
> >
> > Here's what I have for LuceneFunction:
> >
> > public class LuceneFunction extends BaseOperation<NullContext>
> > implements Function<NullContext> {
> >
> > private Fields _metaDataFields;
> >
> > public LuceneFunction (Fields metaDataFields) {
> > super( new Fields( "ParsedDatum-parsedText"));
> > _metaDataFields = metaDataFields;
> > }
> >
> > @Override
> > public void operate(FlowProcess process, FunctionCall funcCall) {
> > TupleEntry entry = funcCall.getArguments();
> > TupleEntryCollector collector = funcCall.getOutputCollector();
> >
> > String value = entry.getString( "ParsedDatum-parsedText");
> > TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum-
> > parsedText"), new Tuple( value));
> > collector.add(boost);
> > }
> > }
> >
> > Thanks,
> > Sanjoy
> >
> > --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@>
> > wrote:
> > >
> > > Hi Sanjoy,
> > >
> > > Sorry for the delay in responding, I've been busy with the ACM data
> > > mining unconference (prep, and writeups - see
http://bixolabs.com/blog/xxx)
> > > .
> > >
> > > Re how to integrate indexing into SimpleCrawlTool - I would create a
> > > separate SimpleIndexTool, which can be run on the output directories
> > > of SimpleCrawlTool.
> > >
> > > You could clone the SimpleStatusTool, since that's very similar, but
> > > with the change (as per a previous email) of processing the data
> > using
> > > Cascading versus directly opening taps & iterating.
> > >
> > > I just wrote some utility code to help create a Cascading source tap
> > > that's the set of crawl output subdirs you'd want to process for
> > > something like building the Lucene index.
> > >
> > > Re how to output the Lucene index - since this is an output format,
> > > conceptually I think the right place is in a Cascading Schema, which
> > > already exists as IndexSchema in Bixo.
> > >
> > > But I think I'd need to see your new code to better understand what
> > > looks different when handling this as a LuceneFunction.
> > >
> > > Thanks,
> > >
> > > -- Ken
> > >
> > > On Nov 1, 2009, at 11:32am, sanjoy wrote:
> > >
> > > > Hi Ken,
> > > >
> > > > This is what I am thinking. I want to get Lucene indexing done
> > > > before I attempt Solr.
> > > >
> > > > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to
> > be
> > > > the version in Maven, but I couldn't download the sources for
> > Lucene
> > > > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
> > > >
> > > > 2) I have added a couple of lines to SiteCrawler.java.
> > > > Tap luceneSink = new Hfs(new
> > > > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
> > > > curCrawlDirName + "/lucene");
> > > >
> > > > LucenePipe lucenePipe = new
> > > > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
> > > >
> > > > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
> > > >
> > > > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
> > > > urlPipe, lucenePipe);
> > > >
> > > > 3) Adding a class named bixo.pipes.LucenePipe modeled on
> > ParsePipe.
> > > >
> > > > 4) Adding a class named bixo.operations.LuceneFunction modeled on
> > > > ParseFunction. The writing of the Lucene index will happen in the
> > > > operate() function.
> > > >
> > > > I have these done. I will borrow the code to write the index from
> > > > existing test code.
> > > >
> > > > Let me know what you think,
> > > > Sanjoy
> > > >
> > > >
> > > >
> > >
> > > --------------------------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://bixolabs.com
> > > e l a s t i c w e b m i n i n g
> > >
> >
> >
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>