Search the web
Sign In
New User? Sign Up
bixo-dev · Bixo
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.

Messages

  Messages Help
Advanced
Integrate Lucene   Message List  
Reply Message #158 of 351 |
Re: Integrate Lucene

Yup, that's right.

It then does

Fields indexFields = new Fields( "ParsedDatum-parsedText");
Store[] storeSettings = new Store[] { Store.YES };
Index[] indexSettings = new Index[] { Index.ANALYZED };
indexScheme = new IndexScheme( indexFields, storeSettings, indexSettings, false,
StandardAnalyzer.class, MaxFieldLength.UNLIMITED.getLimit());

in LucenePipe.java. It basically pipes the field into IndexScheme.

This IndexScheme is defined as a Sink in SiteCrawler.java.

LucenePipe lucenePipe = new LucenePipe( parsePipe, MetaData.FIELDS);
Tap luceneSink = new Lfs( lucenePipe.getIndexScheme(), urCrawlDirName +
"/lucene", SinkMode.REPLACE);

and I add this into the sinkMap in SiteCrawler.java. That's the connection.

Please let me know if this is not the right way,
Sanjoy



--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Sanjoy,
>
> In the code below, it looks like you're creating a new tuple that has
> just the parsed text.
>
> If so, then how does this tie into the process of generating a Lucene
> index?
>
> Thanks,
>
> -- Ken
>
>
> On Nov 3, 2009, at 3:05pm, sanjoy wrote:
>
> > Hi Ken,
> >
> > Could you pass me the utility code you have.
> >
> > For indexing the output of the crawler, my only concern was this
> > will double the storage requirement since we are storing all the
> > content and also indexing it. That's why I was indexing on the fly.
> > Plus it will be faster since the stored content is not reread.
> >
> > If you still think we should index the stored content I will code
> > for that.
> >
> > Here's what I have for LuceneFunction:
> >
> > public class LuceneFunction extends BaseOperation<NullContext>
> > implements Function<NullContext> {
> >
> > private Fields _metaDataFields;
> >
> > public LuceneFunction (Fields metaDataFields) {
> > super( new Fields( "ParsedDatum-parsedText"));
> > _metaDataFields = metaDataFields;
> > }
> >
> > @Override
> > public void operate(FlowProcess process, FunctionCall funcCall) {
> > TupleEntry entry = funcCall.getArguments();
> > TupleEntryCollector collector = funcCall.getOutputCollector();
> >
> > String value = entry.getString( "ParsedDatum-parsedText");
> > TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum-
> > parsedText"), new Tuple( value));
> > collector.add(boost);
> > }
> > }
> >
> > Thanks,
> > Sanjoy
> >
> > --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@>
> > wrote:
> > >
> > > Hi Sanjoy,
> > >
> > > Sorry for the delay in responding, I've been busy with the ACM data
> > > mining unconference (prep, and writeups - see
http://bixolabs.com/blog/xxx)
> > > .
> > >
> > > Re how to integrate indexing into SimpleCrawlTool - I would create a
> > > separate SimpleIndexTool, which can be run on the output directories
> > > of SimpleCrawlTool.
> > >
> > > You could clone the SimpleStatusTool, since that's very similar, but
> > > with the change (as per a previous email) of processing the data
> > using
> > > Cascading versus directly opening taps & iterating.
> > >
> > > I just wrote some utility code to help create a Cascading source tap
> > > that's the set of crawl output subdirs you'd want to process for
> > > something like building the Lucene index.
> > >
> > > Re how to output the Lucene index - since this is an output format,
> > > conceptually I think the right place is in a Cascading Schema, which
> > > already exists as IndexSchema in Bixo.
> > >
> > > But I think I'd need to see your new code to better understand what
> > > looks different when handling this as a LuceneFunction.
> > >
> > > Thanks,
> > >
> > > -- Ken
> > >
> > > On Nov 1, 2009, at 11:32am, sanjoy wrote:
> > >
> > > > Hi Ken,
> > > >
> > > > This is what I am thinking. I want to get Lucene indexing done
> > > > before I attempt Solr.
> > > >
> > > > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to
> > be
> > > > the version in Maven, but I couldn't download the sources for
> > Lucene
> > > > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
> > > >
> > > > 2) I have added a couple of lines to SiteCrawler.java.
> > > > Tap luceneSink = new Hfs(new
> > > > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
> > > > curCrawlDirName + "/lucene");
> > > >
> > > > LucenePipe lucenePipe = new
> > > > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
> > > >
> > > > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
> > > >
> > > > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
> > > > urlPipe, lucenePipe);
> > > >
> > > > 3) Adding a class named bixo.pipes.LucenePipe modeled on
> > ParsePipe.
> > > >
> > > > 4) Adding a class named bixo.operations.LuceneFunction modeled on
> > > > ParseFunction. The writing of the Lucene index will happen in the
> > > > operate() function.
> > > >
> > > > I have these done. I will borrow the code to write the index from
> > > > existing test code.
> > > >
> > > > Let me know what you think,
> > > > Sanjoy
> > > >
> > > >
> > > >
> > >
> > > --------------------------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://bixolabs.com
> > > e l a s t i c w e b m i n i n g
> > >
> >
> >
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>





Fri Nov 6, 2009 12:16 am

sanjoy
Offline Offline

Message #158 of 351 |
Expand Messages Author Sort by Date

Hi Ken, This is what I am thinking. I want to get Lucene indexing done before I attempt Solr. 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems...
sanjoy
Offline
Nov 1, 2009
7:33 pm

Hi Sanjoy, Sorry for the delay in responding, I've been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx) . Re...
Ken Krugler
kkrugler
Offline Send Email
Nov 3, 2009
10:40 pm

Hi Ken, Could you pass me the utility code you have. For indexing the output of the crawler, my only concern was this will double the storage requirement since...
sanjoy
Offline
Nov 3, 2009
11:06 pm

Hi Sanjoy, In the code below, it looks like you're creating a new tuple that has just the parsed text. If so, then how does this tie into the process of...
Ken Krugler
kkrugler
Offline Send Email
Nov 4, 2009
10:33 pm

Yup, that's right. It then does Fields indexFields = new Fields( "ParsedDatum-parsedText"); Store[] storeSettings = new Store[] { Store.YES }; Index[]...
sanjoy
Offline
Nov 6, 2009
12:16 am
Advanced

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help