Search the web
Sign In
New User? Sign Up
bixo-dev · Bixo
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Integrate Lucene   Message List  
Reply | Forward Message #156 of 294 |
Re: Integrate Lucene

Hi Ken,

Could you pass me the utility code you have.

For indexing the output of the crawler, my only concern was this will double the
storage requirement since we are storing all the content and also indexing it.
That's why I was indexing on the fly. Plus it will be faster since the stored
content is not reread.

If you still think we should index the stored content I will code for that.

Here's what I have for LuceneFunction:

public class LuceneFunction extends BaseOperation<NullContext> implements
Function<NullContext> {

private Fields _metaDataFields;

public LuceneFunction (Fields metaDataFields) {
super( new Fields( "ParsedDatum-parsedText"));
_metaDataFields = metaDataFields;
}


@Override
public void operate(FlowProcess process, FunctionCall funcCall) {
TupleEntry entry = funcCall.getArguments();
TupleEntryCollector collector = funcCall.getOutputCollector();

String value = entry.getString( "ParsedDatum-parsedText");
TupleEntry boost = new TupleEntry( new Fields(
"ParsedDatum-parsedText"), new Tuple( value));
collector.add(boost);
}
}

Thanks,
Sanjoy



--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Sanjoy,
>
> Sorry for the delay in responding, I've been busy with the ACM data
> mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx)
> .
>
> Re how to integrate indexing into SimpleCrawlTool - I would create a
> separate SimpleIndexTool, which can be run on the output directories
> of SimpleCrawlTool.
>
> You could clone the SimpleStatusTool, since that's very similar, but
> with the change (as per a previous email) of processing the data using
> Cascading versus directly opening taps & iterating.
>
> I just wrote some utility code to help create a Cascading source tap
> that's the set of crawl output subdirs you'd want to process for
> something like building the Lucene index.
>
> Re how to output the Lucene index - since this is an output format,
> conceptually I think the right place is in a Cascading Schema, which
> already exists as IndexSchema in Bixo.
>
> But I think I'd need to see your new code to better understand what
> looks different when handling this as a LuceneFunction.
>
> Thanks,
>
> -- Ken
>
> On Nov 1, 2009, at 11:32am, sanjoy wrote:
>
> > Hi Ken,
> >
> > This is what I am thinking. I want to get Lucene indexing done
> > before I attempt Solr.
> >
> > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be
> > the version in Maven, but I couldn't download the sources for Lucene
> > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
> >
> > 2) I have added a couple of lines to SiteCrawler.java.
> > Tap luceneSink = new Hfs(new
> > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
> > curCrawlDirName + "/lucene");
> >
> > LucenePipe lucenePipe = new
> > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
> >
> > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
> >
> > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
> > urlPipe, lucenePipe);
> >
> > 3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.
> >
> > 4) Adding a class named bixo.operations.LuceneFunction modeled on
> > ParseFunction. The writing of the Lucene index will happen in the
> > operate() function.
> >
> > I have these done. I will borrow the code to write the index from
> > existing test code.
> >
> > Let me know what you think,
> > Sanjoy
> >
> >
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>





Tue Nov 3, 2009 11:05 pm

sanjoy
Online Now Online Now

Forward
Message #156 of 294 |
Expand Messages Author Sort by Date

Hi Ken, This is what I am thinking. I want to get Lucene indexing done before I attempt Solr. 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems...
sanjoy
Online Now
Nov 1, 2009
7:33 pm

Hi Sanjoy, Sorry for the delay in responding, I've been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx) . Re...
Ken Krugler
kkrugler
Offline Send Email
Nov 3, 2009
10:40 pm

Hi Ken, Could you pass me the utility code you have. For indexing the output of the crawler, my only concern was this will double the storage requirement since...
sanjoy
Online Now
Nov 3, 2009
11:06 pm

Hi Sanjoy, In the code below, it looks like you're creating a new tuple that has just the parsed text. If so, then how does this tie into the process of...
Ken Krugler
kkrugler
Offline Send Email
Nov 4, 2009
10:33 pm

Yup, that's right. It then does Fields indexFields = new Fields( "ParsedDatum-parsedText"); Store[] storeSettings = new Store[] { Store.YES }; Index[]...
sanjoy
Online Now
Nov 6, 2009
12:16 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help