Search the web
Sign In
New User? Sign Up
bixo-dev · Bixo
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Integrate Lucene   Topic List   < Prev Topic  |  Next Topic >
Reply | Forward  | 
Re: [bixo-dev] Integrate Lucene

Hi Sanjoy,

Sorry for the delay in responding, I've been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx).

Re how to integrate indexing into SimpleCrawlTool - I would create a separate SimpleIndexTool, which can be run on the output directories of SimpleCrawlTool.

You could clone the SimpleStatusTool, since that's very similar, but with the change (as per a previous email) of processing the data using Cascading versus directly opening taps & iterating.

I just wrote some utility code to help create a Cascading source tap that's the set of crawl output subdirs you'd want to process for something like building the Lucene index.

Re how to output the Lucene index - since this is an output format, conceptually I think the right place is in a Cascading Schema, which already exists as IndexSchema in Bixo.

But I think I'd need to see your new code to better understand what looks different when handling this as a LuceneFunction.

Thanks,

-- Ken

On Nov 1, 2009, at 11:32am, sanjoy wrote:

Hi Ken,

This is what I am thinking. I want to get Lucene indexing done before I attempt Solr.

1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be the version in Maven, but I couldn't download the sources for Lucene 2.4.1. The tar and zip seem to have a problem on the Lucene site.

2) I have added a couple of lines to SiteCrawler.java.
Tap luceneSink = new Hfs(new SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)), curCrawlDirName + "/lucene"); 

LucenePipe lucenePipe = new LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS); 

sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink); 

Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe, urlPipe, lucenePipe); 

3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.

4) Adding a class named bixo.operations.LuceneFunction modeled on ParseFunction. The writing of the Lucene index will happen in the operate() function.

I have these done. I will borrow the code to write the index from existing test code.

Let me know what you think,
Sanjoy


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g






Tue Nov 3, 2009 10:39 pm

kkrugler
Offline Offline
Send Email Send Email

Forward
 | 
Expand Messages Author Sort by Date

Hi Ken, This is what I am thinking. I want to get Lucene indexing done before I attempt Solr. 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems...
sanjoy
Online Now
Nov 1, 2009
7:33 pm

Hi Sanjoy, Sorry for the delay in responding, I've been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx) . Re...
Ken Krugler
kkrugler
Offline Send Email
Nov 3, 2009
10:40 pm

Hi Ken, Could you pass me the utility code you have. For indexing the output of the crawler, my only concern was this will double the storage requirement since...
sanjoy
Online Now
Nov 3, 2009
11:06 pm

Hi Sanjoy, In the code below, it looks like you're creating a new tuple that has just the parsed text. If so, then how does this tie into the process of...
Ken Krugler
kkrugler
Offline Send Email
Nov 4, 2009
10:33 pm

Yup, that's right. It then does Fields indexFields = new Fields( "ParsedDatum-parsedText"); Store[] storeSettings = new Store[] { Store.YES }; Index[]...
sanjoy
Online Now
Nov 6, 2009
12:16 am
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help