Skip to search.

Breaking News Visit Yahoo! News for the latest.

×Close this window

bixo-dev · Bixo Web Mining Toolkit

The Yahoo! Groups Product Blog

Check it out!

Group Information

  • Members: 113
  • Category: Open Source
  • Founded: May 17, 2009
  • Language: English
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Messages

Advanced
Messages Help
Messages 152 - 181 of 1321   Oldest  |  < Older  |  Newer >  |  Newest
Messages: Show Message Summaries Sort by Date ^  
#152 From: sanjoy
Date: Sun Nov 1, 2009 7:32 pm
Subject: Integrate Lucene
sanjoy
 
Hi Ken,

This is what I am thinking.  I want to get Lucene indexing done before I attempt
Solr.

1) I am using Lucene 2.9.0.  Nothing against 2.4.1 which seems to be the version
in Maven, but I couldn't download the sources for Lucene 2.4.1.  The tar and zip
seem to have a problem on the Lucene site.

2) I have added a couple of lines to SiteCrawler.java.
Tap luceneSink = new Hfs(new SequenceFile(
FetchedDatum.FIELDS.append(MetaData.FIELDS)), curCrawlDirName + "/lucene");

LucenePipe lucenePipe = new LucenePipe( fetchPipe.getContentTailPipe(),
MetaData.FIELDS);

sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);

Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe, urlPipe,
lucenePipe);


3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.

4) Adding a class named bixo.operations.LuceneFunction modeled on ParseFunction.
The writing of the Lucene index will happen in the operate() function.

I have these done.  I will borrow the code to write the index from existing test
code.

Let me know what you think,
Sanjoy

#153 From: "Freddy" <fuad@...>
Date: Sun Nov 1, 2009 7:51 pm
Subject: Neko HTML Parser
fouad_efendi
Send Email Send Email
 
Guys,

I am currently using ElementRemover with list of TAGS to be ignored (removed)
from stream _before_ parsing content; it allows also to deal with invalid XML
right before parsing and more... are we interested in [p], [table], [div] tags
in DOM, or just anchor[a] with href? Plus images of course... I believe it runs
faster:

		 ElementRemover remover = new ElementRemover();
		 for (HtmlTag t : HtmlTag.TAGS) {
			 if (t.accept)
				 remover.acceptElement(t.tag, t.attributes);
			 if (t.remove)
				 remover.removeElement(t.tag);
		 }

		 XMLDocumentFilter[] filters = { remover, };
		 parser = new DOMParser();
...
			 parser.setProperty("http://cyberneko.org/html/properties/filters", filters);


-Fuad

P.S.
I am interested in structural tags also, such as [table], [div]... for some kind
of "mining"... but I haven't implemented it yet.

#154 From: Ken Krugler <KKrugler_lists@...>
Date: Tue Nov 3, 2009 10:39 pm
Subject: Re: Integrate Lucene
kkrugler
Send Email Send Email
 
Hi Sanjoy,

Sorry for the delay in responding, I've been busy with the ACM data mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx).

Re how to integrate indexing into SimpleCrawlTool - I would create a separate SimpleIndexTool, which can be run on the output directories of SimpleCrawlTool.

You could clone the SimpleStatusTool, since that's very similar, but with the change (as per a previous email) of processing the data using Cascading versus directly opening taps & iterating.

I just wrote some utility code to help create a Cascading source tap that's the set of crawl output subdirs you'd want to process for something like building the Lucene index.

Re how to output the Lucene index - since this is an output format, conceptually I think the right place is in a Cascading Schema, which already exists as IndexSchema in Bixo.

But I think I'd need to see your new code to better understand what looks different when handling this as a LuceneFunction.

Thanks,

-- Ken

On Nov 1, 2009, at 11:32am, sanjoy wrote:

Hi Ken,

This is what I am thinking. I want to get Lucene indexing done before I attempt Solr.

1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be the version in Maven, but I couldn't download the sources for Lucene 2.4.1. The tar and zip seem to have a problem on the Lucene site.

2) I have added a couple of lines to SiteCrawler.java.
Tap luceneSink = new Hfs(new SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)), curCrawlDirName + "/lucene"); 

LucenePipe lucenePipe = new LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS); 

sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink); 

Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe, urlPipe, lucenePipe); 

3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.

4) Adding a class named bixo.operations.LuceneFunction modeled on ParseFunction. The writing of the Lucene index will happen in the operate() function.

I have these done. I will borrow the code to write the index from existing test code.

Let me know what you think,
Sanjoy


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#155 From: Ken Krugler <KKrugler_lists@...>
Date: Tue Nov 3, 2009 10:39 pm
Subject: Re: Neko HTML Parser
kkrugler
Send Email Send Email
 
Hi Fuad,

1. I'll need to look into how CyberNeko uses filters, but from what I've seen with other parsers, these filters get applied during parsing, not before. Otherwise it seems very hard to be able to properly handle broken HTML where (for example) there are missing end tags.

2. Also note that Tika is switching to TagSoup (Jira issue TIKA-310).

3. Re what tags are of interest - depends on what Bixo is being used to do. If it's just generating an index, then often it's only the content and links, so tags without either can be stripped.

But I think the more common use case is for data mining, where you'd want all of the tags to be able to do appropriate pattern matching on layout - that's key for extracting semi-structured data.

-- Ken


On Nov 1, 2009, at 11:51am, Freddy wrote:

Guys,

I am currently using ElementRemover with list of TAGS to be ignored (removed) from stream _before_ parsing content; it allows also to deal with invalid XML right before parsing and more... are we interested in [p], [table], [div] tags in DOM, or just anchor[a] with href? Plus images of course... I believe it runs faster:

ElementRemover remover = new ElementRemover();
for (HtmlTag t : HtmlTag.TAGS) {
if (t.accept)
remover.acceptElement(t.tag, t.attributes);
if (t.remove)
remover.removeElement(t.tag);
}

XMLDocumentFilter[] filters = { remover, };
parser = new DOMParser();
...
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

-Fuad

P.S.
I am interested in structural tags also, such as [table], [div]... for some kind of "mining"... but I haven't implemented it yet.


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#156 From: sanjoy
Date: Tue Nov 3, 2009 11:05 pm
Subject: Re: Integrate Lucene
sanjoy
 
Hi Ken,

Could you pass me the utility code you have.

For indexing the output of the crawler, my only concern was this will double the
storage requirement since we are storing all the content and also indexing it. 
That's why I was indexing on the fly.  Plus it will be faster since the stored
content is not reread.

If you still think we should index the stored content I will code for that.

Here's what I have for LuceneFunction:

public class LuceneFunction extends BaseOperation<NullContext> implements
Function<NullContext> {

     private Fields _metaDataFields;

     public LuceneFunction (Fields metaDataFields) {
         super( new Fields( "ParsedDatum-parsedText"));
         _metaDataFields = metaDataFields;
     }


     @Override
     public void operate(FlowProcess process, FunctionCall funcCall) {
         TupleEntry entry = funcCall.getArguments();
         TupleEntryCollector collector = funcCall.getOutputCollector();

         String value = entry.getString( "ParsedDatum-parsedText");
         TupleEntry boost = new TupleEntry( new Fields(
"ParsedDatum-parsedText"), new Tuple( value));
         collector.add(boost);
     }
}

Thanks,
Sanjoy



--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Sanjoy,
>
> Sorry for the delay in responding, I've been busy with the ACM data
> mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx)
> .
>
> Re how to integrate indexing into SimpleCrawlTool - I would create a
> separate SimpleIndexTool, which can be run on the output directories
> of SimpleCrawlTool.
>
> You could clone the SimpleStatusTool, since that's very similar, but
> with the change (as per a previous email) of processing the data using
> Cascading versus directly opening taps & iterating.
>
> I just wrote some utility code to help create a Cascading source tap
> that's the set of crawl output subdirs you'd want to process for
> something like building the Lucene index.
>
> Re how to output the Lucene index - since this is an output format,
> conceptually I think the right place is in a Cascading Schema, which
> already exists as IndexSchema in Bixo.
>
> But I think I'd need to see your new code to better understand what
> looks different when handling this as a LuceneFunction.
>
> Thanks,
>
> -- Ken
>
> On Nov 1, 2009, at 11:32am, sanjoy wrote:
>
> > Hi Ken,
> >
> > This is what I am thinking. I want to get Lucene indexing done
> > before I attempt Solr.
> >
> > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be
> > the version in Maven, but I couldn't download the sources for Lucene
> > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
> >
> > 2) I have added a couple of lines to SiteCrawler.java.
> > Tap luceneSink = new Hfs(new
> > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
> > curCrawlDirName + "/lucene");
> >
> > LucenePipe lucenePipe = new
> > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
> >
> > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
> >
> > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
> > urlPipe, lucenePipe);
> >
> > 3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.
> >
> > 4) Adding a class named bixo.operations.LuceneFunction modeled on
> > ParseFunction. The writing of the Lucene index will happen in the
> > operate() function.
> >
> > I have these done. I will borrow the code to write the index from
> > existing test code.
> >
> > Let me know what you think,
> > Sanjoy
> >
> >
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>

#157 From: Ken Krugler <KKrugler_lists@...>
Date: Wed Nov 4, 2009 10:33 pm
Subject: Re: Re: Integrate Lucene
kkrugler
Send Email Send Email
 
Hi Sanjoy,

In the code below, it looks like you're creating a new tuple that has just the parsed text.

If so, then how does this tie into the process of generating a Lucene index?

Thanks,

-- Ken


On Nov 3, 2009, at 3:05pm, sanjoy wrote:

Hi Ken, 

Could you pass me the utility code you have.

For indexing the output of the crawler, my only concern was this will double the storage requirement since we are storing all the content and also indexing it. That's why I was indexing on the fly. Plus it will be faster since the stored content is not reread.

If you still think we should index the stored content I will code for that.

Here's what I have for LuceneFunction:

public class LuceneFunction extends BaseOperation<NullContext> implements Function<NullContext> {

private Fields _metaDataFields;

public LuceneFunction (Fields metaDataFields) {
super( new Fields( "ParsedDatum-parsedText"));
_metaDataFields = metaDataFields;
}

@Override
public void operate(FlowProcess process, FunctionCall funcCall) {
TupleEntry entry = funcCall.getArguments();
TupleEntryCollector collector = funcCall.getOutputCollector();

String value = entry.getString( "ParsedDatum-parsedText");
TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum-parsedText"), new Tuple( value));
collector.add(boost);
}
}

Thanks,
Sanjoy

--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Sanjoy,
> 
> Sorry for the delay in responding, I've been busy with the ACM data 
> mining unconference (prep, and writeups - see http://bixolabs.com/blog/xxx) 
> .
> 
> Re how to integrate indexing into SimpleCrawlTool - I would create a 
> separate SimpleIndexTool, which can be run on the output directories 
> of SimpleCrawlTool.
> 
> You could clone the SimpleStatusTool, since that's very similar, but 
> with the change (as per a previous email) of processing the data using 
> Cascading versus directly opening taps & iterating.
> 
> I just wrote some utility code to help create a Cascading source tap 
> that's the set of crawl output subdirs you'd want to process for 
> something like building the Lucene index.
> 
> Re how to output the Lucene index - since this is an output format, 
> conceptually I think the right place is in a Cascading Schema, which 
> already exists as IndexSchema in Bixo.
> 
> But I think I'd need to see your new code to better understand what 
> looks different when handling this as a LuceneFunction.
> 
> Thanks,
> 
> -- Ken
> 
> On Nov 1, 2009, at 11:32am, sanjoy wrote:
> 
> > Hi Ken,
> >
> > This is what I am thinking. I want to get Lucene indexing done 
> > before I attempt Solr.
> >
> > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to be 
> > the version in Maven, but I couldn't download the sources for Lucene 
> > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
> >
> > 2) I have added a couple of lines to SiteCrawler.java.
> > Tap luceneSink = new Hfs(new 
> > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)), 
> > curCrawlDirName + "/lucene");
> >
> > LucenePipe lucenePipe = new 
> > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
> >
> > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
> >
> > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe, 
> > urlPipe, lucenePipe);
> >
> > 3) Adding a class named bixo.pipes.LucenePipe modeled on ParsePipe.
> >
> > 4) Adding a class named bixo.operations.LuceneFunction modeled on 
> > ParseFunction. The writing of the Lucene index will happen in the 
> > operate() function.
> >
> > I have these done. I will borrow the code to write the index from 
> > existing test code.
> >
> > Let me know what you think,
> > Sanjoy
> >
> >
> > 
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#158 From: sanjoy
Date: Fri Nov 6, 2009 12:16 am
Subject: Re: Integrate Lucene
sanjoy
 
Yup, that's right.

It then does

Fields indexFields = new Fields( "ParsedDatum-parsedText");
Store[] storeSettings = new Store[] { Store.YES };
Index[] indexSettings = new Index[] { Index.ANALYZED };
indexScheme = new IndexScheme( indexFields, storeSettings, indexSettings, false,
StandardAnalyzer.class, MaxFieldLength.UNLIMITED.getLimit());

in LucenePipe.java.  It basically pipes the field into IndexScheme.

This IndexScheme is defined as a Sink in SiteCrawler.java.

LucenePipe lucenePipe = new LucenePipe( parsePipe, MetaData.FIELDS);
Tap luceneSink = new Lfs( lucenePipe.getIndexScheme(), urCrawlDirName +
"/lucene", SinkMode.REPLACE);

and I add this into the sinkMap in SiteCrawler.java.  That's the connection.

Please let me know if this is not the right way,
Sanjoy



--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Sanjoy,
>
> In the code below, it looks like you're creating a new tuple that has
> just the parsed text.
>
> If so, then how does this tie into the process of generating a Lucene
> index?
>
> Thanks,
>
> -- Ken
>
>
> On Nov 3, 2009, at 3:05pm, sanjoy wrote:
>
> > Hi Ken,
> >
> > Could you pass me the utility code you have.
> >
> > For indexing the output of the crawler, my only concern was this
> > will double the storage requirement since we are storing all the
> > content and also indexing it. That's why I was indexing on the fly.
> > Plus it will be faster since the stored content is not reread.
> >
> > If you still think we should index the stored content I will code
> > for that.
> >
> > Here's what I have for LuceneFunction:
> >
> > public class LuceneFunction extends BaseOperation<NullContext>
> > implements Function<NullContext> {
> >
> > private Fields _metaDataFields;
> >
> > public LuceneFunction (Fields metaDataFields) {
> > super( new Fields( "ParsedDatum-parsedText"));
> > _metaDataFields = metaDataFields;
> > }
> >
> > @Override
> > public void operate(FlowProcess process, FunctionCall funcCall) {
> > TupleEntry entry = funcCall.getArguments();
> > TupleEntryCollector collector = funcCall.getOutputCollector();
> >
> > String value = entry.getString( "ParsedDatum-parsedText");
> > TupleEntry boost = new TupleEntry( new Fields( "ParsedDatum-
> > parsedText"), new Tuple( value));
> > collector.add(boost);
> > }
> > }
> >
> > Thanks,
> > Sanjoy
> >
> > --- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@>
> > wrote:
> > >
> > > Hi Sanjoy,
> > >
> > > Sorry for the delay in responding, I've been busy with the ACM data
> > > mining unconference (prep, and writeups - see
http://bixolabs.com/blog/xxx)
> > > .
> > >
> > > Re how to integrate indexing into SimpleCrawlTool - I would create a
> > > separate SimpleIndexTool, which can be run on the output directories
> > > of SimpleCrawlTool.
> > >
> > > You could clone the SimpleStatusTool, since that's very similar, but
> > > with the change (as per a previous email) of processing the data
> > using
> > > Cascading versus directly opening taps & iterating.
> > >
> > > I just wrote some utility code to help create a Cascading source tap
> > > that's the set of crawl output subdirs you'd want to process for
> > > something like building the Lucene index.
> > >
> > > Re how to output the Lucene index - since this is an output format,
> > > conceptually I think the right place is in a Cascading Schema, which
> > > already exists as IndexSchema in Bixo.
> > >
> > > But I think I'd need to see your new code to better understand what
> > > looks different when handling this as a LuceneFunction.
> > >
> > > Thanks,
> > >
> > > -- Ken
> > >
> > > On Nov 1, 2009, at 11:32am, sanjoy wrote:
> > >
> > > > Hi Ken,
> > > >
> > > > This is what I am thinking. I want to get Lucene indexing done
> > > > before I attempt Solr.
> > > >
> > > > 1) I am using Lucene 2.9.0. Nothing against 2.4.1 which seems to
> > be
> > > > the version in Maven, but I couldn't download the sources for
> > Lucene
> > > > 2.4.1. The tar and zip seem to have a problem on the Lucene site.
> > > >
> > > > 2) I have added a couple of lines to SiteCrawler.java.
> > > > Tap luceneSink = new Hfs(new
> > > > SequenceFile( FetchedDatum.FIELDS.append(MetaData.FIELDS)),
> > > > curCrawlDirName + "/lucene");
> > > >
> > > > LucenePipe lucenePipe = new
> > > > LucenePipe( fetchPipe.getContentTailPipe(), MetaData.FIELDS);
> > > >
> > > > sinkMap.put( LucenePipe.LUCENE_PIPE_NAME, luceneSink);
> > > >
> > > > Flow flow = flowConnector.connect(inputSource, sinkMap, fetchPipe,
> > > > urlPipe, lucenePipe);
> > > >
> > > > 3) Adding a class named bixo.pipes.LucenePipe modeled on
> > ParsePipe.
> > > >
> > > > 4) Adding a class named bixo.operations.LuceneFunction modeled on
> > > > ParseFunction. The writing of the Lucene index will happen in the
> > > > operate() function.
> > > >
> > > > I have these done. I will borrow the code to write the index from
> > > > existing test code.
> > > >
> > > > Let me know what you think,
> > > > Sanjoy
> > > >
> > > >
> > > >
> > >
> > > --------------------------------------------
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://bixolabs.com
> > > e l a s t i c w e b m i n i n g
> > >
> >
> >
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>

#159 From: Ken Krugler <KKrugler_lists@...>
Date: Tue Nov 24, 2009 11:31 pm
Subject: Crawler-commons project
kkrugler
Send Email Send Email
 
Just an early note that the crawler-commons project is getting up and
running.

This is an effort to share common code between Nutch, Heritrix, Droids
and Bixo.

Nothing has been moved to the code repo (at code.google) yet, but high
on my list is the robots.txt parsing code I've written for Bixo.

See http://code.google.com/p/crawler-commons/ and
http://groups.google.com/group/crawler-commons

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

#160 From: "Freddy" <fuad@...>
Date: Fri Nov 27, 2009 2:41 am
Subject: Cygwin (Windows) & Classpath, running bin/bixo
fouad_efendi
Send Email Send Email
 
I have problems...

Do we forget to add Bundle-ClassPath: runtime-libs/ into MANIFEST.MF? It seems
that "job" jar can't see it...

I am running Cygwin, trying to run bin/bixo.

#161 From: "Freddy" <fuad@...>
Date: Fri Nov 27, 2009 3:03 am
Subject: Re: Cygwin (Windows) & Classpath, running bin/bixo
fouad_efendi
Send Email Send Email
 
I don't know is it Hadoop specific or not... but MANIFEST.MF doesn't have
classpath. That's why I can't run bin/bixo.


In order to fix that (sorry if formatting removed by Yahoo):

	 <!-- ================================================================== -->
	 <!-- Hadoop job jar                                                     -->
	 <!-- ================================================================== -->

	 <target name="job"
	         depends="compile"
	         description="--> create a Hadoop ready jar with all dependencies">

		 <copy todir="${build.dir}/runtime-libs" flatten="true">
			 <path refid="runtime.classpath" />
		 </copy>

		 <pathconvert property="mf.classpath" pathsep=" ">
			 <path refid="runtime.classpath" />
			 <chainedmapper>
				 <flattenmapper/>
			  	 <globmapper from="*" to="runtime-libs/*"/>
			 </chainedmapper>
		 </pathconvert>

		 <jar destfile="${build.dir}/${job.name}" compress="true">
			 <fileset dir="${build.dir.main-classes}" />
			 <fileset dir="${build.dir}" includes="runtime-libs/" />
			 <manifest>
				 <attribute name="Main-Class" value="${job.main.class}" />
				 <attribute name="Class-Path" value="${mf.classpath}"/>
			 </manifest>
		 </jar>

	 </target>





--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> I have problems...
>
> Do we forget to add Bundle-ClassPath: runtime-libs/ into MANIFEST.MF? It seems
that "job" jar can't see it...
>
> I am running Cygwin, trying to run bin/bixo.
>

#162 From: "Freddy" <fuad@...>
Date: Fri Nov 27, 2009 2:29 pm
Subject: URL Database becomes empty at some point; BIXO stops, can't recrawl
fouad_efendi
Send Email Send Email
 
Hi,

I just noticed it... it is even easier to see it with "-duration 1" command line
parameter.

At some point in a loop, URL contains no any tuples. Subsequent recrawl
obviously doesn't have any URL to be fetched.

It happens after SKIPPED_TIME_LIMIT.

I probably need to join JIRA and submit patches... I am doing many minor fixes
in my environment, and I don't want to loose connection to newer releases...

About "plugin points", we need some solution, I need to plug specific processing
(update SOLR, update MySQL) in some places, and it shouldn't be part of BIXO...

#163 From: Ken Krugler <KKrugler_lists@...>
Date: Fri Nov 27, 2009 3:02 pm
Subject: Re: Re: Cygwin (Windows) & Classpath, running bin/bixo
kkrugler
Send Email Send Email
 
Hi Fuad,

Thanks for the input on this issue.

It's Thanksgiving break here in the US, but I'll look into this on Monday.

-- Ken

PS - I see that I've got an old email from Sanjoy that I need to respond to as well...my bad.

On Nov 26, 2009, at 7:03pm, Freddy wrote:

I don't know is it Hadoop specific or not... but MANIFEST.MF doesn't have classpath. That's why I can't run bin/bixo.

In order to fix that (sorry if formatting removed by Yahoo):

<!-- ================================================================== -->
<!-- Hadoop job jar -->
<!-- ================================================================== -->

<target name="job"
depends="compile"
description="--> create a Hadoop ready jar with all dependencies">

<copy todir="${build.dir}/runtime-libs" flatten="true">
<path refid="runtime.classpath" />
</copy>

<pathconvert property="mf.classpath" pathsep=" "> 
<path refid="runtime.classpath" /> 
<chainedmapper>
<flattenmapper/>
<globmapper from="*" to="runtime-libs/*"/>
</chainedmapper>
</pathconvert>

<jar destfile="${build.dir}/${job.name}" compress="true">
<fileset dir="${build.dir.main-classes}" />
<fileset dir="${build.dir}" includes="runtime-libs/" />
<manifest>
<attribute name="Main-Class" value="${job.main.class}" />
<attribute name="Class-Path" value="${mf.classpath}"/>
</manifest>
</jar>

</target>

--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> I have problems...
> 
> Do we forget to add Bundle-ClassPath: runtime-libs/ into MANIFEST.MF? It seems that "job" jar can't see it...
> 
> I am running Cygwin, trying to run bin/bixo.
>


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#164 From: "Freddy" <fuad@...>
Date: Fri Nov 27, 2009 3:03 pm
Subject: Re: URL Database becomes empty at some point; BIXO stops, can't recrawl
fouad_efendi
Send Email Send Email
 
The problem is that SKIPPED_TIME_LIMIT is in TODO list of
SiteCrawler.CreateUrlFromStatusFunction

And I need to make many changes quickly; I don't have any idea how to push many
tiny patches...

--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> Hi,
>
> I just noticed it... it is even easier to see it with "-duration 1" command
line parameter.
>
> At some point in a loop, URL contains no any tuples. Subsequent recrawl
obviously doesn't have any URL to be fetched.
>
> It happens after SKIPPED_TIME_LIMIT.
>
> I probably need to join JIRA and submit patches... I am doing many minor fixes
in my environment, and I don't want to loose connection to newer releases...
>
> About "plugin points", we need some solution, I need to plug specific
processing (update SOLR, update MySQL) in some places, and it shouldn't be part
of BIXO...
>

#165 From: "Freddy" <fuad@...>
Date: Fri Nov 27, 2009 9:12 pm
Subject: Re: URL Database becomes empty at some point; BIXO stops, can't recrawl
fouad_efendi
Send Email Send Email
 
Am I on correct path?

I need to be able to recrawl page after some timeout (for instance, 30 days)

1. I added code to SiteCrawler.CreateUrlFromStatusFunction:
} else if (status == UrlStatus.SKIPPED_TIME_LIMIT) {
    fetchTime = 0;
}

2. I added "importPipe" here:
Pipe urlPipe = new GroupBy("url pipe", Pipe.pipes(urlFromFetchPipe,
urlFromOutlinksPipe, importPipe), new Fields(UrlDatum.URL_FIELD));
urlPipe = new Every(urlPipe, new LatestUrlBuffer(), Fields.RESULTS);

3. And "refreshDelay":
	 private static class SkipFetchedScoreGenerator implements IScoreGenerator {

		 @Override
		 public double generateScore(GroupedUrlDatum datum) {
			 if (datum.getLastFetched() != 0) {
				 if (datum.getLastFetched() + _refreshDelay < System.currentTimeMillis()) {
					 return 1.0;
				 } else {
					 return IScoreGenerator.SKIP_URL_SCORE;
				 }
			 } else {
				 return 1.0;
			 }
		 }
     }


Thanks!

--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> The problem is that SKIPPED_TIME_LIMIT is in TODO list of
SiteCrawler.CreateUrlFromStatusFunction
>
> And I need to make many changes quickly; I don't have any idea how to push
many tiny patches...
>
> --- In bixo-dev@yahoogroups.com, "Freddy" <fuad@> wrote:
> >
> > Hi,
> >
> > I just noticed it... it is even easier to see it with "-duration 1" command
line parameter.
> >
> > At some point in a loop, URL contains no any tuples. Subsequent recrawl
obviously doesn't have any URL to be fetched.
> >
> > It happens after SKIPPED_TIME_LIMIT.
> >
> > I probably need to join JIRA and submit patches... I am doing many minor
fixes in my environment, and I don't want to loose connection to newer
releases...
> >
> > About "plugin points", we need some solution, I need to plug specific
processing (update SOLR, update MySQL) in some places, and it shouldn't be part
of BIXO...
> >
>

#166 From: "Freddy" <fuad@...>
Date: Mon Nov 30, 2009 12:17 am
Subject: Crawl Delay
fouad_efendi
Send Email Send Email
 
Hi Ken,


I couldn't understand why changing parameters in MyFetchPolicy don't change
anything...

After playing a lot against single domain (with different parameters for
FetcherPolicy) I found simple thing in SimpleCrawlTool:
SimpleGroupingKeyGenerator grouper = new SimpleGroupingKeyGenerator(_userAgent);

So that it uses MyFetchPolicy from SimpleCrawlTool to fetch pages, and it uses
FetcherPolicy inside GroupingKeyGenerator.

I changed it:
IHttpFetcher fetcher = new SimpleHttpFetcher(_maxThreads, _fetcherPolicy,
_userAgent);
SimpleGroupingKeyGenerator grouper = new SimpleGroupingKeyGenerator(fetcher,
false);

-Fuad

#167 From: "Freddy" <fuad@...>
Date: Mon Nov 30, 2009 3:17 am
Subject: Defect in SimpleRobotRules
fouad_efendi
Send Email Send Email
 
We are using path instead of path + query.
robots.txt may have strings such as
Disallow: /?field=


===
     @Override
     public boolean isAllowed(URL url) {
         String path = getPath(url);
         ... ... ...
         return _robotRules.isAllowed(path.toLowerCase());
     }

#168 From: "Freddy" <fuad@...>
Date: Mon Nov 30, 2009 5:33 pm
Subject: Simple Cascading Question...
fouad_efendi
Send Email Send Email
 
Hi,

Simple question; scenario:

A.
urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new UrlFilter(_urlFilter,
MetaData.FIELDS));

B.
urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new NormalizeUrlFunction(new
SimpleUrlNormalizer(), MetaData.FIELDS));


Questions:
- are A and B single-threaded? (I believe "yes")
- will Cascading remove duplicates between A and B? (I believe "no")

Thanks,

#169 From: Ken Krugler <KKrugler_lists@...>
Date: Mon Nov 30, 2009 6:09 pm
Subject: Re: Simple Cascading Question...
kkrugler
Send Email Send Email
 
Hi Freddy,

On Nov 30, 2009, at 9:33am, Freddy wrote:

Hi,

Simple question; scenario:

A.
urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new UrlFilter(_urlFilter, MetaData.FIELDS));

B.
urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new NormalizeUrlFunction(new SimpleUrlNormalizer(), MetaData.FIELDS));

Questions: 
- are A and B single-threaded? (I believe "yes")

What do you mean by "single-threaded"?

These will be mapper tasks in Hadoop land, which means they will be parallelized across the number of job runners you've configured for map tasks.

- will Cascading remove duplicates between A and B? (I believe "no")

Not sure what you mean by "duplicates".

If your code matches the above (you do A first, then B) then I believe Cascading will set these up as a series of map tasks (chained).

-- Ken



Thanks,


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#170 From: "Freddy" <fuad@...>
Date: Mon Nov 30, 2009 6:28 pm
Subject: Re: Simple Cascading Question...
fouad_efendi
Send Email Send Email
 
Hi Ken,

I believe A and B will be chained inside same Map, but I am unsure that's why I
asked here... important for me is when duplicates are removed, and that's why:

I added to UrlDatum new field: _lastAnchor.

And, after A and B, I have additional step C dealing with Anchor Text.
We may have same URL but with different Anchor Text.

I believe A (filter) and B (normalizer) won't delete duplicate UrlDatum (same
URL, different Anchor). (?)


Thanks,
Fuad


--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Freddy,
>
> On Nov 30, 2009, at 9:33am, Freddy wrote:
>
> > Hi,
> >
> > Simple question; scenario:
> >
> > A.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new
> > UrlFilter(_urlFilter, MetaData.FIELDS));
> >
> > B.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new
> > NormalizeUrlFunction(new SimpleUrlNormalizer(), MetaData.FIELDS));
> >
> > Questions:
> > - are A and B single-threaded? (I believe "yes")
> >
> What do you mean by "single-threaded"?
>
> These will be mapper tasks in Hadoop land, which means they will be
> parallelized across the number of job runners you've configured for
> map tasks.
>
> > - will Cascading remove duplicates between A and B? (I believe "no")
> >
> Not sure what you mean by "duplicates".
>
> If your code matches the above (you do A first, then B) then I believe
> Cascading will set these up as a series of map tasks (chained).
>
> -- Ken
>
>
> >
> > Thanks,
> >
> >
> >
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>

#171 From: Ken Krugler <KKrugler_lists@...>
Date: Mon Nov 30, 2009 7:01 pm
Subject: Re: Re: Simple Cascading Question...
kkrugler
Send Email Send Email
 
Hi Freddy,

On Nov 30, 2009, at 10:28am, Freddy wrote:

Hi Ken,

I believe A and B will be chained inside same Map, but I am unsure that's why I asked here... important for me is when duplicates are removed, and that's why:

I added to UrlDatum new field: _lastAnchor.

And, after A and B, I have additional step C dealing with Anchor Text.
We may have same URL but with different Anchor Text.

I believe A (filter) and B (normalizer) won't delete duplicate UrlDatum (same URL, different Anchor). (?)

Correct. That kind of de-duping would require that you first group by URL (e.g. use a Cascading CoGroup).

-- Ken


--- In bixo-dev@yahoogroups.com, Ken Krugler <KKrugler_lists@...> wrote:
>
> Hi Freddy,
> 
> On Nov 30, 2009, at 9:33am, Freddy wrote:
> 
> > Hi,
> >
> > Simple question; scenario:
> >
> > A.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new 
> > UrlFilter(_urlFilter, MetaData.FIELDS));
> >
> > B.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new 
> > NormalizeUrlFunction(new SimpleUrlNormalizer(), MetaData.FIELDS));
> >
> > Questions:
> > - are A and B single-threaded? (I believe "yes")
> >
> What do you mean by "single-threaded"?
> 
> These will be mapper tasks in Hadoop land, which means they will be 
> parallelized across the number of job runners you've configured for 
> map tasks.
> 
> > - will Cascading remove duplicates between A and B? (I believe "no")
> >
> Not sure what you mean by "duplicates".
> 
> If your code matches the above (you do A first, then B) then I believe 
> Cascading will set these up as a series of map tasks (chained).
> 
> -- Ken
> 
> 
> >
> > Thanks,
> >
> >
> > 
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#172 From: "Fuad Efendi" <fuad@...>
Date: Tue Dec 1, 2009 12:31 am
Subject: RE: Re: Simple Cascading Question...
fouad_efendi
Send Email Send Email
 

Hi Ken,

 

So that... I’ll need step D (CoGroup) after C: I need to remove duplicate pairs (URL, Anchor) and Cascading doesn’t do it by default?

Will cascading automatically remove duplicates of same Tuple? I hope “yes”...

Thanks,

Fuad

 

From: bixo-dev@yahoogroups.com [mailto:bixo-dev@yahoogroups.com] On Behalf Of Ken Krugler
Sent: November-30-09 2:02 PM
To: bixo-dev@yahoogroups.com
Subject: Re: [bixo-dev] Re: Simple Cascading Question...

 

 

Hi Freddy,

 

On Nov 30, 2009, at 10:28am, Freddy wrote:



Hi Ken,

I believe A and B will be chained inside same Map, but I am unsure that's why I asked here... important for me is when duplicates are removed, and that's why:

I added to UrlDatum new field: _lastAnchor.

And, after A and B, I have additional step C dealing with Anchor Text.
We may have same URL but with different Anchor Text.

I believe A (f ilter) and B (normalizer) won't delete duplicate UrlDatum (same URL, different Anchor). (?)

Correct. That kind of de-duping would require that you first group by URL (e.g. use a Cascading CoGroup).

 

-- Ken

 



--- In bixo-dev@yahoogroups.com, Ken K rugler <KKrugler_lists@...> wrote:
>
> Hi Freddy,
> 
> On Nov 30, 2009, at 9:33am, Freddy wrote:
> 
> > Hi,
> >
> > Simple question; scenario:
> >
> > A.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new 
> > UrlFilter(_urlFilter, MetaData.FIELDS));
> >
> > B.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new 
> > NormalizeUrlFunction(new SimpleUrlNormalizer(), MetaData.FIELDS));
> >
> > Questions:
> > - are A and B single-threaded? (I believe "yes")
> >
> What do you mean by "single-threaded"?
> 
> These will be mapper tasks in Hadoop land, which means they will be 
> parallelized across the number of job runners you've configured for 
> map tasks.
> 
> > - will Cascading remove duplicates between A and B? (I believe "no")
> >
> Not sure what you mean by "duplicates".
> 
> If your code matches the above (you do A first, then B) then I believe 
> Cascading will set these up as a series of map tasks (chained).
> 
> -- Ken
> 
> 
> >
> > Thanks,
> >
> >
> >  
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>

 

--------------------------------------------

Ken Krugler

+1 530-210-6378

e l a s t i c   w e b   m i n i n g

 

 

 

 

 

Fuad Efendi

+1 416-993-2060

http://www.tokenizer.ca

Data Mining, Vertical Search

 


#173 From: "Freddy" <fuad@...>
Date: Tue Dec 1, 2009 12:37 am
Subject: Hadoop Cluster 0.19.2
fouad_efendi
Send Email Send Email
 
Small problem:
java.lang.NoClassDefFoundError: cascading/flow/PlannerException

And it doesn't complain on other external classes (such as Command Line-parser
which is loaded before Cascading)

#174 From: "Fuad Efendi" <fuad@...>
Date: Tue Dec 1, 2009 2:04 am
Subject: RE: Hadoop Cluster 0.19.2
fouad_efendi
Send Email Send Email
 

Hadoop requires /lib in a jar, but we have /runtime-libs... am I right?

 

 

Fuad Efendi

+1 416-993-2060

http://www.tokenizer.ca

Data Mining, Vertical Search

 

From: bixo-dev@yahoogroups.com [mailto:bixo-dev@yahoogroups.com] On Behalf Of Freddy
Sent: November-30-09 7:38 PM
To: bixo-dev@yahoogroups.com
Subject: [bixo-dev] Hadoop Cluster 0.19.2

 

 

Small problem:
java.lang.NoClassDefFoundError: cascading/flow/PlannerException

And it doesn't complain on other external classes (such as Command Line-parser which is loaded before Cascading)


#175 From: Ken Krugler <KKrugler_lists@...>
Date: Tue Dec 1, 2009 2:54 am
Subject: Re: Re: Simple Cascading Question...
kkrugler
Send Email Send Email
 

On Nov 30, 2009, at 4:31pm, Fuad Efendi wrote:


Hi Ken,

So that... I’ll need step D (CoGroup) after C: I need to remove duplicate pairs (URL, Anchor) and Cascading doesn’t do it by default?

Will cascading automatically remove duplicates of same Tuple? I hope “yes”...

The simple approach is to do a GroupBy (with the URL as the key), followed by an Every() that uses the First() operator. This will discard all but the first tuple with duplicate URLs.

Assuming these are UrlDatum tuples, then it would look something like:

pipe = new GroupBy(pipe, new Fields(UrlDatum.URL_FIELD));
pipe = new Every(pipe, new First(), Fields.RESULTS);

See the "Get 'DISTINCT' (unique) values from a Tuple stream" in the Cascading Cook Book (http://www.cascading.org/documentation/cook-book.html)

-- Ken

 

From: bixo-dev@yahoogroups.com [mailto:bixo-dev@yahoogroups.com] On Behalf Of Ken Krugler
Sent: November-30-09 2:02 PM
To: bixo-dev@yahoogroups.com
Subject: Re: [bixo-dev] Re: Simple Cascading Question...

 

 

Hi Freddy,

 

On Nov 30, 2009, at 10:28am, Freddy wrote:



Hi Ken,

I believe A and B will be chained inside same Map, but I am unsure that's why I asked here... important for me is when duplicates are removed, and that's why:

I added to UrlDatum new field: _lastAnchor.

And, after A and B, I have additional step C dealing with Anchor Text.
We may have same URL but with different Anchor Text.

I believe A (f ilter) and B (normalizer) won't delete duplicate UrlDatum (same URL, different Anchor). (?)

Correct. That kind of de-duping would require that you first group by URL (e.g. use a Cascading CoGroup).

 

-- Ken

 



--- In bixo-dev@yahoogroups.com, Ken K rugler <KKrugler_lists@...> wrote:
>
> Hi Freddy,
> 
> On Nov 30, 2009, at 9:33am, Freddy wrote:
> 
> > Hi,
> >
> > Simple question; scenario:
> >
> > A.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new 
> > UrlFilter(_urlFilter, MetaData.FIELDS));
> >
> > B.
> > urlFromOutlinksPipe = new Each(urlFromOutlinksPipe, new 
> > NormalizeUrlFunction(new SimpleUrlNormalizer(), MetaData.FIELDS));
> >
> > Questions:
> > - are A and B single-threaded? (I believe "yes")
> >
> What do you mean by "single-threaded"?
>  
> These will be mapper tasks in Hadoop land, which means they will be 
> parallelized across the number of job runners you've configured for 
> map tasks.
> 
> > - will Cascading remove duplicates between A and B? (I believe "no")
> >
> Not sure what you mean by "duplicates".
> 
> If your code matches the above (you do A first, then B) then I believe 
> Cascading will set these up as a series of map tasks (chained).
> 
> -- Ken
> 
> 
> >
> > Thanks,
> >
> >
> >  
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>

 

--------------------------------------------

Ken Krugler

+1 530-210-6378

e l a s t i c   w e b   m i n i n g

 

 

 

 

 

Fuad Efendi

+1 416-993-2060

http://www.tokenizer.ca

Data Mining, Vertical Search

 



--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#176 From: "Freddy" <fuad@...>
Date: Tue Dec 1, 2009 3:03 am
Subject: Running in Production!
fouad_efendi
Send Email Send Email
 
Running in production; slightly modified version from trunk, with added SOLR
Tokenizer; http://www.tokenizer.org/bot.html; agent@...

Crawling about 10000 shopping malls, let's see... and Thank You!!!

First exception:
Exception running tool: step failed: (3/6) ...esponseRate',
'FetchedDatum-httpHeaders', 'crawl-depth',
'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
cascading.flow.FlowException: step failed: (3/6) ...esponseRate',
'FetchedDatum-httpHeaders', 'crawl-depth',
'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
         at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:491)
         at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:420)
         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8\
86)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:619)
[


I didn't have such things on my laptop...

Also, I need to understand the real-life difference between -duration and
-numloops...

-Fuad

#177 From: "Freddy" <fuad@...>
Date: Tue Dec 1, 2009 3:14 am
Subject: Re: Running in Production!
fouad_efendi
Send Email Send Email
 
It seems I need to use FQDN (DNS) names in configuration of Hadoop...
I currently use short names (aliases) from /etc/hosts file.

--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> Running in production; slightly modified version from trunk, with added SOLR
> Tokenizer; http://www.tokenizer.org/bot.html; agent@...
>
> Crawling about 10000 shopping malls, let's see... and Thank You!!!
>
> First exception:
> Exception running tool: step failed: (3/6) ...esponseRate',
'FetchedDatum-httpHeaders', 'crawl-depth',
'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
> cascading.flow.FlowException: step failed: (3/6) ...esponseRate',
'FetchedDatum-httpHeaders', 'crawl-depth',
'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
>         at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:491)
>         at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:420)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8\
86)
>         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> [
>
>
> I didn't have such things on my laptop...
>
> Also, I need to understand the real-life difference between -duration and
-numloops...
>
> -Fuad
>

#178 From: Ken Krugler <KKrugler_lists@...>
Date: Tue Dec 1, 2009 4:10 am
Subject: Re: Re: Running in Production!
kkrugler
Send Email Send Email
 
It's been a while since I've set up Hadoop on my own cluster - I've been using EC2 for all my crawls.

I'll be running a crawl in EC2 tomorrow with the latest from Bixo trunk (and Hadoop 0.19.2), so I'll let you know what issues I run into.

-- Ken

On Nov 30, 2009, at 7:14pm, Freddy wrote:

It seems I need to use FQDN (DNS) names in configuration of Hadoop...
I currently use short names (aliases) from /etc/hosts file.

--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> Running in production; slightly modified version from trunk, with added SOLR
> Tokenizer; http://www.tokenizer.org/bot.html; agent@...
> 
> Crawling about 10000 shopping malls, let's see... and Thank You!!!
> 
> First exception:
> Exception running tool: step failed: (3/6) ...esponseRate', 'FetchedDatum-httpHeaders', 'crawl-depth', 'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
> cascading.flow.FlowException: step failed: (3/6) ...esponseRate', 'FetchedDatum-httpHeaders', 'crawl-depth', 'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
> at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:491)
> at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:420)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> [
> 
> 
> I didn't have such things on my laptop...
> 
> Also, I need to understand the real-life difference between -duration and -numloops...
> 
> -Fuad
>


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#179 From: "enachb" <erich@...>
Date: Tue Dec 1, 2009 7:42 am
Subject: Redirect count in status
enachb
Send Email Send Email
 
Hi,

I'm currently using bixo to crawl potentially spammy websites.

Often spammers redirect several times through URL shortening services and
intermediate sites under their control before they reach their final destination
making no of redirects a useful metric.

I couldn't find number of redirects in the FetchStatus pipe, so my question is:
How difficult would it be to pass on the the number of hops/redirects?

Thanks!
-Erich

#180 From: Ken Krugler <KKrugler_lists@...>
Date: Tue Dec 1, 2009 2:11 pm
Subject: Re: Redirect count in status
kkrugler
Send Email Send Email
 
Hi Erich,

On Nov 30, 2009, at 11:42pm, enachb wrote:

Hi,

I'm currently using bixo to crawl potentially spammy websites.

Often spammers redirect several times through URL shortening services and intermediate sites under their control before they reach their final destination making no of redirects a useful metric. 

I couldn't find number of redirects in the FetchStatus pipe, so my question is: How difficult would it be to pass on the the number of hops/redirects?

Trivial, especially now that I added a redirect handler to the SimpleHttpFetcher configuration. I needed this to keep track of any permanent redirects.

Let me take a quick look at it, and I might be able to add a one-liner to get you what you need, though you'd have to modify the SimpleHttpFetcher to pull this result out of the HttpClient execution context and put it into the status and fetched datum as metadata.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
e l a s t i c   w e b   m i n i n g





#181 From: "Fuad Efendi" <fuad@...>
Date: Tue Dec 1, 2009 7:25 pm
Subject: RE: Re: Running in Production!
fouad_efendi
Send Email Send Email
 

Hi Ken,

 

Ok, I increased default (600 seconds) value for <name>mapred.task.timeout</name> - I can run it now.

 

It seems to be performance bottleneck somewhere... I believe it crawls concurrently up to a few (2-3) URLs... 3 URLs returned by getFetcherRequest of (my) FetcherPolicy...

 

-maxthreads 64 doesn’t play any role...

 

 

Fuad Efendi

+1 416-993-2060

http://www.tokenizer.ca

Data Mining, Vertical Search

 

From: bixo-dev@yahoogroups.com [mailto:bixo-dev@yahoogroups.com] On Behalf Of Ken Krugler
Sent: November-30-09 11:10 PM
To: bixo-dev@yahoogroups.com
Subject: Re: [bixo-dev] Re: Running in Production!

 

 

It's been a while since I've set up Hadoop on my own cluster - I've been using EC2 for all my crawls.

 

I'll be running a crawl in EC2 tomorrow with the latest from Bixo trunk (and Hadoop 0.19.2), so I'll let you know what issues I run into.

 

-- Ken

 

On Nov 30, 2009, at 7:14pm, Freddy wrote:



It seems I need to use FQDN (DNS) names in configuration of Hadoop...
I currently use short names (aliases) from /etc/hosts file.

--- In bixo-dev@yahoogroups.com, "Freddy" <fuad@...> wrote:
>
> Running in production; slightly modified version from trunk, with added SOLR
> Tokenizer; http://www.tokenizer.org/bot.html; agent@...
> 
> Crawling about 10000 shopping malls, let's see... and Thank You!!!
> 
> First exception:
> Exception running tool: step failed: (3/6) ...esponseRate', 'FetchedDatum-httpHeaders', 'crawl-depth', 'FetcherBuf fer-fetch-exception']]"][fetch_pipe/7882/]
> cascading.flow.FlowException: step failed: (3/6) ...esponseRate', 'FetchedDatum-httpHeaders', 'crawl-depth', 'FetcherBuffer-fetch-exception']]"][fetch_pipe/7882/]
> at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:491)
> at cascading.flow.FlowStep$FlowStepJob.call(FlowStep.java:420)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> [
>  
> 
> I didn't have such things on my laptop...
> 
> Also, I need to understand the real-life difference between -duration and -numloops...
> 
> -Fuad
>

 

--------------------------------------------

Ken Krugler

+1 530-210-6378

e l a s t i c   w e b   m i n i n g

 

 

 

 


Messages 152 - 181 of 1321   Oldest  |  < Older  |  Newer >  |  Newest
Add to My Yahoo!      XML What's This?

Copyright © 2010 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines NEW - Help