Search the web
Sign In
New User? Sign Up
search_dev · Independent Search Engine Developers
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Hear how Yahoo! Groups has changed the lives of others. Take me there.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Indexing Portal content   Message List  
Reply | Forward Message #490 of 845 |
Re: [search_dev] Re: Indexing Portal content

Rup,

I totally agree, let the users decide how they like it, what the requirements are.

Others,

There is a fancier way to try and determine duplicates, using statistics and word densities.  Again, not saying this is the "right" way to do it, but just a more modern approach.

(long rant, but of interest to some I hope)

In this approach, each document is considered to be a "bag of words", and using statistical methods, the top N "most important" words are collected and stored as a "fingerprint" for this document.  At a later time, this word fingerprint can be compared with that of other documents, to see if they are the same or not.  In some cases, this is the only measure used.  In other cases, this method is only employed as a tie breaker.

Notice my lose use of N (the number of terms), "important", and "fingerprint".  There are different methods for each of those.  I'll give some examples here.

For N, you could go as low as 5, or perhaps as high as 100.  For the problem Avi mentions, where the title and summary are the same, but later parts of the document vary widely, you'd probably want a higher side count, say 100 (or more).  I agree with what Avi points out, substantially different documents with a similar title and extract, are not a "fringe" case.  Think of replies to emails, or comments to a new mailing list posting, where the original text is often left in place.  Or a set of legal contracts, where only the terms in the last few sections of the doc changes (like what the price is!).  In all these cases, it's the end of the doc which may have really important changes, but where the upper portion of the doc may give similar looking results.

But even in these case, if you set N to 100, or heck, 250, you're are more likely to see these changes.

For "important", (selecting the most "important" words to store) there are at least two methods I've heard of.  One method uses the TF/IDF mathematics.  TF being "term frequency", or how dense this word appears in the document.  IDF being "inverse doc frequency", meaning how RARE  the word is in the OTHER documents.  So in English, important words are the ones in that particular document that occur a lot, and/or words in that document that are rarely used in other documents.  Different formulas can be used to calculate TF and IDF, and two combine those two scores.

A more interesting approach is to use more direct Bayesian or PCA style mathematics to calculate, for each word in a document, how well this word discriminates this document from the others.  Various tests are run for each word (or short phrases), and words that have the biggest variance are chosen.

And finally, for "fingerprint", you can store just the words as an unsorted list.  Or have them sorted most important to least.  Or even store the words and some type of "factor" or "score".  And when you compare two fingerprints, you can decide how many words need to match, and whether or not they need to be in the same order.  And if you stored scores, how closely the scores need to be.

These calculations can be expensive to compute, so might be applied only in "tie breaker" scenarios.

Back to Avi's scenario, if you had a 10 page email, and then somebody replied at the very bottom with a 1/2 page response, you are still likely to see a difference in this type of a "fingerprint".  The words in that extra half page should jostle the fingerprint enough to matter.

A very interesting field (related to clustering and spam detection), and something we've actually had to review this with customers.

One "cheat" that can foil simpler versions of these methods is having random gibberish text, which still has valid English words, but that happens to meet (or vary, in the case of spam detection) the statistical tests.  So a human looking at the text can see it's random words, but a simple statistics engine is fooled.  This is one reason I favor using more than single words in fulltext analysis; although you can then defeat those by using random sentences, though that dataset takes longer to gather.

Hope this is of interest to at least a few other folks out there.  :-)

Mark

On Feb 1, 2008 6:35 AM, p_rupendra <p_rupendra@...> wrote:
Mark, We are still debating with users what marks a duplicate content.
whether they consider it duplicate if title and summary shown in
results is the same while the content itself is not. We want to take a
more iterative approach and leave the option to decide which set of
fields should be same before a user(business user) calls it a
duplicate result.

Thanks Sam for the insight, here is the complete setup in IDOL server cfg:

In idol cfg, setup two field processing types for referencetypes

[Server]
KillDuplicates=*/DREREFERENCE

[FieldProcessing]
1=setupreferencefields1
2=setupreferencefields2


[setupreferencefields1]
Property=ReferenceFields1
PropertyFieldCSVs=*/DREREFERENCE,*/URL

[setupreferencefields2]
Property=ReferenceFields2
PropertyFieldCSVs=*/DRETITLE

[Properties]
0=ReferenceFields1
1=ReferenceFields2

[ReferenceFields1]
ReferenceType=TRUE
TrimSpaces=TRUE

[ReferenceFields2]
ReferenceType=TRUE
TrimSpaces=TRUE


In the above setup, idol eliminates duplicate content matching same
drereference and same url fields. (Keeps the last matched content).
Again, more fields can be added here so that they all are considered
before judging a contentitem as duplicate.

When it comes to querying,
Combine=DRETITLE can be used to return only the most relevant result
when DRETITLEs are same for more than one result.

You can specify more than one field in Combine as well so that all the
fields are checked to be same before the most relevant results of them
all is returned.

A hybrid of both KillDuplicates and Combine in query gives very
powerful and flexible options to choose and eliminate what users agree
as duplicate.

Thanks,
Rup


--- In search_dev@yahoogroups.com, Walter Underwood <wunderwood@...>
wrote:
>
> That goes way back, to before snippets. You have the reasoning
correct ‹ if
> two search results look the same, then they are the same.
>
> Ultraseek has options for different deduping policies. One of those
is based
> on all the extracted text. This one is affected by the Page Expert text
> extraction rules, so it can sometimes spot duplicates even on
different site
> designs.
>
> To get back to SAP Portal, if this is being spidered with Ultra
Spider in
> front of IDOL, you could use some of the Ultraseek features on that,
> including the rules for which dupe should be preferred.
>
> wunder
> Former Ultraseek Architect
>
> On 1/31/08 2:47 PM, "Mark Bennett" <mbennett@...> wrote:
>
> > Other folks have argued that, at least in the public portal search
space,
> > perception is more important than subtle changes, and that if the
content
> > appears duplicate to the casual user, it should be treated as such and
> > removed.  For example, I believe Ultraseek used to treat content
with the same
> > title and summary as a duplicate, even if the full text of the 2
pages were
> > not quite the same.
>





Yahoo! Groups Links

<*> To visit your group on the web, go to:
   http://groups.yahoo.com/group/search_dev/

<*> Your email settings:
   Individual Email | Traditional

<*> To change settings online go to:
   http://groups.yahoo.com/group/search_dev/join
   (Yahoo! ID required)

<*> To change settings via email:
   mailto:search_dev-digest@yahoogroups.com
   mailto:search_dev-fullfeatured@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
   search_dev-unsubscribe@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
   http://docs.yahoo.com/info/terms/





--
Mark Bennett / New Idea Engineering, Inc.
Office: 408-733-0387 / Cell: 408-829-6513

Fri Feb 1, 2008 5:35 pm

ttennebkram
Offline Offline
Send Email Send Email

Forward
Message #490 of 845 |
Expand Messages Author Sort by Date

We are working on implementing search for SAP Portal and came across an interesting scenario. Search engine used is IDOL 7.2. One approach is to index at...
p_rupendra
Offline Send Email
Jan 30, 2008
3:58 pm

I'm not real familiar with IDOL 7.2. But most engines have some mechanism to deduplicate result sets based on an arbitrary criteria. Sometimes that's an easy...
Garth
gdgrimm
Offline Send Email
Jan 31, 2008
5:14 pm

I'm real familiar with IDOL 7.2, but unfortunately that doesn't make me smarter than Garth ;) All I can say is he's right on the money. What he proposes can...
Sam Mefford
sammefford
Offline Send Email
Jan 31, 2008
9:35 pm

Checksums are a solution in some cases, but not in others. The problem is that checksums can be very volatile. Even a single character change will completely...
Mark Bennett
ttennebkram
Offline Send Email
Jan 31, 2008
10:47 pm

... I agree with you on the problem, but there's more. I've found far too many cases where the title and top part of the content were the same, so they seemed...
Avi Rappoport
searchtools1
Offline Send Email
Jan 31, 2008
11:58 pm

This is why IDOL Server also has the ability to remove duplicates based on the conceptual relevancy of the content to the other content in the index. The...
Chris Wildgoose
chriswildgoose
Offline Send Email
Feb 1, 2008
4:23 pm

That goes way back, to before snippets. You have the reasoning correct ‹ if two search results look the same, then they are the same. Ultraseek has options...
Walter Underwood
walter_under...
Offline Send Email
Feb 1, 2008
12:04 am

Mark, We are still debating with users what marks a duplicate content. whether they consider it duplicate if title and summary shown in results is the same...
p_rupendra
Offline Send Email
Feb 1, 2008
4:45 pm

Rup, I totally agree, let the users decide how they like it, what the requirements are. Others, There is a fancier way to try and determine duplicates, using...
Mark Bennett
ttennebkram
Offline Send Email
Feb 1, 2008
5:35 pm

Another nuance to this from the technical side is whether you're deduping the index, or deduping the results. The technique mentioned by Emory appears to...
Garth
gdgrimm
Offline Send Email
Feb 1, 2008
7:10 pm

Good point Garth, about the engine having only one page in the index, but recording multiple URLs for it. At least in simple cases, FAST ESP has both "url" and...
Mark Bennett
ttennebkram
Offline Send Email
Feb 1, 2008
7:25 pm

Does ESP then choose which of the "urls" is most appropriate, or does it return them all with the result and then custom app code has the task of choosing...
Garth
gdgrimm
Offline Send Email
Feb 1, 2008
7:31 pm

It picks one of them at index time. So "url" is always singular. Then "urls" will have at least one URL, possibly more. It uses spider rules at index time,...
Mark Bennett
ttennebkram
Offline Send Email
Feb 1, 2008
8:10 pm

I would say we are deduping the index as well as deduping the results. Index time deduping is setup using the stardard set of fields which when same a content...
p_rupendra
Offline Send Email
Feb 1, 2008
8:58 pm

IDOL can de-dupe based on either a field (by default its DREREFERENCE which is the URL for web content) or by a checksum on fields you specify. If what you're...
Emory Emrich
emoryemrich
Offline Send Email
Jan 31, 2008
7:29 pm

Thanks Garth and Emory for your inputs. In this case checksum on DRECONTENT would be more applicable as urls can be different. Please note that IDOL should...
p_rupendra
Offline Send Email
Jan 31, 2008
8:18 pm

You can specify multiple fields for what goes into the MD5 hash, so yes, this is configurable. ... From: p_rupendra <p_rupendra@...> To:...
Emory Emrich
emoryemrich
Offline Send Email
Jan 31, 2008
8:36 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help