Search the web
Sign In
New User? Sign Up
search_dev · Independent Search Engine Developers
? Already a member? Sign in to Yahoo!

Yahoo! Groups Tips

Did you know...
Message search is now enhanced, find messages faster. Take it for a spin.

Best of Y! Groups

   Check them out and nominate your group.
Having problems with message search? Fill out this form to ensure your group is one of the first to be migrated to the new message search system.

Messages

  Messages Help
Advanced
Indexing Portal content   Message List  
Reply | Forward Message #485 of 845 |
Re: [search_dev] Re: Indexing Portal content

Checksums are a solution in some cases, but not in others.
 
The problem is that checksums can be very volatile.  Even a single character change will completely change an md5 checksum.  A good filter could normalize out things like whitespace and punctuation.  But then you might also have different advertisements in the same content downloaded at different times; again you can filter those out.  Then there could be a field that is a "timestamp" or something; this can also be filtered out.  But as you can see finding a "generic" solution is rough.  I've heard that there are also more forgiving checksums, who's amount of change is somehow proportional to the amount of change in the source text.

Then there is the problem of slightly different versions of content.  An engineer would be tempted to say that if the content has changed, it is not a duplicate, and both should be presented.

Other folks have argued that, at least in the public portal search space, perception is more important than subtle changes, and that if the content appears duplicate to the casual user, it should be treated as such and removed.  For example, I believe Ultraseek used to treat content with the same title and summary as a duplicate, even if the full text of the 2 pages were not quite the same.  A further argument being that a user will probably not notice if near dupes are removed, but certainly would notice if near dupes are presented, and uses can't complain about something if they don't know it's missing.  Since average users can't audit the "Internet", they won't know what's been withheld.  Google even gives you a choice, near the end of the results, to see the near duplicate content.

I'm not going to take sides on the subject of near duplicate removal, I think the final decision should be based on the specific project at hand.  The next "Google" might want to remove apparent duplicates, whereas a compliance or eDiscovery search application might choose to keep both copies since they do differ by some (possibly important) amount.

Mark
 
On 1/31/08, Sam Mefford <meffords@...> wrote:
I'm real familiar with IDOL 7.2, but unfortunately that doesn't make me smarter than Garth ;)  All I can say is he's right on the money.

What he proposes can be a good option and it's one we've used before.  FYI, the way to do this in IDOL is to make sure the "unique id" field (let's call it UNIQUE_ID for now) is either a ReferenceType field (according the instructions under "Using KillDuplicates and Combine on ReferenceType fields" on pp 139-141 of IDOLServer_7.2_Admin.pdf) or a FieldCheckType field, then use the "Combine" parameter added to your /action=query commands.

One more tip on this approach is that if you don't want it to choose randomly which URL is preferred, you can add a field to each result depending on where it shows in the portal, then use the BIAS parameter on your queries to prefer the result appropriate for the portion of the portal the searcher is currently using.  We like to call this "Context-based relevancy".
Sam Mefford
Enterprise Search Practice Lead
Avalon Consulting, LLC
(801) 706-9731


Garth wrote:
I'm not real familiar with IDOL 7.2.

But most engines have some mechanism to deduplicate result sets 
based on an arbitrary criteria.  Sometimes that's an easy thing to do, 
sometimes it can be pretty tough.  But in any case, it's usually 
possible.

So if you can tell that the content is the same over numerous URLs, 
there should be someway to tag the multiple URLs of the unique 
content with the same value (if there isn't already a field that makes the 
uniqueness of the content obvious).  Then at query time, before final 
processing of the result set for display, deduplicate the set based on 
that field.

--- In search_dev@yahoogroups.com, "p_rupendra" <p_rupendra@...> 
wrote:
  
We are working on implementing search for SAP Portal and came 
    
across an 
  
interesting scenario. Search engine used is IDOL 7.2. One approach 
    
is 
  
to index at content-item level along with the meta-data and the portal 
URL. However, the content-item can be present in multiple portal 
    
pages 
  
and there can be several urls for it.

To simplify the indexing process, when content is published into 
    
Portal 
  
publishing process is setup to update a index table which contains all 
the metadata for the content-item and the destination url(this is the 
issue, it can have more than one destination url) on the portal page 
for the content-item and the location of the item on server. Later, a 
seperate process creates the IDX file for IDOL server by looking at 
this table and idx is indexed into IDOL.

As the content-item is present on multiple pages on Portal, the table 
used by indexing process would have multiple rows in the index table 
with only destination url different per row. This returns duplicate 
results while searching though destination urls are different in each 
result, the content is same once user lands there.

Anyone has any ideas for the above scenario or any other 
    
approaches for 
  
indexing the Portal content please share your comments.
Thanks,
Rup

    
 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/search_dev/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/search_dev/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:search_dev-digest@yahoogroups.com 
    mailto:search_dev-fullfeatured@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    search_dev-unsubscribe@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 
  



--
Mark Bennett / New Idea Engineering, Inc.
Office: 408-733-0387 / Cell: 408-829-6513

Thu Jan 31, 2008 10:47 pm

ttennebkram
Offline Offline
Send Email Send Email

Forward
Message #485 of 845 |
Expand Messages Author Sort by Date

We are working on implementing search for SAP Portal and came across an interesting scenario. Search engine used is IDOL 7.2. One approach is to index at...
p_rupendra
Offline Send Email
Jan 30, 2008
3:58 pm

I'm not real familiar with IDOL 7.2. But most engines have some mechanism to deduplicate result sets based on an arbitrary criteria. Sometimes that's an easy...
Garth
gdgrimm
Offline Send Email
Jan 31, 2008
5:14 pm

I'm real familiar with IDOL 7.2, but unfortunately that doesn't make me smarter than Garth ;) All I can say is he's right on the money. What he proposes can...
Sam Mefford
sammefford
Offline Send Email
Jan 31, 2008
9:35 pm

Checksums are a solution in some cases, but not in others. The problem is that checksums can be very volatile. Even a single character change will completely...
Mark Bennett
ttennebkram
Offline Send Email
Jan 31, 2008
10:47 pm

... I agree with you on the problem, but there's more. I've found far too many cases where the title and top part of the content were the same, so they seemed...
Avi Rappoport
searchtools1
Offline Send Email
Jan 31, 2008
11:58 pm

This is why IDOL Server also has the ability to remove duplicates based on the conceptual relevancy of the content to the other content in the index. The...
Chris Wildgoose
chriswildgoose
Offline Send Email
Feb 1, 2008
4:23 pm

That goes way back, to before snippets. You have the reasoning correct ‹ if two search results look the same, then they are the same. Ultraseek has options...
Walter Underwood
walter_under...
Offline Send Email
Feb 1, 2008
12:04 am

Mark, We are still debating with users what marks a duplicate content. whether they consider it duplicate if title and summary shown in results is the same...
p_rupendra
Offline Send Email
Feb 1, 2008
4:45 pm

Rup, I totally agree, let the users decide how they like it, what the requirements are. Others, There is a fancier way to try and determine duplicates, using...
Mark Bennett
ttennebkram
Offline Send Email
Feb 1, 2008
5:35 pm

Another nuance to this from the technical side is whether you're deduping the index, or deduping the results. The technique mentioned by Emory appears to...
Garth
gdgrimm
Offline Send Email
Feb 1, 2008
7:10 pm

Good point Garth, about the engine having only one page in the index, but recording multiple URLs for it. At least in simple cases, FAST ESP has both "url" and...
Mark Bennett
ttennebkram
Offline Send Email
Feb 1, 2008
7:25 pm

Does ESP then choose which of the "urls" is most appropriate, or does it return them all with the result and then custom app code has the task of choosing...
Garth
gdgrimm
Offline Send Email
Feb 1, 2008
7:31 pm

It picks one of them at index time. So "url" is always singular. Then "urls" will have at least one URL, possibly more. It uses spider rules at index time,...
Mark Bennett
ttennebkram
Offline Send Email
Feb 1, 2008
8:10 pm

I would say we are deduping the index as well as deduping the results. Index time deduping is setup using the stardard set of fields which when same a content...
p_rupendra
Offline Send Email
Feb 1, 2008
8:58 pm

IDOL can de-dupe based on either a field (by default its DREREFERENCE which is the URL for web content) or by a checksum on fields you specify. If what you're...
Emory Emrich
emoryemrich
Offline Send Email
Jan 31, 2008
7:29 pm

Thanks Garth and Emory for your inputs. In this case checksum on DRECONTENT would be more applicable as urls can be different. Please note that IDOL should...
p_rupendra
Offline Send Email
Jan 31, 2008
8:18 pm

You can specify multiple fields for what goes into the MD5 hash, so yes, this is configurable. ... From: p_rupendra <p_rupendra@...> To:...
Emory Emrich
emoryemrich
Offline Send Email
Jan 31, 2008
8:36 pm
Advanced

Copyright © 2009 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Guidelines - Help