Checksums are a solution in some cases, but not in others.
The problem is that checksums can be very volatile. Even a single character change will completely change an md5 checksum. A good filter could normalize out things like whitespace and punctuation. But then you might also have different advertisements in the same content downloaded at different times; again you can filter those out. Then there could be a field that is a "timestamp" or something; this can also be filtered out. But as you can see finding a "generic" solution is rough. I've heard that there are also more forgiving checksums, who's amount of change is somehow proportional to the amount of change in the source text.
Then there is the problem of slightly different versions of content. An engineer would be tempted to say that if the content has changed, it is not a duplicate, and both should be presented.
Other folks have argued that, at least in the public portal search space, perception is more important than subtle changes, and that if the content appears duplicate to the casual user, it should be treated as such and removed. For example, I believe Ultraseek used to treat content with the same title and summary as a duplicate, even if the full text of the 2 pages were not quite the same. A further argument being that a user will probably not notice if near dupes are removed, but certainly would notice if near dupes are presented, and uses can't complain about something if they don't know it's missing. Since average users can't audit the "Internet", they won't know what's been withheld. Google even gives you a choice, near the end of the results, to see the near duplicate content.
I'm not going to take sides on the subject of near duplicate removal, I think the final decision should be based on the specific project at hand. The next "Google" might want to remove apparent duplicates, whereas a compliance or eDiscovery search application might choose to keep both copies since they do differ by some (possibly important) amount.
Mark
On 1/31/08, Sam Mefford <meffords@...> wrote:
I'm real familiar with IDOL 7.2, but unfortunately that doesn't make me smarter than Garth ;) All I can say is he's right on the money.
What he proposes can be a good option and it's one we've used before. FYI, the way to do this in IDOL is to make sure the "unique id" field (let's call it UNIQUE_ID for now) is either a ReferenceType field (according the instructions under "Using KillDuplicates and Combine on ReferenceType fields" on pp 139-141 of IDOLServer_7.2_Admin.pdf) or a FieldCheckType field, then use the "Combine" parameter added to your /action=query commands.
One more tip on this approach is that if you don't want it to choose randomly which URL is preferred, you can add a field to each result depending on where it shows in the portal, then use the BIAS parameter on your queries to prefer the result appropriate for the portion of the portal the searcher is currently using. We like to call this "Context-based relevancy".
Sam Mefford Enterprise Search Practice Lead Avalon Consulting, LLC (801) 706-9731
Garth wrote:I'm not real familiar with IDOL 7.2. But most engines have some mechanism to deduplicate result sets based on an arbitrary criteria. Sometimes that's an easy thing to do, sometimes it can be pretty tough. But in any case, it's usually possible. So if you can tell that the content is the same over numerous URLs, there should be someway to tag the multiple URLs of the unique content with the same value (if there isn't already a field that makes the uniqueness of the content obvious). Then at query time, before final processing of the result set for display, deduplicate the set based on that field. --- In search_dev@yahoogroups.com, "p_rupendra" <p_rupendra@...> wrote:We are working on implementing search for SAP Portal and cameacross aninteresting scenario. Search engine used is IDOL 7.2. One approachisto index at content-item level along with the meta-data and the portal URL. However, the content-item can be present in multiple portalpagesand there can be several urls for it. To simplify the indexing process, when content is published intoPortalpublishing process is setup to update a index table which contains all the metadata for the content-item and the destination url(this is the issue, it can have more than one destination url) on the portal page for the content-item and the location of the item on server. Later, a seperate process creates the IDX file for IDOL server by looking at this table and idx is indexed into IDOL. As the content-item is present on multiple pages on Portal, the table used by indexing process would have multiple rows in the index table with only destination url different per row. This returns duplicate results while searching though destination urls are different in each result, the content is same once user lands there. Anyone has any ideas for the above scenario or any otherapproaches forindexing the Portal content please share your comments. Thanks, RupYahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/search_dev/ <*> Your email settings: Individual Email | Traditional <*> To change settings online go to: http://groups.yahoo.com/group/search_dev/join (Yahoo! ID required) <*> To change settings via email: mailto:search_dev-digest@yahoogroups.com mailto:search_dev-fullfeatured@yahoogroups.com <*> To unsubscribe from this group, send an email to: search_dev-unsubscribe@yahoogroups.com <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/
--
Mark Bennett / New Idea Engineering, Inc.
Office: 408-733-0387 / Cell: 408-829-6513