Grokbase Groups Lucene dev July 2010
[ ]

Mark Harwood commented on LUCENE-2557:

I think we're agreed that the effects of IDF are troublesome when ranking variant term matches but I question that the default solution should be to remove IDF from the equation completely.

Doing that reminds me of the time my mother thought the shadow in a photograph was annoying and cut it out with a pair of scissors leaving a big hole in its place.
What we're proposing here instead is the equivalent of some "photoshopping" to retain some of the original information but suitably blurred to provide a more natural balance to the overall picture.

Some degree of IDF can be usefully retained from a FuzzyQuery in order to acheive balance with all the other (potentially non-fuzzy) optional clauses that may exist in a BooleanQuery.
The proposal is that the most natural blending of IDF scores within a FuzzyQuery is to use only the IDF of the input term (which defines the user's original intent) and use this to score a match on any suggested variant . If the input term does not exist the average IDF of all variants is used as the next best alternative for scoring each variant.

This approach has exactly the same ranking effect as the existing "remove IDF" policy within a single FuzzyQuery but has the added advantage of sitting better with the other optional clauses that may exist in a containing query.

The question over core vs contrib comes down to what is considered the more natural/expected behaviour.

FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Key: LUCENE-2557
Project: Lucene - Java
Issue Type: Bug
Components: Query/Scoring
Affects Versions: 3.0.2
Reporter: Jingkei Ly
Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch

The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally.
For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 9 of 15 | next ›
Discussion Overview
groupdev @
postedJul 23, '10 at 4:23p
activeJul 26, '10 at 4:26p

1 user in discussion

Mark Harwood (JIRA): 15 posts



site design / logo © 2021 Grokbase