Grokbase Groups Lucene dev July 2010
[ ]

Mark Harwood commented on LUCENE-2557:

bq. I dont understand why we need to average any idfs? this seems really costly

IDF lookups and averaging etc should only be calculated for the top "n" terms that finally make it into the query. "Top" in this case being some edit distance threshold or synonymity measure. All the required doc frequency info for IDF is available in RAM on TermEnum which is iterated across anyway and so shouldn't incur any extra disk seeks. So given a query that expands to 1,000 terms the cost of computing the average IDF for that set of terms is surely lost in the cost of 1,000 disk seeks on the TermDocs as part of query evaluation? I need to review the code to remind myself of how it is processed but it feels like it should be cheap.

bq. average docfreq across all 50 terms even, maybe the top-5 or so is sufficient.

That could work. The IDF score simply has to be a value that is used as a constant for all the expanded terms in a fuzzy query and, as an added bonus, represents a value that can be usefully contrasted with other query clauses. The averaging policy is just a fall-back position in the rarer situations when a user's original input term has no associated IDF value we can use. If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.
FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Key: LUCENE-2557
Project: Lucene - Java
Issue Type: Bug
Components: Query/Scoring
Affects Versions: 3.0.2
Reporter: Jingkei Ly
Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch

The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally.
For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 12 of 15 | next ›
Discussion Overview
groupdev @
postedJul 23, '10 at 4:23p
activeJul 26, '10 at 4:26p

1 user in discussion

Mark Harwood (JIRA): 15 posts



site design / logo © 2021 Grokbase