Grokbase Groups Lucene dev July 2010
[ ]

Robert Muir commented on LUCENE-2557:

bq. If this policy is a performance concern then we could reduce the number of terms as you suggest or just ignore IDF entirely in this case but I'm not sure the averaging costs represent any kind of real performance concern given the IO costs of accessing TermDocs.

I suggested reducing the number of terms (for the averaging), but also the number of default expansions.
I think in general expanding to 1024 is obscene...

But also, if we reduce this number, FuzzyTermsEnum itself gets faster, too.
FuzzyTermsEnum is aware (via an attribute) when the priority queue is filled, and it knows the minimal score to be competitive.
When a certain edit distance is no longer competitive, it optimizes itself by swapping in a more efficient Automaton.
This is safe because the pq's comparator is score, then the term's compareTo (lexicographic order).

Simple example: lets say you ask for a max of 1 expansions, but with a fuzzy query of max 1 edit distance.
as soon as the enum finds a term of ed=1, terms of ed=1 are no longer competitive, so it will then try to seek
to an exact match (swapping in an ed=0 automaton) and exit, instead of wasting time seeking to useless terms.

its a bit more complicated since the boost value is really not just edit distance but also string length, but I think this illustration works,
its one reason why I think we should try to 'improve the defaults'.

FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Key: LUCENE-2557
Project: Lucene - Java
Issue Type: Bug
Components: Query/Scoring
Affects Versions: 3.0.2
Reporter: Jingkei Ly
Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch

The FuzzyQuery often causes misspellings to be ranked higher than the exact match, which seems to be an undesirable property generally.
For example, in an index of surnames, if I search using a FuzzyQuery for "smith", the misspellings such as "smiith", or "smiht" would appear near the top of the search results ahead of documents that match "smith".
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 13 of 15 | next ›
Discussion Overview
groupdev @
postedJul 23, '10 at 4:23p
activeJul 26, '10 at 4:26p

1 user in discussion

Mark Harwood (JIRA): 15 posts



site design / logo © 2021 Grokbase