FAQ
Why not just create a new field for this? That is, if you have
FieldA, create field FieldAFuzzy and put the various permutations there.

The fuzzy scorer/parser can be changed to automatically use the
XXXXFuzzy field when required.

You could also store positions, and allow that the first term is the
"closest", next is the second closest, etc. to add support for a slop
factor.

This is similar to the same way fast phonetic searches can be
implemented.

If you do it this way, you don't have any of the synchronization
issues between the index and the external "fuzzy" index.

On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:


[ https://issues.apache.org/jira/browse/LUCENE-1513?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=12661314#action_12661314 ]

Robert Muir commented on LUCENE-1513:
-------------------------------------

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup
is very specific to exactly what I am doing and in no way is this
code ready. But since i'm currently pretty busy with other things
at work I just wanted to put something up here for anyone else
interested.

theres the issues you mentioned, and also some i mentioned on java-
user. for example how to handle updates to indexes that introduce
new terms (they must be added to auxiliary index), or even if
auxiliary index is the best approach.

the general idea is that instead of enumerating terms to find
terms, the deletion neighborhood as described in the paper is used
instead. this way search time is not linear based on number of
terms. yes you are correct that it only can guarantee edit
distances of K which is determined at index time. perhaps this
should be configurable, but i hardcoded k=1 for simplicity. i think
its something like 80% of typos...

as i mentioned on the list another idea is you could implement
FastSS (not the wC variant) with deletion positions maybe by using
payloads. This would require more space but eliminate the candidate
verification step. maybe it would be nice to have some of their
other algorithms such as block-based,etc available also.


fastss fuzzyquery
-----------------

Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/
LUCENE-1513
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Attachments: fastSSfuzzy.zip


code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and
creates an auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the
auxiliary index to retrieve a candidate list. this list is then
verified with levenstein algorithm.
sorry but the code is a bit messy... what I'm actually using is
very different from this so its pretty much untested. but at least
you can see whats going on or fix it up.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 17 | next ›
Discussion Overview
groupjava-dev @
categorieslucene
postedJan 6, '09 at 6:03p
activeJan 7, '09 at 1:30a
posts17
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase