a deletion neighborhood can be pretty large (for example robert is something
like robert obert rbert robrt robet ...)
so if you have a 100 million docs with 1 billion words, but only 100k unique
terms, it definitely would be wasteful to have 1 billion deletion
neighborhoods when you only need 100k.
On Tue, Jan 6, 2009 at 4:02 PM, robert engels wrote:

Why not just create a new field for this? That is, if you have FieldA,
create field FieldAFuzzy and put the various permutations there.

The fuzzy scorer/parser can be changed to automatically use the XXXXFuzzy
field when required.

You could also store positions, and allow that the first term is the
"closest", next is the second closest, etc. to add support for a slop

This is similar to the same way fast phonetic searches can be implemented.

If you do it this way, you don't have any of the synchronization issues
between the index and the external "fuzzy" index.

On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314

Robert Muir commented on LUCENE-1513:

otis, discussion was on java-user.

again, I apologize for the messy code. as mentioned there, my setup is
very specific to exactly what I am doing and in no way is this code ready.
But since i'm currently pretty busy with other things at work I just wanted
to put something up here for anyone else interested.

theres the issues you mentioned, and also some i mentioned on java-user.
for example how to handle updates to indexes that introduce new terms (they
must be added to auxiliary index), or even if auxiliary index is the best

the general idea is that instead of enumerating terms to find terms, the
deletion neighborhood as described in the paper is used instead. this way
search time is not linear based on number of terms. yes you are correct that
it only can guarantee edit distances of K which is determined at index time.
perhaps this should be configurable, but i hardcoded k=1 for simplicity. i
think its something like 80% of typos...

as i mentioned on the list another idea is you could implement FastSS (not
the wC variant) with deletion positions maybe by using payloads. This would
require more space but eliminate the candidate verification step. maybe it
would be nice to have some of their other algorithms such as block-based,etc
available also.

fastss fuzzyquery

Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/LUCENE-1513
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor
Attachments: fastSSfuzzy.zip

code for doing fuzzyqueries with fastssWC algorithm.
FuzzyIndexer: given a lucene field, it enumerates all terms and creates
an auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary
index to retrieve a candidate list. this list is then verified with
levenstein algorithm.
sorry but the code is a bit messy... what I'm actually using is very
different from this so its pretty much untested. but at least you can see
whats going on or fix it up.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Robert Muir

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 18 | next ›
Discussion Overview
groupdev @
postedJan 6, '09 at 6:03p
activeFeb 19, '10 at 9:28p



site design / logo © 2021 Grokbase