FAQ
robert theres only one problem i see: i don't see how you can do a single
search since fastssWC returns some false positives (with k=1 it will still
return some things with ED of 2). maybe if you store the deletion position
information as a payload (thus using original fastss where there are no
false positives) it would work though. i looked at storing position
information but it appeared like it might be complex and the api was (is)
still marked experimental so i didn't go that route.

i also agree lucene index might not be the best possible data structure...
just convenient thats all. i used it because i store other things related to
the term besides deletion neighborhoods for my fuzzy matching.

i guess i'll also mention that i do think storage size should be a big
consideration. you really don't need this kind of stuff unless you are
searching pretty big indexes in the first place (for <= few million docs the
default fuzzy is probably just fine for a lot of people).

for me, the whole thing was about turning 30second queries into 1 second
queries by removing a linear algorithm, i didn't really optimize much beyond
that because i was just very happy to have reasonable performance..
On Tue, Jan 6, 2009 at 6:26 PM, robert engels wrote:

I understand now.
The index in my case would definitely be MUCH larger, but I think it would
perform better, as you only need to do a single search - for obert (if you
assume it was a misspelling).

In your case you would eventually do an OR search in the lucene index for
all possible matches (robert, roberta, roberto, ...) which could be much
larger with some commonly prefixed/postfixed words).

Classic performance vs. size trade-off. In your case where it is not for
misspellings, the performance difference might be worthwhile.

Still, in your case, I am not sure using a Lucene index as the external
index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
permutation) with a a key on both would allow easy search and update. If
implemented as a service, some intelligent caching of common misspellings
could really improve the performance.

On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:


On Tue, Jan 6, 2009 at 5:15 PM, robert engels wrote:

It is definitely going to increase the index size, but not any more than
than the external one would (if my understanding is correct).
The nice thing is that you don't have to try and keep documents numbers in
sync - it will be automatic.

Maybe I don't understand what your external index is storing. Given that
the document contains 'robert' but the user enters' obert', what is the
process to find the matching documents?
heres a simple example. neighborhood stored for robert is 'robert obert
rbert roert ...' this is indexed in a tokenized field.

at query time user typoes robert and enters 'tobert'. again neighborhood is
generated 'tobert obert tbert ...'
the system does a query on tobert OR obert OR tbert ... and robert is
returned because 'obert' is present in both neighborhoods.
in this example, by storing k=1 deletions you guarantee to satisfy all edit
distance matches <= 1 without linear scan.
you get some false positives too with this approach, thats why what comes
back is only a CANDIDATE and true edit distance must be used to verify. this
might be tricky to do with your method, i don't know.



Is the external index essentially a constant list, that given obert, the
source words COULD BE robert, tobert, reobert etc., and it contains no
document information so:
no. see above, you generate all possible 1-character deletions of the index
term and store them, then at query time you generate all possible
1-character deletions of the query term. basically, LUCENE and LUBENE are 1
character different, but they are the same (LUENE) if you delete 1 character
from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
store LUENE.
given the source word X, and an edit distance k, you ask the external
dictionary for possible indexed words, and it returns the list, and then use
search lucene using each of those words?

If the above is the case, it certainly seems you could generate this list
in real-time rather efficiently with no IO (unless the external index only
stores words which HAVE BEEN indexed).

I think the confusion may be because I understand Otis's comments, but
they don't seem to match what you are stating.

Essentially performing any term match requires efficient
searching/matching of the term index. If this is efficient enough, I don't
think either process is needed - just an improved real-time fuzzy
possibilities word generator.

On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

i see, your idea would definitely simplify some things.

What about the index size difference between this approach and using
separate index? Would this separate field increase index size?

I guess my line of thinking is if you have 10 docs with robert, with
separate index you just have robert, and its deletion neighborhood one time.
with this approach you have the same thing, but at least you must have
document numbers and the other inverted index stuff with each neighborhood
term. would this be a significant change to size and/or performance? and
since the documents have multiple terms there is additional positional
information for slop factor for each neighborhood term...

i think its worth investigating, maybe performance would actually be
better, just curious. i think i boxed myself in to auxiliary index because
of some other irrelevant thigns i am doing.
On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

I don't think that is the case. You will have single deletion
neighborhood. The number of unique terms in the field is going to be the
union of the deletion dictionaries of each source term.
For example, given the following documents A which have field 'X' with
value best, and document B with value jest (and k == 1).

A will generate est bst, bet, bes, B will generate est, jest, jst, jes

so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

I don't think the storage requirement is any greater doing it this way.


3.2.1 Indexing
For all words in a dictionary, and a given number of edit operations k,
FastSS
generates all variant spellings recursively and save them as tuples of
type
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
deletion
positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
n
dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhood Ud (p,
k).
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate with
the deletion positions in U(p, k), using Theorem 4.


--
Robert Muir
rcmuir@gmail.com


--
Robert Muir
rcmuir@gmail.com


--
Robert Muir
rcmuir@gmail.com

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 15 of 18 | next ›
Discussion Overview
groupdev @
categorieslucene
postedJan 6, '09 at 6:03p
activeFeb 19, '10 at 9:28p
posts18
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase