On Tue, Jan 6, 2009 at 5:29 PM, robert engels wrote:
To clarify a statement in the last email.
To generate the 'possible source words' in real-time is not a difficult as
first seems, if you assume some sort of first character prefix (which is
what it appears google does).
For example, assume the user typed 'robrt' instead of 'robert'. You see
that this word has very low frequency (or none), so you want to find
possible misspellings, so do a fuzzy search starting with r. But this search
can be optimized, because as the edit/delete position moves to the right,
the prefix remains the same, so these possibilities can be quickly skipped.
If you don't find any words with high enough frequency as possible edit
distances, try [a-z]robrt, assuming the user may have dropped the first
character (possibly try this in "know combination" order, rather than alpha
(i.e. try sr before nr).
For example, searching google for 'robrt engels' works. So does 'obert
engels', so does 'robt engels', all ask me if I meant 'robert engels', but
searching for 'obrt engels' does not.
On Jan 6, 2009, at 4:15 PM, robert engels wrote:
It is definitely going to increase the index size, but not any more than
than the external one would (if my understanding is correct).
The nice thing is that you don't have to try and keep documents numbers in
sync - it will be automatic.
Maybe I don't understand what your external index is storing. Given that
the document contains 'robert' but the user enters' obert', what is the
process to find the matching documents?
Is the external index essentially a constant list, that given obert, the
source words COULD BE robert, tobert, reobert etc., and it contains no
document information so:
given the source word X, and an edit distance k, you ask the external
dictionary for possible indexed words, and it returns the list, and then use
search lucene using each of those words?
If the above is the case, it certainly seems you could generate this list
in real-time rather efficiently with no IO (unless the external index only
stores words which HAVE BEEN indexed).
I think the confusion may be because I understand Otis's comments, but they
don't seem to match what you are stating.
Essentially performing any term match requires efficient searching/matching
of the term index. If this is efficient enough, I don't think either process
is needed - just an improved real-time fuzzy possibilities word generator.
On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:
i see, your idea would definitely simplify some things.
What about the index size difference between this approach and using
separate index? Would this separate field increase index size?
I guess my line of thinking is if you have 10 docs with robert, with
separate index you just have robert, and its deletion neighborhood one time.
with this approach you have the same thing, but at least you must have
document numbers and the other inverted index stuff with each neighborhood
term. would this be a significant change to size and/or performance? and
since the documents have multiple terms there is additional positional
information for slop factor for each neighborhood term...
i think its worth investigating, maybe performance would actually be
better, just curious. i think i boxed myself in to auxiliary index because
of some other irrelevant thigns i am doing.
On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:
I don't think that is the case. You will have single deletion
neighborhood. The number of unique terms in the field is going to be the
union of the deletion dictionaries of each source term.
For example, given the following documents A which have field 'X' with
value best, and document B with value jest (and k == 1).
A will generate est bst, bet, bes, B will generate est, jest, jst, jes
so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)
I don't think the storage requirement is any greater doing it this way.
For all words in a dictionary, and a given number of edit operations k,
generates all variant spellings recursively and save them as tuples of
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
dictionary words of length m with k mismatches.
For a query p and edit distance k, first generate the neighborhood Ud (p,
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate with
the deletion positions in U(p, k), using Theorem 4.