FAQ
To clarify a statement in the last email.

To generate the 'possible source words' in real-time is not a
difficult as first seems, if you assume some sort of first character
prefix (which is what it appears google does).

For example, assume the user typed 'robrt' instead of 'robert'. You
see that this word has very low frequency (or none), so you want to
find possible misspellings, so do a fuzzy search starting with r. But
this search can be optimized, because as the edit/delete position
moves to the right, the prefix remains the same, so these
possibilities can be quickly skipped.

If you don't find any words with high enough frequency as possible
edit distances, try [a-z]robrt, assuming the user may have dropped
the first character (possibly try this in "know combination" order,
rather than alpha (i.e. try sr before nr).

For example, searching google for 'robrt engels' works. So does
'obert engels', so does 'robt engels', all ask me if I meant 'robert
engels', but searching for 'obrt engels' does not.
On Jan 6, 2009, at 4:15 PM, robert engels wrote:

It is definitely going to increase the index size, but not any more
than than the external one would (if my understanding is correct).

The nice thing is that you don't have to try and keep documents
numbers in sync - it will be automatic.

Maybe I don't understand what your external index is storing. Given
that the document contains 'robert' but the user enters' obert',
what is the process to find the matching documents?

Is the external index essentially a constant list, that given
obert, the source words COULD BE robert, tobert, reobert etc., and
it contains no document information so:

given the source word X, and an edit distance k, you ask the
external dictionary for possible indexed words, and it returns the
list, and then use search lucene using each of those words?

If the above is the case, it certainly seems you could generate
this list in real-time rather efficiently with no IO (unless the
external index only stores words which HAVE BEEN indexed).

I think the confusion may be because I understand Otis's comments,
but they don't seem to match what you are stating.

Essentially performing any term match requires efficient searching/
matching of the term index. If this is efficient enough, I don't
think either process is needed - just an improved real-time fuzzy
possibilities word generator.
On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

i see, your idea would definitely simplify some things.

What about the index size difference between this approach and
using separate index? Would this separate field increase index size?

I guess my line of thinking is if you have 10 docs with robert,
with separate index you just have robert, and its deletion
neighborhood one time. with this approach you have the same thing,
but at least you must have document numbers and the other inverted
index stuff with each neighborhood term. would this be a
significant change to size and/or performance? and since the
documents have multiple terms there is additional positional
information for slop factor for each neighborhood term...

i think its worth investigating, maybe performance would actually
be better, just curious. i think i boxed myself in to auxiliary
index because of some other irrelevant thigns i am doing.

On Tue, Jan 6, 2009 at 4:42 PM, robert engels
wrote:
I don't think that is the case. You will have single deletion
neighborhood. The number of unique terms in the field is going to
be the union of the deletion dictionaries of each source term.

For example, given the following documents A which have field 'X'
with value best, and document B with value jest (and k == 1).

A will generate est bst, bet, bes, B will generate est, jest, jst,
jes

so field FieldXFuzzy contains
(est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

I don't think the storage requirement is any greater doing it this
way.


3.2.1 Indexing
For all words in a dictionary, and a given number of edit
operations k, FastSS
generates all variant spellings recursively and save them as
tuples of type
v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a
list of deletion
positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the
variants for n
dictionary words of length m with k mismatches.


3.2.2 Retrieval
For a query p and edit distance k, first generate the neighborhood
Ud (p, k).
Then compare the words in the neighborhood with the index, and find
matching candidates. Compare deletion positions for each candidate
with
the deletion positions in U(p, k), using Theorem 4.





--
Robert Muir
rcmuir@gmail.com

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 12 of 18 | next ›
Discussion Overview
groupdev @
categorieslucene
postedJan 6, '09 at 6:03p
activeFeb 19, '10 at 9:28p
posts18
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase