What about the index size difference between this approach and using

separate index? Would this separate field increase index size?

I guess my line of thinking is if you have 10 docs with robert, with

separate index you just have robert, and its deletion neighborhood one time.

with this approach you have the same thing, but at least you must have

document numbers and the other inverted index stuff with each neighborhood

term. would this be a significant change to size and/or performance? and

since the documents have multiple terms there is additional positional

information for slop factor for each neighborhood term...

i think its worth investigating, maybe performance would actually be better,

just curious. i think i boxed myself in to auxiliary index because of some

other irrelevant thigns i am doing.

On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

I don't think that is the case. You will have single deletion neighborhood.

The number of unique terms in the field is going to be the union of the

deletion dictionaries of each source term.

For example, given the following documents A which have field 'X' with

value best, and document B with value jest (and k == 1).

A will generate est bst, bet, bes, B will generate est, jest, jst, jes

so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

I don't think the storage requirement is any greater doing it this way.

3.2.1 Indexing

For all words in a dictionary, and a given number of edit operations k,

FastSS

generates all variant spellings recursively and save them as tuples of type

v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of

deletion

positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n

dictionary words of length m with k mismatches.

3.2.2 Retrieval

For a query p and edit distance k, first generate the neighborhood Ud (p,

k).

Then compare the words in the neighborhood with the index, and find

matching candidates. Compare deletion positions for each candidate with

the deletion positions in U(p, k), using Theorem 4.

I don't think that is the case. You will have single deletion neighborhood.

The number of unique terms in the field is going to be the union of the

deletion dictionaries of each source term.

For example, given the following documents A which have field 'X' with

value best, and document B with value jest (and k == 1).

A will generate est bst, bet, bes, B will generate est, jest, jst, jes

so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

I don't think the storage requirement is any greater doing it this way.

3.2.1 Indexing

For all words in a dictionary, and a given number of edit operations k,

FastSS

generates all variant spellings recursively and save them as tuples of type

v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of

deletion

positions.

Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n

dictionary words of length m with k mismatches.

3.2.2 Retrieval

For a query p and edit distance k, first generate the neighborhood Ud (p,

k).

Then compare the words in the neighborhood with the index, and find

matching candidates. Compare deletion positions for each candidate with

the deletion positions in U(p, k), using Theorem 4.

--

Robert Muir

rcmuir@gmail.com