FAQ
fastss fuzzyquery
-----------------

Key: LUCENE-1513
URL: https://issues.apache.org/jira/browse/LUCENE-1513
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
Reporter: Robert Muir
Priority: Minor


code for doing fuzzyqueries with fastssWC algorithm.

FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.

sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Robert Muir (JIRA) at Jan 6, 2009 at 6:05 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-1513:
    --------------------------------

    Attachment: fastSSfuzzy.zip
    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Otis Gospodnetic (JIRA) at Jan 6, 2009 at 7:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661286#action_12661286 ]

    Otis Gospodnetic commented on LUCENE-1513:
    ------------------------------------------

    References provided by Glen Newton:

    - Fast Similarity Search in Large Dictionaries. http://fastss.csg.uzh.ch/
    - Paper: Fast Similarity Search in Large Dictionaries.
    http://fastss.csg.uzh.ch/ifi-2007.02.pdf
    - FastSimilarSearch.java http://fastss.csg.uzh.ch/FastSimilarSearch.java
    - Paper: Fast Similarity Search in Peer-to-Peer Networks.
    http://www.globis.ethz.ch/script/publication/download?docid=506

    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Otis Gospodnetic (JIRA) at Jan 6, 2009 at 8:40 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661302#action_12661302 ]

    Otis Gospodnetic commented on LUCENE-1513:
    ------------------------------------------

    I feel like I missed some FastSS discussion on the list.... was there one?

    I took a quick look at the paper and the code. Is the following the general idea:
    # index "fuzzy"/"misspelled" terms in addition to the normal terms (=> larger index, slower indexing). How much fuzziness one wants to allow or handle is decided at index time.
    # rewrite the query to include variations/misspellings of each terms and use that to search (=> more clauses, slower than normal search, but faster than the "normal" fuzzy query whose speed depends on the number of indexed terms)
    ?

    Quick code comments:
    * Need to add ASL
    * Need to replace tabs with 2 spaces and formatting in FuzzyHitCollector
    * No @author
    * Unit test if possible
    * Should FastSSwC not be able to take a variable K?
    * Should variables named after types (e.g. "set" in public static String getNeighborhoodString(Set<String> set) { ) be renamed, so they describe what's in them instead? (easier to understand API?)

    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 6, 2009 at 8:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314 ]

    Robert Muir commented on LUCENE-1513:
    -------------------------------------

    otis, discussion was on java-user.

    again, I apologize for the messy code. as mentioned there, my setup is very specific to exactly what I am doing and in no way is this code ready. But since i'm currently pretty busy with other things at work I just wanted to put something up here for anyone else interested.

    theres the issues you mentioned, and also some i mentioned on java-user. for example how to handle updates to indexes that introduce new terms (they must be added to auxiliary index), or even if auxiliary index is the best approach.

    the general idea is that instead of enumerating terms to find terms, the deletion neighborhood as described in the paper is used instead. this way search time is not linear based on number of terms. yes you are correct that it only can guarantee edit distances of K which is determined at index time. perhaps this should be configurable, but i hardcoded k=1 for simplicity. i think its something like 80% of typos...

    as i mentioned on the list another idea is you could implement FastSS (not the wC variant) with deletion positions maybe by using payloads. This would require more space but eliminate the candidate verification step. maybe it would be nice to have some of their other algorithms such as block-based,etc available also.


    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 6, 2009 at 9:03 pm
    Why not just create a new field for this? That is, if you have
    FieldA, create field FieldAFuzzy and put the various permutations there.

    The fuzzy scorer/parser can be changed to automatically use the
    XXXXFuzzy field when required.

    You could also store positions, and allow that the first term is the
    "closest", next is the second closest, etc. to add support for a slop
    factor.

    This is similar to the same way fast phonetic searches can be
    implemented.

    If you do it this way, you don't have any of the synchronization
    issues between the index and the external "fuzzy" index.

    On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/LUCENE-1513?
    page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
    tabpanel&focusedCommentId=12661314#action_12661314 ]

    Robert Muir commented on LUCENE-1513:
    -------------------------------------

    otis, discussion was on java-user.

    again, I apologize for the messy code. as mentioned there, my setup
    is very specific to exactly what I am doing and in no way is this
    code ready. But since i'm currently pretty busy with other things
    at work I just wanted to put something up here for anyone else
    interested.

    theres the issues you mentioned, and also some i mentioned on java-
    user. for example how to handle updates to indexes that introduce
    new terms (they must be added to auxiliary index), or even if
    auxiliary index is the best approach.

    the general idea is that instead of enumerating terms to find
    terms, the deletion neighborhood as described in the paper is used
    instead. this way search time is not linear based on number of
    terms. yes you are correct that it only can guarantee edit
    distances of K which is determined at index time. perhaps this
    should be configurable, but i hardcoded k=1 for simplicity. i think
    its something like 80% of typos...

    as i mentioned on the list another idea is you could implement
    FastSS (not the wC variant) with deletion positions maybe by using
    payloads. This would require more space but eliminate the candidate
    verification step. maybe it would be nice to have some of their
    other algorithms such as block-based,etc available also.


    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/
    LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and
    creates an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the
    auxiliary index to retrieve a candidate list. this list is then
    verified with levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is
    very different from this so its pretty much untested. but at least
    you can see whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir at Jan 6, 2009 at 9:07 pm
    a deletion neighborhood can be pretty large (for example robert is something
    like robert obert rbert robrt robet ...)
    so if you have a 100 million docs with 1 billion words, but only 100k unique
    terms, it definitely would be wasteful to have 1 billion deletion
    neighborhoods when you only need 100k.
    On Tue, Jan 6, 2009 at 4:02 PM, robert engels wrote:

    Why not just create a new field for this? That is, if you have FieldA,
    create field FieldAFuzzy and put the various permutations there.

    The fuzzy scorer/parser can be changed to automatically use the XXXXFuzzy
    field when required.

    You could also store positions, and allow that the first term is the
    "closest", next is the second closest, etc. to add support for a slop
    factor.

    This is similar to the same way fast phonetic searches can be implemented.

    If you do it this way, you don't have any of the synchronization issues
    between the index and the external "fuzzy" index.



    On Jan 6, 2009, at 2:57 PM, Robert Muir (JIRA) wrote:

    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314
    ]

    Robert Muir commented on LUCENE-1513:
    -------------------------------------

    otis, discussion was on java-user.

    again, I apologize for the messy code. as mentioned there, my setup is
    very specific to exactly what I am doing and in no way is this code ready.
    But since i'm currently pretty busy with other things at work I just wanted
    to put something up here for anyone else interested.

    theres the issues you mentioned, and also some i mentioned on java-user.
    for example how to handle updates to indexes that introduce new terms (they
    must be added to auxiliary index), or even if auxiliary index is the best
    approach.

    the general idea is that instead of enumerating terms to find terms, the
    deletion neighborhood as described in the paper is used instead. this way
    search time is not linear based on number of terms. yes you are correct that
    it only can guarantee edit distances of K which is determined at index time.
    perhaps this should be configurable, but i hardcoded k=1 for simplicity. i
    think its something like 80% of typos...

    as i mentioned on the list another idea is you could implement FastSS (not
    the wC variant) with deletion positions maybe by using payloads. This would
    require more space but eliminate the candidate verification step. maybe it
    would be nice to have some of their other algorithms such as block-based,etc
    available also.



    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and creates
    an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary
    index to retrieve a candidate list. this list is then verified with
    levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is very
    different from this so its pretty much untested. but at least you can see
    whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    --
    Robert Muir
    rcmuir@gmail.com
  • Robert engels at Jan 6, 2009 at 9:42 pm
    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to be
    the union of the deletion dictionaries of each source term.

    For example, given the following documents A which have field 'X'
    with value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains
    (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit operations
    k, FastSS
    generates all variant spellings recursively and save them as tuples
    of type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a
    list of deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants
    for n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood Ud
    (p, k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate with
    the deletion positions in U(p, k), using Theorem 4.
  • Robert Muir at Jan 6, 2009 at 9:59 pm
    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and using
    separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert, with
    separate index you just have robert, and its deletion neighborhood one time.
    with this approach you have the same thing, but at least you must have
    document numbers and the other inverted index stuff with each neighborhood
    term. would this be a significant change to size and/or performance? and
    since the documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually be better,
    just curious. i think i boxed myself in to auxiliary index because of some
    other irrelevant thigns i am doing.
    On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

    I don't think that is the case. You will have single deletion neighborhood.
    The number of unique terms in the field is going to be the union of the
    deletion dictionaries of each source term.
    For example, given the following documents A which have field 'X' with
    value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit operations k,
    FastSS
    generates all variant spellings recursively and save them as tuples of type

    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
    deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for n

    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood Ud (p,
    k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate with
    the deletion positions in U(p, k), using Theorem 4.


    --
    Robert Muir
    rcmuir@gmail.com
  • Robert engels at Jan 6, 2009 at 10:15 pm
    It is definitely going to increase the index size, but not any more
    than than the external one would (if my understanding is correct).

    The nice thing is that you don't have to try and keep documents
    numbers in sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given
    that the document contains 'robert' but the user enters' obert', what
    is the process to find the matching documents?

    Is the external index essentially a constant list, that given obert,
    the source words COULD BE robert, tobert, reobert etc., and it
    contains no document information so:

    given the source word X, and an edit distance k, you ask the external
    dictionary for possible indexed words, and it returns the list, and
    then use search lucene using each of those words?

    If the above is the case, it certainly seems you could generate this
    list in real-time rather efficiently with no IO (unless the external
    index only stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments,
    but they don't seem to match what you are stating.

    Essentially performing any term match requires efficient searching/
    matching of the term index. If this is efficient enough, I don't
    think either process is needed - just an improved real-time fuzzy
    possibilities word generator.
    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and
    using separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert,
    with separate index you just have robert, and its deletion
    neighborhood one time. with this approach you have the same thing,
    but at least you must have document numbers and the other inverted
    index stuff with each neighborhood term. would this be a
    significant change to size and/or performance? and since the
    documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually
    be better, just curious. i think i boxed myself in to auxiliary
    index because of some other irrelevant thigns i am doing.

    On Tue, Jan 6, 2009 at 4:42 PM, robert engels
    wrote:
    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to
    be the union of the deletion dictionaries of each source term.

    For example, given the following documents A which have field 'X'
    with value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains
    (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this
    way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit
    operations k, FastSS
    generates all variant spellings recursively and save them as tuples
    of type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a
    list of deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the
    variants for n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood
    Ud (p, k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate
    with
    the deletion positions in U(p, k), using Theorem 4.





    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Jan 6, 2009 at 10:29 pm

    On Tue, Jan 6, 2009 at 5:15 PM, robert engels wrote:

    It is definitely going to increase the index size, but not any more than
    than the external one would (if my understanding is correct).
    The nice thing is that you don't have to try and keep documents numbers in
    sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given that
    the document contains 'robert' but the user enters' obert', what is the
    process to find the matching documents?
    heres a simple example. neighborhood stored for robert is 'robert obert
    rbert roert ...' this is indexed in a tokenized field.

    at query time user typoes robert and enters 'tobert'. again neighborhood is
    generated 'tobert obert tbert ...'
    the system does a query on tobert OR obert OR tbert ... and robert is
    returned because 'obert' is present in both neighborhoods.
    in this example, by storing k=1 deletions you guarantee to satisfy all edit
    distance matches <= 1 without linear scan.
    you get some false positives too with this approach, thats why what comes
    back is only a CANDIDATE and true edit distance must be used to verify. this
    might be tricky to do with your method, i don't know.



    Is the external index essentially a constant list, that given obert, the
    source words COULD BE robert, tobert, reobert etc., and it contains no
    document information so:
    no. see above, you generate all possible 1-character deletions of the index
    term and store them, then at query time you generate all possible
    1-character deletions of the query term. basically, LUCENE and LUBENE are 1
    character different, but they are the same (LUENE) if you delete 1 character
    from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
    store LUENE.
    given the source word X, and an edit distance k, you ask the external
    dictionary for possible indexed words, and it returns the list, and then use
    search lucene using each of those words?

    If the above is the case, it certainly seems you could generate this list
    in real-time rather efficiently with no IO (unless the external index only
    stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments, but they
    don't seem to match what you are stating.

    Essentially performing any term match requires efficient searching/matching
    of the term index. If this is efficient enough, I don't think either process
    is needed - just an improved real-time fuzzy possibilities word generator.

    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and using
    separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert, with
    separate index you just have robert, and its deletion neighborhood one time.
    with this approach you have the same thing, but at least you must have
    document numbers and the other inverted index stuff with each neighborhood
    term. would this be a significant change to size and/or performance? and
    since the documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually be
    better, just curious. i think i boxed myself in to auxiliary index because
    of some other irrelevant thigns i am doing.
    On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to be the
    union of the deletion dictionaries of each source term.
    For example, given the following documents A which have field 'X' with
    value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit operations k,
    FastSS
    generates all variant spellings recursively and save them as tuples of
    type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
    deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
    n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood Ud (p,
    k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate with
    the deletion positions in U(p, k), using Theorem 4.


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com
  • Robert engels at Jan 6, 2009 at 11:26 pm
    I understand now.

    The index in my case would definitely be MUCH larger, but I think it
    would perform better, as you only need to do a single search - for
    obert (if you assume it was a misspelling).

    In your case you would eventually do an OR search in the lucene index
    for all possible matches (robert, roberta, roberto, ...) which could
    be much larger with some commonly prefixed/postfixed words).

    Classic performance vs. size trade-off. In your case where it is not
    for misspellings, the performance difference might be worthwhile.

    Still, in your case, I am not sure using a Lucene index as the
    external index is appropriate. Maybe a simple BTREE (Derby?) index of
    (word,edit permutation) with a a key on both would allow easy search
    and update. If implemented as a service, some intelligent caching of
    common misspellings could really improve the performance.
    On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:



    On Tue, Jan 6, 2009 at 5:15 PM, robert engels
    wrote:
    It is definitely going to increase the index size, but not any more
    than than the external one would (if my understanding is correct).

    The nice thing is that you don't have to try and keep documents
    numbers in sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given
    that the document contains 'robert' but the user enters' obert',
    what is the process to find the matching documents?

    heres a simple example. neighborhood stored for robert is 'robert
    obert rbert roert ...' this is indexed in a tokenized field.

    at query time user typoes robert and enters 'tobert'. again
    neighborhood is generated 'tobert obert tbert ...'
    the system does a query on tobert OR obert OR tbert ... and robert
    is returned because 'obert' is present in both neighborhoods.
    in this example, by storing k=1 deletions you guarantee to satisfy
    all edit distance matches <= 1 without linear scan.
    you get some false positives too with this approach, thats why what
    comes back is only a CANDIDATE and true edit distance must be used
    to verify. this might be tricky to do with your method, i don't know.




    Is the external index essentially a constant list, that given
    obert, the source words COULD BE robert, tobert, reobert etc., and
    it contains no document information so:

    no. see above, you generate all possible 1-character deletions of
    the index term and store them, then at query time you generate all
    possible 1-character deletions of the query term. basically, LUCENE
    and LUBENE are 1 character different, but they are the same (LUENE)
    if you delete 1 character from both of them. so you dont need to
    store LUCENE LUBENE LUDENE, you just store LUENE.

    given the source word X, and an edit distance k, you ask the
    external dictionary for possible indexed words, and it returns the
    list, and then use search lucene using each of those words?

    If the above is the case, it certainly seems you could generate
    this list in real-time rather efficiently with no IO (unless the
    external index only stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments,
    but they don't seem to match what you are stating.

    Essentially performing any term match requires efficient searching/
    matching of the term index. If this is efficient enough, I don't
    think either process is needed - just an improved real-time fuzzy
    possibilities word generator.
    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and
    using separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert,
    with separate index you just have robert, and its deletion
    neighborhood one time. with this approach you have the same thing,
    but at least you must have document numbers and the other inverted
    index stuff with each neighborhood term. would this be a
    significant change to size and/or performance? and since the
    documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually
    be better, just curious. i think i boxed myself in to auxiliary
    index because of some other irrelevant thigns i am doing.

    On Tue, Jan 6, 2009 at 4:42 PM, robert engels
    wrote:
    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to
    be the union of the deletion dictionaries of each source term.

    For example, given the following documents A which have field 'X'
    with value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst,
    jes

    so field FieldXFuzzy contains
    (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this
    way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit
    operations k, FastSS
    generates all variant spellings recursively and save them as
    tuples of type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a
    list of deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the
    variants for n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood
    Ud (p, k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate
    with
    the deletion positions in U(p, k), using Theorem 4.





    --
    Robert Muir
    rcmuir@gmail.com



    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Jan 6, 2009 at 11:45 pm
    robert theres only one problem i see: i don't see how you can do a single
    search since fastssWC returns some false positives (with k=1 it will still
    return some things with ED of 2). maybe if you store the deletion position
    information as a payload (thus using original fastss where there are no
    false positives) it would work though. i looked at storing position
    information but it appeared like it might be complex and the api was (is)
    still marked experimental so i didn't go that route.

    i also agree lucene index might not be the best possible data structure...
    just convenient thats all. i used it because i store other things related to
    the term besides deletion neighborhoods for my fuzzy matching.

    i guess i'll also mention that i do think storage size should be a big
    consideration. you really don't need this kind of stuff unless you are
    searching pretty big indexes in the first place (for <= few million docs the
    default fuzzy is probably just fine for a lot of people).

    for me, the whole thing was about turning 30second queries into 1 second
    queries by removing a linear algorithm, i didn't really optimize much beyond
    that because i was just very happy to have reasonable performance..
    On Tue, Jan 6, 2009 at 6:26 PM, robert engels wrote:

    I understand now.
    The index in my case would definitely be MUCH larger, but I think it would
    perform better, as you only need to do a single search - for obert (if you
    assume it was a misspelling).

    In your case you would eventually do an OR search in the lucene index for
    all possible matches (robert, roberta, roberto, ...) which could be much
    larger with some commonly prefixed/postfixed words).

    Classic performance vs. size trade-off. In your case where it is not for
    misspellings, the performance difference might be worthwhile.

    Still, in your case, I am not sure using a Lucene index as the external
    index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
    permutation) with a a key on both would allow easy search and update. If
    implemented as a service, some intelligent caching of common misspellings
    could really improve the performance.

    On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:


    On Tue, Jan 6, 2009 at 5:15 PM, robert engels wrote:

    It is definitely going to increase the index size, but not any more than
    than the external one would (if my understanding is correct).
    The nice thing is that you don't have to try and keep documents numbers in
    sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given that
    the document contains 'robert' but the user enters' obert', what is the
    process to find the matching documents?
    heres a simple example. neighborhood stored for robert is 'robert obert
    rbert roert ...' this is indexed in a tokenized field.

    at query time user typoes robert and enters 'tobert'. again neighborhood is
    generated 'tobert obert tbert ...'
    the system does a query on tobert OR obert OR tbert ... and robert is
    returned because 'obert' is present in both neighborhoods.
    in this example, by storing k=1 deletions you guarantee to satisfy all edit
    distance matches <= 1 without linear scan.
    you get some false positives too with this approach, thats why what comes
    back is only a CANDIDATE and true edit distance must be used to verify. this
    might be tricky to do with your method, i don't know.



    Is the external index essentially a constant list, that given obert, the
    source words COULD BE robert, tobert, reobert etc., and it contains no
    document information so:
    no. see above, you generate all possible 1-character deletions of the index
    term and store them, then at query time you generate all possible
    1-character deletions of the query term. basically, LUCENE and LUBENE are 1
    character different, but they are the same (LUENE) if you delete 1 character
    from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
    store LUENE.
    given the source word X, and an edit distance k, you ask the external
    dictionary for possible indexed words, and it returns the list, and then use
    search lucene using each of those words?

    If the above is the case, it certainly seems you could generate this list
    in real-time rather efficiently with no IO (unless the external index only
    stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments, but
    they don't seem to match what you are stating.

    Essentially performing any term match requires efficient
    searching/matching of the term index. If this is efficient enough, I don't
    think either process is needed - just an improved real-time fuzzy
    possibilities word generator.

    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and using
    separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert, with
    separate index you just have robert, and its deletion neighborhood one time.
    with this approach you have the same thing, but at least you must have
    document numbers and the other inverted index stuff with each neighborhood
    term. would this be a significant change to size and/or performance? and
    since the documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually be
    better, just curious. i think i boxed myself in to auxiliary index because
    of some other irrelevant thigns i am doing.
    On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to be the
    union of the deletion dictionaries of each source term.
    For example, given the following documents A which have field 'X' with
    value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit operations k,
    FastSS
    generates all variant spellings recursively and save them as tuples of
    type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
    deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
    n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood Ud (p,
    k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate with
    the deletion positions in U(p, k), using Theorem 4.


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com
  • Robert engels at Jan 7, 2009 at 1:04 am
    I think you would need to store the position in the stream using
    position == to the k factor. Pretty straightforward, both for
    indexing and for searching.

    I think if you want the utmost in performance this is the way to go.

    If you don't want to store all of the additional data, I still think
    a better fuzzy search can be done without the external index
    entirely. As I see it, the external index's sole purpose (in your
    case) is to provide "indexed" words at a certain edit distance given
    a certain source word. Using a combination of inverting the alg, and
    a binary selective search on the term index.
    On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:

    robert theres only one problem i see: i don't see how you can do a
    single search since fastssWC returns some false positives (with k=1
    it will still return some things with ED of 2). maybe if you store
    the deletion position information as a payload (thus using original
    fastss where there are no false positives) it would work though. i
    looked at storing position information but it appeared like it
    might be complex and the api was (is) still marked experimental so
    i didn't go that route.

    i also agree lucene index might not be the best possible data
    structure... just convenient thats all. i used it because i store
    other things related to the term besides deletion neighborhoods for
    my fuzzy matching.

    i guess i'll also mention that i do think storage size should be a
    big consideration. you really don't need this kind of stuff unless
    you are searching pretty big indexes in the first place (for <= few
    million docs the default fuzzy is probably just fine for a lot of
    people).

    for me, the whole thing was about turning 30second queries into 1
    second queries by removing a linear algorithm, i didn't really
    optimize much beyond that because i was just very happy to have
    reasonable performance..

    On Tue, Jan 6, 2009 at 6:26 PM, robert engels
    wrote:
    I understand now.

    The index in my case would definitely be MUCH larger, but I think
    it would perform better, as you only need to do a single search -
    for obert (if you assume it was a misspelling).

    In your case you would eventually do an OR search in the lucene
    index for all possible matches (robert, roberta, roberto, ...)
    which could be much larger with some commonly prefixed/postfixed
    words).

    Classic performance vs. size trade-off. In your case where it is
    not for misspellings, the performance difference might be worthwhile.

    Still, in your case, I am not sure using a Lucene index as the
    external index is appropriate. Maybe a simple BTREE (Derby?) index
    of (word,edit permutation) with a a key on both would allow easy
    search and update. If implemented as a service, some intelligent
    caching of common misspellings could really improve the performance.
    On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:



    On Tue, Jan 6, 2009 at 5:15 PM, robert engels
    wrote:
    It is definitely going to increase the index size, but not any
    more than than the external one would (if my understanding is
    correct).

    The nice thing is that you don't have to try and keep documents
    numbers in sync - it will be automatic.

    Maybe I don't understand what your external index is storing.
    Given that the document contains 'robert' but the user enters'
    obert', what is the process to find the matching documents?

    heres a simple example. neighborhood stored for robert is 'robert
    obert rbert roert ...' this is indexed in a tokenized field.

    at query time user typoes robert and enters 'tobert'. again
    neighborhood is generated 'tobert obert tbert ...'
    the system does a query on tobert OR obert OR tbert ... and robert
    is returned because 'obert' is present in both neighborhoods.
    in this example, by storing k=1 deletions you guarantee to satisfy
    all edit distance matches <= 1 without linear scan.
    you get some false positives too with this approach, thats why
    what comes back is only a CANDIDATE and true edit distance must be
    used to verify. this might be tricky to do with your method, i
    don't know.




    Is the external index essentially a constant list, that given
    obert, the source words COULD BE robert, tobert, reobert etc., and
    it contains no document information so:

    no. see above, you generate all possible 1-character deletions of
    the index term and store them, then at query time you generate all
    possible 1-character deletions of the query term. basically,
    LUCENE and LUBENE are 1 character different, but they are the same
    (LUENE) if you delete 1 character from both of them. so you dont
    need to store LUCENE LUBENE LUDENE, you just store LUENE.

    given the source word X, and an edit distance k, you ask the
    external dictionary for possible indexed words, and it returns the
    list, and then use search lucene using each of those words?

    If the above is the case, it certainly seems you could generate
    this list in real-time rather efficiently with no IO (unless the
    external index only stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments,
    but they don't seem to match what you are stating.

    Essentially performing any term match requires efficient searching/
    matching of the term index. If this is efficient enough, I don't
    think either process is needed - just an improved real-time fuzzy
    possibilities word generator.
    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and
    using separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert,
    with separate index you just have robert, and its deletion
    neighborhood one time. with this approach you have the same
    thing, but at least you must have document numbers and the other
    inverted index stuff with each neighborhood term. would this be a
    significant change to size and/or performance? and since the
    documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually
    be better, just curious. i think i boxed myself in to auxiliary
    index because of some other irrelevant thigns i am doing.

    On Tue, Jan 6, 2009 at 4:42 PM, robert engels
    wrote:
    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to
    be the union of the deletion dictionaries of each source term.

    For example, given the following documents A which have field 'X'
    with value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest,
    jst, jes

    so field FieldXFuzzy contains
    (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it
    this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit
    operations k, FastSS
    generates all variant spellings recursively and save them as
    tuples of type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a
    list of deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the
    variants for n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the
    neighborhood Ud (p, k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each
    candidate with
    the deletion positions in U(p, k), using Theorem 4.





    --
    Robert Muir
    rcmuir@gmail.com



    --
    Robert Muir
    rcmuir@gmail.com



    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Jan 7, 2009 at 1:30 am
    for the k=1 case in my mind your last comment might not really be that much
    slower than storing the additional data... sounds worth investigating
    On Tue, Jan 6, 2009 at 8:04 PM, robert engels wrote:

    I think you would need to store the position in the stream using position
    == to the k factor. Pretty straightforward, both for indexing and for
    searching.
    I think if you want the utmost in performance this is the way to go.

    If you don't want to store all of the additional data, I still think a
    better fuzzy search can be done without the external index entirely. As I
    see it, the external index's sole purpose (in your case) is to provide
    "indexed" words at a certain edit distance given a certain source word.
    Using a combination of inverting the alg, and a binary selective search on
    the term index.

    On Jan 6, 2009, at 5:44 PM, Robert Muir wrote:

    robert theres only one problem i see: i don't see how you can do a single
    search since fastssWC returns some false positives (with k=1 it will still
    return some things with ED of 2). maybe if you store the deletion position
    information as a payload (thus using original fastss where there are no
    false positives) it would work though. i looked at storing position
    information but it appeared like it might be complex and the api was (is)
    still marked experimental so i didn't go that route.

    i also agree lucene index might not be the best possible data structure...
    just convenient thats all. i used it because i store other things related to
    the term besides deletion neighborhoods for my fuzzy matching.

    i guess i'll also mention that i do think storage size should be a big
    consideration. you really don't need this kind of stuff unless you are
    searching pretty big indexes in the first place (for <= few million docs the
    default fuzzy is probably just fine for a lot of people).

    for me, the whole thing was about turning 30second queries into 1 second
    queries by removing a linear algorithm, i didn't really optimize much beyond
    that because i was just very happy to have reasonable performance..
    On Tue, Jan 6, 2009 at 6:26 PM, robert engels wrote:

    I understand now.
    The index in my case would definitely be MUCH larger, but I think it would
    perform better, as you only need to do a single search - for obert (if you
    assume it was a misspelling).

    In your case you would eventually do an OR search in the lucene index for
    all possible matches (robert, roberta, roberto, ...) which could be much
    larger with some commonly prefixed/postfixed words).

    Classic performance vs. size trade-off. In your case where it is not for
    misspellings, the performance difference might be worthwhile.

    Still, in your case, I am not sure using a Lucene index as the external
    index is appropriate. Maybe a simple BTREE (Derby?) index of (word,edit
    permutation) with a a key on both would allow easy search and update. If
    implemented as a service, some intelligent caching of common misspellings
    could really improve the performance.

    On Jan 6, 2009, at 4:29 PM, Robert Muir wrote:


    On Tue, Jan 6, 2009 at 5:15 PM, robert engels wrote:

    It is definitely going to increase the index size, but not any more than
    than the external one would (if my understanding is correct).
    The nice thing is that you don't have to try and keep documents numbers
    in sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given that
    the document contains 'robert' but the user enters' obert', what is the
    process to find the matching documents?
    heres a simple example. neighborhood stored for robert is 'robert obert
    rbert roert ...' this is indexed in a tokenized field.

    at query time user typoes robert and enters 'tobert'. again neighborhood
    is generated 'tobert obert tbert ...'
    the system does a query on tobert OR obert OR tbert ... and robert is
    returned because 'obert' is present in both neighborhoods.
    in this example, by storing k=1 deletions you guarantee to satisfy all
    edit distance matches <= 1 without linear scan.
    you get some false positives too with this approach, thats why what comes
    back is only a CANDIDATE and true edit distance must be used to verify. this
    might be tricky to do with your method, i don't know.



    Is the external index essentially a constant list, that given obert, the
    source words COULD BE robert, tobert, reobert etc., and it contains no
    document information so:
    no. see above, you generate all possible 1-character deletions of the
    index term and store them, then at query time you generate all possible
    1-character deletions of the query term. basically, LUCENE and LUBENE are 1
    character different, but they are the same (LUENE) if you delete 1 character
    from both of them. so you dont need to store LUCENE LUBENE LUDENE, you just
    store LUENE.
    given the source word X, and an edit distance k, you ask the external
    dictionary for possible indexed words, and it returns the list, and then use
    search lucene using each of those words?

    If the above is the case, it certainly seems you could generate this list
    in real-time rather efficiently with no IO (unless the external index only
    stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments, but
    they don't seem to match what you are stating.

    Essentially performing any term match requires efficient
    searching/matching of the term index. If this is efficient enough, I don't
    think either process is needed - just an improved real-time fuzzy
    possibilities word generator.

    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and using
    separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert, with
    separate index you just have robert, and its deletion neighborhood one time.
    with this approach you have the same thing, but at least you must have
    document numbers and the other inverted index stuff with each neighborhood
    term. would this be a significant change to size and/or performance? and
    since the documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually be
    better, just curious. i think i boxed myself in to auxiliary index because
    of some other irrelevant thigns i am doing.
    On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to be the
    union of the deletion dictionaries of each source term.
    For example, given the following documents A which have field 'X' with
    value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains
    (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit operations k,
    FastSS
    generates all variant spellings recursively and save them as tuples of
    type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
    deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants
    for n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood Ud (p,
    k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate with
    the deletion positions in U(p, k), using Theorem 4.


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com


    --
    Robert Muir
    rcmuir@gmail.com
  • Robert engels at Jan 6, 2009 at 10:30 pm
    To clarify a statement in the last email.

    To generate the 'possible source words' in real-time is not a
    difficult as first seems, if you assume some sort of first character
    prefix (which is what it appears google does).

    For example, assume the user typed 'robrt' instead of 'robert'. You
    see that this word has very low frequency (or none), so you want to
    find possible misspellings, so do a fuzzy search starting with r. But
    this search can be optimized, because as the edit/delete position
    moves to the right, the prefix remains the same, so these
    possibilities can be quickly skipped.

    If you don't find any words with high enough frequency as possible
    edit distances, try [a-z]robrt, assuming the user may have dropped
    the first character (possibly try this in "know combination" order,
    rather than alpha (i.e. try sr before nr).

    For example, searching google for 'robrt engels' works. So does
    'obert engels', so does 'robt engels', all ask me if I meant 'robert
    engels', but searching for 'obrt engels' does not.
    On Jan 6, 2009, at 4:15 PM, robert engels wrote:

    It is definitely going to increase the index size, but not any more
    than than the external one would (if my understanding is correct).

    The nice thing is that you don't have to try and keep documents
    numbers in sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given
    that the document contains 'robert' but the user enters' obert',
    what is the process to find the matching documents?

    Is the external index essentially a constant list, that given
    obert, the source words COULD BE robert, tobert, reobert etc., and
    it contains no document information so:

    given the source word X, and an edit distance k, you ask the
    external dictionary for possible indexed words, and it returns the
    list, and then use search lucene using each of those words?

    If the above is the case, it certainly seems you could generate
    this list in real-time rather efficiently with no IO (unless the
    external index only stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments,
    but they don't seem to match what you are stating.

    Essentially performing any term match requires efficient searching/
    matching of the term index. If this is efficient enough, I don't
    think either process is needed - just an improved real-time fuzzy
    possibilities word generator.
    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and
    using separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert,
    with separate index you just have robert, and its deletion
    neighborhood one time. with this approach you have the same thing,
    but at least you must have document numbers and the other inverted
    index stuff with each neighborhood term. would this be a
    significant change to size and/or performance? and since the
    documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually
    be better, just curious. i think i boxed myself in to auxiliary
    index because of some other irrelevant thigns i am doing.

    On Tue, Jan 6, 2009 at 4:42 PM, robert engels
    wrote:
    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to
    be the union of the deletion dictionaries of each source term.

    For example, given the following documents A which have field 'X'
    with value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst,
    jes

    so field FieldXFuzzy contains
    (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this
    way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit
    operations k, FastSS
    generates all variant spellings recursively and save them as
    tuples of type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a
    list of deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the
    variants for n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood
    Ud (p, k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate
    with
    the deletion positions in U(p, k), using Theorem 4.





    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir at Jan 6, 2009 at 10:39 pm
    i see what you are saying here. this is different than fastss but sounds
    nice for spelling correction.

    i suppose one reason why i like fastss is for my application i need the true
    complete edit distance, i'm actually not using it for spelling correction
    but as a first step for other tasks.

    but maybe fastss is too complex for the general case (spelling correction)
    and we should do what you mention below. i think either way it would be nice
    to have some kind of fuzzy matching in lucene that isn't a linear scan.
    On Tue, Jan 6, 2009 at 5:29 PM, robert engels wrote:

    To clarify a statement in the last email.
    To generate the 'possible source words' in real-time is not a difficult as
    first seems, if you assume some sort of first character prefix (which is
    what it appears google does).

    For example, assume the user typed 'robrt' instead of 'robert'. You see
    that this word has very low frequency (or none), so you want to find
    possible misspellings, so do a fuzzy search starting with r. But this search
    can be optimized, because as the edit/delete position moves to the right,
    the prefix remains the same, so these possibilities can be quickly skipped.

    If you don't find any words with high enough frequency as possible edit
    distances, try [a-z]robrt, assuming the user may have dropped the first
    character (possibly try this in "know combination" order, rather than alpha
    (i.e. try sr before nr).
    For example, searching google for 'robrt engels' works. So does 'obert
    engels', so does 'robt engels', all ask me if I meant 'robert engels', but
    searching for 'obrt engels' does not.

    On Jan 6, 2009, at 4:15 PM, robert engels wrote:

    It is definitely going to increase the index size, but not any more than
    than the external one would (if my understanding is correct).
    The nice thing is that you don't have to try and keep documents numbers in
    sync - it will be automatic.

    Maybe I don't understand what your external index is storing. Given that
    the document contains 'robert' but the user enters' obert', what is the
    process to find the matching documents?

    Is the external index essentially a constant list, that given obert, the
    source words COULD BE robert, tobert, reobert etc., and it contains no
    document information so:

    given the source word X, and an edit distance k, you ask the external
    dictionary for possible indexed words, and it returns the list, and then use
    search lucene using each of those words?

    If the above is the case, it certainly seems you could generate this list
    in real-time rather efficiently with no IO (unless the external index only
    stores words which HAVE BEEN indexed).

    I think the confusion may be because I understand Otis's comments, but they
    don't seem to match what you are stating.

    Essentially performing any term match requires efficient searching/matching
    of the term index. If this is efficient enough, I don't think either process
    is needed - just an improved real-time fuzzy possibilities word generator.

    On Jan 6, 2009, at 3:58 PM, Robert Muir wrote:

    i see, your idea would definitely simplify some things.

    What about the index size difference between this approach and using
    separate index? Would this separate field increase index size?

    I guess my line of thinking is if you have 10 docs with robert, with
    separate index you just have robert, and its deletion neighborhood one time.
    with this approach you have the same thing, but at least you must have
    document numbers and the other inverted index stuff with each neighborhood
    term. would this be a significant change to size and/or performance? and
    since the documents have multiple terms there is additional positional
    information for slop factor for each neighborhood term...

    i think its worth investigating, maybe performance would actually be
    better, just curious. i think i boxed myself in to auxiliary index because
    of some other irrelevant thigns i am doing.
    On Tue, Jan 6, 2009 at 4:42 PM, robert engels wrote:

    I don't think that is the case. You will have single deletion
    neighborhood. The number of unique terms in the field is going to be the
    union of the deletion dictionaries of each source term.
    For example, given the following documents A which have field 'X' with
    value best, and document B with value jest (and k == 1).

    A will generate est bst, bet, bes, B will generate est, jest, jst, jes

    so field FieldXFuzzy contains (est:AB,bst:A,bet:A,bes:A,jest:B,jst:B,jes)

    I don't think the storage requirement is any greater doing it this way.


    3.2.1 Indexing
    For all words in a dictionary, and a given number of edit operations k,
    FastSS
    generates all variant spellings recursively and save them as tuples of
    type
    v′ ∈ Ud (v, k) → (v, x) where v is a dictionary word and x a list of
    deletion
    positions.

    Theorem 5. Index uses O(nmk+1) space, as it stores al l the variants for
    n
    dictionary words of length m with k mismatches.


    3.2.2 Retrieval
    For a query p and edit distance k, first generate the neighborhood Ud (p,
    k).
    Then compare the words in the neighborhood with the index, and find
    matching candidates. Compare deletion positions for each candidate with
    the deletion positions in U(p, k), using Theorem 4.


    --
    Robert Muir
    rcmuir@gmail.com



    --
    Robert Muir
    rcmuir@gmail.com
  • Robert Muir (JIRA) at Feb 19, 2010 at 9:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir closed LUCENE-1513.
    -------------------------------

    Resolution: Not A Problem

    For Lucene, LUCENE-2089 will always be faster than even FastSS, as our FuzzyQuery is really a top-N query, and we can exploit properties of the priority queue to make it even faster.

    LUCENE-2089 also works without any auxiliary index or data structures, just solely on lucene's terms dict, so it works great for updates/NRT/whatever, no back compat problems.

    I'm cancelling this issue as the alternative is superior in every aspect.
    fastss fuzzyquery
    -----------------

    Key: LUCENE-1513
    URL: https://issues.apache.org/jira/browse/LUCENE-1513
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*
    Reporter: Robert Muir
    Priority: Minor
    Attachments: fastSSfuzzy.zip


    code for doing fuzzyqueries with fastssWC algorithm.
    FuzzyIndexer: given a lucene field, it enumerates all terms and creates an auxiliary offline index for fuzzy queries.
    FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index to retrieve a candidate list. this list is then verified with levenstein algorithm.
    sorry but the code is a bit messy... what I'm actually using is very different from this so its pretty much untested. but at least you can see whats going on or fix it up.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJan 6, '09 at 6:03p
activeFeb 19, '10 at 9:28p
posts18
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase