FAQ
Hi all,


I tried to figure out how the fuzzy search implementation
in Apache Lucene works and I'm kinda stuck here.
--> Version : Apache Lucene 3.0.1 (JAVA)


[What I want / need]
I'm looking for a way to combine a prefix-, fuzzy- and wildcard query.

Q: Is it possible to have a query like "user_input~0.5*" ?


[JavaDoc for org.apache.lucene.search.FuzzyQuery c-tor]
@param minimumSimilarity: a value between 0 and 1 to set
the required similarity between the query term and the
matching terms. For example, for a minimumSimilarity of
0.5 a term of the same length as the query term is
considered similar to the query term if the edit distance
between both terms is less than length(term)*0.5

Q: Mh... what if the query term differs in it's length to the
term in my document?


[Test case]
I have written a small test program (JUnit test case) to
explain my problem / confusion in detail:

--> http://eugeneciurana.com/pastebin/pastebin.php?show=42619



[Examples] Search term --> Subset of expected result
Cinamo~0.5 --> Cinema, Cinnamon [works]
Strawbarr~0.8 --> Strawberry [doesn't work]

-->
As far as I understand, the "Edit distance"
(aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
is 2 (one replacement and one insertion to transform "Strawbarr" into
"Strawberry")

The query "Strawbarr~0.8" in my opinion (and from what I read from
the JavaDocs) should work just fine, because
len(Strawbarr)*0.8 == 9*0.8 == 7.2 ... 7.2 >= 2 ... still --
it doesn't work. Is that, because the length of the search term
and the word in my document differ?

I already searched the wiki, the mailing list archive
and had a look in all the "obvious" places but had no luck
so far.

If I am missing something obvious here I would be glad to receive
some pointers into the right direction.
<--




Regards
-benjamin-

--
Benjamin Jung <bpjung@terreon.de>
Terreon, http://terreon.de/
Tel.: +49 (0)69 / 8484 65 37
Fax: +49 (0)6054 / 909 788 2
Mobil +49 (0)1577 / 159 788 3

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Robert Muir at Mar 29, 2010 at 3:06 pm

    On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung wrote:


    [Examples] Search term --> Subset of expected result
    Cinamo~0.5 --> Cinema, Cinnamon [works]
    Strawbarr~0.8 --> Strawberry [doesn't work]

    -->
    As far as I understand, the "Edit distance"
    (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
    is 2 (one replacement and one insertion to transform "Strawbarr" into
    "Strawberry")
    yes you are correct, the scaling is a bit strange in my opinion. you can see
    it in FuzzyTermsEnum's javadocs (if you look at the code):

    Similarity returns a number that is 1.0f or less (including negative
    numbers) based on how similar the Term is compared to a target term. It
    returns
    exactly 0.0f when

    editDistance > maximumEditDistance

    Otherwise it returns:

    1 - (editDistance / length)

    where length is the length of the shortest term (text or target) including a
    prefix that are identical and editDistance is the Levenshtein distance for
    the two words.


    I think other implementations instead tend to use 1 - (editDistance /
    length) for scaling, where length is the length of the longest term.

    --
    Robert Muir
    rcmuir@gmail.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 29, '10 at 2:58p
activeMar 29, '10 at 3:06p
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase