FAQ
Hello,

We are re currently migrating from 2.4.1 to 2.9.0. We've noticed some
changes in the results of fuzzy queries.
We have made this small test case :

********
StandardAnalyzer analyzer = new StandardAnalyzer();

Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");

addDoc(w, "Giga byte");

addDoc(w, "ManagingGigabytesManagingGigabyte");
addDoc(w, "ManagingGigabytesManagingGigabytes");

addDoc(w, "The Art of Computer Science");
addDoc(w, "J. K. Rowling");
addDoc(w, "JK Rowling");
addDoc(w, "Joanne K Roling");
addDoc(w, "Bruce Willis");
addDoc(w, "Willis bruce");
addDoc(w, "Brute willis");
addDoc(w, "B. willis");
w.close();
***************

Here's the problem :
We would expect the query
Query q = new QueryParser("title", analyzer).parse( "giga~0.9" );

to match at least "Giga byte".

With lucene version 2.4.1 it returns :
1. Giga byte with score : 1.7948763

With 2.9, there's no matches, we have to go something as low as 0.7
("giga~0.7") to get some matches.

Could this be a regression?


http://www.nabble.com/file/p25924689/FirstShot.java Simple test case (1 file
here)




--
View this message in context: http://www.nabble.com/Difference-between-2.4.1-and-2.9.0-%28possible-regression-%29-tp25924689p25924689.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Michael McCandless at Oct 16, 2009 at 4:03 pm
    This looks to have been caused by:

    http://issues.apache.org/jira/browse/LUCENE-1124

    Which short circuits all matching if the term is too short relative to
    the min similarity. But I guess something must be wrong w/ the
    formula.

    I'll reopen that issue & mark fix for 2.9.1.

    Mike
    On Fri, Oct 16, 2009 at 8:35 AM, stefcl wrote:

    Hello,

    We are re currently migrating from 2.4.1 to 2.9.0. We've noticed some
    changes in the results of fuzzy queries.
    We have made this small test case :

    ********
    StandardAnalyzer analyzer = new StandardAnalyzer();

    Directory index = new RAMDirectory();
    IndexWriter w = new IndexWriter(index, analyzer, true,
    IndexWriter.MaxFieldLength.UNLIMITED);

    addDoc(w, "Lucene in Action");
    addDoc(w, "Lucene for Dummies");

    addDoc(w, "Giga byte");

    addDoc(w, "ManagingGigabytesManagingGigabyte");
    addDoc(w, "ManagingGigabytesManagingGigabytes");

    addDoc(w, "The Art of Computer Science");
    addDoc(w, "J. K. Rowling");
    addDoc(w, "JK Rowling");
    addDoc(w, "Joanne K Roling");
    addDoc(w, "Bruce Willis");
    addDoc(w, "Willis bruce");
    addDoc(w, "Brute willis");
    addDoc(w, "B. willis");
    w.close();
    ***************

    Here's the problem :
    We would expect the query
    Query q = new QueryParser("title", analyzer).parse( "giga~0.9" );

    to match at least "Giga byte".

    With lucene version 2.4.1 it returns :
    1. Giga byte with score : 1.7948763

    With 2.9, there's no matches, we have to go something as low as 0.7
    ("giga~0.7") to get some matches.

    Could this be a regression?


    http://www.nabble.com/file/p25924689/FirstShot.java Simple test case (1 file
    here)




    --
    View this message in context: http://www.nabble.com/Difference-between-2.4.1-and-2.9.0-%28possible-regression-%29-tp25924689p25924689.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Oct 16, 2009 at 4:52 pm
    OK I've committed the fix on the 2.9.x branch, so it'll be included in
    the 2.9.1 release.

    Thanks for raising this!

    Mike

    On Fri, Oct 16, 2009 at 12:02 PM, Michael McCandless
    wrote:
    This looks to have been caused by:

    http://issues.apache.org/jira/browse/LUCENE-1124

    Which short circuits all matching if the term is too short relative to
    the min similarity.  But I guess something must be wrong w/ the
    formula.

    I'll reopen that issue & mark fix for 2.9.1.

    Mike
    On Fri, Oct 16, 2009 at 8:35 AM, stefcl wrote:

    Hello,

    We are re currently migrating from 2.4.1 to 2.9.0. We've noticed some
    changes in the results of fuzzy queries.
    We have made this small test case :

    ********
    StandardAnalyzer analyzer = new StandardAnalyzer();

    Directory index = new RAMDirectory();
    IndexWriter w = new IndexWriter(index, analyzer, true,
    IndexWriter.MaxFieldLength.UNLIMITED);

    addDoc(w, "Lucene in Action");
    addDoc(w, "Lucene for Dummies");

    addDoc(w, "Giga byte");

    addDoc(w, "ManagingGigabytesManagingGigabyte");
    addDoc(w, "ManagingGigabytesManagingGigabytes");

    addDoc(w, "The Art of Computer Science");
    addDoc(w, "J. K. Rowling");
    addDoc(w, "JK Rowling");
    addDoc(w, "Joanne K Roling");
    addDoc(w, "Bruce Willis");
    addDoc(w, "Willis bruce");
    addDoc(w, "Brute willis");
    addDoc(w, "B. willis");
    w.close();
    ***************

    Here's the problem :
    We would expect the query
    Query q = new QueryParser("title", analyzer).parse( "giga~0.9" );

    to match at least "Giga byte".

    With lucene version 2.4.1 it returns :
    1. Giga byte with score : 1.7948763

    With 2.9, there's no matches, we have to go something as low as 0.7
    ("giga~0.7") to get some matches.

    Could this be a regression?


    http://www.nabble.com/file/p25924689/FirstShot.java Simple test case (1 file
    here)




    --
    View this message in context: http://www.nabble.com/Difference-between-2.4.1-and-2.9.0-%28possible-regression-%29-tp25924689p25924689.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Stefcl at Oct 16, 2009 at 5:34 pm
    Thanks,
    Even if you add to the example a document called "giga", I'm not sure that
    searching "giga~0.8" would return anything.

    It seems a bit weird because an exact search (which I guess should be more
    or less equivalent to a fuzzy search with nearly ~1 similarity) would
    actually return some results.

    I guess it was part of an attempt to prevent unsignificant terms from having
    unreasonable impact to the score, but can we still call that factor "minimum
    similarity" then?

    I really suspect there's something broken here, or perhaps I just fail to
    understand the logic. The way it worked in 2.4.1 seemed much more
    interesting, now even a 100% exact match isn't enough for the query to
    succeed, in my opinion this should have been implemented as a completely
    different query type.

    I have no intention in making any offense here, I'm just trying to
    understand...
    Kind regards


    Michael McCandless-2 wrote:
    This looks to have been caused by:

    http://issues.apache.org/jira/browse/LUCENE-1124

    Which short circuits all matching if the term is too short relative to
    the min similarity. But I guess something must be wrong w/ the
    formula.

    I'll reopen that issue & mark fix for 2.9.1.
    --
    View this message in context: http://www.nabble.com/Difference-between-2.4.1-and-2.9.0-%28possible-regression-%29-tp25924689p25929358.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Oct 16, 2009 at 5:36 pm
    It was a bug and Mike fixed it. The bug was that exact matches where not
    being returned as you state. Will be fixed in 2.9.1.

    stefcl wrote:
    Thanks,
    Even if you add to the example a document called "giga", I'm not sure that
    searching "giga~0.8" would return anything.

    It seems a bit weird because an exact search (which I guess should be more
    or less equivalent to a fuzzy search with nearly ~1 similarity) would
    actually return some results.

    I guess it was part of an attempt to prevent unsignificant terms from having
    unreasonable impact to the score, but can we still call that factor "minimum
    similarity" then?

    I really suspect there's something broken here, or perhaps I just fail to
    understand the logic. The way it worked in 2.4.1 seemed much more
    interesting, now even a 100% exact match isn't enough for the query to
    succeed, in my opinion this should have been implemented as a completely
    different query type.

    I have no intention in making any offense here, I'm just trying to
    understand...
    Kind regards


    Michael McCandless-2 wrote:
    This looks to have been caused by:

    http://issues.apache.org/jira/browse/LUCENE-1124

    Which short circuits all matching if the term is too short relative to
    the min similarity. But I guess something must be wrong w/ the
    formula.

    I'll reopen that issue & mark fix for 2.9.1.


    --
    - Mark

    http://www.lucidimagination.com




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Stefcl at Oct 16, 2009 at 5:39 pm
    Apologies, my previous message crossed yours.
    Good to hear that it's not intended behavior, I was worried.

    thanks for the fix!
    Kind regards


    stefcl wrote:
    Thanks,
    Even if you add to the example a document called "giga", I'm not sure that
    searching "giga~0.8" would return anything.

    It seems a bit weird because an exact search (which I guess should be more
    or less equivalent to a fuzzy search with nearly ~1 similarity) would
    actually return some results.

    I guess it was part of an attempt to prevent unsignificant terms from
    having unreasonable impact to the score, but can we still call that factor
    "minimum similarity" then?

    I really suspect there's something broken here, or perhaps I just fail to
    understand the logic. The way it worked in 2.4.1 seemed much more
    interesting, now even a 100% exact match isn't enough for the query to
    succeed, in my opinion this should have been implemented as a completely
    different query type.

    I have no intention in making any offense here, I'm just trying to
    understand...
    Kind regards


    Michael McCandless-2 wrote:
    This looks to have been caused by:

    http://issues.apache.org/jira/browse/LUCENE-1124

    Which short circuits all matching if the term is too short relative to
    the min similarity. But I guess something must be wrong w/ the
    formula.

    I'll reopen that issue & mark fix for 2.9.1.
    --
    View this message in context: http://www.nabble.com/Difference-between-2.4.1-and-2.9.0-%28possible-regression-%29-tp25924689p25929456.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 16, '09 at 12:36p
activeOct 16, '09 at 5:39p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase