FAQ
Hi List,

I've been redirected from general@lucene.apache.org to here to discuss
my issue.


---------- My original email ----------


I try to provide relevant results for the users of a lyrics site, even
in the case of misspellings by indexing artist and songs with Lucene.

The problem is that Lucene provides irrelevant search results. For
example searching for "Coldplay" returns "Longplay" as the most relevant
result.

This is how I create individual documents:

Document document = new Document();
document.add(new Field("artist", artist, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("song", song, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
indexWriter.addDocument(document);

And this is how I compose the actual query:

BooleanQuery query = new BooleanQuery();
if (artist.length() > 0) {
FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));
query.add(artist_query, BooleanClause.Occur.MUST);
}
if (song.length() > 0) {
FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
query.add(song_query, BooleanClause.Occur.MUST);
}

Please let me know what's wrong, I'd like to make this work right.

Thanks in advance!


---------- My reply to an answer ----------

On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:
On Dienstag, 17. Juni 2008, László Monda wrote:

FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));
You should try the FuzzyQuery constructor that takes a minimum
similarity
and a prefix length. The general problem is however, that the degree of
similarity is only one factor. The other factors are the same as for other
searches, e.g. the number of occurences of the term in the document and in
the whole index.

You could try to write your own similarity implementation that
disables all
these factors, see
http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html

I understand some essential concepts related to Lucene such as the
Levenshtein distance and tokenization, but I really don't want to go
this deep if it's not necessary.

Since fuzzy searching is based on the Levenshtein distance, the distance
between "coldplay" and "coldplay" is 0 and the distance between
"coldplay" and "downplay" is 3 so how on earth is possible that when
searching for "coldplay", Lucene returns "longplay"? This shouldn't
happen regardless of the minimum similarity and prefix length factors.

Additional info: Lucene seems to do the right thing when only few
documents are present, but goes crazy when there is about 1.5 million
documents in the index.


---------------------------------------------------------------------


I hope that some of you can help me because I don't have any ideas what
can be wrong here.

Thanks in advance!

Search Discussions

  • Daniel Naber at Jun 18, 2008 at 6:36 pm

    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Since fuzzy searching is based on the Levenshtein distance, the distance
    between "coldplay" and "coldplay" is 0 and the distance between
    "coldplay" and "downplay" is 3 so how on earth is possible that when
    searching for "coldplay", Lucene returns "longplay"?
    You can use query.explain() to get the details of the ranking. In your
    case, just build a query like: term^100 OR term~, i.e. boost the original
    (non-fuzzy) term with a large number.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • László Monda at Jun 23, 2008 at 11:23 am
    Hi Daniel,
    On Wed, 2008-06-18 at 20:37 +0200, Daniel Naber wrote:
    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Since fuzzy searching is based on the Levenshtein distance, the distance
    between "coldplay" and "coldplay" is 0 and the distance between
    "coldplay" and "downplay" is 3 so how on earth is possible that when
    searching for "coldplay", Lucene returns "longplay"?
    You can use query.explain() to get the details of the ranking. In your
    case, just build a query like: term^100 OR term~, i.e. boost the original
    (non-fuzzy) term with a large number.
    According to the current Lucene documentation at
    http://lucene.apache.org/java/2_3_2/api/index.html it seems to me that
    the Query class doesn't have any explain() methods.

    Am I missing something?
  • Daniel Naber at Jun 23, 2008 at 6:03 pm

    On Montag, 23. Juni 2008, László Monda wrote:

    According to the current Lucene documentation at
    http://lucene.apache.org/java/2_3_2/api/index.html it seems to me that
    the Query class doesn't have any explain() methods.
    It's in the IndexSearcher and it takes a query and a document number as its
    arguments.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Naber at Jun 18, 2008 at 7:09 pm

    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.
    Lucene works well with more documents (currently using it with 9 million).
    but the fuzzy query requires iteration over all terms which makes this
    query slow. This can be avoid by setting the prefixLength parameter of the
    FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,
    see the spellchecker in the contrib area.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • László Monda at Jun 23, 2008 at 11:10 am

    On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote:
    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.
    Lucene works well with more documents (currently using it with 9 million).
    but the fuzzy query requires iteration over all terms which makes this
    query slow. This can be avoid by setting the prefixLength parameter of the
    FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,
    see the spellchecker in the contrib area.
    Thanks for the suggestion, but I don't have any performance problems
    yet, but I do have serious problems with the relevance of the results
    with fuzzy queries.
  • Markharw00d at Jun 18, 2008 at 8:11 pm
    This looks like it is related to an issue I first raised here:
    http://markmail.org/message/37ywsemfudpos6uh

    At the time I identified 2 issues with FuzzyQuery - that the usual
    "coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.
    The coord factor got fixed but idf remains an issue in FuzzyQuery I beleive.
    There is a class I added to contrib/queries - "FuzzyLikeThisQuery",
    which can be used as a replacement to FuzzyQuery and fixes the idf
    issue. You can subclass the QueryParser to create an instance of that
    rather than FuzzyQuery if required.

    Cheers,
    Mark

    László Monda wrote:
    Hi List,

    I've been redirected from general@lucene.apache.org to here to discuss
    my issue.


    ---------- My original email ----------


    I try to provide relevant results for the users of a lyrics site, even
    in the case of misspellings by indexing artist and songs with Lucene.

    The problem is that Lucene provides irrelevant search results. For
    example searching for "Coldplay" returns "Longplay" as the most relevant
    result.

    This is how I create individual documents:

    Document document = new Document();
    document.add(new Field("artist", artist, Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    document.add(new Field("song", song, Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
    indexWriter.addDocument(document);

    And this is how I compose the actual query:

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
    }

    Please let me know what's wrong, I'd like to make this work right.

    Thanks in advance!


    ---------- My reply to an answer ----------

    On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:

    On Dienstag, 17. Juni 2008, László Monda wrote:

    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    You should try the FuzzyQuery constructor that takes a minimum
    similarity
    and a prefix length. The general problem is however, that the degree of
    similarity is only one factor. The other factors are the same as for other
    searches, e.g. the number of occurences of the term in the document and in
    the whole index.

    You could try to write your own similarity implementation that
    disables all
    these factors, see
    http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html

    I understand some essential concepts related to Lucene such as the
    Levenshtein distance and tokenization, but I really don't want to go
    this deep if it's not necessary.

    Since fuzzy searching is based on the Levenshtein distance, the distance
    between "coldplay" and "coldplay" is 0 and the distance between
    "coldplay" and "downplay" is 3 so how on earth is possible that when
    searching for "coldplay", Lucene returns "longplay"? This shouldn't
    happen regardless of the minimum similarity and prefix length factors.

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.


    ---------------------------------------------------------------------


    I hope that some of you can help me because I don't have any ideas what
    can be wrong here.

    Thanks in advance!


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • László Monda at Jun 23, 2008 at 11:22 am
    Hi Mark,
    On Wed, 2008-06-18 at 21:09 +0100, markharw00d wrote:
    This looks like it is related to an issue I first raised here:
    http://markmail.org/message/37ywsemfudpos6uh

    At the time I identified 2 issues with FuzzyQuery - that the usual
    "coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.
    The coord factor got fixed but idf remains an issue in FuzzyQuery I beleive.
    There is a class I added to contrib/queries - "FuzzyLikeThisQuery",
    which can be used as a replacement to FuzzyQuery and fixes the idf
    issue. You can subclass the QueryParser to create an instance of that
    rather than FuzzyQuery if required.
    Frankly, I neighter understand the deeper anatomy of Lucene, nor do I
    want to. I'd expect Lucene to do the right thing with fuzzy queries and
    it obviously doesn't do the right thing in my case.

    I get clearly irrelevant results when I search for a word using fuzzy
    query, that world is stored in my Lucene index and Lucene returns me a
    word that is pretty different. It seems to me a bug.

    I've downloaded FuzzyQuery from SVN and replaced the references of the
    Fuzzy class with it. I got the same irrelevant results than before.

    Do you have any ideas what's going on here?
    Cheers,
    Mark

    László Monda wrote:
    Hi List,

    I've been redirected from general@lucene.apache.org to here to discuss
    my issue.


    ---------- My original email ----------


    I try to provide relevant results for the users of a lyrics site, even
    in the case of misspellings by indexing artist and songs with Lucene.

    The problem is that Lucene provides irrelevant search results. For
    example searching for "Coldplay" returns "Longplay" as the most relevant
    result.

    This is how I create individual documents:

    Document document = new Document();
    document.add(new Field("artist", artist, Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    document.add(new Field("song", song, Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
    indexWriter.addDocument(document);

    And this is how I compose the actual query:

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
    }

    Please let me know what's wrong, I'd like to make this work right.

    Thanks in advance!


    ---------- My reply to an answer ----------

    On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:

    On Dienstag, 17. Juni 2008, László Monda wrote:

    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    You should try the FuzzyQuery constructor that takes a minimum
    similarity
    and a prefix length. The general problem is however, that the degree of
    similarity is only one factor. The other factors are the same as for other
    searches, e.g. the number of occurences of the term in the document and in
    the whole index.

    You could try to write your own similarity implementation that
    disables all
    these factors, see
    http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html

    I understand some essential concepts related to Lucene such as the
    Levenshtein distance and tokenization, but I really don't want to go
    this deep if it's not necessary.

    Since fuzzy searching is based on the Levenshtein distance, the distance
    between "coldplay" and "coldplay" is 0 and the distance between
    "coldplay" and "downplay" is 3 so how on earth is possible that when
    searching for "coldplay", Lucene returns "longplay"? This shouldn't
    happen regardless of the minimum similarity and prefix length factors.

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.


    ---------------------------------------------------------------------


    I hope that some of you can help me because I don't have any ideas what
    can be wrong here.

    Thanks in advance!


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Laci <http://monda.hu>
  • Mark harwood at Jun 23, 2008 at 11:30 am
    I do have serious problems with the relevance of the results with fuzzy queries.
    Please take the time to read my response here:

    http://www.gossamer-threads.com/lists/lucene/java-user/62050#62050

    I had a work colleague come up with exactly the same problem this week and the solution is the same.

    Just tested my index with a standard Lucene FuzzyQuery for "Paul~" - this gives "Phul", "Saul", and "Paulo" before ANY "Paul" records due to IDF issues.
    Using FuzzyLikeThisQuery puts all the "Paul" records ahead of the variants.



    ----- Original Message ----
    From: László Monda <laci@monda.hu>
    To: java-user@lucene.apache.org
    Cc: lucenelist2007@danielnaber.de
    Sent: Monday, 23 June, 2008 12:10:05 PM
    Subject: Re: Getting irrelevant results using fuzzy query
    On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote:
    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.
    Lucene works well with more documents (currently using it with 9 million).
    but the fuzzy query requires iteration over all terms which makes this
    query slow. This can be avoid by setting the prefixLength parameter of the
    FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,
    see the spellchecker in the contrib area.
    Thanks for the suggestion, but I don't have any performance problems
    yet, but I do have serious problems with the relevance of the results
    with fuzzy queries.

    --
    Laci <http://monda.hu>


    __________________________________________________________
    Sent from Yahoo! Mail.
    A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • László Monda at Jun 23, 2008 at 12:12 pm
    Thanks for your reply, Mark.



    This was my original code for constructing my query using FuzzyQuery:

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
    }



    This is my first attempt to use FuzzyLikeThisQuery (with no success):

    FuzzyLikeThisQuery query = new FuzzyLikeThisQuery(2, new
    SimpleAnalyzer());
    if (artist.length() > 0) {
    query.addTerms(artist, "artist", 0.5f, 0);
    }
    if (song.length() > 0) {
    query.addTerms(song, "song", 0.5f, 0);
    }



    This is my second attempt to use FuzzyLikeThisQuery (with no success):

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyLikeThisQuery artist_query = new FuzzyLikeThisQuery(1, new
    SimpleAnalyzer());
    artist_query.addTerms(artist, "artist", 0.5f, 0);
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyLikeThisQuery song_query = new FuzzyLikeThisQuery(1, new
    SimpleAnalyzer());
    song_query.addTerms(song, "song", 0.5f, 0);
    query.add(song_query, BooleanClause.Occur.MUST);
    }



    I think it's my lack of undersanding of the usage of FuzzyLikeThisQuery
    that makes me getting irrelevant results.

    Could you tell me what's wrong here, please?

    Thank you.
    On Mon, 2008-06-23 at 11:28 +0000, mark harwood wrote:
    I do have serious problems with the relevance of the results with fuzzy queries.
    Please take the time to read my response here:

    http://www.gossamer-threads.com/lists/lucene/java-user/62050#62050

    I had a work colleague come up with exactly the same problem this week and the solution is the same.

    Just tested my index with a standard Lucene FuzzyQuery for "Paul~" - this gives "Phul", "Saul", and "Paulo" before ANY "Paul" records due to IDF issues.
    Using FuzzyLikeThisQuery puts all the "Paul" records ahead of the variants.



    ----- Original Message ----
    From: László Monda <laci@monda.hu>
    To: java-user@lucene.apache.org
    Cc: lucenelist2007@danielnaber.de
    Sent: Monday, 23 June, 2008 12:10:05 PM
    Subject: Re: Getting irrelevant results using fuzzy query
    On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote:
    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.
    Lucene works well with more documents (currently using it with 9 million).
    but the fuzzy query requires iteration over all terms which makes this
    query slow. This can be avoid by setting the prefixLength parameter of the
    FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,
    see the spellchecker in the contrib area.
    Thanks for the suggestion, but I don't have any performance problems
    yet, but I do have serious problems with the relevance of the results
    with fuzzy queries.
    --
    Laci <http://monda.hu>
  • Mark harwood at Jun 23, 2008 at 12:53 pm
    Could you tell me what's wrong here, please?
    There are potentially a number of factors at play here.

    Your use of FuzzyLikeThis is fine - just tried the code on my single-term "Paul" query and as I outlined before it is doing a much better job of matching (Paul~= results Paul,Paul,Paul....Phul rather than FuzzyQuery's Paul~= results Phul, Saul, Paulo , Paul, Paul.....)

    Try the query on just the term artist:Coldplay and see the results. What artists Does FuzzyLikeThis return vs FuzzyQuery?

    If you aren't getting Coldplay as the first result from FuzzyLikeThis double check the content is indexed using the same analyzer that you pass to FuzzyLikeThisQuery (your code below uses SimpleAnalyzer). If you indexed with WhitespaceAnalyzer for example or as "UN_TOKENIZED the index and the query differ so "Coldplay"!=coldplay.

    I notice the song title in your original code is treated as a single term in your query - is that how it is indexed? I can see that artist might possibly make sense as a single term which gets fuzzy matched but song titles are generally longer which means it may work better as a tokenized field.

    Cheers
    Mark


    ----- Original Message ----
    From: László Monda <laci@monda.hu>
    To: java-user@lucene.apache.org
    Cc: markharw00d@yahoo.co.uk
    Sent: Monday, 23 June, 2008 1:11:50 PM
    Subject: Re: Getting irrelevant results using fuzzy query

    Thanks for your reply, Mark.



    This was my original code for constructing my query using FuzzyQuery:

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
    }



    This is my first attempt to use FuzzyLikeThisQuery (with no success):

    FuzzyLikeThisQuery query = new FuzzyLikeThisQuery(2, new
    SimpleAnalyzer());
    if (artist.length() > 0) {
    query.addTerms(artist, "artist", 0.5f, 0);
    }
    if (song.length() > 0) {
    query.addTerms(song, "song", 0.5f, 0);
    }



    This is my second attempt to use FuzzyLikeThisQuery (with no success):

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyLikeThisQuery artist_query = new FuzzyLikeThisQuery(1, new
    SimpleAnalyzer());
    artist_query.addTerms(artist, "artist", 0.5f, 0);
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyLikeThisQuery song_query = new FuzzyLikeThisQuery(1, new
    SimpleAnalyzer());
    song_query.addTerms(song, "song", 0.5f, 0);
    query.add(song_query, BooleanClause.Occur.MUST);
    }



    I think it's my lack of undersanding of the usage of FuzzyLikeThisQuery
    that makes me getting irrelevant results.

    Could you tell me what's wrong here, please?

    Thank you.
    On Mon, 2008-06-23 at 11:28 +0000, mark harwood wrote:
    I do have serious problems with the relevance of the results with fuzzy queries.
    Please take the time to read my response here:

    http://www.gossamer-threads.com/lists/lucene/java-user/62050#62050

    I had a work colleague come up with exactly the same problem this week and the solution is the same.

    Just tested my index with a standard Lucene FuzzyQuery for "Paul~" - this gives "Phul", "Saul", and "Paulo" before ANY "Paul" records due to IDF issues.
    Using FuzzyLikeThisQuery puts all the "Paul" records ahead of the variants.



    ----- Original Message ----
    From: László Monda <laci@monda.hu>
    To: java-user@lucene.apache.org
    Cc: lucenelist2007@danielnaber.de
    Sent: Monday, 23 June, 2008 12:10:05 PM
    Subject: Re: Getting irrelevant results using fuzzy query
    On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote:
    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.
    Lucene works well with more documents (currently using it with 9 million).
    but the fuzzy query requires iteration over all terms which makes this
    query slow. This can be avoid by setting the prefixLength parameter of the
    FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,
    see the spellchecker in the contrib area.
    Thanks for the suggestion, but I don't have any performance problems
    yet, but I do have serious problems with the relevance of the results
    with fuzzy queries.
    --
    Laci <http://monda.hu>


    __________________________________________________________
    Sent from Yahoo! Mail.
    A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • László Monda at Jun 28, 2008 at 3:27 pm

    On Mon, 2008-06-23 at 12:52 +0000, mark harwood wrote:
    Could you tell me what's wrong here, please?
    There are potentially a number of factors at play here.

    Your use of FuzzyLikeThis is fine - just tried the code on my single-term "Paul" query and as I outlined before it is doing a much better job of matching (Paul~= results Paul,Paul,Paul....Phul rather than FuzzyQuery's Paul~= results Phul, Saul, Paulo , Paul, Paul.....)

    Try the query on just the term artist:Coldplay and see the results. What artists Does FuzzyLikeThis return vs FuzzyQuery?

    If you aren't getting Coldplay as the first result from FuzzyLikeThis double check the content is indexed using the same analyzer that you pass to FuzzyLikeThisQuery (your code below uses SimpleAnalyzer). If you indexed with WhitespaceAnalyzer for example or as "UN_TOKENIZED the index and the query differ so "Coldplay"!=coldplay.

    I notice the song title in your original code is treated as a single term in your query - is that how it is indexed? I can see that artist might possibly make sense as a single term which gets fuzzy matched but song titles are generally longer which means it may work better as a tokenized field.
    You were right, tokenization was the issue. Using TOKENIZED instead of
    UN_TOKENIZED immediately provided relevant results, event when using it
    with FuzzyQuery.

    Using FuzzyLikeThisQuery made the relevance much better, so I'm really
    happy with the results.

    Thank you very much!
    Cheers
    Mark


    ----- Original Message ----
    From: László Monda <laci@monda.hu>
    To: java-user@lucene.apache.org
    Cc: markharw00d@yahoo.co.uk
    Sent: Monday, 23 June, 2008 1:11:50 PM
    Subject: Re: Getting irrelevant results using fuzzy query

    Thanks for your reply, Mark.



    This was my original code for constructing my query using FuzzyQuery:

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
    artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
    }



    This is my first attempt to use FuzzyLikeThisQuery (with no success):

    FuzzyLikeThisQuery query = new FuzzyLikeThisQuery(2, new
    SimpleAnalyzer());
    if (artist.length() > 0) {
    query.addTerms(artist, "artist", 0.5f, 0);
    }
    if (song.length() > 0) {
    query.addTerms(song, "song", 0.5f, 0);
    }



    This is my second attempt to use FuzzyLikeThisQuery (with no success):

    BooleanQuery query = new BooleanQuery();
    if (artist.length() > 0) {
    FuzzyLikeThisQuery artist_query = new FuzzyLikeThisQuery(1, new
    SimpleAnalyzer());
    artist_query.addTerms(artist, "artist", 0.5f, 0);
    query.add(artist_query, BooleanClause.Occur.MUST);
    }
    if (song.length() > 0) {
    FuzzyLikeThisQuery song_query = new FuzzyLikeThisQuery(1, new
    SimpleAnalyzer());
    song_query.addTerms(song, "song", 0.5f, 0);
    query.add(song_query, BooleanClause.Occur.MUST);
    }



    I think it's my lack of undersanding of the usage of FuzzyLikeThisQuery
    that makes me getting irrelevant results.

    Could you tell me what's wrong here, please?

    Thank you.
    On Mon, 2008-06-23 at 11:28 +0000, mark harwood wrote:
    I do have serious problems with the relevance of the results with fuzzy queries.
    Please take the time to read my response here:

    http://www.gossamer-threads.com/lists/lucene/java-user/62050#62050

    I had a work colleague come up with exactly the same problem this week and the solution is the same.

    Just tested my index with a standard Lucene FuzzyQuery for "Paul~" - this gives "Phul", "Saul", and "Paulo" before ANY "Paul" records due to IDF issues.
    Using FuzzyLikeThisQuery puts all the "Paul" records ahead of the variants.



    ----- Original Message ----
    From: László Monda <laci@monda.hu>
    To: java-user@lucene.apache.org
    Cc: lucenelist2007@danielnaber.de
    Sent: Monday, 23 June, 2008 12:10:05 PM
    Subject: Re: Getting irrelevant results using fuzzy query
    On Wed, 2008-06-18 at 21:10 +0200, Daniel Naber wrote:
    On Mittwoch, 18. Juni 2008, László Monda wrote:

    Additional info: Lucene seems to do the right thing when only few
    documents are present, but goes crazy when there is about 1.5 million
    documents in the index.
    Lucene works well with more documents (currently using it with 9 million).
    but the fuzzy query requires iteration over all terms which makes this
    query slow. This can be avoid by setting the prefixLength parameter of the
    FuzzyQuery constructor to 1 or 2. Or maybe you should use an n-gram index,
    see the spellchecker in the contrib area.
    Thanks for the suggestion, but I don't have any performance problems
    yet, but I do have serious problems with the relevance of the results
    with fuzzy queries.
    --
    Laci <http://monda.hu>
  • Qaz zaq at Jun 28, 2008 at 10:35 pm
    Hi,

    I ran into a very strange situation regarding document retrieval slowness and want to get some advice urgently.

    I have 2 FSDirectory indexes each with size about 500M. I have 2 parallel search threads fetching 200 documents from these 2 indexes which usually take less then 16ms. However, everytime afer some heavy disk operations (such as copy 1G size of a file into that disk) , the document retrieval slows down to couple seconds immediately, even well after this disk operation being finished for a long time. It appears Lucene could never resume to it's original speed and I have to restart by application inorder to get it normal.

    Anybody has encountered similiar problems?
  • Daniel Naber at Jun 28, 2008 at 11:15 pm

    On Sonntag, 29. Juni 2008, qaz zaq wrote:

    indexes which usually take less then 16ms. However, everytime afer some
    heavy disk operations (such as copy 1G size of a file into that disk) ,
    the document retrieval slows down to couple seconds immediately,
    even well after this disk operation being finished for a long time.
    Is this copy operation initiated by your java code and does it require a
    lot of RAM? If so, this could be caused by the JVM garbage collection.
    There are options to start the JVM so it prints out garbage collection
    progress information. Also, are you sure the copy operation is finished?
    Maybe data is still in the cache of the operating system and the disk is
    still busy writing it?

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Qaz zaq at Jun 28, 2008 at 11:37 pm
    thanks for your quick reply!

    The copy operation is not related to JAVA. I use Windows (2003 Enterprise)  copy command. The problem persists even 5 hours after the copy is done (it should not take so long for OS to write a 1G file from system cache to disk. right?).  The only way to resolve this problem is to restart my application.

    BTW, How can I tell if a file is still in system cache vs. in disk?

    --- On Sat, 6/28/08, Daniel Naber wrote:

    From: Daniel Naber <lucenelist2007@danielnaber.de>
    Subject: Re: document retrieval 100 times slower after finishing some heavy disk operation
    To: java-user@lucene.apache.org
    Date: Saturday, June 28, 2008, 7:15 PM
    On Sonntag, 29. Juni 2008, qaz zaq wrote:

    indexes which usually take less then 16ms. However, everytime afer some
    heavy disk operations (such as copy 1G size of a file into that disk) ,
    the document retrieval slows down to couple seconds immediately,
    even well after this disk operation being finished for a long time.
    Is this copy operation initiated by your java code and does it require a
    lot of RAM? If so, this could be caused by the JVM garbage collection.
    There are options to start the JVM so it prints out garbage collection
    progress information. Also, are you sure the copy operation is finished?
    Maybe data is still in the cache of the operating system and the disk is
    still busy writing it?

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Naber at Jun 29, 2008 at 6:17 am

    On Sonntag, 29. Juni 2008, qaz zaq wrote:

    I have 2 FSDirectory indexes each with size about 500M. I have 2
    parallel search threads fetching 200 documents from these 2
    indexes which usually take less then 16ms.
    Fetching documents means that per document about 2 disk seeks are needed to
    access the document. For 200 documents, that's not possible in 16ms on a
    common hard disk, unless everything is chached by the operating system.
    Your copy operation seems to clear that cache and search times get slower.
    A solution might be to re-warm the cache after each copy process.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Eks dev at Jun 29, 2008 at 1:44 pm
    yes, we have seen this many times. The problem is, especially on windows ,that some simple commands like copy make havoc of File System cache, as matter of fact, we are not sure it is the cache that is making problems, generally all IO operations start blocking like crazy (we have seen this effect on 32Gb machine, where complete index fits comfortably in RAM, ca. 2Gb, and than copy of another file of this size caused lucene to wait endlessly on OS to provide some signs of life. Adding one disk more helped a lot, as well as some coordination between lucene and external processes that are IO intensive, warm-up is easy, as long as you can achieve coordination between processes)...

    good luck






    ----- Original Message ----
    From: qaz zaq <fortques@yahoo.com>
    To: java-user@lucene.apache..org
    Sent: Sunday, 29 June, 2008 12:34:24 AM
    Subject: document retrieval 100 times slower after finishing some heavy disk operation

    Hi,

    I ran into a very strange situation regarding document retrieval slowness and
    want to get some advice urgently.

    I have 2 FSDirectory indexes each with size about 500M. I have 2 parallel search
    threads fetching 200 documents from these 2 indexes which usually take less then
    16ms. However, everytime afer some heavy disk operations (such as copy 1G size
    of a file into that disk) , the document retrieval slows down to couple seconds
    immediately, even well after this disk operation being finished for a long time.
    It appears Lucene could never resume to it's original speed and I have to
    restart by application inorder to get it normal.

    Anybody has encountered similiar problems?


    __________________________________________________________
    Not happy with your email address?.
    Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 18, '08 at 2:06p
activeJun 29, '08 at 1:44p
posts17
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase