FAQ
Im receiving a number of searches with many ORs so that the total number
of matches is huge ( > 1 million) although only the first 20 results are
required. Analysis shows most time is spent scoring the results. Now it
seems to me if you sending a query with 10 OR components, documents that
match most of the terms are bound to get a better score than a match
that only matches one or two of the terms. So does lucene do any
optimization to not bother working out the scores of the poor matches.

EDIT:Actually not sure the statement because if only term matches it
could still get the highest score if the match was on the shortest term.

But can you see my point is there way to get lucene discount the less
good matches without scoring them, or is there another approach. At the
moment we allow the full lucene syntax and use QueryParser to parse a
query and pass the resultant query to search unchanged (execpt for
handling of numeric fields), should I be modifying the query somehow ?

thanks Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Ahmet Arslan at May 4, 2011 at 11:39 am
    Im receiving a number of searches with many ORs so that the total number of matches is huge ( > 1 million) although only the first 20 results are required. Analysis shows most time is spent scoring the results. Now it seems to me if you sending a query with 10 OR components, documents that match most of the terms are bound to get a better score than a match that only matches one or two of the terms.  So does lucene do any optimization to not bother working out the scores of the poor matches.

    EDIT:Actually not sure the statement because if only term matches it could still get the highest score if the match was on the shortest term.

    But can you see my point is there way to get lucene discount the less good matches without scoring them, or is there another approach. At the moment we allow the full lucene syntax and use QueryParser to parse a query and pass the resultant query to search unchanged (execpt for handling of numeric fields), should I be modifying the query somehow ?


    You can restrict number of returned results by using a adaptively computed BooleanQuery.html#setMinimumNumberShouldMatch(int) parameter.
    For example, If you have 10 optional clauses you can set minimum should match to 60% of 10 = 6.

    Similar mechanism exists in solr :
    http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 4, 2011 at 11:52 am

    On 04/05/2011 12:39, Ahmet Arslan wrote:
    Im receiving a number of searches with many ORs so that the total number of matches is huge (> 1 million) although only the first 20 results are required. Analysis shows most time is spent scoring the results. Now it seems to me if you sending a query with 10 OR components, documents that match most of the terms are bound to get a better score than a match that only matches one or two of the terms. So does lucene do any optimization to not bother working out the scores of the poor matches.

    EDIT:Actually not sure the statement because if only term matches it could still get the highest score if the match was on the shortest term.

    But can you see my point is there way to get lucene discount the less good matches without scoring them, or is there another approach. At the moment we allow the full lucene syntax and use QueryParser to parse a query and pass the resultant query to search unchanged (execpt for handling of numeric fields), should I be modifying the query somehow ?


    You can restrict number of returned results by using a adaptively computed BooleanQuery.html#setMinimumNumberShouldMatch(int) parameter.
    For example, If you have 10 optional clauses you can set minimum should match to 60% of 10 = 6.

    Similar mechanism exists in solr :
    http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29
    Thanks for the hint, so this could be done by overriding
    getBooleanQuery() in QueryParser ?

    Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 4, 2011 at 12:34 pm

    On 04/05/2011 12:51, Paul Taylor wrote:
    On 04/05/2011 12:39, Ahmet Arslan wrote:

    Im receiving a number of searches with many ORs so that the total
    number of matches is huge (> 1 million) although only the first 20
    results are required. Analysis shows most time is spent scoring the
    results. Now it seems to me if you sending a query with 10 OR
    components, documents that match most of the terms are bound to get a
    better score than a match that only matches one or two of the terms.
    So does lucene do any optimization to not bother working out the
    scores of the poor matches.

    EDIT:Actually not sure the statement because if only term matches it
    could still get the highest score if the match was on the shortest term.

    But can you see my point is there way to get lucene discount the less
    good matches without scoring them, or is there another approach. At
    the moment we allow the full lucene syntax and use QueryParser to
    parse a query and pass the resultant query to search unchanged
    (execpt for handling of numeric fields), should I be modifying the
    query somehow ?


    You can restrict number of returned results by using a adaptively
    computed BooleanQuery.html#setMinimumNumberShouldMatch(int) parameter.
    For example, If you have 10 optional clauses you can set minimum
    should match to 60% of 10 = 6.

    Similar mechanism exists in solr :
    http://wiki.apache.org/solr/DisMaxQParserPlugin#mm_.28Minimum_.27Should.27_Match.29

    Thanks for the hint, so this could be done by overriding
    getBooleanQuery() in QueryParser ?

    Paul
    Well I did extend QuerParser, and the method is being called but rather
    disappointingly it had no noticeablke effect on how long queries took. I
    really thought by reducing the number of matches the corresponding
    scoring phase would be quicker.

    @Override
    protected Query getBooleanQuery(List<BooleanClause> clauses,
    boolean disableCoord)
    throws ParseException
    {
    BooleanQuery query = (BooleanQuery)
    super.getBooleanQuery(clauses,disableCoord);
    if(query!=null)
    {
    if(clauses.size() > 5)
    {
    query.setMinimumNumberShouldMatch(3);
    }
    }
    return query;
    }


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ahmet Arslan at May 4, 2011 at 2:02 pm
    Thanks for the hint, so this could be done by overriding getBooleanQuery() in QueryParser ?

    I think something like this should do the trick. Without overriding anything.


    Query query= QueryParser.parse("User Entered String");

    if (query instanceof BooleanQuery)
    ((BooleanQuery)query).setMinimumNumberShouldMatch(3);

    You can steal code from solr too, ( e.g. how to calculate mm and optional clause count etc.)


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 4, 2011 at 4:33 pm

    On 04/05/2011 15:02, Ahmet Arslan wrote:
    Thanks for the hint, so this could be done by overriding getBooleanQuery() in QueryParser ?
    I think something like this should do the trick. Without overriding anything.


    Query query= QueryParser.parse("User Entered String");

    if (query instanceof BooleanQuery)
    ((BooleanQuery)query).setMinimumNumberShouldMatch(3);

    You can steal code from solr too, ( e.g. how to calculate mm and optional clause count etc.)
    Thanks again, now done that but still not having much effect on total
    ime, but I havent analysed the results much yet maybe I dont quite have
    the algoithm quite right.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ahmet Arslan at May 4, 2011 at 11:24 pm

    Thanks again, now done that but still not having much
    effect on total
    ime,
    So your main concern is enhancing the running time? , not to decrease the number of returned results.
    Additionally http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 5, 2011 at 7:21 am

    On 05/05/2011 00:24, Ahmet Arslan wrote:
    Thanks again, now done that but still not having much
    effect on total
    ime,
    So your main concern is enhancing the running time? , not to decrease the number of returned results.
    Additionally http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
    Yes correct, but I have looked and the list of optimizations before.
    What was clear from profiling was that it wasnt the searching part that
    was slow (a query run on the same index with only a few matching docs
    ran super fast) the slowness only occurs when there are loads of
    matching docs, and spends most of its time in scorer that is why I was
    trying to remove the poor matches.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ahmet Arslan at May 5, 2011 at 10:14 am

    Yes correct, but I have looked and the list of
    optimizations before. What was clear from profiling was that
    it wasnt the searching part that was slow (a query run on
    the same index with only a few matching docs ran super fast)
    the slowness only occurs when there are loads of matching
    docs, and spends most of its time in scorer that is why I
    was trying to remove the poor matches.
    Okey all clear. Can you give us some example query strings where there are loads of matching?

    Do you use stop word filter? Could it be case described as

    "As you approach the upper limits of a single machine,
    extremely frequent terms (called stop words) can become very
    expensive in the wrong query. If part of a top level BooleanQuery, a
    SHOULD clause that appears in every document will cause a match and
    score for every document in your index."

    http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 5, 2011 at 10:32 am

    On 05/05/2011 11:13, Ahmet Arslan wrote:
    Yes correct, but I have looked and the list of
    optimizations before. What was clear from profiling was that
    it wasnt the searching part that was slow (a query run on
    the same index with only a few matching docs ran super fast)
    the slowness only occurs when there are loads of matching
    docs, and spends most of its time in scorer that is why I
    was trying to remove the poor matches.
    Okey all clear. Can you give us some example query strings where there are loads of matching?

    Do you use stop word filter? Could it be case described as

    "As you approach the upper limits of a single machine,
    extremely frequent terms (called stop words) can become very
    expensive in the wrong query. If part of a top level BooleanQuery, a
    SHOULD clause that appears in every document will cause a match and
    score for every document in your index."

    http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
    We used to use the default stop word list but have no stop words because
    Lucene is used to match very short fields relating to Musicdata such as
    artist or album name, therefore the default stop words really need to be
    included to get good matches, for example how would you match the artist
    'The The' otherwise, so use of a stop word word list is not an option.

    If people construct good queries thee is no problem, but the trouble is
    that many users just OR everything they are looking for because they
    don't want a good match rejected because just one term fails, but the
    problem is there are a number of very popular terms, for example the
    following query:

    tnum:(6) qdur:(189) artist:(tama) track:(ibata) tracks:(10) release:(the
    global rhythm september 2002)

    will match any song that is on an album with 10 tracks, any song which
    is trackno 6 on an album, and any release containg the word 'the' , when
    really what they are looking for is the song 'ibata' by artist 'tama',

    This matches over a million documents (songs) , but doesn't match any
    well, because the song 'ibata' by 'tama' isnt actually in the index !

    So I dont think the query is very good but I cannot force users to
    submit better queries, but I want to protect the server by reducing the
    time these kind of query take (upto 1 second as opposed to the more
    usual 100 milliseconds) and I hope that forcing x number of terms to
    match would do that.

    Paul



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at May 5, 2011 at 11:00 am
    See http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
    for an excellent article and solution to the problem with common
    words.

    You could also consider using, and caching and reusing, filters for
    the tnum and tracks fields.


    --
    Ian.

    On Thu, May 5, 2011 at 11:31 AM, Paul Taylor wrote:
    On 05/05/2011 11:13, Ahmet Arslan wrote:

    Yes correct, but I have looked and the list of
    optimizations before. What was clear from profiling was that
    it wasnt the searching part that was slow (a query run on
    the same index with only a few matching docs ran super fast)
    the slowness only occurs when there are loads of matching
    docs, and spends most of its time in scorer that is why I
    was trying to remove the poor matches.
    Okey all clear. Can you give us some example query strings where there are
    loads of matching?

    Do you use stop word filter? Could it be case described as

    "As you approach the upper limits of a single machine,
    extremely frequent terms (called stop words) can become very
    expensive in the wrong query. If part of a top level BooleanQuery, a
    SHOULD clause that appears in every document will cause a match and
    score for every document in your index."


    http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
    We used to use the default stop word list but have no stop words because
    Lucene is used to match very short fields  relating to Musicdata such as
    artist or album name, therefore the default stop words really need to be
    included to get good matches, for example how would you match the artist
    'The The' otherwise, so use of a stop word word list is not an option.

    If people construct good queries thee is no problem, but the trouble is that
    many users just OR everything they are looking for because they don't want a
    good match rejected because just one term fails, but the problem is there
    are a number of very popular terms, for example the following query:

    tnum:(6) qdur:(189) artist:(tama) track:(ibata) tracks:(10) release:(the
    global rhythm september 2002)

    will match any song that is on an album with 10 tracks, any song which is
    trackno 6 on an album, and any release containg the word 'the' , when really
    what they are looking for is the song 'ibata' by artist 'tama',

    This matches over a million documents (songs) , but doesn't match any well,
    because the song 'ibata' by 'tama' isnt actually in the index !

    So I dont think the query is very good but I cannot force users to submit
    better queries, but I want to protect the server by reducing the time these
    kind of query take (upto 1 second as opposed to the more usual 100
    milliseconds) and I hope that forcing x number of terms to match would do
    that.

    Paul



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 5, 2011 at 12:11 pm

    On 05/05/2011 11:59, Ian Lea wrote:
    See http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
    for an excellent article and solution to the problem with common
    words.
    Would this work when the user doesnt actualy use a phrase query
    You could also consider using, and caching and reusing, filters for
    the tnum and tracks fields.
    `
    This does sound promising because tracks has limited number of values,
    so I guess you create the filter after indexing and cache it somehow
    then modify the query to use the filter rather than the query somehow,
    Ill read-up on it.

    Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at May 4, 2011 at 11:24 pm
    : Well I did extend QuerParser, and the method is being called but rather
    : disappointingly it had no noticeablke effect on how long queries took. I
    : really thought by reducing the number of matches the corresponding scoring
    : phase would be quicker.

    "matching" and "scoring" go hand in hand ... the Searcher iterates over
    all docs to determine if they match the Query (i'm being grossly simple,
    in truth many docs can be skipped wholesale during this iteration because
    of conjunctions) and when a matching document is found, it's score is
    computed to determine if it's high enough to be included in the results
    (ie: is it's score higher then the lowest scoring document already
    collected)

    The botomline: lucene doesn't know that something is a "less good match"
    until it scores it ... the score is what determines how good it is.


    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at May 5, 2011 at 7:18 am

    On 05/05/2011 00:24, Chris Hostetter wrote:
    : Well I did extend QuerParser, and the method is being called but rather
    : disappointingly it had no noticeablke effect on how long queries took. I
    : really thought by reducing the number of matches the corresponding scoring
    : phase would be quicker.

    "matching" and "scoring" go hand in hand ... the Searcher iterates over
    all docs to determine if they match the Query (i'm being grossly simple,
    in truth many docs can be skipped wholesale during this iteration because
    of conjunctions) and when a matching document is found, it's score is
    computed to determine if it's high enough to be included in the results
    (ie: is it's score higher then the lowest scoring document already
    collected)

    The botomline: lucene doesn't know that something is a "less good match"
    until it scores it ... the score is what determines how good it is.


    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    But doesnt setting setMinimumNumberShouldMatch(3) cause the search to decide that a document that only matches two terms does not match the query in the first place, therefore not need scoring ?

    Paul


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 3, '11 at 11:50a
activeMay 5, '11 at 12:11p
posts14
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase