FAQ
I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field (whose contents is a string in the YYYYMMDD format). When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score of .23000652. If I use 20030916 (yesterday), I get 197 hits, each of which has a score of .22295427.

So far, all seems logical. However, when I search for all records for the date 20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the hits have a score of .03125. Here is a tabulation of these and a few more queries:

Query Date Result
======= ========================
20030917 all have a score of .23000652 (157)
20030916 all have a score of .22295427 (197)
20030915 first 2 have a 1.0 score, all rest are .03125 (174)
20030914 all have a score of .21384604 (264)
20030913 first 2 have a 1.0 score, all rest are .03125 (156)
20030912 all have a score .2166833 (241)
20030911 first 3 have a 1.0 score, all rest are .03125 (244)
20030910 all have a score of .2208193 (211)

I would expect that all the hits would have the same score, and I would expect it to be normalized to 1 (unless, I guess, the top score was less than 1, in which case normalization presumably doesn't occur).

Does anyone have any ideas as to what might be going on here? (I'm using the latest CVS sources, obtained this afternoon.)

Regards,

Terry

Search Discussions

  • Erik Hatcher at Sep 17, 2003 at 7:46 pm
    Try using IndexSearcher.explain and dump out the contents of what it
    returns either as toString or toHtml (whichever format suits your
    environment best) and see what it has to say. It'll give you the
    low-down on the factors involved in the score calculation. I'm
    interested to see what you come up with.

    Erik

    On Wednesday, September 17, 2003, at 03:33 PM, Terry Steichen wrote:

    I've run across some puzzling behavior regarding scoring. I have a
    set of documents which contain, among others, a date field (whose
    contents is a string in the YYYYMMDD format). When I query on the
    date 20030917 (that is, today), I get 157 hits, all of which have a
    score of .23000652. If I use 20030916 (yesterday), I get 197 hits,
    each of which has a score of .22295427.

    So far, all seems logical. However, when I search for all records for
    the date 20030915, the first two (of 174 hits) have a score of 1.0,
    while all the rest of the hits have a score of .03125. Here is a
    tabulation of these and a few more queries:

    Query Date Result
    ======= ========================
    20030917 all have a score of .23000652 (157)
    20030916 all have a score of .22295427 (197)
    20030915 first 2 have a 1.0 score, all rest are .03125 (174)
    20030914 all have a score of .21384604 (264)
    20030913 first 2 have a 1.0 score, all rest are .03125 (156)
    20030912 all have a score .2166833 (241)
    20030911 first 3 have a 1.0 score, all rest are .03125 (244)
    20030910 all have a score of .2208193 (211)

    I would expect that all the hits would have the same score, and I
    would expect it to be normalized to 1 (unless, I guess, the top score
    was less than 1, in which case normalization presumably doesn't > occur).

    Does anyone have any ideas as to what might be going on here? (I'm
    using the latest CVS sources, obtained this afternoon.)

    Regards,

    Terry
  • Doug Cutting at Sep 17, 2003 at 8:55 pm
    If you're using RangeQuery to do date searching, then you'll likely see
    unusual scoring. The IDF of a date, like any other term, is inversely
    related to the number of documents with that date. So documents whose
    dates are rare will score higher, which is probably not what you intend.

    Using a Filter for date searching is one way to remove dates from the
    scoring calculation. Another is to provide a Similarity implementation
    that gives an IDF of 1.0 for terms from your date field, e.g., something
    like:

    public class MySimilarity extends DefaultSimilarity {
    public float idf(Term term, Searcher searcher) throws IOException {
    if (term.field() == "date") {
    return 1.0f;
    } else {
    return super.idf(term, searcher);
    }
    }
    }

    Or you could just give date clauses of your query a very small boost
    (e.g., .0001) so that other clauses dominate the scoring.

    Doug

    Terry Steichen wrote:
    I've run across some puzzling behavior regarding scoring. I have a set of documents which contain, among others, a date field (whose contents is a string in the YYYYMMDD format). When I query on the date 20030917 (that is, today), I get 157 hits, all of which have a score of .23000652. If I use 20030916 (yesterday), I get 197 hits, each of which has a score of .22295427.

    So far, all seems logical. However, when I search for all records for the date 20030915, the first two (of 174 hits) have a score of 1.0, while all the rest of the hits have a score of .03125. Here is a tabulation of these and a few more queries:

    Query Date Result
    ======= ========================
    20030917 all have a score of .23000652 (157)
    20030916 all have a score of .22295427 (197)
    20030915 first 2 have a 1.0 score, all rest are .03125 (174)
    20030914 all have a score of .21384604 (264)
    20030913 first 2 have a 1.0 score, all rest are .03125 (156)
    20030912 all have a score .2166833 (241)
    20030911 first 3 have a 1.0 score, all rest are .03125 (244)
    20030910 all have a score of .2208193 (211)

    I would expect that all the hits would have the same score, and I would expect it to be normalized to 1 (unless, I guess, the top score was less than 1, in which case normalization presumably doesn't occur).

    Does anyone have any ideas as to what might be going on here? (I'm using the latest CVS sources, obtained this afternoon.)

    Regards,

    Terry
  • Terry Steichen at Sep 17, 2003 at 9:33 pm
    Doug/Erik,

    I do use RangeQuery to get a range of dates, but in this case I'm just
    getting a single date (string), so I believe it's just a regular query I'm
    using.

    Per Erik's suggestion, I checked out the Explanation for some of these
    anomolies. I've included a condensation of the data it generated below
    (which I don't frankly don't understand). Perhaps that will give you or
    Erik some insight into what's happening?

    Regards,

    Terry

    PS: I note that the 'docFreq' parameters displayed below correspond exactly
    to the number of hits for the query. Also, here's the Similarity class I'm
    using (per an earlier suggestion of Doug):

    public class WESimilarity2 extends
    org.apache.lucene.search.DefaultSimilarity {

    public float lengthNorm(String fieldName, int numTerms) {
    if (fieldName.equals("headline") || fieldName.equals("summary") ||
    fieldName.equals("ssummary")){
    return 4.0f * super.lengthNorm(fieldName, Math.max(numTerms,750));
    } else {
    return super.lengthNorm(fieldName, Math.max(numTerms, 750));
    }
    }
    }




    Query #1: pub_date:20030917
    All items: Score: .23000652
    0.23000652 = weight(pub_date:20030917 in 91197), product of:
    0.99999994 = queryWeight(pub_date:20030917), product of:
    7.360209 = idf(docFreq=157)
    0.1358657 = queryNorm
    0.23000653 = fieldWeight(pub_date:20030917 in 91197), product of:
    1.0 = tf(termFreq(pub_date:20030917)=1)
    7.360209 = idf(docFreq=157)
    0.03125 = fieldNorm(field=pub_date, doc=91197)

    Query #2: pub_date:20030916
    All items: Score: .22295427
    0.22295427 = fieldWeight(pub_date:20030916 in 90992), product of:
    1.0 = tf(termFreq(pub_date:20030916)=1)
    7.1345367 = idf(docFreq=197)
    0.03125 = fieldNorm(field=pub_date, doc=90992)


    Query #3: pub_date:20030915
    Items 1&2: Score: 1.0
    7.2580175 = weight(pub_date:20030915 in 90970), product of:
    0.99999994 = queryWeight(pub_date:20030915), product of:
    7.258018 = idf(docFreq=174)
    0.13777865 = queryNorm
    7.258018 = fieldWeight(pub_date:20030915 in 90970), product of:
    1.0 = tf(termFreq(pub_date:20030915)=1)
    7.258018 = idf(docFreq=174)
    1.0 = fieldNorm(field=pub_date, doc=90970)

    Query #3 (same as above): pub_date:20030915
    Other items: Score: 03125
    0.22681305 = weight(pub_date:20030915 in 90826), product of:
    0.99999994 = queryWeight(pub_date:20030915), product of:
    7.258018 = idf(docFreq=174)
    0.13777865 = queryNorm
    0.22681306 = fieldWeight(pub_date:20030915 in 90826), product of:
    1.0 = tf(termFreq(pub_date:20030915)=1)
    7.258018 = idf(docFreq=174)
    0.03125 = fieldNorm(field=pub_date, doc=90826)

    Query #4: pub_date:20030914
    0.21384604 = weight(pub_date:20030914 in 90417), product of:
    0.99999994 = queryWeight(pub_date:20030914), product of:
    6.843074 = idf(docFreq=264)
    0.14613315 = queryNorm
    0.21384606 = fieldWeight(pub_date:20030914 in 90417), product of:
    1.0 = tf(termFreq(pub_date:20030914)=1)
    6.843074 = idf(docFreq=264)
    0.03125 = fieldNorm(field=pub_date, doc=90417)

    Query #5: pub_date 20030913
    Items 1&2: Score: 1.0
    7.366558 = fieldWeight(pub_date:20030913 in 90591), product of:
    1.0 = tf(termFreq(pub_date:20030913)=1)
    7.366558 = idf(docFreq=156)
    1.0 = fieldNorm(field=pub_date, doc=90591)

    Query #5 (same as above): pub_date:20030913
    Other items: Score: .03125
    0.23020494 = fieldWeight(pub_date:20030913 in 90383), product of:
    1.0 = tf(termFreq(pub_date:20030913)=1)
    7.366558 = idf(docFreq=156)
    0.03125 = fieldNorm(field=pub_date, doc=90383)


    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 4:55 PM
    Subject: Re: Lucene Scoring Behavior

    If you're using RangeQuery to do date searching, then you'll likely see
    unusual scoring. The IDF of a date, like any other term, is inversely
    related to the number of documents with that date. So documents whose
    dates are rare will score higher, which is probably not what you intend.

    Using a Filter for date searching is one way to remove dates from the
    scoring calculation. Another is to provide a Similarity implementation
    that gives an IDF of 1.0 for terms from your date field, e.g., something
    like:

    public class MySimilarity extends DefaultSimilarity {
    public float idf(Term term, Searcher searcher) throws IOException {
    if (term.field() == "date") {
    return 1.0f;
    } else {
    return super.idf(term, searcher);
    }
    }
    }

    Or you could just give date clauses of your query a very small boost
    (e.g., .0001) so that other clauses dominate the scoring.

    Doug

    Terry Steichen wrote:
    I've run across some puzzling behavior regarding scoring. I have a set
    of documents which contain, among others, a date field (whose contents is a
    string in the YYYYMMDD format). When I query on the date 20030917 (that is,
    today), I get 157 hits, all of which have a score of .23000652. If I use
    20030916 (yesterday), I get 197 hits, each of which has a score of
    .22295427.
    So far, all seems logical. However, when I search for all records for
    the date 20030915, the first two (of 174 hits) have a score of 1.0, while
    all the rest of the hits have a score of .03125. Here is a tabulation of
    these and a few more queries:
    Query Date Result
    ======= ========================
    20030917 all have a score of .23000652 (157)
    20030916 all have a score of .22295427 (197)
    20030915 first 2 have a 1.0 score, all rest are .03125 (174)
    20030914 all have a score of .21384604 (264)
    20030913 first 2 have a 1.0 score, all rest are .03125 (156)
    20030912 all have a score .2166833 (241)
    20030911 first 3 have a 1.0 score, all rest are .03125 (244)
    20030910 all have a score of .2208193 (211)

    I would expect that all the hits would have the same score, and I would
    expect it to be normalized to 1 (unless, I guess, the top score was less
    than 1, in which case normalization presumably doesn't occur).
    Does anyone have any ideas as to what might be going on here? (I'm
    using the latest CVS sources, obtained this afternoon.)
    Regards,

    Terry

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Sep 17, 2003 at 9:51 pm

    Terry Steichen wrote:
    0.03125 = fieldNorm(field=pub_date, doc=90992)
    1.0 = fieldNorm(field=pub_date, doc=90970)
    It looks like the fieldNorm's are what differ, not the IDFs. These are
    the product of the document and/or field boost, and 1/sqrt(numTerms)
    where numTerms is the number of terms in the "pub_date" field of the
    document. Thus if each document is only assigned one date, and you
    didn't boost the field or the document when you indexed it, this should
    be 1.0. But if the document has two dates, then this would be
    1/sqrt(2). Or if you boosted this document pub_date field, then this
    will have whatever boost you provided.

    So, did you boost anything when indexing? Or could a single document
    have two or more different values for pub_date? Either would explain this.

    Doug
  • Terry Steichen at Sep 17, 2003 at 11:37 pm
    Doug,

    (1) No, I did *not* boost the pub_date field, either in the indexing process
    or in the query itself.

    (2) And, each pub_date field of each document (which is in XML format)
    contains only one instance of the date string.

    (3) And only the pub_date field itself is indexed. There are other
    attributes of this field that may contain the date string, but they aren't
    indexed - that is, they are not included in the instantiated Document class.

    Regards,

    Terry

    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 5:51 PM
    Subject: Re: Lucene Scoring Behavior

    Terry Steichen wrote:
    0.03125 = fieldNorm(field=pub_date, doc=90992)
    1.0 = fieldNorm(field=pub_date, doc=90970)
    It looks like the fieldNorm's are what differ, not the IDFs. These are
    the product of the document and/or field boost, and 1/sqrt(numTerms)
    where numTerms is the number of terms in the "pub_date" field of the
    document. Thus if each document is only assigned one date, and you
    didn't boost the field or the document when you indexed it, this should
    be 1.0. But if the document has two dates, then this would be
    1/sqrt(2). Or if you boosted this document pub_date field, then this
    will have whatever boost you provided.

    So, did you boost anything when indexing? Or could a single document
    have two or more different values for pub_date? Either would explain this.
    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Sep 18, 2003 at 3:15 am
    Hmm. This makes no sense to me. Can you supply a reproducible
    standalone test case?

    Doug

    Terry Steichen wrote:
    Doug,

    (1) No, I did *not* boost the pub_date field, either in the indexing process
    or in the query itself.

    (2) And, each pub_date field of each document (which is in XML format)
    contains only one instance of the date string.

    (3) And only the pub_date field itself is indexed. There are other
    attributes of this field that may contain the date string, but they aren't
    indexed - that is, they are not included in the instantiated Document class.

    Regards,

    Terry

    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 5:51 PM
    Subject: Re: Lucene Scoring Behavior


    Terry Steichen wrote:
    0.03125 = fieldNorm(field=pub_date, doc=90992)
    1.0 = fieldNorm(field=pub_date, doc=90970)
    It looks like the fieldNorm's are what differ, not the IDFs. These are
    the product of the document and/or field boost, and 1/sqrt(numTerms)
    where numTerms is the number of terms in the "pub_date" field of the
    document. Thus if each document is only assigned one date, and you
    didn't boost the field or the document when you indexed it, this should
    be 1.0. But if the document has two dates, then this would be
    1/sqrt(2). Or if you boosted this document pub_date field, then this
    will have whatever boost you provided.

    So, did you boost anything when indexing? Or could a single document
    have two or more different values for pub_date? Either would explain this.
    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Terry Steichen at Sep 18, 2003 at 2:10 pm
    Doug,

    I just extracted a portion of the database, reindexed and found the scores
    to come out much more like we'd expect. Appears this may be an indexing
    issue - I index new stuff each day and merge the new index with the master
    index. Only redo the master when I can't avoid it (because it takes so
    long). I probably merge 100 times or more before reindexing. This evening
    I'll reindex and let you know if the apparent problem clears up. If so,
    I'll keep track of it as I continue to merge and see if there's any issue
    there.

    Thanks for the input (and from Erik, pointing me to the Explanation - it's
    pretty neat).

    Question: The new scores for the test database portion mentioned above all
    seem to come out in the range of .06 to .07. I assume this is because they
    never get normalized. If this is the case, (a) would it hurt anything to
    "normalize up" (so the scores range up to 1), and if so (b) is there an
    easy, non-disruptive (to the source code) way to do this?

    Regards,

    Terry


    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 11:15 PM
    Subject: Re: Lucene Scoring Behavior

    Hmm. This makes no sense to me. Can you supply a reproducible
    standalone test case?

    Doug

    Terry Steichen wrote:
    Doug,

    (1) No, I did *not* boost the pub_date field, either in the indexing
    process
    or in the query itself.

    (2) And, each pub_date field of each document (which is in XML format)
    contains only one instance of the date string.

    (3) And only the pub_date field itself is indexed. There are other
    attributes of this field that may contain the date string, but they
    aren't
    indexed - that is, they are not included in the instantiated Document
    class.
    Regards,

    Terry

    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 5:51 PM
    Subject: Re: Lucene Scoring Behavior


    Terry Steichen wrote:
    0.03125 = fieldNorm(field=pub_date, doc=90992)
    1.0 = fieldNorm(field=pub_date, doc=90970)
    It looks like the fieldNorm's are what differ, not the IDFs. These are
    the product of the document and/or field boost, and 1/sqrt(numTerms)
    where numTerms is the number of terms in the "pub_date" field of the
    document. Thus if each document is only assigned one date, and you
    didn't boost the field or the document when you indexed it, this should
    be 1.0. But if the document has two dates, then this would be
    1/sqrt(2). Or if you boosted this document pub_date field, then this
    will have whatever boost you provided.

    So, did you boost anything when indexing? Or could a single document
    have two or more different values for pub_date? Either would explain this.
    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Terry Steichen at Sep 21, 2003 at 9:23 pm
    Doug,

    Well, my good intentions (to reindex on Thursday night) were interrupted by
    Hurricaine Isabel (followed by a 44 hour power outtage).

    Well, excuses aside, I did get the reindex done today and the scores for all
    hits from a single date query come out to be the same score (as they
    should). Don't have any idea what screwed up the previous index - though as
    promised, I'll keep an eye on it as I continue to merge new stuff over the
    next few days/weeks.

    Is there a way, using standard Lucene configuration parameters and/or API's,
    to force the hit scores to come out so the highest one is set to 1, and the
    others are proportionately lower?

    Regards,

    Terry

    ----- Original Message -----
    From: "Terry Steichen" <terry@net-frame.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Thursday, September 18, 2003 10:10 AM
    Subject: Re: Lucene Scoring Behavior

    Doug,

    I just extracted a portion of the database, reindexed and found the scores
    to come out much more like we'd expect. Appears this may be an indexing
    issue - I index new stuff each day and merge the new index with the master
    index. Only redo the master when I can't avoid it (because it takes so
    long). I probably merge 100 times or more before reindexing. This evening
    I'll reindex and let you know if the apparent problem clears up. If so,
    I'll keep track of it as I continue to merge and see if there's any issue
    there.

    Thanks for the input (and from Erik, pointing me to the Explanation - it's
    pretty neat).

    Question: The new scores for the test database portion mentioned above all
    seem to come out in the range of .06 to .07. I assume this is because they
    never get normalized. If this is the case, (a) would it hurt anything to
    "normalize up" (so the scores range up to 1), and if so (b) is there an
    easy, non-disruptive (to the source code) way to do this?

    Regards,

    Terry


    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 11:15 PM
    Subject: Re: Lucene Scoring Behavior

    Hmm. This makes no sense to me. Can you supply a reproducible
    standalone test case?

    Doug

    Terry Steichen wrote:
    Doug,

    (1) No, I did *not* boost the pub_date field, either in the indexing
    process
    or in the query itself.

    (2) And, each pub_date field of each document (which is in XML format)
    contains only one instance of the date string.

    (3) And only the pub_date field itself is indexed. There are other
    attributes of this field that may contain the date string, but they
    aren't
    indexed - that is, they are not included in the instantiated Document
    class.
    Regards,

    Terry

    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Users List" <lucene-user@jakarta.apache.org>
    Sent: Wednesday, September 17, 2003 5:51 PM
    Subject: Re: Lucene Scoring Behavior


    Terry Steichen wrote:
    0.03125 = fieldNorm(field=pub_date, doc=90992)
    1.0 = fieldNorm(field=pub_date, doc=90970)
    It looks like the fieldNorm's are what differ, not the IDFs. These
    are
    the product of the document and/or field boost, and 1/sqrt(numTerms)
    where numTerms is the number of terms in the "pub_date" field of the
    document. Thus if each document is only assigned one date, and you
    didn't boost the field or the document when you indexed it, this
    should
    be 1.0. But if the document has two dates, then this would be
    1/sqrt(2). Or if you boosted this document pub_date field, then this
    will have whatever boost you provided.

    So, did you boost anything when indexing? Or could a single document
    have two or more different values for pub_date? Either would explain this.
    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 17, '03 at 7:33p
activeSep 21, '03 at 9:23p
posts9
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase