FAQ
Hi,

In my application, I input only one index file and enter only single term
query to check the lucene score. I used explain method to see the way of
obtaining results and system gave me the result as product of tf, idf,
fieldNorm.

1) Although Lucene uses tf to calculate scoring it seems to me that term
frequency has not been normalized. Even if I index several documents, it
does not normalize tf value. Therefore, since the total number of words
in index documents are varied, can't there be a fault in Lucene's scoring?

2) What is the formula to calculate this fieldNorm value?

If somebody can pls. help me.

Thnks in advance
Manjula.

Search Discussions

  • Rebecca Watson at Jul 8, 2010 at 4:05 am
    hi,
    1) Although Lucene uses tf to calculate scoring it seems to me that term
    frequency has not been normalized. Even if I index several documents, it
    does not normalize tf value. Therefore, since the total number of words
    in index documents are varied, can't there be a fault in Lucene's scoring?
    tf = term frequency i.e. the number of times the term appears in the document,
    while idf is inverse document frequency - is a measure of how rare a term is,
    i.e. related to how many documents the term appears in.

    if term1 occurs more frequently in a document i.e. tf is higher, you
    want to weight
    the document higher when you search for term1

    but if term1 is a very frequent term, ie. in lots of documents, then
    its probably not
    as important to an overall search (where we have term1, term2 etc) so you want
    to downweight it (idf comes in)

    then the normalisations like length normalisation (allow for 'fair' scoring
    across varied field length) come in too.

    the tf-idf scoring formula used by lucene is a scoring method that's
    been around
    a long long time... there are competing scoring metrics but that's an IR thing
    and not an argument you want to start on the lucene lists! :)

    these are IR ('information retrieval') concepts and you might want to start by
    going to through the tf-idf scoring / some explanations for this kind
    of scoring.

    http://en.wikipedia.org/wiki/Tf%E2%80%93idf
    http://wiki.apache.org/lucene-java/InformationRetrieval

    2) What is the formula to calculate this fieldNorm value?
    in terms of how lucene implements its tf-idf scoring - you can see here:
    http://lucene.apache.org/java/3_0_2/scoring.html

    also, the lucene in action book is a really good book if you are starting out
    with lucene (and will save you a lot of grief with understanding
    lucene / setting
    up your application!), it covers all the basics and then moves on to more
    advanced stuff and has lots of code examples too:
    http://www.manning.com/hatcher2/

    hope that helps,

    bec :)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Manjula wijewickrema at Jul 9, 2010 at 6:52 am
    Hi Rebecca,

    Thanks for your valuble comments. Yes I observed tha, once the number of
    terms of the goes up, fieldNorm value goes down correspondingly. I think,
    therefore there won't be any default due to the variation of total number of
    terms in the document. Am I right?

    Manjula.
    On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson wrote:

    hi,
    1) Although Lucene uses tf to calculate scoring it seems to me that term
    frequency has not been normalized. Even if I index several documents, it
    does not normalize tf value. Therefore, since the total number of words
    in index documents are varied, can't there be a fault in Lucene's
    scoring?

    tf = term frequency i.e. the number of times the term appears in the
    document,
    while idf is inverse document frequency - is a measure of how rare a term
    is,
    i.e. related to how many documents the term appears in.

    if term1 occurs more frequently in a document i.e. tf is higher, you
    want to weight
    the document higher when you search for term1

    but if term1 is a very frequent term, ie. in lots of documents, then
    its probably not
    as important to an overall search (where we have term1, term2 etc) so you
    want
    to downweight it (idf comes in)

    then the normalisations like length normalisation (allow for 'fair' scoring
    across varied field length) come in too.

    the tf-idf scoring formula used by lucene is a scoring method that's
    been around
    a long long time... there are competing scoring metrics but that's an IR
    thing
    and not an argument you want to start on the lucene lists! :)

    these are IR ('information retrieval') concepts and you might want to start
    by
    going to through the tf-idf scoring / some explanations for this kind
    of scoring.

    http://en.wikipedia.org/wiki/Tf%E2%80%93idf
    http://wiki.apache.org/lucene-java/InformationRetrieval

    2) What is the formula to calculate this fieldNorm value?
    in terms of how lucene implements its tf-idf scoring - you can see here:
    http://lucene.apache.org/java/3_0_2/scoring.html

    also, the lucene in action book is a really good book if you are starting
    out
    with lucene (and will save you a lot of grief with understanding
    lucene / setting
    up your application!), it covers all the basics and then moves on to more
    advanced stuff and has lots of code examples too:
    http://www.manning.com/hatcher2/

    hope that helps,

    bec :)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Jul 9, 2010 at 7:40 am

    Thanks for your valuble comments. Yes I observed tha, once the number of
    terms of the goes up, fieldNorm value goes down correspondingly. I think,
    therefore there won't be any default due to the variation of total number of
    terms in the document. Am I right?
    With the current scoring model advanced statistics are not available. There
    are currently some approaches to add BM25 support to Lucene, for what the
    index format needs to be enhanced to contain more statistics (number of
    terms per document, avg number of terms per document,...).
    On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson wrote:

    hi,
    1) Although Lucene uses tf to calculate scoring it seems to me that
    term frequency has not been normalized. Even if I index several
    documents, it does not normalize tf value. Therefore, since the
    total number of words in index documents are varied, can't there be
    a fault in Lucene's
    scoring?

    tf = term frequency i.e. the number of times the term appears in the
    document, while idf is inverse document frequency - is a measure of
    how rare a term is, i.e. related to how many documents the term
    appears in.

    if term1 occurs more frequently in a document i.e. tf is higher, you
    want to weight the document higher when you search for term1

    but if term1 is a very frequent term, ie. in lots of documents, then
    its probably not as important to an overall search (where we have
    term1, term2 etc) so you want to downweight it (idf comes in)

    then the normalisations like length normalisation (allow for 'fair'
    scoring across varied field length) come in too.

    the tf-idf scoring formula used by lucene is a scoring method that's
    been around a long long time... there are competing scoring metrics
    but that's an IR thing and not an argument you want to start on the
    lucene lists! :)

    these are IR ('information retrieval') concepts and you might want to
    start by going to through the tf-idf scoring / some explanations for
    this kind of scoring.

    http://en.wikipedia.org/wiki/Tf%E2%80%93idf
    http://wiki.apache.org/lucene-java/InformationRetrieval

    2) What is the formula to calculate this fieldNorm value?
    in terms of how lucene implements its tf-idf scoring - you can see here:
    http://lucene.apache.org/java/3_0_2/scoring.html

    also, the lucene in action book is a really good book if you are
    starting out with lucene (and will save you a lot of grief with
    understanding lucene / setting up your application!), it covers all
    the basics and then moves on to more advanced stuff and has lots of
    code examples too:
    http://www.manning.com/hatcher2/

    hope that helps,

    bec :)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Manjula wijewickrema at Jul 9, 2010 at 10:31 am
    Thanx
    On Fri, Jul 9, 2010 at 1:10 PM, Uwe Schindler wrote:

    Thanks for your valuble comments. Yes I observed tha, once the number of
    terms of the goes up, fieldNorm value goes down correspondingly. I think,
    therefore there won't be any default due to the variation of total number of
    terms in the document. Am I right?
    With the current scoring model advanced statistics are not available. There
    are currently some approaches to add BM25 support to Lucene, for what the
    index format needs to be enhanced to contain more statistics (number of
    terms per document, avg number of terms per document,...).
    On Thu, Jul 8, 2010 at 9:34 AM, Rebecca Watson <bec.watson@gmail.com>
    wrote:
    hi,
    1) Although Lucene uses tf to calculate scoring it seems to me that
    term frequency has not been normalized. Even if I index several
    documents, it does not normalize tf value. Therefore, since the
    total number of words in index documents are varied, can't there be
    a fault in Lucene's
    scoring?

    tf = term frequency i.e. the number of times the term appears in the
    document, while idf is inverse document frequency - is a measure of
    how rare a term is, i.e. related to how many documents the term
    appears in.

    if term1 occurs more frequently in a document i.e. tf is higher, you
    want to weight the document higher when you search for term1

    but if term1 is a very frequent term, ie. in lots of documents, then
    its probably not as important to an overall search (where we have
    term1, term2 etc) so you want to downweight it (idf comes in)

    then the normalisations like length normalisation (allow for 'fair'
    scoring across varied field length) come in too.

    the tf-idf scoring formula used by lucene is a scoring method that's
    been around a long long time... there are competing scoring metrics
    but that's an IR thing and not an argument you want to start on the
    lucene lists! :)

    these are IR ('information retrieval') concepts and you might want to
    start by going to through the tf-idf scoring / some explanations for
    this kind of scoring.

    http://en.wikipedia.org/wiki/Tf%E2%80%93idf
    http://wiki.apache.org/lucene-java/InformationRetrieval

    2) What is the formula to calculate this fieldNorm value?
    in terms of how lucene implements its tf-idf scoring - you can see
    here:
    http://lucene.apache.org/java/3_0_2/scoring.html

    also, the lucene in action book is a really good book if you are
    starting out with lucene (and will save you a lot of grief with
    understanding lucene / setting up your application!), it covers all
    the basics and then moves on to more advanced stuff and has lots of
    code examples too:
    http://www.manning.com/hatcher2/

    hope that helps,

    bec :)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 8, '10 at 3:45a
activeJul 9, '10 at 10:31a
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase