FAQ
Hello group,

The coord(q,d) normalisation is "a score factor based on how many of the query terms are found in the specified document." and described here:

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord

Does this have a theoretical base? On what basis was the decition make to have it? Does anybody know a paper (in Information Retrieval, Information Seeking, etc.) or other more general information about this?

Best Regards,
Karl

P.S.: This is my second question about Lucene scoring (current version). It relates to the question I posted on the older scoring version. I decised to repost since most people here seemed not to read it since it relates to an old version - well actually it doesn't.
--
"Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Steven Rowe at Dec 12, 2006 at 3:01 pm

    Karl Koch wrote:
    The coord(q,d) normalisation is "a score factor based on how many of
    the query terms are found in the specified document." and described
    here:

    http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord

    Does this have a theoretical base? On what basis was the decition
    make to have it? Does anybody know a paper (in Information Retrieval,
    Information Seeking, etc.) or other more general information about
    this?
    Following is quoted from: Krovetz, R. & Croft, W. B. (1992) Lexical
    Ambiguity and Information Retrieval. ACM Transactions on Information
    Systems, 10(2): 115-141.

    Many retrieval systems represent documents and queries
    by the words they contain, and base the comparison on
    the number of words they have in common. The more
    words the query and document have in common, the
    higher the document is ranked; this is referred to as
    a "coordination match." Performance is improved by
    weighting query and document words using frequency
    information from the collection and individual
    document texts [27].

    27. Salton, G. & McGill, M. Introduction to Modern Information
    Retrieval. McGraw-Hill, New York, 1983.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Koch at Dec 12, 2006 at 4:47 pm
    Hello Steven,

    I looked up the paper and read the relevant part. The text quote you provided is from the introcution. I belief that quote referes to the basic purpose of an information retrieval system in general. At least to the purpose of a vector space model IR system.

    If this is the theoretical justfication of the coord_q_d normalisation than it is actually replicating the the other part of the scoring formula to some degree. The entire forumla is actually concerned with this - comparing the term frequencies of query and document.

    Is there any other paper that actually shows the benefit of doing this particular normalisation with coord_q_d? I am not suggesting here that it is not useful, I am just looking for evidence how the idea developed.

    Karl




    -------- Original-Nachricht --------
    Datum: Tue, 12 Dec 2006 10:01:05 -0500
    Von: Steven Rowe <sarowe@syr.edu>
    An: java-user@lucene.apache.org
    Betreff: Re: Lucene scoring: coord_q_d factor
    Karl Koch wrote:
    The coord(q,d) normalisation is "a score factor based on how many of
    the query terms are found in the specified document." and described
    here:

    http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_coord
    Does this have a theoretical base? On what basis was the decition
    make to have it? Does anybody know a paper (in Information Retrieval,
    Information Seeking, etc.) or other more general information about
    this?
    Following is quoted from: Krovetz, R. & Croft, W. B. (1992) Lexical
    Ambiguity and Information Retrieval. ACM Transactions on Information
    Systems, 10(2): 115-141.

    Many retrieval systems represent documents and queries
    by the words they contain, and base the comparison on
    the number of words they have in common. The more
    words the query and document have in common, the
    higher the document is ranked; this is referred to as
    a "coordination match." Performance is improved by
    weighting query and document words using frequency
    information from the collection and individual
    document texts [27].

    27. Salton, G. & McGill, M. Introduction to Modern Information
    Retrieval. McGraw-Hill, New York, 1983.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steven Rowe at Dec 12, 2006 at 10:16 pm

    Karl Koch wrote:
    Is there any other paper that actually shows the benefit of doing
    this particular normalisation with coord_q_d? I am not suggesting
    here that it is not useful, I am just looking for evidence how the
    idea developed.
    I think it's a mischaracterization to call coordination a
    "normalization". In my mind, "normalization" is something applied
    equally to all documents' scores. The coordination component of a
    document's score varies from document to document, and so doesn't meet
    this criterion.

    I repeat the citation of the book cited by the paper I cited :) :
    Salton, G. & McGill, M. Introduction to Modern Information
    Retrieval. McGraw-Hill, New York, 1983.
    In addition to the above book, here are two other books that I've seen
    cited as describing "coordination-level matching" (a.k.a. "overlap
    ranking"):

    Salton, G. (1968). Automatic information organization and retrieval.
    New York: McGraw-Hill.

    Lancaster, F.W. (1979). Information retrieval systems: Characteristics,
    testing and evaluation (2nd ed.). New York: Wiley.

    I don't know the answer to your larger question: why use a coordination
    component in a similarity measure when other components (tf*idf) seem to
    serve the same function? What you seem to be looking for is a study
    that directly compares a system using a coordination component in its
    similarity measure with the *same* system, varying the measure only in
    that coordination is elided. Unfortunately, I know of no such study.

    Good luck,
    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Koch at Dec 13, 2006 at 3:01 pm
    Hello Steven,

    unfortunately I don't have access to these books right now. I will try to get hold of them. Thank you for these pointers. :)

    I had a quick look at "coordination level matching" on the web and found evidence that this seemed to be an early retrieval strategy. My question is mainly, why one should use coordination level matching, if one is already doing (proper) TFxIDF based matching. When I look at Lucenes scoring forumla, it seems to me that two kinds of matching are performed and combined together in a single matching formula.

    In the paper, "Exploiting the Similarity of Non-matching Terms at Retrieval Time" which can be found here:

    http://www.cis.strath.ac.uk/~fabioc/papers/00-jir.pdf

    it is directly compared with TFxIDF. To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF?

    It is likely that I have some kind of misunderstanding here. Perhaps with your help I can untangle that a bit further. As I said earlier, I am only looking for a reasonable explaination (perhaps augmented with some evidence in literature) that makes it clear why it is used together with TFxIDF.

    Thank you,
    Karl



    -------- Original-Nachricht --------
    Datum: Tue, 12 Dec 2006 17:15:48 -0500
    Von: Steven Rowe <sarowe@syr.edu>
    An: java-user@lucene.apache.org
    Betreff: Re: Lucene scoring: coord_q_d factor
    Karl Koch wrote:
    Is there any other paper that actually shows the benefit of doing
    this particular normalisation with coord_q_d? I am not suggesting
    here that it is not useful, I am just looking for evidence how the
    idea developed.
    I think it's a mischaracterization to call coordination a
    "normalization". In my mind, "normalization" is something applied
    equally to all documents' scores. The coordination component of a
    document's score varies from document to document, and so doesn't meet
    this criterion.

    I repeat the citation of the book cited by the paper I cited :) :
    Salton, G. & McGill, M. Introduction to Modern Information
    Retrieval. McGraw-Hill, New York, 1983.
    In addition to the above book, here are two other books that I've seen
    cited as describing "coordination-level matching" (a.k.a. "overlap
    ranking"):

    Salton, G. (1968). Automatic information organization and retrieval.
    New York: McGraw-Hill.

    Lancaster, F.W. (1979). Information retrieval systems: Characteristics,
    testing and evaluation (2nd ed.). New York: Wiley.

    I don't know the answer to your larger question: why use a coordination
    component in a similarity measure when other components (tf*idf) seem to
    serve the same function? What you seem to be looking for is a study
    that directly compares a system using a coordination component in its
    similarity measure with the *same* system, varying the measure only in
    that coordination is elided. Unfortunately, I know of no such study.

    Good luck,
    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Dec 13, 2006 at 3:32 pm

    On 12/13/06, Karl Koch wrote:
    To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF?
    Well, if I search for blue kangaroo, the coord is nice to get
    documents with "blue" and "kangaroo" to score higher than documents
    with just one term. And among documents with just one term, the idf
    factor will make "kangaroo" rank above "blue", which is generally
    desired.

    I have seen complaints about the default similarity though, where the
    coord factor does not give enough of a boost in relation to the idf of
    some of the individual terms.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Koch at Dec 13, 2006 at 3:42 pm
    Do you know about any papers that discuss this?

    Karl

    -------- Original-Nachricht --------
    Datum: Wed, 13 Dec 2006 10:31:41 -0500
    Von: "Yonik Seeley" <yonik@apache.org>
    An: java-user@lucene.apache.org
    Betreff: Re: Lucene scoring: coord_q_d factor
    On 12/13/06, Karl Koch wrote:
    To me, it seems that coordination level matching could be used if I
    don't want to use TFxIDF but not together with it. In this context, I wonder
    what benefit the "coordination level matching" has in combination with
    TFxIDF?

    Well, if I search for blue kangaroo, the coord is nice to get
    documents with "blue" and "kangaroo" to score higher than documents
    with just one term. And among documents with just one term, the idf
    factor will make "kangaroo" rank above "blue", which is generally
    desired.

    I have seen complaints about the default similarity though, where the
    coord factor does not give enough of a boost in relation to the idf of
    some of the individual terms.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search
    server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Elschot at Dec 13, 2006 at 8:01 pm

    On Wednesday 13 December 2006 16:42, Karl Koch wrote:
    Do you know about any papers that discuss this?
    Coordination is called co-ordination In the original idf paper by
    K. Spärck Jones, A statistical interpretation of term specificity
    and its application in retrieval., Journal of Documentation 28,
    11-21, 1972
    http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf

    The paper is the first one on the idf page:
    http://www.soi.city.ac.uk/~ser/idf.html

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Koch at Dec 14, 2006 at 1:39 am
    Hello Paul,

    thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part:

    "In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency."

    If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

    Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

    Cheers,
    Karl

    -------- Original-Nachricht --------
    Datum: Wed, 13 Dec 2006 21:00:45 +0100
    Von: Paul Elschot <paul.elschot@xs4all.nl>
    An: java-user@lucene.apache.org
    Betreff: Re: Lucene scoring: coord_q_d factor
    On Wednesday 13 December 2006 16:42, Karl Koch wrote:
    Do you know about any papers that discuss this?
    Coordination is called co-ordination In the original idf paper by
    K. Spärck Jones, A statistical interpretation of term specificity
    and its application in retrieval., Journal of Documentation 28,
    11-21, 1972
    http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf

    The paper is the first one on the idf page:
    http://www.soi.city.ac.uk/~ser/idf.html

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Soeren Pekrul at Dec 14, 2006 at 8:41 am

    Karl Koch wrote:
    If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?
    I understand that sentence:
    "The natural solution is to correlate a term's matching value with its
    collection frequency."
    exactly in that way, to combine coordination level matching with IDF.

    The score for a document is the sum of the term weights w(tf, idf) for
    each containing term. So you have already the combination of
    coordination level matching with IDF. Now it is possible that your query
    requests three terms A, B and C. Two of them (A and B) are quite often
    in the collection one (C) is very rare. It could be possible that
    documents are matching just C have a higher score than documents
    containing A and B. To avoid this you can give the coordination a higher
    influence by multiplying the sum of term weights with the coordination
    as additional factor.

    Sören

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Koch at Dec 14, 2006 at 9:27 am
    I think I understand now. I also have evidence from literature. So I would say that my question is solved. :)

    Thank you, Otis, and everybody else for contributing!
    Karl

    -------- Original-Nachricht --------
    Datum: Thu, 14 Dec 2006 09:40:31 +0100
    Von: Soeren Pekrul <soeren.pekrul@gmx.de>
    An: java-user@lucene.apache.org
    Betreff: Re: Lucene scoring: coord_q_d factor
    Karl Koch wrote:
    If I do not misunderstand that extract, I would say it suggests the
    combination of coordination level matching with IDF. I am interested in your
    view and those who read this?

    I understand that sentence:
    "The natural solution is to correlate a term's matching value with its
    collection frequency."
    exactly in that way, to combine coordination level matching with IDF.

    The score for a document is the sum of the term weights w(tf, idf) for
    each containing term. So you have already the combination of
    coordination level matching with IDF. Now it is possible that your query
    requests three terms A, B and C. Two of them (A and B) are quite often
    in the collection one (C) is very rare. It could be possible that
    documents are matching just C have a higher score than documents
    containing A and B. To avoid this you can give the coordination a higher
    influence by multiplying the sum of term weights with the coordination
    as additional factor.

    Sören

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    "Ein Herz für Kinder" - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
    Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Soeren Pekrul at Dec 14, 2006 at 10:01 am

    Soeren Pekrul wrote:
    The score for a document is the sum of the term weights w(tf, idf) for
    each containing term. So you have already the combination of
    coordination level matching with IDF. Now it is possible that your query
    requests three terms A, B and C. Two of them (A and B) are quite often
    in the collection one (C) is very rare. It could be possible that
    documents are matching just C have a higher score than documents
    containing A and B. To avoid this you can give the coordination a higher
    influence by multiplying the sum of term weights with the coordination
    as additional factor.
    Addendum:
    For the query Q(A, B, C) with
    A: df++ (ifd--)
    B: df++ (idf--)
    C: df-- (idf++)
    the user would probably expect the following ranking:
    1. D(A, B, C)
    2. D(A, C), D(B, C)
    3. D(A, B)
    4. D(C)
    5. D(A), D(B)

    Sören

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Dec 14, 2006 at 12:31 pm
    FYI: The Wiki has a fair number of resources on IR: http://
    wiki.apache.org/jakarta-lucene/InformationRetrieval (I have added a
    link to this conversation, which contains a lot of useful information)

    Karl, if you are so inclined, please feel free to add any of the
    references you have found that have been helpful that aren't already
    on this page (anyone can edit the Wiki with an login)

    -Grant
    On Dec 14, 2006, at 4:59 AM, Soeren Pekrul wrote:

    Soeren Pekrul wrote:
    The score for a document is the sum of the term weights w(tf, idf)
    for each containing term. So you have already the combination of
    coordination level matching with IDF. Now it is possible that your
    query requests three terms A, B and C. Two of them (A and B) are
    quite often in the collection one (C) is very rare. It could be
    possible that documents are matching just C have a higher score
    than documents containing A and B. To avoid this you can give the
    coordination a higher influence by multiplying the sum of term
    weights with the coordination as additional factor.
    Addendum:
    For the query Q(A, B, C) with
    A: df++ (ifd--)
    B: df++ (idf--)
    C: df-- (idf++)
    the user would probably expect the following ranking:
    1. D(A, B, C)
    2. D(A, C), D(B, C)
    3. D(A, B)
    4. D(C)
    5. D(A), D(B)

    Sören

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    Center for Natural Language Processing
    http://www.cnlp.org

    Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
    LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doug Cutting at Dec 19, 2006 at 8:57 pm

    Karl Koch wrote:
    Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?
    We independently developed coordination-level matching combined with
    TFxIDF when I worked at Apple. This is documented in:

    http://www.informatik.uni-trier.de/~ley/db/conf/trec/trec1996.html#RoseS96

    (I had left Apple when this was written, but it largely describes work
    done while I was there.)

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Dec 14, 2006 at 1:47 am
    Hi,

    But isn't "coord" + TFIDF pretty intuitive? Independently, they are both useful and contribute to the final score for the match.

    Otis

    ----- Original Message ----
    From: Karl Koch <TheRanger@gmx.net>
    To: java-user@lucene.apache.org
    Sent: Wednesday, December 13, 2006 8:35:55 PM
    Subject: Re: Lucene scoring: coord_q_d factor

    Hello Paul,

    thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part:

    "In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request and document share three common terms, the document is retrieved at the same level as another one sharing three rare terms with the request. But it seems we should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term's matching value with its collection frequency."

    If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this?

    Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous?

    Cheers,
    Karl

    -------- Original-Nachricht --------
    Datum: Wed, 13 Dec 2006 21:00:45 +0100
    Von: Paul Elschot <paul.elschot@xs4all.nl>
    An: java-user@lucene.apache.org
    Betreff: Re: Lucene scoring: coord_q_d factor
    On Wednesday 13 December 2006 16:42, Karl Koch wrote:
    Do you know about any papers that discuss this?
    Coordination is called co-ordination In the original idf paper by
    K. Spärck Jones, A statistical interpretation of term specificity
    and its application in retrieval., Journal of Documentation 28,
    11-21, 1972
    http://www.soi.city.ac.uk/~ser/idfpapers/ksj_orig.pdf

    The paper is the first one on the idf page:
    http://www.soi.city.ac.uk/~ser/idf.html

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 12, '06 at 10:32a
activeDec 19, '06 at 8:57p
posts15
users8
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase