FAQ
Hi,

sorry I've already asked few days ago, but I got no reply and I really need
some help on this..

I'm running several queries against a doc collection. The queries are
documents of the collection itself, I need to measure how similar is each
document to the rest of the collection.

Now, Lucene returns me a score per query, but I've been told such score is
not comparable across queries. Is this correct ?

For example, arem't these scores comparable ?
query1, score:8.324234
query2, score:3.324238

If so, why not ? Isn't the cosine similarity between the query vector and
collection docs vectors ? I really need a comparable measure.

thanks

Search Discussions

  • Uwe Schindler at Mar 28, 2011 at 8:04 am
    No, scores are in general not comparable between different queries. The
    problem lies in many things:
    - Each query has a norm factor that makes it more compareable if they are
    sub clauses of a BooleanQuery. But you are right, this norm factor should be
    the same.
    - Some queries like FuzzyQuery rely on the terms in index and those matches
    the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g. simple
    TermQuery, only with different term) on the same index, you can compare the
    scores. As soon as you are using complex queries (e.g several terms compared
    in a BooleanQuery as QueryParser produces), the scores are no longer
    comparable.

    You can read more on all factors that are included in scoring:
    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I really need
    some help on this..

    I'm running several queries against a doc collection. The queries are
    documents of the collection itself, I need to measure how similar is each
    document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told such score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query vector and
    collection docs vectors ? I really need a comparable measure.

    thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 28, 2011 at 8:09 am
    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times, but I need
    some tip.

    My queries are BooleanQueries but they always have the same structure (the
    same structure of the docs, they are actually docs from collection): 3
    fields.

    What if I simplify the similarity scores, by removing coord factor and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different queries. The
    problem lies in many things:
    - Each query has a norm factor that makes it more compareable if they are
    sub clauses of a BooleanQuery. But you are right, this norm factor should
    be
    the same.
    - Some queries like FuzzyQuery rely on the terms in index and those matches
    the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g. simple
    TermQuery, only with different term) on the same index, you can compare the
    scores. As soon as you are using complex queries (e.g several terms
    compared
    in a BooleanQuery as QueryParser produces), the scores are no longer
    comparable.

    You can read more on all factors that are included in scoring:

    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I really need
    some help on this..

    I'm running several queries against a doc collection. The queries are
    documents of the collection itself, I need to measure how similar is each
    document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told such score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query vector and
    collection docs vectors ? I really need a comparable measure.

    thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Mar 28, 2011 at 8:12 am
    Hi Patrick,

    You can disable the coord factor in the constructor of BooleanQuery.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times, but I need
    some tip.

    My queries are BooleanQueries but they always have the same structure
    (the same structure of the docs, they are actually docs from collection): 3
    fields.

    What if I simplify the similarity scores, by removing coord factor and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable if they
    are sub clauses of a BooleanQuery. But you are right, this norm factor
    should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and those
    matches the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g. simple
    TermQuery, only with different term) on the same index, you can
    compare the scores. As soon as you are using complex queries (e.g
    several terms compared in a BooleanQuery as QueryParser produces), the
    scores are no longer comparable.

    You can read more on all factors that are included in scoring:

    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I
    really need
    some help on this..

    I'm running several queries against a doc collection. The queries
    are documents of the collection itself, I need to measure how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query
    vector and collection docs vectors ? I really need a comparable
    measure.
    thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 28, 2011 at 8:36 am
    Cool, so just to be sure, if I disable the coord factor I can finally
    compare my BooleanQuery results ?


    On 28 March 2011 10:11, Uwe Schindler wrote:

    Hi Patrick,

    You can disable the coord factor in the constructor of BooleanQuery.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times, but I need
    some tip.

    My queries are BooleanQueries but they always have the same structure
    (the same structure of the docs, they are actually docs from collection): 3
    fields.

    What if I simplify the similarity scores, by removing coord factor and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable if they
    are sub clauses of a BooleanQuery. But you are right, this norm factor
    should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and those
    matches the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g. simple
    TermQuery, only with different term) on the same index, you can
    compare the scores. As soon as you are using complex queries (e.g
    several terms compared in a BooleanQuery as QueryParser produces), the
    scores are no longer comparable.

    You can read more on all factors that are included in scoring:

    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I
    really need
    some help on this..

    I'm running several queries against a doc collection. The queries
    are documents of the collection itself, I need to measure how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query
    vector and collection docs vectors ? I really need a comparable
    measure.
    thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 28, 2011 at 8:49 am
    One more thing, instead of extending the BooleanQuery class to remove the
    coord factor, can I also extend the Similarity class to do it ?

    Still the other question is open: just to be sure, if I disable the coord
    factor I can finally compare my BooleanQuery results ?

    thanks

    On 28 March 2011 10:11, Uwe Schindler wrote:

    Hi Patrick,

    You can disable the coord factor in the constructor of BooleanQuery.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times, but I need
    some tip.

    My queries are BooleanQueries but they always have the same structure
    (the same structure of the docs, they are actually docs from
    collection):
    3
    fields.

    What if I simplify the similarity scores, by removing coord factor and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable if they
    are sub clauses of a BooleanQuery. But you are right, this norm factor
    should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and those
    matches the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g. simple
    TermQuery, only with different term) on the same index, you can
    compare the scores. As soon as you are using complex queries (e.g
    several terms compared in a BooleanQuery as QueryParser produces), the
    scores are no longer comparable.

    You can read more on all factors that are included in scoring:

    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/search/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I
    really need
    some help on this..

    I'm running several queries against a doc collection. The queries
    are documents of the collection itself, I need to measure how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query
    vector and collection docs vectors ? I really need a comparable
    measure.
    thanks

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Mar 28, 2011 at 9:36 am
    Hi,

    You don't need to extend BooleanQuery, you can just pass "true" in its ctor,
    see: http://s.apache.org/QvK
    Of course you can also subclass DefaultSimilarity and return 1 as coord, but
    that is more work than passing true to a ctor.

    For your type of queries, disabling coord should be enough, but I am not
    100% sure! Why not simply try it out?

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:49 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    One more thing, instead of extending the BooleanQuery class to remove the
    coord factor, can I also extend the Similarity class to do it ?

    Still the other question is open: just to be sure, if I disable the coord factor I
    can finally compare my BooleanQuery results ?

    thanks

    On 28 March 2011 10:11, Uwe Schindler wrote:

    Hi Patrick,

    You can disable the coord factor in the constructor of BooleanQuery.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times,
    but I need
    some tip.

    My queries are BooleanQueries but they always have the same
    structure (the same structure of the docs, they are actually docs
    from
    collection):
    3
    fields.

    What if I simplify the similarity scores, by removing coord factor
    and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable if
    they are sub clauses of a BooleanQuery. But you are right, this
    norm factor should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and
    those matches the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g.
    simple TermQuery, only with different term) on the same index,
    you can compare the scores. As soon as you are using complex
    queries (e.g several terms compared in a BooleanQuery as
    QueryParser produces), the scores are no longer comparable.

    You can read more on all factors that are included in scoring:
    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
    ch/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I
    really need
    some help on this..

    I'm running several queries against a doc collection. The queries
    are documents of the collection itself, I need to measure how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query
    vector and collection docs vectors ? I really need a comparable
    measure.
    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 28, 2011 at 9:39 am
    ok thanks, I will pass well I dunno how to verify it. Even if I try then I
    get some scores, but I dunno if comparing them is reliable.

    On 28 March 2011 11:36, Uwe Schindler wrote:

    Hi,

    You don't need to extend BooleanQuery, you can just pass "true" in its
    ctor,
    see: http://s.apache.org/QvK
    Of course you can also subclass DefaultSimilarity and return 1 as coord,
    but
    that is more work than passing true to a ctor.

    For your type of queries, disabling coord should be enough, but I am not
    100% sure! Why not simply try it out?

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:49 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    One more thing, instead of extending the BooleanQuery class to remove the
    coord factor, can I also extend the Similarity class to do it ?

    Still the other question is open: just to be sure, if I disable the coord factor I
    can finally compare my BooleanQuery results ?

    thanks

    On 28 March 2011 10:11, Uwe Schindler wrote:

    Hi Patrick,

    You can disable the coord factor in the constructor of BooleanQuery.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times,
    but I need
    some tip.

    My queries are BooleanQueries but they always have the same
    structure (the same structure of the docs, they are actually docs
    from
    collection):
    3
    fields.

    What if I simplify the similarity scores, by removing coord factor
    and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different
    queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable if
    they are sub clauses of a BooleanQuery. But you are right, this
    norm factor should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and
    those matches the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g.
    simple TermQuery, only with different term) on the same index,
    you can compare the scores. As soon as you are using complex
    queries (e.g several terms compared in a BooleanQuery as
    QueryParser produces), the scores are no longer comparable.

    You can read more on all factors that are included in scoring:
    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
    ch/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and I
    really need
    some help on this..

    I'm running several queries against a doc collection. The
    queries
    are documents of the collection itself, I need to measure how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told
    such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the query
    vector and collection docs vectors ? I really need a comparable
    measure.
    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Mar 28, 2011 at 9:45 am
    Hi,

    As you seem to want to do very specific things, it might still be
    interesting to provide a modified Similarity (by subclassing
    DefaultSimilaity). You could then e.g. return also 1.0 to disable the
    queryNorm() which may also be a problem (but it isn't for your queries).
    Theoretically, you can change the Similarity to only have the cosine
    similarity left over - if you only want to use that one.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 11:39 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    ok thanks, I will pass well I dunno how to verify it. Even if I try then I get some
    scores, but I dunno if comparing them is reliable.

    On 28 March 2011 11:36, Uwe Schindler wrote:

    Hi,

    You don't need to extend BooleanQuery, you can just pass "true" in its
    ctor,
    see: http://s.apache.org/QvK
    Of course you can also subclass DefaultSimilarity and return 1 as
    coord, but that is more work than passing true to a ctor.

    For your type of queries, disabling coord should be enough, but I am
    not 100% sure! Why not simply try it out?

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:49 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    One more thing, instead of extending the BooleanQuery class to
    remove the coord factor, can I also extend the Similarity class to do
    it ?
    Still the other question is open: just to be sure, if I disable the
    coord factor I
    can finally compare my BooleanQuery results ?

    thanks

    On 28 March 2011 10:11, Uwe Schindler wrote:

    Hi Patrick,

    You can disable the coord factor in the constructor of
    BooleanQuery.
    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several times,
    but I need
    some tip.

    My queries are BooleanQueries but they always have the same
    structure (the same structure of the docs, they are actually docs
    from
    collection):
    3
    fields.

    What if I simplify the similarity scores, by removing coord
    factor
    and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different
    queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable
    if
    they are sub clauses of a BooleanQuery. But you are right, this
    norm factor should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and
    those matches the query
    - Inside Boolean queries, there is also a coord-factor involved

    If you are always using the same simple type of query (e.g.
    simple TermQuery, only with different term) on the same index,
    you can compare the scores. As soon as you are using complex
    queries (e.g several terms compared in a BooleanQuery as
    QueryParser produces), the scores are no longer comparable.

    You can read more on all factors that are included in scoring:
    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
    ch/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply and
    I
    really need
    some help on this..

    I'm running several queries against a doc collection. The
    queries
    are documents of the collection itself, I need to measure how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told
    such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the
    query
    vector and collection docs vectors ? I really need a
    comparable
    measure.
    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-
    help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 28, 2011 at 10:22 am
    I see, well if you say the norm isn't a problem for my case, I will just
    disable the coord factor by initializing BooleanQuery(true); and I should be
    done.

    If this is not correct, please anybody let me know.
    On 28 March 2011 11:44, Uwe Schindler wrote:

    Hi,

    As you seem to want to do very specific things, it might still be
    interesting to provide a modified Similarity (by subclassing
    DefaultSimilaity). You could then e.g. return also 1.0 to disable the
    queryNorm() which may also be a problem (but it isn't for your queries).
    Theoretically, you can change the Similarity to only have the cosine
    similarity left over - if you only want to use that one.

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 11:39 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    ok thanks, I will pass well I dunno how to verify it. Even if I try then
    I
    get some
    scores, but I dunno if comparing them is reliable.

    On 28 March 2011 11:36, Uwe Schindler wrote:

    Hi,

    You don't need to extend BooleanQuery, you can just pass "true" in its
    ctor,
    see: http://s.apache.org/QvK
    Of course you can also subclass DefaultSimilarity and return 1 as
    coord, but that is more work than passing true to a ctor.

    For your type of queries, disabling coord should be enough, but I am
    not 100% sure! Why not simply try it out?

    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:49 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    One more thing, instead of extending the BooleanQuery class to
    remove the coord factor, can I also extend the Similarity class to do
    it ?
    Still the other question is open: just to be sure, if I disable the
    coord factor I
    can finally compare my BooleanQuery results ?

    thanks

    On 28 March 2011 10:11, Uwe Schindler wrote:

    Hi Patrick,

    You can disable the coord factor in the constructor of
    BooleanQuery.
    Uwe

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 10:09 AM
    To: java-user@lucene.apache.org
    Subject: Re: comparing lucene scores across queries

    Hi, thanks for reply.

    Yeah, I've read the Similarity class documentation several
    times,
    but I need
    some tip.

    My queries are BooleanQueries but they always have the same
    structure (the same structure of the docs, they are actually
    docs
    from
    collection):
    3
    fields.

    What if I simplify the similarity scores, by removing coord
    factor
    and just
    leaving the cosine similarity which is comparable ?

    I want to underline the fact that my boolean queries are just a
    combination
    of "field:term" items, and I always have the same 3 fields with different
    terms obviously.

    Thanks



    On 28 March 2011 10:03, Uwe Schindler wrote:

    No, scores are in general not comparable between different
    queries.
    The problem lies in many things:
    - Each query has a norm factor that makes it more compareable
    if
    they are sub clauses of a BooleanQuery. But you are right,
    this
    norm factor should be the same.
    - Some queries like FuzzyQuery rely on the terms in index and
    those matches the query
    - Inside Boolean queries, there is also a coord-factor
    involved
    If you are always using the same simple type of query (e.g.
    simple TermQuery, only with different term) on the same index,
    you can compare the scores. As soon as you are using complex
    queries (e.g several terms compared in a BooleanQuery as
    QueryParser produces), the scores are no longer comparable.

    You can read more on all factors that are included in scoring:
    http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/sear
    ch/
    Simila
    rity.html

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Patrick Diviacco
    Sent: Monday, March 28, 2011 9:44 AM
    To: java-user@lucene.apache.org
    Subject: comparing lucene scores across queries

    Hi,

    sorry I've already asked few days ago, but I got no reply
    and
    I
    really need
    some help on this..

    I'm running several queries against a doc collection. The
    queries
    are documents of the collection itself, I need to measure
    how
    similar is each document to the rest of the collection.

    Now, Lucene returns me a score per query, but I've been told
    such
    score is
    not comparable across queries. Is this correct ?

    For example, arem't these scores comparable ?
    query1, score:8.324234
    query2, score:3.324238

    If so, why not ? Isn't the cosine similarity between the
    query
    vector and collection docs vectors ? I really need a
    comparable
    measure.
    thanks
    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-
    help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Mar 28, 2011 at 11:58 pm
    : I see, well if you say the norm isn't a problem for my case, I will just
    : disable the coord factor by initializing BooleanQuery(true); and I should be
    : done.

    querynorm hsouldn't be a problem (since your booleanqueries all have hte
    same structure, and odn't use query boosts ... i assume) but field norm
    might be; i also don't see anything mentioned so far in this thread that
    describes how you'll work arround the tf and idf values being theretically
    unbounded (unless your docs are all of identical length)

    ultimatley, attempts at comparing scores across different searches all
    come down to normalizing (either explicitly or implicitly) and normalizing
    requires that you have a "max possible score" you can normalize relative
    to -- not just a "max score for the index", but a max score in the scope
    of all theretical documents (because otherwise the comparison isn't fair
    given an arbitrary corpus)

    with the default similarity, you can't really define a "max possible
    score" for a given query because tf and idf are not bounded functions.


    There have been a few nice discussions about this general concept over the
    years, here's the first once i found doing a quick search...

    http://www.gossamer-threads.com/lists/lucene/java-user/61075





    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 29, 2011 at 8:31 am
    hey Hoss,

    thanks for your reply. I thought I've solved the issue according to Uwe, the
    queries without coord function were reasonably comparable, but now you
    actually reopened it.

    So, I need to be sure I'm making them comparable and I would like to ask the
    following.

    My BooleanQueries have similar structure. Important: they only contain
    TermQueries. The fields are always 3 but the terms number can vary... this
    is an example of BooleanQuery (sorry for the syntax):

    field1:term1, SHOULD
    field1:term2, SHOULD
    field2:term1, SHOULD
    field2:term2, SHOULD
    field2:term3, SHOULD
    field3:term1, SHOULD
    ...

    If it is not clear how the BooleanQueries are, I can print some of them for
    you. They have same number of fields but different number of terms.

    1- Do you still think QueryNorm is not an issue ? Funny, because in the
    documentation I can read:
    QueryNorm(q) is a normalizing factor used to make scores between queries
    comparable. This factor does not affect document ranking (since all ranked
    documents are multiplied by the same factor), but rather just attempts to
    make scores from different queries (or even different indexes) comparable.

    It seems I can compare queries from the documentation.



    2- I don't think I'm using queryBoosts, are they enabled by default in the
    BooleanQuery ?

    3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
    SHould I disable it ? Is it a issue ?

    4- If I'm not wrong Uwe told me I can compute comparable cosine
    similarities even with documents of different length. Tf and Idf are
    unbounded, and my docs have different length. Can't I measure the similarity
    between query and doc vectors anyway ?

    5 - Again, I've been told I can compare queries and from documentation, I
    can see that queryNorm factor normalizes all queries. But you are saying I
    should manually normalize them somehow ? It is not clear

    thanks
    Patrick

    querynorm hsouldn't be a problem (since your booleanqueries all have hte
    same structure, and odn't use query boosts ... i assume) but field norm
    might be; i also don't see anything mentioned so far in this thread that
    describes how you'll work arround the tf and idf values being theretically
    unbounded (unless your docs are all of identical length)

    ultimatley, attempts at comparing scores across different searches all
    come down to normalizing (either explicitly or implicitly) and normalizing
    requires that you have a "max possible score" you can normalize relative
    to -- not just a "max score for the index", but a max score in the scope
    of all theretical documents (because otherwise the comparison isn't fair
    given an arbitrary corpus)

    with the default similarity, you can't really define a "max possible
    score" for a given query because tf and idf are not bounded functions.


    There have been a few nice discussions about this general concept over the
    years, here's the first once i found doing a quick search...

    http://www.gossamer-threads.com/lists/lucene/java-user/61075





    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Mar 29, 2011 at 8:49 am

    thanks for your reply. I thought I've solved the issue according to Uwe, the
    queries without coord function were reasonably comparable, but now you
    actually reopened it.

    So, I need to be sure I'm making them comparable and I would like to ask the
    following.

    My BooleanQueries have similar structure. Important: they only contain
    TermQueries. The fields are always 3 but the terms number can vary... this is
    an example of BooleanQuery (sorry for the syntax):

    field1:term1, SHOULD
    field1:term2, SHOULD
    field2:term1, SHOULD
    field2:term2, SHOULD
    field2:term3, SHOULD
    field3:term1, SHOULD
    ...

    If it is not clear how the BooleanQueries are, I can print some of them for
    you. They have same number of fields but different number of terms.

    1- Do you still think QueryNorm is not an issue ? Funny, because in the
    documentation I can read:
    QueryNorm(q) is a normalizing factor used to make scores between queries
    comparable. This factor does not affect document ranking (since all ranked
    documents are multiplied by the same factor), but rather just attempts to
    make scores from different queries (or even different indexes) comparable.

    It seems I can compare queries from the documentation.
    But as you are always using the same type of query (TermQuery), the
    QueryNorm should not change, so no issue at all. It differs if you have a
    variable number of Boolean clauses, the Query norm could help you to make
    the queries comparable. But if you only have always the same looking BQ with
    exact same number of TQ in it (only different terms) its not an issue at
    all. In all other cases, the query norm helps to compare e.g. a BQ with 5 TQ
    clauses with another BQ that has 8 TQ clauses.
    2- I don't think I'm using queryBoosts, are they enabled by default in the
    BooleanQuery ?
    Query boost are only active if you do TermQuery.setBoost(anything != 1.0f).
    3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
    SHould I disable it ? Is it a issue ?
    FieldNorm should not be a problem, as it's an indexed feature. So the same
    document has always the same FieldNorm (which is a combination of length
    norm, indexing document boost). If two queries hit the same document the
    scores for this document should be comparable, as the FieldNorm is the same
    for both cases.

    See point 6) in the Similarity docs: norm(t,d)
    4- If I'm not wrong Uwe told me I can compute comparable cosine
    similarities
    even with documents of different length. Tf and Idf are unbounded, and my
    docs have different length. Can't I measure the similarity between query and
    doc vectors anyway ?
    The field norm normalizes that. So where is the problem?
    5 - Again, I've been told I can compare queries and from documentation, I
    can see that queryNorm factor normalizes all queries. But you are saying I
    should manually normalize them somehow ? It is not clear
    It only affects different querys (e.g. number of Boolean clauses differ,
    type of queries differ).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Patrick Diviacco at Mar 29, 2011 at 8:58 am
    hey Uwe, so from your last answer, I understand I'm done.. no need to do
    anything, I can already compare the queries.

    However there is actually a misunderstanding: my booleanqueries have
    variable number of boolean clauses because the fields are fixed but the
    terms per field are not. So, for example, I have:

    BooleanQuery1:
    field1:term, SHOULD
    field1:term, SHOULD
    field2:term, SHOULD
    field2:term, SHOULD
    field2:term, SHOULD
    field3:term, SHOULD

    BooleanQuery2:
    field1:term, SHOULD
    field2:term, SHOULD
    field3:term, SHOULD
    field3:term, SHOULD
    field3:term, SHOULD
    field3:term, SHOULD
    field3:term, SHOULD

    Is any of the points we discussed so far not anymore valid ?

    thanks
    On 29 March 2011 10:48, Uwe Schindler wrote:

    thanks for your reply. I thought I've solved the issue according to Uwe, the
    queries without coord function were reasonably comparable, but now you
    actually reopened it.

    So, I need to be sure I'm making them comparable and I would like to ask the
    following.

    My BooleanQueries have similar structure. Important: they only contain
    TermQueries. The fields are always 3 but the terms number can vary...
    this
    is
    an example of BooleanQuery (sorry for the syntax):

    field1:term1, SHOULD
    field1:term2, SHOULD
    field2:term1, SHOULD
    field2:term2, SHOULD
    field2:term3, SHOULD
    field3:term1, SHOULD
    ...

    If it is not clear how the BooleanQueries are, I can print some of them for
    you. They have same number of fields but different number of terms.

    1- Do you still think QueryNorm is not an issue ? Funny, because in the
    documentation I can read:
    QueryNorm(q) is a normalizing factor used to make scores between queries
    comparable. This factor does not affect document ranking (since all ranked
    documents are multiplied by the same factor), but rather just attempts to
    make scores from different queries (or even different indexes)
    comparable.
    It seems I can compare queries from the documentation.
    But as you are always using the same type of query (TermQuery), the
    QueryNorm should not change, so no issue at all. It differs if you have a
    variable number of Boolean clauses, the Query norm could help you to make
    the queries comparable. But if you only have always the same looking BQ
    with
    exact same number of TQ in it (only different terms) its not an issue at
    all. In all other cases, the query norm helps to compare e.g. a BQ with 5
    TQ
    clauses with another BQ that has 8 TQ clauses.
    2- I don't think I'm using queryBoosts, are they enabled by default in the
    BooleanQuery ?
    Query boost are only active if you do TermQuery.setBoost(anything != 1.0f).
    3- FieldNorm is not mentioned in Similarity class. How can I disable it ?
    SHould I disable it ? Is it a issue ?
    FieldNorm should not be a problem, as it's an indexed feature. So the same
    document has always the same FieldNorm (which is a combination of length
    norm, indexing document boost). If two queries hit the same document the
    scores for this document should be comparable, as the FieldNorm is the same
    for both cases.

    See point 6) in the Similarity docs: norm(t,d)
    4- If I'm not wrong Uwe told me I can compute comparable cosine
    similarities
    even with documents of different length. Tf and Idf are unbounded, and my
    docs have different length. Can't I measure the similarity between query and
    doc vectors anyway ?
    The field norm normalizes that. So where is the problem?
    5 - Again, I've been told I can compare queries and from documentation, I
    can see that queryNorm factor normalizes all queries. But you are saying I
    should manually normalize them somehow ? It is not clear
    It only affects different querys (e.g. number of Boolean clauses differ,
    type of queries differ).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Vinaya Kumar Thimmappa at Mar 29, 2011 at 6:58 am
    Hello All,

    I am looking for Japanese/Chinese stemmer . Does this exists ? do we
    require it ?
    (Analyser are already present in lucene)

    I did a goggle and did not find any conclusive answer.

    Thanks in advance
    vinaya

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 28, '11 at 7:44a
activeMar 29, '11 at 8:58a
posts15
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase