FAQ
I'm using BoostingTermQuery to boost the score of documents with terms
containing payloads (boost value > 1). I'd like to change the scoring
behavior such that if a query contains multiple BoostingTermQuery terms
(either required or optional), documents containing more matching terms with
payloads always score higher than documents with fewer terms with payloads.
Currently, if one of the terms has a high IDF weight and contains a boosting
payload but no payloads on other matching terms, it may score higher than
docs with other matching terms with payloads and lower IDF.

I think what I need is a way to increase the weight of a matching term in
BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see how to do
this. Any suggestions?

Thanks,
Peter

Search Discussions

  • Grant Ingersoll at Nov 6, 2008 at 1:01 pm
    Not sure, but it sounds like you are interested in a higher level
    Query, kind of like the BooleanQuery, but then part of it sounds like
    it is per document, right? Is it that you want to deal with multiple
    payloads in a document, or multiple BTQs in a bigger query?
    On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:

    I'm using BoostingTermQuery to boost the score of documents with terms
    containing payloads (boost value > 1). I'd like to change the scoring
    behavior such that if a query contains multiple BoostingTermQuery
    terms
    (either required or optional), documents containing more matching
    terms with
    payloads always score higher than documents with fewer terms with
    payloads.
    Currently, if one of the terms has a high IDF weight and contains a
    boosting
    payload but no payloads on other matching terms, it may score higher
    than
    docs with other matching terms with payloads and lower IDF.

    I think what I need is a way to increase the weight of a matching
    term in
    BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see
    how to do
    this. Any suggestions?

    Thanks,
    Peter
    --------------------------
    Grant Ingersoll


    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 6, 2008 at 6:09 pm
    Let me give some background on the problem behind my question.

    Our index contains many fields (title, body, date, city, etc). Most queries
    search all fields, but for best performance, we create an additional
    'contents' field that contains all terms from all fields so that only one
    field needs to be searched. Some fields, like title and city, are boosted by
    a factor of 5. In order to make term boosting work, we create an additional
    field 'boost' that contains all the terms from the boosted fields (title,
    city).

    Then, at search time, a query for "petroleum engineer" gets rewritten to:
    (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer).
    Note that the two clauses are OR'd so that a term that exists in both fields
    will get a higher weight in the 'boost' field. This works quite well at
    boosting documents with terms that exist in the boosted fields. However, it
    doesn't work properly if excluded terms are added, for example:

    (+contents:petroleum +contents:engineer -contents:drilling)
    (+boost:petroleum +boost:engineer -boost:drilling)

    If a document contains the term 'drilling' in the 'body' field, but not in
    the 'title' or 'city' field, a false hit occurs.

    Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are added
    to the 'contents' field, they are assigned a payload (value=5) if the term
    also exists in one of the boosted fields. The 'scorePayload' method in our
    Similarity class returns the payload value as a score. The query no longer
    contains the 'boost' fields and is simply:

    +contents:petroleum +contents:engineer -contents:drilling

    The goal is to make the payload technique behavior similar to the 'boost'
    field technique. The problem is that relevance scores of the top hits are
    sometimes quite different. The reason is that the IDF values for a given
    term in the 'boost' field is often much higher than the same term in the
    'contents' field. This makes sense because the 'boost' field contains a
    fairly small subset of the 'contents' field. Even with a payload of '5', a
    low IDF in the 'contents' field usually erases the effect of the payload.

    I have found a fairly simple (albeit inelegant) solution that seems to work.
    The 'boost' field is still created as before, but it is only used to compute
    IDF values for the weight class 'BoostingTermQuery.BoostingTermWeight. I had
    to make this class 'public' so that I could override the IDF value as
    follows:

    public class MNSBoostingTermQuery extends BoostingTermQuery {
    public MNSBoostingTermQuery(Term term) {
    super(term);
    }
    protected class MNSBoostingTermWeight extends
    BoostingTermQuery.BoostingTermWeight {
    public MNSBoostingTermWeight(BoostingTermQuery query, Searcher searcher)
    throws IOException {
    super(query, searcher);
    java.util.HashSet<Term> newTerms = new java.util.HashSet<Term>();
    // Recompute IDF based on 'boost' field
    Iterator i = terms.iterator();
    Term term=null;
    while (i.hasNext()) {
    term = (Term)i.next();
    newTerms.add(new Term("boost", term.text()));
    }
    this.idf = this.query.getSimilarity(searcher).idf(newTerms, searcher);
    }
    }
    }

    Any thoughts about a better implementation are welcome.

    Peter



    On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll wrote:

    Not sure, but it sounds like you are interested in a higher level Query,
    kind of like the BooleanQuery, but then part of it sounds like it is per
    document, right? Is it that you want to deal with multiple payloads in a
    document, or multiple BTQs in a bigger query?

    On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:

    I'm using BoostingTermQuery to boost the score of documents with terms
    containing payloads (boost value > 1). I'd like to change the scoring
    behavior such that if a query contains multiple BoostingTermQuery terms
    (either required or optional), documents containing more matching terms
    with
    payloads always score higher than documents with fewer terms with
    payloads.
    Currently, if one of the terms has a high IDF weight and contains a
    boosting
    payload but no payloads on other matching terms, it may score higher than
    docs with other matching terms with payloads and lower IDF.

    I think what I need is a way to increase the weight of a matching term in
    BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see how to
    do
    this. Any suggestions?

    Thanks,
    Peter
    --------------------------
    Grant Ingersoll


    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 6, 2008 at 9:25 pm
    I've discovered another flaw in using this technique:

    (+contents:petroleum +contents:engineer +contents:refinery)
    (+boost:petroleum +boost:engineer +boost:refinery)

    It's possible that the first clause will produce a matching doc and none of
    the terms in the second clause are used to score that doc. Yet another
    reason to use BoostingTermQuery.

    Peter

    On Thu, Nov 6, 2008 at 1:08 PM, Peter Keegan wrote:

    Let me give some background on the problem behind my question.

    Our index contains many fields (title, body, date, city, etc). Most queries
    search all fields, but for best performance, we create an additional
    'contents' field that contains all terms from all fields so that only one
    field needs to be searched. Some fields, like title and city, are boosted by
    a factor of 5. In order to make term boosting work, we create an additional
    field 'boost' that contains all the terms from the boosted fields (title,
    city).

    Then, at search time, a query for "petroleum engineer" gets rewritten to:
    (+contents:petroleum +contents:engineer) (+boost:petroleum +boost:engineer).
    Note that the two clauses are OR'd so that a term that exists in both fields
    will get a higher weight in the 'boost' field. This works quite well at
    boosting documents with terms that exist in the boosted fields. However, it
    doesn't work properly if excluded terms are added, for example:

    (+contents:petroleum +contents:engineer -contents:drilling)
    (+boost:petroleum +boost:engineer -boost:drilling)

    If a document contains the term 'drilling' in the 'body' field, but not in
    the 'title' or 'city' field, a false hit occurs.

    Enter payloads and 'BoostingTermQuery'. At indexing time, as terms are
    added to the 'contents' field, they are assigned a payload (value=5) if the
    term also exists in one of the boosted fields. The 'scorePayload' method in
    our Similarity class returns the payload value as a score. The query no
    longer contains the 'boost' fields and is simply:

    +contents:petroleum +contents:engineer -contents:drilling

    The goal is to make the payload technique behavior similar to the 'boost'
    field technique. The problem is that relevance scores of the top hits are
    sometimes quite different. The reason is that the IDF values for a given
    term in the 'boost' field is often much higher than the same term in the
    'contents' field. This makes sense because the 'boost' field contains a
    fairly small subset of the 'contents' field. Even with a payload of '5', a
    low IDF in the 'contents' field usually erases the effect of the payload.

    I have found a fairly simple (albeit inelegant) solution that seems to
    work. The 'boost' field is still created as before, but it is only used to
    compute IDF values for the weight class
    'BoostingTermQuery.BoostingTermWeight. I had to make this class 'public' so
    that I could override the IDF value as follows:

    public class MNSBoostingTermQuery extends BoostingTermQuery {
    public MNSBoostingTermQuery(Term term) {
    super(term);
    }
    protected class MNSBoostingTermWeight extends
    BoostingTermQuery.BoostingTermWeight {
    public MNSBoostingTermWeight(BoostingTermQuery query, Searcher
    searcher) throws IOException {
    super(query, searcher);
    java.util.HashSet<Term> newTerms = new java.util.HashSet<Term>();
    // Recompute IDF based on 'boost' field
    Iterator i = terms.iterator();
    Term term=null;
    while (i.hasNext()) {
    term = (Term)i.next();
    newTerms.add(new Term("boost", term.text()));
    }
    this.idf = this.query.getSimilarity(searcher).idf(newTerms,
    searcher);
    }
    }
    }

    Any thoughts about a better implementation are welcome.

    Peter




    On Thu, Nov 6, 2008 at 8:00 AM, Grant Ingersoll wrote:

    Not sure, but it sounds like you are interested in a higher level Query,
    kind of like the BooleanQuery, but then part of it sounds like it is per
    document, right? Is it that you want to deal with multiple payloads in a
    document, or multiple BTQs in a bigger query?

    On Nov 4, 2008, at 9:42 AM, Peter Keegan wrote:

    I'm using BoostingTermQuery to boost the score of documents with terms
    containing payloads (boost value > 1). I'd like to change the scoring
    behavior such that if a query contains multiple BoostingTermQuery terms
    (either required or optional), documents containing more matching terms
    with
    payloads always score higher than documents with fewer terms with
    payloads.
    Currently, if one of the terms has a high IDF weight and contains a
    boosting
    payload but no payloads on other matching terms, it may score higher than
    docs with other matching terms with payloads and lower IDF.

    I think what I need is a way to increase the weight of a matching term in
    BoostingSpanScorer.score() if 'payloadsSeen > 0', but I don't see how to
    do
    this. Any suggestions?

    Thanks,
    Peter
    --------------------------
    Grant Ingersoll


    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steven A Rowe at Nov 6, 2008 at 11:57 pm
    Hi Peter,
    On 11/06/2008 at 4:25 PM, Peter Keegan wrote:
    I've discovered another flaw in using this technique:

    (+contents:petroleum +contents:engineer +contents:refinery)
    (+boost:petroleum +boost:engineer +boost:refinery)

    It's possible that the first clause will produce a matching
    doc and none of the terms in the second clause are used to
    score that doc. Yet another reason to use BoostingTermQuery.
    I think you could address this, without BTQ, using something like:

    boost:(+petroleum +engineer +refinery)
    (+contents:(+petroleum +engineer +refinery)
    +((*:* -boost:petroleum)
    (*:* -boost:engineer)
    (*:* -boost:refinery)))

    The last three lines gives you the set of documents that are missing at least one of the terms in the "boost" field. The *:* thingy, indicating a MatchAllDocsQuery, is necessary to get all documents that don't have a given term; Lucene's (sub-)query document exclusion operation needs a non-empty set on which to operate.
    On 11/06/2008 at 1:08 PM, Peter Keegan wrote:
    Then, at search time, a query for "petroleum engineer" gets rewritten
    to: (+contents:petroleum +contents:engineer) (+boost:petroleum
    +boost:engineer). Note that the two clauses are OR'd so that a term that
    exists in both fields will get a higher weight in the 'boost' field.
    This works quite well at boosting documents with terms that exist in the
    boosted fields. However, it doesn't work properly if excluded terms are
    added, for example:

    (+contents:petroleum +contents:engineer -contents:drilling)
    (+boost:petroleum +boost:engineer -boost:drilling)

    If a document contains the term 'drilling' in the 'body'
    field, but not in the 'title' or 'city' field, a false hit occurs.
    I think you could address this problem like this:

    +(boost:(+petroleum +engineer)
    (+contents:(+petroleum +engineer)
    +((*:* -boost:petroleum)
    (*:* -boost:engineer))))
    -contents:drilling

    You don't have to include "-boost:drilling", because this condition is entailed by "-contents:drilling".

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Nov 7, 2008 at 6:33 pm
    boost:(+petroleum +engineer +refinery)
    (+contents:(+petroleum +engineer +refinery)
    +((*:* -boost:petroleum)
    (*:* -boost:engineer)
    (*:* -boost:refinery)))

    That's an interesting solution. Would this result in many more documents
    being visited by the scorer, possibly impacting performance? (I haven't
    tried it yet).

    Thanks,
    Peter


    On Thu, Nov 6, 2008 at 6:56 PM, Steven A Rowe wrote:

    Hi Peter,
    On 11/06/2008 at 4:25 PM, Peter Keegan wrote:
    I've discovered another flaw in using this technique:

    (+contents:petroleum +contents:engineer +contents:refinery)
    (+boost:petroleum +boost:engineer +boost:refinery)

    It's possible that the first clause will produce a matching
    doc and none of the terms in the second clause are used to
    score that doc. Yet another reason to use BoostingTermQuery.
    I think you could address this, without BTQ, using something like:

    boost:(+petroleum +engineer +refinery)
    (+contents:(+petroleum +engineer +refinery)
    +((*:* -boost:petroleum)
    (*:* -boost:engineer)
    (*:* -boost:refinery)))

    The last three lines gives you the set of documents that are missing at
    least one of the terms in the "boost" field. The *:* thingy, indicating a
    MatchAllDocsQuery, is necessary to get all documents that don't have a given
    term; Lucene's (sub-)query document exclusion operation needs a non-empty
    set on which to operate.
    On 11/06/2008 at 1:08 PM, Peter Keegan wrote:
    Then, at search time, a query for "petroleum engineer" gets rewritten
    to: (+contents:petroleum +contents:engineer) (+boost:petroleum
    +boost:engineer). Note that the two clauses are OR'd so that a term that
    exists in both fields will get a higher weight in the 'boost' field.
    This works quite well at boosting documents with terms that exist in the
    boosted fields. However, it doesn't work properly if excluded terms are
    added, for example:

    (+contents:petroleum +contents:engineer -contents:drilling)
    (+boost:petroleum +boost:engineer -boost:drilling)

    If a document contains the term 'drilling' in the 'body'
    field, but not in the 'title' or 'city' field, a false hit occurs.
    I think you could address this problem like this:

    +(boost:(+petroleum +engineer)
    (+contents:(+petroleum +engineer)
    +((*:* -boost:petroleum)
    (*:* -boost:engineer))))
    -contents:drilling

    You don't have to include "-boost:drilling", because this condition is
    entailed by "-contents:drilling".

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 4, '08 at 2:43p
activeNov 7, '08 at 6:33p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase