FAQ
Is there any way to build a query where the occurrence of a particular
Term (in a Keyword field) causes the rank of the document to be
decreased? I have various types of documents, and some of them are less
interesting than others, so I want them to be pushed towards the bottom
of the results ranking. However, I do not want to eliminate them
entirely, so I can't use a boolean not.

Using negative weights would seem logical here, but apparently has no
effect on rankings - negative weights appear to be treated as zeros.

Any ideas would be appreciated.

Thanks,
Boris


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

  • Doug Cutting at Mar 18, 2004 at 6:33 pm
    Have you tried assigning these very small boosts (0 < boost < 1) and
    assigning other query clauses relatively large boosts (boost > 1)?

    Boris Goldowsky wrote:
    Is there any way to build a query where the occurrence of a particular
    Term (in a Keyword field) causes the rank of the document to be
    decreased? I have various types of documents, and some of them are less
    interesting than others, so I want them to be pushed towards the bottom
    of the results ranking. However, I do not want to eliminate them
    entirely, so I can't use a boolean not.

    Using negative weights would seem logical here, but apparently has no
    effect on rankings - negative weights appear to be treated as zeros.

    Any ideas would be appreciated.

    Thanks,
    Boris


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Boris Goldowsky at Mar 19, 2004 at 2:39 pm

    I asked:
    Is there any way to build a query where the occurrence of a particular
    Term (in a Keyword field) causes the rank of the document to be
    decreased?
    On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
    Have you tried assigning these very small boosts (0 < boost < 1) and
    assigning other query clauses relatively large boosts (boost > 1)?
    Thanks for the suggestion! Unfortunately it doesn't have the desired
    effect. I wanted
    title: asparagus
    various fields...
    doctype: bad

    to score lower than
    title: asparagus
    various similar fields...
    doctype: good

    I was trying to formulate a query like, say
    +(title: asparagus) (doctype:bad)^-3

    which would make sure the "bad" document was ranked lower than any other
    value for doctype. But negative boosts are illegal.

    I tried your suggestion of putting large boost on the first clause and a
    small one (0.01) on the second, but the "bad" document is still ranked
    higher than the good one -- it gets a slight improvement from the
    doctype:bad match, times 0.01, which is a very slight improvement but
    still positive. Then it gets a big boost because it has a 1.0 rather
    than a 0.5 coordination factor, so the bad item gets top billing.

    I think I've identified a few ways to solve the puzzle, though:

    (a) enumerate all the possible "good" types of documents and search for
    them, rather than the single bad one. Harder to maintain since doctypes
    can be introduced, but possible.

    (b) attach boost values less than one to the "bad" Documents at indexing
    time. Not as flexible as modifying the query, but plausible.

    (c) a more complex query like this:
    (title:asparagus) OR (title:asparagus -doctype:bad)
    so for good documents both clauses will match and the coordination
    factor will be in their favor. This increases query complexity (they
    aren't really simple one-term queries like this toy example), but
    hopefully that will not be a performance issue.

    Bng





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Mar 19, 2004 at 4:55 pm

    Boris Goldowsky wrote:
    On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
    Have you tried assigning these very small boosts (0 < boost < 1) and
    assigning other query clauses relatively large boosts (boost > 1)?
    I was trying to formulate a query like, say
    +(title: asparagus) (doctype:bad)^-3

    which would make sure the "bad" document was ranked lower than any other
    value for doctype. But negative boosts are illegal.

    I tried your suggestion of putting large boost on the first clause and a
    small one (0.01) on the second, but the "bad" document is still ranked
    higher than the good one -- it gets a slight improvement from the
    doctype:bad match, times 0.01, which is a very slight improvement but
    still positive. Then it gets a big boost because it has a 1.0 rather
    than a 0.5 coordination factor, so the bad item gets top billing.
    I don't think you understood my proposal. You should try boosting the
    documents when you add them. Instead of adding a "doctype" field with
    "good" and "bad" values, use Document.setBoost(0.01) at index time.

    Also, you could disable coordination if you like by defining your own
    Similarity class.
    I think I've identified a few ways to solve the puzzle, though:

    (a) enumerate all the possible "good" types of documents and search for
    them, rather than the single bad one. Harder to maintain since doctypes
    can be introduced, but possible.
    That would indeed work better in an additive scoring system like this.
    (b) attach boost values less than one to the "bad" Documents at indexing
    time. Not as flexible as modifying the query, but plausible.
    Yes, that's what I proposed. You can reset boost values later now too.
    (c) a more complex query like this:
    (title:asparagus) OR (title:asparagus -doctype:bad)
    so for good documents both clauses will match and the coordination
    factor will be in their favor. This increases query complexity (they
    aren't really simple one-term queries like this toy example), but
    hopefully that will not be a performance issue.
    I think modifying the coordination function would be better. Note that,
    in the current CVS codebase, you can modify the Similarity
    implementation on a per-clause basis. So you could construct a query
    that had negative coordination, i.e., that gives lower scores when more
    clauses match. This could be done by subclassing BooleanQuery and
    overriding its getSimilarity(Searcher) method.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Mar 19, 2004 at 4:59 pm

    Doug Cutting wrote:
    On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:

    Have you tried assigning these very small boosts (0 < boost < 1) and
    assigning other query clauses relatively large boosts (boost > 1)?
    I don't think you understood my proposal. You should try boosting the
    documents when you add them. Instead of adding a "doctype" field with
    "good" and "bad" values, use Document.setBoost(0.01) at index time.
    Sorry. My mistake. You did understand my proposal, it was just a bad
    proposal. Boosting documents is a better approach, but is less
    flexible. I think the final proposal in my previous message might be
    the best approach (defining a custom coordination function for these
    query clauses).

    Again, sorry for the false accusation,

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Boris Goldowsky at Mar 22, 2004 at 1:26 pm

    On Fri, 2004-03-19 at 11:58, Doug Cutting wrote:
    Doug Cutting wrote:
    On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:

    Have you tried assigning these very small boosts (0 < boost < 1) and
    assigning other query clauses relatively large boosts (boost > 1)?
    I don't think you understood my proposal. You should try boosting the
    documents when you add them. Instead of adding a "doctype" field with
    "good" and "bad" values, use Document.setBoost(0.01) at index time.
    Sorry. My mistake. You did understand my proposal, it was just a bad
    proposal. Boosting documents is a better approach, but is less
    flexible. I think the final proposal in my previous message might be
    the best approach (defining a custom coordination function for these
    query clauses).
    Thanks for the ideas - I love the flexibility of Lucene that there are
    so many ways to accomplish what at first seemed so difficult.

    Boris



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Markharw00d at Mar 27, 2004 at 2:29 am
    I have not been able to work out how to get custom coordination going to
    demote results based on a specific term but have an alternative suggestion
    that looks like it might work:

    I've created a "MissingTermQuery" - which is the opposite of a TermQuery
    and can be used to boost documents that DONT have a specific term.
    This seems to have the desired effect of demoting but not necessarily
    precluding documents with a specific term but has a side effect of producing
    some irrelevant low scoring results when none of the other terms match.
    This can be counteracted by making the positive terms mandatory eg

    +(wine beer) !cheap

    The ! character in the example above is used to denote a MissingTermQuery.
    This says find all documents with "wine" or "beer" and favour ones that dont
    used the word "cheap". Of course MissingTerms can be boosted with a value
    to emphasise the effect eg !cheap^2

    It doesn't look like there's currently an elegant way of negating all the other
    query types (phrase, prefix...) without creating new "MissingXxxQuery" classes
    for each type.

    I've put an example implementation here:
    http://www.inperspective.com/lucene/demote.zip


    Cheers
    Mark





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Mar 29, 2004 at 6:14 pm

    markharw00d@yahoo.co.uk wrote:
    I have not been able to work out how to get custom coordination going to
    demote results based on a specific term [ ... ]
    Yeah, it's a little more complicated than perhaps it should be.

    I've attached a class which does this. I think it's faster and more
    effective than what you proposed. This only works in the 1.4 codebase
    (current CVS), as it requires the new Query.getSimilarity() method.

    To use this, change the line in your test program from:

    Query balancedQuery =
    NegatingQuery.createQuery(positiveQuery,negativeQuery,1);

    to

    Query balancedQuery =
    new BoostingQuery(positiveQuery, negativeQuery, 0.01f);

    Please tell if you find it useful.

    Doug
  • Markharw00d at Mar 28, 2004 at 10:42 pm
    I've found an elegant way of doing this now for all types of search - a new "NegatingQuery" class that takes any Query object in its constructor and
    selects all documents that DONT match and gives them a user-definable boost.

    The code is here:
    http://www.inperspective.com/lucene/demote.zip

    Cheers
    Mark


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Stephane James Vaucher at Mar 29, 2004 at 3:27 am
    Mark,

    I've added a section in the wiki called:

    http://wiki.apache.org/jakarta-lucene/CommunityContributions

    and have added an entry for your message. If you want to edit the
    message, go for it. I believe that the wiki can support attached files if
    you want to upload there.

    cheers,
    sv
    On Sun, 28 Mar 2004 markharw00d@yahoo.co.uk wrote:

    I've found an elegant way of doing this now for all types of search - a
    new "NegatingQuery" class that takes any Query object in its constructor
    and selects all documents that DONT match and gives them a user-definable boost.

    The code is here:
    http://www.inperspective.com/lucene/demote.zip

    Cheers
    Mark


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Markharw00d at Mar 29, 2004 at 8:13 pm
    Hi Doug,
    Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my
    implementation :-)
    Unless anyone has a particularly good reason I'll remove the link to my code that Stephane put on the Wiki contributions page.
    I definitely find BoostingQuery very useful and would be happy to see it in Lucene core but I'm not sure its popular
    enough to warrant adding special support to the query parser.

    BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments
    and then querying it to get the best fragments. This is nice in theory but could fail to find anything if the query is of these forms:
    a AND b
    "a b"
    When the code that breaks a doc into "sentence docs" splits co-occuring "a" and "b" terms into seperate docs
    this would produce no match. I dont think there's an easy way round that so I'll stick to the current approach of scoring
    fragments simply based on terms found in the query.


    Cheers
    Mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Doug Cutting at Mar 29, 2004 at 8:28 pm

    markharw00d@yahoo.co.uk wrote:
    Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my
    implementation :-)
    Great! Glad to hear it was useful.
    BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments
    and then querying it to get the best fragments. This is nice in theory but could fail to find anything if the query is of these forms:
    a AND b
    "a b"
    When the code that breaks a doc into "sentence docs" splits co-occuring "a" and "b" terms into seperate docs
    this would produce no match. I dont think there's an easy way round that so I'll stick to the current approach of scoring
    fragments simply based on terms found in the query.
    You could, if you fail to find any fragments that match the entire
    query, re-query the fragments with a flattened query containing just an
    OR of all of the original query terms.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Stephane James Vaucher at Mar 29, 2004 at 9:55 pm
    Mark,

    Thanks for the update, since I contributed the page, I was going to modify
    it (I don't want to force work on other.

    sv
    On Mon, 29 Mar 2004 markharw00d@yahoo.co.uk wrote:

    Hi Doug,
    Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my
    implementation :-)
    Unless anyone has a particularly good reason I'll remove the link to my code that Stephane put on the Wiki contributions page.
    I definitely find BoostingQuery very useful and would be happy to see it in Lucene core but I'm not sure its popular
    enough to warrant adding special support to the query parser.

    BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments
    and then querying it to get the best fragments. This is nice in theory but could fail to find anything if the query is of these forms:
    a AND b
    "a b"
    When the code that breaks a doc into "sentence docs" splits co-occuring "a" and "b" terms into seperate docs
    this would produce no match. I dont think there's an easy way round that so I'll stick to the current approach of scoring
    fragments simply based on terms found in the query.


    Cheers
    Mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  • Markharw00d at Mar 29, 2004 at 9:04 pm

    You could, if you fail to find any fragments that match the entire
    query, re-query the fragments with a flattened query containing just an
    OR of all of the original query terms.
    The other issue with this approach I'm still struggling with is simply the cost of creating the temporary index. I don't know if you got a chance to look at the "FastIndex"
    implementation I put together using TreeMaps. I was getting a 2x speed improvement over RAM indexes but it was still 4 times slower than the basic
    cost of tokenization used by the current highlighter code. Costs for processing 50k worth of docs are as follows:
    fast indexing : 1182 ms
    ramindexing : 2413 ms
    just tokenizing : 310 ms

    Still quite an overhead and I couldn't see any obvious means of improving on this.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 17, '04 at 8:13p
activeMar 29, '04 at 9:55p
posts14
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase