hi
hi

how to achive Query with terms' weight to a Boolean matching?
i think my question is unclear/misleading... example

I have a document reads:
"I am eating an apple while using apple computer"

My xapian query:
apple(weight:4)
computer(weight:3)

instead of getting a weight of 11 of this doc (2Xapple 1Xcomputer), how to
make the matching in boolean way so i will get a weight of 7 for this
document?

Is it possible to add "penalty" in a query?
docA = "How to eat an apple while using apple computer"
docB = "I am eating an apple while using apple computer"

Query(apple:4,computer:3,how:-1) << is it possible to penalty / lost weight
when doc has the term "how" so the docB ranks heigher?

how heavy will it be if i add a value of "hash(md5 HTML<title> X
websiteDomain)" to each document, and then use this key to collapse
duplicated-title-in-domain using set_collapse_key? is it way too heavy?

Thanks and really appreciated
Andrey K.

Search Discussions

  • Olly Betts at Nov 2, 2007 at 5:43 am

    On Thu, Nov 01, 2007 at 09:49:39PM -0700, Andrey wrote:
    I have a document reads:
    "I am eating an apple while using apple computer"

    My xapian query:
    apple(weight:4)
    computer(weight:3)

    instead of getting a weight of 11 of this doc (2Xapple 1Xcomputer), how to
    make the matching in boolean way so i will get a weight of 7 for this
    document?
    If I understand correctly, you want to ignore the wdf of terms - you can
    do that by setting BM25's k1 parameter to 0:

    http://www.xapian.org/docs/apidoc/html/classXapian_1_1BM25Weight.html#_details

    That's not what I'd call "boolean" weighting though, so perhaps I'm
    misunderstanding you...
    Is it possible to add "penalty" in a query?
    docA = "How to eat an apple while using apple computer"
    docB = "I am eating an apple while using apple computer"

    Query(apple:4,computer:3,how:-1) << is it possible to penalty / lost weight
    when doc has the term "how" so the docB ranks heigher?
    I don't think that's currently possible without indexing each document
    which doesn't contain "how" with a "XNOThow" term, or something similar.

    Several of the matcher's optimisations rely on the current fact that
    terms can't contribute a negative amount, so I think the only way to do
    this would be to add something to all documents which don't contain
    "how". It would probably be possible to implement a query operator
    which did that.

    You can completely exclude documents which contain a particular term
    though, using OP_AND_NOT.
    how heavy will it be if i add a value of "hash(md5 HTML<title> X
    websiteDomain)" to each document, and then use this key to collapse
    duplicated-title-in-domain using set_collapse_key? is it way too heavy?
    How much overhead it incurs will depend on the nature of your data (for
    example if the sites you are indexing each have millions of pages with
    each title, the cost will probably be higher as you'll be rejecting a
    large number of matches).

    It's not an obviously ridiculous idea in general, so all I can really
    suggest is that you try it on your data and see if it performs
    acceptably.

    Cheers,
    Olly

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedNov 2, '07 at 4:49a
activeNov 2, '07 at 5:43a
posts2
users2
websitexapian.org
irc#xapian

2 users in discussion

Olly Betts: 1 post Andrey: 1 post

People

Translate

site design / logo © 2021 Grokbase