FAQ
Another thought on fuzzy scoring:
shouldn't all these queries which automatically expand
terms favour common words over rare ones? The default
scoring behaviour at the moment favours rare words. As
a user aren't I more likely to be looking for the most
common expansions?

If I'm not sure how to spell I might search for:
accomodation~
or
accom*
The fuzzy scoring algorithms will currently favour all
of the mis-spellings of accommodation in the ranking
of results because they are more rare.

Ideally within the expansions of a term the score
contribution should be based on df (as opposed to the
usual idf) BUT within the overall query the usual idf
scheme applies. To clarify:
If I search for:
the cheapest accomodation~ in london
I want to see the most common spellings of
accommodation before all other variants of this word
BUT I then want these variants scored against the
OTHER words ("in", "the" etc) on the usual basis of
rarity.

This suggests a sort order within another, different
sort order.
This seems like it would not be easy to do. Any bright
ideas?

Cheers
Mark





___________________________________________________________
ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Search Discussions

  • Paul Elschot at Dec 23, 2004 at 6:41 pm
    Mark,
    On Thursday 23 December 2004 14:25, mark harwood wrote:
    Another thought on fuzzy scoring:
    shouldn't all these queries which automatically expand
    terms favour common words over rare ones? The default
    scoring behaviour at the moment favours rare words. As
    a user aren't I more likely to be looking for the most
    common expansions?

    If I'm not sure how to spell I might search for:
    accomodation~
    or
    accom*
    The fuzzy scoring algorithms will currently favour all
    of the mis-spellings of accommodation in the ranking
    of results because they are more rare.

    Ideally within the expansions of a term the score
    contribution should be based on df (as opposed to the
    usual idf) BUT within the overall query the usual idf
    scheme applies. To clarify:
    If I search for:
    the cheapest accomodation~ in london
    I want to see the most common spellings of
    accommodation before all other variants of this word
    BUT I then want these variants scored against the
    OTHER words ("in", "the" etc) on the usual basis of
    rarity.

    This suggests a sort order within another, different
    sort order.
    This seems like it would not be easy to do. Any bright
    ideas?
    The brightest idea I had so far is to drop the idf alltogether.
    Idf just doesn't seem to make much sense for terms related
    through expansion as fuzzy terms of as truncated terms.

    But since dropping idf is probably too controversial,
    one solution that uses idf is to use the minimum idf for
    all the expanded terms.
    Also the within document frequency for the expanded terms
    could be added over these terms before applying tf(),
    without a coordination factor as you suggested
    in the previous post.
    These three measures together would effectively treat
    each expanded term as having equal value for scoring.

    This would score the most common spellings equal to
    the less common ones.

    Regards,
    Paul Elschot


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Markharw00d at Dec 23, 2004 at 9:20 pm
    Thanks for the suggestions, Paul.

    I've just tried a scheme using the max docFreq of the expanded terms as
    the docFreq shared by all expanded terms in their idf calculations
    (giving a lower, shared, IDF) and I'm still removing the coordination
    factor on the BooleanQuery that groups the term queries..
    Results seem much more sensible than the existing way of handling fuzzy
    queries. Here are some example results:

    Query: smith~
    ==============
    New scheme top result: Smith Smith
    New scheme top score: 1.0
    Existing scheme top result: Smita Khurana
    Existing scheme top score: 0.02


    Query: pete~ smith~
    ==============
    New Scheme top result: Peter Smith
    New Scheme top score: 0.99
    Existing Scheme top result: Morrissey Pete
    Existing Scheme top score: 0.07

    Query: David Harland~
    ==============
    New scheme top result: David Harland
    New scheme top score: 0.68
    Existing scheme top result: David Burland
    Existing scheme top score: 0.18


    I've currently amended FuzzyQuery to create new subclasses of
    BooleanQuery and TermQuery which override the similarity methods coord
    (for BooleanQuery) and idf ( for TermQuery). This approach will need to
    be taken by the other multi-term queries.
    Does this sound like the best way to do this?

    Cheers
    Mark



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Paul Elschot at Dec 23, 2004 at 9:55 pm
    Mark,
    On Thursday 23 December 2004 22:20, markharw00d wrote:
    Thanks for the suggestions, Paul.

    I've just tried a scheme using the max docFreq of the expanded terms as
    the docFreq shared by all expanded terms in their idf calculations
    (giving a lower, shared, IDF) and I'm still removing the coordination
    factor on the BooleanQuery that groups the term queries..
    Results seem much more sensible than the existing way of handling fuzzy
    queries. Here are some example results:
    That's quick. Do you have a time shrinking machine there?
    Query: smith~
    ==============
    New scheme top result: Smith Smith
    New scheme top score: 1.0
    Existing scheme top result: Smita Khurana
    Existing scheme top score: 0.02


    Query: pete~ smith~
    ==============
    New Scheme top result: Peter Smith
    New Scheme top score: 0.99
    Existing Scheme top result: Morrissey Pete
    Existing Scheme top score: 0.07

    Query: David Harland~
    ==============
    New scheme top result: David Harland
    New scheme top score: 0.68
    Existing scheme top result: David Burland
    Existing scheme top score: 0.18


    I've currently amended FuzzyQuery to create new subclasses of
    BooleanQuery and TermQuery which override the similarity methods coord
    (for BooleanQuery) and idf ( for TermQuery). This approach will need to
    be taken by the other multi-term queries.
    Does this sound like the best way to do this?
    The results look pretty good and it sounds like the code is compact.
    What more could one wish?

    Does it also do summing before tf()? That would make it perfect, I think,
    but it may be somewhat harder to implement. Summing before
    tf() is useful in documents that have more than one variation
    of the expanded term.

    Regards,
    Paul Elschot.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Markharw00d at Dec 23, 2004 at 11:35 pm
    That's quick. Do you have a time shrinking machine there?
    :) Actually, time's up. It'll be after Christmas before I spend any more
    time on this now but initial results looked promising so I'll make some
    code available, probably in the new year.
    I've got an update to the highlighter to release too that uses gradient
    highlighting so "more important" terms are highlighted more strongly
    using a sliding scale of font color. This works well with "more like
    this" queries which tend to produce a lot of query terms.
    Does it also do summing before tf()? That would make it perfect, I think,
    No, not yet. I need to think through what this would mean.

    Thanks again.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedDec 23, '04 at 1:27p
activeDec 23, '04 at 11:35p
posts5
users2
websitelucene.apache.org

2 users in discussion

Markharw00d: 3 posts Paul Elschot: 2 posts

People

Translate

site design / logo © 2021 Grokbase