FAQ
Hello,

I'm using the default lucene Queryparser on the search text : fleming
roofing inc., marietta ga

These items are in my index.

doc 1: fleming ga
doc 2: marietta ga
doc 3: marietta il
doc 4: marietta ok
doc 5: marietta ok
doc 6: fleming pa

The first match is always "fleming ga" even though "marietta ga" is closer
together in the search text. I'm assuming this is because of the "fleming"
has a higher idf than marietta. What should I change in the way i'm querying
or indexing to make this happen?

Also, I don't want to modify the search text by putting quotes around
"marietta ga" which forces the query parser to make a phrase query.

thanks,
Rajiv
--
View this message in context: http://www.nabble.com/IDF-scoring-issue-tp21045385p21045385.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Erick Erickson at Dec 17, 2008 at 2:49 am
    Note a couple of things:

    1> how a doc scores also takes into account how many other words
    are in the field you're querying on.
    2> Is "text" your default field? Because what you posted is really
    searching text:fleming <default field>:roofing <default
    field>:inc......
    Not also the implicit OR between each of them. Is this really your
    intent?
    3> query.explain (as i remember) is your friend to figure out how the
    weights are being calculated. If you haven't got a copy of Luke, I'd
    *strongly* advise getting one and looking at the "explain" tab...

    Best
    Erick
    On Tue, Dec 16, 2008 at 8:19 PM, Rajiv2 wrote:


    Hello,

    I'm using the default lucene Queryparser on the search text : fleming
    roofing inc., marietta ga

    These items are in my index.

    doc 1: fleming ga
    doc 2: marietta ga
    doc 3: marietta il
    doc 4: marietta ok
    doc 5: marietta ok
    doc 6: fleming pa

    The first match is always "fleming ga" even though "marietta ga" is closer
    together in the search text. I'm assuming this is because of the "fleming"
    has a higher idf than marietta. What should I change in the way i'm
    querying
    or indexing to make this happen?

    Also, I don't want to modify the search text by putting quotes around
    "marietta ga" which forces the query parser to make a phrase query.

    thanks,
    Rajiv
    --
    View this message in context:
    http://www.nabble.com/IDF-scoring-issue-tp21045385p21045385.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Rajiv2 at Dec 17, 2008 at 3:44 am
    To answer your questions,
    1. there are only two words in the document I'm searching -- city and state
    abbrev. lowercased and analyzed by whitespaceanalyzer
    2. the only field and default field is text, so the query becomes text:
    fleming text:roofing txt:inc. ...etc.
    Using query operator AND instead of OR gives me no results which does not
    help.
    3. I've been using explain in Luke and the only difference between "fleming
    ga" and "marietta ga" is the idf value is higher for "flemming" ... that's
    why "fleming ga" has a higher score.

    Basically i'm just trying to get the "marietta ga" doc to score higher. In
    the query text the two words are closer together than "fleming" and "ga".

    rajiv



    Erick Erickson wrote:
    Note a couple of things:

    1> how a doc scores also takes into account how many other words
    are in the field you're querying on.
    2> Is "text" your default field? Because what you posted is really
    searching text:fleming <default field>:roofing <default
    field>:inc......
    Not also the implicit OR between each of them. Is this really your
    intent?
    3> query.explain (as i remember) is your friend to figure out how the
    weights are being calculated. If you haven't got a copy of Luke, I'd
    *strongly* advise getting one and looking at the "explain" tab...

    Best
    Erick
    On Tue, Dec 16, 2008 at 8:19 PM, Rajiv2 wrote:


    Hello,

    I'm using the default lucene Queryparser on the search text : fleming
    roofing inc., marietta ga

    These items are in my index.

    doc 1: fleming ga
    doc 2: marietta ga
    doc 3: marietta il
    doc 4: marietta ok
    doc 5: marietta ok
    doc 6: fleming pa

    The first match is always "fleming ga" even though "marietta ga" is
    closer
    together in the search text. I'm assuming this is because of the
    "fleming"
    has a higher idf than marietta. What should I change in the way i'm
    querying
    or indexing to make this happen?

    Also, I don't want to modify the search text by putting quotes around
    "marietta ga" which forces the query parser to make a phrase query.

    thanks,
    Rajiv
    --
    View this message in context:
    http://www.nabble.com/IDF-scoring-issue-tp21045385p21045385.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --
    View this message in context: http://www.nabble.com/IDF-scoring-issue-tp21045385p21046615.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Anshum at Dec 17, 2008 at 5:04 am
    Hi Rajiv,
    If 'm interpreting your problem correctly, I'd suggest you to try using a
    phraseQuery with an appropriate slop value. Though again it depends on what
    is it that you exactly are trying to fetch.

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Wed, Dec 17, 2008 at 9:13 AM, Rajiv2 wrote:


    To answer your questions,
    1. there are only two words in the document I'm searching -- city and state
    abbrev. lowercased and analyzed by whitespaceanalyzer
    2. the only field and default field is text, so the query becomes text:
    fleming text:roofing txt:inc. ...etc.
    Using query operator AND instead of OR gives me no results which does not
    help.
    3. I've been using explain in Luke and the only difference between "fleming
    ga" and "marietta ga" is the idf value is higher for "flemming" ... that's
    why "fleming ga" has a higher score.

    Basically i'm just trying to get the "marietta ga" doc to score higher. In
    the query text the two words are closer together than "fleming" and "ga".

    rajiv



    Erick Erickson wrote:
    Note a couple of things:

    1> how a doc scores also takes into account how many other words
    are in the field you're querying on.
    2> Is "text" your default field? Because what you posted is really
    searching text:fleming <default field>:roofing <default
    field>:inc......
    Not also the implicit OR between each of them. Is this really your
    intent?
    3> query.explain (as i remember) is your friend to figure out how the
    weights are being calculated. If you haven't got a copy of Luke, I'd
    *strongly* advise getting one and looking at the "explain" tab...

    Best
    Erick
    On Tue, Dec 16, 2008 at 8:19 PM, Rajiv2 wrote:


    Hello,

    I'm using the default lucene Queryparser on the search text : fleming
    roofing inc., marietta ga

    These items are in my index.

    doc 1: fleming ga
    doc 2: marietta ga
    doc 3: marietta il
    doc 4: marietta ok
    doc 5: marietta ok
    doc 6: fleming pa

    The first match is always "fleming ga" even though "marietta ga" is
    closer
    together in the search text. I'm assuming this is because of the
    "fleming"
    has a higher idf than marietta. What should I change in the way i'm
    querying
    or indexing to make this happen?

    Also, I don't want to modify the search text by putting quotes around
    "marietta ga" which forces the query parser to make a phrase query.

    thanks,
    Rajiv
    --
    View this message in context:
    http://www.nabble.com/IDF-scoring-issue-tp21045385p21045385.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    --
    View this message in context:
    http://www.nabble.com/IDF-scoring-issue-tp21045385p21046615.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Grant Ingersoll at Dec 17, 2008 at 1:54 pm

    On Dec 16, 2008, at 8:19 PM, Rajiv2 wrote:
    Hello,

    I'm using the default lucene Queryparser on the search text : fleming
    roofing inc., marietta ga

    Also, I don't want to modify the search text by putting quotes around
    "marietta ga" which forces the query parser to make a phrase query.
    Why not?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Rajiv2 at Dec 17, 2008 at 2:27 pm
    Because, the search term is provided by a user, and that user would explicity
    have to put quotes around "marietta ga" when I beleive the search text as it
    is : fleming roofing inc., marietta ga -- should score higher for "marietta
    ga"

    rajiv


    Grant Ingersoll-6 wrote:
    On Dec 16, 2008, at 8:19 PM, Rajiv2 wrote:


    Hello,

    I'm using the default lucene Queryparser on the search text : fleming
    roofing inc., marietta ga

    Also, I don't want to modify the search text by putting quotes around
    "marietta ga" which forces the query parser to make a phrase query.
    Why not?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    --
    View this message in context: http://www.nabble.com/IDF-scoring-issue-tp21045385p21054127.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Matthew Hall at Dec 17, 2008 at 3:14 pm
    Well, you could also do a simple test of removing IDF from the scoring
    equation and seeing if the query then reacts the way you want it to.

    Simply write your own custom similarity that does this, and test out to
    see how it works.

    Handily enough, I've already done this, so here's some code you can try:


    Fix the package declaration to something that works for you, and then
    simply use the custom similarity at the appropriate times.

    ======================================================================
    package org.jax.mgi.shr.searchtool;

    import org.apache.lucene.search.DefaultSimilarity;

    /**
    * This is our custom similarity class, which removes document frequency
    from
    * the calculation of score.
    *
    * It extends the DefaultSimilarity class, and thusly inherits most of its
    * methods from it.
    *
    * @author mhall
    *
    */

    public class MGISimilarity extends DefaultSimilarity {

    /**
    * If we have any doc frequency at all in the index, normalize it to
    1 (The
    * document exists)
    *
    * Otherwise, return 0 (Does not exist)
    *
    * @param docFreq
    * This items doc frequency
    * @param numDocs
    * How many documents this item appears in.
    *
    * This API is enforced by the DefaultSimilarity class.
    *
    */

    public float idf(int docFreq, int numDocs) {
    if (docFreq > 0) {
    return 1.0f;
    } else {
    return 0.0f;
    }
    }

    }

    ===================================================================
    Rajiv2 wrote:
    Because, the search term is provided by a user, and that user would explicity
    have to put quotes around "marietta ga" when I beleive the search text as it
    is : fleming roofing inc., marietta ga -- should score higher for "marietta
    ga"

    rajiv


    Grant Ingersoll-6 wrote:
    On Dec 16, 2008, at 8:19 PM, Rajiv2 wrote:

    Hello,

    I'm using the default lucene Queryparser on the search text : fleming
    roofing inc., marietta ga

    Also, I don't want to modify the search text by putting quotes around
    "marietta ga" which forces the query parser to make a phrase query.
    Why not?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]



    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Grant Ingersoll at Dec 17, 2008 at 3:32 pm

    On Dec 17, 2008, at 9:26 AM, Rajiv2 wrote:
    Because, the search term is provided by a user, and that user would
    explicity
    have to put quotes around "marietta ga" when I beleive the search
    text as it
    is : fleming roofing inc., marietta ga -- should score higher for
    "marietta
    ga"
    Just because the user doesn't do it, doesn't mean you can't. Your
    stating that there is an implied ordering in their query, yet you
    don't want to take advantage of that. You can often achieve better
    results by generating phrase queries implicitly based on 2 or 3
    grams. You might also even try generating the whole thing as a phrase
    query with a really large slop value (like 100 or more). Thus,
    scoring will reward things when they are closer together, but you
    still get the flexibility of an AND-like query. Downside is,
    possibly, a small performance hit, but you could test it first. Or,
    you could add in the phrase query as an optional OR query to the
    original query, something like" fleming OR roofing OR marietta OR ga
    OR ("fleming roofing" OR "roofing marietta" OR "marietta ga".

    You could also try using a more intelligent Query Parser that is tuned
    to your domain. You could also try to factor in click-through stats
    into your results. Probably not the answer you want to hear, but it
    is doable and useful.

    Do you have any a priori knowledge about Marietta GA over Fleming, GA
    to begin with? Have you done any broader scale relevance assessment?
    It is often the problem that "fixing" one query, results in breaking a
    whole bunch of others. What I typically recommend is that you take
    the top 50 queries plus 10-30 random queries from your logs and do an
    assessment of the top 5/10 results for: relevant, somewhat relevant,
    not relevant and embarrassing. The goal is to maximize relevant while
    minimizing embarrassing and not relevant.

    Is this particular example an isolated case or do you feel this is
    systemic to your application? I've said it before, but it bears
    repeating: Just because someone typed search terms into your search
    box does not mean you have to actually do a search in order to present
    them results. If you KNOW the Marietta result is a better result for
    this query, then make it the top result. Solr has this feature via
    the "QueryElevationComponent" (horrible name, I know), but I call it
    Editorial Placement. It's not that hard to implement.

    Finally, I'd say I wouldn't split hairs over position too much, if the
    Marietta result is #2 and the Fleming result is #1. Now, if you're
    telling me the Marietta result is something like #100 and Fleming is
    #1, that's a different story. The fact is, b/c your user didn't put
    quotes, you don't actually know for a fact that the Fleming result is
    what they wanted (but I agree, it is highly likely). The point is, I
    wouldn't quibble over anything that is in the top ten. Lucene is
    doing what you told it to do, that is rank the results according to TF/
    IDF, etc. If you have other pertinent information about Marietta or
    the query then you should tell Lucene that via phrases, boosts or
    payloads or altering the Similarity. But, like I said, be careful
    that you aren't breaking other queries.

    HTH,
    Grant

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 17, '08 at 1:19a
activeDec 17, '08 at 3:32p
posts8
users5
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase