FAQ
I am baffled by the results of the following queries. Can it be something to
do with the boosting factor? All of these queries are performed in the same
environment with the same crawled index/data.

A. query1 = +(content:(Pepsi)) resulted in 228
hits.
B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398 hits.
C. query3 = +(host:(ca)^10 ) resulted in 212
hits.

Two questions (strictly just one):
1. query1 of any content contains Pepsi yielded 228 hits, how could a more
limiting query2 (give me all docs that have Pepsi in it with a domain of ca)
yield more hits (398)?
2. Since there are 212 hits of Canadian domains, how can query2 return 398
hits?

Thanks for any pointers!
Cheers,
student_t


--
View this message in context: http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Daniel Naber at Sep 26, 2008 at 8:39 pm

    On Freitag, 26. September 2008, student_t wrote:

    A. query1 = +(content:(Pepsi))
    I guess this is the string input you use for your queries, isn't it? It's
    more helpful to look at the toString() output of the parsed query to see
    how Lucene interpreted your input.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Student_t at Sep 26, 2008 at 10:59 pm
    Hi Dan,

    Thanks for your suggestion. I will definitely check that out.

    -student_t



    Daniel Naber-10 wrote:
    On Freitag, 26. September 2008, student_t wrote:

    A. query1 = +(content:(Pepsi))
    I guess this is the string input you use for your queries, isn't it? It's
    more helpful to look at the toString() output of the parsed query to see
    how Lucene interpreted your input.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697619.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Sep 26, 2008 at 8:41 pm
    That certainly doesn't look right. What analyzers are you using at index
    and query time?

    Two things that will help track down what's really happening:

    1> query.toString() is your friend.
    2> get a copy of the excellent Luke tool and have it do its explain magic on
    your query. Watch that the analyzer you choose when querying is what you
    expect....

    If neither of those things sheds any light on the problem, let us know what
    you find....

    Best
    Erick
    On Fri, Sep 26, 2008 at 3:55 PM, student_t wrote:


    I am baffled by the results of the following queries. Can it be something
    to
    do with the boosting factor? All of these queries are performed in the same
    environment with the same crawled index/data.

    A. query1 = +(content:(Pepsi)) resulted in 228
    hits.
    B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398 hits.
    C. query3 = +(host:(ca)^10 ) resulted in 212
    hits.

    Two questions (strictly just one):
    1. query1 of any content contains Pepsi yielded 228 hits, how could a more
    limiting query2 (give me all docs that have Pepsi in it with a domain of
    ca)
    yield more hits (398)?
    2. Since there are 212 hits of Canadian domains, how can query2 return 398
    hits?

    Thanks for any pointers!
    Cheers,
    student_t


    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Student_t at Sep 26, 2008 at 11:01 pm
    Hi Eric,

    Thanks a bunch for your pointers. I will need to find out the analyzers at
    index and query time. But is it critical to have the same analyzers during
    these two times?

    I had tested with lucli from some of my local segment data and they appeared
    working fine (i.e., their result sets are reasonable.)

    Is Luke part of Lucene contrib? I recall there is a GUI that lets you view
    the indices. Would you please elaborate?

    Thanks again!
    student_t


    Erick Erickson wrote:
    That certainly doesn't look right. What analyzers are you using at index
    and query time?

    Two things that will help track down what's really happening:

    1> query.toString() is your friend.
    2> get a copy of the excellent Luke tool and have it do its explain magic
    on
    your query. Watch that the analyzer you choose when querying is what you
    expect....

    If neither of those things sheds any light on the problem, let us know
    what
    you find....

    Best
    Erick
    On Fri, Sep 26, 2008 at 3:55 PM, student_t wrote:


    I am baffled by the results of the following queries. Can it be something
    to
    do with the boosting factor? All of these queries are performed in the
    same
    environment with the same crawled index/data.

    A. query1 = +(content:(Pepsi)) resulted in
    228
    hits.
    B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398
    hits.
    C. query3 = +(host:(ca)^10 ) resulted in
    212
    hits.

    Two questions (strictly just one):
    1. query1 of any content contains Pepsi yielded 228 hits, how could a
    more
    limiting query2 (give me all docs that have Pepsi in it with a domain of
    ca)
    yield more hits (398)?
    2. Since there are 212 hits of Canadian domains, how can query2 return
    398
    hits?

    Thanks for any pointers!
    Cheers,
    student_t


    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context: http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Sep 27, 2008 at 2:43 pm
    For Luke, see http://www.getopt.org/luke/ or just google lucene luke

    About the same analyzers at index and search. No, it's not *required*
    that you use the same analyzer, but unless and until you understand
    what analyzers do, I would *strongly* recommend that you use the
    same one. See PerFieldAnalyzerWrapper for using different analyzers
    for different fields....

    Conceptually, think of an analyzer as giving you the "smallest meaningful
    token" from your stream that your program then does something with,
    either index or analyze.

    Consider the line "this is the 2nd Line in Erick's program". What are the
    tokens?
    Well, depending on the analyzer you could get
    nd, line, erick, program (assuming your analyzer lowercases, strips numbers
    and punctuation and removes stopwords)

    You could get
    "this is the 2nd Line in Erick's program" (assuming your analyzer did
    nothing, just returned the entire input stream as a token, which would then
    NOT get a hit searching, say, 'Line')

    You could get
    this, is, the, 2nd, line, in erick's, program (assuming your analyzer just
    lowercased and split on whitespace)

    You could get
    nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological
    analyzer that always returned the word "nonsense")

    Really, any transformations you wanted to perform can be done in an
    analyzer.
    So using different analyzers during indexing and searching will very often
    produce "surprising" results. Don't do it unless you know exactly what's
    being
    done to your input streams. Or be prepared to spend a lot of hours tracking
    down "what went wrong" <G>.

    All that said, it doesn't explain the behaviour you're seeing because you're
    right,
    adding successively more restrictions should produce smaller result counts.
    I suspect
    that if you see what actual queries you're generating (and when you get
    Luke, be
    sure to find the drop-down for "which analyzer" to use) you'll find
    something
    surprising.

    Best
    Erick

    On Fri, Sep 26, 2008 at 6:57 PM, student_t wrote:


    Hi Eric,

    Thanks a bunch for your pointers. I will need to find out the analyzers at
    index and query time. But is it critical to have the same analyzers during
    these two times?

    I had tested with lucli from some of my local segment data and they
    appeared
    working fine (i.e., their result sets are reasonable.)

    Is Luke part of Lucene contrib? I recall there is a GUI that lets you view
    the indices. Would you please elaborate?

    Thanks again!
    student_t


    Erick Erickson wrote:
    That certainly doesn't look right. What analyzers are you using at index
    and query time?

    Two things that will help track down what's really happening:

    1> query.toString() is your friend.
    2> get a copy of the excellent Luke tool and have it do its explain magic
    on
    your query. Watch that the analyzer you choose when querying is what you
    expect....

    If neither of those things sheds any light on the problem, let us know
    what
    you find....

    Best
    Erick
    On Fri, Sep 26, 2008 at 3:55 PM, student_t wrote:


    I am baffled by the results of the following queries. Can it be
    something
    to
    do with the boosting factor? All of these queries are performed in the
    same
    environment with the same crawled index/data.

    A. query1 = +(content:(Pepsi)) resulted in
    228
    hits.
    B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398
    hits.
    C. query3 = +(host:(ca)^10 ) resulted in
    212
    hits.

    Two questions (strictly just one):
    1. query1 of any content contains Pepsi yielded 228 hits, how could a
    more
    limiting query2 (give me all docs that have Pepsi in it with a domain of
    ca)
    yield more hits (398)?
    2. Since there are 212 hits of Canadian domains, how can query2 return
    398
    hits?

    Thanks for any pointers!
    Cheers,
    student_t


    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Student_t at Sep 29, 2008 at 3:08 pm
    Thank you Erick!

    I got Luke and it's a great tool! I verified from Luke my queries posted
    originally worked as expected (i.e., "Canadian pepsi" produced fewer results
    than "pepsi" along.)

    Based on your suggestion, I found out the program re-wrote the query before
    it was sent to Nutch as the following:

    (content:"+(content:pepsi) +((host:ca^10.0)^10.0)")

    I think this is messing up the query results. Literally, this is what the
    code was doing:

    Query nutchQuery = new Query();
    nutchQuery.addRequiredTerm("+(content:pepsi) +((host:ca^10.0)^10.0)",
    "content");
    nutchBean.search(nutchQuery, maxHits, maxHitsPerSite);

    The above code must be using a default analyzer, which I have not found out
    yet. But after looking at Luke results and this code, I think my problem is
    in the query construction. Could you shed some light about how the new query
    be constructed based on the following String?

    +(content:pepsi) +((host:ca^10.0)^10.0)

    Thanks again!


    Erick Erickson wrote:
    For Luke, see http://www.getopt.org/luke/ or just google lucene luke

    About the same analyzers at index and search. No, it's not *required*
    that you use the same analyzer, but unless and until you understand
    what analyzers do, I would *strongly* recommend that you use the
    same one. See PerFieldAnalyzerWrapper for using different analyzers
    for different fields....

    Conceptually, think of an analyzer as giving you the "smallest meaningful
    token" from your stream that your program then does something with,
    either index or analyze.

    Consider the line "this is the 2nd Line in Erick's program". What are the
    tokens?
    Well, depending on the analyzer you could get
    nd, line, erick, program (assuming your analyzer lowercases, strips
    numbers
    and punctuation and removes stopwords)

    You could get
    "this is the 2nd Line in Erick's program" (assuming your analyzer did
    nothing, just returned the entire input stream as a token, which would
    then
    NOT get a hit searching, say, 'Line')

    You could get
    this, is, the, 2nd, line, in erick's, program (assuming your analyzer just
    lowercased and split on whitespace)

    You could get
    nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological
    analyzer that always returned the word "nonsense")

    Really, any transformations you wanted to perform can be done in an
    analyzer.
    So using different analyzers during indexing and searching will very often
    produce "surprising" results. Don't do it unless you know exactly what's
    being
    done to your input streams. Or be prepared to spend a lot of hours
    tracking
    down "what went wrong" <G>.

    All that said, it doesn't explain the behaviour you're seeing because
    you're
    right,
    adding successively more restrictions should produce smaller result
    counts.
    I suspect
    that if you see what actual queries you're generating (and when you get
    Luke, be
    sure to find the drop-down for "which analyzer" to use) you'll find
    something
    surprising.

    Best
    Erick

    On Fri, Sep 26, 2008 at 6:57 PM, student_t wrote:


    Hi Eric,

    Thanks a bunch for your pointers. I will need to find out the analyzers
    at
    index and query time. But is it critical to have the same analyzers
    during
    these two times?

    I had tested with lucli from some of my local segment data and they
    appeared
    working fine (i.e., their result sets are reasonable.)

    Is Luke part of Lucene contrib? I recall there is a GUI that lets you
    view
    the indices. Would you please elaborate?

    Thanks again!
    student_t


    Erick Erickson wrote:
    That certainly doesn't look right. What analyzers are you using at index
    and query time?

    Two things that will help track down what's really happening:

    1> query.toString() is your friend.
    2> get a copy of the excellent Luke tool and have it do its explain magic
    on
    your query. Watch that the analyzer you choose when querying is what you
    expect....

    If neither of those things sheds any light on the problem, let us know
    what
    you find....

    Best
    Erick
    On Fri, Sep 26, 2008 at 3:55 PM, student_t wrote:


    I am baffled by the results of the following queries. Can it be
    something
    to
    do with the boosting factor? All of these queries are performed in the
    same
    environment with the same crawled index/data.

    A. query1 = +(content:(Pepsi)) resulted
    in
    228
    hits.
    B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398
    hits.
    C. query3 = +(host:(ca)^10 ) resulted
    in
    212
    hits.

    Two questions (strictly just one):
    1. query1 of any content contains Pepsi yielded 228 hits, how could a
    more
    limiting query2 (give me all docs that have Pepsi in it with a domain
    of
    ca)
    yield more hits (398)?
    2. Since there are 212 hits of Canadian domains, how can query2 return
    398
    hits?

    Thanks for any pointers!
    Cheers,
    student_t


    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context: http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19725730.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Sep 29, 2008 at 4:11 pm
    OK, you're officially beyond where I can help. But the
    rewritten query is your problem, and I'm going to appeal to
    people who understand things waaay better than I do to answer
    it. Can you recognize it when I run away <G>...

    You might consider posting that question over on the nutch
    user's list if you don't get anything here, but I'm pretty sure that
    some folks who are very knowledgeable about nutch might
    see it here.

    Best
    Erick
    On Mon, Sep 29, 2008 at 11:07 AM, student_t wrote:


    Thank you Erick!

    I got Luke and it's a great tool! I verified from Luke my queries posted
    originally worked as expected (i.e., "Canadian pepsi" produced fewer
    results
    than "pepsi" along.)

    Based on your suggestion, I found out the program re-wrote the query before
    it was sent to Nutch as the following:

    (content:"+(content:pepsi) +((host:ca^10.0)^10.0)")

    I think this is messing up the query results. Literally, this is what the
    code was doing:

    Query nutchQuery = new Query();
    nutchQuery.addRequiredTerm("+(content:pepsi) +((host:ca^10.0)^10.0)",
    "content");
    nutchBean.search(nutchQuery, maxHits, maxHitsPerSite);

    The above code must be using a default analyzer, which I have not found out
    yet. But after looking at Luke results and this code, I think my problem is
    in the query construction. Could you shed some light about how the new
    query
    be constructed based on the following String?

    +(content:pepsi) +((host:ca^10.0)^10.0)

    Thanks again!


    Erick Erickson wrote:
    For Luke, see http://www.getopt.org/luke/ or just google lucene luke

    About the same analyzers at index and search. No, it's not *required*
    that you use the same analyzer, but unless and until you understand
    what analyzers do, I would *strongly* recommend that you use the
    same one. See PerFieldAnalyzerWrapper for using different analyzers
    for different fields....

    Conceptually, think of an analyzer as giving you the "smallest meaningful
    token" from your stream that your program then does something with,
    either index or analyze.

    Consider the line "this is the 2nd Line in Erick's program". What are the
    tokens?
    Well, depending on the analyzer you could get
    nd, line, erick, program (assuming your analyzer lowercases, strips
    numbers
    and punctuation and removes stopwords)

    You could get
    "this is the 2nd Line in Erick's program" (assuming your analyzer did
    nothing, just returned the entire input stream as a token, which would
    then
    NOT get a hit searching, say, 'Line')

    You could get
    this, is, the, 2nd, line, in erick's, program (assuming your analyzer just
    lowercased and split on whitespace)

    You could get
    nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological
    analyzer that always returned the word "nonsense")

    Really, any transformations you wanted to perform can be done in an
    analyzer.
    So using different analyzers during indexing and searching will very often
    produce "surprising" results. Don't do it unless you know exactly what's
    being
    done to your input streams. Or be prepared to spend a lot of hours
    tracking
    down "what went wrong" <G>.

    All that said, it doesn't explain the behaviour you're seeing because
    you're
    right,
    adding successively more restrictions should produce smaller result
    counts.
    I suspect
    that if you see what actual queries you're generating (and when you get
    Luke, be
    sure to find the drop-down for "which analyzer" to use) you'll find
    something
    surprising.

    Best
    Erick

    On Fri, Sep 26, 2008 at 6:57 PM, student_t wrote:


    Hi Eric,

    Thanks a bunch for your pointers. I will need to find out the analyzers
    at
    index and query time. But is it critical to have the same analyzers
    during
    these two times?

    I had tested with lucli from some of my local segment data and they
    appeared
    working fine (i.e., their result sets are reasonable.)

    Is Luke part of Lucene contrib? I recall there is a GUI that lets you
    view
    the indices. Would you please elaborate?

    Thanks again!
    student_t


    Erick Erickson wrote:
    That certainly doesn't look right. What analyzers are you using at index
    and query time?

    Two things that will help track down what's really happening:

    1> query.toString() is your friend.
    2> get a copy of the excellent Luke tool and have it do its explain magic
    on
    your query. Watch that the analyzer you choose when querying is what you
    expect....

    If neither of those things sheds any light on the problem, let us know
    what
    you find....

    Best
    Erick
    On Fri, Sep 26, 2008 at 3:55 PM, student_t wrote:


    I am baffled by the results of the following queries. Can it be
    something
    to
    do with the boosting factor? All of these queries are performed in
    the
    same
    environment with the same crawled index/data.

    A. query1 = +(content:(Pepsi)) resulted
    in
    228
    hits.
    B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398
    hits.
    C. query3 = +(host:(ca)^10 ) resulted
    in
    212
    hits.

    Two questions (strictly just one):
    1. query1 of any content contains Pepsi yielded 228 hits, how could a
    more
    limiting query2 (give me all docs that have Pepsi in it with a domain
    of
    ca)
    yield more hits (398)?
    2. Since there are 212 hits of Canadian domains, how can query2
    return
    398
    hits?

    Thanks for any pointers!
    Cheers,
    student_t


    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    View this message in context:
    http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19725730.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 26, '08 at 7:56p
activeSep 29, '08 at 4:11p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase