FAQ
Hi,
I use a dse stack with has solr4.10.
I want to control the number of rows from result set as a percent of the max hit 'numFound' or 'maxScore' for a query.
e.g.,
1) for a query 'foo', if I get 100 hits and if I want to get the top 5% percent (say rows=5%). Then I get only 5 rows.
for a query 'bar', if I get 1000 hits, I want to get the top 5% (rows=5%).Then I get top 50 rows.

2) for a query 'foo' if the maxScore is 4.5, I want to get say all records within 10% of maxScore ..I want to get all records whose score is between 4.5 to 4.0(this could be the any number of records)

in other words, the returned set is a percent of hits, instead of a static row count.
Is there a way to do this readily or via some custom implementation?

Thanks
Cheers
Prasanna Josium

Search Discussions

  • Binoy Dalal at Jun 9, 2016 at 5:27 am
    I don't think you can do such a thing ootb with solr but this is pretty
    easy to achieve using a custom search component.

    Just write some custom code which will limit your resultset and plug it
    into your request handler as the last component.
    On Thu, 9 Jun 2016, 08:53 Prasanna Josium, wrote:

    Hi,
    I use a dse stack with has solr4.10.
    I want to control the number of rows from result set as a percent of the
    max hit 'numFound' or 'maxScore' for a query.
    e.g.,
    1) for a query 'foo', if I get 100 hits and if I want to get the top 5%
    percent (say rows=5%). Then I get only 5 rows.
    for a query 'bar', if I get 1000 hits, I want to get the top 5%
    (rows=5%).Then I get top 50 rows.

    2) for a query 'foo' if the maxScore is 4.5, I want to get say all records
    within 10% of maxScore ..I want to get all records whose score is between
    4.5 to 4.0(this could be the any number of records)

    in other words, the returned set is a percent of hits, instead of a
    static row count.
    Is there a way to do this readily or via some custom implementation?

    Thanks
    Cheers
    Prasanna Josium
    --
    Regards,
    Binoy Dalal
  • Erick Erickson at Jun 9, 2016 at 5:13 pm
    Why do this at all? I have a hard time understanding what benefit this
    is to the _user_.

    And even returning 5% is risky. I mean what happens for a query of
    *:*? For a corpus of 100M docs that's still 5M documents which is
    would hurt.

    Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
    Users usually don't want to deal with very many docs at a time.

    If you must do this for some kind of reporting or something, just fire
    two queries. The first has a rows of 0 and the second has a rows=5%
    of what was returned the first time.

    Under the covers, you really can't do this without writing some sort
    of custom collector. Solr (Well, Lucene) uses the
    rows parameter as the dimension of the list where the most relevant
    docs are stored, and replaced as "better" docs some along. You can't
    know how many doc are going to be found before you score them all.
    So how would you know what 5% was when you start? You'd have to
    write something that would keep 20X whatever your max was set
    to and then grow it as necessary.... but by that time you _might_ have
    already thrown away docs that should be in the expanded list....... Or
    you'd have to keep _all_ the results which would be very expensive usually.

    All in all, I think a 2-query solution is much simpler than hacking into
    your own collector, not to mention far more efficient in the general case.

    Best,
    Erick
    On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal wrote:
    I don't think you can do such a thing ootb with solr but this is pretty
    easy to achieve using a custom search component.

    Just write some custom code which will limit your resultset and plug it
    into your request handler as the last component.
    On Thu, 9 Jun 2016, 08:53 Prasanna Josium, wrote:

    Hi,
    I use a dse stack with has solr4.10.
    I want to control the number of rows from result set as a percent of the
    max hit 'numFound' or 'maxScore' for a query.
    e.g.,
    1) for a query 'foo', if I get 100 hits and if I want to get the top 5%
    percent (say rows=5%). Then I get only 5 rows.
    for a query 'bar', if I get 1000 hits, I want to get the top 5%
    (rows=5%).Then I get top 50 rows.

    2) for a query 'foo' if the maxScore is 4.5, I want to get say all records
    within 10% of maxScore ..I want to get all records whose score is between
    4.5 to 4.0(this could be the any number of records)

    in other words, the returned set is a percent of hits, instead of a
    static row count.
    Is there a way to do this readily or via some custom implementation?

    Thanks
    Cheers
    Prasanna Josium
    --
    Regards,
    Binoy Dalal
  • Prasanna Josium at Jun 10, 2016 at 4:03 am
    Thanks Erick & Binoy,
    I will try out the 2 query technique. Guess this will work for numFound related issue.

    Guess I was not very clear in stating my problem. The problem I'm dealing with is mostly with maxScore.
    I have collection (~500K docs) where I look for matches to the query.
    Because of the nature of the data in the collection, I get for some of them a very high score which soon fades to very low score for others(5 to 0.5);
    For some queries even within the first 10 docs; 8 have score between 5 to 3.8 and the 9th onwards falls to 0.4 & 0.3 and so on into a long tail.

    The business guys thinks that docs with very low score compared to the highs scores ones should not be part of the result set.
    and must be cut off below a threshold defined as a percent of maxScore. Any thought about how to work with max score.

    Thanks
    Prasanna Josium




    -----Original Message-----
    From: Erick Erickson
    Sent: 09 June 2016 22:43
    To: solr-user
    Subject: Re: Returned number of result rows as a function of maxScore or numFound.

    Why do this at all? I have a hard time understanding what benefit this is to the _user_.

    And even returning 5% is risky. I mean what happens for a query of *:*? For a corpus of 100M docs that's still 5M documents which is would hurt.

    Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
    Users usually don't want to deal with very many docs at a time.

    If you must do this for some kind of reporting or something, just fire two queries. The first has a rows of 0 and the second has a rows=5% of what was returned the first time.

    Under the covers, you really can't do this without writing some sort of custom collector. Solr (Well, Lucene) uses the rows parameter as the dimension of the list where the most relevant docs are stored, and replaced as "better" docs some along. You can't know how many doc are going to be found before you score them all.
    So how would you know what 5% was when you start? You'd have to write something that would keep 20X whatever your max was set to and then grow it as necessary.... but by that time you _might_ have already thrown away docs that should be in the expanded list....... Or you'd have to keep _all_ the results which would be very expensive usually.

    All in all, I think a 2-query solution is much simpler than hacking into your own collector, not to mention far more efficient in the general case.

    Best,
    Erick
    On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal wrote:
    I don't think you can do such a thing ootb with solr but this is
    pretty easy to achieve using a custom search component.

    Just write some custom code which will limit your resultset and plug
    it into your request handler as the last component.

    On Thu, 9 Jun 2016, 08:53 Prasanna Josium,
    wrote:
    Hi,
    I use a dse stack with has solr4.10.
    I want to control the number of rows from result set as a percent of
    the max hit 'numFound' or 'maxScore' for a query.
    e.g.,
    1) for a query 'foo', if I get 100 hits and if I want to get the top
    5% percent (say rows=5%). Then I get only 5 rows.
    for a query 'bar', if I get 1000 hits, I want to get the top 5%
    (rows=5%).Then I get top 50 rows.

    2) for a query 'foo' if the maxScore is 4.5, I want to get say all
    records within 10% of maxScore ..I want to get all records whose
    score is between
    4.5 to 4.0(this could be the any number of records)

    in other words, the returned set is a percent of hits, instead of a
    static row count.
    Is there a way to do this readily or via some custom implementation?

    Thanks
    Cheers
    Prasanna Josium
    --
    Regards,
    Binoy Dalal

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupsolr-user @
categorieslucene
postedJun 9, '16 at 3:23a
activeJun 10, '16 at 4:03a
posts4
users3
websitelucene.apache.org...

People

Translate

site design / logo © 2019 Grokbase