FAQ
Hi there.

I am trying to implement sorting in a large index (3 million documents).
My sort field is simple integer with values between 1 and 100.

With IndexSearcher's search(Query, Sort) everything works fine, but the
problem is resource consumption. I need to make it to use less memory
and CPU.

I thought of setting boost value for documents at index time, with the
value of my sort field, and then making custom Similarity class which
would disregard Lucene scoring and take in evaluation only this document
boost.

Did someone try this, is it even possible? What do you think will I gain
something with this, in terms of resource consumption?



Regards,

Dragan

Search Discussions

  • Dragan Jotanovic at Sep 11, 2008 at 4:54 pm
    Thanks Mark for quick resonse,
    but the problem is that I need to do incremental additions to this
    index, which means that I can not keep the sort order,
    and full reindexation is costly process and I can not do it often.
    That's why I need to find some other solution.


    -----Original Message-----
    From: Mark Miller
    Sent: Thursday, September 11, 2008 5:38 PM
    To: Dragan Jotanovic
    Subject: Re: Sorting in lucene through Document boosting

    You can sort by index order after adding the docs in the sorted order to

    the index.

    - Mark

    Dragan Jotanovic wrote:
    Hi there.

    I am trying to implement sorting in a large index (3 million
    documents).
    My sort field is simple integer with values between 1 and 100.

    With IndexSearcher's search(Query, Sort) everything works fine, but the
    problem is resource consumption. I need to make it to use less memory
    and CPU.

    I thought of setting boost value for documents at index time, with the
    value of my sort field, and then making custom Similarity class which
    would disregard Lucene scoring and take in evaluation only this document
    boost.

    Did someone try this, is it even possible? What do you think will I gain
    something with this, in terms of resource consumption?



    Regards,

    Dragan


    ------_=extPart_001_01C9142B.745F28F4--
    Delivered-To: markrmiller@gmail.com
    Received: by 10.140.207.7 with SMTP id e7cs88544rvg;
    Thu, 11 Sep 2008 09:32:40 -0700 (PDT)
    Received: by 10.142.47.13 with SMTP id
    u13mr1030345wfu.38.1221150760779;
    Thu, 11 Sep 2008 09:32:40 -0700 (PDT)
    Return-Path:
    <java-user-return-36070-markrmiller=ail.com@lucene.apache.org>
    Received: from mail.apache.org (hermes.apache.org [140.211.11.2])
    by mx.google.com with SMTP id
    32si13407392wfc.12.2008.09.11.09.32.40;
    Thu, 11 Sep 2008 09:32:40 -0700 (PDT)
    Received-SPF: pass (google.com: domain of
    java-user-return-36070-markrmiller=ail.com@lucene.apache.org designates
    140.211.11.2 as permitted sender) client-ip0.211.11.2;
    Authentication-Results: mx.google.com; spf=ss (google.com: domain of
    java-user-return-36070-markrmiller=gmail.com@lucene.apache.org
    designates 140.211.11.2 as permitted sender)
    smtp.mail=java-user-return-36070-markrmiller=gmail.com@lucene.apache.org
    Received: (qmail 90437 invoked by uid 500); 11 Sep 2008 16:32:30 -0000
    Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
    Precedence: bulk
    List-Help: List-Unsubscribe: List-Post: List-Id: <java-user.lucene.apache.org>
    Reply-To: java-user@lucene.apache.org
    Delivered-To: mailing list java-user@lucene.apache.org
    Received: (qmail 90426 invoked by uid 99); 11 Sep 2008 16:32:30 -0000
    Received: from athena.apache.org (HELO athena.apache.org)
    (140.211.11.136)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2008 09:32:30 -0700
    X-ASF-Spam-Status: No, hits=0 required.0
    tests=ML_MESSAGE,SPF_PASS
    X-Spam-Check-By: apache.org
    Received-SPF: pass (athena.apache.org: local policy)
    Received: from [213.52.246.188] (HELO MAIL.DIOSPHERE.com)
    (213.52.246.188)
    by apache.org (qpsmtpd/0.29) with SMTP; Thu, 11 Sep 2008 16:31:33 +0000
    Content-class: urn:content-classes:message
    MIME-Version: 1.0
    Content-Type: multipart/alternative;
    boundary=---_=_NextPart_001_01C9142B.745F28F4"
    Subject: Sorting in lucene through Document boosting
    X-MimeOLE: Produced By Microsoft Exchange V6.5
    Date: Thu, 11 Sep 2008 17:31:41 +0100
    Message-ID:
    <ED024AB4B57C8543A3425147237C45F84511AF@MAIL.DIOSPHERE.com>
    X-MS-Has-Attach:
    X-MS-TNEF-Correlator:
    Thread-Topic: Sorting in lucene through Document boosting
    Thread-Index: AckUK966utYWlK4gQSyxgsEmZmYljg=From: "Dragan Jotanovic"
    <Dragan.Jotanovic@DIOSPHERE.com>
    To: <java-user@lucene.apache.org>
    X-Virus-Checked: Checked by ClamAV on apache.org

    ------_=extPart_001_01C9142B.745F28F4
    Content-Type: text/plain;
    charset=s-ascii"
    Content-Transfer-Encoding: quoted-printable

    Hi there.

    I am trying to implement sorting in a large index (3 million
    documents).
    My sort field is simple integer with values between 1 and 100.

    With IndexSearcher's search(Query, Sort) everything works fine, but the
    problem is resource consumption. I need to make it to use less memory
    and CPU.

    I thought of setting boost value for documents at index time, with the
    value of my sort field, and then making custom Similarity class which
    would disregard Lucene scoring and take in evaluation only this document
    boost.

    Did someone try this, is it even possible? What do you think will I gain
    something with this, in terms of resource consumption?



    Regards,

    Dragan


    ------_=extPart_001_01C9142B.745F28F4--

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Sep 13, 2008 at 11:00 pm
    : I thought of setting boost value for documents at index time, with the
    : value of my sort field, and then making custom Similarity class which
    : would disregard Lucene scoring and take in evaluation only this document
    : boost.

    the general idea should work, but a few things to pay attention to...

    1) document boosts are folded into the fieldNorm, so make sure you don't
    "setOmitNorms(true)"

    2) your lengthNorm function needs to return a constant

    3) you'll need to adjust your boost values so that when the fieldNorms are
    converted to the internal 'byte' representation they are still unique ...
    with some simple experimentation you can find an approach that helps you
    genreate a mapping from 1,2,3,4,5... to a,b,c,d,... where a<b<c<....



    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dragan Jotanovic at Sep 15, 2008 at 12:09 pm
    Thanks Chris.

    I made simple Similarity implementation:

    public float lengthNorm(String arg0, int arg1) {
    return 1f;
    }

    public float tf(float arg0) {
    return 1f;
    }

    My boost values are calculated simply by calling:
    document.setBoost(DefaultSimilarity.decodeNorm((byte)rank));

    It works perfectly. I just need to check if I gain something with this,
    in terms of performance and resource consumption.



    -----Original Message-----
    From: Chris Hostetter
    Sent: Saturday, September 13, 2008 11:59 PM
    To: java-user@lucene.apache.org
    Subject: Re: Sorting in lucene through Document boosting


    : I thought of setting boost value for documents at index time, with the
    : value of my sort field, and then making custom Similarity class which
    : would disregard Lucene scoring and take in evaluation only this
    document
    : boost.

    the general idea should work, but a few things to pay attention to...

    1) document boosts are folded into the fieldNorm, so make sure you don't
    "setOmitNorms(true)"

    2) your lengthNorm function needs to return a constant

    3) you'll need to adjust your boost values so that when the fieldNorms
    are
    converted to the internal 'byte' representation they are still unique
    ...
    with some simple experimentation you can find an approach that helps you

    genreate a mapping from 1,2,3,4,5... to a,b,c,d,... where a<b<c<....



    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Wettin at Sep 15, 2008 at 3:57 pm

    15 sep 2008 kl. 14.08 skrev Dragan Jotanovic:

    I made simple Similarity implementation:
    public float tf(float arg0) {
    return 1f;
    }
    Why do you touch the term frequency? Is that prehaps unrelated to
    what's discussed in this thread?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dragan Jotanovic at Sep 15, 2008 at 4:10 pm
    Hm, probably that is not needed.
    I thought that tf would influence the score if I don't set it to
    constant value, but it seems that it is sufficient to override just
    lengthNorm.



    -----Original Message-----
    From: Karl Wettin
    Sent: Monday, September 15, 2008 4:56 PM
    To: java-user@lucene.apache.org
    Subject: Re: Sorting in lucene through Document boosting


    15 sep 2008 kl. 14.08 skrev Dragan Jotanovic:
    I made simple Similarity implementation:
    public float tf(float arg0) {
    return 1f;
    }
    Why do you touch the term frequency? Is that prehaps unrelated to
    what's discussed in this thread?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 11, '08 at 4:32p
activeSep 15, '08 at 4:10p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase