FAQ
The simplest sorting would be to sort your collection before indexing, because Lucene will preserve order of added documents I think nutch sorts index afterward somehow, but I do not know how this works


by omitTf() I mean the new feature in the trunk version, see https://issues.apache.org/jira/browse/LUCENE-1340







----- Original Message ----
From: Cedric Ho <cedric.ho@gmail.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 20 August, 2008 3:28:36 AM
Subject: Re: Are there any Lucene optimizations applicable to SSD?

Hi eks,

My index is fully optimized, but I wasn't aware that I can sort it by
fields in Lucene. Could you elaborate on how to do that?

By omitTf(), do you mean Fieldable.setOmitNorms(true)? I'll try that.

Thanks,
Cedric Ho

if you have possibility to sort your index once in a while on something like
DateRange you will be surprised how good OS File cache utilizes locality of
reference... we had dramatic (ca 30%) improvements just by having index sorted
once a week on the most used fields... depend on nature of your collection and
is not always possible, but if possible, does the job. If this is also only used
as boolean condition to select range of documents, not affecting score (guess
not), give omitTf() a try, your index will be smaller as well

Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Cedric Ho at Aug 20, 2008 at 1:18 pm
    Hi eks,

    On Wed, Aug 20, 2008 at 3:04 PM, eks dev wrote:
    The simplest sorting would be to sort your collection before indexing, because Lucene will preserve order of added documents I think nutch sorts index afterward somehow, but I do not know how this works
    The way we update our index probably already ensured it is more or
    less sorted by date. But I should also try this. For our index on the
    SSD will not be updated.


    by omitTf() I mean the new feature in the trunk version, see https://issues.apache.org/jira/browse/LUCENE-1340
    This seems great! We got 5-6 fields that could get indexed this way.
    I'll definitely check it out.

    Thanks for the great tips =)

    Regards,
    Cedric Ho

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Halsey, Stephen at Aug 20, 2008 at 2:33 pm
    Hi,

    We are using lucene to index a large number of documents (millions) and
    we currently optimize half the index in the background every 2 days, to
    stop it becoming too fragmented. This takes about an hour and we are
    finding during this time searches are slowed down dramatically on that
    machine. This is not due to CPU as it is a dual CPU box, so I'm
    thinking it must be the large amounts of IO being used to optimize the
    index.

    I was wondering if anyone has any ideas for alleviating this problem?

    One option I've come up with is to slowly copy the index to a second
    second offline box, optimize there and then slowly copy the newly
    optimized index back onto the search box. To slow down the IO so that
    bandwidth and IO are not maxed out I thought I could use something like
    the linux Traffic Control (tc) program
    http://tldp.org/HOWTO/Traffic-Control-HOWTO/elements.html#e-shaping (see
    also http://gentoo-wiki.com/HOWTO_Apache_2_bandwidth_limiting ) or tar,
    nfs and http://www.ivarch.com/programs/quickref/pv.shtml and its
    rate-limit option to limit how quickly the index directory is copied to
    and from the remote machine. This option doesn't seem ideal as it would
    involve other programs, servers and scripts.

    The other option is to do it all within the existing Java program, by
    rate-limiting/throttling the IO of the lucene Directory being used to do
    the optimize. I've done this in Lucene by extending the FSDirectory and
    the FSIndexOutput classes and putting a small sleep in the
    FSIndexOutput.flushBuffer, and it seems to work OK. I'm not that keen
    on copying and modifying lucene code though, because I'll have to check
    and possibly modify it every time I upgrade lucene, so if there is a
    reasonable alternative I'd be interested in hearing anyone's ideas? If
    people think IO throttling FSDirectory may be a good idea and useful for
    them, I could develop it more and possibly contact lucene-dev to look
    into getting it added to the lucene trunk?

    Cheers



    steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 23, 2008 at 1:43 pm
    I think IO throttling would be a useful built-in feature.

    I imagine many people are actually uknowingly affected by this, when
    they make changes to their index on the same machine that also does
    simultaneous searching. It's not only optimize() that will cause
    this, but also normal flushing of segments, addIndex*, expungeDeletes,
    normal segment merging, etc.

    It'd be nice if the throttling could somehow be conditional on whether
    there is "contention", ie, searches are currently doing reading.

    Really the OS should provide this facility to us, but it doesn't (at
    least not up through Java's APIs). Linux does let you pick the IO
    Scheduler to use, and at least one of these IO Schedulers lets you
    prioritize whole processes wrt IO. It's not an easy problem to solve!

    Mike

    Halsey, Stephen wrote:
    Hi,

    We are using lucene to index a large number of documents (millions)
    and
    we currently optimize half the index in the background every 2 days,
    to
    stop it becoming too fragmented. This takes about an hour and we are
    finding during this time searches are slowed down dramatically on that
    machine. This is not due to CPU as it is a dual CPU box, so I'm
    thinking it must be the large amounts of IO being used to optimize the
    index.

    I was wondering if anyone has any ideas for alleviating this problem?

    One option I've come up with is to slowly copy the index to a second
    second offline box, optimize there and then slowly copy the newly
    optimized index back onto the search box. To slow down the IO so that
    bandwidth and IO are not maxed out I thought I could use something
    like
    the linux Traffic Control (tc) program
    http://tldp.org/HOWTO/Traffic-Control-HOWTO/elements.html#e-shaping
    (see
    also http://gentoo-wiki.com/HOWTO_Apache_2_bandwidth_limiting ) or
    tar,
    nfs and http://www.ivarch.com/programs/quickref/pv.shtml and its
    rate-limit option to limit how quickly the index directory is copied
    to
    and from the remote machine. This option doesn't seem ideal as it
    would
    involve other programs, servers and scripts.

    The other option is to do it all within the existing Java program, by
    rate-limiting/throttling the IO of the lucene Directory being used
    to do
    the optimize. I've done this in Lucene by extending the FSDirectory
    and
    the FSIndexOutput classes and putting a small sleep in the
    FSIndexOutput.flushBuffer, and it seems to work OK. I'm not that keen
    on copying and modifying lucene code though, because I'll have to
    check
    and possibly modify it every time I upgrade lucene, so if there is a
    reasonable alternative I'd be interested in hearing anyone's ideas?
    If
    people think IO throttling FSDirectory may be a good idea and useful
    for
    them, I could develop it more and possibly contact lucene-dev to look
    into getting it added to the lucene trunk?

    Cheers



    steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 20, '08 at 7:06a
activeAug 23, '08 at 1:43p
posts4
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase