FAQ
I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give
some kudos for the indexing performance enhancements.

The lucene indexing portion is about 6-8 times faster. Previously we
were doing ~60-120 documents per second, now we're between 400-1000,
depending on the type of document, size, and how many fields there are.
Haven't done a formal comparison side by side, but certainly is
substantially faster.

We index from 5 to 20 fields per document (in serial through
IndexWriter). Most are 3-5K total size, but can vary quite a bit. Total
index size (eventually) will be ~15G.

Thanks,

Brian Beard





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Beard, Brian at Jul 9, 2008 at 1:18 pm
    I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give
    some kudos for the indexing performance enhancements.

    The lucene indexing portion is about 6-8 times faster. Previously we
    were doing ~60-120 documents per second, now we're between 400-1000,
    depending on the type of document, size, and how many fields there are.
    Haven't done a formal comparison side by side, but certainly is
    substantially faster.

    The gain would have been equal to the 8-10 times in the readme, but
    using custom tokenizers slows things down a little vs. using the
    standard one. At first I didn't realize to use reusableTokenFilter which
    bypassed the custom tokenizers and had the 8-10x improvement. Maybe
    there's some more gain to be had I can pursue.

    We index from 5 to 20 fields per document (in serial through
    IndexWriter). Most are 3-5K total size, but can vary quite a bit. Total
    index size (eventually) will be ~15G.

    Thanks,

    Brian Beard





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 9, 2008 at 2:38 pm
    This is great to hear!

    If you tweak things a bit (increase RAM buffer size, use
    autoCommit=false, use threads, etc) you should be able to eke out some
    more gains...

    Are you storing fields & using term vectors on any of your fields?

    Mike

    Beard, Brian wrote:
    I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give
    some kudos for the indexing performance enhancements.

    The lucene indexing portion is about 6-8 times faster. Previously we
    were doing ~60-120 documents per second, now we're between 400-1000,
    depending on the type of document, size, and how many fields there
    are.
    Haven't done a formal comparison side by side, but certainly is
    substantially faster.

    The gain would have been equal to the 8-10 times in the readme, but
    using custom tokenizers slows things down a little vs. using the
    standard one. At first I didn't realize to use reusableTokenFilter
    which
    bypassed the custom tokenizers and had the 8-10x improvement. Maybe
    there's some more gain to be had I can pursue.

    We index from 5 to 20 fields per document (in serial through
    IndexWriter). Most are 3-5K total size, but can vary quite a bit.
    Total
    index size (eventually) will be ~15G.

    Thanks,

    Brian Beard





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Beard, Brian at Jul 9, 2008 at 3:35 pm
    I will try tweaking RAM, and check about autoCommit=false. It's on the
    future agenda to multi-thread through the index writer. The indexing
    time I quoted includes the document creation time which would definitely
    improve with multi-threading.

    I'm doing batch updates of up to 1000 a pop, and closing and re-opening
    the IndexWriter in between.

    Not using term vectors on any fields. Not using compression either. I
    did change the tokenizers to re-use instead of instantiate a new one.

    About half of the fields are stored in the index. Most are small, but
    unfortunately the largest one is stored which uses a lot of memory and
    probably takes additional time to write. The only reason is so a snippet
    can be returned from it, but eventually I'd like to get rid of that and
    return snippets as tokens in the stream (I'm guessing that it might be
    ok to return analyzed data as a snippet given it would save a lot on
    index size, which would speed up copy time during swapping between
    search and update indexes).

    -----Original Message-----
    From: Michael McCandless
    Sent: Wednesday, July 09, 2008 10:38 AM
    To: java-user@lucene.apache.org
    Subject: Re: performance feedback


    This is great to hear!

    If you tweak things a bit (increase RAM buffer size, use
    autoCommit=false, use threads, etc) you should be able to eke out some
    more gains...

    Are you storing fields & using term vectors on any of your fields?

    Mike

    Beard, Brian wrote:
    I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give
    some kudos for the indexing performance enhancements.

    The lucene indexing portion is about 6-8 times faster. Previously we
    were doing ~60-120 documents per second, now we're between 400-1000,
    depending on the type of document, size, and how many fields there
    are.
    Haven't done a formal comparison side by side, but certainly is
    substantially faster.

    The gain would have been equal to the 8-10 times in the readme, but
    using custom tokenizers slows things down a little vs. using the
    standard one. At first I didn't realize to use reusableTokenFilter
    which
    bypassed the custom tokenizers and had the 8-10x improvement. Maybe
    there's some more gain to be had I can pursue.

    We index from 5 to 20 fields per document (in serial through
    IndexWriter). Most are 3-5K total size, but can vary quite a bit.
    Total
    index size (eventually) will be ~15G.

    Thanks,

    Brian Beard





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Jul 9, 2008 at 3:51 pm

    On Wed, Jul 9, 2008 at 11:35 AM, Beard, Brian wrote:
    I will try tweaking RAM, and check about autoCommit=false. It's on the
    future agenda to multi-thread through the index writer. The indexing
    time I quoted includes the document creation time which would definitely
    improve with multi-threading.

    I'm doing batch updates of up to 1000 a pop, and closing and re-opening
    the IndexWriter in between.
    autoCommit=false will definitely help, and there is normally no reason
    not to use it.
    Bigger batches (or a single batch) will also help indexing speed. A
    single IndexWriter session can now avoid copying stored fields on
    segment merges.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Jul 9, 2008 at 4:06 pm

    On Wed, Jul 9, 2008 at 11:35 AM, Beard, Brian wrote:
    I will try tweaking RAM, and check about autoCommit=false. It's on the
    future agenda to multi-thread through the index writer. The indexing
    time I quoted includes the document creation time which would definitely
    improve with multi-threading.

    I'm doing batch updates of up to 1000 a pop, and closing and re-opening
    the IndexWriter in between.
    autoCommit=false will definitely help, and there is normally no reason
    not to use it.
    Bigger batches (or a single batch) will also help indexing speed. A
    single IndexWriter session can now avoid copying stored fields on
    segment merges.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Beard, Brian at Jul 10, 2008 at 3:14 pm
    Currently the default setting is being used with our setup, so
    autoCommit is true. I'll set this to false to see if it improves.

    Question: If autoCommit is false, does this apply to optimization also,
    so that during an hour long optimization that gets killed in the middle,
    will the index be in the left in the initial state before optimization
    started?

    -----Original Message-----
    From: yseeley@gmail.com On Behalf Of Yonik
    Seeley
    Sent: Wednesday, July 09, 2008 12:06 PM
    To: java-user@lucene.apache.org
    Subject: Re: performance feedback
    On Wed, Jul 9, 2008 at 11:35 AM, Beard, Brian wrote:
    I will try tweaking RAM, and check about autoCommit=false. It's on the
    future agenda to multi-thread through the index writer. The indexing
    time I quoted includes the document creation time which would
    definitely
    improve with multi-threading.

    I'm doing batch updates of up to 1000 a pop, and closing and
    re-opening
    the IndexWriter in between.
    autoCommit=false will definitely help, and there is normally no reason
    not to use it.
    Bigger batches (or a single batch) will also help indexing speed. A
    single IndexWriter session can now avoid copying stored fields on
    segment merges.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Jul 10, 2008 at 3:21 pm

    On Thu, Jul 10, 2008 at 11:13 AM, Beard, Brian wrote:
    Question: If autoCommit is false, does this apply to optimization also,
    so that during an hour long optimization that gets killed in the middle,
    will the index be in the left in the initial state before optimization
    started?
    Yes. But the longest merge is the biggest, so that would probably
    happen almost as often with autoCommit=true too.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 9, '08 at 1:04p
activeJul 10, '08 at 3:21p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase