FAQ
We're getting up there in terms of corpus size for our Lucene indexing application:
* 20 million documents
* all fields need to be stored
* 10 short fields / document
* 1 long free text field / document (analyzed with a custom shingle-based analyzer)
* 140GB total index size
* Optimized into a single segment
* Must run over NFS due to VMWare setup

I think I've already taken the most common steps to reduce memory requirements and increase performance on the searching side including:
* omitting norms on all fields except two
* omitting term vectors
* indexing as few fields as possible
* reusing a single searcher
* splitting the index up into N shards for ParallelMultiSearcher

The application will run with 10G of -Xmx but any less and it bails out. It seems happier if we feed it 12GB. The searches are starting to bog down a bit (5-10 seconds for some queries)...

Our next step was to deploy the shards as RemoteSearchables for the same ParallelMultiSearcher over RMI - but before I do that I'm curious:
* are there other ways to get that memory usage down?
* are there performance optimizations that I haven't thought of?

Thanks,
-Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Paul Libbrecht at Jul 13, 2010 at 10:10 pm

    Le 13-juil.-10 à 23:49, Christopher Condit a écrit :

    * are there performance optimizations that I haven't thought of?
    The first and most important one I'd think of is get rid of NFS.
    You can happily do a local copy which might, even for 10 Gb take less
    than 30 seconds at server start.

    paul
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 14, 2010 at 9:53 am
    You can also set the termsIndexDivisor when opening the IndexReader.
    The terms index is an in-memory data structure and it an consume ALOT
    of RAM when your index has many unique terms.

    Flex (only on Lucene's trunk / next major release (4.0)) has reduced
    this RAM usage (as well as the RAM required when sorting by string
    field with mostly ascii content) substantially -- see
    http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html

    Mike
    On Tue, Jul 13, 2010 at 6:09 PM, Paul Libbrecht wrote:


    Le 13-juil.-10 à 23:49, Christopher Condit a écrit :
    * are there performance optimizations that I haven't thought of?
    The first and most important one I'd think of is get rid of NFS.
    You can happily do a local copy which might, even for 10 Gb take less than
    30 seconds at server start.

    paul
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Jul 14, 2010 at 8:14 am

    On Tue, 2010-07-13 at 23:49 +0200, Christopher Condit wrote:
    * 20 million documents [...]
    * 140GB total index size
    * Optimized into a single segment
    I take it that you do not have frequent updates? Have you tried to see
    if you can get by with more segments without significant slowdown?
    The application will run with 10G of -Xmx but any less and it bails out.
    It seems happier if we feed it 12GB. The searches are starting to bog
    down a bit (5-10 seconds for some queries)...
    10G sounds like a lot for that index. Two common memory-eaters are
    sorting by field value and faceting. Could you describe what you're
    doing in that regard?

    Similarly, the 5-10 seconds for some queries seems very slow. Could you
    give some examples on the queries that causes problems together with
    some examples of fast queries and how long they take to execute?


    The standard silver bullet for easy performance boost is to buy a couple
    of consumer grade SSDs and put them on the local machine. If you're
    gearing up to use more machines you might want to try this first.

    Regards,
    Toke


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Christopher Condit at Jul 14, 2010 at 6:32 pm
    Hi Toke-
    * 20 million documents [...]
    * 140GB total index size
    * Optimized into a single segment
    I take it that you do not have frequent updates? Have you tried to see if you
    can get by with more segments without significant slowdown?
    Correct - in fact there are no updates and no deletions. We index everything offline when necessary and just swap the new index in...
    By more segments do you mean not call optimize() at index time?
    The application will run with 10G of -Xmx but any less and it bails out.
    It seems happier if we feed it 12GB. The searches are starting to bog
    down a bit (5-10 seconds for some queries)...
    10G sounds like a lot for that index. Two common memory-eaters are sorting
    by field value and faceting. Could you describe what you're doing in that
    regard?
    No faceting and no sorting (other than score) for this index...
    Similarly, the 5-10 seconds for some queries seems very slow. Could you give
    some examples on the queries that causes problems together with some
    examples of fast queries and how long they take to execute?
    Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND (Salsa OR Sauce) AND (This OR That)
    The latter is most typical.

    With a single keyword it will execute in < 1 second. In a case where there are 10 clauses it becomes much slower (which I understand, just looking for ways to speed it up)...

    Thanks,
    -Chris
  • Glen Newton at Jul 15, 2010 at 1:15 am
    There are a number of strategies, on the Java or OS side of things:
    - Use huge pages[1]. Esp on 64 bit and lots of ram. For long running,
    large memory (and GC busy) applications, this has achieved significant
    improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article
    introducing and benchmarking huge tables, both in C and Java, see [5]
    To see if huge pages might help you, do
    cat /proc/meminfo
    And check on the "PageTables: 26480 kB"
    If the PageTables is, say, more than 1-2GBs, you should consider
    using huge pages.
    - assuming multicore: there are times (very application dependent)
    when having your application running on all cores turns out not to
    produce the best performance. Take one core out making it available to
    look after system things (I/O, etc) sometimes will improve
    performance. Use numactl[6] to bind your application to n-1 cores,
    leaving one out.
    - - numactl also allows you to restrict memory allocation to 1-n
    cores, which also may be useful depending on your application
    - The Java vm from Sun-Oracle has a number of options[7]
    - -XX:+AggressiveOpts [You should have this one on always...]
    - -XX:+StringCache
    - -XX:+UseFastAccessorMethods

    - -XX:+UseBiasedLocking [My experience has this helping some
    applications, hindering others...]
    - -XX:ParallelGCThreads= [Usually this is #cores; try reducing this to n/2]
    - -Xss128k
    - -Xmn [Make this large, like of your 40% of heap -Xmx If you do
    this use -XX:+UseParallelGC See [8]
    You can also play with the many GC parameters. This is pretty arcane,
    but can give you good returns.

    And of course, I/O is important: data on multiple disks with multiple
    controllers; RAID; filesystem tuning ; turn off atime; readahead
    buffer (change from 128k to 8MB on Linux: see [9]) OS tuning. See [9]
    for a useful filesystem comparison (for Postgres).

    -glen
    http://zzzoot.blogspot.com/

    [1]http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
    [2]http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html
    [3]http://kirkwylie.blogspot.com/2008/11/linux-fork-performance-redux-large.html
    [4]http://orainternals.files.wordpress.com/2008/10/high_cpu_usage_hugepages.pdf
    [5]http://lwn.net/Articles/374424/
    [6]http://www.phpman.info/index.php/man/numactl/8
    [7]http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp#PerformanceTuning
    [8]http://java.sun.com/performance/reference/whitepapers/tuning.html#section4.2.5
    [9]http://assets.en.oreilly.com/1/event/27/Linux%20Filesystem%20Performance%20for%20Databases%20Presentation.pdf
    On 15 July 2010 04:28, Christopher Condit wrote:
    Hi Toke-
    * 20 million documents [...]
    * 140GB total index size
    * Optimized into a single segment
    I take it that you do not have frequent updates? Have you tried to see if you
    can get by with more segments without significant slowdown?
    Correct - in fact there are no updates and no deletions. We index everything offline when necessary and just swap the new index in...
    By more segments do you mean not call optimize() at index time?
    The application will run with 10G of -Xmx but any less and it bails out.
    It seems happier if we feed it 12GB. The searches are starting to bog
    down a bit (5-10 seconds for some queries)...
    10G sounds like a lot for that index. Two common memory-eaters are sorting
    by field value and faceting. Could you describe what you're doing in that
    regard?
    No faceting and no sorting (other than score) for this index...
    Similarly, the 5-10 seconds for some queries seems very slow. Could you give
    some examples on the queries that causes problems together with some
    examples of fast queries and how long they take to execute?
    Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND (Salsa OR Sauce) AND (This OR That)
    The latter is most typical.

    With a single keyword it will execute in < 1 second. In a case where there are 10 clauses it becomes much slower (which I understand, just looking for ways to speed it up)...

    Thanks,
    -Chris


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lance Norskog at Jul 15, 2010 at 1:20 am
    Glen, thank you for this very thorough and informative post.

    Lance Norskog

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Jul 15, 2010 at 8:23 am
    On Wed, 2010-07-14 at 20:28 +0200, Christopher Condit wrote:

    [Toke: No frequent updates]
    Correct - in fact there are no updates and no deletions. We index
    everything offline when necessary and just swap the new index in...
    So everything is rebuild from scratch each time? Or do you mean that
    you're only adding new documents, not changing old ones?

    Either way, optimizing to a single 140GB segment is heavy. Ignoring the
    relatively light processing of the data, the I/O for merging is still at
    the very minimum to read and write the 140GB. Even if you can read and
    write 100MB/sec it still takes an hour. This is of course not that
    relevant if you're fine with a nightly batch job.
    By more segments do you mean not call optimize() at index time?
    Either that or calling it with maxNumSegments 10, where 10 is just a
    wild guess. Your mileage will vary:
    http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/IndexWriter.html#optimize%28int%29

    [Toke: 10GB is a lot for such an index. What about sorting & faceting?]
    No faceting and no sorting (other than score) for this index...
    [Toke: What queries?]
    Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo)
    AND (Salsa OR Sauce) AND (This OR That)
    The latter is most typical.

    With a single keyword it will execute in < 1 second. In a case where
    there are 10 clauses it becomes much slower (which I understand,
    just looking for ways to speed it up)...
    As Erick Erickson recently wrote: "Since it doesn't make sense to me,
    that must mean I don't understand the problem very thoroughly".

    Your queries seems simple enough and I would expect response times well
    under a second with warmed index and conventional local harddrives.
    Together with the unexpected high memory requirement my guess is that
    there's something going on with your terms. If you try opening the index
    with luke, it'll tell you the number of terms. If that is very high for
    the fields you search on, this would explain the memory usage.

    You can also take a look at the rank for the most common terms. If it is
    very high this would explain the long execution times for compound
    queries that uses one or more of these terms. A stopword filter would
    help in this case if such a filter is acceptable for you.

    Regards,
    Toke Eskildsen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Christopher Condit at Jul 15, 2010 at 6:53 pm

    [Toke: No frequent updates]

    So everything is rebuild from scratch each time? Or do you mean that you're
    only adding new documents, not changing old ones?
    Everything is reindexed from scratch - indexing speed is not essential to us...
    Either way, optimizing to a single 140GB segment is heavy. Ignoring the
    relatively light processing of the data, the I/O for merging is still at the very
    minimum to read and write the 140GB. Even if you can read and write
    100MB/sec it still takes an hour. This is of course not that relevant if you're
    fine with a nightly batch job.
    Sorry - I wasn't clear here. The total index size ends up being 140GB but to try to help improve performance we build 50 separate indexes (which end up being a bit under 3gb each) and then open them with a parallel multisearcher. The only reason I tried this multisearcher approach was to toy around with Katta which ended up not working out for us. I can also deploy it as a RemoteSearchable (although I'm not sure if this is deprecated or not).
    By more segments do you mean not call optimize() at index time?
    Either that or calling it with maxNumSegments 10, where 10 is just a wild
    guess. Your mileage will vary:
    http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/index/Inde
    xWriter.html#optimize%28int%29
    Is preferred(in terms of performance) to the above approach (splitting into multiple indexes)?
    As Erick Erickson recently wrote: "Since it doesn't make sense to me, that
    must mean I don't understand the problem very thoroughly".
    Not yet! I've added some benchmarking code to keep track of all performance as I add these changes. Do you happen to know if the Lucene benchmark package is still in use / a good thing to toy around with?

    Thanks for all your suggestions,
    -Chris
  • Toke Eskildsen at Jul 16, 2010 at 8:35 am
    On Thu, 2010-07-15 at 20:53 +0200, Christopher Condit wrote:

    [Toke: 140GB single segment is huge]
    Sorry - I wasn't clear here. The total index size ends up being 140GB
    but to try to help improve performance we build 50 separate indexes
    (which end up being a bit under 3gb each) and then open them with a
    parallel multisearcher.
    Ah! That is an whole other matter then. Now I understand why you go for
    single segment indexes.

    [Toke (assuming a single index): Why not optimize to 10 segments?]
    Is preferred(in terms of performance) to the above approach (splitting
    into multiple indexes)?
    It's been 2 or 3 years since I experimented with the MultiSearcher, so
    this is mostly guesswork from my part. Searching on a single index with
    multiple segments and multiple indexes of single segments has the same
    penalties: The weighting of the query requires a merge of query term
    statistics from the parts. In principle it should be the same but as
    always the devil is in the details.


    50 parts do sound like a lot though. Even without range searches or
    similar query-exploding searches, there is an awful lot of seeks to be
    done. The logarithmic nature of term lookups work against you here.

    A rough estimate: A simple boolean query with 5 field/terms is weighted
    by each searcher. Each index has 50K terms (conservative guess) so for
    each condition, the searchers performs ~log2(50K) = 16 lookups. With 50
    indexes that's 50 * 5 * 16 = 4000 lookups.

    The 4K lookups does of course not all result in a remote NFS request but
    with 10-12GB of RAM on the search machine taken already, I would guess
    that there is not much left for caching of the 140GB of index data?

    Is it possible for you to measure the number of read requests that your
    NFS server receives for a standard search? Another thing to try would be
    to measure the same slow query 5 times after each other, thereby
    ensuring that everything is fully cached. This should indicate if the
    remote I/O is the main bottleneck or not.

    The other extreme, a single fully optimized index, would (pathological
    worst case compared to the rough estimate above) require 1 * 5 *
    log2(50*50K) ~= 110 lookups for the terms.

    I would have guessed that the 50 indexes is partly responsible for your
    speed problems, but it sounds like you started out with a lower number
    and later increased it?
    Not yet! I've added some benchmarking code to keep track of all
    performance as I add these changes. Do you happen to know if the
    Lucene benchmark package is still in use / a good thing to toy around with?
    Sorry, no. The only performance testing we've done extensively is for
    searches and for that we used our standard setup with logged queries in
    order to emulate the production setting.

    Regards,
    Toke Eskildsen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 13, '10 at 9:54p
activeJul 16, '10 at 8:35a
posts10
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase