FAQ
Our query performance is surprisingly inconsistent, and I'm trying to figure
out why. I've realized that I need to better understand what's going on
internally in Lucene when we're searching. I'd be grateful for any answers
(including pointers to existing docs, if any).

Our situation is this: We have roughly 250 million docs spread across four
indexes. Each doc has about a dozen fields, all stored and most indexed.
(They're the usual document things like author, date, title, contents,
etc.) Queries differ in complexity but always have at least a few terms in
boolean combination, up to some larger queries with dozens or even hundreds
of terms combined with ands, ors, nots, and parens. There's no sorting,
even by relevance: we just want to know what matches. Query performance is
often sub-second, but not infrequently it can take over 20 seconds (we the
time-limited hit collector, so anything over 20 seconds is stopped).
Obviously the more complex queries are slower on average, but a given query
can sometimes be much slower or much faster.

My assumption is that we're having memory problems or disk utilization
problems or both. Our app has a 5gb JVM heap on an 8gb server with no other
user processes running, so we shouldn't be paging and should have some room
for Linux disk cache. The server is lightly loaded and concurrent queries
are the exception rather than the norm. Two of the four indexes are updated
a few times a day via rsync and subsequently closed and re-opened, but poor
query performance doesn't seem to be correlated with these times.

So, getting to some specific questions:

1) How is the inverted index for a given field structured in terms of what's
in memory and what's on disk? Is it dynamic, based on available memory, or
tuneable, or fixed? Is there a rule of thumb that could be used to estimate
how much memory is required per indexed field, based on the number of terms
and documents? Likewise, is there a rule of thumb to estimate how many disk
accesses are required to retrieve the hits for that field? (I'm thinking,
by perhaps false analogy, of how a database maintains a b-tree structure
that may reside partially in RAM cache and partially in disk pages.)

2) When boolean queries are searched, is it as simple as iterating the hits
for each ANDed or ORed term and applying the appropriate logical operators
to the results? For example, is searching for "foo AND bar" pretty much the
same resource-wise as doing two separate searches, and therefore should the
query performance be a linear function of the number the number of search
terms? Or is there some other caching and/or decision logic (perhaps kind
of like a database's query optimizer) at work here that makes the I/O and
RAM requirements more difficult to model from the query? (Remember that
we're not doing any sorting.)

I'm hoping that with some of this knowledge, I'll be able to better model
the RAM and I/O usage of the indexes and queries, and thus eventually
understand why things are slow or fast.

Thanks,
Chris

Search Discussions

  • Ken Krugler at Jun 23, 2009 at 10:17 pm
    Hi Chris,

    Others on this list will be able to provide much better optimization
    suggestions, but based on my experience with some large indexes...

    1. For search time to vary from < 1 second => 20 seconds, the only
    two things I've seen are:

    * Serious JVM garbage collection problems.
    * You're in Linux swap hell.

    We tracked similar issued down by creating a testbed that let us run
    a set of real-world queries, such that we could trigger these types
    of problems when we had appropriate instrumentation on and recording.

    2. 250M/4 = 60M docs/index. The old rule of thumb was 10M docs/index
    as a reasonable size. You might just need more hardware.

    3. We had better luck running more JVMs per system, versus one JVM
    with lots of memory. E.g. run 3 32-bit JVMs with 1.5GB/JVM. Though
    this assumes you've got one drive/JVM, to avoid disk contention.

    4. I'm assuming, since you don't care about scoring, that you've
    turned of field norms. There are other optimizations you can do to
    speed up query-style searches (find all docs that match X) when you
    don't care about scores, but others on the list are much better
    qualified to provide input in this area.

    5. It seems like storing content outside the index helps with
    performance, though I can't say for certain what the impact might be.
    E.g. only store a single unique ID field in the index, and use that
    to access the content (say, from a MapFile) when you're processing
    the matched entries.

    6. Having most of the index loaded into the OS cache was the biggest
    single performance win. So if you've got 3 GB of unused memory on a
    server, limiting the size of the index to some low multiple of 3GB
    would be a good target.

    -- Ken
    Our query performance is surprisingly inconsistent, and I'm trying to figure
    out why. I've realized that I need to better understand what's going on
    internally in Lucene when we're searching. I'd be grateful for any answers
    (including pointers to existing docs, if any).

    Our situation is this: We have roughly 250 million docs spread across four
    indexes. Each doc has about a dozen fields, all stored and most indexed.
    (They're the usual document things like author, date, title, contents,
    etc.) Queries differ in complexity but always have at least a few terms in
    boolean combination, up to some larger queries with dozens or even hundreds
    of terms combined with ands, ors, nots, and parens. There's no sorting,
    even by relevance: we just want to know what matches. Query performance is
    often sub-second, but not infrequently it can take over 20 seconds (we the
    time-limited hit collector, so anything over 20 seconds is stopped).
    Obviously the more complex queries are slower on average, but a given query
    can sometimes be much slower or much faster.

    My assumption is that we're having memory problems or disk utilization
    problems or both. Our app has a 5gb JVM heap on an 8gb server with no other
    user processes running, so we shouldn't be paging and should have some room
    for Linux disk cache. The server is lightly loaded and concurrent queries
    are the exception rather than the norm. Two of the four indexes are updated
    a few times a day via rsync and subsequently closed and re-opened, but poor
    query performance doesn't seem to be correlated with these times.

    So, getting to some specific questions:

    1) How is the inverted index for a given field structured in terms of what's
    in memory and what's on disk? Is it dynamic, based on available memory, or
    tuneable, or fixed? Is there a rule of thumb that could be used to estimate
    how much memory is required per indexed field, based on the number of terms
    and documents? Likewise, is there a rule of thumb to estimate how many disk
    accesses are required to retrieve the hits for that field? (I'm thinking,
    by perhaps false analogy, of how a database maintains a b-tree structure
    that may reside partially in RAM cache and partially in disk pages.)

    2) When boolean queries are searched, is it as simple as iterating the hits
    for each ANDed or ORed term and applying the appropriate logical operators
    to the results? For example, is searching for "foo AND bar" pretty much the
    same resource-wise as doing two separate searches, and therefore should the
    query performance be a linear function of the number the number of search
    terms? Or is there some other caching and/or decision logic (perhaps kind
    of like a database's query optimizer) at work here that makes the I/O and
    RAM requirements more difficult to model from the query? (Remember that
    we're not doing any sorting.)

    I'm hoping that with some of this knowledge, I'll be able to better model
    the RAM and I/O usage of the indexes and queries, and thus eventually
    understand why things are slow or fast.

    Thanks,
    Chris

    --
    Ken Krugler
    +1 530-210-6378

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Jun 24, 2009 at 7:33 am

    1. For search time to vary from < 1 second => 20 seconds, the only
    two things I've seen are:

    * Serious JVM garbage collection problems.
    * You're in Linux swap hell.

    We tracked similar issued down by creating a testbed that let us run
    a set of real-world queries, such that we could trigger these types
    of problems when we had appropriate instrumentation on and recording.
    I had similar problems with our configuration, too. Suddenly sometimes the
    server even did not respond. The problem was (I think is the same here): the
    GC. The standard Java GC is not multithreaded, so if you have lots of
    traffic at some time, the JVM halts all threads and starts to GC, which can
    take very long time with so big heap sizes.

    On our server with indexes of similar disk space size (not documents), I
    changed the JVM options to use:

    -Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC -verbosegc -XX:+PrintGCDetails -XX:+UseLargePages

    This also turns on GC debugging and the ParNewGC and ConcMarkSweepGC works
    much better here (but please do not simply copy these settings, read about
    them in the JVM docs, exact settings depend on your use-case!). I had no
    hangs anymore since this change. The JVM prints information about garbage
    collection to stderr (which you should study, there is a paper from sun
    about it). Our web server (Sun Java System Webserver 7.0, Solaris 10 x64)
    also reports the time used in complete for GC, during a server uptime of 11
    days it used about 4 hours to GC in parallel threads. This config works good
    with multiple CPUs, in our case, one could say: "one CPU is GCing the whole
    time" :-)

    There is also a howto on the lucid imagination page about different GCs and
    Lucene.

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Eks dev at Jun 24, 2009 at 8:46 am
    We've also had the same Problem on 150Mio doc setup (Win 2003, java 1.6). After monitoring response time distribution over time for couple of weeks, it was clear that such long running response times were due to bad warming-up. There were peeks short after index reload (even comprehensive warming-up did not help?! Maybe we did something wrong) ... We did not use relaod(). After loading index into RAMDisk the problems disapeared (proof that this was not gc() related). We have tried MMAP as well, but MMAP had the same problems (you cannot force OS to load as much as possible into RaM, will soon be possible ).

    What helped us a lot before "RAM luxury" was to give less memory to jvm in order to leave more for OS, lucene is fine with less memory.

    remaining long runners happen selten, could be that these are due to the gc()...

    as you do not care about scoring, I guess you set omitNorms() and omitTf() during indexing for all fields? If not, try this. It helps a lot

    good luck,
    eks



    ----- Original Message ----
    From: Uwe Schindler <uwe@thetaphi.de>
    To: java-user@lucene.apache.org
    Sent: Wednesday, 24 June, 2009 9:33:08
    Subject: RE: Analyzing performance and memory consumption for boolean queries
    1. For search time to vary from < 1 second => 20 seconds, the only
    two things I've seen are:

    * Serious JVM garbage collection problems.
    * You're in Linux swap hell.

    We tracked similar issued down by creating a testbed that let us run
    a set of real-world queries, such that we could trigger these types
    of problems when we had appropriate instrumentation on and recording.
    I had similar problems with our configuration, too. Suddenly sometimes the
    server even did not respond. The problem was (I think is the same here): the
    GC. The standard Java GC is not multithreaded, so if you have lots of
    traffic at some time, the JVM halts all threads and starts to GC, which can
    take very long time with so big heap sizes.

    On our server with indexes of similar disk space size (not documents), I
    changed the JVM options to use:

    -Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC -verbosegc -XX:+PrintGCDetails -XX:+UseLargePages

    This also turns on GC debugging and the ParNewGC and ConcMarkSweepGC works
    much better here (but please do not simply copy these settings, read about
    them in the JVM docs, exact settings depend on your use-case!). I had no
    hangs anymore since this change. The JVM prints information about garbage
    collection to stderr (which you should study, there is a paper from sun
    about it). Our web server (Sun Java System Webserver 7.0, Solaris 10 x64)
    also reports the time used in complete for GC, during a server uptime of 11
    days it used about 4 hours to GC in parallel threads. This config works good
    with multiple CPUs, in our case, one could say: "one CPU is GCing the whole
    time" :-)

    There is also a howto on the lucid imagination page about different GCs and
    Lucene.

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nigel at Jun 24, 2009 at 6:54 pm
    Hi Uwe,

    Good points, thank you. The obvious place where GC really has to work hard
    is when index changes are rsync'd over and we have to open the new index and
    close the old one. Our slow performance times don't seem to be directly
    correlated with the index rotation, but maybe it just appears that way,
    since it may take a little while before GC kicks in to try to recover the
    objects used by the closed index.

    Chris
    On Wed, Jun 24, 2009 at 3:33 AM, Uwe Schindler wrote:

    I had similar problems with our configuration, too. Suddenly sometimes the
    server even did not respond. The problem was (I think is the same here):
    the
    GC. The standard Java GC is not multithreaded, so if you have lots of
    traffic at some time, the JVM halts all threads and starts to GC, which can
    take very long time with so big heap sizes.

    On our server with indexes of similar disk space size (not documents), I
    changed the JVM options to use:

    -Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC -verbosegc -XX:+PrintGCDetails -XX:+UseLargePages
  • Uwe Schindler at Jun 24, 2009 at 8:48 pm
    Have you tried out, if GC affects you? A first step would be to turn on GC
    logging with -verbosegc -XX:+PrintGCDetails

    If you see some relation between query time and gc messages, you should try
    to use a better parallelized GC and change the perm size and so on (se docs
    about GC tuning).

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Nigel
    Sent: Wednesday, June 24, 2009 8:54 PM
    To: java-user@lucene.apache.org
    Subject: Re: Analyzing performance and memory consumption for boolean
    queries

    Hi Uwe,

    Good points, thank you. The obvious place where GC really has to work
    hard
    is when index changes are rsync'd over and we have to open the new index
    and
    close the old one. Our slow performance times don't seem to be directly
    correlated with the index rotation, but maybe it just appears that way,
    since it may take a little while before GC kicks in to try to recover the
    objects used by the closed index.

    Chris
    On Wed, Jun 24, 2009 at 3:33 AM, Uwe Schindler wrote:

    I had similar problems with our configuration, too. Suddenly sometimes the
    server even did not respond. The problem was (I think is the same here):
    the
    GC. The standard Java GC is not multithreaded, so if you have lots of
    traffic at some time, the JVM halts all threads and starts to GC, which can
    take very long time with so big heap sizes.

    On our server with indexes of similar disk space size (not documents), I
    changed the JVM options to use:

    -Xms4096M -Xmx8192M -XX:MaxPermSize=512M -Xrs -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC -verbosegc -XX:+PrintGCDetails -XX:+UseLargePages

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nigel at Jun 26, 2009 at 1:39 am

    On Wed, Jun 24, 2009 at 4:47 PM, Uwe Schindler wrote:

    Have you tried out, if GC affects you? A first step would be to turn on GC
    logging with -verbosegc -XX:+PrintGCDetails

    If you see some relation between query time and gc messages, you should try
    to use a better parallelized GC and change the perm size and so on (se docs
    about GC tuning).

    I'll definitely try this, and thank you for the GC options details that you
    posted previously. I have a test server that's exactly the same
    configuration as our production boxes, which I need to use to try this out
    (with the same indexes of course). It's busy doing some unrelated work at
    the moment but next week I should be able to do some tests.

    Thanks,
    Chris
  • Eks dev at Jun 24, 2009 at 9:04 am
    another performance tip, waht helps "a lot" is collection sorting before you index.

    if you can somehow logically partition your index, you can improve locality of reference by sorting.

    What I mean by this:
    imagine index with following fields: zip, user_group, some text

    if typical query on this index contains ZIP and user_group as a MUST fields, by sorting on ZIP, user_group before indexing, you will get an index where Lucene does not have to seek back and forth during search, reducing disk contention. It is quite ok to sort once in a while... that is a standard trick database admins use.

    This improved our average response times by almost factor 2 on FSDisk. Of course, depends on your index distribution, but if you have a lot of categorical variables that segment your collection logically, it is worth trying.



    Our query performance is surprisingly inconsistent, and I'm trying to figure
    out why. I've realized that I need to better understand what's going on
    internally in Lucene when we're searching. I'd be grateful for any answers
    (including pointers to existing docs, if any).

    Our situation is this: We have roughly 250 million docs spread across four
    indexes. Each doc has about a dozen fields, all stored and most indexed.
    (They're the usual document things like author, date, title, contents,
    etc.) Queries differ in complexity but always have at least a few terms in
    boolean combination, up to some larger queries with dozens or even hundreds
    of terms combined with ands, ors, nots, and parens. There's no sorting,
    even by relevance: we just want to know what matches. Query performance is
    often sub-second, but not infrequently it can take over 20 seconds (we the
    time-limited hit collector, so anything over 20 seconds is stopped).
    Obviously the more complex queries are slower on average, but a given query
    can sometimes be much slower or much faster.

    My assumption is that we're having memory problems or disk utilization
    problems or both. Our app has a 5gb JVM heap on an 8gb server with no other
    user processes running, so we shouldn't be paging and should have some room
    for Linux disk cache. The server is lightly loaded and concurrent queries
    are the exception rather than the norm. Two of the four indexes are updated
    a few times a day via rsync and subsequently closed and re-opened, but poor
    query performance doesn't seem to be correlated with these times.

    So, getting to some specific questions:

    1) How is the inverted index for a given field structured in terms of what's
    in memory and what's on disk? Is it dynamic, based on available memory, or
    tuneable, or fixed? Is there a rule of thumb that could be used to estimate
    how much memory is required per indexed field, based on the number of terms
    and documents? Likewise, is there a rule of thumb to estimate how many disk
    accesses are required to retrieve the hits for that field? (I'm thinking,
    by perhaps false analogy, of how a database maintains a b-tree structure
    that may reside partially in RAM cache and partially in disk pages.)

    2) When boolean queries are searched, is it as simple as iterating the hits
    for each ANDed or ORed term and applying the appropriate logical operators
    to the results? For example, is searching for "foo AND bar" pretty much the
    same resource-wise as doing two separate searches, and therefore should the
    query performance be a linear function of the number the number of search
    terms? Or is there some other caching and/or decision logic (perhaps kind
    of like a database's query optimizer) at work here that makes the I/O and
    RAM requirements more difficult to model from the query? (Remember that
    we're not doing any sorting.)

    I'm hoping that with some of this knowledge, I'll be able to better model
    the RAM and I/O usage of the indexes and queries, and thus eventually
    understand why things are slow or fast.

    Thanks,
    Chris

    --
    Ken Krugler
    +1 530-210-6378

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 24, 2009 at 3:16 am
    Nigel,

    Based on the description, I'd suspect unnecessarily(?) large JVM heap and insufficient RAM for caching the actual index. Run vmstat while querying the index and watch columns: bi, bo, si, so, wa, and id. :) If what I said above is correct, then you should see more data loaded from disk during those clow queries and probably a jump in the wa column if you are running multiple concurrent queries.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Nigel <nigelspleen@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, June 23, 2009 4:53:09 PM
    Subject: Analyzing performance and memory consumption for boolean queries

    Our query performance is surprisingly inconsistent, and I'm trying to figure
    out why. I've realized that I need to better understand what's going on
    internally in Lucene when we're searching. I'd be grateful for any answers
    (including pointers to existing docs, if any).

    Our situation is this: We have roughly 250 million docs spread across four
    indexes. Each doc has about a dozen fields, all stored and most indexed.
    (They're the usual document things like author, date, title, contents,
    etc.) Queries differ in complexity but always have at least a few terms in
    boolean combination, up to some larger queries with dozens or even hundreds
    of terms combined with ands, ors, nots, and parens. There's no sorting,
    even by relevance: we just want to know what matches. Query performance is
    often sub-second, but not infrequently it can take over 20 seconds (we the
    time-limited hit collector, so anything over 20 seconds is stopped).
    Obviously the more complex queries are slower on average, but a given query
    can sometimes be much slower or much faster.

    My assumption is that we're having memory problems or disk utilization
    problems or both. Our app has a 5gb JVM heap on an 8gb server with no other
    user processes running, so we shouldn't be paging and should have some room
    for Linux disk cache. The server is lightly loaded and concurrent queries
    are the exception rather than the norm. Two of the four indexes are updated
    a few times a day via rsync and subsequently closed and re-opened, but poor
    query performance doesn't seem to be correlated with these times.

    So, getting to some specific questions:

    1) How is the inverted index for a given field structured in terms of what's
    in memory and what's on disk? Is it dynamic, based on available memory, or
    tuneable, or fixed? Is there a rule of thumb that could be used to estimate
    how much memory is required per indexed field, based on the number of terms
    and documents? Likewise, is there a rule of thumb to estimate how many disk
    accesses are required to retrieve the hits for that field? (I'm thinking,
    by perhaps false analogy, of how a database maintains a b-tree structure
    that may reside partially in RAM cache and partially in disk pages.)

    2) When boolean queries are searched, is it as simple as iterating the hits
    for each ANDed or ORed term and applying the appropriate logical operators
    to the results? For example, is searching for "foo AND bar" pretty much the
    same resource-wise as doing two separate searches, and therefore should the
    query performance be a linear function of the number the number of search
    terms? Or is there some other caching and/or decision logic (perhaps kind
    of like a database's query optimizer) at work here that makes the I/O and
    RAM requirements more difficult to model from the query? (Remember that
    we're not doing any sorting.)

    I'm hoping that with some of this knowledge, I'll be able to better model
    the RAM and I/O usage of the indexes and queries, and thus eventually
    understand why things are slow or fast.

    Thanks,
    Chris

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nigel at Jun 24, 2009 at 6:42 pm
    Thanks Otis -- I'll give that a try. I think this relates to the first
    question in my original message, which was what (if any) of the inverted
    index structure is explicitly cached by Lucene in the JVM. Clearly there's
    something, since a large JVM heap is required to avoid running out of
    memory, but it can't be everything, otherwise OS caching would have no
    effect.

    Thanks,
    Chris
    On Tue, Jun 23, 2009 at 11:16 PM, Otis Gospodnetic wrote:


    Nigel,

    Based on the description, I'd suspect unnecessarily(?) large JVM heap and
    insufficient RAM for caching the actual index. Run vmstat while querying
    the index and watch columns: bi, bo, si, so, wa, and id. :) If what I said
    above is correct, then you should see more data loaded from disk during
    those clow queries and probably a jump in the wa column if you are running
    multiple concurrent queries.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  • Michael McCandless at Jun 24, 2009 at 9:06 am
    Is it possible the occasional large merge is clearing out the IO cache
    (thus "unwarming" your searcher)? (Though since you're rsync'ing your
    updates in, it sounds like a separate machine is building the index).

    Or... linux will happily swap out a process's core in favor of IO
    cache (though I'd expect this effect to be much less spikey). You can
    tune "swappiness" to have it not do that:

    http://kerneltrap.org/node/3000

    Maybe Lucene's norms/deleted docs/field cache were getting swapped out?

    Lucene's postings reside entirely on disk (ie, Lucene doesn't cache
    those in RAM; we rely on the OS's IO cache). Lucene does a linear
    scan through the terms in the query... Linux will readahead, though,
    if things are fragmented this could mean lots of seeking.

    Have you tried putting the index on an SSD instead of a spinning magnetic disk?

    Mike

    On Tue, Jun 23, 2009 at 4:53 PM, Nigelwrote:
    Our query performance is surprisingly inconsistent, and I'm trying to figure
    out why.  I've realized that I need to better understand what's going on
    internally in Lucene when we're searching.  I'd be grateful for any answers
    (including pointers to existing docs, if any).

    Our situation is this: We have roughly 250 million docs spread across four
    indexes.  Each doc has about a dozen fields, all stored and most indexed.
    (They're the usual document things like author, date, title, contents,
    etc.)  Queries differ in complexity but always have at least a few terms in
    boolean combination, up to some larger queries with dozens or even hundreds
    of terms combined with ands, ors, nots, and parens.  There's no sorting,
    even by relevance: we just want to know what matches.  Query performance is
    often sub-second, but not infrequently it can take over 20 seconds (we the
    time-limited hit collector, so anything over 20 seconds is stopped).
    Obviously the more complex queries are slower on average, but a given query
    can sometimes be much slower or much faster.

    My assumption is that we're having memory problems or disk utilization
    problems or both.  Our app has a 5gb JVM heap on an 8gb server with no other
    user processes running, so we shouldn't be paging and should have some room
    for Linux disk cache.  The server is lightly loaded and concurrent queries
    are the exception rather than the norm.  Two of the four indexes are updated
    a few times a day via rsync and subsequently closed and re-opened, but poor
    query performance doesn't seem to be correlated with these times.

    So, getting to some specific questions:

    1) How is the inverted index for a given field structured in terms of what's
    in memory and what's on disk?  Is it dynamic, based on available memory, or
    tuneable, or fixed?  Is there a rule of thumb that could be used to estimate
    how much memory is required per indexed field, based on the number of terms
    and documents?  Likewise, is there a rule of thumb to estimate how many disk
    accesses are required to retrieve the hits for that field?  (I'm thinking,
    by perhaps false analogy, of how a database maintains a b-tree structure
    that may reside partially in RAM cache and partially in disk pages.)

    2) When boolean queries are searched, is it as simple as iterating the hits
    for each ANDed or ORed term and applying the appropriate logical operators
    to the results?  For example, is searching for "foo AND bar" pretty much the
    same resource-wise as doing two separate searches, and therefore should the
    query performance be a linear function of the number the number of search
    terms?  Or is there some other caching and/or decision logic (perhaps kind
    of like a database's query optimizer) at work here that makes the I/O and
    RAM requirements more difficult to model from the query?  (Remember that
    we're not doing any sorting.)

    I'm hoping that with some of this knowledge, I'll be able to better model
    the RAM and I/O usage of the indexes and queries, and thus eventually
    understand why things are slow or fast.

    Thanks,
    Chris
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 24, 2009 at 2:02 pm
    Stealing this thread/idea, but changing subject, so we can branch and I don't look like a thread thief.


    I never played with /proc/sys/vm/swappiness, but I wonder if there are points in the lifetime of an index where this number should be changed. For example, does it make sense to in/decrease that number once we know the index is going to be read-only for a while? Does it make sense to in/decrease it during merges or optimizations?

    Thanks,
    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Michael McCandless <lucene@mikemccandless.com>
    To: java-user@lucene.apache.org
    Sent: Wednesday, June 24, 2009 5:06:25 AM
    Subject: Re: Analyzing performance and memory consumption for boolean queries

    Is it possible the occasional large merge is clearing out the IO cache
    (thus "unwarming" your searcher)? (Though since you're rsync'ing your
    updates in, it sounds like a separate machine is building the index).

    Or... linux will happily swap out a process's core in favor of IO
    cache (though I'd expect this effect to be much less spikey). You can
    tune "swappiness" to have it not do that:

    http://kerneltrap.org/node/3000

    Maybe Lucene's norms/deleted docs/field cache were getting swapped out?

    Lucene's postings reside entirely on disk (ie, Lucene doesn't cache
    those in RAM; we rely on the OS's IO cache). Lucene does a linear
    scan through the terms in the query... Linux will readahead, though,
    if things are fragmented this could mean lots of seeking.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jun 24, 2009 at 2:46 pm
    My opinion is swappiness should generally be set to zero, thus turning
    off "swap core out in favor of IO cache".

    I don't think the OS's simplistic LRU policy is smart enough to know
    which RAM (that Lucene had allocated & filled) are OK to move to
    disk. EG you see the OS evict stuff because Lucene does a big segment
    merge (because from Java we can't inform the OS *not* to cache those
    bytes).

    Lucene loads the terms index, deleted docs bit vector, norms and
    FieldCache's into RAM. Just because my search app hasn't been used in
    a while doesn't mean the OS should up and swap stuff out, because then
    when a query finally does come along, that query pays a massive
    swapfest price.

    So highish swappiness (the default in many linux distros) can really
    kill a search app that has 1) a big index, and 2) relatively slow
    query rate. If the query rate is fast, it should keep the pages hot
    and the OS shouldn't swap them out too badly.

    I don't like swappiness in a desktop setting either: I hate coming
    back to a unix desktop to discover say my web browser and mail program
    were 100% swapped out because say my mencoder was reading & writing
    lots of bytes (OK, so mencoder should have called
    madvise/posix_fadvise so that the OS wouldn't put those bytes into the
    IO cache in the first place, but it doesn't seem to... and other IO
    intensive programs seem not to as well). You then wait for a looong
    time while a swapfest ensues, to get those pages back in RAM, just to
    check your email. I don't like waiting ;) I've disabled swapping
    entirely on my desktop for this reason.

    Windows (at least Server 2003) has an "Adjust for best performance of
    Programs vs System Cache" as well, which I'm guessing is the same
    thing as swappiness.

    Even as we all switch to SSDs, which'll make swapping back in alot
    faster, it's still far slower than had things not been swapped out in
    the first place.

    I'll add "check your swappiness" to the ImproveSearchPerformance page!

    Mike


    On Wed, Jun 24, 2009 at 10:02 AM, Otis
    Gospodneticwrote:
    Stealing this thread/idea, but changing subject, so we can branch and I don't look like a thread thief.


    I never played with /proc/sys/vm/swappiness, but I wonder if there are points in the lifetime of an index where this number should be changed.  For example, does it make sense to in/decrease that number once we know the index is going to be read-only for a while?  Does it make sense to in/decrease it during merges or optimizations?

    Thanks,
    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Michael McCandless <lucene@mikemccandless.com>
    To: java-user@lucene.apache.org
    Sent: Wednesday, June 24, 2009 5:06:25 AM
    Subject: Re: Analyzing performance and memory consumption for boolean queries

    Is it possible the occasional large merge is clearing out the IO cache
    (thus "unwarming" your searcher)?  (Though since you're rsync'ing your
    updates in, it sounds like a separate machine is building the index).

    Or... linux will happily swap out a process's core in favor of IO
    cache (though I'd expect this effect to be much less spikey).  You can
    tune "swappiness" to have it not do that:

    http://kerneltrap.org/node/3000

    Maybe Lucene's norms/deleted docs/field cache were getting swapped out?

    Lucene's postings reside entirely on disk (ie, Lucene doesn't cache
    those in RAM; we rely on the OS's IO cache).  Lucene does a linear
    scan through the terms in the query... Linux will readahead, though,
    if things are fragmented this could mean lots of seeking.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nigel at Jun 24, 2009 at 7:54 pm
    This is interesting, and counter-intuitive: more queries could actually
    improve overall performance.

    The big-index-and-slow-query-rate does describe our situation. I'll try
    running some tests that run queries at various rates concurrent with
    occasional big I/O operations that use the disk cache. (And then set
    swappiness to zero if it looks like it will help.)

    Thanks,
    Chris
    On Wed, Jun 24, 2009 at 10:46 AM, Michael McCandless wrote:

    So highish swappiness (the default in many linux distros) can really
    kill a search app that has 1) a big index, and 2) relatively slow
    query rate. If the query rate is fast, it should keep the pages hot
    and the OS shouldn't swap them out too badly.
  • Michael McCandless at Jun 24, 2009 at 10:29 pm
    You can also run vmstat or iostat and watch if the high latency
    queries correspond to lots of swap-ins.

    Mike

    On Wed, Jun 24, 2009 at 3:54 PM, Nigelwrote:
    This is interesting, and counter-intuitive: more queries could actually
    improve overall performance.

    The big-index-and-slow-query-rate does describe our situation.  I'll try
    running some tests that run queries at various rates concurrent with
    occasional big I/O operations that use the disk cache.  (And then set
    swappiness to zero if it looks like it will help.)

    Thanks,
    Chris

    On Wed, Jun 24, 2009 at 10:46 AM, Michael McCandless <
    lucene@mikemccandless.com> wrote:
    So highish swappiness (the default in many linux distros) can really
    kill a search app that has 1) a big index, and 2) relatively slow
    query rate.  If the query rate is fast, it should keep the pages hot
    and the OS shouldn't swap them out too badly.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nigel at Jun 24, 2009 at 7:38 pm
    Hi Mike,

    Yes, we're indexing on a separate server, and rsyncing from index snapshots
    there to the search servers. Usually rsync has to copy just a few small
    .cfs files, but every once in a while merging will product a big one. I'm
    going to try to limit this by setting maxMergeMB, but of course that's a
    trade-off with having more segments.

    It sounds like surely any swapping out of the JVM memory could cause big and
    unpredictable performance drops. As I just mentioned in reply to Uwe, our
    poor performance times don't always directly correlate with index updates,
    but it may be that the damage is done and the effects are only seen sometime
    later.

    We don't store norms (since we don't care about sort order), and we don't
    have any deleted docs (since the index is read-only on the search servers).
    What exactly is stored in the field cache?

    p.s. I haven't tried SSDs yet, or for that matter faster disks of any sort.
    First I'd like to get a better understanding of what I/O is required and
    when during the search process, ideally to be able to have an approximate
    model that predicts I/O based on the query (the way a DBA might do when
    estimating how a SQL query would work with certain tables and indexes).

    Thanks,
    Chris
    On Wed, Jun 24, 2009 at 5:06 AM, Michael McCandless wrote:

    Is it possible the occasional large merge is clearing out the IO cache
    (thus "unwarming" your searcher)? (Though since you're rsync'ing your
    updates in, it sounds like a separate machine is building the index).

    Or... linux will happily swap out a process's core in favor of IO
    cache (though I'd expect this effect to be much less spikey). You can
    tune "swappiness" to have it not do that:

    http://kerneltrap.org/node/3000

    Maybe Lucene's norms/deleted docs/field cache were getting swapped out?

    Lucene's postings reside entirely on disk (ie, Lucene doesn't cache
    those in RAM; we rely on the OS's IO cache). Lucene does a linear
    scan through the terms in the query... Linux will readahead, though,
    if things are fragmented this could mean lots of seeking.

    Have you tried putting the index on an SSD instead of a spinning magnetic
    disk?

    Mike
  • Michael McCandless at Jun 24, 2009 at 10:26 pm

    On Wed, Jun 24, 2009 at 3:38 PM, Nigelwrote:

    Yes, we're indexing on a separate server, and rsyncing from index snapshots
    there to the search servers.  Usually rsync has to copy just a few small
    .cfs files, but every once in a while merging will product a big one.  I'm
    going to try to limit this by setting maxMergeMB, but of course that's a
    trade-off with having more segments. OK.
    It sounds like surely any swapping out of the JVM memory could cause big and
    unpredictable performance drops.  As I just mentioned in reply to Uwe, our
    poor performance times don't always directly correlate with index updates,
    but it may be that the damage is done and the effects are only seen sometime
    later.
    OK... I wonder whether the bytes written by rsync go into the IO
    cache. I would assume they do. But in your case that might be OK
    since presumably you'll then cutover to those new files, so, they've
    been pre-warmed by rsync.
    We don't store norms (since we don't care about sort order), and we don't
    have any deleted docs (since the index is read-only on the search servers).
    Ahh excellent.
    What exactly is stored in the field cache?
    FieldCache is used when sorting by field (not relevance) and by
    function queries, or, if you directly load values, eg
    FieldCache.DEFAULT.getInts(...).
    p.s. I haven't tried SSDs yet, or for that matter faster disks of any sort.
    First I'd like to get a better understanding of what I/O is required and
    when during the search process, ideally to be able to have an approximate
    model that predicts I/O based on the query (the way a DBA might do when
    estimating how a SQL query would work with certain tables and indexes).
    Sounds like a good plan!! And, after that, upgrade ;) There's no
    going back once you make the switch...

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Jun 26, 2009 at 9:14 am

    On Wed, 2009-06-24 at 21:38 +0200, Nigel wrote:
    It sounds like surely any swapping out of the JVM memory could cause big and
    unpredictable performance drops. As I just mentioned in reply to Uwe, our
    poor performance times don't always directly correlate with index updates,
    but it may be that the damage is done and the effects are only seen sometime
    later.
    We were hit by GC hell some time ago when we ran JVMs with 5-7GB
    allocated. It turned out that using Sun's RMI forces a total garbage
    collection once a minute. At least for the version we used at the time.

    Some info at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6200091
    p.s. I haven't tried SSDs yet, or for that matter faster disks of any sort.
    We've tested MTRON-SSDs from 2007 vs. 10.000 and 15.000 RPM harddisks in
    RAID 0 and 1 on a dual-core Intel Xeon machine. There was no doubt that
    the SSDs were substantially faster than the harddisks for searching.
    Some related observations was that the need for warm-up and disk-cache
    was greatly reduced.

    Regards,
    Toke Eskildsen


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nigel at Jun 24, 2009 at 5:35 pm
    Hi Ken,

    Thanks for your reply. I agree that your overall diagnosis (GC problems
    and/or swapping) sounds likely. To follow up on some the specific things
    you mentioned:

    2. 250M/4 = 60M docs/index. The old rule of thumb was 10M docs/index as a
    reasonable size. You might just need more hardware.

    I'm curious if there's any practical difference in terms of memory overhead
    between several smaller indexes vs. fewer bigger indexes. For example, if
    we had 25 indexes of 10M docs each on the same server, would it be more or
    less efficient than 4 indexes of 60M docs each? (Assuming that they're
    partitioned so that a given search doesn't have to aggregate results from
    many indexes.)

    3. We had better luck running more JVMs per system, versus one JVM with
    lots of memory. E.g. run 3 32-bit JVMs with 1.5GB/JVM. Though this assumes
    you've got one drive/JVM, to avoid disk contention.

    Good point; I've run into this approach before when dealing with very large
    heap sizes and lots of GC. My impression is that GC improvements in the JVM
    over the years has made this less beneficial than it used to be. I wonder
    in this case whether the benefit was really the separate disks rather than
    the separate JVMs.

    4. I'm assuming, since you don't care about scoring, that you've turned of
    field norms. There are other optimizations you can do to speed up
    query-style searches (find all docs that match X) when you don't care about
    scores, but others on the list are much better qualified to provide input in
    this area.

    That's correct; we don't store norms. I've read up on some of the other
    techniques, such as cached filters, but that seems most appropriate when
    you're doing the same or similar queries frequently, and in our case the
    queries are fairly different. Also, more caching could just lead to more GC
    and swapping issues.

    5. It seems like storing content outside the index helps with performance,
    though I can't say for certain what the impact might be. E.g. only store a
    single unique ID field in the index, and use that to access the content
    (say, from a MapFile) when you're processing the matched entries.

    I've thought about this as well, and I know people sometimes store the docs
    themselves in things like Berkeley DB. But, since in Lucene the stored
    fields are not cached in memory (apart from OS caching), it doesn't seem
    like storing things in Lucene should make searching itself slower. In other
    words, if you're going to load the doc from disk somehow (Lucene or BDB or
    flat file or whatever), it might as well be in Lucene to keep things
    architecturally simpler. Of course, it could be more efficient to move the
    document store to a different server, but a similar benefit could be
    achieved by moving some of the Lucene indexes to a different server.

    Thanks,
    Chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 23, '09 at 8:53p
activeJun 26, '09 at 9:14a
posts19
users7
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase