FAQ
Hi Everyone,

I have an index of relatively small size (400mb) , containing roughly
0.7 million documents. The index is actually a copy of an existing
database table. Hence, most of my queries are of the form

" +field1:value1 +field2:value2 +field3:value3..... ~20 fields"

I have been running performance tests using this query. Strangely, I
noticed that if I remove some specific clauses... I get a performance
improvement of atleast 5 times. Here are the numbers and examples, so
that I could be more precise

1) Complete Query: 90 requests per second using 10 threads
2) If I remove few specific clauses : 500 requests per second using 10
threads
3) If I form a new query using only 2 clauses from the set of removed
clauses -> 100 requests per second using 10 threads

Now, some of these specific clauses are such that they match around half
of the entire document set. Also, note that I need all the query terms
to be present in the documents retrieved. My target is to obtain 300
requests per second with the given query (20 clauses). It includes 2
range queries. However, I am unable to get 300 rps unless I remove some
of the clauses (which include these range queries) .
I have tried using filters without any significant improvement in
performance. Also, I have more than enough RAM, so I am using the
RAMDirectory to read the index. I have optimized my index before
searching. All the tests have been warmed for 5 seconds ( the test
duration is 10 seconds).

My first question is, is this kind of decrease in performance expected
as the number of clauses shoot up ? Using a single clause out of these
20 , I was able to get 2000 requests per second!
Could someone please guide me if there are any other ways in which I can
obtain improvement in performance ?
Particularly, I am interested to know more about what further caching
could be done apart from the default caching which lucene does.

Thanks In Advance,
Prafulla

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Paul Elschot at Dec 20, 2008 at 3:03 pm

    Op Saturday 20 December 2008 15:23:43 schreef Prafulla Kiran:
    Hi Everyone,

    I have an index of relatively small size (400mb) , containing roughly
    0.7 million documents. The index is actually a copy of an existing
    database table. Hence, most of my queries are of the form

    " +field1:value1 +field2:value2 +field3:value3..... ~20 fields"

    I have been running performance tests using this query. Strangely, I
    noticed that if I remove some specific clauses... I get a performance
    improvement of atleast 5 times. Here are the numbers and examples, so
    that I could be more precise

    1) Complete Query: 90 requests per second using 10 threads
    2) If I remove few specific clauses : 500 requests per second using
    10 threads
    3) If I form a new query using only 2 clauses from the set of removed
    clauses -> 100 requests per second using 10 threads

    Now, some of these specific clauses are such that they match around
    half of the entire document set. Also, note that I need all the
    query terms to be present in the documents retrieved. My target is to
    obtain 300 requests per second with the given query (20 clauses). It
    includes 2 range queries. However, I am unable to get 300 rps unless
    I remove some of the clauses (which include these range queries) .
    I have tried using filters without any significant improvement in
    performance. Also, I have more than enough RAM, so I am using the
    RAMDirectory to read the index. I have optimized my index before
    searching. All the tests have been warmed for 5 seconds ( the test
    duration is 10 seconds).

    My first question is, is this kind of decrease in performance
    expected as the number of clauses shoot up ? Using a single clause
    out of these 20 , I was able to get 2000 requests per second!
    Could someone please guide me if there are any other ways in which I
    can obtain improvement in performance ?
    You might try and add brackets and a + around a group
    of the less frequently occurring terms, like this:

    +field1:frequentValue1 +field2:frequentValue2 +(+field3:inFrequentValue3
    +field4:inFrequentValue4)

    This may help, and at least it should not degrade performance much.
    Also, it will affect score values somewhat.
    Particularly, I am interested to know more about what further caching
    could be done apart from the default caching which lucene does.
    More caching is probably not going to help.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Dec 20, 2008 at 3:31 pm
    What specifically are you measuring when you time the queries? I've been
    mislead by including in my measurement say, creating the response. I realize
    that throughput includes assembling the response, but the solution is
    different
    depending upon whether it's the actual search or what you do with the
    results that takes the time.

    Are you doing any sorting?

    Are you using a Hits object and iterating on it? This gets very inefficient.

    You might post your code where you time the query. Also what do the "few
    specific clauses" you remove look like? Do they have anything to do with
    time?
    How many unique values do the fields have that you remove to see the
    improvement?

    Do you start your timings *after* you've fired up a few warmup queries?

    Best
    Erick
    On Sat, Dec 20, 2008 at 9:23 AM, Prafulla Kiran wrote:

    Hi Everyone,

    I have an index of relatively small size (400mb) , containing roughly 0.7
    million documents. The index is actually a copy of an existing database
    table. Hence, most of my queries are of the form

    " +field1:value1 +field2:value2 +field3:value3..... ~20 fields"

    I have been running performance tests using this query. Strangely, I
    noticed that if I remove some specific clauses... I get a performance
    improvement of atleast 5 times. Here are the numbers and examples, so that I
    could be more precise

    1) Complete Query: 90 requests per second using 10 threads
    2) If I remove few specific clauses : 500 requests per second using 10
    threads
    3) If I form a new query using only 2 clauses from the set of removed
    clauses -> 100 requests per second using 10 threads

    Now, some of these specific clauses are such that they match around half of
    the entire document set. Also, note that I need all the query terms to be
    present in the documents retrieved. My target is to obtain 300 requests per
    second with the given query (20 clauses). It includes 2 range queries.
    However, I am unable to get 300 rps unless I remove some of the clauses
    (which include these range queries) .
    I have tried using filters without any significant improvement in
    performance. Also, I have more than enough RAM, so I am using the
    RAMDirectory to read the index. I have optimized my index before searching.
    All the tests have been warmed for 5 seconds ( the test duration is 10
    seconds).

    My first question is, is this kind of decrease in performance expected as
    the number of clauses shoot up ? Using a single clause out of these 20 , I
    was able to get 2000 requests per second!
    Could someone please guide me if there are any other ways in which I can
    obtain improvement in performance ?
    Particularly, I am interested to know more about what further caching could
    be done apart from the default caching which lucene does.

    Thanks In Advance,
    Prafulla

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Prafulla Kiran at Dec 22, 2008 at 3:32 am
    Hi,

    Here's the code which I am using to time the query:

    long startTime = System.currentTimeMillis();
    TopDocCollector collector = new TopDocCollector(10);
    is.search(query,collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
    long endTime = System.currentTimeMillis();

    Most of the clauses which I removed, had very few unique terms: say 2 or 3.
    I have started taking the timings after I've fired the warmup queries.
    Also, I am not doing any kind of sorting or iterating through the hits
    object.

    Regards,
    Prafulla

    Erick Erickson wrote:
    What specifically are you measuring when you time the queries? I've been
    mislead by including in my measurement say, creating the response. I realize
    that throughput includes assembling the response, but the solution is
    different
    depending upon whether it's the actual search or what you do with the
    results that takes the time.

    Are you doing any sorting?

    Are you using a Hits object and iterating on it? This gets very inefficient.

    You might post your code where you time the query. Also what do the "few
    specific clauses" you remove look like? Do they have anything to do with
    time?
    How many unique values do the fields have that you remove to see the
    improvement?

    Do you start your timings *after* you've fired up a few warmup queries?

    Best
    Erick

    On Sat, Dec 20, 2008 at 9:23 AM, Prafulla Kiran wrote:

    Hi Everyone,

    I have an index of relatively small size (400mb) , containing roughly 0.7
    million documents. The index is actually a copy of an existing database
    table. Hence, most of my queries are of the form

    " +field1:value1 +field2:value2 +field3:value3..... ~20 fields"

    I have been running performance tests using this query. Strangely, I
    noticed that if I remove some specific clauses... I get a performance
    improvement of atleast 5 times. Here are the numbers and examples, so that I
    could be more precise

    1) Complete Query: 90 requests per second using 10 threads
    2) If I remove few specific clauses : 500 requests per second using 10
    threads
    3) If I form a new query using only 2 clauses from the set of removed
    clauses -> 100 requests per second using 10 threads

    Now, some of these specific clauses are such that they match around half of
    the entire document set. Also, note that I need all the query terms to be
    present in the documents retrieved. My target is to obtain 300 requests per
    second with the given query (20 clauses). It includes 2 range queries.
    However, I am unable to get 300 rps unless I remove some of the clauses
    (which include these range queries) .
    I have tried using filters without any significant improvement in
    performance. Also, I have more than enough RAM, so I am using the
    RAMDirectory to read the index. I have optimized my index before searching.
    All the tests have been warmed for 5 seconds ( the test duration is 10
    seconds).

    My first question is, is this kind of decrease in performance expected as
    the number of clauses shoot up ? Using a single clause out of these 20 , I
    was able to get 2000 requests per second!
    Could someone please guide me if there are any other ways in which I can
    obtain improvement in performance ?
    Particularly, I am interested to know more about what further caching could
    be done apart from the default caching which lucene does.

    Thanks In Advance,
    Prafulla

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - http://www.avg.com
    Version: 8.0.176 / Virus Database: 270.9.19/1857 - Release Date: 12/19/2008 10:09 AM

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Dec 22, 2008 at 1:55 pm
    Well, you haven't run afoul of the usual suspects, that's pretty
    clean timing code. I'm afraid I'll have to defer to the people who
    know the internals of Lucene.....

    Best
    Erick

    On Sun, Dec 21, 2008 at 10:31 PM, Prafulla Kiran
    wrote:
    Hi,

    Here's the code which I am using to time the query:

    long startTime = System.currentTimeMillis();
    TopDocCollector collector = new TopDocCollector(10);
    is.search(query,collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;
    long endTime = System.currentTimeMillis();

    Most of the clauses which I removed, had very few unique terms: say 2 or 3.
    I have started taking the timings after I've fired the warmup queries.
    Also, I am not doing any kind of sorting or iterating through the hits
    object.

    Regards,
    Prafulla

    Erick Erickson wrote:
    What specifically are you measuring when you time the queries? I've been
    mislead by including in my measurement say, creating the response. I
    realize
    that throughput includes assembling the response, but the solution is
    different
    depending upon whether it's the actual search or what you do with the
    results that takes the time.

    Are you doing any sorting?

    Are you using a Hits object and iterating on it? This gets very
    inefficient.

    You might post your code where you time the query. Also what do the "few
    specific clauses" you remove look like? Do they have anything to do with
    time?
    How many unique values do the fields have that you remove to see the
    improvement?

    Do you start your timings *after* you've fired up a few warmup queries?

    Best
    Erick

    On Sat, Dec 20, 2008 at 9:23 AM, Prafulla Kiran <prafulla@tachyontech.net
    wrote:
    Hi Everyone,

    I have an index of relatively small size (400mb) , containing roughly 0.7
    million documents. The index is actually a copy of an existing database
    table. Hence, most of my queries are of the form

    " +field1:value1 +field2:value2 +field3:value3..... ~20 fields"

    I have been running performance tests using this query. Strangely, I
    noticed that if I remove some specific clauses... I get a performance
    improvement of atleast 5 times. Here are the numbers and examples, so
    that I
    could be more precise

    1) Complete Query: 90 requests per second using 10 threads
    2) If I remove few specific clauses : 500 requests per second using 10
    threads
    3) If I form a new query using only 2 clauses from the set of removed
    clauses -> 100 requests per second using 10 threads

    Now, some of these specific clauses are such that they match around half
    of
    the entire document set. Also, note that I need all the query terms to
    be
    present in the documents retrieved. My target is to obtain 300 requests
    per
    second with the given query (20 clauses). It includes 2 range queries.
    However, I am unable to get 300 rps unless I remove some of the clauses
    (which include these range queries) .
    I have tried using filters without any significant improvement in
    performance. Also, I have more than enough RAM, so I am using the
    RAMDirectory to read the index. I have optimized my index before
    searching.
    All the tests have been warmed for 5 seconds ( the test duration is 10
    seconds).

    My first question is, is this kind of decrease in performance expected as
    the number of clauses shoot up ? Using a single clause out of these 20 ,
    I
    was able to get 2000 requests per second!
    Could someone please guide me if there are any other ways in which I can
    obtain improvement in performance ?
    Particularly, I am interested to know more about what further caching
    could
    be done apart from the default caching which lucene does.

    Thanks In Advance,
    Prafulla

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - http://www.avg.com Version: 8.0.176 / Virus Database:
    270.9.19/1857 - Release Date: 12/19/2008 10:09 AM


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 20, '08 at 2:24p
activeDec 22, '08 at 1:55p
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase