FAQ
We recently upgraded from lucene 2.4.0 to lucene 3.0.2. Our load testing revealed a serious performance drop specific to traversing the list of terms and their associated documents for a given indexed field. Our code looks something like this:

for(Term term : terms) {
TermDocs termDocs = indexReader.termDocs(term);
while(termDocs.next()) { // much slower here
int doc = termDocs.doc();
...do something with each doc...
}


The slowness is all on the first call to TermDocs.next() for each term. Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some new synchronization on the SegmentTermDocs constructor and the SegmentReader.getTermsReader(). The first call to next() hits this synchronization, causing a 4x slowdown on an 8 CPU machine.

My first question is should we be using a different approach to process each term's doc list that would be more efficient? The synchronization appears to be on aspects of these classes that the next() operation is not concerned with.

My other question is whether there are planned performance enhancements to address this loss of performance?

Thanks.

John

Search Discussions

  • Michael McCandless at Jul 29, 2010 at 9:55 am

    On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P wrote:
    We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing revealed a serious performance drop specific to traversing the list of terms and their associated documents for a given indexed field.  Our code looks something like this:

    for(Term term : terms) {
    TermDocs termDocs = indexReader.termDocs(term);
    while(termDocs.next()) {   //  much slower here
    int doc = termDocs.doc();
    ...do something with each doc...
    }
    Is that IndexReader reading multiple segments or single segment?
    The slowness is all on the first call to TermDocs.next() for each term.  Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some new synchronization on the SegmentTermDocs constructor and the SegmentReader.getTermsReader().  The first call to next() hits this synchronization, causing a 4x slowdown on an 8 CPU machine.
    There was some added sync, however, the code within those sync blocks
    is minuscule (looking up a field). It's weird that you're seeing a 4X
    hit because of this. We could conceivably optimize this code to avoid
    the sync blocks if the reader is readOnly.
    My first question is should we be using a different approach to process each term's doc list that would be more efficient?  The synchronization appears to be on aspects of these classes that the next() operation is not concerned with.
    Are you sorting your terms in index-sort order (UTF16, ie
    String.compareTo)? This can be an important gain especially if you
    have many terms.

    Also, if you are working with your top reader, you should see some
    perf gain by instead working w/ the sub readers directly, ie:

    for(IndexReader subReader : indexReader.getSequentialSubReaders()) {
    ...
    }

    Also, instead of getting a new TermDocs every time, you should get a
    single TermDocs up front (IndexReader.termDocs()), and then seek it to
    your term (termDocs.seek(term)), validate the term in seek'd to in
    fact matches what you asked for, then iterate its docs.
    My other question is whether there are planned performance enhancements to address this loss of performance?
    These APIs are very different in the next major release (4.0) of
    Lucene, so except for problems spotted by users like you, there's not
    much more dev happening against them.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nader, John P at Jul 29, 2010 at 9:49 pm
    Thanks much for your response. Yes, our terms are sorted in index-sort order. I think you have a good suggestion, which is to get the term docs once and then seek to each term. I will try that approach and report back to the forum on the results.

    Like you I am surprised by the overhead of the added synchronization. I don't think is waiting on locks, but rather the memory flush and loading that goes on.

    -John

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 29, 2010 5:55 AM
    To: java-user@lucene.apache.org
    Subject: Re: Term browsing much slower in Lucene 3.x.x
    On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P wrote:
    We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing revealed a serious performance drop specific to traversing the list of terms and their associated documents for a given indexed field.  Our code looks something like this:

    for(Term term : terms) {
    TermDocs termDocs = indexReader.termDocs(term);
    while(termDocs.next()) {   //  much slower here
    int doc = termDocs.doc();
    ...do something with each doc...
    }
    Is that IndexReader reading multiple segments or single segment?
    The slowness is all on the first call to TermDocs.next() for each term.  Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some new synchronization on the SegmentTermDocs constructor and the SegmentReader.getTermsReader().  The first call to next() hits this synchronization, causing a 4x slowdown on an 8 CPU machine.
    There was some added sync, however, the code within those sync blocks
    is minuscule (looking up a field). It's weird that you're seeing a 4X
    hit because of this. We could conceivably optimize this code to avoid
    the sync blocks if the reader is readOnly.
    My first question is should we be using a different approach to process each term's doc list that would be more efficient?  The synchronization appears to be on aspects of these classes that the next() operation is not concerned with.
    Are you sorting your terms in index-sort order (UTF16, ie
    String.compareTo)? This can be an important gain especially if you
    have many terms.

    Also, if you are working with your top reader, you should see some
    perf gain by instead working w/ the sub readers directly, ie:

    for(IndexReader subReader : indexReader.getSequentialSubReaders()) {
    ...
    }

    Also, instead of getting a new TermDocs every time, you should get a
    single TermDocs up front (IndexReader.termDocs()), and then seek it to
    your term (termDocs.seek(term)), validate the term in seek'd to in
    fact matches what you asked for, then iterate its docs.
    My other question is whether there are planned performance enhancements to address this loss of performance?
    These APIs are very different in the next major release (4.0) of
    Lucene, so except for problems spotted by users like you, there's not
    much more dev happening against them.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Jul 29, 2010 at 10:23 pm
    : > My other question is whether there are planned performance
    : > enhancements to address this loss of performance?
    :
    : These APIs are very different in the next major release (4.0) of
    : Lucene, so except for problems spotted by users like you, there's not
    : much more dev happening against them.

    To be clear: the 3.x branch is still under active development, so if you
    spot performance improvements that can be made (while maintaining API back
    compat) then those suggestions would absolutely be welcome -- but many
    developers working on Lucene internals are more focused on potential
    performance enhancements to the 4.x branch of development (trunk)


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nader, John P at Jul 30, 2010 at 7:17 pm
    Mike,

    We took your suggestion and refactored like this:

    TermEnum termEnum = indexReader.terms(new Term(field, "0"));
    TermDocs allTermDocs = indexReader.termDocs();

    while(termEnum.next() && termEnum.term().field().equals(field) {
    allTermsDocs.seek(termEnum);
    while(allTermDocs.next()) {
    ...doo something to each doc...
    }
    }

    The results were much better than creating a new TermDocs for each term. We were about 6x faster than the old algorithm in Lucene 3.0.2, and 3x faster than the old algorithm in Lucene 2.4.0.

    Thanks much for your help.

    With respect to a 3.0.2 enhancement that would yield the same performance without using different APIs, I'm not sure what the impact would be. We definitely have proven the synchronization had a dramatic impact in our environment. But the synchronization in the constructor looks like it is necessary in other API calls.

    BTW, that environment is Java 1.6.0_12 on 64-bit SUSE Linux with 32G of RAM and using MMapDirectory.

    Thanks.

    -John


    -----Original Message-----
    From: Nader, John P
    Sent: Thursday, July 29, 2010 5:49 PM
    To: java-user@lucene.apache.org
    Subject: RE: Term browsing much slower in Lucene 3.x.x

    Thanks much for your response. Yes, our terms are sorted in index-sort order. I think you have a good suggestion, which is to get the term docs once and then seek to each term. I will try that approach and report back to the forum on the results.

    Like you I am surprised by the overhead of the added synchronization. I don't think is waiting on locks, but rather the memory flush and loading that goes on.

    -John

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 29, 2010 5:55 AM
    To: java-user@lucene.apache.org
    Subject: Re: Term browsing much slower in Lucene 3.x.x
    On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P wrote:
    We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing revealed a serious performance drop specific to traversing the list of terms and their associated documents for a given indexed field.  Our code looks something like this:

    for(Term term : terms) {
    TermDocs termDocs = indexReader.termDocs(term);
    while(termDocs.next()) {   //  much slower here
    int doc = termDocs.doc();
    ...do something with each doc...
    }
    Is that IndexReader reading multiple segments or single segment?
    The slowness is all on the first call to TermDocs.next() for each term.  Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some new synchronization on the SegmentTermDocs constructor and the SegmentReader.getTermsReader().  The first call to next() hits this synchronization, causing a 4x slowdown on an 8 CPU machine.
    There was some added sync, however, the code within those sync blocks
    is minuscule (looking up a field). It's weird that you're seeing a 4X
    hit because of this. We could conceivably optimize this code to avoid
    the sync blocks if the reader is readOnly.
    My first question is should we be using a different approach to process each term's doc list that would be more efficient?  The synchronization appears to be on aspects of these classes that the next() operation is not concerned with.
    Are you sorting your terms in index-sort order (UTF16, ie
    String.compareTo)? This can be an important gain especially if you
    have many terms.

    Also, if you are working with your top reader, you should see some
    perf gain by instead working w/ the sub readers directly, ie:

    for(IndexReader subReader : indexReader.getSequentialSubReaders()) {
    ...
    }

    Also, instead of getting a new TermDocs every time, you should get a
    single TermDocs up front (IndexReader.termDocs()), and then seek it to
    your term (termDocs.seek(term)), validate the term in seek'd to in
    fact matches what you asked for, then iterate its docs.
    My other question is whether there are planned performance enhancements to address this loss of performance?
    These APIs are very different in the next major release (4.0) of
    Lucene, so except for problems spotted by users like you, there's not
    much more dev happening against them.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nader, John P at Aug 19, 2010 at 2:49 am
    This is a follow up related to my original post Term browsing performance problems with our upgrade to Lucene 3.0.2. The suggestions were helpful and did give us a performance increase. However, in a full scale environment under load our performance issue remained a problem.

    Our investigation lead us to an issue with our patch level of Java. This is not too surprising considering we were on revision 12, and the current is 21. The behavior we saw was that some JVMs would come up in a state where browsing ran very slowly, while others would run as expected. The JVM would stay in that state until it was restarted.

    The only change we made was to upgrade our JVM to 1.6.0_21. At that point, all JVMs performed consistently with no issues. I wanted to make sure I share this info with the forum as others may encounter similar problems.

    I have no specific info about what changed between 3.0.2 from 2.4.0 that would cause an issue with Java 1.6.0_12. Nor do any Java release notes indicate what might have been fixed to address this issue. I strongly suspect that it is JIT compiler related and would be glad to share thoughts on this with anyone that is interested.

    -John


    -----Original Message-----
    From: Nader, John P
    Sent: Friday, July 30, 2010 3:17 PM
    To: java-user@lucene.apache.org
    Subject: RE: Term browsing much slower in Lucene 3.x.x

    Mike,

    We took your suggestion and refactored like this:

    TermEnum termEnum = indexReader.terms(new Term(field, "0"));
    TermDocs allTermDocs = indexReader.termDocs();

    while(termEnum.next() && termEnum.term().field().equals(field) {
    allTermsDocs.seek(termEnum);
    while(allTermDocs.next()) {
    ...doo something to each doc...
    }
    }

    The results were much better than creating a new TermDocs for each term. We were about 6x faster than the old algorithm in Lucene 3.0.2, and 3x faster than the old algorithm in Lucene 2.4.0.

    Thanks much for your help.

    With respect to a 3.0.2 enhancement that would yield the same performance without using different APIs, I'm not sure what the impact would be. We definitely have proven the synchronization had a dramatic impact in our environment. But the synchronization in the constructor looks like it is necessary in other API calls.

    BTW, that environment is Java 1.6.0_12 on 64-bit SUSE Linux with 32G of RAM and using MMapDirectory.

    Thanks.

    -John


    -----Original Message-----
    From: Nader, John P
    Sent: Thursday, July 29, 2010 5:49 PM
    To: java-user@lucene.apache.org
    Subject: RE: Term browsing much slower in Lucene 3.x.x

    Thanks much for your response. Yes, our terms are sorted in index-sort order. I think you have a good suggestion, which is to get the term docs once and then seek to each term. I will try that approach and report back to the forum on the results.

    Like you I am surprised by the overhead of the added synchronization. I don't think is waiting on locks, but rather the memory flush and loading that goes on.

    -John

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 29, 2010 5:55 AM
    To: java-user@lucene.apache.org
    Subject: Re: Term browsing much slower in Lucene 3.x.x
    On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P wrote:
    We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing revealed a serious performance drop specific to traversing the list of terms and their associated documents for a given indexed field.  Our code looks something like this:

    for(Term term : terms) {
    TermDocs termDocs = indexReader.termDocs(term);
    while(termDocs.next()) {   //  much slower here
    int doc = termDocs.doc();
    ...do something with each doc...
    }
    Is that IndexReader reading multiple segments or single segment?
    The slowness is all on the first call to TermDocs.next() for each term.  Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some new synchronization on the SegmentTermDocs constructor and the SegmentReader.getTermsReader().  The first call to next() hits this synchronization, causing a 4x slowdown on an 8 CPU machine.
    There was some added sync, however, the code within those sync blocks
    is minuscule (looking up a field). It's weird that you're seeing a 4X
    hit because of this. We could conceivably optimize this code to avoid
    the sync blocks if the reader is readOnly.
    My first question is should we be using a different approach to process each term's doc list that would be more efficient?  The synchronization appears to be on aspects of these classes that the next() operation is not concerned with.
    Are you sorting your terms in index-sort order (UTF16, ie
    String.compareTo)? This can be an important gain especially if you
    have many terms.

    Also, if you are working with your top reader, you should see some
    perf gain by instead working w/ the sub readers directly, ie:

    for(IndexReader subReader : indexReader.getSequentialSubReaders()) {
    ...
    }

    Also, instead of getting a new TermDocs every time, you should get a
    single TermDocs up front (IndexReader.termDocs()), and then seek it to
    your term (termDocs.seek(term)), validate the term in seek'd to in
    fact matches what you asked for, then iterate its docs.
    My other question is whether there are planned performance enhancements to address this loss of performance?
    These APIs are very different in the next major release (4.0) of
    Lucene, so except for problems spotted by users like you, there's not
    much more dev happening against them.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 19, 2010 at 9:39 am
    Phew... thanks for bringing closure! And, good sleuthing.

    So the takeaway is JRE 1.6.0_12 = BAD and JRE 1.6.0_21 = GOOD.

    Mike
    On Wed, Aug 18, 2010 at 10:48 PM, Nader, John P wrote:

    This is a follow up related to my original post Term browsing performance problems with our upgrade to Lucene 3.0.2.  The suggestions were helpful and did give us a performance increase.  However, in a full scale environment under load our performance issue remained a problem.

    Our investigation lead us to an issue with our patch level of Java.  This is not too surprising considering we were on revision 12, and the current is 21.  The behavior we saw was that some JVMs would come up in a state where browsing ran very slowly, while others would run as expected.  The JVM would stay in that state until it was restarted.

    The only change we made was to upgrade our JVM to 1.6.0_21.  At that point, all JVMs performed consistently with no issues.  I wanted to make sure I share this info with the forum as others may encounter similar problems.

    I have no specific info about what changed between 3.0.2 from 2.4.0 that would cause an issue with Java 1.6.0_12.  Nor do any Java release notes indicate what might have been fixed to address this issue.  I strongly suspect that it is JIT compiler related and would be glad to share thoughts on this with anyone that is interested.

    -John


    -----Original Message-----
    From: Nader, John P
    Sent: Friday, July 30, 2010 3:17 PM
    To: java-user@lucene.apache.org
    Subject: RE: Term browsing much slower in Lucene 3.x.x

    Mike,

    We took your suggestion and refactored like this:

    TermEnum termEnum = indexReader.terms(new Term(field, "0"));
    TermDocs allTermDocs = indexReader.termDocs();

    while(termEnum.next() && termEnum.term().field().equals(field) {
    allTermsDocs.seek(termEnum);
    while(allTermDocs.next()) {
    ...doo something to each doc...
    }
    }

    The results were much better than creating a new TermDocs for each term.  We were about 6x faster than the old algorithm in Lucene 3.0.2, and 3x faster than the old algorithm in Lucene 2.4.0.

    Thanks much for your help.

    With respect to a 3.0.2 enhancement that would yield the same performance without using different APIs, I'm not sure what the impact would be.  We definitely have proven the synchronization had a dramatic impact in our environment.  But the synchronization in the constructor looks like it is necessary in other API calls.

    BTW, that environment is Java 1.6.0_12 on 64-bit SUSE Linux with 32G of RAM and using MMapDirectory.

    Thanks.

    -John


    -----Original Message-----
    From: Nader, John P
    Sent: Thursday, July 29, 2010 5:49 PM
    To: java-user@lucene.apache.org
    Subject: RE: Term browsing much slower in Lucene 3.x.x

    Thanks much for your response.  Yes, our terms are sorted in index-sort order.  I think you have a good suggestion, which is to get the term docs once and then seek to each term.  I will try that approach and report back to the forum on the results.

    Like you I am surprised by the overhead of the added synchronization.  I don't think is waiting on locks, but rather the memory flush and loading that goes on.

    -John

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 29, 2010 5:55 AM
    To: java-user@lucene.apache.org
    Subject: Re: Term browsing much slower in Lucene 3.x.x
    On Wed, Jul 28, 2010 at 2:39 PM, Nader, John P wrote:
    We recently upgraded from lucene 2.4.0 to lucene 3.0.2.  Our load testing revealed a serious performance drop specific to traversing the list of terms and their associated documents for a given indexed field.  Our code looks something like this:

    for(Term term : terms) {
    TermDocs termDocs = indexReader.termDocs(term);
    while(termDocs.next()) {   //  much slower here
    int doc = termDocs.doc();
    ...do something with each doc...
    }
    Is that IndexReader reading multiple segments or single segment?
    The slowness is all on the first call to TermDocs.next() for each term.  Further investigation comparing 2.4.0 and 3.0.2 revealed that there is some new synchronization on the SegmentTermDocs constructor and the SegmentReader.getTermsReader().  The first call to next() hits this synchronization, causing a 4x slowdown on an 8 CPU machine.
    There was some added sync, however, the code within those sync blocks
    is minuscule (looking up a field).  It's weird that you're seeing a 4X
    hit because of this.  We could conceivably optimize this code to avoid
    the sync blocks if the reader is readOnly.
    My first question is should we be using a different approach to process each term's doc list that would be more efficient?  The synchronization appears to be on aspects of these classes that the next() operation is not concerned with.
    Are you sorting your terms in index-sort order (UTF16, ie
    String.compareTo)?  This can be an important gain especially if you
    have many terms.

    Also, if you are working with your top reader, you should see some
    perf gain by instead working w/ the sub readers directly, ie:

    for(IndexReader subReader : indexReader.getSequentialSubReaders()) {
    ...
    }

    Also, instead of getting a new TermDocs every time, you should get a
    single TermDocs up front (IndexReader.termDocs()), and then seek it to
    your term (termDocs.seek(term)), validate the term in seek'd to in
    fact matches what you asked for, then iterate its docs.
    My other question is whether there are planned performance enhancements to address this loss of performance?
    These APIs are very different in the next major release (4.0) of
    Lucene, so except for problems spotted by users like you, there's not
    much more dev happening against them.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 28, '10 at 6:40p
activeAug 19, '10 at 9:39a
posts7
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase