FAQ
Hello all,

We have very large documents with large numbers of unique terms. Our documents average about 800,000 KB and about 200,000 tokens. In trying to understand how often the ramBuffer gets flushed to disk we turned on the IndexWriter log.
<infoStream file="/tmp/IndexWriter.log">true</infoStream>

With the Solr default setting of ramBufferSizeMB=32 it appears that the buffer is flushing every 2-10 documents.

When we see this in the IndexWriter log: "flush postings as segment _9hw numDocs=2", does this mean that the buffer is writing a segment that contains only 2 documents?

Also how do we interpret the following?

"DW: ramUsed=33.467 MB newFlushedSize=6764406 docs/MB=0.155 new/old=19.276%"

What does "docs/MB" and 'new/old" mean?

Tom

Search Discussions

  • Michael McCandless at Mar 17, 2011 at 8:33 pm
    Hi Tom,

    Answers below...
    On Thu, Mar 17, 2011 at 1:19 PM, Burton-West, Tom wrote:
    Hello all,

    We have very large documents with large numbers of unique terms.   Our documents average about 800,000 KB and about 200,000 tokens.  In trying to understand how often the ramBuffer gets flushed to disk we turned on the IndexWriter log.
    <infoStream file="/tmp/IndexWriter.log">true</infoStream>

    With the Solr default setting of ramBufferSizeMB=32 it appears that the buffer is flushing every 2-10 documents.

    When we see this in the IndexWriter log: "flush postings as segment _9hw numDocs=2",  does this mean that the buffer is writing a segment that contains only 2 documents? Yes.
    Also how do we interpret the following?

    "DW:   ramUsed=33.467 MB newFlushedSize=6764406 docs/MB=0.155 new/old=19.276%"

    What does "docs/MB" and 'new/old" mean?
    docs/MB is docCount divided by size of the newly flushed segment's size in MB.

    new/old measures "RAM efficiency", ie size of the flushed segment
    divided by RAM consumed before the flush. In this case IndexWriter
    flushed because the RAM buffer hit 33.467 MB, but then the newly
    flushed segment was 6.45 MB (= 6764406 / 1024 / 1024). This number is
    expected to be well below 100%, ie, we pay a price to have the
    malleable data structures in RAM that allow for incoming docs to be
    inverted... though a "good" number is maybe 30%. The more unique
    terms you have the lower this number will be...

    --
    Mike

    http://blog.mikemccandless.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 17, '11 at 5:20p
activeMar 17, '11 at 8:33p
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase