FAQ

[Solr-user] Permanently Full Old Generation...

Annette Newton
Nov 28, 2012 at 4:06 pm
Hi,



I'm hoping someone can help me with an issue we are encountering with solr
cloud..



We are seeing strange gc behaviour after running solr cloud under quite
heavy insert load for a period of time. The old generation becomes full and
no amount of garbage collection will free up the memory. I have attached a
memory profile, as you can see it gets progressively worse as the day goes
on to the point where we are always doing full garbage collections all the
time. The only way I have found to resolve this issue is to reload the
core, then subsequent garbage collections reclaim the used space, that's
what happened at 3pm on the Memory profile. All the nodes eventually
display the same behaviour.



We have multiple threads running adding batches of upto 100 documents at a
time. I have also attached our Schema and Config.



We are running 4 shards each with a single replica, have a 3 node zookeeper
setup and the 8 solr boxes instances are aws High-Memory Double Extra Large
with 34.2 GB Memory, 4 Virtual cores.



Thanks in advance



Annette Newton
reply

Search Discussions

8 responses

  • Jack Krupansky at Nov 28, 2012 at 4:23 pm
    Have you done a Java heap dump to see what the most common objects are?

    -- Jack Krupansky

    From: Annette Newton
    Sent: Wednesday, November 28, 2012 11:06 AM
    To: solr-user@lucene.apache.org
    Cc: Andy Kershaw
    Subject: Permanently Full Old Generation...

    Hi,



    I’m hoping someone can help me with an issue we are encountering with solr cloud..



    We are seeing strange gc behaviour after running solr cloud under quite heavy insert load for a period of time. The old generation becomes full and no amount of garbage collection will free up the memory. I have attached a memory profile, as you can see it gets progressively worse as the day goes on to the point where we are always doing full garbage collections all the time. The only way I have found to resolve this issue is to reload the core, then subsequent garbage collections reclaim the used space, that’s what happened at 3pm on the Memory profile. All the nodes eventually display the same behaviour.



    We have multiple threads running adding batches of upto 100 documents at a time. I have also attached our Schema and Config.



    We are running 4 shards each with a single replica, have a 3 node zookeeper setup and the 8 solr boxes instances are aws High-Memory Double Extra Large with 34.2 GB Memory, 4 Virtual cores.



    Thanks in advance



    Annette Newton
  • Annette Newton at Nov 28, 2012 at 4:29 pm
    Hi,




    Class <http://localhost:7000/histo/class>

    Instance Count <http://localhost:7000/histo/count>

    Total Size <http://localhost:7000/histo/size>


    class [C <http://localhost:7000/class/0x73ae500c0>

    3410607

    699552656


    class [Lorg.apache.lucene.util.fst.FST$Arc; <http://localhost:7000/class/0x73b6ff2c0>

    319813

    332605520


    class [Ljava.lang.Object; <http://localhost:7000/class/0x73ae96900>

    898433

    170462152


    class [Ljava.util.HashMap$Entry; <http://localhost:7000/class/0x73b60b0e8>

    856551

    149091216


    class java.util.HashMap$Entry <http://localhost:7000/class/0x73b14d060>

    2802560

    100892160


    class java.lang.String <http://localhost:7000/class/0x73ae91a80>

    3295405

    52726480


    class java.util.HashMap <http://localhost:7000/class/0x73b3f2270>

    750750

    42042000


    class org.apache.lucene.util.fst.FST <http://localhost:7000/class/0x73aee36d8>

    319896

    39027312


    class org.apache.lucene.index.FieldInfo <http://localhost:7000/class/0x73afd0af0>

    748801

    35942448


    class [B <http://localhost:7000/class/0x73ae96890>

    1032862

    28932353


    class java.util.LinkedHashMap$Entry <http://localhost:7000/class/0x73b545a08>

    516680

    26867360


    class org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader <http://localhost:7000/class/0x73afd0fa0>

    319831

    24307156


    class java.util.Collections$UnmodifiableMap <http://localhost:7000/class/0x73b016400>

    749522

    23984704


    class java.util.HashMap$FrontCache <http://localhost:7000/class/0x73afda618>

    863640

    17272800


    class org.apache.lucene.util.BytesRef <http://localhost:7000/class/0x73b09d2a8>

    709550

    11352800


    class java.util.LinkedHashMap <http://localhost:7000/class/0x73afd8908>

    109810

    7137650


    class [I <http://localhost:7000/class/0x73ae50130>

    320987

    5268068


    class org.apache.lucene.analysis.core.WhitespaceTokenizer <http://localhost:7000/class/0x73b3b3b90>

    59540

    5239520


    class java.lang.Integer <http://localhost:7000/class/0x73ae0e718>

    1149625

    4598500


    class org.apache.lucene.util.AttributeSource$State <http://localhost:7000/class/0x73b3b6440>

    169168

    2706688


    class java.util.TreeMap$Node <http://localhost:7000/class/0x73ae4c290>

    19876

    1371444


    class [Lorg.apache.lucene.util.AttributeSource$State; <http://localhost:7000/class/0x73b60a978>

    56394

    1353456


    class org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl <http://localhost:7000/class/0x73afcd418>

    56394

    1127880


    class org.apache.lucene.analysis.util.CharacterUtils$CharacterBuffer <http://localhost:7000/class/0x73b3b6600>

    52888

    951984


    class org.apache.lucene.analysis.Analyzer$TokenStreamComponents <http://localhost:7000/class/0x73b3b6670>

    56404

    902464


    class java.lang.Class <http://localhost:7000/class/0x73ae162f0>

    5706

    867312


    class org.apache.lucene.util.fst.FST$Arc <http://localhost:7000/class/0x73b3b6a60>

    14017

    686833


    class java.util.HashMap$Values <http://localhost:7000/class/0x73b605ce8>

    56409

    451272


    class org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl <http://localhost:7000/class/0x73b3823b8>

    56380

    451040


    class org.apache.lucene.analysis.tokenattributes.PositionIncrementAttributeImpl <http://localhost:7000/class/0x73b382498>

    56396

    225584


    class [Ljava.lang.Integer; <http://localhost:7000/class/0x73ae502f0>

    1

    161048


    class java.util.concurrent.ConcurrentHashMap$HashEntry <http://localhost:7000/class/0x73ae8db40>

    3456

    96768


    class [[C <http://localhost:7000/class/0x73b60cf18>





    This sample of the heap dump I took when we encountered the problem previously..



    -----Original Message-----
    From: Jack Krupansky
    Sent: 28 November 2012 16:23
    To: solr-user@lucene.apache.org
    Subject: Re: Permanently Full Old Generation...



    Have you done a Java heap dump to see what the most common objects are?



    -- Jack Krupansky



    From: Annette Newton

    Sent: Wednesday, November 28, 2012 11:06 AM

    To: solr-user@lucene.apache.org

    Cc: Andy Kershaw

    Subject: Permanently Full Old Generation...



    Hi,





    I’m hoping someone can help me with an issue we are encountering with solr cloud..





    We are seeing strange gc behaviour after running solr cloud under quite heavy insert load for a period of time. The old generation becomes full and no amount of garbage collection will free up the memory. I have attached a memory profile, as you can see it gets progressively worse as the day goes on to the point where we are always doing full garbage collections all the time. The only way I have found to resolve this issue is to reload the core, then subsequent garbage collections reclaim the used space, that’s what happened at 3pm on the Memory profile. All the nodes eventually display the same behaviour.





    We have multiple threads running adding batches of upto 100 documents at a time. I have also attached our Schema and Config.





    We are running 4 shards each with a single replica, have a 3 node zookeeper setup and the 8 solr boxes instances are aws High-Memory Double Extra Large with 34.2 GB Memory, 4 Virtual cores.





    Thanks in advance





    Annette Newton
  • Shawn Heisey at Nov 28, 2012 at 4:48 pm

    On 11/28/2012 9:06 AM, Annette Newton wrote:
    We are seeing strange gc behaviour after running solr cloud under
    quite heavy insert load for a period of time. The old generation
    becomes full and no amount of garbage collection will free up the
    memory. I have attached a memory profile, as you can see it gets
    progressively worse as the day goes on to the point where we are
    always doing full garbage collections all the time. The only way I
    have found to resolve this issue is to reload the core, then
    subsequent garbage collections reclaim the used space, that’s what
    happened at 3pm on the Memory profile. All the nodes eventually
    display the same behaviour.

    We have multiple threads running adding batches of upto 100 documents
    at a time. I have also attached our Schema and Config.

    We are running 4 shards each with a single replica, have a 3 node
    zookeeper setup and the 8 solr boxes instances are aws High-Memory
    Double Extra Large with 34.2 GB Memory, 4 Virtual cores.
    Looking at that jconsole graph, I am not seeing a problem. It looks
    fairly normal to me, especially for heavy indexing.

    Solr has been up for at least 29 hours, just based on the graph I can
    see, which may not reflect the full JVM uptime - click on VM Summary to
    see that. In that time, you've only spent a total of two minutes in
    garbage collection, which does not seem problematic to me. Also, your
    Old Gen is not full - it's only using 1.2GB out of 5.5GB available. A
    full GC (ConcurrentMarkSweep) would only be automatically triggered when
    the Old Gen reaches that 5.5GB mark.

    Are you seeing actual performance problems, or are you just concerned
    about what you see when you watch the memory?

    General note about memory and Solr: I have had very good luck with the
    following java memory options. Each machine handles three 22GB index
    shards on one JVM.

    -Xms4096M
    -Xmx8192M
    -XX:NewRatio=1
    -XX:+UseParNewGC
    -XX:+UseConcMarkSweepGC
    -XX:+CMSParallelRemarkEnabled

    Thanks,
    Shawn
  • Annette Newton at Nov 28, 2012 at 5:02 pm
    My jvm settings:


    -Xmx8192M -Xms8192M -XX:+CMSScavengeBeforeRemark -XX:NewRatio=2
    -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
    -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=70
    -XX:+UseCMSInitiatingOccupancyOnly -XX:-CMSIncrementalPacing
    -XX:CMSIncrementalDutyCycle=75

    I turned off IncrementalPacing, and enabled CMSInitiatingOccupancyFraction,
    after issues with nodes being reported as down due to large Garbage
    collection pauses. The problem with the memory profile was visible before
    the drop down to 1.2GB (this was when I reloaded the core), my concern was
    that the collection of the old generation didn't seem to free any of the
    heap, and we went from occasionally collecting to always collecting the old
    gen.

    Please see the attached gc log.

    -----Original Message-----
    From: Shawn Heisey
    Sent: 28 November 2012 16:48
    To: solr-user@lucene.apache.org
    Subject: Re: Permanently Full Old Generation...
    On 11/28/2012 9:06 AM, Annette Newton wrote:
    We are seeing strange gc behaviour after running solr cloud under
    quite heavy insert load for a period of time. The old generation
    becomes full and no amount of garbage collection will free up the
    memory. I have attached a memory profile, as you can see it gets
    progressively worse as the day goes on to the point where we are
    always doing full garbage collections all the time. The only way I
    have found to resolve this issue is to reload the core, then
    subsequent garbage collections reclaim the used space, that's what
    happened at 3pm on the Memory profile. All the nodes eventually
    display the same behaviour.

    We have multiple threads running adding batches of upto 100 documents
    at a time. I have also attached our Schema and Config.

    We are running 4 shards each with a single replica, have a 3 node
    zookeeper setup and the 8 solr boxes instances are aws High-Memory
    Double Extra Large with 34.2 GB Memory, 4 Virtual cores.
    Looking at that jconsole graph, I am not seeing a problem. It looks fairly
    normal to me, especially for heavy indexing.

    Solr has been up for at least 29 hours, just based on the graph I can see,
    which may not reflect the full JVM uptime - click on VM Summary to see that.
    In that time, you've only spent a total of two minutes in garbage
    collection, which does not seem problematic to me. Also, your Old Gen is not
    full - it's only using 1.2GB out of 5.5GB available. A full GC
    (ConcurrentMarkSweep) would only be automatically triggered when the Old Gen
    reaches that 5.5GB mark.

    Are you seeing actual performance problems, or are you just concerned about
    what you see when you watch the memory?

    General note about memory and Solr: I have had very good luck with the
    following java memory options. Each machine handles three 22GB index shards
    on one JVM.

    -Xms4096M
    -Xmx8192M
    -XX:NewRatio=1
    -XX:+UseParNewGC
    -XX:+UseConcMarkSweepGC
    -XX:+CMSParallelRemarkEnabled

    Thanks,
    Shawn
  • Shawn Heisey at Nov 29, 2012 at 3:29 pm

    My jvm settings:


    -Xmx8192M -Xms8192M -XX:+CMSScavengeBeforeRemark -XX:NewRatio=2
    -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
    -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=70
    -XX:+UseCMSInitiatingOccupancyOnly -XX:-CMSIncrementalPacing
    -XX:CMSIncrementalDutyCycle=75

    I turned off IncrementalPacing, and enabled
    CMSInitiatingOccupancyFraction,
    after issues with nodes being reported as down due to large Garbage
    collection pauses. The problem with the memory profile was visible before
    the drop down to 1.2GB (this was when I reloaded the core), my concern was
    that the collection of the old generation didn't seem to free any of the
    heap, and we went from occasionally collecting to always collecting the
    old
    gen.

    Please see the attached gc log.
    I am on the train for my morning commute, so I have some time, but no
    access to the log or graph.

    Confession time: GC logs make me go glassy eyed and babble incoherently,
    but I did take a look at it. I saw 18 CMS collections and three entries
    near the end that saif Full GC. It looks like these collections take 6 to
    8 seconds. That is pretty nasty, but probably unavoidable, so the goal is
    to make them happen extremely infrequently - do young generation
    collections instead.

    The thing that seems to make GC less of a problem for solr is maximizing
    the young generation memory pool. Based on the available info, I would
    start with making NewRatio 1 instead of 2. This will increase the eden
    size and decrease the old gen size. You may even want to use an explicit
    -Xmn of 6144. If that doesn't help, you might actually need 6GB or so of
    old gen heap, so try increasing the overall heap size to 9 or 10 GB and
    going back to a NewRatio of 2.

    Thanks,
    Shawn
  • Andy Kershaw at Nov 29, 2012 at 5:45 pm
    Thanks for responding Shawn.

    Annette is away until Monday so I am looking into this in the meantime.
    Looking at the times of the Full GC entries at the end of the log, I think
    they are collections we started manually through jconsole to try and reduce
    the size of the old generation. This only seemed to have an effect when we
    reloaded the core first though.

    It is my understanding that the eden size is deliberately smaller to keep
    the ParNew collection time down. If it takes too long then the node is
    flagged as down.
    On 29 November 2012 15:28, Shawn Heisey wrote:

    My jvm settings:


    -Xmx8192M -Xms8192M -XX:+CMSScavengeBeforeRemark -XX:NewRatio=2
    -XX:+CMSParallelRemarkEnabled -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
    -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=70
    -XX:+UseCMSInitiatingOccupancyOnly -XX:-CMSIncrementalPacing
    -XX:CMSIncrementalDutyCycle=75

    I turned off IncrementalPacing, and enabled
    CMSInitiatingOccupancyFraction,
    after issues with nodes being reported as down due to large Garbage
    collection pauses. The problem with the memory profile was visible before
    the drop down to 1.2GB (this was when I reloaded the core), my concern was
    that the collection of the old generation didn't seem to free any of the
    heap, and we went from occasionally collecting to always collecting the
    old
    gen.

    Please see the attached gc log.
    I am on the train for my morning commute, so I have some time, but no
    access to the log or graph.

    Confession time: GC logs make me go glassy eyed and babble incoherently,
    but I did take a look at it. I saw 18 CMS collections and three entries
    near the end that saif Full GC. It looks like these collections take 6 to
    8 seconds. That is pretty nasty, but probably unavoidable, so the goal is
    to make them happen extremely infrequently - do young generation
    collections instead.

    The thing that seems to make GC less of a problem for solr is maximizing
    the young generation memory pool. Based on the available info, I would
    start with making NewRatio 1 instead of 2. This will increase the eden
    size and decrease the old gen size. You may even want to use an explicit
    -Xmn of 6144. If that doesn't help, you might actually need 6GB or so of
    old gen heap, so try increasing the overall heap size to 9 or 10 GB and
    going back to a NewRatio of 2.

    Thanks,
    Shawn
  • Shawn Heisey at Nov 29, 2012 at 6:16 pm

    On 11/29/2012 10:44 AM, Andy Kershaw wrote:
    Annette is away until Monday so I am looking into this in the meantime.
    Looking at the times of the Full GC entries at the end of the log, I think
    they are collections we started manually through jconsole to try and reduce
    the size of the old generation. This only seemed to have an effect when we
    reloaded the core first though.

    It is my understanding that the eden size is deliberately smaller to keep
    the ParNew collection time down. If it takes too long then the node is
    flagged as down.
    Your ParNew collections are taking less than 1 second (some WAY less
    than one second) to complete and the CMS collections are taking far
    longer -- 6 seconds seems to be a common number in the GC log. GC is
    unavoidable with Java, so if there has to be a collection, you
    definitely want it to be on the young generation (ParNew).

    Controversial idea coming up, nothing concrete to back it up. This
    means that you might want to wait for a committer to weigh in: I have
    seen a lot of recent development work relating to SolrCloud and shard
    stability. You may want to check out branch_4x from SVN and build that,
    rather than use 4.0. I don't have any idea what the timeline for 4.1
    is, but based on what I saw for 3.x releases, it should be released
    relatively soon.

    The above advice is a bad idea if you have to be able to upgrade from
    one 4.1 snapshot to a later one without reindexing. There is a
    possibility that the 4.1 index format will change before release and
    require a reindex, it has happened at least twice already.

    Thanks,
    Shawn
  • Walter Underwood at Nov 29, 2012 at 6:31 pm
    Several suggestions.

    1. Adjust the traffic load for about 75% CPU. When you hit 100%, you are already in an overload state and the variance of the response times goes way up. You'll have very noisy benchmark data.

    2. Do not force manual GCs during a benchmark.

    3. Do not force merge (optimise). That is a very expensive operation and will cause slowdowns.

    4. Make eden big enough to hold all data allocated during a request for all simultaneous requests. All that stuff is garbage after the end of the request. If eden fills up, it will be allocated from the tenured space and cause that to grow unnecessarily. We use an 8GB heap and 2GB eden. I like setting the size better than setting ratios.

    5. What version of the JVM are you using?

    wunder
    On Nov 29, 2012, at 10:15 AM, Shawn Heisey wrote:
    On 11/29/2012 10:44 AM, Andy Kershaw wrote:
    Annette is away until Monday so I am looking into this in the meantime.
    Looking at the times of the Full GC entries at the end of the log, I think
    they are collections we started manually through jconsole to try and reduce
    the size of the old generation. This only seemed to have an effect when we
    reloaded the core first though.

    It is my understanding that the eden size is deliberately smaller to keep
    the ParNew collection time down. If it takes too long then the node is
    flagged as down.
    Your ParNew collections are taking less than 1 second (some WAY less than one second) to complete and the CMS collections are taking far longer -- 6 seconds seems to be a common number in the GC log. GC is unavoidable with Java, so if there has to be a collection, you definitely want it to be on the young generation (ParNew).

    Controversial idea coming up, nothing concrete to back it up. This means that you might want to wait for a committer to weigh in: I have seen a lot of recent development work relating to SolrCloud and shard stability. You may want to check out branch_4x from SVN and build that, rather than use 4.0. I don't have any idea what the timeline for 4.1 is, but based on what I saw for 3.x releases, it should be released relatively soon.

    The above advice is a bad idea if you have to be able to upgrade from one 4.1 snapshot to a later one without reindexing. There is a possibility that the 4.1 index format will change before release and require a reindex, it has happened at least twice already.

    Thanks,
    Shawn
    --
    Walter Underwood
    wun...@...org

Related Discussions