On Thu, Jun 16, 2016 at 1:08 PM, Jeff Wartes wrote:
Check your gc log for CMS “concurrent mode failure” messages.
If a concurrent CMS collection fails, it does a stop-the-world pause while
it cleans up using a *single thread*. This means the stop-the-world CMS
collection in the failure case is typically several times slower than a
concurrent CMS collection. The single-thread business means it will also be
several times slower than the Parallel collector, which is probably what
you’re seeing. I understand that it needs to stop the world in this case,
but I really wish the CMS failure would fall back to a Parallel collector
The Parallel collector is always going to be the fastest at getting rid of
garbage, but only because it stops all the application threads while it
runs, so it’s got less complexity to deal with. That said, it’s probably
not going to be orders of magnitude faster than a (successfully) concurrent
Regardless, the bigger the heap, the bigger the pause.
If your application is generating a lot of garbage, or can generate a lot
of garbage very suddenly, CMS concurrent mode failures are more likely. You
can turn down the -XX:CMSInitiatingOccupancyFraction value in order to
give the CMS collection more of a head start at the cost of more frequent
collections. If that doesn’t work, you can try using a bigger heap, but you
may eventually find yourself trying to figure out what about your query
load generates so much garbage (or causes garbage spikes) and trying to
address that. Even G1 won’t protect you from highly unpredictable garbage
In my case, for example, I found that a very small subset of my queries
were using the CollapseQParserPlugin, which requires quite a lot of memory
allocations, especially on a large index. Although generally this was fine,
if I got several of these rare queries in a very short window, it would
always spike enough garbage to cause CMS concurrent mode failures. The
single-threaded concurrent-mode failure would then take long enough that
the ZK heartbeat would fail, and things would just go downhill from there.
On 6/15/16, 3:57 PM, "Cas Rusnov" wrote:
Hey Shawn! Thanks for replying.
Yes I meant HugePages not HugeTable, brain fart. I will give the
transparent off option a go.
I have attempted to use your CMS configs as is and also the default
settings and the cluster dies under our load (basically a node will get a
35-60s GC STW and then the others in the shard will take the load, and they
will in turn get long STWs until the shard dies), which is why basically in
a fit of desperation I tried out ParallelGC and found it to be half-way
acceptable. I will run a test using your configs (and the defaults) again
just to be sure (since I'm certain the machine config has changed since we
used your unaltered settings).
On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey wrote:
On 6/15/2016 3:05 PM, Cas Rusnov wrote:
After trying many of the off the shelf configurations (including CMS
configurations but excluding G1GC, which we're still taking the
warnings about seriously), numerous tweaks, rumors, various instance
sizes, and all the rest, most of which regardless of heap size and
newspace size resulted in frequent 30+ second STW GCs, we settled on
the following configuration which leads to occasional high GCs but
mostly stays between 10-20 second STWs every few minutes (which is
almost acceptable): -XX:+AggressiveOpts -XX:+UnlockDiagnosticVMOptions
-XX:+UseAdaptiveSizePolicy -XX:+UseLargePages -XX:+UseParallelGC
-XX:+UseParallelOldGC -XX:MaxGCPauseMillis=15000 -XX:MaxNewSize=12000m
-XX:ParGCCardsPerStrideChunk=4096 -XX:ParallelGCThreads=16 -Xms31000m
You mentioned something called "HugeTable" ... I assume you're talking
about huge pages. If that's what you're talking about, have you also
turned off transparent huge pages? If you haven't, you might want to
completely disable huge pages in your OS. There's evidence that the
transparent option can affect performance.
I assume you've probably looked at my GC info at the following URL:http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
The parallel collector is most definitely not a good choice. It does
not optimize for latency. It's my understanding that it actually
prefers full GCs, because it is optimized for throughput. Solr thrives
on good latency, throughput doesn't matter very much.
If you want to continue avoiding G1, you should definitely be using
CMS. My recommendation right now would be to try the G1 settings on my
wiki page under the heading "Current experiments" or the CMS settings
just below that.
The out-of-the-box GC tuning included with Solr 6 is probably a better
option than the parallel collector you've got configured now.
[image: Manzama Logo] <http://www.manzama.com
Visit our Resource Center <http://www.manzama.com/resource-center/
US & Canada Office: +1 (541) 306-3271 <+15413063271> | UK Office: +44
(0)203 282 1633 <+4402032821633> | AUS Office: +61 02 9326 6264
>| Google +