Following the previous post, I will describe the result of the research.
First, I reproduced the issue with devbox deployment of CF of
github.com/cloudfoundry/vcap (commit hash:
695fd2fff754bc25e7b6b8fccc4e2dc6c4e97b3d). The commit hash of Cloud
Controller was 31ab65cdf0b9863677675b3812aac7305001267e. This seems a bit
old, but it does't matter as explained afterwards.
The first graph shows the number of subscription of a nats client in Cloud
Controller (*) while vcap-yeti full scenario test is running. It took about
7000 seconds and the number has grown to about 300.
It seems not so fast, but is apparently growing monotonically.
Subscriptions are leaking. They may eat up memory of both NATS server and
After some investigation, I found the cause in stager client gem, not in
Cloud Controller itself. It's in the VCAP::Stager::Client::EmAware#stage
At the line 32https://github.com/cloudfoundry/stager-client/blob/master/lib/vcap/stager/client/em_aware.rb#L32
sid = @nats.request(@queue, request_details_json) do |result|
NATS request is issued.
Then, at the line 46https://github.com/cloudfoundry/stager-client/blob/master/lib/vcap/stager/client/em_aware.rb#L46
@nats.timeout(sid, timeout_secs) do
timeout is set.
It appears OK, but if you want to timeout absolutely after the period, you
have to give :expected option for the timeout method. Because if expected
number of responses are received in the timeout period, the timeout is
*cancelled*. This is a right behavior of NATS, not a bug. The default value
of :expected option is 1, so if a value is not given the option explicitly,
a NATS client cancels timeout when (just) one response is received.
If you want to unsubscribe a subject after a certain number of response is
received, you have to set the number to :max option of the request method.
:max is evaluated before :expected, so when they are given a same value,
:max wins and the subscription will be unsubscribed.
To fix the current subscription leak, I have set :max option to 1. The
fixed version's graph for the same experiment seems OK (actually, almost
all the figures were 1 and 2. 3 appered only 1 time)
Now I have filed a pull request about this fix to the github's
Please review it.
Thanks in advance.
2013年5月20日月曜日 6時43分19秒 UTC+9 Noburou TANIGUCHI:
Recently I had time to research this issue again. I succeeded to reproduce
the issue with the CF of github.com/cloudfoundry/vcap version and
ascertained the cause of it.
I will report the detail of the research in another post by another
2013年2月22日金曜日 22時06分50秒 UTC+9 James Bayer:
We are aiming to have a Team Edition system (also known as v2 or ng)
recommended for new development in the April timeframe. The engineering
team is spending virtually all of their time in that new code base
which includes cloud_controller_ng. We have not observed the same issue in
our production environments to my knowledge. Therefore, if you can identify
the root cause of the issue, we would gladly take a pull request on the
cloud_controller env to fix it. However, it is highly unlikely we will
prioritize attempting to find the root cause for your issue that we have
not observed in our environments.
On Friday, February 22, 2013 4:43:38 AM UTC-8, Noburou TANIGUCHI wrote:
I understood what you said is we should wait for the Next Generation
So the next question is: when is the Next Generation CC's production
release? Or when is the legacy CC "abandon"ed. It depends on the answer(s)
to invest our time to amend this issue.
Thank you .
2013年2月22日金曜日 2時52分53秒 UTC+9 Matt Reider:
This is happening in the legacy Cloud Controller. The Cloud Foundry
development team is focused on the Next Generation Cloud Controller. This
unsavory flavor of product planning will be avoided at all costs in the
future (having two code bases).
A pull request against the legacy Cloud Controller's github repo, from
the community, would be welcome and merged quickly.
On Thu, Feb 21, 2013 at 5:24 AM, Derek Collison wrote:
The nats-server's memory growth is attributed to the CloudController
creating subscriptions and not removing them, e.g. leaking them. You can
see this using the nats-top utility.
The CF team should fix it, its a bug.
On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI