Hi,

A short while before we've found a memory problem with NATS server.
After some discussion with Derek Collison (the repository owner of NATS),
we reached a conclusion that there is "subscription leak" in Cloud
Controller.
You may see the discussion here.
https://github.com/derekcollison/nats/issues/49

The data in the discussion shows that the number of subscription grew from
500 to 16600 in about 5 days.
The leak causes (slow, but) steady memory consumption increase in NATS
server.

The first graph attached is a monthly report (taken by munin) of memory
usage of a NATS server host. NATS server is almost only application in the
host, so the green area can be assumed as used memory of the NATS server.
We restarted the NATS server at middle of the graph, and you may see a
sharp decrese of application memory usage there.

The next graph is a yearly report. It shows a gradual memory usage growth
in the rightmost end.

I suspect a great part of the subscription leak may be tentative subjects
to sub/pub responses for NATS requests.

Because NATS is the key component of the Cloud Foundry, we think it is
important to solve this problem.
So we hope it is solved soon.
Thank you.

<https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>

Search Discussions

  • David Laing at Feb 20, 2013 at 11:08 pm
    Nice sleuthing!

    Would it be fair to assume that none of the other CF components have a
    similar memory leak? (ie, you are running munin on all your CF instances,
    and the NATS one is the only one which exhibited this memory growth
    pattern.)

    :D

    On 20 February 2013 22:16, Noburou TANIGUCHI wrote:

    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of NATS),
    we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/derekcollison/nats/issues/49

    The data in the discussion shows that the number of subscription grew from
    500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage growth
    in the rightmost end.

    I suspect a great part of the subscription leak may be tentative subjects
    to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>


    --
    David Laing
    Trading API @ City Index
    david@davidlaing.com
    http://davidlaing.com
    Twitter: @davidlaing
  • Noburou TANIGUCHI at Feb 22, 2013 at 12:28 pm
    I want to clarify a bit more about the issue (and my previous post).

    There are 3 points to understand this issue.

    1. The number of subscription only from CC shows abnormal growth

    I wrote:
    The data in the discussion shows that the number of subscription grew from
    500 to 16600 in about 5 days.

    This lacked a bit words. Actually I should write:
    The data in the discussion shows that the number of subscription* from CC*grew from 500 to 16600 in about 5 days.

    And I will explain how to read the table in
    https://github.com/derekcollison/nats/issues/49#issuecomment-13169542

    At the day 1 (the first day), subscriptions from CC's host was about 500.
    (Because I couldn't insert images well, I paste the url to the image here)
    http://www.hostedredmine.com/attachments/download/46396/nats-top-day-1.png

    At the day 6 (the last day), it had grown to 16.6K.
    http://www.hostedredmine.com/attachments/download/46397/nats-top-day-6.png

    Subscriptions from any other component's host did not show such an odd
    behavior.

    2. A NATS server assigns a memory buffer to each subscription.

    When a NATS client subscribes to a subject in a NATS server, the server
    assigns a memory buffer to the subscription.
    (It's my understanding that the memory buffer is assigned per subscription,
    not per subject)
    http://www.hostedredmine.com/attachments/download/46398/nats-sub.png

    So if the more subscriptions, the more memory a NATS *server *consumes.


    3. A NATS request uses a temporal subject for send/receive responses.

    When a request is sent to a subject, the subscribers of the subject will
    send responses to the requester via a temporal subject.
    http://www.hostedredmine.com/attachments/download/46399/nats-request.png

    After it served for the purpose, the requester should unsubscribe from the
    temporal subject.
    But back to the point 1, CC doesn't seem to do unsubscribe.


    That's what I guess happening in a CF environment.


    2013年2月21日木曜日 8時08分31秒 UTC+9 David Laing:
    Nice sleuthing!

    Would it be fair to assume that none of the other CF components have a
    similar memory leak? (ie, you are running munin on all your CF instances,
    and the NATS one is the only one which exhibited this memory growth
    pattern.)

    :D


    On 20 February 2013 22:16, Noburou TANIGUCHI <taniguch...@po.ntts.co.jp<javascript:>
    wrote:
    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of NATS),
    we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/derekcollison/nats/issues/49

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage growth
    in the rightmost end.

    I suspect a great part of the subscription leak may be tentative subjects
    to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>


    --
    David Laing
    Trading API @ City Index
    da...@davidlaing.com <javascript:>
    http://davidlaing.com
    Twitter: @davidlaing
  • Yohei Sasaki at Feb 22, 2013 at 1:02 pm
    Hi,
    I'm also interested in this topic.

    Could you tell us the commit hash of CC you used to reproduce the situation?
    2013/02/22 21:28 "Noburou TANIGUCHI" <taniguchi.noburou@po.ntts.co.jp>:
    I want to clarify a bit more about the issue (and my previous post).

    There are 3 points to understand this issue.

    1. The number of subscription only from CC shows abnormal growth

    I wrote:
    The data in the discussion shows that the number of subscription grew from
    500 to 16600 in about 5 days.

    This lacked a bit words. Actually I should write:
    The data in the discussion shows that the number of subscription* from CC*grew from 500 to 16600 in about 5 days.

    And I will explain how to read the table in
    https://github.com/derekcollison/nats/issues/49#issuecomment-13169542

    At the day 1 (the first day), subscriptions from CC's host was about 500.
    (Because I couldn't insert images well, I paste the url to the image here)
    http://www.hostedredmine.com/attachments/download/46396/nats-top-day-1.png

    At the day 6 (the last day), it had grown to 16.6K.
    http://www.hostedredmine.com/attachments/download/46397/nats-top-day-6.png

    Subscriptions from any other component's host did not show such an odd
    behavior.

    2. A NATS server assigns a memory buffer to each subscription.

    When a NATS client subscribes to a subject in a NATS server, the server
    assigns a memory buffer to the subscription.
    (It's my understanding that the memory buffer is assigned per
    subscription, not per subject)
    http://www.hostedredmine.com/attachments/download/46398/nats-sub.png

    So if the more subscriptions, the more memory a NATS *server *consumes.


    3. A NATS request uses a temporal subject for send/receive responses.

    When a request is sent to a subject, the subscribers of the subject will
    send responses to the requester via a temporal subject.
    http://www.hostedredmine.com/attachments/download/46399/nats-request.png

    After it served for the purpose, the requester should unsubscribe from the
    temporal subject.
    But back to the point 1, CC doesn't seem to do unsubscribe.


    That's what I guess happening in a CF environment.


    2013年2月21日木曜日 8時08分31秒 UTC+9 David Laing:
    Nice sleuthing!

    Would it be fair to assume that none of the other CF components have a
    similar memory leak? (ie, you are running munin on all your CF instances,
    and the NATS one is the only one which exhibited this memory growth
    pattern.)

    :D


    On 20 February 2013 22:16, Noburou TANIGUCHI <taniguch...@po.ntts.co.**jp
    wrote:
    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of
    NATS), we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage
    growth in the rightmost end.

    I suspect a great part of the subscription leak may be tentative
    subjects to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>


    --
    David Laing
    Trading API @ City Index
    da...@davidlaing.com
    http://davidlaing.com
    Twitter: @davidlaing
  • Noburou TANIGUCHI at Feb 23, 2013 at 4:28 pm
    We are using a forked private version of CC.
    So I can't tell you the very commit hash you expect, (actually I can write
    the private version's commit hash but it's fully meaningless,) but the CC
    we used was based on the public version of commit hash 31ab65cdf0b9863677675b3812aac7305001267e
    .

    And I used a modified version of nats-top which is able to access a
    authorization required monitoring URL. It is available at
    https://github.com/nsnt/nats/blob/51f2d760504bd0c4a2f4cffde7a77caa1c598fe4/bin/my-nats-top
    .

    2013年2月22日金曜日 22時02分10秒 UTC+9 yssk22:
    Hi,
    I'm also interested in this topic.

    Could you tell us the commit hash of CC you used to reproduce the
    situation?
    2013/02/22 21:28 "Noburou TANIGUCHI" <taniguch...@po.ntts.co.jp<javascript:>
    :
    I want to clarify a bit more about the issue (and my previous post).

    There are 3 points to understand this issue.

    1. The number of subscription only from CC shows abnormal growth

    I wrote:
    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.

    This lacked a bit words. Actually I should write:
    The data in the discussion shows that the number of subscription* from CC
    * grew from 500 to 16600 in about 5 days.

    And I will explain how to read the table in
    https://github.com/derekcollison/nats/issues/49#issuecomment-13169542

    At the day 1 (the first day), subscriptions from CC's host was about 500.
    (Because I couldn't insert images well, I paste the url to the image here)
    http://www.hostedredmine.com/attachments/download/46396/nats-top-day-1.png

    At the day 6 (the last day), it had grown to 16.6K.
    http://www.hostedredmine.com/attachments/download/46397/nats-top-day-6.png

    Subscriptions from any other component's host did not show such an odd
    behavior.

    2. A NATS server assigns a memory buffer to each subscription.

    When a NATS client subscribes to a subject in a NATS server, the server
    assigns a memory buffer to the subscription.
    (It's my understanding that the memory buffer is assigned per
    subscription, not per subject)
    http://www.hostedredmine.com/attachments/download/46398/nats-sub.png

    So if the more subscriptions, the more memory a NATS *server *consumes.


    3. A NATS request uses a temporal subject for send/receive responses.

    When a request is sent to a subject, the subscribers of the subject will
    send responses to the requester via a temporal subject.
    http://www.hostedredmine.com/attachments/download/46399/nats-request.png

    After it served for the purpose, the requester should unsubscribe from
    the temporal subject.
    But back to the point 1, CC doesn't seem to do unsubscribe.


    That's what I guess happening in a CF environment.


    2013年2月21日木曜日 8時08分31秒 UTC+9 David Laing:
    Nice sleuthing!

    Would it be fair to assume that none of the other CF components have a
    similar memory leak? (ie, you are running munin on all your CF instances,
    and the NATS one is the only one which exhibited this memory growth
    pattern.)

    :D


    On 20 February 2013 22:16, Noburou TANIGUCHI <taniguch...@po.ntts.co.**
    jp> wrote:
    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of
    NATS), we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage
    growth in the rightmost end.

    I suspect a great part of the subscription leak may be tentative
    subjects to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>


    --
    David Laing
    Trading API @ City Index
    da...@davidlaing.com
    http://davidlaing.com
    Twitter: @davidlaing
  • Derek Collison at Feb 21, 2013 at 1:24 pm
    The nats-server's memory growth is attributed to the CloudController
    creating subscriptions and not removing them, e.g. leaking them. You can
    see this using the nats-top utility.

    The CF team should fix it, its a bug.
    On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI wrote:

    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of NATS),
    we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/derekcollison/nats/issues/49

    The data in the discussion shows that the number of subscription grew from
    500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage growth
    in the rightmost end.

    I suspect a great part of the subscription leak may be tentative subjects
    to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>
  • Matt Reider at Feb 21, 2013 at 5:52 pm
    This is happening in the legacy Cloud Controller. The Cloud Foundry
    development team is focused on the Next Generation Cloud Controller. This
    unsavory flavor of product planning will be avoided at all costs in the
    future (having two code bases).

    A pull request against the legacy Cloud Controller's github repo, from the
    community, would be welcome and merged quickly.





    On Thu, Feb 21, 2013 at 5:24 AM, Derek Collison wrote:

    The nats-server's memory growth is attributed to the CloudController
    creating subscriptions and not removing them, e.g. leaking them. You can
    see this using the nats-top utility.

    The CF team should fix it, its a bug.

    On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI wrote:

    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of NATS),
    we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage growth
    in the rightmost end.

    I suspect a great part of the subscription leak may be tentative subjects
    to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>
  • Noburou TANIGUCHI at Feb 22, 2013 at 12:43 pm
    OK, Matt.
    I understood what you said is we should wait for the Next Generation Cloud
    Controller.

    So the next question is: when is the Next Generation CC's production
    release? Or when is the legacy CC "abandon"ed. It depends on the answer(s)
    to invest our time to amend this issue.

    Thank you .

    2013年2月22日金曜日 2時52分53秒 UTC+9 Matt Reider:

    This is happening in the legacy Cloud Controller. The Cloud Foundry
    development team is focused on the Next Generation Cloud Controller. This
    unsavory flavor of product planning will be avoided at all costs in the
    future (having two code bases).

    A pull request against the legacy Cloud Controller's github repo, from the
    community, would be welcome and merged quickly.






    On Thu, Feb 21, 2013 at 5:24 AM, Derek Collison <derek.c...@gmail.com<javascript:>
    wrote:
    The nats-server's memory growth is attributed to the CloudController
    creating subscriptions and not removing them, e.g. leaking them. You can
    see this using the nats-top utility.

    The CF team should fix it, its a bug.

    On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI wrote:

    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of
    NATS), we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage
    growth in the rightmost end.

    I suspect a great part of the subscription leak may be tentative
    subjects to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>
  • James Bayer at Feb 22, 2013 at 1:06 pm
    We are aiming to have a Team Edition system (also known as v2 or ng)
    recommended for new development in the April timeframe. The engineering
    team is spending virtually all of their time in that new code base
    which includes cloud_controller_ng. We have not observed the same issue in
    our production environments to my knowledge. Therefore, if you can identify
    the root cause of the issue, we would gladly take a pull request on the
    cloud_controller env to fix it. However, it is highly unlikely we will
    prioritize attempting to find the root cause for your issue that we have
    not observed in our environments.
    On Friday, February 22, 2013 4:43:38 AM UTC-8, Noburou TANIGUCHI wrote:

    OK, Matt.
    I understood what you said is we should wait for the Next Generation Cloud
    Controller.

    So the next question is: when is the Next Generation CC's production
    release? Or when is the legacy CC "abandon"ed. It depends on the answer(s)
    to invest our time to amend this issue.

    Thank you .

    2013年2月22日金曜日 2時52分53秒 UTC+9 Matt Reider:

    This is happening in the legacy Cloud Controller. The Cloud Foundry
    development team is focused on the Next Generation Cloud Controller. This
    unsavory flavor of product planning will be avoided at all costs in the
    future (having two code bases).

    A pull request against the legacy Cloud Controller's github repo, from
    the community, would be welcome and merged quickly.





    On Thu, Feb 21, 2013 at 5:24 AM, Derek Collison wrote:

    The nats-server's memory growth is attributed to the CloudController
    creating subscriptions and not removing them, e.g. leaking them. You can
    see this using the nats-top utility.

    The CF team should fix it, its a bug.


    On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI
    wrote:
    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of
    NATS), we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of memory
    usage of a NATS server host. NATS server is almost only application in the
    host, so the green area can be assumed as used memory of the NATS server.
    We restarted the NATS server at middle of the graph, and you may see a
    sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage
    growth in the rightmost end.

    I suspect a great part of the subscription leak may be tentative
    subjects to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>
  • Noburou TANIGUCHI at May 19, 2013 at 9:43 pm
    Hi,

    Recently I had time to research this issue again. I succeeded to reproduce
    the issue with the CF of github.com/cloudfoundry/vcap version and
    ascertained the cause of it.

    I will report the detail of the research in another post by another account
    (dev@nota.m001.jp). I will use the accout hereafter.

    2013年2月22日金曜日 22時06分50秒 UTC+9 James Bayer:
    We are aiming to have a Team Edition system (also known as v2 or ng)
    recommended for new development in the April timeframe. The engineering
    team is spending virtually all of their time in that new code base
    which includes cloud_controller_ng. We have not observed the same issue in
    our production environments to my knowledge. Therefore, if you can identify
    the root cause of the issue, we would gladly take a pull request on the
    cloud_controller env to fix it. However, it is highly unlikely we will
    prioritize attempting to find the root cause for your issue that we have
    not observed in our environments.
    On Friday, February 22, 2013 4:43:38 AM UTC-8, Noburou TANIGUCHI wrote:

    OK, Matt.
    I understood what you said is we should wait for the Next Generation
    Cloud Controller.

    So the next question is: when is the Next Generation CC's production
    release? Or when is the legacy CC "abandon"ed. It depends on the answer(s)
    to invest our time to amend this issue.

    Thank you .

    2013年2月22日金曜日 2時52分53秒 UTC+9 Matt Reider:

    This is happening in the legacy Cloud Controller. The Cloud Foundry
    development team is focused on the Next Generation Cloud Controller. This
    unsavory flavor of product planning will be avoided at all costs in the
    future (having two code bases).

    A pull request against the legacy Cloud Controller's github repo, from
    the community, would be welcome and merged quickly.





    On Thu, Feb 21, 2013 at 5:24 AM, Derek Collison wrote:

    The nats-server's memory growth is attributed to the CloudController
    creating subscriptions and not removing them, e.g. leaking them. You can
    see this using the nats-top utility.

    The CF team should fix it, its a bug.


    On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI
    wrote:
    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of
    NATS), we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in NATS
    server.

    The first graph attached is a monthly report (taken by munin) of
    memory usage of a NATS server host. NATS server is almost only application
    in the host, so the green area can be assumed as used memory of the NATS
    server. We restarted the NATS server at middle of the graph, and you may
    see a sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage
    growth in the rightmost end.

    I suspect a great part of the subscription leak may be tentative
    subjects to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it is
    important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>
  • Dev at May 19, 2013 at 10:02 pm
    Hello, again.

    Following the previous post, I will describe the result of the research.

    First, I reproduced the issue with devbox deployment of CF of
    github.com/cloudfoundry/vcap (commit hash:
    695fd2fff754bc25e7b6b8fccc4e2dc6c4e97b3d). The commit hash of Cloud
    Controller was 31ab65cdf0b9863677675b3812aac7305001267e. This seems a bit
    old, but it does't matter as explained afterwards.

    The first graph shows the number of subscription of a nats client in Cloud
    Controller (*) while vcap-yeti full scenario test is running. It took about
    7000 seconds and the number has grown to about 300.

    <http://www.hostedredmine.com/attachments/download/56357/nats-subs-leak-nats-top-115639.png>

    It seems not so fast, but is apparently growing monotonically.
    Subscriptions are leaking. They may eat up memory of both NATS server and
    client side.

    After some investigation, I found the cause in stager client gem, not in
    Cloud Controller itself. It's in the VCAP::Stager::Client::EmAware#stage
    method.

    https://github.com/cloudfoundry/stager-client/blob/master/lib/vcap/stager/client/em_aware.rb#L27-L52

    At the line 32
    https://github.com/cloudfoundry/stager-client/blob/master/lib/vcap/stager/client/em_aware.rb#L32

         sid = @nats.request(@queue, request_details_json) do |result|
           ..

    NATS request is issued.

    Then, at the line 46
    https://github.com/cloudfoundry/stager-client/blob/master/lib/vcap/stager/client/em_aware.rb#L46

         @nats.timeout(sid, timeout_secs) do
           ..

    timeout is set.

    It appears OK, but if you want to timeout absolutely after the period, you
    have to give :expected option for the timeout method. Because if expected
    number of responses are received in the timeout period, the timeout is
    *cancelled*. This is a right behavior of NATS, not a bug. The default value
    of :expected option is 1, so if a value is not given the option explicitly,
    a NATS client cancels timeout when (just) one response is received.

    If you want to unsubscribe a subject after a certain number of response is
    received, you have to set the number to :max option of the request method.
    :max is evaluated before :expected, so when they are given a same value,
    :max wins and the subscription will be unsubscribed.

    To fix the current subscription leak, I have set :max option to 1. The
    fixed version's graph for the same experiment seems OK (actually, almost
    all the figures were 1 and 2. 3 appered only 1 time)

    <http://www.hostedredmine.com/attachments/download/56360/nats-subs-leak-nats-top-022236.png>

    Now I have filed a pull request about this fix to the github's
    stager-client repository.
    https://github.com/cloudfoundry/stager-client/pull/1

    Please review it.
    Thanks in advance.


    2013年5月20日月曜日 6時43分19秒 UTC+9 Noburou TANIGUCHI:
    Hi,

    Recently I had time to research this issue again. I succeeded to reproduce
    the issue with the CF of github.com/cloudfoundry/vcap version and
    ascertained the cause of it.

    I will report the detail of the research in another post by another
    account (d...@nota.m001.jp <javascript:>). I will use the accout
    hereafter.

    2013年2月22日金曜日 22時06分50秒 UTC+9 James Bayer:
    We are aiming to have a Team Edition system (also known as v2 or ng)
    recommended for new development in the April timeframe. The engineering
    team is spending virtually all of their time in that new code base
    which includes cloud_controller_ng. We have not observed the same issue in
    our production environments to my knowledge. Therefore, if you can identify
    the root cause of the issue, we would gladly take a pull request on the
    cloud_controller env to fix it. However, it is highly unlikely we will
    prioritize attempting to find the root cause for your issue that we have
    not observed in our environments.
    On Friday, February 22, 2013 4:43:38 AM UTC-8, Noburou TANIGUCHI wrote:

    OK, Matt.
    I understood what you said is we should wait for the Next Generation
    Cloud Controller.

    So the next question is: when is the Next Generation CC's production
    release? Or when is the legacy CC "abandon"ed. It depends on the answer(s)
    to invest our time to amend this issue.

    Thank you .

    2013年2月22日金曜日 2時52分53秒 UTC+9 Matt Reider:

    This is happening in the legacy Cloud Controller. The Cloud Foundry
    development team is focused on the Next Generation Cloud Controller. This
    unsavory flavor of product planning will be avoided at all costs in the
    future (having two code bases).

    A pull request against the legacy Cloud Controller's github repo, from
    the community, would be welcome and merged quickly.





    On Thu, Feb 21, 2013 at 5:24 AM, Derek Collison wrote:

    The nats-server's memory growth is attributed to the CloudController
    creating subscriptions and not removing them, e.g. leaking them. You can
    see this using the nats-top utility.

    The CF team should fix it, its a bug.


    On Wednesday, February 20, 2013 2:16:35 PM UTC-8, Noburou TANIGUCHI
    wrote:
    Hi,

    A short while before we've found a memory problem with NATS server.
    After some discussion with Derek Collison (the repository owner of
    NATS), we reached a conclusion that there is "subscription leak" in Cloud
    Controller.
    You may see the discussion here.
    https://github.com/**derekcollison/nats/issues/49<https://github.com/derekcollison/nats/issues/49>

    The data in the discussion shows that the number of subscription grew
    from 500 to 16600 in about 5 days.
    The leak causes (slow, but) steady memory consumption increase in
    NATS server.

    The first graph attached is a monthly report (taken by munin) of
    memory usage of a NATS server host. NATS server is almost only application
    in the host, so the green area can be assumed as used memory of the NATS
    server. We restarted the NATS server at middle of the graph, and you may
    see a sharp decrese of application memory usage there.

    The next graph is a yearly report. It shows a gradual memory usage
    growth in the rightmost end.

    I suspect a great part of the subscription leak may be tentative
    subjects to sub/pub responses for NATS requests.

    Because NATS is the key component of the Cloud Foundry, we think it
    is important to solve this problem.
    So we hope it is solved soon.
    Thank you.


    <https://lh3.googleusercontent.com/-3_Xnp93qd6c/USVCL9gujvI/AAAAAAAAAAM/H557D7UBESM/s1600/memory-pinpoint%253D1358456074%252C1361307274.png><https://lh6.googleusercontent.com/-TtcB2eZosbg/USVCRyrFKII/AAAAAAAAAAU/qW2t3l5uwv8/s1600/memory-pinpoint%3D1326747274%2C1361307274.png>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupvcap-dev @
postedFeb 20, '13 at 10:16p
activeMay 19, '13 at 10:02p
posts11
users7

People

Translate

site design / logo © 2021 Grokbase