On the Cloud Foundry Runtime team, we're beginning to discuss monitoring
for deployments that may not have internet access. I was hoping to get
feedback from the community in this decision that will likely affect the
architecture of Cloud Foundry.
To briefly go over where we are and how we got here:
- Prior to AWS, we were using OpenTSDB <http://opentsdb.net/>
- We decided to use CloudWatch <http://aws.amazon.com/cloudwatch/> for
Amazon AWS deployments of Cloud Foundry
- We decided to use Datadog <http://www.datadoghq.com/> instead of
- IIRC the reasons we used Datadog included but were not limited to:
- Higher resolution when viewing charts
- More charting functions
- Out-of-the box integration with PagerDuty
- Costs money now, but lets us focus on higher priorities until
appropriate to revisit
I really like Datadog. It's easy to use, the charts look great, and their
support has been fantastic. But Datadog just won't work in everyone's
circumstances (e.g. non-AWS deployments with no internet access). We're
about at the point where it's time to consider options for monitoring that
will not require external internet access.
We're thinking that statsd is our most obvious candidate to replace
Datadog. It seems to be a buzzworthy tool, and Datadog can optionally use a
customized statsd backend <http://docs.datadoghq.com/guides/dogstatsd/> (potentially
saving us some work).
However, many of our metrics currently rely on tags, which are specific to
Datadog. For instance, we would record CPU load average of each component
and use tags to specify that a data point was for tag job of "dea" and
index of 0. Then we can filter the data by tags, so we could see the
average CPU load average for all DEAs, or we could show a dashboard with a
chart just for each Cloud Controller. Once I learned that we were
accomplishing this through a non-standard statsd backend, I began to wonder
if needing tags at all was a statsd antipattern. Surely our use case is
very typical for users of statsd.
So it's going to take some non-trivial effort to switch to statsd, and we
haven't even touched on frontends yet. I don't think I've talked to anyone
on the team who has had significant experience using statsd, so we're not
certain that statsd is the best choice.
Whichever tool we use, our major use cases include:
- Record arbitrary time series data
- View charts from the recorded data and from realtime data
- Alert on-call support when the recorded meets certain predefined
What do you say, community? Should we go forward with statsd?