Get Started with Production Monitoring

I bet you have an application in production right now. Actually, you probably have more than one. Are you in the Microservices camp? Then you likely have dozens of applications and services deployed.

The question is, are you effectively monitoring your production applications?

The rise of DevOps and Continuous Delivery/Deployment (do you know the difference between Continuous Delivery and Continuous Deployments?) has made the ability to really know what is happening in production more important than ever—and it’s always been important. As the applications that provide value to the business become more distributed, knowing that you’re tracking the right metrics becomes more difficult.

All too often, I see applications that have been deployed with a default set of metrics, mostly around the machines or virtual machines (VMs) that are hosting the applications. The metrics aren’t monitored for change, they are monitored for the extremes:

Are we out of disk?
Are we out of memory?
How many requests are we serving?

While these metrics are all important, there’s no cohesiveness to what is measured and what it means.

I am not the neck-bearded, old school System Administrator that is going to sigh, roll my eyes, and issue edicts about production monitoring. I am a developer that has played many roles and had to support various applications in production. I’ve been on teams where I’ve watched others, made suggestions, and missed the boat on monitoring. I’ve read a few books and a slew of articles on the subject. This has lead me to an approach that stands on the shoulders of giants.

If you are supporting an application and you don’t really have a solid approach to what you are monitoring, then you’ll find this article useful.

Smart people that support large infrastructure and enable Continuous Delivery have identified what metrics are important to start and, in most cases, how you can track them for the various components of a system.

Books like Google’s Site Reliability Engineering (SRE) and accompanying workbook have shed a light on the metrics that are tracked in the “big time” applications. While it’s unlikely that you’re at Google’s scale, these books are worth reading. There are others contributing in this space (some of which I will mention later), providing content and approaches to help us all improve production monitoring.

Internal vs. External Production Monitoring

At a high level, this approach splits out production monitoring into two camps: Internal vs External. Internal monitoring consists of the servers and other resources that are hosting your application, software, and services. Think VMs, containers and container hosts, cloud-hosted database services, and the like. External metrics are focused on the users or clients of your infrastructure. Measuring these values gives an indication of the experience your software is providing.

Let’s go through both camps using cool TLAs (three-letter acronyms) and this will become clear.

USE This Method for Internal Metrics

The first method we’ll visit is called the USE Method and it is the brainchild of Brenden Gregg. USE is a TLA that stands for Utilization, Saturation, and Errors. The idea is to quickly find servers/VMs/containers that are struggling in some capacity. The USE metrics can be tracked for the components like CPU, memory, disk I/O, and network I/O.

Each of these metrics is defined here:

Utilization: How busy has this resource been over a time period? For example, CPU usage or memory levels over the last 24 hours.
Saturation: How much work is queuing up for a resource? CPUs have a run queue length, but be careful. Load average might be a better metric here (more below).
Errors: Self-explanatory

One of the issues with the USE method is that it’s exceedingly difficult to track some of these metrics against certain resources. For example, getting CPU Errors is difficult to impossible. Using run queue length for saturation of a CPU is confusing and may not mean what you think it does. Brenden has suggestions on how to get these metrics, and there is a fantastic article from Circonus on how they have created a USE dashboard for their service. Spoiler: They simply don’t capture some of these metrics, which is reasonable.

You obviously need to make your own decisions around how (and if) you get these metrics, but using a service (like Circonus, who is not paying me) is the best route if you can do it.

Know When Your Users Are Seeing RED

Now that you know what your servers, et al., are doing, it’s time to focus on your users (coughs, maybe users should be first?). How can we easily measure something that gives a quick indication of the experience our software is providing? Tom Wilkie of Grafana Labs came up with as a “microservices-oriented monitoring philosophy.” The method is called the RED Method, which stands for:

Rate: Requests/second
Errors: How many requests are failing?
Duration: The latency of those requests

These are easy to get and services like New Relic and AWS CloudWatch will give you this information in some form with little to no configuration. In some cases, you’ll need developers to expose code for these metrics, especially if you are using something like Prometheus (see below) to capture and expose them.

My DBA is a Fan of the CELTics

(Note: I realize my section naming skills are degrading as this article goes on.)
The last set of metrics to visit are based on the “Four Golden Signals” from the aforementioned SRE book. I first heard of this variation reading through the ebook “DevOps for the Database” by Baron Schwarz of VividCortex. CELT stands for:

Concurrency: the number of queries executed simultaneously per second
Errors: Self-explanatory. Database errors.
Latency: How long queries are taking
Throughput: Queries/second

Baron suggests using CELT+USE as the “Seven Golden Signals” for database monitoring. Getting these signals is database specific, of course, but I’ll link to an article or two at the end of this post that has done some legwork here.

How Shall I Track Thee?

The last item I’d like to mention here is what happens once you have these metrics in place? There are a metric ton (see what I did there?) of values to handle, so aggregation is in order. Many people think that tracking the average value of these metrics is the way to go, but using the average can hide some glaring problems. A more useful value to track is percentile, meaning, the 50th (p50), 95th (p95), or 99th (p99).

These percentiles point to a value that is the maximum for that percentage of values. For example, if p99 for Request Latency is 800ms, it means 99% of the request were 800ms or faster. The long tail that a percentile shows is very often indicative of issues that need to be addressed for a large number of users. If you want to track average too, p50 (the median) is likely a better number.

Also, these percentiles should be aggregated against a time period. For recent times periods (the last hour, etc) you will want per minute values. For less recent (the last week/month) you’ll want to use larger interval aggregates. As a rule of thumb, here are CloudWatch’s aggregation and retention policies.

Lastly, once you have all these metrics, they need to be visible to your team. In other words, you need a dashboard. Services, like CloudWatch, allow you to create a dashboard with both out-of-the-box and calculated metrics. Circonus, New Relic, Datadog, and VividCortex will supply you a nice dashboard for a fee.

If you want to roll your own, which is a fine idea, Prometheus is the current belle of the ball for collecting metrics. It has plugins for Grafana so you can display, query, and analyze your metrics your way.

Wrapping Up

Like I said, you are probably supporting one or more applications or services today. I am sure you are tracking some metrics, but do you have a plan? If not, maybe starting with some of the items discussed here can get you on track to observability nirvana.

There is really so much great content out there on these methods. I would suggest starting with the Google SRE book and lacing in the links above. Of course, if you’d like some help getting started or assessing your current system, Method is here to help!

Resources

Here is a collection of links if you want to learn more:

How to Monitor the SRE Golden Signals – Faun – Medium takes all of this a bit deeper, offering links to how to track these signals for web servers, databases, etc.
The RED Method: How to Instrument Your Services | Grafana Labs Blog is based on Tom Wilkie’s talk and work around the RED method.