*metric of 99th percentile.*This is very common in systems like Graphite, and it doesn’t achieve what people sometimes think it does. This blog post explains how percentiles might trick you, the degree of the mistake or problem (it depends), and what you can do if percentile metrics aren't right for you.

## Away From Averages

Over the last few years, a lot of people have started talking about the problems with averages in monitoring. It’s good this topic is in wider discussion now, because for a long time averages were accepted without much deeper inspection. Averages can be unhelpful when it comes to monitoring. If you’re merely looking at averages, you’re potentially missing the outliers, which might matter a lot more. There are two issues with averages in the presence of outliers:- Averages hide the outliers, so you can’t see them.
- Outliers skew averages, so in a system with outliers, the average doesn’t represent typical behavior.

*exactly*right in practice: specifically, the median, or 50th percentile, provides this property.) Optimizely did a write-up in this blog post from a couple years ago. It illustrates beautifully why averages can backfire:

“While the average might be easy to understand it’s also extremely misleading. Why? Because looking at your average response time is like measuring the average temperature of a hospital. What you really care about is a patient’s temperature, and in particular, the patients who need the most help.”Brendan Gregg also puts it well:

“As a statistic, averages (including the arithmetic mean) have many practical uses. Properly understanding a distribution isn't one of them.”

## And Towards Percentiles

Percentiles (more broadly, quantiles) are often praised as a potential way to bypass this fundamental issue with averages. The idea of the 99th percentile is to take a population of data (say, a collection of measurements from a system) and sort them, then discard the worst 1% and look at the largest remaining value. The resulting value has two important properties:- It’s the largest value that occurs 99% of the time. If it’s a webpage load time, for example, it represents the worst experience 99% of your visitors have.
- It's robust in the face of truly extreme outliers, which come from all sorts of causes including measurement errors.

**How Time Series Databases Store and Transform Metrics**

There’s a big problem with most time series data and percentiles. Time series databases are almost always storing *aggregate*metrics over time ranges, not the

*full population*of events originally measured. Time series databases then

*average*these metrics over time in a number of ways. Most importantly:

- They average the data whenever you request it at a time resolution that differs from the stored resolution. If you want to render a chart of a metric over a day at 600px wide, each pixel will represent 144 seconds of data. This averaging is implicit and isn’t disclosed to the user. They ought to put a warning on that!
- They average the data when they archive it for long-term storage at a lower resolution, which almost all time series databases do.

*Percentiles are computed from a population of data, and have to be recalculated every time the population (time interval) changes. Time series databases with traditional metrics don’t have the original population.*

**Alternative Ways To Compute Percentiles**

If a percentile requires the population of original events—such as measurements of every web page load—we have a big problem. A Big Data problem, to be exact. Percentiles are notoriously expensive to compute because of this.
Lots of ways to compute *approximate*percentiles are almost as good as keeping the entire population and querying and sorting it. You can find tons of academic research on a variety of techniques, including:

- Histograms, which partition the population into ranges or bins, and then count how many fall into various ranges.
- Approximate streaming data structures and algorithms (sketches).
- Databases sampling from populations to give fast approximate answers.
- Solutions bounded in time, space, or both.

*distribution*of the population in some way. From the distribution, you can compute at least the approximate percentiles, as well as other interesting things. From the Optimizely blog post, again, there’s a nice example of a distribution of response times and the average and 99th percentile:

*Source: Catchpoint.com, data fro*

*m Oct. 15, 2013 to Nov. 25, 2013 for 30KB Optimizely snippet.*There are tons of ways to compute and store approximate distributions, but histograms are popular because of their relative simplicity. Some monitoring solutions actually support histograms. Circonus is one, for example. Circonus CEO Theo Schlossnagle often writes about the benefits of histograms. Ultimately, having the distribution of the original population isn’t just useful for computing a percentile, it’s very revealing in ways the percentile isn’t. After all, a percentile is a single number trying to represent a lot of information. I wouldn’t go as far as Theo did when he tweeted “99th percentile is as bad as an average,” because I agree with percentile fans that it’s more representative of some important characteristics of the underlying population than an average is. But it’s not as representative as histograms, which are much more granular. The chart above from Optimizely contains way more information than any single number could ever show.

**Percentiles Done Better in Time Series Databases**

A better way to compute percentiles with a time series database is to collect banded metrics. I mention the assumption because lots of time series databases are just ordered, timestamped collections of named values, without the capability of storing histograms.
Banded metrics provide a way to get the same effect as a series of histograms over time. What you’d do is select limits that divide the space of values up into ranges or bands, and then compute and store metrics about each band over time. The metric will be just as it is in histograms: the count of observations that fall into the range.
Choosing the ranges well is a hard problem, generally. Common solutions include logarithmic ranges and ranges providing a given number of significant digits but may be faster to calculate at the cost of not growing uniformly. Even divisions are rarely a good choice. For more on these topics, please read Brendan Gregg’s excellent write-up.
The fundamental tension is between the amount of data retained and the fineness of the resolution. However, even coarse banding can be effective for showing more than simple averages. For example, Phusion Passenger Union Station shows banded metrics of request latencies using 11 bands. (I don’t think the visualization is the most effective; the y-axis’s meaning is confusing and it’s essentially a 3d chart mapped into 2d in a nonlinear way. Nevertheless, it still shows more detail than an average would reveal.)
How would you do this with popular open source time series tools? You’d have to define ranges and create stacked charts as shown.
To compute a percentile from this would be much more difficult. You’d have to range over the bands in reverse order, from biggest to smallest, summing up as you go. When you reach a sum no more than 1% of the total, that band contains the 99th percentile. There are lots of nuances in this—strict inequalities, how to handle edge cases, what value to use for the percentile (upper or lower bin limit? in the middle? weighted?).
And the math can be confusing. You might think, for example, you need at least 100 bands to compute the 99th percentile, but it depends. If you have two bands and the uppermost band’s value contains 1% of the values, you’ve got your 99th percentile. (If that sounds counterintuitive, take a moment to ponder quantiles; I think a deep understanding of quantiles is worthwhile.)
So this is complicated. It’s possible in the abstract, but it also largely depends on a whether a database’s query language supports the calculations you’d need to get an approximate percentile. If you can confirm systems in which this is definitely possible, please comment and let me know.
The nice thing about banded metrics in a system like Graphite, which treats all of its metrics naively in terms of assuming they can be averaged and resampled at will, is banded metrics are robust to this type of transformation. You’ll get correct answers because the calculations are commutative over all time ranges.
**Beyond Percentiles: Heatmaps**

A percentile is still a single number, just like an average. An average shows the center of gravity of a population, if you will; a percentile shows a high-water mark for a given portion of the population. Think of percentiles as wave marks on a beach. But although this reveals the boundaries of the population and not just its central tendency as an average does, it’s still not as revealing or descriptive as a distribution, which shows the shape of the entire population.
Enter heatmaps, which are essentially 3-D charts where histograms are turned sideways and stacked together, collected over time, and visualized with the darkness of a color. Again, Circonus provides an excellent example of heatmap visualizations.
On the other hand, as far as I know, Graphite doesn'tt have the ability to produce heatmaps with banded metrics. If I’m wrong and it can be done with a clever trick, please let me know.
Heatmaps are great for visualizing the shape and density of latencies, in particular. Another example of heatmap latency is Fastly’s streaming dashboard.
Even some old-fashioned tools you might think of as primitive can produce heatmaps. Smokeping, for example, uses shading to show the range of values. The bright green is the average: