If you’re looking at response time, what’s more useful: a mean or a percentile? Not sure?
You probably know how to calculate the average or mean of a sample, but what about percentiles? It’s very simple. Let’s say you have a 100-element set and you wanted to know the 95th percentile value. You would have to sort the set from least to greatest and get the 95th element. If you wanted the 50th percentile value, you would look at the 50th element.
It’s very easy to calculate percentiles for small sample sets. What if we’re working with those that are larger? Maybe a couple of orders of magnitude larger? What if we keep getting a stream of values rapidly? At that scale, it’s going to take longer to sort all those elements, and we’ll be using a lot of space to store them, and our method of finding percentiles becomes inefficient.
If we’re dealing with a large amount of values and we need know the characteristic of our data, we should be fine with approximations. If we settle for statistics that are good enough
, we’re suddenly exposed to a variety of neat techniques that solve our problem. One technique uses histograms. The other technique is called reservoir sampling, which uses a statistically representative subset of a large amount of data.
With reservoir sampling, we still have to maintain a set of values and we still have to sort to be able to calculate quantiles. It’s better than calculating the exact quantiles by storing every single value, but what if we don’t want to store individual values at all? Fortunately, we don’t have to.
Histograms are useful representations of distributions. Each “bin” or bar in a histogram represents a range of values, and the height of the bar represents the frequency (or count) of values in that bin. With histograms, we don’t have to keep track of individual values. We can just find which bin a value belongs to and increment the frequency counter.
We’ve decided to implement histograms that are dynamic, compressed, and offer efficient quantile approximations. They’re dynamic in that they do not have fixed bin values, and bins adapt to the data that streams in. The histograms also have a fixed maximum bin count, so resource usage stays constant regardless of how many values stream in.
The algorithm is fairly simple. A histogram is created with a maximum bin count. Each new value streamed into the histogram creates an additional bin with a count of 1. If the current bin count is higher than the maximum, the closest bins are merged. Close refers to the distance between the values of the bins. The higher the maximum bin count, the better the approximations at the cost of greater resource utilization and computation time.
We’ve also added an implementation of a weighted histogram where each bin’s frequency count is an exponentially-weighted moving average. This would allow the histograms to be used for long periods of time with many values without worrying too much about overflows. Moving averages also give more recent values more weight.
The Apache Hive project includes an implementation in Java. They recommend using between 20 and 80 bins.
We’ve open-sourced our implementations (written in Go) under the MIT license. They’re available on Github