Streaming Approximate Histograms in Go
July 8, 2013
Database
If you’re looking at response time, what’s more useful: a mean or a percentile? Not sure?
You probably know how to calculate the average or mean of a sample, but what about percentiles? It’s very simple. Let’s say you have a 100-element set and you wanted to know the 95th percentile value. You would have to sort the set from least to greatest and get the 95th element. If you wanted the 50th percentile value, you would look at the 50th element.
It’s very easy to calculate percentiles for small sample sets. What if we’re working with those that are larger? Maybe a couple of orders of magnitude larger? What if we keep getting a stream of values rapidly? At that scale, it’s going to take longer to sort all those elements, and we’ll be using a lot of space to store them, and our method of finding percentiles becomes inefficient.
If we’re dealing with a large amount of values and we need know the characteristic of our data, we should be fine with approximations. If we settle for statistics that are good enough, we’re suddenly exposed to a variety of neat techniques that solve our problem. One technique uses histograms. The other technique is called reservoir sampling, which uses a statistically representative subset of a large amount of data.