Many of our customers use Graphite, and I don’t think anyone would argue with me when I say it’s probably the most commonly used time series database in the DevOps community. Not only does it have a huge installed base, it also has a robust community of advocates and developers, including Jason Dixon, who’s writing a book on monitoring with Graphite
I remember when Graphite first came onto the scene. It was praised to the heavens. Recently, though, the conversation has really changed. Not as many people seem to love it. What happened? The answer points to broader trends in the technology landscape and community.
In The Beginning...
I first heard about Graphite from Percona’s Peter Zaitsev, just after he came back from an on-site consulting job where it was in use. This would have been around 2009 or 2010. It doesn’t seem like a long time ago, but it’s an eon in tech.
Peter was pretty excited. You have to travel back in time to understand why. At that time, most of the people I was aware of were using Cacti and other RRD-based tools for monitoring. These tools were really frustrating as the pace of change in technology picked up. It was so bad that I ended up writing an entire set of meta-software for creating graphing templates for Cacti and similar, which are now part of the Percona Monitoring Plugins
. Even with these helpers, Cacti was painful. The process of adding a new host for monitoring was tedious, and custom metrics were really hard to get into Cacti. In practice, nobody did it.
The reason is simple. In Cacti, you couldn’t just start pushing metrics at it and expect them to be recorded. The definition of the universe of possible metrics lived inside Cacti, so you had to set up an RRD archive to accept metrics. Then
you could start sending them. (This is the moral equivalent of a database schema
in the monitoring world.)
Graphite, in contrast, was much more flexible. Just sending it metrics was enough. It would record, and allow you to graph and analyze, arbitrary metrics.
Oh, and the user interface was worlds
better than the other tools that existed at that time. People raved about it.
Nobody Seems To Love Graphite Anymore
At the last few conferences I attended, speakers and audiences clearly had a completely different view than they did 3-5 years ago. One of the speakers, for example, talked about Grafana; another about a custom dashboard they built in-house. Both mentioned how awful
the Graphite user interface is. And there were laughs, nods, and other indications of agreement.
If you weren't around in 2008, maybe this wouldn’t seem like such a change, but I was struck by the difference in opinions. From praising Graphite, to lambasting it, in the space of a few years, was quite a dramatic change for me.
Another change is how many people complain about the cost of running Graphite. It’s widely considered to require too much hardware
. People call it “hungry” and speak of the difficulty of installing and maintaining clustered Graphite. There have been alternative storage engines and other projects to try to reduce the cost, enable it to scale to larger sizes, and make it faster.
Graphite seems to be falling into disfavor.
Graphite belongs to a category of time series databases that serves the purposes of 5-10 years ago, but increasingly not today’s or tomorrow’s.
Let’s rewind even farther in time. In the beginning there was BMC and HP and … OK, that’s not the beginning. But there was a time when if you wanted to record, graph, and analyze system metrics, your best options were pretty much proprietary tools from big companies.
This set the stage for RRDTool, which democratized metrics and changed the world. Now time series data was easy for everyone. The number of tools built on top of RRDTool is large. I remember using MRTG, Cacti, SmokePing, Munin, and probably some others. And in comparison to what existed at that time, they were good
. You could do a lot of things with them from the commandline, even some crude anomaly detection. They maintained themselves, gracefully aging out and averaging data, dealing with missing and late points pretty well by default. If you didn’t live through this time, it might be difficult to appreciate what a revolution this was. One way to say it: RRDTool is to metrics what MySQL is to databases.
But times were changing. In the era of cloud, for example, you now had a lot more servers, and they were ephemeral, not long-lived. RRDTool wasn’t built with that in mind. And with the philosophy that “if it moves, you should graph it”
, we had a lot more metrics. Importantly, everyone in software-driven organizations (which is now practically every company) was beginning to realize system metrics aren’t the most important use of a time series database. From that same blog post:
Application metrics are usually the hardest, yet most important, of the three. They’re very specific to your business, and they change as your applications change (and Etsy changes a lot). Instead of trying to plan out everything we wanted to measure and putting it in a classical configuration management system, we decided to make it ridiculously simple for any engineer to get anything they can count or time into a graph with almost no effort.
And this was the beginning of the rise of the metrics.
If there was a watershed moment in systems monitoring around this time, it was StatsD, not Graphite. StatsD, and the mindset that you should be able to just send metrics from your application to a monitoring system and then do useful things with them with as little friction as possible.
That means no
predefined notion of what metrics will be arriving at the monitoring system.
Graphite and StatsD could accommodate this, but RRDTool could not.
But now we’re at another crossroads in monitoring.
If RRDTool assumed that servers are few, relatively long-lived, and once set up, will send metrics continually, Graphite assumed some things itself.
- There are relatively few metrics
- Metrics are relatively long-lived
- Once a metric is created, its points will exist more or less continually
EDIT/UPDATE: I think it’s important to clarify what I mean by “few metrics.” I mean low cardinality, as in, each thing you measure doesn’t have a large number of distinct things to measure. A CPU, for example, has a handful of metrics (typically different kinds of utilization, such as idle, system, user, steal…) and a network interface has a handful too (things in, things out, things that errored or dropped). Cardinality of metrics is in contrast to the rate of points measured over time (e.g., measuring CPU utilization once per second instead of once per minute produces 60x as many points). High-cardinality metrics explode the overall volume of metrics data much faster than higher granularity does.
The RRDTool view of the world was “up to hundreds of servers, each sending dozens or hundreds of metrics.” The Graphite+StatsD view was “up to thousands of servers/applications/services, each sending hundreds or thousands of metrics.”
But give people a taste of this, and they want more. How about…
- There are tens of thousands of applications, containers, services, and servers
- There are millions of metrics per each (high cardinality)
- They are often sparse, containing points as seldom as once in all time
- They are tracked in high resolution, such as once per second instead of per-minute or per-5-minutes
Graphite can’t handle that, but it’s a real use case. When you go beyond global averages (such as average QPS for the whole database) and start tracking metrics in fine detail (such as QPS by category of query), you automatically get this, for example. There are other use cases where this arises–lots of them.
A lot of our customers use Graphite, as I mentioned. Probably several times a month people ask us, “how hard would it be to set up the Database Performance Monitor agents to forward metrics to Graphite via StatsD?” The answer is it’d be easy, since our agents communicate with our aggregator agent via the standard StatsD protocol.
But the result would crush
Graphite, immediately. There are several reasons for it:
- Graphite isn’t built for metrics at such high granularity. Its time series back end, Whisper, just isn’t efficient enough. If it’s “hungry” already with a limited number of metrics and limited granularity in time, it’s orders of magnitude too inefficient to handle this kind of load. (StatsD could buffer and reduce the granularity, but that’s not the point).
- Graphite’s assumptions don’t match the characteristics of this data. Sparseness, for example, means huge amounts of wasted space in preallocated, fixed-size RRD-like database files. If we pointed our agents at StatsD, Graphite would instantly fill up all of its disks with mostly-empty files and everything would crash.
- Graphite tries to store a metric per file, but this doesn’t work at that cardinality of metrics. Even in the most modern, high-performance file systems, lots of files in a directory is a problem, and processes with lots of open filehandles is too.
So if you want to destroy your Graphite cluster with one weird trick, you could certainly do it without much effort. Just send the kind of metrics workload at it that a modern developer wants to be able to send!
The Emergence Of New Monitoring Systems
Database Performance Monitor isn’t the only service or product capable of handling this kind of workload. There are several companies bringing these capabilities to market, such as SignalFX
, and more (edit: SolarWinds AppOptics
, I forgot that one; I’m sure I’m forgetting more; apologies in advance). Database Performance Monitor is built for different purposes than those are, so in practice we never “compete” with them, despite the capabilities we share with them.
On the open-source front, if you want to build it yourself (but you shouldn’t
except in unusual cases), there’s a limited number of options. Company after company founded to handle this kind of data workload has built their own instead of using open source; many founders have told me that they essentially designed their back end by using my blog post
as a first draft of their design document. Some are using standard open source solutions like Cassandra. InfluxDB is beginning to gain prominence as well.
Whether you use open source, buy it from a hosted service, or build your own, it’s not easy–everyone seems to agree there.
Another is that they’re unsatisfied with Graphite. In just the space of a couple of years, Graphite has gone from being a widely accepted monitoring and graphing solution to being viewed as increasingly unsatisfactory, even in the cases where it doesn’t outright fall short of the required capabilities.
We’re clearly moving into a new era of monitoring. I’ve heard that Netflix’s monitoring infrastructure costs a double-digit percentage of their entire budget. A few years ago this would have been shocking, but now we recognize the value of measuring. “If you can’t measure it, you can’t improve it,” and making it possible for everyone
to measure and analyze things is an enormous driver of IT speed and efficiency.
Whether it’s in specialized areas such as database performance, or the ability to create sophisticated custom dashboards from arbitrary metrics, it’s just no longer okay for most companies to live without these capabilities. Those who do are finding themselves out-innovated and out-maneuvered by their competition, who are delivering products to market faster, with better quality, and at much lower cost.
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.
The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.
Where’s the puck going beyond that? Probably analytics. As Adrian Cockcroft has said
, we really don't need more monitoring systems. By this he means gathering and storing metrics isn’t what we should continue to reinvent. Instead, deriving meaningful knowledge is the next challenge.
This post seems to have touched a nerve with some people. I apologize if it seemed like trolling. I’ll avoid editing too much, in order to avoid pulling the rug out from under the conversations and changing what people think they’re talking about while they’re talking about it.
My intention was not to troll, just to express what people have shared with me and what I think it means. Maybe I got carried away and pontificated too much (where was I going in the end there… that wasn’t much of a conclusion). Maybe I should have been more thoughtful and taken more time, or asked for people to review it for me. But I would never intentionally try to stir up an argument.
For reference, Jason Dixon
had a thoughtful response to this post.