This week’s Facebook outage is in the news right now, and it’s inevitable that some people are going to rush to post their hot-as-the-surface-of-the-sun takes. Thankfully, this blog is not going to be that kind of place. It’s going to take time—time, investigation, analysis, and (perhaps most of all) patience—for the meaningful details of that event to come out, and for people to be able to come to the RIGHT conclusions about what happened, and how it might have been avoided. Nobody should be offering their Monday-morning armchair SysAdmin thoughts until that happens.
BUT... the word “server configuration” keeps getting bandied about, and that gives me a chance to jump in with a few thoughts, and perhaps a few lessons of my own.
First, let’s lay out what we know:
- Facebook was down for several hours on Wednesday, March 13
THE HORROR! OH, THE HUMANITY!
- Facebook’s initial comment the next day (on Twitter, as it happens, but that’s another thing entirely) included: “as a result of a server configuration change, many people had trouble accessing our apps and services.”
Now at this point, if you’ve worked in IT for more than 15 minutes, you should be scratching your head. *A* server configuration change? How is that even possible, at the scale we know Facebook is operating? Like I said, a lot of the technical details have not yet come to light. And that’s OK. In any outage of this magnitude, it takes time to get all the facts on the table and assemble an accurate picture of what really happened.
Just to be clear, nothing about the Facebook outage (or any outage, if it’s happening to one of THOSE companies) is going to be simple, trite, or accurately encapsulated in a single sentence. In fact, it has all the earmarks of a “Black Swan.” (I’ve written about the IT version of Black Swans before.) So, it bothers me that many sites are leading in to their analysis with titles like:
“Facebook blames outage on server configuration change!”
Let’s talk about server configuration changes for a minute.
In most organizations, this—along with its sibling network configuration change—is a real challenge. Depending on which report you read, uncontrolled change accounts for anywhere between 40% and 80% of all business-critical outages. Being able to track, alert on, and roll back changes on servers can mean the difference between a 10-hour outage with a side of rebuild-the-server and a 10-minute outage with a generous helping of “most of our customers didn’t even notice.” For most organizations, the problem isn’t some weird automation glitch like with Facebook, but in not having a server configuration monitoring tool that tracks change at all.
So, I’d like to propose—hot takes aside—that we IT professionals use this latest highly public outage as a wake-up call, both within our teams and as a way to open a healthy dialogue with management. Here’s why:
Whatever you may or may not like about the platform, you cannot deny that Facebook has effectively infinite resources. Because of this, they’re able to afford both a deep bench of talented engineers and a giant staff of that talent. If this type of outage can happen to them, it can happen to anyone. For those of us who don’t have infinite funds, a deep bench, or a giant staff, we can only close the gap with tools that help us be in more places and know more things than we could otherwise. And “now” is always a better time to start evaluating and adopting those tools than waiting until after the post-outage dust settles and the finger-pointing begins.