- Was enough data collected to gather the root cause of the incident?
- Would more monitoring data help with the process analysis?
- Is the impact of the incident clearly defined?
- Was outcome shared with stakeholders?
After It Broke: Executing Good Postmortems
No matter how much automation, redundancy, and protection you build into your systems, thing are always going to break. It might be a change breaking an API to another system. It might be a change in a metric. Perhaps you just experienced massive hardware failure. Many IT organizations have traditionally had a postmortem, or root cause analysis, process to try to improve the overall quality of their processes. The major problem with most postmortem processes is that they devolve into circular pointing matches. The database team blames the storage team, who in turn blames the network team, and everyone walks out of the meeting angry.
As I’m writing this article, I’m working a system where someone restarted a database server in the middle of a large operation, causing database corruption. This is a classic example of an event that might trigger a postmortem. In this scenario, we moved to new hardware and no one tested the restore times of the largest databases. This is currently problematic, as the database restore is still happening a few hours after I started this article. Other scenarios would be any situations where you have unexpected data loss, on-call pages, or a monitoring failure that didn’t capture a major system fault.
How can we do a better postmortem? The first thing to do is execute blameless postmortems. This process assumes that everyone involved in an accident had good intentions and executed with the right intentions based on available information. This technique originates in medicine and aviation, where human lives are at stake. Instead of assigning blame to any one person or team, the situation is analyzed with an eye toward figuring out what happened. Writing a blameless postmortem can be hard, but the outcome is more openness in your organization. You don’t want engineers trying to hide outages to avoid an ugly, blame-filled process.
Some common talking points for your postmortems include: