Disaster Recovery - The Postmortem
So you’ve made it through a disaster recovery event. Well done! Whether it was a simulation or an actual recovery, it was possibly a tense and trying time for you and your operations team. The business is no doubt happy to know that everything is back up and running from an infrastructure perspective, and they’re likely scrambling to test their applications to make sure everything’s come up intact.
How do you know that your infrastructure is back up and running though? There are probably a lot more green lights flashing in your DC than there are amber or red ones. That’s a good sign. And you probably have visibility into monitoring systems for the various infrastructure elements that go together to keep your business running. You might even be feeling pretty good about getting everything back without any major problems.
But would you know if anything did go wrong? Some of your applications might not be working. Or worse, they could be working, but with corrupt or stale data. Your business users are hopefully going to know if something’s up, but that’s going to take time. It could be something as simple as a server that’s come up with the wrong disk attached, or a snapshot that’s no longer accessible.
Syslog provides a snapshot of what happened during the recovery, in much the same way as you can use it as a tool to validate your recovery process when you’re actually in the midst of a recovery. When all of those machines come back on after a DC power failure, for example, there’s going to be a bunch of messages sent to your (hopefully centralized) syslog targets. In this way, you can go back through the logs to ensure that hosts have come back in the correct order. More importantly, if an application is still broken after the first pass at recovery, syslog can help you pinpoint where the problem may be. Rather than manually checking every piece of infrastructure and code that comprises the application stack, you can narrow your focus and, hopefully, resolve the problem in a quicker fashion than if you were just told “there’s something wrong with our key customer database.”
Beyond troubleshooting though, I think syslog is a great tool to use when you need to provide some kind of proof back to the business that their application is either functional or having problems because of an issue outside of the infrastructure layer. You’ve likely heard someone say that “it’s always the network,” when it’s likely nothing to do with the network. But proving that to an unhappy end-user or developer who’s got a key application that isn’t working anymore can be tricky. Having logs available to show them will at least give them some comfort that the problem isn’t with the network, or the storage, or whatever.
Syslog also gives you a way of validating your recovery process, and providing the business, or the operations manager, evidence that you’ve done the right thing during the recovery process. It’s not about burying the business with thousands of pages of logs, but rather demonstrating that you know what you’re doing, you have a handle on what’s happening at any given time, and you can pinpoint issues quickly. Next time there is an issue, the business is going to have a lot more confidence in your ability to resolve the crisis. This can only be a good thing when it comes to improving the relationship between business users and their IT teams.