Disaster Recovery Testing - Learning to Walk the Walk
March 6, 2018
Monitoring and Observability
With the influx of natural disasters, hacks, and increasingly more common ransomware, being able to recover from a disaster is quickly moving up the priority list for IT departments across the globe. In addition to awareness, we are seeing our data centers move from a very static deployment to an ever-changing environment. Each day we see more and more applications getting deployed, either on-premises or in the cloud, and each day we, as IT professionals, have the due diligence to ensure that when disaster strikes we can recover these applications. Without the proper procedures in place to consistently update our DR plans, no matter how well-crafted or detailed they are, the confidence in completing successful failovers decreases. So what now?
We’ve already discussed the first step in our DR process: creating our plan. We’ve also touched on the second step, which is to make it a living document to accommodate for data center change. But there is one more step we need to put in place for a successful failover, and that's testing. It boosts the confidence in the IT department and the organization as a whole.
Testing our DR plan - We learn by doing!
When thinking of disaster recovery plan testing, I always like to compare it to a child. I know, a weird analogy, but if we think about how children learn and get better, it begins to make sense. Children learn by doing; they learn to talk by talking, learn to play sports by playing, etc. The point is that by “walking the walk,” we tend to improve ourselves. The same applies to our DR plans. We can have as many details and processes laid out on paper as we want, but if we can't restore when we need to, we've failed. Essentially, our DR plans are set up for success by also walking the walk, aka testing.
Start small, get bigger!
I’m not recommending going and pulling the plug on your data center tomorrow to see if your plan works. That would certainly be a career-changing move. Instead, you should start small. Take a couple key services as defined in your disaster recovery plan and begin to draft a plan on how to test a failover of the components and servers contained within them. Just as when creating our DR plan, details and coordination are the key to success when creating our testing plan. Know exactly what you are testing for. Don’t simply acknowledge that the servers have booted as a success. Instead, go deeper. Can you log into the application? Can you use the application? Can a member of the department that owns the application sign off stating that it is indeed functioning normally? By knowing exactly what the end goal is you can sign off on a successful test, or on the flip side, take the failures which have occurred and learn from them, updating our plan to reflect any changes, and be prepared for the next testing cycle.
Once you have a couple services defined go ahead and begin to integrate more and ensure that recurring time has been set aside and defined within the disaster recovery plan to carry out these tests. A full-scale DR test is not something that can be performed on a regular basis, but we can carry out smaller tests on a monthly or quarterly basis. Without a consistent schedule and attention to detail we can almost guarantee that items like configuration drift will soon creep up and cause our DR testing to fail, or worse, our DR execution to fail.
I’ve mentioned before that not keeping our DR plans up to date is perhaps one the biggest flaws in the whole DR process. However, not applying a consistent testing plan trumps this. Disaster Recovery, in my opinion, cannot be classified as a project. It cannot have an end date and a closing. We must always ensure, when deploying new services and changing existing applications, that we revisit the DR plan, updating both the process of recovering and the process for testing said recovery. Testing our disaster recovery plan is a key component in ensuring that all that work we have done in creating our plan will be successful when the plan is most needed. Let’s face it. A failed recovery will put a blemish on the entire DR planning process and all the work that has gone into it. Test and test often to make sure this doesn't happen to you.