There are two sayings you can find in virtually every article about data protection: "If your data doesn't exist in at least two places, then it doesn’t exist," and "If you can't restore your data, then it doesn’t exist." This is for good reason. Not testing backups and (believe it or not) not
having backups are the two most common business-affecting mistakes that IT teams make.
Perhaps we should add a third saying to the usual two: “How do you know you can actually restore your data from your current backups? Have you tried?”
Most people are at least passingly familiar with Erwin Schrödinger’s
thought experiment involving quantum mechanics, a radioactive atom, and a cat. The unfortunate feline is neither alive nor dead (more accurately, it’s both) until its state has actually been observed. You can probably see where this analogy is going.
Let’s put a practical (though pessimistic) slant on it: your organization’s backups are definitely dead unless you’ve observed them being successfully restored to production. No quantum superposition here. If you can’t restore your data from backups, it doesn’t exist.
Test your ability to restore from backups. You don’t want to find out whether or not a restore will work when a disaster has already occurred and the business is counting on you. Murphy's Law is a terrible force, and none of us should trifle with it.
Scary Numbers
Let's look at some numbers. A recent
Dell/EMC survey found that just 18% of respondents believe that their current data protection solution will meet all future business challenges. The same survey reported the average cost of data loss as $900,000, and the average cost of downtime was pegged at $555,000. These numbers aren't out of line with any of the other surveys covering the data protection space, so let's ponder them for a while.
That only 18% of respondents believe their organization's data protection approach is adequate is disconcerting, to say the least. More accurately, it’s terrifying. To look at it another way, 82% of organizations aren't confident their data protection works.
The other numbers in the survey are an average cost of $900,000 for data loss and $555,000 for non-data loss-related outages. Now, these numbers should be taken with a grain of salt; the survey covered respondents from organizations of varying sizes. Data loss and outages for larger organizations can be quite significant, and the high-end numbers reported in that survey strayed far from the average.
That said, the costs are worth paying attention to, even for organizations as small as 50 people. It's rarely direct costs that get you, so much as it is the staffing costs required to solve the problem. If those costs aren’t incurred in the form of hiring someone to make the problem go away, they usually manifest as overtime hours for existing IT staff.
Did We Mention You Should Test Your Backups?
This brings us back to "test your backups." More importantly, automate the testing of your backups. Consider what success looks like for each backup and workload type, then set up success conditions for that automated testing beforehand.
Consider, for example, the backup of a file share on a file server. There are various possible success conditions. One might want to make sure the directory structure is readable and traversable. One might also want to verify that files are as expected.
One way to do this might be to place test files at specific points in the file structure and verify that the version offered by the data protection solution is the same as the version known to have been placed in the file path. You may want to test not only that the primary backup branch is traversable in this manner, but that at least the last three versions are as well. This can help when dealing with some problems, such as ransomware.
Testing workloads is trickier. There’s nothing special about the ability to attach a file to a hypervisor, call it a disk image, and try to get a virtual machine (VM) to boot off of it. Actually having the workload boot off of that image is the important part.
Some backup offerings can attach a workload image to a hypervisor and try booting it, wait until VM activity settles (indicating that booting is likely to have completed), and then email a screenshot of that VM's console to an administrator.
This allows you to verify that the
backups of bare-metal workloads—which won't have Hyper-V integrations or
VMware tools in them—have successfully been able to boot in the disaster recovery environment. Bare-metal workloads are usually the hardest to verify, especially since most disaster recovery options are physical-to-virtual, at least until the backup can be pushed back to the original iron.
Testing, Testing, 1,2,3…
The critical part—as repeatedly mentioned here—is that we all test our backups. If this level of repetition is frustrating, or even infuriating, that’s good! It's an important-enough topic to be worth a little vexation. Hopefully, it will spur you to go test your backups, or better yet, automate the testing of those backups.