Data protection has traditionally involved an on-prem storage vault, but as data protection solutions evolve, this requirement is becoming increasingly optional. On-prem storage vaults exist because if you do away with them, you eliminate cost and complexity while creating risk. Understanding how "cloud-first" data protection works is important to making informed decisions.
Why On-prem Vaults Exist
To understand the purpose of on-prem data protection vaults, we must first understand the common “3-2-1” industry data protection catechism. This refers to maintaining three copies of one's data, on two different mediums, with one copy being offsite. This rule is generally considered to be the industry-standard approach to backups.
The three copies of one's data typically refer to the production copy, a copy in an on-prem data protection vault, and the offsite copy. The offsite copy is increasingly in the cloud, which makes the "at least two mediums" something that’s in the eye of the beholder.
Traditionally, the two mediums in question would be disk and tape. Whether or not the copy in the cloud is considered another medium depends on whether or not one wishes to examine why the "two mediums" clause exists or not.
Strictly speaking, one can have the production copy, the on-prem vault, and the cloud copy all running on disk. The point of the two mediums, however, is not the hope that a magical elf will one day make all disks stop working at the same time. Instead, the point of two mediums is that putting data on tape meant that if malware got into your network and ruined both your production copy and your on-prem vault, the tapes were physically disconnected and couldn't be affected. Depending on how your cloud data protection is set up, it can serve the same "air-gapped" purpose.
What's worth noting here is that the on-prem data protection vault should never be considered a separate medium from the production storage. This is because the purpose of the on-prem vault is not to provide disaster-proof data protection. Instead, on-prem data protection vaults exist to ensure that in most cases where a restore is requested, that restore can happen almost instantly.
On-prem vaults can provide this near-instant restore capability in two ways. The first is that by virtue of being on the same network as the production systems, an on-prem vault can restore data to those production systems faster than downloading that same data over what is typically a much slower internet connection.
Secondly, on-prem vaults can frequently be used to light up copies of a workload on an emergency basis, making them useful for coping with the failure of primary infrastructure.
However, this capability should never be viewed as anything other than a convenience.
The ability to host a limited number of workloads on one's on-prem data vault is not a disaster recovery capability. An actual disaster (such as a fire or flood) would take out the whole data center, including both the production compute capacity and the on-prem data vault. True disaster recovery capacity must exist offsite.
One other advantage for on-prem data vaults is they offer a place for workloads to back up to when the internet is down. This is of mixed value; the data isn't going offsite, so it's not disaster-proof. That said, it can and does provide protection against failure of an individual infrastructure item, even if it can't protect against a data center-destroying disaster.
The flip side of this advantage is data protection solutions, which back workloads up to an on-prem vault before sending those workloads to the cloud, are also adding a delay in the process of getting those workloads offsite. This delay can be minimal, such as for solutions that utilize changed block tracking (CBT) technologies, or quite significant, such as those solutions that use the on-prem vault as a full-on cloud storage gateway (CSG).
Why On-prem Vaults Are Optional
Not all workloads need to be restored quickly. In many corporate environments, less than 20% of the workloads actually need to keep the recovery time objective (RTO) as close to zero as possible. (The RTO is a measure of the amount of time it takes to restore a workload to expected functionality).
For most other workloads, if it takes a business day to pull the whole workload down from the cloud, it’s not a big problem. It’s annoying but doesn’t hugely affect the business.
Another consideration is that restoring full workloads from backups can be a rare event. There might be a request to pull an individual file out of a file server's backups every other day, and maybe once a week, someone might need to retrieve a configuration file from a virtual machine's backup image. But with the right backup software, these restores are not only quick and simple, they can be done by the end users themselves.
In many cases, if you need to completely restore a workload from backup, it’s a disaster scenario. Either the local production copy is gone (dead SAN) or the data center is out of commission (flooding, for example). In such a scenario, you’d be using the planned disaster recovery infrastructure rather than restoring back to the on-prem location.
In today's world, most of the time, this means bringing up a copy of the backup in the cloud, because running one's own second site is a rather expensive luxury these days.
The RPO Problem
At the core of the on-prem data protection vault discussion is the recovery point objective (RPO) problem. RPO measures the amount of data considered to be okay to lose for a given workload.
In the case of a point-of-sale database, for example, the answer to this is probably "zero." At any time, someone could be making a transaction. If that point-of-sale database were rolled back to a version even
one second old, a sale could be missed, which would not only mean the organization's financials were off, but inventory records could be affected. It could even affect someone's tips, bonuses, or whether or not they met their sales quota.
Losing even a fraction of a second of data from some workloads is a serious problem.
As a consequence, these real-time workloads usually use application-native clustering to ensure their data is replicated to multiple sites. They often have their own data protection solutions and requirements that no third-party solution is ever going to fully cover. That's just life in IT.
For these workloads, which are also almost always the workloads that require the lowest RTOs, disaster recovery typically involves failing over to another cluster member. It wouldn’t require restoring these workloads from backup except under the most exceptional of circumstances.
Application- vs. Crash-consistent Workloads
Once we get past workloads with real-time requirements, the discussion changes. Here, rolling a workload back to a previous version does happen, almost always because the production copy of the workload got hit by ransomware. Here, workloads exist in two groups: application-consistent workloads and crash-consistent workloads.
With application-consistent workloads, if the production copy of the workload is ever terminated less than gracefully (due to abrupt power loss, for example), bad things happen. These might be databases, mail servers or similar workloads.
Application-consistent workloads are the sort of workloads that really should be operating in application-native clusters, but where rolling back to a previous version is considered acceptable because there are other ways to recover the lost data. In the case of a mail server, this could be because the spam filter keeps a buffer of the past few days of mail messages and can "replay" the sent messages if required.
Similarly, certain financial databases might be fine to roll back because they only receive their orders from something that can be replayed. Perhaps an application (typically an e-store package) writes an individual XML file for every order, and these orders are read by a parser, which both injects the information into the database and moves the parsed file into a “done” folder. Replaying the transactions in this case is as simple as selecting the missed transactions in the “done” folder and moving them back into the input folder.
Crash-consistent workloads are workloads where it doesn't really matter if the workload has a high RPO. Workloads like render engines, where patches and the configuration are the only things that might change over the course of years, can be restored from backups with minimal effort or grief.
Rational Restores
Modern cloud-first data protection solutions don't have to pull the whole backup image of a workload down to roll that workload back to a previous version. If your workload became compromised by ransomware, for example, you could roll the workload back to a previous version and only have to pull down the changed blocks. This competes with one of the greatest advantages of an on-prem data protection vault: the near-zero RTO it can provide.
Modern backup solutions can also perform a “continuous restore.” Essentially, they pre-warm a disaster recovery location in anticipation of a planned failover. This removes the RTO constraints from planned outages. And in the case of unplanned outages, as previously discussed, having an on-prem data protection vault won't matter, because both that vault and production compute capacity will be equally affected.
On-prem data protection vaults cost money to buy or build. They cost time and money to manage and maintain. And, ultimately, most workloads don't need them.
For those workloads where the lowest possible RTO matters, and application-native clustering isn't the answer, by all means use an on-prem data vault. Similarly, if internet connectivity is unreliable where you are, having the on-prem vault to back up to during outages may matter. Everything else, however, is perfectly okay backing up directly to the cloud.
Cloud First is Not Cloud Only
It's important to differentiate those data protection vendors marketing their solution as "cloud first" from those who are marketing it as "cloud only." Cloud only is bad, for all of the reasons mentioned above. Some workloads should have on-prem data protection vaults, and you should run, not walk, away from any data protection solution that doesn't offer them as an option.
In marketing
backup products as "cloud first," companies are saying they believe the default should be backing workloads up to the cloud without involving an on-prem data protection vault; but additionally, that workloads should be evaluated on an individual basis. Those workloads that truly need to use an on-prem vault should have that option.
This is a big change in the industry approach to data protection. For most organizations, it can dramatically lower the costs associated with on-prem data protection vaults. Fewer workloads backing up to them means that they can be smaller, and even less complicated.
There is no one data protection approach that will fit all workloads. With luck, however, one can now make informed choices.