Disaster recovery (DR) is one facet of the overarching business continuance plans devised and maintained by the various lines of business within an organization. An effective disaster recovery plan mitigates the impact that unplanned—or even planned—outages of mission- and business-critical systems have on an enterprise’s ability to operate and continue to earn revenue.
To do this, a DR plan provides organizations with a flexible structure that enables them to operate in a unified and collaborative manner to restore, redevelop, and revitalize their systems and build a more resilient infrastructure.
How long could a business continue to operate if it were to lose its payroll system just before pay is calculated and accounts funded? What penalties would a company incur due to the delayed payment of federal, state, and local taxes? What consequences would the company face as a result of not paying employees on time—and how long would workers remain working?
To do or not to do disaster recovery? That is no longer the question. The real question is what is the true cost of losing minutes, days, or weeks of important data and the trust built over years in an instant?
Disaster recovery can no longer remain an afterthought or something considered only when there’s enough budget because today’s organizations are expected to respond promptly to disruptive events and stay operational. Rather than being deterred by the cost of implementing a resiliency plan, organizations must go deeper and assess the real cost of not having a plan at all. For instance, examine the service level agreements (SLAs) that could not be met or the penalties and loss of revenue that would result from an outage. Compare the cost of implementing DR with the penalties, loss of revenue, and lost customer confidence and the choice is clear.
Whether the outage is caused by a natural disaster, IT operator/service provider errors, data corruption, or ransomware attacks, an organization must be able to shield itself from disruptions to business operations that result in catastrophic losses, being replaced by opportunistic competitors, and loss of reputation and goodwill.
While these outcomes seem dramatic, they reflect the recent experiences of many well-known organizations that failed to recover in a timely fashion and lost large amounts of critical transactional data, customer information, and trust.
A wide variety of scenarios and root causes can disrupt business operations.
Disaster recovery as a service (DRaaS) in the cloud provides enterprises with unparalleled options for operational flexibility. Organizations use disaster recovery solutions for planned outages more frequently than to recover from catastrophic events.
The two primary goals of disaster recovery are to return affected systems to an operational state as fast as possible and to do so with as little data loss as possible after a catastrophic event or planned outage. The metrics for these two key goals are universally known as the recovery time objective (RTO) and recovery point objective (RPO) respectively.
Each business system will have different requirements for these two metrics depending on the service level agreement between IT operations and the business owners.
The recovery time objective is the time it takes to restore a business system to a fully operational state after planned or unplanned outages.
A recovery point objective is the maximum amount of data that can be lost before causing detrimental harm to any given business system. The RPO is usually measured in time from the delta of the last data backup, replication, or snapshot.
Traditionally, high availability (HA) protects against single points of failure, while disaster recovery protects against multiple points of failure. In cloud computing, protection against single points of failure at the physical infrastructure layer, including power, cooling, storage, networks, and physical servers, is completely abstracted into the overall architecture through availability and fault domains.
High availability in this case is scalable and highly resilient to data loss or loss of processing performance. Almost every enterprise-class cloud provider builds high availability into their offering.
High availability disaster recovery solutions that deliver zero data loss and zero downtime protection for databases get expensive when complex data mapping and replication technology is involved. These solutions don’t provide ransomware protection, which is achieved via a comprehensive backup with a point-in-time recovery point objective and immutable storage.
Traditional HA solutions work well in conjunction with a low-cost cloud DR (CDR) solution. The add-on CDR technology provides an extra layer of protection for databases that require zero data loss, zero downtime HA and need ransomware protection with long-term incremental rollback.
On-premises DR is an inflexible and expensive solution because of the high cost of duplicating IT infrastructure in a source location and a fixed target data center location. It cannot function over the WAN or migrate servers between disparate environments, so it requires a dedicated circuit between the source and target data centers to operate like a single interconnected LAN environment. Traditional DR also cannot migrate servers between disparate heterogeneous environments, such as an on-premises environment and a cloud platform or between two different cloud platforms.
DRaaS uses a patchwork of vendor-supplied backup solutions and open source migration tools to create a tightly controlled and highly specific environment. The end user can only recover workloads in the DRaaS provider’s traditional hosting environment from their VMware on-premises environment. These vendor-supplied solutions can be expensive and limited in the range of workloads they can protect and the compute environments they can support.
Oracle typically refers to a DR workflow as a DR plan. A disaster recovery plan—or runbook—is a list of predetermined steps or tasks that need to be completed to transition the compute instance, platform, database, and applications to a predetermined recovery region in the cloud. These include all the tasks or individual steps that are executed by either a human or some sort of automation during a disaster recovery operation such as a switchover or failover. The terms DR plan, DR runbook, and DR workflow are interchangeable.
A disaster recovery operation is the process of executing each predetermined step or task in a DR plan that is required to restore the infrastructure, database, and applications to a fully operational state. Two different terms are used to describe the transition of an application stack to a different location: failover and switchover.
A failover implies an unplanned outage where applications, databases, and virtual machines have crashed and all resources, including storage, data, and guest operating systems, are in an inconsistent state. In this case, the primary cloud region is assumed to be completely inaccessible and unavailable, which indicates that a true disaster recovery operation needs to be triggered.
Therefore, all disaster recovery tasks in a DR plan can only be performed at the standby cloud region. It is crucial that a cloud provider’s DRaaS solution be designed for high availability at the standby region to ensure it is accessible and functional when a catastrophic disaster strikes.
Switchover implies planned downtime that includes an orderly shutdown of applications, databases, and virtual machines or servers. In this case, both the primary and standby regions operate normally, and the IT operations staff is likely focused on “moving” one or more systems from one region to another for maintenance or completing rolling upgrades.
Modern enterprises may take advantage of one or more of the following cloud deployment models for a variety of reasons, including cost, performance, and business continuance requirements. A robust cloud deployment strategy is becoming more and more prevalent as companies continue to move operations into the cloud.
Cross-regional disaster recovery solutions protect organizations from complete outages that would impact access to business systems hosted in the cloud infrastructure of a single cloud provider. As the name implies, an entire application stack for any given business system, including virtual machines, databases, and applications, can be transitioned on demand to an entirely different cloud region in a different geographical location.
DRaaS solutions should be able to transition an entire enterprise system, such as a human resources portal, financial services portal, or enterprise resource management application, to an entirely different cloud region. DRaaS should be able to meet the recovery time and recovery point objectives required by the service level agreements for each individual application.
Cross-regional DR solutions such as Oracle Cloud Infrastructure (OCI) Full Stack Disaster Recovery protect entire enterprise applications from the catastrophic loss of access to an entire cloud region, including loss of all availability domains in that region.
Hybrid cloud DR is a very popular solution that allows enterprises to transition workloads and virtual machines from their own data centers to cloud infrastructure. A hybrid cloud is often used as a disaster recovery solution to protect the data and viability of a corporation’s critical business systems.
For Oracle customers, hybrid cloud is a loose confederation between OCI and physical servers, Oracle cloud appliances, or other systems in a company’s physical data center. Oracle cloud appliances such as Oracle Private Cloud Appliance X9-2 and Oracle Exadata Cloud@Customer are excellent examples of on-premises systems that integrate nicely with OCI.
Some products from Oracle partners, such as Rackware, can be used to achieve DR between on-premises infrastructure and OCI. Rackware is available through the Oracle Cloud Marketplace.
Multicloud DR solutions protect applications and data by spreading the various components of an application stack across the cloud infrastructures of two or more cloud providers. Oracle’s partnership with Microsoft Azure is a good example of a multicloud relationship.
Disaster recovery as a service should be able to accommodate cross-regional DR, hybrid cloud DR, and multicloud DR. OCI can currently provide DRaaS for cross-regional DR and hybrid cloud DR.
Viable disaster recovery solutions involving databases should always include the means to ensure data consistency. The point at which a database can be recovered to full data consistency, including in-flight transactions, defines the recovery point objective.
Data consistency is much less of a problem for serverless or flat file databases, but recovering sophisticated relational databases, such as Oracle Database or MySQL, to a data-consistent state for a given point in time is very complex—yet still crucial.
When using MySQL Database Service in OCI, a MySQL database system is a logical container for one or more MySQL instances. It provides an interface that enables the management of tasks such as provisioning, automatic backups, and point-in-time recovery.
MySQL binary log and native replication technologies enable point-in-time recovery, high availability, and disaster recovery. MySQL Group Replication is commonly used to create local fault-tolerant clusters for high availability, while MySQL asynchronous replication is typically used for geo-distributed disaster recovery.
A database system with high availability enabled has three MySQL instances placed in different availability domains or fault domains. The database system guarantees that if one instance fails, another takes over, with zero data loss and minimal downtime. For disaster recovery in different regions, inbound replication channels between two database systems can also be created.
Use the following links to learn much more about disaster recovery for MySQL in the cloud.
Oracle Maximum Availability Architecture (MAA) provides architecture, configuration, and lifecycle best practices for Oracle databases, enabling high-availability service levels for databases residing in on-premises, cloud, or hybrid configurations.
Use the following links to learn much more about Oracle Maximum Availability Architecture for disaster recovery in the cloud.
Deployment architecture describes how various components such as compute, platform/database, and applications are deployed between cloud regions to create a resilient means of recovering from the total failure of a data center. Deployment architecture describes where everything is located during the normal operation of an application suite and what needs to be recovered at the standby region to get things running again.
Oracle believes DRaaS solutions should have the flexibility to incorporate any combination of DR deployment architectures to meet the individual service level requirements for each unique business system. Cloud providers should offer their customers the freedom to choose “all of the above” solutions to meet SLAs for the wide variety of business systems that organizations typically support.
Many terms are used to describe DR strategies and deployment architectures. However, the various approaches and patterns for describing how to deploy the infrastructure, platform, and applications for disaster recovery fall into two broad strategic categories: active/active and active/standby DR.
The following table breaks down the common terms used to describe the different deployment architectures that fall under the two broad DR strategies.
Strategy | Deployment architecture | RPO | RTO |
---|---|---|---|
Active/standby | Cold standby | Hours | Hours |
Active/standby | Pilot light | Minutes | Hours |
Active/standby | Warm standby | Seconds | Hours |
Active/standby | Hot standby | Seconds | Minutes |
Active/active | Active/active | Near zero | Near zero |
This section attempts to demystify some of the variations of active/active and active/standby approaches to disaster recovery. Both disaster recovery strategies have their place in the arsenal of weapons available for achieving business continuity.
There are many variations of active/standby deployment architectures. Active/standby deployments, sometimes called active/passive deployments, are often characterized as pilot light, cold, warm, and hot standby deployments.
Pilot light, cold standby, warm standby, and hot standby scenarios all represent some form of the same theme where 100% of an application stack is running at the primary region while less than 100% of the same business system is actively running in the standby region.
The following series of very high-level conceptual diagrams is meant to illustrate some fundamental differences between common deployment architectures and are not meant as reference architectures; they do not describe how to implement DR for an application stack.
In this scenario, the virtual machines (VMs), database, and applications only exist at the current primary region. The file and block storage volume groups containing the boot disk and any other virtual disks for each VM are replicated to the standby region.
During a DR operation such as a switchover or failover, the replicated storage containing the same VMs is activated at the standby region, and the same exact VMs are started again at the point where they were stopped or crashed. The VMs are the same exact virtual machines that were formerly running at the active region.
This deployment architecture has several advantages, including lower deployment costs, lower maintenance overhead, and lower operating costs. However, this may not be the best solution for applications that rely on relational databases for the back end since the only way to ensure data consistency is to perform periodic cold backups. This approach works fine for applications that can tolerate occasional short, complete outages.
Most IT operations solve this problem by periodically shutting down the database, taking a snapshot of the block storage, and then resuming normal operations. This defines the recovery point objective since you can only restore to the point in time the backup was completed.
This architecture will have a slightly longer recovery time, and there’s a much greater risk of data loss, but it’s a viable solution for the right workload. For example, this may be an excellent solution for applications that rely on a flat file database, a serverless database, or no database at all or for customers who simply want to make a set of virtual machines mobile between regions for greater flexibility.
The following two scenarios outline how the process flow for DR operations for cold standby deployments might progress.
In the case of a switchover, the following tasks are performed at both the primary and standby regions:
In the case of a failover, the following tasks are performed only at the standby region:
In this scenario, virtual machines exist at both the primary and standby regions but are completely independent of each other and have their own unique host names and preassigned IP addresses. The VMs at the standby region exist and can be stopped or running depending on customer preference.
Databases should be running at both the primary and standby regions.
For Oracle databases, we recommend using Oracle Data Guard to replicate the primary and standby database. Refer to our Gold MAA reference architecture for more details.
For non-Oracle databases, respective native replication technologies will be used to keep the databases synchronized between primary and standby regions.
Applications also exist at the standby cloud region but are not running or accessible.
The following two scenarios outline how the process flow for DR operations for warm standby deployments might progress.
In the case of a switchover, the following tasks are performed at both the primary and standby regions:
In the case of a failover, the following tasks are performed only at the standby region:
In this scenario, virtual machines exist and are running at both the primary and standby regions with their own unique host names and preassigned IP addresses.
Databases should be running at both the primary and standby regions.
For Oracle databases, we recommend using Oracle Data Guard to replicate the primary and standby database. Refer to our Gold MAA reference architecture for more details.
For non-Oracle databases, respective native replication technologies will be used to keep the databases synchronized between primary and standby regions.
Applications are running at the standby cloud region, but are not accessible over the public-facing network.
The following two scenarios outline how the process flow for DR operations for hot standby deployments might progress.
In the case of a switchover, the following tasks are performed at both the primary and standby regions:
In the case of a failover, the following tasks are performed only at the standby region:
In this scenario, the entire application stack is fully functional and handles a workload at both the primary and standby regions.
Databases should be running at both the primary and standby regions.
For Oracle databases, we recommend using Oracle GoldenGate to have active/active application configuration. Apart from this, we recommend having local standby databases in each region using Oracle Data Guard to achieve a near-zero RTO and RPO. Refer to our platinum MAA reference architecture for more details.
For non-Oracle databases, respective native replication and active/active technologies will be used to keep the database synchronized between primary and standby regions.
Applications are running and accessible over the public-facing network at the standby cloud region and have a running workload.
Oracle teams have gone to great lengths to design high availability into their products—including the vast majority of Oracle’s enterprise-class databases and applications—and often devise best practices and deployment architectures for achieving disaster recovery in traditional on-premises settings as well as in the cloud. Each product line devises a DR approach for their individual applications that incorporates loosely coupled steps to recover all the components needed to support their application.
Cloud-based disaster recovery as a service can tie all the loosely coupled steps devised by developers, cloud architects, and product solution architects into a single, seamless workflow that can be initiated with a single click. OCI offers a DRaaS solution called Full Stack Disaster Recovery that is flexible, highly scalable, and highly extensible.
OCI Full Stack Disaster Recovery manages the transition of infrastructure, databases, and applications between OCI regions from anywhere around the globe with a single click. Customers can deploy disaster recovery environments without redesigning or redeploying existing infrastructure, databases, or applications while eliminating the need for specialized management or conversion servers.