What is disaster recovery?

Disaster recovery (DR) is one facet of the overarching business continuance plans devised and maintained by the various lines of business within an organization. An effective disaster recovery plan mitigates the impact that unplanned—or even planned—outages of mission- and business-critical systems have on an enterprise’s ability to operate and continue to earn revenue.

To do this, a DR plan provides organizations with a flexible structure that enables them to operate in a unified and collaborative manner to restore, redevelop, and revitalize their systems and build a more resilient infrastructure.

Why is disaster recovery important?

How long could a business continue to operate if it were to lose its payroll system just before pay is calculated and accounts funded? What penalties would a company incur due to the delayed payment of federal, state, and local taxes? What consequences would the company face as a result of not paying employees on time—and how long would workers remain working?

To do or not to do disaster recovery? That is no longer the question. The real question is what is the true cost of losing minutes, days, or weeks of important data and the trust built over years in an instant?

Disaster recovery can no longer remain an afterthought or something considered only when there’s enough budget because today’s organizations are expected to respond promptly to disruptive events and stay operational. Rather than being deterred by the cost of implementing a resiliency plan, organizations must go deeper and assess the real cost of not having a plan at all. For instance, examine the service level agreements (SLAs) that could not be met or the penalties and loss of revenue that would result from an outage. Compare the cost of implementing DR with the penalties, loss of revenue, and lost customer confidence and the choice is clear.

Revenue, productivity, and lost loyalty image; description belowThe image is titled “Revenue, productivity, and lost loyalty.” There are three key statistics displayed in this image. These statistics were derived from a commissioned study by Forrester Consulting on behalf of IBM in 2019. The question posed to survey respondents was “Which of the following costs does your organization face due to planned and unplanned downtime?”
On the left side, the statistic displayed shows that 53% of those surveyed responded “Lost revenue.” In the middle, the statistic shows that 47% responded “Lost productivity.” On the right side, the statistic shows that 41% responded “Lost brand equity or trust.”
Source: A commissioned study conducted by Forrester Consulting on behalf of IBM, August 2019. "Which of the following costs does your organization face due to planned and unplanned downtime?"
Base: 100 IT directors in large US enterprises (Rank top 3)

Unplanned outages

Whether the outage is caused by a natural disaster, IT operator/service provider errors, data corruption, or ransomware attacks, an organization must be able to shield itself from disruptions to business operations that result in catastrophic losses, being replaced by opportunistic competitors, and loss of reputation and goodwill.

While these outcomes seem dramatic, they reflect the recent experiences of many well-known organizations that failed to recover in a timely fashion and lost large amounts of critical transactional data, customer information, and trust.

A wide variety of scenarios and root causes can disrupt business operations.

Top causes of unplanned downtime chart, description belowThis image is titled “Top causes of unplanned downtime.” The image displays a bar chart that shows causes of unplanned outages. The bar chart is segmented into three categories: operational failures, natural disasters, and human-caused events. Operational failures are grouped in the leftmost part of the bar chart, natural disasters are in the middle, and human-caused events are grouped in the rightmost part of the bar chart. The source of this chart is Forrester Research, Inc.
Source: Forrester Research, Inc.
Presented at the Gartner Data Center Conference 2014—When downtime is not an option.
Base: 94 global disaster recovery decision-makers and influencers who were asked "What was the cause(s) of your most significant disaster declaration(s) or major business disruption?" (Does not include "don't know" responses; multiple responses were accepted.)

Planned outages

Disaster recovery as a service (DRaaS) in the cloud provides enterprises with unparalleled options for operational flexibility. Organizations use disaster recovery solutions for planned outages more frequently than to recover from catastrophic events.

Typical pain points

  • Traditional approaches to disaster recovery require investments in automation.
  • • Even business systems in tier 1 data centers can be impacted by power outages, which are all too frequent. How often has a common incident such as a power outage caused a day or two of lost productivity? IT staff end up spending hours or many days on conference calls getting systems up and running again using stopgap solutions. • Some companies spend significant portions of their IT budgets developing in-house automation to manage disaster recovery only to be afraid of using it—or even testing it periodically to ensure it continues to work as expected. • It often takes a day (or days) to recover from a planned maintenance window. • Even well-documented DR plans or runbooks can involve days of lost productivity while IT operations staff perform acts of heroism to get things running again.

Key goals of disaster recovery

The two primary goals of disaster recovery are to return affected systems to an operational state as fast as possible and to do so with as little data loss as possible after a catastrophic event or planned outage. The metrics for these two key goals are universally known as the recovery time objective (RTO) and recovery point objective (RPO) respectively.

Each business system will have different requirements for these two metrics depending on the service level agreement between IT operations and the business owners.

Data protection terminology image, description belowThe image is titled “Data protection terminology.” Tolerance for data loss and tolerance for downtime are depicted on a straight line that extends in opposite directions from the centre of the image. On the left you have “Data loss,” and on the right you have “Downtime.”

Recovery time objective

The recovery time objective is the time it takes to restore a business system to a fully operational state after planned or unplanned outages.

Recovery point objective

A recovery point objective is the maximum amount of data that can be lost before causing detrimental harm to any given business system. The RPO is usually measured in time from the delta of the last data backup, replication, or snapshot.

Disaster recovery vs. high availability

Traditionally, high availability (HA) protects against single points of failure, while disaster recovery protects against multiple points of failure. In cloud computing, protection against single points of failure at the physical infrastructure layer, including power, cooling, storage, networks, and physical servers, is completely abstracted into the overall architecture through availability and fault domains.

High availability in this case is scalable and highly resilient to data loss or loss of processing performance. Almost every enterprise-class cloud provider builds high availability into their offering.

High availability disaster recovery solutions that deliver zero data loss and zero downtime protection for databases get expensive when complex data mapping and replication technology is involved. These solutions don’t provide ransomware protection, which is achieved via a comprehensive backup with a point-in-time recovery point objective and immutable storage.

Traditional HA solutions work well in conjunction with a low-cost cloud DR (CDR) solution. The add-on CDR technology provides an extra layer of protection for databases that require zero data loss, zero downtime HA and need ransomware protection with long-term incremental rollback.

On-premises DR is an inflexible and expensive solution because of the high cost of duplicating IT infrastructure in a source location and a fixed target data center location. It cannot function over the WAN or migrate servers between disparate environments, so it requires a dedicated circuit between the source and target data centers to operate like a single interconnected LAN environment. Traditional DR also cannot migrate servers between disparate heterogeneous environments, such as an on-premises environment and a cloud platform or between two different cloud platforms.

DRaaS uses a patchwork of vendor-supplied backup solutions and open source migration tools to create a tightly controlled and highly specific environment. The end user can only recover workloads in the DRaaS provider’s traditional hosting environment from their VMware on-premises environment. These vendor-supplied solutions can be expensive and limited in the range of workloads they can protect and the compute environments they can support.

DR workflows, runbooks, and plans

Oracle typically refers to a DR workflow as a DR plan. A disaster recovery plan—or runbook—is a list of predetermined steps or tasks that need to be completed to transition the compute instance, platform, database, and applications to a predetermined recovery region in the cloud. These include all the tasks or individual steps that are executed by either a human or some sort of automation during a disaster recovery operation such as a switchover or failover. The terms DR plan, DR runbook, and DR workflow are interchangeable.

DR operations (switchover vs. failover)

A disaster recovery operation is the process of executing each predetermined step or task in a DR plan that is required to restore the infrastructure, database, and applications to a fully operational state. Two different terms are used to describe the transition of an application stack to a different location: failover and switchover.

Failover

A failover implies an unplanned outage where applications, databases, and virtual machines have crashed and all resources, including storage, data, and guest operating systems, are in an inconsistent state. In this case, the primary cloud region is assumed to be completely inaccessible and unavailable, which indicates that a true disaster recovery operation needs to be triggered.

Therefore, all disaster recovery tasks in a DR plan can only be performed at the standby cloud region. It is crucial that a cloud provider’s DRaaS solution be designed for high availability at the standby region to ensure it is accessible and functional when a catastrophic disaster strikes.

Switchover

Switchover implies planned downtime that includes an orderly shutdown of applications, databases, and virtual machines or servers. In this case, both the primary and standby regions operate normally, and the IT operations staff is likely focused on “moving” one or more systems from one region to another for maintenance or completing rolling upgrades.

Cloud deployment strategy

Modern enterprises may take advantage of one or more of the following cloud deployment models for a variety of reasons, including cost, performance, and business continuance requirements. A robust cloud deployment strategy is becoming more and more prevalent as companies continue to move operations into the cloud.

Cross-regional DR solutions

Cross-regional disaster recovery solutions protect organizations from complete outages that would impact access to business systems hosted in the cloud infrastructure of a single cloud provider. As the name implies, an entire application stack for any given business system, including virtual machines, databases, and applications, can be transitioned on demand to an entirely different cloud region in a different geographical location.

DRaaS solutions should be able to transition an entire enterprise system, such as a human resources portal, financial services portal, or enterprise resource management application, to an entirely different cloud region. DRaaS should be able to meet the recovery time and recovery point objectives required by the service level agreements for each individual application.

Cross-regional DR solutions such as Oracle Cloud Infrastructure (OCI) Full Stack Disaster Recovery protect entire enterprise applications from the catastrophic loss of access to an entire cloud region, including loss of all availability domains in that region.

Hybrid cloud DR solutions

Hybrid cloud DR is a very popular solution that allows enterprises to transition workloads and virtual machines from their own data centers to cloud infrastructure. A hybrid cloud is often used as a disaster recovery solution to protect the data and viability of a corporation’s critical business systems.

For Oracle customers, hybrid cloud is a loose confederation between OCI and physical servers, Oracle cloud appliances, or other systems in a company’s physical data center. Oracle cloud appliances such as Oracle Private Cloud Appliance X9-2 and Oracle Exadata Cloud@Customer are excellent examples of on-premises systems that integrate nicely with OCI.

Some products from Oracle partners, such as Rackware, can be used to achieve DR between on-premises infrastructure and OCI. Rackware is available through the Oracle Cloud Marketplace.

Multicloud DR solutions

Multicloud DR solutions protect applications and data by spreading the various components of an application stack across the cloud infrastructures of two or more cloud providers. Oracle’s partnership with Microsoft Azure is a good example of a multicloud relationship.

Disaster recovery as a service should be able to accommodate cross-regional DR, hybrid cloud DR, and multicloud DR. OCI can currently provide DRaaS for cross-regional DR and hybrid cloud DR.

Data consistency for databases

Viable disaster recovery solutions involving databases should always include the means to ensure data consistency. The point at which a database can be recovered to full data consistency, including in-flight transactions, defines the recovery point objective.

Data consistency is much less of a problem for serverless or flat file databases, but recovering sophisticated relational databases, such as Oracle Database or MySQL, to a data-consistent state for a given point in time is very complex—yet still crucial.

Disaster recovery considerations for MySQL databases

When using MySQL Database Service in OCI, a MySQL database system is a logical container for one or more MySQL instances. It provides an interface that enables the management of tasks such as provisioning, automatic backups, and point-in-time recovery.

MySQL binary log and native replication technologies enable point-in-time recovery, high availability, and disaster recovery. MySQL Group Replication is commonly used to create local fault-tolerant clusters for high availability, while MySQL asynchronous replication is typically used for geo-distributed disaster recovery.

A database system with high availability enabled has three MySQL instances placed in different availability domains or fault domains. The database system guarantees that if one instance fails, another takes over, with zero data loss and minimal downtime. For disaster recovery in different regions, inbound replication channels between two database systems can also be created.

Use the following links to learn much more about disaster recovery for MySQL in the cloud.

Disaster recovery considerations for Oracle databases

Oracle Maximum Availability Architecture (MAA) provides architecture, configuration, and lifecycle best practices for Oracle databases, enabling high-availability service levels for databases residing in on-premises, cloud, or hybrid configurations.

Use the following links to learn much more about Oracle Maximum Availability Architecture for disaster recovery in the cloud.

Cloud-based deployment architectures

Deployment architecture describes how various components such as compute, platform/database, and applications are deployed between cloud regions to create a resilient means of recovering from the total failure of a data center. Deployment architecture describes where everything is located during the normal operation of an application suite and what needs to be recovered at the standby region to get things running again.

Oracle believes DRaaS solutions should have the flexibility to incorporate any combination of DR deployment architectures to meet the individual service level requirements for each unique business system. Cloud providers should offer their customers the freedom to choose “all of the above” solutions to meet SLAs for the wide variety of business systems that organizations typically support.

Many terms are used to describe DR strategies and deployment architectures. However, the various approaches and patterns for describing how to deploy the infrastructure, platform, and applications for disaster recovery fall into two broad strategic categories: active/active and active/standby DR.

The following table breaks down the common terms used to describe the different deployment architectures that fall under the two broad DR strategies.

Strategy Deployment architecture RPO RTO
Active/standby Cold standby Hours Hours
Active/standby Pilot light Minutes Hours
Active/standby Warm standby Seconds Hours
Active/standby Hot standby Seconds Minutes
Active/active Active/active Near zero Near zero

This section attempts to demystify some of the variations of active/active and active/standby approaches to disaster recovery. Both disaster recovery strategies have their place in the arsenal of weapons available for achieving business continuity.

Active/standby deployment architectures

There are many variations of active/standby deployment architectures. Active/standby deployments, sometimes called active/passive deployments, are often characterized as pilot light, cold, warm, and hot standby deployments.

Pilot light, cold standby, warm standby, and hot standby scenarios all represent some form of the same theme where 100% of an application stack is running at the primary region while less than 100% of the same business system is actively running in the standby region.

The following series of very high-level conceptual diagrams is meant to illustrate some fundamental differences between common deployment architectures and are not meant as reference architectures; they do not describe how to implement DR for an application stack.

Cold standby

In this scenario, the virtual machines (VMs), database, and applications only exist at the current primary region. The file and block storage volume groups containing the boot disk and any other virtual disks for each VM are replicated to the standby region.

Cold standby image, description belowShows an image with the primary region on the left and the standby region on the right. The primary region has three blocks: application, database, and infrastructure , each containing their respective icons. Both regions have an icon representing a load balancer at the top. The load balancer icon in the standby region is a lighter shade than the one in the primary region. The standby region has three blocks: application, database, and infrastructure. In the standby region, only the infrastructure block is populated with icons—one each for object storage, block storage, and file storage. The database and application blocks in the standby region are empty. Two arrows representing object storage replication and storage replication are shown between the two infrastructure blocks. These arrows represent replication from the primary to the standby region.

During a DR operation such as a switchover or failover, the replicated storage containing the same VMs is activated at the standby region, and the same exact VMs are started again at the point where they were stopped or crashed. The VMs are the same exact virtual machines that were formerly running at the active region.

This deployment architecture has several advantages, including lower deployment costs, lower maintenance overhead, and lower operating costs. However, this may not be the best solution for applications that rely on relational databases for the back end since the only way to ensure data consistency is to perform periodic cold backups. This approach works fine for applications that can tolerate occasional short, complete outages.

Most IT operations solve this problem by periodically shutting down the database, taking a snapshot of the block storage, and then resuming normal operations. This defines the recovery point objective since you can only restore to the point in time the backup was completed.

This architecture will have a slightly longer recovery time, and there’s a much greater risk of data loss, but it’s a viable solution for the right workload. For example, this may be an excellent solution for applications that rely on a flat file database, a serverless database, or no database at all or for customers who simply want to make a set of virtual machines mobile between regions for greater flexibility.

Example DR workflows for this deployment architecture

The following two scenarios outline how the process flow for DR operations for cold standby deployments might progress.

In the case of a switchover, the following tasks are performed at both the primary and standby regions:

  • Primary applications are quiesced.
  • Primary databases are quiesced and synced to the standby region.
  • Primary virtual machines are stopped nicely.
  • Primary storage is synced to the standby region.
  • Replicated storage is brought online, replication is reversed, and it is now primary at the standby region.
  • Replicated copies of primary virtual machines are launched, and any system applications or tools required by the application stack are started and are now primary at the standby region.
  • Replicated copies of primary databases are recovered at the standby region using the latest cold backup or transaction and redo logs from the replicated storage.
  • Replicated copies of primary databases are started and mounted and are now primary at the standby region.
  • Replicated copies of primary applications are launched and validated and are now primary at the standby region.
  • Any changes to the DNS and load balancers are made to enable access to the application over the public-facing network.

In the case of a failover, the following tasks are performed only at the standby region:

  • Replicated storage is brought online, replication is reversed, and it is now primary at the standby region.
  • Replicated copies of primary virtual machines are launched, and any system applications or tools required by the application stack are started and are now primary at the standby region.
  • Replicated copies of primary databases are recovered at the standby region using the latest cold backup or transaction and redo logs from the replicated storage.
  • Replicated copies of primary databases are started and mounted and are now primary at the standby region.
  • Replicated copies of primary applications are launched and validated and are now primary at the standby region.
  • Any changes to the DNS and load balancers are made to enable access to the application over the public-facing network.

Warm standby

In this scenario, virtual machines exist at both the primary and standby regions but are completely independent of each other and have their own unique host names and preassigned IP addresses. The VMs at the standby region exist and can be stopped or running depending on customer preference.

Warm standby image, description belowShows an image with the primary region on the left and the standby region on the right. The primary region has three blocks: application, database, and infrastructure, each containing their respective icons. Both regions have an icon representing a load balancer at the top. The load balancer icon in the standby region is a lighter shade than the one in the primary region. The standby region has three blocks: application, database, and infrastructure. In the standby region, the infrastructure block is populated with icons—one each for object storage, block storage, and file storage. There is also an icon for virtual machines at the infrastructure level, but it is a lighter shade. The database icons and application icon in the standby region are also a lighter shade. Two arrows representing object storage replication and storage replication are shown between the two infrastructure blocks. These arrows represent replication from the primary to the standby region.

Databases should be running at both the primary and standby regions.

For Oracle databases, we recommend using Oracle Data Guard to replicate the primary and standby database. Refer to our Gold MAA reference architecture for more details.

For non-Oracle databases, respective native replication technologies will be used to keep the databases synchronized between primary and standby regions.

Applications also exist at the standby cloud region but are not running or accessible.

Example DR workflows for this deployment architecture

The following two scenarios outline how the process flow for DR operations for warm standby deployments might progress.

In the case of a switchover, the following tasks are performed at both the primary and standby regions:

  • Primary applications are quiesced.
  • Primary databases are quiesced and synced to the standby region.
  • Primary virtual machines are stopped nicely.
  • Primary storage is synced to the standby region.
  • Replicated storage is brought online, replication is reversed, and it is now primary at the standby region.
  • Standby virtual machines are launched if not already running, and any system applications or tools required by the application stack are started and are now primary at the standby region.
  • Standby databases are switched to full read/write access and are now primary at the standby region.
  • Standby applications are launched and validated and are now primary at the standby region.
  • Any changes to the DNS and load balancers are made to enable access to the application over the public-facing network.

In the case of a failover, the following tasks are performed only at the standby region:

  • Replicated storage is brought online, replication is reversed if possible, and it becomes primary at the standby region.
  • Standby virtual machines are launched if not already running, and any system applications or tools required by the application stack are started and are now primary at the standby region.
  • Standby databases are recovered using the latest transaction and redo logs from the replicated storage.
  • Standby databases are switched to full read/write access and are now primary at the standby region.
  • Standby applications are launched and validated and are now primary at the standby region.
  • Any changes to the DNS and load balancers are made to enable access to the application over the public-facing network.

Hot standby

In this scenario, virtual machines exist and are running at both the primary and standby regions with their own unique host names and preassigned IP addresses.

Hot standby image, description belowShows an image with the primary region on the left and the standby region on the right. The primary region has three blocks: application, database, and infrastructure, each containing their respective icons. Both regions have an icon representing a load balancer at the top. The load balancer icon in the standby region is a lighter shade than the one in the primary region. The standby region has three blocks: application, database, and infrastructure. Both the primary and the standby regions have icons in the application, database, and infrastructure blocks. The infrastructure block has icons for virtual machines, file storage, object storage, and block storage in both the primary and standby regions. Only the database icons in the standby region are a lighter shade. Two arrows representing object storage replication and storage replication are shown between the two infrastructure blocks. These arrows represent replication from the primary to the standby region.

Databases should be running at both the primary and standby regions.

For Oracle databases, we recommend using Oracle Data Guard to replicate the primary and standby database. Refer to our Gold MAA reference architecture for more details.

For non-Oracle databases, respective native replication technologies will be used to keep the databases synchronized between primary and standby regions.

Applications are running at the standby cloud region, but are not accessible over the public-facing network.

Example DR workflows for this deployment architecture

The following two scenarios outline how the process flow for DR operations for hot standby deployments might progress.

In the case of a switchover, the following tasks are performed at both the primary and standby regions:

  • Primary applications are quiesced.
  • Primary databases are quiesced and synced to the standby region.
  • Primary virtual machines are stopped nicely.
  • Primary storage is synced to the standby region.
  • Replicated storage is brought online, replication is reversed, and it is now primary at the standby region.
  • Standby virtual machines are already running, and any system applications or tools required by the application stack are started and are now primary at the standby region.
  • Standby databases are switched to full read/write access and are now primary at the standby region.
  • Standby applications are launched and validated and are now primary at the standby region.
  • Any changes to the DNS and load balancers are made to enable access to the application over the public-facing network.

In the case of a failover, the following tasks are performed only at the standby region:

  • Replicated storage is brought online, replication is reversed if possible, and it becomes primary at the standby region.
  • Standby virtual machines are launched if not already running, and any system applications or tools required by the application stack are started and are now primary at the standby region.
  • Standby databases are recovered using the latest transaction and redo logs from the replicated storage.
  • Standby databases are switched to full read/write access and are now primary at the standby region.
  • Standby applications are launched and validated and are now primary at the standby region.
  • Any changes to the DNS and load balancers are made to enable access to the application over the public-facing network.

Active/active deployment architecture

In this scenario, the entire application stack is fully functional and handles a workload at both the primary and standby regions.

Active/active deployment architecture image, description belowShows an image with the primary region on the left and the standby region on the right. The primary region and the standby region each have three blocks: application, database, and infrastructure, each containing their respective icons. Both regions have an icon representing a load balancer at the top. Neither one is grayed out. Icons in the application, database, and infrastructure blocks in both the primary and standby regions are shown in color. One arrow representing optional storage replication is shown between the two infrastructure blocks. This arrow represents replication from the primary to the standby region.

Databases should be running at both the primary and standby regions.

For Oracle databases, we recommend using Oracle GoldenGate to have active/active application configuration. Apart from this, we recommend having local standby databases in each region using Oracle Data Guard to achieve a near-zero RTO and RPO. Refer to our platinum MAA reference architecture for more details.

For non-Oracle databases, respective native replication and active/active technologies will be used to keep the database synchronized between primary and standby regions.

Applications are running and accessible over the public-facing network at the standby cloud region and have a running workload.

Automating disaster recovery tasks with DRaaS

Oracle teams have gone to great lengths to design high availability into their products—including the vast majority of Oracle’s enterprise-class databases and applications—and often devise best practices and deployment architectures for achieving disaster recovery in traditional on-premises settings as well as in the cloud. Each product line devises a DR approach for their individual applications that incorporates loosely coupled steps to recover all the components needed to support their application.

Cloud-based disaster recovery as a service can tie all the loosely coupled steps devised by developers, cloud architects, and product solution architects into a single, seamless workflow that can be initiated with a single click. OCI offers a DRaaS solution called Full Stack Disaster Recovery that is flexible, highly scalable, and highly extensible.

OCI Full Stack Disaster Recovery manages the transition of infrastructure, databases, and applications between OCI regions from anywhere around the globe with a single click. Customers can deploy disaster recovery environments without redesigning or redeploying existing infrastructure, databases, or applications while eliminating the need for specialized management or conversion servers.