Disclaimer: The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation.
This FAQ answers common questions about how Oracle achieves resilience and continuous availability of our core infrastructure services and hosting platform. Customers of Oracle Cloud might be interested in these answers for several reasons:
We don’t make such distinctions. Instead, we categorize our services by dependency level, availability scope, and data plane versus control plane. These categories are designed to provide various useful tradeoffs among availability, durability, performance, and convenience.
These levels might be considered layers or tiers in an architectural block diagram. Each layer may depend only on the layers below it.
From bottom to top:
To meet the goals for availability and durability for a service, one of the following availability scopes is chosen for each service:
Control Plane Versus Data Plane
The data plane of a service is the collection of data-processing interfaces and components that implement the functionality of the service that is intended to be used by applications. For example, the virtual cloud network (VCN) data plane includes the network packet processing system, virtualized routers, and gateways, while the Block Volumes data plane includes the implementation of the iSCSI protocol and the fault-tolerant replicated storage system for volume data.
The control plane of a service is the set of APIs and components responsible for the following tasks:
For all types of service, we use the same set of engineering principles to achieve resilience and availability, because the fundamental engineering challenges of building fault-tolerant, scalable, distributed systems are the same for all types of service.
To achieve resilience and continuous availability, it’s necessary to understand and then deal with all of the causes of unavailability—degraded performance and unhandled failures—in cloud-scale systems. There are a vast number of such causes, so we group them into categories according to their fundamental nature.
Traditionally, analysis of the availability of enterprise IT systems has focused on the category of hardware failure. However, for cloud systems, hardware failure is a relatively minor and well-understood problem. It's now relatively easy to avoid or mitigate most single points of hardware failure. For example, racks can have dual power feeds and associated power distribution units, and many components are hot-swappable. Large-scale hardware failure and loss is of course possible—for example, because of natural disasters. However, our experience, and reports in public post-mortems from other cloud vendors, shows that failure or loss of an entire data center happens extremely rarely, relative to the other causes of unavailability. Large-scale hardware failure must still be handled (for example, with disaster recovery and other mechanisms), but it is far from being the dominant availability problem.
The dominant causes of unavailability in cloud-scale systems are as follows:
These challenges are universal—they are part of the "laws of physics" for cloud-scale distributed systems.
For each of the preceding categories, we use proven engineering strategies to tackle the problem. The most important of these are:
Principles of Architecture and System Design
Many of these principles exist, but we'll focus on those most relevant to resilience and availability.
To handle software bugs and mistakes by operators that have relatively localized effects, we follow the principles of recovery-oriented computing1. At a high level, this means that rather than trying to guarantee that we never have a problem (which is impossible to test), we focus on handling any problems unobtrusively, in a way that can be tested. In particular, we focus on minimizing mean time to recovery (MTTR), which is a combination of mean time to detect, mean time to diagnose, and mean time to mitigate.
Our aim is to recover so quickly that human users aren’t inconvenienced by the issue. The following points help us to achieve this goal:
Minimizing the Effects of Issues
To deal with bugs and mistakes that might have broader effects, we build mechanisms to minimize the "blast radius" of any issues. That is, we focus on minimizing the number of customers, systems, or resources that are affected by any issues, including the particularly challenging issues of multitenant "noisy neighbors," offered overload, degraded capacity, and distributed thrash. We achieve this by using various isolation boundaries and change-management practices (see the following sections).
Architectural Concepts Arising from Design Principles
Many of these concepts exist, but we’ll focus on concepts for limiting the blast radius.
Placement Concepts Enshrined in Our Public API: Regions, Availability Domains, and Fault Domains
Because fault domains are relatively new, we’ll describe those in more detail.
Fault domains are used to limit the blast radius of problems that happen when a system is being actively changed—for example, deployments, patching, hypervisor restarts, and physical maintenance.
The guarantee is that, in a given availability domain, resources in at most one fault domain are being changed at any point in time. If something goes wrong with the change process, some or all of the resources in that fault domain might be unavailable for a while, but the other fault domains in the availability domain aren't affected. Each availability domain contains at least three fault domains, in order to allow quorum-based replication systems (for example, Oracle Data Guard) to be hosted with high availability within a single availability domain.
As a result, for a dominant category of availability problems—software bugs, configuration errors, mistakes by operators, and performance issues that occur during a change procedure—each fault domain acts as a separate logical data center within an availability domain.
Fault domains also protect against some kinds of localized hardware failure. The properties of fault domains guarantee that resources placed in different fault domains don't share any potential single points of hardware failure within the availability domain, to the greatest practical extent. For example, resources in different fault domains don't share the same "top-of-rack" network switch, because the standard design of such switches lacks redundancy.
However, the ability for fault domains to protect against problems in hardware or in the physical environment stops at that local level. In contrast to availability domains and regions, fault domains do not provide any large-scale physical isolation of infrastructure. In the rare case of a natural disaster or availability-domain-wide infrastructure failure, resources in multiple fault domains would likely be affected at the same time.
Our internal services use fault domains in the same way that customers should be using them. For example, the Block Volumes, Object Storage, and File Storage services store replicas of data in three separate fault domains. All components of all control planes and data planes are hosted in all three fault domains (or, in a multiple-availability-domain region, in multiple availability domains).
Service cells are used to limit the blast radius of issues that happen even when a system is not being actively changed. Such problems can arise because the workload of a multitenant cloud system can change in extreme ways at any time, and because complex partial failures can occur in any large distributed system at any time. These scenarios might trigger subtle hidden bugs or emergent performance issues.
In addition, service cells also limit the blast radius in some rare but challenging scenarios when the system is being actively changed. A classic example is when deployment to an individual fault domain appears successful—no errors or change in performance—but as soon as the second or final fault domain has been updated, new interactions within the system (at full cloud scale with production workload) cause a performance issue.
Note that the use of service cells is an architectural pattern, not a concept that is explicitly named in the Oracle Cloud API or SDK. Any multitenant system can use this architectural pattern; it doesn't require special support from the cloud platform.
Service cells work as follows:
The result is that each service cell is yet another kind of "logical data center"—a logical grouping of performance isolation and fault isolation—within a single availability domain or region.
In summary, service cells and fault domains complement each other in the following ways:
We combine the properties of fault domains and service cells into a unified strategy when we perform deployments and patching.
Service Engineering Procedures
Because both testing and operational excellence are critical to the reliability of cloud systems, we have a large number of engineering procedures. Following are some of the more important ones that leverage the concepts mentioned in the preceding section:
Yes. In each region, all availability domains offer the same set of services.
In single-availability-domain regions, customers can use fault domains (logical groups with decorrelated failure modes between groups) to achieve most of the properties of separate "logical data centers." Customers can also use multiple regions for disaster recovery (DR).
In multiple-availability-domain regions, customers can use fault domains in the same way. Customers can also use a combination of availability domain local services, inter-availability-domain failover features (such as DBaaS with Data Guard), and regional services (Object Storage, Streaming) to achieve full HA across higher-level "logical data centers" (availability domains). Finally, customers can also use multiple regions for DR.
In all cases, customers can use the concept of service cells to further isolate even the most severe issues, such as distributed thrash.
We achieve this via fault domains, service cells, and our operational procedures for incremental deployment and validation. See the discussion earlier in this document.
Yes. All categories of services are deployed across multiple logical data centers—separate logical groupings of fault isolation and performance isolation—for resilience and continuous availability.
In single-availability-domain regions, we offer fault domains as the mechanism for “multiple logical data centers”, as discussed elsewhere in this document.
In multiple-availability-domain regions, we offer services and features that provide an even higher level of physical durability of synchronously replicated data (at modest performance, cost because of the distance between availability domains in the region, and the speed of light).
We do not offer automatic HA or fail-over mechanisms across regions, as this would create a close-coupling relationship between regions, and incur risk that multiple regions may experience problems at the same time. Instead, we enable various forms of asynchronous replication between regions, and offer a growing list features, such as asynchronous copy & backup, to enable Disaster Recover across regions.
This is a complicated question, so to clarify, we’ll restate it in a couple of different ways:
The answer is in two parts.
We use architectural principles that significantly reduce correlated failure across dependent services. In some cases, this technique reduces the probability of correlated failure to a degree that it can be ignored from the perspective of meeting an availability service level agreement (SLA).
In particular, we use service cells, as described earlier in this document. Cells help with this problem because if internal service A is affected by a problem in one of its dependencies, service B, then the problem with service B is very likely confined to a single cell. Other higher-level services—and the customer's own applications—that use service B are likely to be using other cells that are not affected. This is a probabilistic argument that varies with the number of cells, which is a hidden internal parameter that does change (increases), so no quantification or guarantee is given, beyond the standalone service SLAs of services A and B. But in practice, this can significantly decorrelate failures between services.
Many of our shared internal services—for example, the Workflow and Metadata services for control planes, and the Streaming/Messaging service—use service cells to decorrelate outages for the upstream services that use them.
The following guidance is high level because the low-level implementation and details of services can and do change. But for the key dimensions of compute, storage, networking, and authentication/authorization, we indicate the following dependencies.
For control planes, the common dependencies are as follows:
Some control planes obviously have service-specific dependencies. For example, the Compute control plane, when launching a bare metal or VM instance, depends on:
For core service data planes, the general principle is that each data plane is intentionally designed to have minimal dependencies, in order to achieve high availability, fast time to diagnosis, and fast time to recovery. The results of that principle are as follows:
For IaaS data planes, the general principle is to depend only on core or lower-level data planes (in order to avoid cyclic dependencies).
Yes, Oracle Cloud Infrastructure services are architected to be region-independent so that services in an Oracle Cloud Infrastructure region can continue to operate even when the region is isolated from other Oracle Cloud Infrastructure regions and/or global control plane. Both data plane & control plane functionality, including service API endpoints, continue to be available even if the region is isolated.
Many Oracle Cloud Infrastructure services offer cross-region functionality such as the cross-region object copy function offered by Oracle Cloud Infrastructure Object Storage. Cross-region functionality in Oracle Cloud Infrastructure is always architected as a layer on top of the core service so that region isolation doesn't impact the core service even if it impacts cross-region functionality. As an example, Oracle Cloud Infrastructure object store cross-region copy functionality is architected as a layer on top of object store service and consequently, isolation of a region may impact relevant cross-region copy function, but will not impact core object storage service in the region.
Yes, Oracle Cloud Infrastructure services are architected so that data plane functionality in every logical data center continues to operate even when isolated from the corresponding regional control plane. As an example, Oracle Cloud Infrastructure compute instances in a logical data center will continue to function along with attached block volumes & associated virtual network functionality even when the data center is isolated from the control plane functions of compute, block storage, VCN and/or identity and access management.
Yes. Oracle Cloud Infrastructure is connected to the Internet via multiple redundant providers in all commercial regions. These connections use BGP (Border Gateway Protocol).