Guide to Oracle Data Guard Fast-Start Failover

Architect: High Availability

by John Smiley Published March 2009

This document will guide you through configuring Oracle Data Guard Fast-Start Failover (FSFO) using a physical standby database.  FSFO can provide substantial gains in high availability and disaster recovery preparedness for all environments, from inexpensive Cloud-based systems to global distributed data centers. 

The information in this guide is based on practical experience gained from deploying FSFO in a global corporate production environment. The guide makes few assumptions about your existing environment and includes examples for creating a physical standby database and Data Guard Broker configuration. To get started, all you'll need is Oracle Database Enterprise Edition Release 10.2 or later, a database, and three hosts: two for the databases and a small host for the FSFO observer. The guide attempts to be operating system agnostic; however, some examples may contain platform specific elements such as path and file naming conventions.

DOWNLOAD

Major Components of an FSFO Environment

FSFO builds upon a number of other Oracle technologies and features such as Data Guard, Flashback Database, and Data Guard Broker.

Data Guard

The foundation of FSFO is Data Guard - a primary and at least one standby. The standby can be physical or logical and there can be multiple standbys, but only one of the standbys can be the failover target at any given time. The following paragraphs describe the supported availability modes.

Maximum Availability Mode (Oracle Database 10g Rel 2 and later)

In Maximum Availability mode, FSFO guarantees that no transaction that has received a commit acknowledgment will be lost during a failover. The price for this guarantee is increased commit latency ( log file sync waits). Maximum Availability mode uses synchronous redo transfer and FSFO imposes the additional requirement that the redo is recorded in the standby redo log (SRL) of the target standby (AFFIRM option of log_archive_dest_ n). Overall commit latency is increased by the round-trip network latency. With increased latency comes decreased throughput; however, in some cases the difference in throughput may be made up by increasing parallelism.

Although redo transfer is synchronous, Maximum Availability mode allows the primary to remain available if the standby database becomes unavailable for any reason (e.g. standby database, host, or network failure, etc.). If the primary is unable to contact the standby after a user specified period of time (NET_TIMEOUT option of log_archive_dest_ n), it drops out of synchronous transfer mode and begins operating as though it were in Maximum Performance mode. When the standby becomes available again, the primary and standby re-synchronize and resume synchronous redo transfer.

Maximum Performance Mode (Oracle Database 11g Rel 1 and later)

Oracle Database 11g FSFO adds support for Maximum Performance mode (async redo transfer), providing the flexibility to trade durability for performance. Commit latency is not affected by redo transfer, but committed transactions whose redo has not been received by the standby will be lost during failover. FSFO configurations in Maximum Performance mode may limit potential data loss by specifying the maximum allowable age of transactions that are lost during a failover. For example, if the limit specified is 30 seconds (the default), FSFO guarantees that all transactions that committed prior to 30 seconds ago are preserved during failover. The minimum allowable limit is 10 seconds.

Data Guard Broker

Broker is a Data Guard management utility that maintains state information about a primary and its standby databases. It automatically sets Data Guard related database initialization parameters on instance start and role transitions, starts apply services for standbys, and automates many of the administrative tasks associated with maintaining a Data Guard configuration. FSFO is a feature of Broker which records information about the failover target, how long to wait after a failure before triggering a failover, and other FSFO specific properties.

Flashback Database

Flashback Database is a continuous data protection (CDP) solution integrated with the Oracle Database. It provides a way to quickly restore a database to a previous point in time or SCN using on-disk data structures called flashback logs. Flashing back a database is much faster and more seamless (one simple DDL statement) than traditional point-in-time or SCN-based recovery. FSFO uses Flashback Database as part of the process of reinstating a failed primary as a standby.

Problems with automatic reinstatement are frequently due to misconfiguration, so let's look at this in a bit more detail.

Flashback Database records the before-image of changed blocks. To avoid the overhead of recording every change to every block, Flashback Database takes a "fuzzy" snapshot every 30 minutes and only records the before-image block upon its first change since the last snapshot. Subsequent changes to the same block during the same snapshot are not recorded.

Flashing back a database occurs in two stages:

  • Restore - Flashback Database restores the datafiles to the closest snapshot prior to the specified SCN. This can be compared to performing an RMAN restore of the datafiles from a backup taken prior to the specified SCN, but is much faster.
  • Media Recovery - Once the restore is complete, recovery proceeds as a typical media recovery, applying redo from archived and online redologs and rolling back uncommitted changes with undo. This means that in order for a flashback database operation to succeed,Flashback Database requires all archive redo logs generated between the snapshot time and restore SCN (typically the past 30 minutes of redo). Use the V$RECOVERY_PROGRESS view to monitor recovery status.

For FSFO environments, set db_flashback_retention_target = 60 or higher to provide sufficient Flashback Database history for automatic standby reinstatement. Metadata for the fuzzy snapshot is stored in the flashback log itself. If that metadata is pushed out, Oracle can no longer find a fuzzy snapshot so it will not be able to flash back. To avoid problems due to timing variations, values less than 60 minutes are not recommended and values of 30 or less virtually guarantee Flashback Database failure.

Flashback Database stores its logs in the Flash Recovery Area (FRA), so the FRA must be large enough to store at least 60 minutes of Flashback Database history. The total storage requirement is proportional to the number of distinct blocks changed during snapshots - e.g. 1,000,000 block changes on a small set of blocks generates less Flashback Database history than 1,000,000 changes on a larger set of blocks. A good method to determine Flashback Database storage requirements is to enable Flashback Database and observe the amount of storage it uses during several peak loads. There is little risk in enabling Flashback Database to determine its storage requirements - it can be disabled while the primary is open if necessary. However, re-enabling Flashback Database will require a bounce since the database must be mounted and not open.

FSFO observer

The observer is the third party in an otherwise typical primary/standby Data Guard configuration. It is actually a low-footprint OCI client built into the DGMGRL CLI (Data Guard Broker Command Line Interface) and, like any other client, may be run on a different hardware platform than the database servers. Its primary job is to perform a failover when conditions permit it to do so without violating the data durability constraints set by the DBA. Only the observer can initiate FSFO failover. It's secondary job is to automatically reinstate a failed primary as a standby if that feature is enabled (the default). The observer is the key element that separates Data Guard failover from its pre-FSFO role as the plan of last resort to its leading role in a robust high availability solution.

Note: the FSFO observer version must match the database version. Oracle Database 11g observers are incompatible with 10g databases and vice-versa.

Conditions for FSFO Failover

By default, the observer will initiate failover to the target standby if and only if ALL of the following are true:

  • observer is running
  • observer and the standby both lose contact with the primary
    • Note: if the observer loses contact with the primary, but the standby does not, the observer can determine that the primary is still up via the standby.
  • observer is still in contact with the standby
  • durability constraints are met
  • failover threshold timeout has elapsed

User configurable failover conditions (11g and later)

Oracle Database 11g Rel 1 introduced user configurable failover conditions that can trigger the observer to initiate failover immediately.

Health conditions

Broker can be configured to initiate failover on any of the following conditions. Conditions shown in blue are enabled by default.

  • Datafile Offline (due to IO errors)
  • Corrupted Controlfile
  • Corrupted Dictionary
  • Inaccessible Logfile (due to IO errors)
  • Stuck Archiver

Oracle errors (ORA-NNNNN)

You can also specify a list of ORA- errors that will initiate FSFO failover. The list is empty by default.

Application initiated

Applications can initiate FSFO failover directly using the DBMS_DG.INITIATE_FS_FAILOVER procedure with an optional message text that will be displayed in the observer log and the primary's alert log.

Walkthrough Overview

The walkthrough begins with a single database that will become the primary of a Data Guard configuration. For this build, we will use a single physical standby database. FSFO can also be used with logical standbys and an FSFO-enabled configuration may have multiple standbys with a mix of physical and logical, but only one standby can be the failover target at any given time.

The major steps in the walkthrough are:

  1. Configure Oracle Net (aka SQL*Net)
  2. Prepare the primary database
  3. Create a physical standby
  4. Enable Flashback Database
  5. Create a Broker configuration
  6. Configure the observer
  7. Enable and test FSFO

Conventions used

Database hosts are referred to as "a" and "b" hosts and the databases themselves are referred to as the "a" and "b" databases. The observer host is 'observer.demo.org'.

Names used in the examples:

Database name db1
Database unique names db1_a db1_b
Domain name demo.org
Hostnames dbhost-a dbhost-b observer
Data Guard listener name LISTENER_DG
TNS aliases db1_a db1_b

Input commands are shown in shaded boxes in normal text. Expected output is shown in blue text.

Configure Oracle Net

Data Guard uses Oracle Net (SQL*Net) for communication between the primary and standby databases and the FSFO observer. Getting the Oracle Net configuration right is one of the key factors in a successful FSFO deployment. Improper Oracle Net configuration is a leading cause of reported FSFO issues.

Note: Data Guard requires dedicated server connections for proper operation. Do not use Shared Server (formerly MTS) for Data Guard.

Configure listeners

It's good practice to use separate listeners for application connections and Data Guard connections. This allows Data Guard to remain functional during maintenance periods when the application listeners are down. Be sure to include the Data Guard listener in the local_listeners database parameter.

Most of the network services used in a FSFO environment may use dynamic registration, but to enable Broker to restart instances during role transitions or during reinstatement after a failover, you must define a static service named db_unique_name_dgmgrl. db_domain . (Note: 11.1.0.7 adds the StaticConnectIdentifier Broker database property to allow you to specify a different service name.) If you will be using RMAN to create the standby database, it also needs a static service to restart the database being created. In order to maintain separation of Broker and non-Broker activity, a second static service is recommended.

listener.ora configuration for host "a":

  LISTENER_DG = (DESCRIPTION= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=dbhost-a)(PORT=1522)) ) ) SID_LIST_LISTENER_DG= (SID_LIST= (SID_DESC= (SID_NAME=db1) (SDU=32767) (ORACLE_HOME=/u01/app/oracle/product/11.1.0/db_1) (GLOBAL_DBNAME=db1_a_static.demo.org) ) (SID_DESC= (SID_NAME=db1) (SDU=32767) (ORACLE_HOME=/u01/app/oracle/product/11.1.0/db_1) (GLOBAL_DBNAME=db1_a_dgmgrl.demo.org) ) ) 

listener.ora configuration for host "b":

  LISTENER_DG = (DESCRIPTION= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=tcp)(HOST=dbhost-b)(PORT=1522)) ) ) SID_LIST_LISTENER_DG= (SID_LIST= (SID_DESC= (SID_NAME=db1) (SDU=32767) (ORACLE_HOME=/u01/app/oracle/product/11.1.0/db_1) (GLOBAL_DBNAME=db1_b_static.demo.org) ) (SID_DESC= (SID_NAME=db1) (SDU=32767) (ORACLE_HOME=/u01/app/oracle/product/11.1.0/db_1) (GLOBAL_DBNAME=db1_b_dgmgrl.demo.org) ) ) 

Configure naming method

Create a unique connect alias for each database. Using the db_unique_name of each database as the Oracle Net alias is simple and intuitive.

db1_a: Alias to connect to the dynamic Data Guard service on database "a"

db1_b: Alias to connect to the dynamic Data Guard service on database "b"

db1_a_static: Alias to connect to the static Data Guard service on database "a"

db1_b_static: Alias to connect to the static Data Guard service on database "b"

Example tnsnames.ora entries:

  db1_a= (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-a)(port=1522)) ) (connect_data= (service_name=db1_a.demo.org) (server=dedicated) ) ) db1_b= (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-b)(port=1522)) ) (connect_data= (service_name=db1_b.demo.org) (server=dedicated) ) ) db1_a_static= (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-a)(port=1522)) ) (connect_data= (service_name=db1_a_static.demo.org) (server=dedicated) ) ) db1_b_static= (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-b)(port=1522)) ) (connect_data= (service_name=db1_b_static.demo.org) (server=dedicated) ) ) 

Start the Data Guard listener

Start the Data Guard listener on both "a" and "b" hosts.

lsnrctl start LISTENER_DG

Test the Oracle Net connections

Verify the configuration from both hosts.

  tnsping db1_a Attempting to contact (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-a)(port=1522))) (connect_data= (service_name=db1_a.demo.org) (server=dedicated))) OK (0 msec) 
  tnsping db1_b Attempting to contact (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-b)(port=1522))) (connect_data= (service_name=db1_b.demo.org) (server=dedicated))) OK (0 msec) 

Preparing the Primary

A number of prerequisites must be met on the primary in order to use Fast-Start Failover. This section describes how to configure and verify each prerequisite. To see if your primary has already met a prerequisite, follow the instructions in the Verify section.

Steps that require the primary to be in a mounted (not open) state are grouped together in the section below entitled Steps Requiring a Bounce of the Primary.

Add the Data Guard listener to local_listeners parameter

To allow the database to register with the Data Guard listener, the listener endpoint must be added to the database's local_listener parameter. If local_listener is already in use, add the Data Guard listener to the list. After setting local_listener, register the database with the listener and verify the services have been registered.

Note: You can also use TNS aliases defined in the tnsnames.ora file when setting the local_listener parameter. This is particularly useful when registering with multiple listeners where the parameter value would otherwise exceed the 255 character limit.

Set

  alter system set local_listener='(address=(host=dbhost-a)(port=1522)(protocol=tcp))'; alter system register; 

Verify

  show parameter local_listener NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ local_listener string (address=(host=dbhost-a)(port= 1522)(protocol=tcp)) 
  lsnrctl status listener_dg ... Services Summary... Service "DB1_A" has 1 instance(s). Instance "db1", status READY, has 1 handler(s) for this service... Service "DB1_A_XPT" has 1 instance(s). Instance "db1", status READY, has 1 handler(s) for this service... Service "db1XDB" has 1 instance(s). Instance "db1", status READY, has 1 handler(s) for this service... Service "db1_a_dgmgrl.demo.org" has 1 instance(s). Instance "db1", status UNKNOWN, has 1 handler(s) for this service... Service "db1_a_static.demo.org" has 1 instance(s). Instance "db1", status UNKNOWN, has 1 handler(s) for this service... The command completed successfully 

Enable Force Logging

All Data Guard environments should enable force logging at the database level in order to guard against nologging tablespaces from being added.

Enable

alter database force logging;

Verify

  select force_logging from v$database; FOR --- YES 

Create an spfile

Broker changes database parameters during startup and role transitions via ALTER SYSTEM commands. An spfile is required to persist these changes.

Create

  create spfile='?/dbs/spfile${ORACLE_SID}.ora' from pfile='?/dbs/init${ORACLE_SID}.ora'; alter system set spfile='?/dbs/spfiledb1.ora'; 

Verify

show parameter spfile;

  NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ spfile string ?/dbs/spfiledb1.ora 

Create a password file

All Data Guard environments require the use of a password file in order to allow the databases to connect to each other.

Create

orapwd file=$ORACLE_HOME/dbs/orapw$ORACLE_SID

Verify

  select * from v$pwfile_users; USERNAME SYSDB SYSOP SYSAS ------------------------------ ----- ----- ----- SYS TRUE TRUE FALSE 

Enable remote login

Remote login is required, along with a password file, to allow the databases in a Data Guard configuration to connect to each other.

Enable

alter system set remote_login_passwordfile=exclusive scope=spfile; 

Verify

  show parameter remote_login_passwordfile NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ remote_login_passwordfile string EXCLUSIVE 

Set db_unique_name

Each database in a Data Guard configuration must have a unique name. This guide uses the naming convention of appending an underscore followed by a letter to the db_name to create the db_unique_name.

Set

  alter system set db_unique_name = db1_a scope=spfile; NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ db_unique_name string db1_a 

Verify

  show parameter db_unique_name NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ db_unique_name string db1_a 

Configure Flash Recovery Area

If you don't already have a Flash Recovery Area (FRA), you will need to create one for Flashback Database. If you already have an FRA, you may need to increase its size in order to accommodate the Flashback Database files. See theFlashback Database section above for information on storage requirements.

Configure

  alter system set db_recovery_file_dest_size = 20g scope=both; alter system set db_recovery_file_dest = '/u01/fra' scope=both; 

Verify

  show parameter db_recovery_file NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ db_recovery_file_dest string /u02/flash_recovery_area db_recovery_file_dest_size big integer 2G 

Enable automatic standby file management

Not a hard requirement, but recommended.

Enable

alter system set standby_file_management=auto;

Verify

  show parameter standby_file_management NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ standby_file_management string AUTO 

Set log_archive_config

This parameter must be set before the primary can be opened in Maximum Availability mode. The remaining Data Guard-related parameters will be set by Broker later in the walkthrough.

Set

alter system set log_archive_config='DG_CONFIG=(db1_b)' scope=both; 

Verify

  show parameter log_archive_config NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ log_archive_config string DG_CONFIG=(db1_b) 

Create standby redo logs

In order to accommodate all load conditions, Oracle recommends having at least one more SRL group than the number of ORL groups of the same size. This walkthrough assumes that all ORLs and SRLs on the primary and standby databases are the same size.

Create

Determine the number and size of the Online Redo Log files (ORLs)

  select bytes, count(group#) from v$log group by bytes; BYTES COUNT(GROUP#) ---------- ------------- 52428800 3 

Find the highest group#

  select max(group#) from v$log; MAX(GROUP#) ----------- 3 

Add the SRLs. Unlike ORLs, SRLs should be created with only one member per group. There is no need to multiplex SRLs in order to protect redo as with ORLs (the redo is already protected in the ORLs of the primary). Multiplexing SRLs merely adds unnecessary IO and can increase commit latency. For systems with multiple RAID controllers, consider creating SRLs such that their IO is balanced across the controllers.

In this example, there are 3 ORLs with a max group# of 3. We will create 4 SRLs starting with group# 11. Starting with 11 is purely cosmetic - it allows new ORL groups to be added later while keeping their group# in the same sequence as the existing ORLs.

  alter database add standby logfile group 11 '/u02/oradata/db1/stby-t01-g11-m1.log' size 52428800; alter database add standby logfile group 12 '/u02/oradata/db1/stby-t01-g12-m1.log' size 52428800; alter database add standby logfile group 13 '/u02/oradata/db1/stby-t01-g13-m1.log' size 52428800; alter database add standby logfile group 14 '/u02/oradata/db1/stby-t01-g14-m1.log' size 52428800; 

Verify

  select group#, type, member from v$logfile where type = 'STANDBY'; GROUP# TYPE MEMBER ---------- ------- ---------------------------------------- 11 STANDBY /u02/oradata/db1/stby-t01-g11-m1.log 12 STANDBY /u02/oradata/db1/stby-t01-g12-m1.log 13 STANDBY /u02/oradata/db1/stby-t01-g13-m1.log 14 STANDBY /u02/oradata/db1/stby-t01-g14-m1.log 

Steps Requiring a Bounce of the Primary

The following steps all require the database to be in a mounted (not open) state. They can all be done at the same time in a single bounce.

Enable archivelog mode

Enable

alter database archivelog;

Verify

  select log_mode from v$database; LOG_MODE ------------ ARCHIVELOG 

Enable Flashback Database

Enable

  alter database flashback on; alter system set db_flashback_retention_target = 60 scope=both; 

Verify

  select flashback_on from v$database; FLASHBACK_ON ------------------ YES show parameter db_flashback_retention_target NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ db_flashback_retention_target integer 60 

Enable Maximum Availability Mode

As mentioned above, Maximum Availability mode is mandatory for Oracle Database 10g and optional for Oracle Database 11g.

Enable

alter database set standby database to maximize availability;

Verify

  select protection_mode from v$database; PROTECTION_MODE -------------------- MAXIMUM AVAILABILITY 

Create the Standby Database

If you don't already have a standby database, use your favorite method to create one. The example below takes advantage of the 11g RMAN Active Database Duplication feature. This feature enables RMAN to duplicate an existing database over the network without requiring a backup to disk or tape. The following assumes that the standby host has been setup according to Oracle's recommendations and that the operating system, accounts, security, resource limits, directory structure, etc. are configured correctly.

Copy the password file from the primary to the standby

In Oracle Database 11g, the password file on the standby must be a physical copy of the password file on the primary due to security enhancements introduced in Oracle Database 11g. Oracle Database 10g allows a different password file to be used as long as the SYS passwords are the same on the primary and standby.

Create oratab entry

Add an entry to the oratab file for the standby

db1:/u01/app/oracle/product/11.1.0/db_1:Y

Create an init.ora file

For the RMAN duplicate active database method, the init.ora file (initdb1.ora in the example) requires only one parameter: db_name (it doesn't even have to be the real name of the database - any name will do). RMAN will copy the spfile from the primary, so this init.ora file is only needed during the first phase of the duplication.

db_name = db1

Set environment

Make sure that your OS environment on the standby is setup. Use the oraenv script provided by Oracle.

Startup nomount the standby

startup nomount

Create the standby with RMAN

Run the RMAN utility and connect to the target (primary) and auxiliary (new standby).

  rman target sys/password@db1_a_static auxiliary sys/password@db1_b_static connected to target database: DB1 (DBID=1234567890) connected to auxiliary database: X (not mounted) 

Set the default device type to disk

CONFIGURE DEFAULT DEVICE TYPE TO DISK;

Duplicate the database from the active primary

The procedure for using RMAN to create a standby database is fully explained in Appendix F of Oracle Oracle Data Guard Concepts and Administration document (10g Rel 2 and 11g Rel 1). The example uses the FROM ACTIVE DATABASE clause introduced in 11g that allows RMAN to create a standby database by copying the primary across the network without the need to store the backup files on disk or tape. RMAN also copies the spfile and password files and you can change the values for individual parameters. At a minimum, you must set db_unique_name. The example assumes that the standby uses the same directory structure as the primary.

Note: If you have just enabled archivelog mode, force an archive log creation ( alter system archive log current) to ensure that at least one archive log exists. Otherwise, the DUPLICATE TARGET DATABASE command will fail with "RMAN-20208: UNTIL CHANGE is before RESETLOGS change".

  DUPLICATE TARGET DATABASE FOR STANDBY FROM ACTIVE DATABASE DORECOVER SPFILE SET db_unique_name='db1_b' SET log_archive_config='' SET log_file_name_convert= ' ',' ' SET local_listener='(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=dbhost-b)(PORT=1522)))' NOFILENAMECHECK; 

Enable Flashback Database on the standby

Enable

  alter database flashback on; alter system set db_flashback_retention_target = 60 scope=both; 

Verify

  select flashback_on from v$database; FLASHBACK_ON ------------------ YES show parameter db_flashback_retention_target NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ db_flashback_retention_target integer 60 

Set the location of the Data Guard Broker configuration files

Broker stores it configuration information in a mirrored set of files outside the database. By default, both files are stored in $ORACLE_HOME/dbs. To protect the files, it's good practice to store them in separate filesystems.

Set (primary and standby)

  alter system set dg_broker_config_file1='/u01/app/oracle/admin/db1/dgbroker/dg1db1.dat'; alter system set dg_broker_config_file2='/u02/app/oracle/admin/db1/dgbroker/dg2db1.dat'; 

Verify

  show parameter dg_broker_config_file NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ dg_broker_config_file1 string /u01/app/oracle/admin/db1/dgbr oker/dg1db1.dat dg_broker_config_file2 string /u02/app/oracle/admin/db1/dgbr oker/dg2db1.dat 

Enable Data Guard Broker

Enable (primary and standby)

alter system set dg_broker_start=true;

Verify

  show parameter dg_broker_start NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ dg_broker_start boolean TRUE 

Configure Broker

Start the dgmgrl utility and connect as SYS on the primary

dgmgrl sys/password@db1_a

Create the Broker Configuration

Add the primary

  create configuration 'FSF' as primary database is db1_a connect identifier is db1_a; Configuration "FSF" created with primary database "db1_a" 

Add the standby

  add database db1_b as connect identifier is db1_b maintained as physical; Database "db1_b" added 

Verify the configuration

  show configuration Configuration Name: FSF Enabled: NO Protection Mode: MaxAvailability Databases: db1_a - Primary database db1_b - Physical standby database Fast-Start Failover: DISABLED Current status for "FSF": DISABLED 

Edit Database Properties

LogXptMode

Broker will set the primary to use asynchronous log transport by default. For Maximum Availability environments, change this to synchronous.

NetTimeout

The NetTimeout property specifies the number of seconds LGWR will block waiting for acknowledgment from the standby in synchronous mode before considering the connection lost (corresponds to the NET_TIMEOUT option of log_archive_dest_n). The default value is 30 seconds. When using Maximum Availability mode, consider lowering this to reduce the time commits block when the standby becomes unavailable. Choose a value high enough to avoid false disconnects from intermittent network trouble. The example uses 10 seconds.

ObserverConnectIdentifier (11g and later)

Oracle Database 11g adds the ObserverConnectIdentifier database property to the Broker configuration, allowing you to specify a connect identifier for the observer to use for monitoring the primary and failover target. By default, the observer uses the same connect identifiers used by Data Guard for redo transfer and information exchange between the primary and standby ( DGConnectIdentifier in Oracle Database 11g, InitialConnectIdentifier in Oracle Database 10g). ObserverConnectIdentifier allows you to specify different connect identifiers for the observer to use. You might, for instance, use this to allow the observer to monitor the databases using the same connect identifiers as the client applications.

We'll leave the other properties at their default values for the walkthrough, but you should become familiar with all of the Broker config and database properties. Chapter 9 of the Data Guard Broker documentation (10g and 11g) contains a description of each property. Some properties have changed between those releases.

Note: Many of the Broker database properties correspond to database spfile parameters. Broker maintains these parameters by issuing ALTER SYSTEM commands as appropriate during role transitions, database startup/shutdown, and other events. If these parameters are modified outside of Broker, it raises a warning. To see the specific parameter, use the "show database ... StatusReport" command.

  edit database db1_a set property LogXptMode='SYNC'; edit database db1_a set property NetTimeout=10; edit database db1_b set property NetTimeout=10; 

Enable the configuration

Broker will validate the configuration, set parameters on both databases, and start managed recovery. This may take a few minutes. Tailing the alert logs on the primary and standby is a good way to watch Broker in action and get familiar with how it performs various tasks.

enable configuration;

Verify the configuration

Make sure everything is working before moving on.

  show configuration Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_a - Primary database db1_b - Physical standby database Fast-Start Failover: DISABLED Current status for "FSF": SUCCESS 

Enable Fast-Start Failover

This is a good time to enable FSFO to make sure that all of the prerequisites have been met. Broker will verify that the configuration meets all prerequisites before enabling FSFO and will report any problems it finds. The most common problems are mismatched Data Guard protection modes and LogXptMode properties and forgetting to enable Flashback Database on the primary or standby.

Note that enabling FSFO does not make the configuration ready for automatic failover - that requires an observer, which we'll get to next.

  enable fast_start failover; Enabled. 

With FSFO enabled, Broker expects to find an observer, which we haven't started yet, so if you verify the at this point with 'show configuration', Broker will report a warning (if it doesn't, give it a minute to discover that the observer isn't there).

  show configuration Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_a - Primary database db1_b - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Current status for "FSF": Warning: ORA-16608: one or more databases have warnings 

Running a StatusReport on the primary should verify that the error is due to a missing observer.

  show database db1_a statusreport STATUS REPORT INSTANCE_NAME SEVERITY ERROR_TEXT * ERROR ORA-16819: fast-start failover observer not started 

Configure the Observer

To maximize the benefits of FSFO, the observer should run on a different host than the primary and standby databases. Ideally the primary, standby, and observer will be in geographically separate areas. The observer is very lightweight, requiring few system resources. Unlike the primary / standby interconnect, where bandwidth and latency are determining performance factors, the observer requires very little network bandwidth and is not overly latency sensitive, allowing the it to be placed practically anywhere a reliable connection is available.

Since the observer is a specialized instance of a dgmgrl session, the observer host should be installed with either the Oracle Client Administrator software or the full Oracle Database software stack.

Verify connectivity to the primary and standby

  tnsping db1_a Attempting to contact (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-a)(port=1522))) (connect_data= (service_name=db1_a.demo.org) (server=dedicated))) OK (0 msec) tnsping db1_b Attempting to contact (description= (SDU=32767) (address_list= (address=(protocol=tcp)(host=dbhost-b)(port=1522))) (connect_data= (service_name=db1_b.demo.org) (server=dedicated))) OK (0 msec) 

Start the observer

Start the observer by running dgmgrl and logging in using SYS credentials. We'll start it interactively for now to verify that everything's working. Notice that the terminal session appears to hang after starting the observer. This is normal. The Appendix provides information oncreating a simple wrapper script to start the observer as a background process.

  dgmgrl sys/password@db1_a start observer; observer started 

The terminal session will appear to hang at this point.

Verify the configuration

In a separate terminal session, verify the configuration. This example shows the verbose mode of the 'show configuration' command that provides FSFO-specific information. If the status is SUCCESS, you're ready to start testing role transitions.

  dgmgrl sys/password@db1_a show configuration verbose Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_a - Primary database db1_b - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Threshold: 30 seconds Target: db1_b Observer: observer.demo.org Lag Limit: 30 seconds (not in use) Shutdown Primary: TRUE Auto-reinstate: TRUE Current status for "FSF": SUCCESS 

Use the 'show fast_start failover' command to see which user configurable FSFO failover conditions are in effect.

  show fast_start failover; Fast-Start Failover: ENABLED Threshold: 30 seconds Target: db1_b Observer: observer.demo.org Lag Limit: 30 seconds (not in use) Shutdown Primary: TRUE Auto-reinstate: TRUE Configurable Failover Conditions Health Conditions: Corrupted Controlfile YES Corrupted Dictionary YES Inaccessible Logfile NO Stuck Archiver NO Datafile Offline YES Oracle Error Conditions: (none) 

Testing the Configuration

The real test of the configuration is a successful role transition in both directions with both switchover and FSFO failover. We'll start with switchovers.

Test switchover in both directions

In order to fully automate switchover, Broker needs SYSDBA credentials in order to restart one or both databases. Without the credentials, Broker will complete the role transition, but will leave the databases in need of a manual restart.

  dgmgrl sys/password@db1_a switchover to db1_b Performing switchover NOW, please wait... New primary database "db1_b" is opening... Operation requires shutdown of instance "db1" on database "db1_a" Shutting down instance "db1"... ORA-01109: database not open Database dismounted. ORACLE instance shut down. Operation requires startup of instance "db1" on database "db1_a" Starting instance "db1"... ORACLE instance started. Database mounted. Switchover succeeded, new primary is "db1_b" 
  show configuration Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_b - Primary database db1_a - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Current status for "FSF": SUCCESS 

Now let's test switchover in the other direction.

  switchover to db1_a Performing switchover NOW, please wait... New primary database "db1_a" is opening... Operation requires shutdown of instance "db1" on database "db1_b" Shutting down instance "db1"... ORA-01109: database not open Database dismounted. ORACLE instance shut down. Operation requires startup of instance "db1" on database "db1_b" Starting instance "db1"... ORACLE instance started. Database mounted. Switchover succeeded, new primary is "db1_a" 
  show configuration Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_a - Primary database db1_b - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Current status for "FSF": SUCCESS 

Test FSFO failover in both directions

Now that we know switchovers work, it's time to test failovers. Testing FSFO failover requires simulating loss of the primary. The simplest way to do this is to abort the primary. This will signal the observer to initiate failover after the FSFO threshold timeout has been reached (default is 30 seconds). It is instructive to watch the alert logs on both databases as well as the observer log after aborting the primary to gain insight into what happens during FSFO failover.

It is also possible to initiate failover using the dgmgrl failover command. This exercises the configuration, but triggers failover differently than losing contact with the primary.

Check Flashback Database retention

We want the observer to be able to automatically reinstate the former primary as a standby after our failover tests, so before each test, make sure that Flashback Database has at least 30 minutes of history. Do this prior to every failover test. If Flashback Database history is insufficient, the observer will not be able to reinstate and you will have to manually reinstate from backup or by primary duplication.

On the primary to be aborted:

  select (sysdate - oldest_flashback_time)*24*60 as history from v$flashback_database_log; HISTORY ---------- 140.35 

Don't initiate failover unless at least 30 minutes of history is available.

Create some test data

It wouldn't be much of a test if we didn't verify that our durability constraints were being met, so let's make a change on the primary and see if it survives the failover. This walkthrough uses Maximum Availability mode to achieve "zero data loss".

Log in as a test user and make some changes that won't impact other parts of the system.

  -- Note that DDL statements automatically commit create table x as select * from all_objects; Table created. select count(*) from x; COUNT(*) ---------- 68855 

Initiate FSFO failover

Abort the primary.

shutdown abort

observer log:

  Initiating Fast-Start Failover to database "db1_b"... Performing failover NOW, please wait... Failover succeeded, new primary is "db1_b" 

Examine the Broker configuration by logging into dgmgrl on the new primary. Notice that the former primary is now disabled.

  dgmgrl sys/password@db1_b show configuration Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_b - Primary database db1_a - Physical standby database (disabled) - Fast-Start Failover target Fast-Start Failover: ENABLED Current status for "FSF": Warning: ORA-16608: one or more databases have warnings 

Check the test data

Log into the new primary and verify that the changes made it across.

  select count(*) from x; COUNT(*) ---------- 68855 

Reinstate the aborted primary as a standby

Initiate reinstatement by mounting the database. Note that the database will not open at this point. A database in the primary role will not open until it has verified with the observer that it is still the primary. If the observer finds that the database is no longer the primary, it will attempt to reinstate it as the failover target standby.

The first step in reinstatement is to flash the database back to the SCN where the standby became the primary (v$database.standby_became_primary_scn on the new primary). As described in theFlashback Database section, Flashback Database takes place in two stages: a restore stage and a media recovery stage. In the restore stage, Flashback Database restores the database to a point prior to the standby_became_primary_scn using the before-image blocks in the Flashback Database logs. In the media recovery phase, Flashback Database applies redo to bring the database up to the standby_became_primary_scn. In order for Flashback Database to succeed, there must be sufficient history available in the Flashback Database logs and all of the redo generated between the restore point and the standby_became_primary_scn must be available. If Flashback Database fails, automatic reinstatement stops and you will have to perform a manual SCN-based recovery to the standby_became_primary_scn and complete the reinstatement.

Once Flashback Database has succeeded, the observer will convert the database to a standby, bounce it, and begin apply services.

startup mount

observer log:

  Initiating reinstatement for database "db1_a"... Reinstating database "db1_a", please wait... Operation requires shutdown of instance "db1" on database "db1_a" Shutting down instance "db1"... ORA-01109: database not open Database dismounted. ORACLE instance shut down. Operation requires startup of instance "db1" on database "db1_a" Starting instance "db1"... ORACLE instance started. Database mounted. Continuing to reinstate database "db1_a" ... Reinstatement of database "db1_a" succeeded 

dgmgrl status:

  Configuration Name: FSF Enabled: YES Protection Mode: MaxAvailability Databases: db1_b - Primary database db1_a - Physical standby database - Fast-Start Failover target Fast-Start Failover: ENABLED Current status for "FSF": SUCCESS 

Repeat in the other direction

Now test FSFO failover back to the original primary. Make some new changes and verify that they are preserved after failover. Remember to check Flashback Database history before aborting the primary.

Stalling the Primary

Another good test is to simulate network failures that leave the primary up, but isolated from the failover target standby and the observer. When a primary loses contact with both the failover target and the observer simultaneously, it enters a "stalled" state (v$database.fs_failover_status = 'STALLED') and any sessions still connected to the primary will block on commit. Queries and DML will continue to run - only sessions that commit will block. This prevents a "split brain" condition if a failover occurs since none of the changes made to the isolated primary can be made permanent. Oracle Database 10g databases running versions prior to 10.2.0.4 will remain in a stalled state until aborted or signaled to remain the primary by the observer once connectivity has been restored. Starting with 10.2.0.4 (including all versions of 11g and later), Oracle provides the FastStartFailoverPmyShutdown Broker property that allows you to specify what the primary should do if it is still in a stalled state when the FSFO threshold timeout has elapsed. Setting this property to 'TRUE' (the default) causes the primary to self-terminate. Setting it to 'FALSE' leaves the database open and stalled until it is terminated or signaled to proceed in the event a failover did not take place (e.g. the observer was killed after the stall began, but before the failover timeout had elapsed).

The simple tests described in this guide are fine for making sure the basics are working, but you'll probably want to develop a more comprehensive set of tests suited to your environment and requirements.

Monitoring FSFO Readiness

There's a big difference between a system that's FSFO enabled and one that's FSFO ready. Being FSFO ready means that all conditions are met for a successful failover, including having a running observer and sufficient redo transmitted to the failover target to meet durability requirements. This section describes how to stay on top of your FSFO environments.

Ask Broker

Use Broker's "show configuration" command to determine FSFO status and the "show database <db_unique_name> statusreport" command to drill down for details if Broker reports a problem.

Ask the primary

The v$database view has has columns specifically for monitoring FSFO status. The values that indicate FSFO is ready for failover are listed below. See the Oracle Reference and Data Guard Administrator guides for your release for details.

  fs_failover_status = 'SYNCHRONIZED' for MaxAvail 'TARGET UNDER LAG LIMIT' for MaxPerf fs_failover_observer_present = 'YES' 

Standby apply

Nothing will ruin your day faster than finding out that the standby the observer just failed over to is 12 hours behind in applying redo. The observer is perfectly satisfied if all of the redo it needs to meet your durability requirements has been received by the failover target. It doesn't consider how much of that redo has been applied. Have a means of notifying someone if standby apply falls too far behind.

Flashback Database history

Slightly less critical than making sure you've got a good primary is making sure the failed primary can be automatically reinstated. Monitoring flashback database history and reacting when it drops below 30 minutes will save you time and improve availability. It will also alert you to databases that have had Flashback Database disabled at some point after FSFO was enabled. Broker checks to see if Flashback Database is enabled on the primary and failover targets when FSFO is enabled. After FSFO is enabled, Broker will continue to check that Flashback Database is enabled during health checks. If it detects that Flashback Database was disabled, either manually, or automatically because Flashback Database discovered a problem, Broker signals "ORA-16827: Flashback Database is disabled".

Client Notification

Failing over the database won't do much good if applications and other database clients don't know where the primary went. If clients are already configured to automatically time out and reconnect if they don't get a response from the database, a simple but effective approach is to use a network alias (e.g. DNS CNAME) that always resolves to the primary. After a role change, the naming service can be updated with the new primary's address. A trigger on the DB_ROLE_CHANGE system event can be used to update the naming service and, with the proper client cache TTL settings, clients can connect to the new primary very quickly.

Oracle also provides Fast Application Notification (FAN) for OCI clients and Fast Connect Failover for JDBC clients. These facilities allow applications written to take advantage of them to receive asynchronous notification of database events, including role transitions.

Appendix

Creating an observer wrapper

This section will help you get started with creating a wrapper script to automatically start and restart the FSFO observer. Use the wrapper script to start the observer process when the observer host boots or to restart it if it dies. Writing the wrapper itself and the means to determine when to execute it are up to you.

Create a wallet

While not strictly required, creating a wallet provides a secure way to store the credentials needed to automatically connect to the primary when starting the observer. The "Configuring Authentication" chapter of the Oracle Database Security Guide provides detailed instructions for creating a wallet.

A simple example for *nix is provided below that will work with both releases. The Oracle Database 10g FSFO observer is limited to using the default username and password defined in the wallet. In 10g, a single wallet can be used for multiple observers, but they must all use the same SYS password. The Oracle Database 11g observer can make use of specific credentials, allowing the same wallet to be used for multiple observers with different SYS passwords.

Create a directory to store the wallet.

mkdir -p /u01/app/oracle/admin/wallet

Create a wallet and set the default username and password to the database's SYSDBA credentials (usually SYS).

  mkstore -wrl /u01/app/oracle/admin/wallet -createEntry oracle.security.client.default_username SYS mkstore -wrl /u01/app/oracle/admin/wallet -createEntry oracle.security.client.default_password <sys password> 

Add the wallet location and override to sqlnet.ora.

  WALLET_LOCATION = (SOURCE = (METHOD = FILE) (METHOD_DATA = (DIRECTORY = /u01/app/oracle/admin/wallet) ) ) SQLNET.WALLET_OVERRIDE = TRUE 

Determine where to store the observer state file (fsfo.dat file)

The observer maintains state information in a file. By default the file is named fsfo.dat and is created in the working directory where the observer is started. The state file is locked when the observer is running to prevent multiple observers from using the same file. It's generally a good idea to store the state file in a directory associated with the database to avoid locking issues when running multiple observers on the same host.

mkdir -p /u01/app/oracle/admin/db1/observer

Observer startup command

Here's a one-liner observer startup for *nix. Note the use of "/@<tns_alias>" to login using the wallet.

  nohup dgmgrl /@db1 "start observer file='/u01/app/oracle/admin/db1/observer/fsfo.dat'" \ >> /u01/app/oracle/admin/db1/observer/dgmgrl.log & 

Useful things to know about the FSFO observer

  • All database connections associated with the observer, including the initial connection, must use dedicated server connections. Using Shared Server (MTS) or connection pooling can result in unpredictable behavior.
  • For reliable startup, the initial connection should always be made to the primary. A running observer will follow the primary automatically after a role transition, but a newly (re)started observer won't start if the initial connection is to a down database or one with an out of date or corrupted Broker config file.
  • Startup can fail with "ORA-16647: could not start more than one observer" even when no observer is actually running if the previous observer process terminated without deregistering itself and the new observer isn't using the previous fsfo.dat file. If this occurs, run 'stop observer' and try again.
  • It's a good idea to have at least two hosts configured to run observers so that one can take over if the other fails.

Failover script

Create a script to automate FSFO failover initiation and use it as your standard method for standby flips. This not only saves time and minimizes problems by automating an otherwise manual process, it exercises your failover and DR procedures with every flip so that you know the FSFO configuration is sound and, in a real emergency, everyone knows what to do. Failovers become routine. In fact, failovers are so reliable, fast, and simple that switchovers become the exception rather than the rule. Just be sure to include a Flashback Database history check in the script to provide an option to abort if a failover would require a manual reinstate.

DB_ROLE_CHANGE system event

The DB_ROLE_CHANGE event will fire whenever a database is opened for the first time after a role transition. Create a trigger on this event to perform actions specific to your environment after a switchover or failover, such as updating the name resolution service to point to the new primary. Keep this trigger as simple and reliable as possible, limiting it to only what is absolutely necessary at the moment of role transition, since any failures at this point may affect availability. If there are many actions that need to take place, put them in a separate script and use the trigger to run the script in an orphan process or thread independent of the database.

FSFO configurations with multiple standbys

Bystanders

All standbys other than the failover target are considered bystanders (v$database.fs_failover_status = 'BYSTANDER'). Bystanders are part of the Data Guard configuration, but not part of the FSFO configuration. Only two databases, the primary and the failover target, can be in the FSFO configuration at any given time.

During failover, bystanders "follow" the primary by default, flashing back and reapplying redo from the new primary as necessary. (Yes, bystanders need Flashback Database too).

After a failover, a bystander will not automatically become the new failover target. Unless action is taken to change the failover target to one of the bystanders, the new primary will be without a failover target until the former primary is reinstated as a standby. If the Broker configuration is changed to make a bystander the new failover target (probably a good idea if the failed database will be down for a while), the observer will not automatically reinstate the former primary because it is no longer part of the FSFO configuration. Reinstatement will have to be accomplished by other means (manual or scripted Broker commands).

Switchover restrictions

FSFO enabled configurations having multiple standbys cannot switchover to a standby that is not the failover target. To switchover to a standby that is not the current failover target:

  1. Disable FSFO
  2. Change the failover target to the standby being switched over to
  3. Enable FSFO
  4. Switchover

or

  1. Disable FSFO
  2. Switchover
  3. Change the failover target to the desired standby
  4. Enable FSFO

John Smiley [jrsmiley@gmail.com] is a persistent storage architect for a major online retailer.