|
Data Guard Redo Shipping
Oracle9i
Data Guard - Not Your Father's Log Shipping
Joe Meeks, Server Technologies, Oracle Corporation
Larry Carpenter, Server Technologies, Oracle Corporation
Introduction
Oracle9i Data Guard includes
many features that automate the maintenance of a standby database and enable
zero data loss disaster recovery configurations. In Oracle9i,
fundamental to achieving enhanced
data protection, the Oracle8i-based
approach to "log shipping" is replaced with the concept of "redo shipping". Beginning with Oracle9i
(and likewise in all subsequent releases, including Oracle Database 10g),
Data Guard makes it possible to ship redo to a standby database,
either synchronously or asynchronously, as quickly as transactions are committed
using the Log Writer (LGWR) process.
With the conclusion of regular support for Oracle8i
quickly approaching, many sites are actively planning their upgrades to Oracle9i
and Data Guard. This Tech-Tip helps explain the benefits of using LGWR, and
provides a simple method to evaluate network performance of a Data Guard configuration
under a production workload.
Oracle8i
Log Shipping
Lets quickly review Oracle8i
to understand the difference between log shipping and redo shipping. Many Oracle8i
sites use Oracle
Managed Standby, Oracle8i
Data Guard (the predecessors to Oracle9i
Data Guard), or home grown scripts to ship archive logs from the production
database (primary) to a remote standby database for disaster recovery purposes.
When a log switch occurs on the primary, it triggers an Archiver (ARCH) process to archive
the contents of the online redo log to a local archive log. The ARCH process
simultaneously archives the contents to a remote destination on the standby
server. An RFS (remote file server) process on the standby server receives the
redo and writes it to an archive log. When transmission of the log is complete,
it is registered in the standby control file and managed recovery applies the
redo to the standby database. This process is highlighted in the area shaded
in beige, in the figure below.
Log shipping using the ARCH process has the disadvantage
of much greater risk of data loss when compared to LGWR. Under the best circumstances,
ARCH-based log shipping will result in the loss of all data in any online redo
logs that have not yet been archived should there be a sudden loss of the primary
system. With online redo logs frequently sized from 200MB to 500MB, the amount
of data loss can be significant. To work around this limitation, many Oracle8i
users configure smaller archive logs, or deploy third party remote mirroring
solutions to mirror the primary database online redo logs on the remote standby.
Both workarounds incur additional overhead on the primary server, and/or the
additional cost and complexity of purchasing and supporting third party mirroring
technology.
Oracle9i
Redo Shipping
In contrast, Oracle9i Data Guard uses LGWR to ship redo directly to the remote standby
server as it is being written to the local online redo log. LGWR does not wait
for a log switch to ship redo. Redo can be shipped synchronously (LGWR SYNC), to
achieve zero data loss protection, or asynchronously (LGWR
ASYNC). Asynchronous redo shipping increases the exposure to data loss slightly,
but the exposure is limited by the size of the ASYNC buffer on the primary database
(this parameter is user configurable). The gray highlighted area in the figure
below illustrates the direct route taken by redo shipped to the standby server
using the LGWR process.

Practical Considerations
So this sounds great - all production sites can
benefit from using LGWR for maximum data protection. Yes, but one practical
consideration is that fact that LGWR is more dependent than ARCH on having sufficient
network bandwidth to handle peak processing loads. In LGWR SYNC, the primary
will wait for an acknowledgment from the standby server that it has received
the redo previously sent before the commit is allowed to complete. The advantage
is that this mechanism makes it possible to achieve zero data loss protection.
But doing so requires sufficient network bandwidth and low latency to avoid
impacting the performance of the primary database.
LGWR ASYNC makes this much less of an issue because
the primary does not wait for the network acknowledgment before processing the
next transaction. But it is still possible, usually during peak processing periods,
for the primary database to generate redo faster than the network can ship.
If redo transmission falls too far behind, the LGWR ASYNC buffer becomes full
and Data Guard automatically reverts back to using ARCH based log shipping until
LGWR ASYNC can reconnect to the standby and reinitiate the redo shipping process
at a later time.
Will LGWR Work For You?
The paper titled "Oracle9i Data Guard: Primary Site and Network Configuration Best Practices"
available at http://otn.oracle.com/deploy/availability/htdocs/maa.htm,
is part of Oracle Maximum Availability Architecture (MAA) series of white papers,
and provides a useful framework to show the correlation between the peak redo
rate and the required bandwidth (within this paper refer to Appendix F: Network Throughput and
Peak Redo Rates).
But even when armed with the above information,
it is often difficult for DBAs to predict network performance under a peak production
workload. Fortunately, the ability to dynamically configure many characteristics
of a Data Guard configuration allows DBAs to incrementally "ratchet"
up redo transport in a stepwise fashion under production workload, without impacting
application availability.
Five Easy Steps
The procedure below assumes you have followed the
Oracle9i
Data Guard Documentation and have correctly configured other
aspects of your Data Guard configuration that are not addressed by this note.
The procedure also assumes that all reasonable effort has been made in a test
environment to simulate production workload and network performance, and that
you have followed the guidelines included in the above-mentioned Network
Best Practices paper. Notwithstanding these prior efforts, it may still
be difficult to know for sure how the network will impact the performance of
your Data Guard configuration. Below are five steps that eliminate all guesswork.
All parameter changes described can be made dynamically, without risk of impacting
application availability.
Step 1
Begin by configuring Data Guard in MAXIMUM PERFORMANCE
Mode (the default protection mode) using the ARCH process:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS ARCH';
This mode uses the equivalent log shipping mechanism
used by Oracle8i. Run
in this mode through a complete processing cycle that includes time periods
with peak redo generation rates. Follow the guidelines in the Data Guard documentation
and the Network
Best Practices white paper to be insure optimal tuning. Use Statspack
to obtain a performance baseline from which the impact of the following configuration
changes can be compared.
Step 2
Assuming success with step 1, upgrade redo transport
services from ARCH to LGWR ASYNC using the following command.
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS LGWR ASYNC';
At the next log switch, LGWR will begin shipping
redo asynchronously to the standby server. From this point forward, as long
as the LGWR ASYNC buffer (initially set at its default value of 1MB) does not
become full and as long as the primary can sustain a network connection with
the standby, redo is always shipped as quickly as the network will allow.
Users have employed different strategies during
this step in the process. One common approach is to initially run in LGWR ASYNC
during less active periods of the day, and gradually increase the duration of
the time period run in LGWR ASYNC until it includes peak periods as well. The
ability to switch between ARCH and LGWR ASYNC is completely dynamic. If there
is ever a need to revert to ARCH, simply use the command in #1 above.
Whichever strategy you use, be sure to monitor
two aspects of performance. First, use Statspack
to confirm there is little or no performance impact on the primary server.
Second, monitor V$DATAGUARD_STATUS on the primary
database looking for error code of 16166. This error code denotes that LGWR
had problems sending redo to the standby server. The 16166 error definition
is as follows.
ORA-16166 LGWR
network server failed to send remote message
Cause: The LGWR network server could not communicate with the remote
host.
Action: The alert log contains more problem-specific information.
An example query is:
SQL> alter session set nls_date_format = 'MON-DD-YYYY
HH24:MI:SS';
Session altered.
SQL> column MESSAGE format a50
SQL> select timestamp, message from v$dataguard_status where error_code=16166;
TIMESTAMP MESSAGE
-------------------- --------------------------------------------------
JUN-10-2004 13:14:04 LGWR: I/O error 16166 archiving log 2 to 'Primary'
JUN-10-2004 13:19:22 LGWR: I/O error 16166 archiving log 2 to 'Primary'
The above query shows there were two instances
where LGWR ASYNC failed. More information on each of these events is available
in the alert log of the primary database. For example, in the alert log detail
for the second failure is:
Thu Jun 10 13:19:22 2004
Timing out on NetServer 0 prod=80,cons=8,threshold=64
ORA-16166: LGWR timed out on Network Server 0 due to buffer full condition.
No action is required since the log file transfer will be attempted via ARCH
Thu Jun 10 13:19:22 2004
Errors in file /private1/oracle/OraHome92/admin/Physical/bdump/physical_lgwr_3597.trc:
ORA-16166: LGWR network server failed to send remote message
LGWR: I/O error 16166 archiving log 2 to 'Primary'
Thu Jun 10 13:19:22 2004
Errors in file /private1/oracle/OraHome92/admin/Physical/bdump/physical_lgwr_3597.trc:
ORA-16166: LGWR network server failed to send remote message
This alert log message is typical of ORA-16166
errors for Data Guard configurations using LGWR ASYNC. Note from the information
in the alert log that: LGWR timed out on Network Server 0 due to buffer full
condition. The primary database generated redo faster than LGWR could ship
until finally reaching the point where the ASYNC buffer became full and LGWR
abandoned shipping redo to the standby.
The default setting for the LGWR ASYNC
buffer is 1MB. The buffer size can be increased to a maximum of 10MB (in Oracle
10g this the maximum
value has been increased to 50MB). The increased buffer is only used if it is
absolutely needed. This means that increasing the buffer size has no impact
on how quickly LGWR will ship redo to the standby, it always ships as quickly
as the network will allow. If there is a frequent occurrence of LGWR failing
due to a buffer full condition, consider increasing the size of the ASYNC buffer
by using the following command:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS
LGWR ASYNC=20480';
Also note that the primary server continues processing
transactions in cases where LGWR abandons shipping to the standby server, therefore
no immediate action is required by the DBA. Data Guard automatically resolves
the resulting gap (a situation where there are archive logs on the primary contain
redo that has not been shipped to the standby) using the ARCH process. This
can be seen in the same alert log message above: No action is required since
the log file transfer will be attempted via ARCH. At the next log switch,
LGWR will attempt to reconnect to the standby server. When successful, LGWR
begins shipping the then-current redo stream. When ARCH completes the process
of resolving the gap, the primary and standby servers are back in sync.
Step 3
Lets assume success strikes a second time. You
find that primary performance is unaffected using LGWR ASYNC. You find that
LGWR ASYNC proves capable of utilizing available bandwidth and keeping pace
with the primary redo generation rate. Instances where the ASYNC buffer becomes
full and shipping reverts to ARCH are limited to bursts of peak activity that
occur after hours during batch processing. Generally, during work hours, it
is possible to sustain LGWR ASYNC and benefit from enhanced data protection.
Things have gone so well, it is desirable to test LGWR SYNC and see if zero
data loss protection is feasible given your network environment. Use the following
command to dynamically alter redo transport to use LGWR SYNC:
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS
LGWR SYNC';
Repeat the monitoring described in #2 above.
Step 4
Lets assume yet another successful outcome.
Your evaluation of network performance is complete, and you decide to make LGWR
SYNC the default redo shipping transport by upgrading the Data Guard protection
mode from MAXIMUM PERFORMANCE to MAXIMUM AVAILABILITY. Be sure to review the
Data Guard documentation for a full understanding of MAXIMUM AVAILABILITY protection
mode (for example, archive destination attributes for NET_TIMEOUT and REOPEN).
Upgrade to MAXIMUM AVAILABILITY by using the command, below. Unlike changes
to redo transport in steps 1-3, the change in the protection mode does require
a bounce of the primary database, but it need not be done until the next maintenance
period.
SQL> ALTER DATABASE SET STANDBY TO MAXIMIZE
AVAILABILITY;
Step 5
OK - the world is not perfect. You have proceeded
through the above steps and somewhere along the way, you find that the network
can not keep pace, regardless of how well you tune it. For example, you made
it through step 3, and determined that LGWR ASYNC with a 10MB buffer works well,
but there is an unacceptable impact to primary database performance of using
LGWR SYNC. Simply revert to LGWR ASYNC by repeating the command from step 2.
SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS
LGWR ASYNC=20480';
Likewise, if at any time you desire to revert to
ARCH, repeat the command from step 1:
SQL> ALTER SYSTEM SET
LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS
ARCH';
Conclusion
Oracle9i Data Guard significantly enhances
past mechanisms for maintaining a standby database for disaster recovery. Understanding
and deploying redo shipping using the LGWR process is essential to achieve optimum
data protection. The network's ability to keep pace with Data Guard has become
more important than ever. Fortunately, the ability to dynamically configure
Data Guard's redo transport services affords DBAs the opportunity to know for
certain how their data guard configuration will respond to production workload,
and make adjustments on the fly with zero impact to application availability.
References
-
Oracle
Data Guard Overview
-
Oracle
Data Guard Concepts and Administration
Manual
-
Oracle9i
Database Performance Tuning Guide and Reference Manual - Chap. 21: Using
Statspack
-
MetaLink Note 94224.1 - "FAQ - Statspack Complete Reference"
-
Oracle Maximum
Availability Architecture
-
Oracle9i Data Guard: Primary Site and Network Configuration
Best Practices
-
MetaLink Note 225633.1 - "Implementing SSH port forwarding
with 9i Data Guard"
-
MetaLink
Note 241925.1
- "Troubleshooting 9i Data Guard Network Issues"
Joe Meeks is responsible for managing Customer Programs
with Oracle's Database High Availability Group. His principal focus area is
Oracle Data Guard, which is Oracle's disaster recovery solution for enterprise
data. Joe has 24+ years of experience in the computer industry addressing the
high availability needs of business critical applications in manufacturing,
finance, telecommunications, healthcare, and the public sector. He has a BS
in Environmental Science and an MBA.
Larry Carpenter is a member of the Data Guard development
team in Server Technologies. Larry has 30+ years of experience in the computer
industry, with the last 15 years focused on the high availability requirements
of business critical databases. Larry's expertise is insuring the successful
deployment of Oracle Disaster Recovery Solutions in diverse computing environments
world wide. In addition to English, he speaks fluent Italian, French, and German.
|