Data Guard Redo Shipping

Oracle9i Data Guard - Not Your Father's Log Shipping

Joe Meeks, Server Technologies, Oracle Corporation

Larry Carpenter, Server Technologies, Oracle Corporation


 Introduction

Oracle9i Data Guard includes many features that automate the maintenance of a standby database and enable zero data loss disaster recovery configurations. In Oracle9i, fundamental to achieving enhanced data protection, the Oracle8i-based approach to "log shipping" is replaced with the concept of "redo shipping". Beginning with Oracle9i (and likewise in all subsequent releases, including Oracle Database 10g), Data Guard makes it possible to ship redo to a standby database, either synchronously or asynchronously, as quickly as transactions are committed using the Log Writer (LGWR) process.

With the conclusion of regular support for Oracle8i quickly approaching, many sites are actively planning their upgrades to Oracle9i and Data Guard. This Tech-Tip helps explain the benefits of using LGWR, and provides a simple method to evaluate network performance of a Data Guard configuration under a production workload.

 

Oracle8i Log Shipping

Lets quickly review Oracle8i to understand the difference between log shipping and redo shipping. Many Oracle8i sites use Oracle Managed Standby, Oracle8i Data Guard (the predecessors to Oracle9i Data Guard), or home grown scripts to ship archive logs from the production database (primary) to a remote standby database for disaster recovery purposes. When a log switch occurs on the primary, it triggers an Archiver (ARCH) process to archive the contents of the online redo log to a local archive log. The ARCH process simultaneously archives the contents to a remote destination on the standby server. An RFS (remote file server) process on the standby server receives the redo and writes it to an archive log. When transmission of the log is complete, it is registered in the standby control file and managed recovery applies the redo to the standby database. This process is highlighted in the area shaded in beige, in the figure below.

Log shipping using the ARCH process has the disadvantage of much greater risk of data loss when compared to LGWR. Under the best circumstances, ARCH-based log shipping will result in the loss of all data in any online redo logs that have not yet been archived should there be a sudden loss of the primary system. With online redo logs frequently sized from 200MB to 500MB, the amount of data loss can be significant. To work around this limitation, many Oracle8i users configure smaller archive logs, or deploy third party remote mirroring solutions to mirror the primary database online redo logs on the remote standby. Both workarounds incur additional overhead on the primary server, and/or the additional cost and complexity of purchasing and supporting third party mirroring technology.

 

Oracle9i Redo Shipping

In contrast, Oracle9i Data Guard uses LGWR to ship redo directly to the remote standby server as it is being written to the local online redo log. LGWR does not wait for a log switch to ship redo. Redo can be shipped synchronously (LGWR SYNC), to achieve zero data loss protection, or asynchronously (LGWR ASYNC). Asynchronous redo shipping increases the exposure to data loss slightly, but the exposure is limited by the size of the ASYNC buffer on the primary database (this parameter is user configurable). The gray highlighted area in the figure below illustrates the direct route taken by redo shipped to the standby server using the LGWR process.

 

 

Practical Considerations

So this sounds great - all production sites can benefit from using LGWR for maximum data protection. Yes, but one practical consideration is that fact that LGWR is more dependent than ARCH on having sufficient network bandwidth to handle peak processing loads. In LGWR SYNC, the primary will wait for an acknowledgment from the standby server that it has received the redo previously sent before the commit is allowed to complete. The advantage is that this mechanism makes it possible to achieve zero data loss protection. But doing so requires sufficient network bandwidth and low latency to avoid impacting the performance of the primary database.

LGWR ASYNC makes this much less of an issue because the primary does not wait for the network acknowledgment before processing the next transaction. But it is still possible, usually during peak processing periods, for the primary database to generate redo faster than the network can ship. If redo transmission falls too far behind, the LGWR ASYNC buffer becomes full and Data Guard automatically reverts back to using ARCH based log shipping until LGWR ASYNC can reconnect to the standby and reinitiate the redo shipping process at a later time.

Will LGWR Work For You?

The paper titled "Oracle9i Data Guard: Primary Site and Network Configuration Best Practices" available at http://otn.oracle.com/deploy/availability/htdocs/maa.htm, is part of Oracle Maximum Availability Architecture (MAA) series of white papers, and provides a useful framework to show the correlation between the peak redo rate and the required bandwidth (within this paper refer to Appendix F: Network Throughput and Peak Redo Rates).

But even when armed with the above information, it is often difficult for DBAs to predict network performance under a peak production workload. Fortunately, the ability to dynamically configure many characteristics of a Data Guard configuration allows DBAs to incrementally "ratchet" up redo transport in a stepwise fashion under production workload, without impacting application availability.

Five Easy Steps

The procedure below assumes you have followed the Oracle9i Data Guard Documentation and have correctly configured other aspects of your Data Guard configuration that are not addressed by this note. The procedure also assumes that all reasonable effort has been made in a test environment to simulate production workload and network performance, and that you have followed the guidelines included in the above-mentioned Network Best Practices paper. Notwithstanding these prior efforts, it may still be difficult to know for sure how the network will impact the performance of your Data Guard configuration. Below are five steps that eliminate all guesswork. All parameter changes described can be made dynamically, without risk of impacting application availability.

Step 1

Begin by configuring Data Guard in MAXIMUM PERFORMANCE Mode (the default protection mode) using the ARCH process:

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS ARCH';

This mode uses the equivalent log shipping mechanism used by Oracle8i. Run in this mode through a complete processing cycle that includes time periods with peak redo generation rates. Follow the guidelines in the Data Guard documentation and the Network Best Practices white paper to be insure optimal tuning. Use Statspack to obtain a performance baseline from which the impact of the following configuration changes can be compared.

Step 2

Assuming success with step 1, upgrade redo transport services from ARCH to LGWR ASYNC using the following command.

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS LGWR ASYNC';

At the next log switch, LGWR will begin shipping redo asynchronously to the standby server. From this point forward, as long as the LGWR ASYNC buffer (initially set at its default value of 1MB) does not become full and as long as the primary can sustain a network connection with the standby, redo is always shipped as quickly as the network will allow.

Users have employed different strategies during this step in the process. One common approach is to initially run in LGWR ASYNC during less active periods of the day, and gradually increase the duration of the time period run in LGWR ASYNC until it includes peak periods as well. The ability to switch between ARCH and LGWR ASYNC is completely dynamic. If there is ever a need to revert to ARCH, simply use the command in #1 above.

Whichever strategy you use, be sure to monitor two aspects of performance. First, use Statspack to confirm there is little or no performance impact on the primary server.

Second, monitor V$DATAGUARD_STATUS on the primary database looking for error code of 16166. This error code denotes that LGWR had problems sending redo to the standby server. The 16166 error definition is as follows.

ORA-16166 LGWR network server failed to send remote message
Cause: The LGWR network server could not communicate with the remote host.
Action: The alert log contains more problem-specific information.

An example query is:

SQL> alter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';
Session altered.
SQL> column MESSAGE format a50
SQL> select timestamp, message from v$dataguard_status where error_code=16166;

TIMESTAMP            MESSAGE
-------------------- --------------------------------------------------
JUN-10-2004 13:14:04 LGWR: I/O error 16166 archiving log 2 to 'Primary'
JUN-10-2004 13:19:22 LGWR: I/O error 16166 archiving log 2 to 'Primary'

 

The above query shows there were two instances where LGWR ASYNC failed. More information on each of these events is available in the alert log of the primary database. For example, in the alert log detail for the second failure is:

Thu Jun 10 13:19:22 2004
Timing out on NetServer 0 prod=80,cons=8,threshold=64
ORA-16166: LGWR timed out on Network Server 0 due to buffer full condition.
No action is required since the log file transfer will be attempted via ARCH
Thu Jun 10 13:19:22 2004
Errors in file /private1/oracle/OraHome92/admin/Physical/bdump/physical_lgwr_3597.trc:
ORA-16166: LGWR network server failed to send remote message
LGWR: I/O error 16166 archiving log 2 to 'Primary'
Thu Jun 10 13:19:22 2004
Errors in file /private1/oracle/OraHome92/admin/Physical/bdump/physical_lgwr_3597.trc:
ORA-16166: LGWR network server failed to send remote message

This alert log message is typical of ORA-16166 errors for Data Guard configurations using LGWR ASYNC. Note from the information in the alert log that: LGWR timed out on Network Server 0 due to buffer full condition. The primary database generated redo faster than LGWR could ship until finally reaching the point where the ASYNC buffer became full and LGWR abandoned shipping redo to the standby. 

The default setting for the LGWR ASYNC buffer is 1MB. The buffer size can be increased to a maximum of 10MB (in Oracle 10g this the maximum value has been increased to 50MB). The increased buffer is only used if it is absolutely needed. This means that increasing the buffer size has no impact on how quickly LGWR will ship redo to the standby, it always ships as quickly as the network will allow. If there is a frequent occurrence of LGWR failing due to a buffer full condition, consider increasing the size of the ASYNC buffer by using the following command:

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS LGWR ASYNC=20480';

Also note that the primary server continues processing transactions in cases where LGWR abandons shipping to the standby server, therefore no immediate action is required by the DBA. Data Guard automatically resolves the resulting gap (a situation where there are archive logs on the primary contain redo that has not been shipped to the standby) using the ARCH process. This can be seen in the same alert log message above: No action is required since the log file transfer will be attempted via ARCH. At the next log switch, LGWR will attempt to reconnect to the standby server. When successful, LGWR begins shipping the then-current redo stream. When ARCH completes the process of resolving the gap, the primary and standby servers are back in sync.

Step 3

Lets assume success strikes a second time. You find that primary performance is unaffected using LGWR ASYNC. You find that LGWR ASYNC proves capable of utilizing available bandwidth and keeping pace with the primary redo generation rate. Instances where the ASYNC buffer becomes full and shipping reverts to ARCH are limited to bursts of peak activity that occur after hours during batch processing. Generally, during work hours, it is possible to sustain LGWR ASYNC and benefit from enhanced data protection. Things have gone so well, it is desirable to test LGWR SYNC and see if zero data loss protection is feasible given your network environment. Use the following command to dynamically alter redo transport to use LGWR SYNC:

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS LGWR SYNC';

Repeat the monitoring described in #2 above.

Step 4

Lets assume yet another successful outcome. Your evaluation of network performance is complete, and you decide to make LGWR SYNC the default redo shipping transport by upgrading the Data Guard protection mode from MAXIMUM PERFORMANCE to MAXIMUM AVAILABILITY. Be sure to review the Data Guard documentation for a full understanding of MAXIMUM AVAILABILITY protection mode (for example, archive destination attributes for NET_TIMEOUT and REOPEN). Upgrade to MAXIMUM AVAILABILITY by using the command, below. Unlike changes to redo transport in steps 1-3, the change in the protection mode does require a bounce of the primary database, but it need not be done until the next maintenance period.

SQL> ALTER DATABASE SET STANDBY TO MAXIMIZE AVAILABILITY;

Step 5

OK - the world is not perfect. You have proceeded through the above steps and somewhere along the way, you find that the network can not keep pace, regardless of how well you tune it. For example, you made it through step 3, and determined that LGWR ASYNC with a 10MB buffer works well, but there is an unacceptable impact to primary database performance of using LGWR SYNC. Simply revert to LGWR ASYNC by repeating the command from step 2.

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS LGWR ASYNC=20480';

Likewise, if at any time you desire to revert to ARCH, repeat the command from step 1:

SQL> ALTER SYSTEM SET LOG_ARCHIVE_DEST_2='SERVICE=STANDBYTNS ARCH';

 

Conclusion

Oracle9i Data Guard significantly enhances past mechanisms for maintaining a standby database for disaster recovery. Understanding and deploying redo shipping using the LGWR process is essential to achieve optimum data protection. The network's ability to keep pace with Data Guard has become more important than ever. Fortunately, the ability to dynamically configure Data Guard's redo transport services affords DBAs the opportunity to know for certain how their data guard configuration will respond to production workload, and make adjustments on the fly with zero impact to application availability.

 

References

  1. Oracle Data Guard Overview

  2. Oracle Data Guard Concepts and Administration Manual

  3. Oracle9i Database Performance Tuning Guide and Reference Manual - Chap. 21: Using Statspack

  4. MetaLink Note 94224.1 - "FAQ - Statspack Complete Reference"

  5. Oracle Maximum Availability Architecture

  6. Oracle9i Data Guard: Primary Site and Network Configuration Best Practices

  7. MetaLink Note 225633.1 - "Implementing SSH port forwarding with 9i Data Guard"

  8. MetaLink Note 241925.1 - "Troubleshooting 9i Data Guard Network Issues"

 


Joe Meeks is responsible for managing Customer Programs with Oracle's Database High Availability Group. His principal focus area is Oracle Data Guard, which is Oracle's disaster recovery solution for enterprise data. Joe has 24+ years of experience in the computer industry addressing the high availability needs of business critical applications in manufacturing, finance, telecommunications, healthcare, and the public sector. He has a BS in Environmental Science and an MBA.

Larry Carpenter is a member of the Data Guard development team in Server Technologies. Larry has 30+ years of experience in the computer industry, with the last 15 years focused on the high availability requirements of business critical databases. Larry's expertise is insuring the successful deployment of Oracle Disaster Recovery Solutions in diverse computing environments world wide. In addition to English, he speaks fluent Italian, French, and German.

 

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy