|
Oracle Data Guard and Network Bandwidth Issues
Network Bandwidth Implications of Oracle Data
Guard
Ray Dutcher, Server Technologies, Oracle
Corporation
Ashish Ray, Server Technologies, Oracle Corporation
Introduction
Oracle
Data Guard is Oracle's data protection and disaster recovery solution. One of the frequent questions that customers ask
the Data Guard team is how much network bandwidth is required by Data
Guard. A variation of the same question is: Can the network link between the production (or primary site) and disaster recovery (DR or secondary site) data center support a Data Guard configuration?
At a high level, the answer is simple enough. It
depends on how busy the production database is. Let's look into it in a bit more
detail.
It's the Redo
What is the basis of Data Guard's operation? Well,
Data Guard sends the redo data generated by the primary database to one or more
secondary, or standby databases. That's how Data Guard keeps the standby
databases transactionally consistent with the primary database. The more redo
data the primary database generates, the more redo data Data Guard needs to
transmit to the standby database. In other words, the faster the primary
database generates redo, the faster Data Guard needs to send the redo to the
standby database, otherwise either the standby database may fall behind, or
processing on the primary database may slow down (the exact behavior depends on
the Data Guard protection mode chosen - more on it later). This means
that the available network between the primary and standby databases must be
capable of supporting this redo generation rate, or more precisely - the peak
redo generation rate.
Since the amount of redo generated by a database
is proportional to the transactional, or the write activity in the
database, this implies that for a very busy OLTP (on-line transaction
processing) system (e.g. the
leading e-commerce websites), the network bandwidth required by Data Guard will be higher
than that required by a non-OLTP system, or a system which supports primarily
read-intensive transactions (e.g. systems for the
technical support knowledge bank of a hi-tech company, or systems that allow
you to view your present/previous credit card statements, your bank account
balances, etc.).
What are the typical redo generation rates of
Oracle production databases out there? The following graph, which summarizes the
responses to the question "What is your peak redo generation rate?"
from approximately 100 Oracle customers interested in Data Guard and attending OracleWorld San Francisco
2003, provides some insights:

It shows that 70%+ Oracle customers report a peak redo rate less than 500KB/sec.
Measuring the Peak Redo Rate
How does one measure the peak redo generation rate for
a database? Use the Oracle Statspack
utility for an accurate measurement of the redo rate.
Based on your business you should have a good idea
as to what your peak periods of normal business activity are. For
example, you may be running an online store which historically sees the peak
activity for 4 hours every Monday between 10:00 am - 2:00 pm. Or, you may be
running a merchandising database which batch-loads a new catalog every Thursday
for 2 hours between 1 am - 3 am. Note that we say "normal" business
activity - this means that in certain days of the year you may witness much
heavier business volume than usual, e.g. the 2-3 days before Mother's Day or
Valentine's Day for an online florist business. Just for those days, perhaps
you may allocate higher bandwidth than usual, and you may not consider those as
"normal" business activity. However, if such periodic surges of
traffic are regularly expected as part of your business operations, you must consider them
in your redo rate calculation.
During the peak duration of your business, run a Statspack snapshot at periodic intervals. For example, you may run it three times during your peak hours, each time for a five-minute duration. The Statspack snapshot report will include a
"Redo size" line under the "Load Profile" section near the beginning of the report. This line includes the
"Per Second" and "Per Transaction" measurements for the redo size in bytes during the snapshot interval.
Make a note of the "Per Second" value. Take the highest "Redo size"
"Per Second" value of these three snapshots, and that is your peak redo generation rate. For example,
this highest "Per Second" value may be 394,253 bytes.
Note that if your primary database is a RAC
database, you must run the Statspack snapshot on every RAC instance. Then, for
each Statspack snapshot, sum the "Redo Size Per Second" value of each
instance, to obtain the net peak redo
generation rate for the primary database. Remember that for a RAC primary
database, each node generates its own redo and independently sends that redo to
the standby database - hence the reason to sum up the redo rates for each RAC
node, to obtain the net peak redo rate for the database.
Redo Generation Rate and the Required Network
Bandwidth
The paper titled "Oracle9i Data Guard: Primary Site and Network Configuration Best
Practices" available at http://otn.oracle.com/deploy/availability/htdocs/maa.htm,
is part of
Oracle Maximum Availability Architecture (MAA) series of white papers, and provides a
useful framework to show the correlation between the peak redo rate and the
required bandwidth (ref. Appendix F: Network Throughput and Peak Redo Rates).
This article will not go into the details of the formula calculation since it is
already explained in the paper. The formula used in the paper (assuming a conservative TCP/IP network overhead of
30%) is:
Required bandwidth = ((Redo rate bytes per sec. /
0.7) * 8) / 1,000,000 = bandwidth in Mbps
Thus, our example of 385 KB/sec peak rate would
require an available network bandwidth of at least
((394253 / 0.7) * 8) / 1,000,000 = 4.5
Mbps.
For this Data Guard configuration, a standard T1
line primary-standby connection providing up to 1.544 Mbps will not be adequate.
However, a T3 connection (typically providing up to 44.736 Mbps) may be more
than adequate, provided of course this connection is not heavily shared by
other applications that may reduce the effective bandwidth for the
primary-standby connection. This means that while the peak redo generation rate
is a good indication of your Data Guard-related network requirements, make sure
that while specifying your network requirements with your network service
provider, you also consider other applications and their Service Level Agreements
(SLAs)
that may be sharing this network. Remember - the formula above indicates the network bandwidth that should be
available to Data Guard, it does not indicate what the entire network bandwidth
should be between your primary and DR data centers.
If this network link may be shared with other
critical apps, consider configuring a higher bandwidth network e.g. dark fibre,
OC1, or OC3, and/or using Quality of Service (QoS) to prioritize network traffic
or to allocate dedicated bandwidth to a particular
class of traffic, to prevent bursty traffic adversely affecting your
latency-sensitive traffic (such as Data Guard redo traffic).
Data Guard Protection Modes and the Network
Data Guard can be configured in one of three protection
modes - Maximum Protection, Maximum Availability or Maximum Performance. These
protection modes essentially differ in the following:
- their recommended redo data transport
settings,
- the behavior of the primary when the last standby in the chosen
protection mode is unavailable, and
- their capabilities for zero data loss in the
event of a disaster at the primary site.
For Maximum Protection and Maximum
Availability, the redo data transport setting requires the LGWR SYNC AFFIRM
attributes in the log_archive_dest_n
entry for the particular standby. For the Maximum Performance mode, the redo
transport is set to the LGWR ASYNC,
or alternatively, ARCH
attributes.
Synchronous transport, as implied by LGWR
SYNC AFFIRM attributes, means that primary database
transactions are not committed till they are also available on disk on the
standby. This implies that a possible impact on the production transactions is
correlated to the latency of the network link between the primary and the
standby. Since the latency or round trip time for a network is usually
correlated to the length of the network, or the physical distance between the
two end points (in this case the primary and standby), Maximum Protection and
Maximum Availability modes are not recommended for Data Guard deployments over a
Wide Area Network (WAN). Note that this recommendation is driven by the laws of
physics (speed of light limitation) - the greater the distance of a network, the
longer it will take for data packets to traverse the network, and hence the
longer it will take for primary database transactions to commit.
For WAN deployments of Data Guard, the Maximum
Performance protection mode is recommended. All three protection modes can
however be used for Local Area Network (LAN) or Metropolitan Area Network (MAN)
deployments of Data Guard. As demonstrated in the previously mentioned MAA paper titled "Oracle9i Data Guard: Primary Site and Network Configuration Best
Practices", Maximum Protection/Availability modes are viable
for a Data Guard deployment of approximately 345 miles, with minimal performance impact
(no more than 3% in
Oracle's internal tests) on the primary. A US coast-to-coast (i.e. WAN)
deployment of Data Guard using the Maximum Performance mode has almost no performance impact (1% in tests)
on the primary.
Network Bandwidth Issues During
Standby Creation
If you are creating the standby database from a
backup of your multi-terabyte production database, an issue that you have to
resolve is how to ship the initial backup to the standby site. Sending this
initial multi-terabyte backup to the standby site over the network may not be
feasible. You may be better off by shipping the backup tape(s) to the standby site
and subsequently using the network to copy incremental backups to the standby
site.
Data Guard provides an important optimization in
this regard. While the backup tapes are in transit, the standby database may be
mounted and started, based on the standby control file and initialization file
sent to the standby site over the network. In such a situation, the standby
database acts as an archive log repository. For example, any archive logs
generated at the primary server since the backup of the primary database can be
manually copied to the standby site over the network. Also, after redo shipping
is enabled on the primary, any new redo data generated on the primary can
automatically be sent to the standby server by Data Guard. This redo data is not
applied to the standby database since it is not yet fully restored with the
backups, but at least the archive logs will be available at the standby site. This way, Data
Guard minimizes any risk of data loss in the event of a severe outage at the
primary server while the backup tapes are in transit, and enables faster
synchronization of the standby database with the primary since the required
archive logs are already available locally at the standby site.
Once the backup tapes arrive at the standby site
and the backups (full and incremental) are restored at the standby database, the
standby database and the apply process can be started. All accumulated redo data
at the standby site will now be automatically applied to the standby database.
If necessary, Data Guard will use the network to automatically
send any new primary database archive logs, or any missing archive logs, to the standby site and
rapidly bring the
standby up-to-date with the primary database.
What if I have a Slow Network?
If you have a slow network link between the
production and DR data centers, seriously consider upgrading the network.
Remember, Disaster Recovery is not an area where you would want to cut corners,
especially if your business has strict availability requirements. In case there
is a severe outage at the production site and your business operations are down,
the last thing that you want to do at that critical moment is to figure out how
much data you might have lost because redo data was not shipped to the
standby because of a slow network, or figure out how much the standby database
is behind the currently unavailable primary database.
Data Guard does provide you with some additional
options in case you want to reduce the demands on your network resources for a
highly active production database. If you have configured multiple standbys,
consider the Cascaded
Redo Log Destinations feature, with which you can have one standby database
sending redo data to one or more standby databases, instead of requiring the
primary database to send this redo to all standbys. This feature not only saves
network resource consumption around the production data center, but also saves
valuable processing cycles for the production database.
Another option that you may evaluate is
configuring the link with SSH port forwarding with compression. For a high
latency low bandwidth network, SSH port forwarding is
recommended for Maximum Performance mode. Oracle's internal tests in a high
latency WAN showed that using SSH with compression made a significant reduction in
network traffic and reduction in redo data transfer time. Refer to the "Oracle9i Data Guard: Primary Site and Network Configuration Best
Practices" paper for further details on the test results. Please
also refer to the MetaLink Note 225633.1
"Implementing SSH port forwarding with 9i Data Guard" for
configuration guidelines.
For additional guidelines related to tuning the
relevant parameters for Data Guard, Oracle Net Services and your operating system,
refer to this Oracle9i Data Guard: Primary Site and Network Configuration Best
Practices" paper as well as the following MetaLink Notes:
- MetaLink Note 241925.1
"Troubleshooting 9i Data Guard Network Issues"
- MetaLink Note 260040.1:
"Refining Remote Archival Over a Slow Network with the ARCH Process"
A question that we do commonly get for this slow
network issue is whether there is any way Data Guard can filter out selected
transactions before sending the redo data to the standby sites. The answer is no.
Every bit of redo data that is generated on the primary database will be sent
over to the standby site - no filtering is possible. Make sure you understand
the rationale here - Data Guard is a disaster recovery mechanism, so the general
goal should be to keep the standby databases transactionally consistent with the
primary, such that during a switchover or a failover, a chosen standby database
may easily be transitioned into a primary role. If you need to transform/filter
your redo data before sending that over to the standby site, consider an
alternative solution such as Oracle
Streams. Unlike Data Guard, Streams also allows the replication of a
subset of of the tables on the source database to the target database, and that
could be another way to ensure that only the data that needs to be protected is
transmitted across the network, especially when the available network bandwidth
is not enough to keep up with the redo generation rate.
Note that after the redo data reaches the standby
site, Data Guard SQL Apply (logical standby database) offers some flexibility in that it allows you to skip applying that redo for certain tables.
This is not possible with Data Guard Redo Apply (physical standby database),
which by definition is a block-for-block copy of the primary database.
A follow-up question is whether one can do NOLOGGING
operations on the primary database in a Data Guard configuration, to reduce the
load in the network. The answer is that one shouldn't do it. The redo data is
the basis of Data Guard's operations. Since nologging operations write directly to the data files and bypass the redo logs,
Data Guard will not be able to keep the standby database consistent with the
primary during nologging operations. In fact, to ensure this doesn't happen,
Oracle9i introduced the command ALTER
DATABASE FORCE LOGGING; to make sure that all
database write operations are logged. It is always a good idea to run this
command on your production database so that you are protected against any
application that may have NOLOGGING
operations in-built in its code. Refer to the MetaLink Note 216211.1
"Nologging In The E-Business Suite" for further details in this
matter.
Conclusion
This article focused on the network bandwidth
implications of a Data Guard configuration. The objective of the article was to
convey the most relevant issues in a concise manner and provide readers with
helpful pointers for further reading.
Network bandwidth management is not a one-off
exercise. It needs careful planning, review and understanding of SLAs for the
supported applications, as well as SLAs promised by the network service
provider, and continuous monitoring of the network to ensure that the business
operations goals and availability requirements are being met. The good thing for
administrators is that several bandwidth management and monitoring tools are
available in the market, that allow administrators to extract the maximum value
of their network connectivity investments, instead of buying extra bandwidth
that ultimately may not be necessary.
Data Guard is an excellent choice for data
protection and disaster recovery not just because of its comprehensive
functionality, but also because of the way it is optimally
architected to handle data transmission issues over a network. It is
based on standard TCP/IP protocols, which means organizations can leverage
existing resources, and not buy extra hardware, or incur extra training. The
redo transmission is optimal - even though a write transaction affects the redo
log files, archive log files and data files, Data Guard sends only the redo
data to keep the standby databases synchronized with the primary. This is in
contrast to certain storage-level remote mirroring solutions which may send all
of those changes, requiring up to 3 times more network resource consumption.
Data Guard also offers administrators the flexibility to configure their desired
redo transport mechanism based on their business requirements. Finally Data
Guard comes with a rich set of configuration guidelines and best practice
blueprints that make it easy to implement and use.
References
-
Oracle
Data Guard Overview
-
Oracle
Data Guard Concepts and Administration
Manual
-
Oracle9i Database Performance Tuning Guide and Reference
Manual - Chap. 21: Using Statspack
-
MetaLink
Note 94224.1 - "FAQ - Statspack Complete Reference"
-
Oracle Maximum Availability Architecture
-
Oracle9i Data Guard: Primary Site and Network Configuration Best
Practices
-
MetaLink
Note 225633.1 - "Implementing SSH port forwarding with 9i Data
Guard"
-
MetaLink
Note 241925.1
- "Troubleshooting 9i Data Guard Network Issues"
-
MetaLink
Note 260040.1 - "Refining Remote Archival Over a Slow Network with the ARCH Process"
-
Oracle
Streams Overview
-
MetaLink
Note 216211.1 - "Nologging In The E-Business Suite"
Ray Dutcher (Raymond.Dutcher@oracle.com)
is a Principal Member of Technical Staff with Oracle's High Availability Systems Group. He has 20+ years of combined experience in software development, database management, technical support and developing database best practices in the areas of availability, performance, and management.
Ashish Ray (Ashish.Ray@oracle.com)
is a Senior Product Manager with Oracle's Database High Availability Group. His principal focus area is Oracle Data Guard, which is Oracle's disaster recovery solution for enterprise data.
He has 10+ years of combined experience in software architecture design, software development and product management, focusing largely on the reliability, availability and scalability issues of enterprise and e-business computing.
|