|
Data Guard vs. HADR
Technical
Comparison of Oracle Data Guard vs. IBM DB2 HADR
Ashish Ray, Server Technologies, Oracle
Corporation
[Note: This article is excerpted from the
comprehensive white paper: Technical Comparison of Oracle Database 10g
vs. IBM DB2 v8.2: Focus on High Availability, available at the OTN
HA Collateral site.]
OVERVIEW
Oracle Data Guard
Oracle
Data Guard is the disaster recovery (DR) solution for the
Oracle database. Available as an integrated feature of Oracle Database
Enterprise Edition, it ensures fault isolation, high availability and
disaster recovery through the use of one or many standby databases that
are transactionally consistent copies of the production or primary
database. In the event of a planned or unplanned outage at the
production site, Data Guard ensures that a chosen standby database can
be easily switched to a primary database role, and continue serving the
enterprise data needs.
DB2 HADR
The new high availability feature in DB2 version 8.2 is called "High
Availability Disaster Recovery", or HADR [1]. HADR is based on a
similar feature, called High
Availability Data Replication (HDR for
short) from IBM's Informix Dynamic Server acquisition]. It is
similar to Oracle Data Guard in the sense that it replicates data
changes from a source database, called the primary, to a target
database, called the standby. The idea is that in the event of a
partial or complete site failure, the standby database can take over
for the primary database.
DB2 – What Existed in Previous Releases
What existed in version 8.1 was simply called DB2
log shipping,
which involves manual administration and custom scripts. This
log-shipping mechanism requires a setting up a user-exit program
(specifically called “db2uxt2”) at a specified location, and setting
the database configuration parameter “userexit” to “Yes”. Once
these settings are in place, the DB2 server makes a call every five
minutes to the user-exit program to check for log files that can be
archived
to the special directory/location specified by the program. The
user-exit program could be written such that it copies the
archived logs to a directory accessible by the standby server, or
simply FTP the logs to the standby server.
This mechanism also requires setting up a scheduled job on the standby
system to periodically issue a db2 rollforward command (e.g. “db2
rollforward db <dbname> to end of logs”). When the
rollforward db
command is invoked on the standby server, the DB2 logger automatically
attempts to retrieve the next consecutive log file from the archive
target path. The roll forward operation continues to retrieve log files
until there are no more left to process. The frequency at which this
job runs determines how quickly archived logs can be picked up and
applied by the standby database.
Obviously, this script-driven approach is very cumbersome and
error-prone. With HADR in Version 8.2, IBM has attempted to remove some
of the complexities associated with this approach.
How is v8.2 HADR Different from v8.1
Log Shipping?
The DB2 version 8.1 log-shipping feature still exists in version 8.2.
The way HADR is different from log-shipping is that IBM has built some
more automation around this concept, using technology acquired from
Informix. The dependence on manual user-exit programs and explicit
roll-forward commands has been reduced through the “START HADR” and
“STOP HADR” commands.
However, the caveat here is that the existing
“rollforward database”
command cannot be used in a HADR configuration,
because it may produce some inconsistencies [1]. IBM advises using
“start hadr on db
<dbname> as standby” instead.
The data protection is more granular in HADR compared to the earlier
log-shipping. Instead of waiting for complete archived logs to be
generated and then sending those, log pages are sent to the standby
database as they are generated on the primary database. The state of
replication is also controlled in a more granular manner through the
HADR_SYNCMODE
configuration parameter, with its SYNC, NEARSYNC and
ASYNC values, which are
somewhat similar to the protection modes of
Data Guard. HADR also provides automation around the role transition,
through the TAKEOVER HADR
command.
DATA GUARD: COMPARATIVE STRENGTHS
HADR is a new feature. In contrast, Data Guard has been around for
several years, has evolved and enhanced through several Oracle database
releases, and is deployed for mission-critical applications at major
customer sites all over the world. The following table provides a
quick
summary of the comparative strengths of Data Guard over HADR.
Table 1:
Addressing Disaster Recovery – Oracle vs. DB2
Following sections provide further details on Data Guard's comparative
strengths.
HADR
does not offer any functionality equivalent to Data Guard SQL Apply
Oracle Data Guard supports two kinds of standby databases – physical
standby databases that use Redo Apply technology, and logical standby
databases that use SQL Apply technology. These two types of standby
databases are well integrated with each other – for example, a single
Data Guard configuration can be created to contain a mix of physical
and logical standby databases. Data Guard uses the same redo transport
mechanism to keep these standby databases transactionally consistent
with the primary. No extra integration is required to maintain these
two types of standby database – they are part of the same feature. The
same management interface – whether it is SQL*Plus, or DGMGRL, or
Enterprise Manager Grid Control, can be used to manage these two types
of standby databases.
The kind of standby database supported by HADR is similar to physical
standby. However – SQL Apply provides a very powerful capability for
Data Guard, allowing a logical standby database to be open for
read/write access and be utilized as a reporting database while SQL is
being applied to it. With real time apply – the new feature in Oracle
Database 10g Release 1,
logical standby databases can also be used as a
real time reporting solution. This means that the logical standby
server can also be utilized for other valuable business purpose besides
disaster recovery. This is critical because effective system resource
utilization is a very important criterion for any disaster recovery
solution – to be cost effective, one simply can’t afford to have system
resources idling away waiting for the next disaster to happen.
A standby database in a HADR configuration cannot be open – whether
read, or read/write, while log data is being applied to it.
Applications cannot access this standby database in any state, which
means HADR customers cannot extract value out of their DR investment,
to the extent Data Guard customers can.
HADR does not provide any integrated automatic
failover
HADR does not provide any integrated capability to automatically
perform a failover after a severe outage at the production site. The
failover operation has to be manually initiated through the TAKEOVER …
BY FORCE command.
In contrast, Data Guard in Oracle Database 10g Release 2 offers the
Fast-Start Failover feature that allows Data Guard to automatically
fail over to a previously chosen, synchronized standby database in the
event of loss of the primary database, without requiring any manual
steps to invoke the failover. Not only that, following a failover, once
connection to the old primary database is established, it is
automatically reinstated as a new standby in the configuration,
restoring high availability and data protection capabilities for the
configuration.
To implement automatic failover, DB2 documentation [1] suggests
integration with a Cluster Manager, which manages the primary-standby
pair. There are several flaws with this design. For one thing – it
requires a separate integration with a third party clustering product
for the OS under consideration. At least for AIX, IBM offers its own
clustering product – HACMP, which requires the separate integration
anyway. However, for other OS-s, e.g. Linux, IBM has to be dependent on
whatever clustering product that is provided by that OS vendor, and the
integration complexities are likely to be more serious. Also – this
approach may still be able to be made to work in cases where the
primary and standby databases are located across a short distance. For
any distance that is recommended for a practical level of disaster
protection, DB2 would require a geo-cluster implementation for
automatic failover, which will increase the integration and operational
complexities even more.
Data Guard’s integrated automatic failover capability is an excellent
fit for mission-critical business applications, which must tolerate
server failures transparently, while at the same time, being protected
from data failures. Without such an integrated support for automatic
failover, DB2 is not a fit for lights-out high availability that is
increasingly a critical need for today’s 24x7 global business
applications.
A
HADR standby cannot be open read-only like Redo Apply
In Data Guard, the physical standby database can be open read-only to
satisfy read-only reporting requirements. While in this state, redo is
still sent to the standby server – it is just not applied to it. After
the reporting is complete, redo apply can be restarted with a simple
command, or a mouse click. In this manner, a physical standby database
can be transitioned back-and-forth between being in redo apply mode and
being opened up read-only as many times as possible, to suit specific
business requirements.
A HADR standby does not have any such capability. To do any
client/application access, the HADR standby database either has to
switch role to a primary database, or activated to be a standard
database.
Note that the SQL Apply continuously-open or Redo Apply read-only
capabilities offer an excellent way to test out or validate the DR
configuration, without causing any disruption to the primary database.
In contrast, since applications cannot access the HADR standby database
at any time, the only way such a validation can be done in a HADR
configuration is to do a role change to the standby, which is
disruptive to the primary database.
A
HADR standby cannot be seamlessly activated to be a read/write
database, and back
Once a HADR standby database is activated to be a standard database, it
cannot just go back to be a standby database. If it needs to go back to
the standby state, it needs to be completely reinstantiated as a
standby database from a backup of the primary database [1]. In
contrast, a physical standby database can be activated to support
read/write reporting capabilities, and then – using the Flashback
Across Resetlogs feature in Oracle Database 10g Release 2, it can be
converted back to a standby database with a single command. This
enhances the reporting as well as testing/cloning capabilities of the
Data Guard physical standby database.
A
HADR standby cannot be used for backups
A Data Guard physical standby database can be used for backups, which
can be used to restore primary databases. This offloads the backup
operation from the production database, reduces resource contention on
the production server, boosts performance, and enables no-downtime
backup windows. Furthermore, using RMAN, the backups can occur while
redo is being applied to the physical standby database. A HADR standby
database cannot be used for such backups. This is yet another example
where customers investing in a DB2 HADR configuration will not be able
to extract value out of their investment and will instead waste money
on system resources that are essentially sitting idle.
HADR does not have any built-in mechanisms to
prevent/undo data
corruptions related to human errors
Human errors are one of the leading causes of downtime, yet HADR, in
contrast with Data Guard, does not have any built-in mechanisms to
prevent data corruptions related to human errors.
One way Data Guard prevents such data corruptions is using delayed
apply. The redo is still sent to the standby as fast as possible,
however the apply (Redo Apply or SQL Apply) can be delayed on the
standby by a configurable amount of time. This provides administrators
a safety time window to failover to the standby, in case the primary
has been corrupted for example, because of a bad batch job that got run
on the primary database. This delay is very flexible in that it can be
configured on a per-standby basis – a primary database with two
standbys may have one standby configured with a delay of 4 hrs, the
other with a delay of 12 hrs – for varied protection from such
corruptions.
In Oracle Database 10g,
administrators may choose not to use delayed
apply but use real-time apply instead (e.g. to get the benefits of
real-time reporting in logical standby databases). If a human error
were to occur in such cases, the primary and standby databases may
simply be flashed back to a safe point in time in the past, using the
Flashback Database feature, providing yet another flexibility for Data
Guard in preventing corruptions due to human errors. HADR completely
lacks this capability.
HADR does not support
DB2’s own clustering feature
HADR does not support DB2’s own partitioning feature (Database
Partitioning Feature, or DPF), which is its clustering feature [1].
This is a critical deficit, since partitioning is IBM’s premier HA
feature. This effectively means that the HA part of HADR is really
missing! Data Guard on the other hand, is completely integrated with
Oracle's clustering solution – RAC. Either or both of the primary and
standby database can be a RAC cluster. All protection modes are
supported in these configurations. Automated transmission of redo data
and recovery are available for all configurations. With a
well-integrated RAC and Data Guard offering, Oracle offers an
end-to-end High Availability solution that is simply unparalleled in
the industry.
The
HADR asynchronous mode is not really asynchronous, since it can
stall the primary
HADR uses synchronization modes (the values SYNC, NEARSYNC and ASYNC of
the HADR_SYNCMODE
configuration parameter), to manage the transmission
of log data between the primary and standby databases. The asynchronous
(ASYNC) value is meant
to minimize the impact in the primary database,
but even in this mode, the primary can stall in cases where there is
high traffic, as pointed out in the DB2 documentation [1], [2]:
For example, when the HADR
synchronization mode is asynchronous and the
primary and standby databases are in peer state, if the primary
database is experiencing a high transaction load, the log receive
buffer on the standby database might fill to capacity and the log
shipping operation from the primary database might stall.
or
If HADR synchronization mode (the HADR_SYNCMODE database configuration
parameter) is set to ASYNC, during peer state, a slow standby may cause
the send operation on the primary to stall and therefore block
transaction processing on the primary.
To manage these temporary peaks, the documentation suggests tweaking
the DB2_HADR_BUF_SIZE
registry variable. This tweaking seems to be
purely arbitrary, without any advisories provided to assist the
customer.
In contrast, Data Guard in Oracle Database 10g has been architected
such that the Maximum Performance mode will not block the primary.
A
HADR configuration does not support multiple standbys
A Data Guard configuration supports multiple standbys, which allows
standbys to be used whether they are physical or logical standbys, or
whether they are located on a LAN or a WAN. Customers find this option
quite flexible to meet their unique business needs. For example, with
Data Guard, a multiple standby configuration is possible in which there
is a logical standby in a LAN serving as a local reporting database,
and a physical standby in a WAN serving as a remote DR database. A HADR
configuration, which allows only a primary-standby pair, does not offer
this flexibility.
Log data transport /
log gap resolution in HADR configuration is not
well-architected
The fact that the primary database can be blocked even in an
asynchronous log data transport in a HADR configuration in the event of
a high transaction load points out the inefficient log data transport
architecture for HADR. There are other issues related to this matter
that are worth pointing out.
IBM collateral
indicates that there is an HADR process that takes
the log buffer and passes it over to the HADR process on the standby
machine. The fact that there is one process implies a potential
bottleneck in transmission of the data from the log buffers to the
standby database, which indeed explains why the primary database may
stall for high transaction loads. In contrast, for Data Guard, it is
possible to set up multiple Archiver (ARCH) processes to transmit redo
data to the standby database (even using parallel streaming in Oracle
Database 10g Release 2)
without impacting the primary database even for
high transaction loads.
A related issue is the case of network disconnect problems – which is
especially a concern that needs to be addressed if the disaster
recovery solution is deployed over a long distance. In the case of
HADR, if network connection is lost, the standby database enters the
Remote Catchup Pending state, and remains in that state until the
connection is restored. When the connection is restored, it enters the
Remote Catchup state, and expects primary database archival methodology
to send archived log data to the standby database – presumably through
using specialized user-exit programs and configuring the database
configuration parameters logarchmeth1
and logarchmeth2. When
all of the
log files on disk of the primary have been replayed by the standby
database, the primary and standby enter Peer state, at which time log
pages can be sent to the standby directly from the log buffer of the
primary.
The serious architectural limitation here is that for a busy system,
following a network disconnect problem and subsequent restoring of the
connectivity, the standby may perpetually be in the Remote Catchup
state – because there is always more archived data to catch up to.
Consequently, it may never be able to move to the Peer state, implying
a significant data loss exposure in such cases.
Data Guard, in contrast, allows simultaneous redo transmission using
both the Log Writer (LGWR) and Archiver (ARCH) processes, and that – in
combination with allowing multiple network connections to the same
standby server, allows much more expedited catchup for a Data Guard
standby, following restoration of network connectivity.
For resolution of large archive-log gaps, in order to minimize the
delay in transmitting several large log files over the network, it may
be prudent to apply incremental backups on the standby database and
bring it up-to-date with the primary, or at least get missing log files
from another local standby. HADR does not support applying incremental
backups on the standby database to bring it up to date with the
primary. Also, since HADR does not support multiple standbys, it is
solely dependent on the primary server for missing log data. Not only
that – if a HADR standby enters the Remote Catchup Pending state, and
more log files become available that can be manually copied to the
standby server, the standby database must be restarted to ensure that
it can recognize those logs [1]. For Data Guard – if missing log files
are registered with the standby control file – Redo Apply or SQL Apply
will automatically recognize them and start applying, without needing
any database restart.
HADR
does not support rolling upgrades across major database releases
Both HADR and Data Guard supports rolling upgrades of databases, but
HADR's rolling upgrades support is much more restrictive because unlike
Data Guard, it does not support rolling upgrades across major database
releases. It supports only rolling upgrades across fixpack releases
(equivalent to Oracle patchsets).
A related issue regarding system requirements is that the OS on the
primary and standby databases in a HADR configuration should be the
same version, including patches. In contrast, in a Data Guard
configuration, the OS on the primary and standby databases should be
the same, but they can be of different versions. For example, a Solaris
8 – Solaris 9 Data Guard configuration is supported.
Dynamic reconfiguration
of HADR configuration parameters is not
supported
Any changes made to any HADR configuration parameter are not effective
until the database has been shut down and restarted [1].
Examples of these parameters are: HADR_TIMEOUT, HADR_SYNCMODE, etc.
This problem is compounded by the fact that most of these parameters
require identical values between the primary and standby databases.
This means that to change any of these values, both the primary as well
as the standby databases have to be shut down and restarted. This
certainly reduces the availability and flexibility of HADR.
In contrast, almost all the parameters that are relevant to a Data
Guard configuration can be dynamically altered without requiring the
restart of the database.
A failover in a HADR
configuration requires full re-instantiation of
the old primary database
A failover in a HADR configuration, done through the TAKEOVER HADR
command with the BY FORCE
option, typically requires the old primary
database to be recreated using a backup of the new primary database,
before it can rejoin the configuration. Only in one specific case [1]
– in the SYNC state,
and also if the primary was in a peer state when
it fails, can the old primary be resynchronized as a new standby,
instead of needing to be recreated. However, even in SYNC mode it is
possible that the primary was not in the peer state when it fails, and
in that case – any effort to start it as a standby database (through
the START HADR … AS STANDBY
command) will fail, and it has to be
recreated from a full backup of the new primary.
DB2 in general requires this full database re-instantiation because
with HADR, it is not possible to revert the primary database to the
point in time when the failure occurred [1]. This may consume
significant time and resources considering today’s multi-TB databases.
Besides, while the old primary is being recreated from a backup – not
only does this leave the HADR configuration unprotected for this
duration, it may also create logistics problems if the HADR
configuration is a WAN, because that may involve shipment of backup
tapes from a remote site.
In contrast, since Data Guard is integrated with the Oracle Flashback
Database feature, a primary database, after a failover, can be simply
flashed back to the point in time when the failover occurred, and then
can simply rejoin the Data Guard configuration as a new standby and
automatically catch up. This capability is available since Oracle
Database 10g Release 1. As
discussed previously, the Fast-Start
Failover feature in Oracle Database 10g
Release 2 automates this even
further, by automatically reinstating the old primary database as a new
standby database, without requiring any manual backups, cumbersome
shipping of tapes, followed by a manual restore operation – as is the
case with DB2.
This means that compared to DB2 HADR, complete data protection can be
restored much more quickly in a Data Guard configuration after a
failover, and a Data Guard configuration is much better protected from
possible subsequent failures.
HADR does not replicate stored procedures
Stored procedures are not replicated in a HADR configuration – they
must be manually recreated. In contrast, for Data Guard, this is not an
issue for Redo Apply, which replicates all stored procedures. SQL Apply
also replicates all stored procedures, except Oracle PL/SQL supplied
packages that modify system metadata.
HADR does not offer built-in
authentication/encryption
HADR does not offer built-in security mechanisms such as encrypting log
data while in transit, or authenticating every new primary-to-standby
connection [3]. Data Guard in Oracle Database 10g offers a built-in
security feature such that every new connection between a primary and
standby database is authenticated based on an administrator-supplied
password. Furthermore, with Oracle Advanced Security Option (ASO), all
redo traffic between the primary and standby databases will be
encrypted. Support for such security mechanisms are very important
considering that highly sensitive business critical data may be
transmitted between primary and standby databases.
HADR
does not support raw devices
HADR does not support a configuration in which the primary/standby
database is based on raw devices (as opposed to a file system). No such
restriction exists for Data Guard.
HADR
does not support cascaded standbys
HADR does not support cascaded standby configurations in which a
standby can retransmit redo to a second layer of standbys. This helps
saving CPU-processing and networking resources around the primary data
center. Data Guard supports this capability.
HADR
does not properly replicate all BLOBs and CLOBs
An IBM support
note indicates that BLOBs and CLOBs larger than 1
GB cannot be logged, so they also cannot be replicated. Data Guard does
not have this restriction – whether that is for physical standby, or
logical standby.
CONCLUSION
Recognizing the high availability challenges every business faces,
Oracle provides comprehensive, unique, powerful, and simple-to-use
capabilities that protect businesses against all forms of unplanned
downtime, including system faults, data corruption, disasters, and
human errors. Oracle achieves this in an environment where the downtime
that occurs during planned maintenance activities is also minimized.
Unlike DB2, Oracle’s high availability solutions are not isolated,
disjointed solutions. Oracle offers a well-integrated high
availability solution stack – comprised of components such as RAC,
Data Guard, RMAN, Flashback, etc., that do not need consultants to
stitch them together. This saves customers time, money and
system/people resources – factors that are extremely critical in
today’s economy. Oracle has gone one step further by publishing best
practice guidelines for configuring a High Availability solution
through its Maximum
Availability Architecture framework, and making it available for
its customers. The long
list of Oracle customers who have embraced its High Availability
solutions is a testimonial to Oracle’s unparalleled technical
leadership and vision in this area.
In contrast to Oracle, DB2 offers a basic set of backup and recovery
features and lacks the completeness and depth of High Availability
functionality required by most businesses today. DB2 continues to lag
several releases behind Oracle in this regard and is not an appropriate
choice for today’s business applications demanding high levels of
uptime.
REFERENCES
- IBM DB2 Universal Database – Data Recovery and
High Availability Guide and Reference, Version 8.2, Chapter 7: High
Availability Disaster Recovery (HADR)
- IBM DB2 Universal Database – Administration
Guide: Performance, Version 8.2, Appendix A: DB2 Registry and
Environment Variables
- IBM DB2 Universal Database – Introduction to
Replication and Event Publishing, Version 8.2, Chapter 7: Comparison of
Q replication to high availability disaster recovery (HADR)
Ashish Ray (Ashish.Ray at
oracle.com) is a Group Product Manager with Oracle's Database High
Availability Group. He has 12+ years of combined experience in software
architecture design, software development and product management,
focusing largely on the reliability, availability and scalability
issues of enterprise and e-business computing.
|