|
Data Protection - The Oracle Data Guard Difference
Data Protection
The Oracle Data Guard Difference - It's Not Just About
Disasters
Joseph Meeks, Server Technologies, Oracle
Corporation
Don't Let
This Happen To You . . .
You are one year into
implementing a disaster recovery (DR) site for your
critical IT operations. Your database is Oracle. You have
selected a leading manufacturer of storage subsystems to implement
asynchronous remote-mirroring to maintain a current copy of your Oracle
data at your DR site. You are three months into production and
you
have just experienced your second SAN failure at your primary
site. You
also discover that your storage solution is so good at
remote-mirroring, that the the corruptions at your primary site
have been faithfully replicated to your remote site.
Unfortunately this is not discovered until mirroring to the target
volumes at the remote site is turned off, and you attempt to mount and
start the standby database. You have no
choice but to do a full restore and point-in-time recovery of your
primary database,
incurring significant
production downtime for the second time in three months. You fail
to
deliver on
the SLAs used to justify your two million dollar investment in a remote
disaster
recovery site. This is not a fictional event, it is the actual
experience of a global business jet charter and aircraft management
company.
If Only . . . They
Had Used Oracle Data Guard
The above experience could have been
avoided if they had used Oracle
Data Guard [1], a standard
feature of Oracle Database Enterprise
Edition. A second customer experience, but with an entirely
different outcome, is described below.
Imagine you are the DBA Manager for a
Fortune 100 manufacturing
company. Your business runs on packaged applications from
SAP. Your database is Oracle. You also have strategic
relationships with a leading hardware manufacturers that provide world
class computer systems and storage.
If your database is not available,
manufacturing and logistics are
severely impacted. High
availability (HA) isn't a concept, its how you keep your job.
Traditionally
you have deployed HA solutions provided by system and storage
vendors. But
over time as hardware and OS architectures change, so have the HA
solutions bundled with them, significantly impacting operations and
increasing business risk as you transition from one generation to the
next. At the end of the day, you have become a systems integrator
crafting an HA environment for your mission
critical databases from technologies provided by different vendors.
You decide that you can no longer
afford to play the role of systems integrator. After
careful evaluation of your next generation HA/DR strategy, you
decide
on Oracle
Real Application Clusters (RAC)
[2] and Oracle Data Guard. A
synchronized copy of your primary RAC database is maintained by Data
Guard at a remote data center using Data
Guard's integrated asynchronous
redo transport
services.
Moving away from hardware-based
components to Oracle Data Guard was a
big departure from your traditional reliance on hardware based
remote-mirroring solutions. Within
two months of going production, you are glad you did it, and this is
why.
On a
Nice Summer Day . . .
You log a TAR with Oracle support
and report that your UNIX administrator has identified errors related
to a host bus adapter on Node A of your production, 2-node RAC
database.
A log dump of one of the RAC threads
showed:
ERROR at line 1:
ORA-00353 : log corruption near
block
change time
ORA-00353 : log corruption near block
7458 change
5160206351 time 07/17/2004 03:32:54
ORA-00334 : archived log: '/oracle/PRD/saparch/PRDarch1_299739.dbf.20040717.0420'
SQL> alter system dump logfile
'/oracle/PRD/saparch/PRDarch1_299739.dbf';
System altered.
SQL> alter system dump logfile
'/oracle/PRD/saparch/PRDarch1_299740.dbf';
alter system dump logfile
'/oracle/PRD/saparch/PRDarch1_299740.dbf'
*
ERROR at line 1:
ORA-00399 : corrupt change description in
redo log
ORA-00353 : log corruption near
block 6166
change 5160261501 time 07/17/2004 03:48:41
ORA-00334 : archived log:
'/oracle/PRD/saparch/PRDarch1_299740.dbf'
SQL>
And so on. These errors were
followed by a continuous
stream of ORA-00600s in the trace files as more corruptions were
reported.
While Oracle support is analyzing the
errors, you update the TAR to say
that
you are working with the hardware vendor to identify
and fix faulty hardware. You find significant corruption in eleven
database
files and archived logs. The first node in your RAC database had
crashed, but before you lost the remaining node you managed
to complete a no-data loss switchover to your standby database
maintained by
Data Guard. The transition of the standby database to the primary
role took less than 10
minutes of total database downtime (this
is an Oracle9i Data
Guard environment), well
within
your service level agreement. You know that
corruption did not propagate to your standby database and that all of
the
data in the standby database is good.
You also recall a similar
circumstance using your previous HA/DR
solution based on remote-mirroring, where corruptions to primary
database files were also mirrored to target volumes on your
standby. In that instance, your standby database was useless and
a point in time
recovery was required that blew your service level agreements for both
availability and data loss "out of the water". After you remember
this past event, the staff
observes
you shaking your head and saying out loud to
yourself, "I am glad we made the move to Data Guard".
Over the next 4 days you work with your
hardware vendors and
Oracle
to
determine the cause of the problem. You discover a
firmware issue in new drives installed on the primary caused dropped
I/O's and incomplete
writes. The same issue did not impact the standby - the standby
was using the previous generation drives. The storage vendor provided
updated firmware. Because a
Data Guard physical standby is an exact replica of its primary
database, rather than rebuild the original primary
database (2.5TB) you can use a copy of the affected data
files from the new primary server to repair the
corruption. When you restart the original primary database, now
a
standby database for the new primary, Data
Guard automatically brings it up to date by processing the more than
300 log files (500MB
each) generated since the switchover. After several hours of
successfully running as a standby database, you execute a
switchover and return all systems to their original roles.
Round
2 . . . Murphy on the Loose . . .
Within an hour of being back in
production, the corruption
quickly
reappears and this time there is a hard crash of both nodes of
the primary RAC database. When
this occurs you calmly invoke a Data Guard failover to the standby
database; a process that enables you to have another production
database
running out of your DR data center in a matter of few minutes.
Meanwhile, the
hardware
vendor is brought back in to correct the problems.
A second issue in the HBA (host bus adapter) firmware is
discovered. It takes another week for it to be fixed. Your staff
is much the worse for wear, and you have a hard time
sleeping at night knowing that you were running at your Disaster
Recovery site without a standby copy while your hardware vendor fixes
the
problem, but the good news
is that production ran smoothly at the remote site throughout the
process. You delivered on your Service Level Agreements. The
investment in your second site and your wisdom in selecting Oracle Data
Guard are firmly established in the minds of your team, application
developers, and
line-of-business managers.
Protection
Against Data Corruption
Data Guard is an inherently
database-aware process that uses
redo data to synchronize one or more standby databases with a primary
production database. Redo data is
fundamental to the recoverability of an Oracle database, making it
possible to recover a database to any desired point in time. With
the Oracle parameters db_block_checking and db_block_checksum enabled
on a primary database, redo is
logically validated after each change, and a checksum generated before
shipping it to the standby. A
Data Guard standby database is at least
mounted at all times while redo data is applied (note: a physical
standby database is mounted, a logical standby database is open). With
the db_block_checksum parameter also enabled on the standby database,
Data Guard is able to verify the redo
checksum
before applying redo to the standby database. In the event a
corruption is introduced into the redo
stream, the standby database is able to detect the corruption before it
is
applied. The standby
database will not apply the corrupt data, and instead will generate
an error immediately notifying administrators that corrective action is
required.
In contrast, a remote-mirroring
solution must mirror every
I/O to every
file - online logs, archive logs, data files, and control file, in
order to synchronize the standby database. There is no notion of
a database transaction in this process. All I/O is faithfully
replicated to the standby site. Furthermore, the target volumes
are not even mounted while data is mirrored.
This makes a standby database maintained by remote-mirroring a "cold,
lights-out" database whose state cannot be predictably determined until
the mirroring process is turned off and the database mounted and
started.
This creates a situation where you can't discover there is a problem
until the worst possible time, well after the problem has occurred and
much too late to easily fix it. A conversation with an IT manager at a
global logistics services provider described this in the following way:
"With
remote-mirroring it's
scary. You don't know what you have is working, unless you bring
it up at failover time. So at failover time, I can be the hero if
it all works, or I can lose my job if it doesn't. With Data
Guard, you always know its working, so that's great."
For a more complete
discussion comparing Data Guard to remote-mirroring, see Oracle
Data Guard and Remote-Mirroring Solutions [3].
And
Then . . . Human Error
Data corruptions as described above may
be rare, but it only
takes one event
to
create a lasting impression. In truth, disasters come in many
shapes and sizes. A disaster recovery solution must achieve the
highest possible degree of isolation between primary and standby
databases while protecting against a wide range of potential
failures. A faulty line of microcode buried deep
within a system's
infrastructure can have the same impact on system
availability as a hurricane. And then of course there is
always human error to
contend with.
Consider a third customer - a high volume
chip
manufacturer. Administrator error caused a data file and archive
logs to be deleted from disk. Remote-mirroring was in
use, and as expected, the same files were also deleted from the
mirrored copy. This simple mistake caused 5 hours of downtime,
with costly impact to manufacturing operations. If they had been using
Data Guard 10g they could have
quickly switched over to the standby
database, the standby
would assume the primary production role. When following Oracle
Best Practices for Switchover and Failover [4], database downtime
could have been as little as 15 seconds.
As noted above, a Data Guard standby is
maintained by applying redo
data to the standby database. Because Data Guard is completely
database aware, administrator errors that are external to the
database never make it into the redo stream and are not propagated to
the standby. Not only could this customer have switched production to
the standby with just seconds of downtime, but good files from the new
production database could also be used to repair the old primary.
The old primary could then be automatically resynchronized and
reinstated by Data Guard as a standby database
for the new primary.
Data
Guard Value Proposition
Oracle Data Guard is a standard feature
of Oracle Database Enterprise
Edition and is integrated with other Oracle features to provide a
comprehensive HA/DR solution to protect mission critical data. It runs
on any hardware platform Oracle supports. It doesn't require
proprietary storage subsystems, special (and costly) network devices,
or third party software to integrate and maintain. The
bottom
line is that for Oracle
databases, Data Guard provides better data protection, at less
cost,
with less performance overhead, and with greater high
availability, than all other DR solutions available today.
When combined with:
. . . Data Guard is part of an integrated
capability provided by Oracle
for
deploying a highly available computing platform. Together, these
technologies form the Oracle
Maximum Availability Architecture
[10], an Oracle-tested blueprint
and best practices for high availability.
Conclusion
A fourth customer, a large public utility
company, proved it is never
too late to to consider Data
Guard. They had previously deployed remote-mirroring but were
now testing Data
Guard for possible use as an alternative HA/DR solution. For over
a month they
had a "pre-production" Data Guard maintained standby
database, completely synchronized with the primary database
and running in parallel with their "production" replica maintained by
remote-mirroring. They
experienced corruption on their primary database and when they mounted
and attempted to restart the database using the mirrored copy - they
discovered the same
corruption prevented them from opening the database. Customer
Service would have been
severely impacted by an extended outage. So they were faced with
a dilemma. Tell the business they were down, or failover to the
Data Guard standby. Not wanting to take the
downtime
hit, they quickly failed over to the Data Guard copy and kept the
business
running.
They have been Data Guard users since.
Oh -
By The Way . . .
Remember the first customer at the top of
this article? Shortly
after their second SAN failure they called Oracle to talk about Data
Guard. In less than 2 weeks elapsed time - analysis, planning,
testing, and deployment, they began using Oracle Data Guard to replace
remote-mirroring for their Oracle data. They didn't
need to make any additional purchases, they leveraged existing in-house
Oracle DBA skill-set, and they didn't have to be involved in any
software or hardware integration to make this work - they could
begin using Data Guard immediately, and eliminate their dependence on a
non-Oracle technology to provide high availability and data protection
for their Oracle database.
References
- Oracle Data Guard - http://www.oracle.com/technology/deploy/availability/htdocs/DataGuardOverview.html
- Oracle Real Application Clusters - http://www.oracle.com/technology/products/database/clustering/index.html
- Oracle Data Guard and Remote-Mirroring
Solutions - http://www.oracle.com/technology/deploy/availability/htdocs/DataGuardRemoteMirroring.html
- Oracle Data Guard 10g Release 2,
Switchover and Failover Best
Practices - http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gR2_SwitchoverFailoverBestPractices.pdf
- Oracle Recovery Manager - http://www.oracle.com/technology/deploy/availability/htdocs/rman_overview.htm
- Oracle Flashback Technologies - http://www.oracle.com/technology/deploy/availability/htdocs/Flashback_Overview.htm
- Online Reorganization and Redefinition - http://www.oracle.com/technology/deploy/availability/htdocs/online_ops.html
- Automatic Storage Management - http://www.oracle.com/technology/products/database/asm/index.html
- Oracle Enterprise Manager - http://www.oracle.com/technology/products/oem/index.html
- Oracle Maximum Availability Architecture
(MAA) - http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
Joseph Meeks is a Director of Product
Management with Oracle's High Availability and Maximum Availability
Architecture Group with 25 years of experience assisting customers in
the design and deployment of highly available systems for mission
critical applications.
|