Data Protection
The Oracle Data Guard Difference -
It's Not Just About Disasters
Joseph Meeks, Server Technologies, Oracle Corporation
Don't Let This Happen To You . . .
You are one year into implementing a disaster recovery (DR) site for your critical IT operations. Your database is Oracle. You have selected a leading manufacturer of storage subsystems to implement asynchronous remote-mirroring to maintain a current copy of your Oracle data at your DR site. You are three months into production and you have just experienced your second SAN failure at your primary site. You also discover that your storage solution is so good at remote-mirroring, that the the corruptions at your primary site have been faithfully replicated to your remote site. Unfortunately this is not discovered until mirroring to the target volumes at the remote site is turned off, and you attempt to mount and start the standby database. You have no choice but to do a full restore and point-in-time recovery of your primary database, incurring significant production downtime for the second time in three months. You fail to deliver on the SLAs used to justify your two million dollar investment in a remote disaster recovery site. This is not a fictional event, it is the actual experience of a global business jet charter and aircraft management company.
If Only . . . They Had Used Oracle Data Guard
The above experience could have been avoided if they had used
Oracle Data Guard
[1], a standard feature of Oracle Database Enterprise Edition. A second customer experience, but with an entirely different outcome, is described below.
Imagine you are the DBA Manager for a Fortune 100 manufacturing company. Your business runs on packaged applications from SAP. Your database is Oracle. You also have strategic relationships with a leading hardware manufacturers that provide world class computer systems and storage.
If your database is not available, manufacturing and logistics are severely impacted. High availability (HA) isn't a concept, its how you keep your job. Traditionally you have deployed HA solutions provided by system and storage vendors. But over time as hardware and OS architectures change, so have the HA solutions bundled with them, significantly impacting operations and increasing business risk as you transition from one generation to the next. At the end of the day, you have become a systems integrator crafting an HA environment for your mission critical databases from technologies provided by different vendors.
You decide that you can no longer afford to play the role of systems integrator. After careful evaluation of your next generation HA/DR strategy, you decide on
Oracle Real Application Clusters (RAC)
[2] and Oracle Data Guard. A synchronized copy of your primary RAC database is maintained by Data Guard at a remote data center using Data Guard's integrated asynchronous redo transport services.
Moving away from hardware-based components to Oracle Data Guard was a big departure from your traditional reliance on hardware based remote-mirroring solutions. Within two months of going production, you are glad you did it, and this is why.
On a Nice Summer Day . . .
You log a TAR with Oracle support and report that your UNIX administrator has identified errors related to a host bus adapter on Node A of your production, 2-node RAC database.
A log dump of one of the RAC threads showed:
ERROR at line 1:
ORA-00353 : log corruption near block change time
ORA-00353 : log corruption near block 7458 change 5160206351 time 07/17/2004
03:32:54
ORA-00334 : archived log:
'/oracle/PRD/saparch/PRDarch1_299739.dbf.20040717.0420'
SQL> alter system dump logfile '/oracle/PRD/saparch/PRDarch1_299739.dbf';
System altered.
SQL> alter system dump logfile '/oracle/PRD/saparch/PRDarch1_299740.dbf';
alter system dump logfile '/oracle/PRD/saparch/PRDarch1_299740.dbf'
*
ERROR at line 1:
ORA-00399 : corrupt change description in redo log
ORA-00353 : log corruption near block 6166 change 5160261501 time 07/17/2004
03:48:41
ORA-00334 : archived log: '/oracle/PRD/saparch/PRDarch1_299740.dbf'
SQL>
And so on. These errors were followed by a continuous stream of ORA-00600s in the trace files as more corruptions were reported.
While Oracle support is analyzing the errors, you update the TAR to say that you are working with the hardware vendor to identify and fix faulty hardware. You find significant corruption in eleven database files and archived logs. The first node in your RAC database had crashed, but before you lost the remaining node you managed to complete a no-data loss switchover to your standby database maintained by Data Guard. The transition of the standby database to the primary role took less than 10 minutes of total database downtime (this is an Oracle9
i
Data Guard environment), well within your service level agreement. You know that corruption did not propagate to your standby database and that all of the data in the standby database is good.
You also recall a similar circumstance using your previous HA/DR solution based on remote-mirroring, where corruptions to primary database files were also mirrored to target volumes on your standby. In that instance, your standby database was useless and a point in time recovery was required that blew your service level agreements for both availability and data loss "out of the water". After you remember this past event, the staff observes you shaking your head and saying out loud to yourself, "I am glad we made the move to Data Guard".
Over the next 4 days you work with your hardware vendors and Oracle to determine the cause of the problem. You discover a firmware issue in new drives installed on the primary caused dropped I/O's and incomplete writes. The same issue did not impact the standby - the standby was using the previous generation drives. The storage vendor provided updated firmware. Because a Data Guard physical standby is an exact replica of its primary database, rather than rebuild the original primary database (2.5TB) you can use a copy of the affected data files from the new primary server to repair the corruption. When you restart the original primary database, now a standby database for the new primary, Data Guard automatically brings it up to date by processing the more than 300 log files (500MB each) generated since the switchover. After several hours of successfully running as a standby database, you execute a switchover and return all systems to their original roles.
Round 2 . . . Murphy on the Loose . . .
Within an hour of being back in production, the corruption quickly reappears and this time there is a hard crash of both nodes of the primary RAC database. When this occurs you calmly invoke a Data Guard failover to the standby database; a process that enables you to have another production database running out of your DR data center in a matter of few minutes.
Meanwhile, the hardware vendor is brought back in to correct the problems. A second issue in the HBA (host bus adapter) firmware is discovered. It takes another week for it to be fixed. Your staff is much the worse for wear, and you have a hard time sleeping at night knowing that you were running at your Disaster Recovery site without a standby copy while your hardware vendor fixes the problem, but the good news is that production ran smoothly at the remote site throughout the process. You delivered on your Service Level Agreements. The investment in your second site and your wisdom in selecting Oracle Data Guard are firmly established in the minds of your team, application developers, and line-of-business managers.
Protection Against Data Corruption
Data Guard is an inherently database-aware process that uses redo data to synchronize one or more standby databases with a primary production database. Redo data is fundamental to the recoverability of an Oracle database, making it possible to recover a database to any desired point in time. With the Oracle parameters db_block_checking and db_block_checksum enabled on a primary database, redo is logically validated after each change, and a checksum generated before shipping it to the standby.
A Data Guard standby database is at least mounted at all times while redo data is applied (note: a physical standby database is mounted, a logical standby database is open). With the db_block_checksum parameter also enabled on the standby database, Data Guard is able to verify the redo checksum before applying redo to the standby database. In the event a corruption is introduced into the redo stream, the standby database is able to detect the corruption before it is applied. The standby database will not apply the corrupt data, and instead will generate an error immediately notifying administrators that corrective action is required.
In contrast, a remote-mirroring solution must mirror every I/O to every file - online logs, archive logs, data files, and control file, in order to synchronize the standby database. There is no notion of a database transaction in this process. All I/O is faithfully replicated to the standby site. Furthermore, the target volumes are not even mounted while data is mirrored. This makes a standby database maintained by remote-mirroring a "cold, lights-out" database whose state cannot be predictably determined until the mirroring process is turned off and the database mounted and started. This creates a situation where you can't discover there is a problem until the worst possible time, well after the problem has occurred and much too late to easily fix it. A conversation with an IT manager at a global logistics services provider described this in the following way:
"With remote-mirroring it's scary. You don't know what you have is working, unless you bring it up at failover time. So at failover time, I can be the hero if it all works, or I can lose my job if it doesn't. With Data Guard, you always know its working, so that's great."
For a more complete discussion comparing Data Guard to remote-mirroring, see
Oracle Data Guard and Remote-Mirroring Solutions
[3].
And Then . . . Human Error
Data corruptions as described above may be rare, but it only takes one event to create a lasting impression. In truth, disasters come in many shapes and sizes. A disaster recovery solution must achieve the highest possible degree of isolation between primary and standby databases while protecting against a wide range of potential failures. A faulty line of microcode buried deep within a system's infrastructure can have the same impact on system availability as a hurricane. And then of course there is always human error to contend with.
Consider a third customer - a high volume chip manufacturer. Administrator error caused a data file and archive logs to be deleted from disk. Remote-mirroring was in use, and as expected, the same files were also deleted from the mirrored copy. This simple mistake caused 5 hours of downtime, with costly impact to manufacturing operations. If they had been using Data Guard 10
g they could have quickly switched over to the standby database, the standby would assume the primary production role. When following
Oracle Best Practices for Switchover and Failover
[4], database downtime could have been as little as 15 seconds.
As noted above, a Data Guard standby is maintained by applying redo data to the standby database. Because Data Guard is completely database aware, administrator errors that are external to the database never make it into the redo stream and are not propagated to the standby. Not only could this customer have switched production to the standby with just seconds of downtime, but good files from the new production database could also be used to repair the old primary. The old primary could then be automatically resynchronized and reinstated by Data Guard as a standby database for the new primary.
Data Guard Value Proposition
Oracle Data Guard is a standard feature of Oracle Database Enterprise Edition and is integrated with other Oracle features to provide a comprehensive HA/DR solution to protect mission critical data. It runs on any hardware platform Oracle supports. It doesn't require proprietary storage subsystems, special (and costly) network devices, or third party software to integrate and maintain. The bottom line is that for Oracle databases, Data Guard provides better data protection, at less cost, with less performance overhead, and with greater high availability, than all other DR solutions available today.
When combined with:
. . . Data Guard is part of an integrated capability provided by Oracle for deploying a highly available computing platform. Together, these technologies form the
Oracle Maximum Availability Architecture
[10], an Oracle-tested blueprint and best practices for high availability.
Conclusion
A fourth customer, a large public utility company, proved it is never too late to to consider Data Guard. They had previously deployed remote-mirroring but were now testing Data Guard for possible use as an alternative HA/DR solution. For over a month they had a "pre-production" Data Guard maintained standby database, completely synchronized with the primary database and running in parallel with their "production" replica maintained by remote-mirroring. They experienced corruption on their primary database and when they mounted and attempted to restart the database using the mirrored copy - they discovered the same corruption prevented them from opening the database. Customer Service would have been severely impacted by an extended outage. So they were faced with a dilemma. Tell the business they were down, or failover to the Data Guard standby. Not wanting to take the downtime hit, they quickly failed over to the Data Guard copy and kept the business running.
They have been Data Guard users since.
Oh - By The Way . . .
Remember the first customer at the top of this article? Shortly after their second SAN failure they called Oracle to talk about Data Guard. In less than 2 weeks elapsed time - analysis, planning, testing, and deployment, they began using Oracle Data Guard to replace remote-mirroring for their Oracle data. They didn't need to make any additional purchases, they leveraged existing in-house Oracle DBA skill-set, and they didn't have to be involved in any software or hardware integration to make this work - they could begin using Data Guard immediately, and eliminate their dependence on a non-Oracle technology to provide high availability and data protection for their Oracle database.
References
- Oracle Data Guard -
http://www.oracle.com/technology/deploy/availability/htdocs/DataGuardOverview.html
- Oracle Real Application Clusters -
http://www.oracle.com/technology/products/database/clustering/index.html
- Oracle Data Guard and Remote-Mirroring Solutions -
http://www.oracle.com/technology/deploy/availability/htdocs/DataGuardRemoteMirroring.html
- Oracle Data Guard 10g Release 2, Switchover and Failover Best Practices -
http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gR2_SwitchoverFailoverBestPractices.pdf
- Oracle Recovery Manager -
http://www.oracle.com/technology/deploy/availability/htdocs/rman_overview.htm
- Oracle Flashback Technologies -
http://www.oracle.com/technology/deploy/availability/htdocs/Flashback_Overview.htm
- Online Reorganization and Redefinition -
http://www.oracle.com/technology/deploy/availability/htdocs/online_ops.html
- Automatic Storage Management -
http://www.oracle.com/technology/products/database/asm/index.html
- Oracle Enterprise Manager -
http://www.oracle.com/technology/products/oem/index.html
- Oracle Maximum Availability Architecture (MAA) -
http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
Joseph Meeks is a Director of Product Management with Oracle's High Availability and Maximum Availability Architecture Group with 25 years of experience assisting customers in the design and deployment of highly available systems for mission critical applications.