As Published In
Oracle Magazine
March/April 2007

COVER FEATURE


Always Available

By Alan Joch

Oracle high-availability technologies drive business 24/7.

Don't be fooled by appearances—Red Nose Day is no laughing matter. Comic Relief's biennial fund-raiser invites people throughout the United Kingdom to don red noses as they get sponsored to do all sorts of silly things and donate money to support antipoverty programs locally and in Africa. The event culminates with a five-hour telethon, during which comedians such as John Cleese and Ricky Gervais might show up to push for a final round of telephone or Web donations. Donors apparently laugh all the way to their wallets: A recent campaign netted more than US$123.6 million.

Making It Available
Making It Available
Download pdf

But peek behind the scenes at those make-or-break telethons and you'll find a lot of serious IT staff who could use a laugh themselves. Because so much of the fund-raising success hinges on efficiently processing more than 250,000 electronic transactions in a short time frame, Comic Relief's data center is fine-tuned for high availability (HA). "If the system isn't working, then obviously we can't collect contributions," says Martin Gill, head of new media for Comic Relief. "We can't just go back to people a couple of days later and say, 'Remember when you were feeling generous and you wanted to give us £100? Please try again now, because we fixed the glitch that prevented you doing it the other night."' Given the narrow window in which the charitable organization needs to process pledges, any glitch can cost the fund-raising effort—so systems need to be up, available, and fast.

To avoid any awkward second acts, Comic Relief's new-media managers rely on a dual-data-center architecture built with Oracle Database 10g, Oracle Real Application Clusters (Oracle RAC), and Oracle Data Guard. Together these technologies ensure that the charitable fun doesn't grind to a halt if a hard drive, server, network switch, or entire site crashes.

This high-availability approach came in handy during a recent Red Nose Day when processing times were rising above targeted levels. The flexible high-availability environment enabled Comic Relief to switch all the transaction processing to one data center while using the second one for diagnostics, ultimately solving the problem with a quick upgrade before rejoining the two centers. "We had no loss of service to our donors," Gill says. "Only some sweaty engineers who worked hard and brilliantly to successfully manage the situation."

Easier Choices

High availability and disaster recovery (DR) have long been like life insurance: IT managers know they need them, but their hard-to-quantify return on investment (ROI) poses a challenge. Budget watchers struggle with how much is enough to spend for server and storage resources that might remain idle most of the time. 

Staying Secure with Oracle Data Guard


Oracle Data Guard is a feature of Oracle Database 10g that creates, synchronizes, and monitors one or more standby databases to protect data from failures, disasters, errors, and corruptions. These standby databases can be located at remote disaster recovery sites thousands of miles away from the production data center, or they may be located in the same city, same campus, or even in the same building. If the production database becomes unavailable because of a planned or unplanned outage, Oracle Data Guard can switch any standby database to the production role, thus minimizing the downtime associated with the outage and preventing any data loss.

In keeping with the imperative to make maximum use of standby database resources even while in standby role, an Oracle Data Guard standby database can field queries and run backups to relieve processing demands on the main database. Operating system and hardware maintenance can be done in rolling fashion: First at the standby side, and then after a planned role transition (referred to as a switchover) the standby database assumes the production role and the same maintenance can be executed on the original primary with minimal downtime. Oracle's Sushil Kumar says, "People have assumed that certain tasks require downtime, but Oracle's goal is to eliminate planned maintenance downtime."

Now, new HA/DR strategies mean IT managers don't have to make that choice. The right technologies configured within an effective high-availability architecture keep data and systems protected against extended outages, while also contributing processing power for day-to-day tasks when crises aren't looming. "Many companies think their disaster recovery infrastructure is an investment that cannot provide other operational benefits," says Sushil Kumar, senior director of database product management at Oracle. "Oracle's philosophy is for customers to derive the most out of their disaster recovery infrastructure, even in times when there is no disaster."

"K" Line America, a global transportation company specializing in ocean transport, found this high-availability balance when it installed Oracle Database 10g and Oracle RAC. Its dual server cluster protects "K" Line's global transportation management system with automatic failover should either server node crash for any reason. "K" Line uses Oracle RAC to automatically balance transaction processing between its two servers. The result is a boost in processing capacity: All resources are used all the time, and users are protected from server failure.

"Oracle RAC really impressed us because it allows us to take advantage of both servers," says Knut LaVine, general manager of application development at "K" Line America. "We saw a dramatic improvement in the performance of the application because we were able to utilize both servers at the same time."

Budget Relief

The right high-availability architecture delivers other economic advantages as well. Because Oracle software can provide the highest level of availability on commodity hardware, such as x86-based servers, high-availability designers aren't forced to buy expensive proprietary platforms, long thought to be essential for reliability. This expensive philosophy dates back to mainframe models and argues that the more you spend on hardware, the fewer breakdowns you'll experience.

Today, enterprises can achieve comparable reliability at a fraction of the mainframe cost using Oracle's high-availability functionality and commodity-priced hardware. "We used to spend a tremendous amount of money buying very expensive proprietary UNIX systems," says Hernan Alvarez, director of engineering operations for Farecast, an online travel-booking site based in Seattle, Washington. "With the advent of clustering software and open source operating systems, that paradigm has shifted. Now it's the software that's really making things happen."

Farecast invested in Oracle RAC, which automatically transfers and rebalances workload from a failed server to surviving servers in a cluster. The ability to deploy a high-availability solution on commodity hardware using Oracle RAC is a cornerstone of Farecast's strategy.

The travel site applies its proprietary algorithms to fare data collected from airlines and third-party sources to predict prices for customers shopping for the best deals. Customers access the site from around the world, which means that any downtime, whether for planned maintenance or resulting from technical problems, would almost always affect some customers during the business day.

To cope, Farecast uses 100 x86 servers with 64-bit processors and large amounts of RAM. These powerful servers came at a relative bargain of only about US$5,000 each. The redundancy available from these Oracle RAC-running econo-models gives Alvarez confidence about his HA capabilities. "If we lose a box, who cares—we're not dependent on any one device in our network," Alvarez says.

Farecast's predictive engine relies on an Oracle data warehouse with more than 5 terabytes of data for storing and analyzing data for its airfare predictions. Before Oracle Database 10g and Oracle RAC, Farecast relied on a MySQL database, a product that Farecast just outgrew, according to Alvarez. "Clustering is what's compelling about the Oracle technology," he says. "We looked at other clustering and database alternatives, including IBM DB2 and Microsoft SQL Server. But we have a very large database, so with partitioning, compression, and clustering on top of that, there really wasn't any other choice. SQL Server just wasn't going to get it done."

Alvarez adds that Oracle RAC's ability to configure multiple low-cost commodity servers and create a highly available and scalable grid that requires no change to application or database structures keeps total costs under control. "I'd say our hardware costs are one-tenth of what they were five years ago," he says.

Which helps Farecast align its high-availability needs with its business demands. "We could always roll out a US$10 million solution and get the HA job done, but does that meet our business goals?" he asks. "We're able to stay within budget and get the performance and availability we're looking for [with Oracle], so it's a huge business success."

Snapshots


Comic Relief

www.comicrelief.com
 Location: London
 Industry: Charitable fund-raising
 Employees: 95
 Oracle products: Oracle Database 10g Release 2, Oracle Application Server 10g, Oracle Data Guard, Oracle Real Application Clusters

Fannie Mae

www.fanniemae.com
 Location: Washington DC
 Industry: Financial services
 Employees: 5,500
 Oracle products: Oracle Database 10g, Oracle Real Application Clusters, Oracle Data Guard, Oracle Enterprise Manager

Farecast, Inc

www.farecast.com
 Location: Seattle, Washington
 Industry: Online travel
 Employees: 25
Oracle products: Oracle Database 10g, Oracle Real Application Clusters, Oracle Enterprise Manager

"K" Line America, Inc.

(A wholly owned subsidiary of Tokyo-based Kawasaki Kisen Kaisha, Ltd.)
www.kline.com
 Location: Richmond, Virginia
 Industry: Transportation
 Employees: 560
 Oracle products: Oracle Database 10g, Oracle Real Application Clusters

Kemira GrowHow (U.K.)

www.kemira-growhow.com
 Location: Chester, U.K.
 Industry: Fertilizer and agricultural products
 Employees: 450 (in the U.K.)
 Oracle products: Oracle E-Business Suite, Oracle Real Application Clusters, Oracle Data Guard

Low-Cost Redundancy

Oracle RAC can grow from an initial two nodes to as many as 100 nodes as power is needed. Because all servers in an Oracle RAC cluster are active, application performance scales as additional servers are added to the cluster. In addition, the multiple servers in Oracle RAC all have access to all of the data. Because of Cache Fusion, users can coordinate access so all servers can modify any of the data. This allows work requests to run on any server, instead of being limited to a specific server because of some "partitioning" algorithm required of shared-nothing environments. The combination of these attributes makes it possible to build clusters of low-priced commodity servers that can provide higher availability and better performance than much more costly and proprietary mainframe-based high-availability architectures.

The Oracle database shared by all servers in a cluster is exposed to storage subsystem failures that can cause data file corruptions on the primary database. Such failures are infrequent, but when they occur they result in unacceptable downtime for mission-critical applications. Oracle Data Guard isolates the standby database from such corruption by continually validating all data before it is applied. Corruptions caused by the primary database storage subsystem, or corruptions introduced by the network during the course of transmitting data to the standby site, are never applied to the standby database. This concept of isolating the standby site from failures that occur on the production database is one of the major benefits provided by Oracle Data Guard.

Like Oracle RAC, Oracle Data Guard is implemented on top of commodity hardware. "It requires only a standard network link between the two computers," says Oracle's Kumar. "Oracle Data Guard not only saves customers money, but also provides them with a more efficient and better disaster recovery solution."

These technologies work together to provide a complete solution. "Oracle has developed very powerful availability technologies including Oracle RAC, Oracle Data Guard, Flashback, and RMAN. These technologies are each generally acknowledged to be 'best-of-breed' solutions in their space," says Juan Loaiza, senior vice president for the Systems Technology group at Oracle. "Just as importantly, we have architected these technologies to work seamlessly together to cover all classes of downtime. Without this end-to-end approach, you continue to run the risk of unwanted downtime even if you build resilience into each and every component of your architecture."

Serious Protection

By the time Comic Relief's Red Nose Day ends, about 14 million people will have watched the telethon broadcast on the BBC, and the Comic Relief Web site will have received hits from up to 1 million unique visitors. The organization's financial model depends on quickly depositing donations into interest-bearing accounts. Interest yields are one of the income streams that Comic Relief uses to avoid paying for administrative overhead from the direct public donation pool. "We promise that for every pound a member of the public directly gives us, we will give a pound directly to the projects we support," says Gill. "All time savings and systematic efficiencies offered by digital technology help us meet this promise to the public."

On the key nights, both of Comic Relief's data centers move into action. Oracle RAC provides protection from server failure within the primary site. Unless there's an emergency, the centers combine their processing capacities to keep individual e-commerce transaction processing times to less than two seconds. "Oracle RAC is the cornerstone of our high-availability solution. It gives us a chance to cope with flash crowds, high demand, or even a failure in one of our database servers with little or no degradation of performance to an end user," says Gill.

As an extra precaution, Oracle Data Guard provides disaster recovery failover between data centers by keeping a standby copy of the database at the second data center synchronized with the database at the primary location. "If there's a failure in our primary environment, we can shift sessions to the other site," he says. Comic Relief's standby database also provides a critical layer of additional data protection and high availability.

High Availability Is a Philosophy

Fannie Mae, a financial services company that collaborates with mortgage lenders to ensure that loans are available for home buyers, needs guaranteed system uptime for its financial systems and to meet federal regulations. "Regulatory requirements dictate that all of our critical applications have complete redundancy to meet specific high-availability needs," says Mano Malayanur, manager of technical operations and infrastructure management and infrastructure architect for Fannie Mae's Guarantee Businesses Systems (GBS) group.

The GBS arm of Fannie Mae runs Oracle Database 10g Release 2, a local high-availability cluster, and Oracle Data Guard, for its production database to stay running if a breakdown occurs. Oracle Data Guard provides what Malayanur calls "our preferred solution for disaster recovery." Some applications in Fannie Mae also use Oracle RAC for additional high-availability and horizontal scaling.

Before Fannie Mae's GBS group made its choice, it conducted an extensive proof-of-concept project that tested how well its database, middle layers, and client infrastructure could handle the projected business demand. The project simulated 1 million transactions per hour, with each transaction including many hits to the Oracle database. But beyond speed, the project also had to prove data reliability—Fannie Mae wanted nothing less than zero when it came to data loss protection.

This approach is fairly typical for other applications in Fannie Mae. The performance tests help Fannie Mae reduce the time required for a failover to occur between servers at a site, and between sites. "When we go into our test lab," says Malayanur, "we have a clear idea of what we want to see: Is the solution compatible with the other layers in our technology stack? How long does it take to fail over from server to server and site to site? Is the failover automated? And then we do tests to verify that the numbers are where we expect them to be."

Oracle Data Guard proved to be a key technology component in the organization's data protection scheme. The pilot project showed Oracle Data Guard could sustain synchronous zero data loss protection and avoid any loss of information for workloads running as high as 16 megabytes per second of redo data. This is very high database throughput, easily surpassing what is seen in most mission-critical applications. Fannie Mae found that Oracle Data Guard could run at even faster rates in asynchronous mode, for applications that didn't have such stringent data loss requirements.

Fannie Mae's testing didn't stop once the pilot project was over. The company continues to test and audit the system regularly to make sure that changes to the environment don't reduce its protection. "High availability is not something you build on top of your existing applications," Malayanur says. "It needs to be thinking that pervades your entire process of setting up an IT infrastructure."

Clear ROI

Because they traditionally haven't been designed to contribute to daily business profitability, high-availability technologies have been difficult to evaluate for ROI. The consequences of downtime may be real, but how can companies know if the high-availability choices they're making are financially sound?

Kemira GrowHow (U.K.) grappled with this question as it reevaluated its high-availability service contract in 2003. The fertilizer and agricultural products company in the United Kingdom had been relying on a service-level agreement with an outside firm that promised replacements for any faulty hardware within four hours of crashing and a restoration of data to the most recent daily backup within 24 hours. But because manual processes would take over in the interim, "resynching" of the systems could take much longer.

Kemira grew increasingly uncomfortable with that timetable, concerned that disrupted order and shipment processing would hurt the company long before repairs were made. Adding to the problem, the company was paying about US$37,000 for the service contract.

Next Steps


READ more about high availability
oracle.com/database/high-availability.html

 DOWNLOAD Forrester report on Oracle RAC

The company made some changes. Thanks to a 2004 rearchitecting of its high-availability resources, Kemira reduced its maintenance costs, eliminated the service contract, and simultaneously boosted its availability so it would see little disruption in the aftermath of a system crash or site failure. Kemira now runs the Oracle E-Business Suite on a two-node cluster in its primary data center. Oracle Data Guard synchronizes the primary database with a secondary instance at a remote facility. If the production cluster or database fails, Kemira uses Oracle Data Guard to fail over to the remote standby database and upgrade it to production mode—all without needing to perform any recovery tasks.

As a result of this configuration, Kemira not only eliminated the service-contract fee but also cut its database downtime window from the 24 hours outlined in its service contract to a few minutes. It estimates it can return its applications to full production status in two hours or less in the worst-case scenario of a complete data-center failure. "The investment we made upfront for the new technology was justified by cutting costs from the old contracts," says Dave Allen, IS facilities manager at Kemira GrowHow.

Kemira chose Oracle technology because of its proven track record. "We wouldn't trust our business processes to other offerings. Oracle products have scaled from large enterprise-class systems down to smaller business servers, not up from PCs," Allen says. "The Oracle database is bombproof."

 

Comprehensive High Availability


Enterprises need a wide range of tools to develop an effective HA architecture. These tools from Oracle can help ensure success.

Oracle Real Application Clusters (Oracle RAC)

Oracle RAC lets enterprises set up clusters of multiple servers to provide processing power to applications that access a single database. No application or database changes are required. If a server fails, the surviving servers in the cluster automatically take over processing chores. Oracle RAC provides scalability; additional servers can be introduced into the cluster in a nondisruptive fashion to help with increasing workloads. Oracle RAC is a database option for Oracle Database 10g Enterprise Edition and is included with Oracle Database 10g Standard Edition.

Oracle Data Guard

Oracle Data Guard enables IT managers to automatically maintain a synchronized, standby copy of a production database in another location that can immediately be elevated to primary status should the production database fail.

Oracle Enterprise Manager

This systemwide administration tool gives IT managers a Web-based interface for monitoring database performance, allocating resources, and managing Oracle RAC installations and Oracle Data Guard standby databases.

Automatic Storage Management (ASM)

This integrated file system and volume manager eliminates the need for third-party storage management software and simplifies storage management for Oracle databases. A feature of Oracle Database 10g, ASM makes it easy to optimize performance of storage subsystems. ASM's built-in mirroring provides high availability by protecting against local disk failure.

Flash Recovery Area

This area consists of a unified disk-based storage location for all recovery-related files, even daily backups, for an Oracle database. Automatically managed by Oracle Database, this resource cuts recovery time by eliminating downtime associated with restoring archived files from tape media.

Recovery Manager (RMAN)

RMAN automates and manages backup and recovery processes for Oracle databases, including automatically backing up and recovering data to the flash recovery area. RMAN is integrated with the core database engine, allowing it to do its operations most efficiently.

Flashback

Flashback helps companies recover from human error or logical corruption that results in damaged data, and it does so at a level of granularity appropriate for the corrective action required. Flashback Query can roll back changes from individual queries; Flashback Table can roll back data from a single table to its last correct version. For widespread data corruption, Flashback Database lets users roll the entire database to a desired point in time, many times faster than traditional point-in-time recovery methods.

LogMiner

LogMiner allows redo log files to be read, analyzed, and interpreted using SQL. IT managers can use the analyses to audit changes to data or recover deleted data.

Oracle Secure Backup

This secure network tape backup software for Oracle databases and file systems lets administrators perform direct tape backups without third-party software. It provides an integrated, easy-to-use backup solution that encrypts data to tape to safeguard against the misuse of sensitive data if backup tapes are lost or stolen.

Maximum Availability Architecture (MAA)

MAA provides an Oracle-validated blueprint of technology and architectural best practices to achieve highest availability.

 


Alan Joch (ajoch@worldpath.net) is a technology writer based in New England who specializes in enterprise, Web, and high-performance-computing applications.


Send us your comments