Always AvailableBy Alan Joch
Oracle high-availability technologies drive business 24/7.
Don't be fooled by appearances—Red Nose Day is no laughing matter. Comic Relief's biennial fund-raiser invites people throughout the United Kingdom to don red noses as they get sponsored to do all sorts of silly things and donate money to support antipoverty programs locally and in Africa. The event culminates with a five-hour telethon, during which comedians such as John Cleese and Ricky Gervais might show up to push for a final round of telephone or Web donations. Donors apparently laugh all the way to their wallets: A recent campaign netted more than US$123.6 million.
But peek behind the scenes at those make-or-break telethons and you'll find a lot of serious IT staff who could use a laugh themselves. Because so much of the fund-raising success hinges on efficiently processing more than 250,000 electronic transactions in a short time frame, Comic Relief's data center is fine-tuned for high availability (HA). "If the system isn't working, then obviously we can't collect contributions," says Martin Gill, head of new media for Comic Relief. "We can't just go back to people a couple of days later and say, 'Remember when you were feeling generous and you wanted to give us £100? Please try again now, because we fixed the glitch that prevented you doing it the other night."' Given the narrow window in which the charitable organization needs to process pledges, any glitch can cost the fund-raising effort—so systems need to be up, available, and fast.
To avoid any awkward second acts, Comic Relief's new-media managers rely on a dual-data-center architecture built with Oracle Database 10g, Oracle Real Application Clusters (Oracle RAC), and Oracle Data Guard. Together these technologies ensure that the charitable fun doesn't grind to a halt if a hard drive, server, network switch, or entire site crashes.
This high-availability approach came in handy during a recent Red Nose Day when processing times were rising above targeted levels. The flexible high-availability environment enabled Comic Relief to switch all the transaction processing to one data center while using the second one for diagnostics, ultimately solving the problem with a quick upgrade before rejoining the two centers. "We had no loss of service to our donors," Gill says. "Only some sweaty engineers who worked hard and brilliantly to successfully manage the situation."
High availability and disaster recovery (DR) have long been like life insurance: IT managers know they need them, but their hard-to-quantify return on investment (ROI) poses a challenge. Budget watchers struggle with how much is enough to spend for server and storage resources that might remain idle most of the time.
"K" Line America, a global transportation company specializing in ocean transport, found this high-availability balance when it installed Oracle Database 10g and Oracle RAC. Its dual server cluster protects "K" Line's global transportation management system with automatic failover should either server node crash for any reason. "K" Line uses Oracle RAC to automatically balance transaction processing between its two servers. The result is a boost in processing capacity: All resources are used all the time, and users are protected from server failure.
"Oracle RAC really impressed us because it allows us to take advantage of both servers," says Knut LaVine, general manager of application development at "K" Line America. "We saw a dramatic improvement in the performance of the application because we were able to utilize both servers at the same time."
The right high-availability architecture delivers other economic advantages as well. Because Oracle software can provide the highest level of availability on commodity hardware, such as x86-based servers, high-availability designers aren't forced to buy expensive proprietary platforms, long thought to be essential for reliability. This expensive philosophy dates back to mainframe models and argues that the more you spend on hardware, the fewer breakdowns you'll experience.
Today, enterprises can achieve comparable reliability at a fraction of the mainframe cost using Oracle's high-availability functionality and commodity-priced hardware. "We used to spend a tremendous amount of money buying very expensive proprietary UNIX systems," says Hernan Alvarez, director of engineering operations for Farecast, an online travel-booking site based in Seattle, Washington. "With the advent of clustering software and open source operating systems, that paradigm has shifted. Now it's the software that's really making things happen."
Farecast invested in Oracle RAC, which automatically transfers and rebalances workload from a failed server to surviving servers in a cluster. The ability to deploy a high-availability solution on commodity hardware using Oracle RAC is a cornerstone of Farecast's strategy.
The travel site applies its proprietary algorithms to fare data collected from airlines and third-party sources to predict prices for customers shopping for the best deals. Customers access the site from around the world, which means that any downtime, whether for planned maintenance or resulting from technical problems, would almost always affect some customers during the business day.
To cope, Farecast uses 100 x86 servers with 64-bit processors and large amounts of RAM. These powerful servers came at a relative bargain of only about US$5,000 each. The redundancy available from these Oracle RAC-running econo-models gives Alvarez confidence about his HA capabilities. "If we lose a box, who cares—we're not dependent on any one device in our network," Alvarez says.
Farecast's predictive engine relies on an Oracle data warehouse with more than 5 terabytes of data for storing and analyzing data for its airfare predictions. Before Oracle Database 10g and Oracle RAC, Farecast relied on a MySQL database, a product that Farecast just outgrew, according to Alvarez. "Clustering is what's compelling about the Oracle technology," he says. "We looked at other clustering and database alternatives, including IBM DB2 and Microsoft SQL Server. But we have a very large database, so with partitioning, compression, and clustering on top of that, there really wasn't any other choice. SQL Server just wasn't going to get it done."
Alvarez adds that Oracle RAC's ability to configure multiple low-cost commodity servers and create a highly available and scalable grid that requires no change to application or database structures keeps total costs under control. "I'd say our hardware costs are one-tenth of what they were five years ago," he says.
Which helps Farecast align its high-availability needs with its business demands. "We could always roll out a US$10 million solution and get the HA job done, but does that meet our business goals?" he asks. "We're able to stay within budget and get the performance and availability we're looking for [with Oracle], so it's a huge business success."
Oracle RAC can grow from an initial two nodes to as many as 100 nodes as power is needed. Because all servers in an Oracle RAC cluster are active, application performance scales as additional servers are added to the cluster. In addition, the multiple servers in Oracle RAC all have access to all of the data. Because of Cache Fusion, users can coordinate access so all servers can modify any of the data. This allows work requests to run on any server, instead of being limited to a specific server because of some "partitioning" algorithm required of shared-nothing environments. The combination of these attributes makes it possible to build clusters of low-priced commodity servers that can provide higher availability and better performance than much more costly and proprietary mainframe-based high-availability architectures.
The Oracle database shared by all servers in a cluster is exposed to storage subsystem failures that can cause data file corruptions on the primary database. Such failures are infrequent, but when they occur they result in unacceptable downtime for mission-critical applications. Oracle Data Guard isolates the standby database from such corruption by continually validating all data before it is applied. Corruptions caused by the primary database storage subsystem, or corruptions introduced by the network during the course of transmitting data to the standby site, are never applied to the standby database. This concept of isolating the standby site from failures that occur on the production database is one of the major benefits provided by Oracle Data Guard.
Like Oracle RAC, Oracle Data Guard is implemented on top of commodity hardware. "It requires only a standard network link between the two computers," says Oracle's Kumar. "Oracle Data Guard not only saves customers money, but also provides them with a more efficient and better disaster recovery solution."
These technologies work together to provide a complete solution. "Oracle has developed very powerful availability technologies including Oracle RAC, Oracle Data Guard, Flashback, and RMAN. These technologies are each generally acknowledged to be 'best-of-breed' solutions in their space," says Juan Loaiza, senior vice president for the Systems Technology group at Oracle. "Just as importantly, we have architected these technologies to work seamlessly together to cover all classes of downtime. Without this end-to-end approach, you continue to run the risk of unwanted downtime even if you build resilience into each and every component of your architecture."
By the time Comic Relief's Red Nose Day ends, about 14 million people will have watched the telethon broadcast on the BBC, and the Comic Relief Web site will have received hits from up to 1 million unique visitors. The organization's financial model depends on quickly depositing donations into interest-bearing accounts. Interest yields are one of the income streams that Comic Relief uses to avoid paying for administrative overhead from the direct public donation pool. "We promise that for every pound a member of the public directly gives us, we will give a pound directly to the projects we support," says Gill. "All time savings and systematic efficiencies offered by digital technology help us meet this promise to the public."
On the key nights, both of Comic Relief's data centers move into action. Oracle RAC provides protection from server failure within the primary site. Unless there's an emergency, the centers combine their processing capacities to keep individual e-commerce transaction processing times to less than two seconds. "Oracle RAC is the cornerstone of our high-availability solution. It gives us a chance to cope with flash crowds, high demand, or even a failure in one of our database servers with little or no degradation of performance to an end user," says Gill.
As an extra precaution, Oracle Data Guard provides disaster recovery failover between data centers by keeping a standby copy of the database at the second data center synchronized with the database at the primary location. "If there's a failure in our primary environment, we can shift sessions to the other site," he says. Comic Relief's standby database also provides a critical layer of additional data protection and high availability.
High Availability Is a Philosophy
Fannie Mae, a financial services company that collaborates with mortgage lenders to ensure that loans are available for home buyers, needs guaranteed system uptime for its financial systems and to meet federal regulations. "Regulatory requirements dictate that all of our critical applications have complete redundancy to meet specific high-availability needs," says Mano Malayanur, manager of technical operations and infrastructure management and infrastructure architect for Fannie Mae's Guarantee Businesses Systems (GBS) group.
The GBS arm of Fannie Mae runs Oracle Database 10g Release 2, a local high-availability cluster, and Oracle Data Guard, for its production database to stay running if a breakdown occurs. Oracle Data Guard provides what Malayanur calls "our preferred solution for disaster recovery." Some applications in Fannie Mae also use Oracle RAC for additional high-availability and horizontal scaling.
Before Fannie Mae's GBS group made its choice, it conducted an extensive proof-of-concept project that tested how well its database, middle layers, and client infrastructure could handle the projected business demand. The project simulated 1 million transactions per hour, with each transaction including many hits to the Oracle database. But beyond speed, the project also had to prove data reliability—Fannie Mae wanted nothing less than zero when it came to data loss protection.
This approach is fairly typical for other applications in Fannie Mae. The performance tests help Fannie Mae reduce the time required for a failover to occur between servers at a site, and between sites. "When we go into our test lab," says Malayanur, "we have a clear idea of what we want to see: Is the solution compatible with the other layers in our technology stack? How long does it take to fail over from server to server and site to site? Is the failover automated? And then we do tests to verify that the numbers are where we expect them to be."
Oracle Data Guard proved to be a key technology component in the organization's data protection scheme. The pilot project showed Oracle Data Guard could sustain synchronous zero data loss protection and avoid any loss of information for workloads running as high as 16 megabytes per second of redo data. This is very high database throughput, easily surpassing what is seen in most mission-critical applications. Fannie Mae found that Oracle Data Guard could run at even faster rates in asynchronous mode, for applications that didn't have such stringent data loss requirements.
Fannie Mae's testing didn't stop once the pilot project was over. The company continues to test and audit the system regularly to make sure that changes to the environment don't reduce its protection. "High availability is not something you build on top of your existing applications," Malayanur says. "It needs to be thinking that pervades your entire process of setting up an IT infrastructure."
Because they traditionally haven't been designed to contribute to daily business profitability, high-availability technologies have been difficult to evaluate for ROI. The consequences of downtime may be real, but how can companies know if the high-availability choices they're making are financially sound?
Kemira GrowHow (U.K.) grappled with this question as it reevaluated its high-availability service contract in 2003. The fertilizer and agricultural products company in the United Kingdom had been relying on a service-level agreement with an outside firm that promised replacements for any faulty hardware within four hours of crashing and a restoration of data to the most recent daily backup within 24 hours. But because manual processes would take over in the interim, "resynching" of the systems could take much longer.
Kemira grew increasingly uncomfortable with that timetable, concerned that disrupted order and shipment processing would hurt the company long before repairs were made. Adding to the problem, the company was paying about US$37,000 for the service contract.
The company made some changes. Thanks to a 2004 rearchitecting of its high-availability resources, Kemira reduced its maintenance costs, eliminated the service contract, and simultaneously boosted its availability so it would see little disruption in the aftermath of a system crash or site failure. Kemira now runs the Oracle E-Business Suite on a two-node cluster in its primary data center. Oracle Data Guard synchronizes the primary database with a secondary instance at a remote facility. If the production cluster or database fails, Kemira uses Oracle Data Guard to fail over to the remote standby database and upgrade it to production mode—all without needing to perform any recovery tasks.
As a result of this configuration, Kemira not only eliminated the service-contract fee but also cut its database downtime window from the 24 hours outlined in its service contract to a few minutes. It estimates it can return its applications to full production status in two hours or less in the worst-case scenario of a complete data-center failure. "The investment we made upfront for the new technology was justified by cutting costs from the old contracts," says Dave Allen, IS facilities manager at Kemira GrowHow.
Kemira chose Oracle technology because of its proven track record. "We wouldn't trust our business processes to other offerings. Oracle products have scaled from large enterprise-class systems down to smaller business servers, not up from PCs," Allen says. "The Oracle database is bombproof."
Alan Joch (firstname.lastname@example.org) is a technology writer based in New England who specializes in enterprise, Web, and high-performance-computing applications.