|
Cover Feature
Always Available
By David A. Kelly
New and existing technologies from Oracle
help keep your critical systems available.
High availabilityor the ability of a business and that business's customers to use an application or a service at the appropriate time with the level of functionality they expectis no longer just a key requirement of today's business systems; it's increasingly becoming the key requirement. Customer and user expectations are significantly higher than even just a few years ago, and there's even more pressure on businesses to ensure continuous availability of core systems for internal users and external customers.
"More and more organizations' business processes are extremely dependent on their applications and infrastructure, so if there are any outages in the environment, it's very costly from a business perspective," says Donna Scott, vice president and distinguished analyst at Gartner, Inc., in Stamford, Connecticut. "It takes a very well-designed and well-managed environment to mitigate the risks of downtime and prevent it from happening or, if you can't prevent it from happening, to enable as quick a recovery as possible."
In its simplest form, a high-availability plan encompasses three aspects: resiliency (making sure applications
and systems are as reliable as possible), recoverability (ensuring that if a component does fail, there's a way to recover within a given time period), and continuous operation (ensuring that systems or applications are available even during maintenance activities).
Whereas disaster recovery is concerned with getting core business (and IT) systems up and running again after an unplanned outage, high availability involves proactively ensuring that there's no single point of failure for essential systems (be they applications, databases, networks, storage, or any other IT component). "High availability involves both planned and unplanned outages," says Charles Garry, senior program director
of infrastructure services at Meta Group, Inc. "Planned outages occur for things like application upgrades, hardware upgrades, patches, and basic maintenance. Unplanned outages can include user error, operator error, and actual hardware failure."
In fact, although most people probably think about a server crashing, a disk controller dying, hardware or software glitches, or catastrophic disasters when the topic of high availability comes up, there's another side to the story. "Forty percent of downtime, on average, is caused by operations errors," says Scott. "Oracle, like the rest of the industry, focuses on trying to reduce manageability requirements for software products. This not only reduces the amount of labor needed to manage them but also reduces the chance for errors from touching a product."
But high availability isn't just about databases, application servers, networks, backups, or new grid technologies; it's also about users and their expectationsgiving users what they want, when they expect it, at the level of performance they expect. Advances in a wide variety of technologies, including products such as Oracle Real Application Clusters (RAC) and Oracle Data Guard as well as new functionality such as the Flashback capabilities in Oracle Database 10g, not only make it easier to meet these user expectations but also significantly reduce the manageability and development requirements associated with highly available applications.
"We're not trying to build a system in which no piece ever fails; we're trying to build a system in which we can tolerate the failures of individual pieces and have the system as a whole keep going," says Juan Loaiza, Oracle vice president of Data and Systems Technologies. "We've integrated several technologies into our stack that basically allow organizations to assemble a lot of low-cost servers, storage, and network technologies to create a highly available system that is also highly scalable. It's unbreakable and inexpensive."
Oracle: Integrating High Availability
Traditionally, high-availability solutions were implemented by organizations that had mission-critical applications that had to be up and running no matter what. They often required specialized hardware and software, as well as expensive consulting services, for correct design and implementation. One result of this approach was that most organizations felt that high-availability solutions were beyond their financial abilities or resources and simply didn't implement them. The other result of this approach was that many organizations have ended up backing into high-availability solutions because of some catastrophic event that highlighted their vulnerability.
"High-availability spending is often driven by a crisis situation," says Gartner's Scott. "For example,
a company's e-mail became corrupted, so its systems went down for three days and the CEO became involved in making changes. That's the sort of thing that often happens."
It doesn't have to be that way. Recent advances in everything from hardware to software to operations management are making it easier to increase availability without having to sell the farm or hire a team of consultants. For example, many hardware components are now coming with redundancy built in. And although organizations still need to design high availability into specific applications, more infrastructure components are coming with high-availability capabilities already integrated, making it easier and less expensive for organizations to create highly available solutions. "Oracle has been very strong in its focus on designing for availability," says Scott. "For example, RAC provides a very highly available environment, because you can recover in less than a minute if there's a failure. That's a very, very strong value proposition from an availability perspective."
But implementing Oracle RAC isn't the only way to increase availability. In addition to the steps outlined in the sidebar section "Steps to Higher Availability," you can use a wide variety of Oracle products and capabilities (some of which you probably already own) to decrease potential downtime. Keep in mind that many of these products may have multiple effects on enterprise capabilitiesfor example, implementing Oracle RAC not only increases availability but also increases scalability. Consider the following products and capabilities, which can all have a positive effect on availability when implemented:
- Oracle Real Application Clusters (RAC). Oracle RAC provides clustering capabilities, so that if one node in a cluster goes down, the users can transparently fail over to another node. For example if a user is querying the database and trying to retrieve thousands of rows and one node in a cluster fails, the system takes care of migrating the user and the query to another node that's available and continues the operation.
- Oracle Data Guard. Data Guard provides the ability to create, maintain, manage, and monitor standby databases as transactionally consistent copies of the production database. Oracle9i Release 2 provides additional capabilities such as Data Guard SQL Apply (logical standby database), which allows additional objects such as indexes and materialized views to be created in the standby database, enabling it to be used for reporting while protecting your data from possible disasters.
- Maximum Availability Architecture (MAA). Oracle's MAA
provides a practical framework based on best practices for implementing high availability across key Oracle solutions, including Oracle RAC and Data Guard.
- Oracle Enterprise Manager (OEM). In addition to providing
complete enterprise management capabilities for Oracle products, OEM helps reduce downtime by proactively monitoring systems and configurations; its automation capabilities greatly reduce the potential for downtime due to operator error.
- Backup and Recovery (Recovery Manager). Oracle provides a variety of backup and recovery capabilities, including Recovery Manager (RMAN) and the new Flashback capabilities in Oracle Database 10g, which help reduce operator errors associated with backing up and recovering files as well as significantly decreasing the time required to restore a database to a given point in time.
- Automatic Storage Management (ASM). New in Oracle Database 10g, ASM automatically manages the underlying storage systems, eliminating the time-consuming (and potentially error-prone) need for DBAs to manage files and drives individually. Like Oracle RAC's clustering capabilities, ASM enables organizations to build highly available clusters of inexpensive storage devices that are highly resilient.
- Grid capabilities. Oracle's core products can all be used
in an enterprise grid environment, offering a new level of resource and data availability. In an enterprise grid, resources and data can be dynamically provisioned to meet computing demands, keeping systems functioning optimally.
These features and products (as well as many others) contribute to increasing the availability of an application, but you should keep the overall picture in mind. "High availability is not a specific product," says Murali Vallath,
an independent Oracle consultant with Summersky Enterprises, based in Charlotte, North Carolina, and the president of the Oracle Real Application Cluster Special Interest Group (SIG). "It comes from the entire enterprise architecture, combined with the features of the infrastructure applications that run on it."
Luckily, with Oracle9i and Oracle Database 10g, Oracle has provided a solid set of infrastructure components that integrate high-availability capabilities into the heart of the enterprise IT system.
Resiliency and Recoverability in Practice
It is exactly these types of advances in software and infrastructure support for high availability that have enabled Steve
Sorem, CTO and principal of Payment Technologies, Inc., in Mechanicsburg, Pennsylvania, to be confident that his company can process hundreds of thousands of payment transactions in real time for companies such as Chase Merchant Services, Liberty Mutual Insurance, First Data Corporation, and even Exxon Mobil Corporation's SpeedPass Network of gas station payment tags. "I think we've reached a level of availability that will be hard for us to exceed," says Sorem.
Broadly speaking, PayTec provides hosted transaction management services to financial processors and their
large retail customers. In effect, it is combining a customer relationship management (CRM) product and a payment processing product that handles individual payment transactions and associates those purchases with multiple other accounts, tracking a consumer's accumulated points in a loyalty program or in one that provides discounts or rewards, for example.
A key attribute of the system, named Valexia, is that it has to operate in real time and be available 24/7. To accomplish this, PayTec is using Oracle RAC to enable both scalability and availability. "We've been very satisfied with Oracle9i RAC, because it's really helped us from a high-availability perspective," says Don Smith, vice president of information services, Payment Technologies. "We had been using Oracle8 Parallel Server and noticed nice improvements in speed, performance, and administrative costs when we moved to Oracle9i RAC."
PayTec uses a three-tier architecture: a front adapter tier, where the point-of-sale and other transactions come in; a second application tier, comprising Pure Java J2EE applications running on JBoss servers; and the third, database tier, running Oracle9i RAC (9.2.0.4) on two Sun V880s. Although PayTec knew that RAC would theoretically fail over from one database node to another when there was a problem, it was gratified to learn that it actually worked when put to the test. "This stuff is real. You'll have one node lock up or have a hardware or network hit, and Oracle Database will automatically fail over to the second node," says Sorem. "We had several server-related failures early on, and Oracle RAC really saved us in those cases."
Not only has the company's Oracle RAC implementation saved the applications from going down but it's also made maintenance significantly easier. With RAC, PayTec can shift all processes over to the other node and then shut the first server down and do maintenance. "It's beautiful at the database tier to be able to cycle one of those database nodesin a nonpeak period, of courseand shift all the transactional activity to the other node," says Smith. "We leverage some of the Oracle Resource Manager components to lower the priorities on our back-office processes when we're doing that." After maintenance, PayTec cycles one node and brings it back up, and the application recognizes that the node has reappeared and starts migrating appropriate transactional activity to the right node.
In addition to using Oracle RAC to ensure availability, PayTec takes advantage of other Oracle featuresData Guard and Recovery Managerto provide recoverability in the case of a major, unplanned outage at its main facility.
"We're a hosted platform and have some pretty stringent
|
| Making IT Available
|
SLAs [service-level agreements] with some of our customers, with some as high as 99.99 percent uptime," says Smith. "So in the event of a disaster, to keep our uptime, we have implemented an Oracle Data Guard standby database in Maximum Performance mode." The standby database is located at a disaster recovery site 130 miles away, connected via SONNET ring with two 10-Mbps interconnects.
"We use Oracle9i time-based log switching, so every 10 minutes (at the most), we are automatically switching a log and pushing it up to the disaster recovery site. Thus, our worst-case scenario is a potential 10 minutes of transactional loss," adds Sorem. At peak periods, PayTec's Valexia is operating at more than 6.5 million I/O operations per hour, with a throughput of 500,000 transactions per hour. During these times, the system creates a 500MB log every three minutes that has to be moved to a disaster recovery facility in a round-robin fashion over the two network links.
Although Valexia was originally built on Oracle8i, Smith is thrilled with some of the new features in Oracle9i that make it easier for PayTec to ensure availability. For example, Oracle networking's integration into Oracle9i meant that the company didn't have to write its own scripts to push the logs to the disaster recovery site. Instead, Data Guard in Oracle9i has a completely automated redo-data-transport and apply mechanism that takes care of
keeping the disaster recovery site up to date.
"It's very clean and stable and works really well," Sorem says. "These automatic features greatly reduce the administrative costs, time, and resources required as well as dependency on third-party products." In fact, with just one primary DBA, PayTec is able to manage 16 Oracle instances, including its production environment, disaster recovery environment, and others.
24/7 Availability
Although ensuring that a payment or a transaction goes through correctly is a business-critical process for an organization such as Payment Technologies, Austrian Railways (ÖBB) is more concerned that its 10,000 kilometers of track, 16,000 switches, 240 tunnels, 6,000 bridges, and 6,700 road crossings are safe for railway passengers. Historically, ÖBB had left responsibility for its rail networks to regional authorities, with the information stored in local databases on paper records. As a result, accurately measuring the quality of the track and planning strategic maintenance
and upgrades were difficult and expensive.
However, since 1996 ÖBB has radically redesigned and centralized the management of its track network to provide new competitive services and ensure the physical reliability
of the track. Core to this initiative has been ÖBB's implementation of Oracle RAC running on a six-node HP AlphaServer Tru64-Cluster housed in two locations separated by 1.5 km and connected via a fiber connection.
"High availability has become more important for us since more and more business-critical processes are supported by
IT solutions," says Friedrich Brimmer, CIO, Austrian Railways Fahrweg, the ÖBB's infrastructure department. "Our integrated database applications have more than a thousand users accessing more than 10TB of data to handle and manage maintenance information 24 hours a day."
For example, ÖBB has railroad cars that continuously circle the entire rail network, measuring track tolerances every 25 centimeters and transmitting the measured data via wireless LAN to the central database. Applications, including ones using a geographic information system (GIS) and the Oracle Spatial option, can then extract and analyze data to calculate variations in the track conditions by running on multiple CPUs in the clustered environment, resulting in detailed information on reliability deficiencies that can be sent to maintenance crews within 24 hours.
"In terms of deploying high-availability solutions, we've learned that it's important to plan carefully and check the resiliency and cooperation of all hardware and software components under exceptional conditions and severe stress tests," says Brimmer. "If you're going to build a cluster system, it's important to do so on a stable operating system and a stable database, such as Oracle."
Although collecting and analyzing railway data by using Oracle RAC is an "always on" application for ÖBB that's critical to the safety and security of its customers, it's also helping the bottom line. "We believe that by combining technologies such as Oracle RAC with process efficiencies, we can increase productivity by more than 70 percent over five years," Brimmer adds.
Reducing Human Error
"High availability for us usually starts with the twin data center concept, where we deliver an application running on two systems, located in different data centers that are interconnected by something like a fiber connection," says Erik Snel, manager of Oracle Run Services at Atos Origin, a global IT services company with 50,000 employees located in Hoofddorp, The Netherlands. "If one of the two fails, the other can take over."
That's why Atos Origin, which manages more than 1,800 Oracle databases in Holland alone, uses Oracle9i Data Guard to enable that failover between the locations. With that many customers, ensuring availability is important. "For example, with our warehouse management systems, there are a lot of people standing around and not doing any work and products that can't be delivered to their customers if we fail to deliver high availability. The impact is very high when we can't deliver the service," adds Nico Sponselee, an Atos Origin technical consultant.
"We find that moving beyond 99.9 percent availability is not easy, and Oracle definitely helps us try to reach the goal of adding another 9 after the decimal," says Snel, who's excited about the new capabilities of Oracle Database 10g. "We think that Oracle Database 10g is another step in that direction. It will help us make even higher availability possible at even less cost, specifically because we think the human error component of an operating system and an application becomes less. Human failure can be driven down by decreasing interaction by humans, and Oracle Database 10g helps in that respect."
When authorized people make mistakes, you need the tools to correct these errors. Oracle Database 10g provides
a family of human-error correction technologies called Flashback, which revolutionizes data recovery. In the past,
it might have taken minutes to damage a database but
hours to recover it. With Flashback the time to correct errors equals the time it took to make the error. Flashback provides fine-grained surgical analysis and repair for
localized damagesuch as when the wrong customer
order is deleted. Flashback also allows for correction of more-widespread damage and lets you do it quickly to
avoid long downtimesuch as when all of this month's
customer orders have been deleted. Flashback is unique
to the Oracle Database and supports recovery at all levels, including the row, transaction, table, and tablespace levels and databasewide.
The Importance of Rapid Recovery
For most businesses, there's no way around the dramatic growth of database sizes over the past five years, and First American Real Estate Solutions (RES) is no exception. "We have approximately 600,000 users who rely on our 2TB
real estate database to make decisions daily," says Ben Graboske, chief technical officer at First American RES, in Anaheim Hills, California. "We operate 24/7 to meet the needs of our clients who often access our services outside
of normal business hours. From that perspective, high
availability is really important to us."
The problem is, restoring such huge databases after a hardware failure or a system crash takes a bit longer than two or three minutes. That's why First American RES implemented Oracle Data Guard on its key billing and customer tracking databasesto ensure rapid recovery and continued availability in the event of a problem. "It takes a long time to restore a large database from a backup database," says Daniel Liu, senior technical consultant at First American RES. "With so many users and so many applications, there is minimal to no downtime. I think the greatest value of Data Guard is that it enables us to efficiently handle failover on large databases."
More than 600,000 professionals depend on First American RES to provide them with the information they need; although not all of those individuals access the system at the same time, First American typically has thousands of concurrent users on the system at any one time. And although First American works hard to collect, consolidate, and manage data from approximately 3,000 U.S. counties, it wouldn't be a Fortune 500 company if it couldn't accurately track and bill each of its hundreds of thousands of customers.
"We have multiple billing databases that are key to our business," says Liu. "Because we charge users by how many searches they do and how much time they spend searching our database, we need to ensure availability. So for each of the billing databases, we've used Oracle Data Guard to set up standby databases for protection."
According to Liu, First American RES uses two standby databases for each billing database, and it uses the Maximum Performance setting within Data Guard to push the transactions as quickly as possible to the remote site so that during a failover, all the data will already be on the standby databases. "The process actually goes very quickly. We can have all the users, more than 2,000 of them concurrently, fail over to our standby site in less than a few minutes without losing any transactions at all."
Payment Technologies, Inc.
Mechanicsburg, Pennsylvania
Provides hosted transaction services (Valexia) to
financial processors and their large retail customers
Software:
Oracle RAC
Oracle Data Guard
Oracle Enterprise Manager
RMAN
Austrian Railways (ÖBB)
Austria
Manages 10,000 kilometers of railroad track,
monitoring continuously for reliability and safety
Software:
Oracle RAC
Oracle Spatial Option
First American Real Estate Solutions
Anaheim Hills, California
The nation's largest source of real estate data,
delivered to more than 500,000 users
Software:
Oracle Data Guard
Atos Origin
Hoofddorp, The Netherlands
Global IT services company with 50,000 employees
Software:
Oracle Data Guard
|
Key to deploying any type of high-availability solution
is testing. "Before we implemented our high-availability solution, we built a testing and staging environment in which we could mimic failures and cause the database or server to crash and fail over to the standby database,"
says Liu. Apparently the testing and extra work have paid off. Since the company deployed the system, it has had
two hardware-related failures, and both times the Data Guard failover process worked successfully and within minutesthousands of users were shifted to the new primary server without a problem.
Using Best Practices to
Increase Availability
With a concept as broad as high availability, it can be hard for organizations to figure out where to start. And when you're designing the high-availability infrastructure for Oracle's Outsourcing business, it's even tougher. Oracle Outsourcing had to take into account the wide variety of availability and disaster recovery solutions that are uniquely designed for individual
customer scenarios and find a way
to deliver them as standard services with well-defined SLAs and customer expectations.
That's where the Oracle Maximum Availability Architecture (MAA) came in. MAA is the overall guideline or blueprint that customers can use to implement high availability with Oracle solutions. It's a thorough document that not only gives generalized architecture advice but also gets down to specific configuration details and settings based on best practices.
"We've built our recovery strategy around the MAA architecture, so we can tell customers exactly what to expect when something goes wrong," says Ken Piro, vice president, Oracle Outsourcing. "Even
if we lose our Austin data center, we can tell them they
can expect to be back in business in this amount of time, with this amount of data loss. We couldn't do that without an architecture such as MAA in placeit gives us amazing flexibility. For example, we can even have failover architectures based on Data Guard where the primary site uses
EMC storage and the secondary site uses Network Appliance storage."
Although it's comprehensive, MAA is also very flexible,
so you can implement it incrementally. With MAA, you don't have to do it all at once but can decide what's most important for your specific scenario. "For example, if you have a primary site, you may decide that disaster recovery is important, so you implement Data Guard," says Tammy Bednar, Oracle senior product manager, High Availability. "Then you might decide that the next step is to protect your primary database from a host failure, so you implement Oracle RAC. You just continue to add your levels of data protection into your high-availability architecture."
The Grid Moves High Availability to New Level
Until recently, high availability has been measured in figures under 100 percent. It simply hasn't been possible to consider a system that never goes down and isn't susceptible to some failure, no matter how remote, so a highly available system might be guaranteed to be available 99.99 percent of the time. New grid technologies, however, may just offer the promise of even more uptime.
"Grid computing takes the issue of high availability to a different level," says Piro. "Grid is making people realize that performance, availability, and capacity are all wrapped together and have to be there 100 percent of the time. Customers expect them to work together. Grid takes all these requirements and puts them together in one uniform concept."
With Oracle9i, Oracle began offering grid technologiesfeatures such as Transportable Tablespaces and Oracle Streamsthat dynamically provision pooled resources and data to the users and programs that need them. With Oracle Database 10g, all Oracle core technologies are now enabled for use in an enterprise grid environment, providing even higher availability.
Oracle Database 10g enables a new model for high availability by combining high-volume, inexpensive processors and inexpensive storage to produce a high-quality system. New capabilities such as disk-based recovery, Flashback, and rolling upgrades now make it possible for Oracle users to build highly available IT infrastructures with low-cost, high-volume components. Says Oracle's Loaiza, "There's a new economics that organizations should be aware of: trading inexpensive disk space for expensive downtime. With Oracle Database 10g, we're enabling organizations to spend less money on hardware but still achieve resilient, high-availability deployments."
Oracle Database 10g also helps organizations increase availability by providing ways to compensate for human error (such as a DBA deleting or restoring the wrong files). "When you're talking about availability and recovery from every angle, it's not just system errors that are important; it's data availability as well," says Summersky's Vallath. "In Oracle Database
10g you have the Flashback option, dynamic redefinition and reconfiguration options, enhanced backup and recovery options, and more. All these great features give you an extra advantage for achieving maximum availability."
Across the board, Oracle Database 10g's support for high availability now provides organizations with a new opportunity to reevaluate their organization's high-availability strategy and move toward one that utilizes low-cost, standard hardware and storage. By integrating availability into the fabric of the computing infrastructure the way grid computing does, organizations will have even more flexibility to use their IT resources in ways that make the most business sense.
David A. Kelly (davekellt@attbi.com) is a business, technology, and travel writer who lives in West Newton, Massachusetts.
|
The Role of
Enterprise Manager
in High Availability
Oracle Enterprise Manager (OEM) helps
organizations reduce downtime by monitoring their systems and applicationseverything from Oracle RAC to Data Guard to RMAN. In Oracle Database 10g, OEM supports new features such as Flashback, enabling administrators to automate their backup strategies, and the Incrementally Updated Backup feature.
"Although you can access and manage features of the different products through SQL*Plus and in other ways, Enterprise Manager provides a high degree of automation. For example, it does everything that's required to make a backup of your primary site and push it out, transferring it to the standby site and creating your control files, listener entries, and everything else you need," says Venkat Maddali, Oracle senior development manager, OEM. "All with a click of a button."
In addition, OEM has capabilities such as Verify Configuration for Data Guard, which allows it to proactively verify that standby databases are healthy and that they have all the standby redo logs needed to take over if the primaries fail. If they don't, it can notify an administrator proactively. OEM can also help increase availability by ensuring that best practices for configuration and change management are followed. "OEM can collect information on all managed hosts and can verify that information against a set of best practices, proactively notifying administrators if any are violated," says Maddali. "It not only detects policy violations of best practices but also sees if you're behind in your software maintenance, by comparing your patch levels to the ones available on MetaLink."
Automatic Storage
Management (ASM)
ASM is a new Oracle Database 10g feature that simplifies the management process for DBAs. It provides an integrated volume manager and file system that are purpose-built for Oracle Database files, allowing Oracle to manage the underlying storage. From a business perspective, it reduces the management requirements for the database and reduces the potential of operator error. By simplifying the provisioning of the original database layout and then, more important, the ongoing maintenance of the database, it helps deliver a system that is easier to manage and more automated.
"From a high-availability standpoint, ASM helps reduce planned downtime by allowing administrators to add more storage to a disk pool and have the database dynamically redistribute the data across the storage pool nonintrusively, without having to schedule downtime to remap or move data files around," says Paul Manning, product manager for ASM. "On the unplanned-downtime side, ASM helps eliminate potential operator error and creates a more bullet-proof database environment." Like RAC, ASM is built on the idea of assembling less-expensive storage devices into a highly available storage system, instead of purchasing one big, expensive, mainframelike storage system.
ASM technology also improves performance as well as availability. There's nothing worse than having a system that's always there
but moving at a snail's pace. "ASM automatically distributes the data across the underlying storage resources in a manner that keeps the database I/O at an optimal level," says Manning. In addition, ASM can provide three levels of automatic mirroring, including double and triple mirroring, and external mirroring, in which case ASM does not mirror.
Backup and Recovery
You need to look at high availability
from all different sorts of levels, from the simplest thing such as dropping a database table or even simply deleting a row from a table, all the way up to losing your building," says Tammy Bednar, Oracle senior product manager, High Availability. "High availability can protect you from something as small as a human error that causes a table to be dropped from a production database."
One of the most basic steps in building highly available systems is to ensure adequate backups. But even more important is determining how quickly you can (or need to) recover from a database outage. "It's becoming more important to look at the recoverability of your database, not necessarily just the backup,"
says Bednar.
With databases growing by leaps and bounds, restoring from tape takes a long time and can cause a significant amount of downtime. As a result, organizations are looking to recover more quickly, and Oracle is addressing this need with new capabilities, such as disk-based recovery. For example, Oracle Database 10g introduces the Flash Recovery Area, where you assign a location on a local disk into which are put all your recovery-related files, so that if you have to recover a data file, it will be locally available and faster to restore from disk.
Oracle's Flashback technologies go a step further and help organizations eliminate the time it takes to restore the database backup.
It takes only a very short period of time to "rewind" the database with the Flashback database technology. "No matter what the size of your database, it takes only 5 or 10 minutes to recover, versus 14 hours if you have a terabyte database," says Bednar. "We're eliminating that 14-hour recovery time."
Flashback is ideally suited for the correction of human errors, such as when someone makes a mistake (by deleting something incorrectly, for example) and all you want to do is go back to 15 minutes before that happened. Traditionally that would be very difficult, involving a lengthy restore from tape and rolling backward to the right timea process that could easily take hours. "In Oracle Database 10g, we can use Flashback to track just the changes to the database, so if a database needs to be restored to 15 minutes ago, you simply say, 'Flash this database back to 2:00 p.m.' and restore only the data blocks that have changed in that time period," says Oracle's Juan Loaiza. "It's like a rewind button for the database."
Recovery Manager (RMAN) is Oracle's utility for backing up and recovering databases. What's important about it from the high-availability perspective (beyond its ability to back up the database) is that it's tightly integrated with the Oracle Database and thus knows exactly which files to back up and how to recover. According to Bednar, "Having received a single command, RMAN knows how to back up all the database files." In addition, RMAN checks for corrupted blocks during backup and repairs them (via a feature called Block Media Recovery) in memory. "That's a huge manageability and availability feature we introduced in Oracle9i," notes Bednar.
RMAN also knows how to intelligently restore data files. "It restores the file you need and automatically applies the archive logs required for recovery," says Bednar.
Steps to Higher Availability
In many ways, availability is like securitythere's no single way to achieve complete availability or complete security but, rather, a wide variety of processes and technologies that can be applied to increase availability or security. With no one-size-fits-all solution for high availability available, there are some basic steps any organization can take to start down the road to higher availability, including the following:
- Define requirements. One of the first steps toward higher availability is defining the requirements of users, including acceptable downtimes and costs associated with them, as well as testing for synergies between high availability and the need for increased performance from various systems.
- Consider it early. For best results, organizations should be considering high-availability requirements during a system's design stage. "Consider it not only for the infrastructure but also for the applications and operations architecture," notes Gartner's Scott.
- Standardize. By standardizing as much hardware and software as possible, organizations cut down on the variables and differences they have to keep track of. This can be especially important when you are trying to recover from a failure. "One of the main reasons Oracle Outsourcing standardized on Linux was that it allowed us to focus our energies and get really good at deploying Linux-based solutions," says Ken Piro, vice president, Oracle Outsourcing.
- Define metrics. Building a highly available solution is a good step, but you also need to ensure that the solution actually works within the business guidelines required. Defining (and implementing) metrics to measure that what you're putting together will work is important, for verifying the solution as well as for documenting the capabilities for auditors or boards of directors.
- Do regular backups. Yes, it's rudimentary. Yes, it's trite. But it's still critical to make sure you've got appropriate backupsof your database, applications, configurations, operating systems, and everything else. "Even if you don't employ Data Guard or RAC and if you don't have a disaster recovery site, if you at least back up your systems and maintain the backups off-site, you'll have a chance to recover," says Oracle's Bednar.
- Reduce the amount of change. The first rule in any kind of problem determination is to understand what has changed. "Organizations should introduce change in small, related batches if possible. Introducing too much change at one time inevitably lengthens the application's mean time to recovery, because it takes longer to determine what the problem is," says Meta Group's Charles Garry.
- Document procedures. Remarkably but understandably, many organizations simply don't document what they have deployed. Know (and document) where your database files are located, what operating systems (and patches) you have installed, and what the critical passwords are. "Documenting just the basic information can help you recover from simple outages a lot faster," notes Bednar.
- Build a change infrastructure. Make sure you've invested to create the infrastructure and processes necessary to support change in a consistent and efficient way. Have a well-planned and well-managed quality assurance test environment, and establish a change management group with responsibility for coordinating, documenting, communicating, and approving changes to environments, including promotionsfrom development to test, to QA, to production. "This is something most companies that have a mainframe environment are already really good at," says Garry.
- Improve communication. According to Garry, planned changes should be communicated to all interested parties within a reasonable timeframe, so that they can evaluate any related or conflicting changes.
- Read MAA. Download and read Oracle's Maximum Availability Architecture document to get an overview of high-availability best practices, recommendations, and even configuration settings.
|
|