Confused about when to implement Oracle Parallel Server (OPS), or replication, or a standby database? Here's an overview of the options.
Lisa Phillips (firstname.lastname@example.org), an application architect for Learning Pays.com, in Ft. Lauderdale, Florida, writes, "My company is developing a Web application that will utilize an Oracle database. We need 100-percent uptime. What are your suggestions on the use of replication versus Oracle Parallel Server? Which would you recommend, and why? In terms of scalability, redundancy, availability, and backup issues, what are the pros and cons of each?"
You have at least five options to consider, not just two:
Oracle Parallel Server (OPS)
A standby database
Let's look at each of these briefly.
Using OPS in the context of availability means that you are using the Oracle database server's ability to run on more than one computer; multiple instances of the server can mount and open a single database. If one machine fails, the other machines continue processing with little or no interruption. You might, for example, use a server configuration with two nodes. If one node were to fail, the other node would recover any transactions the failed node had outstanding and roll back any in-progress transactions from the failed node. The system would never go down. (However, users who were processing transactions on the failed node would receive error messages and would have to resubmit their transactions; your software could, of course, detect the failure and automatically resubmit transactions as well.)
OPS has the following upsides:
It has increased horizontal scalability. If you need higher capacity, you can add another machine (or node). You do not necessarily need to buy more-powerful machines—you need more machines.
It uses proven technology. OPS has been around for a long time. Many people use it, so it's well known and understood. There are design considerations you have to be aware of—but they are well known (which is half the battle). For example, partitioning plays a key role in using OPS for designing an application.
It is fairly transparent. Replication affects how an application interacts with data, while OPS does not. At the application level, the fact that you are using OPS is not as relevant as using replication would be: OPS impacts DBAs but not necessarily programmers; replication impacts both.
Some of the downsides are as follows:
You have to design your application to work with OPS. You have to understand OPS and be aware of the potential performance bottlenecks. (Oracle8i Release 8.1.6 has a special two-node-cluster mode of OPS. If you use this mode, 100 percent of the performance issues are nonissues. Both nodes are used to read data, but only one node is used to write data. The second node takes over for the first node in the event of a failure and starts writing as well.)
Upgrades will most likely cause server downtime. Typically, all nodes go offline for upgrades (of the Oracle software, not the OS). Upgrade time will vary, but it may be as long as one hour.
DBAs must be fully aware of the OPS environment. DBAs must fully understand OPS, not just any Oracle product.
Although you can use replication for failover, it wasn't really designed for it. It was designed to make data available in more than one location. With replication, database 1 receives a transaction and stores it in a local queue. When the transaction commits, database 1 forwards (in the background, not as part of the user transaction) the transaction stored in the queue to other database servers that are in the replication loop. These other replication servers then also perform the same transaction.
Replication is asynchronous in nature. At any point in time, the various databases can and do contain different data, as they each contain a version of the database at different points in time. This fact can make application development somewhat harder. Let's say you sell widgets around the world from a Web site that's built on a replicated database model. You have an inventory of x widgets. People in two different locations simultaneously order x widgets, but one site goes to your database 1 and the mirror site dips into database 2. As far as your software is concerned, you can fill both orders. It is only after the transaction at each site has been performed that you discover in the background that you now have -x widgets in inventory (you've received orders for 2x widgets). This situation is known as an update conflict, and it is up to you to fix it. Oracle database server discovers it but then invokes your code to fix the problem. Your application design must contain code for handling this situation.
Many people avoid conflicts by partitioning data—for example, you can have a West Coast server handling West Coast data and an East Coast server handling East Coast data. However, if you fail over from the West Coast server to the East Coast server, you must remember that the failed West Coast machine probably had transactions in its database queue that will be propagated to the East Coast server. So you still have the potential for update conflicts when you use replication for failover, and your design still needs to be able to handle them.
In terms of OPS compared with replication:
If failover and availability are the key issues, using OPS wins out in my opinion. If redundancy is the main issue, then using replication is the answer (because using OPS is not a redundant solution—you have only one database).
Replication and OPS are of equal value for update and backup. Each option makes your database as available as the machines involved. Each is the same for backup.
Both replication and OPS have certain performance implications, and you have to make allowances for these in your design. With OPS and a two-node configuration, you can eliminate the performance considerations. With OPS, a single write node, and many (more than two) read nodes, you can eliminate most performance considerations. With replication, a client transaction generates about twice the I/O for any given write transaction and does not affect read transactions at all.
As for long-distance replication for disaster-recovery purposes—replication works in a store-and-forward fashion. If a user updates a row in database 1, database 1 not only updates that row but also queues the transaction (locally). After the user commits the transaction, database 1 will pick it up from the queue and send it to database 2, to database 3, and so on. You can update the same row in any number of databases, but you have to be able to handle the resulting update conflicts. Oracle database server detects them, but you must provide the code for fixing them. OPS allows only one person at a time to modify a single row (just like a single database does, since OPS is a single database).
Replication has some of the following upsides:
It works over long distances and wide-area networks.
It tolerates short network outages. "Short" is relative here and is a function of the number of changes made to the database in a given time period. You don't want the database servers to get too far out of sync. If the data diverges too much, it could also affect your application.
It has a fixed overhead. The overhead is twice the data modified for inserts/deletes and three times for updates (before and after images are stored in the queue).
It provides you with a copy of the data at another location. With OPS, you have one database. An earthquake could take out your only copy.
Some of the downsides are as follows:
You must carefully study and design for conflict resolution and its impact.
You need a DBA who is familiar with this technology, and your developers need to be familiar with its impact.
Note that you can use OPS and replication together—OPS for rapid failover and replication for a hot spare.
Using a standby database is a nice solution for failover. In this configuration, the Oracle database server forwards its logs to another computer on the network; changes in the logs are applied to a database instance that is in constant recovery mode. You can open the standby database for reporting and such (read only), but it is generally unavailable while in recovery mode. The standby database lags behind the master database by some small amount (governed by the size of your redo files), but you can bring it online rapidly after a failure and can totally catch up with the master database if you can get to the file system on the failed machine (for example, plug the disks that contain the redo in to another machine). Be aware that replication has the same sort of lag time as a standby database.
DOWNLOAD Oracle Database 11g Release 2
Setup is extremely easy. You set up and maintain an exact replica of your production database. The replica is maintained for you, though, and once you've set it up, it doesn't require any ongoing administration.
It has no performance impact on the master machine. The master machine performs as it normally does; the standby database uses the archive redo logs from the master to update itself independently of the master machine.
It imposes no design considerations on the application. Your application runs on the master machine. In the event of a failure, it will run on the new master (the standby), but this changeover is transparent to your application.
Some of the downsides are as follows:
DBAs must be aware of the standby database. If you make structural changes to the master database, you must synchronize the standby, but the work involved in doing this is minimal.
Once you fail over to the standby, you have no failback. The standby becomes the master, which means you no longer have a standby database, so you will have to immediately build a new standby to cover the new master database.
Here, I am talking about dual-ported RAID devices or other types of operating-system failover solutions in which two machines have the same set of disks mounted. If one machine fails, the second machine takes over the IP address and runs a recovery script of some sort. The script starts up the Oracle instance and lets it recover, and you're good to go. Talk to your OS vendor about these types of solutions.
Overall site architecture is another important consideration. Whether your Web application relies on a single database running on a single server or many databases running on many servers will have an impact on what users— your customers, your clients, and the world at large—perceive about your company's success or failure at any given moment. For example, if a server goes down and your entire site content resides on that one server, then your site will no longer exist on the internet.
You need to take all these criteria into account when you choose the technology and architecture for your site. As always in any implementation, it depends on your specific needs and goals, and there are too many variables to just say, "Use this technology." I've worked on projects for which OPS was right, replication was right, OPS plus replication was right, standby was right, and so on. I hope this overview helps you better assess some of your options.