Review real-world experiences in deploying a business-critical PHP app over Oracle Real Application Clusters.
By John Lim
Published June 2009
Running a software application that is able to work reliably through hardware and software failures is incredibly hard. Our client, a Malaysian bank, commissioned us to develop a Web-based financial application for them that would be used in an intranet setting and was considered mission-critical. This meant that they that wanted a system that was reliable and had automatic failover to another server whenever any component failed.
In this article, I will cover the network, architecture and design of our RAC application. Then I will discuss the real-world experiences and problems we experienced. The experiences in general are also relevant even if you are using Java or .NET or your application and not PHP.
We proposed PHP for the application with the data stored in Oracle Database 10gR2 Standard Edition using Oracle Real Application Clusters (RAC) on Red Hat Linux Enterprise 4. Our final system looks like this:
In our network, we have two load balancers, in active-passive mode to distribute HTTP queries to our two application servers. The application servers are running Squid 3, Apache 2.2, and PHP 5, and use Oracle's client side load-balancing and failover to connect to the Oracle database cluster.
The Oracle RAC is a cluster of independent database servers that share the same data and cooperate as a single system. Our RAC is a two-node cluster using a multipath connection to a Hitachi Storage Area Network (SAN). We use cluster-aware file-systems on the SAN to ensure that we both RAC nodes are able to write to the same files simultaneously without corrupting the data. There is also a private interconnect between the two RAC nodes, which is just a 1GB private LAN for data synchronization purposes.
In order to understand our real-world experiences, I need to explain the architecture of Oracle RAC. If you want to learn more, there is an Oracle 2 Day + Oracle RAC Guide available.
Oracle RAC uses a shared everything data architecture. This means that all data storage needs to be globally available to all RAC nodes.
To implement shared everything on Linux, you can choose to use third-party cluster file systems or Oracle technologies such as Oracle Automated Storage Management (ASM), a volume manager that uses raw devices to store data, or Oracle's open source OCFS2 filesystem.In our case, using Oracle Database Standard Edition requires Oracle ASM raw devices for data storage. However for all other shared Oracle files, we choose to use OCFS2 as these files are then accessible to normal operating system commands such as ls and cp.
Oracle ASM is a raw storage format highly optimized for database use. If you are familiar with tablespaces and data files inside an Oracle database, then you will see similarities with the way ASM is organized. In an Oracle database, we store tables in virtual tablespaces made up of multiple physical data files. Similarly, in Oracle ASM, we store these data files in virtual diskgroups which is made up of multiple physical disks.
Oracle ASM allows you to create mirrored groups of raw disks, or use external raw devices that are already RAID compliant. In our case, the SAN is already RAID 1, so we configured the ASM to use external redundancy.
Oracle ASM is managed through a database instance. This is not the same database instance where your production data resides. This means you actually have two database instances running on each RAC node with separate SID's (Oracle database identifiers):
To connect to the Oracle ASM instance from the command line on server 1, you need to set the ORACLE_SID environment variable to connect to the right instance before running sqlplus:
su - oracle ORACLE_SID=+ASM1; export ORACLE_SID sqlplus / as sysdbaYou can then view and modify the configuration of the disks the using standard ASM views:
v$asm_diskgroup v$asm_disk v$asm_fileMore on Oracle ASM later when we discuss our experiences.
Oracle Clusterware manages the startup, shutdown and monitoring of database instances, listeners and other services. When a service go down, Clusterware will automatically restart the service for you.
The cluster communicates through a private interconnect, in our case a 1Gbit private LAN; this means you need at least two network interface cards on each database node. One for the bank network, and another for the private interconnect.
If both the private and bank networks go down, the cluster can still communicate through the cluster filesystem. Special voting files were created in the OCFS2 file system to allow messages to be exchanged between the cluster nodes as a health check and to arbitrate cluster ownership among the instances in case of network failures.
Each database instance in the cluster has its own configuration such as SID, spfile and startup parameters. This means many of the familiar Oracle management techniques still apply.
However the fact that data is shared by all instances means that Oracle requires some means of synchronizing data buffered in memory within multiple instances. Oracle uses a technology called Cache Fusion that uses a high-speed interconnect to manage and synchronize the buffer caches. The speed of this synchronization is a good measure of the effectiveness of the RAC cluster. We will discuss more about this when we talk about Oracle Enterprise Manager.
Installation of Oracle RAC has been extensively covered in articles by John Smiley and Jeffrey Hunter, and the reader is directed to them for further information on the procedure. Instead I will talk about the glitches we faced.
The biggest problem we had during the installation was connecting to the SAN reliably. When you have multiple connections to the SAN using multipath, the same device will show up multiple times, e.g. as /dev/sda and /dev/sdc because there are two redundant paths to the same device. Another problem is that assignment of device names is by default based on the order of detection of devices. So a slight change in the bootup configuration could cause /dev/sda to be mapped to a different hard disk after the change.
That is we why we used the Linux multipath daemon to standardize the mapping of the devices. For example, this allowed us to remap hard disk 1 which has the paths /dev/sda and /dev/sdc to /dev/mapper/1HITACHI3450034. Then we use the more reliable 1HITACHI* path for all device connections.
The next problem we hit was when we formatted the 1HITACHI3450034 device as an ASM raw disk and thought we were good to go ...until we next rebooted: then ASM would loose the mappings and the database would be stuck without any data files.
After much frustration and searching, it turned out that ASM was detecting the sd* devices before the 1HITACHI* devices. We needed to configure ASM to change the order of device detection. This was done by putting this line in /etc/sysconfig/oracleasm:
ORACLEASM_SCANORDER="1HITACHI sd"This tells ASM to load all 1HITACHI* devices before sd* devices. For more info see this ASMLib multipath link.
Another gotcha we encountered was that we were running 64 bit Linux, so we only installed the required 64 bit RPMs. Mysterious errors occurred. It turns out that some 32-bit RPMs need to be installed too. Based on metalink document 339510.1 we installed both 32 and 64 bit versions of the following RPMs:
One final word of advice: most of the difficulties were because of problems interfacing to non-Oracle software and hardware. Once we overcame these issues the setup was just follow the documentation. For troubleshooting, we found My Oracle Support (formely MetaLink; support contract required) more helpful than Google, and would highly recommend it.
On the application servers, make sure you are running the latest release of your Oracle client libraries to ensure that Oracle Transparent Application Failover (TAF) is supported. In our case, we downloaded Oracle 11g Instant Client install (yes, the Oracle 11g Instant Client works fine with our 10gR2 database) and the PHP OCI8 1.3 extension source code. We then compiled the extension by hand. For instructions, see this guide.
Oracle supports load-balancing and failover of the client connection. You just need to configure TAF.To do this, you need to setup a database network service. The easiest way to do so is to use Oracle's Database Configuration Assistant (run $ORACLE_HOME/bin/dbca as root). Go to the Services Management page and create a service called RACSERVICE which uses two Preferred instances for load balancing and follows the basic TAF policy. The Assistant will also configure the cluster correctly.
After creating the service, open $ORACLE_HOME/network/admin/tnsnames.ora on either node and copy the RACSERVICE connection string generated by the Assistant. We use the connection string on the client side like this:
$RAC_CONN_STR = "(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = RAC1-VIP)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = RAC2-VIP)(PORT = 1521)) (LOAD_BALANCE = yes) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = RACSERVICE) (FAILOVER_MODE = (TYPE = SELECT) (METHOD = BASIC) (RETRIES = 180) (DELAY = 5) ) ) )"; $conn = oci_pconnect($USERID, $PASSWORD, $RAC_CONN_STR);RAC1-VIP and RAC2-VIP are the host names (that must be defined in /etc/hosts of database and application servers) of the virtual IPs used by the RAC servers. These virtual IPs are configured during installation and are used for fast connection failover. When one server goes down, the other server will take up both virtual IPs on the public interface, ensuring there is no delay in failover.
TYPE=SELECT means that if Oracle is executing a SELECT statement when a server fails, on failover that SELECT statement is re-executed and the cursor is positioned in a manner that the client can seamlessly continue fetching the rows. Transactions that INSERT or UPDATE are still rolled back however. You can also configure TYPE=SESSION meaning if the first connection fails, all work in progress at that point are irrevocably lost.
METHOD=BASIC means only connect to the failover server when any server fails. METHOD=PRECONNECT means make connections to all servers even before any server connection fails; this avoids the overhead of reconnection, but means that additional unused resources are taken up by each pre-connect connection.
Oracle 10gR2 introduced the ability to notify an application (and the Oracle client) when failover occurs using Fast Application Notification (FAN). This allows the Oracle client to remove terminated connections immediately instead of when it is actually called by our PHP application, and make TAF more responsive by causing the failover to occur more quickly.
Enabling FAN can be done by executing the following package:
execute dbms_service.modify_service ( service_name => 'RACSERVICE', aq_ha_notifications => true);You then need to define in php.ini:
oci8.events = On
Oracle client libraries and database must both be 10gR2 or later for this to work.
You can also tune load balancing on the server side. Internally Oracle has a load monitoring module called the Load Balancing Advisory (LBA) that will select the node to run your connection on.
In PHP, load balancing only occurs on connection (Connection Load Balancing in Oracle terminology) and is tuned for long running connections such as PHP persistent connections. Alternatively, you can configure RAC to be tuned for short running sessions (e.g. PHP non-persistent connections) using:
execute dbms_service.modify_service( service_name => 'RACSERVICE', clb_goal => dbms_service.clb_goal_short); -- or clb_goal_long
PHP Application Integration
Our PHP application contains 2,900 source code files and took over 20 man-years to develop. The financial software we built is based on our proprietary PhpLens Application Server technology and comes with a workflow engine, integrated graphical screen designer, and report writer. It uses the open source ADOdb library to connect to Oracle.
Integrating our PHP application was very straightforward. We did not have to make any PHP code changes to integrate with Oracle RAC. In terms of error handling, we have a transaction manager built into ADOdb that keeps track of all errors, and will auto-rollback a transaction whenever an error is detected. Something we are considering in a future release of ADOdb is the ability to restart failed transactions caused by failover as these are not automatically handled by Oracle. Instructions on how to do so are described in this Oracle white paper on PHP Scalability and High Availability.
We did have to make some changes on the Oracle side of our application to accommodate RAC behavior. In our application we use sequences extensively to generate surrogate primary keys. In Oracle RAC, a global lock is required to ensure that sequences are generated in order within the nodes, and will degrade database performance.
So we recreated our surrogate primary key sequences using the following command:
CREATE SEQUENCE seqname START WITH nextvalue CACHE 100 NOORDERThis prevents Oracle from invoking the global lock on every invocation of a new sequence number. The disadvantage of this approach is that although the sequence numbers are guaranteed to be unique, they are no longer in ascending order and gaps might occur in the numbering.
When running Oracle, one of the most important things you need to do is perform memory sizing. To do this you need to work out the maximum number of connections to allow. Our database servers had 8GB of RAM, and we wanted 2GB allocated for the operating system and other processes we needed to run. That left us with 6GB for Oracle to use.
In our case we made an assumption that the peak number of active web browsers connecting to our two application servers would be about 100. Assuming one application server went down, that means we still need to configure each server for 100 connections. Giving ourselves a 50% safety margin, both application servers were configured to run Apache in pre-fork mode with MaxClients set to 150. Note that if we had not been running Squid to manage the static files, the safety margin would have to be higher as at least half of all Apache child processes would be involved in downloading static files.
Each Apache child process would be running one PHP persistent connection, so that meant that if only one RAC node was up, we would need 300 Dedicated Server connections configured in the database to handle both application servers. After including other processes, dblink's from other Oracle servers, and a safety factor, we configured the maximum number of Dedicated Server processes (e.g. sessions) to 500.
From this we derived the following settings:
Number of dedicated server processes, and is equivalent to the number of connections you can make to the database. For easy calculation, we estimated that each process would take 4 MB.
500 x ~4MB = 2GB
Memory for Shared Global Area. Used for data buffers and other shared data structures.
We tested failover of the Oracle RAC cluster by running SQL*Plus 11g (with its -F option) from the application server. We tested the following scenarios:
select instance_name from v$instance
There is a learning curve to using Oracle RAC. Here are some things that I would advise you know.
Take the time to learn about the new features of Oracle Enterprise Manager that support Oracle RAC. For example, originally I was maintaining the ASM setup using SQL*Plus; I was flabbergasted when I found how easy it was within Enterprise Manager.You can start Oracle Enterprise Manager by running:
emctl start dbconsoleThen you can access the Oracle Enterprise Manager Database Control through the URL: http://server-address:1158/em.
Oracle RAC performance statistics are another important feature of Oracle Enterprise Manager. A measure of how efficiently RAC is performing is global cache block access latency. This metric measures how long it takes for Oracle RAC to synchronize data blocks between the nodes. If the measure is over consistently over 5 ms, you will need to look into ways of reducing latency, such as database sharding (partitioning the cluster so each node manages a subset of the data). You can view this statistic from the Performance tab:
If you are using Oracle ASM raw devices, be prepared to spend time educating the IT infrastructure team who maintain the systems. They are not necessarily DBAs and might be more familiar with Unix tools for monitoring disk usage such as df that do not work with raw devices. And if you want to copy files from the raw device you need to use Oracle's asmcmd utility if you are running Oracle Database 11g or RMAN on 10gR2.
In order to bridge the learning curve, we provide a monitoring Web page for their system operators that displayed the free disk space in the ASM partitions. To extract this information, we have a cron job that connects to our ASM instance using sqlplus and runs the following script:
connect / as sysdba; spool '/path/to/asmlogfile.txt'; select name, free_mb, total_mb from v$asm_diskgroup; exit;We then run a PHP script that reads the contents of the asmlogfile.txt and writes it to the production database instance.
Management of batch jobs in a high availability environment is challenging. This is because you need to have infrastructure to failover your batch jobs. Oracle provides you with tools to manage this with the DBMS_SCHEDULER package. This allows you associate a job with a database service that supports failover.
First you need to define a service that uses failover in tnsnames.ora. In the following example, FAILOVER is defined, but not LOAD_BALANCE: this means all jobs will run on RAC2-VIP by default, unless that server is down:
FAILOVER_SERVICE = (DESCRIPTION = (FAILOVER = ON) (ADDRESS = (PROTOCOL = TCP)(HOST = RAC2-VIP)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = RAC1-VIP)(PORT = 1521)) (CONNECT_DATA = (SERVICE_NAME = BATCH_SERVICE) (FAILOVER_MODE = (TYPE = SELECT) (METHOD = BASIC) (RETRIES = 20) (DELAY = 1) ) ) )Then define a job class for all jobs that uses this service:
DBMS_SCHEDULER.create_job_class( job_class_name => 'MY_JOB_CLASS', service => 'FAILOVER_SERVICE');Lastly you create a job using that job class:
DBMS_SCHEDULER.create_job ( job_name => 'my_job', job_type => 'PLSQL_BLOCK', job_action => 'BEGIN NULL; END;', -- your code here start_date => trunc(SYSDATE)+23/24, -- at 11 pm repeat_interval => 'FREQ=DAILY;', job_class => 'MY_JOB_CLASS', end_date => NULL, enabled => TRUE);
Our servers have three network cards, one of which was not utilized. For some reason, when there was a public network failure on one of the servers. The Virtual IP would failover to the spare unused network card on the same server instead of failing over to the other server. This meant that when the network was restored, that server would still be incapacitated as the virtual IP would be bound to the unused interface card.
To fix this, we ran the following command to explicitly bind the Virtual IP to only eth0 for the server with hostname node1.
srvctl modify nodeapps -n node1 -A 10.1.199.24/255.255.255.0/eth0
Once we got past the learning hurdle, our Oracle RAC experiences have been very positive. Uptime is high and performance is good. At peak load, there are 650 users logged in, running 70 transactions per second. For our customer, what is most important is to them is peace of mind that the data is managed reliably, safely and with no interruption to business. Oracle RAC gives them that assurance.
John Lim is based in Malaysia. He is the developer of ADOdb, a popular database library for PHP. John has been also been eating, drinking, and sleeping professionally with Oracle since 1995.