DBA and Sysadmin: Linux

Introduction to Linux Cluster Filesystems
by Sheryl Calish

Cluster filesystems complement the database cluster facilities in Oracle RAC in various ways. Here's how they compare.

Traditionally, a cluster is simply a group of servers, either PCs or workstations, acting as a single system. That definition is stretching considerably, however; cluster technology is now a dynamic field with diverse applications that is continually absorbing new features. Furthermore, cluster filesystem technologies, whether open source or proprietary, are rapidly converging in their capabilities.

Many people refer to cluster applications and the filesystem software used in them as if they were one and the same. More accurately, most clusters have two main components: servers, which are connected to some sort of shared storage media through a fast network, and filesystems, which act as the software "glue" that keeps the cluster nodes working together.

In "Guide to Linux Filesystem Mastery," I explained how filesystem methods and data structures provide a user-level view of the physical organization of a hard disk partition. Although there are variations among projects, cluster filesystems do the same for multiple nodes of a cluster: they make all the nodes appear to be part of one monolithic system and, at the same time, allow concurrent reads and writes by all nodes of the cluster.

In this follow-up article, we will take a high-level look at differences among cluster filesystems in general and some of the characteristics of the Oracle Real Application Clusters (RAC) environment specifically. DBAs or sysadmins who are new to clustering, Linux filesystems, or Oracle RAC should find it educational.

Introduction to Cluster Applications

Cluster applications have varying levels of maturity and capabilities. They include:
  • High-performance clusters, also referred to as parallel or computational clusters, are usually used in systems that support large volumes of computational processing. In these clusters a parallel filesystem distributes processing resources across the nodes, thereby allowing each node to access the same files at the same time via concurrent reads and writes. The Beowulf Linux cluster, developed in the early 1990s by NASA, is the most familiar example.
  • High-availability (HA) clusters are designed for fault tolerance or redundancy. Because these clusters usually use one or more servers for processing the servers are able to assume processing responsibilities for the others in the event that one or more of them goes down.
  • Load-leveling or load-balancing clusters distribute workload as evenly as possible across multiple servers (commonly web servers).
  • Storage clusters, which are used between SANs and servers with different operating systems, provide shared access to data blocks on common storage media.
  • Database clusters, for which Oracle RAC is a platform, bring many cluster filesystem features into the application itself.
These cluster applications have overlapping features, and features of one or more of them are commonly found in a single cluster application as well—especially in HA and load-balancing clusters. For example, Oracle RAC can be installed over HA cluster filesystems to bring the benefits of database clustering to an HA cluster application, such as:
  • Shared resources—including data, storage, hard disks and metadata —so that multiple nodes look like a single filesystem. They allow all members of the cluster to read and write to the filesystem simultaneously.
  • Pooling of storage devices into a single disk volume, thereby improving performance because no data copying is needed
  • Scalable capacity, bandwidth, and connectivity
  • A single system image that provide all nodes with the same view of the data.
Now let's take a look at some of the cluster-aware Linux filesystem options that support Oracle RAC, as well as how they complement its capabilities.

Cluster Filesystem Options for Running Oracle

Oracle RAC technology already provides features such as load balancing, redundancy, failover, scalability, caching, and locking, so there is a duplication of functions that occurs when Oracle datafiles reside on a block device with a traditional Linux filesystem such as ext2/ext3. Performance decreases in this case because caching by Oracle as well as the filesystem drains memory resources.

As of this writing, in addition to third-party cluster filesystems, there are four filesystem options for running Oracle RAC. They are, in order of recommendation by Oracle:

  1. Oracle Automatic Storage Management
  2. Oracle Cluster File System
  3. Network Filesystem
  4. Raw devices.
Oracle Automatic Storage Management (ASM) One hallmark of Oracle is that regardless of the environment on which it runs, once you get to an Oracle API everything looks, feels, and works the same. Oracle ASM, a feature of Oracle Database 10g, extends this consistent environment to storage management with the use of SQL statements, Oracle Enterprise Manager Grid Control, or the Database Configuration Assistant to create and manage storage contents and metadata. It is considered best practice to use ASM for Oracle Database 10g data file storage.

The basic data structure in ASM is the disk group, which comprises of one or more disks. A "disk" in this context can be a disk partition, an entire disk, a concatenated disk, a partition of a storage spindle, or an entire storage spindle.

It's important to understand that ASM is not a general-purpose cluster filesystem. Rather, ASM is a cluster-aware filesystem specifically designed to handle Oracle database files, controlfiles and logfiles. ASM should not be used with a Logical Volume Manager (LVM), as the latter will obscure the disks from ASM.

ASM performs the following functions:
  • Recognizes disks via ASM ID in disk header
  • Dynamically distributes data across all storage within a disk group, provides optional redundancy protection, and is cluster aware.
  • Allows for major storage manipulation to take place while an Oracle database is fully operational—no downtime needed to add, remove, or even move diskgroups (although rare) to a new storage array
  • Performs automatic load balancing and rebalancing when disks are added or dropped
  • Provides additional redundancy protection through the use of failure groups
  • Optimizes use of storage resources.
When installed on raw devices or, as recommended by Oracle, on block devices using the ASM library driver, ASM runs as its own instance that starts before a database instance. It enables the DBA to create, extend, and shrink a disk and maps such changes to disk groups on other nodes that share access to those groups. Database instances can share a clustered pool of storage across a number of nodes in a cluster.

ASM installs with the Oracle Universal Installer. If ASM is added to an existing database make sure you set the database to depend on the ASM instance so that at startup, the ASM instance is started before the dependent database. For example:

$ srvctl modify instance -d O10G -i O10G1 -s +ASM1

makes an o10G1 instance dependent on +ASM1 instance.

An ASM instance differs from an Oracle database instance in several ways:
  1. No data dictionary exists, although you can use several V$ views to obtain information on the ASM instance: V$ASM_DISKGROUP, V$ASM_CLIENT, V$ASM_DISK, V$ASM_FILE, V$ASM_TEMPLATE, V$ASM_ALIAS, and V$ASM_OPERATION.
  2. You can connect to an ASM instance only as SYSDBA or SYSOPER.
  3. There are five initialization parameters used for an ASM instance, one of which, INSTANCE_TYPE, is required and should be set as such: INSTANCE_TYPE = ASM.
In the ASM instance, the DBA can use SQL syntax or Enterprise Manager to:
  1. Define a disk group for a pool of storage using one or more disks
  2. Add and remove disks from a disk group
  3. Define a failure group for additional data redundancy protection. This is usually a set of disks in a disk group sharing a common resource, such as a controller, that require constant uptime.
The status of ASM disk groups can be monitored through Enterprise Manager or through V$ASM views. You can also refer to them in a database instance to assign storage when you create your database structures.

From within your database instance you can refer to the ASM disk groups when you create your tablespaces, redo log, archive log files, and control files, either by specifying the disk group in the initialization parameters or in your DDL.

For more details about ASM, see Lannes Morris-Murphy's OTN article "Storage on Automatic," Arup Nanda's ASM installment in "Oracle Database 10g: Top 20 Features for DBAs," and Chapter 12 of the Oracle Database Administrator's Guide 10g Release 1 (10.1).

Oracle Cluster File System (OCFS) OCFS is design specifically to support data and disk sharing for Oracle RAC applications. It provides a consistent file system image across the server nodes in a RAC cluster, functioning as a replacement for raw devices. In addition to simplifying cluster database administration, it overcomes the limitations of raw devices while maintaining their performance advantages.

OCFS version 1 supports Oracle data files, spfiles, control files, quorum disk file, archive logs, configuration files and the Oracle Cluster Registry (OCR) file, which is new in Oracle Database 10g. It is not designed for use with any other filesystem files, even the Oracle software which must be installed on each node of the cluster—unless you use a third-party solution. OCFS also does not provide LVM capabilities such as I/O distribution (striping), nor does it provide redundancy.

Oracle supports Oracle databases on OCFS version 1 for Red Hat Advanced Server 2.1, Red Hat Enterprise Linux 3, and Novell SUSE (United Linux) on 32-bit and 64-bit distributions when it is installed from the downloadable binaries. If you recompile it yourself, no Oracle support is available.

Three different rpm packages are available:

  • The OCFS kernel module, which differs for Red Hat and United Linux distributions. Take care to verify your kernel version:
    $ uname -a
    Linux linux 2.4.18-4GB #1 Wed Mar 27 13:57:05 UTC 2002 i686 unknown
    
  • The OCFS support package
  • The OCFS tools package.
When you have downloaded these rpm packages, the installation steps are as follows:
  1. Install the packages by issuing the rpm -Uhv ocfs*.rpm command in the directory in which the rpm packages have been downloaded.
  2. Verify automatic mounting during boot has been enabled.
  3. Configure OCFS on each node in the cluster automatically using ocfstool. An optional manual method of configuring also exists and is explained in the OCFS Users Guide. The end result of this step is the creation of the /etc/ocfs.conf file, used to configure OCFS.
  4. Ensure OCFS loads at startup by running ocfs load_ocfs.
  5. Use either the ocfstool command and the GUI environment or mkfs.ocfs to format the OCFS partitions.
  6. Mount OCFS partitions manually or place an entry in the /etc/fstab for automatic mounting.
For more detailed instructions for these steps, see the "Best Practices" documentation.

Because OCFS version 1 was not written to be POSIX compliant, file commands like cp, dd, tar and textutils require the coreutils to provide an O_DIRECT switch. This switch enables these commands to function as expected on Oracle datafiles, even while Oracle is operating on these same files (which is only an issue if you run third-party software for doing hot backups). Using RMAN avoids this issue altogether. If you still need to use these facilities for various maintenance tasks; you can download the OCFS Tools that enable these commands from oss.oracle.com/projects/coreutils/files.

OCFS version 2 (in Beta as of March 2005), in contrast, is POSIX compliant and supports Oracle database software, which can be installed on one node and shared across other nodes in the cluster. Besides the shared ORACLE_HOME, other new features of OCFS version 2 include improved metadata data caching, space allocation, and locking. There is also improved journaling and node recovery.

Network File System (NFS) Although ASM and OCFS are the preferred filesystems for Oracle RAC, Oracle also supports NFS on certified network file servers. NFS is a distributed filesystem, a full discussion of which is beyond the scope of this article. For more information, visit the NFS homepage.

Raw Devices There was a time when raw devices were the only option for running Oracle RAC. A raw device is simply a disk drive without a filesystem on it that can be divided into raw partitions. Raw devices allow direct access to a hardware partition by bypassing the filesystem buffer cache.

To make use of a raw device for Oracle RAC, you must bind a block device to the raw device before installing Oracle software via the Linux raw command:

# raw /dev/raw/raw1 /dev/sda
/dev/raw/raw1:  bound to major 8, minor 0
# raw /dev/raw/raw2 /dev/sda1
/dev/raw/raw2:  bound to major 8, minor 1
# raw /dev/raw/raw3 /dev/sda2
/dev/raw/raw3:  bound to major 8, minor 2
When the bindings have been made you can then use the raw command to query all raw devices.
# raw -qa
/dev/raw/raw1:  bound to major 8, minor 0
/dev/raw/raw2:  bound to major 8, minor 1
/dev/raw/raw3:  bound to major 8, minor 2
The major and minor numbers identify the device location and driver to the kernel. The major number identifies the general device type, while the minor number identifies the number of devices that belong to the device type. In the examples above, major 8 is the device type of the SCSI disk, /dev/sda.

Note that the device does not need to be accessible to run the above commands. There was no SCSI disk connected to my system when I ran the above commands for demonstration purposes. The effects of these commands will be lost with my next reboot unless I place them in a startup script like /etc/init.d/boot.local or /etc/init.d/dbora, which will run whenever my system boots.

After mapping the block devices to the raw devices, you still need to make sure the raw devices belong to the oracle user and oinstall group.

# ls -l /dev/raw/raw1
crw-rw----    1 root     disk     162,   1 Mar 23  2002 /dev/raw/raw1
# chown oracle:oinstall /dev/raw/raw1
# ls -l /dev/raw/raw1
crw-rw----    1 oracle   oinstall 162,   1 Mar 23  2002 /dev/raw/raw1
You can then use symbolic links between the Oracle datafile(s) and the raw device to make things easier to manage.

Raw device limitations in version 2.4 of the Linux kernel include a limit of one raw device per partition, and a limit of 255 raw devices per system. Novell SUSE Enterprise Linux comes with 63 raw device files, but you can create up to 255 raw devices using the mknod command (root privileges required).

# ls /dev/raw/raw64
ls: /dev/raw/raw64: No such file or directory
# cd /dev/raw
linux:/dev/raw # mknod raw64 c 162 64
# ls /dev/raw/raw64
/dev/raw/raw64
The mknod command above requires a device name, device type, and major and minor number. The device name in this example is "raw64" and the device type is "c" (indicating that it is a character device). The major and minor numbers of the new device are 162 and 64 respectively. Alternatively, Novell SUSE users can install these devices by running orarun rpm.

Other drawbacks of using raw devices include:

  • The number of raw partitions on a disk is limited to 14.
  • Oracle Managed Files (OMF) are not supported.
  • Raw device partitions cannot be resized so you must create another partition to add a database file if you run out of space.
  • Raw devices appear as unused space, which can lead to overwriting by other applications.
  • The only way to write to a raw device is with the low-level command dd, which transfers raw data between devices or files. However, you need to take extra care that an I/O operation is correctly aligned in memory and on disk.
  • A raw partition can hold only one datafile, control file, or redo log, and so on. Unless you use ASM, you need a separate raw device for each datafile associated with a tablespace. However, a tablespace can have multiple datafiles in different raw device partitions.
Conclusion

Oracle RAC provides many of the functions of a filesystem, whether clustered or not, so the responsibilities of the filesystem itself are minimal. What is needed, as I've discussed above, is a filesystem that complements the existing, inherent database cluster facilities of Oracle RAC. In most cases, ASM will serve that purpose best and is considered Oracle best practice, although OCFS, NFS, and raw devices may also be viable options. It is also possible to run ASM for datafiles and OCFS for voting disk, OCR, and Oracle Home as well as run ASM on NFS storage.

In the future we can look forward to one more option when OCFS version 2 enables the use of a shared Oracle Home, thereby complementing shared storage on ASM.


Sheryl Calish (scalish@earthlink.net) is an Oracle developer specializing in Linux for Blue Heron Consulting. She has also served as funding chair for the Central Florida Oracle Users Group and marketing chair for the IOUG Linux SIG.


Please rate this document:

Excellent Good Average Below Average Poor


Send us your comments

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy