Wim Coekaerts Talking Linux

Update on OCFS
by Wim Coekaerts

Why move to OCFS version 1.0.9.12? And when will support for async I/O be available? Oracle's director of Linux engineering Wim Coekaerts has the answers.

Because we often get questions and bug reports on specific usage of Oracle Cluster File System (OCFS) on Linux, as well as the timeline of product releases and capabilities, I thought it would be a good idea to provide a relatively short summary of where we are today as well as answer some of the most commonly asked questions.

Why should I upgrade to the new release of version 1?
Version 1 was released about a year ago and is still the current major release number. The main goal of OCFS was to provide a replacement for raw devices. It was not designed as a general-purpose cluster filesystem, nor was it designed to store nondatabase files. By database files, I mean datafiles, redologfiles, archive logfiles, controlfiles, the quorum disk file, and spfile.

Using the filesystem for any other types of files is not well tested, because that capability was not part of the original goal; OCFS is not comparable to other filesystems such as ufs, vxfs, and ext2/3. In other words, updates of ctime/utime/mtimes are not necessarily similar. OCFS has a different way of showing inode numbers when a file moves, and you end up with different sorts of timestamp updates. Basically, we do as much as possible to preserve performance and keep the RDBMS happy. Anything beyond that is nice but not a requirement.

The raw performance of OCFS when used by Oracle is similar to that of doing raw I/O. The code itself is almost identical, and we rely on the database to handle or maintain I/O concurrency. This approach allows us to maintain performance, but to do so, we had to think up some tricks.

When the database opens a file on OCFS, it opens it with an extra flag, O_DIRECT. When a file on any node is opened with O_DIRECT, regular file access is not allowed. Attempting a cp, an md5sum, or a dd through the standard OS tools will fail with -EPERM. If you want to be able to use cp or dd while the database is up and running—to make a hot backup, for instance—you need to download updated fileutils packages or use RMAN.

These updated fileutils—cp, dd, and the like—take an option, o_direct=yes, that makes the tool also use 0_direct when opening the file.

Using O_direct makes the filesystem perform better. If you want to use dd or cp, you should also add a large blocksize; the ideal size for OCFS is between 512K and 1MB. For example:

dd o_direct=yes if=/ocfs/foo of=/backup/foo.bak bs=1M

For Red Hat Enterprise Linux 2.1 Advanced Server (RHAS2.1), the preferred kernels are 2.4.9-e.25 and higher. The OCFS version you should use is 1.0.9.9 or higher (currently 1.0.9.12 is the latest version). For those who have an older version running, there are several reasons to move to this release:

  • Kernel stack overflow warnings that end up in hangs or kernel "oopses": The Linux kernel has an 8K kernel stack per process. The process struct sits right under it, so if drivers or kernel code overuse the stack, it can potentially overwrite process structures and cause a system hang, or a system oops. Often the Linux kernel generates a do_IRQ warning in /var/log/messages. Although this warning itself is not a problem, it does signify that you are getting close to the 2K limit. In 1.0.9.9, we did a major cleanup of the OCFS kernel stack usage. Several customers with a heavy I/O load (both network and disk I/O) saw some inexplicable lockups; with our cleanup, however, most of those went away. Note that you can easily get the same warnings without having OCFS in the stack. So if you experience hangs or get crashes that reveal a lot of instances of do_IRQ() in the trace, I strongly advise you to move to this release.
  • Customers in a multinode setup archiving heavily to a single OCFS partition : There is a very small window in which two nodes can allocate space on the same disk location. This scenario, while rare, will cause archive log corruption.
A few hints on problems:
If a system totally locks up or crashes and the report doesn't show OCFS in the stack, the cause is not likely to be OCFS. If a system is pingable but otherwise doesn't respond, a generic kernel VM problem is probably involved.

If a command such as ls is hanging, check ps and see if the process state changes. If the process is in D state and never changes, it's likely to be stuck. If it changes around D, S, R, it's just doing work.

Please contact support or file bugs against any serious issues so they can get resolved.

What versions and ports are upcoming for OCFS version 1?
The same 1.0.9.9 version is released on Red Hat Enterprise Linux 3 and will be released on the forthcoming SUSE/United Linux1.0 SP2a and SP3.

Next Steps

Download OCFS

Read OCFS Documentation

See OCFS fileutils

Read previous installments of "Talking Linux"

For IA64, we are planning to do an update for RHEL2.1 and a release for RHEL3 and SUSE/United Linux1.0 SP2 for IA64. We expect all these releases to be in sync by mid-November.

A version for AMD64 is in the works and should be available as we release Oracle9i Database R2 production for AMD64.

When will support for async I/O be available?
The same 1.0.9.9 version is released on Red Hat Enterprise Linux 3 and will be released on the forthcoming SUSE/United Linux1.0 SP2a and SP3.

What's new in OCFS version 2?
Version 2, soon to be released in beta, will allow a Shared Oracle Home install and will support the installation of Oracle binaries and log/trace files. Version 2 will probably also support several non-Oracle products, but it will not support a generic cluster filesystem. The target timeframe for version 2 is before Oracle Database 10g's production release. Version 2 will continue to support the Oracle RDBMS, just as it does today, without performance penalties.

Version 2 also contains a serious rewrite for allowing caching of data and reducing the number of structures allocated inside the filesystem.

Let's talk about Linux. Email your questions to me at talkinglinux_ww@oracle.com.


Wim Coekaerts (talkinglinux_ww@oracle.com) is principal member of technical staff, Corporate Architecture. After having designed a Linux-based internet appliance expressly for Larry Ellison, he earned the title Oracle's "Mr. Linux." His team works on continuing enhancements to the Linux kernel and publishes source code under the GPL at oss.oracle.com.
E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy