OCFS2 - Oracle Cluster File System for Linux

TECHNICAL INFORMATION

OCFS2 (Oracle Cluster File System 2) is a free, open source, general-purpose, extent-based clustered file system which Oracle developed and contributed to the Linux community, and accepted into Linux kernel 2.6.16.

Building on Oracle's long-term commitment to the Linux and open source community, OCFS2 provides an open source, enterprise-class alternative to proprietary cluster file systems, and provides both high performance and high availability. OCFS2 provides local file system semantics and it can be used with any application. Cluster-aware applications can leverage cache-coherent parallel I/O for higher performance, and other applications can make use of the file system to provide a fail-over setup to increase availability.

Most Complete, Open, Integrated Enterprise Software Stack for Linux
OCFS2 is among the many key Oracle value-added technologies in Oracle Linux, and Oracle VM uses OCFS2 as its cluster filesystem to host virtual machine images, as well as the OCFS2 heartbeat to handle high availability. Only Oracle delivers the most complete, open, integrated enterprise software stack for Linux, including database, middleware, applications, and virtualization.

"This is a vital contribution to the open source community. The endorsement of OCFS2 by the Linux community represents a significant milestone for Oracle and demonstrates how Oracle's continued contributions are driving adoption of open source technologies." - Andrew Morton, Linux Kernel Maintainer, Google


Availability/Support From Oracle
Oracle Linux 6.0 ocfs2 version 1.6.3 (Using the Unbreakable Enterprise Kernel)
ocfs2-tools version 1.6.4 (default)
Oracle Linux 5.6 ocfs2 version 1.6.3 (Using the Unbreakable Enterprise Kernel)
ocfs2 version 1.4.8 (alternative kernel)
ocfs2-tools version 1.6.3 (default)
Technical Specifications (OCFS2 version 1.6)
Supported Platforms x86, x86_64
Networking TCP/IP
Supported Operating Systems Oracle Linux, RHEL (OCFS 1.4.8 only)
Block Size from 512 bytes to 4KB
Cluster Size from 4KB to 1MB
Max File Size 16TB
Max number of subdirectories 32,000
Max number of addressable clusters 2^32 (a future release will increase the limit to 2^64)
Max Filesystem Size 4PB* (1MB cluster size)
Version Compatibility
Compatibility between OCFS2 versions OCFS2 strives for backwards compatibility with older versions. OCFS2 Release 1.6 is fully compatible with OCFS2 Release 1.4 or 1.2. A node with the new release can join a cluster of nodes running the older file system.
OCFS2 Release 1.6 is on-disk compatible with OCFS2 Release 1.2. However, not all the new features are activated by default; users can enable and disable features as needed using tunefs.ocfs2. The latest version of ocfs2-tools supports all existing versions of the file system.


OCFS2 Features
Feature Details
Variable Block and Cluster sizes Supports block sizes ranging from 512 bytes to 4KB and cluster sizes ranging from 4KB to 1MB.
Extent-based allocations Tracks the allocated space in ranges of clusters making it especially efficient for storing large files.
Metadata Checksums Ensures integrity by detecting silent corruption in meta-data objects like inodes and directories. The error correction code is capable of fixing single-bit errors automatically.
Extended Attributes Supports attaching an unlimited number of name:value pairs to the file system objects like regular files, directories or symbolic links.
Advanced Security Supports POSIX ACLs (Access Control Lists) and SELinux attributes in addition to the traditional file access permission/ownership model.
Quotas Supports user and group quotas.
File snapshots – REFLINK This feature allows a regular user to create multiple, write-able snapshots of regular files. The snapshot created is a point-in-time image of the file that includes both the file data and all its attributes (including extended attributes). The file system creates a new inode with the same extent pointers as the original inode. Multiple inodes are thus able to share data extents. Because of this, creating a REFLINK snapshot requires very little space initially. It grows only when a snapshot is modified, using a copy-on-write mechanism. REFLINK works across the cluster.
Journaling Supports both ordered and writeback data journaling modes to provide file system consistency in the event of power failure or system crash.

  • Ordered Journal Mode
    This new default journal mode (mount option data=ordered) forces the file system to flush file data to disk before committing the corresponding meta-data. This flushing ensures that the data written to newly allocated regions will not be lost due to a file system crash. While this feature removes the ever-so-small probability of stale or null data to appearing in a file after a crash, it does so at the expense of some performance. Users can revert to the older journal mode by mounting with data=writeback mount option. It should be noted that file system meta-data integrity is preserved by both journaling modes.
In-built Cluster-stack with DLM Includes an easy to configure, in-kernel cluster-stack with a distributed lock manager.
Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os Supports all modes of I/Os for maximum flexibility and performance.
Large Inodes Block-sized inodes allow it to store small files in the inode itself.
Comprehensive Tools Support Provides a familiar EXT3-style tool-set that uses similar parameters for ease-of-use. The toolset is cluster-aware in that it prevents users from formatting a volume from a node if the volume is in use on some other node. Other tools such as tunefs.ocfs2 detect active volumes and allow only operations that can be performed on a live volume.
Performance Enhancements Enhances performance by either reducing the numbers of I/Os or by doing them asynchronously.
  • Indexed Directories – This feature allows a user to perform quick lookups of a directory entry in a very large directory. It also results in faster creates and unlinks and thus provides better overall performance.
  • Directory Readahead - Directory operations asynchronously read the blocks that may get accessed in the future.
  • File Lookup - Improves cold-cache stat(2) times by cutting the required amount of disk I/O in half.
  • File Remove and Rename - Replaces broadcast file system messages with DLM locks for unlink(2) and rename(2) operations. This improves node scalability, as the number of messages does not grow with the number of nodes in the cluster.
Splice I/O Adds support for the new splice(2) system call. This allows for efficient copying between file descriptors by moving the data in kernel.
Access Time Updates Access times are now updated consistently and are propagated throughout the cluster. Since such updates can have a negative performance impact, the file system allows users to tune it via the following mount options:
  • atime_quantum=<number_of_secs> - Defaults to 60 seconds. OCFS2 will not update atime unless this number of seconds has passed since the last update. Set to zero to always update it.
  • noatime - This standard mount option turns off atime updates completely.
  • relatime - This is another standard mount option (added in Linux v2.6.20) supported by OCFS2. Relative atime only updates the atime if the previous atime is older than the mtime or ctime. This is useful for applications that only need to know that a file has been read since it was last modified. Additionally, all time updates in the file system have nanosecond resolution.
Flexible Allocation The file system now supports some advanced features that are intended to allow users more control over file data allocation. These features entail an on-disk change.
  • Sparse File Support - It adds the ability to support holes in files. This allows the ftruncate(2) system call to efficiently extend files. The file system can postpone allocating space until the user actually writes to those clusters.
  • Unwritten Extents - It adds the ability for an application to request a range of clusters to be pre-allocated, but not initialized, within a file. Pre-allocation allows the file system to optimize the data layout with fewer, larger extents. It also provides a performance boost, delaying initialization until the user writes to the clusters. Users can access these features via an ioctl(2), or via fallocate(2) on current kernels.
  • Punching Holes - It adds the ability for an application to remove arbitrary allocated regions within a file. Creating holes, essentially. This could be more efficient if a user can avoid zeroing the data. Users can access these features via an ioctl(2), or via fallocate(2) on later kernels.
  • Discontiguous Block Group – It allows the allocation of space for inodes to grow in smaller, variable-sized chunks.
Shared Writeable mmap(2) Shared writeable memory mappings are fully supported now on OCFS2. The file system supports cluster coherent shared writeable mmap. Processes on different nodes can mmap() a file and write to the memory region fully expecting the writes to transparently show up on other nodes.
Inline Data This feature makes use of OCFS2’s large inodes by storing the data of small files and directories in the inode block itself. This saves space and can have a positive impact on cold-cache directory and file operations. Data is transparently moved out to an extent when it no longer fits inside the inode block. This feature entails an on-disk change.
Online File system Resize Users can now grow the file system without having to un-mount it. This feature requires a compatible clustered logical volume manager. Compatible volumes managers will be announced when support is available.
Clustered flock(2) The flock(2) system call is now cluster-aware. File locks taken on one node from user-space will interact with those taken on other nodes. All flock(2) options are supported, including the kernel's ability to cancel a lock request when an appropriate kill signal is received.
Endian and Architecture Neutral Supports a cluster of nodes with mixed architectures. Allows concurrent mounts on 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64) architectures.
Linux Community Adoption OCFS2 has been ported to many architectures, including ppc64, ia64, s390x, and it's also integrated with many Linux distributions, including SLES, Ubuntu, openSUSE, Fedora Core, and Debian.
* Theoretical Maximum. File systems up to 16TB have been tested.


More Information

The preceding list includes further details on Oracle Cluster File System 2 (OCFS2) for Linux.

 
Oracle 1-800-633-0691