By Jeff Wright
Data is a critical business asset. As organizations grow, their data storage requirements increase and subsequently drive increases in IT infrastructure. To maintain or increase profitability during times of growth, IT organizations must effectively manage data storage and protection. Often, data growth results when disparate users or organizations create separate copies of the same data. In this case, storage system deduplication technology can provide a dramatic increase in storage efficiency, reducing the cost to support growth.
This paper describes the technical aspects of the Sun ZFS Storage Appliance deduplication feature and how to apply this feature to solve common IT storage challenges.
Data deduplication is a companion technology to data compression. It removes redundancy from stored data in addition to that which is removed by data compression alone.
Architecturally, data compression works by applying a transformation function, such as
gzip, to a block of data. The transformation function identifies patterns of data that can be stored as smaller amounts of data along with input values to the transformation function. Compression is especially effective for ASCII text due to the regular patterns of bits common to ASCII characters. Thus, compression makes possible a significant reduction in the capacity required to store a single file.
Although compression is an important first step in storage capacity management, it reduces capacity requirements only when data is first created. The redundancy introduced when compressed data is copied to duplicate it for other purposes cannot be removed by applying compression. Consequently, growth due to duplication of information must be managed through additional technology. Data deduplication, which is designed to remove redundancy resulting from business processes that duplicate information, addresses the issue of capacity growth. Combined with compression, deduplication offers a complete solution to maximize storage efficiency.
Data deduplication removes redundant copies of information by storing multiple copies of a set of data as a single copy along with an index of references to the copy. For business processes that generate duplicate information, such as deploying operating systems, storing the individual data used by different members of an organization, or maintaining system backups, data deduplication is a logical technological solution. For business processes such as these, small unique data sets are distributed to many different consumers and stored as many separate but identical copies, leading to exponential growth in physical storage capacity.
Deduplication counters growth resulting from storage of redundant data by physically storing unique data a single time and then storing subsequent copies of this data as indices pointing to the unique data. As a result, 100 copies of a 10-GB operating system can be stored with 10 GB of physical capacity; 1000 copies of the same 1-MB PDF file can be stored with 1 MB of storage; and backups of many similar client systems can be stored with a fraction of the space that would otherwise be required without deduplication.
Implementations of data deduplication vary among vendors. Practical deduplication implementations differ in two key ways:
Deduplication can be applied as the data is written to the storage target (synchronous) or during a background process that analyzes the data after creation (asynchronous). In general, users of systems with synchronous deduplication realize an immediate benefit in capacity utilization, while uses of asynchronous deduplication must provision extra storage to keep redundant copies of information available until post processing can consolidate them.
Deduplication granularity can be executed either at the block level or the file level. Block-level deduplication works by inspecting the contents of a file and removing redundancy both within a file and between files. File-level deduplication only removes redundancy between files and cannot remove redundancy within a specific file. Block-level deduplication provides a broader and more general implementation that outperforms simple file-level deduplication.
The data deduplication feature provided in the Sun ZFS Storage Appliance is available with Software Release 2010.Q1. This feature is implemented to provide synchronous block-level deduplication and is designed to be applicable to any data stored on the appliance. Deduplication may be applied to file- and block-accessed shares, such as NFS, CIFS, iSCSI, or Fibre Channel. From an operational perspective, it may be enabled or disabled at any time for any share. It may be used in combination with any other Sun ZFS Storage Appliance feature, including compression, snapshot, and remote replication.
To help ensure successful deployment of Sun ZFS Storage Appliance deduplication, the following sections of this paper describe the types of environments most suited to this feature, the expected effectiveness (dedup ratio), implementation and configuration guidelines, and performance expectations.
The Sun ZFS Storage Appliance deduplication feature may be applied to any share supporting any workload. In practice, Sun ZFS Storage Appliance deduplication is most effective for read-mostly copies of a specific data set, such as copies of the operating system images used to support virtual machines or identical copies of files in home directories.
Engineering tests of virtualized environments based on VMware and Oracle VM show deduplication rates that are ideal:
In addition to deploying operating system images, Sun ZFS Storage Appliance deduplication is also well-suited to general user data produced through business processes that copy information in order to share it. For example, home directories in which users store read-only copies of content that is distributed from a central resource, such as PDF documents distributed by human resources, are ideal candidates for Sun ZFS Storage Appliance deduplication.
In practice, the distribution of unique and duplicate content is implementation specific, so customers should consider their specific data usage to understand how to optimize deduplication for their specific environments. The Sun ZFS Storage Appliance includes a rich set of monitoring tools (called Analytics) that can show system planners how effective deduplication will be in their environment and how to best plan for growth with deduplication.
The synchronous and block-level implementation of Sun ZFS Storage Appliance deduplication is less effective for structured data, such as backup sets containing files from many different sources. Engineering tests using Oracle Recovery Manager (RMAN) and Semantic NetBackup demonstrated deduplication rates up to 3:1 with Sun ZFS Storage Appliance deduplication. Consequently, deduplication should be considered as a supplement to Sun ZFS Storage Appliance compression for structured backup data.
To perform deduplication efficiently, the Sun ZFS Storage Appliance requires the deduplication data table (DDT) to be resident in memory (DRAM). If the DDT can be completely cached in the memory of the Sun ZFS Storage Appliance, throughput to and from shares with deduplication enabled is within 30% of the throughput available without deduplication enabled. If the DDT cannot be cached in memory, system throughput may be significantly lower and more variable because parts of the DDT must be recalled from flash or spinning media.
The deduplication ratio is the ratio of duplicate data to unique data in a specific data set. This ratio is inherent to the data set, so, for the purposes of capacity planning, system architects must estimate or measure the deduplication ratio for their specific data sets. For example, for ten copies of an operating system, the deduplication ratio is 10:1 because one unique image is referenced ten times. (For more information, see Application Guidelines for Deduplication.)
Deduplication requires DRAM to run effectively. The practical available capacity for unique data can be estimated based on the following configurable options:
Given these parameters, the practical available capacity for unique data (C) in TB is calculated as follows:
C = 0.1 * D * S * R / 320
The factor 0.1 represents 10% of the available DRAM dedicated to the DDT, and the factor 320 is the size of a DDT entry in bytes. Table 1 shows sample calculations for several common configurations.
Table 1. Practical Capacity Available for Unique Data (TB)
Performance of the Sun ZFS Storage Appliance with deduplication also depends on the share record size. In general, a smaller share record size (4 KB to 16 KB) results in lower throughput but provides a higher deduplication rate, while a larger share record size (64 KB to 128 KB) results in higher throughput but provides a lower deduplication rate. By appropriately configuring the share record size and system memory, the system planner can optimize system behavior.
To further enhance system planning when using deduplication, the Sun ZFS Storage Appliance Analytics monitoring tools provide metrics that can be used to track system resources used by deduplication operations. Figure 1 shows a metric that tracks ZFS data management unit (DMU) operations per second broken down by DMU object type. Other metrics that provide insight into system dynamics when running deduplication include ARC accesses (broken down by hit and miss), L2ARC accesses (broken down by hit and miss), disk I/O operations (broken down by type), and disk I/O operations (broken down by state).
To accurately track storage response time to determine if deduplication operations are leading to changes in system response time, the system planner can enable a metric that tracks protocol operations, such as NFSv3, iSCSI, or Fibre Channel, by type Analytic and then drill down on the read operations and write operations by latency (see Figure 1).
Figure 1. Sun ZFS Storage Appliance Analytics for Tracking System Resources and Response Time When Using Deduplication
If the amount of data stored in a deduplicated share causes the DDT to grow beyond the available cache in a Sun ZFS Storage Appliance, system performance may be reduced.
Specific performance reductions related to deduplication include increased response times for the following types of I/O operations:
To ensure an optimal system, a recommended a storage system configuration includes the following:
A high ratio of DRAM and flash capacity to spinning disk capacity helps ensure that the DDT is accessed from solid-state media, thus providing more predictable performance over the entire capacity range of the system.
Storage system deduplication is an important technology for managing physical capacity growth in data centers. When combined with data compression, it provides a compelling and effective solution for increasing efficiency. When deduplication is applied using the application and configuration guidelines presented in this paper, Sun ZFS Storage Appliance deduplication can save physical space with minimal reduction in system performance. When using deduplication, a system planner can enable the Advanced Analytics feature in the Sun ZFS Storage Appliance to monitor the relationships among system throughput, system response time, cache operations, and ZFS DMU operations to manage growth effectively while ensuring optimal system performance.
Revision 1.0, 03/09/2011