by Maureen Chew
Many SAS solutions are moving towards cloud and/or grid-enabled deployment architectures that require horizontally scaled servers and a high-performance shared file system. As SAS Grid Computing environments grow, the bottleneck is often I/O throughput from the shared file system. This article describes a flexible architecture that is designed and proven to meet the most demanding SAS Grid Computing workloads with a cost-effective solution. It provides performance characterizations for two SAS applications that compare three different network configurations with Oracle's Sun Blade 6000 modular system and Oracle's Sun ZFS Storage 7420 appliance.
Blade servers are a natural fit for SAS Grid Computing deployments because the environment lends itself toward a horizontally scaled server architecture where server density, manageability, and cost efficiencies are top IT considerations for running batch SAS jobs.
As the SAS Grid Computing environment grows, however, one of the key challenges is I/O throughput for the shared file system. SAS Grid Computing requires a shared file system rather than block-based storage as in a Storage Area Network (SAN). For small SAS installations, a simple NFS file server works well, but these file servers typically cannot scale to deliver the I/O performance needed for SAS Grid Computing.
Many enterprises with large deployments have, therefore, been compelled to move to cluster file systems, which add cost and complexity to the solution architecture. In a typical cluster file system, for example, every client node (SAS compute node) must have a host bus adapter (HBA) card installed to connect it to the shared Fibre Channel storage. Maintaining the cluster file system and tuning it for high performance also requires a highly trained storage administrator, adding to the total cost of ownership.
SAS and Oracle have designed a cost-effective architecture that delivers enough storage I/O throughput and blade server processing power to meet even the most demanding SAS Grid Computing needs. The architecture is based on blade servers and unified storage platforms from Oracle that are simple to deploy and manage.
The solution architecture takes advantage of the Sun Blade 6000 modular system and the Sun ZFS Storage Appliance for high scalability and simplicity of deployment and management. As shown in Figure 1, a variety of network configurations can be deployed depending on the application needs.
Figure 1. SAS Grid Computing Solution Architecture
The Sun Blade 6000 modular system was chosen for its extreme flexibility and maximum performance in a small footprint with easy manageability. It supports a wide choice of compute and I/O modules, enabling SAS customers to scale the capacity of both processing power and I/O throughput with fine or coarse granularity.
SAS Grid Computing requires a shared file system and the Sun ZFS Storage Appliance is an excellent fit, because it offers the performance of a complex cluster file system while providing the simplicity of an NFS file server. The Sun ZFS Storage Appliance presents a common shared file system to all SAS Grid Computing nodes through NFS. (Other protocols, such as CIFS, HTTP, FTP, WebDav, and iSCSI, are also available and are included in the appliance at no additional cost.) Sun ZFS Storage Appliances come preconfigured and ready to run, so they can often be deployed in minutes. Their excellent storage monitoring capabilities also offer an important advantage for SAS Grid Computing environments.
DTrace Analytics in Sun ZFS Storage Appliances provides the ability to drill down into a detailed view of I/O traffic, as illustrated by the screenshots in the performance results sections below. It can provide valuable insight at a comprehensive level about I/O resources per node, per job, and so on. This real-time access to detailed analytics helps SAS administrators stay on top of usage patterns and system performance in their SAS environment as workloads change, and it also enables them to quickly identify and fix performance issues. In addition, knowledge about I/O traffic patterns enables accurate planning for future storage capacity and I/O throughput requirements.
There are four options for networking between the Sun Blade 6000 modular system and the Sun ZFS Storage Appliance:
Figure 2 illustrates an aggregation of four Ethernet ports into a single logical datalink. In this example, the I/O devices with the names
ixgbe3 have been aggregated into a logical interface called
As annotated on Figure 2, when a link is selected, the Sun ZFS Storage Appliance management interface automatically highlights all the components that comprise that link, making it easy for administrators to understand the network configuration. When data is sent over the aggregated link, the Sun ZFS Storage Appliance works with the network switch to deliver the I/O packets across all four Ethernet links.
Figure 2. Network Configuration and Link Aggregation
Implementing the Ethernet connections on the Sun Blade 6000 modular system is accomplished by using the Sun Blade 6000 Ethernet Switched Network Express Module (NEM) 24p 10GbE to provide a 10 GbE non-blocking concurrent switching network fabric. The Sun Blade 6000 Ethernet Switched NEM 24p 10GbE is integrated into the Sun Blade 6000 chassis, enabling all server blades in the chassis to have access to the Ethernet interface. For more information, see the "Sun Blade 6000 I/O and Management Architecture" white paper, which is available on the Oracle Technology Network.
The InfiniBand network fabric consists of the Sun Datacenter InfiniBand Switch 36, 10 Sun dual-port 4x QDR PCIe modules (one for each Sun Blade X6270 M2 server module), a 4x QDR PCIe card for the Sun ZFS Storage 7420 appliance (dual-ported to the switch), and associated cabling.
In the test configuration, the Sun Blade 6000 modular system was deployed with the following hardware and software components:
The Sun ZFS Storage 7420 appliance was configured as follows:
Disk mirroring is a desirable configuration for SAS environments that require extra data protection. The test results described later in this document illustrate that excellent I/O throughput can be achieved with mirroring on the Sun ZFS Storage 7420 appliance, making mirroring a viable option for SAS users.
Figure 3. Storage Pool Configuration
Three file systems were allocated from the single ZFS mirrored pool:
The SAS work file system (/work) contains the SASWORK directory of temporary runtime files. These files are heavily used by SAS applications and are typically deployed on a local disk on each SAS Grid node to avoid network and storage I/O contention.
For this test, the SASWORK directory was purposely placed on the Sun ZFS Storage 7420 appliance in an attempt to maximize I/O throughput of the Sun ZFS Storage 7420 appliance. The test results below show that the Sun ZFS Storage 7420 appliance can be a viable and practical place for the SASWORK directory on an NFS-based storage system. This is especially useful for large and volatile SASWORK directories that need to be available for the general population of SAS users.
Both of the workloads were tested with both 10 GbE networking and an InfiniBand network fabric.
Each of the 10 Sun Blade X6270 M2 server modules contained both a single 10 GbE interface and an InfiniBand interface. The network configuration was selected by simply modifying the client NFS V3 mounts to change from the different Sun ZFS Storage 7420 appliance mount points assigned to the various network interfaces.
All other variables (application, workload, servers, and so on) remained unchanged for all the network configurations tested.
To optimize performance, the following tuning modifications were done on each server node.
The following changes were made to the
/kernel/drv/ixgbe.conf file for the 10 GbE driver:
Jumbo Frames were enabled through the
default_mtu setting. (See "Configure Jumbo Frames in Solaris OS" in the "Configuring the Driver Parameters" section of the Sun Dual 10GbE SPF+ PCIe 2.0 Low Profile Adapter User's Guide.) A value of 9000 was used because that matches the Jumbo Frame size on the Sun ZFS Storage 7420 appliance.
intr_throttling setting is related to disabling interrupt blanking. With interrupt blanking enabled, interrupts are grouped or batched together. (See "ixgbe Parameters" in the "Network Driver Parameters" section of Chapter 2, "Oracle Solaris Kernel Tunable Parameters" in the Oracle Solaris Tunable Parameters Reference Manual.)
The following was changed in the InfiniBand /kernel/drv/ibd.conf file. (See
ibd(7d) in the "Devices and Network Interfaces" section of the man pages section 7: Device and Network Interfaces manual.)
The "1" entries correspond to the two InfiniBand ports on each HBA card and specify the use of connected mode. Connected mode for IP over InfiniBand (IPoIB) can provide better performance than Datagram IPoIB.
Additionally, the following NFS changes were made in /etc/system, which set the NFS client logical buffer size and read-ahead operations. (See Chapter 3, "NFS Module Parameters," of the Oracle Solaris Tunable Parameters Reference Manual.)
set nfs:nfs3_bsize=131072 set nfs:nfs3_nra=32
In order to switch testing between 10 GbE and InfiniBand, we unmounted and remounted the correct NFS server assignments. For instance, /etc/vfstab would have the following entry for 10 GbE:
s7420-011ge2:/export/sasgeno_data - /data nfs - yes vers=3,noxattr s7420-011ge2:/export/sasgeno_swsg - /apps nfs - yes vers=3,noxattr s7420-011ge2:/export/sasgeno_work - /work1 nfs - yes vers=3,noxattr
Or it would have the entries similar to the following for InfiniBand:
s7420-011ib0:/export/sasgeno_data - /data nfs - yes vers=3,noxattr,proto=tcp s7420-011ib0:/export/sasgeno_swsg - /apps nfs - yes vers=3,noxattr,proto=tcp s7420-011ib0:/export/sasgeno_work - /work nfs - yes vers=3,noxattr,proto=tcp
The SAS Grid Mixed Analytic workload represents the type of SAS processing commonly seen in grid-based SAS applications today. More than 130 resource-intensive jobs are submitted to the grid. The number of concurrent jobs per node is configurable by SAS Grid Manager. For this workload test, 40 concurrent jobs were selected. Many of the I/O intensive tests in the SAS Grid Mixed Analytic workload involve multiple passes through extremely large data sets.
This workload heavily utilizes the SAS data step and many commonly used procedures in Base SAS and SAS/STAT. Common characteristics of this type of workload include the following:
The test simulates roughly 30 CPU hours that represent a typical workday for a team of analytic users. The user personas that these jobs are composed of include the following:
|SAS PROCS used||MIXED, RANK, LOGISTIC, REG, GLM, SORT, SUMMARY, FREQ, MEANS, SQL|
|Data input||Approximately 50 distinct files totaling almost 600 GB; 24 of the files are between 12 GB and 50 GB.|
|Data output||Since total I/O is roughly divided evenly between input and output, the output is written to both SASWORK and designated output directories.|
The SAS Grid Mixed Analytic workload comprises more than 130 SAS applications that are submitted to the grid with predefined timing delays to represent a typical period of heavy usage. The cumulative total of all the job runtimes represents 30 CPU hours, but with the power of the grid, the test is completed in roughly 45 minutes.
For the SAS Grid Mixed Analytic workload, the highest overall throughput was delivered with the InfiniBand network fabric. As Figure 4 illustrates, peak throughput results for the various network fabrics were as follows:
Figure 4. NFS Network Fabric I/O Performance Comparison for the SAS Grid Mixed Analytic Workload
This performance data was gathered using DTrace Analytics, the industry's only comprehensive and intuitive storage analytics environment. The screenshots in Figures 5 and 6 below were captured by DTrace Analytics on the Sun ZFS Storage 7420 appliance along with other data to capture a complete I/O profile for the SAS Grid Mixed Analytic workload. Figure 5 shows the I/O throughput for the InfiniBand configuration, and Figure 6 shows the I/O throughput for the quad 10 GbE test configuration.
Note that DTrace Analytics plotted the throughput broken down by direction ("in" = write, "out"= read), enabling administrators to see the balance between reads and writes. The I/O throughput is plotted over time, giving administrators the ability to see when bursts of I/O operations happen and whether the peak throughput is a big spike versus a gradual build up. While there are significant peaks in both cases, we see nice sustained throughput for both the InfiniBand and the quad 10 GbE configurations.
Figure 5. Peak I/O Throughput for InfiniBand (Plotted with DTrace Analytics)
Figure 6. Peak I/O Throughput for Quad 10 GbE (Plotted with DTrace Analytics)
DTrace Analytics also enables I/O to be broken down by a wide range of additional variables, such as client, file name, packet size, and more. A breakdown of activity by client, for example, could help an administrator determine whether a particular node behaved unexpectedly or whether it was scheduled with jobs that had unbalanced I/O requirements.
In order to demonstrate high throughput and scalable I/O, a Concurrent SAS Data Step microbenchmark was used in addition to the SAS Grid Mixed Analytic workload. This test consists of running multiple instances of a single, write-intensive SAS data step that outputs 10 million observations. Each observation is 824 bytes (consisting of 103 variables), and 10 million observations result in an 8.3 GB data set. Throughput is increased by running 10 instances concurrently on 10 nodes for a total of 100 concurrent jobs. This microbenchmark is almost exclusively doing write I/O.
The microbenchmark is realistic in the sense that it represents typical SAS data step code. Yet it is contrived in the sense that all jobs are launched simultaneously under controlled circumstances. The benchmark's usefulness is in generating anecdotal performance proof points for a simple and known workload that is uniformly distributed in a predictable pattern.
The aim of the Concurrent SAS Data Step microbenchmark was to compare NFS over different network fabrics and to demonstrate I/O scalability.
Figure 7 shows the average time for completion of 100 identical SAS jobs when using SASWORK over NFS via the single 10 GbE interface, the quad 10 GbE link aggregation interface, and the InfiniBand interface.
Performance was roughly 10% better for the Sun ZFS Storage 7420 appliance's quad 10 GbE interface. Although performance averages were better for the quad 10 GbE interface, we did see higher peak throughputs for InfiniBand, as shown in Figure 8.
Figure 7. Average Job Completion Times Across the Three Different NFS Network Fabrics
Figure 8. Peak I/O Throughput for the Three Different NFS Network Fabrics
DTrace Analytics provides additional insight into the I/O performance differences between 10 GbE and InfiniBand. Figure 9 shows an I/O profile for a quad 10 GbE test followed by a run over InfiniBand. While the I/O profile runtimes are close in length, notice that the InfiniBand profile on the right is denser and exhibits a higher peak throughput.
Figure 9. I/O Profile for Quad 10 GbE Test Versus InfiniBand Test
For a contextual baseline, maximum throughput performance expectations for different network fabrics are on the order of the following:
Many factors affect InfiniBand throughput, such as cable type, switch, host adaptor, and so on. In the tested configuration, 3 GB is a reasonable estimate of the practical upper limit due to the unidirectional limit of the InfiniBand network card that was used.
Under perfect conditions, the quad 10 GbE (4 x ~1 GB/sec = 4GB/sec) link aggregation might be expected to outperform the InfiniBand configuration in the given configuration by roughly 20%. The results for this microbenchmark do show better performance for the quad 10 GbE link aggregation. However, the difference is on the order of 5% to 10% rather than 20%. As described in the upcoming I/O Considerations section, the number of disk spindles in the tested configuration appears to have been a key limiting factor in scaling I/O throughput. This factor would tend to equalize performance across the two network fabrics.
Figure 10 confirms the effectiveness of the 10 GbE link aggregation. For background, in the 10 GbE link aggregation, connections are multiplexed using an administrator-designated strategy (for example, hash based on MAC, IP, MAC+IP, and so on). Since the test configuration has 10 blades and four potential 10 GbE interfaces, a strategy was devised to target an even distribution.
Since there were 10 nodes and network interface assignments are on a node basis, the best-case scenario would be for each of the four network interfaces in the link aggregation to be assigned either two or three nodes. The actual distribution was as follows:
ixbe0: Two nodes
ixbe1: Four nodes
ixbe2: Two nodes
ixbe3: Two nodes
Figure 10 shows two views from DTrace Analytics for I/O traffic when 10 jobs were run on each of the 10 different nodes in the SAS Grid Computing environment.
The top chart shows the aggregated logical interface (
aggr1) broken down by the actual NICs assigned. (NIC assignments were added on top of the screen shot for clarity.) In the naming scheme (for example, n2/be1), n2 represents traffic to or from node 2 and be1 represents interface
ixgbe1 on the Sun ZFS Storage 7420 appliance.
The bottom graph shows a color-coding of each of the 10 GbE NICs that comprise the link aggregation.
The hashing scheme used was L3, which is based on the IP address of the nodes and resulted in an almost perfectly balanced workload. If one node assigned to
ixbe1 had been assigned to any of the other interfaces, it would have been a perfect distribution. DTrace Analytics provided a simple and easy way to visualize the link aggregation distribution and determine its efficiency.
Figure 10. Two Views of I/O Traffic from DTrace Analytics
As mentioned previously, the stated maximum bandwidth is approximately 4 GB/sec for the quad 10 GbE link aggregation and 3 GB/sec to 3.5 GB/sec for a single InfiniBand PCIe card. Using the link aggregation interface, we can see the DTrace Analytics traces for a pure network bandwidth test.
For the traffic "out" (green) from the Sun ZFS Storage 7420 appliance, read-only applications were run from each of the blades to ensure that all reads were coming from the appliance's read cache. For the traffic flowing "in" to the Sun ZFS Storage 7420 appliance, a public domain application, netperf, was used to write over the network interface. Thus no traffic flowing over the network interface should have resulted from any disk I/O.
Figure 11 shows that both the read-only and write-only tests can sustain 4+ GB/sec, demonstrating excellent network performance. This matches our expected maximum network throughput.
Figure 11. Read-Only and Write-Only Tests Sustain 4+ GB/sec
Recall that we had two disk shelves in a mirrored configuration, which is a typical enterprise deployment configuration. Each shelf has 24 drives. In our test configuration, one drive was reserved for a spare, and four drives were flash-based and were used as write cache devices.
Assuming that a 7200 rpm drive can write approximately 50 MB/sec, we would expect the baseline write performance for a single shelf to be around 1GB/sec (19 drives * 50 MB/sec = 950 MB/sec). In actuality, with the unique hybrid storage features of the Sun ZFS Storage 7420 appliance, we were able to realize 2 to 3 times that in peak throughput.
Despite the excellent throughput (the Concurrent SAS Data Step microbenchmark wrote a total of 800 GB in 10 minutes) and consistent performance (each of 100 jobs performed very close in time), having 38 spindles available for the 100 SAS I/O jobs was deemed to be a limiting factor in the mirrored configuration.
Since the storage configuration was fixed, the only way to test the theory about spindles limiting the throughput was to break the ZFS mirror and surface the mirror as an alternate pool. While it is never recommended that you run your storage without any redundancy, we were able to validate and affirm the I/O scalability of the Sun ZFS Storage 7420 appliance by sending half the jobs (50 total from 5 nodes) to one pool/mirror and the other half to the other pool.
By splitting the ZFS pool and effectively doubling the storage configuration, performance was nearly doubled, as shown in Figure 12. The average time for each of 10 jobs running simultaneously on 10 nodes went from approximately 9 minutes to approximately 5 minutes.
Figure 12. Completion Times for 100 SAS Jobs over InfiniBand with One Disk Shelf Versus Two Disk Shelves
Additionally, the I/O profile in Figure 13 shows that the runtimes are much shorter with two disk shelves and that throughput is significantly higher with less wait time (the graph denser).
Figure 13. I/O Profile for 100 SAS Jobs over InfiniBand with One Disk Shelf (Left) Versus Two Disk Shelves (Right)
This large difference in performance when using one versus two disk shelves was a strong indication that the storage configuration was the limiting factor on overall throughput. To demonstrate this hypothesis and to confirm that there was excess CPU capacity, the 100 Concurrent SAS Data Step microbenchmark was run again with just two nodes.
For this test, rather than running 10 jobs on each of 10 nodes, we ran 50 jobs on each of two nodes. Average performance time for this configuration was only slightly worse than when the 100 jobs were distributed over 10 nodes. Since two nodes could almost keep up with the storage I/O, this confirms that there was excess CPU capacity and that storage I/O was likely the limiting factor. This also rationalizes why only four SAS Grid Manager job slots were allocated per node in the SAS Grid Mixed Analytic workload. Since the data sets were so large, we saw better peak performance and throughput as such.
From this series of tests we can conclude that additional disk shelves would have created a more balanced configuration that would have scaled I/O throughput even higher.
The combination of Oracle's Sun Blade 6000 modular system with Sun Blade X6270 M2 server modules from Oracle, Oracle's Sun ZFS Storage 7420 appliance, and the Oracle Solaris 10 operating system provides an ideal environment for SAS Grid Computing applications. It offers robust, enterprise reliability with scalable I/O and compute performance.
The SAS workloads tested in our labs provided an extremely strenuous test for SAS Grid Computing, stressing both the compute and I/O aspects of the architecture. The results showed the following:
A look at system resource utilization during the tests showed that there was excess capacity from a CPU and network perspective. The network-only throughput test showed excess capacity and the link aggregation was shown to be capable of approximately 4 GB/sec of throughput.
The demonstration of I/O scalability is clear and simple. The Concurrent SAS Data Step microbenchmark scaled almost linearly when the storage resources were doubled.
The choice of an appropriate configuration for storage is enterprise-specific and all factors, such as the total of the grid node compute resources, the number of concurrent users, the I/O utilization and heuristics, and the profile of the applications themselves, have to be taken into consideration. Knowing which storage configuration is right for your environment is often a very difficult question to answer, because business demands are dynamic and workloads are unpredictable.
The Sun ZFS Storage 7420 appliance provides great flexibility in terms of configuration, enabling you to start with a relatively modest configuration and grow the storage system as needed.
To balance CPU performance with storage I/O performance, additional disk shelves would have been helpful. In many cases, this would mean excess storage capacity, but the additional disk spindles would provide greater I/O throughput. From our review of the test data, five or six disk shelves (or 10 to 12, if mirrored for availability) would have provided a more balanced configuration.
Every deployment is unique. While storage is one of the more complicated components of the overall system configuration, a generous storage configuration can also provide the best "insurance" against sizing and capacity planning uncertainty.
If both compute node memory and storage I/O throughput are generously configured, this provides an insurance mechanism that enables the configuration to adequately handle spikes in usage patterns. Since over-sizing these components does not generally affect software licensing costs for SAS or for Sun ZFS Storage Appliances, this risk-adverse approach is still cost-effective.
The fact that this solution architecture is based on NFS also provides tremendous IT flexibility and eases the administrative burden. A well thought out architecture can support major changes under the covers with little to no downtime. For example, moving a client to a different network interface could be as simple as remapping the NFS server IP address and host name and re-mounting the file system.
Providing the right balance of configuration, sizing, and cost tradeoffs requires a multi-faceted knowledge base. The performance characterizations discussed in this article are intended to provide insight to make it easier for you to formulate the appropriate criteria for sizing and configuration exercises.
Here are resources referenced earlier in this document:
ibd(7d) in the "Devices and Network Interfaces" section of man pages section 7: Device and Network Interfaces: http://download.oracle.com/docs/cd/E19253-01/index.html
For more information on Oracle and SAS, visit oracle.com/sas.
|Revision 1.1, 09/07/2011|