|By Hugo Rivero, June 2006|
|-||1. Introduction to Throughput Computing|
|-||2. Throughput Computing in the Solaris OS|
|-||3. Development Best Practices|
|-||4. Deployment Best Practices|
|-||5. Application Qualification|
|-||6. For More Information|
Throughput computing is a new approach to system design that delivers higher throughput -- the aggregate amount of work done -- by relying on processors with CMT technology, providing multiple threads of execution on a single chip.
Traditional processor designs have focused on increasing the speed of execution of a single instruction stream. However, those designs are limited in that memory speeds have not been increasing at the same rate as processor speeds, so the processor often spends most of its time waiting for memory references. The example below shows how a reduction in processor compute time for a single thread of execution results in small time savings when the workload's time is already dominated by memory latencies, a common case in today's server applications (see Figure 1).
Throughput computing, in contrast, exploits the fact that server workloads typically run multiple jobs at the same time. Given a transistor density, the processor is optimized for parallel computation. In the example below, the processor is enabled to run four execution threads, and can alternate between them at every clock cycle. When one thread is stalled waiting for memory, it is simply skipped. This approach leads to higher processor utilization (the sum of the "C" blocks), even when running at the same clock rate as the previous example.
Throughput workloads that are comparatively memory intensive, therefore, see little benefit from clock rate improvements on traditional processor architectures, whereas on threaded CMT architectures (for example, using the UltraSPARC T1 processor), these workloads can see enormous benefit. These workloads spend more of their time waiting for memory requests to be satisfied, and that time can now be used to execute other threads (see Figure 2).
Taking this approach further, a CMT processor can replicate multiple processing units (also known as cores) on the same chip, leading to substantial improvement in processing density. For example, the UltraSPARC T1 processor can have up to 8 cores and 32 hardware threads, all on the same chip. This technology not only delivers better aggregate throughput, but is also more efficient on power consumption.
For more details, visit the Throughput Computing section on sun.com.
With the Solaris OS on a CMT architecture (for example, using the UltraSPARC T1 processor), the first thing to notice is that each of the hardware threads is treated as a logical processor. The Solaris OS will schedule LWPs (either processes or software threads) on each of them, and let the chip handle the low-level thread switching in hardware, where it can be done at every clock cycle. Contrast this to the thousands of instructions needed to do a context switch in software, and we can see why this architecture is so efficient in running multiple jobs in parallel.
The Solaris OS will assign each hardware thread its own CPU ID. On an eight-core UltraSPARC T1 processor, the command
psrinfo(1M) would give us:
# psrinfo 0 on-line since 01/30/2006 17:51:38 1 on-line since 01/30/2006 17:51:39 2 on-line since 01/30/2006 17:51:39 3 on-line since 01/30/2006 17:51:39 4 on-line since 01/30/2006 17:51:39 ... <some lines deleted> ... 28 on-line since 01/30/2006 17:51:39 29 on-line since 01/30/2006 17:51:39 30 on-line since 01/30/2006 17:51:39 31 on-line since 01/30/2006 17:51:39
In the above output (logical) processors 0 through 3 correspond to the hardware threads on the first core, (logical) processors 4 through 7 correspond to the hardware threads on the second core, and so on. Thus the server looks, for practical purposes, like an SMP on a chip. Similarly, if an application inquires about the number of processors configured (for example, via the
sysconf(3C) library call), the answer will be the number of hardware threads available.
Despite the radical chip design with a highly threaded architecture, CMT did not require fundamental changes to the operating system. The Solaris OS has been optimized over many years to scale to a large number of processors. For example, a Sun Fire E25K server has 144 (single-threaded) cores so scheduling jobs on 32 logical processors was familiar territory. The CMT-specific optimizations required by the multicore and multithreaded nature of the processor have been implemented in the Solaris 10 OS. By making the OS aware of the relationship between logical CPUs and physical resources (cores, caches, and Translation Lookaside Buffers [TLBs]), the scheduler can make educated decisions on how best to use the available resources. For example, here's the
mpstat(1M) snapshot when running eight CPU-intensive jobs on an eight-core system.
Notice that the eight jobs were evenly distributed, one running on each core, instead of being blindly assigned to any available hardware thread.
Another Solaris enhancement is the support of a new machine architecture, sun4v. This is needed to make use of the UltraSPARC T1 processor's new Hypervisor interface, a thin layer of firmware that presents a virtualized machine environment to the operating system:
# uname -a SunOS rumble13 5.10 Generic_118822-20 sun4v sparc SUNW,Sun-Fire-T200 <note: check the output for T2000 GA units, name should have changed to T2000>
This new machine architecture is transparent to applications: it only affects the contract between the processor and the Solaris kernel. It does not affect the interfaces between the Solaris OS and user applications. The UltraSPARC T1 is a fully compatible SPARC v9 implementation, and existing SPARC binaries will run unchanged. More details on Hypervisor are available on the OpenSPARC web site.
It is a common misconception to think that CMT only works with multithreaded applications. Because the term thread" is used for the hardware, people may assume that the software needs to be threaded as well. This is not the case. Multiple, single-threaded processes can also take good advantage of CMT. The Solaris OS handles processes and threads in a similar fashion: they are both scheduled as LWPs (lightweight processes). As long as there are enough active LWPs to keep the cores busy, applications will reap the performance benefits.
In traditional SMP architectures, when a processor became idle, it entered an idle loop looking for new work to do (and consuming CPU cycles in the process). Clearly, this would be suboptimal in a CMT architecture, where the core's compute cycles are shared among its hardware threads. To improve efficiency, the Solaris platform has been modified to park an idle hardware thread when it is not running any job. The hardware thread then stays out of the way, and is reactivated only when the OS scheduler finds something to run on it.
This means that the concept of idle time, as reported by traditional monitoring tools like
mpstat, has to be interpreted in a slightly different way: a core is not really idle until all of its hardware threads are idle.
For example, here's the
mpstat excerpt showing the four hardware threads on core one, over three sampling intervals.
This shows CPU 1 as being idle, so it could seem that 25 percent of the core's cycles are wasted. In reality, the processor is doing its magic, distributing its compute cycles among hardware threads 0, 2 and 3. For people interested in a low-level analysis of processor utilization, the Solaris OS allows access to the processor counters via the
The most important consideration for software developers looking at Sun's CMT architecture is that the same practices still apply. Development recommendations are no different from those for traditional SMP architectures. As radical as the processor architecture is, it does not introduce new paradigms in software design. In fact, it has lowered the barriers of entry to highly-threaded servers, so good parallel programming becomes even more relevant than before. As the industry moves to chips with multiple cores and threads, the applications that are designed to scale to large numbers of logical processors will be more competitive.
Having said that, here are some specific recommendations that developers should keep in mind:
At a first glance, it may seem reasonable to configure applications according to the number of cores on the system. For example, an UltraSPARC T1 processor with eight cores can be thought of as an eight-way system. However, it is important to remember that the CMT architecture delivers the most benefits when all of its hardware threads are in use as this increases the cores' utilization and, therefore, its overall throughput. These are, after all, "thread-hungry" processors. So, when deploying applications on a CMT processor, it is recommended to configure them to use a large enough number of active LWPs. Most applications provide tunable parameters to set the number of software threads or processes used, for example, to adjust the size of a process pool or the number of worker threads.
mpstat(1M) command can be used to check if all hardware threads are being utilized. The following excerpt from
mpstat output shows the activity on the hardware threads of the first core of an UltraSPARC T1 processor.
Notice that only one of the hardware threads on this core is active. This is most likely a suboptimal configuration, and performance could improve with a higher number of LWPs.
To better monitor LWP activity, the
ps command has been enhanced to include a
-L flag. Previously, this showed a summary of the activity within a process.
Now we can go to seeing how many LWPs that process has (
NLWP column), and the time accumulated by each (
LTIME column). In this case, we can see that only 5 of its 14 LWPs have been active.
This information can help administrators to adjust the number of active LWPs so that it better matches the available hardware threads.
Because of the increased throughput capabilities and large number of logical processors, CMT architectures are good candidates for workload consolidation -- an example would be combining multiple logical tiers, such as application server and database tiers, on the same physical host. The Solaris OS does a good job of balancing the load over all the cores in a CMT processor, so in most cases no additional tuning is needed. However, there may be situations, especially if the applications have very different runtime characteristics, when some kind of isolation can provide additional benefits, such as better management of the system resources and improved hit rates on the cores' caches.
The Solaris platform offers several ways to segregate applications running on the same host. One is to use processor sets, with the Solaris command
psrset(1M). Processor sets allow grouping of processors on a system, and binding of LWPs to them. For example, the following command creates a processor set on the first two cores of an UltraSPARC T1 processor:
# psrset -c 0-7 created processor set 1 processor 0: was not assigned, now 1 processor 1: was not assigned, now 1 processor 2: was not assigned, now 1 processor 3: was not assigned, now 1 processor 4: was not assigned, now 1 processor 5: was not assigned, now 1 processor 6: was not assigned, now 1 processor 7: was not assigned, now 1
Then, processes or even individual LWPs can be assigned to this processor set:
# psrset -b 1 5364 5365 5366 5367 process id 5364: was not bound, now 1 process id 5365: was not bound, now 1 process id 5366: was not bound, now 1 process id 5367: was not bound, now 1
Another way to segregate multiple applications running on the same host is to use Solaris Containers and the Solaris Resource Manager. Containers, a virtualization tool, allow the creation of multiple private execution environments within a single instance of the Solaris OS. Processor sets can be created in the context of a resource pool, and then associated with a specific container. For detailed instructions, see the How To guide to Solaris Containers.
On a CMT processor like the UltraSPARC T1, it is recommended to keep the grouping at the core level. Splitting a core's hardware threads over multiple processor sets or resource manager pools could lead to suboptimal use of shared resources like the Level 1 cache.
In general, software vendors qualify their products to a target Solaris version, and not to specific hardware configurations. For example, they test and support their software on the Solaris 10 OS for SPARC platforms, instead of testing it on every single hardware configuration that runs this operating system. This is possible because Sun maintains the same instruction set and the same operating system across the entire SPARC product line.
Sun's implementation of CMT does not disrupt this model. The UltraSPARC T1 chip is binary compatible with existing SPARC processors, as shown by the
# isalist sparcv9 sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparc sparcv9+vis sparcv9+vis2 sparcv8plus+vis sparcv8plus+vis2
Furthermore, servers with the UltraSPARC T1 processor use the same Solaris 10 OS as other servers based on the SPARC platform. Thus, applications that are qualified to run on the Solaris 10 release on the SPARC architecture are automatically qualified to run on servers using the UltraSPARC T1 processors, such as the Sun servers based on CoolThreads technology.