Maximizing Application Performance on Chip Multithreading (CMT) Architectures

By Hugo Rivero, June 2006  
1. Introduction to Throughput Computing
2. Throughput Computing in the Solaris OS
3. Development Best Practices
4. Deployment Best Practices
5. Application Qualification
6. For More Information
7. Glossary

This article provides a set of best practices for developers and system administrators who want to achieve maximum application performance on chip multithreading (CMT) architectures such as Sun servers with CoolThreads technology. A brief introduction to throughput computing is presented along with several examples that illustrate what CMT means in the context of the Solaris Operating System. Finally, several best practices for application development and deployment on CMT architectures are presented.

For the latest information, see Sun Fire Servers with CoolThreads Technology on

1. Introduction to Throughput Computing

Throughput computing is a new approach to system design that delivers higher throughput -- the aggregate amount of work done -- by relying on processors with CMT technology, providing multiple threads of execution on a single chip.

Traditional processor designs have focused on increasing the speed of execution of a single instruction stream. However, those designs are limited in that memory speeds have not been increasing at the same rate as processor speeds, so the processor often spends most of its time waiting for memory references. The example below shows how a reduction in processor compute time for a single thread of execution results in small time savings when the workload's time is already dominated by memory latencies, a common case in today's server applications (see Figure 1).

Figure 1: Example of Single-Threaded Execution
Figure 1: Example of Single-Threaded Execution

Throughput computing, in contrast, exploits the fact that server workloads typically run multiple jobs at the same time. Given a transistor density, the processor is optimized for parallel computation. In the example below, the processor is enabled to run four execution threads, and can alternate between them at every clock cycle. When one thread is stalled waiting for memory, it is simply skipped. This approach leads to higher processor utilization (the sum of the "C" blocks), even when running at the same clock rate as the previous example.

Throughput workloads that are comparatively memory intensive, therefore, see little benefit from clock rate improvements on traditional processor architectures, whereas on threaded CMT architectures (for example, using the UltraSPARC T1 processor), these workloads can see enormous benefit. These workloads spend more of their time waiting for memory requests to be satisfied, and that time can now be used to execute other threads (see Figure 2).

Figure 2: Example of Multithreaded Execution
Figure 2: Example of Multithreaded Execution

Taking this approach further, a CMT processor can replicate multiple processing units (also known as cores) on the same chip, leading to substantial improvement in processing density. For example, the UltraSPARC T1 processor can have up to 8 cores and 32 hardware threads, all on the same chip. This technology not only delivers better aggregate throughput, but is also more efficient on power consumption.

For more details, visit the Throughput Computing section on

2. Throughput Computing in the Solaris OS

With the Solaris OS on a CMT architecture (for example, using the UltraSPARC T1 processor), the first thing to notice is that each of the hardware threads is treated as a logical processor. The Solaris OS will schedule LWPs (either processes or software threads) on each of them, and let the chip handle the low-level thread switching in hardware, where it can be done at every clock cycle. Contrast this to the thousands of instructions needed to do a context switch in software, and we can see why this architecture is so efficient in running multiple jobs in parallel.

The Solaris OS will assign each hardware thread its own CPU ID. On an eight-core UltraSPARC T1 processor, the command psrinfo(1M) would give us:

# psrinfo
0 on-line since 01/30/2006 17:51:38
1 on-line since 01/30/2006 17:51:39
2 on-line since 01/30/2006 17:51:39
3 on-line since 01/30/2006 17:51:39
4 on-line since 01/30/2006 17:51:39
... <some lines deleted> ...
28 on-line since 01/30/2006 17:51:39
29 on-line since 01/30/2006 17:51:39
30 on-line since 01/30/2006 17:51:39
31 on-line since 01/30/2006 17:51:39

In the above output (logical) processors 0 through 3 correspond to the hardware threads on the first core, (logical) processors 4 through 7 correspond to the hardware threads on the second core, and so on. Thus the server looks, for practical purposes, like an SMP on a chip. Similarly, if an application inquires about the number of processors configured (for example, via the sysconf(3C) library call), the answer will be the number of hardware threads available.

Despite the radical chip design with a highly threaded architecture, CMT did not require fundamental changes to the operating system. The Solaris OS has been optimized over many years to scale to a large number of processors. For example, a Sun Fire E25K server has 144 (single-threaded) cores so scheduling jobs on 32 logical processors was familiar territory. The CMT-specific optimizations required by the multicore and multithreaded nature of the processor have been implemented in the Solaris 10 OS. By making the OS aware of the relationship between logical CPUs and physical resources (cores, caches, and Translation Lookaside Buffers [TLBs]), the scheduler can make educated decisions on how best to use the available resources. For example, here's the mpstat(1M) snapshot when running eight CPU-intensive jobs on an eight-core system.

Figure 3: Code Sample #1
Figure 3: Code Sample #1

Notice that the eight jobs were evenly distributed, one running on each core, instead of being blindly assigned to any available hardware thread.

Another Solaris enhancement is the support of a new machine architecture, sun4v. This is needed to make use of the UltraSPARC T1 processor's new Hypervisor interface, a thin layer of firmware that presents a virtualized machine environment to the operating system:

# uname -a
SunOS rumble13 5.10 Generic_118822-20  
                   sun4v sparc SUNW,Sun-Fire-T200
                    <note: check the output for T2000 GA units,  name should have changed to T2000>

This new machine architecture is transparent to applications: it only affects the contract between the processor and the Solaris kernel. It does not affect the interfaces between the Solaris OS and user applications. The UltraSPARC T1 is a fully compatible SPARC v9 implementation, and existing SPARC binaries will run unchanged. More details on Hypervisor are available on the OpenSPARC web site.

Multithreaded vs. Multi-process

It is a common misconception to think that CMT only works with multithreaded applications. Because the term thread" is used for the hardware, people may assume that the software needs to be threaded as well. This is not the case. Multiple, single-threaded processes can also take good advantage of CMT. The Solaris OS handles processes and threads in a similar fashion: they are both scheduled as LWPs (lightweight processes). As long as there are enough active LWPs to keep the cores busy, applications will reap the performance benefits.

Idle Time on CMT

In traditional SMP architectures, when a processor became idle, it entered an idle loop looking for new work to do (and consuming CPU cycles in the process). Clearly, this would be suboptimal in a CMT architecture, where the core's compute cycles are shared among its hardware threads. To improve efficiency, the Solaris platform has been modified to park an idle hardware thread when it is not running any job. The hardware thread then stays out of the way, and is reactivated only when the OS scheduler finds something to run on it.

This means that the concept of idle time, as reported by traditional monitoring tools like mpstat, has to be interpreted in a slightly different way: a core is not really idle until all of its hardware threads are idle.

For example, here's the mpstat excerpt showing the four hardware threads on core one, over three sampling intervals.

Figure 4: Code Sample #2
Figure 4: Code Sample #2

This shows CPU 1 as being idle, so it could seem that 25 percent of the core's cycles are wasted. In reality, the processor is doing its magic, distributing its compute cycles among hardware threads 0, 2 and 3. For people interested in a low-level analysis of processor utilization, the Solaris OS allows access to the processor counters via the cpustat(1M) or cputrack(1) utilities.

3. Development Best Practices

The most important consideration for software developers looking at Sun's CMT architecture is that the same practices still apply. Development recommendations are no different from those for traditional SMP architectures. As radical as the processor architecture is, it does not introduce new paradigms in software design. In fact, it has lowered the barriers of entry to highly-threaded servers, so good parallel programming becomes even more relevant than before. As the industry moves to chips with multiple cores and threads, the applications that are designed to scale to large numbers of logical processors will be more competitive.

Having said that, here are some specific recommendations that developers should keep in mind:

  • Do not assume that the application will only need to scale to a small number of CPUs. CMT is a reality, and hardware threads are becoming cheaper. Avoid coarse-grained locks whenever possible, and keep critical sections short. When applicable, create per-CPU structures that can be increased dynamically based on the number of processors available at runtime.
  • Allow ways to adjust the number of threads or processes spawned, through a configurable parameter that system administrators can adjust as needed. Applications can also try to do runtime detection of the number of processors; however, this approach may be misleading when other applications are running on the same host.
  • Take advantage of scalable software infrastructure whenever possible. Use commercial software with a proven record of taking care of the "plumbing" (concurrency, resource management, pooling, and the like), so that developers can focus on their areas of expertise. Java Platform, Enterprise Edition, is a good example: It provides a standard, portable platform for enterprise applications, with implementations available from a variety of software vendors.
  • Make locks adaptive. Lock implementations often rely on spinning to avoid being context switched while waiting to grab a lock. In a CMT environment, though, spinning consumes processor cycles that could be used for other hardware threads, incurring in a higher cost than on a traditional SMP setup. Providing the spin count as a runtime tunable helps to better adapt the application to the underlying hardware.
4. Deployment Best Practices

At a first glance, it may seem reasonable to configure applications according to the number of cores on the system. For example, an UltraSPARC T1 processor with eight cores can be thought of as an eight-way system. However, it is important to remember that the CMT architecture delivers the most benefits when all of its hardware threads are in use as this increases the cores' utilization and, therefore, its overall throughput. These are, after all, "thread-hungry" processors. So, when deploying applications on a CMT processor, it is recommended to configure them to use a large enough number of active LWPs. Most applications provide tunable parameters to set the number of software threads or processes used, for example, to adjust the size of a process pool or the number of worker threads.

The mpstat(1M) command can be used to check if all hardware threads are being utilized. The following excerpt from mpstat output shows the activity on the hardware threads of the first core of an UltraSPARC T1 processor.

Figure 5: Code Sample #3
Figure 5: Code Sample #3

Notice that only one of the hardware threads on this core is active. This is most likely a suboptimal configuration, and performance could improve with a higher number of LWPs.

To better monitor LWP activity, the ps command has been enhanced to include a -L flag. Previously, this showed a summary of the activity within a process.

Figure 6: Code Sample #3
Figure 6: Code Sample #4

Now we can go to seeing how many LWPs that process has ( NLWP column), and the time accumulated by each ( LTIME column). In this case, we can see that only 5 of its 14 LWPs have been active.

Figure 7: Code Sample #5
Figure 7: Code Sample #5

This information can help administrators to adjust the number of active LWPs so that it better matches the available hardware threads.

Application Consolidation

Because of the increased throughput capabilities and large number of logical processors, CMT architectures are good candidates for workload consolidation -- an example would be combining multiple logical tiers, such as application server and database tiers, on the same physical host. The Solaris OS does a good job of balancing the load over all the cores in a CMT processor, so in most cases no additional tuning is needed. However, there may be situations, especially if the applications have very different runtime characteristics, when some kind of isolation can provide additional benefits, such as better management of the system resources and improved hit rates on the cores' caches.

The Solaris platform offers several ways to segregate applications running on the same host. One is to use processor sets, with the Solaris command psrset(1M). Processor sets allow grouping of processors on a system, and binding of LWPs to them. For example, the following command creates a processor set on the first two cores of an UltraSPARC T1 processor:

# psrset -c 0-7
created processor set 1
processor 0: was not assigned, now 1
processor 1: was not assigned, now 1
processor 2: was not assigned, now 1
processor 3: was not assigned, now 1
processor 4: was not assigned, now 1
processor 5: was not assigned, now 1
processor 6: was not assigned, now 1
processor 7: was not assigned, now 1

Then, processes or even individual LWPs can be assigned to this processor set:

# psrset -b 1 5364 5365 5366 5367
process id 5364: was not bound, now 1
process id 5365: was not bound, now 1
process id 5366: was not bound, now 1
process id 5367: was not bound, now 1

Another way to segregate multiple applications running on the same host is to use Solaris Containers and the Solaris Resource Manager. Containers, a virtualization tool, allow the creation of multiple private execution environments within a single instance of the Solaris OS. Processor sets can be created in the context of a resource pool, and then associated with a specific container. For detailed instructions, see the How To guide to Solaris Containers.

On a CMT processor like the UltraSPARC T1, it is recommended to keep the grouping at the core level. Splitting a core's hardware threads over multiple processor sets or resource manager pools could lead to suboptimal use of shared resources like the Level 1 cache.

5. Application Qualification

In general, software vendors qualify their products to a target Solaris version, and not to specific hardware configurations. For example, they test and support their software on the Solaris 10 OS for SPARC platforms, instead of testing it on every single hardware configuration that runs this operating system. This is possible because Sun maintains the same instruction set and the same operating system across the entire SPARC product line.

Sun's implementation of CMT does not disrupt this model. The UltraSPARC T1 chip is binary compatible with existing SPARC processors, as shown by the isalist(1) command:

# isalist
sparcv9 sparcv8plus sparcv8 sparcv8-fsmuld 
sparcv7 sparc sparcv9+vis sparcv9+vis2 
sparcv8plus+vis sparcv8plus+vis2

Furthermore, servers with the UltraSPARC T1 processor use the same Solaris 10 OS as other servers based on the SPARC platform. Thus, applications that are qualified to run on the Solaris 10 release on the SPARC architecture are automatically qualified to run on servers using the UltraSPARC T1 processors, such as the Sun servers based on CoolThreads technology.

6. For More Information
7. Glossary
Throughput computing:
System design principle that aims at delivering higher server throughput. The underlying processor technology is CMT.
Chip multithreading. Processor technology that allows multiple hardware threads of execution (also known as strands) on the same chip, through multiple cores per chip, multiple threads per core, or a combination of both.
Chip multiprocessing. Processor technology that combines multiple cores on the same chip.
Symmetric multiprocessing. Computer architecture where multiple processors are connected via shared memory.
Lightweight process. The basic job scheduling unit for the operating system.
A single piece of silicon. Also called a die.
Hardware thread:
Basic scheduling unit at the processor level.
The basic execution unit within a processor.
UltraSPARC T1:
Offers up to 8 four-way multithreaded cores with typical processor power consumption of 72 watts. The SPARC v9 implementation delivers binary compatibility with previous generations of Sun systems and Solaris software.
A layer of software that provides greater isolation between operating system software and the underlying processor implementation.
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.