Improving Application Efficiency Through Chip Multi-Threading

By Nagendra Nagarajayya, March 10, 2005, Updated October 2005  
Abstract: This article introduces CMT technology, such as chip multi-processing and simultaneous multi-threading, from a software developer's perspective.


1. Introduction to Chip Multi-Threading
2. Improving Application Performance on a System With CMT
3. CMT System Efficiency
4. Conclusion
5. Acknowledgments
6. References
7. Footnotes
Appendix A: Glossary
Appendix B: Prototype of Java Technology-Based Application

Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. Software multi-threading refers to the execution of multiple tasks within a process. The tasks are executed using software threads. The software threads are executed on a single processor or on many processors simultaneously. A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized.

The paper introduces CMT technology, such as chip multi-processing (CMP), chip multi-processing with hardware threads, and simultaneous multi-threading, from a software developer's perspective. It introduces Sun's throughput computing strategy and Sun's first CMT processors, UltraSPARC US-IV and the Niagara.

Software tuning techniques to improve application efficiency on a CMT processor-based system are then introduced. A prototype Java technology-based application using some of techniques in the paper shows the scalability on a CMT system.

Throughput Computing: Sun introduced throughput computing [1], [2] to cut the cost and complexity of network computing. The idea is to maximize application workload throughput, or the aggregate amount of work done in a given amount of time [1] by using CMT-enabled processors. The CMT processors execute multiple software threads simultaneously -- contrasting to a single thread processed by the current non-CMT processors -- and this can help improve price/performance and reduce the total cost of ownership.

Throughput computing maximizes the throughput per processor and per system. So a processor with multiple cores will be able to increase the throughput by the number of cores per processor. This increase in performance comes at a lower cost, fewer systems, reduced power consumption, and lower maintenance and administration, with increase in reliability due to fewer systems.

1 Introduction to Chip Multi-Threading

CMT brings to hardware the concept of multi-threading, similar to software multi-threading. Software multi-threading refers to the execution of multiple tasks within a process. The tasks are executed using software threads. This allows other tasks to execute when one thread is blocked, maybe on an IO operation. A central processing unit or a processor executes instructions from a software application [3]. Figure 1 shows a one-core processor with the L1 cache on the core and the L2 cache of the core. The processor executes instructions from one software thread and waits, wasting clock cycles, when it executes instructions that need memory operations. This wait takes most of the time [4].

 Figure 1: Processor With One Core (no CMT)
Figure 1: Processor With One Core (no CMT)

Figure 2 shows compute cycles followed by a wait, as memory is accessed. The compute cycles are 'C', and the memory wait time is 'M'. The wait time is dependent on the speed of the memory operations.

Memory operations are slow compared to the processor speed. Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years [1]. Memory access like a cache miss can take quite a few clock cycles (it could be in hundreds of clock cycles), and the processor has to wait until the memory access is complete. Increases in processor speed without an increase in memory speed will only improve performance by a small amount, since the processor still has to wait for the clock cycles for the memory access to complete [5]. Figure 3 shows a faster processor; compute cycles are faster, but memory access time is still the same. The compute time saved is not much as the memory access slows the processor. (The faster processor waits the same amount of time for memory access.)

 Figure 2: No CMT; Process Waits for Memory Access
Figure 2: No CMT; Process Waits for Memory Access

 Figure 3: No CMT; Increase in Processor Speed With No Change in Memory Speed
Figure 3: No CMT; Increase in Processor Speed With No Change in Memory Speed

Similar to software multi-threading, instruction execution could also be multi-threaded in hardware using additional cores on the same silicon die. The additional cores can continue processing instructions from other threads while one core is waiting for memory to be accessed. The use of multiple cores (two or more) on a silicon die to process instructions from multiple threads is chip multi-processing (CMP) [6]. The core using additional registers and logic could switch the threads in hardware to continue execution instead of waiting for the memory operation to complete. This switching in hardware is known as vertical multi-threading (VMT) [7]. The core can process instructions from multiple threads simultaneously without switching between software threads. This is known as horizontal multi-threading (HMT) or simultaneous multi-threading (SMT). These techniques to improve application performance by executing more than one software thread on a silicon die are known as chip multi-threading.

CMT improves application efficiency by executing instructions from multiple software threads in parallel per clock cycle. These instructions are executed in parallel without any changes to the multi-threaded application code. Using the latest compilers to optimize the multi-threaded application code should increase this performance. The operating system needs to be CMT aware, to minimize resource conflicts, maximizing the parallelization and application throughput. The throughput can be increased by the number of cores per processor.

1.1 Chip Multi-Processing (CMP)

Chip multi-processing (CMP) involves having cores on a silicon die. Figure 4 shows a two-core CMP processor with an on-core L1 cache and an external split L2 cache. A core is made up of functional units, pipelines, caches, and so on, which can process computer instructions. The cores could have their own L1 and L2 cache and share a L3 cache. The cores could also share the memory and cache bus.

 Figure 4: Processor With CMP
Figure 4: Processor With CMP

If access is required to memory due to a cache miss, a memory stall, or a Translation Lookaside Buffer (TLB) miss, this can take from tens to hundreds of cycles. A processor with just one core needs to wait until this access is completed (Figure 2). The OS can context switch [6] the thread instead of waiting for the memory access to complete. Software context switches can take quite a few cycles. A processor with multiple cores can continue executing software threads while a thread on one core waits for a memory access. Having more than one core improves performance per processor since other threads continue execution. This also enables thread level parallelism (TLP) [8] as instructions from multiple threads are executed simultaneously per processor. Figure 6 shows two threads executing simultaneously on two cores.

1.1.1 UltraSPARC US-IV

UltraSPARC US-IV [9], [10] is a CMP processor derived from the UIII [11] processor but with two cores. UIII has only one core. Each core on a US-IV has an internal L1 cache (64 Kbyte data, 32 Kbyte instruction) and an external L2 cache (16 MB). The L2 cache is a split L2 cache and each core can access 8 MB of the split cache. The address and data bus to the cache are shared. The memory controller is on the processor, and the path to the memory is shared between the two cores. Each core can process 4 instructions per clock cycle. The cores have a 14-stage pipeline and use instruction level parallelism (ILP) [12] to execute 4 instructions every cycle. The two cores could together execute a maximum of 8 instructions per clock cycle. Figure 5 shows independent instructions in a single thread being executed simultaneously on a core.

 Figure 5: Instruction-Level Parallelism
Figure 5: Instruction-Level Parallelism

The two cores appear as processors to the OS, and the OS schedules software threads on the cores. Each core can execute one software thread, so two software threads can be executed simultaneously. On a memory access, only the core involved in the access waits, while the second core continues processing instructions. The number of instructions processed by the processor is now two times that of a processor with a single core. The processor executes instructions from two different threads simultaneously, providing thread level parallelism (TLP) in addition to the instruction level parallelism on each core. TLP can help in increasing the performance of applications by almost doubling the throughput.

1.2 Chip Multi-Processing With Hardware Threads

The OS schedules software threads on the cores in a US-IV processor. In the case of a memory access, a core needs to wait for the access to complete while the other core continues processing instructions. This is not very efficient as a core wastes clock cycles. The core using more registers and logic in hardware can context-switch the software thread to another software thread. This logic in hardware holding the thread state is called a hardware thread. The hardware thread appears as a logical processor to the OS. A processor with n hardware threads would appear as n logical processors. For example, if the US-IV processor had 4 hardware threads per core, it would appear to have 8 logical processors (4 hardware threads * 2 cores).

 Figure 6: TLP (Instructions From Two Threads Executing Simultaneously in a Processor)
Figure 6: TLP (Instructions From Two Threads Executing Simultaneously in a Processor)

The OS schedules software threads on these logical processors. On a memory access, the core switches to a new hardware thread so it does not have to wait for the access to be complete. The context switch is in hardware so no clock cycles are wasted. Hardware threads improve the efficiency of a processor by processing instructions every clock cycle. Figure 7 shows a context switch to a new thread on a memory access. The switch enables the core to continue processing instructions from the new thread while memory is being accessed from the old thread. (Faster processors can execute threads in parallel.)

 Figure 7: With CMT; Increase in Processor Speed With No Change in Memory Speed
Figure 7: With CMT; Increase in Processor Speed With No Change in Memory Speed

1.2.1 Niagara Processor

The Niagara 1 processor (code name for Sun's new CMT processor) [13] has eight cores and four hardware threads per core. Each core can process one instruction every cycle. It has a L1 cache per core and a shared L2 cache. The processor does not use ILP [14] but focuses more on TLP. The processor has a simple mechanism to prevent execution stalls [13], switching to another hardware thread. This leads to a simpler processor with very low power consumption. It is well suited for systems such as blade servers, middleware, database servers, and so on.

The Niagara processor has an on-chip Ethernet controller and a TCP/IP off load engine. The on-chip Ethernet processing saves processor cycles and avoids the memory access needed to process the network packets. Locks in a Single CMT Processor

Niagara based systems are single processor based systems. The L2 cache has 12-way associativity and is shared among the logical processors. Once a shared variable like a lock is accessed into the L2 cache then it can be shared by all the threads. This improves the performance compared to a symmetric multi-processing (SMP) system, where threads could be running on different processors, and the L2 cache is shared on each processor but not among processors. Performance

The Niagara processor can process eight instructions simultaneously every clock cycle or one instruction per core.

1.3 Simultaneous Multi-Threading

Simultaneous multi-threading [15], [16], [17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core.

SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads.

1.4 Chip Multi-Threading

Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1], [12], [18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18], [19].

1.5 Symmetric Multi-Processing With CMT

SMP involves multiple processors on a single system sharing a single memory system. The memory appears as one to all the processors and can be accessed with the same latency.

A SMP-based system with CMT has multiple processors, with each processor having cores and logical processors. The cores/logical processors share the single memory system.

2 Improving Application Performance on a System With CMT

2.1 Operating System

The OS is the most important component that can improve the performance of applications on a system with CMT. CMT enables a single processor to process instructions from multiple software threads simultaneously. The software threads are scheduled on the cores in CMP processors and logical processors in CMP processors with hardware threads and SMT processors. If the OS is CMT aware then it can schedule the application threads to minimize contention for the resources. The resources of contention are the system bus and shared caches. The OS needs to schedule threads per processor and then per core.

2.1.1 Solaris Operating System

The Solaris Operating System (OS) is CMT enabled with the Solaris 10 release. Applications can see improved performance with the new TCP/IP stack, improved CMT scheduling (available on x86 platforms also), improved MPO, Solaris Containers, DTrace, and so on. The Solaris 9 OS also has some CMT features but applications can reap more benefits by being on the Solaris 10 OS.

The Solaris OS provides the following features to improve application performance. Multi-Threaded Library

The Solaris OS provides a POSIX compliant as well as a native multi-threaded library for multi-threading applications. The library uses a 1:1 threading model. In a 1:1 thread model, an application thread has a corresponding kernel thread. A multi-threaded application with n application threads will have n kernel threads. Each application thread has a light-weight process (LWP) connecting it to the kernel thread. An LWP is always associated with a kernel thread, but not every kernel thread is associated with an LWP [20]. The kernel threads are the entities scheduled on the cores. Usage Using POSIX Threads

CC -mt program -lpthread Using Solaris Threads

CC -mt program

Note: In the Solaris 10 OS, POSIX and Solaris thread library calls are now in, so -lpthread and -lthread are no longer needed. and are provided for compatiblity with older releases. Adaptive Mutexes

The Solaris OS offers adaptive mutexes for use in device drivers. With an adaptive mutex, a thread will spin if the current thread holding the lock is on a core. This avoids the overhead of going to sleep. The thread also spins before sleeping, if there are n cores.

Another feature in the Solaris OS has this effect: Before unlocking a lock, a thread spins to see if other threads are waiting for the lock and if the lock is acquired. This avoids waking up sleeping threads unnecessarily. Usage

void mutex_init(kmutex_t *
             mp, char *
             name, kmutex_type_t  
             type, void *
void mutex_destroy(kmutex_t *
void mutex_enter(kmutex_t *
void mutex_exit(kmutex_t *
int mutex_owned(kmutex_t *
int mutex_tryenter(kmutex_t *

   Scheduler Activation

The Solaris OS provides support for an application thread to communicate with the kernel. The kernel informs the application thread of the current execution state of the LWP. Similarly, the application thread informs the kernel about the priority of the currently running thread, and also about preemption control. If the application thread is in a critical section, it can inform the kernel to not preempt it. Usage

Preemption initialization function
A hint to the scheduler that preemption should be avoided on the current LWP
Removes the hint set by schedctl_start() Rechoose Interval

Rechoose Interval 2 is a thread affinity tunable used by the scheduler to switch a thread from one processor to another [21]. This interval tells the scheduler that the L2 cache is still warm, and it could schedule the thread back on the same processor. This tunable is now core specific, and if L2 caches are shared by the cores in a processor, this tunable is for all the cores.

This interval could be varied for some types of application to improve the performance [22]. Usage

Rechoose interval can be set in /etc/system

set rechoose_interval=150 Memory Placement Optimization (MPO)

MPO 3 can help ensure that data related to a process is stored as close to the processor as possible. This is to improve locality and reduce access latency and improve bandwidth. Access latency refers to how fast data can be brought from memory to the cache, and bandwidth refers to how much data can be moved. MPO [23] is a smart algorithm in the Solaris OS which places process data close to the thread accessing the data by optimizing for latency and bandwidth.

The Solaris OS uses the concept of locality groups ( lgroups) to model the underlying hardware to determine processors and memory proximity. MPO can deliver good performance on servers with memory locality properties without any changes to application code. The Solaris OS also provides APIs that can be used to optimize MPO for an application.

Note: MPO works using a "first touch" algorithm. An application that initializes all of its data structures within an initial serial thread and then spins off lots of compute threads will have all of its memory allocated as close as possible to the processor which ran the initialization thread (thereby guaranteeing that the N-1 threads will be working with "remote" memory). MPO can degrade performance in such cases. The degradation can be severe, a 30% decrease in performance.


Obtain information on scheduling decisions
Obtain information on sched
Get lgroup affinity
Get CPU IDs contained in specified lgroup
Initialize lgroup interface
Get home lgroup
Get latency between two lgroups
Provide information about memory
Provide advice to VM system Multiple Page Size Support

An application using the default page size (8 Kbyte) will have a high number of TLB misses. A TLB miss is very expensive [24]. The Solaris OS provides support for multiple page [25] sizes and allows applications to choose an optimum page size from 8 Kbyte to 4 Mbyte. The page sizes supported can be obtained using the pagesize -a command. Bigger page sizes can improve application performance by reducing the TLB misses. Usage


export LD_PRELOAD=$
export MPSSHEAP=

a1. Example

export MPSSHEAP=

b. Using ppgsz command

The ppgsz command can be used to set the preferred page size for stack, heap, and/or other anonymous segments within scripts, and so on.

/usr/bin/ppgsz [-F] -o option[,option]  cmd | -p pid...

b1. Example of using ppgsz

/usr/bin/ppgsz -o heap=4m,anon=4m,stack=4m command

c. Using memcntl

The memcntl API can be used to request explicit page sizes.

c1. Example of using memcntl

int memcntl(caddr_t addr, size_t len, int cmd, caddr_t arg,int attr, int mask);

The cmd argument can be used to specify a new control operation, MC_HAT_ADVISE, for page-size operations [25].

d. Using pagesize

The pagesize command can be used to print default and the page sizes supported by the system.

/usr/bin/pagesize -a Library

/usr/lib/ Processor Sets

Processor sets in the Solaris OS allow processors to be grouped. A process can then be scheduled to run in this set. Software threads from the process are scheduled on processors in the set. This can be used to increase the efficiency of scheduling and locality. Each set has its own dispatcher queue, so the scheduler looks in this queue to dispatch the threads. This reduces the latency of scheduling since the scheduler does not have to look at a global queue.

Processors are also organized by system boards on Sun systems. So creating a processor set by using processors on the same board helps in improving locality. Interrupts within a processor set can be disabled to allow threads to run without being context switched.

Note: Create processor sets using cores across processors. This will allow threads to be scheduled across processors minimizing resource conflicts. For Niagara based systems, create processor sets across cores. Usage

a. Processor sets can be created and cores assigned using the following command:

%psrset -c

%psrset -a 1 1 2 3 4 512 513 514 515

b. Interrupts can be turned off within a set using:

%psrset -f 1

c. An application can be run within the set or LWPs of an application can be bound into the set using:

%psrset -e command
%psrset -b [application pid] Multi-Threaded malloc

malloc and free are single-threaded operations and are among the bottlenecks for multi-threaded applications. A multi-threaded malloc scales with multi-threaded requests and can improve multi-threaded application performance. The Solaris OS has two types of multi-threaded malloc libraries, mt-malloc and umem. Usage

a. Using mt-malloc:

b. Using libumem:

cc [ flag... ] file...  
             -lumem [ library... ]

From version 9 onwards, Sun Studio software provides compile options 4 that allow optimizations for UltraSPARC US-IV processors. [21], [26], [27] The compiler also provides software pre-fetch insertions. This allows data to be pre-fetched from memory. The compiler also provides a high performance library, especially optimized for multi-threaded applications. It also provides options to multi-thread single-threaded applications, an optimized math library, and profile feedback. The profile feedback can be used to track runtime paths and generate optimized code for branch predictions. Usage

a. Compile options

-xchip and -xtarget
This generates code optimized for a specific processor.
Allows the compiler to insert pre-fetch instructions into the code. This allows data to be pre-fetched into the cache before it is needed.
-xdepend along with -xprefetch
Allows the compiler to analyze the dependencies between loop iterations, and to determine the memory access patterns to generate better code.
Option allows the most optimized code for a particular platform. [26] The meaning of the -fast option can change with compiler releases. Use the flag -xdryrun to list the components of -fast. The -fast option lets the compiler assume that the target platform the code will run on is the same platform on which it was compiled (because it includes -xtarget=native). Therefore you may need to explicitly set the target platform. For example:

-fast -xtarget=ultra3 -xarch=v8plusa

when compiling applications for 32-bit platforms, or

-fast -xtarget=ultra3 -xarch=v9a

when compiling applications for 64-bit platforms.
Performance library
Sun provides a high performance library for high-speed math (linear algebra and other numerically intensive problems). The library has routines which have been parallelized, and also optimized for performance on UltraSPARC platforms [21]. A platform-specific version is also available.
Allows single-threaded applications to be multi-threaded. Turns on automatic parallelization for multiple processors.
Profile feedback
Allows the compiler to collect information about runtime performance and branches taken. Uses this information as feedback for compilation. The compiler uses the runtime paths to predict branches and generates optimized code for this execution.


  1. %cc -o progt -xO5 -xprofile=collect:prog file.c
  2. %progt
  3. %cc -o prog -xO5 -xprofile=use:prog file.c
Enables optimal reordering of functions and variables by the linker. Reordering variables can solve cache and page contention caused by unrelated variables near each other in memory, unnecessarily large working sets as a result of related variables not near each other, and so on.

This option needs to be used with the linker by creating a map file and linking the map file using the -M option.

1. %CC -xF
2. %CC -M mapfile -o program object-files
Allows the linker to provide additional optimizations that could improve runtime performance.

  1. %cc -o progt -xO5 -xprofile=collect:prog file.c
  2. %progt
  3. %cc -o prog -xO5 -xprofile=use:prog -xlinkopt file.c
-xpagesize= n
Can be used to specify the preferred page size for the stack and the heap. n 5 can be 8K, 64K, 512K, 4M, 32M, 256M, 2G, or 16G. Increasing the page size will reduce page faults and reduce TLB misses, increasing application performance. This option is similar to using the library.

b. New Pragmas Introduced With Sun Studio 9 Software

#pragma does_not_read_global_data
#pragma does_not_write_global_data
This pragma allows the compiler to optimize knowing that the functions will not access any global data.
#pragma does_not_read_global_data( funcname [, funcname])
#pragma does_not_write_global_data( funcname [, funcname])
#pragma no_side_effect
This pragma allows the compiler to optimize knowing that the function will not change any persistent state, do any IO, access read/write state in other parts of the program, or anything else.
#pragma no_side_effect( name[, name...])
#pragma returns_new_memory
This pragma allows the compiler to optimize scheduling and pipelining knowing that the function returns a new memory location.
#pragma returns_new_memory( name[, name...]) Java Virtual Machine

The JVM in the Solaris OS is optimized for CMT processors. The JVM is written in C++ and is compiled with the latest version of Sun Studio software. Using the latest Sun compilers with CMT optimizations can further improve application performance.

2.2 C/C++/Java Applications on the Solaris OS

2.2.1 Multi-Threaded Applications

Multi-threading is the execution of many tasks within a single process. The tasks could be executed concurrently on different software threads. On an SMP system with no CMT, the software threads are executed on different processors in the system. On an SMP system with CMT, the threads are executed on cores in CMP processors, and logical processors in CMP processors with hardware threads and SMT processors.

Multi-threaded applications run with no changes on CMT-enabled systems. Optimizations made for SMP-enabled system hold for CMT-enabled systems. The optimization specific to CMT-enabled systems would be to avoid cache thrashing as multiple threads now run simultaneously on a processor. Other optimizations would be to keep threads accessing shared data closer to the shared L2 cache, lock-free data structures, atomic operations, thread local storage, multi-threaded malloc, and so on. Lock-Free Applications

Locks are memory locations used to synchronize access to a critical section. Locks are accessed to read the current value. The value of a lock needs to be changed in an atomic way. Locks are shared by threads which need access to a critical section. If the lock is not in the L2 cache then this needs to be accessed from memory (an expensive operation). If it is in the L2 cache then threads running on cores within a processor can access the lock value in the shared L2 cache, without a memory operation.

A lock protects a critical section, so when one thread enters the critical section, other threads accessing the critical section will be blocked. When the lock is released, these blocked threads need to be woken up, resulting in context switches. Lock-free structures [28], [29], [30] avoid thread blocking and context switches. The data can also be in the shared caches, avoiding memory access. Usage

Refer to Practical lock-free data structures and AppCore: A Portable High Performance Thread Synchronization Library. Atomic Operations

Atomic operations [31], [32], [33] are similar to lock-free structures and are completed without needing a lock to access/change data. The operation could be incrementing/decrementing/setting a value at a memory location. This will be faster if the memory location value is in the shared L2 cache and accessed by threads running on cores within a processor. Atomic operations also avoid a thread context switch. Usage

Atomic libraries are part of the Solaris 10 OS and part of Java 2 Platform, Standard Edition (J2SE) 5.0.

a. C/C++ applications

a1. Include file


a2. Library


Releases below Solaris 10 can use the library referenced in [33].

b. Java Applications

import java.util.concurrent.atomic.*;

AtomicBoolean.set(true) Cache Thrashing

Threads now run simultaneously within a processor. Since they all access memory locations simultaneously, if the size of the shared L1 and L2 cache is small, then this might result in the destruction of the cache data in use, as new cases of access fill up the caches. A running thread might need information to be brought in again from memory-degrading application performance [34].

Most applications cannot control how data is laid out in the cache [35], [36], but system-level applications like database servers, JVM, drivers, and so on, can make use of the cache sizes to keep the most accessed data in the cache.

Cache statistics can be measured using cpustat or cputrack utilities on the Solaris OS. har [37] is a utility developed by a Sun engineer that can make it easier to measure the instructions executed and cache statistics. Usage

Example A

cpustat [-c events] [-nhD] [interval [count]]

        cpustat -c IC_ref,IC_hit -c Cycle_cnt,Instr_cnt 1 3

Example B

cputrack [-T secs] [-N count] [-Defhnv] [-o file]

        -c events [command [args] | -p pid]

        cputrack -T 0 -fve -c Cycle_cnt,Instr_cnt sh -c date

Example C

har [-{i|s|r|M}] [-{u|a|k}] [-DnmRh] [interval [count]]

        har -r 1 Thread Local Storage

Thread local storage (TLS) is data local to a thread and not accessible from other threads. This is similar to local variables in a function or a method. Local variables are allocated on the stack knowing that these can be destroyed at the end of the function or method. Similarly, thread local allocations can be kept on the stack but are needed for the lifetime of the thread. The compiler can optimize to pre-fetch these into the L2 cache for improved performance. Reducing global data to thread local data, if the data is not shared between threads, should improve performance and eliminate synchronization. Usage

a. C/C++ Applications

Variables are declared thread-local using the __thread keyword.

__thread int i;
__thread char *p;
__thread struct state s;
cc -xthreadvar   

b. Java Applications


public class SerialNum {
         // The next serial number to be assigned
   private static int nextSerialNum = 0;
   private static ThreadLocal serialNum = new ThreadLocal() {
        protected synchronized Object initialValue() {
        return new Integer(nextSerialNum++);

   public static int get() {
        return ((Integer) (serialNum.get())).intValue();
} Use VIS Instruction Set

VIS 6 is an extension of the SPARC instruction set. The VIS instructions can process four 16-bit data, packed into a register with one instruction. This instruction set can accelerate processing of some algorithms by as much as 7 times, by performing up to 10 operations in parallel per cycle [38]. Applications using imaging, linear algebra, signal processing, audio, video and networking can benefit from this increased performance. Applications written in the Java programming language can use VIS by means of wrappers. Wrappers in Java technology use Java Native Interface (JNI) to access the library. Usage

VIS can be used with the VSDK [39] or the Media Library [40].

a. Examples:

VectorAdd(), MatrixAdd(), MatrixMul()

b. Wrappers for Java applications are also available for accessing the VIS library using JNI.

2.3 Single-Threaded Applications

As the Solaris OS uses a 1:1 threading model, an application thread has a corresponding kernel thread. A multi-threaded application has n application threads (LWPs), and n kernel threads. A single-threaded application has one LWP.

Running multiple single-threaded processes results in your application appearing as multiple LWPs to the kernel. The difference between the single-threaded application and multi-threaded application is the size of the process, how the data segments are loaded in memory, and the address space. In single-threaded apps, text and libraries segments are shared, but data can only be shared through interprocess communication across an address space. On the other hand, in a multi-threaded process, data between LWPs can be shared through intra-process communication within an address space, including direct access to a global variable.

The second difference is the scheduling. The Time Share (TS) class has the time-quanta per process, so each LWP gets more core time. A single-threaded application waiting for IO will be context switched, but in a multi-threaded app, a second thread could continue processing using the time-quanta. Running multiple copies of a single-threaded process results in a process context switch, which is more expensive than a thread context switch.

2.3.1 Multi-Threading a Single-Threaded Application

If a single-threaded application cannot be multi-threaded, the loops and some of the library routines it uses can be made multi-threaded to improve the efficiency [41]. The Sun Studio compiler offers an autopar option, which can be used to make loops multi-threaded. A high-performance multi-threaded library ( perflib) is also available that can be used along with the autopar option to improve the single-threaded application performance. The number of threads to use can be specified through an environment variable. Usage

cc -fast -xO4 -xautopar example.c -o example

setenv PARALLEL 2
setenv SUNW_MP_THR_IDLE value

2.3.2 Single-Threaded Applications in a Container

The Solaris 10 OS provides containers called zones [42], [54], which provide fault isolation, resource allocation and management. Solaris Zones can be used to sand-box applications on a server within a single operating system instance. This can be used to run multiple copies of a single-threaded application in isolation and is similar to running the single-threaded application on multiple systems. Zones provide a solution to virtualize the operating environment. Each zone can have its own IP address and its own file system. Single-threaded applications with dependency on ports, IP address, and so on, can now be run in zones. This allows single-threaded applications to scale horizontally on a vertical system. Usage

zonecfg -z lisa 'create; set zonepath=/aux0/lisa'
zoneadm -list -vc
zoneadm -z lisa install
zoneadm -list -vc
zoneadm -z lisa boot
zoneadm -list -vc
zlogin lisa

[install and run your app]

2.4 Java Technology-Based Applications on the Solaris OS

Java applications need no code changes since the JVM is optimized for a CMT system. Using lock-free, atomic, or TLS data structures should improve the performance.

2.4.1 Garbage Collection Tuning

Tuning garbage collection [43], [44], [45], [46] is rarely considered. However, just making a few changes to the runtime environment, changing the garbage collector, can result in quite a bit of improvement in performance (see 3.1.2 for a 39% improvement). J2SE 5.0 technology offers automatic tuning but this is limited to the garbage collection (GC) [47] at the moment. GC can be tuned differently for throughput applications and for low-pause applications [47]. For throughput applications, using automatic tuning in J2SE 5.0 software offers a better performance advantage. Low-pause applications can use automatic tuning or the concurrent collector for the optimum performance at low pause. Usage

a. For throughput applications, use:

java -XX:+UseParallelGC -XX:GCTimeRatio  

b. For low-pause applications, use:

b1. Automatic tuning

java -XX:+UseParallelGC -XX:GCTimeRatio=xx

        -XX:MaxGCPauseMillis=xx application

b2. Concurrent collector

java -Xconcgc application

2.5 C/C++ Applications on the Solaris OS

A multi-threaded C/C++ application that has been optimized for scaling on an SMP system runs with no changes on a CMT-enabled SMP system. Using the latest compilers like Sun Studio with the compiler optimizations mentioned above should further improve the performance.

3 CMT System Efficiency

The efficiency of a CMT-enabled system can be identified by the memory stalls, cache references and misses, and number of instructions processed per cycle. Cache misses, memory stalls, and instructions processed can be obtained using the cputrack, cpustat, and har utilities on the Solaris OS. An ideal system with no cache misses or stalls should run at 100% efficiency.

An SMP system with CMT should be able to increase the number of instructions processed by the number of cores.

3.1 Test Application in Java Programming Language

I wrote a small Java application to find the scalability on an UltraSPARC US-IV based system. The system is a Sun Fire E6900 server with multiple domains running the Solaris 10 OS (Build 63). A domain with 4 processors (8 cores) @ 1.2 GHz (16 Gbyte of memory) was used to test the Java application.

I wrote three different tests to test the scalability. One of the tests is a version in Java programming language of a working set problem from an earlier paper [24]. The output of the test is units of work done in a known amount of time. A thread checks if time has elapsed, and if not, goes ahead and starts performing the work. To reduce concurrency locking overhead, an atomic boolean test has been used [32]. The work done is stored on the stack (thread local storage could be used here), and this is updated in the instance through an atomic add [34] at the end of the test. This reduces contention to the boolean test, so increasing the number of threads should increase the work done linearly.

Each test was run for 300 seconds, and the number of threads was doubled for each test. The test outputs the results at the end of the test. har [37] was used to capture the instructions per cycle. The tests were run twice, without har and with har to capture the instructions per cycle. Running har impacts the results of the test. The tests were run sequentially, increasing the number of threads to make use of the warm cache.

The program was run with no options to capture the baseline performance and scalability. Adding GC tuning, MPSS, and multi-threaded malloc options results in performance improvement for some of the configurations. The performance of the working set configuration shown in section 3.2.1 (8 threads) improves by 39% to 121 units of work done.

The java command line with (no options) was used to run the program:

%java CMT.Cmt 1 0 0 300000

Here is the java command line with automatic GC tuning, mpss, and multi-malloc options:

export LD_PRELOAD=$
export MPSSHEAP=512k
export MPSSSTACK=64k

%java -ms2g -mx2g -XX:+UseParallelGC -XX:GCTimeRatio=30 
        CMT.Cmt 0 8 0 300000

The program takes four arguments where each argument represents a type of work and the number of threads to use. The first argument represents compute work, the second represents working set work, the third represents IO work, and the fourth the time to work. Increasing this number should increase the work done linearly, unless no more compute resources are available or a bottleneck occurs.

3.1.1 Test 1: Computation Work Results

Table 1: Test 1:1


Threads Units Work Done Avg. Time Scalability
1 59982202100 300 1
2 119963907231 300 2
4 239884329382 300 4
8 478137581539 300 7.97
16 478310064785 300 7.97

Note: Tests were run to get the best numbers.

Table 2: Test 1:2


Threads MIPS CPI Efficiency
1 1597.5 2.78 4.16
2 3183.5 1.53 8.29
4 4180.11 1.23 10.89
8 12761.94 0.8 33.23

 Figure 8: Graph Test 1:1
Figure 8: Graph Test 1:1

The system scales almost linearly and flattens out at eight threads as the system has only eight cores. Increasing the number of threads beyond eight does not increase the work done. The maximum efficiency achieved was 33% at 8 threads. The efficiency was calculated using the maximum MIPS 7 possible. At 100% efficiency, the system with eight cores can execute 38400 MIPS @ 0.25 CPI. The formula used is (processor speed * cores * number of instructions possible per cycle)/1000000. So for a US-IV processor at 1.2 GHz the maximum MIPS would be (1200000000 * 8 * 4)/1000000 = 38400 MIPS. With eight threads the MIPS achieved was 12761 which is 33% of the maximum MIPS.

3.1.2 Test 2: Working Set Results

Table 3: Test 2:1


Threads Units Work Done Avg. Time Scalability
1 8 331 1
2 20 306 2.5
4 64 312.75 8
8 89 321 11.13
16 83 332.81 10.38

Table 4: Test 2:2


Threads MIPS CPI Efficiency
1 69.05 11.02 0.18
2 210.05 10.93 0.55
4 888.21 8.89 2.31
8 696.22 15.94 1.81

 Figure 9: Graph Test 2:1
Figure 9: Graph Test 2:1

The second test was to show a working set access. The tests were run sequentially to use a warm cache. The scalability seen is due to the caching. This again peaks at 8 threads. The efficiency is poor due to the high cache miss and memory stalls.

3.1.3 Test 3: IO Work Results

Table 5: Test 3:1


Threads MIPS CPI Efficiency
1 370.33 2.96 0.96
2 611.89 3.31 1.59
4 553.06 2.81 1.44
8 511.29 2.87 1.33

The IO test was to check the scalability with disk access. The biggest bottleneck turned out to be file open. If this is changed by opening the file once, then disk performance becomes the bottleneck. The domain had only 1 root disk and at 8 threads, iostat writes/sec were up to 400, and the actv length up to 250, with w% and b% at 100%. A faster disk (hardware RAID) should show good scalability.

Table 6: Test 3:2


Threads Units Work Done Avg. Time Scalability
1 3943 300 1
2 5886 300 1.49
4 5636 300 1.43
8 5178 300 1.31
16 5269 301 1.34

 Figure 10: Graph Test 3:3
Figure 10: Graph Test 3:3

4 Conclusion

CMT offers a very easy way to improve application efficiency without changes to the application code. The throughput improves by the number of cores/logical processors. The power of CMT is in scheduling. The OS needs to be CMT aware, and it should also be able to schedule threads, aware of resource sharing to maximize performance.

Applications using non-locking structures, atomic operations, mt-malloc and thread local storage for data should see good improvement in performance and scalability. Processor sets can also be used to improve scheduling efficiency. Using the latest compilers like Sun Studio with CMT optimizations can further improve application performance.

5 Acknowledgments

I would like to thank my colleagues for some very good feedback which has provided structure to this paper. Morgan Herrington did a great review and suggested the current structure. His experience on memory management is reflected in the MPO and MPSS sections, including the first touch degradation example. His review was a valuable input that increased the readability of the paper. Teresa Majer provided some very good feedback, including pointing me to CMT material, and feedback on CMP and CMT when I first started the paper. Nils Gura gave great feedback, including suggesting a glossary to manage the new terminology in the paper. Denis Sheahan provided valuable feedback on Niagara based systems. Hugo Rivero provided access to a Sun Fire E6900 server, and his feedback increased the readability of the figures and charts in the paper, including the scalability results. Bruce Chapman and S.R. Venkatramanan offered some excellent review and feedback, including the start of the huge reference section. I would especially like to thank a lot of folks on the Sun internal aliases who again provided valuable feedback. Don Kretsch's suggestion of including single-threaded applications opened up the sections on compiler options and automatic parallelization. Patrick McGehearty and Paul Ryan's suggestions increased the readability of the paper, while Chien-Hua Yen's suggestions increased the accuracy of the sections on the Solaris 10 OS.

I would like to thank my ISVs, Sonic Software and Highdeal Transactive, who took part in the OSS/J IP Billing Massive Scalability Study, for also reviewing the paper, especially Simon Waterer who provided some very good suggestions.

I also wish to thank the folks on the Sun developer portals who spent time on editing, proofing and reindexing this paper.

Finally, I would like to thank everyone else not mentioned in the paper but who provided valuable suggestions that made writing this paper possible.

6 References
  1. Introduction to Throughput Computing (pdf), Sun Microsystems, Inc.
  2. Beat The Clock (feature article)
  3. The PC Guide - Topic Index
  4. The Processor-Memory bottleneck: Problems and Solutions, Nihar R. Mahapatra and Balakrishna Venkatrao
  5. Sun's Throughput Computing (pdf), Vernon Turner, Dr. Christopher G. Willard
  6. On-Chip Multiprocessing, Paul Mazzucco
  7. Sun MAJC 5200: A Processor for the 21st Century (ppt), Marc Tremblay
  8. TLP and the Return of KISS, Chris Rijk
  9. UltraSPARC IV Processor Architecture Overview (pdf), Quinn Jacobson, Sun Microsystems, Inc.
  10. UltraSPARC IV Processor: User's Manual Supplement (pdf)
  11. UltraSPARC III Cu User's Manual (pdf), Sun Microsystems, Inc.
  12. PC Processor Microarchitecture, ExtremeTech
  13. Niagara: A Torrent of Threads, Chris Rijk
  14. Limits of Instruction-Level Parallelism (pdf), David W. Wall
  15. Alpha EV8 (Part 2): Simultaneous Multi-Threat, Paul DeMone
  16. Hyper-Threading Technology on the Intel Xeon Processor Family for Servers(pdf), Intel Corporation
  17. Simultaneous Multithreading: Present Developments and Future Directions (pdf), Miquel Pericas
  18. Converting Thread-Level Parallelism Instruction-Level Parallelism via Simultaneous Multithreading (pdf)
  19. A Comparative Study of SMT and CMP Multiprocessors (pdf), Rajat A Dua, Bhusan Lokhande
  20. Multithreading in the Solaris Operating Environment (pdf)
  21. Techniques for Optimizing Applications: High Performance Compiling, Rajat P. Garg, Ilya Sharpov, Sun BluePrints Publications
  22. MPO white paper on iForce partner portal
  23. Supporting Multiple Page Sizes in the Solaris Operating System
  24. Application Performance Optimization (pdf), Borje Lindh, Sun BluePrints OnLine
  25. Practical lock-free data structures, University of Cambridge
  26. Some Notes on Lock-Free and Wait-Free Algorithms, Ross Bencina
  27. AppCore: A Portable High-Performance Thread Synchronization Library (under construction)
  28. Atomic Instructions in Java (pdf), David Hovemeyer, William Pugh, and Jaime Spacco
  29. Package java.util.concurrent.atomic
  30. A performance methodology for commercial servers, S. R. Kunkel, R. J. Eickemeyer, M. H. Lipasti, T. J. Mullins, B. O'Krafka, H. Rosenberg, S. P. VanderWiel, P. L. Vitale, and L. D. Whitley
  31. Memory Hierarchy in Cache-Based Systems (pdf), Ruud van der Pas, Sun BluePrints Publications
  32. Tag Storage
  33. VIS Instruction Set User's Manual (pdf)
  34. VIS Instruction Set
  35. mediaLib Libraries
  36. Parallelizing Sun C Code, C User's Guide (
  37. Solaris Zones: Operating System Support for Consolidating Commercial Workloads (pdf), Daniel Price, Andrew Tucker
  38. Garbage Collection (GC) Tuning Techniques and the GC Portal, Nagendra Nagarajayya, Alka Gupta, SR Venkatramanan
  39. GC Portal, Alka Gupta
  40. USENIX WIP JVM 2002, Self Tuning JVM to Improve Application Performance and Scalability, Nagendra Nagarajayya, Alka Gupta, SR Venkatramanan, Mayank Srivastava (not available online)
  41. Java Performance Tuning From A Garbage Collection Perspective, JA SIG 2004 (pdf), Nagendra Nagarajayya
  42. Chip Multiprocessing, ComputerWorld article
  43. (Glossary Search: L2 cache)
  44. Solaris Processes: LWP, Princeton University
  45. Wikipedia: Instruction Pipeline
  46. TLB page on Cornell University site (formerly located at
  47. Appendix: Vertical/Horizontal Multi-threading (SaskTel web site)
  48. Get in the Zone with Solaris 10, Dennis Clarke

7 Footnotes
  1. Niagara based systems require the Solaris 10 OS and above.
  2. Rechoose interval does not apply to the Niagara processor.
  3. Niagara based systems cannot make use of MPO since they are single-processor-based systems.
  4. Niagara based systems might need different optimizations as there is no branch prediction and the pipe is a simple pipe.
  5. Niagara based systems have 8k, 16k, 64k, 4m, 256m page sizes. Use a bigger page size like 4m to improve the performance since the L2 cache is small.
  6. Niagara based systems emulate VIS instructions. Performance degradation could occur with VIS.
  7. Niagara based systems do not have cycle counts. Only misses are counted, so efficiency is difficult to calculate.

Appendix A: Glossary
Term Description
Architecture States The additional registers and logic that are needed in a core to keep track of a thread execution.

A chip is the same as a silicon die. Chip could also include the plastic and the pins.

Chip Multi-Processing (CMP)

Chip multi-processing is a technique in CPU design that combines two or more processor cores on a single piece of silicon (called a die) to enhance computing performance over servers with single or multiple discrete CPUs. It's also known as "on-chip multiprocessing" or "multiple-processor system-on-a-chip". [48]

Chip Multi-Threading (CMT)

A processor capable of executing two or more software threads simultaneously without restoring to a software context switch. Chip multi-threading may be achieved through the use of multiple processor cores, supporting multiple threads per core, or a combination of these strategies [10].

Core A hardware unit that instantiates one or more logical processors [10].


Cycles per instruction


CPU is the same as a core.

Execution Unit

An execution unit is the same as a core.

Hardware Threads

Hardware threads and logical processors are the same. The hardware thread is a view from the core, while a logical processor is a view from the OS.

Horizontal Multi-Threading (HMT)

Multi-threading allows threads to share functional units on a core without halting the execution of a thread (an idle functional unit can be assigned to any thread that needs it).

Instruction Level Parallelism (ILP)

Instruction Level Parallelism exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread.

L1 Cache

Level 1 cache for instructions and data. It is a small piece of very fast memory that is almost always on the core itself. It sits between the core registers and the L2 cache. Typically L1 cache has a lower latency than L2 cache, making it more expensive to produce and harder to produce in larger quantities without additional complexity.

L2 Cache

A piece of fast memory that sits between the L1 cache of the processor and main memory. It is usually larger than L1 cache, and the L1 cache checks the L2 cache before going to main memory for data [49]. L2 caches could also be on the same die as the L1 cache.

L3 Cache

This type of cache is becoming more prevalent as processors are built with L1 and L2 cache on the same die. L3 cache is the extra cache that sits between the processor and main memory.

Light-Weight Process (LWP)

A light-weight process is a "virtual CPU" that performs the processing for an application. Application threads are attached to available light-weight processes, which are attached to a kernel thread, which is scheduled on the system's core dispatch queue. [50]

Logical Processor

The abstraction of a processor's architecture that maintains the state and management of an executing thread [10] as seen by the OS.


Same as a system.


Millions of instructions per second


Operating system


Pipelining seeks to improve performance by reducing the amount of idle time. To do this the core includes new circuitry that examines the instructions, and breaks them down into their sub-instructions. These smaller instructions are then dispatched to the processing logic in step, with the pipeline controller watching the execution [51].


A processor is a single piece of silicon that interprets and executes operating system functions and other software tasks. A processor is implemented by one or more cores [38].

Silicon Die

A square of silicon that contains a CPU.

Simultaneous Multi-Threading (SMT)

Same as horizontal multi-threading

Symmetric Multi-Processing (SMP)

A system with many processors with a single memory system. All the processors access the memory with the same latency.


A system has processors, memory, power and storage. It has an OS managing the system.

Thread Level Parallelism (TLP)

Multiple software threads executing simultaneously on cores in a processor.

Threads Software threads from a multi-threaded application.
Translation Lookaside Buffer (TLB)

The TLB [52] is a table in the processor that contains cross-references between the virtual and real addresses of recently referenced pages of memory. It functions like a "hot list" or quick-lookup index of the pages in main memory that have been most recently accessed.

When a cache miss occurs, data must be fetched from an address in virtual memory. This virtual memory address must be translated into a main memory real address. If the real-memory address of a desired page is not in the TLB, a further delay is incurred while the real address is determined by other means.

Vertical Multi-Threading (VMT)

Vertical multi-threading allows a core to switch execution between threads without needing to save thread state. VMT generally uses duplicated registers, and is usually used to continue execution with another thread when one thread hits a delay due to a cache miss and must wait. [53]

Appendix B: Prototype of Java Technology-Based Application

In this prototype, I have changed the working set and float values to integers to demonstrate better computational scalability. The prototype might show numbers which are much higher than the results in the paper.

Source code

About the Author

Nagendra Nagarajayya has been working at Sun for the last 11 years. He is a Staff Engineer in Market Development Engineering (MDE), where he works with Independent Software Vendors (ISVs) in the telecommunications industry, on issues related to architecture, performance tuning, sizing and scaling, benchmarking, POCs, porting, and so on. His interests include multi-threading issues, concurrency and parallelism, HA, distributed computing, networking, and performance tuning for Java and Solaris platforms.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.
Left Curve
System Administrator
Right Curve
Left Curve
Developer and ISVs
Right Curve
Left Curve
Related Products
Right Curve