Chip Multithreading and Multiprocessing

By Erik Fischer, Technical Ambassador, Sun Microsystems, Hungary, December 14, 2005  

Increasing microprocessor performance by on-chip resource replication

It has been demonstrated over the last couple of years that the traditional approach to increasing the performance of a microprocessor by increasing the number of the execution units is not efficient. (Execution units are parts of the microprocessor that do the real work: executing integer and floating point arithmetic instructions or moving data to and from the processor.) The performance of a microprocessor does not increase proportionately to the number of execution units due to on-chip resource limitations existing in different subsystems (e.g. the number of ports of the register file, the associativity of the cache or the number of entries in the branch predictor array). Most of the time these resource limitations cannot be avoided due to physical restrictions, however in the last three years two new technologies emerged and have been implemented in commercial processors to try to hide the effect of physical restrictions. These new technologies duplicate or multiply a rather different set of features in the microprocessors.

It is key to understand these new architectures, since Sun Microsystems and other companies are already using or will use them starting this year. One of these two technologies is Simultaneous Multithreading (SMT). Intel had announced processors using this technology under the marketing name of Hyperthreading in the Pentium 4 product family.

Simultaneous Multithreading (Hyperthreading)

In an SMT processor, designers duplicate the resources needed to handle more than one computational entity (program, task, process or thread) context directly in the hardware. What is it good for? To figure it out, we have to go down to the code of those computational entities. 

Since there are a large number of dependencies between the instructions of every single computational entity (for example, the result of a load instruction is needed to add an arbitrary value to it) and those dependencies limit the instruction level parallelism, there will be unused resources - execution units - in the processor. To occupy the unused resources, the processor could execute instructions from multiple entities, since there are no dependencies between those instructions (except some physical ones, which could be relatively easily dealt with). The technology of how a processor fetches instructions from the computational entities is implementation dependent, but the general and most common implementation is the time-multiplexed way, where each cycle instructions are fetched from a different entity. 

The major advantage of the SMT concept is that it requires only 5-25% extra transistors. These extra transistors store information separately for all computational entity contexts, so an SMT processor has multiple Instruction Pointers/Program Counters and Register Sets. The variation in the number of used extra transistors is implementation dependent. It is the decision of the architect of the processor to choose, exactly what resources will be replicated above the mandatory ones. However, it is also the decision of the architect to pick structures in the microprocessor, which will be divided equally or shared arbitrary between various computational entities. The overall design space is large enough to allow many different implementations with different performance characteristic. Due to the relatively small transistor increase, an SMT processor could be manufactured with almost the same efficiency as the non-SMT version of the same processor. The drawback of the concept is that the verification complexity is much higher, so it takes more time to verify the correct operation of an SMT processor. Another issue to be handled is that an SMT processor must be initialized in a single threaded mode and later migrated into the SMT mode, which requires some modifications in the operating system and in the boot facility (Boot PROM or BIOS). The hardware supported computation contexts are seen as unique processors from the operating system level.

Other multithreading techniques

SMT is not the only way to utilize multiple computational entity contexts in a microprocessor. Designers introduced time multiplexed and Switch-on-Event (SoE) multithreading processors as well. Time multiplexed multithreading switches computational entity context in every clock cycle and tries to hide long latency events (e.g. L2 cache misses or TLB misses) via this operational method. However, SoE multithreading switches computational entity context if and only if a long latency event happens. It is also possible to combine these two techniques into a hybrid multithreaded operation. Sun Microsystems' UltraSPARC-T1 processor implements this hybrid multithreading. 

Chip Multiprocessing

The second performance increasing technology is completely different, but looks similarly promising. This technology, called Chip Multiprocessing (CMP), duplicates or multiplies the entire processor core with almost all of its subsystems on a single die. It is also possible to "simply" co-package two already existing and only slightly modified processors with some additional logic, which will behave exactly as a dual-core die. 

CMP has the benefit, of easily being pin compatible with the previous generation, so a CMP processor could fit into an existing computer and multiply the number of processors available in the box. With the exact multiplication of a complete processor core comes another advantage, the validation of the processor increases moderately. Nevertheless - as always - there is dark side, the multiplication requires a large number of transistors, which occupy large die areas. Larger dies are more expensive to fabricate and to package. Furthermore, a chip with more transistors dissipates more heat. To keep these parameters at least at the same level, one requires new fabrication technologies. 

There are many CMP processors available commercially, however, the very first and not recognized dual-core chip was Sun Microsystems' MAJC-5200 processor, which was followed by the IBM Power4 processor. Later Sun Microsystems introduced the UltraSPARC-IV and IV+ processors, HP released the PA-RISC 8800, AMD started to sell the dual-core Opteron and Athlon 64 chips and Intel shipped various dual core Pentium and Xeon chips. Intel also plans to release a dual-core Itanium derivative. It is also possible to combine the CMP concept with the previously described multithreading technique and design and produce multicore, multithreaded processors. The first such chip was the IBM Power5 processor with two cores and two threads per core. Currently Sun Microsystems announced its UltraSPARC-T1 processor, which not only implements a hybrid multithreading technique, but contains eight such multithreaded cores.


To evaluate the performance of the two technologies, we have to understand, that both methods increase the performance of a multiprogrammed environment, and not necessarily the performance of a single application due to the very nature of their designs, though it is possible to see some performance increase in special circumstances. Both technologies could have similar limitations related to resources shared by multiple on-die cores or thread contexts. These limitations are mostly centered on the data and instruction paths connecting the processor with other processors, caches and off-chip memory banks. 

For most of the cases, the higher execution efficiency of these advanced designs require more instructions and data, which will create a higher load to these I/O paths. This could be an issue, which is not necessarily easy to circumvent. In addition, it is true for both architectures, that to get optimal performance new optimizing compilers and recompilations are required. According to research and real life benchmarks, a 2-way SMT processor could show a 5-10% improvement, while a 4-way design can result in a 10-30% improvement compared to a non-threaded processor. Dual core CMP processors could easily hit 30-80% performance improvement compared to a single core.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.