Calculating Processor Utilisation From the UltraSPARC T1 and UltraSPARC T2 Performance Counters

   
By Darryl Gove, September 2007  

Use the performance counters for the UltraSPARC T1 and UltraSPARC T2 processors to estimate core load and find potential areas for performance improvement.

Contents

Introduction

The UltraSPARC T1 and UltraSPARC T2 processors are designed for high throughput, and as such they replicate a very simple core multiple times so that the processor can handle many threads. The UltraSPARC T1 processor has eight cores, and each core can issue one instruction per cycle and supports four threads, making a total of 32 virtual processors. The UltraSPARC T2 processor has eight cores, and each core can issue two instructions per cycle and can support eight threads, making a total of 64 virtual processors. The peak performance of the processor can be calculated by multiplying the frequency by the number of cores and the number of instructions that can be issued for each core. Assuming that both processors run at 1.4GHz, then the UltraSPARC T1 processor can sustain 1.4*8 = 11.2 billion instructions per second, and the UltraSPARC T2 processor can sustain 22.4 billion instructions per second.

Because multiple threads are sharing a single core, the question of whether the core is fully loaded or not becomes interesting. For example, suppose that the core is fully loaded. That means each thread should be getting its fair share of the available instruction slots. So each thread should be able to issue an instruction every four cycles. In this instance it becomes interesting to ask whether running fewer threads on the core would improve the latency of the application. Alternatively, if not all the threads on a particular core are busy, it is interesting to ask whether the core has sufficient instruction issue capacity to handle additional threads.

This article examines the issue of utilisation of UltraSPARC T1 and UltraSPARC T2 cores and attempts to determine whether performance would benefit from fewer or more virtual processors being assigned work.

Instruction Utilisation

It is relatively easy to determine the instruction issue rate on a system-wide basis using the cpustat command (with superuser privileges). The following code example shows using cpustat to collect instruction count data once every second for 10 seconds.

$  
                   cpustat -c Instr_cnt,sys 1 10

time cpu event pic1

1.009 22 tick 10298

1.009 23 tick 10744

1.009 24 tick 55105

1.009 11 tick 11154

1.009 5 tick 483468

1.009 26 tick 10731

1.009 21 tick 79061

1.009 16 tick 83529

1.009 18 tick 22184

1.009 2 tick 41845

....
                

The cpustat output has oneline per CPU. It is relatively easy to post-process this output to provide formatted output showing the utilisation of each core as a percentage of the maximum possible issue rate for the core. (The utility psrinfo -v reports the processor type and clock speed information necessary for this calculation.) In fact the downloadable tool corestat already does this calculation.

Information on the instruction count is useful in determining whether the core is fully saturated, and as such whether there is spare capacity for adding additional threads, or whether it might be possible to improve latency by reducing the number of active threads.

Floating-Point Computation

The UltraSPARC T1 processor has a single floating-point unit, and the UltraSPARC T2 processor has one floating-point unit per core. Consequently a similar question can be asked about the utilisation of the floating-point unit. Both processors have a counter to record floating-point instructions. The result of the cpustat command, showing both the floating-point instruction count and the total instruction count, is shown in the following code example.

$  
                   cpustat -c FP_instr_cnt,sys,Instr_cnt,sys 1 10

time cpu event pic0 pic1

1.011 26 tick 0 10642

1.011 6 tick 4 31433

1.011 12 tick 50 467295

1.011 21 tick 12 33915

1.011 14 tick 0 10304

1.011 23 tick 0 10496

...
                

Again, it is possible to calculate utilisation for the floating-point units of both processor types.

Stall Budget Utilisation

There is a difference between traditional processors and the UltraSPARC T1 and UltraSPARC T2 processors in the way that they respond to optimisation. On a traditional processor, optimisation of the code typically involves identifying the cause of processor stalls and working to eliminate these stalls. For example, you can add prefetch instructions to reduce cache misses and the corresponding memory stall time.

However, on the UltraSPARC T1 and UltraSPARC T2 processors these stall cycles are used by the other threads that are sharing the core. So the stall cycles can no longer be looked at as places for optimisation. It is still interesting to know about the stall cycles, but removing stall cycles from a single thread will not necessarily lead to a direct improvement in throughput, although it may help the latency of the thread.

To explain this situation, consider the fact that under an ideal load each thread gets to issue an instruction every four cycles. This means that for every instruction, the thread has three cycles during which it cannot issue an instruction. If the thread has more than three cycles of stall for every instruction, then it ends up getting less than its fair share of the instruction budget. If the instruction has fewer than three cycles of stall, it does not follow that it will be able to issue instructions any more frequently, because the other three threads might still have instructions to be issued.

Another way of looking at this is to imagine that each thread has an instruction budget that corresponds to the number of instructions that it should expect to issue per second, under fair load. Each thread also has a stall budget which corresponds to the number of cycles per second that the thread can be stalled before the stall events will start to reduce the number of instructions executed. This stall budget is three times the instruction budget.

Various events, such as cache misses, will cause stalls, and consequently use up this budget of stall cycles. Therefore it is of interest to know the large contributors to stall. If the counter counts events rather than cycles, it is necessary to multiply the count of events by the estimated cost per event. These costs are processor specific.

Estimating Stall Budget Usage for the UltraSPARC T1 Processor

Unfortunately, most of the performance counters on the UltraSPARC T1 processor count events rather than cycles. The exception is the store queue counter, which counts the number of cycles in which the store queue is full. However, because of the simplicity of the pipeline, it is possible to estimate the number of cycles lost to the various stalls by multiplying the number of events by an estimate of the cost of the event, as shown in the following table.

Table 1: Estimated Number of Cycles Lost to Stalls
 
 
Counter
Comment
Cost in Cycles
SB_full
Cycles when store buffer is full
1
FP_instr_cnt
Floating-point instruction count
30
IC_miss
Instruction cache miss
20
DC_miss
Data cache miss
20
ITLB_miss
Instruction TLB miss
100
DTLB_miss
Data TLB miss
100
L2_imiss
Instruction fetches that miss L2 cache
100
L2_dmiss_ld
Loads that miss L2 cache
100
Instr_cnt
Instruction count
1

A simple script can be written that runs an application multiple times and calculates the number of cycles contributed to processor stall by the various events. The results of running this script on an application that has a data set that is resident in the on-chip second level cache is shown in the following table.

Table 2: Sample Results of Calculating UltraSPARC T1 Processor Stall Cycles
 
 
Comment
Cost in Cycles
Raw Count
Scaled Count
Estimated Time at 1.2GHz
Percentage of Total Runtime
Cycles when store buffer is full
1
3,626,323
3,626,323
0
1%
Floating-point instruction count
30
31
930
0
0
Instruction cache miss
20
99,418
1,988,360
0
0
Data cache miss
20
16,829,760
336,595,200
0.28
53%
Instruction TLB miss
100
139
13,900
0
0
Data TLB miss
100
25,669
2,566,900
0
0
Instruction fetches that miss L2 cache
100
3,661
36,6100
0
0
Loads that miss L2 cache
100
771,285
77,128,500
0.06
12%
Instruction count
1
87,048,958
87,048,958
0.07
14%
Cycles
1
636,000,000
636,000,000
0.53
100%

Unsurprisingly, the the majority (50%) of the stall time comes from loads that miss the on-chip cache, but are resident in the second level cache. A small number of loads also miss the second level cache and contribute significant time because of the additional cost of fetching data from memory. If this were real code, then focusing on improving the cache utilisation and footprint would be the way to improve overall performance.

However, looking at instruction count, it is apparent that the application is using 14% of the instruction budget. An ideal application would use 25%, because there are four threads per core, and each core can issue one instruction per cycle. So although reducing the number of cache misses would improve performance, the maximum performance gain would be from 14% utilisation of the instruction budget to 25% utilisation, nearly doubling performance, assuming that all the threads on the core are active.

Placing this in context, on a traditional processor, all the memory stall time could potentially be converted into performance, but on a CMT processor, there is an upper bound imposed by the sharing of cycles between the multiple threads.

Estimating Stall Budget Usage for the UltraSPARC T2 Processor

The UltraSPARC T2 processor has eight threads sharing a core that is capable of issuing two instructions per cycle. This gives the same budget of instructions: each thread should be able to issue an instruction every four cycles, and has three cycles where it is unable to issue an instruction.

There is a different set of performance counters on the UltraSPARC T2 processor, with new multipliers. There is no counter of cycles spent in store queue stalls, and the floating-point instruction counter has a changed name. The following table shows the performance counters and their multipliers.

Table 3: Performance Counters for the UltraSPARC T2 Processor
 
 
Counter
Comment
Cost in Cycles
Instr_FGU_arithmetic
Floating-point instruction count
8
IC_miss
Instruction cache miss
20
DC_miss
Data cache miss
20
ITLB_miss
Instruction TLB miss
100
DTLB_miss
Data TLB miss
100
L2_imiss
Instruction fetches that miss L2 cache
100
L2_dmiss_ld
Loads that miss L2 cache
100
Instr_cnt
Instruction count
1

Results from running the same code on an UltraSPARC T2 processor are shown in the following table.

Table 4: Sample Results of Calculating UltraSPARC T2 Processor Stall Cycles
 
 
Comment
Cost in Cycles
Raw Count
Scaled Count
Estimated Time at 1.4GHz
Percentage of Total Runtime
Floating-point instruction count
30
37
296
0
0
Instruction cache miss
20
39,749
794,980
0
0
Data cache miss
20
16,803,200
336,064,000
0.24
62%
Instruction TLB miss
100
29
2,900
0
0
Data TLB miss
100
6,408
640,800
0
0
Instruction fetches that miss L2 cache
100
421
42,100
0
0
Loads that miss L2 cache
100
2,062
206,200
0
0
Instruction count
1
86,756,212
86,756,212
0.06
16%
Cycles
1
5,460,000,000
5,460,000,000
0.39
100%

Again, most of the time is spent in load operations of data that is resident in the second level cache. The application is getting about 16% of the total cycles, which is still less that the theoretical peak of 25% of the total cycles. So reducing the cache misses could improve performance further. One issue that needs to be taken into consideration is that the memory pipe is shared between eight threads, so the peak performance of the application depends on there being one load for every two instructions. Otherwise the application could be limited to issuing one instruction every eight cycles.

Ramifications for Optimising for CMT Processors

The previous discussion shows that applications running on CMT processors have a great tolerance for cycles spent in memory stalls. On more traditional processors, memory stall times can be minimised, which leads directly to performance gains. On a CMT processor, reduction in stall times leads to performance gains only up to the point at which the process consumes 25% of the instruction issue budget. Reductions in stall events beyond that are unlikely to lead to significant performance gains, so efforts to further reduce instruction stalls are wasted work.

For CMT processors, there are three ways of improving performance, shown here in order from the most effective to the least effective:

  • Use more threads. Each additional thread gets a new instruction issue budget. Two threads can potentially do twice the work of a single thread.
  • Reduce instruction count. For UltraSPARC T1 and UltraSPARC T2 processors, each thread gets to issue a single instruction at a time, so the instruction count corresponds directly to the length of time it takes to complete the task. A traditional processor might be able to issue multiple instructions from a single thread in the same cycle, so some of the instructions issued could be obtained for free.
  • Reduce stall time. This might not directly improve performance because stall time on one thread is an opportunity for another thread to do work. When the core is issuing its peak instruction rate there are no possible performance gains from reducing cycles spent on stall events. Of course reducing stall time on one of the threads might enable that thread to get its fair share of the instruction budget, and it might be possible to reduce the latency of one of the threads, but it will not have an impact on the throughput of the system.
Author

Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the U.K. Before joining Sun, Darryl held various software architecture and development roles in the U.K. Read Darryl's blog.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.