|By Darryl Gove, Compiler Performance Engineering, Sun Microsystems, September 29, 2006|
The Binary Instrumentation Tool (BIT) has been released as part of the Cool Tools effort on SPARC systems. These tools are add-ons to Sun Studio 11 . BIT can gather data on the number of times basic blocks are executed, or the probability that a given branch is taken.
In order to gather data using BIT, the binary needs to be build with Sun Studio 11 compilers, an optimisation level of at least
-xO1, and the flag
flag tells the compiler to produce an annotated application, suitable for further analysis at a later time. More details on binary optimisation with
-binopt can be found in a recent
A good example of what can be achieved is to run the BIT tool on the test program shown below.
$ more test.c
Here there is an inner loop that will get executed 1000 times. On the first iteration of the loop the variable j will be decremented.
The application is then built as described above; in this case low optimisation is used because at higher levels of optimisation the entire program will be eliminated. The command to build an executable for binary optimisation is:
$ /opt/SUNWSpro/bin/cc -O -xbinopt=prepare -o test test.c
The man page for BIT describes the options that are available for the tool. The information that is most useful for determining whether the training workload is appropriate are the basic block counts, and the branch taken probabilities.
The following command will run an instrumented application and output basic block counts and branch probabilities.
$ /opt/SUNWspro/extra/bin/bit collect -R -o bbc.txt -a bbc -o branch.txt -a branch a.out
The basic block count output is as follows:
$ more bbc.txt
It is also possible to get disassembly from BIT which includes information about the execution counts for the assembly language instructions.
$ /opt/SUNWspro/extra/bin/bit analyze -a dis test Disassembly for routine main ROUTINE: main FREQUENCY: 1.0 BLOCK: main: FREQUENCY: 1.0 PC: 0x11794 [ 1.0] 0x11794: or %g0, #sint=0, %o5 [ 1.0] 0x11798: or %g0, #sint=0, %o4 [ 1.0] 0x1179c: subcc %o4, %o5, %g0 BLOCK: $LABEL_main_3_3_117a0: FREQUENCY: 1000.0 PC: 0x117a0 [ 1000.0] 0x117a0: br,pn@(ne),%icc $LABEL_main_1_1_117ac [ 1000.0] 0x117a4: add %o4, #sint=1, %o4 BLOCK: $LABEL_main_2_2_117a8: FREQUENCY: 1.0 PC: 0x117a8 [ 1.0] 0x117a8: add %o5, #sint=-1, %o5 BLOCK: $LABEL_main_1_1_117ac: FREQUENCY: 1000.0 PC: 0x117ac [ 1000.0] 0x117ac: subcc %o4, #sint=999, %g0 [ 1000.0] 0x117b0: br,pt@(le),%icc $LABEL_main_3_3_117a0 [ 1000.0] 0x117b4: subcc %o4, %o5, %g0 BLOCK: $LABEL_main_4_4_117b8: FREQUENCY: 1.0 PC: 0x117b8 [ 1.0] 0x117b8: jmpl [%o7, #sint=8], %g0 [ 1.0] 0x117bc: nop
It is possible to see from the disassembly how the routine is structured, with three basic blocks in the loop, and one of these only gets executed when the variables i and j have the same value.
There are two branches in the code, the first (at 0x117a0) to branch around the j-- statement when i and j are not equal (this branch will be taken 999 times out of 1000). The second branch (at 0x117b0) is the branch back to the top of the loop, this will also get taken 999 times out of 1000.
The report on branch probabilities confirms this, the report shown has been trimmed to reduce the number of columns shown.
$ more branch.txt Branch taken/not taken report for whole program: PC Trip Cnt Taken Not Taken Instruction 117a0 1000 999 1 br,pn@(ne),%icc $LABEL_main_1_1_117ac 117b0 1000 999 1 br,pt@(le),%icc $LABEL_main_3_3_117a0 ...
The methodology proposed here is to gather both basic block count and branch probability data for the training and reference workload. If there are multiple training workloads, the aggregate data for all of them should be gathered. The following is a step by step walkthrough of the process of gathering the data for either the training or reference workload. The middle step where the application is run can be repeated multiple times if there are multiple training (or reference workloads).
$ /opt/SUNWspro/extra/bin/bit instrument -R test
For a workload to adequately train an application for a reference workload it should at least execute the same parts of the code that the reference workload does. If the training workload does not execute the same code as the reference workload, then at best it is leaving some opportunities for performance gain behind; at worst it is mis-leading the compiler by indicating that an important part of the code for the reference workload is not important.
Coverage can be estimated from basic code block counts. If a basic block is executed more than once, it has been covered. It is not possible to draw strong conclusions from the magnitude of the count attributed to a basic block -- the magnitude can be a function of the duration of the run, or even a fixed count which depends on the size of a data structure.
It is possible to calculate coverage as a single value. Assume that T i is the number of times that basic block i is executed during the run with the training workload. Similarly assume that R i is the number of times that basic block i is executed during the run with the reference workload. The coverage can be defined as follows:
The coverage ranges from 0 to 100% (when all of the basic blocks that are critical to the reference workload are covered by the training workload).
Whilst coverage is a minimum criteria for acceptance of a training workload, having good coverage does not necessarily mean that the training workload is ideal. But we have a visual way of examining the coverage information for a particular pair of training and reference workloads.
One of the problems with basic block count data is that the counts depend on the runtime of the application. So a longer running workload will have higher counts. Therefore it does not make sense to try and compare the absolute count numbers. To present the data visually, the basic blocks are ordered according to their counts, basic blocks with higher counts are plotted further from the origin. For each basic block the ordering from the reference workload is used to determine position on the x-axis, and the ordering from the training workload is used to determine position on the y-axis. As a further refinement the size of the marker used to plot each point depends on the frequency of execution (relative to the most frequently executed basic block) of the reference workload.
In an ideal situation, the most frequently executed basic blocks in the training workload will also be the most frequently executed basic blocks in the reference workload. As a graph, this should look something like a line of points going up at 45 o with small markers near the origin, and large markers at the top right. This shape is probably best described as a 'lolly-pop' shape.
Basic blocks that are frequently executed by the reference workload, but not adequately covered by the training workload will appear as large markers below the diagonal line.
The paper presented at the SPEC workshop had results for all of the CPU2000 suite. In this paper, we'll just look at an example of good behaviour, and an example of bad behaviour.
The benchmark 300.twolf got a coverage of 100% using this methodology. The coverage graph for this benchmark is shown in figure 1.
Figure 1: Basic block coverage for benchmark 300.twolf
Figure 2: Basic block coverage for benchmark 301.apsi
However, the benchmark 301.apsi got a coverage of only 37%. The coverage plot for 301.apsi is shown in figure 2. The coverage plot readily identifies the reason for the very poor coverage results, the training workload does not exercise a large part of the code that is frequently executed by the reference workload.
The next step beyond coverage of the frequently executed basic blocks is to determine whether the branches in the code behave in the same way. At the simplest level, a branch can be usually taken or usually untaken. If a given branch is usually taken (or usually untaken) in both the reference and training workloads, the the training workload is appropriate.
A branch is declared to be usually taken if its taken more than half of the number of times that it is encountered in the instruction stream; otherwise it is declared to be usually untaken. It is possible to calculate a single value, which we will call a Correspondence Value, which denotes how well a training workload represents the branch behaviour of the reference workload. The formulation of the Correspondence Value is very similar to that of the Coverage calculated above. Assume that the execution frequency for a branch instruction, i, in the reference workload is denoted F i. Assume that a R i is given the value 1 if branch instruction i is usually taken in the reference workload. Similarly, assume that T i is given the value 1 if branch instruction i is usually taken in the training workload. Then the Correspondence Value can be calculated as follows:
The Correspondence Value ranges from zero to 100%, meaning that all branches behave the same way in both training and reference workloads.
A high Correspondence Value indicates a strong agreement between the behaviour of the branches in the training and reference workloads. However, a low agreement does not necessarily mean that the training workload is inappropriate for the reference workload. There are two situations which may lead to a lower than expected Correspondence Value. The first situation is where the branches are taken about half the time, and the branches in the reference and training workloads happen to fall on different sides of the half-way mark. The second situation is where the branch behaviour of the application is unpredictable. It is possible to better distinguish the branch behaviours by plotting the branches graphically.
Branch probability data is always bound between zero (never taken) and 1 (always taken). This makes it simple to plot each branch instruction on a chart, where the location on the x-axis is determined by the probability of the branch being taken in the reference workload, and the the location on the y-axis being determined by the probability of the branch being taken in the training workload. To convey the importance of the various branch instructions, the size of the marker depends on the frequency of execution of the branch instruction in the reference code (as a proportion of the execution frequency of the most frequently executed branch).
In an ideal version of this graph, the branches would all lie on the diagonal line running from the origin, at (0,0), to the top right hand corner at (1,1). Most workloads will not conform to this ideal. A more general template is to imagine the graph split into the four quadrants. The upper-right and lower-left quadrants indicate that the branch behaviour is similar in the reference and training workloads. Branches that appear in the upper-left or lower-right quadrants are being mistrained.
Workloads which have a level of uncertainty (or randomness) about branch behaviour will typically be indicated by a smear of branches around the diagonal.
The first benchmark to examine is 300.twolf, which has a Correspondence Value of 100% and shows good correspondence between the branch behaviour in the training and reference workloads. In figure 3, it is apparent that all the branches that are frequently executed by the reference workload appear in either the upper-right or lower-left quadrant, indicating good agreement between the training and reference workloads.
Figure 3: Branch probabilities for 300.twolf
Figure 4: Branch probabilities for 301.apsi
Once again, the benchmark 301.apsi scores a low Correspondence Value of 72%. Figure 4 shows this in a graphical format. It is apparent that one reason for the low score is that there are a number of branches which have a probability of being taken in the training workload of about 50%, but are more strongly taken in the reference workload. Another reason for the low score is that there are several branches which are usually taken in the reference workload, but usually untaken in the training workload.
This paper has presented several methodologies for determining if a particular training workload is appropriate for a particular reference workload. The same approach can be extended to determine if the behaviour of an application is similar over a range of reference workloads.
The cumulative effect is that given a situation where building an application with profile feedback leads to a faster runtime for a given reference workload, it is possible to look at the behaviour of the branches and basic blocks in an application and determine whether the training workload is appropriate for the range of reference workloads that the application actually has to deal with. If, as a result of this analysis, it is determined that the training workloads do not adequately train for a particular reference workload, then it is possible to identify additional training workloads to be used.
Two Perl scripts were written to support this analysis:
blcompare.pl which compares two basic block data files and determines the Coverage of the reference workload by the training workload.
brcompare.pl which compares two branch probability files and calculates the Correspondence Value.
Both scripts will generate the appropriate graphs if they are able to locate
gnuplot on the path.