Use Profile Feedback To Improve Performance

   
By Darryl Gove, Chris Aoki, Sun Microsystems, September, 2005  
Profile feedback is a useful mechanism for providing the compiler with information about how a code behaves at runtime. Having this information can lead to significant improvements in the performance of the application. As with all optimisations, it is only worth using profile feedback if it does produce a gain in performance.

Some degree of care is required in selecting representative workloads for providing training data to the compiler. The representativeness of the workloads can be examined by comparing the profiles gathered by tools such as tcov or the Performance Analyzer.

Introduction

When an application is compiled, the compiler will do its best to select the optimal instructions to use and the optimal layout for the code. It has to make decisions based on the source code, but the source code contains no information about the dynamic behaviour of the code; so the compiler has to use heuristics to provide a 'best-guess'.

The heuristics are used to determine how to structure the code, which routines should be inlined, which bits of code are executed frequently, and many other details.

An example of one of the problems that the compiler faces is the code shown in Example 1:

 

void calculate(...)
{
 
 
if(...some condition...)
{
     //do calculation
     ...
}
A
  else  
  {
    //do caclulation
    ...
}
B
  //do more work
...
 
}    
     

Code Example 1

In the code shown in Example 1 the compiler has an interesting decision to make; there are different ways of structuring the code. Should the compiler make A or B the default (and therefore make that path faster), or should it structure the code so that both branches of the code have equal performance.?

The question of how best to arrange if statements is one of the many decisions affecting code layout that the compiler has to make. Examples of other decisions are:

  • Is a routine executed sufficiently often that inlining it will improve performance?
  • Can the code be laid out in memory so it uses the instruction cache more effectively?
  • Are there loops which are are iterated over sufficient numbers of times that it is worth unrolling them?

Most of these decisions can only be governed by heuristics since the compiler has no information about what happens to the program at runtime. However, there is a mechanism, called profile feedback, which enables the compiler to gather information about what happens to the program at runtime using data from a run of a representative 'training' dataset.

Using profile feedback on the benchmark suite SPEC CPU2000 leads to an average of about 7% performance gain for the floating point suite and a 16% performance gain for the integer suite. For individual codes within the suite, the performance gains vary from no gain for some codes, to significant gains for others.


Building with profile feedback

The idea with profile feedback is to run the program for a short time and gather data about what happens to the program during that run. The compiler then uses that data to refine its optimisation decisions.

The process of using profile feedback is:

  1. The binary is built with the flag -xprofile=collect. This flag produces a special version of the application (called an instrumented binary), which when run will gather data about the run.
  2. The application is then run with a 'training' workload. A training workload is a workload that is representative of the real work that the application will do, but does not need to last as long as a real workload.
  3. The binary is rebuild using the flag -xprofile=use. This flag uses the previously collected data to optimise the binary.

This means that, using profile feedback, build process will take about twice as long compared building without it. This is because the build involves two passes through the compiler, plus a short run of the application. It is therefore important that the gains in performance seen at runtime are worth the extra build complexity.


Selecting a representative workload

Building with profile feedback requires that a 'representative' training workload is used to inform the compiler about the runtime behaviour of the application. The key points about this workload are:

  • It should take little time to run. Running for a long time does not, necessarily, improve the data being fed to the compiler.
  • The workload should exercise all the critical parts of the application. Tools such as the Sun Studio Performance Analyzer or tcov(1) can be used to assess whether the training workload covers the critical sections of code, and whether the profile of the training workload is similar to that of the real workload.
  • Several training workloads can be used if this improves the coverage of the code.

A concern that is sometimes raised is whether using the wrong training workload might lead to worse performance for some cases. This is possible, but it typically comes about for one of two reasons:

  • The training workload did not cover the entire application. The problem code happens to use part of the application for which the compiler had inadequate information.
  • The behaviour of the problem workload is significantly different from the training workload.

In both cases it may be that adding another training workload will improve the performance for the problem workload. It is also worth looking at the code coverage or time spent in the various routines so that the reason for the difference in performance can be identified. It is rare that training for one workload will force another workload to run slower. It is more likely that the training data has indicated to the compiler that a particular optimisation is unnecessary, and using additional training data which provides evidence that the optimisation is necessary will improve performance for the problem workload whilst not impacting performance for other workloads.


The benefits of profile feedback

The more information that the compiler has, the better job it can do at optimising the application. As with all optimisations, some codes will greatly benefit, other codes will see no gains. It is strongly dependant on the type of code.

The type of code that is likely to benefit from profile feedback is code which has a large number of conditional statements (if statements), the largest benefit will be had by codes which have very predictable behaviour, but the behaviour is not obvious to the compiler.

A simple example of this kind of code is where there are checks for correct values. The compiler cannot easily determine whether the programmer expects the checks to pass or fail, so will typically make the null assumption that passing and failing are equally likely. However, if the test is for 'valid data', most of the time the values in the code are valid, then profile feedback will enable the compiler to identify this, and optimise the code appropriately.

Another situation where profile feedback can lead to performance gains is when the profile can be used to select the best set of routines for the compiler to inline. There are two benefits from inlining, the first is to eliminate the cost of the call to the routine, the second is to expose further opportunities for optimisation. The downside of inlining is that it can lead to an increase in code size; if the inlined code turns out not to be useful, then this increase in code size may actually reduce performance. Profile feedback enables the compiler to correctly select the routines which are frequently called and are therefore candidates for inlining, whilst rejecting routines which are rarely called.


The profile feedback compiler flags

The flag that tells the compiler to either build the application and collect a profile, or build the application and use an existing profile is -xprofile. The use of the flag has some subtleties which require a bit more explanation.

  • -xprofile=collect can take an optional parameter which tells the compiler where to place the profile information. For example:

    -xprofile=collect:myapp will place the profile data in a directory called myapp.profile in the current directory at the time that the program is executed. Similarly -xprofile=collect:/tmp/myapp will place the profile data in the directory /tmp/myapp.profile. In the event that the location is not specified, then the profile is placed in the directory <prog> .profile where <prog> is the name of the executable at the time the executable is run.

  • -xprofile=use can also take an optional parameter telling the compiler where the profile data is located.

    -xprofile=use:/tmp/myapp will use the profile data locate in /tmp/myapp.profile. If no location is specified, then the compiler will look for data in a.out.profile in the current directory. Notice that this is different behaviour from the -xprofile=collect phase. The reason for the difference in behaviour is that the profile collector can determine the name of the executable when collecting the data, but when the compiler is building the new application using profile data, it does not know the name of the application that was used to generate the profile.

    Note: It is a good practice to always specify the full location of the profile data when building the executable.


Specifying other compiler flags with profile feedback

When the application is compiled with -xprofile=collect, to collect profile information, the binary is produced with a lower level of optimisation than would otherwise occur – this is so that the data gathered is more detailed than the data that would be gathered using an optimised binary. The instrumented binary produced will have a particular layout of the code depending on both the source code, and the flags used to build it. If the flags are changed, the layout of the code may change.

Note: Apart from the arguments to -xprofile, it is best to specify the same flags for both the collection and use phases.


Running the executable to collect profile information

When the executable is run, the profile data is written into the file system. The write takes place at the end of the run, so if the application fails to run to completion, then there may well be no profile data written. If the application is run multiple times, then the profile data accumulates the results from all the runs.

If the source code is modified, it is not a good idea to reuse old profile data. It may happen that the compiler does not complain or report an error, but it is unlikely that the compiler is taking the optimal decisions.

Note: It is a good practice to remove the old profile data whenever a new -xprofile=collect binary is built, and for new profile data to be collected every time the source is changed.


Compiler options that use data collected by profile feedback

There are several compiler options which use profile feedback information:

  • At optimisation level -xO5, profile feedback enables the compiler to generate speculative instructions in some frequently generated regions of code. In the absence of profile feedback speculative instructions are still generated at -xO5, but much more sparingly.

  • The compiler flags -xipo and -xcrossfile  perform crossfile optimisation – meaning optimisations that are across multiple source files. One example of this kind of optimisation is inlining a routine from one source file into code from another source file. In the presence of profile feedback, the compiler has a much better model of the set of routines that are worth inlining

  • The compiler flag -xlinkopt causes the compiler to perform link time optimisation. This final phase of compilation uses all the knowledge of the generated code in order to do some final tweaking of the code layout. This is useful for large codes where performance can be gained by laying out the code to keep all the frequently executed code together.


Example code using profile feedback

The code shown in Example 2 has opportunities for improvement to code layout from profile feedback. From inspection of the code it is obvious that the time is spent calling function f. This function sums up the six values passed into it, but before performing the sum it checks that each of the pointers to the values is valid. In the example, all the values are valid, and for most checks of this kind found in programs, it is usual for the data to be valid. However, the compiler cannot identify that the tests will usually be valid, so has to make the assumption that both of the two conditions in the if statement are equally likely.

#include <stdio.h>
                      

#include <stdlib.h>
                      

                       

static unsigned f( unsigned *a0, unsigned *a1, unsigned *a2,
                      

 unsigned *a3, unsigned *a4, unsigned *a5)
                      

{
                      

 unsigned result = 0;
                      

 if (a0 == NULL) {printf("a0 == NULL");} else {result += (*a0);}
                      

 if (a1 == NULL) {printf("a1 == NULL");} else {result += (*a1);}
                      

 if (a2 == NULL) {printf("a2 == NULL");} else {result += (*a2);}
                      

 if (a3 == NULL) {printf("a3 == NULL");} else {result += (*a3);}
                      

 if (a4 == NULL) {printf("a4 == NULL");} else {result += (*a4);}
                      

 if (a5 == NULL) {printf("a5 == NULL");} else {result += (*a5);}
                      

 return result;
                      

}
                      

                       

void main(int argc,const char *argv[])
                      

{
                      

 int i, j, niters = 1, n=6;
                      

 unsigned sum, answer = 0, a[6];
                      

                       

 niters = 1000000000;
                      

 if (argc == 2) { niters = atoi(argv[1]); }
                      

                       

 for(j=0; j<n; j++)
                      

 {
                      

  a[j] = rand();
                      

  answer += a[j];
                      

 }
                      

                       

 for(i=0; i<niters; i++) { sum=f(a+0, a+1, a+2, a+3, a+4, a+5); }
                      

                       

 if (sum == answer) { printf("answer = %u\n", answer); }
                      

 else { printf("error sum=%u, answer=%u", sum, answer); }
                      

}
                    

Code Example 2 - Demonstrating performance gains with profile feedback

Example 3 shows the results of compiling and running this program without profile feedback.

$ cc -O -o example example.c
                      

$ timex example 1000000000
                      

answer = 86902
                      

                       

real 43.87
                      

user 43.28
                      

sys 0.00
                    

Code Example 3 - Compiling and running without profile feedback

Example 4 shows the process of compiling this code with profile feedback, notice that there is a training run of the program using far fewer iterations of the main loop.

 

$ cc -O -xprofile=collect:./example -o example example.c
                      

$ example 100
                      

answer = 86902
                      

                       

$ cc -O -xprofile=use:./example -o example example.c
                      

$ timex example 1000000000
                      

answer = 86902
                      

                       

real 34.52
                      

user 33.93
                      

sys 0.01
                    

Code Example 4 - Compiling and running with profile feedback

The 10 second difference in runtime between the two codes represents about a 25% improvement. Obviously this particular example has been put together to demonstrate profile feedback optimisations, but the principles that it shows appear in most codes.


About the Authors
Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK.
Chris Aoki is an engineer in Sun's SPARC compiler backend team. He has worked on code generation and optimization in several generations of compiler technology at Sun. His current projects primarily involve compiler and runtime support for feedback based optimization.

(Page last updated September 7, 2005)
Left Curve
System Administrator
Right Curve
Left Curve
Developer and ISVs
Right Curve
Left Curve
Related Products
Right Curve
solaris-online-forum-banner