|By Darryl Gove, Chris Aoki, Sun Microsystems, September, 2005|
When an application is compiled, the compiler will do its best to select the optimal instructions to use and the optimal layout for the code. It has to make decisions based on the source code, but the source code contains no information about the dynamic behaviour of the code; so the compiler has to use heuristics to provide a 'best-guess'.
The heuristics are used to determine how to structure the code, which routines should be inlined, which bits of code are executed frequently, and many other details.
An example of one of the problems that the compiler faces is the code shown in Example 1:
|//do more work
Code Example 1
In the code shown in Example 1 the compiler has an interesting decision to make; there are different ways of structuring the code. Should the compiler make A or B the default (and therefore make that path faster), or should it structure the code so that both branches of the code have equal performance.?
The question of how best to arrange if statements is one of the many decisions affecting code layout that the compiler has to make. Examples of other decisions are:
Most of these decisions can only be governed by heuristics since the compiler has no information about what happens to the program at runtime. However, there is a mechanism, called profile feedback, which enables the compiler to gather information about what happens to the program at runtime using data from a run of a representative 'training' dataset.
Using profile feedback on the benchmark suite SPEC CPU2000 leads to an average of about 7% performance gain for the floating point suite and a 16% performance gain for the integer suite. For individual codes within the suite, the performance gains vary from no gain for some codes, to significant gains for others.
The idea with profile feedback is to run the program for a short time and gather data about what happens to the program during that run. The compiler then uses that data to refine its optimisation decisions.
The process of using profile feedback is:
This means that, using profile feedback, build process will take about twice as long compared building without it. This is because the build involves two passes through the compiler, plus a short run of the application. It is therefore important that the gains in performance seen at runtime are worth the extra build complexity.
Building with profile feedback requires that a 'representative' training workload is used to inform the compiler about the runtime behaviour of the application. The key points about this workload are:
A concern that is sometimes raised is whether using the wrong training workload might lead to worse performance for some cases. This is possible, but it typically comes about for one of two reasons:
In both cases it may be that adding another training workload will improve the performance for the problem workload. It is also worth looking at the code coverage or time spent in the various routines so that the reason for the difference in performance can be identified. It is rare that training for one workload will force another workload to run slower. It is more likely that the training data has indicated to the compiler that a particular optimisation is unnecessary, and using additional training data which provides evidence that the optimisation is necessary will improve performance for the problem workload whilst not impacting performance for other workloads.
The more information that the compiler has, the better job it can do at optimising the application. As with all optimisations, some codes will greatly benefit, other codes will see no gains. It is strongly dependant on the type of code.
The type of code that is likely to benefit from profile feedback is code which has a large number of conditional statements (if statements), the largest benefit will be had by codes which have very predictable behaviour, but the behaviour is not obvious to the compiler.
A simple example of this kind of code is where there are checks for correct values. The compiler cannot easily determine whether the programmer expects the checks to pass or fail, so will typically make the null assumption that passing and failing are equally likely. However, if the test is for 'valid data', most of the time the values in the code are valid, then profile feedback will enable the compiler to identify this, and optimise the code appropriately.
Another situation where profile feedback can lead to performance gains is when the profile can be used to select the best set of routines for the compiler to inline. There are two benefits from inlining, the first is to eliminate the cost of the call to the routine, the second is to expose further opportunities for optimisation. The downside of inlining is that it can lead to an increase in code size; if the inlined code turns out not to be useful, then this increase in code size may actually reduce performance. Profile feedback enables the compiler to correctly select the routines which are frequently called and are therefore candidates for inlining, whilst rejecting routines which are rarely called.
The flag that tells the compiler to either build the application and collect a profile, or build the application and use an existing profile is -xprofile. The use of the flag has some subtleties which require a bit more explanation.
-xprofile=collect:myapp will place the profile data in a directory called myapp.profile in the current directory at the time that the program is executed. Similarly -xprofile=collect:/tmp/myapp will place the profile data in the directory /tmp/myapp.profile. In the event that the location is not specified, then the profile is placed in the directory <prog> .profile where <prog> is the name of the executable at the time the executable is run.
-xprofile=use:/tmp/myapp will use the profile data locate in /tmp/myapp.profile. If no location is specified, then the compiler will look for data in a.out.profile in the current directory. Notice that this is different behaviour from the -xprofile=collect phase. The reason for the difference in behaviour is that the profile collector can determine the name of the executable when collecting the data, but when the compiler is building the new application using profile data, it does not know the name of the application that was used to generate the profile.
Note: It is a good practice to always specify the full location of the profile data when building the executable.
When the application is compiled with -xprofile=collect, to collect profile information, the binary is produced with a lower level of optimisation than would otherwise occur – this is so that the data gathered is more detailed than the data that would be gathered using an optimised binary. The instrumented binary produced will have a particular layout of the code depending on both the source code, and the flags used to build it. If the flags are changed, the layout of the code may change.
Note: Apart from the arguments to -xprofile, it is best to specify the same flags for both the collection and use phases.
When the executable is run, the profile data is written into the file system. The write takes place at the end of the run, so if the application fails to run to completion, then there may well be no profile data written. If the application is run multiple times, then the profile data accumulates the results from all the runs.
If the source code is modified, it is not a good idea to reuse old profile data. It may happen that the compiler does not complain or report an error, but it is unlikely that the compiler is taking the optimal decisions.
Note: It is a good practice to remove the old profile data whenever a new -xprofile=collect binary is built, and for new profile data to be collected every time the source is changed.
There are several compiler options which use profile feedback information:
At optimisation level -xO5, profile feedback enables the compiler to generate speculative instructions in some frequently generated regions of code. In the absence of profile feedback speculative instructions are still generated at -xO5, but much more sparingly.
The compiler flags -xipo and -xcrossfile perform crossfile optimisation – meaning optimisations that are across multiple source files. One example of this kind of optimisation is inlining a routine from one source file into code from another source file. In the presence of profile feedback, the compiler has a much better model of the set of routines that are worth inlining
The compiler flag -xlinkopt causes the compiler to perform link time optimisation. This final phase of compilation uses all the knowledge of the generated code in order to do some final tweaking of the code layout. This is useful for large codes where performance can be gained by laying out the code to keep all the frequently executed code together.
The code shown in Example 2 has opportunities for improvement to code layout from profile feedback. From inspection of the code it is obvious that the time is spent calling function f. This function sums up the six values passed into it, but before performing the sum it checks that each of the pointers to the values is valid. In the example, all the values are valid, and for most checks of this kind found in programs, it is usual for the data to be valid. However, the compiler cannot identify that the tests will usually be valid, so has to make the assumption that both of the two conditions in the if statement are equally likely.
Code Example 2 - Demonstrating performance gains with profile feedback
Example 3 shows the results of compiling and running this program without profile feedback.
$ cc -O -o example example.c
Code Example 3 - Compiling and running without profile feedback
Example 4 shows the process of compiling this code with profile feedback, notice that there is a training run of the program using far fewer iterations of the main loop.
$ cc -O -xprofile=collect:./example -o example example.c
Code Example 4 - Compiling and running with profile feedback
The 10 second difference in runtime between the two codes represents about a 25% improvement. Obviously this particular example has been put together to demonstrate profile feedback optimisations, but the principles that it shows appear in most codes.
Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK.
Chris Aoki is an engineer in Sun's SPARC compiler backend team. He has worked on code generation and optimization in several generations of compiler technology at Sun. His current projects primarily involve compiler and runtime support for feedback based optimization.