Building Enterprise Applications with Sun Studio Profile Feedback

 
By Giri Mandalika, Sun Microsystems, April 11, 2006 (updated June 2007)  

Large, CPU intensive applications may perform better when built with profile feedback. Profile feedback optimization requires the application be built twice, once to collect the profile data, and again to make use of the profile to generate optimal code. This requirement may prevent many software vendors from building their applications with profile feedback Generating a profile for each software release can be impractical. However it is possible to use old profiles to minimize the overhead of profile feedback builds in a development environment and without compromising the advantages of feedback directed optimization.

This article introduces all the stages of profile feedback with examples, and offers some tips for making profile feedback builds, feasible.

Contents:

0. Introduction

In general, compilers generate object code based on the pre-defined heuristics, and the optimization flags supplied during the compilation. However since compilers cannot predict the dynamic behavior of the code, they have to rely on heuristics for the best possible guess; and hence the generated code may or may not perform well with typical workloads.

Processor stalls are one of the problems that could occur with large applications with tons of instructions. Since the processor cannot hold all the instructions on chip at any given time, it has to wait while some instructions are being fetched from memory. So it is up to the developer to lay out the high level instructions carefully to reduce processor stalls for better performance. As developers may not be the end users in most cases, it will be a cumbersome exercise for them to gather the application run-time data, to identify the hot code (where the application spends most of the time), and to re-write/re-arrange some blocks of code to improve the run-time performance. Programmers can be relieved from such tasks by using the feedback based optimization technique, supported by Sun Studio compilers. When the run-time behavior of the code is available in the form of a profile, the object code can be laid out by the compiler, in a way that the on-chip (Level-1 or L1) cache and the memory will be used efficiently, during the run time.

Note that the Sun Studio C, C++ & Fortran compilers have the ability to generate optimal code using profile feedback data. Even though the examples in this article were written in C, the execution methodology is the same for all applications, regardless of the high-level language used to develop them. Also, the steps outlined in this article can be used to build any kind of application, and not just enterprise applications as the title suggests.

1. Feedback-Based Optimization

In some situations, the desired code improvements may not be achieved directly with compiler's classical optimization flags. For example, a hot routine may not be auto-inlined by the compiler at optimization level 4 ( -O4 ) or higher, if its inclusion violates the threshold heuristic defined in the compiler. In this case, using profile feedback data may help inline the hot code.

Feedback based optimization (FBO) is the term used to describe any technique that alters a program based on information gathered at run-time. It is also widely known as feedback directed optimization (FDO) and profile feedback optimization (PFO). The idea behind this technique is to supply some information about run-time behavior of the program, to the compiler. Upon instrumentation of the object code by the complier, the program is profiled and this profile data is used by the compiler to generate optimal code that would run faster.

When the profile data is available, the compiler's front end reads the execution counts of each block from the profile feedback file, and attaches them to the program's intermediate representation (IR). It will be done, at the beginning of any kind of optimization. Many compiler optimizations subsequently use the execution counts from IR. Based on the profile data, compiler can do optimizations of the following types:

  • Code Layout: Arrange code in a way that the frequently executed code in a routine is grouped together. Its goal is to reduce instruction cache (I$) misses, and to improve instruction fetch by using profile information to guide the layout of code in memory. The article Improving Code Layout Can Improve Application Performance explains code reordering using profile feedback.

  • Inlining: Inline routines that are frequently called. Rarely executed functions may not be inlined even if they are eligible for auto-inlining. Inlining eliminates the cost of the call to the routine; and exposes further opportunities for optimization.

  • Register allocation: utilizes the block counts from profile data, to determine register spills.

  • Loop transformations: Loop invariant code motion, loop fusion, loop interchange, loop peeling etc.

  • Branch optimizations: Tail splitting, branch interchange, loop unswitching etc.

  • Block straightening or outlining moves infrequently executed blocks to separate sections.

  • Switch case code generation: Most frequent cases tested early, to avoid branching.

  • Global instruction scheduling: groups instructions with no dependencies together to avoid processor stalls in the pipeline.

  • Delay slot scheduling: for processors with branch/call delay slots, such as SPARC, an instruction is said to require a delay slot if some instructions following the instruction are executed as if they were located before it. With profile data, the scheduler can reduce branch penalties by using instructions from more frequent blocks for normal branches.

  • sethi hoisting: hoists sethi instructions to reduce register pressure.

  • Copy loop detection: uses profile data to select frequently executed copy loops.

  • Branch prediction: using profile data, the compiler can minimize pipeline stalls for processors that support branch prediction (such as SPARC) by setting the branch prediction bit in the opcodes for conditional branch instructions.

Typical steps involved in using profile feedback mechanism, are as follows:

  • Compile for profile data collection

  • Training run/profile feedback data collection run

  • Re-build the application with profile feedback

You can find a quick introduction to these steps in Use Profile Feedback To Improve Performance . The following details these steps with examples.

1.1. Compile for profile data collection

Build the application with -xprofile=collect compiler option. In this step, the object code is instrumented to gather profile data. ie., counters are inserted into the object code to facilitate determining the number of times the code was executed. Instrumented objects can also be referred as profiled objects. Instrumented code runs slower compared to non-instrumented code. Use instrumented code only to collect profile data.

When the instrumented binaries are run, the application may appear unchanged to the end user, but profile data is collected as a side effect of execution. This data will be used by the compiler in use phase of FBO, to generate highly optimized binaries.

FBO requires the code to be compiled at optimization level 2, or above. If no optimization level is specified on compile line, compiler uses level 2 optimization (ie., -O2), by default. Compiler may suppress certain optimizations with -xprofile=collect option, to record accurate information about the run-time behavior of the code. However it is recommended to specify exact compiler flags, except for the value -xprofile, in both phases of feedback based optimization.

The following example shows the steps involved in generating instrumented binaries with the Sun Studio C compiler.

%  
                  
cat bubblesort.c
#include <stdio.h>
#include <stdlib.h>

#define COUNT 50

void swap (int *Array, int i, int j)
{
        int temp;

        temp = Array[i];
        Array[i] = Array[j];
        Array[j] = temp;
}

void bubblesort(int *Array, int length)
{
        int i, j;

        for (i = 0; i < length; ++i)
        {
                for (j = (i + 1); j < length; ++j)
                {
                        if (Array[i] > Array[j])
                        {
                                swap (Array, i, j);
                        }
                }
        }
}

int main()
{
        int i, *Array;

        Array = (int *) malloc (sizeof (int) * COUNT);

        for (i = COUNT; i > 0; --i)
                Array[COUNT - i] = i;

        bubblesort(Array, COUNT);

        for (i = 0; i < COUNT; ++i)
                printf("\nArray[%d] = %d", i, Array[i]);

        return (0);
}
                

To enable profile data collection, compile this code with -xprofile=collect and -xO2 options.

% cc -o bubblesort -xO2 -xprofile=collect bubblesort.c


Use -xprofile_ircache[= path] with the -xprofile=collect|use option, to improve compilation time during the use phase by reusing compilation data (Intermediate Representation or IR) saved from the collect phase. Be aware that the saved data could increase disk space requirements considerably.

See the Options Reference chapter in the Sun Studio 12 C User's Guide

1.2. Training run/profile feedback data collection run

Run the instrumented binary (that is, the binary compiled with -xprofile=collect), with one or more representative workloads. If the workload is representative, then the branches that are normally taken in the training run are normally taken in the real workload.

In general, if you run your program with only a single input file, then you can just run that input file, and you'd have collected good profile data. However, if you are creating a general purpose application that can have a variety of inputs which cause execution of different parts of your program, you should choose different kinds of representative sample inputs, which your program will receive. Using only certain kinds of input will bias the compiler in favoring the executed paths of the program more, than the non-executed paths. So, it is important to find one or a combination of training workloads that may give the best possible results in almost all scenarios.

In this phase, the compiler instrumented code collects the branch frequencies for all branches, and the counts for all basic blocks. As a side effect of the execution, a directory named after the program will be created, with .profile extension. feedbin file under <program>.profile directory, holds the execution frequencies of various blocks, for later use by the optimizer when the source code is compiled again with -xprofile=use option. feedbin file can be referred as profile feedback file.

The profile data collection is additive . That is, if you run the profiled executable more than once, with similar or different inputs, the data from the recent run will be added to the data collected from previous runs. Therefore, the profile data will be an aggregate of all your runs with the profiled executable.

But you do need to observe caution here. If you have profile data from earlier training runs, and if you recompile the program with -xprofile=collect and re-run it, the compiler instrumented code that writes out the profile data will detect it as a different program, and overwrites the old data.

By default, the <program>.profile directory will be created in the same directory, from where the executable is being run. If you wish to change the directory in which the profile data resides, you can use the SUN_PROFDATA_DIR environment variable, as shown in the following example.

Let's continue with the bubble sort example:

%  
                  
./bubblesort
Array[0] = 1
Array[1] = 2
..
..
Array[48] = 49
Array[49] = 50

%  
                  
ls -dF *.profile
bubblesort.profile/

%  
                  
ls -dF /tmp/*.profile
No match

%  
                  
setenv SUN_PROFDATA_DIR /tmp

%  
                  
./bubblesort
Array[0] = 1
Array[1] = 2
..
..

%  
                  
ls -dF /tmp/*.profile
/tmp/bubblesort.profile/
                

1.2.1 Single feedbin for all profiled processes

By default, the profiler thread creates one profile feedback file (ie., feedbin) for each profiled executable. The default behavior is good enough for small programs or applications with very few executables. However for large applications with tens of executables, having too many profile feedback files, pose slight inconvenience in use phase, where these feedback files are specified on compile line with -xprofile=use:<path_to_profdir> option, to produce optimal binaries.

For example, if the application consists of twenty executables, we need to have twenty -xprofile=use flags on the compile line, as shown below:

             
%  
              
cc -xO2 -xprofile=use:feedback
                
1 -xprofile=use:feedback
                
2 [...] \ 
-xprofile=use:feedback
                
19 -xprofile=use:feedback
                
20 -o optimalbin <sourcefile>.c
              
            
          

There are two major inconveniences with this:

  • If the make file grabs all compiler options from environment variables like CFLAGS, it may not be possible to specify all instances of -xprofile=use in a single CFLAGS, due to the underlying shell restrictions on the number of characters per variable.

  • The compile line may become too long, and look ugly with too many instances of -xprofile=use.

To get around these inconveniences, it is recommended to use compiler supported environment variables SUN_PROFDATA_DIR and SUN_PROFDATA in profile data collection phase, to request the profiler to write all the profile data from different profiled processes into a single feedbin file, instead of creating one per executable. If these environment variables are set, the profiler writes the profile data into the file pointed by SUN_PROFDATA, under the directory SUN_PROFDATA_DIR . That is, the profile data from all processes will be written into $SUN_PROFDATA_DIR/$SUN_PROFDATA.

The following trivial example illustrates the default behavior, as well as the behavior with SUN_PROFDATA and SUN_PROFDATA_DIR environment variables.

Here's an example:

                   
%  
                    
cat a.c
                  
                   
#include <stdio.h>

                   
int main()

                   
{
  
                  
   printf("In a.c\n");
  
                  
   return (0);
                   
}

                   
%  
                    
cat b.c
                  
                   
#include <stdio.h>

                   
int main()
                   
{
  
                  
   printf("In b.c\n");
  
                  
   return (0);
                   
}

                   
%  
                    
cc -xO2 -xprofile=collect -o a a.c
                  

                   
%  
                    
cc -xO2 -xprofile=collect -o b b.c
                  
                

1.2.1.1 Default behavior

Note that in the following example there are two profiles, one per executable. a.profile holds the profile data for the executable " a"; and b.profile holds the profile data for the executable " b".

                   
%  
                    
setenv SUN_PROFDATA_DIR /tmp/default
                  

                   
%  
                    
./a
                  
                   
In a.c

                   
%  
                    
./b
                  
                   
In b.c

                   
%  
                    
ls /tmp/default
                  

                   
a.profile/  b.profile/
                

To use the feedback data, the programs have to be compiled as follows:

             
%  
              
cc -xO2 -xprofile=use:/tmp/default/a -o a a.c
            
             
%  
              
cc -xO2 -xprofile=use:/tmp/default/b -o b b.c
            
          

1.2.1.2 Requesting a single feedback file

A single feedback file can be requested:

             
%  
              
setenv SUN_PROFDATA_DIR /tmp/consolidate
            
             
%  
              
setenv SUN_PROFDATA singlefeedbin.profile
            
          

During run-time, the profiler thread reads the values of SUN_PROFDATA and SUN_PROFDATA_DIR and writes all profile feedback data from different profiled processes into a single feedbin file under /tmp/consolidate/singlefeedbin.profile directory.

Here's an example:

                   
%  
                    
mkdir /tmp/consolidate
                  

                   
%
                    
./a
                  
                   
In a.c

                   
%
                    
./b
                  
                   
In b.c

                   
%  
                    
ls /tmp/consolidate
                  
                   
singlefeedbin.profile/
                

Observe that singlefeedbin.profile holds the feedback data for both executables " a" and " b". If there are more profiled processes, their profile data will be appended to this feedbin file.

To use this profile, simply run:

             
%  
              
cc -xO2 -xprofile=use:/tmp/consolidate/singlefeedbin -o a a.c
            
             
%  
              
cc -xO2 -xprofile=use:/tmp/consolidate/singlefeedbin -o b b.c
            
          

But note that writing to a single profile feeback file helps only when several instrumented objects serve as dependencies for several profiled processes. The purpose of the above example is only to show, how to request the profiler to write the profile data into a single feedback file.

1.2.2 Asynchronous profile data collection

By default, profile data collection is synchronous. The profiler thread waits for the shared library finalization (if any), and also for the process to call exit(), before writing all the profile data to feedback file. In a way it requires that the process exit to get the profile data. As a result, multi-threaded applications may experience some profile data loss due to the possible race conditions that may occur among multiple threads. Also, there is no guarantee that all applications, especially multi-threaded applications, will be designed to terminate gracefully. If some profiled process does not call exit() but forces the process to terminate in other ways, for example with a SIGKILL, it may be unlikely that a usable profile can be obtained from that process. If the profiled process loads dynamically and unloads other libraries with the help of dlopen(), dlclose() system calls, it will lead to indirect call profiling, with its own share of problems collecting the profile data.

To alleviate the problems described above, we need some mechanism to collect the profile data from a running process without requiring it terminate gracefully. An asynchronous profile data collection feature was added in the Sun Studio 11 compiler release. It was then back ported to the Sun Studio 9 and 10 releases. Applying patch 115983-06 (or later) to Studio 9, and 117832-06 (or later) to Studio 10, gives you the ability to control the way the profile data will be collected. As a result, the chances of getting a good profile from many single/multi-threaded applications is high, irrespective of how the profiled processes exit.

1.2.2.1 Enabling asynchronous profile data collection

Asynchronous profile collection is not enabled by default. To enable it, set SUN_PROFDATA_ASYNC_INTERVAL environment variable before running the application. If SUN_PROFDATA_ASYNC_INTERVAL has been set to a positive integer value n at the start up of an application, the profiler thread collects periodic profile data, every n seconds, and subsequently updates the corresponding feedbin file. n is the time interval between periodic profile snapshots, in seconds.

When data for a snapshot is collected, the profiler updates a single profile directory whose name is of the form: <procname>.<hostname>.<pid>[. profile]

where:
<procname> is the name of the process being profiled
<hostname> is the host name of the machine executing the profiled process
<pid> is the process id of the profiled process

.profile will be appended to the name of the profile directory unless <dir_name> is specified using the value of the environment variable SUN_PROFDATA.

Note that the profiler thread collects profile snapshots only for the process in which it was initiated. Forked processes will not inherit the profiler thread.

The collected profile data can be used in the use phase of profile feedback by specifying the compiler option: -xprofile=use:<procname>.<hostname>.<pid>. The profile directory can be renamed as you wish before specifying it in - xprofile=use option.

1.2.2.2 Multiple profile snapshots per process

Asynchronous profile collection also enables the collection of profile data more than once per process. If the environment variable SUN_PROFDATA_ASYNC_SEQUENCE is defined, and set to an integer value, num_snapshots ≥ 1, the profiler generates a sequence of distinct profile snapshots whose names are of the form: <procname>.<hostname>.<pid>.< n>[ .profile]

where:
< n> is a positive integer in the range [1.. num_snapshots].

Subsequent profile snapshots are applied to update the <procname>.<hostname>.<pid>[ .profile] directory for the remaining life time of the process.

The time sequence of profile snapshots generated by setting SUN_PROFDATA_ASYNC_SEQUENCE might be used to determine how long profile data should be collected from a given application in order to obtain good performance with -xprofile=use.

Here's an example:

Let's assume that the program mtserver is compiled with - xprofile=collect. The async profile data collection can be done as follows:

                   
%  
                    
uname -n
                  
                   
v890appserv

                   
%  
                    
setenv SUN_PROFDATA_ASYNC_INTERVAL 30
                   
                   
%  
                    
setenv SUN_PROFDATA_ASYNC_SEQUENCE 3
                   
                   
%  
                    
setenv SUN_PROFDATA_VERBOSE
                  
                   
%  
                    
setenv SUN_PROFDATA_DIR /tmp/profile
                  

                   
%  
                    
./mtserver &
                   
                   
[1] 8529
                

This example collects a snapshot of profile data from process 8529 every 30 seconds, as long as it runs. The first 3 snapshots will be saved in their own profile directories: /tmp/profile/mtserver.v890appserv.8529.1.profile, /tmp/profile/mtserver.v890appserv.8529.2.profile and /tmp/profile/mtserver.v890appserv.8529.3.profile. Then the subsequent snapshots will update the feedback directory: /tmp/profile/mtserver.v890appserv.8529.profile.

To get any warning messages during profile data collection, define the environment variable SUN_PROFDATA_VERBOSE. For multi-threaded programs, observe that the thread count increases by one if the program is compiled with -xprofile=collect. The extra thread that you didn't create is the profiler thread – the compiler adds necessary code to create this thread as part of its instrumentation.

1.3. Re-build the application with profile feedback

Once you gather the profile data from the profiled process, feed it to the compiler with the flag: -xprofile=use:< path_to_profdir>. The compiler uses this data to do a better job optimizing the application code. Make sure to give the profile data directory -- if you only use -xprofile=use, then the compiler does not know what the profile data directory is called; and therefore looks for a.out.profile by default. Note that it is not necessary to add .profile, when specifying the profile data directory name in -xprofile=use. In the bubble sort example, it is valid to specify either -xprofile=use:bubblesort.profile or -xprofile=use:bubblesort on compile line.

Except for the -xprofile option which changes from -xprofile=collect to -xprofile=use, the source files and other compiler options must be exactly the same as those used for the compilation of profiled objects. The same version of the compiler must be used for both collect and use builds.

If both -xprofile=collect and -xprofile=use are specified on the same compile line, the rightmost -xprofile option in the compile line is applied.

If you are compiling the object file with -xprofile=use in a directory that is different from the directory in which the object file was previously compiled with -xprofile=collect, make sure to add the -xprofile_pathmap=< collect_prefix> :< use_prefix> option on compile line, so the compiler can find profile data for the object file. collect-prefix is the prefix of the pathname of a directory in which object file was compiled using - xprofile=collect; and use-prefix is the prefix of the pathname of a directory in which the object file is to be compiled using - xprofile=use. Refer to C compiler options reference for detailed information about -xprofile_pathmap compiler option.

Continuing with the bubble sort example:

                   
%  
                    
cc -o bubblesort -xO2 -xprofile=use:bubblesort.profile bubblesort.c
                  
                   
%  
                    
./bubblesort
                  
                   
Array[0] = 1
                   
Array[1] = 2
                   
..
                   
..
                

Important Note:
Measure the application performance with profile feedback, and compare it with baseline numbers before you put this into a build environment. Because it requires compiling the entire application code twice, it is intended to be used only after other debugging and tuning is finished, and as one of the last steps before putting the application into production or releasing it to the customers.

1.3.1 Compiling with multiple profiles

Sun studio compilers accept multiple profiles on the compile line, with multiple -xprofile=use:< path_to_profdir> options. -xprofile=use:< path_to_profdir1> :< path_to_profdir2>..< path_to_profdirn> results in a compilation error.

For example:

% cc -xO2 -xprofile=use:/tmp/prof1.profile -xprofile=use:/tmp/prof2.profile

When the compiler encounters multiple profiles on the compile line, all the profile data will be merged before any code transformations are performed, based on the profile feedback data.

1.3.2 Extracting execution counts

If you are curious about the compiler code transformations performed based on the profile feedback data, use the following code generator (cg) options, to dump the execution count of each basic block in an assembly listing.

C ( cc):
-xprofile=use:<
path_to_profdir > -Wc,-assembly,-Qcg-V

C++ (CC) and Fortran ( f95):
-xprofile=use:<
path_to_profdir > -Qoption cg -assembly,-Qcg-V

The -assembly option will generate a .s file with the same basename and dirname as the object file (e.g., bubblesort.o will be accompanied by bubblesort.s in the same directory). The -Qcg-V option adds more information as assembler comments to the generated .s file. If -xprofile=use has been specified, this information includes execution counts derived from the < path_to_profdir>

Here's an example:

                   
%  
                    
cc -xO2 -xprofile=use:bubblesort.profile -Wc,-assembly,-Qcg-V bubblesort.c
                  
                   
%  
                    
cat bubblesort.s
                  
                   
...
                   
...
                   
!   15                              !void bubblesort(int *Array, int length)
                   
!   16                              !{

                   
!
                   
! SUBROUTINE bubblesort
                   
!
                   
! OFFSET    SOURCE LINE LABEL   INSTRUCTION     (ISSUE TIME)    (COMPLETION TIME)

  
                  
                                       .global bubblesort


  
                  
                       bubblesort:             /* frequency 1.0 confidence 1.0 */

                   
/* 000000         16 ( 0  1) */         save    %sp,-96,%sp
                   
/* 0x0004            ( 1  2) */         orcc    %g0,%i1,%i1

                   
!   17                              !        int i, j;
                   
!   19                              !        for (i = 0; i < length; ++i)

                   
/* 0x0008         19 ( 1  2) */         ble,pn  %icc,.L77000015 ! tprob=0.00
                   
/* 0x000c            ( 1  2) */         or      %g0,0,%l6 ! const ! hoisted

                   
! Registers live out of bubblesort:
                   
! o2 sp l6 i0 i1 i4 fp i7 gsr
                   
!

                   
! predecessor blocks : bubblesort

  
                  
                       .L77000031:             /* frequency 1.0 confidence 1.0 */
                   
/* 0x0010         19 ( 0  1) */         or      %g0,%i0,%l7

                   
!   20                              !   {

                   
!   21                              !                for (j = (i + 1); j < length; ++j)

                   
/* 0x0014         21 ( 0  1) */         add     %l6,1,%l5 ! no_overflow

                   
! Registers live out of .L77000031:
                   
! o2 sp l5 l6 l7 i0 i1 i4 fp i7 gsr
                   
!

                   
! predecessor blocks : .L77000031 .L900000205

  
                  
                       .L900000206:            /* frequency 50.0 confidence 1.0 */
                   
/* 0x0018         21 ( 0  1) */         cmp     %l5,%i1
                   
/* 0x001c            ( 0  1) */         bge,pn  %icc,.L77000032 ! tprob=0.02
                   
/* 0x0020            ( 0  1) */         add     %l7,4,%i4 ! hoisted

                   
! Registers live out of .L900000206:
                   
! o2 sp l5 l6 l7 i0 i1 i4 fp i7 gsr
                   
!

                   
...
                   
...
                

1.3.2.1 Alternatives

The code coverage analysis tool, tcov, can be used to find the frequency of execution of blocks, and instructions. If the source code is compiled with -g or -g0 debug options, the Sun Studio er_src utility can be used to read the compiler inserted commentary about the code transformations.

Please refer to Sun Studio's Performance Analyzer documentation, for more detailed information about these tools.

2. Building Patches For An Enterprise Application

There is one frequently asked question to ask when considering to use the profile feedback mechanism in building applications: Is it necessary to go through the entire profile feedback life cycle whenever changes are made to the source code?. The simple answer is: No. The following explains a simple way to avoid building the entire application with -xprofile=collect when there aren't many changes in the code base.

If the application is very big and only few objects were changed, profile only those objects that will be re-built for the patch. However, in order to collect a meaningful profile, there needs to be -xprofile=collect versions of all object files comprising a re-built executable or a shared library. For example, if the executable mtserver is built by linking the object files a.oand b.o, re-compile those objects with -xprofile=collect, and re-link to build a new copy of mtserver. Then: (i) replace the old binaries in the previously saved collect build with the newly built binaries; (ii) re-run the training run, and collect the profile data for the entire build; (iii) finally re-compile all object files comprising the binary (executable or library) with -xprofile=use, and then re-link to build the actual binary to be shipped to the customer, as a patch.

Here's an example:

Assume that a shared library libABC.so was built with profile feedback, by linking the objects A.o, B.o and C.o. If the objects A & B were modified/enhanced later, re-build libABC.so with profile feedback, as outlined below:

  1. Compile the objects A and B with -xprofile=collect.

  2. Link the objects A.o (new), B.o (new) and C.o (old) to build libABC.so. Make sure to specify -xprofile=collect compiler flag on link line.

  3. Replace libABC.so in the previous full collect build, with the newly built libABC.so. Here the assumption is that the full collect build of the application that was used for collecting the profile data in building the previous version of the application is still available.

  4. Collect profile data for the entire application with the training run, preferably with the workload used in previous training run(s).

  5. Compile the objects A and B again with -xprofile=usecompiler flag, and with the new profile data from step #4.

  6. Re-link the objects A.o (new), B.o (new) and C.o (old) to build libABC.so. Make sure to specify -xprofile=use compiler flag on link line, along with the new profile data from step #4.

  7. Release libABC.so as a patch, to the customers.

Repeat the above steps for all binaries (executables or shared libraries) that will be released as a patch. Apparently step #4 will be done only once, even if there are multiple binaries that need to be re-built, to be released as part of a patch. If there are several binaries that need to be re-built due to the changes in source code, consider building the whole application with -xprofile=collect, instead of building only those binaries (as explained in the above example) that goes into the patch.

In general, it is desirable to collect profile data whenever there are some changes in the code base. However doing so may not be feasible when very large applications were built with profile feedback. So it is suggested to skip the profile data collection, and use the existing profile data to reduce the overhead upto some extent, when the source code changes are limited to very few lines. Be aware that the gains from profile feedback may diminish over the time, when the previously collected profile data is used, despite the large number of changes in code base. So for optimal performance, collect the profile data again for the whole application, when the number of source code changes become large enough to release a bigger patch. That is, when distributing a large number of modified binaries.

3. Compiling Modified Source With Old Profile Data

It is important to know how a simple change in source code affects the feedback based optimization in the presence of old profile data. Assume that a program was linked with a library libstrimpl.so, that implements string comparison, __strcmp, and string length calculation, __strlen.

Example:

                   
%  
                    
cat strimpl.h
                  
                   
int __strcmp(const char *, const char *);
                   
int __strlen(const char *);

                   
%  
                    
cat strimpl.c
                  
                   
#include <stdlib.h>

                   
#include "strimpl.h"
  
                  

                   
int __strcmp(const char *str1, const char *str2)
                   
{
  
                  
  int rc = 0;

  
                  
  for(;;)
  
                  
  {

  
                  
     rc = *str1 - *str2;
  
                  
     if(rc != 0 || *str1 == 0)
  
                  
     {
  
                  
        return (rc);
  
                  
     }
  
                  
     ++str1;

  
                  
     ++str2;
  
                  
  }
                   
}
  
                  

                   
int __strlen(const char *str)
                   
{
  
                  
       int length = 0;

  
                  
       for(;;)
  
                  
       {
  
                  
               if (*str == 0)
  
                  
               {
  
                  
                       return (length);
  
                  
               }

  
                  
               else
  
                  
               {
  
                  
                       ++length;
  
                  
                       ++str;
  
                  
               }
  
                  
       }

                   
}

                   
%  
                    
cat driver.c
                  
                   
#include <stdio.h>
                   
#include "strimpl.h"
  
                  

                   
int main()
                   
{

  
                  
       printf("\nstrcmp(pod, podcast) = %d", __strcmp("pod", "podcast"));
  
                  
       printf("\nstrlen(Solaris10) = %d", __strlen("Solaris10"));

  
                  
       return (0);
                   
}
                

Assume that the shared library, libstrimpl.so, was built with profile feedback, as shown below:

                   
%  
                    
cc -xO2 -xprofile=collect -G -o libstrimpl.so strimpl.c
                  

                   
%  
                    
cc -xO2 -xprofile=collect -lstrimpl -o driver driver.c
                  
                   
%  
                    
./driver
                  
                   
%  
                    
cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
                  
                   
%  
                    
cc -xO2 -xprofile=use:driver -lstrimpl -o driver driver.c
                  
                

The library was extended with a new routine for string reversal, __strreverse, for its next release. Let's see what happens if we skip the profile data collection for this library after integrating the code for __strreverse routine. Since the programmer may not care much about the organization of the independent routines within the source file, the new routine can be placed anywhere (top, middle or at the end) in the source file.

Case 1: The routine was added at the bottom of the file ie., after all existing routines

                   
%  
                    
cat strimpl.c
                  
                   
#include <stdlib.h>
                   
#include "strimpl.h"
  
                  

                   
int __strcmp(const char *str1, const char *str2) {  ...  }

  
                  

                   
int __strlen(const char *str) {  ...  }
  
                  

                   
char *__strreverse(const char *str)
                   
{
  
                  
       int i, length = 0;
  
                  
       char *revstr = NULL;

  
                  
       length = __strlen(str);
  
                  
       revstr = (char *) malloc (sizeof (char) * length);

  
                  
       for (i = length; i > 0; --i)
  
                  
       {
  
                  
               *(revstr + i - 1) = *(str + length - i);

  
                  
       }

  
                  
       return (revstr);
                   
}

                   
%  
                    
cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
                  
                   
warning: Profile feedback data for function __strreverse is inconsistent. Ignored.
                

If you do not want to collect the profile data for the new code to be added, appending the new code at the bottom of the source file is the recommended way. By doing so, the existing profile data remains consistent, and can be used by the compiler in optimizing the untouched (existing) code, as before. Since there is no profile feedback data available for the new routine, compiler simply performs other optimizations, as it usually does without -xprofilecompiler option.

Case 2: The routine was added somewhere in the middle of the source file

                   
%  
                    
cat strimpl.c
                  
                   
#include <stdlib.h>
                   
#include "strimpl.h"

  
                  

                   
int __strcmp(const char *str1, const char *str2) {  ...  }
  
                  

                   
char *__strreverse(const char *str) {  ...  }
  
                  

                   
int __strlen(const char *str) {  ...  }

                   
%  
                    
cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
                  
                   
warning: Profile feedback data for function __strreverse is inconsistent. Ignored.

                   
warning: Profile feedback data for function __strlen is inconsistent. Ignored.
                

Compiler reads the line numbers of the blocks and their execution counts from the feedback (feedbin) file. As a result, introducing new code in a routine makes its profile data inconsistent. Also, since the position of all other routines that are underneath the newly introduced code may change, their profile data becomes inconsistent as well. Hence the compiler ignores the profile data of such routines to avoid introducing functional errors.

Apparently the same explanation holds true even when the new code was added at the top of the source file above all existing routines. Such an action leaves all the profile data for this object, in an unusable (inconsistent) state. Observe the warnings in the following example, for clear understanding.

Case 3: The routine was added at the top of the source file

                   
%  
                    
cat strimpl.c
                  

                   
#include <stdlib.h>
                   
#include "strimpl.h"
  
                  

                   
char *__strreverse(const char *str) {  ...  }
  
                  

                   
int __strcmp(const char *str1, const char *str2) {  ...  }

                   
int __strlen(const char *str) {  ...  }

                   
%  
                    
cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
                  
                   
warning: Profile feedback data for function __strreverse is inconsistent. Ignored.
                   
warning: Profile feedback data for function __strcmp is inconsistent. Ignored.
                   
warning: Profile feedback data for function __strlen is inconsistent. Ignored.
                

The bottom line: If the plan is to skip profile data collection in favor of using old profile data from previous training run(s), always add the new code at the bottom of the source file (unless it needs to be placed somewhere else to avoid compilation errors), to keep the data consistent at least for majority of the existing code.

4. Other Compiler Options That Could Use Profile Data

Compiler option -xipo performs crossfile optimization -- optimizations that extend across multiple source files. One example of this kind of optimization is inlining a routine from one source file into code from another source file. In the presence of profile feedback, the compiler has a much better model of the set of routines that are worth inlining.

Option -xlinkopt causes the compiler to perform link time optimization. This final phase of compilation uses all the knowledge of the generated code in order to do some final tweaking of the code layout. This is useful for large codes where performance can be gained by laying out the code to keep all the frequently executed code together.

Refer to the technical article Improving Code Layout Can Improve Application Performance for more information about link time optimization.

5. Profile Data Portability Across Different Platforms

In order to reduce the build time overhead of profile feedback, it is desirable to use the profile data collected on one platform in building the application on other platforms, provided the application code is portable. At the time of this writing, profile data collected with Sun Studio compilers on SPARC platforms is not compatible with profile data collected on x86/x64 platforms. That is, profile data collected on one platform cannot be use on another platform.

6. Alternatives To Feedback-Based Optimization

Sun introduced a static optimizer, binopt, as part of Sun Studio 11 compiler suite. binopt works directly on binaries. If using feedback based optimization is either not feasible, or didn't help much due to the non-representative workloads used in training run(s), binopt can be used as an alternative, to improve the performance of the application.

Refer to the Sun Studio Binary Code Optimizer technical article for further information about using binopt.

7. Summary

The discussion of this article can be summarized as follows:

  • Depending on the version of your compiler, make sure the most recent versions of the following patches are installed:
    115983 for Studio 9, 117832 for Studio 10; and 120760 for Studio 11 releases.

  • Build the application with compiler flags: -xprofile=collect, -xO2 or higher, -xipo, -xlinkopt, and other optimization flags.

    For example:

    %  
                    
    cc -xO2 -xprofile=collect -xipo -xlinkopt -o application application.c
                  
  • Collect the profile feedback data asynchronously, by running the application with one or more representative workloads.

    %  
                    
    mkdir /tmp/myapp
    %  
                    
    setenv SUN_PROFDATA_ASYNC_INTERVAL 30
    
    %  
                    
    setenv SUN_PROFDATA_DIR /tmp/myapp
    %  
                    
    ./application args
                  
  • Re-build the application with -xprofile=use, optimization level -xO2 or above, -xipo, -xlinkopt and other optimization flags.

     
    %  
                    
    cc -xO2 -xprofile=use:/tmp/myapp.profile -xipo -xlinkopt -o application application.c
                  


References and Further Reading



Acknowledgements

The techniques described in this article are derived from earlier work done by Vinod Grover and Chris Aoki, and the author wishes to acknowledge their input.

About The Author

Giri Mandalika is a software engineer in Sun Microsystems Market Development Engineering group, working with independent software vendors to make sure their products run well on Sun platform. He holds a Master's degree in Computer Science from The University of Texas at Dallas.

Left Curve
System Administrator
Right Curve
Left Curve
Developer and ISVs
Right Curve
Left Curve
Related Products
Right Curve
solaris-online-forum-banner