How to Optimize the Serial Performance of Applications

Oracle Solaris Studio 12.3

by Darryl Gove, December 2011


This article provides advice for using Oracle Solaris Studio to select appropriate compiler flags, compiler options, and other optimizations to optimize applications for serial performance.




Introduction

Getting the best performance for SPARC or 86 applications involves using the latest compilers and selecting the best and most appropriate set of compiler options. Oracle Solaris Studio compilers strive to provide the best out-of-the-box performance for any applications built using them. However, often some minor refinements to the compiler options can yield further performance gains. As a result, it is important that you approach optimization and tuning on an experimental basis before you release the final version of an application.

OTN is all about helping you become familiar enough with Oracle technologies to make an informed decision. Articles, software downloads, documentation, and more. Join up and get the technical resources you need to do your job.

As a part of this process, it is important to understand exactly what is expected of the compiler in concert with the assumptions made in the application. In particular, you should ask two key questions when selecting the appropriate compiler options:

  • What is known about the platforms where the compiled application will eventually run?
  • What is known about the assumptions that are made in the code?

In addition, it is helpful to consider the purpose of a particular compilation. Compiler options can present various trade-offs depending on whether a given compilation is meant to assist with debugging, testing, tuning, or final performance optimization.

Note: This article addresses optimizing applications for serial performance. Optimizing multithreaded or parallel applications is covered in How to Optimize the Parallel Performance of Applications.

Identifying the Target Platform

Knowing where the code will eventually run is essential in order to understand what optimization options make sense. The choice of platform determines the following:

  • A 32-bit or 64-bit instruction set
  • Instruction set extensions the compiler can use to accelerate performance
  • Instruction scheduling, depending on instruction execution times
  • Cache configuration

Generating 32-Bit or 64-Bit Code

The SPARC and x86 processor families can run both 32-bit and 64-bit code. The principal advantage of 64-bit code is that the application can handle a larger data set than with 32-bit code. However, the cost of this larger address space is a larger memory footprint for the application, since long variable types and pointers increase in size from 4 bytes to 8 bytes. The increase in memory footprint can cause a 64-bit version of an application to run more slowly than the 32-bit version.

At the same time, the x86 platform presents some architectural advantages when running 64-bit code compared to running 32-bit code. In particular, the application can use more registers and a better calling convention. On the x86 platform, these advantages typically allow a 64-bit version of an application to run faster than a 32-bit version of the same code, unless the memory footprint of the application has significantly increased.

The SPARC line of processors take a different approach, because they were architected to enable a 32-bit version of an application to use the architectural features of the 64-bit instruction set. As a result, there is no architectural performance gain in going from 32-bit to 64-bit code. Consequently, 64-bit applications compiled for SPARC processors will see only the additional cost of the increased memory footprint.

Compiler flags determine whether a 32-bit or 64-bit binary is generated.

  • To generate a 32-bit binary, use the -m32 flag.
  • To generate a 64-bit binary, use the -m64 flag.

For additional details about migrating from 32-bit to 64-bit code, refer to Converting 32-bit Applications Into 64-bit Applications: Things to Consider and 64-bit x86 Migration, Debugging, and Tuning with the Sun Studio 10 Toolset.

Specifying an Appropriate Target Processor

Oracle Solaris Studio compilers provide considerable flexibility for selecting a target processor through the -xtarget compiler flag. The default for the compiler is to produce a "generic" binary, namely a binary that will work well on all platforms (-xtarget=generic). In many situations, a generic binary is the best choice. However, there are some situations, including the following, in which it is appropriate to select a different target:

  • To override a previous target setting. The compiler evaluates options from left to right. If you specify the -fast flag on the compile line, it might be appropriate to override the implicit setting of -xtarget=native with a different choice.
  • To exploit the features of a particular processor. For example, newer processors tend to have more features that can be exploited for performance gains. The compiler can use these features at the expense of producing a binary that does not run on older processors that do not have these features.

The -xtarget flag actually sets three flags:

  • The -xarch flag specifies the architecture of the target machine. This architecture is basically the instruction set that the compiler can use. If the processor that runs the application does not support the appropriate architecture, the application might not run.
  • The -xchip flag tells the compiler which processor to assume is running the code. This flag tells the compiler which patterns of instructions to favor when it has a choice between multiple ways of coding the same operation. It also tells the compiler the instruction latency to use in order to schedule the instructions to minimize stalls.
  • The -xcache flag tells the compiler the cache hierarchy to assume. This selection can have a significant impact on floating point code in which the compiler is able to make a choice about how to arrange loops so that the data being manipulated fits into the caches.

Target Architectures for the SPARC Processor Family

For the SPARC processor family, the default setting, -xtarget=generic, should be appropriate for most situations. This setting generates a 32-bit binary that uses the SPARC V8 instruction set if the -m32 flag is used or a 64-bit binary that uses the SPARC V9 instruction set if the -m64 flag is used. The most common situation in which you need to take the target architecture into account and specify a different setting is when you are compiling code that contains significant floating point computations.

For example, recent SPARC processors support floating point multiply-accumulate (FMA or FMAC) instructions. These instructions combine a floating point multiply and a floating point addition (or subtraction) into a single operation. An FMA operation typically takes the same number of cycles to complete as either a floating point addition or a floating point multiplication, so the performance gain from using these instructions can be significant. However, the results from an application compiled to use FMA instructions might be different than the same application compiled not to use the instructions. In addition, code compiled to take advantage of FMA instructions will not run on a platform that does not support FMA instructions.

As an illustration, consider the operation shown below. The use of the word ROUND in the equation indicates that the value is rounded to the nearest representable floating point number when it is stored into Result.

Result = ROUND( (value1 * value2) + value3)

This single FMA instruction replaces the following two instructions:

tmp = ROUND(value1 * value2)
Result = ROUND(tmp + value3)

Notice that the two-instruction version has two round operations. It is this difference in the number of rounding operations that can result in a difference in the least-significant bits of the calculated result. The FMA implementation is referred to as a fused FMA.

To generate FMA instructions, compile the binary using the following flags:

-xarch=sparcfmaf -fma=fused

Alternatively, you can use the flags -xtarget=sparc64vi -fma=fused or -xarch=sparcvis3 -fma=fused to enable the generation of the FMA instructions. As mentioned, the resulting code will not run on a platform that does not support FMA instructions.

Target Architectures for the x86 Processor Family

By default, the Oracle Solaris Studio compiler targets a 32-bit generic x86-based processor, so generated code will run on any x86 processor from an Intel Pentium Pro to the latest Intel or AMD Opteron processor.

While -xtarget=generic produces code that can run over the widest range of processors, this code will not take advantage of the Streaming SIMD Extensions 2 (SSE2) extensions offered by the latest processors. To exploit these instructions, use the flag -xarch=sse2. However, the compiler might not recognize all opportunities to use these instructions unless you also use the vectorization flag -xvector=simd.

Summary of Recommended Compilers Flags for SPARC and x86 Target Architectures

Table 1 provides a summary of Oracle Solaris Studio compiler flags recommended for compilation for various SPARC and x86 target architectures.

Table 1. Oracle Solaris Studio Flags for Specifying Architecture and Address Space
ARCHITECTURE 32-BIT ADDRESS SPACE 64-BIT ADDRESS SPACE
SPARC -xtarget=generic -m32 -xtarget=generic -m64
SPARC64, SPARC T3 -xtarget=sparc64vi -m32 -fma=fused -xtarget=sparc64vi -m64 -fma=fused
x86 -xtarget=generic -m32 -xtarget=generic -m64
X86/SSE2 -xtarget=generic -xarch=sse2 -m32 -xvector=simd -xtarget=generic -xarchsse2 -m64 -xvector=simd

Choosing Compiler Optimization Options

Choosing compiler options represents a trade-off between compilation time, runtime, and (possibly) application behavior. The optimization flags you choose alter three important characteristics:

  • The runtime of the compiled application
  • The length of time that the compilation takes
  • The amount of debug activity that is possible with the final binary

In general, the higher the level of optimization, the faster the application runs (and the longer it takes to compile), but less debug information is available. Ultimately, the particular impact of optimization levels will vary from application to application. The easiest way of thinking about these trade-offs is to consider three degrees of optimization, as outlined in Table 2.

Table 2. Three Degrees of Optimization and the Implications for Resulting Code
PURPOSE FLAGS COMMENTS
Full debug capabilities -g [no optimization flags] The application will have full debug capabilities, but almost no optimization will be performed on the application, leading to lower performance.
Optimization -g -O The application will have good debug capabilities, and a reasonable set of optimizations will be performed on the application, typically leading to significantly better performance.
High optimization -g -fast The application will have good debug capabilities, and a large set of optimizations will be performed on the application, typically leading to higher performance.

Compiling for Debugging (-g)

The -g option is a high-fidelity debug option that lets you check for algorithmic errors. With the flag set, code performs exactly as written and you can inspect variables under the debugger. For lower levels of optimization, the -g flag disables some minor optimizations (to make the generated code easier to debug). At higher levels of optimization, the presence of the flag does not alter the code generated (or its performance). However, it is important to be aware that at high levels of optimization, it is not always possible for the debugger to relate the disassembled code to the exact line of the source code or for it to determine the value of local variables held in registers rather than stored to memory.

A very strong reason for compiling with the -g flag is that the Oracle Solaris Studio Performance Analyzer can then attribute time spent in the code directly to lines of source code , making the process of finding performance bottlenecks considerably easier. For information using on the Performance Analyzer, see How to Analyze and Improve Application Performance.

Compiling for Basic Optimization (-O)

You can achieve basic optimization by using the -O compiler flag. The -O flag offers decent runtime performance, without taking excessively long to compile the application. Add the -g flag to the -O flag to get optimization with debugging information built in.

Multiple possible levels of optimization are offered with Oracle Solaris Studio compilers, including -O3, -O4, and -O5. See the Oracle Solaris Studio documentation for a full description of these options.

Compiling for Aggressive Optimization (-fast)

The -fast option is a good starting point when optimizing code, but it might not represent the desired optimizations for the finished application. Note that because the -fast option is defined as a particular selection of compiler options, it is subject to change from one release to another, as well as between compilers.

In addition, some of the component options selected by -fast might not be available on some platforms. You must also take care if you perform the application compilation and linking separately. In that case, to ensure proper behavior, ensure that the application is both compiled and linked with -fast.

The -fast option implies many individual compilation optimizations. You can turn these individual options off or on at will. Ideally, you will apply the -fast option objectively. For instance, if compiling with -fast yields a five-fold performance gain, it is definitely worth exploring which of the specific options included in -fast are providing the performance advantages. You can then use those options individually in subsequent builds for a more deterministic and focused optimization.

Be aware of the following implications for using the -fast compilation flag:

  • Implications for the target architecture. Setting the -fast compiler flag sets -xtarget=native for the compilation. This option detects the native chip and instruction set of the development system, and targets the code for that system. As a result, use -xtarget=native only if you know that the target platform is the same as the development system. Otherwise, set -xtarget=generic, or use the -xtarget flag to select the desired target architecture.

    For instance, FMA instructions are implemented on recent SPARC processors, but they are not currently implemented on older processors. As a result, a binary that was built on a recent SPARC system and compiled with -xtarget=native -fma=fused will not run on an older system. The same issue applies to Streaming SIMD Extensions (SSE) instructions in the Intel x86 architecture, which might not be available on older x86 processors and systems.

  • Implications for floating point arithmetic. The -fast option also includes floating point arithmetic simplifications through the -fns and -fsimple flags. Using -fns and -fsimple can result in significant performance gains. However, these flags can also result in a loss of precision, and they allow the compiler to perform some optimizations that do not comply with the IEEE-754 floating point arithmetic standard. Language standards are also relaxed regarding floating point expression reordering. For example:

    • When you set the -fns flag, subnormal numbers are flushed to zero. Subnormal numbers are very small numbers that are too small to be represented in normal form.
    • With -fsimple, the compiler can treat floating point arithmetic as a mathematics textbook might express the operation, for example, by assuming that the order in which additions are performed doesn't matter and that it is safe to replace a divide operation with multiplication by the reciprocal. These kinds of assumptions and transformations seem perfectly acceptable when performed on paper, but they can result in a loss of precision when algebra becomes real numerical computation with numbers of limited precision. Also, -fsimple allows the compiler to make optimizations that assume that the data used in floating point calculations will not be NaNs (Not a Number). Compiling with -fsimple is not recommended if computation with NaNs is expected.

    Therefore, before you commit to using these flags in production code, carefully evaluate any performance gains and carefully check the results.

  • Implications for pointer aliasing. Using the -fast compiler optimization flag asserts that basic types don't alias, so you should check coding assumptions accordingly. Aliased pointers point to the same region of memory, so an update of a value accessed through one pointer should cause an update of the value accessed through the other pointer.

    In the following code fraction, if a and b point to the same (initially zero) memory location, the output should be a=2 b=2. However, if the compiler assumes no aliasing, the compiler could read a, read b, increment a, increment b, store a back to memory, store b back to memory, and then print a=1 b=1.

    void function (int  *a, int  *b)
    {
       *b++;
       *a++;
       printf("a = %i b= %i\n",*a,*b)
    }
    

    For the compiler, aliasing means that stores to the memory addressed by one pointer can change the memory addressed by the other pointer. As a result, the compiler has to be very careful never to reorder stores and loads in expressions containing pointers, and it might also have to reload the values of memory accessed through pointers after new data is stored into memory. The compiler does not check to see whether the assertion is ever violated, so if the source code violates the assertion, the application might not behave in the intended fashion. The results generated by the application will be unpredictable if the source code does not adhere to the degree of aliasing allowed by the compiler flags.

    The following flags tell the compiler what degree of aliasing to assume in the code.

    • -xrestrict asserts that all pointers passed into functions are restricted pointers. This means that if two pointers are passed into a function, under -xrestrict, the compiler can assume that those two pointers never point at overlapping memory.
    • -xalias_level indicates what assumptions can be made about the degree of aliasing between two different pointers. You can consider -xalias_level to be a statement about coding style. By using this flag, you inform the compiler how pointers are treated in the coding style employed. For example, the compiler flag -xalias_level=basic informs the compiler that a pointer to an integer value will never point to the same location as a pointer to a floating point value.

Additional Optimizations

In addition to optimization flags, a number of other flags and techniques can be used to increase performance.

Crossfile Optimization (-xipo)

The -xipo option performs interprocedural optimizations over the whole program at link time. Through this approach, object files are examined again at link time to see if there are any further optimization opportunities. The most common opportunity is to inline code from one file into code from another file. The term inlining means that the compiler replaces a call to a routine with the actual code from that routine.

Inlining can be good for two reasons, the most obvious being that it eliminates the overhead of calling another routine. A second, less obvious reason is that inlining might expose additional optimizations that can be performed on the object code. For example, the following routine calculates the color of a particular point in an image by taking the x and y position of the point and calculating the location of the point in the block of memory containing the image.

int position(int x, int y)
{
  return x + y*row_length;
}

for (x=0; x<100; x++)
{
  value +=array[position(x,y)];
}

By inlining the code in the routine that works over all the pixels in the image, the compiler is able generate code to just add one to the current offset to get to the next point instead of having to do a multiplication and an addition to calculate each address of each point, resulting in a performance gain.

for (x=0; x<100; x++)
{
  value += array[x + y*row_length];
}

This code can then be further optimized.

ytmp=y*row_length;
for (x=0; x<100; x++)
{
  value += array[x+ytmp];
}

The downside of using -xipo is that it can significantly increase the compile time of the application and it might also increase the size of the executable. It is worth compiling with -xipo to see whether the increase in compile time is worth the gain in performance.

Profile Feedback (-xprofile=collect, -xprofile=use)

When compiling a program, the compiler makes a best guess at how the flow of the program might proceed (about the branches that are taken and those that not). For floating point-intensive code, this approach generally gives good performance. However, for integer programs with many branching operations, taking the compiler's approximations might not result in the best performance.

Profile feedback assists the compiler in optimizing the application by giving it real information about the paths that are actually taken based on a sample run of the program. Knowing the critical routes through the code allows the compiler to make sure these routes are optimized.

To use profile feedback, do the following:

  1. Compile a version of your application with the -xprofile=collect flag set.
  2. Then run the application with representative input data to collect a runtime performance profile.
  3. Then recompile the application with -xprofile=use combined with the performance profile data that was collected.

The downside of this approach is that the compile cycle can be significantly longer, since it comprises two compiles and a run of the application. The upside is that the compiler can produce much more optimal execution paths, yielding a faster runtime for the application.

A representative data set is one that exercises the code in ways similar to the actual data that the application will see in production. Additionally, you can run the application multiple times with different workloads to build up a representative data set. Of course, if the representative data manages to exercise the code in ways that are not representative of the real workloads, performance might not be optimal. However, often the code is typically executed through similar routes, and so regardless of whether the data is representative, the performance will improve.

For more information on determining whether a workload is representative, see the article Selecting Representative Training Workloads for Profile Feedback Through Coverage and Branch Analysis.

Example Optimizations in Practice

Optimization is an incremental process where different optimizations are evaluated against the advantages they provide. Those optimizations that make a substantial performance difference are then noted as candidates for building the final executable application. As an example of various tuning options, this section considers a simple program that calculates the Mandelbrot set. The code for this application is shown in Listing 1.

Listing 1. Code for Mandelbrot Application
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define SIZE 4000
int ** data;

int ** setup()
{
   int i;
   int **data;
   data=(int**)malloc(sizeof(int*)*SIZE);
   for (i=0; i<SIZE; i++)
   {
     data[i]=(int*)malloc(sizeof(int)*SIZE);
   }
   return data;
}

int inset(double ix, double iy)
{
   int iterations=0;
   double x=ix, y=iy, x2=x*x, y2=y*y;
   while ((x2+y2<4) && (iterations<1000))
   {
     y = 2 * x * y + iy;
     x = x2 - y2 + ix;
     x2 = x * x;
     y2 = y * y;
     iterations++;
   }
   return iterations;
}

void loop()
{
   int x,y;
   double xv,yv;
   #pragma omp parallel for private(y,xv,yv) schedule(guided)
   for (x=0; x<SIZE; x++)
   {
     for (y=0; y<SIZE; y++)
     {
       xv = ((double)(x-SIZE/2))/(double)(SIZE/4);
       yv = ((double)(y-SIZE/2))/(double)(SIZE/4);
       data[x][y]=inset(xv,yv);
     }
   }
   if (data[7][7]<0) {printf("Error");} 
}

void main()
{
   data = setup();
   loop();
}

To determine a baseline, the application is first compiled using the -g, -O, and -xtarget=generic compiler flags. Timing information for the application runtime is provided below.

% cc -g -O -xtarget=generic mandle.c
% timex ./a.out
real          33.02
user          32.88
sys            0.09

Because the development system in this case was based on the x86 architecture, it made sense to specify the use of SSE2 instructions to see if using those instructions would provide an additional performance advantage. Note that -xtarget=native would produce the same result in this case, since the -xarch=sse2 flag would be implied.

% cc -g -O -xarch=sse2 mandle.c
% timex ./a.out
real          12.05
user          11.92
sys            0.08

In this case, the code runs nearly three times faster using SSE2 instructions, compared to when the compiler is told not to generate those instructions. Fortunately, most x86 processors now support SSE2 instructions so it is relatively safe to assume that the bulk of the available hardware will support them.

Next, the -xopenmp flag is set to trigger the use of the OpenMP directive that delineates the for loop that does the computation work of the Mandelbrot computation. The -xvpara and -xloopinfo flags are specified to generate information on what loops are parallelized and to report any potential issues.

Note: For more information on parallelization and the -xopenmp, -xvpara, and -xloopinfo compiler flags, see How to Optimize the Parallel Performance of Applications.

% cc -g -O -xopenmp -xvpara -xloopinfo mandle.c
"mandle.c", line 13: not parallelized, call may be unsafe
"mandle.c", line 25: not parallelized, loop has multiple exits
"mandle.c", line 41: PARALLELIZED, user pragma used
"mandle.c", line 43: not parallelized, loop inside OpenMP region

The resulting code is then run with the environment variable OMP_NUM_THREADS set to equal 2.

% export OMP_NUM_THREADS=2
% timex ./a.out
real           8.72
user          11.92
sys            0.08

In this case, it is important to note that the user time is the same (11.92 seconds), because the same amount of work is performed. However, the real (or wall-clock) time is reduced, because there are now two threads performing the work. Unfortunately, the performance doesn't double, because the work is unbalanced between the two threads. One thread finishes first, so the performance improvement is limited by the slower thread. This behavior can be checked by collecting a profile using the Performance Analyzer and looking at the timeline view, as shown in Figure 1. (For more information on using the Performance Analyzer, see How to Analyze and Improve Application Performance.)

Figure 1

Figure 1. Viewing the timeline in the Performance Analyzer reveals that the thread work is not evenly split between the two threads.

The OpenMP directive schedule(guided) changes the scheduling used for the loop. Rather than statically dividing the work across the threads equally, this directive dynamically divides the work at runtime so that each thread takes about the same amount of time. Once this minor source code change is made and the application has been recompiled, the runtime performance improves even further, dropping from a high of over 33 seconds to less than 7 seconds.

% timex ./a.out
real           6.90
user          11.94
sys            0.08

Figure 2 shows the Oracle Solaris Studio integrated development environment (IDE) displaying a run of the final Mandelbrot application.

Figure 2

Figure 2. The IDE shows the final profiled result of the Mandelbrot application.

For More Information

For an exhaustive description of compiler flags and options, see the complete Oracle Solaris Studio product documentation at http://oracle.com/technetwork/server-storage/solarisstudio/documentation/oss123-docs-1357739.html.

Also see:

Revision 1.0, 12/13/2011

Follow us on Facebook, Twitter, or Oracle Blogs.