How to Make Your Solaris Applications Run Faster by Selecting the Best Compiler Options

Updated February 2011

By Darryl Gove

This article suggests how to get the best performance from an UltraSPARC or x86/EMT64 (x64) processor running on the latest Oracle Solaris platforms by compiling with the best set of compiler options and the latest compilers. These are suggestions of things you should try, but before you release the final version of your program, you should understand exactly what you have asked the compiler to do.


The Fundamental Questions

There are two questions that you need to ask when compiling your program:

  • What do I know about the platforms that this program will run on?
  • What do I know about the assumptions that are made in the code?

The answers to these two questions determine what compiler options you should use.

The Target Platform

What platforms do you expect your code to run on? The choice of platform determines the following:

  • 32-bit or 64-bit instruction set
  • Instruction set extensions the compiler can use
  • Instruction scheduling, depending on instruction execution times
  • Cache configuration

The first three are often the ones that will have the greatest impact on the performance of the application.

32-Bit Versus 64-Bit Code

The UltraSPARC (Oracle SPARC) and x64 families of processors can run both 32-bit and 64-bit code. The critical advantage of 64-bit code is that the application can handle a larger data set than 32-bit code, which has a size limit of 4GB for the application and data. However, the cost of this larger address space is a larger memory footprint for the application; long variable types and pointers increase in size from 4 bytes to 8 bytes. The increase in footprint will cause the 64-bit application to run more slowly than the 32-bit version.

However, the x86/x64 platform has some architectural advantages when running 64-bit code compared to running 32-bit code. In particular, the application can use more registers and can use a better calling convention. On an x86 processor, these advantages will typically enable a 64-bit version of an application to run faster than a 32-bit version of the same code, unless the memory footprint of the application has significantly increased.

The UltraSPARC line of processors was architected to enable the 32-bit version of the application to already use the architectural features of the 64-bit instruction set. So there is no architectural performance gain going from 32-bit to 64-bit code. Consequently, the UltraSPARC processors will see only the additional cost of the increase in memory footprint.

Hence, best performance is likely to be attained if SPARC binaries are compiled as 32-bit and x86 binaries are compiled as 64-bit. The compiler flags that determine whether a 32-bit or 64-bit binary is generated are the flags -m32 and -m64.

For additional details about migrating from 32-bit to 64-bit code, refer to Converting 32-bit Applications Into 64-bit Applications: Things to Consider and 64-bit x86 Migration, Debugging, and Tuning, With the Sun Studio 10 Toolset

Specifying an Appropriate Target Processor

The default for the compiler is to produce a "generic" binary, that is, a binary that will work well on all platforms. In many situations, this will be the best choice. However, there are some situations where it is appropriate to select a different target.

  • To override a previous target setting. The compiler evaluates options from left to right. If the flag -fast has been specified on the compile line, then it may be appropriate to override the implicit setting of -xtarget=native with a different choice.

  • To take advantage of features of a particular processor. For example, newer processors tend to have more features. The compiler can use these features at the expense of producing a binary that does not run on the older processors that do not have these features.

The -xtarget flag actually sets three flags:

  • The -xarch flag specifies the architecture of the machine. This is basically the instruction set that the compiler can use. If the processor that runs the application does not support the appropriate architecture, then the application may not run.

  • The -xchip flag tells the compiler which processor to assume is running the code. This tells the compiler which patterns of instructions to favor when it has a choice between multiple ways of coding the same operation. It also tells the compiler the instruction latency to use for scheduling instructions to minimize stalls.

  • The -xcache flag tells the compiler the cache hierarchy to assume. This can have a significant impact on floating point codes where the compiler is able to make a choice about how to arrange loops so the data being manipulated fits into the caches.

The impact of the these three performance settings will depend on the characteristics of the application. Code that spends time in floating-point computation tends to show the most sensitivity to the settings used for the target.

Target Architectures for SPARC Processors

The default -xtarget=generic option should be appropriate for most situations. The compiler will generate a 32-bit binary that uses the SPARC V8 instruction set or a 64-bit binary that uses the SPARC V9 instruction set. The most common situation where a different setting might be required would be with code doing a significant number of floating-point computations. Here, use of the hardware floating-point multiply-accumulate (FMA or FMAC) instructions would be effective.

The SPARC64 line of processors supports FMA instructions. These instructions combine a floating-point multiply and a floating-point addition (or subtraction) into a single operation. An FMA instruction typically takes the same number of cycles to complete as either a floating point addition or a floating-point multiplication, so the performance gain from using these instructions can be significant. However, it is possible that the results from an application compiled to use FMA instructions may be different than the same application compiled to not use these instructions.

An FMAC instruction performs the following operation, called a “fused multipy-accumulate”:

Result = ROUND( (value1 * value2) + value3)


Here ROUND indicates that the value is rounded to the nearest representable floating-point number when it is stored into the result. This single FMAC instruction replaces the following two instructions:

    tmp = ROUND(value1 * value2) 
Result = ROUND(tmp + value3)

Notice that the two-instruction version has two round operations, and this difference can result in a difference in the least significant bits of the calculated result.

To generate FMA instructions, the binary needs to be compiled with two flags: one to specify an architecture that supports the FMA instructions and another to tell the compiler that it is acceptable to use these instructions:

-xarch=sparcfmaf -fma=fused


Alternatively, the flags -xtarget=sparc64vi -fma=fused will enable the generation of the FMA instruction and will also tell the compiler to assume the characteristics of the SPARC64 VI processor when compiling the code. This will produce optimal code for the SPARC64 VI platform. Code compiled to contain FMA instructions will not run on platforms that do not support the instructions.

Specifying the Target Processor for the x86/x64 Processor Family

By default, the compiler targets a 32-bit generic x86-based processor, so the code will run on any x86 processor from a Pentium Pro up to an AMD Opteron architecture. While this produces code that can run over the widest range of processors, it does not take advantage of the extensions offered by the latest processors. Most currently available x86 processors have the SSE2 instruction set extensions. To take advantage of these instructions, the flag -xarch=sse2 should be used. However, the compiler may not recognize all opportunities to use these instructions unless the vectorization flag -xvector=simd is also used.

So, for x86/x64 processors, compile with at least the following:

-xarch=sse2 -xvector=simd

Summary of Target Settings for Various Address Spaces and Architectures

The following tables summarize the options to use for various processors and architectures.
 

Address Space

SPARC

SPARC64

32-bit

-xtarget=generic -m32

-xtarget=sparc64vi -m32 -fma=fused

64-bit

-xtarget=generic -m64

-xtarget=sparc64vi -m64 -fma=fused


 

Address Space

x86

x64/sse2

32-bit

-xtarget=generic -m32

-xtarget=generic -xarch=sse2 -m32 
-xvector=simd

64-bit

-xtarget=generic -xarch=sse2 -m64 -xvector=simd


Optimizing and Debugging

Compiling with an optimization flag alters three important characteristics: the run time of the compiled application, the length of time the compilation takes, and the amount of debugging that is possible with the final binary. In general, the higher the level of optimization, the faster the application runs (and the longer it takes to compile) and the less debugging information is available. However, the particular impact of optimization levels will vary from application to application.

The easiest way of thinking about this is to consider three degrees of optimization, as outlined in the following table.

Purpose

Flags

Comments

Full debug

[no optimization flags] -g

The application will have full debug capabilities, but almost no compiler optimizations will be performed, leading to lower performance.

Optimized

-g -O
[-g0 for C++]

The application will have good debug capabilities, and a reasonable set of optimizations will be performed, typically leading to significantly better performance.

High optimization

-g -fast

The application will have good debug capabilities and a full set of compiler optimizations, typically leading to higher performance.

 

Note: For C++ at optimization levels of -O and lower, the debug flag -g will inhibit some of the inlining of methods. This can have a significant performance impact on the binary. The flag -g0 will provide debug information without inhibiting the inlining of these methods. Consequently, it can be useful to use the flag -g0 with -O if it is important to have the same level of performance as the non-debug version. The behavior of -g for C++ was changed to this in Oracle Solaris Studio 12 Update 1; prior releases of the C++ compiler always disabled front-end inlining when the flag -g was used.

Suggestion: In general, an optimization level of at least -O is suggested. However, the two situations where lower levels might be considered are (1) where more detailed debug information is required and (2) where the semantics of the program require that variables are treated as volatile, in which case the optimization level should be lowered to -xO2.

More on Debugging

The compiler will generate information for the debugger if the -g flag is present. For lower levels of optimization, the -g flag disables some minor optimizations to make the generated code easier to debug. At higher levels of optimization, the presence of the flag does not alter the code generated (or its performance), but be aware that at high levels of optimization, it is not always possible for the debugger to relate the disassembled code to the exact line of source or to determine the value of local variables held in registers rather than stored to memory.

As discussed earlier, at low levels of optimization, the C++ compiler will disable some of the inlining performed by the compiler when the -g compiler flag is used. However, the flag -g0 will tell the compiler to do all the inlining that it would normally do as well as generate the debug information.

A very strong reason for compiling with the -g flag is that the generated debug information lets the Oracle Solaris Studio Performance Analyzer attribute time spent in the code directly to lines of source code, making the process of finding performance bottlenecks considerably easier. Also, if the application produces a core file, the debugger will usually be able to report the line of code that produced the core file.

Suggestion: Always compile with -g or -g0. It rarely makes any difference to performance, and your program will be easier to debug and analyze.

Using -fast for Performance

The flag -fast is a good starting point when optimizing code. However, it might not give you the right set of optimizations for the finished program. The -fast flag is amacro that enables a full set of optimizations that often lead to near-optimal performance for many applications. However, some of these optimizations might not be appropriate for your particular application.

  • The -fast flag assumes that the platform doing the compiling is representative of the type of machine that will run the resulting binary (-xtarget=native). The compiler will use the instruction set extensions that are supported by the compiling platform. The application might not run if these instructions are not also available on the platform where the application is deployed. Overriding the implied -xtarget=native with an -xtarget flag to specify a more generic target might be required.

  • On x86 platforms, -xregs=frameptr allows the compiler to use the framepointer as an unallocated callee-saves register, which can result in increased run-time performance. This option is included in -fast for C. Using this flag might mean that some tools are unable to correctly generate callstack information.

  • For the C compiler, the -fast flag includes -xalias_level=basic, which declares that the application does not contain pointer aliasing between different data types. Code not complying to language standards might not run correctly when compiled with this flag. Pointer aliasing is discussed later in this article in the Advanced Compiler Options: C/C++ Pointer Aliasing section.

  • The -fast flag also enables certain floating-point optimizations, which are discussed in the next section, The Implications for Floating-Point Arithmetic When Using the -fast Option.

The -fast flag is a good starting point for getting the best performance out of your application. It is recommended that the optimizations it enables be reviewed before a final set of compiler flags are decided upon for the production build of your application. The flags -#, -xdryrun, or -V will cause the compiler to print out the options that -fast includes, and the list can be used to select the appropriate ones for your application.

The expansion of -fast flag tends to change with each Solaris Studio release. Refer to the compiler man pages for instructions on how to determine the -fast flag expansions by the Solaris Studio C, C++, and Fortran compilers, ccCC, and f95, respectively.

The Implications for Floating-Point Arithmetic When Using the -fast Option

One issue to be aware of is the inclusion of certain floating-point arithmetic simplifications implied with -fast. These are the options -fns and -fsimple=2, which allow the compiler to do some optimizations that do not comply with the IEEE-754 floating-point arithmetic standard, and also allow the compiler to relax language standards regarding floating-point expression reordering.

With -fns, subnormal numbers (that is, very small numbers that are too small to be represented in normal form) are flushed to zero. Calculations on subnormal numbers are often done in software, and they are very slow, so code that has a significant number of calculations on subnormal numbers will also run slowly. Subnormal numbers are stored with fewer significant figures of accuracy, so code that sees many of them will not only run slower, but may also perform inaccurate calculations. Hence, the presence of subnormals not only causes performance problems but also necessitates an investigation of the calculations.

With -fsimple=2, the compiler can treat floating-point arithmetic as you would expect to find in a mathematics textbook. For example, the order in which additions are performed doesn't matter, and it is considered safe to replace a divide operation with multiplication by the reciprocal. These transformations seem perfectly acceptable when performed on paper, and they can provide some performance gains, but they can result in a loss of precision when algebra becomes real numerical computation with numbers of limited precision.

Also, -fsimple=2 allows the compiler to make optimizations that assume the data used in floating-point calculations will not be NaN (Not a Number). Compiling with -fsimple=2 is not recommended if you expect computation with NaN data or if your application is sensitive to the exact order in which floating-point computations are performed.

Suggestions:

  • The use of the flags -fns and -fsimple can result in significant performance gains. However, they may also result in a loss of precision. Before committing to using them in production code, it is best to evaluate the performance gain you get from using the flags and whether there is any difference in the results of the application.

  • Avoid using -fsimple=2 with applications that perform calculations on NaN data or are known to be sensitive to the order of floating-point computation.

For more information on floating-point computation, see the Sun Studio Numerical Computation Guide.

Crossfile Optimization

The -xipo option performs interprocedural optimizations on the whole program at link time. This means that the object files are examined again at link time to see if there are any further optimization opportunities. The most common opportunity is to inline some code from one file into code from another file. The term inlining means that the compiler replaces a call to a routine with the actual code from that routine.

Inlining is good for two reasons, the most obvious being that it eliminates the overhead of calling another routine. A second, less-obvious reason is that inlining may expose additional optimizations that can be performed on the object code. For example, imagine that a routine calculates the color of a particular point in an image by taking the x and y position of the point and calculating the location of the point in the block of memory containing the image: (image_offset = y * row_length + x). By inlining the code in the routine that works over all the pixels in the image, the compiler is able generate code to just add one to the current offset to get to the next point instead of having to do a multiplication and an addition to calculate each address of each point. So inlining results in a performance gain.

The downside of using -xipo is that it can significantly increase the compile time of the application and might also increase the size of the executable.

Suggestion: Try compiling with -xipo to see if the increase in compile time is worth the gain in performance.

Profile Feedback

When compiling a program, the compiler takes a best guess at how the flow of the program might go, for example, which branches are taken and which branches are not taken. For code that is floating-point intensive, this generally gives good performance. But programs with many branching operations might not obtain the best performance.

Profile feedback assists the compiler in optimizing a program by giving the compiler real information about the paths actually taken by the program. Knowing the critical routes through the code allows the compiler to make sure these are the optimized routes.

Profile feedback requires that you first compile and execute a version of your application with -xprofile=collect and then run the application with representative input data to collect a run-time performance profile. You then recompile with -xprofile=use and use the performance profile data collected. The downside of doing this is that the compile cycle can be significantly longer (you are doing two compiles and a run of your application), but the compiler can produce much more optimal execution paths, which means a faster run time.

A representative data set should be one that will exercise the code in ways similar to the actual data that the application will see in production. The program can be run multiple times with different workloads to build up the representative data set. Of course, if the representative data manages to exercise the code in ways that are not representative of real workloads, then performance might not be optimal. However, it is often the case that the code is always executed through similar routes, and so regardless of whether the data is representative or not, the performance will improve.

For more information on determining whether a workload is representative, read my article Selecting Representative Training Workloads for Profile Feedback Through Coverage and Branch Analysis.

Suggestions:

  • Try compiling with profile feedback and see whether the performance gain is worth the additional compile time.
  • Try compiling with profile feedback and -xipo, because the profile information will also help the compiler make better choices about inlining.

Using Large Pages for Data

If a program manipulates large data sets, it might help improve performance by using large pages to hold the data. A page is a region of contiguous physical memory. The processor works with virtual memory, which allows the processor the freedom to move the data around in physical memory or even store it to and load it from disk. However, working with virtual memory means the processor has to look up virtual addresses in a table to find the actual physical location of that data page in real memory. This takes a small amount of time, but if it happens often, the time spent in table lookups can become significant.

The default size of these pages is 8 KB for SPARC and 4 KB for x86. However, the processor can use a range of page sizes. The advantage of using a large page size is that the processor will perform fewer lookups, but the disadvantage is that the processor may not be able to find a sufficiently large chunk of contiguous memory in which to allocate the large page (in which case, a set of smaller sized pages will be allocated instead).

The compiler option that controls page size is -xpagesize=size. The options for the size depend on the platform. On UltraSPARC processors, allowable sizes are 4K, 8K, 64K, 512K, 2M, 4M, 32M, 256M, 2G, or 16G. For example, changing the page size from 8K (the default) to 64K will reduce the number of lookups by a factor of eight. On the x86 platform, the default page size is 4K, and the actual sizes that are available, often 4K, 2M, 4M, and 1G, depend on the processor.

It is possible to detect performance issues from page sizes using trapstat, if it is available and if the processor traps into Oracle Solaris to handle Table Lookup Buffer (TLB) misses. Or cpustat can be used when the processor provides hardware performance counters that count TLB miss events.

The command that reports the page sizes available on a particular system is pagesize -a.

If the application incurs significant numbers of TLB miss events during its run, then it is likely that recompiling with a setting for -xpagesize will improve performance.

Advanced Compiler Options: C/C++ Pointer Aliasing

Two pointers "alias" if they point to the same location in memory. For the compiler, aliasing means that stores to the memory addressed by one pointer may change the memory addressed by the other pointer . This means that the compiler has to be very careful never to reorder stores and loads in expressions containing pointers, and it might also have to reload the values of memory accessed through pointers after new data is stored into memory.

There are two flags you can use to make assertions about the use of pointers in your program. These flags tell the compiler something it can assume about the use of pointers in your source code. The compiler does not check to see if the assertion is ever violated, so if your code violates the assertion, then your program might not behave in the way you intended. Note that lint can help you do some validity checking of the code at a particular -xalias_level. (See Chapter 4, lint Source Code Checker, in Oracle Solaris Studio 12.2: C User’s Guide.)

The following are the two assertions:

  • -xrestrict asserts that all pointers passed into functions are restricted pointers. This means that if a function gets two pointers passed into it, under -xrestrict, the compiler can assume that those two pointers never point at overlapping memory.

  • -xalias_level indicates what assumptions can be made about the degree of aliasing between two different pointers. -xalias_level can be considered to be a statement about coding style. You are telling the compiler how you treat pointers in the coding style you use. For example, you can tell the compiler that an int* will never point to the same memory location as a float*).

The following table summarizes the options for -xalias_level for C (cc).

cc -xalias_level=

Comment

any

Any pointers can alias (default).

basic

Basic types do not alias each other (for example, int* and float*).

weak

Structure pointers alias by offset. Structure members of the same type at the same offset (in bytes) from the structure pointer, may alias.

layout

Structure pointers alias by common fields. If the first few fields of two structure pointers have identical types, then they may potentially alias.

strict

Pointers to structures that contain different variable types do not alias.

std

Pointers to differently named structures do not alias. (So even if all the elements in the structures have the same types, if they have different names, then the structures do not alias.) This is the level of aliasing allowed by the language standard.

strong

There are no pointers to the interiors of structures and char* is considered a basic type. (At lower levels, char* is considered as potentially aliasing with any other pointers.)


 

The following table summarizes the options for -xalias_level for C++ (CC).

CC -xalias_level=

Comment

any

Any pointers can alias (default)

simple

Basic types do not alias (same as basic for C)

compatible

Corresponds to layout for C


 

Notes:

  • Specifying -xrestrict and -xalias_level correctly can lead to significant performance gains. But if your code does not conform to the requirements of the flags, then the results of running the application might be unpredictable.

  • For C, -xalias_level=std means that pointers behave in the same way as the 1999 ISO C standard suggests. It is specified for standard-conforming code.

  • The flag -fast for C includes -xalias_level=basic. If the code contains aliasing of different basic types, then -fast needs to be followed by the flag -xalias_level=any to tell the compiler that any pointers may potentially alias.

A Set of Flags to Try

The final thing to do is to pull all these points together to make a suggestion for a good set of flags. Remember, this set of flags might not actually be appropriate for your application, but it is hoped that they will give you a good starting point.

Note: The use of the flags within square brackets ([ ]) depends on special circumstances.

Flags

Comment

-g

Generates debugging information (may use -g0 for C++).

fast

Aggressive optimization.

-xtarget=generic 
[-xtarget=sparc64vi -fma=fused] 
[-xarch=sse -xvector=simd]

Specifies target platform.

-xipo

Enables interprocedural optimization.

-xprofile=[collect|use]

Compiles with profile feedback.

-fsimple=0 -fns=no]

No floating-point arithmetic optimizations. Use if IEEE-754 compliance is important.

[-xalias_level=val]

Sets the level of pointer aliasing (for C and C++). Use only if you know the option to be safe for your program.

[-xrestrict]

Uses restricted pointers (for C). Use only if you know the option to be safe for your program.


 

Final Remarks

There are many other options that the compilers recognize. The ones presented here probably give the most noticeable performance gains for most programs and are relatively easy to use. When selecting the compiler options for your program, remember the following:

  • It is important to be aware of what you are telling the compiler to do. A program might have unpredictable results if it does not conform to the requirements of the flags.
  • Optimization is a tradeoff between increased compile time and improved run-time performance.
  • Only use the flags that give you a performance benefit and make acceptable assertions about the code.

For details on all these options, see the Oracle Solaris Studio compiler user guides and man pages.

Further Reading

About the Author

Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc. (now Oracle) analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK. He is the author of the books Solaris Application Programming and The Developer's Edge. He also maintains a blog focused on developer issues.