Advanced Compiler Options for Performance

 

 

 

 

By Oracle Solaris Studio Compiler Engineering Staff, Revised April 2012

Users wanting the best performance from CPU-intensive codes may wish to explore the use of additional libraries and advanced compiler options that control individual compiler components.

Performance libraries

There are several libraries that can help performance, including:

  1. The optimized math library, selected by the switch -xlibmopt in Fortran and C++, or by including -lmopt in C. This library may produce slightly different results, usually differing only in the last bit, in order to achieve higher performance.

  2. There are various memory allocation libraries that can be used. A guide to several choices can be found in the NOTES section of the umem_alloc(3MALLOC) man page. In addition, the library libfast.a is available. Like libbsdmalloclibfast keeps free lists of various sizes to provide very fast allocation, but at the expense of additional memory. Also, libfast samples the allocation stream and will create a new free list if consecutive samples are sufficiently smaller than the current free list size. The libfast library will be used when -lfast is added to C, Fortran, or C++ compile flags for single-threaded applications. For threaded applications (for example, those compiled with -xautopar, -xopenmp or -mt) then the reentrant version -lfast_r is required. The libfast library is available on Oracle Solaris SPARC and x86 platforms in -m32 and -m64 formats.  There is no Linux version available at this time.

  3. The Sun Performance Library, which is a set of optimized, high-speed mathematical subroutines that are used to solve linear algebra and other numerically intensive problems. The library is linked to your application by using the switch -xlic_lib=sunperf.

    Note: The optimized math library, sunperf, and several other math libraries are described in the Numerical Computation Guide.

  4. An optimized memset/memcpy library, suitable for UltraSPARC-III or later systems. To use it, add the switch -ll2amm to your compile command. You might also need to add -xarch=v8plusb to your command line. Note that compiling with -ll2amm produces a binary that cannot be used on previous generation processors.

  5. The Apache C++ Library, also referenced as stdcxx. This library is included with some versions of Oracle Solaris. If included with the system, it can be selected using the switch "-library=stdcxx4", as described in the C++ Users Guide.

    For versions of Oracle Solaris that do not include stdcxx, it is still possible to use it, by following these procedures:

    1. Build the library:

      1. Download V4.2.1 from http://stdcxx.apache.org/ and unpack it.

      2. Ensure that you have a copy of GNU Make in your PATH

      3. Ensure that Oracle Solaris Studio C++ is first in your PATH

      4. For 64-bit, enter the command:

            gmake BUILDTYPE=8D CONFIG=sunpro.config 

        For 32-bit, enter the command

            gmake BUILDTYPE=8d CONFIG=sunpro.config 

        Note that in the above commands, the library is built without support for multi-threaded applications. Therefore, if you are linking to this library, your calling program should not use -xautopar, -mt, -xopenmp, nor explicit parallelism directives.

    2. Add stdcxx to your Makefile via commands such as these:

        ST_INC     = /data1/stdcxx-4.2.1/include
        ST_BLD_INC = /data1/stdcxx-4.2.1/build/include
        ST_LIB     = /data1/stdcxx-4.2.1/build/lib
        STDCXX     = std8D
      
        EXTRA_CXXFLAGS = -library=no%Cstd -I$(ST_INC) -I$(ST_BLD_INC)
        EXTRA_CXXLIBS  = -library=no%Cstd -L$(ST_LIB) -R$(ST_LIB) -l$(STDCXX) 

      The above example would need to be modified to fit your particular Makefile conventions and directory structures. The key point is that the flags designated here as "EXTRA_CXXFLAGS" would get added to every compile, and the flags designated as "EXTRA_CXXLIBS" would get added to every link.

    Support for stdcxx varies depending on whether you are using a copy pre-installed with Oracle Solaris, or a copy you have built yourself:

    1. If you are using a copy of stdcxx that comes pre-installed with Oracle Solaris, bug reports will be accepted on the library itself, and will be addressed. It is possible that Oracle may choose to modify the library in order to resolve a support issue.

    2. If you have built your own copy, then support requests should demonstrate that the compiler or runtime environment is incorrectly interacting with stdcxx, and the compiler team will attempt to improve such interaction, but obviously will not modify your copy of the stdcxx source code. The team can also point out incorrect or unportable constructs in stdcxx if that is the reason for the failure.

Compiler Component Options

The Oracle Solaris Studio compilers are divided into several different components. Sometimes, it can be helpful to performance if switches (see the cc -W option for example) are sent directly to individual components, including:

  • CC, driver for C++

  • cg, the code generator

  • d, driver for C

  • f90comp, front end for Fortran

  • iropt, the global optimizer

  • ld, the link editor

  • ube, the x86/x64 code generator

  • ube_ipa, the x86/x64 interprocedural optimizer

  •  

NOTE: although the use of these options are supported, there are certain notes that must be understood before using them.

  • Usually, these options are set automatically:

    Usually, the compiler itself picks values for these options, based on other options that are selected. Most users will achieve adequate performance without needing to set these options.

  • Subject to change:

    These options may change from release to release of the compiler. The spelling, effect, and even presence of these performance options may evolve from time to time; Makefile authors should be prepared to cope with such evolution.

  • Performance testing is required:

    Some of these options may help your code run faster; others might actually make it run more slowly. You should not use one of these options unless you believe that you have a good test case to demonstrate its effect. A good test case:

    • represents an important real-life use of the code;

    • can be compiled both with and without the compiler option in question;

    • includes a repeatable workload; and

    • has available a machine environment where changes can be reliably measured.

  • Understand what the driver is doing:

    If you choose to experiment with the options documented on this page, you will probably find it useful to examine what the compiler driver is passing to each stage of the compilation, both before and after your changes. You can do this by adding -v to your Fortran or C++ compile, or by adding -# to your C compile. For example, if you wanted to check whether the driver passes -Aheap to iropt, you could find out by typing (in a C shell) something like this:

     
    % cc -# -fast -W2,-Aheap tmp.c |& grep bin/iropt | fold -s -20 | \ 
    grep Aheap
    -O5 -Aheap %

    The above command pipes stderr to grep, which looks for the line that invokes iropt. Since that line is very long, we fold it into smaller chunks, and then examine the chunks for one containing " Aheap".

  • Correctness testing is recommended:

    The use of performance options may sometimes lead to unexpected results. For example, a program may have a bug whichis harmless when compiled without optimization, but which causes incorrect operation when the compiler uses more advanced optimizations. In addition, although these options have been tested by Oracle internally, they have had less exposure to customer applications.

    Therefore, as in any other performance improvement exercise, it is prudent to include testing for correct output as you tune performance.

Below is a selected list of options that can be passed directly to Oracle Solaris Studio compiler components. These are passed using the -W flag (when using the C compiler) and the -Qoption flag (when using the Fortran or C++ compilers). The table below shows the relationship between the C and Fortran/C++ compilers and how to invoke the options. The f90comp component is Fortran specific and not available from the C or C++ compiler. The ld component is not needed from the C or C++ compiler, just use the -M flag.

Component

Fortran

C++

C

CC

-

-Qoption CC suboption

-

cg

-Qoption cg suboption

-Qoption cg suboption

-Wc,suboption

d

-

-

-Wd,suboption

f90comp

-Qoption f90comp suboption

-

-

iropt

-Qoption iropt suboption

-Qoption iropt suboption

-W2,suboption

ld

-Qoption ld suboption

-

-

ube

-Qoption ube suboption

-Qoption ube suboption

-Wu,suboption

ube_ipa

-Qoption ube_ipa suboption

-Qoption ube_ipa suboption

-Wi,suboption

 

As an example, the following shows how the compile line would look to invoke the -Abopt option to the iropt component of the compiler when compiling a program for C, Fortran and C++.

cc -W2,-Abopt c_example.c
f95 -Qoption iropt -Abopt fortran_example.f95
CC -Qoption iropt -Abopt cxx_example.cxx

The "Component" column below is the name of the compiler component as specified for the Fortran and C++ compilers. The "Option" column is the action requested of the specific compiler component. The "Description" column describes what the option does.

Component

           Option

Description

CC, d

-iropt-prof

Use iropt in the profile phase of the compilers (iropt is the global optimizer).

cg

-Qdepgraph-early_cross_call=1

Enable cross-call instruction scheduling.
This option controls whether the "early" schedulers may move instructions across a call instruction. The early schedulers are those run before register allocation. Because of SPARC register windows, this is sometimes useful.

cg

-Qeps:do_spec_load=1

Allow generation of speculative loads during Enhanced Pipeline Scheduling (EPS). A speculative load may reduce load latency, if the speculation is correct; but if the speculation is incorrect (e.g. the other path is taken, and the load misses in the cache or the TLB), then the overhead may be hundreds of cycles for the incorrect speculation. The EPS scheduler needs to have a very good chance of speculating correctly in order for EPS speculative loads to be an overall win.

cg

-Qeps:enabled=1

Use enhanced pipeline scheduling (EPS) and selective scheduling algorithms for instruction scheduling. The EPS scheduler will cause some applications to improve performance, but some will run more slowly.

cg

-Qeps:rp_filtering_margin=n

The number of live variables allowed at any given point is n more than the number of physical registers. Setting n to a significantly large number (e.g., 100) will disable register pressure heuristics in EPS.

cg

-Qeps:ws=n

Set the EPS window size to n, that is, the number of instructions it will consider across all paths when trying to find independent instructions to schedule a parallel group. Larger values may result in better run time, at the cost of increased compile time.

cg

-Qgsched-trace_late=1

Enable the (late) trace scheduler. This is a new feature of the compiler which is being tuned from release to release. It may become the default in a future release.

cg

-Qgsched-T n

When performing trace scheduling, set the aggressiveness of the trace formation to level n, where n is 4, 5, or 6. The higher the value of n, the lower the branch probability needed to include a basic block in a trace.

cg

-Qgsched-trace_spec_load=1

When performing trace scheduling, enable the conversion of loads to non-faulting loads inside the trace.

cg

-Qicache-chbab=1

Enable optimizations to reduce branch after branch penalty. On some machines, the instruction fetcher will operate more effectively if branches are separated from each other; for example, not having one branch occupy the delay slot of another branch. Adding no-ops into the code may make the fetcher run more effectively. -Qicache-chbab is not currently on by default because it may increase code size and therefore make the icache less effective, and the algorithm for adding the nops has not been shown to benefit all applications.

cg

-Qinline_memcpy= n

Inline calls to memcpy with n bytes or fewer being copied. If there are many calls to memcpy with a small number of bytes, the call overhead may be significant.

cg

-Qipa:valueprediction

Use profile feedback data to predict values and attempt to generate faster code along these control paths, even at the expense of possibly slower code along paths leading to different values. Correct code is generated for both paths.

cg

-Qiselect-funcalign= n

Align function entry points at n-byte boundaries. Aligning functions may make the instruction fetcher more effective on some machines. In general, this option causes the binary to be larger, and it may cause the I-cache to be less well packed. Default settings are likely to differ from machine to machine

cg

-Qiselect-sw_pf_tbl_th= n

Peels the most frequent test branches/cases off a switch until the branch probability reaches less than 1/ n. This is effective only when profile feedback is used

cg

-Qlp[= n][-av= n]
[-t=
n][-fa= n]
[-fl=
n][-ip= n]
[-it=
n][-imb= n]
[-pt=weak][-ol=
n]

Control prefetching for loops with control flow:
lp= n Turns the module on (1) or off (0) (default is on for f95; off for C/C++)
lp in Fortran, equivalent to -Qlp=1 and is used as a means for setting sub-options listed below. In C/C++, equivalent to -Qlp=0. However, when used with the options -xprefetch=auto or -xprefetch_level=[2|3], it is equivalent to -Qlp=1, and used as a means for setting sub-options listed below.
-av= n Sets the prefetch look ahead distance, in bytes. Default is 256.
-t= n Sets the number of attempts at prefetching. If not specified, t=2 if -xprefetch_level=3 has been set; otherwise, defaults to t=1.
-fa= n 1=Force user settings to override internally computed values.
-fl= n 1=Force the optimization to be turned on for all languages
-ip= n Turns on (1) prefetching for one-level indirect memory accesses.
-it= n Indicates to the compiler to insert n extra prefetches for each indirect access in outer loops.
-imb= n Indicates to the compiler (1) to insert indirect prefetches when the indirect access chain spans across basic blocks.
-pt=weak Use weak prefetches in the general loop prefetch.
-ol= n Turns on (1) prefetching for outer loop.

cg

-Qms_pipe+alldoall

Specifies that all loops can be pipelined without needing to be concerned about loop-carried dependencies.

cg

-Qms_pipe+intdivusefp

Use fp divide for signed integer division.

cg

-Qms_pipe+prefolim= n

Set prefetch ahead distance assuming that the number of outstanding prefetches are n. With larger n, the ahead distance gets larger.

cg

-Qms_pipe-pref

Disable prefetching within modulo scheduling (used in software pipelining).

cg

-Qms_pipe-pref_prolog

Turn off prefetching in the prolog of modulo scheduled loops.

cg

-Qms_pipe-prefst

Turn off prefetching for stores in the pipeliner.

cg

-Qms_pipe-pref_prefstrong=0

Turn off the use of strong prefetches in modulo scheduled loops.

cg

-Qms_pipe+unoovf

Assert (to the pipeliner) that unsigned int computations will not overflow.

cg

-Qpeep-Sh0

Disable the max live base registers algorithm for sethi hoisting. A sethi is a SPARC instruction for forming large constants, especially address constants. Sethi hoisting uses an algorithm that may increase register pressure. Usually, this option is likely to help performance.

cg

-Qlp-prt=1

Use prefetch with function code 1 (prefetch for one read) for memory accesses which are read only.

cg

-Qlp-prwt=3

Use prefetch with function code 3 (prefetch for one write) for memory accesses which are read and then written.

f90comp

-O[3-5]

Set the optimization level of the f95 front/middle end to the specified optimization level (fortran only).

f90comp

-array_pad_rows, n

Enable padding of f95 arrays by n (fortran only).

f90comp

-hoist_expensive,-hoist_trivial

Enables additional loop invariant code motion, hoisting operations out of loops.

iropt

-Abcopy

Increase the probability that the compiler will perform memcpy/memset transformations.

iropt

-Abopt

Enable aggressive optimizations of all branches, such as reversing the branch condition. This is only useful when profile feedback is used.

iropt

-Abuiltin_opt:assume_standard_func=on

Allow standard library functions to be inlined even when their respective include files are not specified.

iropt

-Adata_access

This option turns on analysis of data access patterns for scalars and arrays regions accessed in each loop. The information is used by various loop transformations such as loop fusion for determining profitability of those transformations.

Unlike regular data dependence analysis, this analyzes detailed array sections accessed in a loop, so the analysis can be expensive in terms of compilation time.

iropt

-Addint:ignore_parallel

Ignore parallelization factors in loop interchange heuristics.

iropt

-Addint:sf= n

Set memory store operation weight for loop interchange to n. A higher value of n indicates a greater performance cost for stores. This flag gives more weight to store operations in determining whether some loop transformations such as loop interchange should be done.

iropt

-Aheap

Allow the compiler to recognize malloc-like memory allocation functions. If -xbuiltin is specified, this option is implied.

iropt

-Ainline[:cp= n]
[:cs=
n][:inc= n]
[:irs=
n][:mi]
[:rs=
n][:recursion= n]

cp= n The minimum call site frequency counter in order to consider a routine for inlining
cs= n Set inline callee size limit to n. The unit roughly corresponds to the number of instructions.
inc= n The inliner is allowed to increase the size of the program by up to n%.
irs= n Allow routines to increase by up to n. The unit roughly corresponds to the number of instructions.
mi Perform maximum inlining (without considering code size increase).
rs= n Inliner only considers routines smaller than n pseudo instructions as possible inline candidates.
recursion= n Allow a recursive call to be inlined up to n level.

iropt

-Aivel:duplicate_loops

More aggressive strength reduction by replicating loops.

iropt

-Aivsub3

Increase the probability that loop induction variables will replaced, so that some extraneous code can be eliminated from loops.

iropt

-Aloop_dist:ignore_parallel

Ignore parallelization factors in loop distribution heuristics.

iropt

-Amemopt:arrayloc

Reconstruct array subscripts during memory allocation merging and data layout program transformation. The transformation uses the same arrays, but modifies the ways the arrays are referenced to make them more efficient globally.

iropt

-Apf:[llist=n  | noinnerllist]

Do speculative prefetching for link-list data structures:
llist= n perform prefetching n iterations ahead.
noinnerllist, do not attempt for innermost loops.

iropt

-Apf:pdl= n

Allow prefetching through up to n levels of indirect memory references.

iropt

-Aparallel:nthreads=count

Instructs the compiler on the number of threads to use for automatically parallelized regions.

iropt

-Arestrict_g

Assumes global pointers are not aliased (restricted).

iropt

-Ashort_ldst[:ldld]

Convert multiple short memory operations into single long memory operations.
ldld Convert multiple short memory loads into single long load operations.

iropt

-Atile:skew=on

Perform loop tiling that is enabled by loop skewing. Loop skewing transforms a non-fully interchangeable loop nest to a fully interchangeable loop nest..

iropt

-Atile:skewp[:b n]

Perform loop tiling that is enabled by loop skewing. Loop skewing transforms a non-fully interchangeable loop nest to a fully interchangeable loop nest. The optional b n sets the tiling block size to n.

iropt

-Aujam:inner=g

Increase the probability that small-trip-count inner loops will be fully unrolled.

iropt

-Aujam:noinner

Do not unroll small-trip-count inner loops.

iropt

-Aunroll

Enable outer-loop unrolling.

iropt

-crit

Enable optimization of critical control paths. This is based on profile data to select critical paths and create super blocks so that more optimizations and better scheduling can be done on the critical paths, and result in better overall performance.

iropt

-MR

Do not inline calls when parameters are arrays and actual array dimensions and formal array dimensions are mismatched

iropt

-Ma n

Enable inlining of routines with frame size up to n.

iropt

-Mm n

Set the maximum code increase due to inlining to n instruction triples per module. A higher value of n allows more inlining to occur.

iropt

-Mr n

Set the maximum code increase due to inlining to n instruction triples per routine. A higher value of n allows more inlining to occur.

iropt

-Ms n

Set the maximum level of recursive inlining to a depth of n. A higher value of n allows more inlining to occur.

iropt

-Mt n

Set the maximum size of a routine body eligible for inlining to n instruction triples. A higher value of n allows larger routines to be inlined.

iropt

-Rscalarrep

Disable scalar replacement optimization. Generally, scalar replacement will reduce memory accesses in a loop, and therefore improve the loop's performance. But it can also increase register pressure (which can lead to register spills, that is stores of registers to memory, which is an expensive operation).

iropt

-Rloop_dist

Do not perform loop distribution transformations.

iropt

-reroll=1

Enable loop rerolling.

iropt

-Rtile

Disable loop tiling optimization in iropt

iropt

-Rujam

Disable loop unroll and jam optimization in iropt

iropt

-whole

Do whole program optimizations. Allows the compiler to do a better job of inter-procedural analysis.

iropt

-xprefetch_mult=iterations

Specifies how far to prefetch ahead (in loop iterations)

iropt

-xrestrict

Treat formal pointer parameters as restricted pointers (not aliased).

ld

-M,/usr/lib/ld/map.bssalign

Instructs linker to use mapfile from /usr/lib/ld/map.bssalign. This provides an appropriate alignment for large page mapping of the heap, allowing for more efficient usage of large pages. (Fortran)

ube

-fsimple=3

Allow optimizer to use x87 hardware instructions for sine, cosine, and rsqrt. The precision and rounding effects are determined by the underlying hardware implementation, rather than by standard IEEE754 semantics (x86).

ube

-gra_loop_based_splits=on|off

Enables|disables spilling of registers to memory before a loop if it requires more free registers than available and the spilled variables are unused inside the loop. Default is on.

ube

-nontemporal

Allows the compiler to use streaming stores, that is, stores that avoid writing caches, instead going directly to memory. Also allows use of a prefetch hint that data is unlikely to be re-used, and therefore caching should be avoided as much as possible

ube

-sched_first_pass=1

Enable the instruction scheduling phase before global register allocator

ube

-xcallee[=yes|no]

Assume (yes, default) that callee-save registers are saved, no assumes they are not saved (x86).

ube_ipa

-inl_alt

Enables more aggressive inlining, especially with profile feedback (x86).