Advanced Compiler Options for Performance

By Oracle Solaris Studio Compiler Engineering Staff, Revised March 2017

Users wanting the best performance from CPU-intensive codes may wish to explore the use of additional libraries and advanced compiler options that control individual compiler components.

Performance libraries

There are several libraries that can help performance, including:

The optimized math library, selected by the switch -xlibmopt in Fortran and C++, or by including -lmopt in C. This library may produce slightly different results, usually differing only in the last bit, in order to achieve higher performance.
There are various memory allocation libraries that can be used. A guide to several choices can be found in the NOTES section of the umem_alloc(3MALLOC) man page. In addition, the library libfast.a is available. Like libbsdmalloc, libfast keeps free lists of various sizes to provide very fast allocation, but at the expense of additional memory. Also, libfast samples the allocation stream and will create a new free list if consecutive samples are sufficiently smaller than the current free list size. The libfast library will be used when -lfast is added to C, Fortran, or C++ compile flags for single-threaded applications. For threaded applications (for example, those compiled with -xautopar, -xopenmp or -mt) then the reentrant version -lfast_r is required. The libfast library is available on Oracle Solaris SPARC and x86 platforms in -m32 and -m64 formats. There is no Linux version available at this time.
The Sun Performance Library, which is a set of optimized, high-speed mathematical subroutines that are used to solve linear algebra and other numerically intensive problems. The library is linked to your application by using the switch -xlic_lib=sunperf.
Note: The optimized math library, sunperf, and several other math libraries are described in the Numerical Computation Guide.
An optimized memset/memcpy library, suitable for UltraSPARC-III or later systems. To use it, add the switch -ll2amm to your compile command. You might also need to add -xarch=v8plusb to your command line. Note that compiling with -ll2amm produces a binary that cannot be used on previous generation processors.
The Apache C++ Library, also referenced as stdcxx. This library is available on Oracle Solaris 10 8/11 (Update 10) and later, and on Oracle Solaris 11. The library may not be installed by default on a supported system. If it is missing, install it from the appropriate Oracle Solaris repository.
If the library is installed, select the library by using the option -library=stdcxx4 on every CC command, compiling and linking, as described in the C++ Users Guide, chapter 12.

If you cannot upgrade to a version of Oracle Solaris that supports Apache stdcxx.

(Note: Do not use the version found at apache.org. It will not work on Oracle Solaris.) You do not need to build the entire "userland" gate, you can build just libstdcxx and install it and the associated header files.

Support for stdcxx varies depending on whether you are using a copy pre-installed with Oracle Solaris, or a copy you have built yourself:
1. If you are using a copy of stdcxx that comes pre-installed with Oracle Solaris, bug reports will be accepted on the library itself, and will be addressed. It is possible that Oracle may choose to modify the library in order to resolve a support issue.
2. If you have built your own copy, then support requests should demonstrate that the compiler or runtime environment is incorrectly interacting with stdcxx, and the compiler team will attempt to improve such interaction, but obviously will not modify your copy of the stdcxx source code. The team can also point out incorrect or unportable constructs in stdcxx if that is the reason for the failure.

Compiler Component Options

The Oracle Solaris Studio compilers are divided into several different components. Sometimes, it can be helpful to performance if switches (see the cc -W option for example) are sent directly to individual components, including:

CC, driver for C++
cg, the code generator
d, driver for C
f90comp, front end for Fortran
iropt, the global optimizer
ld, the link editor
ube, the x86/x64 code generator
ube_ipa, the x86/x64 interprocedural optimizer

NOTE: although the use of these options are supported, there are certain notes that must be understood before using them.

Usually, these options are set automatically:

Usually, the compiler itself picks values for these options, based on other options that are selected. Most users will achieve adequate performance without needing to set these options.
Subject to change:

These options may change from release to release of the compiler. The spelling, effect, and even presence of these performance options may evolve from time to time; Makefile authors should be prepared to cope with such evolution.
Performance testing is required:

Some of these options may help your code run faster; others might actually make it run more slowly. You should not use one of these options unless you believe that you have a good test case to demonstrate its effect. A good test case:
- represents an important real-life use of the code;
- can be compiled both with and without the compiler option in question;
- includes a repeatable workload; and
- has available a machine environment where changes can be reliably measured.
Understand what the driver is doing:

If you choose to experiment with the options documented on this page, you will probably find it useful to examine what the compiler driver is passing to each stage of the compilation, both before and after your changes. You can do this by adding -v to your Fortran or C++ compile, or by adding -# to your C compile. For example, if you wanted to check whether the driver passes -Aheap to iropt, you could find out by typing (in a C shell) something like this:
```
 % cc -# -fast -W2,-Aheap tmp.c |& grep bin/iropt | fold -s -20 | \ 
grep Aheap -O5 -Aheap 
% 
```
The above command pipes stderr to grep, which looks for the line that invokes iropt. Since that line is very long, we fold it into smaller chunks, and then examine the chunks for one containing " Aheap".
Correctness testing is recommended:

The use of performance options may sometimes lead to unexpected results. For example, a program may have a bug whichis harmless when compiled without optimization, but which causes incorrect operation when the compiler uses more advanced optimizations. In addition, although these options have been tested by Oracle internally, they have had less exposure to customer applications.

Therefore, as in any other performance improvement exercise, it is prudent to include testing for correct output as you tune performance.

Below is a selected list of options that can be passed directly to Oracle Solaris Studio compiler components. These are passed using the -W flag (when using the C compiler) and the -Qoption flag (when using the Fortran or C++ compilers). The table below shows the relationship between the C and Fortran/C++ compilers and how to invoke the options. The f90comp component is Fortran specific and not available from the C or C++ compiler. The ld component is not needed from the C or C++ compiler, just use the -M flag.

Component	Fortran	C++	C
CC	`-`	`-Qoption CC suboption`	`-`
cg	`-Qoption cg suboption`	`-Qoption cg suboption`	`-Wc,suboption`
d	`-`	`-`	`-Wd,suboption`
f90comp	`-Qoption f90comp suboption`	`-`	`-`
iropt	`-Qoption iropt suboption`	`-Qoption iropt suboption`	`-W2,suboption`
ld	`-Qoption ld suboption`	`-`	`-`
ube	`-Qoption ube suboption`	`-Qoption ube suboption`	`-Wu,suboption`
ube_ipa	`-Qoption ube_ipa suboption`	`-Qoption ube_ipa suboption`	`-Wi,suboption`

As an example, the following shows how the compile line would look to invoke the -Abopt option to the iropt component of the compiler when compiling a program for C, Fortran and C++.


cc -W2,-Abopt c_example.c 
f95 -Qoption iropt -Abopt fortran_example.f95 
CC -Qoption iropt -Abopt cxx_example.cxx

The "Component" column below is the name of the compiler component as specified for the Fortran and C++ compilers. The "Option" column is the action requested of the specific compiler component. The "Description" column describes what the option does.

Component	Option	Description
CC, d	`-iropt-prof`	Use iropt in the profile phase of the compilers (iropt is the global optimizer).
cg	`-Qdepgraph-early_cross_call=1`	Enable cross-call instruction scheduling. This option controls whether the "early" schedulers may move instructions across a call instruction. The early schedulers are those run before register allocation. Because of SPARC register windows, this is sometimes useful.
cg	`-Qeps:do_spec_load=1`	Allow generation of speculative loads during Enhanced Pipeline Scheduling (EPS). A speculative load may reduce load latency, if the speculation is correct; but if the speculation is incorrect (e.g. the other path is taken, and the load misses in the cache or the TLB), then the overhead may be hundreds of cycles for the incorrect speculation. The EPS scheduler needs to have a very good chance of speculating correctly in order for EPS speculative loads to be an overall win.
cg	`-Qeps:enabled=1`	Use enhanced pipeline scheduling (EPS) and selective scheduling algorithms for instruction scheduling. The EPS scheduler will cause some applications to improve performance, but some will run more slowly.
cg	`-Qeps:rp_filtering_margin=n`	The number of live variables allowed at any given point is n more than the number of physical registers. Setting n to a significantly large number (e.g., 100) will disable register pressure heuristics in EPS.
cg	`-Qeps:ws=n`	Set the EPS window size to n, that is, the number of instructions it will consider across all paths when trying to find independent instructions to schedule a parallel group. Larger values may result in better run time, at the cost of increased compile time.
cg	`-Qgsched-trace_late=1`	Enable the (late) trace scheduler. This is a new feature of the compiler which is being tuned from release to release. It may become the default in a future release.
cg	`-Qgsched-T` `n`	When performing trace scheduling, set the aggressiveness of the trace formation to level n, where n is 4, 5, or 6. The higher the value of n, the lower the branch probability needed to include a basic block in a trace.
cg	`-Qgsched-trace_spec_load=1`	When performing trace scheduling, enable the conversion of loads to non-faulting loads inside the trace.
cg	`-Qicache-chbab=1`	Enable optimizations to reduce branch after branch penalty. On some machines, the instruction fetcher will operate more effectively if branches are separated from each other; for example, not having one branch occupy the delay slot of another branch. Adding no-ops into the code may make the fetcher run more effectively. -Qicache-chbab is not currently on by default because it may increase code size and therefore make the icache less effective, and the algorithm for adding the nops has not been shown to benefit all applications.
cg	`-Qinline_memcpy=` `n`	Inline calls to memcpy with n bytes or fewer being copied. If there are many calls to memcpy with a small number of bytes, the call overhead may be significant.
cg	`-Qipa:valueprediction`	Use profile feedback data to predict values and attempt to generate faster code along these control paths, even at the expense of possibly slower code along paths leading to different values. Correct code is generated for both paths.
cg	`-Qiselect-funcalign=` `n`	Align function entry points at n-byte boundaries. Aligning functions may make the instruction fetcher more effective on some machines. In general, this option causes the binary to be larger, and it may cause the I-cache to be less well packed. Default settings are likely to differ from machine to machine
cg	`-Wc,-Qiselect-rcpa=` `2`	Single- and double-precision floating-point division operations are approximated based on the SPARC64 X reciprocal approximation instructions (frcpa[sd]). This option has no effect unless -xarch=sparcace or -xarch=sparcaceplus, and -fsimple=2 are both in effect. In this situation, the use of -fns=yes is strongly advised. These approximated floating-point division operations do not conform to IEEE-754. Furthermore, spurious floating-point exceptions can be raised in certain corner cases. In particular, the invalid operation exception is raised when the divisor is subnormal or an infinity, or when the dividend is an infinity and the divisor is near the overflow threshold (i.e. with magnitude greater than 2^126 or 2^1022 in single- or double-precision respectively).
cg	`-Wc,-Qiselect-rsqrta=` `2`	Single- and double-precision floating-point square root operations are approximated based on the SPARC64 X approximation instructions (frsqrta[sd]). This option has no effect unless -xarch=sparcace or -xarch=sparcaceplus, and -fsimple=2 are both in effect. In this situation, the use of -fns=yes is strongly advised. These approximated floating-point square root operations do not conform to IEEE-754.
cg	`-Wc,-Qiselect-rsqrta1x=` `2`	The reciprocal of single- and double-precision floating-point square root operations are approximated based on the SPARC64 X approximation instructions (frsqrta[sd]). This option has no effect unless -xarch=sparcace or -xarch=sparcaceplus, and -fsimple=2 are both in effect. In this situation, the use of -fns=yes is strongly advised. Furthermore, DZ exception is never raised when input is a positive subnormal or a zero, and a positive zero is returned instead of infinity with appropriate sign.
cg	`-Qiselect-sw_pf_tbl_th=` `n`	Peels the most frequent test branches/cases off a switch until the branch probability reaches less than 1/ n. This is effective only when profile feedback is used
cg	`-Qlp[=` `n``][-av=` `n``]` `[-t=` `n``][-fa=` `n``]` `[-fl=` `n``][-ip=` `n``]` `[-it=` `n``][-imb=` `n``]` `[-pt=weak][-ol=` `n``]`	Control prefetching for loops with control flow: `lp=` n Turns the module on (1) or off (0) (default is on for f95; off for C/C++) `lp` in Fortran, equivalent to `-Qlp=1` and is used as a means for setting sub-options listed below. In C/C++, equivalent to `-Qlp=0`. However, when used with the options `-xprefetch=auto` or `-xprefetch_level=`[2\|3], it is equivalent to `-Qlp=1`, and used as a means for setting sub-options listed below. `-av=` n Sets the prefetch look ahead distance, in bytes. Default is 256. `-t=` n Sets the number of attempts at prefetching. If not specified, `t=2` if `-xprefetch_level=3` has been set; otherwise, defaults to `t=1`. `-fa=` n 1=Force user settings to override internally computed values. `-fl=` n 1=Force the optimization to be turned on for all languages `-ip=` n Turns on (1) prefetching for one-level indirect memory accesses. `-it=` n Indicates to the compiler to insert n extra prefetches for each indirect access in outer loops. `-imb=` n Indicates to the compiler (1) to insert indirect prefetches when the indirect access chain spans across basic blocks. `-pt=`weak Use weak prefetches in the general loop prefetch. `-ol=` n Turns on (1) prefetching for outer loop.
cg	`-Qms_pipe+alldoall`	Specifies that all loops can be pipelined without needing to be concerned about loop-carried dependencies.
cg	`-Qms_pipe+intdivusefp`	Use fp divide for signed integer division.
cg	`-Qms_pipe+prefolim=` `n`	Set prefetch ahead distance assuming that the number of outstanding prefetches are n. With larger n, the ahead distance gets larger.
cg	`-Qms_pipe-pref`	Disable prefetching within modulo scheduling (used in software pipelining).
cg	`-Qms_pipe-pref_prolog`	Turn off prefetching in the prolog of modulo scheduled loops.
cg	`-Qms_pipe-prefst`	Turn off prefetching for stores in the pipeliner.
cg	`-Qms_pipe-pref_prefstrong=0`	Turn off the use of strong prefetches in modulo scheduled loops.
cg	`-Qms_pipe+unoovf`	Assert (to the pipeliner) that unsigned int computations will not overflow.
cg	`-Qpeep-Sh0`	Disable the max live base registers algorithm for sethi hoisting. A sethi is a SPARC instruction for forming large constants, especially address constants. Sethi hoisting uses an algorithm that may increase register pressure. Usually, this option is likely to help performance.
cg	`-Qlp-prt=1`	Use prefetch with function code 1 (prefetch for one read) for memory accesses which are read only.
cg	`-Qlp-prwt=3`	Use prefetch with function code 3 (prefetch for one write) for memory accesses which are read and then written.
f90comp	`-O[3-5]`	Set the optimization level of the f95 front/middle end to the specified optimization level (fortran only).
f90comp	`-array_pad_rows,` `n`	Enable padding of f95 arrays by n (fortran only).
f90comp	`-hoist_expensive,-hoist_trivial`	Enables additional loop invariant code motion, hoisting operations out of loops.
iropt	`-Abcopy`	Increase the probability that the compiler will perform memcpy/memset transformations.
iropt	`-Abopt`	Enable aggressive optimizations of all branches, such as reversing the branch condition. This is only useful when profile feedback is used.
iropt	`-Abuiltin_opt:assume_standard_func=on`	Allow standard library functions to be inlined even when their respective include files are not specified.
iropt	`-Adata_access`	This option turns on analysis of data access patterns for scalars and arrays regions accessed in each loop. The information is used by various loop transformations such as loop fusion for determining profitability of those transformations. Unlike regular data dependence analysis, this analyzes detailed array sections accessed in a loop, so the analysis can be expensive in terms of compilation time.
iropt	`-Addint:ignore_parallel`	Ignore parallelization factors in loop interchange heuristics.
iropt	`-Addint:sf=` `n`	Set memory store operation weight for loop interchange to n. A higher value of n indicates a greater performance cost for stores. This flag gives more weight to store operations in determining whether some loop transformations such as loop interchange should be done.
iropt	`-Afully_unroll:always=on`	Do aggressive loop fully unrolling based on the size and trip count of the loop.
iropt	`-Aheap`	Allow the compiler to recognize malloc-like memory allocation functions. If `-xbuiltin` is specified, this option is implied.
iropt	`-Ainline[:cp=` `n``]` `[:cs=` `n``][:inc=` `n``]` `[:irs=` `n``][:mi]` `[:rs=` `n``][:recursion=` `n``]`	`cp=` n The minimum call site frequency counter in order to consider a routine for inlining `cs=` n Set inline callee size limit to n. The unit roughly corresponds to the number of instructions. `inc=` n The inliner is allowed to increase the size of the program by up to n%. `irs=` n Allow routines to increase by up to n. The unit roughly corresponds to the number of instructions. `mi` Perform maximum inlining (without considering code size increase). `rs=` n Inliner only considers routines smaller than n pseudo instructions as possible inline candidates. `recursion=` n Allow a recursive call to be inlined up to n level.
iropt	`-Aivel:duplicate_loops`	More aggressive strength reduction by replicating loops.
iropt	`-Aivsub3`	Increase the probability that loop induction variables will replaced, so that some extraneous code can be eliminated from loops.
iropt	`-Aloop_dist:ignore_parallel`	Ignore parallelization factors in loop distribution heuristics.
iropt	`-Amemopt:arrayloc`	Reconstruct array subscripts during memory allocation merging and data layout program transformation. The transformation uses the same arrays, but modifies the ways the arrays are referenced to make them more efficient globally.
iropt	`-Apf:`[`llist=n` \| `noinnerllist`]	Do speculative prefetching for link-list data structures: `llist=` n perform prefetching n iterations ahead. noinnerllist, do not attempt for innermost loops.
iropt	`-Apf:pdl=n`	Allow prefetching through up to n levels of indirect memory references.
iropt	`-Aparallel:nthreads=count`	Instructs the compiler on the number of threads to use for automatically parallelized regions.
iropt	`-Arestrict_g`	Assumes global pointers are not aliased (restricted).
iropt	`-Asac`	Structure Array Contraction reduces strides in a hot loop accessing a big array. This is done by collecting only the hot fields into a new structure and rearranging the dimensions in the new array (of the new structure) to minimize stride width, e.g. a[x][y] to a[y][x]
iropt	`-Ashort_ldst[:ldld]`	Convert multiple short memory operations into single long memory operations. `ldld` Convert multiple short memory loads into single long load operations.
iropt	`-Atile:routine=on`	Enable routine level loop tiling optimization. Routine level loop tiling is an optimization that applies tiling on multiple loop-nests within the whole routine.
iropt	`-Atile:skew=on`	Perform loop tiling that is enabled by loop skewing. Loop skewing transforms a non-fully interchangeable loop nest to a fully interchangeable loop nest.
iropt	`-Atile:skewp[:b` `n``]`	Perform loop tiling that is enabled by loop skewing. Loop skewing transforms a non-fully interchangeable loop nest to a fully interchangeable loop nest. The optional `b` n sets the tiling block size to n.
iropt	`-Aujam:inner=g`	Increase the probability that small-trip-count inner loops will be fully unrolled.
iropt	`-Aujam:noinner`	Do not unroll small-trip-count inner loops.
iropt	`-Aunroll`	Enable outer-loop unrolling.
iropt	`-crit`	Enable optimization of critical control paths. This is based on profile data to select critical paths and create super blocks so that more optimizations and better scheduling can be done on the critical paths, and result in better overall performance.
iropt	`-MR`	Do not inline calls when parameters are arrays and actual array dimensions and formal array dimensions are mismatched
iropt	`-Ma` `n`	Enable inlining of routines with frame size up to n.
iropt	`-Mm` `n`	Set the maximum code increase due to inlining to n instruction triples per module. A higher value of n allows more inlining to occur.
iropt	`-Mr` `n`	Set the maximum code increase due to inlining to n instruction triples per routine. A higher value of n allows more inlining to occur.
iropt	`-Ms` `n`	Set the maximum level of recursive inlining to a depth of n. A higher value of n allows more inlining to occur.
iropt	`-Mt` `n`	Set the maximum size of a routine body eligible for inlining to n instruction triples. A higher value of n allows larger routines to be inlined.
iropt	`-reroll=1`	Enable loop rerolling.
iropt	`-Rloop_dist` `n`	Do not perform loop distribution transformations.
iropt	`-Rscalarrep`	Disable scalar replacement optimization. Generally, scalar replacement will reduce memory accesses in a loop, and therefore improve the loop's performance. But it can also increase register pressure (which can lead to register spills, that is stores of registers to memory, which is an expensive operation).
iropt	`-Rtile`	Disable loop tiling optimization in iropt
iropt	`-Rujam`	Disable loop unroll and jam optimization in iropt
iropt	`-whole`	Do whole program optimizations. Allows the compiler to do a better job of inter-procedural analysis.
iropt	`-xprefetch_mult=iterations`	Specifies how far to prefetch ahead (in loop iterations)
iropt	`-xrestrict`	Treat formal pointer parameters as restricted pointers (not aliased).
ld	`-M,/usr/lib/ld/map.bssalign`	Instructs linker to use mapfile from `/usr/lib/ld/map.bssalign`. This provides an appropriate alignment for large page mapping of the heap, allowing for more efficient usage of large pages. (Fortran)
ube	`-fsimple=3`	Allow optimizer to use x87 hardware instructions for sine, cosine, and rsqrt. The precision and rounding effects are determined by the underlying hardware implementation, rather than by standard IEEE754 semantics (x86).
ube	`-gra_loop_based_splits=on\|off`	Enables\|disables spilling of registers to memory before a loop if it requires more free registers than available and the spilled variables are unused inside the loop. Default is on.
ube	`-nontemporal`	Allows the compiler to use streaming stores, that is, stores that avoid writing caches, instead going directly to memory. Also allows use of a prefetch hint that data is unlikely to be re-used, and therefore caching should be avoided as much as possible
ube	`-sched_first_pass=1`	Enable the instruction scheduling phase before global register allocator
ube	`-xcallee[=yes\|no]`	Assume (yes, default) that callee-save registers are saved, no assumes they are not saved (x86).
ube_ipa	`-inl_alt`	Enables more aggressive inlining, especially with profile feedback (x86).