Getting The Best AMD64 Performance With Sun Studio Compilers

   
By Stanislav Mekhanoshin, Sun Microsystems, May 23, 2006; updated August 22, 2008  

Changes in Sun Studio 12

With the release of Sun Studio 12, one should no longer use -xarch=amd64 or any other similar value to explicitly specify bitness. Instead, the option -m64 should now be used to compile a 64-bit application (8-byte long pointers, widecodr registers, etc.). Use the -m32 flag to compile 32-bit applications. The -xarch flag is now used to specify solely the instruction set to use, such as SSE, SSE2, or SSE2a. Complete information can be found in the Sun Studio 12 product documentation.

Support for the AMD Family 10h CPUs has been added to the Sun Studio 12 release. The following options should be used to support this:

-xtarget=barcelona
-xarch=amdsse4a
-xchip=barcelona

 


Sun Microsystems Inc. systems solutions based on the AMD Opteron processor have attracted worldwide attention due to their outstanding performance, low prices, and unique performance/Watt energy utilization. These AMD64-based systems have achieved 65 world records over the past 2 years ( http://www.sun.com/x64/benchmarks/). The new Sun Fire X4100 and X4200 systems set world performance records just after being released. This outstanding achievement is a result of productive collaboration between AMD and Sun.

Performance is a factor of both hardware and software. To extract the maximum performance from the new AMD-64 based systems on your critical C/C++ and Fortran applications, choose the best compilers in the industry – Sun Studio 11. Then by setting compiler options to take advantage of the Opteron system features, you'll maximize your performance benefit. This article will show you how.

Sun Studio 11 Compilers and Tools

While our focus below is Opteron based systems, the techniques that we discuss in this note can be applied with small changes to the SPARC platform and to Intel processor based systems running Solaris. In most cases what's required are changes to the compiler option arguments.

In this note our focus is compilation. Hence we do not address coding techniques. Refer to the Sun Studio 11 documentation for more details.

Test environment

Our test environment consists of a Sun W2100 workstation running Solaris 10.

We use the GNU sed 4.1.4 utility source to illustrate the compiler optimizations. The sed utility is used to filter text. Specifically sed takes text as input and then performs one or more operations on the text and outputs the modified text. The sed utility is a C application. You can obtain the sed source from http://ftp.gnu.org/pub/gnu/sed/sed-4.1.4.tar.gz

We run a sequence of performance tests. For each test, we compile sed with the Sun Studio 11 compilers and with selected compiler options. We then determine the impact of the selected compiler options on performance by measuring the execution time of sed to convert a text file to html using the following script:

demo%  
             cat txt2htm.sed
s/\&/\&amp\;/g
s/[\<]/\&lt\;/g
s/[\>]/\&gt\;/g
s/^\s\+/\<\/p\>\<p\>/
s/^$/\<\/p\>\&nbsp\;\<p\>/
s/^/\<html\>\n\<head>\n\<\/head\>\n\<body\>\n\<\/p\>/
s/\s\s/\&nbsp\;\&nbsp\;/g
$s/$/\<\/p\>\n\<\/body\>\n/
          

We then call the time command to obtain the performance numbers:

time -p sed -f txt2htm.sed >/dev/null <test.txt

If you have access to an Opteron-based Sun worskstation or server running Solaris 10 you can follow along to get a sense of the impact of the option setting on the performance of your application. If you do not already have Sun Studio 11 installed on your system, you can download it without charge from: here

Measuring with default options

Let's build and measure the performance of our sample sed program. First we compile and run with no options set, using the compiler's settings. We get a test execution time of 12.22 seconds. This number is for the default options provided by the configure script without any optimizations.

Specifying Target Computer: -xarch, -xchip, -xcache

Often programmers compile their source code with the compiler's default option settings. By careful choice of compiler options, significant performance improvements can be had. These compiler options are not turned on by default: you must set them explicitly.

-xarch

The -xarch compiler option tells the compiler what instruction set is available for code generation.

If you know the target platform for your application, specify it with the - xarch compiler option. The compiler will then generate code optimized for the target platform.

What x86 platform architectures are available? First, let's look at the Pentium III. Some time ago computers had separate chips for integer and floating point calculations. The floating point coprocessor used the special floating point stack for operations. Later, both floating and integer units were assembled on a same chip, so now any CPU is able to perform both kind of calculations. Some Pentium CPUs have the MMX (Multimedia Extension) instruction set, providing a fast and more convenient way for typical floating point calculations. That support has been expanded in the Pentium III with the SSE (Streaming SIMD Extensions) floating point instruction set, making the old floating point stack calculations obsolete. With SSE you get faster floating point calculations and 8 additional XMM floating point registers available for the compiler. For these processors, all the XMM registers are 32-bit.

Starting with Pentium IV, the SSE2 instruction set was introduced as an expansion of the older SSE set. If you intend to run your executable program on a Pentium IV, you need to compile with -xarch=sse2.

Compared to the older Pentium CPUs, the current generation AMD Opteron and Athlon64 architectures have much improved multimedia extension support. These processors have 16 XMM registers, and 8 more general purpose registers. The general purpose registers are now 64 bit, suitable for direct 64-bit calculations in addition to 32-bit computation, and XMM registers are now 128 bit wide. You can now directly access much more than 4GB of memory. To use these features, compile your application with -xarch=amd64. Note that the resulting executable will work not only on AMD64 based computers but on 64 bit Intel XEONs and EM64Ts as well.

-xchip

A second target computer specific option is -xchip. This option specifies the target computer CPU type. By default the compiler will generate code for the generic x86 chip, not taking advantage of the advanced Pentium IV or Opteron features. But you can set -xchip=opteron or -xchip=pentium4 to specify your primary target. Unlike the -xarch option, where the compiler uses the architecture argument to generate code specific for instruction set, the -xchip option is more of a "hint" to the compiler – it helps the compiler optimize the program for a family of CPUs with the same minimal instruction set.

-xcache

A third target computer specifc option is -xcache. This option specifies the target computer cache configuration. For the Opteron based system, this should be set -xcache=64/64/2:1024/64/16

-xtarget

You can specify the -xarch, -xchip and -xcache options together by using the composite option -xtarget. -xtarget is a macro holding three settings: -xarch, -xchip, and -xcache. Specifying -xtarget=opteron is the equivalent of -xarch=sse2 -xchip=opteron -xcache=64/64/2:1024/64/16.

Note that -xtarget=opteron conservatively sets -xarch=sse2, and not -xarch=amd64. So to get the most performant code for the AMD64 bit CPU you need to set -xarch=amd64 in addition to -xtarget=opteron. Note that if an option is repeated on the compiler command line, the last occurrence of the option takes precedence. So if you set -xarch=amd64 -xtarget=opteron you will not get 64 bit code since the macro expansion yields -xarch=sse2! In the current example set the options in this order, -xtarget=opteron -xarch=amd64.

64-bit Memory Considerations

Setting -xarch=amd64 tells the compiler about resources that are significant for code generation: more registers, additional instructions, expanded direct memory access. For instance your long type in C/C++ becomes 64 bit and you do not need to use the long long data type for 64-bit numerical calculations because the compiler can do that in a more efficient way. Function arguments are now passed in registers instead of memory, giving you an additional performance boost.

But 64-bit mode is not always faster than 32-bit mode. In 64-bit mode, even if your program is far from using 4GB of memory, you still point to a memory location with 64-bit addressing. If you are using pointers, they will take 8 bytes in 64-bit mode instead of 4 bytes in 32-bit mode. That is not usually a problem until you have a lot them. For example if you have large arrays of pointers or structure or class instances with pointers you will waste a lot of memory. Your program may become slower due to swapping or to memory cache misses. So if your program uses large arrays of pointers, consider using 32-bit SSE2 mode instead of the 64-bit mode. It may be more efficient. You may need to experiment to determine the best option setting.

Also be aware that C/C++ long type becomes 8 bytes in 64-bit mode. If you are using long data types extensively, consider using int instead or switch to 32 bit ( int is 32 bits in both 32-bit and 64-bit modes).

This is not a concern for Fortran since it has fixed data type sizes for all platforms and also because Fortran does not make extensive use of pointers.

Safe versus aggressive compiler option settings

Several of the compiler options described below can be used to obtain significant performance gains, albeit with a possible loss of robustness for programs that do not strictly follow conservative coding techniques. Proper use of these advanced options is based on an agreement between the programmer not to code in specific ways and for the compiler to use that knowledge for optimization. These advanced optimizations need to be enabled explicitly by the programmer. They are not optimizations done by default.

Setting optimization levels: -xO1 to -xO5

There are five optimization levels available for the Sun Studio compilers, -xO1 to -xO5. All the optimizations selected with -xO n options are safe unless your program uses some low level memory manipulations, based on the internal data layout knowledge. Use the highest optimization, -xO5, unless it exposes some significant programming errors in your code.

Re-measuring sed performance

The executable we have started with was targeterd for the generic ( 386) architecture. Selecting amd64 architecture give us 13.74 seconds, which is slower than 32 bit. But watch what happens as you start using the optimization options.

We compile sed with optimization level -xO5. We get 7.49 seconds for the sse2 32 bit architecture and 6.68 seconds for the amd64 architecture. This is about twice as fast as the un-optimized code.

Notice also that the 64-bit version now runs faster than the 32-bit version. To get the full power of the 64-bit processor working with sed, we had to enable compiler optimizations.

-fast macro

The -fast macro option conveniently collects a number of powerful and relatively safe optimizations. It is a good first step for getting the best performance out of most applications.

So what are the options collected in -fast and under what conditions are they safe? Lets take a closer look at some of the options included in -fast and see what they do.

Using C compiler the -fast option expands to the following options on x86/x64 processors:

-fns -fsimple=2 -fsingle -nofstore -xalias_level=basic -xbuiltin=%all -xdepend -xlibmil -xlibmopt -xO5 -xregs=frameptr -xtarget=native.

Details on all these options can be found in the user guides and man pages for each of the compilers.

Note that -fast includes -xO5. It also includes -xtarget=native . The native argument instructs the compiler to assume that the architecture of the target computer is identical to that of the development machine. On an AMD64 platform, the compiler will still select -xarch=sse2, which is a 32-bit target. You need to specify your specific target options after -fast. For example,

-fast -xarch=amd64

Other options included in the -fast macro may also be overridden by adding the changed value following -fast on the command line. It is good practice to specify -fast as the first compiler option.

Re-measuring sed performance

We now compile the sed source using -fast with -xarch=amd64. We get 6.33 seconds.

Floating Point Options: -fns -fsimple=2 -fsingle -nofstore

The set of options -fns -fsimple=2 -fsingle -nofstore tell the compiler that it may relax concerns about floating point precision and runtime exceptions while compiling floating point expressions. In most cases this is a reasonable assumption, and allows the compiler to optimize over floating-point instructions. However, use of these options may result in some loss of standards conformance for floating-point operations and possible numerical differences if the program algorithms are sensitive to rounding errors.

Aliasing: -xalias_level=<level>

The -xalias_level option is a powerful optimization for most C/C++ programs. In general, optimizing C/C++ is made difficult by the use of pointers in the source code. The compiler can make few assumptions regarding how data is used in the program, thus inhibiting many optimizations and code restructuring.

Because the actual data and pointer values will be known only at run time, the compiler cannot make any assumptions at compile time. The situation where two pointers could cause the same data to be changed or read indirectly is called aliasing. There are many aliasing situations. Hence seven values are available for the -xalias_level option of C compiler, the lowest being -xalias_level=any, which prevents any optimization that requires the assumption that some pointer is not aliasing another one.

While the -xalias_level=basic setting that appears in -fast is the next restrictive level after any, it still allows the compiler to perform certain optimizations even though pointers are used in the code. This leads to a performance boost in most situations. The -xalias_level=basic option guarantees to the compiler, for example, that nowhere in the code will a double variable be accessed by a pointer to long . But the common practice of accessing any data with a char* pointer is allowed with this option setting.

You can assert other aliasing levels by adding the appropriate -xalias_level option after -fast. If you are confident that your program is written without pointer tricks, try using -xalias_level=strong for C programs or -xalias_level=compatible for C++ first. But be careful. The compiler will not warn you if the assumption regarding aliasing is wrong. In most cases, if you have set the alias level too aggressively, you will get runtime errors or memory exceptions. More information about the various aliasing levels can be found in the C User's Guide.

Aliasing is one of the biggest problems for any C/C++ compiler. If performance is a main concern, avoid using pointers in compute-intense loops. Array indexing performs better than doing pointer arithmetic. For example:

for( int i=0; array[i]; i++ ) foo(array[i]); // recommended

for( T* p=array; *p; p++ ) foo(*p); // avoid

-xrestrict

You may also consider using -xrestrict, which tells the compiler there is no aliasing between the arguments in functions. For example strcpy is a typical function which takes two pointers (destination and source). These two pointers can never alias each other. So for this example, make it clear to the compiler by setting the -xrestrict option.

A counter-example, where the compiler option -xrestrict is not safe, is memmove. This is an example of a function where aliasing is expected. If one compiles memmove source with -xrestrict option, it would result in bad code. In practice it is better to declare each function's argument as restrict pointer as appropriate. However, when compiling a big third party project it may be much easier to specify -xrestrict. Experiment, but keep in mind that the option applies to all the source code in a compile unit.

Re-measuring sed performance

We do not get any further performance benefits by using -xrestrict or -xalias_level=strong for our sed code right now. But note that adding -xalias_level=any after -fast to tell the compiler to be most conservative increases the run time to 7.03 seconds.

Intrinsics and Library Functions: -xbuiltin, -xlibmil, -xlibmopt

Many compiler optimizations are possible when the compiler knows exactly what the code is doing. A good example is a call to a standard library function. The compiler could replace the call with optimized code for the function inserted right in place. This depends on an understanding between the programmer and the compiler that the program does not override a standard function with its own custom version. When this agreement holds, the compiler can assume that all calls to standard functions use the library version, and replace the function calls with the intrinsic code itself. The option asserting this is -xbuiltin=%all, which is a part of the -fast expansion.

To benefit from intrinsics, do not forget to include the system headers into your source declaring the particular functions that you use. The compiler's builtin intrinsics will not be used if you do not include the right header files. Look for the compilation warnings about undefined functions.

There are also two libraries of pre-optimized functions - libm.il and libmopt.so. Use of these libraries is specified by the -xlibmil -xlibmopt options, which are a part of the -fast macro. These two libraries provide optimized versions of standard math routines. The high performance trade-off when using these libraries is that information about arithmetic exceptions might not be available and that errno variable will not be set. In many cases this is a reasonable trade-off between functionality and performance.

Frame Pointers: -xregs

Each function normally has a frame pointer. This is a special stack structure that serves to help manage the function's data. Without frame pointers the system will have trouble unwinding the stack during exception handling and debugging. On the other hand, if you choose not to use frame pointers in your program you free one general purpose register for the compiler. Also the compiler could generate a shorter function prologue, and function execution should be faster. The option -xregs=frameptr, which is a part of -fast, tells the compiler not to use frame pointers.

While stack unwinding during exception handling is critical for C++, this ability will not be lost when the option -xregs=frameptr is set. The compiler does not omit frame pointers where they are really necessary. You may still have difficulty debugging so consider this option for achieving higher optimization in production code.

Inter-procedural Optimizations: -xipo

Optimization can be much more effective if the compiler has access to the full application project source, and not just individual source modules. For example, by looking at the entire application source code tree, the compiler could determine that it can inline a function called in one source file and defined in another. Function inlining removes the expensive function call by replacing it inline with the actual code of the called function. Inlining and related optimizations that read the entire application source code tree are called interprocedural optimizations. Use of these options can lead to a significant performance gains.

To take advantage of inter-procedural optimization, you need to provide the compiler with all the source files in a single-step compile and link. However this is not always possible, especially with large projects. If you are doing mixed language development requiring compilation with a combination of cc, CC, and f95 commands, une one of compiler commands to do the linkinng rather than using the ld command directly.

To enable interprocedural optimization add -xipo=2 on the compiler command line in both the compile and link steps. Even with separate module compilation you will still benefit from interprocedural optimization if you specify that option at every compiler invocation, including linking.

Re-measuring sed performance

When we add -xipo=2 to the compilation options for sed, we get 5.85 seconds – 7.5% faster!

Profile Feedback: -xprofile

The optimizations discussed so far are made by analyzing the program source code only. Profile feedback takes into account the actual program's runtime data to help the compiler determine whether certain optimizations are worth performing or not. If the compiler knows the typical data value at critical program junctions it can effectively rearrange some branches or even restructure to code completely so as to make it work faster in typical use cases. The performance benefits you could get with this option are high, these at the cost of additional programmer effort.

To use profile feedback you need to provide the compiler representative data samples. That is, it should be almost the same as real working data in most typical cases. Also, the sample data should be small enough to keep compilation time reasonable. Note also that the profiled executable collecting the sample run data will take longer to run than usual.

To generate a "compiler training run", compile your program with the same options that you plan to use in production, but with the additional option -xprofile=collect:./feedback. Now run the executable using your typical input data. This run will create the subdirectory ./feedback.profile in your current directory. The data collected will be used later for optimization. Now recompile everything with the same options but replacing "collect" with "use", as in -xprofile=use:./feedback. The compiler will now use data from ./feedback.profile to direct its optimizations.

To obtain the best representative data, you may want to run several "collect" executions with different data sets. When you do this, the results are merged. However if you change your source or compilation options you must erase the ./feedback.profile directory and run a new "collect" pass(es) to get relevant results.

Re-measuring sed performance

Let's try profiling on sed. First we generate the profile data by compiling sed with the options -fast -xarch=amd64 -xipo=2 -xprofile=collect. Then we use the profile data by compiling with -fast -xarch=amd64 -xipo=2 -xprofile=use. We get 5.18 seconds.

Currently the Sun Studio x86 compiler has two different profilers. The one that we selected with -xprofile is the current production version in Sun Studio 11. There is also new profiler which is still in development, and which will become the default profiler with the next release of Sun Studio. Let's try this new profiler. To enable the new profiler add the driver option -iropt-prof along with -xprofile option. To add it to the C compiler driver use -Wd,-iropt-prof. To add it to the C++ or Fortran compiler use -qoption CC -iropt-prof or -qoption f90 -iropt-prof.

We recompile sed with the -iropt-prof set in addition to -xprofile. We now get 4.78 seconds.

Sometimes an optimization option that may not improve performance by itself will provide a performance boost when used together with profiling and interprocedural optimization. The -xalias_level and -xrestrict options are examples of such options.

Earlier, when we tried the -xalias_level and -xrestrict options, we did not get performance improvements. Let's try them again: -fast -xarch=amd64 -xipo=2 -xprofile=... -Wd,-iropt-prof -xalias_level=strong -xrestrict. We now get 4.70 seconds with this combination.

Memory Allocation

Memory layout is a critical factor for performance. If you use the malloc function in your program, be aware that there are various versions and some perform better than others. The default malloc tries to save memory. However if you only use malloc occasionally, you might find that adding the option -lbsdmalloc to your compilation string can improve performance. Using the bsdmalloc library results in better alignment of allocated memory, and possibly better performance through better cache utilization.

Suppose that your application has declared the following structure of 64 bytes size and that memory is allocated for the array of such structures:

typedef struct _s_t {
//...
}s_t;
//...
assert(sizeof(s_t)==64);
s_t *sa = (s_t*) malloc(sizeof(s_t)*NELEM);

When using the default malloc we can fall into the situation where the starting address of the array and each individual structure is not aligned to a 64-byte boundary. As a result each structure will be placed in two cachelines instead of a single one. Referencing a field at the beginning and at the end of the structure will result in the request for two different cachelines from memory. If we are doing non sequential array access this may result in twice as many memory cache misses as would occur if there were strict alignment. Using -lbsdmalloc, the structure will force a fit into a single cache line.

The default malloc and free use mutexes, which is time consuming. bsdmalloc does not, so the allocation itself is faster. We trade-off memory allocation density and speed, or allocation and use speed.

However, it seems unwise to use the bsdmalloc instead of malloc on our sed example since sed is a memory intensive application with numerous memory allocation calls. We would just waste a memory and most probably lose performance if we were to not use the default malloc.

SIMD: -xvector=simd

The SSE2 instruction set includes some special SIMD instructions. SIMD stands for Single Instruction Multiple Data. That means you may process several data values concurrently with one operation. Suppose that you scale a vector:

for( i=0; i < NELEM; i++ ) v[i] = v[i] * 2;

That would normally require NELEM multiplications. With the SIMD "vector" instructions, this can be done with fewer operations. If the data being multiplied is of type float it will require 4 times fewer operations. The size of float is 32 bits. SIMD instructions operate with data packed into a 128 bit XMM register. So we may place four 32 bit floats into one 128 bit XMM register and process it at once. Similar operations on data of type char would require 16 times fewer multiplications. A limitation of SSE2-style vectorization requires that the data being processed be in adjacent memory locations. So this would not work if we were striding through odd elements of array, for example.

Sun Studio 11 introduced basic support for vectorization on SSE2 platforms. The compiler looks for such operations and vectorizes them whenever possible. That serves both floating point and integer calculations. Currently the mode is still experimental and not as beneficial as it can be. The support will be extended in the next compiler releases. To enable vectorization support, you need to specify the instruction set (at least -xarch=sse2) along with the compiler flag -xvector=simd.

We get no performance benefit from -xvector=simd for sed. So not every program benefits from every optimization.

Prefetching: -xprefetch,-xprefetch_level=<level>

One of the biggest performance bottlenecks is memory speed. While utilizing a high-speed cache can serve data to the CPU quickly, these cache buffers have limited size. If your program processes big data arrays that do not fit into the cache, the CPU stalls waiting for more data. Programs like this are called memory bound in comparison to CPU bound, where the CPU speed is the bottleneck. And, as raw CPU processing speed increases, more and more programs fall into the memory bound category. This situation can be aggravated with SIMD which processes data arrays many times faster than before.

Most memory accesses to fill the cache can be done in parallel with computations. Advanced CPU architectures such as the AMD Opteron can automatically prefetch data into cache. While some automatic prefetching is done by the CPU, in certain situations the compiler may generate additional prefetch instructions. The heuristics used by the CPU or by the compiler to fill the cache in advance are speculative and might not always fetch the data actually needed. In practice prefetching could degrade performance by pushing needed data out of the cache. Another challenge is knowing how much data to prefetch. That depends on the CPU to memory speed ratio and varies from one computer to another.

You can enable the generation of prefetching instructions by compiling with the -xprefetch option. Fortran includes prefetching in its expansion of the -fast option. The -xprefetch=no option disables it completely. You can regulate how aggressively the compiler generates prefetches by setting the -xprefetch_level=<level> option. The higher the level value (from one to three), the more aggressively the compiler inserts prefetches.

The AMD CPUs implement special 3DNow! instructions extensions. One of the 3DNow! instructions supports prefetches for memory stores. Store prefetches are generated along with read prefetches if the special -xarch is specified: pentium_proa, ssea, sse2a, and amd64a. Note that compiling for these architectures will make your executable program incompatible with non-AMD platforms.

While beneficial in many cases, prefetching does not improve performance for sed. That is an expected result, as sed does not access large data arrays sequentially – the normal case where prefetching usually helps.

Automatic Parallelization: -xautopar

Today's modern Sun Opteron-based systems are configured with multiple multi-core CPUs. To fully leverage the available CPU resources, applications are best parallelized through the extensive use of threads. The compiler can help you in the parallelization effort. Try compiling your application with the -xautopar to see if there's a benefit. Then request the number of execution units at runtime by setting the PARALLEL environment variable to, typically, the maximum number of CPUs (or cores) on your system minus 1. You may have to experiment with the best setting for your application. Some applications may run faster with auto parallelization when fewer than the maximum number of CPUs or cores is specified.

Conclusion

By using the Sun Studio 11 compilers and the right compiler options we've sped up the sed utility from 12 seconds to less then 5 seconds – making it 60% faster without recoding. We now have a version of sed that runs 35% faster than the gcc compiled version.

We started with -fast -xarch=amd64 and added extra options, some of which helped while others did not.

We saw that some compiler options may not help on their own, but do so when selected in conjunction with other options, such as -xalias_level option in conjunction with profiling.

For More Information

This portal has a number of other technical articles on performance and parallelization that are worth reading.

About The Author
Stanislav Mekhanoshin is the team lead of the Sun Studio Opteron Performance team. The team is based at the Sun Saint Petersburg Development Center in Russia. The main goal of the team is to improve the performance of code produced by the Sun Studio compilers for the x86 platform, and especially for Sun's Opteron based systems. Stanislav graduated from Saint-Petersburg State Technological University in 1995 with a masters degree in computer science. Prior to joining Sun, Stanislav worked on various software technologies in Russia -- including database, speech recognition, and programming for mobile devices.