Sun Studio 10 Compilers and the AMD64 ABI

   
By The Sun Studio Team, Revised: November, 2005  
The AMD64 ABI (Application Binary Interface) is evolved from the 32-bit x86 ABI. The ABI features binary compatibility between Solaris and Linux systems. Programs compiled for a 32-bit x86 platform do not need to be recompiled to run on the AMD64 platform, but performance may be enhanced as a result of recompiling.
 
Contents
 
AMD64 ABI Summary of Features
Where to Find the AMD64 ABI
Implications for 64-bit Code on Linux Platforms
Data Types - Differences Between 32-bit (ILP32) and 64-bit (LP64) x86
How Recompiling Affects Performance
Consequences of varargs Being Passed in Registers
Reusing the Frame Pointer Register
Address Space for Statically Allocated Data Objects under AMD64
 
 
AMD64 ABI Summary of Features

The AMD64 ABI provides the following enhancements over the 32-bit x86 ABI:

  • The number of integer registers has doubled, from 8 to 16.
  • The number of FP (XMM) registers has doubled, from 16 to 32.
  • Integer and XMM registers participate in parameter passing conventions making it faster to call routines, than on 32-bit x86 where all parameters were passed via the memory stack.
  • Efficient (and different) alignment of data types ( long, long long, pointer, uint64)
  • Small structs are optimized to pass and return in registers. This feature will provide most benefit to programs coded in C++.
  • varargs are defined differently. This has user code implications only if the code does type-spoofing.
  • There are seven code models: small, kernel, medium, large, small position-independent code (PIC), medium PIC, and large PIC. For details, refer to the AMD64 ABI specification.
  • The use of a register for the frame pointers is optional. There is a separate eh_frame mechanism to deal with stack unwind
Where to Find the AMD64 ABI

http://www.x86-64.org/documentation

Implications for 64-bit Code on Linux Platforms

Our goal is object interoperability between Linux and Solaris systems for the 64-bit AMD Opteron instruction set over a useful range of programs. We are not yet at our goal, but we are working closely with AMD, Linux, and Solaris developers to produce a common Application Binary Interface (ABI). This document will likely result in changes to Linux, so you may need to upgrade to a newer version of Linux to get full object interoperability.

Note, however, that ABI compatibility has limitations when files appear in different places within the file system. Furthermore, the Solaris operating system is POSIX compliant and Linux is not. So, binary compatibility will only be effective if programmers code to the common subset of Linux and Solaris systems.

With the caveats given above, 64-bit code compiled in conformance to the AMD64 ABI can be linked together and run on either Linux or Solaris 10 x86 systems.

Size and Alignment of C Data Types - Differences Between 32-bit (ILP32) and 64-bit (LP64) x86 and SPARC

 

  LP64 (AMD64) ILP32 (x86)   LP64 (SPARC v9) ILP32 (SPARC v8)
C Type sizeof Alignment
(bytes)
sizeof Alignment
(bytes)
  sizeof Alignment
(bytes)
sizeof Alignment
(bytes)
Integer
_Bool 1 1 1 1   1 1 1 1
char
signed char
1 1 1 1   1 1 1 1
unsigned char 1 1 1 1   1 1 1 1
short
signed short
2 2 2 2   2 2 2 2
unsigned short 2 2 2 2   2 2 2 2
int
signed int
enum
4 4 4 4   4 4 4 4
unsigned int 4 4 4 4   4 4 4 4
long
signed long
unsigned long
8 8 4 4   8 8 4 4
long long
signed long long
unsigned long long
8 8 8 4   8 8 8 8
Pointer
any-type *
any-type (*) ()
8 8 4 4   8 8 4 4
Floating Point
float
double
long double
4
8
16
4
8
16
4
8
12
4
4
4
  4
8
16
4
8
16
4
8
16
4
8
8
Complex Types
float _Complex
double _Complex
long double _Complex
8
16
32
4
8
16
8
16
24
4
4
4
  8
16
32
4
8
16
8
16
32
4
8
16
Imaginary Types
float _Imaginary
double _Imaginary
long double _Imaginary
4
8
16
4
8
16
4
8
12
4
4
4
  4
8
16
4
8
16
4
8
16
4
8
16




How Recompiling Affects Performance

The following items will tend to increase the speed of recompiled code:

  • The AMD64 architecture has twice as many registers as 32-bit x86 platforms: 16 general registers versus 8 and 32 XMM registers versus 16. The ability of the compiler to keep data in the fastest available location is much improved.
  • As can be seen from the table above, the AMD64 ABI requires types to be aligned on their size, which enables fast loads and stores.
  • Rather than passing parameters in memory on the stack, the AMD64 ABI passes integer and pointer parameters in general registers and floating-point parameters in XMM registers.
  • The AMD64 ABI passes and returns small structures in registers. This feature will mostly benefit programs written and compiled in C++.

The following items will tend to decrease the speed of recompiled code:

  • Heavy use of pointers
  • Heavy use of varargs
  • Heavy use of stack walkback
  • Extensive use of C++ exceptions, as the .eh_frame mechanism is much slower than on 32-bit x86 platforms and SPARC based systems.

Pointers can reduce the speed of a 64-bit program because they are larger. If your application data is mostly pointers to other data, and you spend most of your execution time waiting on main memory, the increased size of pointers decreases the number of pointers that fit in the cache, and will more likely saturate the bandwidth to memory, thus reducing performance.

Varargs processing is relatively slow on 64-bit x86 because arguments are really packed into registers and one needs to track a fair amount of information to get the next parameter from the proper place. Normal non-varargs functions should be faster because of this approach, but the varargs functions themselves will be slower.

To perform stack walkback, the calling convention needs a lot of information about each function. Much of this information is stored in the executable as auxillary information, separate from the actual code. The result is that object files are much larger, often as much as twice as large as they would be on 32-bit x86. Pulling together all the information necessary to walk back up the stack means that C++ exception processing, Java exception processing, and POSIX thread cancellation may be slower for 64-bit application when compared to a similar 32-bit application.

It is generally hard to predict whether a specific application will be faster or slower when recompiled. Your best bet is to measure the performance when compiled with both 32-bit and 64-bit x86 builds, and then choose the best.

Consequences of varargs Being Passed in Registers

The AMD64 ABI requires parameters to be passed in specific registers. If you pass a floating-point type to an integer hex printf specifier, it will not work unless it is specifically cast. Example:

#define L(d) ((unsigned long long *) &d)[0]
            

             

int main () {
            

 double dval = 132.674;
            

             

 /* This technique does not work on AMD64 (LP64) */
            

 printf("dval = %5.2f (%llX)\n", dval, dval);
            

             

 /* This technique will work on IPL32 and LP64,
            

 SPARC or x86 */
            

 printf("dval = %5.2f (%llX)\n", dval, L(dval));
            

             

 return 0;
            

}
            

             

 amd64% /set/vulcan/lang/intel-S2/bin/cc t.c -xarch=amd64
            

 amd64% ./a.out
            

 dval = 132.67 (FFFFFD7FFFFFF5B8)
            

 dval = 132.67 (406095916872B021)
            

 amd64%
            

          
Reusing the Frame Pointer Register

The AMD64 ABI permits the compiler to reuse the register that normally contains the frame pointer. The reason is that one extra register can sometimes make a significant difference in the speed of loops. Unfortunately, without the frame pointer available and in a consistent location, debugging and performance analysis tools cannot easily follow the chain of function calls. In particular, when the compiler reuses the frame pointer register, dtrace will not work. Dtrace is a Solaris 10 OS application for whole-system performance analysis. It can help you identify the big problems in system performance. Because this facility is so important, Sun Studio compilers will not reuse the frame pointer by default.

For some applications, particularly benchmarks, the higher-level performance problems that dtrace will help you find have already been eliminated. In these circumstances, reusing the frame pointer register will provide an extra boost of speed. To make this boost more easily available, we reuse the frame pointer register by compiling with the -fast option.

Address Space for Statically Allocated Data Objects under AMD64

The AMD64 ABI specifies four memory address space models: kernel, small, medium and large. The larger address space models such as medium and large were not finalized in the ABI during the development of the Sun Studio 10 release, so only kernel and small models are implemented in the compilers, with the small model as the default mode. The medium and large models will appear in future releases.

The small model as defined in the AMD64 ABI is basically about 2 gigabytes in address range and provides the fastest data access. Note that this is smaller than the address space for 32-bit mode, which is about 4 gigabytes for absolute addressing. It is possible for some programs to be able to compile and link under 32-bit mode but fail in 64-bit mode, as shown below.

Describing the Linker Problem

The linker may issue an error message under -xarch=amd64 for large data objects. For example:

% cc t.c -xarch=amd64
            

 ld: fatal: relocation error: R_AMD64_32: file t.o: symbol buf: value
            

 0xffffffffc0410ba0 does not fit
          

where t.c may be:

 #pragma weak buf
            

 char arr[3L<<30], *buf;
            

 int main()
            

 {
            

 buf = arr;
            

 }
            

          

This is not a compiler error, but rather a misunderstanding of address space under AMD64 as stated in the ABI.

Currently with -xarch=amd64, we have -xmodel=[ small| kernel]. The compiler may generate 3 different 32-bit relocatable types to handle static memory access:

  1. R_AMD64_32 for -xmodel=small, all upper 33-bits MUST be zero
  2. R_AMD64_32S for -xmodel=kernel, all upper 33-bits MUST be either all zero or all ones
  3. R_AMD64_PC32 an optimization for AMD64 RIP-relative mode

If the above conditions of R_AMD64_32/32S are not satisfied, the linker will issue the "does not fit" relocation error. This requirement of the linker check is stated in the AMD64 ABI.

So in a sense, the address space for static memory objects is only about 2 gigabytes using R_AMD64_32/32S, since only 31 bits are available, smaller than the x86 32-bit mode that can access around 4 gigabytes.

In some cases, the compiler can optimize by using R_AMD64_PC32, which is the difference between the memory to be accessed and the current code location. But again this is only 32-bit and might still be insufficient for program with very large data objects.

Workaround Solution
The real solution is to use the medium model in the ABI, which will be available in future compiler releases, where small data objects will be accessed via 32-bit memory reference and large data objects with 64-bit memory reference.

Meanwhile the user can workaround a "does not fit" relocation error by:

1. Using the -Kpic option. This creates a position independent code. But the compiler will generate 64-bit memory reference by using register indirection via the Global Offset Table with the R_AMD64_GOTPCREL relocatable type. This will work fine as long as the difference between the current code location and the location in the Global Offset Table for the corresponding data object is less than 32 bits.

2. Allocate all static data objects in heap. Then reference the objects via pointer indirection.

Note the workaround may have a small performance degradation in memory access due to reference indirection.



Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.