Performance Tuning With Sun Studio Compilers and Inline Assembly Code

By Timothy Jacobson, Sun Microsystems, June 2007  
For developers who need faster performance out of C, C++, or Fortran programs, Sun Studio compilers provide several efficient methods. Performance tuning has always been a difficult task requiring extensive knowledge of the machine architecture and instructions. To make this process easier, the Sun Studio C, C++, and Fortran compilers provide easy-to-use performance flags.

By using performance flags, developers can quickly improve execution speed. However, sometimes compiler flags alone do not result in optimum performance. For this reason, Sun Studio compilers also allow inline assembly code to be placed in critical areas. The inline code behaves similarly to a function or subroutine call, which enables cleaner, more readable code and also enables variables to be directly accessed in the inline assembly code.

This paper provides a demonstration of how to measure the performance of a critical piece of code. An example using a compiler flag and another example using inline assembly code are provided. The results are compared to show the benefits and differences of each approach.



For demonstration purposes, this paper uses an academic program to generate the Mandelbrot set. The example Mandelbrot program is written in C. Computing all the pixel values of the Mandelbrot set using the Sun Studio compiler is timed. Then, optimization flags are used and the computations are timed again. Finally, example Sun Studio inline assembly code is used and the computations are timed again and compared with the previous timings. The examples demonstrate two different methods for improving performance with the Sun Studio compiler: using flags and using inline assembly code.

Example 1: The Mandelbrot Set Algorithm

The Mandelbrot program calculates unique values for a display that is 1000 pixels by 1000 pixels. Each pixel represents a position in the complex plane of the display. A value from 0 to 255 is calculated by performing a series of multiplications and additions. This iteration process is the heart of the Mandelbrot set algorithm, which is shown the Example 1.

Let c = a + bi, a coordinate in the complex plane
z1 =  c
z2 = z12 + c
z3 = z22 + c
z4 = z32 + c
z5 = z42 + c
zk+1 = zk2 + c
until zk+1 >= 4.0 or k+1 = 255

The color values are placed in a two-dimensional array of integers that is also 1000 by 1000 elements in size. It is the calculation and placement of these values into the array that is timed so that any latency caused by displaying the pixels can be avoided.

Example 2: Mandelbrot Calculation in C

The Mandelbrot program is written in C (as shown in Example 2), but similar results can be found using Sun Studio C++ and Fortran compilers.

start = gethrtime();
for(i = 0; i < disp_width; i++)
  for(j = 0; j < disp_height; j++)
        x = ((float)i * scale_real) - 2;
        y = ((float)j * scale_imag) - 2;
        u = 0.0;
        v = 0.0;
        u2 = 0.0;
        v2 = 0.0;
        iter = 0;
        while ( u2 + v2 < 4.0 &&  iter < max_iter )
            v = 2 * v * u + y;
            u = u2 - v2 + x;
            u2 = u*u;
            v2 = v*v;
            iter = iter + 1;
        array[i][j] = iter;
end = gethrtime();
printf("Time = %lld nsec\n", end - start );

In the Mandelbrot algorithm, the majority of the time is spent in the double-nested loop and calculating the values of the pixel colors, as shown in Example 2.

Example 3: Compiling and Timing mandelbrot.c

To establish a baseline for timing, the program is compiled using the Sun Studio C compiler with no special flags or optimizations, as shown in Example 3.

$  cc -xarch=amd64 -o Mandelbrot Mandelbrot.c -lX11
$  Mandelbrot
$  Time = 434277313 nsec

Example 4: Compiling and Timing mandelbrot.c With -fast

One of the easiest ways a developer can get faster performance is to use the -fast flag in the Sun Studio compilers. The -fast flag is an umbrella flag that invokes a collection of flags in the correct dependency order to achieve optimization. For further details on -fast, see the compiler man page (that is, man cc, man CC, or man f90). Example 4 shows how -fast is used as a compilation option.

$  cc -xarch=amd64 -fast -o Mandelbrot Mandelbrot.c -lX11
$  Mandelbrot
$  Time = 206874465 nsec

Wow, the -fast option has more than cut the time in half. The beautiful thing about this is that it was so easy to use. However, it is widely believed that for ultimate performance, a program should be written in assembly code. Writing assembly code is not trivial for most people. To make it easier, Sun provides a clean, inline assembly feature called .il, which looks like a function call in the code.

Example 5: Sun Studio Inline Assembly Declaration for C

For the next example, a .il descriptor named mandel_il is used. It is declared just like any function would be, as shown in Example 5.

The naming convention for inline code is name _il( arg1 , arg2 , ...);. Example 5 provides four variable arguments to the inline code and returns an integer to the main program. The argument list tells the compiler how to arrange variables in the correct registers.

int mandel_il(float, float, float, int);

Example 6: Mandelbrot Calculation in C With Inline Assembly Code

The inline assembly code replaces the critical code in the while loop and appears as a function call, as shown in Example 6. This makes the program more readable because the assembly instructions are hidden in the inline file. With a real function call, a jump to somewhere else in the code would occur, which causes latency. With inline code, there is no jump, so the stack pointer can continue without interruption. By placing only the variables needed in the argument list, the inline code knows which registers hold those arguments. The arguments are passed to the inline code portion in the same register order that would be found in a function call. Likewise, the return value is placed in the register that would normally be used for the return of a function. This allows inline assembly code to be consistent and reusable, similar to a macro.

scale_real = 4.0 / (float)disp_width;
scale_imag = 4.0 / (float)disp_height;

start = gethrtime();
for(i = 0; i < disp_width; i++)
  for(j = 0; j < disp_height; j++)
        x = ((float)i * scale_real) - 2;
        y = ((float)j * scale_imag) - 2;
        array[i][j] = mandel_il(x, y, 4.0, max_iter);

The actual inline code is stored in a separate file that has a .il ending. For this example, a file called is used. When a file .il entry is included on the compile line, this indicates to the Sun Studio compiler that inline code is used. The compiler then searches that .il file to find a section beginning with name _il.

Example 7: Typical Inline Assembly Code Template

The key structures of the inline code are shown in Example 7. Each inline portion begins with the .inline keyword followed by the name used in the C code, a comma, and finally the value 0. The last line is .end, which is a keyword indicating the conclusion of the inline assembly code.

name_il, 0

// inline code placed here


For calculation of the Mandelbrot set, the inline code first needs to read the four input values. These are held in registers. Scalar floating-point parameters are passed in registers %xmm0, %xmm1, %xmm2, and so on. Scalar integer parameters are passed in %rdi, %rsi, %rdx, %rcx, and so on.

In this example, the AMD64 Application Binary Interface (ABI) is used to define the registers that are used. Each architecture has an ABI that defines the register order for passing parameters. Also, the ABI defines what register is used to pass a parameter back to the calling routine. This example returns an integer in the %rax register, according to the AMD64 ABI.

Example 8: Inline Assembly Code for the Iterative Mandelbrot Calculation

Knowing all these facts, the inline code can be written, as shown in Example 8.

.inline mandel_il,0
// x is stored in %xmm0
// y is stored in %xmm1
// 4.0 is stored in %xmm2
// max_int is stored in %rdi

// set registers to zero
  xorps %xmm3, %xmm3
  xorps %xmm4, %xmm4
  xorps %xmm5, %xmm5
  xorps %xmm6, %xmm6
  xorps %xmm7, %xmm7
  xorq %rax, %rax

// check to see if u2 - v2 > 4.0
  movss %xmm5, %xmm7
  addss %xmm6, %xmm7
  ucomiss %xmm2, %xmm7
  jp     .exit
  jae    .exit

// v = 2 * v * u + y
  mulss %xmm3, %xmm4
  addss %xmm4, %xmm4
  addss %xmm1, %xmm4
// u = u2 - v2 + x
  movss %xmm5, %xmm3
  subss %xmm6, %xmm3
  addss %xmm0, %xmm3
// u2 = u * u
  movss %xmm3, %xmm5
  mulss %xmm3, %xmm5
// v2 = v * v
  movss %xmm4, %xmm6
  mulss %xmm4, %xmm6

  incl %eax
  cmpl %edi, %eax
  jl .loop

// end of mandel_il

Example 9: Compiling and Timing for mandelbrot.c With Inline Assembly Code

To compile code with .il inline code, just add filename .il to the compile line and the Sun Studio compiler will search for the . name _il keyword to start reading the inline code into the program. Example 9 shows the compile line and timing result from the inline code.

$  cc -xarch=amd64 -o Mandelbrot2 Mandelbrot2.c -lX11
$  Mandelbrot2
$  Time = 242519854 nsec


The timing for the inline assembly code also shows a significant improvement over the baseline. However, it does not perform quite as well as the -fast method. There are several reasons this might be the case. One is that the Mandelbrot set iterates very few times for many points in the array. The inline example has to do extra work to move values into registers before and after. For cases where the inline code does not iterate, this might be an inefficient use of time. The longer the iteration, the more beneficial inline code becomes.

Another reason is that -fast might be unrolling the loops that traverse the array. This could be done in the inline code as well; however, it would be much longer and complicated to write by hand. In the inline code, only one floating-point number is stored in each register, but these registers can hold up to four 32-bit floating-point numbers. The -fast version might be using the multiple data features of Streaming SIMD Extensions 2 (SSE2) better than the inline version. Finally, -fast might be using prefetch commands for the data.

One might think that combining inline code and using the -fast flag would make even greater improvements. This is not the case for this example. There are errors in compiling because of multiple copies of labels, which is due to the fact that -fast unrolls the loop where the inline code lies. Because the inline code uses a label to jump to the top of the iteration loop (. loop) and a label to jump to the end (. exit), the labels appear multiple times in the unrolled loop. The compiler does not change or modify the inline code to rename these labels. With some tinkering, it would be possible to match the performance in -fast with inline assembly code. For the developer who has unlimited time and wants the satisfaction of getting every bit of performance, this might be the thing to do.

Since most developers want good performance with little time and effort, this raises an important question. Why bother to write inline code when the -fast flag performs better? The answer is that -fast doesn't always perform better than inline code. The -fast flag makes assumptions that work very well for an example such as this, but there are cases where -fast does not help much. So for performance tuning, trying -fast is a great place to start. If -fast shows significant improvement, then it might not be worth all the effort to write inline assembly code.

Conversely, if -fast does not make significant improvements, exploring inline assembly code is the better option. Sun Studio compilers provide the flexibility to do either, which benefits developers.


For further information on performance tuning, has the most current documentation on Sun Studio compilers. In the Product categories, Sun Studio is under Software -> Application Development -> Development Tools.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor