Cloud Account Sign in to Cloud Sign Up for Free Cloud Tier

Oracle Account

Assembly Language Techniques for Oracle Solaris on x86 Platforms

By Paul Lovvik, June 2004

Long ago, I took an assembly language course in college that focused on the x86 instruction set (286 at the time) and used MSDOS as the foundation. A lot has changed since then. Processors are much faster, compilers are much smarter at generating code, and software engineers are writing much larger programs. Has the software world changed enough over the years that application programmers don't need to worry about assembly language any more? Yes and no.

Yes, because companies tend to be more and more concerned with portability and look to new hardware to provide them with the performance they require.

No, because many enterprise solutions are judged on their price/performance ratio, in which any advantage in performance could be rewarded with a higher profit margin.

I have found that many of Sun's partners still use assembly language in their products to ensure that hot code paths are as efficient as possible. While compilers are able to generate much more efficient code today, the resulting code still doesn't always compete with hand-coded assembly written by an engineer that knows how to squeeze performance out of each microprocessor instruction. Assembly language remains a powerful tool for optimization, granting the programmer greater control, and with judicious use can enhance performance. Assembly language can also be a liability. It requires more specialized talent to maintain than higher-level languages and is not portable.

This paper discusses assembly language techniques for the Solaris Operating System running on the x86 architecture. My focus is to help others not just figure out how to integrate assembly language into their projects, but also help demonstrate that assembly language is not always the answer for better performance. I will not be covering the x86 instruction set or how to write assembly code. There are many books on the market that cover those topics extensively.

All of the examples provided in this paper can be compiled with either the compiler in Sun Studio software or GCC, except as noted in the text.

Assembly Language Starter Template for the Solaris OS

When writing assembly language modules, it is very useful to have a starter file. This file should be designed to improve an engineer's efficiency by reducing errors caused by misremembering the format or the correct order of the instructions for setting up and tearing down the stack for a function. Rather than simply provide my own rendition of such a file, I will demonstrate how to arrive at your own boilerplate that contains the features that you need. This can be done easily using a compiler and a bit of C code.

I will begin by creating a very simple C program. I have chosen an example that illustrates what goes into an assembly language module as well as why hand-coded assembly can be more efficient. Please see the section titled "Using Assembly to Improve Performance" for a discussion of when hand-coded assembly should be used.

My program is a simple variation of HelloWorld:

hello.c

#include <stdio.h>


int
main(int argc, char **argv) {
  int count = argc;
  if (count > 1) {
    int index;
    for (index = 1; index < count; index++) {
      printf("Hello, %s\n", argv[index]);
    }
  } else {
    printf("Hello, world!\n");
  }
  return (0);
}

I have extended the common HelloWorld implementation to greet each person by name, should their names be entered on the command line. I could compile this code, but instead I will use the compiler to generate assembly code:

cc -S hello.c

The -S option causes a new, hello.s file to be created that contains the assembly version of our C application. GCC can also be used to generate the assembly, though there will be differences in the resulting code. I have added a few comments to the assembly file to make it easier to follow along and compare to the original C code.

hello.s

  .section  .text,"ax"

  .align  4

  .globl  main
  .type main,@function
  .align  16
main:
  pushl %ebp
  movl  %esp,%ebp
  subl  $12,%esp
.L93:

/ File hello.c:
/ Line 5
  movl  8(%ebp), %eax         / argc is referenced by
            / 8(%ebp)
            / %eax is a temp register
            / used for loading a
            / local variable (count)
  movl  %eax, -8(%ebp)        / int count = argc
            / count is referenced by
            / -8(%ebp)
/ Line 6
  movl  -8(%ebp), %eax        / set up for comparison
                              / moving count into a
            / temp register
  cmpl  $1,%eax               / compare count to 1
  jle .L95                  / if count <= 1, goto
            / .L95
.L96:
/ Line 8
  movl  $1, -12(%ebp)         / index is a local
            / variable referenced
                              / by -12(%ebp)
                              / index = 1 (from the
            / for loop)
  movl  -12(%ebp), %eax       / put index into temp
            / register for a compare
  cmpl  -8(%ebp),%eax         / index < count?
  jge .L99                  / if not, goto .L99
.L100:
.L97:
/ Line 9
  movl  -12(%ebp), %edx       / calculate the address
            / of argv[index]
  movl  12(%ebp), %eax        / argv is referenced by
                              / 12(%ebp)
  movl  0(%eax,%edx,4), %eax  / set tmp var to argv +
                              / (index * 4)
  pushl %eax                  / Set up for printf call.
                              / All arguments are
            / pushed onto the stack
            / in reverse order from
            / the C code.  Start with
            / argv[index].
  pushl $.L101+0              / push "Hello, %s\n"

  call  printf                / print (uses libc)
  addl  $8,%esp               / Fix up the stack after
                              / the function call.  All
                              / the arguments passed to
                              / the function must be
                              / removed from the stack.
                              / Here, we simply
            / manipulate the stack
            / pointer, which is more
            / efficient.
  movl  -12(%ebp), %eax       / this is the last part
            / of the for loop using a
                              / temp register to
  incl  %eax                  / increment the index
  movl  %eax, -12(%ebp)       / reassign to the index
                              / local variable
  movl  -12(%ebp), %eax       / LOOK HERE! The %eax
                              / register already holds
                              / the -12(%ebp) value
                              / from the line above.
                              / This is the comparison
                              / part of the for loop.
                              / Use a temporary
            / variable to hold index
            / for the compare.
  cmpl  -8(%ebp),%eax         / index < count?
  jl  .L97                  / yes, goto .L97
.L102:
.L99:
  jmp .L103
  .align  4
.L95:
/ Line 12
  pushl $.L104+0              / set up for the other
                              / printf call push
            / "Hello, world!\n" onto
            / the stack
  call  printf
  addl  $4,%esp               / fix up the stack-only 1
                              / arg this time 
.L103:
/ Line 14
  movl  $0, -4(%ebp)
  jmp .L92
  .align  4
.L92:
  movl  -4(%ebp), %eax        / the return value goes
            / in %eax
  movl  %ebp,%esp             / reset the stack pointer
  popl  %ebp
  ret
  .size main,.-main

  .section  .bss,"aw"
Bbss.bss:
  .zero 0
  .type Bbss.bss,@object
  .size Bbss.bss,0

  .section  .data,"aw"
Ddata.data:
  .zero 0
  .type Ddata.data,@object
  .size Ddata.data,0

  .section  .rodata,"a"
Drodata.rodata:
  .zero 0
  .type Drodata.rodata,@object
  .size Drodata.rodata,0

  .section  .rodata1,"a"

  .align  4
.L101:  /         H    e    l    l    o    ,       
  .byte 0x48,0x65,0x6c,0x6c,0x6f,0x2c,0x20
        /         %    s   \n
  .byte   0x25,0x73,0xa,0x0
  .set  .,.+1
  .type .L101,@object
  .size .L101,12
  .align  4
.L104:  /         H    e    l    l    o    ,      
  .byte 0x48,0x65,0x6c,0x6c,0x6f,0x2c,0x20
        /         w    o    r  l    d    !   \n
  .byte   0x77,0x6f,0x72,0x6c,0x64,0x21,0xa,0x0
  .type .L104,@object
  .size .L104,15
  .type printf,@function

It is an interesting exercise to read machine-generated assembly code and find ways to make it more efficient. There are at least a couple of improvement opportunities in the code above. For example, on line 73 (or search for "LOOK HERE!"), the contents of the index variable are being loaded into %eax even though %eax already holds the contents of index. This is not ideal, but even having this inefficiency repeated within a loop does not significantly affect the overall performance of this application.

I used this compiler-generated assembly code to create a starting point for all of the assembly modules that will be presented in this paper. To create this template, I generated assembly code many times from the compiler in Sun Studio 9 software and GCC, both with and without optimization enabled. The resulting template represents a reasonable minimum set.

Start.s

/ This is a template for creating assembly language

/ programs or modules for Solaris x86.
/
/ Before coding, replace all instances of "main" with
/ the name of your function.
/
/ Build using cc
  
.section  .text      / code section
.globl  main
  .type main, @function
  .align  16
main:                      / start of the function
preamble:                  / the preamble sets up the
                           / stack frame
  pushl %ebp
  movl  %esp,%ebp
  pushl %ebx
  pushl %esi
  pushl %edi

/ Function code goes here
  

finale:                    / the finale restores the
                           / stack
  popl  %edi               / and returns to the caller.
  popl  %esi
  popl  %ebx
  movl  %ebp,%esp
  popl  %ebp
  ret
  .size main,.-main


.section  .data      / contains initialized data
Ddata.data:
  .type Ddata.data,@object
  .size Ddata.data,0

.section  .bss       / contains uninitialized
                           / data
Bbss.bss:
  .type Bbss.bss,@object
  .size Bbss.bss,0

The first section of the module is for assembly code. The label "main" indicates the starting point of the assembly function. The preamble sets up the stack frame by preserving the contents of the %ebp register and then configures %ebp to point to the top of the stack. If this module is to be linked with an optimized C application, the contents of %ebx, %esi, and %edi must be preserved.

The finale returns the stack pointer, %ebp, %edi, %esi, and %ebx registers to their original values and then returns control to the caller.

The Ddata.data section is the place to put initialized data. By convention, the data in the Ddata.data section is considered constant.

The Bbss.bss section is for non-constant data. This non-constant data can be module variables that are allocated and will be initialized and used during the execution of the application.

I have written this template such that the function name will be "main". To create a module containing a function other than main, simply do a global search and replace "main" with your function name.

At this point, a simple assembly program can be written by hand using the start.s template. For this example, I will keep the function name as "main", and will write the classic Hello World program in assembly:

HelloWorld.s

.section .text      / code Section

.globl  main
  .type main, @function
  .align  16
main:                      / start of the function
preamble:                  / the preamble sets up the
                           / stack frame
  pushl %ebp
  movl  %esp,%ebp
  pushl %ebx
  pushl %esi
  pushl %edi

/ Function code goes here
  pushl $Hello             / prepare for a call to
                           / printf by pushing the
                           / arguments onto the stack
  call  printf
  addl  $4, %esp           / fix the stack upon return
                           / of printf

  movl  $0, %eax           / set up the return value
  movl  %eax, -4(%ebp)

finale:                    / the finale restores the
                           / stack
  popl  %edi               / and returns to the caller.
  popl  %esi
  popl  %ebx
  movl  %ebp,%esp
  popl  %ebp
  ret
  .size main,.-main


.section  .data      / contains initialized data
Ddata.data:
  .type Ddata.data,@object
Hello:
  .string "Hello, world!\n" 
  .size Ddata.data,0

.section  .bss       / contains uninitialized
                           / data
Bbss.bss:
  .type Bbss.bss,@object
  .size Bbss.bss,0

The classic two-line C program translates effectively into five lines of assembly body plus the standard wrapping for an assembly language application.

In the function code section, the address of the constant "Hello, world!\n" string is pushed onto the stack and the printf function is called. The return value is set into %eax, and finally the stack is cleaned up simply by changing the stack pointer and resetting the values of a few registers.

Compile this application with the command cc -o HelloWorld HelloWorld.s.

In the previous example, I chose to write the main function in assembly language. It is just as easy to write an assembly language module that can be linked with a C application. In this case, the name of the function must be changed and a C header file should be provided so the C compiler can do automatic parameter checking.

Assembly Functions for C Applications

In the last section, I presented a template that will make it easy to get started writing an assembly language program or function. In this section, I will demonstrate how to write assembly language functions for the Solaris OS. This section focuses on the use of the stack to read parameters, return values, and allocate function-local variables.

The stack plays a central role in assembly language work. It is the scheme used for functions to communicate values to other functions and is used for creating temporary variables within a function.

Accessing Function Parameters

The parameters for a function call are pushed onto the stack in the reverse order that they appear in C code. That way, the first parameter in the C call will be the value closest to the top of the stack in the assembly function. The other parameters will then appear in the same order as in the C code, as they are popped off the stack.

The following example illustrates the idea:

int foo(int a, int b, int c);

...
int result = foo(1, 2, 3);

...
foo:
preamble:
...
  popl  %eax    / %eax = 1
  popl  %ebx    / %ebx = 2
  popl  %ecx    / %ecx = 3

This is rarely how the assembly function would be written because the stack has to be returned to the caller in essentially the same condition. That is, the %esp register must point to the same address as before the call. Also, because popping values from the stack is less efficient than referencing them from within the stack, the popl instruction is rarely used to read the parameters.

Instead, the values are used within the stack. This also makes it possible to refer to the parameter values multiple times without having to set up a temporary variable to hold the parameter value or use a dedicated register. The values can be read into a register using a movl instruction.

This illustration shows the locations within the stack:



    ...
    %ebp + 12 / Second parameter (if present)
    %ebp + 8  / First parameter (if present)
    %ebp + 4
    %ebp + 0  / Caller's %ebp value
    %ebp - 4  / Caller's %ebx value
    %ebp - 8  / Caller's %esi value
    %ebp - 12 / Caller's %edi value
    %ebp - 16 / temporary storage
    %ebp - 20 / temporary storage
    ...

Note that this illustration has offsets specific to the way I set up the stack frame in my start.s template. If the template is changed (by pushing more or fewer registers on the stack in the preamble), the offsets in the stack must be changed respectively.

Here is a simple function that adds the values of the three parameters passed to it and prints the result.

add3_a.s

add3:      / start of the function

                        / void add3(int a, int b,
                        /           int c); 
preamble:               / the preamble sets up the
                        / stack frame
  pushl %ebp
  movl  %esp,%ebp
  pushl %ebx
  pushl %esi
  pushl %edi

/ Function code goes here
  movl  8(%ebp), %ecx   / %ecx represents the sum, set
                        / to parameter a
  addl  12(%ebp), %ecx  / add value of parameter b
  addl  16(%ebp), %ecx  / add value of parameter c

                        / set up for print - calling
                        / like this:
                        / printf("sum=%d\n", sum);
  pushl %ecx            / push sum (2nd param) first
  pushl $sumString      / push the address of the
                        / format string
  call  printf
  addl  $8, %esp        / adjust the stack pointer to
                        / account for the 2 parameters
                        / pushed on to call printf

finale:                 / the finale restores the stack
  popl  %edi            / and returns to the caller.
  popl  %esi
  popl  %ebx
  movl  %ebp,%esp
  popl  %ebp
  ret
  .size add3,.-add3

.section  .data    / contains initialized data
Ddata.data:
  .type Ddata.data,@object
sumString:
  .string "sum=%d\n"

  .size Ddata.data,0

The call to the printf function makes this a good example of how to read parameters and also how to call a function with parameters.

Returning a Value

When a function returns, its return value is passed through %eax. To demonstrate this idea, I will remove the printf call from the previous add3 example. Instead of printing the result, the sum will be passed back to the caller as a return value.

add3_b.s

add3:                    / start of the function

                         / int add3(int a, int b,
                         /          int c); 
preamble:                / the preamble sets up the
                         / stack frame
  pushl %ebp
  movl  %esp,%ebp
  pushl %ebx
  pushl %esi
  pushl %edi

/ Function code goes here
  movl  8(%ebp), %ecx    / %ecx represents the sum, set
                         / to parameter a
  addl  12(%ebp), %ecx   / add value of parameter b
  addl  16(%ebp), %ecx   / add value of parameter c
  movl  %ecx, %eax       / set the return value

finale:                  / the finale restores the
                         / stack
  popl  %edi             / and returns to the caller
  popl  %esi
  popl  %ebx
  movl  %ebp,%esp
  popl  %ebp
  ret

It should be obvious from these examples (and the explanation on how the stack is used) that the values of the parameters on the stack can also be modified, should more than one return value be required.

Creating a Local Variable

It is also possible to allocate space on the stack to create local variables. In this short example, I add the parameters in the same way as in the previous code sample. But instead of using %ecx as the sum, I use space on the stack for that purpose. This technique is valuable for algorithms that require more variables than available registers.

add3_c.s

add3:                    / start of the function

                         / int add3(int a, int b,
                         /          int c); 
preamble:                / the preamble sets up the
                         / stack frame
  pushl %ebp
  movl  %esp,%ebp
  pushl %ebx
  pushl %esi
  pushl %edi

/ Function code goes here
  movl  8(%ebp), %eax
  movl  %eax, -16(%ebp)  / sum = -16(%ebp),
                                 / set to parameter a
  movl  12(%ebp), %eax
  addl  %eax, -16(%ebp)  / add value of
                                 / parameter b
  movl  16(%ebp), %eax
  addl  %eax, -16(%ebp)  / add value of
                                 / parameter c
  movl  -16(%ebp), %eax  / Set the return value

finale:                  / the finale restores the
                         / stack and returns to the
  popl  %edi     / caller
  popl  %esi
  popl  %ebx
  movl  %ebp,%esp
  popl  %ebp
  ret

It is not possible to move data from the stack directly into another part of the stack, so I use the %eax register to temporarily hold the parameter values for the movl and addl instructions.

These simple examples show how functions can be called using assembly language and how to deal with parameters, return values, and temporary variables.

Should variables with module scope be required, simply allocate that variable space within one of the data sections. Interestingly, a static variable within a function should also be allocated in one of the data sections. Static variables are not scoped within the function at the assembly level at all, but the C language makes them accessible only from within the function that established them.

Inline Assembly Techniques

In the last section I talked about how to write assembly language functions. All of the code in the previous section was written in assembly language. Sometimes it makes sense to improve the performance of a C function using bits of assembly language rather than a wholesale replacement of a C function with an assembly function. This section examines how to do exactly that.

In C, the __asm function can be used to include assembly language in a function. The "Assembly Functions for C Applications" section is key to understanding how to integrate assembly language snippets in your C code so you can make use of function parameters and any local variables that have been established.

As a first example, I demonstrate how to write a C function that uses inline assembly to add three integers and return the result. I chose this example so the assembly module code can be compared directly with the inline assembly code.

add3_inline_a.c

int add3(int a, int b, int c) {

  volatile int result = 0;

  __asm("\n\
  movl  8(%ebp), %ecx  / %ecx is the sum, set to a  \n\
  addl  12(%ebp), %ecx / add b (the second param)   \n\
  addl  16(%ebp), %ecx / add c                      \n\
  movl  %ecx, -8(%ebp) / move the result into the   \n\
                       / result variable            \n\
        ");
  return (result);
}

Essentially the %ecx register is used to sum the three parameters. The value of that register is then set into the location representing the result local variable so it will be returned to the caller. The volatile modifier for the result variable is necessary so the compiler doesn't simply return 0. The volatile keyword is a hint to the compiler that it must actually read the contents of the variable and not rely on a value cached in a register.

That example seems straightforward, and this code might look like it may be less trouble to write and maintain than some of our previous examples. Not true. This example works if it is compiled with no optimization enabled. Because the stackframe is set up differently when the compiler generates optimized code it will not work when compiled with optimization flags. The problem is that the result variable gets placed in a different offset on the stack depending on how the function is compiled.

With a bit of investigation, I was able to get the code to compile using the compiler in Sun Studio software, regardless of whether optimization was turned on or not. However, the same code did not work when compiled with GCC using optimization. Here is the new code:

add3_inline_b.c

int add3(int a, int b, int c) {

  volatile int result = 0;

  __asm("\n\
  movl  8(%ebp), %ecx / %ecx is the sum, set it a \n\
  addl  12(%ebp), %ecx  / add b (the second param)  \n\
  addl  16(%ebp), %ecx  / add c                     \n\
  movl  %ecx, (%esp)    / move the result into the  \n\
                        / result variable           \n\
  movl  %ecx, %eax      / also set the return value \n\
                        / just in case              \n\
       ");
  return (result);
}

The difference is that now I am referencing the local variable "result" based on the stack pointer rather than %ebp. Generally, a module that uses inline assembly in this way has very specific requirements about how it is compiled.

It really doesn't make sense to use inline assembly in such a function. In this case, the C code is only setting up the stack frame and returning. It would be far more beneficial to write this entire function in either C or assembly and avoid being tied to a single compiler or specific compiler flags for this module.

In the remainder of this section, I present three inline assembly solutions to solve the same problem. This allows the various schemes to be understood and compared. I hope this will make it easy to choose the most appropriate solution should inline assembly be called for in your project.

In this example, I take an array of strings and return a corresponding array of integers, each of which indicates the length of its respective string.

getStringLengths.c

int *getStringLengths(int size, char **strings) {

  int *lengths = calloc(size, sizeof (int));
  int index;

  for (index = 0; index < size; index++) {
    lengths[index] = strlen(strings[index]);
  }
  return (lengths);
}

Because I am using the strlen function, I think I can improve the performance using assembly language, provided that I don't have to endure the same function call overhead and I don't have to execute a jump that requires a memory access. That means I can't simply write my own assembly version of the strlen function, but instead I need to write an inline version of the strlen functionality.

In the following section, I include three different ways of doing this.

The first way is similar to the last inline example and will have similar problems with getting references to local variables. I simply use __asm() to include the assembly instructions directly into the code.

getStringLengths_inline_a.c

int *asmGetStringLengths(int size, char **strings) {

  int *lengths = calloc(size, sizeof (int));
  int index;

  for (index = 0; index < size; index++) {
    __asm("\n\
  movl  0(%esp), %edx   / index             \n\
  movl  12(%ebp), %ecx    / string array      \n\
                                /address            \n\
  movl  0(%ecx, %edx, 4), %ecx  / string element    \n\
                                /address            \n\
                                                    \n\
  xor %eax, %eax    / %eax = 0 (this is \n\
                                / the counter)      \n\
  jmp asm_charCompare                             \n\
asm_loopStart:                                      \n\
  addl  $1, %eax                                    \n\
asm_charCompare:                                    \n\
  movsbl  (%ecx, %eax), %edx                  \n\
  cmpl  $0, %edx                                    \n\
  jne asm_loopStart                               \n\
asm_loopEnd:            / at this point,    \n\
                                / %eax contains the \n\
                                / length            \n\
  movl  %eax, %ecx    / perform the       \n\
                                / assignment back to \n\
  movl  0(%esp), %edx   / lengths[index]    \n\
  movl  4(%esp), %eax                               \n\
  movl  %ecx, 0(%eax,%edx,4)                        \n\
          ");
  }
  return (lengths);
}

This code works correctly using the compiler in Sun Studio software or GCC (provided that optimization flags are not used).

For the second scheme, I use the inline procedure call expander. This is only available on the compiler in Sun Studio software, so the code I show here is compiler-specific.

_strlen.il



  .inline _strlen,4      / inline function name, size
                         / of args
  xor %eax, %eax       / %eax = 0 (this is the 
                         / counter)
  movl  (%esp), %ecx     / %ecx = the string to
                         / determine the length of
  jmp charCompare
loopStart:
  addl  $1, %eax
charCompare:
  movsbl  (%ecx, %eax), %edx
  cmpl  $0, %edx
  jne loopStart
loopEnd:                 / at this point, %eax contains
                         / the length

  .end

getStringLengths_inline_b.c

extern int _strlen(char *string);


int *getStringLengths(int size, char **strings) {
  int *lengths = calloc(size, sizeof (int));
  int index;

  for (index = 0; index < size; index++) {
    lengths[index] = _strlen(strings[index]);
  }
  return (lengths);
}

Notice that in the C code, I have provided a function prototype for my inline function, and I call the assembly code as if it were a function. Actually, it is not going to be treated as a function call; the assembly code is inserted within the body of the function.

The difference between inline call expander and the asm scheme is how local variables are accessed. With the inline call expander, the local variables required for the assembly code are pushed onto the stack just before the assembly code is inserted. This provides consistent access to those variables. The function parameters can still be accessed within your assembly code exactly as shown in previous examples.

The _strlen.il file does not represent a complete assembly module, so it doesn't require the data sections and stack frame setup code.

When using the inline call expander, it is important that you preserve the values of all registers except %eax, %ecx, and %edx. The return value is handed to the function body in register %eax, the same convention used for the return value of a function. More information is available on the inline call expander; see the manpage: inline(1).

One of the problems you may encounter with the inline call expander scheme is that the same inline call cannot be used in multiple places within a module if the inline code has labels, as my example does. The inline code is inserted into the resulting assembly code wherever you call it, and the labels cause the assembler to exit with an error complaining of duplicate labels. The way to work around this problem is to create additional modules such that the inline code would only be called at most once within each module, or only use this scheme if your code has no labels.

The final example of inline assembly I provide uses Extended Asm and the GCC compiler. Again, this solution is compiler-specific. GCC has extended the __asm command to solve the problem of matching C local variables with stack offsets. This interface is referred to as "Extended Asm".

getStringLengths_inline_c.c



int *extasmGetStringLengths(int size, char **strings) {
  int *lengths = calloc(size, sizeof (int));
  int index;

  for (index = 0; index < size; index++) {
    __asm__ __volatile__ ("                         \n\
  xor %%eax, %%eax        / %%eax = 0 (this is the  \n\
                          / counter)                \n\
  jmp extasm_charCompare                          \n\
extasm_loopStart:                                   \n\
  addl $1, %%eax                                    \n\
extasm_charCompare:                                 \n\
  movsbl (%1, %%eax), %%edx                         \n\
  cmpl $0, %%edx                                    \n\
  jne extasm_loopStart                              \n\
extasm_loopEnd:           / at this point, %%eax    \n\
                          / contains the length     \n\
  movl %%eax, %0                                    \n\
          "
    : "=r" (lengths[index])
    : "c" (strings[index])
    : "eax", "edx", "cc");
  }
  return (lengths);
}

The directives at the end of the __asm command indicate the output register that the input registers and the clobbered registers, respectively.

The output register is a letter indicating which register should be configured to receive the output, and must be preceded with "=". In this example, I chose "r" for the output register, which indicates the register should be dynamically allocated. At the end of the assembly code, I must move the result into the output register with the following:

movl %%eax, %0

The assembly code refers to the first register allocated as %0, the second as %1, and so on. Note that within the assembly I now must use %% to refer to registers.

The input register I have chosen is %ecx. I chose this simply so I could reuse much of the code from my previous two examples with little modification. This register will be loaded with the value of strings[index] before my assembly code is executed.

The clobber list tells the compiler what registers need to be saved before executing my assembly code and restored afterward. In this example, I am using %eax and %edx, and I am changing the state of the flags register (referred to as condition codes). GCC will use this information to save prior state and also to select appropriate nonconflicting registers for those specified as dynamically allocated.

Again, extended asm is specific only to GCC. I include it here for completeness, but my personal policy is to make my code portable between compilers wherever possible.

Using Assembly to Improve Performance

It is exciting to get back to bare metal again, isn't it? Really, it is the fairly rare situation that lends itself to hand-coded assembly. It would be irresponsible to write everything in assembly language or even to include assembly modules for each application.

Assembly code is harder for code maintainers to understand. This translates into higher maintenance costs and a larger defect count. Throwing a bunch of assembly routines into your product means that you need a more specialized maintenance engineering staff. Otherwise, they will either leave the code alone or possibly maintain it in such a way that you lose the performance benefit of choosing assembly in the first place.

The performance improvement comes at the sacrifice of code portability. For each assembly function, a separate implementation for each supported platform will be required. This also feeds into the higher cost of code maintenance.

Assembly language shouldn't be the first answer when the performance of an application must be improved. Writing an inefficient algorithm in hand-coded assembly doesn't help significantly. It is important to choose the fastest algorithm for the problem and then if more performance is required or desired, recode the algorithm in assembly.

In this section I illustrate one process for improving performance that can be followed to ensure that the required performance is achieved without having to resort to assembly in every case. The process has the following steps:

Write the code in C.
Execute tests to measure the performance and determine if the C code is sufficient.
Try other algorithms and test again.
Use the compiler to generate assembly code from your fastest algorithm if you cannot get the desired performance.
Hand optimize or rewrite the assembly and retest.
Choose the variation that gives you the best measured performance.

To demonstrate the entire process, I use a relatively simple algorithm that can probably benefit from assembly.

The problem is a simple string reversal function. This function should work similarly to memcpy, except that it copies the memory in reverse order. My goal is to match the performance of this new function with that of the memcpy function.

The performance measurement is performed by a simple test harness that I built specifically for this purpose. The results are written in HTML format to make it easy to spot the best performance at a glance. The code for this test harness is available in Appendix A.

I start with a simple C implementation of the string reverse function:

rcopy_c1.c

/*

 * Copy the source onto the destination in reverse
 * order.
 */
void *
rcopy_c1(void *dest, const void *src, size_t size) {
  char *destChar = (char *)dest;
  char *srcChar = (char *)src;
  int index;

  for (index = 0; index < size; index++) {
    destChar[size - 1 - index] = srcChar[index];
  }
  return (dest);
}

This code is fairly straightforward. I use a simple loop to traverse the source memory block in order, and translate that onto the destination memory block. As the source memory is traversed in order, the destination memory is written in reverse order. Now I run the test harness to see how it compares to the performance of memcpy.

The following table shows the results of the first test.

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	227.4 MB/sec	233.5 MB/sec	233.6 MB/sec	228.7 MB/sec	220.9 MB/sec	223.9 MB/sec	224.2 MB/sec
Baseline (memcpy)	3062.2 MB/sec	1279.4 MB/sec	1294.7 MB/sec	999.3 MB/sec	330.3 MB/sec	338.5 MB/sec	341.6 MB/sec

This code was compiled using cc testharness.c rcopy_c1.c.

This time I fell short of my goal. That's not unexpected because on most operating systems, memcpy is a function written in assembly language to squeeze all of the performance possible out of the underlying hardware.

This is a really simple algorithm, but maybe I can still improve on it a bit. Where I currently have only one index, I could instead use two. The additional variable makes the math within the loop much simpler (either increment or decrement instead of a variable subtraction).

I could also avoid using the additional temporary variables holding the char * references to the source and destination memory blocks. Something similar to this is necessary so the pointer arithmetic is done using the proper size. Instead, I could change the code to use casting so the temporary variables can be removed. Either way, I have to give the compiler a hint to use the movb instruction rather than movl.

rcopy_c2.c

void *

rcopy_c2(void *dest, const void *src, size_t size) {
  int srcIndex = 0;
  int destIndex = size - 1;

  while (srcIndex < size) {
    ((char *)(dest))[destIndex--] =
          ((char *)(src))[srcIndex++];
  }
  return (dest);
}

This table shows the results of the second test.

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	226.8 MB/sec	233.9 MB/sec	233.9 MB/sec	228.6 MB/sec	224.3 MB/sec	224.0 MB/sec	223.4 MB/sec
rcopy_c2	227.1 MB/sec	234.3 MB/sec	233.6 MB/sec	228.8 MB/sec	223.7 MB/sec	223.9 MB/sec	223.6 MB/sec
Baseline (memcpy)	3028.6 MB/sec	1500.1 MB/sec	1492.0 MB/sec	1083.7 MB/sec	344.1 MB/sec	352.6 MB/sec	354.6 MB/sec

Once again, the performance of my C version of the rcopy function is nowhere near that of memcpy. Just for fun, take a look at the assembly code for the rcopy_c2 function. I only show the code associated with the function, not all the other segments that are generated. Again, I have added a few comments to make it easier to understand what the code is doing.

rcopy_c2.s

rcopy_c2:

  pushl %ebp
  movl  %esp,%ebp
  subl  $12,%esp
.L109:

/ File rcopy_c2.c:
/ Line 7
  movl  $0, -8(%ebp)    / -8(%ebp) refers to srcIndex
/ Line 8
  movl  16(%ebp), %eax  / %eax = size
  decl  %eax            / size - 1
  movl  %eax, -12(%ebp) / -12(%ebp) refers to destIndex
/ Line 10
  movl  -8(%ebp), %eax  / need to use %eax for the
                        / compare
  cmpl  16(%ebp),%eax   / srcIndex < size?
  jae .L113           / if not, goto .L113
.L114:
.L111:
/ Line 11
  movl  12(%ebp), %eax  / %eax = *src
  addl  -8(%ebp), %eax  / %eax = += srcIndex
  movl  8(%ebp), %edx   / %edx = *dest
  addl  -12(%ebp), %edx / %edx += destIndex
  movsbl 0(%eax), %eax  / %eax = contents of address
                        / (%eax)
  movb  %al,0(%edx)     / dest[destIndex] = lower bits
                        / of %eax
  movl  -8(%ebp), %eax  / prepare for increment
  incl  %eax            / increment the srcIndex
  movl  %eax, -8(%ebp)  / update the incremented value
  movl  -12(%ebp), %eax / prepare to decrement
                        / destIndex
  decl  %eax            / decrement destIndex
  movl  %eax, -12(%ebp) / update the destIndex variable
  movl  -8(%ebp), %eax  / prepare to compare srcIndex
                        / to size
  cmpl  16(%ebp),%eax   / srcIndex < size?
  jb  .L111           / if so, goto .L111
.L115:
.L113:
/ Line 13
  movl  8(%ebp), %eax   / prepare the return value
  movl  %eax, -4(%ebp)
  jmp .L108     
  .align  4
.L108:
  movl  -4(%ebp), %eax  / return value goes in %eax
                        / (it's already there)
  movl  %ebp,%esp       / reset the stack pointer
  popl  %ebp
  ret                   / return to the caller

I see a lot of instructions devoted to moving data into registers for simple computations like comparisons, incrementing, and decrementing. I could improve this with some hand-coded assembly. Following is the third attempt at a reverse copy, written in assembly language. Again, all of the boilerplate has been omitted.

rcopy_a3.s



rcopy_a3:                  / start of the function
preamble:                  / the preamble sets up the
                           / stack frame
  pushl %ebp
  movl  %esp,%ebp
  pushl %ebx
  pushl %esi
  pushl %edi

/ Function code goes here
  movl  12(%ebp), %esi     / %esi = *src. %esi will
                           / *always* be *src.
  movl  16(%ebp), %ebx     / %ebx will *always* be
                           / srcIndex.  Move the size
                           / into %ebx. We will 
                           / decrement the srcIndex to
                           / traverse the source
                           / instead of increment it.
                           / This avoids additional
                           / register manipulation 
                           / within the loop.
  
  movl  8(%ebp), %edi      / %edi = *dest,  %edi will
                           / *always* be dest

  xor %ecx, %ecx         / clear %ecx -  this will be
                           / destIndex
  
  cmpl  $0, %ebx           / srcIndex >= 0?
  jb  endLoop            / if not, jump past the loop
beginLoop:
  movb -1(%esi, %ebx), %al / move one byte
  movb  %al, (%edi, %ecx)  / dest[destIndex] =
                           /       src[srcIndex]

  decl  %ebx               / srcIndex--
  incl  %ecx               / destIndex++

  cmpl  $0, %ebx           / srcIndex >= 0?
  ja  beginLoop          / if so, goto beginLoop

endLoop:
  movl  8(%ebp), %eax      / prepare the return value

finale:                    / the finale restores the
                           / stack and returns to the
  popl  %edi               / caller.
  popl  %esi
  popl  %ebx
  movl  %ebp,%esp
  popl  %ebp
  ret

I have cleaned up the code significantly. I have "glued" important indexes and addresses to specific registers which are not changed during the execution of the function. The following table shows the new results.

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	227.1 MB/sec	227.4 MB/sec	234.0 MB/sec	228.8 MB/sec	224.3 MB/sec	223.5 MB/sec	224.4 MB/sec
rcopy_c2	225.7 MB/sec	234.7 MB/sec	234.6 MB/sec	228.9 MB/sec	224.0 MB/sec	222.7 MB/sec	224.1 MB/sec
rcopy_a3	645.9 MB/sec	654.0 MB/sec	650.4 MB/sec	620.6 MB/sec	391.4 MB/sec	386.9 MB/sec	383.3 MB/sec
Baseline (memcpy)	3026.9 MB/sec	1391.4 MB/sec	1340.6 MB/sec	1044.7 MB/sec	358.7 MB/sec	362.3 MB/sec	360.7 MB/sec

These results are much better. The rcopy_a3 function even beats memcpy performance for some copy sizes!

I have now written a version of rcopy that has beaten memcpy for some sizes, and the original C algorithms pale in comparison. This assembly language thing has really paid off. So that's it, right?

Not quite. So far, I haven't used the most powerful tool in the toolbox to its full advantage. The compiler can optimize the C code automatically. I can use the compiler's optimization to see if I really have created the fastest algorithm possible in assembly:

cc -fast testharness.c rcopy_c1.c rcopy_c2.c rcopy_a3.s

After running the tests again, I get the results shown in the following table.

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	643.2 MB/sec	546.6 MB/sec	647.8 MB/sec	609.9 MB/sec	389.0 MB/sec	389.8 MB/sec	388.4 MB/sec
rcopy_c2	849.7 MB/sec	859.5 MB/sec	864.1 MB/sec	793.6 MB/sec	383.5 MB/sec	390.9 MB/sec	386.8 MB/sec
rcopy_a3	619.0 MB/sec	655.6 MB/sec	649.8 MB/sec	613.4 MB/sec	395.6 MB/sec	394.1 MB/sec	391.0 MB/sec
Baseline (memcpy)	2995.7 MB/sec	1435.0 MB/sec	1339.5 MB/sec	1028.8 MB/sec	352.5 MB/sec	352.3 MB/sec	349.5 MB/sec

My assembly language function is still cleanly beating its C counterparts for the larger copies, but look at the smaller ones. The optimized C versions are much more efficient at what are arguably the most common application sizes for my reverse copy function. If my assembly function can't beat the C functions at those smaller copies, I don't think it's worth the tradeoff to use assembly.

Further investigation is in order. If you compare the performance of rcopy_c2 to the other rcopy variants, you notice it has really good performance for small copy sizes. I will use that function as a guide to help fix my assembly function to be even faster. Once again, I generate the assembly code for rcopy_c2, but this time I use optimization.

cc -fast -S rcopy_c2.c

The optimized code is usually much harder to read than either hand-coded or non-optimized assembly, but in this case, it isn't incomprehensible.

rcopy_c2_optimized.s

rcopy_c2:

  push  %ebx
  push  %esi
  push  %edi
  mov  24(%esp),%edi    / %edi is size
  lea  -1(%edi),%edx    / %edx is destIndex, set to
                        / size -1
  xor  %ecx,%ecx        / srcIndex = 0
  test  %edi,%edi
  jbe  .LE0.29
.LP0.30:
  mov  20(%esp),%ebx  / %ebx is src
  mov  16(%esp),%esi  / %esi is dest
.LB0.31:
  movb  (%ebx,%ecx),%al
  movb  %al,(%esi,%edx)
  add  $0x1,%ecx        / srcIndex++
  add  $0xffffffff,%edx / destIndex--
.LC0.32:
  cmp  %edi,%ecx        / srcIndex < size?
  jb  .LB0.31           / if so, repeat the loop
.LX0.33:
.LE0.29:
.CG3.24:
  mov  16(%esp),%eax
  pop  %edi
  pop  %esi
  pop  %ebx
  ret

After doing this, I noticed one of the tricks the optimized code is using is to always add rather than increment or decrement. I will apply this trick to my assembly code using the following changes to rcopy_a3 to arrive at rcopy_a4.

  addl  $0xffffffff, %ebx / srcIndex--

  addl  $0x1, %ecx    / destIndex++

Testing again, I get the following table.

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	642.9 MB/sec	648.4 MB/sec	651.1 MB/sec	610.5 MB/sec	384.4 MB/sec	375.9 MB/sec	376.2 MB/sec
rcopy_c2	839.6 MB/sec	861.7 MB/sec	858.7 MB/sec	780.9 MB/sec	386.0 MB/sec	381.1 MB/sec	379.4 MB/sec
rcopy_a3	638.9 MB/sec	656.8 MB/sec	645.7 MB/sec	615.6 MB/sec	377.2 MB/sec	376.1 MB/sec	372.2 MB/sec
rcopy_a4	853.4 MB/sec	870.6 MB/sec	863.3 MB/sec	792.4 MB/sec	375.5 MB/sec	374.4 MB/sec	369.6 MB/sec
Baseline (memcpy)	3078.1 MB/sec	1519.5 MB/sec	1500.1 MB/sec	1061.3 MB/sec	353.3 MB/sec	353.5 MB/sec	352.7 MB/sec

My assembly code is much faster than before for the smaller copies, but is slower than all the others on copies 512 Kbyte and greater. The other major difference between my code and the optimized rcopy_c2 function is that my assembly reads the source buffer in reverse, while the compiled version reads in order and writes in reverse. If I made that change in my assembly code, the two assembly implementations would be almost exactly the same. Looking at the optimized assembly, I cannot pick out a glaring inefficiency that would warrant a hand-coded assembly solution.

Given that result, I think it would be interesting to write a C version of my assembly algorithm and see how the optimized C performance compares to that of my hand-coded assembly function.

rcopy_c5.c

void *

rcopy_c5(void *dest, const void *src, size_t size) {
  int srcIndex = size - 1;
  int destIndex = 0;

  while (srcIndex >= 0) {
    ((char *)(dest))[destIndex++] =
        ((char *)(src))[srcIndex--];
  }
  return (dest);
}

The only difference between this and rcopy_c2.c is that I am reading the source in reverse and writing to the destination in order. You can see how my new implementation performs in the following table.

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	644.6 MB/sec	595.9 MB/sec	649.7 MB/sec	608.5 MB/sec	377.6 MB/sec	378.5 MB/sec	379.4 MB/sec
rcopy_c2	855.3 MB/sec	820.9 MB/sec	850.7 MB/sec	778.0 MB/sec	373.9 MB/sec	369.2 MB/sec	373.3 MB/sec
rcopy_a3	638.1 MB/sec	652.7 MB/sec	644.3 MB/sec	608.0 MB/sec	382.1 MB/sec	379.1 MB/sec	380.3 MB/sec
rcopy_a4	853.3 MB/sec	867.5 MB/sec	856.5 MB/sec	782.4 MB/sec	382.5 MB/sec	379.0 MB/sec	380.0 MB/sec
rcopy_c5	854.1 MB/sec	871.3 MB/sec	861.8 MB/sec	786.3 MB/sec	386.1 MB/sec	378.0 MB/sec	382.4 MB/sec
Baseline (memcpy)	3116.5 MB/sec	1363.2 MB/sec	1379.3 MB/sec	1026.8 MB/sec	354.8 MB/sec	355.3 MB/sec	357.0 MB/sec

I will also try this on my 2-CPU AMD Opteron-based workstation (Colfax International high-end workstation, model #691) to see how the performance numbers change (see the following table).

FUNCTION	16 KB	64 KB	128 KB	256 KB	512 KB	1 MB	2 MB
rcopy_c1	775.3 MB/sec	746.8 MB/sec	751.1 MB/sec	751.6 MB/sec	746.4 MB/sec	648.7 MB/sec	647.4 MB/sec
rcopy_c2	972.0 MB/sec	896.8 MB/sec	898.6 MB/sec	899.4 MB/sec	891.1 MB/sec	705.0 MB/sec	703.8 MB/sec
rcopy_a3	690.2 MB/sec	689.4 MB/sec	690.2 MB/sec	690.9 MB/sec	684.0 MB/sec	580.1 MB/sec	569.8 MB/sec
rcopy_a4	690.3 MB/sec	692.7 MB/sec	694.4 MB/sec	694.4 MB/sec	687.3 MB/sec	606.1 MB/sec	603.9 MB/sec
rcopy_c5	979.9 MB/sec	926.8 MB/sec	928.0 MB/sec	927.8 MB/sec	915.8 MB/sec	710.6 MB/sec	722.6 MB/sec
Baseline (memcpy)	5201.6 MB/sec	2398.8 MB/sec	2400.3 MB/sec	2407.1 MB/sec	2159.4 MB/sec	750.9 MB/sec	756.6 MB/sec

My assembly functions are the only ones that actually got worse performance on the faster machine. In fact, it takes 20 percent more time for the 128 Kbyte case, comparing the rcopy_a4 algorithm between my laptop and my Opteron workstation. That is very interesting when compared to the memcpy performance, which gains almost 75 percent copy speed for the same case.

Taking all of this into account, I would have to choose the rcopy_c5 algorithm. It is the most suitable implementation because of its functional portability (its relative performance probably is not portable to all architectures), its ease of maintenance, and the fact that it beat all the other algorithms. On the Opteron workstation, the performance improved by a minimum of 1 percent and a maximum of 3 percent. The Pentium 4 performance improvement ranged from less than 1 percent performance degradation to 1 percent performance improvement.

So given that I chose to stick with a C algorithm in the end, was it a waste of time to write two assembly language functions? No. Keep in mind that I now enjoy up to a 46 percent improvement on the Pentium and up to 26 percent better performance on the Opteron over my original code. That is significant no matter how I got there!

Key Point

This exercise demonstrates how assembly can play a role in your application performance whether or not you use your assembly language code in the end. It is unlikely I would have arrived at such a fast algorithm without understanding assembly language during the process. Keep this exercise in mind in your own pursuit of ultimate performance, and don't always put assembly language into your application just because you took the trouble to write it.

Conclusion

I have covered a lot of information that is of interest to those considering applying assembly language on their own projects. This is by no means everything you need to know about assembly language to make optimal use of it, but the many exercises leading to key takeaways should help you organize your thoughts on how assembly can be integrated into your own work and how to approach that integration to ensure that you get the best performance win for your trouble.

I will summarize with a few thoughts:

For compiler portability, it is best to avoid inline assembly. If inline assembly must be used, be sure to test thoroughly to ensure that the code works with the compilers that you care about and with the compiler flags that are required. Keep in mind that when optimization is turned on, many compilers set up the stack frame in a different way which can interfere with the local variable offsets.

Compiler optimizations will have no performance effect on your hand written assembly code. Also, be sure that the performance of your assembly code is compared to an optimized C language equivalent.

If assembly is used for performance enhancement, make sure it is applied in areas that matter. Before writing optimized functions in any language, you should profile your code to ensure you understand the actual software bottlenecks. Don't invest significant effort on a mere guess.

About the Author

Paul Lovvik, who has been with Sun for six years, is lead engineer in a group focused on partner adoption of the Solaris OS, x86 Platform Edition in the Market Development Engineering organization. Paul and his engineering team have helped many partners add Solaris x86 support to their products over the past year.