Using Inline Templates to Improve Application Performance

Updated February 2011

By Darryl Gove, Compiler Performance Engineering Team, Oracle

Inline templates are a mechanism for directly inserting assembly code into an executable. Typically, this approach is used to obtain the best performance for a given function or tone implement an algorithm in a specific way.

Contents

Introduction


In general, you should never need to use inline templates It is normally possible to do all the coding in a high-level language, and the compiler is able to do an excellent job of optimizing this. However, in some cases, you may either know more about the target hardware or more about the behavior of the code, or perhaps you want to do something that the compiler doesn't readily support. In these rare situations, you will find inline templates to be helpful.

The following are examples where inline templates are particularly useful:

  • User coded mutex locks. If you want to code a mutex lock, then you will probably want to use the atomic instructions.

  • Hardware-level access. If you are coding for a hardware device, or perhaps just accessing the registers already present on the system, then you may end up wanting to use inline templates.

  • Precise implementation of algorithms. If you have a short algorithm, which can be implemented optimally using hand-coding tricks that the compiler is unable to replicate, then you may wish to use inline templates.

To use inline templates, a regular function call is placed in the source code, then an inline template is produced with the appropriate name, and at compile time both the source file and the file containing the inline template are compiled together. The compiler will then insert the code from the inline template into the code generated from the source code.

The documentation for inlining using .il files can be found under man inline(1). This paper is based on that data.

Listing 1: inline Man Page

man -M /opt/SUNWspro/man inline

The inline man page is also available in HTML from the documentation index of man pages.

Compiling with Inline Templates


You compile inline templates by placing them on the same compile line as the file that uses them. The code is inlined by the code-generator stage of compilation.

Listing 2: Compiling with an Inline Template File

cc -O prog.c code.il

The example above will compile prog.c and inline the code from code.il into the appropriate points.

Layout of Code in Inline Templates


The inline template file can contain a number of inline templates. Each template starts with a declaration, and ends with an end statement:

Listing 3: Layout of an Inline Template

.inline identifier,argument_size
  ...instructions...
.end

identifier is the name of the template, and argument_size is the size of the arguments in bytes (this is not required for the latest compiler versions). Multiple templates of the same name can be placed in the file, but the compiler will pick the first one.

There is no need for a return instruction since your template will be inlined directly into your code without a call.

Note that you must prototype the template in your high-level source code to ensure that the compiler assigns correct types for all the parameters.

Listing 4: Example of a Prototype for an Inline Template

void do_nothing();

Listing 5: Example of a Template

/* The following template does nothing*/
.inline do_nothing,0
  nop
end

Listing 4 shows the prototype as it might end up in code.h. Listing 5 shows the inline template code as it might end up in a separate code.il file. Inline templates are always in files with the suffix .il.

Note: In the following examples, the prototype has been included in the same code block as the inline template code to make the paper more readable; however, they, must go into different files.

Guidelines for Coding Inline Templates


The inline code can use only integer registers %o0 to %o5 and floating point registers %f0 to %f31 for temporary values. Other registers should not be used. These registers are referred to as the "caller-saved" registers. Calls can be made to other routines from the inline template, but these calls are subject to the same constraint.

The compiler will handle most of the SPARC instruction set. If the template contains only instructions that the compiler normally generates, then it will be early inlined (see Late and Early Inlining), and the code will be scheduled optimally. If the template contains instructions that the compiler understands, but does not typically generate (such as VIS instructions or atomics), then the code may be late inlined, and consequently, the code may not be optimally scheduled, resulting in a slight loss of performance.

Parameter Passing


Parameter passing obeys the parameter passing rules defined in the target architecture, so parameter passing is different for 32-bit and 64-bit code. Parameter passing is described by the SPARC ABI, which can be referenced at the SPARC International Technical Documents page. SCD 2.3 describes v8 (32-bit code) and SCD 2.4.1 describes v9 (64-bit code).

On entering the template, arguments will be passed in %o0 to %o5 and will continue on the stack. For 32-bit code, the offset is [%sp+0x5c] and %sp is guaranteed to be 64-byte aligned; for 64-bit code, the offset is [%sp+0x8af].

Note: %sp+2037 is aligned to a 16-byte boundary.

int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);

/*Add up 7 integer parameters; last one will be passed on stack*/
.inline add_up,28
  add %o0,%o1,%o0
  ld [%sp+0x5c],%o1
  add %o2,%o3,%o2
  add %o4,%o5,%o4
  add %o0,%o1,%o0
  add %o2,%o4,%o2
  add %o0,%o2,%o0
.end

Here's an example for 64-bit code. Note that when a 32-bit int register is passed on the stack, the full 64 bits of the register are saved.

Listing 7: Example of 64-Bit Parameter Passing Using the Stack

int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
/*Add up 7 integer parameters; last one will be passed on stack*/
.inline add_up,28
  add %o0,%o1,%o0
  ldx [%sp+0x8af],%o1
  add %o2,%o3,%o2
  add %o4,%o5,%o4
  add %o0,%o1,%o0
  add %o2,%o4,%o2
  add %o0,%o2,%o0
.end

For 32-bit code, floating point values will be passed in the integer registers. For 64-bit code, they will be passed in the floating point registers.

Listing 8: Example of 32-Bit Parameter Passing by Value

double sum_val(double a, double b);
/*sum of two doubles by value*/
.inline sum_val,16
  st   %o0,[%sp+0x48]
  st   %o1,[%sp+0x4c]
  ldd  [%sp+0x48],%f0
  st   %o2,[%sp+0x48]
  st   %o3,[%sp+0x4c]
  ldd  [%sp+0x48],%f2
  faddd %f0,%f2,%f0
.end

Listing 9: Example of 64-Bit Floating Point Parameter Passing

double sum(double a, double b);
/*sum of two doubles 64-bit calling convention*/
.inline sum,16
  faddd %f0,%f2,%f0
.end

Values passed in memory, single-precision floating point values, and integers are guaranteed to be 4-byte aligned. Double-precision floating point values will be 8-byte aligned if their offset in the parameters is a multiple of 8-bytes.

Integer return values are passed in %o0. Floating point return values are passed in %f0/%f1 (single-precision values in %f0, double-precision values in the register pair %f0,%f1).

For 32-bit code, there are two ways of passing the floating point registers. The first way is to pass them by value, and the second is to pass them by reference. Either way, the compiler will do its best to optimize out the load and store instructions. It is often more successful at doing this if the floating point parameters are passed by reference.

Here is an example of 32-bit by reference parameter passing.

Listing 10: Example of 32-Bit Parameter Passing by Reference

double sum_ref(double *a, double *b);
/*sum of two doubles by reference*/
.inline sum_ref,16
  ldd [%o0],%f0
  ldd [%o1],%f2
  faddd %f0,%f2,%f0
.end

Stack Space


Sometimes, it is necessary to store variables to the stack in order to load them back later; this is the case for moving between the int and fp registers. The best way of doing this is to use the space already set aside for parameters that are passed into the function.

For example, in the v8 code shown in Listing 8, the location %sp+0x48 is 8-byte aligned (%sp is 8-byte aligned), and it corresponds to the place where the second and third 4-byte integer parameters would be stored if they were passed on the stack. (Note that the first parameter would be stored at a non-8-byte boundary.)

Branches and Calls


Support for branching and calls is available. Every branch or call must be followed by a nop instruction to fill the branch delay slot. It is possible to put instructions in the delay slot of branches, which can be useful if you wish to use the processor support for annulled instructions, but doing so will cause the code to be late-inlined (described in Late and Early Inlining) and may result in sub-optimal performance.

Call instructions must have an extra last argument that indicates the number of registers used to pass arguments in the call parameters. In general, you should avoid inlining call instructions.

The destinations of branches must be indicated with a number, and the branch instructions should use this number to indicate the appropriate destination together with an f for a forward branch or a b for a backward branch.

Here is an example.

Listing 11: Example of Using Branches in an Inline Template

int is_true(int i);
/*return whether true*/
.inline is_true,4
   cmp  %o0,%g0
   bne  1f
   nop
   mov  1,%o0
   ba   2f
   nop
1:
   mov  0,%o0
2:
.end

Late and Early Inlining


Inlining of templates is done by the code generator part of the compiler. There are two opportunities for inlining: before and after optimization. If the inline template is "complex," then it will end up being inlined after optimization (that is, late inlined), which means that the code will more or less appear exactly as it appears in the template. If the code is inlined before optimization (early inlining), then it will be merged with the other code around the call site.

Early inlining will lead to better performance.

Things that will cause late inlining are:

  • Use of instructions that the compiler cannot generate
  • Instructions in the delay slots of branches
  • Call instructions

You will get information in the compiler commentary on inlining when the code is compiled with -g. This information will tell you if a routine is late inlined. If there is no comment, then the routine was early inlined. An example of this is attempting to inline the following (incorrect) template.

Listing 12: Incorrect Inline Template

.inline sum_val,16
  st   %o0,[%fp+0x48]
  st   %o1,[%fp+0x4c]
  ldd  [%fp+0x48],%f0
  st   %o2,[%fp+0x48]
  st   %o3,[%fp+0x4c]
  ldd  [%fp+0x48],%f2
  faddd %f0,%f2,%f0
.end

The template in Listing 12 is incorrect because the code uses the frame pointer (%fp) rather than the stack pointer (%sp). The compiler will still inline the code, but because of this error, it is unable to early inline the code and will need to late inline the code.

Listing 13: Compiling with -g to Generate Debug Information

cc -g -O inline32.il driver32.c

Listing 13 shows the compile line used to generate a 32-bit executable with debug information. Note that the debug information is stored in the .o files by default, so it is necessary to keep these files available.

Listing 14: Using er_src to Output Compiler Commentary

er_src a.out main
Source file: /home/dg83945/book_code/inline/driver32.c
Object file: /home/dg83945/book_code/inline/driver32.o
Load Object: a.out

     1. #include <stdio.h>
     2.
     3. void do_nothing();
     4. int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
     5. double sum_val(double a, double b);
     6. double sum_ref(double *a, double *b);
     7. int is_true(int i);
     8.
     9.
     10. void main()
     11. {
     12.   double a=3.11,b=7.22;
     13.   do_nothing();
     14.   printf("add_up  %i\n",add_up(1,2,3,4,5,6,7));

   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
   Template could not be early inlined because it references the register %fp
     15.   printf("sum_val %f\n",sum_val(a,b));
     16.   printf("sum_ref %f\n",sum_ref(&a,&b));
     17.   printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
     18. }

You can use utility er_src to examine the compiler commentary for a particular file. It takes two parameters: the name of the executable and the name of the function that you wish to examine. In this case, the template that cannot be early inlined is sum_val. Each time the compiler comes across the %fp register, it inserts a debug message, so you can tell that there are six instances of references to %fp in the template.

Decoding the Calling Convention


The calling convention for the architecture can be a bit tricky to master. The easiest way of dealing with this is to write a test function to see how that gets converted into assembly language.

Listing 15: Examining the 32-Bit Calling Convention

# more fptest.c

double sum(double d1,double d2, double d3, double d4)
{
  return d1 + d2 + d3 + d4;
  }

#cc -O -xarch=v8plusa -S fptest.c

# more fptest.s
....
                        .global sum
                       sum:
/* 000000          2 */         st      %o0,[%sp+68]
/* 0x0004            */         st      %o2,[%sp+76]
/* 0x0008            */         st      %o1,[%sp+72]
/* 0x000c            */         st      %o3,[%sp+80]
/* 0x0010            */         st      %o4,[%sp+84]
/* 0x0014            */         st      %o5,[%sp+88]

!    3                !  return d1 + d2 + d3 + d4;

/* 0x0018          3 */         ld      [%sp+68],%f2
/* 0x001c            */         ld      [%sp+72],%f3
/* 0x0020            */         ld      [%sp+76],%f10
/* 0x0024            */         ld      [%sp+80],%f11
/* 0x0028            */         ld      [%sp+84],%f4
/* 0x002c            */         faddd   %f2,%f10,%f12
/* 0x0030            */         ld      [%sp+88],%f5
/* 0x0034            */         ld      [%sp+92],%f6
/* 0x0038            */         ld      [%sp+96],%f7
/* 0x003c            */         faddd   %f12,%f4,%f14
/* 0x0040            */         retl    ! Result =  %f0
/* 0x0044            */         faddd   %f14,%f6,%f0
....

In the example code, you can see that the first three fp parameters are passed in %o0-%o5, and the fourth fp parameter is passed on the stack at locations %sp+92 and %sp+96. Note that this location is 4-byte aligned, so it is not possible to use a single floating point load double instruction to load it.

Here is an example for 64-bit code.

Listing 16: Examining the 64-Bit Calling Convention

#more inttest.c
long sum(long v1,long v2, long v3, long v4, long v5, long v6, long v7)
{
return v1 + v2 + v3 + v4 + v5 + v6 + v7;
}

# cc -O -xarch=v9 -S inttest.c

# more inttest.s
...
/* 000000          2 */         ldx     [%sp+2223],%g2
/* 0x0004          3 */         add     %o0,%o1,%g1
/* 0x0008            */         add     %o3,%o2,%g3
/* 0x000c            */         add     %g3,%g1,%g4
/* 0x0010            */         add     %o5,%o4,%g5
/* 0x0014            */         add     %g5,%g4,%o1
/* 0x0018            */         retl    ! Result =  %o0
/* 0x001c            */         add     %o1,%g2,%o0
...

In the code above, you can see that the first action is to load the seventh integer parameter from the stack.

Other Examples of Templates


Templates are used in libm.il (the inline math library) and in vis.il (the Visual Instruction Set inline library). These two files can be found in /opt/SUNWspro/prod/lib/. They are linked in by the compiler when flags -xlibmil (for the math templates) or -xvis (for the VIS templates) are specified. The include files that prototype the functions in the template libraries are math.h and vis.h.

Complete Source Code for 32-Bit Examples


Listing17: inline32.il File for 32-Bit Inline Template Examples

/* The following template does nothing*/
.inline do_nothing,0
  nop
.end

/*Add up 7 integer parameters; last one will be passed on stack*/
.inline add_up,28
  add %o0,%o1,%o0
  ld [%sp+0x5c],%o1
  add %o2,%o3,%o2
  add %o4,%o5,%o4
  add %o0,%o1,%o0
  add %o2,%o4,%o2
  add %o0,%o2,%o0
.end

/*sum of two doubles by value*/
.inline sum_val,16
  st   %o0,[%sp+0x48]
  st   %o1,[%sp+0x4c]
  ldd  [%sp+0x48],%f0
  st   %o2,[%sp+0x48]
  st   %o3,[%sp+0x4c]
  ldd  [%sp+0x48],%f2
  faddd %f0,%f2,%f0
.end

/*sum of two doubles by reference*/
.inline sum_ref,16
  ldd [%o0],%f0
  ldd [%o1],%f2
  faddd %f0,%f2,%f0
.end

/*return whether true*/
.inline is_true,4
   cmp  %o0,%g0
   bne  1f
   nop
   mov  1,%o0
   ba   2f
   nop
1:
   mov  0,%o0
2:
.end

Listing 18: driver32.c Source File for 32-Bit Examples

#include <stdio.h>

void do_nothing();
int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
double sum_val(double a, double b);
double sum_ref(double *a, double *b);
int is_true(int i);

void main()
{
  double a=3.11,b=7.22;
  do_nothing();
  printf("add_up  %i\n",add_up(1,2,3,4,5,6,7));
  printf("sum_val %f\n",sum_val(a,b));
  printf("sum_ref %f\n",sum_ref(&a,&b));
  printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
}

Complete Source Code for 64-Bit Examples


Listing 19: inline64.il Template File for 64-Bit Template Examples

/* The following template does nothing*/
.inline do_nothing,0
  nop
.end

/*Add up 7 integer parameters; last one will be passed on stack*/
.inline add_up,56
  add %o0,%o1,%o0
  ldx [%sp+0x8af],%o1
  add %o2,%o3,%o2
  add %o4,%o5,%o4
  add %o0,%o1,%o0
  add %o2,%o4,%o2
  add %o0,%o2,%o0
.end

/*sum of two doubles 64-bit calling convention*/
.inline sum,16
  faddd %f0,%f2,%f0
.end

/*return whether true*/
.inline is_true,4
   cmp  %o0,%g0
   bne  1f
   nop
   mov  1,%o0
   ba   2f
   nop
1:
   mov  0,%o0
2:
.end

Listing 20: driver64.c Source File for 64-Bit Examples

#include <stdio.h>

void do_nothing();
int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
double sum(double a, double b);
int is_true(int i);

void main()
{
  double a=3.11,b=7.22;
  int v1=1, v2=2, v3=3, v4=4, v5=5, v6=6, v7=7;
  do_nothing();
  printf("add_up  %i\n",add_up(v1,v2,v3,v4,v5,v6,v7));
  printf("sum    %f\n",sum(a,b));
  printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
}

Running Examples


Listing 21: Compile and Run Sequence for the Examples

%cc -O driver32.c inline32.il
% a.out
add_up  28
sum_val 10.330000
sum_ref 10.330000
is_true 0=1,1=0

% cc -O -xarch=v9 driver64.c inline64.il
% a.out
add_up  28
sum    10.330000
is_true 0=1,1=0

About the Author


Darryl Gove is a staff engineer in the Compiler Performance Engineering group at Sun Microsystems and now at Oracle, analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK.