Sun Studio: Using VIS Instructions To Speed Up Key Routines

   
By Darryl Gove, Geetha Vallabhaneni, Compiler Performance Engineering Team, Sun Microsystems, January 5, 2006  

Introduction

The VIS instruction set includes a number of instructions that can be used to handle several items of data at the same time. These are called SIMD (Single Instruction Multiple Data) instructions. The VIS instructions work on data held in floating point registers. The floating point registers are 8-bytes in size, and the VIS instructions can operate on them as two 4 byte ints, four 2 byte shorts, or eight 1-byte chars, as shown in the following figure:

Frame1

The advantage of using VIS instructions is that an operation can be applied to different items of data in parallel; meaning that it takes the same time to compute eight 1 byte results as it does to calculate one 8-byte results. In theory this means that code that uses VIS instructions can be many times faster than code without them.

Further information on the VIS instruction set, including manuals and libraries can be found at http://www.sun.com/processors/vis/ .


VIS performance

There can be a performance gain by using VIS instructions. However, determining how much of a performance gain is not straightforward since the following factors come into play:

  • The VIS instructions work on multiple data at one time, so doing a VIS add operation on eight chars can be faster than doing eight separate add operations.

  • The VIS instructions use values held in floating point registers, and the values need to be loaded and stored using floating point loads and stores. This has a benefit on the UltraSPARC-III family of processors since floating point values can be prefetched into the on-chip prefetch cache. Doing this can avoid processor stalls where the integer data needs to be fetched from memory.

  • The VIS instructions typically take longer than the equivalent integer operation. A VIS instruction will often have a four cycle latency (VIS instructions are handled by the floating point unit, so the latency for VIS instructions is the same as for floating point instructions), whereas the equivalent integer operation might take one cycle.

  • Unfortunately not all integer operations are available through VIS instructions, so it may be necessary to move data back to the integer registers for processing. This can only be done through memory, so is quite a slow operation.


Compiling with VIS

Using VIS requires a target architecture of at least v8plusa or v9a. This can be achieved by compiling using the -xarch=v8plusa or -xarch=v9a compiler flag.

There are two ways to get the compiler to generate VIS instructions:

  • There are macros for the VIS instructions available in <vis.h>, the compiler will recognise these if the -xvis compiler flag is specified.

  • It is also possible to put VIS instructions into inline templates. Details on writing and using inline templates can be found in this article:  

Because VIS instructions are not directly generated by the compiler, it may happen that the generated code is suboptimal (the VIS instructions will typically be late-inlined as discussed in the article on inline templates). Therefore it is always worth checking the resulting assembly code to see if it looks reasonable, if the performance is not as fast as expected.


Example routine coded without VIS instructions

The example we will use is a simple search-type routine that works on integer (4-byte) data rather than characters. The routine scans an array for a particular value, and then reports the number of characters scanned.

int search(int value, int* array)
{
  int len=0;
  while (*array++!=value) {len++;}
  return len;
}

Table 1 - example search routine coded in C without VIS

We also define a test harness so that the performance of the existing code can be measured.

#include <stdio.h>
#include <sys/time.h>

int array[10*1024*1024];
static hrtime_t last_time=0;

void seconds()
{
  hrtime_t time;
  time=gethrtime();
  if (last_time!=0)
  {
    printf( "Elapsed seconds = %5.3f\n",
      ((double)(time-last_time)*1.0e-9) );
  }
  last_time=time;
}

void init()
{
  int i;
  for(i=0; i<10*1024*1024; i++) {array[i]=i;}
}

void main()
{
  int loop;
  init();
  seconds();
  for (loop=0; loop <10*1024*1023; loop+=100*1024)
  {
    if (search(loop,array)!=loop)
    {
      printf("Miscompare\n");
      break;
    }
  }
  seconds();
}

Table 2 - Test harness code

The timing loop is more complex than might appear necessary. However, the loop has the following characteristics:

  • The timing loop also validates the results. This ensures that when the code is optimised, the new versions can also be checked for correctness.

  • The timing code tries out a variety of array lengths, to ensure that the optimisation has to give good performance over a wide range of lengths. Of course, the method of timing has the disadvantage that poor performance at one length can be masked by good performance at another length.

There are some weaknesses in the test harness:

  • The code does not attempt to warm the caches into the same state, so it is possible that the earlier test codes might be penalised because the caches are cold.

  • The test code does not check for alignment, and performance of misaligned code. For example, the performance might be different if the array started on a three byte boundary.

 

Building and running the example code

First we'll run the example code.

%  
                   cc test.c
%  
                   a.out
Elapsed seconds = 14.481
                

Table 3 - Performance with no optimisation

The code was compiled without optimisation, so performance will be poor. It is interesting to look at the hot loop in this light.

10cb4: inc      %i5
10cb8: ld       [%fp + 72], %o0
10cbc: ld       [%o0], %i4
10cc0: ld       [%fp + 72], %o0
10cc4: inc      4, %o0
10cc8: st       %o0, [%fp + 72]
10ccc: ld       [%fp + 68], %o0
10cd0: cmp      %i4, %o0
10cd4: bne      search+0x34             ! 0x10cb4
10cd8: st       %i4, [%fp - 24]

Table 4 - Unoptimised assembly code

It is readily apparent that this is very poor code. The loop index variable (held in %i4) and the array offset variable ( %o0) are stored and reloaded every iteration. The way to evaluate this is to count the number of loads and stores. There are four loads and 2 stores per trip around the loop. Comparing this with the source code, it seems that there should only be a single load per iteration. The obvious way of improving the situation is to recompile with some amount of optimisation.

%  
                   cc -O test.c
%  
                   a.out 
Elapsed seconds = 8.614
                

Table 5 - Performance of optimised code

So the performance improves by nearly a factor of two. The disassembly for the hot loop looks like the following:

10ca4:  ld         [%g5], %o4
10ca8:  cmp        %o4, %o5
10cac:  be,pn      %icc,search+0x44        ! 0x10ccc
10cb0:  inc        4, %g5
10cb4:  ld         [%g5], %o2
10cb8:  inc        %o0
10cbc:  inc        4, %g5
10cc0:  cmp        %o2, %o5
10cc4:  bne,a,pt   %icc,search+0x1c        ! 0x10ca4
10cc8:  inc        %o0

Table 6 - Optimised assembly code

In this case the loop has been unrolled twice (there are two iterations performed before the predicted taken branch at the end of the loop), but not pipelined (the two iterations are not interleaved together). The most significant gain is that the index variable is now held in a register and does not end up being stored and reloaded every iteration. In this case there are two loads in the block of code, but the block of code is for two iterations, so the code has the optimal number of loads.

However, the loop still does not contain prefetch instructions. Adding prefetch instructions will enable the processor to start fetching data in advance of the data being needed. This will mean that the data will often be ready when the processor needs it, and hence the processor will spend less time waiting for the data to be returned from memory.


Manually adding prefetch

Using the header file <sun_prefetch.h> it is possible to manually add prefetch statements into the source code for the program.

#include <sun_prefetch.h>

....

int prefetch_search(int value, int* array)
{
  int len=0;
  while (*array++!=value)
  {
    len++;
    sparc_prefetch_read_many(array+16);
  }
  return len;
}

Table 7 - Source code with manual prefetch statement

Running this code with the manual prefetch statements gets the following results.

%  
                   cc -O test.c
%  
                   a.out 
Elapsed seconds = 5.606
                

Table 8- Performance of optimised code with prefetch

The prefetch statement says to prefetch for 16 ints from the current location within array. This is 16*4 bytes = 64 bytes (each int takes four bytes of memory), or one cacheline ahead. Prefetch can be made more effective by having more time for the prefetch to complete before the load is issued. To check demonstrate this, the offset can be changed from +16 to +64, which means to prefetch for 64*4 bytes ahead, or four cachelines. The following result is obtained.

%  
                   cc -O test.c
%  
                   a.out 
Elapsed seconds = 3.566
                

Table 9 - Performance with increased prefetch ahead distance

So from using optimisation and manually inserting prefetch it is possible to get a nearly five times gain in performance for this bit of code.


Including VIS instructions in the source code

One way of using VIS instructions is to include them in the source code of the application. This requires the use of the <vis.h> header file and the flag -xvis in order that the compiler can recognise them. VIS instructions also require an architecture of at least v8plusa.

The code can be modified to use VIS instructions. Since the comparison is of four byte integers, two can be loaded and then compared with the target value at the same time. The macro vis_fcmpeq32 generates the VIS instruction which performs the compare. The following code performs the search using VIS instructions:

#include <vis.h>

...

int srcvis_search(int value, int *array) {
  union {double d; int i1,i2;} tmpR;
  double* tmpI;
  int ret;
  int ind = 0;
  tmpR.i1=value;
  tmpR.i2=value;
  if(!((unsigned long)(&array[0]) & 4) 
      || array[ind++] != value) 
  {
    tmpI = (double*) &array[ind];
    for(; !(ret=vis_fcmpeq32(tmpR.d, *tmpI++)); ind+=2) 
    {
       sparc_prefetch_read_many(tmpI+32);
    }
    if (!(ret &2)) { ind++;}
  }
  return ind;
}

Table 10 - Using VIS instructions in C source code

The compile line for this code is:

%  
                   cc -O -xvis -xarch=v8plusa test.c
%  
                   a.out 
Elapsed seconds = 2.639
                

Table 11 - Performance of C code containing VIS instructions

The VIS code is faster than the previous integer code. There are two main reasons for this. There is some performance gain from being able to compare two integer values at once, but the instructions to do so are longer latency, so there is not much to be gained from this (of course, if the code was working with eight bytes, or four shorts, then the VIS instructions would lead to greater performance gains). The other contributor to performance is that the floating point load instructions can load data from the on-chip prefetch cache, this reduces the time spent waiting for data from the off-chip caches.

It is worth discussing the code, which looks significantly more complex than the previous versions of the same code. The reasons for the complexity are as follows:

  • Since VIS works using the floating point registers, the value that is being searched for needs to be duplicated into both the upper and lower halves of the floating point register. This is achieved using the union statement. This value will be used to compare with the value loaded from memory.

  • The loads are loading eight byte values, so the loads need to be eight byte aligned. To get the correct alignment, the if statement checks for mis-alignment, and if it is misaligned, handles the first value in the array before passing control on to the for loop.

  • The return value from the VIS compare instruction is a bit pattern which indicates which of the two values being compared was different. Bit 1 indicates that the upper value was different, bit 0 indicates that the lower value was different. If the upper value was not different, then it is necessary to add one to the return value to show that the miscompare took place with the lower value. (Note that the lower bit being set does not necessarily mean that the miscompare took place with the lower value, the upper value could also have miscompared).


Using VIS and inline templates

In order to obtain the best possible performance from VIS, it is necessary to use inline templates to schedule the code. The following code is basically doing the same algorithm as the C source code, but the code layout is slightly tweaked to improve performance.

/* Routine vis_search(int value, unsigned int * array); */
/* %o0 = value*/
/* %o1 = address of array*/

.inline vis_search,8
  st %o0,[%sp+0x48]    /* store search value */
  clr %o3              /* counter = 0 */
  and %o1,4,%o2        /* check for not 8-byte aligned*/
  cmp %o2,4
  bne,a 1f
  ld  [%o1],%o2        /* annul load if taken */

  cmp %o2,%o0
  be 2f                /* found */
  add %o1,4,%o1        /* dealt with misaligned int */

  add %o3,1,%o3        /* compared first character */
1:
  ld [%sp+0x48],%f0    /* load upper word */
  fmovs %f0,%f1        /* copy to lower word */
  ldd [%o1],%f2        /* load 2 ints */
  
3:  
  add %o1,8,%o1        /* move pointer to next pair */
  prefetch [%o1+256],0 /* Prefetch four lines ahead */
  fcmpne32 %f0,%f2,%o0 /* compare ints */
  ldda [%o1]%asi,%f2   /* non-faulting load 2 ints */
  cmp %o0,3            /* check result of compare */
  be,a 3b              /* branch if not found */
  add %o3,2,%o3        /* increment count by two; annulled if found */

  andcc %o0,2,%o0      /* check bit 2 of return value from compare */
  bnz,a 2f
  add %o3,1,%o3        /* add one if the first half matched */
  
2:
  or %o3,%g0,%o0       /* Copy counter to output */
.end

Table 12 - Inline template using VIS instructions

The VIS code starts by checking and correcting for non-eight byte aligned arrays. It then duplicates the integer value into both halves of the floating point value. The inner loop is then very similar to the VIS example coded in C. The following figure shows compiling and running this code, there is a slight performance gain over the VIS instructions used at source code level. The gain can be attributed to using non-faulting (speculative) load instructions to fetch the next value before the compare of the current value has completed.

The non-faulting load instruction is equivalent to a normal load, except in the case where the load would access unmapped memory (for example off the end of an array). In this circumstance a normal load would cause a runtime error because the memory is not mapped. However in the same situation, a non-faulting load would not cause an error and would just return a zero value. The advantage of using this instruction is that the load can now be moved before the test of whether the end of the array has been reached, safe in the knowledge that no fault will occur if the load does happen to pass the end of the array.

%  
                   cc -O vis_search.c test.c
%  
                   a.out 
Elapsed seconds = 2.416
                

Table 13 - Performance of inline template code

It is possible to further improve the performance of the hand-coded VIS routine. For example, it would be possible to unroll and pipeline the inner loop such that two or more iterations were computed in parallel. Whilst doing that optimisation would improve performance, it would also make the code less clear, so it was not shwon here.


Concluding remarks

This article has demonstrated a number of useful techniques for improving performance. To recap:

  • Whilst the compiler generally does a good job of inserting prefetch into code, there is always the possibility of manually inserting prefetch statements into the code to cover the cases where the compiler does not.

  • It is possible to write source code that includes VIS instructions in a way that leads to performance gains.

  • If necessary, further performance can be extracted from codes by manually recoding the hot points using inline templates.


Full source code

Here is a link to the full source code test.c and to the vis_search.il inline code.

Compiling and running the program should produce results similar to that shown below.

%  
                   cc -xvis -O -xarch=v8plusa test.c vis_search.il
%  
                   a.out
Elapsed seconds = 8.628
Elapsed seconds = 5.606
Elapsed seconds = 2.645
Elapsed seconds = 2.410
                

Table 14 - Compile line and timing data for all routines

 
About the Authors
Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK.

Geetha Vallabhaneni is a member of the Compiler Performance Engineering team at Sun Microsystems Inc. Her work primarily focuses on performance analysis and optimization of applications for UltraSPARC systems.
 
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.
Left Curve
System Administrator
Right Curve
Left Curve
Developer and ISVs
Right Curve
Left Curve
Related Products
Right Curve
solaris-online-forum-banner