Prefetching Pragmas and Intrinsics

Diane Meirowitz, Senior Staff Engineer, and Spiros Kalogeropulos, Staff Engineer, June, 2007

Introduction

Explicit data prefetching pragmas and intrinsics for the x86 platform and additional pragmas and intrinscs for the SPARC platform are now available in Sun Studio 12 compilers, released June 2007.

Prefetch instructions can increase the speed of an application substantially by bringing data into cache so that it is available when the processor needs it. This benefits performance because today's processors are so fast that it is difficult to bring data into them quickly enough to keep them busy, even with hardware prefetching and multiple levels of data cache.

The compilers have several options that enable them to generate prefetch instructions automatically: -xprefetch, -xprefetch_level, and -xprefetch_auto_type (described below). The compilers generally do an excellent job of inserting prefetch instructions, and this is the most portable and best way to use prefetch. If finer control of prefetching is desired, prefetch pragmas or intrinsics can be used. Note that the performance benefit due to prefetch instructions is hardware-dependent and prefetches which improve performance on one chip may not have the same effect on a different chip. It is a good idea to study the instruction reference manual for the target hardware before inserting prefetch pragmas or intrinsics. Furthermore, the Sun Studio Performance Analyzer can be used to identify the cache misses of an application.

Prefetch Pragmas and Intrinsics

Prefetch pragmas are available in Fortran and prefetch intrinsics are available in C and C++. Prefetch can be specified generically, or, on SPARC platforms, with SPARC-specific versions.

Generic x86 and SPARC Prefetch  (New)

PREFETCH TYPE FORTRAN PRAGMA C, C++ INTRINSIC
Prefetch data that is likely to be read more than once c$pragma sun_prefetch_read_many(address) sun_prefetch_read_many(address)
Prefetch data that is likely to be read only once c$pragma sun_prefetch_read_once(address) sun_prefetch_read_once(address)
Prefetch data that is likely to be written more than once c$pragma sun_prefetch_write_many(address) sun_prefetch_write_many(address)
Prefetch data that is likely to be written only once c$pragma sun_prefetch_write_once(address) sun_prefetch_write_once(address)

SPARC Platforms only:

PREFETCH TYPE FORTRAN PRAGMA C, C++ INTRINSIC
Prefetch data that is likely to be read more than once c$pragma sparc_prefetch_read_many(address) sparc_prefetch_read_many(address)
Prefetch data that is likely to be read only once c$pragma sparc_prefetch_read_once(address) sparc_prefetch_read_once(address)
Prefetch data that is likely to be written more than once c$pragma sparc_prefetch_write_many(address) sparc_prefetch_write_many(address)
Prefetch data that is likely to be written only once c$pragma sparc_prefetch_write_once(address) sparc_prefetch_write_once(address)

Strong Prefetch and Instruction Cache Prefetch:

The SPARC Ultra IV+ (ultra4plus) processor provides "strong" data prefetch instructions. It also provides a prefetch for instructions rather than data. Strong prefetches are more powerful than normal prefetches and are recommended when the data being prefetched has a very high probability of being used. They will not be dropped on a TLB miss or prefetch queue full event. Ultra III and Ultra IV processors treat strong prefetches as normal prefetches.

PREFETCH TYPE FORTRAN PRAGMA C, C++ INTRINSIC
Prefetch data that is likely to be read more than once c$pragma sparc_strong_prefetch_read_many(address) sparc_strong_prefetch_read_many(address)
Prefetch data that is likely to be read only once c$pragma sparc_strong_prefetch_read_once(address) sparc_strong_prefetch_read_once(address)
Prefetch data that is likely to be written more than once c$pragma sparc_strong_prefetch_write_many(address) sparc_strong_prefetch_write_many(address)
Prefetch data that is likely to be written only once c$pragma sparc_strong_prefetch_write_once(address) sparc_strong_prefetch_write_once(address)
Prefetch instructions or data at address of label or data
c$pragma sparc_prefetch_instruction(address)
sparc_prefetch_instruction(address)

 

Command-line options related to prefetch:

-xprefetch[=auto|no%auto|explicit|no%explicit|latx:factor] Enable the compiler to insert prefetch instructions. The default is -xprefetch=auto,explicit.
-xprefetch_level[=1|2|3] Control the degree of insertion of prefetch instructions. The default is 1 for C and C++, and 2 for Fortran.
-xprefetch_auto_type=[no%]indirect_array_access Generate prefetches for indirect memory accesses
-xarch=architecture Prefetch instructions are only inserted for architectures that support prefetch. See documentation.
-xO[12345] The optimization level must be 2 or higher for automatic prefetch
-xdepend, -xrestrict, -xalias_level These options may affect the aggressiveness of computing the prefetch candidates due to better memory disambiguation
 

Explicit Prefetch Insertion:

For best performance, loops should be unrolled such that each iteration uses one cache line of data. Since it may take several iterations for the cache line to arrive, the distance should be a few iterations ahead. It is very important to avoid inserting too many prefetch instructions since this can seriously degrade performance. Use the "read_many" variant if the data will be read again before being evicted from all levels of data cache, and use "read_once" otherwise. 

C/C++ example:

original loop:

 for (i = 0; i < n; i = i + 1) {
 a[i] = (b[i] + 1) \* 2;
 }


unrolled loop with explicit prefetch:

 
 #include <sun_prefetch.h>

 static double a[N], b[N];
 int i;
 int m = (n/8) \* 8;
 for (i = 0; i < m; i = i + 8) {
 sun_prefetch_read_many(&b[i]+256);
 sun_prefetch_write_many(&a[i]+256);
 a[i] = (b[i] + 1) \* 2;
 a[i+1] = (b[i+1] + 1) \* 2;
 a[i+2] = (b[i+2] + 1) \* 2;
 a[i+3] = (b[i+3] + 1) \* 2;
 a[i+4] = (b[i+4] + 1) \* 2;
 a[i+5] = (b[i+5] + 1) \* 2;
 a[i+6] = (b[i+6] + 1) \* 2;
 a[i+7] = (b[i+7] + 1) \* 2;
 }

 /\* complete the rest of the iterations that are not a multiple of 8 \*/
 /\* The unroll(1) pragma tells the compiler not to unroll this loop. \*/
#pragma unroll(1)
 for (i = m+1; i < n; i = i + 1) {
 a[i] = (b[i] + 1) \* 2;
 }

% cc -xO4 -xarch=sse2 -xprefetch=explicit,no%auto -c prefetch.c

Note that using -xprefetch=auto with the original loop yields the same code as doing explicit prefetch, so explicit prefetch should only be used for cases where the compiler does not insert adequate prefetches.

Fortran example:

original loop:

 do i = 1, n
 a(i) = (b(i) + 1) \* 2;
 end do

unrolled loop with explicit prefetch:

 doubleprecision a(N), b(N)
 m = (n/8) \* 8
 do i = 1, m, 8
c$pragma sun_prefetch_read_many(loc(b(i))+256)
c$pragma sun_prefetch_write_many(loc(a(i))+256)
 a(i) = (b(i) + 1) \* 2;
 a(i+1) = (b(i+1) + 1) \* 2;
 a(i+2) = (b(i+2) + 1) \* 2;
 a(i+3) = (b(i+3) + 1) \* 2;
 a(i+4) = (b(i+4) + 1) \* 2;
 a(i+5) = (b(i+5) + 1) \* 2;
 a(i+6) = (b(i+6) + 1) \* 2;
 a(i+7) = (b(i+7) + 1) \* 2;
 end do

 ! complete the rest of the iterations that are not a multiple of 8
 ! The unroll=1 pragma tells the compiler not to unroll this loop.
c$pragma sun unroll=1
 do i = m+1, n
 a(i) = (b(i) + 1) \* 2;
 end do

% f95 -O4 -xarch=sse2 -xprefetch=explicit,no%auto -c prefetch.f

Note that using -xprefetch=auto with the original loop yields the same code as doing explicit prefetch, so explicit prefetch should only be used for cases where the compiler does not insert adequate prefetches.

Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Systems Downloads
Right Curve
Solaris 11.2 Banner RHS