Avoiding Performance Loss from RAW Hazards

by Darryl Gove

What causes performance loss from read-after-write (RAW) operations, and how to identify, fix, and avoid RAW hazards.


Published February 2014


Introduction

RAW hazards are a performance issue that can impact all types of processors. The cost of a RAW hazard can be surprisingly large. In many cases, these hazards can be easily avoided. To understand the problem, we need to understand a little bit about how processors work.

Want to comment on this article? Post the link on Facebook's OTN Garage page.  Have a similar article to share? Bring it up on Facebook or Twitter and let's discuss.

A RAW hazard is a read-after-write problem. An application has stored a value to memory, and subsequently it wants to load that value back from memory. This is a common situation, so most chips contain hardware that makes this an efficient operation. If this hardware were not present, there would be a cost to this common style of code. However, there are some situations where this hardware is unable to work, and this will cause the processor to stall due to a RAW hazard.

Storing Data to Memory

When a processor stores a value, the value does not instantly get sent to memory. Memory is divided into chunks called cache lines—often 64 bytes in size. In order to store a value into a cache line, the entire cache line needs to be fetched from memory, the stored bytes need to be updated, and then the modified cache line needs to be written back to memory. This process takes some time: basically the time needed to fetch the cache line from memory.

To ensure that the processor does not stall waiting for the cache line to be fetched from memory, there is a structure that holds a list of all the pending store operations. This structure could be called a store queue or a store buffer. When a store operation is performed, the stored value is added to this list, and it remains in the list until the cache line, where the value will be stored, has been fetched from memory. Once the cache line has been fetched, the store can proceed, and the modified cache line can be written out to the caches and, eventually, to memory. The value may remain in the store queue for a significant number of cycles, while it waits for the cache line to be fetched from memory.

Bypassing Values

If there is a store of a value to an address, and then later there is a load from that address, it would take a long time for the load to obtain the value if the processor had to wait for the completion of the fetch of the cache line from memory. So rather than having to wait, most processors have a bypass operation where a load operation will check the store queue for the most recent data, before fetching it from memory.

If the bypass operation works, then the load can quickly get the value from the store queue. A RAW hazard occurs when there has been a prior store to the address, but for some reason the bypass operation is unable to work. In this case, the load has to wait until the stored value reaches the caches. This can take a significant number of cycles.

There are a number of reasons why the bypass operation might not work. The exact situations will depend on the complexity of the bypass hardware. More-complex hardware can avoid more situations.

Bypass hardware should always be able to handle a situation in which there is a single store of a value to an address, and there is a subsequent load from the same address of a value of the same size as the store. For example, if an integer is stored to an address in memory, then the hardware should be able to bypass this value to a subsequent load of an integer from the same address.

There are some situations in which the bypass might not be able to work. If the bypass does not work, then there will be a stall until the stored data is available from the cache. This stall will often have the same kind of cost as a load from memory. The following is a non-exhaustive list of the kinds of situations in which the hardware might not be able to bypass a value from an earlier store to a later load:

  • Bypassing from multiple stores to the same address. Although the store queue is likely to be ordered by time, if there are multiple stores to the same address, then the hardware might not be able to determine which of the stored values is the one to bypass to the subsequent load.
  • Stores and loads of different sizes of data, for example, if a four-byte integer value is stored to an address, and then a subsequent load attempts to extract a single byte from that address.
  • Stores and loads of partial data, for example, if four bytes are stored to consecutive addresses, and then a subsequent load attempts to read those four bytes as an integer.
  • Other hardware limitations. There might be some other limitations on the bypass operation in the hardware. For example, there might be a time "window" during which the bypass operation works, but after that window, the bypass operation can no longer succeed, and the load needs to wait for the value to be available from the cache.

Common Situations that Cause RAW Hazards

There are some situations in which code, that looks reasonable to a developer, can cause a RAW hazard in the hardware. A common code pattern is when a developer needs to manipulate bytes and concatenate the bytes into an integer. This operation might be to load four misaligned bytes of data from a buffer into an integer, or it could be to simulate a big endian load on little endian hardware. Consider the snippet of code shown in Listing 1:

int misaligned_load_int(char* ptr)
{
  int temp;
  memcpy(&temp, ptr, sizeof(int));
  return temp;
}

Listing 1

The memory pointed to by ptr has no alignment guaranty, so on hardware that cannot handle misaligned loads, the call to memcpy() might well become a sequence of byte loads and stores equivalent to the code in Listing 2. In fact, many developers might write the code shown in Listing 2 in order to perform the same operation.

int misaligned_load_int(char* ptr)
{
  int temp;
  ((char*)&temp)[0] = ptr[0];
  ((char*)&temp)[1] = ptr[1];
  ((char*)&temp)[2] = ptr[2];
  ((char*)&temp)[3] = ptr[3];
  return temp;
}

Listing 2

From the code in Listing 2, the compiler will generate a sequence of four load-byte instructions, together with four store-byte instructions, followed by an integer load. This is a situation in which we are storing data of one size and then attempting to load data of a different size; consequently, the hardware might not be able to perform the bypass from the stored value to the load.

Another situation in which code might cause a potential RAW hazard is when the compiler is unable to directly use the value being stored to memory, and it needs to reload the code to ensure correctness. This happens when a variable is declared as being volatile or when there might be another pointer that could alias to the address. Consider the code in Listing 3:

void update(int *value1, int *value2)
{
  *value1++;
  *value2++;
  return *value1;
}

Listing 3

In Listing 3, the store to value1 needs to be performed because the compiler does not know whether value1 and value2 point to the same location. If they are held at the same location, then the update of value2 will change the result returned from the read of value1.

This case is interesting for a couple of reasons. Most hardware should be fine with bypassing the store of value1 to its subsequent load. So although the code has a potential RAW hazard, most processors will be able to deal with it. However, if value1 and value2 do point to the same location in memory, then there might be two stores to the same location in the store queue, and the bypass operation might not work.

Fixing RAW Hazards

Many situations with RAW hazards have trivial fixes. The fix is typically to avoid loading a value that has just been stored to memory. Sometimes the compiler can do this automatically, and sometimes it needs intervention from the developer.

The situation in which the RAW hazard is caused by different sizes requires the developer to recode using logical operations rather than passing the data through memory. If we need to access data stored using byte-sized store operations as an integer, we can load the bytes, and then combine them in a register, rather than combining them in memory. We can rewrite the misaligned_load_int() example shown in Listing 2, for big endian hardware, as shown in Listing 4:

int misaligned_load_int(char* ptr)
{
  int temp;
  temp = (ptr[3]<<24) + (ptr[2]<<16) + (ptr[1]<<8) + ptr[0];
  return temp;
}

Listing 4

If we take the earlier example of update() in Listing 3, we can manually check whether the two locations are separate and handle the two situations appropriately, as shown in Listing 5. Such a code sequence introduces the overhead of testing the two pointers and branching based on the results. So it is likely to be slower than the original code if the two pointers are different. However, if the processor does suffer from RAW hazards, and the two pointers frequently point to the same location, then the cost of the branches might be less than the cost of the RAW hazard.

void update(int *value1, int *value2)
{
  if (value1==value2)
  {
    *value1+=2;
    return value1;
  }
  else
  {
    int temp = *value1++;
    *value2++;
    return temp;
  }
}

Listing 5

Identifying RAW Hazards

A RAW hazard in a profile looks like time spent on a load instruction. It is often hard to tell that the time spent on the load is due to RAW hazards and not due to some other reason for a load taking time, for example, cache misses.

There are some clues in the disassembly. For the cause to be a RAW hazard, there must be previous stores to the same address. If the data is being held on the stack, then the disassembly will clearly indicate that the address is the same offset from the stack pointer.

However, inspecting the disassembly can require some patience. Many processors will provide hardware counters that indicate whether a RAW hazard occurs or not. So this can be an easy way to determine whether RAW hazards are occurring. Combining the performance counters with a profiling tool will often indicate the exact places in the code where the RAW hazards are occurring.

Listing 6 is an example of using the Oracle Solaris Studio Performance Analyzer on one of the example code snippets. The application has been profiled on a SPARC T4 processor from Oracle using the hardware performance counter named RAW_hit_st_buf, which counts RAW hazard events.

$ collect -h RAW_hit_st_buf ./a.out
$ er_print test.1.er
(er_print) metrics e.RAW_hit_st_buf
(er_print) src  misaligned_load_int_bad2
   Excl.
   RAW_hit_st_buf
   Events
...
                    42. int misaligned_load_int_bad2(char* ptr)
            0       43. {
                    44.   int temp;
            0       45.   ((char*)&temp)[0] = ptr[0];
            0       46.   ((char*)&temp)[1] = ptr[1];
            0       47.   ((char*)&temp)[2] = ptr[2];
            0       48.   ((char*)&temp)[3] = ptr[3];
            0       49.   return temp;
    100000302       50. }

Listing 6

The profile indicates the locations in the code where RAW hazards are a problem.

Example of Code Containing RAW Hazards

The code in Listing 7 contains examples of routines that have a RAW hazard and alternative versions of the code which avoid the RAW hazards. Note that the code uses the Oracle Solaris call gethrtime(), which is a low-overhead way of getting a high-resolution time stamp. A more portable way of doing this would be to use a call such as gettimeofday(), but this would have a higher overhead.

#include <stdio.h> 
#include <sys/time.h> 
#include <string.h> 

void tick() 
{ 
  hrtime_t now = gethrtime(); 
  static hrtime_t then = 0; 
  if (then>0) printf("Elapsed = %f ns\n", 1.0*(now-then)/100000000.0); 
  then = now; 
} 

int update_bad(int *value1, int *value2) 
{ 
  (*value1)+=1; 
  (*value2)+=1; 
  return *value1; 
} 

int update_good1(int *value1, int *value2) 
{ 
  if (value1==value2) 
  { 
    (*value1)+=2; 
    return *value1; 
  } 
  else 
  { 
    int temp = *value1+=1; 
    (*value2)+=1; 
    return temp; 
  } 
} 

int misaligned_load_int_bad1(char* ptr) 
{ 
  int temp; 
  memcpy(&temp, ptr, sizeof(int)); 
  return temp; 
} 

int misaligned_load_int_bad2(char* ptr) 
{ 
  int temp; 
  ((char*)&temp)[0] = ptr[0]; 
  ((char*)&temp)[1] = ptr[1]; 
  ((char*)&temp)[2] = ptr[2]; 
  ((char*)&temp)[3] = ptr[3]; 
  return temp; 
} 

int misaligned_load_int_good(char* ptr) 
{ 
  int temp; 
  temp = (ptr[3]<<24) + (ptr[2]<<16) + (ptr[1]<<8) + ptr[0]; 
  return temp; 
} 

#define COUNT 100000000 
#define COUNT2 300000000 

void main() 
{ 
  char buffer[1000]; 
  tick(); 
  printf("Misaligned load v1 (bad) memcpy()\n"); 
  for(int i=0;i<COUNT;i++) misaligned_load_int_bad1(&buffer[1]); 
  tick(); 
  printf("Misaligned load v2 (bad) byte copy\n"); 
  for(int i=0;i<COUNT;i++) misaligned_load_int_bad2(&buffer[1]); 
  tick(); 
  printf("Misaligned load good\n"); 
  for(int i=0;i<COUNT;i++) misaligned_load_int_good(&buffer[1]); 
  tick(); 
  int value1, value2; 
  printf("\nUpdate bad -- different addresses\n"); 
  for(int i=0;i<COUNT2;i++) update_bad(&value1,&value2); 
  tick(); 
  printf("Update bad -- same address\n"); 
  for(int i=0;i<COUNT2;i++) update_bad(&value1,&value1); 
  tick(); 
  printf("Update good -- different addresses\n"); 
  for(int i=0;i<COUNT2;i++) update_good1(&value1,&value2); 
  tick(); 
  printf("Update good -- same address\n"); 
  for(int i=0;i<COUNT2;i++) update_good1(&value1,&value1); 
  tick(); 
}

Listing 7

As with all microbenchmarks, you need to be careful that the compiler does not realize that the code is pointless and optimize it all out, or optimize it in other ways that break the effect that the code is trying to demonstrate.

For the code in Listing 7, the compiler must not perform a couple of optimizations. Function inlining needs to be switched off; if it is enabled, the compiler will almost certainly recognize that no useful work is being performed by the code, and it will eliminate all the loops. The second optimization that needs to be disabled is recognition of memcpy() function calls. If the compiler recognizes the memcpy() call, it will probably replace it with more-efficient code, and it might even generate code that removes the RAW hazard.

For the Oracle Solaris Studio compiler, an optimization level of -O will not enable function inlining, and the flag -xbuiltin=%none will tell the compiler not to replace calls to memcpy() with inline code. The results from compiling the microbenchmark and running it on an old Oracle Solaris x86 system are shown in Listing 8:

$ cc -O -xbuiltin=%none raw.c 
$ ./a.out 
Misaligned load v1 (bad) memcpy()
Elapsed = 24.738033 s 
Misaligned load v2 (bad) byte copy
Elapsed = 11.100576 s 
Misaligned load good 
Elapsed = 4.348355 s 

Update bad -- different addresses 
Elapsed = 13.264937 s 
Update bad -- same address 
Elapsed = 17.670419 s 
Update good -- different addresses 
Elapsed = 13.888094 s 
Update good -- same address 
Elapsed = 13.587685 s 

Listing 8

The results in Listing 8 show that code to handle misaligned data using memcpy() is quite painfully slow—about 6x slower than the logical operations that replace it: 24 seconds versus 4 seconds. This is partly due to the cost of a function call to memcpy(), but also due to the RAW hazard. The results from the code that uses loads and stores of bytes still has the RAW hazard and is about 3x slower than the logical operations: 11 seconds versus 4 seconds.

The results in Listing 8 from the code that updates two memory locations shows interesting, but expected, behavior. If the two locations are different, then the original code is marginally faster than the code that includes the overhead of checking for potential RAW hazards: 13.3 seconds versus 13.9 seconds. However, if the two addresses are the same, then the original code hits a significant slowdown: performance drops from 13.3 seconds to 17.7 seconds. The alternative code has some overhead above the original code, but avoids the slowdown when the two pointers indicate the same location; the original code took 13.3 seconds, and the code with the workaround takes about 13.6 seconds. The important thing to recognize here is that the new code is perhaps slightly slower, but it avoids a significant potential slowdown.

Conclusion

RAW hazards can become a significant cause of processor stall. However, they are a condition that can be resolved or avoided in many situations.

See Also

About the Author

Darryl Gove is a senior principal software engineer in the Oracle Solaris Studio team, working on optimizing applications and benchmarks for current and future processors. He is also the author of the books Multicore Application Programming, Solaris Application Programming, and The Developer's Edge. He has a blog at http://blogs.oracle.com/d/.

Revision 1.0, 02/21/2014

Follow us:
Blog | Facebook | Twitter | YouTube