Automatic DTrace Probe Insertion in Fortran

   
By Diane Meirowitz, February 2009  

This article describes a new feature of the Sun Studio 12 Fortran compiler—automatic insertion of DTrace static probes in optimized code. DTrace is a powerful, dynamic tracing facility that was introduced in the Solaris 10 operating environment. See http://www.opensolaris.org/os/community/dtrace/ for more details. The Fortran -dtrace option is in the technology preview stage, which means it is an experimental feature that is evolving and may not yet have the same high quality as other features. The names of the probes, the locations where the probes are inserted, and the command-line option are all likely to change in future releases.

Please take this new feature for a spin and let us know what you think!

Contents
 
Introduction to the Example
Initial Performance Profile
Using Fortran Automatic DTrace Probe Insertion
Finding Call Sites, Part 1: Using ustack to Find Call Stacks
Finding Call Sites, Part 2: Using dbx to Interpret Stack Traces
Performance Workaround for the Example
Effect of -dtrace and DTRACE_PROBE Macro on Optimization
Future Directions
References
 
Introduction to the Example

 

Analyzing the performance of Fortran programs can be more difficult than doing so for C or C++, because a single innocent-looking statement can be translated into hundreds of machine instructions. Furthermore, the performance can vary depending upon various run time conditions. Consider, for example,

   subroutine add_three (A, B)
                    

       integer, pointer, dimension(:) :: A, B
                    

       A = B + 3
                    

    end subroutine
                  
 

The statement 'A = B + 3' looks simple enough. However, the code generated for it is definitely not trivial. The semantic rules of Fortran say that the expression on the right of an assignment must be fully evaluated before the assignment to the variable on the left. Since both A and B have the POINTER attribute, the compiler has to assume that it is possible that they may occupy some of the same memory. In order to execute this statement correctly, the compiler generates something like the following fairly expensive code:

   temp = malloc(...)
                    

   do i = ...
                    

      temp(i) = B(i) + 3
                    

   end do
                    

   do i = ...
                    

      A(i) = temp(i)
                    

   end do
                    

   free(temp)
                    

                  
 

Since it is often the case that A and B do not occupy the same storage, the Sun Studio 12 Fortran compiler adds a check to avoid the extra code, yielding the following code:

   if (A and B do not overlap) then
      do i = ...
         A(i) = B(i) + 3
      end do
   else
      temp = malloc(...)
      do i = ...
         temp(i) = B(i) + 3
      end do
      do i = ...
         A(i) = temp(i)
      end do
      free(temp)
   end if
 

If this function happened to take up a large amount of time, it could be because A and B actually overlap. However, it would be difficult to discover that.

Initial Performance Profile

 

First, let's wrap the subroutine in a module, in add_three.f90:

    MODULE ADD_IT
                    

    contains
                    

       subroutine add_three (A, B)
                    

          integer, pointer, dimension(:) :: A, B
                    

          A = B + 3
                    

       end subroutine
                    

    END MODULE ADD_IT
                    

                  
 

Next, let's add a main program, in main.f90:

   program main
                    

      USE ADD_IT
                    

      parameter (n=100000,m=50000)
                    

      integer, target, dimension(n) :: c,d
                    

      integer, pointer, dimension(:) :: a, b
                    

      c = (/ (i, i = 1,n) /)
                    

      d = (/ (i, i = 1,n) /)
                    

      a => c(2:n)
                    

      b => c(1:n-1)
                    

      do i = 1, m
                    

         call add_three(a,b)   ! 'a' and 'b' pointers overlap
                    

      end do
                    

      b => d(1:n-1)
                    

      do i = 1, m
                    

         call add_three(a,b)   ! 'a' and 'b' pointers do not overlap
                    

      end do
                    

   end
                    

                  
 

We put the subprograms in separate files so that the compiler does not get too smart and optimize all the loops away!

Below we compile, collect performance data with the Collector tool, and then print profile information using the Analyzer's command-line tool er_print (we could also have done this with the Analyzer GUI). The -fast flag sets various compilation options for fast performance, and -g adds information for debugging and profiling.

%  
                     f95 -fast -g add_three.f90 main.f90
                    

add_three.f90:
                    

main.f90:
                    

Linking:
                    

                     

%  
                     collect a.out
                    

Creating experiment database test.1.er ...
                    

                     

%  
                     er_print -source main test.1.er
                    

                     

      Excl.     Incl.     
                    

      User CPU  User CPU  
                    

       sec.       sec.
                    

      0.        17.122  <Function: main, from source file main.f90>
                    

      ...
                    

      0.         0.             6.    c = (/ (i, i = 1,n) /)
                    

      0.         0.             7.    a => c(2:n)
                    

      0.         0.             8.    b => c(1:n-1)
                    

      0.         0.             9.    do i = 1, m
                    

  ##  0.        11.548         10.       call add_three(a,b)
                    

      0.         0.            11.    end do
                    

      0.         0.            12.    b => d(2:n)
                    

      0.         0.            13.    d = (/ (i, i = 1,n) /)
                    

      0.         0.            14.    do i = 1, m
                    

      0.         6.424         15.       call add_three(a,b)
                    

      0.         0.            16.    end do
                    

      0.         0.            17.    end
                    

                  
 

Notice that the call on line 10 takes nearly twice as much time as the call on line 15. From the code we can see that they are called the same number of times on the same amount of data. Why is the second one faster?

Using Fortran Automatic DTrace Probe Insertion

 

To figure out why the second call is faster, we can try the -dtrace option, new in Sun Studio 12. First, we recompile, adding -qoption f90comp -dtrace. (Note that -dtrace currently has no effect on SPARC processors with optimization level less than 2, but works on x86 at all optimization levels. Note also that -qoption is used because the syntax of the option will change when it becomes a fully supported feature.) Then we run DTrace -G to postprocess add_three.o, just as we would have to do if we had used the DTRACE_PROBE macro in C/C++ code.

Here is fortran_provider.d:

   provider fortran {
                    

      /* arguments are: description, file, line number, flag */
                    

      probe event(string, string, int, int);
                    

      probe start_event(string, string, int, int);
                    

      probe stop_event(string, string, int, int);
                    

   };
                    

                  
 

Here is fortran_action.d:

   BEGIN
                    

   {
                    

      start_ts = timestamp;
                    

   }
                    

                     

   fortran$target:::start_event
                    

   / pid == $target /
                    

   {
                    

      message = stringof(copyinstr(arg0));
                    

      file = stringof(copyinstr(arg1));
                    

      line = arg2;
                    

      self->ts[message,file,line] = timestamp;
                    

   }
                    

                     

   fortran$target:::stop_event
                    

   / pid == $target /
                    

   {
                    

      t = timestamp;
                    

      message = stringof(copyinstr(arg0));
                    

      file = stringof(copyinstr(arg1));
                    

      line = arg2;
                    

      ela = t - self->ts[message,file,line];
                    

      @time [message, line, file] = sum(ela);
                    

      @count[message, line, file] = count();
                    

   }
                    

                     

   END
                    

   {
                    

      total_elapsed = timestamp - start_ts;
                    

      printf("Count\n");
                    

      printa(" %12@u  %s, line %d, file %s\n", @count);
                    

      printf("\n");
                    

      normalize(@time, 1000);
                    

      printf("Time in microseconds\n");
                    

      printa(" %12@u  %s, line %d, file %s\n", @time);
                    

      printf(" %12u  Total elapsed time usec\n", total_elapsed/1000);
                    

   }
                    

                  
 

Before running DTrace, we need to have permission to do so. So we determine our process id and then ask our system administrator to use the ppriv command to allow us to run DTrace. For example:

%  
                                             echo $$                     
                    

12345
                    

%  
                                             su root                     
                    

Password:
                    

#  
                                             ppriv -s A+dtrace_proc 12345                     
                    

#  
                                             exit                     
                  
 

And we are ready to go.

%  
                                             f95 -fast -g -c -qoption f90comp -dtrace add_three.f90 main.f90                     
                    

add_three.f90:
                    

main.f90:
                    

                     

%  
                                             dtrace -G -32 -s fortran_provider.d add_three.o                     
                    

                     

                     

%  
                     f95 -fast -g fortran_provider.o add_three.o main.o
                    

                     

%  
                     dtrace -q -s fortran_action.d -c ./a.out
                    

                     

   Count
                    

      50000  array syntax no overlap, line 5, file add_three.f90
                    

      50000  array syntax with overlap, line 5, file add_three.f90
                    

                     

   Time in microseconds
                    

    5735866  array syntax no overlap, line 5, file add_three.f90
                    

   10126427  array syntax with overlap, line 5, file add_three.f90
                    

   17289700  Total elapsed time usec
                  
 

From the DTrace output above we can see that nearly twice as much time is spent in "array syntax with overlap" as in "array syntax without overlap." "array syntax with overlap" tells us that the arrays A and B in subroutine add_three share some storage. As we discussed earlier, the compiler handles this array overlap with slower malloc/free and copy code, rather than the single loop.

Finding Call Sites, Part 1: Using ustack to Find Call Stacks

 

Since there are two call sites for add_three, how can we determine which site was the more expensive one? To do this, we use DTrace's ustack function to obtain call stacks. Later, we print out the stacks in the END probe in dtrace_action.d. Here are the modified DTrace probes:

   fortran$target:::stop_event
   / pid == $target /
   {
                    

      t = timestamp;
                    

                     

      message = stringof(copyinstr(arg0));
                    

      file = stringof(copyinstr(arg1));
                    

      line = arg2;
                    

      ela = t - self->ts[message,file,line];
      @time [message, line, file, ustack(2)] = sum(ela);\
      # get 2 stack frames
      @count[message, line, file] = count();
   }
                    

                     

   END
                    

   {
                    

      total_elapsed = timestamp - start_ts;
                    

      printf("Count\n");
                    

      printa(" %12@u  %s, line %d, file %s\n", @count);
                    

      printf("\n");
                    

      normalize(@time, 1000);
                    

      printf("Time in microseconds\n");
                    

      printa(" %12@u  %s, line %d, file %s, stack%k\n", @time);
                    

      printf(" %12u  Total elapsed time usec\n", total_elapsed/1000);
                    

   }
                    

                  
 

We will need to use the f95 -xregs=no%frameptr option after -fast in order to use ustack. We also compile with -xunroll=1 to disable loop unrolling in order to simplify the output of DTrace.

%  
                     f95 -fast -xregs=no%frameptr -g -xunroll=1 -c\       -qoption f90comp -dtrace add_three.f90 main.f90
add_three.f90:
main.f90:
                    

                     

%  
                     dtrace -G -32 -s fortran_provider.d add_three.o
                    

                     

                     

%  
                     f95 -fast -g fortran_provider.o add_three.o main.o
                    

                     

%  
                     dtrace -q -s fortran_action.d -c ./a.out
                    

                     

Count
                    

        50000  array syntax no overlap, line 5, file add_three.f90
                    

        50000  array syntax with overlap, line 5, file add_three.f90
                    

                     

Time in microseconds
      7007362  array syntax no overlap, line 5,\
      file add_three.f90, stack
                    

a.out`add_it.add_three_+0x292
                    

a.out`MAIN_+0x13d

     14481704  array syntax with overlap, line 5,\
     file add_three.f90, stack
a.out`add_it.add_three_+0x1a9
a.out`MAIN_+0xad

     22925477 Total elapsed time usec
                    

                  
 
Finding Call Sites, Part 2: Using dbx to Interpret Stack Traces

 

Now we need to relate the stack trace to the source line. To do this, we will use dbx. Since we are using Fortran, we need to remove the trailing underscore from the name that DTrace gives us and use the actual Fortran name of the function.

%  
                     dbx a.out
                    

dbx>  
                     whereis -a MAIN+0x13d
                    

in `a.out`main.f90`MAIN at "main.f90":16
                    

                     

dbx>  
                     whereis -a MAIN+0xad
                    

in `a.out`main.f90`MAIN at "main.f90":11
                    

                  
 

The line number is one higher than the line with the call, but this information does tell us that:

  • 14481704 usecs were spent doing array syntax with overlap on line 10 of main.f90
  • 7007362 usecs were spent doing array syntax no overlap on line 15 of main.f90

From this we can conclude that arguments to the call on line 10 were arrays that overlapped each other requiring an expensive copy and calls to malloc/free, whereas the call on line 15 was not expensive.

Performance Workaround for the Example

 

Now that we know why the code is slower, how can we speed it up while preserving correctness? For the first case, recall that A points at elements 2:n of C, while B points at elements 1:n-1 of C.

   a => c(2:n)
                    

   b => c(1:n-1)
                  
 

The function add_three assigns A = B+3. Iterating forward through these arrays would give us the incorrect result, i.e.,

   a(1) = b(1) + 3      ! i.e., c(2) = c(1) + 3
                    

   a(2) = b(2) + 3      ! i.e., c(3) = c(2) + 3
                    

   a(3) = b(3) + 3      ! i.e., c(4) = c(3) + 3
                    

   and so on...
                    

                  
 

Notice that c(2) is read after it is written, which violates the rule that for A = B + 3, the expression B + 3 must be fully evaluated before A is written to. But if we run this loop in reverse, we will get the correct answer. One way to do this would be to create a new function which contains an explicit loop, rather than array syntax:

   subroutine add_three_reverse(a, b)
                    

      integer, pointer, dimension(:) :: a, b
                    

      do i = ubound(a,1), 1, -1
                    

         a(i) = b(i) + 3
                    

      end do
                    

   end
                  
 
Effect of -dtrace and DTRACE_PROBE Macro on Optimization

 

The insertion of DTrace probes in code can inhibit some of the compiler's optimizations. To the compiler, these probes look like ordinary function calls, so in general, the compiler cannot move code past the call to say, fuse loops or propagate constants. This is probably fine if the probes surround code that takes a long time to execute, but bad if the probes surround trivial code. The Fortran -dtrace switch inserts probes around code that is generally expected to be expensive, so the effect on optimization is probably minimal.

Future Directions

 

Since the -dtrace option is in the early stages of development, it currently only inserts probes for a few interesting cases involving array syntax: runtime overlap testing, array copy-in/out for calls, and loop collapsing for array syntax. We anticipate inserting probes for various array intrinsics such as MATMUL, DOT_PRODUCT, CSHIFT, etc. We will probably also have different start/stop probe names for different conditions, so it will be easy to ignore things that you don't care about. We want to make this a really useful feature, so please let us know if there are other probes that the compiler should insert. Comments on this article can be posted to the Sun Studio Fortran Forum.

References

 

Rate This Article
 
 
Comments