Automatic DTrace Probe Insertion in Fortran
| By Diane Meirowitz, February 2009
|
|
|
This article describes a new feature of the Sun Studio 12 Fortran compiler—automatic insertion of DTrace static probes in optimized code. DTrace is a powerful, dynamic tracing facility that was introduced in the Solaris 10 operating environment. See
http://www.opensolaris.org/os/community/dtrace/ for more details. The Fortran
-dtrace option is in the technology preview stage, which means it is an experimental feature that is evolving and may not yet have the same high quality as other features. The names of the probes, the locations where the probes are inserted, and the command-line option are all likely to change in future releases.
Please take this new feature for a spin and let us know what you think!
Introduction to the Example
Analyzing the performance of Fortran programs can be more difficult than doing so for C or C++, because a single innocent-looking statement can be translated into hundreds of machine instructions. Furthermore, the performance can vary depending upon various run time conditions. Consider, for example,
subroutine add_three (A, B)
integer, pointer, dimension(:) :: A, B
A = B + 3
end subroutine
|
The statement 'A = B + 3' looks simple enough. However, the code generated for it is definitely not trivial. The semantic rules of Fortran say that the expression on the right of an assignment must be fully evaluated before the assignment to the variable on the left. Since both A and B have the POINTER attribute, the compiler has to assume that it is possible that they may occupy some of the same memory. In order to execute this statement correctly, the compiler generates something like the following fairly expensive code:
temp = malloc(...)
do i = ...
temp(i) = B(i) + 3
end do
do i = ...
A(i) = temp(i)
end do
free(temp)
|
Since it is often the case that A and B do not occupy the same storage, the Sun Studio 12 Fortran compiler adds a check to avoid the extra code, yielding the following code:
if (A and B do not overlap) then
do i = ...
A(i) = B(i) + 3
end do
else
temp = malloc(...)
do i = ...
temp(i) = B(i) + 3
end do
do i = ...
A(i) = temp(i)
end do
free(temp)
end if
|
If this function happened to take up a large amount of time, it could be because A and B actually overlap. However, it would be difficult to discover that.
Initial Performance Profile
First, let's wrap the subroutine in a module, in add_three.f90:
MODULE ADD_IT
contains
subroutine add_three (A, B)
integer, pointer, dimension(:) :: A, B
A = B + 3
end subroutine
END MODULE ADD_IT
|
Next, let's add a main program, in main.f90:
program main
USE ADD_IT
parameter (n=100000,m=50000)
integer, target, dimension(n) :: c,d
integer, pointer, dimension(:) :: a, b
c = (/ (i, i = 1,n) /)
d = (/ (i, i = 1,n) /)
a => c(2:n)
b => c(1:n-1)
do i = 1, m
call add_three(a,b) ! 'a' and 'b' pointers overlap
end do
b => d(1:n-1)
do i = 1, m
call add_three(a,b) ! 'a' and 'b' pointers do not overlap
end do
end
|
We put the subprograms in separate files so that the compiler does not get too smart and optimize all the loops away!
Below we compile, collect performance data with the Collector tool, and then print profile information using the Analyzer's command-line tool
er_print (we could also have done this with the Analyzer GUI). The
-fast flag sets various compilation options for fast performance, and
-g adds information for debugging and profiling.
%
f95 -fast -g add_three.f90 main.f90
add_three.f90:
main.f90:
Linking:
%
collect a.out
Creating experiment database test.1.er ...
%
er_print -source main test.1.er
Excl. Incl.
User CPU User CPU
sec. sec.
0. 17.122 <Function: main, from source file main.f90>
...
0. 0. 6. c = (/ (i, i = 1,n) /)
0. 0. 7. a => c(2:n)
0. 0. 8. b => c(1:n-1)
0. 0. 9. do i = 1, m
## 0. 11.548 10. call add_three(a,b)
0. 0. 11. end do
0. 0. 12. b => d(2:n)
0. 0. 13. d = (/ (i, i = 1,n) /)
0. 0. 14. do i = 1, m
0. 6.424 15. call add_three(a,b)
0. 0. 16. end do
0. 0. 17. end
|
Notice that the call on line 10 takes nearly twice as much time as the call on line 15. From the code we can see that they are called the same number of times on the same amount of data. Why is the second one faster?
Using Fortran Automatic DTrace Probe Insertion
To figure out why the second call is faster, we can try the
-dtrace option, new in Sun Studio 12. First, we recompile, adding
-qoption f90comp -dtrace. (Note that
-dtrace currently has no effect on SPARC processors with optimization level less than 2, but works on x86 at all optimization levels. Note also that
-qoption is used because the syntax of the option will change when it becomes a fully supported feature.) Then we run
DTrace -G to postprocess add_three.o, just as we would have to do if we had used the
DTRACE_PROBE macro in C/C++ code.
Here is fortran_provider.d:
provider fortran {
/* arguments are: description, file, line number, flag */
probe event(string, string, int, int);
probe start_event(string, string, int, int);
probe stop_event(string, string, int, int);
};
|
Here is fortran_action.d:
BEGIN
{
start_ts = timestamp;
}
fortran$target:::start_event
/ pid == $target /
{
message = stringof(copyinstr(arg0));
file = stringof(copyinstr(arg1));
line = arg2;
self->ts[message,file,line] = timestamp;
}
fortran$target:::stop_event
/ pid == $target /
{
t = timestamp;
message = stringof(copyinstr(arg0));
file = stringof(copyinstr(arg1));
line = arg2;
ela = t - self->ts[message,file,line];
@time [message, line, file] = sum(ela);
@count[message, line, file] = count();
}
END
{
total_elapsed = timestamp - start_ts;
printf("Count\n");
printa(" %12@u %s, line %d, file %s\n", @count);
printf("\n");
normalize(@time, 1000);
printf("Time in microseconds\n");
printa(" %12@u %s, line %d, file %s\n", @time);
printf(" %12u Total elapsed time usec\n", total_elapsed/1000);
}
|
Before running DTrace, we need to have permission to do so. So we determine our process id and then ask our system administrator to use the
ppriv command to allow us to run DTrace. For example:
%
echo $$
12345
%
su root
Password:
#
ppriv -s A+dtrace_proc 12345
#
exit
|
And we are ready to go.
%
f95 -fast -g -c -qoption f90comp -dtrace add_three.f90 main.f90
add_three.f90:
main.f90:
%
dtrace -G -32 -s fortran_provider.d add_three.o
%
f95 -fast -g fortran_provider.o add_three.o main.o
%
dtrace -q -s fortran_action.d -c ./a.out
Count
50000 array syntax no overlap, line 5, file add_three.f90
50000 array syntax with overlap, line 5, file add_three.f90
Time in microseconds
5735866 array syntax no overlap, line 5, file add_three.f90
10126427 array syntax with overlap, line 5, file add_three.f90
17289700 Total elapsed time usec
|
From the DTrace output above we can see that nearly twice as much time is spent in "array syntax with overlap" as in "array syntax without overlap." "array syntax with overlap" tells us that the arrays A and B in subroutine add_three share some storage. As we discussed earlier, the compiler handles this array overlap with slower malloc/free and copy code, rather than the single loop.
Finding Call Sites, Part 1: Using ustack to Find Call Stacks
Since there are two call sites for add_three, how can we determine which site was the more expensive one? To do this, we use DTrace's ustack function to obtain call stacks. Later, we print out the stacks in the END probe in dtrace_action.d. Here are the modified DTrace probes:
fortran$target:::stop_event
/ pid == $target /
{
t = timestamp;
message = stringof(copyinstr(arg0));
file = stringof(copyinstr(arg1));
line = arg2;
ela = t - self->ts[message,file,line];
@time [message, line, file, ustack(2)] = sum(ela);\
# get 2 stack frames
@count[message, line, file] = count();
}
END
{
total_elapsed = timestamp - start_ts;
printf("Count\n");
printa(" %12@u %s, line %d, file %s\n", @count);
printf("\n");
normalize(@time, 1000);
printf("Time in microseconds\n");
printa(" %12@u %s, line %d, file %s, stack%k\n", @time);
printf(" %12u Total elapsed time usec\n", total_elapsed/1000);
}
|
We will need to use the
f95 -xregs=no%frameptr option
after
-fast in order to use ustack. We also compile with
-xunroll=1 to disable loop unrolling in order to simplify the output of DTrace.
%
f95 -fast -xregs=no%frameptr -g -xunroll=1 -c\
-qoption f90comp -dtrace add_three.f90 main.f90
add_three.f90:
main.f90:
%
dtrace -G -32 -s fortran_provider.d add_three.o
%
f95 -fast -g fortran_provider.o add_three.o main.o
%
dtrace -q -s fortran_action.d -c ./a.out
Count
50000 array syntax no overlap, line 5, file add_three.f90
50000 array syntax with overlap, line 5, file add_three.f90
Time in microseconds
7007362 array syntax no overlap, line 5,\
file add_three.f90, stack
a.out`add_it.add_three_+0x292
a.out`MAIN_+0x13d
14481704 array syntax with overlap, line 5,\
file add_three.f90, stack
a.out`add_it.add_three_+0x1a9
a.out`MAIN_+0xad
22925477 Total elapsed time usec
|
Finding Call Sites, Part 2: Using dbx to Interpret Stack Traces
Now we need to relate the stack trace to the source line. To do this, we will use dbx. Since we are using Fortran, we need to remove the trailing underscore from the name that DTrace gives us and use the actual Fortran name of the function.
%
dbx a.out
dbx>
whereis -a MAIN+0x13d
in `a.out`main.f90`MAIN at "main.f90":16
dbx>
whereis -a MAIN+0xad
in `a.out`main.f90`MAIN at "main.f90":11
|
The line number is one higher than the line with the call, but this information does tell us that:
From this we can conclude that arguments to the call on line 10 were arrays that overlapped each other requiring an expensive copy and calls to malloc/free, whereas the call on line 15 was not expensive.
Performance Workaround for the Example
Now that we know why the code is slower, how can we speed it up while preserving correctness? For the first case, recall that A points at elements 2:n of C, while B points at elements 1:n-1 of C.
a => c(2:n)
b => c(1:n-1)
|
The function add_three assigns A = B+3. Iterating forward through these arrays would give us the incorrect result, i.e.,
a(1) = b(1) + 3 ! i.e., c(2) = c(1) + 3
a(2) = b(2) + 3 ! i.e., c(3) = c(2) + 3
a(3) = b(3) + 3 ! i.e., c(4) = c(3) + 3
and so on...
|
Notice that c(2) is read after it is written, which violates the rule that for A = B + 3, the expression B + 3 must be fully evaluated before A is written to. But if we run this loop in reverse, we will get the correct answer. One way to do this would be to create a new function which contains an explicit loop, rather than array syntax:
subroutine add_three_reverse(a, b)
integer, pointer, dimension(:) :: a, b
do i = ubound(a,1), 1, -1
a(i) = b(i) + 3
end do
end
|
Effect of
-dtrace and
DTRACE_PROBE Macro on Optimization
The insertion of DTrace probes in code can inhibit some of the compiler's optimizations. To the compiler, these probes look like ordinary function calls, so in general, the compiler cannot move code past the call to say, fuse loops or propagate constants. This is probably fine if the probes surround code that takes a long time to execute, but bad if the probes surround trivial code. The Fortran
-dtrace switch inserts probes around code that is generally expected to be expensive, so the effect on optimization is probably minimal.
Future Directions
Since the
-dtrace option is in the early stages of development, it currently only inserts probes for a few interesting cases involving array syntax: runtime overlap testing, array copy-in/out for calls, and loop collapsing for array syntax. We anticipate inserting probes for various array intrinsics such as MATMUL, DOT_PRODUCT, CSHIFT, etc. We will probably also have different start/stop probe names for different conditions, so it will be easy to ignore things that you don't care about. We want to make this a really useful feature, so please let us know if there are other probes that the compiler should insert. Comments on this article can be posted to the
Sun Studio Fortran Forum.
References
Terms of Use