Improving Code Layout Can Improve Application Performance

By Darryl Gove, Senior Performance Engineer, June 22, 2005  
Large applications have a particular problem: they have a lot of instructions, and the processor does not have the capacity to hold the entire application on-chip at any one time. As a consequence, larger applications spend some of their run time stalled with the processor waiting to fetch new instructions from memory. This paper discusses several techniques that help the processor to hold more useful instructions on-chip, consequently reducing the time wasted fetching data from memory.

Not all instructions are equal.

An application will have many instructions, code has to be written to cover all eventualities -- even those that rarely (and perhaps never) happen. A consequence of this is that most applications end up with a set of instructions that do the work, and a lot of other instructions which have to be there, but are never used. The drawing 1 shows a way of visualising this. The green rectangle represents the whole application, within this application there are a number of routines. Within each routines there are instructions which are frequently executed, these are coloured red, and instructions that are rarely executed, these are coloured blue.


The rarely executed instructions take up space in memory, and also in the caches, and often on the on-chip memory. For example a single cacheline may contain a mix of hot and cold instructions, the cold instructions will just take up space, and consequently the application will have to use more cachelines to hold the code. It is also possible that due to the layout of the code in memory, some of the useful code may try to occupy the same place in the cache as some other useful code -- this is known as thrashing in the cache; this results in only a limited set of the critical instructions being available at any one time.

The symptoms of problems with code layout are that the application has a high number of Instruction Cache miss events, Instruction TLB miss events, or branch misprediction events. All these can be identified using the performance counters on the UltraSPARC-III derived processors (see the article Using UltraSPARC-IIICu Performance Counters to Improve Application Performance)

The next step is to look at some of the techniques for improving the layout of code in memory, but before doing that, it is important to realise that this doesn't just happen at the level of instructions. Whole routines are often either heavily used, or rarely used. Similarly libraries might be full of frequently used routines, or might be required only because of a single library call which almost never happens.

Since the compiler has the ability to change the way the code is laid out in memory, it is possible for the compiler to use memory more efficiently, but it will need more information to do this. The remainder of this article covers three different approaches that can be taken to improve the layout of the application in memory.

Reordering routines using mapfiles

One approach to improve the situation is to use mapfiles. Mapfiles are a facility that tell the linker how to layout routines in memory. To use these to improve the layout of the code it is necessary to order the routines from the most frequently used to the least

frequently used. The drawing 2 shows our original program from drawing 1 laid out from hot routines to cold using a mapfile.


It is possible to manually generate mapfiles, but an easier approach is to use the Performance Analyzer:

  • Build the program using the flag -xF

  • Run the program with a representative workload under collect

  • Generate the mapfile using er_print -mapfile <app> <mapfilename>

  • Rebuild the application with the flags -xF -M <mapfile>

Once a mapfile is generated for an application, the same mapfile can be used on subsequent compiles until either the profile of the application changes, routines are renamed, or additional routines are added.

                    $                        cc -O -xF -o app *.c $                        collect app < test_data    Creating experiment ... $                        er_print -mapfile app $                        cc -O -xF -M -o app *.c                   

Table 1 - Creating a mapfile using the Performance Analyzer tools

Improving the layout of instructions using profile feedback

Mapfiles work very well at the routine level to separate frequently executed routines from infrequently executed routines. However, much of the time is spent at the instruction level, where the processor has to jump over blocks of unexecuted code. Profile feedback is a compiler technique for improving this situation.

The idea with profile feedback is to give the compiler information about how the code is typically run, based on this information it can do optimisations of the following types:

  • Arrange code so that the frequently executed code in a routine is grouped together.

  • Inline routines that are frequently called, to both remove the cost of calling the routine, and potentially to enable further optimisation of the inlined code.

Profile feedback works best with crossfile optimisation (controlled by the flag -xipo ) since this allows the compiler to look at potentially optimisations between all source files.

The drawing 3 shows how profile feedback can rearrange code within a routine to put the frequently executed code together.



Profile feedback is relatively straightforward to use:

  • Build the application with -xprofile=collect -xipo

  • Run the application with one or more representative workloads

  • Rebuild the application with -xprofile=use -xipo

Notice the inclusion of the -xipo flag to enable the compiler to do optimisations across the source files.

                    $                        cc -O -xprofile=collect:app.profile -xipo -o app *.c  $                        app < test_data  $                        cc -O -xprofile=use:app.profile -xipo -o app *.c                   

Table 2 - Using profile feedback to optimise an application

Link-time optimisation

Mapfiles work at the routine level, and profile feedback works within routines; it would seem to be a simple progression to do both optimisations at the same time. This is possible with link-time optimisation (also called post-optimisation).

The principal of link-time optimisation is that the compiler has done its work, the code exists, and all that is necessary is to lay it out appropriately. In laying the code out appropriately, the link-time optimiser will sort the routines so that hot routines are placed together (in a similar way to mapfiles), and also lay out the code within those routines so that hot instructions are placed together. However, it is possible at link-time to go beyond this:

  • Since the hot code has been identified, it is possible to place all the hot-code together, and then place all the cold code together. The idea being to remove all code code from the hot region PH*PH*PH\u00E2PH*PH*PHPH*PH*PH\u20ACPH*PH*PHPH*PH*PH\u201CPH*PH*PH placing code from different routines into the same region of memory.

  • It is also possible to do further optimisations since the addresses of variables and routines can be calculated exactly. Hence the link-time optimiser can simplify expressions which calculate the address of variables or routines -- this further reduces the instruction count.

Drawing 4 shows what an application will look like after it has been link-time optimised. The hot code will have been grouped together in one part of the binary, and the cold code in a separate part.


The link-time optimisation step requires profile feedback data to work, so the necessary steps are as follows:

  • Build the application with the flags -xprofile=collect -xipo

  • Run the application with one or more representative workloads

  • Rebuild the application with -xprofile=use -xipo -xlinkopt

                    $                        cc -O -xprofile=collect:app.profile -xipo -o app *.c $                        app < test_data $                        cc -O -xprofile=use:app.profile -xipo -o app *.c -xlinkopt                   

Table 3 - Combining link-time optimisation with profile feedback

Concluding remarks

Using these techniques on larger applications can yield significant performance gains. It should be noted that there is a cost in terms of increased build times, and increased build complexity; consequently the techniques should be evaluated as to whether the gain is worth the additional effort in the build. It should also be observed that not all builds of the application need to go through the process of optimising the code layout, development builds can be performed without this process, and the process only applied to the final product build.

About the Author
Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK.

(Last updated June 22, 2005)