|
| By Sheldon Lobo, Compiler Technology And Performance Engineering, Sun Microsystems, November 30, 2005 |
|
| |
Improving binary performance is a frequent request from customers. These requests usually come from end customers of Sun systems or even performance, benchmarking and production groups of large independent software vendors (ISVs). The common theme is the non-availability of the original source code. Without a re-compile, it is usually a hard, time-consuming and costly endeavor to meaningfully improve binary performance. Sometimes system tweaks to a non-optimized system will do the trick, but often a complete system upgrade is necessary.
The Binary Optimizer is a tool that improves binary performance, without the need for system changes or upgrades. This tool modifies the binary by updating the binary instructions to generate more optimal code. Capability exists to instrument the binary for profile collection. When data from such a profile training run is fed back to the Binary Optimizer, significant performance improvements may be achieved. This is especially true for binaries that were not built with high levels of optimizations, or were built without profile data, or even built with profile data that is not representative of the end customers unique workload.
-O or
-xO
n) and a special compiler option
-xbinopt=prepare.
-binstrument flag.
-buse flag.
%
cc -O -xbinopt=prepare -o myapp *.c
%
binopt -binstrument myapp
%
myapp < input_data
%
binopt -buse myapp
The global optimizations performed by the Binary Optimizer usually show greater performance improvements on large applications. We see the following potential users of binary optimization technology:
End Users on SPARC Platforms:
Experienced users of Sun systems (for example, database administrators) are often looking for ways to improve binary performance. For such users, ready to go that extra mile to tune binaries they receive from software vendors, the Binary Optimizer is an ideal tool. And, it is available for free as part of the Sun Studio toolkit.
For the software vendor, the necessary step to follow is:
For the end user, the following steps will optimize the binary for their specific workload:
- The vendor ships a binary,
app, built with the-xbinopt=prepareflag% cc -O -xbinopt=prepare -o app
- Instrument the binary, using
binopt.- Run the instrumented binary on a representative workload.
- Use
binoptagain to optimize the binary with the collected runtime data.For example, the end user optimizes
appusingbinopt:% binopt -binstrument -bdata=datafile -o app.instr app
% app.instr < input_data
% binopt -buse -bdata=datafile -o app.opt appSoftware Vendors:
The Binary Optimizer performs optimizations that are not normally performed by the compiler. Hence by including the Binary Optimizer in the build process, a better performing production binary may be obtained.
The steps necessary to create a production binary with the Binary Optimizer are:
- Compile the application with the
-xbinopt=prepareflag.- Instrument the resulting binary for profile collection using the
-binstrumentflag ofbinopt.- Run the application with one or more representative workloads.
- Optimize the binary using the collected profile data and the
binopt -buseflag.Example:
% cc -xO4 -xbinopt=prepare -o app *.c
% binopt -binstrument -bdata=datafile -o app.instr app
% app.instr < input_data
% binopt -buse -bdata=datafile -o app.opt appIt is important to note that if you are already using profile feedback (
-xprofile=collect|usecompiler flags) to build the application, it may easier to use the-xlinkoptcompiler flag in the build, rather than using the Binary Optimizer, to obtain similar optimizations.
binopt
We see significant performance improvements on large applications when the Binary Optimizer is used. This is especially true for applications that are not built with profile feedback or are built with feedback that does not truly represent the end customer's workload. In these situations, a 10% or more performance gain is not unheard of.
The user must also be aware that using the Binary Optimizer causes an increase in size of the binary. This is due to the fact that optimized code is cached in a new segment in the binary. On large applications, an increase in size of up to 1.8x is seen.
The Binary Optimizer runtime is usually a fraction of the build time of the entire binary. For large applications, where the build time is usually several hours, binopt runtime can be measured in minutes. For example, building a well known database application from source takes over 5 hours. Performing binary optimizations on the resultant binary takes 8 minutes.
The
-blevel=1 optimization level is the default level of optimization for
binopt(1). At this level, code ordering and control flow optimizations are performed. While ordering code, functions may be split to optimize I-cache performance.
At the
-blevel=2 optimization level, data-flow information is constructed and more aggressive optimizations are performed. These include inlining, address simplification and load instruction optimizations. Usually better performance is derived from using this higher level of optimization. The tradeoff is an increased binopt runtime.
At
-blevel=0, no optimizations are performed.
Collecting and using a profile of the execution characteristics of a binary is crucial to making effective use of the Binary Optimizer. Instrumenting a binary and executing a training run to collect the data is relatively easy when using this tool. A single command line instruments the binary. The instrumented binary may be freely copied to a potentially different run machine – it is self contained, and no dependencies need to be maintained. After the training run is complete, a file containing the profile data is created. Accumulation of profile data from multiple training runs is another useful feature – the user just needs to specify the pre-existing data file on the
-binstrument command line.
When collecting profile for applications which contain one or more executables and/or shared objects, all binaries for which optimizations are planned need to be instrumented. In the example below, the executable
app has a dependency on the shared object
x.so. As demonstrated, both binaries need to be instrumented and optimized separately.
%
binopt -binstrument -bdata=app.data -o app.instr app
%
binopt -binstrument -bdata=x.so.data -o x.so x.so
%
app.instr < input_data
%
binopt -buse -bdata=app.data -o app.opt app
%
binopt -buse -bdata=x.so.data -o x.so x.so
The Binary Optimizer maintains full compatibility with tools that statically or dynamically examine a binary (
analyzer(1),
dbx(1),
pstack(1), etc.). The symbol tables are updated to reflect all transformations. Mangled symbol names are often assigned (see the example below), which are automatically de-mangled when displayed by the Studio tools.
If the prepared binary was built for debugging (with the
-g compiler option), debugging information is automatically propagated to the binary, instead of leaving it in the object file by default. When such a binary is optimized by binopt, the available debugging information is updated to reflect the transformations performed.
Here is a small example to help understand how the Binary Optimizer transforms the binary.
In the code below, there are three functions
main(),
add() and
sub(). The frequently executed parts of the code are denoted by the red rectangles, while the less frequently executed code is colored green. The layout of the optimized binary is shown on the right hand side. Here are some of the characteristics of the new binary:
main() is split, the hot fragment which is not the entry point is given the mangled name
_$o1cexhO0.main()).
_$r1.main(),
_$b1.add(),
_$b1.sub()).
Figure 1: Typical code layout from the binary optimizer
-xbinopt=prepareConsiderations:The
-xbinopt=preparecompiler flag, when used to build a binary, adds certain information to the binary that allows it to be transformed by the Binary Optimizer. This information describes the location of the executable code, points out control flow structures like function boundaries and switch tables, and provides data flow information about the code. This data is stored in a new ELF section named.annotate. This additional information in the binary results in a 5% increase in size, on average. There is no noticeable build time impact when this flag is used.In addition, prepared binaries built for debugging (with the
-gcompiler option), have an additional size increase due to the presence of debugging information. On average we see a 50% increase in binary file size when compared to a debuggable binary built without the-xbinopt=prepareoption.Profile Instrumentation (
-binstrument) Considerations:While doing a training run to collect binary profile information, the user will notice a slowdown in application performance. This is to be expected since there is an overhead associated with recording the execution count profile of the executable code. Usually we see a 2.5 to 3x slowdown in application performance.
There is also an increase in binary file size associated with adding instrumentation code. We usually see a 2.5x increase in binary size due to profile instrumentation.
-bfinalUsage:As mentioned above, a binary that may be optimized by the Binary Optimizer must be prepared using the
-xbinopt=preparecompiler flag. This results in additional information being placed in an ELF section in the binary. When creating a final binary that is to be deployed on the run systems, and on which no future optimizations are planned, the-bfinaloption may be used to strip the-xbinopt=prepareinformation from the resultant binary. This flag may be used to prevent users of the binary from making any further modifications to it. For example:% cc -xO4 -xbinopt=prepare -o app *.c
% binopt -binstrument -bdata=datafile -o app.instr app
% app.instr < input_data
% binopt -buse -bdata=datafile -bfinal -o app.opt app
Handling Modules Not Built With
-xbinopt=prepareIf the binary contains a combination of legacy code and newly created code (built with
-xbinopt=prepare), the Binary Optimizer may still be gainfully employed. The Binary Optimizer optimizes only that code that was built with the-xbinopt=preparecompiler option, leaving the legacy code untouched.Conflicts
The Binary Optimizer has some restrictions.
It will not optimize binaries built as follows:
- With the
-xprofile=collectcompiler option.- With the
-xlinkoptcompiler option.- With the
-scompiler option or stripped using thestrip(1) tool.- Binopt will not optimize that part of the code compiled with the
-xFcompiler option.- Binopt will not optimize the template code portion of a C++ application.
The Binary Optimizer also does not optimize those parts of the executable code in a binary that were derived from assembly language files. As mentioned earlier, code derived from object files compiled without the
-xbinopt=prepareflag are not optimized either. On the other hand, the presence of assembly code or legacy object code in a binary does not preventbinopt(1) from optimizing the remainder of the binary.
The Sun Studio 11 release includes the Binary Optimizer,
binopt. Updates in future releases may include new functionality and more optimizations. Also, several of the restrictions and conflicts will be addressed. Stay tuned!
| |
Sheldon Lobo is a staff engineer in the SPARC compiler backend team. He works primarily on developing Sun's object and binary file optimization and analysis tools.
