Enabling User-Controlled Collection of Application Crash Data With DTrace

By Greg Nakhimovsky and Morgan Herrington, May 2005  

Abstract: This article introduces AppCrash, a tool for the automatic collection of diagnostic and debugging information when an application crashes under the Solaris Operating System.



AppCrash is a tool for the automatic collection of diagnostic and debugging information when any application crashes under a Solaris system. The tool does not require any changes to the applications or to the operating system and is based on DTrace (Dynamic Tracing), a new facility introduced in the Solaris 10 OS.

AppCrash can help significantly reduce the cost of software defects by shortening the time needed to gather data necessary for application's customer support, quality assurance (QA), and development staff, as well as for Sun technical support engineers. This tool can be especially useful to track the specific details involved with sporadic, hard-to-reproduce application failures. AppCrash enables creation of postprocessing tools that can sort the failure reports and statistically analyze where the application software crashes most frequently. It also enables tools that can provide advice on how each type of problem can be debugged efficiently.

An important feature making the AppCrash method different from others is that the users (including the application developers and/or system administrators at user sites) can control precisely what data is collected and how it is processed. This allows collection of the necessary data without violating any of the users' security or privacy requirements.

Note that by application we mean any software other than the Solaris kernel. This includes middleware such as Mozilla, StarOffice software, various components of GNOME, OpenGL graphics, application servers, and others.

The article is intended for software engineers working at independent software vendors (ISVs), as well as for system administrators and end users working with Solaris applications.

The Problem

As an example, consider a large facility with many users running several multi-process and/or multi-tier applications. What steps are taken if one of the application processes crashes?

In the best of situations, the end user will recognize the specific inputs and conditions needed to reproduce the problem, and the application development team will be able to quickly reproduce and fix the problem.


Automatic and semiautomatic solutions exist for some systems (see the section Solutions on Other Systems below), but they may create security and privacy problems because the user has little or no control of the information gathered and forwarded to the application and system vendors. For example, see reference [ 1].

Previous Solutions

Core Files: Historically, failure analysis on UNIX systems has been based on collecting and inspecting the core file created when an application crashes. (Note: Application "core dump", "coredump", "core file", and "corefile" all mean the same thing.) However, the ways applications are created and used have dramatically changed since UNIX was created in early 1970s. Typical application size and memory requirements are orders of magnitude larger than they were then. The users are not the same people as the application programmers; many end users do not know what a debugger is or how to use one. Due to the ubiquity of the Internet, security and privacy are much more serious concerns now than they were then.

Currently, the Core File method suffers from several types of problems:

  • Security and privacy concerns, which are obviously very important, now more than ever. Core files often contain data that must not be sent outside of the user site (for details, see references [ 1], [ 2], [ 3]), such as:
    • Classified data
    • Proprietary data
    • Private (such as financial) data
  • Size problems. Core files can be huge (multi-gigabyte range), leading to:
    • Long time to dump, negatively affecting system performance (core files are dumped at kernel priority, making everything else stop)
    • Possibly overfilling disk with bad to disastrous (if on root partition) consequences
    • Size too large to send to vendor (cost, time, inconvenience)
  • Technical difficulties
    • An alternative to sending the core file to the vendor is for the end user to perform the analysis in place. However, this places an extra burden (time and effort) on the end user, and slows the debugging process down. Also, some users simply can't or don't want to perform such an analysis.
    • Local analysis may require a debugger and the ability to use it (cost, time, inconvenience).
    • An application-level debugger (such as dbx) used with a core file has severely limited functionality compared to debugging a running program.
    • Since application binaries at user sites are most often optimized, non-debuggable, and stripped, dbx functionality with those core files is even more limited (much more limited). As a result, in most cases the ISVs cannot get much more from an application core dump than a traceback, a memory map, and so on, all of which can be obtained without a core dump.
    • Most application developers can't and don't want to do hardware-specific assembly-level debugging of such core files. Therefore, even when a core file contains additional debugging information, the ISVs can't use it in practice in most cases.
    • Many special problems are associated with core files produced on one (end-user) machine and then copied to another (ISV) machine for debugging. For additional information, issue a help core mismatch command in dbx; also see reference [ 4]. Due to these problems, the application core dumps often can't be used on a different computer at all, even if the users deliver them to the ISV. These problems are gradually being alleviated, but most of them are still present.
  • No core file available (largely as a result of all of the above)
    • Many applications inhibit core dumps altogether (by installing a signal handler or by calling setrlimit(RLIMIT_CORE, &zero)), so no core file is available for analysis. Also, some users and user sites either disable generation of core files completely (with user limit set to zero or with the coreadm(1) command) or have their user limit set so restrictively that an entire core can't be saved.

Interpose Libraries: Another technical approach to dealing with application crashes is to implement an interpose library that would install its own signal handlers for SIGBUS and SIGSEGV (which would then capture state information for an error report). For example, see reference [ 5]. However, this is difficult to implement in production because:

  • The library must not interfere with the normal signal handling operations of the application.
  • Each application invocation must have its environment changed in order to interpose this library. This could either be done by having every user change their environment or by adding a wrapper script around the invocation of every application to be watched. Neither solution is convenient for a large number of users or a large number of applications.
  • Creating library interposers requires a compiler and a certain level of system programming skills. Many users need an easier solution.
  • Generally, library interposition is a debugging and performance-tuning technique not recommended for production use. We need a method that can be used all the time in production.

Application Signal Handlers: Of course, it is also possible to install signal handlers for SIGBUS and SIGSEGV in each application and do whatever is deemed necessary there. However, this approach requires the ISVs to change every application. Even if the ISVs do it, their handling of the crashes won't be uniform. Also note that programming signal handlers can be tricky and platform-dependent. For example, a signal handler should never call any routine that is not Async-Signal-Safe. (For the definition of Async-Signal-Safe, see the attributes(5) man page for the Solaris OS.)

Some ISVs have handled these signals for years and done in their signal handlers whatever works for them. However, this approach does not work for most applications. We need a way to handle application crashes system-wide, without changing any application in any way.

truss(1) : One more interesting solution is to use truss to monitor an application watching for a signal. Aside from its more common role of tracing system and library calls, truss has the ability to silently watch a process and wait for a specific set of signals.

For example, the following sequence invokes an application in the background, with truss then watching for SIGSEGV and SIGBUS. If either signal happens, truss leaves the application in the STOPPED state. If the kill command detects that the process still exists, then the process memory map and traceback are captured using the /proc utilities pmap(1) and pstack(1). Finally, the process is allowed to exit by being restarted with the prun(1) command.

application_invocation &
truss -t \!all -m\!all -s \!all -S segv,bus -p $pid
if kill -0 $pid ; then
    pmap   $pid
    pstack $pid
    prun   $pid

The main problem with the truss(1) solution is that it is limited to handling a single process per invocation (whereas many applications create hierarchies of processes). Also, like the library interposition technique, this requires each application to have an invocation script which must be edited to add this extra processing.

AppCrash: A DTrace-Based Solution

The solution we are proposing in this article uses the new Solaris facility, DTrace, to watch for application crashes and to process each with a user-supplied reporting script.

DTrace is a powerful facility introduced in the Solaris 10 OS for kernel and application performance tuning and debugging (see reference [ 6] and the dtrace(1M) man page for the Solaris OS). It works system-wide at the Solaris kernel level by dynamically instrumenting both the kernel and the application code. The dynamically inserted probes impose little or no overhead when disabled and can be used safely on production systems.

Our implementation of this solution consists of the following DTrace script. (Note: Please save file without .txt suffix.)


This is combined with a user-defined shell script such as the following template. (Note: Please save file without .txt suffix.)


While individual users can use AppCrash to monitor their own applications, it can also be used by a system administrator to watch applications being run by all users. To accomplish this, the administrator could start it at boot time as a daemon using the historical RC script mechanism (for example by creating /etc/rc2.d/S97app_crash) or using the new Solaris Service Management Facility (SMF), see reference [ 7]. Note: To produce a well-behaved Solaris daemon, starting app_crash.d with nohup(1) would be a good idea. Also, a Perl script could be used for creating a daemon as described in reference [ 8].

Once the app_crash.d daemon is running, DTrace will react to any application generating a SIGSEGV or SIGBUS signal. If the process has environment variable $ON_APP_CRASH_INVOKE defined as a path to a user-controlled shell script (such as runme_on_app_crash), then the DTrace script will do the following:

  • Stop the process when the SIGSEGV or SIGBUS signal is generated.
  • Run the user-defined shell script (such as runme_on_app_crash) to collect all the necessary debugging data.
  • Resume normal processing. In particular, if the settings are to produce a core dump, then that is what will happen.

The described design using the environment variable $ON_APP_CRASH_INVOKE to specify a user-controlled shell script gives the users (either end users or the ISVs) complete control over the actions to be taken. The usage could be tailored to be application-specific (by setting it in the application wrapper script) or user-specific (by setting it in the user's personal startup script). Of course, it also lets the users or their system administrators do anything they want in the invoked script to fully control the debugging information collected by that script. The runme_on_app_crash provided above is only a template.

For example, a particular ISV may wish to install DTrace script app_crash.d to be run as a daemon, and then set the $ON_APP_CRASH_INVOKE environment variable in the startup script of its application. This way, AppCrash will only be used for that application and nothing else.

Once the users have collected all the necessary information, they can review it, make sure that it contains no sensitive information, and then send it to their application vendor. If desired, they can even automate this emailing as a part of the script (see commented lines at the end of shell script runme_on_app_crash).

Note that by itself the collected information may not be adequate for the ISV or Sun engineers to resolve the problem. No one can guarantee that the problem will be solved based on this information only. It is always best for the users to come up with a reproducible test case and provide that to their application software vendor. Nevertheless, the information collected with a script like runme_on_app_crash will definitely help the debugging process, and in many cases may be enough to resolve the problem.

If the users see question marks instead of the function names in the traceback, they can send the information to the ISV anyway. The application owners should be able to restore the actual function names, for example using a tool called unstrip_traceback described in reference [ 5], Generating and Handling Application Traceback on Crash. Here is an updated version of that Perl script. (Note: Please save file without .txt suffix.)


More possibilities are enabled by AppCrash:

  • ISVs may be able to collect the automatically generated crash reports and develop an automatic or semiautomatic system to statistically analyze where their applications crash most frequently. This could seriously help the QA efforts, as well as enable creation of metrics that could be used to provide efficient incentives to the development and QA engineers (for example, the fewer the crashes in your software, the larger the bonus you get).
  • A rule-based system can also be created to analyze the crash reports and provide advice on how each problem could be debugged most efficiently. For example, if the crash is in malloc(3C) or free(3C), chances are that memory has been corrupted and the best way to debug it is with tools such as watchmalloc(3MALLOC) or libumem(3LIB) built into the Solaris OS. Such a rule-based system would reduce the time and the cost of analyzing each crash, thus leading to faster problem report turnaround.

Implementing either or both of the above suggestions would help further reduce the costs of software defects.

Note that any automatic system to analyze multiple crash reports will require an agreed-upon standard defining what each report should contain and in what format. This standard can vary for different applications and user sites. The users and ISVs will still have full control over the gathered information. The involved parties will just need to coordinate it.

Non-root Running of DTrace and Related Security Issues

Running the app_crash.d script as a system-wide daemon by root is one way of using this method, but strictly speaking it is not the only way. The end users themselves can run this script provided they have been granted DTrace-related permissions, as described in this section.

DTrace scripts like app_crash.d can be run either by root or by the users who have permissions like the following in the /etc/user_attr file:


File /etc/user_attr is owned by root, so only a system administrator with root access can modify it. Once /etc/user_attr has been modified, the user will need to log off and log on again to activate the new setting. This is a part of the Least Privilege facility providing fine-grained control over the actions of processes. For more information, see privileges(5) . The system administrator can also provide such privileges temporarily using the ppriv(1) command like this:

# ppriv -s A+dtrace_proc,dtrace_kernel PID

where PID is the process ID of the user's shell.

A word of caution is in order. The DTrace privileges described above will allow the use of all facilities of DTrace (including the kernel facilities). Please use these privileges responsibly and be aware that they could permit Denial of Service (DoS) attacks on your systems.

Using these privileges for running our DTrace script app_crash.d described in this article is very safe. However, DTrace scripts can be easily created and used for many other actions, some of which can be destructive. If you do not want to introduce any such risk by allowing ordinary users to have DTrace privileges, you can always run app_crash.d as a daemon owned by root.

Implementation Details

The DTrace script app_crash.d shown in the previous section is quite simple but its operation may not be obvious if you have not been previously exposed to DTrace scripting. Therefore, let us consider what app_crash.d does line-by-line.

#!/usr/sbin/dtrace -qws

This means this script can be run directly (assuming that it has appropriate execute permissions) and that it will run the /usr/sbin/dtrace binary. " -q" means quiet (without generating extra messages). " w" means we allow what are called destructive actions such as the system() action that is used to invoke a system command or a shell script. " s" means that what follows is a DTrace script.

#pragma D option strsize=500
/(args[2] == SIGBUS || args[2] == SIGSEGV) &&
                        pid == args[1]->pr_pid/

In DTrace terminology, these lines specify the use of the signal-send probe from the proc provider, whenever the predicate condition is true. It means the DTrace code following these lines will be executed when any process on the system generates ( sends) signal SIGBUS or SIGSEGV, where the receiving process (whose process ID is stored in args[1]->pr_pid) is the same as the sending process ( pid).


This means the process that generated such a signal is stopped until later notice.

  "%s=%d; %s=%d; %s=%d; %s=%s; %s %s %s %s %s %s %s %s %s",
  "CRASH_PID",  pid,
  "CRASH_UID",  uid,
  "DTRACE_UID", $uid,
  "PROG",       execname,
  "SCRIPT=`/bin/pargs -e $CRASH_PID | ",
  "  /bin/grep ON_APP_CRASH_INVOKE | /bin/cut -d= -f2`;",
  "[ -z \"$SCRIPT\" -o ! -x \"$SCRIPT\" ] && exit 0;",
  "if [ $DTRACE_UID -eq 0 -a $CRASH_UID -ne 0 ] ; then",
  "  USER_NAME=`/bin/getent passwd $CRASH_UID|/bin/cut -d: -f1`;",
  "  /bin/su $USER_NAME -c \"$SCRIPT $CRASH_PID $PROG\";",
  "else ",

This long line executes the specified sequence of Bourne shell commands. We could have instead introduced a helper shell script that would be easier to read, but that would complicate installation somewhat, so we chose to use a one-line command.

Note that the system() action in DTrace scripts allows argument processing like that of printf().

The above script performs the following steps:

  1. Runs the pargs(1) command to extract the value of the environment variable ON_APP_CRASH_INVOKE from the crashing process.
  2. Tests if ON_APP_CRASH_INVOKE is not defined ( $SCRIPT is empty) or if the user script it is pointing to ( $SCRIPT) is not marked executable, in which case the script exits.
  3. Checks if the owner of the app_crash.d script is root ( $uid is equal to zero) and the crashing process doesn't belong to root. If root is running app_crash.d, then the script extracts the user name of the owner of the crashing process:
    /bin/getent passwd <uid> |/bin/cut -d: -f1 

    and runs the user-defined script as that user using the su(1) command:

    su <user_name> -c <script>

    Alternate ways of determining the user ID, for example by looking at $USER, could introduce a security problem (hackers could set their $USER environment variable to root, set ON_APP_CRASH_INVOKE to a script starting a terminal emulator, then crash any application and thus gain access to a root shell).

  4. If the owner of app_crash.d is not root, su is not used and the user-defined script is run directly. This way, any user can run app_crash.d, but it will work only for the processes owned by that user.
  5. As the last step, app_crash.d resumes the crashing process with the prun(1) command:
  6. system("/bin/prun %d", pid);

The example user-defined script runme_on_app_crash does the following:

  • Obtains the process ID ( $PID) and program name from the input arguments.
  • Sends a message to system console (if the permissions allow it).
  • Runs the following Solaris commands for the crashing process (note that the pfiles(1) command prints the relevant pathnames starting with the Solaris 10 OS):
    /bin/pstack $PID
    /bin/pmap -x $PID 
    /bin/pldd $PID
    /bin/ptree $PID
    /bin/pargs -ace $PID
    /bin/plimit -m $PID
    /bin/pwdx $PID
    /bin/pfiles $PID
  • Extracts system configuration data (see the script for details).

The commands specific to the crashing process are based on the Solaris proc(4) facility. They collect potentially useful information about the crashing process: traceback, memory map, library dependencies, process tree, process arguments and environment strings, process limits, working directory, and information about all open files. This information, while brief, in ASCII-text form and easily accessible by the user, can be very useful in the debugging process. The output of all these commands is redirected to file /var/tmp/appcrash.$PROG.$PID. Note that files in /var/tmp will survive a reboot, while those in /tmp normally won't.

A possibility exists that some applications use the signals SIGSEGV and SIGBUS for some special purposes unrelated to crashing. We have not encountered such programs so far, but they may exist. For such programs, AppCrash may create a lot of files in /var/tmp and degrade system performance, given enough of those SIGSEGV/SIGBUS signals. If this happens, the AppCrash scripts may have to be adjusted to account for the unusual situation, for example to exclude certain applications by name. It could be done using the predicate in the DTrace probe or in the runme_on_app_crash script.

Also note that one of the advantages of AppCrash is that all of its components are scripts that are easy to customize.


Consider the following simple test program which contains a bug. It dereferences a null pointer in subroutine sub2():

% cat test1.c
#include <stdio.h>
#include <stdlib.h>
static void sub2(int *p)
  int i;
  i = *p;
static void sub(int *p)
int main()
  int *p=NULL;
  return 0;
% cc -o test1 test1.c

Step 1 Let us start the app_crash.d daemon, assuming the current user has the necessary permissions to run DTrace as described above.

% ./app_crash.d &
[1] 5707

Step 2 Now define the necessary environment variable in a different terminal window:

% setenv ON_APP_CRASH_INVOKE $HOME/tests/runme_on_app_crash

Step 3 Execute the test1 program in the terminal window where ON_APP_CRASH_INVOKE has been defined:

% test1
Segmentation Fault

At the time of the crash, the information was collected in the /var/tmp directory as specified in the runme_on_app_crash shell script (note that some output lines below have been reformatted for readability):

% ls -lt /var/tmp/ | head -2
total 42
-rw-r--r--   1 gregns   staff
       4037 Apr 20 11:30 /var/tmp/appcrash.test1.5174
% cat /var/tmp/appcrash.test1.5174

Output from runme_on_app_crash
Program: test1
Process ID: 5174

Application Debugging Data

> /bin/pstack 5174
5174: test1
 08050652 sub2     (0) + 12
 08050688 sub      (0) + 18
 080506bf main     (1, 8047cec, 8047cf4) + 1f
 080505aa ???????? (1, 8047db0, 0, 8047db6, 8047dc8, 8047e49)

> /bin/pmap -x 5174
5174: test1
 Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
08047000       4       4       4       - rwx--    [ stack ]
08050000       4       4       -       - r-x--  test1
08060000       4       4       4       - rwx--  test1
FEEE0000       4       4       4       - rwx--    [ anon ]
FEEF0000      24      12      12       - rwx--    [ anon ]
FEF00000     724     724       -       - r-x--  libc.so.1
FEFC5000      24      24      24       - rw---  libc.so.1
FEFCB000       8       8       8       - rw---  libc.so.1
FEFDA000     128     128       -       - r-x--  ld.so.1
FEFFA000       4       4       4       - rwx--  ld.so.1
FEFFB000       8       8       8       - rwx--  ld.so.1
-------- ------- ------- ------- -------
total Kb     936     924      68       -

> /bin/pldd 5174
5174: test1

> /bin/ptree 5174
225   /usr/lib/inet/inetd start
  5139  /usr/sbin/in.rlogind
    5141  -csh
      5174  test1

> /bin/pargs -ace 5174
5174: test1
argv[0]: test1

envp[0]: HOME=/home/gregns
... [removed more environment variable settings] ...

> /bin/plimit -m 5174
5174: test1
    resource             current         maximum
  time(seconds)         unlimited       unlimited
  file(mbytes)          unlimited       unlimited
  data(mbytes)          unlimited       unlimited
  stack(mbytes)         10              unlimited
  coredump(mbytes)      0               unlimited
  nofiles(descriptors)  256             65536
  vmemory(mbytes)       unlimited       unlimited
> /bin/pwdx 5174
5174: /home/gregns/tests

> /bin/pfiles 5174
5174: test1
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:28715
                 gid:7 rdev:24,4
   1: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:28715
                 gid:7 rdev:24,4
   2: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:28715
                 gid:7 rdev:24,4

System Configuration Data

> /bin/uname -a
SunOS rahova 5.10 Generic i86pc i386 i86pc

> /bin/cat /etc/release
                          Solaris 10 3/05 s10_74L2a X86
                     Copyright 2005 Sun Microsystems, Inc.
                             All Rights Reserved.
                        Use is subject to license terms.
                            Assembled 22 January 2005

> /usr/sbin/psrinfo -v
Status of virtual processor 0 as of: 04/20/2005 11:30:49
  on-line since 03/30/2005 14:43:48.
  The i386 processor operates at 2393 MHz,
and has an i387 compatible floating point processor.
Status of virtual processor 1 as of: 04/20/2005 11:30:49
  on-line since 03/30/2005 14:43:53.
  The i386 processor operates at 2393 MHz,
and has an i387 compatible floating point processor.

> /usr/sbin/swap -s
total: 62464k bytes allocated + 12248k reserved =
         74712k used, 6891300k available

> /usr/sbin/swap -l
swapfile             dev  swaplo blocks   free
/dev/dsk/c1t0d0s1   28,65      8 8389432 8389432

> /usr/sbin/prtconf|/bin/head -2
System Configuration:  Sun Microsystems  i86pc
Memory size: 3327 Megabytes

> /bin/showrev -p|/bin/cut -d' ' -f2|/bin/sort

The above file is ready to be sent to the owner of the faulty application for debugging. Note that the traceback produced by pstack(1) clearly points to the routine containing the bug, sub2() in this case. It also contains a chain of function calls leading to the faulty routine.

Solutions on Other Systems

Microsoft Windows has an interesting functionality in this area: see reference [ 9], Windows Error Reporting for Developers.

Not only does Microsoft provide the infrastructure for the ISVs to automatically collect the crash data (which is what AppCrash is all about), but it actually collects those error reports from the users and allows the ISVs to access those reports from the above Microsoft site.

Microsoft encrypts the collected data such that only the intended ISV or Microsoft employees can decrypt it. This is not a bad idea, but that method still doesn't allow the users to inspect the data before allowing it to leave their sites. Nor does it let the users control what information to collect.

For more information on how Microsoft collects the data on crashes, see reference [ 10], Microsoft Online Crash Analysis Data Collection Policy.

For further information about Microsoft minidumps, see reference [ 11], Post-Mortem Debugging Your Application with Minidumps and Visual Studio .NET.

We think the AppCrash method described in this article is more flexible and provides more freedom and power to the users and to the ISVs. Of course the users will decide which approach they prefer.


12Mac OS X CrashReporter


Specialized commercial products and services are available to perform automated crash monitoring and analysis for applications. For one example, see reference [ 13].

Related discussions are also available in reference [ 1] and reference [ 3].


This article describes a DTrace-based solution allowing ISVs and users of the Solaris OS to safely collect debugging information when any application crashes, and thus help improve the quality of the applications and reduce the costs of software defects. The users can fully automate such diagnostic data collection and transmission if they want, while having full control over which information is collected and sent to the application developer and/or system vendor for analysis and remediation.

For AppCrash updates and related discussions, see reference [ 14].

About the Authors

Greg Nakhimovsky and Morgan Herrington are Sun engineers working with application software vendors to make sure their products run well on Sun systems.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.