Debugging Multithreaded Programs

By Ann Rice, April 2011 (updated June 2016)

While a traditional UNIX® process contains a single thread of control, multithreading separates a process into many execution threads, each of which runs independently. Multithreading your code has a number of benefits, but it can also introduce bugs that might be difficult to find. This article suggests ways of avoiding such bugs in your code as well as strategies for finding these bugs using the dbx command-line debugger.

Benefits From Multithreading

Multithreading your code can help in the following areas.

Improving Application Responsiveness

Any program in which many activities are not dependent upon each other can be redesigned so that each independent activity is defined as a thread. For example, the user of a multithreaded GUI does not have to wait for one activity to complete before starting another.

Using Multiprocessors Efficiently

Typically, applications that express concurrency requirements with threads need not take into account the number of available processors. The performance of the application improves transparently with additional processors because the operating system takes care of scheduling threads for the number of processors that are available. When multicore processors and multithreaded processors are available, a multithreaded application's performance scales appropriately because the cores and threads are viewed by the OS as processors.

Numerical algorithms and applications with a high degree of parallelism, such as matrix multiplications, can run much faster when implemented with threads on a multiprocessor.

Improving Program Structure

Many programs are more efficiently structured as multiple independent or semi-independent units of execution instead of as a single, monolithic thread. For example, a non-threaded program that performs many different tasks might need to devote much of its code just to coordinating the tasks. When the tasks are programmed as threads, the code can be simplified. Multithreaded programs. especially programs that provide service to multiple concurrent users, can be more adaptive to variations in user demands than single-threaded programs.

Using Fewer System Resources

Programs that use two or more processes that access common data through shared memory are applying more than one thread of control.

However, each process has a full address space and operating environment state. The cost of creating and maintaining this large amount of state information makes each process much more expensive than a thread in both time and space.

In addition, the inherent separation between processes can require a major effort by the programmer. This effort includes handling communication between the threads in different processes, or synchronizing their actions When the threads are in the same process, communication and synchronization becomes much easier.

Combining Threads and RPC

By combining threads and a remote procedure call (RPC) package), you can exploit nonshared-memory multiprocessors, such as a collection of workstations. This combination distributes your application relatively easily and treats the collection of workstations as a multiprocessor.

For example, one thread might create additional threads. Each of these children could then place a remote procedure call, invoking a procedure on another workstation. Although the original thread has merely created threads that are now running in parallel, this parallelism involves other computers.

The Message Processing Interface (MPI) might be a more effective approach to achieve mutithreading in applications that run across distributed systems. See http://www-unix.mcs.anl.gov/mpi/ for more information about MPI.

The Oracle Message Passing Toolkit includes Open MPI Message Passing Interface (OMPI), which is an open source implementation of MPI.

Oversights That Can Cause Bugs

The following list points out some of the more frequent oversights that can cause bugs in multithreaded programs:

  • A pointer passed to the caller's stack as an argument to a new thread.
  • The shared changeable state of global memory accessed without the protection of a synchronization mechanism leading to a data race. A data race occurs then two or more threads in a single process access the same memory location concurrently, and at least one of the threads tries to write to the location. When the threads do not use exclusive locks to control their accesses to that memory, the order of accesses is non-deterministic, and the computation might give different results from run to run depending on that order. Some data races might be benign (for example, when the memory access is used for a busy-wait), but many data races are bugs in the program. The Thread Analyzer tool is useful for detecting data races.
  • Deadlocks caused by two threads trying to acquire rights to the same pair of global resources in alternate order. One thread controls the first resource and the other controls the second resource. Neither can proceed until the other gives up. The Thread Analyzer tool is useful for detecting deadlocks.
  • Trying to reacquire a lock already held (recursive deadlock).
  • Creating a hidden gap in synchronization protection. This gap in protection occurs when a protected code segment contains a function that frees and then reacquires the synchronization mechanism before it returns to the caller. The result is misleading. To the caller, it appears that the global data has been protected when the data actually has not been protected.
  • When mixing UNIX signals with threads, using the sigwait(2) model for handling asynchronous signals.
  • Calling setjmp()and longjmp(), and then long-jumping away without releasing the mutex locks.
  • Failing to re-evaluate the conditions after returning from a call to *_cond_wait() or *_cond_timedwait().
  • Forgetting that default threads are created PTHREAD_CREATE_JOINABLE and must be reclaimed with pthread_join(3C). Note that pthread_exit(3C) does not free up its storage space.
  • Making deeply nested, recursive calls and using large automatic arrays, which can cause problems because multithreaded programs have a more limited stack size than single-threaded programs.
  • Specifying an inadequate stack size, or using nondefault stacks.

Multithreaded programs, especially those containing bugs, often behave differently in two successive runs, given identical inputs, because of differences in the thread scheduling order.

In general, multithreading bugs are statistical instead of deterministic. Tracing is usually a more effective method of finding order of execution problems than is breakpoint-based debugging.

Tracing and Debugging With DTrace

Oracle Solaris Dynamic Tracing (DTrace) is a comprehensive dynamic tracing facility built into the Oracle Solaris operating system. You can use DTrace to examine the behavior of your multithreaded program. DTrace inserts probes into running programs to collect data at points in the execution path that you specify. The collected data can be examined to determine problem areas. See the Oracle Solaris Dynamic Tracing Guide and the DTrace User's Guide for more information about using DTrace.

Detecting Data Races and Deadlocks With the Thread Analyzer

Oracle Developer Studio includes the Thread Analyzer tool. This tool lets you analyze the execution of a multithreaded program. It can detect multithreaded programming errors such as data races or deadlocks in code written using the using the POSIX thread API, the Oracle Solaris thread API, OpenMP directives, or a mix of these technologies. See the Thread Analyzer User's Guide for more information. 

Debugging Multithreaded Programs With the dbx Debugger

Oracle Developer Studio includes the db command-line debugger, an interactive source level debugging tool.

When it detects a multithreaded program, dbx tries to load libthread_db.so, a special system library for thread debugging located in /usr/lib. dbx is synchronous; when any thread or lightweight process (LWP) stops, all other threads and LWPs sympathetically stop. (An LWP is a thread in the Oracle Solaris kernel that executes kernel code and system calls.) This behavior is sometimes referred to as the “stop the world” model.

Setting Breakpoints in Multithreaded Code

You can set breakpoints in multithreaded code using the stop command, trace command, or when command. The basic syntax of these commands is:
stop event-specification [modifier]
trace event-spcification [modifier]
when event-specification [ modifier ] { command; ... }

Two thread-specific events were added in Oracle Developer Studio 11 dbx:

  • The thr_create [thread_id] event occurs when a thread, or a thread with the specified thread_id, has been created. For example, in the following stop command, the thread ID t@1 refers to the creating thread, while the thread ID t@5 refers to the created thread.

    (dbx) stop thr_create t@5 -thread t@1 

  • The thr_exit event occurs when a thread has been exited. To capture the exit of a specific thread, use the -thread option of the stop command as follows:

    (dbx) stop thr_exit -thread t@5

Understanding Thread Creation Activity

You can get an idea of how often your application creates and destroys threads by using the thr_create event and thr_exit event as in the following example:


(dbx) trace thr_create

(dbx) trace thr_exit
(dbx) run

trace: thread created t@2 on l@2
trace: thread created t@3 on l@3
trace: thread created t@4 on l@4
trace: thr_exit t@4
trace: thr_exit t@3
trace: thr_exit t@2

The application created three threads. Note how the threads exited in reverse order from their creation, which might indicate that had the application had more threads, the threads would accumulate and consume resources.

To get more interesting information, you could try the following in a different session:


(dbx) when thr_create { echo "XXX thread $newthread created by $thread"; }
XXX thread t@2 created by t@1
XXX thread t@3 created by t@1
XXX thread t@4 created by t@1

The output shows that all three threads were created by thread t@1, which is a common multithreading pattern.

Suppose you want to debug thread t@3 from its outset. You could stop the application at the point that thread t@3 is created as follows:


(dbx) stop thr_create t@3

(dbx) run
t@1 (l@1) stopped in tdb_event_create at 0xff38409c
0xff38409c: tdb_event_create       :    retl
Current function is main
216       stat = (int) thr_create(NULL, 0, consumer, q, tflags, &tid_cons2);
(dbx)

If your application occasionally spawns a new thread from thread t@5 instead of thread t@1, you could capture that event as follows:

(dbx) stop thr_create -thread t@5

See Setting Event Specifications in the Debugging a Program With dbx manual for a complete list of event specifications. Bear in mind that the event you specify may occur in more than one thread, so your program may hit the breakpoint many times. You can specify a thread_id or lwp_id as a modifier to the stop command and the trace command. The action associated with the event is then executed only if the thread or LWP that caused the event matches the thread_id or lwp_id. However, the specific thread of LWP you have in mind might be assigned a different thread_id or lwp_id from one execution of the program to the next.

Stepping Through Multithreaded Code

dbx supports two basic single-step commands: next and step, plus two variants of step, called step up and step to. Both the next command and the step command let the program execute one source line before stopping again. The basic difference between the next and step commands is in how they handle function calls. If the line executed contains a function call:

  • The next command allows the call to be executed and stops at the following line (“steps over” the call)
  • The step command stops at the first line in a called function (“steps into” the call).

The syntax of the next command is:

next [ n ] [ -sig signal ] [ thread_id ] [lwp_id ]

The syntax of the step command is:

step [ n ] [ up ] [ -sig signal ] [ thread_id ] [lwp_id ] [ to function ]

To step one line in the current thread or LWP, type:

next

or

step

To step multiple (n) lines in the current thread or LWP, type:

next n

or

step n

To step one line in a different thread, type:

step thread_id

For example:

(dbx) step t@2

With multithreaded programs when a function call is stepped into or stepped over, all LWPs are implicitly resumed for the duration of that function call in order to avoid deadlock.

You can specify a specific thread_id or lwp_id to the next command or the step command, thus changing the current thread or LWP. However, if you do so, this deadlock avoidance measure is defeated. So to avoid deadlocks, it is safer to change the current thread or LWP using the thread command or lwp command, and then use the next command or step command to step in the new current thread or LWP.

Whenever you give a command that steps a single thread or LWP, you need to be aware of potential deadlocks. If the thread that continues executing needs to acquire a lock that is held by a thread that has not resumed execution, your program deadlocks. If such a deadlock occurs, you can break it only by typing ctrl-C and then resuming all threads.

To step up and out of the current function in the current thread or LWP, type:

step up

or

step up lwp_id

To step into a specified function at the current source line, type:

step to function_name

To step into the last function called as determined by the assembly code for the current source line, type:

step to

To deliver a signal while stepping, you can add -sig signal to any of the above next and step commands.

Resuming Execution

To resume execution of your multithreaded program after hitting a breakpoint or after single-stepping through your code, use the cont command. For multithreaded programs, the syntax is:

cont [ at line ] [ thread_id | lwp_id ] [ -sig signal ]

To continue execution of all threads, type:

cont

To continue execution of a specific thread or LWP, type:

cont thread_id

or

cont lwp_id

For example:

(dbx) cont l@3

To continue execution at a specific source line, type:

cont at line_number thread_id

or

cont at line_number lwp_id

To continue execution with a specific signal, you can add -sig signal in any of the above cont commands. Whenever you give a command that resumes a single thread or LWP, you need to be aware of potential deadlocks. If the thread that continues executing needs to acquire a lock that is held by a thread that has not resumed execution, your program deadlocks. If such a deadlock occurs, you can break it only by typing ctrl-C and then resuming all threads.

Viewing the Threads List

To view the threads list, use the threads command. The syntax is:

threads [ -all ] [ -mode [ all|filter ] [ auto|manual ] ]

The threads command displays the thread information shown in the following example:


(dbx) threads
    t@1 a l@1  ?()  running   in main()
    t@2      ?() asleep on 0xef751450  in_swtch()
    t@3 b l@2  ?()  running in sigwait()
    t@4     consumer()  asleep on 0x22bb0 in _lwp_sema_wait()
  *>t@5 b l@4 consumer()  breakpoint     in Queue_dequeue()
    t@6 b l@5 producer()     running       in _thread_start()
(dbx>

For native code, each line of information displayed by the threads command is composed of the following:

  • The * (asterisk) indicates that an event requiring user attention has occurred in this thread. Usually this is a breakpoint.

    An 'o' instead of an asterisk indicates that a dbx internal event has occurred.

  • The > (arrow) denotes the current thread.
  • t@number, the thread id, refers to a particular thread. The number is the thread_t value passed back by thr_create.
  • b l@number or a l@number means the thread is bound to or active on the designated LWP, meaning the thread is actually runnable by the operating system.
  • The “Start function” of the thread as passed to thr_create. A ?()means that the start function is not known.
  • The thread state (See the table below for descriptions of the thread states.)
  • The function that the thread is currently executing.

Table 1 Thread and LWP States

State Description
suspended The thread has been explicitly suspended.
runnable The thread is runnable and is waiting for an LWP as a computational resource.
zombie When a detached thread exits (thr_exit()()), it is in a zombie state until it has rejoined through the use of thr_join().() THR_DETACHED is a flag specified at thread creation time (thr_create()()). A non-detached thread that exits is in a zombie state until it has been reaped.
asleep on syncobj The thread is blocked on the given synchronization object. Depending on what level of support libthread.so and libthread_db.so provide, syncobj might be as simple as a hexadecimal address or something with more information content.
active The thread is active on an LWP, but dbx cannot access the LWP.
unknown dbx cannot determine the state.
lwpstate A bound or active thread state has the state of the LWP associated with it.
running The LWP was running but was stopped in synchrony with some other LWP.
syscall num The LWP stopped on an entry into the given system call number.
syscall return num The LWP stopped on an exit from the given system call number.
job control The LWP stopped due to job control.
LWP suspended The LWP is blocked in the kernel.
single stepped The LWP has just completed a single step.
breakpoint The LWP has just hit a breakpoint.
fault num The LWP has incurred the given fault number.
signal name The LWP has incurred the given signal.
process sync The process to which this LWP belongs has just started executing.
LWP death The LWP is in the process of exiting.

To print the list of all known threads, type:

threads

The output of this command might be:


*>    t@1  a  l@1   ?()   signal SIGINT in  _XFlushInt() 
      t@2  b  l@2   ?()   running          in  _signotifywait() 
      t@3  b  l@3   ?()   running          in  _lwp_sema_wait() 
      t@4      ?()   sleep on (unknown) in  _reap_wait() 

To print threads normally not printed (zombies), type:

threads -all

The output of this command might be:


*>    t@1  a  l@1   ?()   signal SIGINT in  _XFlushInt() 
      t@2  b  l@2   ?()   running          in  _signotifywait() 
      t@3  b  l@3   ?()   running          in  _lwp_sema_wait() 
      t@4      ?()   sleep on (unknown) in  _reap_wait() 
      t@5       myThread()   zombie  in  in 
      t@5       myThread()   zombie  in  in 

By default, the threads command runs in filter mode, meaning that hidden threads and zombie threads are not printed. To print all threads including hidden threads and zombies, type:

threads -mode all

Displaying, Changing, Suspending, or Resuming the Current Thread

The thread command lists or changes the current thread. The syntax is:

thread [ -blocks ] [ -blockedby ] [ -info ] [ -hide ] [ -unhide ] [ -suspend ] [ -resume ] thread_id

To change the current thread, type:

thread thread_id

To print everything known about the current thread or given thread, type:

thread -info [thread_id]

For example, this command might produce the following output:


thread -info t@4

Thread t@4 (0xfe60bd70) at priority 127
        state: asleep on (unknown)
        base function: 0x0: 0x00000000() stack: 0xfe60bd70[1047920]
        flags: DETACHED|DAEMON 
        masked signals: HUP INT QUIT ILL TRAP ABRT EMT FPE BUS SEGV SYS PIPE ALRM TERM USR1 USR2 CLD PWR WINCH 
         URG POLL TSTP CONT TTIN TTOU VTALRM PROF XCPU XFSZ WAITING FREEZE THAW LOST RTMIN RTMIN+1 
         RTMIN+2 RTMIN+3 RTMIN+4 RTMIN+5 RTMIN+6 RTMIN+7 
        Currently inactive in _reap_wait 

To print all locks held by the current thread or given thread blocking other threads, type:

thread -blocks [thread_id]

To show which synchronization object the current thread or given thread is blocked by, if any, type:

thread -blockedby [thread_id]

To suspend the current thread or given thread, type:

thread -suspend [thread_id]

To resume (unsuspend) the current thread or given thread, type:

thread -resume [thread_id]

To hide the current thread or given thread so that it will not be displayed by the threads command, type:

thread -hide [thread_id]

To unhide the current thread or given thread so that it will be displayed by the threads command, type:

thread -unhide [thread_id]

Displaying LWP Information

Normally, you need not be aware of LWPs, but there are times when thread level queries cannot be completed. In these cases, you can use the lwp command and lwps command to show information about LWPs.

The syntax of the lwp command is:

lwp [ -info] lwp_id

To list the current LWP, type:

lwp

For example:

lwp l@3

To change the current LWP, type:

lwp lwp_id

To display the name, home, and masked signals of the current LWP, type:

lwp -info

For example, this command might produce the following output:


lwp -info l@2

l@2 running          in _signotifywait()
masked signals are: 

To list all LWPs in the current process, use the lwps command:

lwps

The lwps command displays the LWP information shown in the following example:


(dbx) lwps
    l@1 running in main()
    l@2 running in sigwait()
    l@3 running in _lwp_sema_wait()
  *>l@4 breakpoint in Queue_dequeue()
    l@5 running in _thread_start()
(dbx)

Each line of the LWP list contains the following:

  • The * (asterisk) indicates that an event requiring user attention has occurred in this LWP.
  • The arrow denotes the current LWP.
  • l@number refers to a particular LWP.
  • The next item represents the LWP state.
  • in function_name() identifies the function that the LWP is currently executing.

Runtime Checking Multithreaded Applications

Runtime checking in dbx supports multithreaded applications. Along with each access checking error report, RTC prints the ID of the thread on which the error occurred. The leak report generated by RTC includes the leaks from all the threads in the program.

Potential Problems With Dynamic Function Calls

If you are accustomed to using dynamic function calls from dbx when debugging single-threaded programs, take care in using the same technique when debugging multithreaded code.

dbx allows you to use function calls in expressions. For example, the following command forces the target program to call foo()():

print foo()

Forcing a function call can be useful because it lets you use the program code to examine the state of the program.

You can use the when command to stop execution at particular locations in the program, print data, and then resume execution:

when in bar {print var;}

If you combine these two examples, as in the following command, you are stopping the program at various locations, forcing it to call a function, and then continuing execution:

when in bar {print foo();}

If you give dbx such a command, you must be sure that calling foo()() at the times you stop execution in bar() does not interfere with your program's intended execution.

If your program is multithreaded, it is more difficult to predict when it is safe to force a call to foo()(). One thread might be stopped in bar(), which you know is safe, but other threads might be in the process of modifying data that foo()() relies on.

For More Information

For further details, see: