Analyzing a `patchadd` or `patchrm` Failure in the Solaris OS

Enda O'Connor, April 2009 ( Updated November 2009)

This article covers the following topics:

Note: This article contains references to some items that apply only to the Solaris 10 Operating System, for example, a new bootblk. The rest of the article and the accompanying script work on the Solaris 8, 9, or 10 OS.

Introduction

It is important to gather sufficient information before starting a root-cause analysis to determine why a patchadd or patchrm session did not succeed. This document is intended to help users of the Solaris Operating System for SPARC or x86 platforms do that. The focus is on what to do if the system that was patched did not reboot properly, either by panicking in a loop or dropping to an OK prompt (in the case of the Solaris OS for SPARC platforms).

This article outlines what files are most relevant, where to locate these files, and also (depending on which patch automation tool, if any, was used to apply the patch) where to locate any relevant output from such tools. The document is not intended to help in analyzing the actual failure, because patching-related issues can have any number of different causes, but it does try to provide some generic pointers.

Booting the System From CD-ROM or Network

If the system was rebooted and failed to come back up to the required run level, or it failed to boot into maintenance mode (that is, if the system dropped to the OK prompt or is panicking in a loop), it is necessary to first boot the system from the network or CD-ROM and mount the relevant file systems to be able to access the system.

1. For SPARC systems only, if the system drops down to the OK prompt, as in the following, perform these substeps:

{1} ok

a. First, try to capture the relevant console output from when the system started to reboot up to the point where the OK prompt is displayed. Keep this output, because it might be very relevant to diagnosing the actual underlying failure.

b. At this point, try to identify whether the problem can be resolved without recourse to booting from the network or media, as opposed to booting from disk.

c. If you need to boot to single-user mode from media and mount the root file systems, there are a couple of options:

Boot from network
Boot from CD-ROM

To boot from the network, make sure your client is properly configured in the boot server and the network connections and configuration are correct (this is outside the scope of this document). Then run this command:

{1} ok boot net -s

To boot from CD-ROM, run this command:

{1} ok boot cdrom -s

2. If the system is panicking in continuous reboot:

a. First try to capture the full panic output from console.

b. Then, for a SPARC system, drop to the OK prompt and follow the instructions for Step 1 above.

For an x86 system, you need to make sure the BIOS boot priority allows the system to boot from either the network or a CD-ROM prior to booting from hard disk. If you are doing a network boot, make sure the client is properly configured in the boot server. For example, if you are using DHCP, ensure the client's network connections and configuration are correct, or if you are using NIS, ensure the client is set up correctly in the NIS server.

3. After the system has booted from CD-ROM or network, follow the instructions in the BigAdmin article How to Remove a Solaris OS Patch While Booted From a Network or CD-ROM to mount all relevant file systems that will be examined.

Gathering Various Data to Enable Root-Cause Analysis

At this stage, we will assume that all tasks in Step 2 above have been completed and the system has been mounted under /a.

Note: Most of the following data is gathered by the patchanalysis_gather.txt script, with the exception of actual patchadd output to terminal. Here's the source code for the patchanalysis_gather.txt script file.

1. Gather the patchadd or patchrm related log files.

If a patchadd session was done solely through the patchadd utility, then unless you captured the patchadd output to a log file, this data is not retrievable. It might be possible to simply cut and paste the patchadd output from a terminal or console, if the output is still available. Note that you want the actual output ( STDOUT/STDERR) from the patchadd command itself, as opposed to the log files in /var/sadm/patch/ generated by patchadd.

It is strongly recommended that all patchadd output be redirected to a file during patching, so the output can be retrieved easily if it is required later for examination.

For example, the following command directs the output of patchadd to a log file:

patchadd <PatchID> 2>&1|tee /opt/patchlogs/118833-36.$$

2. In the following examples, we will use /a as the prefix to all commands, because we assume we are booted from alternate media and the root file system is mounted under /a.

If the system was patched using the Traffic Light Patch (TLP) tool, TLP output is located in the following directory:



# ls  /a/var/sadm/install_data/
PMGT:_TLP-Set_for_node_v4u-880c-muc07,_phase_GREEN, 
_snapshot_2008-10-28_log

This data is captured by patchanalysis_gather.txt. These files are standard text files containing patchadd output. One file is generated for every run of the TLP tool.

3. If the system was patched using Sun Update Connection - Enterprise (UCE) or Sun xVM Ops Center (xVMOC) 1.x or 2.x, verify that /a/var/opt/SUNWuce/agent exists, which confirms that UCE, xVMOC 1.x, or xVMOC 2.x was used.

If /a/var/opt/SUNWuce/agent does not exist, and you have verified that /a/var has been mounted correctly (assuming it is separate from the root file system), the system was not patched using any of these tools.

4. The following data is also captured by patchanalysis_gather.txt.

If /a/var/opt/SUNWuce/agent exists, run the following commands and note the output to identify which of the patch automation tools was used:

pkgparam -R /a -v SUNWucea VERSION

This output implies UCE is installed: VERSION='1.1.1-314'.

pkgparam -R /a -v SUNWscnconnmgt VERSION

This output implies that xVMOC 1.x is installed: VERSION='1.0.0'.

This output implies that xVMOC 2.x is installed: VERSION='2.0.0.820,REV=2009.01.26.07.57.17'.

It is useful to know which tool was used in order to help eliminate any potential issues in the tool or to reproduce the issue on another system to identify the underlying problem.

All the patch automation tools mentioned previously store their output in /a/var/opt/SUNWuce/agent/logs/:




# ls -l /a/var/opt/SUNWuce/agent/logs/
total 33032
-rw-r--r-- 1 root root  5610478 Feb 20 17:12 error.log
-rw-r--r-- 1 root root 10485681 Feb 20 13:15 error.log.ad_bak
-rw-r--r-- 1 root root    94418 Feb 20 13:28 job.log
-rw-r--r-- 1 root root    15478 Feb 19 11:02 job_50007101.tgz
-rw-r--r-- 1 root root    11377 Feb 20 13:04 job_50011801.tgz
-rw-r--r-- 1 root root    12189 Feb 20 13:15 job_50012001.tgz
-rw-r--r-- 1 root root    12185 Feb 20 13:28 job_50012002.tgz
-rw------- 1 root root   323598 Feb 19 10:53 last_nco_file.xml
-rw-r--r-- 1 root root    30730 Feb 20 13:28 last_seeking.tgz
-rw-r--r-- 1 root root    16602 Feb 20 13:28 nco.log
-rw-r--r-- 1 root root   253666 Feb 20 13:28 resolve.log
-rw-r--r-- 1 root root     1434 Feb 20 09:29 uce_agent.log

# gzcat /a/var/opt/SUNWuce/agent/logs/job_50012002.tgz | tar tvf -
drwxr-xr-x 0/0      0 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/
-rwx------ 0/0   4466 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/Task.out
-rw-r--r-- 0/0 167297 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
2/copy_inventory
-rw-r--r-- 0/0    497 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/copy_basket
-rw-r--r-- 0/0     17 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/copy_policy

The most important log file is Task.out. It contains the output of patchadd commands that were run. However, it is recommended that you copy all files in /a/var/opt/SUNWuce/agent/logs off the system for possible further examination. Also copy the output of ls -ltr of /a/var/opt/SUNWuce/agent/logs. This data is captured by patchanalysis_gather.txt.

Other Log Files That Should Be Examined

This section lists relevant data that is worth capturing, all of which is collected by the patchanalysis_gather script.

patchadd -R /a -p
pkginfo -R /a -p
pkginfo -R /a
df -k /a and any other mountpoints under /a
/a/var/adm/messages*
/a/var/sadm/system/admin/CLUSTER
/a/var/sadm/install/contents (this file can be quite large)
/a/etc/system
/a/etc/vfstab
/a/release
The directory contents from /a/var/sadm/system/logs/
The directory contents from /a/var/sadm/install_data/

If there are non-global zones, the following might be useful:

/a/etc/zones/*

This data is captured by patchanalysis_gather.txt.

The following directory in every non-global zone contains the patch logs, and might contain useful data for analyzing any non-global zone issues. Note: Currently, the patchanalysis_gather.txt script does not collect this data, because this directory is unlikely to contain any data relevant to making a system unbootable.

<zonepath>/root/var/sadm/patch/*

Examining Output and Log Files

After the previous data has been gathered, it is advisable to start with an examination of the patchadd output. Then also examine the patchadd logs, which were gathered from /var/sadm/patch/*/log.

Look for errors and warnings in these logs, in particular, the patchadd output might have references to pkgadd failures, with a subsequent log file stored in /var/tmp.

If so, retrieve this log file from /a/var/tmp and examine it, because it is very relevant to determining what caused the problem.

Also examine the patchadd output to detect if any patch-level scripts, such as prepatch or postpatch, failed or generated unexpected messages. Compare the patchadd output to known nominal patchadd output from a system where patch application succeeded, and look for any additional or omitted messages.

If there are non-global zones present on the affected system, examine the patchadd output looking for any errors that indicate issues particular to the presence of non-global zones. These might take the following form:

Failed to boot non-global zone <zone-name>

This message indicates that the non-global zone in question was halted, and patchadd was not able to move the affected zone into an internal state used for software maintenance. If this occurs, gather the /a/etc/zones/*.xml files along with /a/etc/vfstab. These files at least enable a support engineer to begin to determine the system state and zone configuration prior to patching.

If the df -k output indicates that available space in the root file system or in /var reached 100% full, it is recommended that you contact Sun support, because depending on the patch that was being installed and what part of the patch installation failed, certain manual steps might be required to restore system consistency.

For instance if, during 137137-09 post-patch execution, available space in /platform reached 100%, then on reboot, the system most likely will not boot beyond the OK prompt, and errors will indicate that boot load failed. In such cases, it is possible that the system can be rescued quite easily with no lasting damage, if sufficient space can be recovered to allow the boot archive to be rebuilt.

So, as you can see, it is vital to first understand the exact problem, because that can allow you to make a proper decision as to what course of action you need to take.

Possible Problems and Solutions

Issue: Cannot open /etc/path_to_inst.

To fix, run boot -ar, and when prompted to rebuild /etc/patch_to_inst, choose yes.

Issue: Boot block problems occur (which are particular to Solaris 10 SPARC-based systems that have been patched to the Kernel Update patch 137137-09 level). Typically, these are identified by an error similar to one of the following:

The file just loaded does not appear to be executable.
Boot load failed. The file just loaded does not appear to be executable.

It is recommended that you contact Sun Support with the following information for further instructions. You can get $ROOTFSTYPE from df -n /a | awk '{print $3}' (if root is mounted on ./a):

Copy



ls -l /platform/`uname -m`/boot_archive
ls -l /platform/`uname -m`/lib/fs/$ROOTFSTYPE/bootblk

As of the Solaris 10 10/08 release for SPARC platforms, or if Kernel Update patch 137137-09 is applied, a new bootblk is installed. This new bootblk uses a boot_archive to boot the system, as opposed to loading ufsboot, as was done in updates prior to the Solaris 10 10/08 release or when 137137-09 is not applied.

So it is vitally important to understand what bootblk is appropriate. Installing the wrong bootblk renders the system unbootable until the correct bootblk is installed. It is recommended that you get further instruction from Sun Support.

If the root file system runs out of space in /platform while building the boot_archive, this can lead to the following error:

The file just loaded does not appear to be executable.

Again, it is recommended that you contact Sun Support if this happens, because the system needs to be analyzed to determine the best long-term solution for freeing up space and building the boot_archive using the bootadm command.

It is important to note that when using tools such as installboot when booting from media to install a boot block, you must use installboot from the correct media. To install bootblk on a system patched to the 137137-09 level or on a system all ready running the Solaris 10 10/08 OS, you must be booted off the Solaris 10 10/08 or later media, or you must use installboot from the mounted system.

So, for example, if the system is booted from the network to a Solaris 5/08 image, you must run this:

#/a/usr/sbin/installboot

Not this:

#/usr/sbin/installboot

But if the system is booted from Solaris 10 10/08 or later media, it is OK to run the following:

#/usr/sbin/installboot

So, care must be taken when using system utilities while booted from media to make modifications to a mounted system. It is advised that you use the latest Solaris update image available at the time as the boot image.

For More Information

Here are some additional resources:

Sun download site
Sun training courses at https://www.oracle.com/sun/
Forums, such as Sun forums and the BigAdmin Discussions collection
Product documentation at https://docs.oracle.com/en/ and the Documentation Center
Support:
- Sun resources:
  - Services
- Community system administration experts
Events of interest to users of Sun products:
- Current Events

Updates

November 2009: Script was extended to collect additional data for detailed patch and package analysis.

Analyzing a patchadd or patchrm Failure in the Solaris OS