Analyzing a patchadd
or patchrm
Failure in the Solaris OS
Enda O'Connor, April 2009 ( Updated November 2009)
This article covers the following topics:
Note: This article contains references to some items that apply only to the Solaris 10 Operating System, for example, a new bootblk
. The rest of the article and the accompanying script work on the Solaris 8, 9, or 10 OS.
Introduction
It is important to gather sufficient information before starting a root-cause analysis to determine why a patchadd
or patchrm
session did not succeed. This document is intended to help users of the Solaris Operating System for SPARC or x86 platforms do that. The focus is on what to do if the system that was patched did not reboot properly, either by panicking in a loop or dropping to an OK prompt (in the case of the Solaris OS for SPARC platforms).
This article outlines what files are most relevant, where to locate these files, and also (depending on which patch automation tool, if any, was used to apply the patch) where to locate any relevant output from such tools. The document is not intended to help in analyzing the actual failure, because patching-related issues can have any number of different causes, but it does try to provide some generic pointers.
Booting the System From CD-ROM or Network
If the system was rebooted and failed to come back up to the required run level, or it failed to boot into maintenance mode (that is, if the system dropped to the OK prompt or is panicking in a loop), it is necessary to first boot the system from the network or CD-ROM and mount the relevant file systems to be able to access the system.
1. For SPARC systems only, if the system drops down to the OK prompt, as in the following, perform these substeps:
{1} ok
a. First, try to capture the relevant console output from when the system started to reboot up to the point where the OK prompt is displayed. Keep this output, because it might be very relevant to diagnosing the actual underlying failure.
b. At this point, try to identify whether the problem can be resolved without recourse to booting from the network or media, as opposed to booting from disk.
c. If you need to boot to single-user mode from media and mount the root file systems, there are a couple of options:
- Boot from network
- Boot from CD-ROM
To boot from the network, make sure your client is properly configured in the boot server and the network connections and configuration are correct (this is outside the scope of this document). Then run this command:
{1} ok boot net -s
To boot from CD-ROM, run this command:
{1} ok boot cdrom -s
2. If the system is panicking in continuous reboot:
a. First try to capture the full panic output from console.
b. Then, for a SPARC system, drop to the OK prompt and follow the instructions for Step 1 above.
For an x86 system, you need to make sure the BIOS boot priority allows the system to boot from either the network or a CD-ROM prior to booting from hard disk. If you are doing a network boot, make sure the client is properly configured in the boot server. For example, if you are using DHCP, ensure the client's network connections and configuration are correct, or if you are using NIS, ensure the client is set up correctly in the NIS server.
3. After the system has booted from CD-ROM or network, follow the instructions in the BigAdmin article How to Remove a Solaris OS Patch While Booted From a Network or CD-ROM to mount all relevant file systems that will be examined.
Gathering Various Data to Enable Root-Cause Analysis
At this stage, we will assume that all tasks in Step 2 above have been completed and the system has been mounted under /a
.
Note: Most of the following data is gathered by the patchanalysis_gather.txt
script, with the exception of actual patchadd
output to terminal. Here's the source code for the patchanalysis_gather.txt
script file.
1. Gather the patchadd
or patchrm
related log files.
If a patchadd
session was done solely through the patchadd
utility, then unless you captured the patchadd
output to a log file, this data is not retrievable. It might be possible to simply cut and paste the patchadd
output from a terminal or console, if the output is still available. Note that you want the actual output ( STDOUT/STDERR
) from the patchadd
command itself, as opposed to the log files in /var/sadm/patch/
generated by patchadd
.
It is strongly recommended that all patchadd
output be redirected to a file during patching, so the output can be retrieved easily if it is required later for examination.
For example, the following command directs the output of patchadd
to a log file:
patchadd <PatchID> 2>&1|tee /opt/patchlogs/118833-36.$$
2. In the following examples, we will use /a
as the prefix to all commands, because we assume we are booted from alternate media and the root file system is mounted under /a
.
If the system was patched using the Traffic Light Patch (TLP) tool, TLP output is located in the following directory:
# ls /a/var/sadm/install_data/
PMGT:_TLP-Set_for_node_v4u-880c-muc07,_phase_GREEN,
_snapshot_2008-10-28_log
This data is captured by patchanalysis_gather.txt
. These files are standard text files containing patchadd
output. One file is generated for every run of the TLP tool.
3. If the system was patched using Sun Update Connection - Enterprise (UCE) or Sun xVM Ops Center (xVMOC) 1.x or 2.x, verify that /a/var/opt/SUNWuce/agent
exists, which confirms that UCE, xVMOC 1.x, or xVMOC 2.x was used.
If /a/var/opt/SUNWuce/agent
does not exist, and you have verified that /a/var
has been mounted correctly (assuming it is separate from the root file system), the system was not patched using any of these tools.
4. The following data is also captured by patchanalysis_gather.txt
.
If /a/var/opt/SUNWuce/agent
exists, run the following commands and note the output to identify which of the patch automation tools was used:
pkgparam -R /a -v SUNWucea VERSION
This output implies UCE is installed: VERSION='1.1.1-314'
.
pkgparam -R /a -v SUNWscnconnmgt VERSION
This output implies that xVMOC 1.x is installed: VERSION='1.0.0'
.
This output implies that xVMOC 2.x is installed: VERSION='2.0.0.820,REV=2009.01.26.07.57.17'
.
It is useful to know which tool was used in order to help eliminate any potential issues in the tool or to reproduce the issue on another system to identify the underlying problem.
All the patch automation tools mentioned previously store their output in /a/var/opt/SUNWuce/agent/logs/
:
# ls -l /a/var/opt/SUNWuce/agent/logs/
total 33032
-rw-r--r-- 1 root root 5610478 Feb 20 17:12 error.log
-rw-r--r-- 1 root root 10485681 Feb 20 13:15 error.log.ad_bak
-rw-r--r-- 1 root root 94418 Feb 20 13:28 job.log
-rw-r--r-- 1 root root 15478 Feb 19 11:02 job_50007101.tgz
-rw-r--r-- 1 root root 11377 Feb 20 13:04 job_50011801.tgz
-rw-r--r-- 1 root root 12189 Feb 20 13:15 job_50012001.tgz
-rw-r--r-- 1 root root 12185 Feb 20 13:28 job_50012002.tgz
-rw------- 1 root root 323598 Feb 19 10:53 last_nco_file.xml
-rw-r--r-- 1 root root 30730 Feb 20 13:28 last_seeking.tgz
-rw-r--r-- 1 root root 16602 Feb 20 13:28 nco.log
-rw-r--r-- 1 root root 253666 Feb 20 13:28 resolve.log
-rw-r--r-- 1 root root 1434 Feb 20 09:29 uce_agent.log
# gzcat /a/var/opt/SUNWuce/agent/logs/job_50012002.tgz | tar tvf -
drwxr-xr-x 0/0 0 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/
-rwx------ 0/0 4466 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/Task.out
-rw-r--r-- 0/0 167297 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
2/copy_inventory
-rw-r--r-- 0/0 497 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/copy_basket
-rw-r--r-- 0/0 17 Feb 20 13:28 2009 va64-x4100a-muc07_job_500120
02/copy_policy
The most important log file is Task.out
. It contains the output of patchadd
commands that were run. However, it is recommended that you copy all files in /a/var/opt/SUNWuce/agent/logs
off the system for possible further examination. Also copy the output of ls -ltr of /a/var/opt/SUNWuce/agent/logs
. This data is captured by patchanalysis_gather.txt
.
Other Log Files That Should Be Examined
This section lists relevant data that is worth capturing, all of which is collected by the patchanalysis_gather
script.
patchadd -R /a -p
pkginfo -R /a -p
pkginfo -R /a
df -k /a
and any other mountpoints under/a
/a/var/adm/messages*
/a/var/sadm/system/admin/CLUSTER
/a/var/sadm/install/contents
(this file can be quite large)/a/etc/system
/a/etc/vfstab
/a/release
- The directory contents from
/a/var/sadm/system/logs/
- The directory contents from
/a/var/sadm/install_data/
If there are non-global zones, the following might be useful:
/a/etc/zones/*
This data is captured by patchanalysis_gather.txt
.
The following directory in every non-global zone contains the patch logs, and might contain useful data for analyzing any non-global zone issues. Note: Currently, the patchanalysis_gather.txt
script does not collect this data, because this directory is unlikely to contain any data relevant to making a system unbootable.
<zonepath>/root/var/sadm/patch/*
Examining Output and Log Files
After the previous data has been gathered, it is advisable to start with an examination of the patchadd
output. Then also examine the patchadd
logs, which were gathered from /var/sadm/patch/*/log
.
Look for errors and warnings in these logs, in particular, the patchadd
output might have references to pkgadd
failures, with a subsequent log file stored in /var/tmp
.
If so, retrieve this log file from /a/var/tmp
and examine it, because it is very relevant to determining what caused the problem.
Also examine the patchadd
output to detect if any patch-level scripts, such as prepatch
or postpatch
, failed or generated unexpected messages. Compare the patchadd
output to known nominal patchadd
output from a system where patch application succeeded, and look for any additional or omitted messages.
If there are non-global zones present on the affected system, examine the patchadd
output looking for any errors that indicate issues particular to the presence of non-global zones. These might take the following form:
Failed to boot non-global zone <zone-name>
This message indicates that the non-global zone in question was halted, and patchadd
was not able to move the affected zone into an internal state used for software maintenance. If this occurs, gather the /a/etc/zones/*.xml
files along with /a/etc/vfstab
. These files at least enable a support engineer to begin to determine the system state and zone configuration prior to patching.
If the df -k
output indicates that available space in the root file system or in /var
reached 100% full, it is recommended that you contact Sun support, because depending on the patch that was being installed and what part of the patch installation failed, certain manual steps might be required to restore system consistency.
For instance if, during 137137-09 post-patch execution, available space in /platform
reached 100%, then on reboot, the system most likely will not boot beyond the OK prompt, and errors will indicate that boot load failed. In such cases, it is possible that the system can be rescued quite easily with no lasting damage, if sufficient space can be recovered to allow the boot archive to be rebuilt.
So, as you can see, it is vital to first understand the exact problem, because that can allow you to make a proper decision as to what course of action you need to take.
Possible Problems and Solutions
Issue: Cannot open /etc/path_to_inst
.
To fix, run boot -ar
, and when prompted to rebuild /etc/patch_to_inst
, choose yes
.
Issue: Boot block problems occur (which are particular to Solaris 10 SPARC-based systems that have been patched to the Kernel Update patch 137137-09 level). Typically, these are identified by an error similar to one of the following:
The file just loaded does not appear to be executable.
Boot load failed. The file just loaded does not appear to be executable.
It is recommended that you contact Sun Support with the following information for further instructions. You can get $ROOTFSTYPE
from df -n /a | awk '{print $3}'
(if root is mounted on ./a
):
ls -l /platform/`uname -m`/boot_archive
ls -l /platform/`uname -m`/lib/fs/$ROOTFSTYPE/bootblk
As of the Solaris 10 10/08 release for SPARC platforms, or if Kernel Update patch 137137-09 is applied, a new bootblk
is installed. This new bootblk
uses a boot_archive
to boot the system, as opposed to loading ufsboot
, as was done in updates prior to the Solaris 10 10/08 release or when 137137-09 is not applied.
So it is vitally important to understand what bootblk
is appropriate. Installing the wrong bootblk
renders the system unbootable until the correct bootblk
is installed. It is recommended that you get further instruction from Sun Support.
If the root file system runs out of space in /platform
while building the boot_archive
, this can lead to the following error:
The file just loaded does not appear to be executable.
Again, it is recommended that you contact Sun Support if this happens, because the system needs to be analyzed to determine the best long-term solution for freeing up space and building the boot_archive
using the bootadm
command.
It is important to note that when using tools such as installboot
when booting from media to install a boot block, you must use installboot
from the correct media. To install bootblk
on a system patched to the 137137-09 level or on a system all ready running the Solaris 10 10/08 OS, you must be booted off the Solaris 10 10/08 or later media, or you must use installboot
from the mounted system.
So, for example, if the system is booted from the network to a Solaris 5/08 image, you must run this:
#/a/usr/sbin/installboot
Not this:
#/usr/sbin/installboot
But if the system is booted from Solaris 10 10/08 or later media, it is OK to run the following:
#/usr/sbin/installboot
So, care must be taken when using system utilities while booted from media to make modifications to a mounted system. It is advised that you use the latest Solaris update image available at the time as the boot image.
For More Information
Here are some additional resources:
- Sun download site
- Sun training courses at https://www.oracle.com/sun/
- Forums, such as Sun forums and the BigAdmin Discussions collection
- Product documentation at https://docs.oracle.com/en/ and the Documentation Center
- Support:
- Sun resources:
- Community system administration experts
- Events of interest to users of Sun products:
November 2009: Script was extended to collect additional data for detailed patch and package analysis.