What You See Is What You Get Element

How to Use Hardware Fault Management

in Oracle Linux

by Robert Chase

Understanding, installing, enabling, and using IPMI and MCE—two hardware fault management tools in Oracle Linux.


Published September 2013


In this article, we will focus on two hardware fault management features and the corresponding tools used with these features in Oracle Linux:

Want to comment on this article? Post the link on Facebook's OTN Garage page.  Have a similar article to share? Bring it up on Facebook or Twitter and let's discuss.
  • Intelligent Platform Management Interface (IPMI) and the ipmitool tool
  • Machine Check Exceptions (MCEs) and the mcelog, mce-inject, and mce-test tools

The following sections provide an overview of the technology, describe common use cases, and provide instructions for installing and configuring the tools in Oracle Linux. Several examples show how the tools can be used to capture and report important hardware information you can use in your daily operations.

Note: These tools include a number of tuning parameters and options, which are not covered in this article. Refer to the resources in the "See Also" section for more information.

About Hardware Fault Management

The modern data center is agile and constantly evolving. It is tasked with driving business objectives and keeping mission-critical workloads available, and it comprises various hardware and software solutions that can be complex to manage efficiently. In an effort to control risk and meet demanding service-level commitments, features have been developed for hardware and software to help system administrators monitor the health of systems and identify issues early.

These features, called fault management, encompass a variety of solutions and standards aimed at providing the tools for monitoring, administering, identifying, and resolving a number of issues that plague system administrators. When combined with data center best practices, such as redundancy and high availability, hardware fault management features provide powerful tools for driving efficiency, raising awareness, mitigating risk, and supporting the demanding objectives placed on systems in the data center.

Using IPMI and ipmitool

IPMI is a specification that was first developed in 1998 by Intel, Dell, HP, and NEC. Its primary purpose is to provide a common command interface for accessing information from the system. It was developed to be management software–neutral; however, it is often used with system management features.

IPMI operates independently from the operating system, which means you can access the system "out of band" or before an operating system is booted. This can be useful in the event of an OS or system failure, since it provides tools you can use to gather critical information when traditional system management features are unavailable.

Some of the predefined commands and interfaces in IPMI are used for reading temperatures, voltage, fan speed, power, and network settings. The IPMI specification was developed to be expandable, though. So vendors can customize and create additional commands and sensors. For example, Oracle Integrated Lights Out Manager (Oracle ILOM) is compliant with both IPMI v.1.5 and 2.0. HP's Integrated Lights-Out (iLO) and Dell's DRAC (Dell Remote Access Controller) are other examples of solutions that integrate with or are compliant with IPMI. Each solution offers a set of features for out-of-band support. This is what the specification intended: a method for providing universal, cross-platform support while allowing a path for vendors to differentiate their individual solutions.

In Oracle Linux, the ipmitool utility is used to manage and configure devices that support the IPMI specification. IPMI support has been part of the Linux kernel since version 2.4. The ipmitool utility provides functions for managing field replaceable units (FRUs), LAN configuration, sensor readings, and remote chassis power control. The next section discusses installation and usage scenarios using features found in ipmitool.

Installation

The first step is to install ipmitool on the system. IPMI features are found in systems that support the IPMI specification. These systems will have a baseboard management controller (BMC), which is the intelligence of the IPMI architecture. Using OpenIPMI and ipmitool, you can interface directly with the BMC and interact with features implemented by the IPMI specification.

In order to access the IPMI features for a server, the local workstation or management machine needs to be on a network that can access the systems with the BMC, and it must have the OpenIPMI and ipmitool tools installed. To install these tools, go to the server console and type the following command:

yum install ipmitool.x86_64 OpenIPMI.x86_64

Then use the following commands to set up ipmitool for use on your system and to start the service. When you start the service, it loads the IPMI kernel and creates a /dev/ipmi0 device.

chkconfig ipmi on
service ipmi start

You can also install the ipmitool and OpenIPMI packages on other IPMI systems that have a BMC, which provides options for configuring IPMI settings on the BMC, as we will see in the following examples.

Once the tools are installed, configured, and running, we can interact with the features available for controlling and monitoring the system. Let's evaluate the following IPMI use cases utilizing ipmitool and Oracle Linux.

Remote System Access

One feature of IPMI is the ability to directly interface with a system over a network. This action is independent of any operating system installed on the target system and provides a useful option for administration. It provides you with a direct connection to the server's IPMI interface, allowing you to execute IPMI commands remotely. In fact, you can write scripts using this option, enabling you to control to an unlimited number of servers from a single management machine.

To enable this feature, there are several steps you must first complete, such as setting up passwords and adding an IP address for the IPMI interface on the server that has the BMC. It is important to note that many servers have a separate Ethernet port for remote management. Check your hardware documentation for further information about your specific server's remote management.

The first step to access IPMI over the network is to set up a valid IP address on the system that has the BMC. The following example outlines how this step is accomplished using ipmitool. (Note: This example uses a Sun Fire X4170 M2 server from Oracle.) To configure an IP address using ipmitool, use the following commands at the server console:

ipmitool lan set 1 ipaddr 192.168.1.120
ipmitool lan set 1 netmask 255.255.255.0
ipmitool lan set 1 defgw ipaddr 192.168.1.1

Once the IP address is set on the IPMI interface, you need a method for authentication. In the following example, we change the password to the root user to enable login using the password PASSW0rd.

Caution: This is not recommended and is provided only as an example. A secure password is strongly recommended.

First, we need to list the users to get the ID number, and then we will change the password using the ID number.

[root@test1 ~]# ipmitool user list 1
ID  Name     Callin  Link Auth  IPMI Msg   Channel Priv Limit
1            false   false      true       NO ACCESS
2   root     false   false      true       ADMINISTRATOR

[root@test1 ~]# ipmitool user set password 2 PASSW0rd

Once you complete these setup steps, you can test the setup by remotely sending a chassis status IPMI request to the server. You will be prompted to provide the password for the account with which you are connecting. If everything is set up correctly, the chassis status will appear on the local command line. From the command line of your management system, type the command shown in Listing 1:

[root@mgmt-vm ~]# ipmitool -I lan -H 192.168.1.120 -U root -a chassis status
Password:
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : true
Power Control Fault  : false
Power Restore Policy : always-on
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false

Listing 1

The command syntax for ipmitool uses the options -I for the LAN interface, -H for identifying the host address for the remote system (192.168.1.120), -U for the username (in this example, we used root), and -a for prompting for the remote user password. This is followed by the command for the status you would like to check: chassis status.

There are other useful things we can do remotely using ipmitool. All of the examples provided in the previous section can be run remotely using the -I, -H, -U, and -a options of ipmitool. Adding multiple servers to a shell script would allow us to power off an entire rack or data center full of equipment quickly and easily from a single management system or workstation.

Here is an example of remotely switching off a server's power:

[root@rchase-oracle-linux-vm ~]# ipmitool -I lan -H 192.168.1.120 -U root -a chassis power cycle
Password:
Chassis Power Control: Cycle

We can also look at specific data about the server hardware in real time. For example, using the ipmitool sdr command, you can view the voltage and temperature of server components. In the example in Listing 2, we are looking at a Sun Fire X4170 M2 server's output. Each hardware vendor might provide the data in a slightly different format but should provide similar fields. Some of the output in this example has been truncated.

[root@test1 ~]# ipmitool sdr
sys.id           | 0x02              | ok
sys.intsw        | 0x00              | ok
sys.psfail       | 0x01              | ok
sys.tempfail     | 0x01              | ok
sys.fanfail      | 0x01              | ok
mb.t_amb         | 28 degrees C      | ok
mb.v_bat         | 2.78 Volts        | ok
mb.v_+3v3stby    | 3.25 Volts        | ok
mb.v_+3v3        | 3.29 Volts        | ok
mb.v_+5v         | 4.99 Volts        | ok
mb.v_+12v        | 12.22 Volts       | ok
mb.v_-12v        | -12.20 Volts      | ok
mb.v_+2v5core    | 2.56 Volts        | ok
mb.v_+1v8core    | 1.82 Volts        | ok
mb.v_+1v2core    | 1.22 Volts        | ok
fp.t_amb         | 21 degrees C      | ok
pdb.t_amb        | 21 degrees C      | ok
io.t_amb         | 19 degrees C      | ok

Listing 2

Information, such as the temperature of a system, provides information about the environmental conditions inside the data center and the cooling functionality of the server. The data found in this output can help you identify issues before they become larger problems. For example, information about voltages might be useful for detecting a pending power supply failure, or they can be used to monitor power levels if the servers are operated off a DC power source. Fan RPMs might reveal a fan that's about to fail or a potential hot spot inside the server that could be corroborated with a high-temperature reading. Let's look at a few more examples where IPMI can provide critical system status data.

System Status Features

The ipmitool chassis status command captures generic information about a server and what its current state is. In the example in Listing 3, we can see that the system is powered on and the power policy has been set to always-off. We also see there is a power supply that has either faulted or has a loose cable, as indicated by Main Power Fault being true. The test system has a second power supply that has a disconnected cable.

[root@test1 ~]# ipmitool chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : true
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false

Listing 3

The ipmitool also allows you to read and write configuration data to the system. For example, if we wanted to change the configuration of the power policy on the test server used in Listing 3, we could do this using ipmitool.

The following command shows how to change the power policy configuration on the test server. Remember from Listing 3 that the Power Restore Policy was set to always-off. The following command will change this policy to always-on. If the power to the server is interrupted, this change to the policy will result in the system restarting itself when power is restored.

[root@test1 ~]# ipmitool chassis policy always-on 

As shown in Listing 4, we can check the chassis status again and verify that the configuration has been changed to a power restore policy of always-on.

[root@test1 ~]# ipmitool chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : true
Power Control Fault  : false
Power Restore Policy : always-on
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false

Listing 4

We can also view the system event logs (SEL) to view the hardware logs for the server, as shown in Listing 5. This is useful for tracking hardware events and using them as a comparison against information found in the standard system logs.

[root@test1 ~]# ipmitool sel list
3501 | 09/29/2007 | 01:48:21 | System Firmware Progress | System boot initiated | Asserted
3601 | 09/29/2007 | 02:05:57 | System Firmware Progress | Motherboard initialization | Asserted
3701 | 09/29/2007 | 02:05:58 | System Firmware Progress | Video initialization | Asserted
3801 | 09/29/2007 | 02:06:08 | System Firmware Progress | USB resource configuration | Asserted
3901 | 09/29/2007 | 02:06:23 | System Firmware Progress | Option ROM initialization | Asserted
...

Listing 5

Listing 6 shows the SEL from a server that has a number of major issues. It shows that memory, fans, and the CPU have predictive failures. A predictive failure is a notification from the IPMI sensors that there is potential hardware issue.

ca6 | 02/28/2013 | 06:13:31 | Memory #0x39 | Predictive Failure Asserted
ada6 | 02/28/2013 | 06:13:32 | Memory #0x38 | Predictive Failure Asserted
aea6 | 02/28/2013 | 06:13:32 | Fan #0x3d | Predictive Failure Asserted
afa6 | 02/28/2013 | 06:13:33 | Memory #0x2e | Predictive Failure Asserted
b0a6 | 02/28/2013 | 06:13:34 | Fan #0x3b | Predictive Failure Asserted
b1a6 | 02/28/2013 | 06:13:36 | Memory #0x30 | Predictive Failure Asserted
b2a6 | 02/28/2013 | 06:13:37 | Fan #0x3e | Predictive Failure Asserted
b3a6 | 02/28/2013 | 06:13:38 | Memory #0x2f | Predictive Failure Asserted
b4a6 | 02/28/2013 | 06:13:39 | Memory #0x37 | Predictive Failure Asserted
b5a6 | 02/28/2013 | 06:13:40 | Processor #0x36 | Predictive Failure Asserted
b6a6 | 02/28/2013 | 06:13:40 | Fan #0x40 | Predictive Failure Asserted
b7a6 | 02/28/2013 | 06:13:43 | Memory #0x38 | Predictive Failure Asserted
b8a6 | 02/28/2013 | 06:13:43 | Fan #0x3d | Predictive Failure Asserted
b9a6 | 02/28/2013 | 06:13:44 | Memory #0x2e | Predictive Failure Asserted
baa6 | 02/28/2013 | 06:13:44 | Fan #0x3c | Predictive Failure Asserted
bba6 | 02/28/2013 | 06:13:45 | Processor #0x2d | Predictive Failure Asserted

Listing 6

The ipmitool also shows the size of the SEL on a system. In Listing 7, we see that our system event log is 99 percent full. On some hardware, this will cause an error light to come on to bring attention to the server.

[root@test1 ~]# ipmitool sel
SEL Information
Version          : 2.0 (v1.5, v2 compliant)
Entries          : 909
Free Space       : 72 bytes
Percent Used     : 99%
Last Add Time    : 02/26/2013 00:59:11
Last Del Time    : 02/26/2013 00:59:11
Overflow         : true
Supported Cmds   : 'Reserve' 'Get Alloc Info'
# of Alloc Units : 913
Alloc Unit Size  : 18
# Free Units     : 4
Largest Free Blk : 4
Max Record Size  : 0

Listing 7

Some servers are unable to write additional data to the SEL once it is full, while others might begin to remove the oldest entries. With ipmitool, you can clear the system event log on the server using the following command:

[root@test1 ~]# ipmitool sel clear
Clearing SEL.  Please allow a few seconds to erase.

You can then verify that the log has been cleared by executing the ipmitool sel command again, as shown in Listing 8:

[root@test1 ~]# ipmitool sel
SEL Information
Version          : 2.0 (v1.5, v2 compliant)
Entries          : 0
Free Space       : 16434 bytes
Percent Used     : 0%
Last Add Time    : 02/26/2013 01:07:17
Last Del Time    : 02/26/2013 01:09:01
Overflow         : false
Supported Cmds   : 'Reserve' 'Get Alloc Info'
# of Alloc Units : 913
Alloc Unit Size  : 18
# Free Units     : 913
Largest Free Blk : 913
Max Record Size  : 3

Listing 8

Notice it's now 0 percent used. At this point, depending on our hardware, if there had been an error light on the front of the server it might be off now.

Hardware Audit Features

Using ipmitool, it is possible to pull the part numbers from the system using the fru option. FRU stands for field replacement unit and is the term that is used to describe a component of a server and the part number and serial number, which are needed in order to replace the part in the event of a hardware failure. This feature is useful for gathering information that is required to receive warranty service on hardware systems in the data center. Listing 9 shows an example of the output of the ipmitool fru command. Some of the output has been truncated.

[root@test1 init.d]# ipmitool fru
FRU Device Description : Builtin FRU Device (ID 0)
 Board Mfg Date        : Sun Dec 31 18:00:00 1995
 Board Product         : ASSY,SERV PROCESSOR,G1/2
 Board Serial          : 1762TH1-0617000504
 Board Part Number     : 501-6979-03
 Board Extra           : 50
 Board Extra           : G1/2_GRASP
 Product Manufacturer  : SUN MICROSYSTEMS
 Product Name          : ILOM

FRU Device Description : sp.net0.fru (ID 2)  Product Manufacturer  : MOTOROLA
 Product Name          : FAST ETHERNET CONTROLLER
 Product Part Number   : MPC8248 FCC
 Product Serial        : 00:14:4F:26:E4:C4
 Product Extra         : 01
 Product Extra         : 00:14:4F:26:E4:C4

FRU Device Description : mb.fru (ID 4)
 Chassis Type                    : Rack Mount Chassis
 Chassis Part Number     : 541-0250-04
 Chassis Serial                  : 0226-0615LHF0DED
 Board Mfg Date        : Sun Dec 31 18:00:00 1995
 Board Product         : ASSY,MOTHERBOARD,A64
 Board Serial          : 1762TH1-0618001211
 Board Part Number     : 501-7644-01
 Board Extra           : 01
 Board Extra           : A64_MB
 Product Manufacturer  : SUN MICROSYSTEMS
 Product Name          : SUN FIRE X4170 SERVER
 Product Part Number   : 602-3222-01
 Product Serial        : 0624AN1527

Listing 9

The ipmitool feature supports activating locator LEDs for onsite service personnel. In order to use this option, you must determine what LEDs a server is equipped with. You can accomplish this using the command shown in Listing 10:

[root@test1 ~]# ipmitool sdr list generic
sys.psfail.led   | Generic @20:18.3  | ok
sys.tempfail.led | Generic @20:18.4  | ok 
sys.fanfail.led  | Generic @20:18.5  | ok
sys.power.led    | Generic @20:00.0  | ok
sys.locate.led   | Generic @20:00.0  | ok
sys.alert.led    | Generic @20:00.0  | ok
bp.power.led     | Generic @20:2D.0  | ok
bp.locate.led    | Generic @20:2D.1  | ok
bp.alert.led     | Generic @20:2D.2  | ok
fp.power.led     | Generic @20:18.0  | ok
fp.locate.led    | Generic @20:18.1  | ok
fp.alert.led     | Generic @20:18.2  | ok
io.hdd0.led      | Generic @20:1A.0  | ok
io.hdd1.led      | Generic @20:1A.1  | ok
io.hdd2.led      | Generic @20:1A.2  | ok
io.hdd3.led      | Generic @20:1A.3  | ok
p0.led           | Generic @20:2D.6  | ok
p0.d0.led        | Generic @20:1C.0  | ok
p0.d1.led        | Generic @20:1C.1  | ok
p0.d2.led        | Generic @20:1C.2  | ok
p0.d3.led        | Generic @20:1C.3  | ok
p1.led           | Generic @20:2D.7  | ok
p1.d0.led        | Generic @20:1C.4  | ok
p1.d1.led        | Generic @20:1C.5  | ok
p1.d2.led        | Generic @20:1C.6  | ok
p1.d3.led        | Generic @20:1C.7  | ok
ft0.fm0.led      | Generic @20:18.7  | ok
ft0.fm1.led      | Generic @20:19.1  | ok
ft0.fm2.led      | Generic @20:19.2  | ok
ft1.fm0.led      | Generic @20:19.3  | ok
ft1.fm1.led      | Generic @20:19.4  | ok
ft1.fm2.led      | Generic @20:19.5  | ok

Listing 10

You can see from the output in Listing 10 that there is an LED option sys.locate.led for this hardware. Using the following command, we can turn the locator LED on:

[root@test1 ~]# ipmitool sunoem sbled set sys.locate.led fast
fp.locate.led    | FAST
bp.locate.led    | FAST

Once the command has been issued, the LED light on the front of the server will turn on immediately. The LED light can be disabled using with the following command. Again once, once the command has been executed, the light will turn off.

[root@test1 ~]# ipmitool sunoem sbled set sys.locate.led off
fp.locate.led    | OFF
bp.locate.led    | OFF

Caution: The following information is for development and testing systems and not intended for ANY production system.

Using ipmitool, we can also inject artificial hardware events on the system for testing purposes. The following command will generate a false temperature warning on the server hardware and store the warning in the system event log. Keep in mind this will appear as an actual hardware problem and should be cleared after testing is complete.

[root@test1 ~]# ipmitool event 1
Sending SAMPLE event: Temperature - Upper Critical - Going High
   0 | Pre-Init Time-stamp   | Temperature #0x30 | Upper Critical going high

Once the command is issued, we can look at the ipmitool sel list output to see the error saved in the server's system event log.

35aa | 04/12/2013 | 22:11:44 | Temperature #0x30 | Upper Critical going high

We can generate many types of artificial hardware troubles. The command ipmi event will output the usage information shown in Listing 11.

usage: event <num>
   Send generic test events
   1 : Temperature - Upper Critical - Going High
   2 : Voltage Threshold - Lower Critical - Going Low
   3 : Memory - Correctable ECC

usage: event file <filename>
   Read and generate events from file
   Use the 'sel save' command to generate from SEL

usage: event <sensorid> <state> [event_dir]
   sensorid  : Sensor ID string to use for event data
   state     : Sensor state, use 'list' to see possible states for sensor
   event_dir : assert, deassert [default=assert]

Listing 11

More information about ipmitool can be found in the "See Also" section of this article.

Using mcelog, mce-inject, and mce-test for Detecting Machine Check Errors

Correctable and uncorrectable hardware errors are known as Machine Check Exceptions (MCEs). The CPU itself has the ability to correct errors and notify the underlying operating system regarding issues with the CPU or cache. The CPU also has the ability to recover by itself from some errors. Oracle Linux can use mcelog as a logging subsystem for machine checks. To get started, you must install the package on the server using the following commands.

yum install mcelog.x86_64
service mcelogd start
chkconfig mcelogd on

The mcelog package works in two different ways depending on the version of Oracle Linux you are using. On Oracle Linux 6.0 and higher, it is controlled by a daemon. In older releases of Oracle Linux, a cron job in /etc/cron.hourly/mcelog.cron checks for MCEs and saves them to /var/log/mcelog every hour. Controlling mcelog by a daemon is better since hardware errors are detected more quickly and logged immediately, rather than waiting for the cron job to run. Errors such as bus errors, memory errors, and CPU cache errors can be detected using mcelog, giving you advanced notice in the event of a pending hardware failure.

Two types of errors are captured by mcelog: corrected and uncorrected. Corrected errors are events that are handled by the CPU; they can be used to identify trends that might predict a larger problem.

An uncorrected error is a critical exception and often results in a kernel panic on the system if the CPU cannot recover. This results in a reset and a disruption to applications. With uncorrected errors, the ability for mcelog to capture the error is dependent on whether the error results in a warm reboot or a hard reset. With a warm reboot, the information will be captured in mcelog and is available after recovery. A hard reset results in lost data and the event will likely not be captured by mcelog.

The example is Listing 12 shows an mcelog error message showing a corrected error on CPU 1:

Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 2
ADDR 1234
TIME 1364535025 Fri Mar 29 01:30:25 2013 MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: No Error
STATUS 9400000000000000 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58

Listing 12

Caution: The following information is for development and testing systems and is not intended for a production system.

In the interest of testing and troubleshooting, the mce-test package can be used to generate false hardware MCE events and perform system testing. Links to the git repository and project page for this package can be found in the "See Also" section of this article.

The mce-test package has a wealth of default tests that are able for simulating real hardware failures and even to panic the kernel. Several setup steps are required to prepare a system for this type of testing.

First, you need to install a few support packages to set up mce-test on the test system. Use the following command:

yum install gcc.x86_64 gcc-c++.x86_64 flex.x86_64 dialog.x86_64 ras-utils.x86_64 git.x86_64

Once this is done, some configuration is required to load a kernel module that mce-test uses during test execution. These steps are outlined below. The first command loads the mce-inject module, and the second command sets this module to load automatically at system boot.

modprobe mce-inject
echo "mce-inject" >> /etc/modules.conf  

To make sure the module has loaded, use the following command. You should see the following output.

[root@test]# modprobe -l | grep mce-inject 
kernel/arch/x86/kernel/cpu/mcheck/mce-inject.ko

Once you have the kernel module loaded, you should test mce-inject to make sure it's functioning. First, create a text file that contains the following content for testing mce-inject.

CPU 0 BANK 1
STATUS CORRECTED
ADDR 0xabcd
#
CPU 1 BANK 2
STATUS CORRECTED
ADDR 0x1234

Once you have created this text file, you can test mce-inject by providing it with the path to the text file.

mce-inject <filename>

You should see the output shown in Listing 13 in /var/log/mcelog.

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1
ADDR abcd
TIME 1371752847 Thu Jun 20 14:27:27 2013 MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: No Error
STATUS 9400000000000000 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58
Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 2
ADDR 1234
TIME 1371752847 Thu Jun 20 14:27:27 2013 MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
MCA: No Error
STATUS 9400000000000000 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58

Listing 13

The mce-inject executable can be used directly by providing it input via the text file, but a much more powerful way to utilize tests on a system is to use the mce-test program.

You will need to download the mce-test suite via git, as shown in the following example.

[root@test ~]# git clone https://github.com/andikleen/mce-test.git
Initialized empty Git repository in /root/mce-test/.git/
remote: Counting objects: 1768, done.
remote: Compressing objects: 100% (748/748), done.
remote: Total 1768 (delta 940), reused 1757 (delta 929) Receiving objects: 100% (1768/1768), 333.46 KiB, done.
Resolving deltas: 100% (940/940), done.

Once you have cloned the git repository, you can change to the directory mce-test and execute mcemenu, which will bring you to the main menu of the mce-test utility (shown in Figure 1).

Main Menu of the mce-test Utility

Figure 1 - Main Menu of the mce-test Utility

The first thing we want to do is compile the test suite, so select the Compile option to compile all the executables that will be used by this test suite. You can then execute the tests from the Execute menu. After the test run, you can use the Results menu to view the results of the tests. In the mce-test/doc directory is all the documentation that covers information about the tests and how to fully utilize the suite for your needs.

Once you have the packages configured and understand which ones provide the results you need, you can generate some interesting mcelog exceptions, such as the one shown in Listing 14, which has interesting hex values.

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 2 TSC 4c060c5ce0
RIP 10:deadbabe
ADDR 1234
TIME 1364534147 Fri Mar 29 01:15:47 2013 MCG status:RIPV EIPV MCIP MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_ADDR register valid
MCA: No Error
STATUS f400000000000000 MCGSTATUS 7
MCGCAP c07 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 58

Listing 14

It's important to keep in mind that your testing will affect both /var/log/mcelog and /var/log/messages in addition to sending messages to the system console and possibly to the server's hardware system event log. It is important to perform these types of tests only on a development system. Some of the syslog messages are helpful and indicate that the MCEs are fake. However, other messages can look quite real to others who are not aware of your testing, and they might think there are serious hardware troubles. Here's an example of one of the more obvious syslog entries.

Mar 29 01:15:48 test kernel: mce: [Hardware Error]: Fake kernel panic: Fatal Machine check

Conclusion

This article provided a short overview of the basic features for ipmitool, mcelog, mce-inject, and mce-test. As mentioned in the article, there are more features and configuration options you can leverage with these tools. This article was meant to provide an introduction and some common use cases. In the next section, you will find additional resources you can access to learn more about these tools. All of the tools discussed in this article are available with Oracle Linux or they can be easily installed on Oracle Linux using the links and references provided.

See Also

ipmitool resources:

mce-test resources:

About the Author

Robert Chase is a member of the Oracle Linux product management team. He has been involved with Linux and open source software since 1996. He has worked with systems as small as embedded devices and with large supercomputer-class hardware.

Revision 1.0, 09/03/2013

Follow us:
Blog | Facebook | Twitter | YouTube