by Robert Chase
Published September 2013
In this article, we will focus on two hardware fault management features and the corresponding tools used with these features in Oracle Linux:
The following sections provide an overview of the technology, describe common use cases, and provide instructions for installing and configuring the tools in Oracle Linux. Several examples show how the tools can be used to capture and report important hardware information you can use in your daily operations.
Note: These tools include a number of tuning parameters and options, which are not covered in this article. Refer to the resources in the "See Also" section for more information.
The modern data center is agile and constantly evolving. It is tasked with driving business objectives and keeping mission-critical workloads available, and it comprises various hardware and software solutions that can be complex to manage efficiently. In an effort to control risk and meet demanding service-level commitments, features have been developed for hardware and software to help system administrators monitor the health of systems and identify issues early.
These features, called fault management, encompass a variety of solutions and standards aimed at providing the tools for monitoring, administering, identifying, and resolving a number of issues that plague system administrators. When combined with data center best practices, such as redundancy and high availability, hardware fault management features provide powerful tools for driving efficiency, raising awareness, mitigating risk, and supporting the demanding objectives placed on systems in the data center.
IPMI is a specification that was first developed in 1998 by Intel, Dell, HP, and NEC. Its primary purpose is to provide a common command interface for accessing information from the system. It was developed to be management software–neutral; however, it is often used with system management features.
IPMI operates independently from the operating system, which means you can access the system "out of band" or before an operating system is booted. This can be useful in the event of an OS or system failure, since it provides tools you can use to gather critical information when traditional system management features are unavailable.
Some of the predefined commands and interfaces in IPMI are used for reading temperatures, voltage, fan speed, power, and network settings. The IPMI specification was developed to be expandable, though. So vendors can customize and create additional commands and sensors. For example, Oracle Integrated Lights Out Manager (Oracle ILOM) is compliant with both IPMI v.1.5 and 2.0. HP's Integrated Lights-Out (iLO) and Dell's DRAC (Dell Remote Access Controller) are other examples of solutions that integrate with or are compliant with IPMI. Each solution offers a set of features for out-of-band support. This is what the specification intended: a method for providing universal, cross-platform support while allowing a path for vendors to differentiate their individual solutions.
In Oracle Linux, the
ipmitool utility is used to manage and configure devices that support the IPMI specification. IPMI support has been part of the Linux kernel since version 2.4. The
ipmitool utility provides functions for managing field replaceable units (FRUs), LAN configuration, sensor readings, and remote chassis power control. The next section discusses installation and usage scenarios using features found in
The first step is to install
ipmitool on the system. IPMI features are found in systems that support the IPMI specification. These systems will have a baseboard management controller (BMC), which is the intelligence of the IPMI architecture. Using
ipmitool, you can interface directly with the BMC and interact with features implemented by the IPMI specification.
In order to access the IPMI features for a server, the local workstation or management machine needs to be on a network that can access the systems with the BMC, and it must have the
ipmitool tools installed. To install these tools, go to the server console and type the following command:
yum install ipmitool.x86_64 OpenIPMI.x86_64
Then use the following commands to set up
ipmitool for use on your system and to start the service. When you start the service, it loads the IPMI kernel and creates a
chkconfig ipmi on service ipmi start
You can also install the
OpenIPMI packages on other IPMI systems that have a BMC, which provides options for configuring IPMI settings on the BMC, as we will see in the following examples.
Once the tools are installed, configured, and running, we can interact with the features available for controlling and monitoring the system. Let's evaluate the following IPMI use cases utilizing
ipmitool and Oracle Linux.
One feature of IPMI is the ability to directly interface with a system over a network. This action is independent of any operating system installed on the target system and provides a useful option for administration. It provides you with a direct connection to the server's IPMI interface, allowing you to execute IPMI commands remotely. In fact, you can write scripts using this option, enabling you to control to an unlimited number of servers from a single management machine.
To enable this feature, there are several steps you must first complete, such as setting up passwords and adding an IP address for the IPMI interface on the server that has the BMC. It is important to note that many servers have a separate Ethernet port for remote management. Check your hardware documentation for further information about your specific server's remote management.
The first step to access IPMI over the network is to set up a valid IP address on the system that has the BMC. The following example outlines how this step is accomplished using
ipmitool. (Note: This example uses a Sun Fire X4170 M2 server from Oracle.) To configure an IP address using
ipmitool, use the following commands at the server console:
ipmitool lan set 1 ipaddr 192.168.1.120 ipmitool lan set 1 netmask 255.255.255.0 ipmitool lan set 1 defgw ipaddr 192.168.1.1
rootuser to enable login using the password
Caution: This is not recommended and is provided only as an example. A secure password is strongly recommended.
First, we need to list the users to get the ID number, and then we will change the password using the ID number.
[root@test1 ~]# ipmitool user list 1 ID Name Callin Link Auth IPMI Msg Channel Priv Limit 1 false false true NO ACCESS 2 root false false true ADMINISTRATOR [root@test1 ~]# ipmitool user set password 2 PASSW0rd
Once you complete these setup steps, you can test the setup by remotely sending a
chassis status IPMI request to the server. You will be prompted to provide the password for the account with which you are connecting. If everything is set up correctly, the chassis status will appear on the local command line. From the command line of your management system, type the command shown in Listing 1:
[root@mgmt-vm ~]# ipmitool -I lan -H 192.168.1.120 -U root -a chassis status Password: System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : true Power Control Fault : false Power Restore Policy : always-on Last Power Event : Chassis Intrusion : inactive Front-Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false
The command syntax for ipmitool uses the options
-I for the LAN interface,
-H for identifying the host address for the remote system (192.168.1.120),
-U for the username (in this example, we used
-a for prompting for the remote user password. This is followed by the command for the status you would like to check:
There are other useful things we can do remotely using
ipmitool. All of the examples provided in the previous section can be run remotely using the
-a options of
ipmitool. Adding multiple servers to a shell script would allow us to power off an entire rack or data center full of equipment quickly and easily from a single management system or workstation.
Here is an example of remotely switching off a server's power:
[root@rchase-oracle-linux-vm ~]# ipmitool -I lan -H 192.168.1.120 -U root -a chassis power cycle Password: Chassis Power Control: Cycle
We can also look at specific data about the server hardware in real time. For example, using the
ipmitool sdr command, you can view the voltage and temperature of server components. In the example in Listing 2, we are looking at a Sun Fire X4170 M2 server's output. Each hardware vendor might provide the data in a slightly different format but should provide similar fields. Some of the output in this example has been truncated.
[root@test1 ~]# ipmitool sdr sys.id | 0x02 | ok sys.intsw | 0x00 | ok sys.psfail | 0x01 | ok sys.tempfail | 0x01 | ok sys.fanfail | 0x01 | ok mb.t_amb | 28 degrees C | ok mb.v_bat | 2.78 Volts | ok mb.v_+3v3stby | 3.25 Volts | ok mb.v_+3v3 | 3.29 Volts | ok mb.v_+5v | 4.99 Volts | ok mb.v_+12v | 12.22 Volts | ok mb.v_-12v | -12.20 Volts | ok mb.v_+2v5core | 2.56 Volts | ok mb.v_+1v8core | 1.82 Volts | ok mb.v_+1v2core | 1.22 Volts | ok fp.t_amb | 21 degrees C | ok pdb.t_amb | 21 degrees C | ok io.t_amb | 19 degrees C | ok
Information, such as the temperature of a system, provides information about the environmental conditions inside the data center and the cooling functionality of the server. The data found in this output can help you identify issues before they become larger problems. For example, information about voltages might be useful for detecting a pending power supply failure, or they can be used to monitor power levels if the servers are operated off a DC power source. Fan RPMs might reveal a fan that's about to fail or a potential hot spot inside the server that could be corroborated with a high-temperature reading. Let's look at a few more examples where IPMI can provide critical system status data.
ipmitool chassis status command captures generic information about a server and what its current state is. In the example in Listing 3, we can see that the system is powered on and the power policy has been set to
always-off. We also see there is a power supply that has either faulted or has a loose cable, as indicated by
Main Power Fault being
true. The test system has a second power supply that has a disconnected cable.
[root@test1 ~]# ipmitool chassis status System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : true Power Control Fault : false Power Restore Policy : always-off Last Power Event : Chassis Intrusion : inactive Front-Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false
ipmitool also allows you to read and write configuration data to the system. For example, if we wanted to change the configuration of the power policy on the test server used in Listing 3, we could do this using
The following command shows how to change the power policy configuration on the test server. Remember from Listing 3 that the
Power Restore Policy was set to
always-off. The following command will change this policy to
always-on. If the power to the server is interrupted, this change to the policy will result in the system restarting itself when power is restored.
[root@test1 ~]# ipmitool chassis policy always-on
As shown in Listing 4, we can check the chassis status again and verify that the configuration has been changed to a power restore policy of
[root@test1 ~]# ipmitool chassis status System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : true Power Control Fault : false Power Restore Policy : always-on Last Power Event : Chassis Intrusion : inactive Front-Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false
We can also view the system event logs (SEL) to view the hardware logs for the server, as shown in Listing 5. This is useful for tracking hardware events and using them as a comparison against information found in the standard system logs.
[root@test1 ~]# ipmitool sel list 3501 | 09/29/2007 | 01:48:21 | System Firmware Progress | System boot initiated | Asserted 3601 | 09/29/2007 | 02:05:57 | System Firmware Progress | Motherboard initialization | Asserted 3701 | 09/29/2007 | 02:05:58 | System Firmware Progress | Video initialization | Asserted 3801 | 09/29/2007 | 02:06:08 | System Firmware Progress | USB resource configuration | Asserted 3901 | 09/29/2007 | 02:06:23 | System Firmware Progress | Option ROM initialization | Asserted ...
Listing 6 shows the SEL from a server that has a number of major issues. It shows that memory, fans, and the CPU have predictive failures. A predictive failure is a notification from the IPMI sensors that there is potential hardware issue.
ca6 | 02/28/2013 | 06:13:31 | Memory #0x39 | Predictive Failure Asserted ada6 | 02/28/2013 | 06:13:32 | Memory #0x38 | Predictive Failure Asserted aea6 | 02/28/2013 | 06:13:32 | Fan #0x3d | Predictive Failure Asserted afa6 | 02/28/2013 | 06:13:33 | Memory #0x2e | Predictive Failure Asserted b0a6 | 02/28/2013 | 06:13:34 | Fan #0x3b | Predictive Failure Asserted b1a6 | 02/28/2013 | 06:13:36 | Memory #0x30 | Predictive Failure Asserted b2a6 | 02/28/2013 | 06:13:37 | Fan #0x3e | Predictive Failure Asserted b3a6 | 02/28/2013 | 06:13:38 | Memory #0x2f | Predictive Failure Asserted b4a6 | 02/28/2013 | 06:13:39 | Memory #0x37 | Predictive Failure Asserted b5a6 | 02/28/2013 | 06:13:40 | Processor #0x36 | Predictive Failure Asserted b6a6 | 02/28/2013 | 06:13:40 | Fan #0x40 | Predictive Failure Asserted b7a6 | 02/28/2013 | 06:13:43 | Memory #0x38 | Predictive Failure Asserted b8a6 | 02/28/2013 | 06:13:43 | Fan #0x3d | Predictive Failure Asserted b9a6 | 02/28/2013 | 06:13:44 | Memory #0x2e | Predictive Failure Asserted baa6 | 02/28/2013 | 06:13:44 | Fan #0x3c | Predictive Failure Asserted bba6 | 02/28/2013 | 06:13:45 | Processor #0x2d | Predictive Failure Asserted
ipmitool also shows the size of the SEL on a system. In Listing 7, we see that our system event log is 99 percent full. On some hardware, this will cause an error light to come on to bring attention to the server.
[root@test1 ~]# ipmitool sel SEL Information Version : 2.0 (v1.5, v2 compliant) Entries : 909 Free Space : 72 bytes Percent Used : 99% Last Add Time : 02/26/2013 00:59:11 Last Del Time : 02/26/2013 00:59:11 Overflow : true Supported Cmds : 'Reserve' 'Get Alloc Info' # of Alloc Units : 913 Alloc Unit Size : 18 # Free Units : 4 Largest Free Blk : 4 Max Record Size : 0
Some servers are unable to write additional data to the SEL once it is full, while others might begin to remove the oldest entries. With
ipmitool, you can clear the system event log on the server using the following command:
[root@test1 ~]# ipmitool sel clear Clearing SEL. Please allow a few seconds to erase.
You can then verify that the log has been cleared by executing the
ipmitool sel command again, as shown in Listing 8:
[root@test1 ~]# ipmitool sel SEL Information Version : 2.0 (v1.5, v2 compliant) Entries : 0 Free Space : 16434 bytes Percent Used : 0% Last Add Time : 02/26/2013 01:07:17 Last Del Time : 02/26/2013 01:09:01 Overflow : false Supported Cmds : 'Reserve' 'Get Alloc Info' # of Alloc Units : 913 Alloc Unit Size : 18 # Free Units : 913 Largest Free Blk : 913 Max Record Size : 3
Notice it's now 0 percent used. At this point, depending on our hardware, if there had been an error light on the front of the server it might be off now.
ipmitool, it is possible to pull the part numbers from the system using the
fru option. FRU stands for field replacement unit and is the term that is used to describe a component of a server and the part number and serial number, which are needed in order to replace the part in the event of a hardware failure. This feature is useful for gathering information that is required to receive warranty service on hardware systems in the data center. Listing 9 shows an example of the output of the
ipmitool fru command. Some of the output has been truncated.
[root@test1 init.d]# ipmitool fru FRU Device Description : Builtin FRU Device (ID 0) Board Mfg Date : Sun Dec 31 18:00:00 1995 Board Product : ASSY,SERV PROCESSOR,G1/2 Board Serial : 1762TH1-0617000504 Board Part Number : 501-6979-03 Board Extra : 50 Board Extra : G1/2_GRASP Product Manufacturer : SUN MICROSYSTEMS Product Name : ILOM FRU Device Description : sp.net0.fru (ID 2) Product Manufacturer : MOTOROLA Product Name : FAST ETHERNET CONTROLLER Product Part Number : MPC8248 FCC Product Serial : 00:14:4F:26:E4:C4 Product Extra : 01 Product Extra : 00:14:4F:26:E4:C4 FRU Device Description : mb.fru (ID 4) Chassis Type : Rack Mount Chassis Chassis Part Number : 541-0250-04 Chassis Serial : 0226-0615LHF0DED Board Mfg Date : Sun Dec 31 18:00:00 1995 Board Product : ASSY,MOTHERBOARD,A64 Board Serial : 1762TH1-0618001211 Board Part Number : 501-7644-01 Board Extra : 01 Board Extra : A64_MB Product Manufacturer : SUN MICROSYSTEMS Product Name : SUN FIRE X4170 SERVER Product Part Number : 602-3222-01 Product Serial : 0624AN1527
ipmitool feature supports activating locator LEDs for onsite service personnel. In order to use this option, you must determine what LEDs a server is equipped with. You can accomplish this using the command shown in Listing 10:
[root@test1 ~]# ipmitool sdr list generic sys.psfail.led | Generic @20:18.3 | ok sys.tempfail.led | Generic @20:18.4 | ok sys.fanfail.led | Generic @20:18.5 | ok sys.power.led | Generic @20:00.0 | ok sys.locate.led | Generic @20:00.0 | ok sys.alert.led | Generic @20:00.0 | ok bp.power.led | Generic @20:2D.0 | ok bp.locate.led | Generic @20:2D.1 | ok bp.alert.led | Generic @20:2D.2 | ok fp.power.led | Generic @20:18.0 | ok fp.locate.led | Generic @20:18.1 | ok fp.alert.led | Generic @20:18.2 | ok io.hdd0.led | Generic @20:1A.0 | ok io.hdd1.led | Generic @20:1A.1 | ok io.hdd2.led | Generic @20:1A.2 | ok io.hdd3.led | Generic @20:1A.3 | ok p0.led | Generic @20:2D.6 | ok p0.d0.led | Generic @20:1C.0 | ok p0.d1.led | Generic @20:1C.1 | ok p0.d2.led | Generic @20:1C.2 | ok p0.d3.led | Generic @20:1C.3 | ok p1.led | Generic @20:2D.7 | ok p1.d0.led | Generic @20:1C.4 | ok p1.d1.led | Generic @20:1C.5 | ok p1.d2.led | Generic @20:1C.6 | ok p1.d3.led | Generic @20:1C.7 | ok ft0.fm0.led | Generic @20:18.7 | ok ft0.fm1.led | Generic @20:19.1 | ok ft0.fm2.led | Generic @20:19.2 | ok ft1.fm0.led | Generic @20:19.3 | ok ft1.fm1.led | Generic @20:19.4 | ok ft1.fm2.led | Generic @20:19.5 | ok
You can see from the output in Listing 10 that there is an LED option
sys.locate.led for this hardware. Using the following command, we can turn the locator LED on:
[root@test1 ~]# ipmitool sunoem sbled set sys.locate.led fast fp.locate.led | FAST bp.locate.led | FAST
Once the command has been issued, the LED light on the front of the server will turn on immediately. The LED light can be disabled using with the following command. Again once, once the command has been executed, the light will turn off.
[root@test1 ~]# ipmitool sunoem sbled set sys.locate.led off fp.locate.led | OFF bp.locate.led | OFF
Caution: The following information is for development and testing systems and not intended for ANY production system.
ipmitool, we can also inject artificial hardware events on the system for testing purposes. The following command will generate a false temperature warning on the server hardware and store the warning in the system event log. Keep in mind this will appear as an actual hardware problem and should be cleared after testing is complete.
[root@test1 ~]# ipmitool event 1 Sending SAMPLE event: Temperature - Upper Critical - Going High 0 | Pre-Init Time-stamp | Temperature #0x30 | Upper Critical going high
Once the command is issued, we can look at the
ipmitool sel list output to see the error saved in the server's system event log.
35aa | 04/12/2013 | 22:11:44 | Temperature #0x30 | Upper Critical going high
We can generate many types of artificial hardware troubles. The command
ipmi event will output the usage information shown in Listing 11.
usage: event <num> Send generic test events 1 : Temperature - Upper Critical - Going High 2 : Voltage Threshold - Lower Critical - Going Low 3 : Memory - Correctable ECC usage: event file <filename> Read and generate events from file Use the 'sel save' command to generate from SEL usage: event <sensorid> <state> [event_dir] sensorid : Sensor ID string to use for event data state : Sensor state, use 'list' to see possible states for sensor event_dir : assert, deassert [default=assert]
More information about
ipmitool can be found in the "See Also" section of this article.
mce-testfor Detecting Machine Check Errors
Correctable and uncorrectable hardware errors are known as Machine Check Exceptions (MCEs). The CPU itself has the ability to correct errors and notify the underlying operating system regarding issues with the CPU or cache. The CPU also has the ability to recover by itself from some errors. Oracle Linux can use
mcelog as a logging subsystem for machine checks. To get started, you must install the package on the server using the following commands.
yum install mcelog.x86_64 service mcelogd start chkconfig mcelogd on
mcelog package works in two different ways depending on the version of Oracle Linux you are using. On Oracle Linux 6.0 and higher, it is controlled by a daemon. In older releases of Oracle Linux, a
cron job in
/etc/cron.hourly/mcelog.cron checks for MCEs and saves them to
/var/log/mcelog every hour. Controlling
mcelog by a daemon is better since hardware errors are detected more quickly and logged immediately, rather than waiting for the
cron job to run. Errors such as bus errors, memory errors, and CPU cache errors can be detected using
mcelog, giving you advanced notice in the event of a pending hardware failure.
Two types of errors are captured by
mcelog: corrected and uncorrected. Corrected errors are events that are handled by the CPU; they can be used to identify trends that might predict a larger problem.
An uncorrected error is a critical exception and often results in a kernel panic on the system if the CPU cannot recover. This results in a reset and a disruption to applications. With uncorrected errors, the ability for
mcelog to capture the error is dependent on whether the error results in a warm reboot or a hard reset. With a warm reboot, the information will be captured in
mcelog and is available after recovery. A hard reset results in lost data and the event will likely not be captured by
The example is Listing 12 shows an
mcelog error message showing a corrected error on CPU 1:
Hardware event. This is not a software error. MCE 0 CPU 1 BANK 2 ADDR 1234 TIME 1364535025 Fri Mar 29 01:30:25 2013 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid MCA: No Error STATUS 9400000000000000 MCGSTATUS 0 MCGCAP c07 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 58
Caution: The following information is for development and testing systems and is not intended for a production system.
In the interest of testing and troubleshooting, the
mce-test package can be used to generate false hardware MCE events and perform system testing. Links to the
git repository and project page for this package can be found in the "See Also" section of this article.
mce-test package has a wealth of default tests that are able for simulating real hardware failures and even to panic the kernel. Several setup steps are required to prepare a system for this type of testing.
First, you need to install a few support packages to set up
mce-test on the test system. Use the following command:
yum install gcc.x86_64 gcc-c++.x86_64 flex.x86_64 dialog.x86_64 ras-utils.x86_64 git.x86_64
Once this is done, some configuration is required to load a kernel module that
mce-test uses during test execution. These steps are outlined below. The first command loads the
mce-inject module, and the second command sets this module to load automatically at system boot.
modprobe mce-inject echo "mce-inject" >> /etc/modules.conf
To make sure the module has loaded, use the following command. You should see the following output.
[root@test]# modprobe -l | grep mce-inject kernel/arch/x86/kernel/cpu/mcheck/mce-inject.ko
Once you have the kernel module loaded, you should test
mce-inject to make sure it's functioning. First, create a text file that contains the following content for testing
CPU 0 BANK 1 STATUS CORRECTED ADDR 0xabcd # CPU 1 BANK 2 STATUS CORRECTED ADDR 0x1234
Once you have created this text file, you can test
mce-inject by providing it with the path to the text file.
You should see the output shown in Listing 13 in
Hardware event. This is not a software error. MCE 0 CPU 0 BANK 1 ADDR abcd TIME 1371752847 Thu Jun 20 14:27:27 2013 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid MCA: No Error STATUS 9400000000000000 MCGSTATUS 0 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 58 Hardware event. This is not a software error. MCE 0 CPU 1 BANK 2 ADDR 1234 TIME 1371752847 Thu Jun 20 14:27:27 2013 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid MCA: No Error STATUS 9400000000000000 MCGSTATUS 0 MCGCAP c07 APICID 1 SOCKETID 0 CPUID Vendor Intel Family 6 Model 58
mce-inject executable can be used directly by providing it input via the text file, but a much more powerful way to utilize tests on a system is to use the
You will need to download the
mce-test suite via
git, as shown in the following example.
[root@test ~]# git clone https://github.com/andikleen/mce-test.git Initialized empty Git repository in /root/mce-test/.git/ remote: Counting objects: 1768, done. remote: Compressing objects: 100% (748/748), done. remote: Total 1768 (delta 940), reused 1757 (delta 929) Receiving objects: 100% (1768/1768), 333.46 KiB, done. Resolving deltas: 100% (940/940), done.
Once you have cloned the
git repository, you can change to the directory
mce-test and execute
mcemenu, which will bring you to the main menu of the
mce-test utility (shown in Figure 1).
Figure 1 - Main Menu of the mce-test Utility
The first thing we want to do is compile the test suite, so select the Compile option to compile all the executables that will be used by this test suite. You can then execute the tests from the Execute menu. After the test run, you can use the Results menu to view the results of the tests. In the
mce-test/doc directory is all the documentation that covers information about the tests and how to fully utilize the suite for your needs.
Once you have the packages configured and understand which ones provide the results you need, you can generate some interesting
mcelog exceptions, such as the one shown in Listing 14, which has interesting hex values.
Hardware event. This is not a software error. MCE 0 CPU 0 BANK 2 TSC 4c060c5ce0 RIP 10:deadbabe ADDR 1234 TIME 1364534147 Fri Mar 29 01:15:47 2013 MCG status:RIPV EIPV MCIP MCi status: Error overflow Uncorrected error Error enabled MCi_ADDR register valid MCA: No Error STATUS f400000000000000 MCGSTATUS 7 MCGCAP c07 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 58
It's important to keep in mind that your testing will affect both
/var/log/messages in addition to sending messages to the system console and possibly to the server's hardware system event log. It is important to perform these types of tests only on a development system. Some of the
syslog messages are helpful and indicate that the MCEs are fake. However, other messages can look quite real to others who are not aware of your testing, and they might think there are serious hardware troubles. Here's an example of one of the more obvious
Mar 29 01:15:48 test kernel: mce: [Hardware Error]: Fake kernel panic: Fatal Machine check
This article provided a short overview of the basic features for
mce-test. As mentioned in the article, there are more features and configuration options you can leverage with these tools. This article was meant to provide an introduction and some common use cases. In the next section, you will find additional resources you can access to learn more about these tools. All of the tools discussed in this article are available with Oracle Linux or they can be easily installed on Oracle Linux using the links and references provided.
Robert Chase is a member of the Oracle Linux product management team. He has been involved with Linux and open source software since 1996. He has worked with systems as small as embedded devices and with large supercomputer-class hardware.
|Revision 1.0, 09/03/2013|