by Arup Nanda
Learn how to use various metrics and reports available to measure the efficiency and effectiveness of your Exadata Database Machine environment.
(The purpose of this guide is educational; it is not intended to replace official Oracle-provided manuals or other documentation. The information in this guide is not validated by Oracle, is not supported by Oracle, and should only be used at your own risk.)
To keep a handle on the Oracle Exadata Database Machine’s reliability and efficiency you will need to measure several health metrics. Monitoring these metrics is likely to be your most common activity, but even so you would probably want a heads-up when something is amiss. In this final installment, you will learn how to check these metrics, either via the CellCLI interface or Oracle Enterprise Manager, and receive alerts based on certain conditions.
First, let’s see how to get the metrics through CellCLI. To check the current metrics, simply use the following command:
CellCLI> list metriccurrent
It will show all the metrics, which quickly becomes impossible to read. Instead you may want to focus on specific ones. For example, to display a specific metric called GD_IO_RQ_W_SM, use it as a modifier. (If you are wondering what this metric is about, you will learn that shortly.)
CellCLI> list metriccurrent gd_io_rq_w_sm
GD_IO_RQ_W_SM PRODATA_CD_00_cell01 2,665 IO requests
GD_IO_RQ_W_SM PRODATA_CD_01_cell01 5,710 IO requests
GD_IO_RQ_W_SM PRODATA_CD_02_cell01 2,454 IO requests
… output truncated …
The metric shows you the I/O requests to the different cell disks. While this may be good for a quick glance at some single parameter, it’s not super-useful for checking up on all the other metrics. To get a longer list of metrics, use the detail option.
CellCLI> list metriccurrent gd_io_rq_w_sm detail
metricValue: 2,456 IO requests
metricValue: 2,982 IO requests
metricValue: 1,631 IO requests
… output truncated …
The output shows the name metric, the object on which this metric is defined (e.g. PRODATA_CD_02_cell01), the type of that object (e.g. grid disk), when the metric was collected, and the current value, among others.
The objects on which the metrics have been defined are wide ranging in types. For instance, here is the metric for rate of reception for I/O packets, which is defined on the object type Cell.
CellCLI> list metriccurrent n_nic_rcv_sec detail
metricValue: 3.7 packets/sec
What other information does current metric show? To know that answer, use the following command.
CellCLI> describe metriccurrent
Note the attribute called alertState, which is probably the most important for you. If the metric is within the limit defined as safe, the alertState will be set to normal. You can use this command to display the interesting attributes where the alert has been triggered, i.e. alertState is non-zero:
CellCLI> list metriccurrent attributes name,metricObjectName,metricType,metricValue,objectType -
> where alertState != 'normal'
Alternatively you can select the current metrics on other types of objects as well. For instance, here is an example of checking the metrics on the interconnect.
CellCLI> list metriccurrent attributes name,metricObjectName,
> where objectType = 'HOST_INTERCONNECT'
N_MB_DROP proddb01.xxxx.xxxx Cumulative 0.0 MB normal
N_MB_DROP proddb02.xxxx.xxxx Cumulative 0.0 MB normal
N_MB_DROP proddb03.xxxx.xxxx Cumulative 0.0 MB normal
N_MB_DROP proddb04.xxxx.xxxx Cumulative 0.0 MB normal
… outout truncated …
As always, refer to the describe command to learn about the order of the attributes, since the output does not have a heading. You can also skip naming the attribute and just use ALL, which will show all the attributes. Let’s see the metrics for a cell. Since we are running it on cell01, the metrics of only that cell is shown:
CellCLI> list metriccurrent attributes all where objectType = 'CELL'
CL_BBU_CHARGE normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 93.0 % CELL
CL_BBU_TEMP normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 41.0 C CELL
CL_CPUT normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 2.4 % CELL
CL_FANS normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 12 CELL
CL_MEMUT normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 41 % CELL
CL_RUNQ normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 0.7 CELL
CL_TEMP normal 2011-05-26T11:37:10-04:00 cell01 Instantaneous 25.0 C CELL
IORM_MODE normal 2011-05-26T11:36:34-04:00 cell01 Instantaneous 0 CELL
N_NIC_NW normal 2011-05-26T11:37:28-04:00 cell01 Instantaneous 3 CELL
N_NIC_RCV_SEC normal 2011-05-26T11:37:28-04:00 cell01 Rate 3.2 packets/sec CELL
N_NIC_TRANS_SEC normal 2011-05-26T11:37:28-04:00 cell01 Rate 3.6 packets/sec CELL
Let’s examine the meaning of these metrics in some detail:
Battery charge on the disk controller
Temperature of the disk controller battery
Number of working fans
%age utilization of file systems
%age of memory use
The run queue
Temperature of the cell
I/O Resource Management mode in effect
Number of interconnects *not* working
Rate of reception of I/O packet
Rate of transmission of I/O packet
Speaking of current metrics, what are the names? You saw a few of them previously: N_MB_DROP, etc. To get a complete list, use the following command:
CellCLI> list metricdefinition
You can use any of these metrics in the metricType.
I am sure your next question is on the description of these metrics. To know more about each metric, issue the command with the name and the detail modifiers:
CellCLI> list metricdefinition cl_cput detail
description: "Cell CPU Utilization is the percentage of time over the
previous minute that the system CPUs were not idle (from /proc/stat)."
Not all metrics are available on all objects. For instance the temperature is a metric for cells but not the disks. Note the objectType attribute in the output above; it shows the type of the object. You can learn about metrics on a specific type of object.
First, see all the attributes you can choose:
CellCLI> describe metricdefinition
This will allow you to use attributes all keywords and let you understand the heading of the output.
CellCLI> list metricdefinition attributes all where objecttype='CELL'
CL_BBU_CHARGE "Disk Controller Battery Charge" Instantaneous CELL %
CL_BBU_TEMP "Disk Controller Battery Temperature" Instantaneous CELL
CL_CPUT "Cell CPU Utilization is the percentage of time over the previous minute
that the system CPUs were not idle (from /proc/stat)." Instantaneous CELL %
CL_FANS "Number of working fans on the cell" Instantaneous CELL Number
CL_MEMUT "Percentage of total physical memory on the cell that is currently used" Instantaneous CELL %
CL_RUNQ "Average number (over the preceding minute) of processes in the Linux
run queue marked running or uninterruptible (from /proc/loadavg)." Instantaneous CELL Number
CL_TEMP "Temperature (Celsius) of the server, provided by the BMC" Instantaneous CELL C
IORM_MODE "I/O Resource Manager objective for the cell" Instantaneous CELL Number
N_NIC_NW "Number of non-working interconnects" Instantaneous CELL Number
N_NIC_RCV_SEC "Total number of IO packets received by interconnects per second" Rate CELL packets/sec
N_NIC_TRANS_SEC "Total number of IO packets transmitted by interconnects per second" Rate CELL packets/sec
Using this command you can get the description on any metric. Some of the metric values are 0. Let’s see all metrics of Grid Disks which are bigger than 0:
CellCLI> list metriccurrent attributes all where objectType = 'GRIDDISK' -
> and metricObjectName = 'PRODATA_CD_09_cell01' -
> and metricValue > 0
GD_IO_BY_R_LG normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 3.9MB GRIDDISK
GD_IO_BY_R_SM normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 29.3MB GRIDDISK
GD_IO_BY_W_LG normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 797MB GRIDDISK
GD_IO_BY_W_SM normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 73.3MB GRIDDISK
GD_IO_RQ_R_LG normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 4 IO requests GRIDDISK
GD_IO_RQ_R_SM normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 1,504 IO requests GRIDDISK
GD_IO_RQ_W_LG normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 3,230 IO requests GRIDDISK
GD_IO_RQ_W_SM normal 2011-05-26T11:38:34-04:00 PRODATA_CD_09_cell01 Cumulative 3,209 IO requests GRIDDISK
Current metrics are like the pulse of the cell – they give you the health and effectiveness of the cell at that time.
Sometimes you may be interested in the history of the metrics. For instance you may want to get an idea about the trending of the values. Use the following command:
CellCLI> list metrichistory
It will get you everything and in fact the output may not be practical. Instead, you may want to examine metrics of a specific object only, since the metrics could be highly context relative. Here is an example where you can see the history of all the metric values for the cells.
CellCLI> list metrichistory where objectType = 'CELL'
IORM_MODE cell01 0 2011-05-26T12:59:39-04:00
CL_BBU_TEMP cell01 41.0 C 2011-05-26T13:00:10-04:00
CL_CPUT cell01 0.6 % 2011-05-26T13:00:10-04:00
CL_FANS cell01 12 2011-05-26T13:00:10-04:00
CL_MEMUT cell01 41 % 2011-05-26T13:00:10-04:00
Don’t know what the columns are for? Just give the command describe metrichistory to see the attributes.
While the output shows the history of metrics, it’s a little difficult to parse. You may want to examine a single statistic, not a bunch of them. The following example shows the history of the temperature of the cell.
CellCLI> list metrichistory where objectType = 'CELL' -
> and name = 'CL_TEMP'
CL_TEMP cell01 25.0 C 2011-05-26T13:00:10-04:00
CL_TEMP cell01 25.0 C 2011-05-26T13:01:10-04:00
CL_TEMP cell01 25.0 C 2011-05-26T13:02:10-04:00
CL_TEMP cell01 25.0 C 2011-05-26T13:03:10-04:00
CL_TEMP cell01 25.0 C 2011-05-26T13:04:10-04:00
CL_TEMP cell01 25.0 C 2011-05-26T13:05:10-04:00
… output truncated …
Two of the often forgotten features of CellCLI is that you can invoke it with the –x option, which suppresses the banner, and the –n option, which suppresses the command line.
# cellcli -x -n -e "list metrichistory where objectType = 'CELL' and name = 'CL_TEMP'"
CL_TEMP prodcel01 24.0 C 2011-05-27T02:00:47-04:00
CL_TEMP prodcel01 24.0 C 2011-05-27T02:01:47-04:00
CL_TEMP prodcel01 24.0 C 2011-05-27T02:02:47-04:00
… output truncated …
This output can be taken straight to a spreadsheet program for charting or other types of analysis.
Now that you learned about metrics, let’s see how you can use them to alert you when something happens. The Exadata Database Machine comes with a lot of predefined alerts out of the box. To see them use the following command. To get more information on them, use the detail modifier.
CellCLI> list alertdefinition detail
alertSource: "Automatic Diagnostic Repository"
description: "Incident Alert"
description: "Hardware Alert"
description: "Threshold Alert"
description: "Threshold Alert"
… output truncated …
The alerts are based on metrics explained earlier. For instance, the alert StatefulAlert_CG_FC_IO_BY_SEC is based on the metric CG_FC_IO_BY_SEC. There are a few alerts which are not based on metrics. Here is how you can find them:
CellCLI> list alertdefinition attributes all where alertSource!='Metric'
ADRAlert ADR "Automatic Diagnostic Repository" Stateless "Incident Alert"
HardwareAlert Hardware Hardware Stateless "Hardware Alert"
Stateful_HardwareAlert Hardware Hardware Stateful "Hardware Stateful Alert"
Stateful_SoftwareAlert Software Software Stateful "Software Stateful Alert"
The next step is to define an alert based on a threshold. Here is how you can define a threshold on the metric CD_IO_ERRS_MIN on database called PRODB. You want the alert to be fired when the value is more than 10.
CellCLI> create threshold cd_io_errs_min.prodb comparison=">", critical=10
Threshold DB_IO_RQ_SM_SEC.PRODB successfully created
When an alert triggers, the cell takes the action on the notification schedule – it either sends an email or an SNMP trap, or both. Here is an example of an alert received via email. It shows that the cell failed to reach either the NTP or DNS server.
You can modify the threshold:
CellCLI> alter threshold DB_IO_RQ_SM_SEC.PRODB comparison=">", critical=100
Threshold DB_IO_RQ_SM_SEC.PRODB successfully altered
And, when not needed, you can drop the threshold.
CellCLI> drop threshold DB_IO_RQ_SM_SEC.PRODB
Threshold DB_IO_RQ_SM_SEC.PRODDB successfully dropped
When you drop the threshold, the alert (if generated earlier) will be cleared and new alert is generated informing that the threshold was cleared. Here is an example email with that information.
As you learned previously, you can administer the Exadata Database Machine in several ways: using the command-line tool CellCLI (or a number of cells at the same time by DCLI), using SRVCTL and CRSCTL, and via plain-old SQL commands. But the easiest approach, hands down, is to use Oracle Enterprise Manager Grid Control. The graphical interface simply makes the command interface intuitive and easy to read. If you already have a Grid Control infrastructure, it makes the decision very easy – you just add the Database Machine to it. If you don’t have Grid Control, perhaps you should seriously think about it now.Setup
To manage the Database Machine via Grid Control, you need:
Let’s examine each approach in detail. The plugin is a jarfile you download and keep on the OMS server. The grid control agents are installed on the database compute nodes the same way you would have done for a normal database server. You can either push the agents from the Grid Control console, or download the agent software on the database compute nodes and install there. During the installation you will be asked the address and port number of the OMS server, which will complete the installation.
The storage plug-in is different. This is not an agent on the storage cells; rather it’s installed as a part of the OMS server. So you need to download the plugin (which is a file with .jar extension) to the machine on which you launched the browser (typically your desktop), not the storage cells. After that, fire up the Grid Control browser and follow the steps shown below.
That’s it; the storage servers are now ready to be monitored via Grid Control.
Once configured, your Exadata Database Machine shows up in Grid Control. The database compute nodes show up as normal database servers and the cluster shows up as a normal cluster database; there is nothing Exadata-specific in nature on those screens.
You may already be familiar with the normal Oracle RAC database management features on Grid Control, which applies to Exadata compute nodes as well. In this article we will assume you are familiar with the database management via Grid Control.
The real benefit comes in looking at the storage cells as an alternative to CellCLI. To check storage cells, go to Homepage > Targets > All Targets. In the Search dropdown box choose Oracle Exadata Storage Server and press Go. The resulting screen that comes up is similar to the picture shown below. Note the names have been erased.
This is a full rack Exadata so there are 14 cells, or storage servers named with a prefix _Cell01 through _Cell14. Clicking on each of the hyperlinked cell names takes you to the information screen of the respective cell (also known as Storage Server). Let’s click on cell#2. It brings up the cell details similar to a screen shown below:
This screen shows you three basic but very important metrics:
The screen gives only a glimpse of these three metrics for the last 24 hours or so. Note that memory utilization has been constant, which is pretty typical. Since this is a storage cell, the only software that runs here is Exadata Storage Server, not the database sessions which tend to fluctuate in memory usage. The temperature has been pretty steady but spiked a little at around 6AM. The CPU however may show a wide fluctuation because of the demand on the ES software as a result of cell offloading (described in detail in Part 1). If you click on CPU Utilization, you will see a dedicated screen for CPU utilization as shown below.
It shows a more detailed graph on the CPU busy-ness across a more granular scale. You can choose to get the refresh frequency of the metric by choosing from the drop-down menu View Data near the top right corner. Here I have chosen "Real Time: 5 Minute Refresh." It will get the data in the real time but refresh only every 5 minutes. That is why you will notice the graph is non-smooth. You can also choose a longer period such as “last 7 days” or even “last 31 days” but that information is no longer real time.
You can also choose the same information from Reports (described later in this installment). Clicking on Temperature and Memory graphs will provide similar data.
Since the cell contains hard drives, which are mechanical devices and generate a lot of heat, it will fail if the temperature is too high. The cells contain fans to force the hot air out and to reduce the temperature. There are 12 fans in each cell. Are they working? There are two power supplies in the cell for redundancy. Are both working? To know the answer to these questions and get other information on the cell, click on the hyperlink View Configuration to bring up a screen like the following.
The information shown here is pretty similar to what you would have seen in the list cell command in CellCLI. Let’s examine some of the interesting information on this screen and why you would like to know about them:
This is just a summary of the cell. The objective of the cell is to provide a storage platform for the Exadata Database Machine. So far, we have not seen anything about the actual storage in the cell. To know about the storage in the cells, look further down the page. There are five sections:
Note the set of 5 hyperlinks toward the top of the screen that will direct you to the appropriate section:
Instead of looking at the sections in the way they have been laid out, I suggest you look in the order they are created. For instance a cell contains physical disks, from which LUNs are created, from which cell disks are created, from which grid disks are created. Finally, IORM (I/O Resource Manager) is used on the disks. Let’s look at them in the same logical order.
To know about the physical disks available in the cell, this is the section to look at. Here is a partial screenshot.
If this looks familiar, it’s because this is formatted output from the CellCLI command LIST PHYSICALDISK DETAIL or LIST PHYSICALDISK ATTRIBUTES. The important columns here to note are:
Next, let’s look at the LUNs in the cell by clicking on the Cell LUN Configuration hyperlink.
Again, this is the graphical representation of the LIST LUN ATTRIBUTES command. Note that the flash drives are also listed here, toward the bottom of the output. They are named with a prefix FD in the Cell Disk, unlike “CD” in case of regular SAS hard disks. LUNs are created from hard disk partitions, which are presented to the host (the storage cell) as a device. The device ID shows up here, along with the cell disks created on the LUN. The two columns that you should pay attention to are “Status” and “Error Count”.
If you see just 10 LUNs, click on the dropdown box and select all 28.
Cell disks are created from the physical disks. In CellCLI, you got the list by using the LIST CELLDISK command. In Grid Control, clicking on the hyperlink Celldisk Configuration will give show the list of cell disks in a graphical manner, similar to one shown below. Like the previous section, click on the drop-down list to choose “All 28” to show all 28 cell disks on the same page. Again, you should pay attention to “Error Count” and “Status” columns.
Grid Disks are created from the cell disks, and are used to build the ASM diskgroups. Obviously, these are the actual building blocks of the storage the ASM instance is aware of. Typically there is a one-to-one relationship between cell disk and grid disk, but if there is not one in your case, you may see less available space in your ASM diskgroup. Clicking on the Griddisk Configuration hyperlink brings up the screen similar to the following:
The key columns in this screen to look at are:
This brings us to the end of the basic building blocks of the storage cells, or storage servers. Using these graphical screens or the CellCLI commands you can get most of the information on these building blocks.
Now let’s get on to some of the more complex data needs: to gather the performance and management metrics on these components. For example, is the cell disk performing optimally, or are there any issues with the grid disk that may result in data loss?Metrics
The best way to examine the metrics for these components is to explore them in the dedicated Metrics page in Grid Control. Note the set of hyperlinks toward the bottom of the page and click on All Metrics. It will bring up a screen as shown below.
Remember, metrics are gathered and stored in the OMS server and then presented as needed. Since they are stored in the OMS server, they affect the performance of the Exadata Database Machine little or none at all. The screen above shows the frequency these metrics are collected (15 minutes, 1 minute, etc.). Some of the metrics are real time (flash cache statistics for example); others do not have any schedule (such as error messages and alerts). The collection frequency is displayed under the column “Collection Schedule”.
When the metrics are collected, they are uploaded to the OMS server’s database. The screen above also shows how soon the metrics are uploaded after collection (the column “Upload Interval”), along with the date and time the last upload occurred (“Last Upload”).
Metrics can also be used to trigger alerts. For instance, if the cell offload efficiency falls below a certain threshold, you can trigger an alert. We will see how these alerts are set up later but the alerting threshold is shown in this screen as well, under the column “Threshold”.
The screen shows the metrics collected by Grid Control and the frequency of the collection. If you want, you can see all the sub categories and specific metric names by clicking on Expand All.
Let’s focus on only one category – Category Statistics. Click on the “+” before the category to show the various metrics; you will see a screen similar to the following:
Click on Category Small IO Requests/Sec.
Click on the eyeglass icon under the column “Details”.
Like all the other screens you saw earlier, you can choose a different refresh frequency by choosing the appropriate value from the drop-down menu “View Data”. Alternatively, you can choose historical records by selecting Last 24 hours, or Last 7 days, etc.
Reports are similar to screens reporting metrics but with some important differences. Some reports could be better for viewing certain metrics but the most valuable use of reports is to compare metrics across several components at a glance.
From the main Enterprise Manager page, choose the Reports tab from the ribbon of tabs at the top. Click on the Report tab and scroll down several pages to a collection of reports named “Storage”, as shown below:
This section shows all the metrics you might have seen earlier, but with a major difference: The reports are organized as independent entities, which can report the data on any component -- not driven from the details screen of the component itself. For instance, click on the hyperlink Celldisk Performance. It brings up a screen asking you to choose the cell disk you are interested in rather than choosing the cell disk screen first and then going to the performance metrics of that cell disk. You can choose a cell disk from the list by clicking on the flashlight icon (the list of values) and it brings up a screen as shown below.
If you click on one of the hyperlinked cell disks Cell Disk, you will see the data for that cell disk.
Grid Disk Performance
As you learned in Part 2 of this series, the grid disks are built from the cell disks and then used for ASM diskgroups. Thus the performance of the grid disks will directly affect the performance of the ASM diskgroups, which in turn will affect database performance. So it’s important to keep a tab on the grid disk performance and examine it if you see any issues in the database. From the report menu shown earlier, go to Storage > CELL > Grid Disk Performance. It will ask you to pick a cell from the list. Choosing Cell 02 will bring up a screen similar to this.
The other important metric to examine is host interconnect performance:
Sometimes you wonder how one component such as a cell disk is doing compared to other cell disks. The comparison is a very neat feature in the Grid Control. First get to the page of the metric you are interested in:
Click on the hyperlink just below Related Links shown as Compare Objects Celldisk Name/Cell Name/Realm Name.
Then click on the OK button. The comparison will appear in the screen similar to the one shown below:
The comparison shows that cell disk CD_00_cell01 has a higher write throughput compared to the other two.
A realm is a set of components enclosed within a logical boundary. Realms are used to separate systems for different usage – for instance you may create a realm each for database A, database B, etc. By default there is a single realm. Since all the cells belong to the realm, metrics for a realm go across the cells and are not limited to the current cell alone.
Let’s see an example. If you pull up the LUN Configuration report, you will see the following:
It shows 28 records since there are 28 LUNs in this cell. The scope of this report is the current cell, not the others. When you pull up the Realm LUN Configuration, it will show the LUNs of all the cells, as shown below.
Note the difference: 392 rows instead of 28. All the LUNs from all the cells have been shown here.
Choosing realms for reports can allow you to get the picture for a specific realm immediately. If there is only one realm, then all the cells are pulled together in the report. Consider a very important report – Realm Performance, shown below.
It shows you the metrics across all the cells in that realm. This is an excellent way to quickly compare various cells to see if they are uniformly loaded.
On the same line of thought, you may want to compare performance of the components of the cells, not just the cells. Choosing Realm Celldisk Performance from the View Report dropdown menu shows the performance of the cell disks across all the cells, not just that cell.
However, with the sheer amount of data in the report (392 rows), it may not be as useful. A better option may be is to compare the Cell Disk #1 of all the cells. You can do that by using a filter in the Celldisk field.
This shows the performance of the first cell disks of all the cells. From the above output, it is clear that Cell 05 is seeing some action while the others are practically idle. Since you don’t operate at the cell level, it can be assumed that the data distribution may not be uniform. Cell #5 may happen to have the most number of storage blocks.
In this installment you learned how to examine metrics from the storage cells using two different interfaces: the command-line tool CellCLI and Enterprise Manager Grid Control. Both have their own appeals. While the visual advantage of the Grid Control is undeniable, the command line tool may be useful in developing scripts and repeatable processes. The Exadata Database Machine comes with a lot of predefined alerts that can be configured via thresholds to make sure they trigger when the metric values fall outside the pre-specified boundary conditions.
This brings us to the end of this series. I hope you developed enough skills in this series to make the transition from DBA to DMA easily and quickly!