Beginning Performance Tuning: Diagnose the PastBy Arup Nanda
Take the next step in diagnosing Oracle Database performance issues.
In “Beginning Performance Tuning,” in the July/August 2012 issue of Oracle Magazine, I demonstrated the basics of diagnosing performance issues in Oracle Database. In this article, I will show how to take performance tuning to the next level—identifying patterns and measuring resource utilization. As was the case with the previous article, you can create this article’s examples by executing the setup script—setup.sql.
Recall from “Beginning Performance Tuning” that you can identify why a session is slow by looking at what event it is waiting for right now in the V$SESSION view. This technique is effective when you are able to peek into the view while the issue is occurring, but what if a user reports that there was an increase in the response time of a session but it is back to normal now? Because the issue is no longer present, the session may appear to be working productively with the STATE column value of WAITED KNOWN TIME. What’s worse, the current wait event shown in the V$SESSION view may be something unrelated to what the user encountered earlier. To diagnose this issue, you need to know all the different events the session had to wait for—not just what it is waiting for now. How difficult is it to get that information?
Fortunately, it’s quite trivial. There is another view—V$SESSION_EVENT—that shows the details of all the wait events the session has ever waited for. Here are the important columns in that view:
Let’s examine the usage of the view with an example. Make sure you have set up the example as described in “Test Case Setup.” Connect to the Oracle Database instance as ARUP, and execute the test1.sql script, also available in “Test Case Setup.” (Note that the script will take some time to complete). While that script is running, connect as SYS from a different session and find the SID of the ARUP session as follows:
select sid from v$session where username = 'ARUP'; SID ————— 37
Now display the session events of SID 37 by executing the script in Listing 1. Because the V$SESSION_EVENT columns report time in centiseconds, I have multiplied them by 10 to show them in milliseconds (ms), a unit of measure that is a bit more familiar. Study the output carefully—it shows the various events the session waited for earlier, not what it is experiencing now. You can see that the session has waited for one event in particular: “kfk: async disk IO.” The session has waited 904,818 times for this event, but it has waited for a total of only 3,050 ms. The average wait shows zero, but it is simply because the average wait was so short that it couldn’t be shown within the two digits after the decimal point. So although the total number of waits for this event was high, it added only 3,050 ms to the overall time of the session—an insignificant number. You can thus rule this wait event out as a cause of delay for the session.
Code Listing 1: History of wait events in a specific session
set lines 120 trimspool on col event head "Waited for" format a30 col total_waits head "Total|Waits" format 999,999 col tw_ms head "Waited|for (ms)" format 999,999.99 col aw_ms head "Average|Wait (ms)" format 999,999.99 col mw_ms head "Max|Wait (ms)" format 999,999.99 select event, total_waits, time_waited*10 tw_ms, average_wait*10 aw_ms, max_wait*10 mw_ms from v$session_event where sid = 37 / Total Waited Average Max Waited for Waits for (ms) Wait (ms) Wait (ms) —————————————————————————— ———————————— ——————————— —————————— ————————— Disk file operations I/O 8 .00 .10 .00 KSV master wait 2 350.00 173.20 340.00 os thread startup 1 20.00 19.30 20.00 db file sequential read 5 160.00 32.10 70.00 direct path read 1,521 51,010.00 33.50 120.00 direct path read temp 463,035 513,810.00 1.10 120.00 direct path write temp 20 370.00 18.70 50.00 resmgr:cpu quantum 21 520.00 24.60 110.00 utl_file I/O 8 .00 .00 .00 SQL*Net message to client 20 .00 .00 .00 SQL*Net message from client 20 9,620.00 481.20 9,619.00 kfk: async disk IO 904,818 3,050.00 .00 .00 events in waitclass Other 35 20.00 .70 20.00
In this case, you should focus your attention on the event that caused the session to wait for the maximum time. From the output in Listing 1, you can see that session 37 (SID = 37) waited 513,810 ms, or more than 8.5 minutes, for the “direct path read temp” event. Every time the session waited for this event, it waited 1.1 ms on average, so if you can reduce the time for this event, you can reduce the overall time for the session. Looking at the session event history enables you to identify the biggest contributors to the delay in the session, whether they are currently affecting the session or have already affected it.
Did you notice the column for the maximum time waited: “Max Wait (ms)”? Why is that information useful? You see, the average wait time does not tell the whole story. Consider the “SQL*Net message from client” event in the output in Listing 1. The session waited 20 times with an average wait time of 481 ms for that event. Does that mean that the session waited for approximately 481 ms at each of the 20 occurrences or that the session waited a very short time for most of the event instances and a very long time for one event? The latter will skew the average to a high value but will indicate an isolated issue rather than a persistent one—the two possibilities lead to very different conclusions.
The “Max Wait (ms)” column shows the maximum time the session had to wait for one occurrence of this event. The value here is 9,619 ms, and because the total wait time was 9,620 ms, it appears that the session waited 9,619 ms on one occasion, leaving a 1 ms combined total for the other 19 occasions—a very small wait each time. Considering the single incidence of a large wait, this event should not be a general cause of concern. On the other hand, had you seen a maximum time close to the average time, you could surmise that all occurrences had to wait about the same amount of time. In such a case, reducing the time for this event would likely apply uniformly to all occurrences and consequently reduce the overall elapsed time.
Although the V$SESSION_EVENT view shows what the session waited for earlier, it doesn’t show when. That information is visible in another view—V$ACTIVE_SESSION_HISTORY (part of the extra-cost Oracle Diagnostics Pack)—which is beyond the scope of this article.
Although wait events are great for helping with understanding the speed bumps the sessions experience, they do not show another important attribute of sessions: the use of resources such as CPU, I/O, and memory. A resource-hogging session deprives other sessions of the same resources, thus causing performance issues. When the root of the problem is that the session is consuming too much CPU, you should look at resource consumption—not the events waited for—by a session. Fortunately, finding that information is quite simple: it’s visible in a view named V$SESSTAT, which has three columns:
Let’s see how to use this information with a performance problem example.
Suppose you’ve heard from several users that the performance is terrible across the board. Further, the UNIX system administrator has reported that both CPU and memory consumption are very high on the server and most of the consumption is by Oracle Database–owned processes. Sound familiar? From my experience, it’s the second-most-common performance issue in Oracle Database systems. You can reproduce this problem by running the test1.sql script, available in this article’s example scripts, as user ARUP.
To diagnose this resource issue, review the top resource-consuming processes by issuing the top command at the UNIX command prompt. Listing 2 shows the output of the top command.
Code Listing 2: Output of the top command
top - 10:56:49 up 18 days, 18:48, 4 users, load average: 1.02, 0.92, 0.48 Tasks: 180 total, 2 running, 178 sleeping, 0 stopped, 0 zombie Cpu(s): 49.8%us, 0.5%sy, 0.0%ni, 49.2%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1815256k total, 1771772k used, 43484k free, 66120k buffers Swap: 2031608k total, 734380k used, 1297228k free, 747740k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5946 oracle 25 0 706m 177m 159m R 100 10.0 9:20.26 oracle 6104 oracle 15 0 2324 1060 800 R 1 0.1 0:00.12 top 31446 oracle 15 0 688m 135m 129m S 0 7.7 0:08.24 oracle … output truncated …
In the output in Listing 2, you can see that the process with ID 5946 consumes the most CPU (100 percent) and memory (10 percent) and therefore should be the focus of your attention. To find out more about the process, enter the following command at the UNIX prompt:
$ ps -aef|grep 5946 oracle 5946 5945 63 10:59 ? 00:01:52 oracleD112D2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
The output shows the entire description of the process, which is clearly an Oracle “server process”—a process that is created by Oracle Database when a session is established—and that the process has been running for 1 minute and 52 seconds. The next question, then, is which Oracle Database session this process was created for. For that, you should look into another view—V$PROCESS—where the SPID column shows the server process ID. However, this view does not show the session information, so you need to join this view with the familiar V$SESSION view, as follows:
select sid from v$session s, v$process p where p.spid = 5946 and s.paddr = p.addr; SID ——— 37
Once you know the SID, you can get everything you need to know about the session—the user who established the session, the machine it came from, the operating system user, the SQL it is executing, and so on—from the V$SESSION view. This script for getting the information was described in Listing 5 in “Beginning Performance Tuning,” in the July/August 2012 issue of Oracle Magazine. To find the SQL being run by session 37, use this query:
select sql_fulltext from v$sql l, v$session s where s.sid = 37 and l.sql_id = s.sql_id;
Here is the output:
select max(test1.owner) from test1, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2, test2;
This SQL is performing multiple Cartesian joins, so it’s no wonder it’s consuming so much CPU and memory.
Now that you have found the culprit, should you just go ahead and kill the session to release the CPU consumption? Not so fast. First, you need to know if the CPU consumption was recent or if the session has been chewing it up since the beginning. This is where the V$SESSTAT view comes in very handy—it shows the resource consumption (CPU in this case) by a specific session. To find out the CPU used by session 37, you would use the following query:
select s.value from v$sesstat s, v$statname n where s.sid = 37 and n.statistic# = s.statistic# and n.name = 'CPU used by this session'; VALUE ————— 47379
The output shows the number of CPU “ticks” that have been consumed by this session since it started. Considering that this session has been running for about two minutes, the CPU consumption is pretty high, so it is likely that this session has been consuming CPU all the time. Again, checking the session’s other details, such as the SQL it is executing, makes it fairly easy to understand why this is the case: the session is performing a multiple-table Cartesian join, which is bound to consume a lot of CPU. At this point, you may decide to kill the session to stop the high CPU consumption, or you may decide to let it run for now and fix the SQL later to avoid the Cartesian product.
Let’s revisit the current problem by checking the CPU consumption once again with this:
select s.valuez from v$sesstat s, v$statname n where s.sid = 37 and n.statistic# = s.statistic# and n.name = 'CPU used by this session'; VALUE ————— 69724
Now the result—the CPU used—is 69,724. Note that this number is larger than the number the last time I checked CPU usage (47,379). This is because the statistic value increases over time. The first time I checked CPU usage, I surmised that the multitable Cartesian product was to blame for that CPU consumption—but can I prove it? The answer is a resounding “Yes, I can, using other statistics.”
Listing 3 shows the query for gathering all the statistics for session 37. In the output, note the “table scan rows gotten” statistic value: 1.0236E+10—about 10 billion rows! This is indeed a very high number of rows to be accessed by one session in two minutes. The value for the “consistent gets” statistic is 25,898,543—about 25.9 million blocks read from the buffer cache. The high number of buffer gets takes up a considerable amount of CPU.
Code Listing 3: All session statistics
select name, value from v$sesstat s, v$statname n where sid = 37 and n.statistic# = s.statistic# order by value desc / NAME VALUE ——————————————————————————————— ——————————————— table scan rows gotten 1.0236E+10 session logical reads 25898547 consistent gets 25898543 table scan blocks gotten 25325165 session pga memory max 21250020 session pga memory 21250020 session uga memory max 20156552 session uga memory 20156552 bytes sent via SQL*Net to client 878760 recursive calls 576848 opened cursors cumulative 143367 parse count (total) 143292 parse count (hard) 143118 table scans (short tables) 143086 sql area evicted 141996 DB time 70007 CPU used by this session 69724 … output truncated …
Another cause of CPU consumption is the parsing of SQL statements. In the output in Listing 3, note the “parse count (total)” statistic, a very high number at 143,292. In plain English, it means that the session had to parse—not just execute—SQL statements that many times in about two minutes, which is quite unusual. Examine the SQL statement executed by the session (shown in the script test2.sql script in “Test Case Setup”). You can see that it creates distinct literal SQL statements. Each literal SQL statement needs to be parsed, which is something you have confirmed by looking at the parse count. Therefore, you surmise from the Listing 3 output that there are two causes of high CPU usage for this session: a high number of buffer gets and a high number of parses.
In Listing 3, also note two other sets of statistics: “session pga memory max” and “session uga memory max,” which indicate the total memory consumed by the session. The very high numbers explain the high memory consumption by the Oracle server process that you noticed in the output of the operating system top command earlier. If you want to reduce the CPU and memory consumption of the server, you need to make sure the session consumes fewer of these resources, by appropriately modifying the SQL statement it issues.
Occasionally you may have a performance issue that will not appear as clearly at the OS level as CPU and memory consumption. One such case is redo generation by the database instance, which, in turn, increases both the rapid switching of redo logs and the creation rate and number of archived logs. This may cause an increase in overall I/O on file systems (or Oracle Automatic Storage Management disk groups), causing a systemwide performance issue. To alleviate this type of issue, you need to locate the session or sessions that caused the generation of high amounts of redo, but looking at OS metrics will not provide any insights into the offending session. In this case, you need to look at the sessions responsible for most of the load: the sessions generating maximum redo. Again, this information is available quite easily in the same V$SESSTAT view. The following query shows the sessions generating the most redo:
select sid, value from v$sesstat s, v$statname n where n.statistic# = s.statistic# and n.name = 'redo size' order by value desc; SID VALUE ———— ———————— 13 11982752 10 3372240 17 964912 26 571324 … output truncated …
It’s clear from the output that SID 13 produced most of the redo, followed by SID 10, and so on.
The Setup sidebar describes how to find the SQL statement that is generating the redo, provides a replacement SQL statement, and tests that new statement for redo generation.
These are just a few of the 604 such statistics available in the V$SESSTAT view. It is not possible to describe each one of them, but I hope you get the general idea of how to use statistics to see the various resources used by a session and focus on performance issues. (Note that there is another view—V$SYSSTAT—that shows the statistics of the entire instance.)
In this article, you learned about two very important sources of performance tuning information in Oracle Database: (1) the history of wait events experienced by sessions—visible in V$SESSION_EVENT, and (2) the resource consumption by session—visible in V$SESSTAT. From the history, you can find out why a session has waited in the past and for how long—very useful information for diagnosing performance issues when they are no longer present. The resource statistics show the consumption of various resources such as CPU, memory, and redo by a specific session—very useful for focusing on resource-hogging sessions. Using these two views and scripts mentioned in this article, you can resolve many performance issues encountered in Oracle Database instances. Happy tuning!
Arup Nanda (email@example.com) has been an Oracle DBA for more than 16 years, handling all aspects of database administration, from performance tuning to security and disaster recovery. He was Oracle Magazine’s DBA of the Year in 2003.
Send us your comments