by Randolf Geist
Maximizing the performance of the Parallel Execution feature in Oracle Database Enterprise Edition
Read Part 1
Read Part 2
In order to get the most out of Parallel Execution, the following prerequisites need to be met:
Efficient Parallel Execution plans were discussed in Part 1. In Part 2 we'll look at the remaining two points in detail.
There is a lot to say about this, but the following is important to understand:
In general, Parallel Execution makes sense only if the underlying system architecture or the specifics of the SQL statement by itself allow saturating more than a single CPU. This is particularly relevant if the majority of data needs to be read from disk. Often the efficiency of Parallel Execution is severely limited by the fact that the I/O subsystem cannot deliver the data fast enough simply leading to more processes waiting for data from disk when employing Parallel Execution rather than speeding up execution. This can be different if the data actually can be served from the Buffer Cache which is by default not the case with Parallel Execution. Only segments that are smaller than the number of blocks defined by the _small_table_threshold will be read into the Buffer Cache when using Parallel Execution. An exception to that rule is the newly introduced In-Memory Parallel Execution feature that attempts to combine Parallel Execution with the Buffer Cache usage, in particular when using multiple instances in a RAC cluster configuration by caching different parts of the segment on different nodes.
I've come across systems where the disk throughput of a SQL statement execution could be scaled by using Parallel Execution although a serial execution wasn't CPU constrained - at least not according to the available instrumentation - but mainly waiting for I/O. I usually wasn't able to determine the exact reason for the behaviour due to time constraints and existing access restrictions of such systems, but I just wanted to point this out that it makes sense to test this for your system architecture individually. It could simply be related to the fact that Parallel Execution usually employs direct path reads that can make use of asynchronous I/O if available and configured properly.
Furthermore it is important to point out that there are cases where Parallel Execution allows increasing performance even when the disks don't normally provide sufficient throughput. This can be the case if complex expressions, for example based on regular expressions, or user-defined PL/SQL functions are in use that saturate a single CPU even with a moderate data flow. Other cases include the usage of compression, for example inserting data into a segment that is marked as compressed.
As outlined in Part 1, the optimizer generates an execution plan based on certain assumptions. In particular the optimizer is pretty optimistic regarding the performance gain by performing an operation in parallel. Although recent releases include an artificial performance reduction factor of 10% in the cost calculation, the optimizer still assumes a pretty linear scaling, for example specifying a parallel degree of 8 for a full table scan ends up with a cost that is close to 12.5% of the serial cost. Except for very high end hardware configurations it is however very unlikely to achieve such an almost linear scaling when using higher degrees of parallelism, therefore care should be taken when determining a sensible DOP.
Note that the default DOP in most cases is way too high, since on most platforms it is based on the assumption to use a DOP of two times the number of available CPUs for a single statement execution, which means that for non-trivial statements employing the Consumer / Producer model you'll end up with four times the number of CPUs Parallel Slaves. If that number of Parallel Slaves really starts consuming CPU (it would be sufficient to only consume 25% CPU on average per slave) then you end up with 100% CPU usage of your whole configured system - according to the CPU_COUNT that Oracle gets told from the operating system (since Oracle Database 11.2 you can set the CPU_COUNT to a lower value for a so called "Instance Caging"). If your system for example reports 32 CPUs to Oracle, then the default DOP is determined by PARALLEL_THREADS_PER_CPU * CPU_COUNT * INSTANCES (for RAC). PARALLEL_THREADS_PER_CPU is 2 on most platforms by default, so the default DOP per instance will be 64 in such a case. Note that you can set PARALLEL_THREADS_PER_CPU, for example to a lower value of 1.
This means that per DFO there could be up to two Parallel Slave Sets active each using 64 Parallel Slaves per RAC instance, resulting in 128 processes per instance for this single SQL execution. It should be verified that such a high DOP really speeds up processing as expected - you might find out that using a much lower DOP actually provides better performance, or at least similar performance with much lighter resource consumption.
But this means that you need to configure your database objects or use statement hints to use a sensible DOP instead of going with the default. Other options include introducing an artificial DOP limit via the Resource Manager and/or relying on the new Auto DOP feature (available since version 11.2).
The other point to consider in this regard is that at execution time there might be various reasons why the statement runs with a lower DOP than assumed by the optimizer.
This is particularly relevant when dealing with multiple, concurrent Parallel Executions.
In the past you had limited choices:
Since Oracle Database 11.2 there is the new Parallel Statement Queuing feature that attempts to address this problem by starting to queue concurrent Parallel Executions if insufficient Parallel Slaves are available (this is configurable via PARALLEL_SERVERS_TARGET). As soon as the required number of Parallel Slaves is available the statement is executed with the intended DOP rather than being downgraded. This should lead to a much better usage of the available system resources.
Note that this feature is by default only available together with the new Auto DOP feature that automatically determines a Degree Of Parallelism (Auto DOP).
As already mentioned, there are other possibilities, why an execution might not use the degree determined by the optimizer, for example restrictions imposed by an active Resource Manager plan could lead to such a downgrade. Note that from 11.2 on the optimizer can work together with the Resource Manager to avoid such a situation, but not all scenarios are covered yet.
Other possibilities include concurrent DDL operations that invalidate cursors while the Parallel Query Coordinator acquires Parallel Slaves. Since some of the Parallel Slaves might be forced to re-optimize the invalided cursor it is possible that the so called "signature" of the cursor used by the Parallel Slave doesn't match the signature of the cursor used by the Parallel Query Coordinator, and in such cases it is possible that the Parallel Slaves cannot be acquired. Oracle tries to cope with that situation usually by silently restarting the operation, but sometimes you might either end up with error ORA-12842 or a serial execution of the statement, depending on the version and exact circumstances.
So the upshot of this point is that the runtime engine needs to follow the execution plan generated by the optimizer – and the optimizer based its decision on a particular DOP.
If the DOP at execution time differs significantly from the DOP at optimization time, it is not unlikely that the resulting runtime might be much longer than expected, possibly even longer than a corresponding serial execution.
This can be caused by various factors:
It is therefore crucial to determine the actual DOP used at execution time in order to understand if a sudden long-running query might have been caused by such downgrades. Now the actual DOP and number of Parallel Slaves involved is easy to find out while a statement executes even without the Real time SQL Monitoring feature since you can simply query the available V$PX* views, for example (G)V$PX_SESSION contains the QC* columns to identify the query coordinator and the SERVER_GROUP, REQ_DEGREE and DEGREE columns that allow answering that question easily.
But if you need to do a post execution analysis these views won't be of any help. Even with a Tuning Pack license the corresponding Real Time SQL Monitoring report might have already been aged out of the Shared Pool and therefore no longer available. As already mentioned for more complex Parallel Execution plans that involve multiple DFOs Real-Time SQL Monitoring only tells the highest DOP used, but doesn't tell you easily the DOP used for each DFO - by looking at the various details provided ("Executions" column in the "Plan Statistics" tab, or number of Parallel Slaves per "Parallel Group" in the "Parallel" tab) you might be able to tell.
We will therefore look at different approaches how you can do such a post execution analysis based on the Active Session History information available via the Diagnostic Pack license (a Tuning Pack license is not required for that purpose).
The last point is also important: Even if the execution at runtime is able to run with the expected DOP a Parallel Execution will only allow to scale as expected if the work is distributed as equally as possible among the Parallel Slaves involved.
However, there are scenarios where this distribution can be skewed leading to unbalanced (or lowered) usage of the involved Parallel Slaves, typically leaving many or even most of the Parallel Slaves idle. Another possible scenario is "temporal" skew, where the data arrives at the Parallel Slaves in such a way that only a few of them are actually active at the same time but at the end they've done a similar amount of work and they all complete at around the same time. I won't cover this scenario here because in my experience it is far less common than the unbalanced data distribution among Parallel Slaves.
The most straightforward way to determine the distribution of data across the Parallel Slaves is the view V$PQ_TQSTAT. There are various problems with that view, however:
So for simple Parallel Execution plans during development the view can be helpful and allows determining pretty straightforward if the data distribution among the Parallel Slaves was skewed or not.
Here is an example output of a skewed Parallel Execution:
DFO_NUMBER TQ_ID SERVER_TYP NUM_ROWS BYTES OPEN_TIME AVG_LATENCY WAITS TIMEOUTS PROCESS ---------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ---------- ---------- 1 0 Producer 500807 54396692 0 0 20 6 P008 1 0 Producer 499193 54259853 0 0 10 2 P009 1 0 Consumer 500038 54332349 0 0 2876 1284 P010 1 0 Consumer 499962 54324196 0 0 2861 1230 P011 1 1 Producer 499826 57267724 0 0 20 5 P008 1 1 Producer 500174 57357253 0 0 9 1 P009 1 1 Consumer 499933 57304789 0 0 2467 1208 P010 1 1 Consumer 500067 57320188 0 0 2481 1069 P011 1 2 Producer 462069 52963939 0 0 19 7 P010 1 2 Producer 537931 61681312 0 0 22 9 P011 1 2 Consumer 500212 57346574 0 0 53 13 P008 1 2 Consumer 499788 57298677 0 0 346 160 P009 1 3 Producer 500038 57337041 0 0 0 0 P010 1 3 Producer 499962 57328456 0 0 0 0 P011 1 3 Consumer 0 48 0 0 2586 1200 P008 1 3 Consumer 1000000 114665449 0 0 2611 1261 P009 1 4 Producer 0 24 0 0 0 0 P008 1 4 Producer 1000000 116668401 0 0 0 0 P009 1 4 Consumer 1000000 116668425 0 0 1018 435 QC
So as you can see for the so called "Table Queue" TQ_ID = 3 the data was not evenly distributed. You can find the corresponding operation in the execution plan by looking at the TQ column.
There are various reasons for uneven work distribution among Parallel Slaves, but they have one thing in common: They are always related to the data resp. data distribution. Although some of the patterns could be recognized by the optimizer I believe at present this information is not used as part of the execution plan generation.
Depending on the PX ITERATOR used (block or partition) it might simply be that the PARTITION granule was favoured by the optimizer, but the data is not evenly distributed across the partitions.
Another common reason is that the HASH distribution method leads to an uneven distribution of data, simply because the key used to build the hash isn't uniformly distributed or has a very low number of distinct values. This can happen even with originally evenly distributed key information in case of massive outer joins.
But you might be unlucky no matter how the data is distributed or not, that only some of the Parallel Slaves produce data to be further processed: All further plan steps that are executed by that slave set without re-distribution will be affected - so sometimes it is desirable to re-distribute the data to achieve an improved work distribution.
Since the straightforward skew determination using the V$PQ_TQSTAT view is often not available, determining the skew and the reason for it is a non-trivial task, in particular if you need to do a post execution analysis (for example, why did statement XYZ take that long last night?).
If a Diagnostic Pack license is available, the Active Session History information can be used for shedding some light on that distribution issue and trying to answer some other interesting questions.
If you have a Diagnostic Pack license and are already using Oracle Database 11g the new columns SQL_PLAN_OPERATION, SQL_PLAN_LINE_ID and SQL_PLAN_OPTIONS along with SQL_EXEC_ID and SQL_EXEC_START in the Active Session History views allow a thorough post execution analysis of a SQL statement execution instance:
Of course this assumes that sufficient samples of Parallel Slaves are available, otherwise the information will be misleading, potentially providing an underestimation of the degree.
As already outlined above, the DOP information per DFO is not that easy to digest from the Real Time SQL Monitoring output - it only provides the information about the maximum DOP detected in the "General Information" section.
If I run this query on ASH data for the sample execution plan described in Part 1 on a single node that has two DFOs that are not active at the same time, the following output will be generated:
INST_ID DFO SAMPLE_COUNT SLAVE_COUNT SET_COUNT ASSUMED_DEGREE ---------- ------ ------------ ----------- ---------- -------------- 1 Q1 59 2 1 2 1 Q2 119 8 2 4
The ASH data provides a strong indication that in fact the two DFOs used a different Parallel Degree (which can also be confirmed at execution time only by looking at (G)V$PX_SESSION)
(Click to view a larger version of this image.)
If you have a Tuning Pack license you can use Real-Time SQL Monitoring to determine if there is a skew. The most useful tabs in the Active report are Activity, Parallel and Metrics, however Real-Time SQL Monitoring doesn't tell you which execution plan operations are affected by the skew.
Let's have a look at each of them when having a problem with a skewed execution. The Activity tab provides the following information (Figure 7).
Here it can be seen that the expected Parallel Degree was reached for only a short period at the beginning of the execution (32 active sessions on CPU in this case), but then there is a long tail where only a single CPU is active.
The Parallel tab reveals the following (Figure 8).
Here we can see the confirmation of the problem: a single Parallel Slave had to work much more than the remaining ones.
The same is confirmed in that case from the Metrics tabs, since it is a CPU bound operation and we expect to make use of 32 CPUs in this case (Figure 9).
We can even use the new Advanced Analytics option and generate a nice graph using R based on the ASH data that shows the activity of the different Parallel Slaves along the different execution plan operation IDs (Figure 10).
Again it becomes clearly visible at which plan operation the work distribution was not distributed as expected.
If you have determined such a skewed execution, depending on which operations are affected in the execution plan, you can attempt to improve the situation by choosing different Parallel Execution distribution methods.
As outlined in Part 1, the PQ_DISTRIBUTE hint can be used to influence the distribution of joins and load operations. However, you cannot directly influence yet the distribution of other non-join operations like aggregations, sorts or analytic functions.
A quite common scenario is skew caused by a HASH re-distribution. This is also possible if the underlying data is not skewed in case of outer joins - if the data gets redistributed on join keys that become NULL due to the outer join, all rows that have NULL values will be re-directed to the same Parallel Slave. So if you end up with massive outer joins and re-distribute the data based on the massive number of NULLs generated, the distribution will become skewed.
In this case using a BROADCAST distribution can improve the performance significantly; however you need to be aware of the effects when using the BROADCAST distribution as already outlined above.
The Parallel Execution option of the Oracle Database Enterprise Edition is a fascinating piece of technology. This article shed some light on the basic building blocks of this feature and hopefully offered some helpful insights into how Parallel Execution works, why it might not behave as expected, and what you can do about it.