Using Extended Statistics to Optimize Multi-Column Relationships and Function Based Statistics
Using Extended Statistics to Optimize Multi-Column
Relationships and Function-Based Statistics
The goal of this tutorial is to show you how to use extended
statistics to optimize multi-column relationships and function-based statistics.
Approximately 30 minutes
This tutorial covers the following topics:
Place
the cursor over this icon to load and view all the screenshots for this tutorial.
(Caution: This action loads all screenshots simultaneously, so response time
may be slow depending on your Internet connection.)
Note: Alternatively, you can place the cursor over an individual icon in the following steps to load and view only the screenshot associated with that step. You can hide an individual screenshot by clicking it.
In real-world data there is often a correlation between two
or more columns in a table. For example, job title and salary are related (the
VP of a company is likely to earn a lot more than the janitor does) or car make
and price (a BMW is likely to be a lot more expensive than a Honda). Up until
now, the optimizer has had no way of knowing that these relationships exist
between the columns in a table. When a query was executed against the table
with multiple, single column predicates, it was impossible for the optimizer
to calculate the correct selectivity of these predicates as it had no way of
knowing if the columns were related or not.
It has also been extremely hard for the optimizer to calculate the correct selectivity
for a column that has a function apply to it. For example, UPPER(surname)=.SMITH.
In Oracle Database 11g, extended statistics (multi-column statistics)
have been introduced, which allow you to collect statistics on a group of columns
as a whole and on functions, thus enabling the optimizer to calculate the selectivity
of these predicates correctly. As a result, the optimizer is aware of the correct
selectivity (cardinality). This tutorial demonstrates why you need extended
statistics and how you can create them.
Back to Topic List
Before starting this tutorial, you should first complete the
following steps:
| 1. |
Install Oracle Database 11g.
|
| 2. |
Make sure the SH user
is unlocked. Open a terminal window and execute the following:
sqlplus sys/<syspassword> as sysdba
alter user sh identified by sh account unlock;
exit
|
| 3. |
Download and unzip the multicolstats.zip
file into a working directory.
|
| 4. |
You need to create and load the data for the CUSTOMERS_OBE
table. From your terminal window, execute the following:
cd <sql_files_directory>
imp sh/sh file= customers_obe.dmp log=imp.log full=y
|
Back to Topic List
A good example of correlated or related columns are the country_id
and cust_state_province columns in the CUSTOMERS_OBE
table. When the value of cust_state_province is
'CA' , the value of country_id is 'US'. There is
a skew in the data in these two columns, which means the majority of rows in
the table have the values 'CA' and 'US'. Both the relationship between the columns
and the skew in the data make it difficult for the optimizer to calculate the
selectivity or cardinality of these columns correctly when they are used together
in a query. Extended statistics should help in this situation. Let's begin
by checking what the actual cardinality or number of rows in the CUSTOMERS_OBE
table is when the country_id is 'US' and cust_state_province
is 'CA'. Perform the following steps:
| 1. |
You first want to check what the actual cardinality or number of rows in the CUSTOMER_OBE table is when the country_id is 'US' and cust_state_province is 'CA'. Open a terminal window and execute the following commands:
CD<sql_files_directory>
sqlplus sh/sh
set echo on
@check_cardinality
The check_cardinality.sql code is as follows:
select count(*) from customers_obe where country_id = 'US' and cust_state_province = 'CA';
|
| 2. |
Now that you know the actual number of rows returned, you can gather statistics on the table. From your terminal window, execute the following SQL script:
@gather_stats
The gather_stats.sql code is as follows:
exec dbms_stats.gather_table_stats(null,'customers_obe', - method_opt => 'for all columns size 1');
|
| 3. |
Review the column statistics that were generated. From
your terminal window, execute the following SQL scripts:
@review_stats
The review_stats.sql code is as follows:
select column_name, num_distinct, histogram from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';
You see that the number of distinct values are for both
the country_id and cust_state_province
columns. Both of these values look accurate.
|
| 4. |
Given these statistics, check the number of rows the optimizer estimates will be returned by the query. From your terminal window, execute the following SQL script:
@explain_plan
The explain_plan.sql code is as follows:
explain plan for select * from customers_obe where country_id = 'US' and cust_state_province = 'CA';
select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));
With just basic statistics, the optimizer thinks there
will only be 1 row returned. Because you know that this is not true, you
need to provide better statistics to the optimizer so it can determine
the correct row count.
|
Back to Topic List
Prior to Oracle Database 11g, there was no way to tell
the optimizer what the correct combined selectivity is for multiple columns.
The only insight given to the optimizer was that there was a data skew in both
of these columns. By making the optimizer aware of the data skew in Oracle Database
11g, a more accurate selectivity can be calculated. In Oracle Database
11g, you can gather histograms on the skewed columns. After you have
histograms on the columns, you can recheck the optimizer's estimate for the
number of rows. Perform the following steps:
| 1. |
Gather a histogram on the skewed columns. From your terminal window, execute the following SQL script:
@gather_histogram
The gather_histogram.sql code is as follows:
exec dbms_stats.gather_table_stats(null,'customers_obe', - method_opt => 'for all columns size skewonly');
select column_name, num_distinct, histogram from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';
|
| 2. |
Now you can generate the explain plan to see what the optimizer estimates are now. From your terminal window, execute the following SQL script:
@explain_plan
The explain_plan.sql code is as follows:
explain plan for select * from customers_obe where country_id = 'US' and cust_state_province = 'CA';
select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));
There is a slight improvement in the estimate because
there are histograms for individual columns. However, the optimizer is
still not aware that there is a relationship or correlation between the
two columns.
|
Back to Topic List
In Oracle Database 11g, the optimizer can learn about
this correlation by creating extended statistics for the columns. Extended statistics
means the next time you gather statistics on the CUSTOMERS_OBE
table, an extra set of statistics for the combined group of country_id
and cust_state_province is gathered. Perform the
following steps:
| 1. |
Create the extended statistics group. From your terminal window, execute the following SQL script:
@create_extended_stats
The create_extended_stats.sql code is as follows:
select dbms_stats.create_extended_stats(null,'customers_obe', '(country_id, cust_state_province)') from dual;
The output of create_extended_stats.sql
is a system-generated name for the virtual column that gets created.
|
| 2. |
Execute the script you ran previously to gather a
histogram on the skewed columns. From your terminal window, execute the
following SQL script:
@gather_histogram
The gather_histogram.sql code is as follows:
exec dbms_stats.gather_table_stats(null,'customers_obe', - method_opt => 'for all columns size skewonly');
select column_name, num_distinct, histogram from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';
You should see a new column listed for the table that
has a system-generated name, which was returned by the create_extended_stats
procedure.
|
| 3. |
Rerun the explain plan. From your terminal window, execute the following script:
@explain_plan
The explain_plan.sql code is as follows:
explain plan for select * from customers_obe where country_id = 'US' and cust_state_province = 'CA';
select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));
Now the optimizer gets the correct number of rows for the query.
|
Back to Topic List
The optimizer can also have trouble finding the right cardinality
when there is a function or expression used on a column. Take the LOWER
function for example; this function takes a string or a character column and
returns that string in all lowercase letters. Perform the following steps:
| 1. |
You want to run a query using a function that selects
the total number of rows who have a country_id
equal to 'US'. The value passed to the query is in lowercase so the LOWER
function to the column country_id is specified.
From your terminal window, execute the following script:
@get_count_lower
The get_count_lower.sql code is as follows:
select count(*) from customers_obe where lower(country_id) = 'us';
The actual number of rows is returned.
|
| 2. |
Now you can run the explain plan again to see how many rows the optimizer thinks this query will return. From your terminal window, execute the following script:
@explain_plan_lowercase
The explain_plan_lowercase.sql code is as follows:
explain plan for select * from customers_obe where lower(country_id) = 'us';
select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));
The optimizer did not get close to the right number of rows.
|
| 3. |
If you create extended statistics for the expression
LOWER(country_id),
you can help the optimizer get the correct number of rows. In the standard
gather_table_stats procedure, the method_opt
argument allows you to specify the extension or the extended statistic
you want to collect. The gather_table_stats
procedure creates the extended statistics automatically if it does not
exist. From your terminal window, execute the following script:
@gather_stats_lower_col
The gather_stats_lower_col.sql code is as follows:
exec dbms_stats.gather_table_stats(null,'customers_obe', - method_opt => - 'for all columns size skewonly for columns (lower(country_id))');
|
| 4. |
Now that you created the extended statistic and gathered
statistics, you can look at the statistics for the CUSTOMERS_OBE
table. You should see a new column with a system-generated name. From
your terminal window, execute the following script:
@review_col_stats
The review_col_stats.sql code is as follows:
select column_name, num_distinct, histogram from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';
|
| 5. |
Rerun the explain plan to see what the optimizer estimates. From your terminal window, execute the following script:
@explain_plan_lowercase
The explain_plan_lowercase.sql code is as follows:
explain plan for select * from customers_obe where lower(country_id) = 'us';
select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));
|
| 6. |
You can also regenerate the virtual column definition
from the system-generated name. From your terminal window, execute the
following command:
select dbms_metadata.get_ddl('TABLE','CUSTOMERS_OBE') from dual;
|
Back to Topic List
Now you can cleanup your environment by performing the following steps:
| 1. |
From your SQL*Plus session, execute the following command:
DROP TABLE CUSTOMERS_OBE;
|
Back to Topic List
In this tutorial, you learned how to:
 |
Determine single column statistics |
 |
Gather histograms on skewed columns |
 |
Create extended statistics to correlate columns |
 |
Create extended statistics for a function used on a column |
Back to Topic List
|