This article describes how Oracle Database constructs a global histogram, an issue with that process, and how to correctly identify the top-n frequent values when creating a global histogram.
In Oracle Database 11g and 12c, the process of gathering statistics has been refined. Oracle Database can gather basic column statistics incrementally for partitioned tables. All basic columns statistics (minimum, maximum, number of nulls, and so on) except the number of distinct values (NDV) can be easily derived from partition-level statistics. To incrementally gather the NDV for the global level, the database uses a synopsis data structure. Starting with Oracle Database 11g, histograms are maintained incrementally also.
In this article, we will see how Oracle Database constructs a global histogram. Does it use a synopsis data structure? Is the synopsis data important for incrementally updating a histogram?
As we know, a frequency histogram (FH) is a collection of {(vi,fi)} pairs, where vi represents the distinct value of a column and fi represents the value's frequency. Assume a table has two partitions, (P1,P2), and the column Ci in each partition has an FH such as {(vip1,fip1), i∈ [1...n]} and {(vip2, fip2), i∈ [1...m]}. Then we need to create a new histogram for the whole table (global level).
It is obvious that if the NDV of (vip1,vjp2) is less than N (the number of requested histogram buckets; the default is 254), the new histogram will be a frequency histogram. First, we need to identify the number of buckets for the global histogram (FH). Of course it will be the NDV of (vip1,vjp2). Also, for each number of distinct buckets we need to identify the frequency count, and we can do that by grouping the number of distinct values and summing their frequencies, for example, as shown in the symbolic SQL statements in Query 1.
Select num_bucket, sum(freq) from (
Select vip1 as num_bucket,fip1 ...
Union all
Select vip2 as num_bucket,fip2 ...
)
Group by num_bucket
As a result, we will have a new FH and its properties could be provided by Query 1.
Now let's see Case 1, below, where we have partitioned the SALES table (and its partitions: SALES_p1
, SALES_p2
) and the prod_id
column has an FH in each partition.
CREATE TABLE sales
(prod_id NUMBER NOT NULL,
time_id DATE NOT NULL,
amount NUMBER(10,2) NOT NULL)
PARTITION BY RANGE (TIME_ID)
(
PARTITION sales_p1 VALUES LESS THAN (TO_DATE('2008-04-01 00:00:00', 'SYYYY-
MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
PARTITION sales_p2 VALUES LESS THAN (TO_DATE('2008-07-01 00:00:00', 'SYYYY-
MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
)
/
BEGIN
DBMS_STATS.set_table_prefs (ownname => null,
tabname => 'sales',
pname => 'approximate_ndv_algorithm',
pvalue => 'hyperloglog');
DBMS_STATS.set_table_prefs (null,
'sales',
'INCREMENTAL',
'TRUE');
END;
DECLARE
i INTEGER;
BEGIN
FOR i IN 1 .. 10000
LOOP
INSERT INTO sales
SELECT TRUNC (DBMS_RANDOM.VALUE (1, 50)),
TO_DATE ('2008-01-01', 'yyyy-mm-dd')
+ TRUNC (DBMS_RANDOM.VALUE (1, 80)),
TRUNC (DBMS_RANDOM.VALUE (1, 1000))
FROM DUAL;
INSERT INTO sales
SELECT TRUNC (DBMS_RANDOM.VALUE (1, 90)),
TO_DATE ('2008-04-01', 'yyyy-mm-dd')
+ TRUNC (DBMS_RANDOM.VALUE (1, 80)),
TRUNC (DBMS_RANDOM.VALUE (1, 1000))
FROM DUAL;
COMMIT;
END LOOP;
END;
How can Oracle Database derive a global-level FH based on a partition's level?
SELECT partition_name part_name, num_distinct, histogram
FROM dba_part_col_statistics
WHERE table_name = 'SALES' AND column_name = 'PROD_ID';
PART_NAME NUM_DISTINCT HISTOGRAM
---------- ------------ ---------------
SALES_P1 49 FREQUENCY
SALES_P2 89 FREQUENCY
And we can see the histogram data for each partition level as follows:
SELECT partition_name,
column_name,
bucket_number,
endpoint_value
FROM user_part_histograms
WHERE table_name = 'SALES' AND columns_name = 'PROD_ID'
ORDER BY 1;
But for the partitions, we have 49 and 89 buckets, respectively.
SELECT partition_name,
count(*)
FROM user_part_histograms
WHERE table_name = 'SALES' AND column_name = 'PROD_ID'
group by partition_name;
PARTITION_NAME COUNT(*)
--------------- ---------
SALES_P1 49
SALES_P2 89
So, according to Query 1, we can derive a global histogram from the partition-level FH data, as shown in Query2.
select ep_val, sum(freq) freq from
(
SELECT endpoint_value ep_val,
bucket_number
- LAG (bucket_number, 1, 0) OVER (partition by partition_name ORDER BY bucket_number)
freq
FROM user_part_histograms
WHERE table_name = 'SALES' AND column_name = 'PROD_ID'
)
group by ep_val
order by 1
Or, this alternative Query 2:
select ep_val, sum(freq) freq from
(
SELECT endpoint_value ep_val,
bucket_number
- LAG (bucket_number, 1, 0) OVER (ORDER BY bucket_number)
freq
FROM user_part_histograms
WHERE table_name = 'SALES' AND partition_name='SALES_P1' and column_name = 'PROD_ID'
union all
SELECT endpoint_value ep_val,
bucket_number
- LAG (bucket_number, 1, 0) OVER (ORDER BY bucket_number)
freq
FROM dba_part_histograms
WHERE table_name = 'SALES1' AND partition_name='SALES_P2' and column_name = 'PROD_ID'
)
group by ep_val
order by 1
EP_VAL FREQ
---------- ----------
1 333
2 298
3 297
4 345
5 329
6 307
7 312
8 308
9 281
10 338
11 318
12 332
13 332
14 312
15 312
16 314
17 314
18 355
19 314
20 332
21 290
22 338
23 327
.....................
81 102
82 114
83 123
84 109
85 136
86 121
87 111
88 102
89 99
89 rows selected.
And we can compare the result with a table-level (global) FH with the following SQL statement:
SELECT endpoint_value ep_val,
endpoint_number
- LAG (endpoint_number, 1, 0) OVER (ORDER BY endpoint_number)
freq
FROM user_tab_histograms
WHERE table_name = 'SALES' and column_name = 'PROD_ID';
If we enable a SQL trace for the statistics gathering, we see the SQL statement shown in Query 3.
Select * from sh.sales where prod_id=13;
alter session set sql_trace=true;
execute DBMS_STATS.gather_table_stats(null,'SALES');
INSERT /*+ OPT_PARAM('_parallel_syspls_obey_force' 'false') */
INTO finalhist$ (endpoint, epvalue_raw, bucket)
SELECT endpoint, epvalue, CEIL (SUM (numrows))
FROM (SELECT (hist.bucket
- LAG (
hist.bucket,
1,
0)
OVER (PARTITION BY hist.obj#
ORDER BY hist.bucket))
AS numrows,
hist.endpoint,
CASE
WHEN hist.epvalue IS NOT NULL
THEN
UTL_RAW.cast_to_raw (hist.epvalue)
ELSE
hist.epvalue_raw
END
epvalue
FROM "_HISTGRM_DEC" hist, (SELECT obj#
FROM tabpart$ tp
WHERE bo# = :b1
UNION ALL
SELECT obj#
FROM tabcompart$ tcp
WHERE bo# = :b1) t
WHERE t.obj# = hist.obj#
AND intcol# = :b2
AND hist.bucket <> 0)
GROUP BY endpoint, epvalue
This statement expresses the same idea that provided by Query 1 and Query 2. But Oracle Database considers and keeps the raw format of the endpoint values (plus statistics for composite partitions). In addition, the dbms_stats
trace shows that Oracle Database first created the FH for the PROD_ID column for each partition, and then it derived the FH for the table (global) level from the partition level only without scanning the entry table.
Gathering basic column statistics including the NDV and FH for the SALES_P1
partition:
not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
SALES_P1 spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P1 objn: 101954
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
............................................................................................................
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:49 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 49, max # of buckects = 254, pct = 100, ssize
= 10000
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1035
DBMS_STATS: done_hist in process_topn: TRUE csr.ccnt: 1
DBMS_STATS: Mark column "PROD_ID" as top N computed
Gathering basic column statistics including the NDV and FH for the SALES_P2
partition:
DBMS_STATS: gather stats on partition SALES_P2: synopsis not gathered yet;
not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
SALES_P2 spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P2 objn: 101955
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
.................................................
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:89 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 89, max # of buckects = 254, pct = 100, ssize
= 10000
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1035
DBMS_STATS: done_hist in process_topn: TRUE csr.ccnt: 1
Deriving basic column statistics including the NDV and FH (histogram data) from partition levels for the table:
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS: derive global freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 89, max # of buckects = 254, pct = 100, ssize
= 20000
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1035
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES
In this section, we will see how Oracle Database constructs a global-level top frequency histogram (TFH). It is should be noted that the resulting (global) histogram does not depend on the types of local histograms (partition level) directly. The global histogram is created by the database by just scanning existing partition-level histogram data. So, this means we could have two (or more) partition-level FHs and result could be a TFH or a hybrid histogram (HH), or we could have two FHs and one HH and the global histogram could be a TFH.
So, the global histogram only depends on the data (top frequent values) that are provided by the local histograms. In Case 2, below, the structure of our SALES table is the same as what we used in the previous example, but the data is different. I used special data that allows Oracle Database to create frequency and hybrid histograms for the partition levels and a top frequency histogram for the table level.
DECLARE
i INTEGER;
BEGIN
DELETE sales;
COMMIT;
FOR i IN 1 .. 1000
LOOP
INSERT INTO sales
SELECT CASE
WHEN (i <= 974 OR i >= 984)
THEN
TRUNC (DBMS_RANDOM.VALUE (1, 20))
ELSE
i
END,
TO_DATE ('2008-01-01', 'yyyy-mm-dd')
+ TRUNC (DBMS_RANDOM.VALUE (1, 80)),
TRUNC (DBMS_RANDOM.VALUE (1, 1000))
FROM DUAL;
INSERT INTO sales
SELECT CASE
WHEN i <= 760 THEN TRUNC (DBMS_RANDOM.VALUE (1, 20))
ELSE i
END,
TO_DATE ('2008-04-01', 'yyyy-mm-dd')
+ TRUNC (DBMS_RANDOM.VALUE (1, 80)),
TRUNC (DBMS_RANDOM.VALUE (1, 1000))
FROM DUAL;
END LOOP;
COMMIT;
END;
As you see, in our table we have 2000 rows (1000 rows per partition). In the sales_p1
partition there could be a maximum of 30 distinct values, and in the sales_p2
partition there could be a maximum of 260 distinct values. I have produced special data here to avoid a problem in generating a global histogram. In the next part of the article, we will see the problem and explore it.
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
BEGIN
DBMS_STATS.gather_table_stats (ownname => null,
tabname => 'SALES');
END;
SELECT partition_name part_name,
num_distinct,
histogram,
num_buckets,
sample_size
FROM user_part_col_statistics
WHERE table_name = 'SALES' AND column_name = 'PROD_ID';
PART_NAME NUM_DISTINCT HISTOGRAM NUM_BUCKETS SAMPLE_SIZE
----------- ------------ --------------- ----------- -----------
SALES_P1 28 FREQUENCY 28 1000
SALES_P2 259 HYBRID 254 1000
SELECT histogram, sample_size
FROM user_tab_col_statistics
WHERE table_name = 'SALES' AND column_name = 'PROD_ID';
HISTOGRAM SAMPLE_SIZE
--------------- -----------
TOP-FREQUENCY 2000
The PROD_ID
column has an FH and an HH according to the partitions, but the global level has a TFH. The dbms_stats
trace shows that the database started to gather statistics for each partition individually and created histograms for each partition:
DBMS_STATS: gather stats on partition SALES_P1: synopsis not gathered yet;
not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
SALES_P1 spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P1 objn: 101954
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
Oracle Database created an FH for the sales_p1
partition:
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:28 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 28, max # of buckects = 254, pct = 100, ssize
= 1000
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1039
DBMS_STATS: done_hist in process_topn: TRUE csr.ccnt: 1
DBMS_STATS: Mark column "PROD_ID" as top N computed
DBMS_STATS: Skip topn computed column "PROD_ID" numhist: 0
Then the database started to gather basic and histogram statistics for the second partition, sales_p2
.
DBMS_STATS: gather stats on partition SALES_P2: synopsis not gathered yet;
not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
SALES_P2 spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P2 objn: 101955
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
...........................................................................
DBMS_STATS: start processing top n values for column "PROD_ID"
DBMS_STATS: >> frequency histograms is not feasible
(topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR
COLNAME
DBMS_STATS: Y Y Y Y
PROD_ID
DBMS_STATS:
TIME_ID
DBMS_STATS:
AMOUNT
DBMS_STATS: Building Histogram for PROD_ID
DBMS_STATS: bktnum=-1, nnv=1000, snnv=1000, sndv=259, est_ndv=259, mnb=254
DBMS_STATS: Trying hybrid histogram
DBMS_STATS: Starting query at 12-FEB-18 11.54.22.688238000 AM +04:00
DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep, freq, cdn, ndv,
(sum(pop) over()) popcnt, (sum(pop*freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,240) maxval, substrb(dump(min(val)
over(),16,0,64),1,240) minval from (select val, freq, (sum(freq) over())
cdn, (count(*) over()) ndv, (case when freq > ((sum(freq) over())/254) then
1 else 0 end) pop from (select /*+ no_parallel(t) no_parallel_index(t)
dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)
no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware'
'false') no_substrb_pad */ "PROD_ID" val, count("PROD_ID") freq from
"SYS"."SALES" t where "PROD_ID" is not null and
TBL$OR$IDX$PART$NUM("SYS"."SALES",0,4,0,"ROWID") = :objn group by "PROD_ID"))
order by val (objn=101955)
DBMS_STATS: > cdn 1000, popFreq 760, popCnt 19, bktSize
.8803418803418803418803418803418803418803, bktSzFrc
.8803418803418803418803418803418803418803
DBMS_STATS: Evaluating hybrid histogram: cht.count 254, mnb 254, ssize
1000, min_ssize 2500, appr_ndv TRUE, ndv 259, selNdv 19, selFreq 760, pct
100, avg_bktsize 4, csr.hreq FALSE, normalize TRUE
DBMS_STATS: Histogram gathering flags: 15
DBMS_STATS: Accepting histogram
Figure 1 describes the process of choosing the histogram type, where
Figure 1. Flowchart for choosing the histogram type
So, in our case we have 19 popular values and their total frequency is 760.Therefore, according the criteria, (760/1000) = 0.76 < (1−1/254) = 0.9960, and thus Oracle Database decided to create an HH for the partition level. To understand how Oracle Database computes top-n values, refer my previous article: "Histogram Construction in Oracle Database 12c."
The following lines indicate that Oracle Database used the HyperLogLog algorithm to compute the NDV; also a global top frequency histogram was derived from the partition-level statistics without scanning the entry table.
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS: top-freq hist feasible? YES ndv: 259 mnb: 254 topn_freq: 1995 nnv: 2000
DBMS_STATS: derive global top-freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 254, max # of buckects = 254, pct = 100,
ssize = 2000
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 2063
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
DBMS_STATS: =============================================================================
=======================
DBMS_STATS: Statistics from clist:
DBMS_STATS: =============================================================================
=======================
DBMS_STATS: COL_NAME ACL NV SNNVDV NDV
SNDV DENS CCNT
DBMS_STATS: -----------------------------------------------------------------
-----------------------------------
DBMS_STATS: PROD_ID 4 0 2000 259
259 .00025 1
How did Oracle Database identify the total frequency of top-n frequent values as 1995 (topn_freq: 1995
) for the global level?
Let's create a SQL statement to find it. In this case, we have one FH and one HH of the partitions. We need to extract the top-n values from the histogram data and then it can be determined which type of histogram should be created for the global (table) level. We can extract frequency information for top-n values from the FH and TFH using Query 2.
What about the HH? As you know, an HH contains two types of frequency information: the frequency for each endpoint values and a cumulative bucket frequency (to track rows per bucket). So we need to find frequencies for endpoint values. The Oracle Database query optimizer calculates selectivity for the endpoint values as endpoint_repeat_count/max(bucket_number)
and, therefore, cardinality will be endpoint_repeat_count/max(bucket_number)* partnum_rows
, where partnum_rows = dba_part_partitions.num_rows
. If the HH is created without sampling (or with 100% sampling) then max(bucket_number)= partnum_rows
, so endpoint_repeat_count
will represent the exact frequency for the endpoint values.
So, the query for top-n values that are extracted from the HH will be this:
SELECT endpoint_value,
endpoint_repeat_count * p.num_rows / MAX (bucket_number) OVER ()
freq
FROM user_part_histograms h, user_tab_partitions p
WHERE h.table_name = 'SALES'
AND h.partition_name = 'SALES_P2'
AND h.column_name = 'PROD_ID'
AND p.table_name = h.table_name
AND p.partition_name = h.partition_name
ORDER BY 2 DESC
Therefore, the final query for extracting top-n values will be the following:
SELECT ep_val,
freq,
(SUM (CASE WHEN ROWNUM <= 254 THEN freq ELSE 0 END) OVER ())
topn_freq
FROM ( SELECT ep_val, SUM (freq) freq
FROM (SELECT endpoint_value ep_val,
bucket_number
- LAG (bucket_number, 1, 0)
OVER (ORDER BY bucket_number)
freq
FROM user_part_histograms
WHERE table_name = 'SALES'
AND column_name = 'PROD_ID'
AND partition_name = 'SALES_P1'
UNION ALL
SELECT endpoint_value,
endpoint_repeat_count
* p.num_rows
/ MAX (bucket_number) OVER ()
freq
FROM user_part_histograms h, user_tab_partitions p
WHERE h.table_name = 'SALES'
AND h.partition_name = 'SALES_P2'
AND h.column_name = 'PROD_ID'
AND p.table_name = h.table_name
AND p.partition_name = h.partition_name)
GROUP BY ep_val
ORDER BY 2 DESC)
ORDER BY 1
We can generalize our query without specifying partition names explicitly. As a result, Query 4 will not depend on the number of partitions and their names.
SELECT ep_val,
freq,
(SUM (CASE WHEN ROWNUM <= 254 THEN freq ELSE 0 END) OVER ())
topn_freq
FROM ( SELECT ep_val, SUM (freq) freq
FROM (SELECT endpoint_value ep_val,
bucket_number
- LAG (bucket_number, 1, 0)
OVER (ORDER BY bucket_number)
freq
FROM user_part_histograms h,
user_part_col_statistics pcs
WHERE h.table_name = 'SALES'
AND h.column_name = 'PROD_ID'
AND pcs.table_name = h.table_name
AND pcs.column_name = h.column_name
AND pcs.histogram IN
('FREQUENCY', 'TOP-FREQUENCY')
AND pcs.partition_name = h.partition_name
UNION ALL
SELECT endpoint_value,
endpoint_repeat_count
* p.num_rows
/ MAX (bucket_number) OVER ()
freq
FROM user_part_histograms h,
user_tab_partitions p,
user_part_col_statistics pcs
WHERE h.table_name = 'SALES'
AND h.column_name = 'PROD_ID'
AND p.table_name = h.table_name
AND p.partition_name = h.partition_name
AND pcs.table_name = h.table_name
AND pcs.column_name = h.column_name
AND pcs.histogram = 'HYBRID'
AND pcs.partition_name = h.partition_name)
GROUP BY ep_val
ORDER BY 2 DESC)
ORDER BY 1
EP_VAL FREQ TOPN_FREQ
---------- ---------- ----------
1 82 1995
2 99 1995
3 90 1995
4 91 1995
5 103 1995
6 101 1995
7 87 1995
8 94 1995
9 104 1995
10 84 1995
11 103 1995
12 81 1995
13 79 1995
14 85 1995
15 102 1995
16 87 1995
17 100 1995
18 84 1995
19 95 1995
761 1 1995
762 1 1995
763 1 1995
764 1 1995
765 1 1995
766 1 1995
................................
990 1 1995
991 1 1995
992 1 1995
993 1 1995
994 1 1995
1000 1 1995
254 rows selected.
This means the top-n frequent values represent 1995/2000*100−99.75% of all column values and that is why Oracle Database decided to create the TFH (because the condition TOPNFREQ>=p*num_rows(SALES)
is true: p=(1−1/254)=99.60
). We see that from the dbms_stats
trace file, as follows:
DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS: top-freq hist feasible? YES ndv: 259 mnb: 254 topn_freq: 1995 nnv: 2000
DBMS_STATS: derive global top-freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 254, max # of buckects = 254, pct = 100, ssize = 2000
So, Query 4 allows us to extract the top-n frequent values from the existing partition-level histogram data. And in our case, the condition TOPNFREQ>=p*num_rows(SALES)
was true and, therefore, Oracle Database stored the top-n frequent values as a TFH.
It is obvious that after identifying the top-n frequent values and based on the NDV, Oracle Database evaluates the condition TOPNFREQ < p*num_rows(SALES)
and decides which type of histogram should be created: FH, TFH, or hybrid histogram (HH). If TOPNFREQ < p*num_rows(SALES)
, where p=(1-1/N)
, then an HH needs to be created.
To produce an HH, we need to scan existing histogram data, grouping data as buckets, and we also need to identify popular values and their frequencies as endpoint values. Let's see Case 3 below.
DECLARE
i INTEGER;
BEGIN
DELETE sales;
COMMIT;
FOR i IN 1 .. 10000
LOOP
INSERT INTO sales
SELECT CASE
WHEN (i <= 9990)
THEN
TRUNC (DBMS_RANDOM.VALUE (1, 20))
ELSE
i
END,
TO_DATE ('2008-01-01', 'yyyy-mm-dd')
+ TRUNC (DBMS_RANDOM.VALUE (1, 80)),
TRUNC (DBMS_RANDOM.VALUE (1, 1000))
FROM DUAL;
END LOOP;
FOR i IN 1 .. 1000
LOOP
INSERT INTO sales
SELECT CASE
WHEN i <= 760 THEN TRUNC (DBMS_RANDOM.VALUE (1, 20))
ELSE i
END,
TO_DATE ('2008-04-01', 'yyyy-mm-dd')
+ TRUNC (DBMS_RANDOM.VALUE (1, 80)),
TRUNC (DBMS_RANDOM.VALUE (1, 1000))
FROM DUAL;
END LOOP;
COMMIT;
END;
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
alter session set sql_trace= true;
begin
dbms_stats.gather_table_stats
(ownname=>'sh',
tabname=>'SALES');
end;
The partitions and table-level column statistics are as follows:
SELECT partition_name part_name,
num_distinct,
histogram,
num_buckets,
sample_size
FROM user_part_col_statistics
WHERE table_name = 'SALES' AND column_name = 'PROD_ID';
PART_NAME NUM_DISTINCT HISTOGRAM NUM_BUCKETS SAMPLE_SIZE
---------- ------------ ---------- ----------- -----------
SALES_P1 29 FREQUENCY 29 10000
SALES_P2 259 HYBRID 254 1000
select column_name,
num_distinct,
num_nulls,
sample_size,
histogram
from user_tab_col_statistics
where table_name='SALES' AND column_name = 'PROD_ID';
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE HISTOGRAM
------------ ------------ ---------- ----------- ----------
PROD_ID 266 0 11000 HYBRID
As you noticed, in the SALES table there are 11000 rows. In the SALES_P1
partition, 9990 rows have a value range between 1 and 20. Also, in the SALES_P2
partition, 760 rows have a value range between 1 and 20. This means at least 9990+760=10750 out of 11000 rows are top frequent values (actually these are popular values). We can see the precise values:
SELECT SUM (freq) topn_freq
FROM ( SELECT prod_id, COUNT ( * ) freq
FROM sales
GROUP BY prod_id
ORDER BY 2 DESC)
WHERE ROWNUM <= 254;
TOPN_FREQ
----------
10985
This means 10985/11000*100=99.86 percent of the total rows are represented by top frequent values.
The question is why, in this case, did Oracle Database decide to create an HH? Let see the dbms_stats
trace file. As in the previous example, Oracle Database gathered basic and histogram data for each partition individually.
Gathering statistics for the SALES_P1
partition and creating an FH for the column of the partition:
DBMS_STATS: gather stats on partition SALES_P1: synopsis not gathered yet;
not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
SALES_P1 spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P1 objn: 101954
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
.......................................
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:29 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS: number of values = 29, max # of buckects = 254, pct = 100, ssize
= 10000
DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1039
DBMS_STATS: done_hist in process_topn: TRUE csr.ccnt: 1
DBMS_STATS: Mark column "PROD_ID" as top N computed
DBMS_STATS: Skip topn computed column "PROD_ID" numhist: 0
Gathering statistics for the SALES_P2
partition and creating an HH for the column of the partition:
DBMS_STATS: Number of rows in the table = 10000, blocks = , average row
length = 15, chain count = , scan rate = 0, sample size = 10000
DBMS_STATS: prepare reporting structures...
DBMS_STATS: reporting_man_update_task: objn: 101954 auto_stats: FALSE status:
COMPLETED ctx.batching_coeff: 0
DBMS_STATS: gather stats on partition SALES_P2: synopsis not gathered yet;
not analyzed yet;
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
SALES_P2 spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P2 objn: 101955
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
.............................................................
DBMS_STATS: start processing top n values for column "PROD_ID"
DBMS_STATS: >> frequency histograms is not feasible
(topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR
COLNAME
DBMS_STATS: Y Y Y Y
PROD_ID
DBMS_STATS:
TIME_ID
DBMS_STATS:
AMOUNT
DBMS_STATS: Building Histogram for PROD_ID
DBMS_STATS: bktnum=-1, nnv=1000, snnv=1000, sndv=259, est_ndv=259, mnb=254
DBMS_STATS: Trying hybrid histogram
....................................................................
DBMS_STATS: > cdn 1000, popFreq 760, popCnt 19, bktSize
.8846153846153846153846153846153846153846, bktSzFrc
.8846153846153846153846153846153846153846
DBMS_STATS: Evaluating hybrid histogram: cht.count 254, mnb 254, ssize
1000, min_ssize 2500, appr_ndv TRUE, ndv 259, selNdv 19, selFreq 760, pct
100, avg_bktsize 4, csr.hreq FALSE, normalize TRUE
DBMS_STATS: Histogram gathering flags: 15
DBMS_STATS: Accepting histogram
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
The following lines of the trace file indicate that the database derived an HH from the partition levels of histogram data and created an HH for the table (global HH):
DBMS_STATS: reporting_man_update_task: objn: 101955 auto_stats: FALSE status:
COMPLETED ctx.batching_coeff: 0
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
spname: execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS: top-freq hist feasible? NO ndv: 266 mnb: 254 topn_freq: 5161
nnv: 11000
DBMS_STATS: derive global hybrid hist: cdn 11000, popFreq 10750, popCnt 19,
bktSize 1.06382978723404255319148936170212765957, bktSzFrc
.06382978723404255319148936170212765957
DBMS_STATS: Evaluating hybrid histogram: cht.count 254, mnb 254, ssize
11000, min_ssize 2500, appr_ndv TRUE, ndv 266, selNdv 19, selFreq 10750, pct
100, avg_bktsize 43, csr.hreq FALSE, normalize FALSE
DBMS_STATS: Histogram gathering flags: 15
DBMS_STATS: Accepting histogram
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
As you can see, Oracle Database computed the total frequency of the top-n frequent values (top 254) as 5161 (topn_freq: 5161
). It is obvious that Oracle Database computed it incorrectly. The total frequency of all popular values is 10750 (popFreq 10750
), as we saw above. Also, Oracle Database computed the number of popular values as 19 (popCnt 19
).
To extract the required information, we can use the idea that Oracle Database uses a SQL statement, such as the SQL statement in Query 5, when generating an HH.
select substrb(dump(val,16,0,64),1,240) ep,
freq,
cdn,
ndv,
(sum(pop) over()) popcnt,
(sum(pop*freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,240) maxval,
substrb(dump(min(val) over(),16,0,64),1,240) minval
from (select val,
freq,
(sum(freq) over()) cdn,
(count(*) over()) ndv,
(case when freq > ((sum(freq) over())/num_bucket) then 1 else 0 end) pop
from (select "column" val,
count("column") freq
from "table" sample (.abcde...) t
where "column" is not null
group by "column"))
order by val
So, we should identify endpoint values (ep
), their frequencies (freq
), the cardinality (cdn
) (histogram-based), the number of distinct values (ndv
) (histogram-based), and the count of popular values (popcnt
) and their frequencies (popfreq
). As in the previous section we saw that frequency information could be extracted using Query 4. Therefore, we should replace the following part of Query 5 with all of Query 4:
select "column" val,
count("column") freq
from "table" sample (.abcde...) t
where "column" is not null
group by "column"
In addition, I have added "bucket frequency" to the query in order to compute the number of rows (cardinality, crdn
) via the histogram data. Thus, as a result, the equivalent query will be Query 6.
SELECT ep_val,
pop,
freq,
crdn,
COUNT ( * ) OVER () ndv,
SUM (pop) OVER () popcnt,
SUM (pop * freq) OVER () popfreq,
(SUM (CASE WHEN ROWNUM <= 254 THEN freq ELSE 0 END) OVER ())
topnfreq
FROM (SELECT ep_val,
ROUND (freq) freq,
SUM (cdn) OVER () crdn,
(CASE
WHEN freq > ROUND (SUM (freq) OVER ()) / 254 THEN 1
ELSE 0
END)
pop
FROM ( SELECT ep_val, SUM (freq) freq, SUM (cdn) cdn
FROM (SELECT ep_val, freq, freq cdn
FROM (SELECT endpoint_value ep_val,
bucket_number
- LAG (
bucket_number,
1,
0)
OVER (
ORDER BY
bucket_number)
freq
FROM user_part_histograms h,
user_part_col_statistics pcs
WHERE h.table_name = 'SALES'
AND h.column_name =
'PROD_ID'
AND pcs.table_name =
h.table_name
AND pcs.column_name =
h.column_name
AND pcs.histogram IN
('FREQUENCY',
'TOP-FREQUENCY')
AND pcs.partition_name =
h.partition_name)
UNION ALL
SELECT endpoint_value,
endpoint_repeat_count
* p.num_rows
/ MAX (bucket_number) OVER ()
freq,
bucket_number
- LAG (bucket_number, 1, 0)
OVER (ORDER BY bucket_number)
cdn
FROM user_part_histograms h,
user_tab_partitions p,
user_part_col_statistics pcs
WHERE h.table_name = 'SALES'
AND h.column_name = 'PROD_ID'
AND p.table_name = h.table_name
AND p.partition_name =
h.partition_name
AND pcs.table_name = h.table_name
AND pcs.column_name = h.column_name
AND pcs.histogram = 'HYBRID'
AND pcs.partition_name =
h.partition_name)
GROUP BY ep_val
ORDER BY 2 DESC))
ORDER BY ep_val;
EP_VAL POP FREQ CRDN NDV POPCNT POPFREQ TOPNFREQ
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
1 1 548 11000 264 19 10750 10985
2 1 561 11000 264 19 10750 10985
3 1 600 11000 264 19 10750 10985
4 1 567 11000 264 19 10750 10985
5 1 548 11000 264 19 10750 10985
6 1 566 11000 264 19 10750 10985
7 1 546 11000 264 19 10750 10985
8 1 531 11000 264 19 10750 10985
9 1 539 11000 264 19 10750 10985
10 1 622 11000 264 19 10750 10985
11 1 556 11000 264 19 10750 10985
12 1 556 11000 264 19 10750 10985
13 1 545 11000 264 19 10750 10985
14 1 577 11000 264 19 10750 10985
15 1 547 11000 264 19 10750 10985
16 1 573 11000 264 19 10750 10985
17 1 583 11000 264 19 10750 10985
18 1 571 11000 264 19 10750 10985
19 1 614 11000 264 19 10750 10985
761 0 1 11000 264 19 10750 10985
762 0 1 11000 264 19 10750 10985
763 0 1 11000 264 19 10750 10985
764 0 1 11000 264 19 10750 10985
765 0 1 11000 264 19 10750 10985
766 0 1 11000 264 19 10750 10985
767 0 1 11000 264 19 10750 10985
768 0 1 11000 264 19 10750 10985
769 0 1 11000 264 19 10750 10985
................................................................................
................................................................................
993 0 1 11000 264 19 10750 10985
994 0 1 11000 264 19 10750 10985
1000 0 1 11000 264 19 10750 10985
9991 0 1 11000 264 19 10750 10985
9992 0 1 11000 264 19 10750 10985
9993 0 1 11000 264 19 10750 10985
9994 0 1 11000 264 19 10750 10985
9995 0 1 11000 264 19 10750 10985
9996 0 1 11000 264 19 10750 10985
9997 0 1 11000 264 19 10750 10985
9998 0 1 11000 264 19 10750 10985
9999 0 1 11000 264 19 10750 10985
10000 0 1 11000 264 19 10750 10985
264 rows selected.
You can see how the histogram data (in our case, we have one FH and one TFH) describe our (table) data. So the table has 11000 rows and the histogram cardinality is 11000. Also, the NDV of the column is 266, but based on the histogram data, the NDV is 264. The numbers are very close to each other.
Let's see another set of numbers. According the histogram data (Query 6), we have 264 frequent values, but the sum of the top-n frequent values is 10985. In addition, there are 19 popular values and sum of their frequencies is 10750. What percentage do the top-n frequent values represent of all column values? So, we can use (TOPNFREQ vs p*num_notnull_rows(SALES1)) => (10985 vs (1-1/254)* 11000) => (10985 vs 10956.69)
. This indicates clearly that the top-n frequent values represent more than p=(1−1/N)
percent of all column rows (values).
So, again a TFH should have been created, but in fact, Oracle Database created the HH and the reason is the wrong computation of the total frequency of top-n values. We actually have seen that in the trace file (topn_freq: 5161
). This is an issue (bug) in the dbms_stats
package implementation.
Let's explore how Oracle Database processes the frequent values and creates the HH. We should create 254 buckets (the default) and we have 264 frequent values. Therefore, we need to identify the bucket size for grouping and then create an HH.
As you can see, Oracle Database has determined the bucket size as 1.0638. This was calculated based on the histogram data, which was provided by Query 6. The bucket size represents the average number of rows per bucket. In addition, it is known that an HH combines the features of an FH and an HBH. Bucket size is used by the query optimizer to determine selectivity for the column values (nonpopular values). Actually, bucket size explains and helps us to understand the meaning of the density (and selectivity). So, in reality Density=bucket_size/crdn
, where crdn
is the cardinality of the column, which is determined based on the histogram data (see Query 6). And the bucket size itself is calculated as follows:
Bucket size=(crdn−popfreq)/(N−popcnt)
where N is requested number of buckets and the default it is 254. Query 6 gives us all the necessary information to calculate bucket size. Thus
Bucket size=(crdn−popfreq)/(N−popcnt)=(11000−10750)/(254−19)= 1.0638.
As can be seen, the calculated bucket size is the same as the reported one in the dbms_stats
trace file. To create an HH, Oracle Database starts to process over each row that is returned by Query 6.
The maximum and minimum values of the histogram data also indicate the max and min values of the column, and they will be included in the HH. For every row, Oracle Database will compare the average bucket size with the size of the current bucket. If the current bucket size is greater than the average bucket size, the value will be stored in the HH as an endpoint value. The endpoint repeat count for the value is actually the "freq" column of Query 6.
In addition, at the same time, the bucket number (or endpoint number) is produced and stored in the HH for the endpoint values. If the bucket is not stored in the HH, its row count (current bucket size) is considered for the next step and this bucket size is added to the next bucket size.
The process is continued until the total bucket count reaches 254 (the number of requested buckets). As a result, Oracle Database created an HH from the existing partition-level histogram data. For more information, refer to my "Histogram Construction in Oracle Database 12c" article to see complete steps of histogram construction. [Laura: link to the first article]
Thus, using this described method, Oracle Database created an HH and stored it in the dictionary. But in reality, a TFH should have been created. To prove that, let set the INCREMENTAL
setting to FALSE
and try to gather table statistics again. In this case, Oracle Database does not gather statistics incrementally.
execute DBMS_STATS.SET_TABLE_PREFS(null, 'sales', 'INCREMENTAL', 'FALSE');
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute DBMS_STATS.gather_table_stats (null, 'SALES');
And now, let's check which type of histograms have been created for the table and its partitions:
select partition_name,
num_distinct,
histogram from user_part_col_statistics
where table_name='SALES'
and column_name= 'PROD_ID';
PARTITION_NAME NUM_DISTINCT HISTOGRAM
---------------- ------------ -----------
SALES_P1 29 FREQUENCY
SALES_P2 259 HYBRID
select column_name,
num_distinct,
num_nulls,
sample_size,
histogram
from user_tab_col_statistics
where table_name='SALES' AND column_name = 'PROD_ID';
COLUMN_NAME NUM_DISTINCT NUM_NULLS SAMPLE_SIZE HISTOGRAM
------------- ------------ ---------- ----------- ---------------
PROD_ID 269 0 11000 TOP-FREQUENCY
As you can see, Oracle Database created a TFH for the columns. This means our prediction was right. As we mentioned above, when incrementally constructing histograms, Oracle Database does not calculate the top-n frequent values correctly. Therefore, Oracle Database decided to create an HH.
So we see the problem in incrementally constructing histograms. When gathering statistics with INCREMENTAL=FALSE
, Oracle Database does not provide the exact number of top-n frequent values in the trace file (because in our case, it decided to create a TFH not an HH). But, ROWIDs (and their frequencies) for top-n frequent values are provided.
DBMS_STATS: Start gather_stats.. pfix: ownname: SYS tabname: SALES pname:
spname: execution phase: 1
DBMS_STATS: Using approximate NDV pct=0
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR
COLNAME
DBMS_STATS: Y Y Y Y Y Y Y
PROD_ID
DBMS_STATS: Y Y
TIME_ID
DBMS_STATS: Y Y Y
AMOUNT
DBMS_STATS: Approximate NDV Options
DBMS_STATS: TOPN,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,RWID,U254,U254,U254U
DBMS_STATS: select /*+ full(t) no_parallel(t) no_parallel_index(t)
dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)
no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware'
'false') no_substrb_pad
*/to_char(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),subst
rb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("TIME_ID")),substrb(dump
(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),t
o_char(count("AMOUNT")),substrb(dump(min("AMOUNT"),16,0,64),1,240),substrb(du
mp(max("AMOUNT"),16,0,64),1,240),count(rowidtochar(rowid)) from "SYS"."SALES"
t /* TOPN,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,RWID,U254,U254,U254U*/
DBMS_STATS: Ending query at 14-FEB-18 03.20.53.066712000 PM +04:00
DBMS_STATS: Approximate NDV Result
DBMS_STATS: <process_result>
<select_list_item>
<pos>0</pos>
<value>269</value>
<rowcnt>11000</rowcnt>
<split>0</split>
<ndv>269</ndv>
<nonnulls>11000</nonnulls>
<rsize>22246</rsize>
<topncnt>254</topncnt>
<topn_values>AAAY5CAABAAAbGPAAW,622,AAAY5CAABAAAbGPAAB,614,AAAY5CAABAAAbGPAAC
,600,AAAY5CAABAAAbGPAAO,583,AAAY5CAABAAAbGPAAI,577,AAAY5CAABAAAbGPAAV,573,AAA
Y5CAABAAAbGPAAA,571,AAAY5CAABAAAbGPAAM,567,AAAY5CAABAAAbGPABe,566,AAAY5CAABAA
AbGPAAS,561,AAAY5CAABAAAbGPAAp,556,AAAY5CAABAAAbGPAAf,556,AAAY5CAABAAAbGPAAJ,
548,AAAY5CAABAAAbGPAAT,548,AAAY5CAABAAAbGPAAN,547,AAAY5CAABAAAbGPAAG,546,AAAY
5CAABAAAbGPAAK,545,AAAY5CAABAAAbGPAAD,539,AAAY5CAABAAAbGPAAQ,531
.............................................................................
..........
BMS_STATS: start processing top n values for column "PROD_ID"
DBMS_STATS: Parsing topn values..
DBMS_STATS: Extracted 254 rowid-freq pairs.
DBMS_STATS: topn sql (len: 9002):
DBMS_STATS: +++ select /*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false')
no_substrb_pad */ substrb(dump("PROD_ID",16,0,64),1,240) val,
rowidtochar(rowid) rwid from "SYS"."SALES" t where
rowid in (chartorowid('AAAY5CAABAAAbGPAAA'),chartorowid('AAAY5CAABAAAbGPAAB'),chartoro
wid('AAAY5CAABAAAbGPAAC'),chartorowid('AAAY5CAABAAAbGPAAD'),chartorowid('AAAY
5CAABAAAbGPAAG'),chartorowid('AAAY5CAABAAAbGPAAI'),chartorowid('AAAY5CAABAAAb
GPAAJ'),chartorowid('AAAY5CAABAAAbGPAAK'),chartorowid('AAAY5CAABAAAbGPAAM'),c
hartorowid('AAAY5CAABAAAbGPAAN'),chartorowid('AAAY5CAABAAAbGPAAO'),chartorowi
d('AAAY5CAABAAAbGPAAQ'),chartorowid('AAAY5CAABAAAbGPAAS'),chartorowid('AAAY5C
AABAAAbGPAAT'),chartorowid('AAAY5CAABAAAbGPAAV'),chartorowid('AAAY5CAABAAAbGP
AAW'),chartorowid('AAAY5CAABAAAbGPAAf'),chartorowid('AAAY5CAABAAAbGPAAp'),cha
rtorowid('AAAY5CAABAAAbGPABe'),chartorowid('AAAY5CAABAAAbGPACJ'),chartorowid(
'AAAY5CAABAAAbGPACK'),chartorowid('AAAY5CAABAAAbGPACL'),chartorowid('AAAY5CAA
BAAAbGPACM'),chartorowid('AAAY5CAABAAAbGPACN'),chartorowid('AAAY5CAABAAAbGPAC
O'),chartorowid('AAAY5CAABAAAbGPACP'),chartorowid('AAAY5CAABAAAbGPACQ'),chart
orowid('AAAY5CAABAAAbGPACR'),chartorowid('AAAY5CAABAAAbGPACS'),chartorowid('A
AAY5DAABAAAaPuAA/'),chartorowid('AAAY5DAABAAAaPuAA1'),chartorowid('AAAY5DAABA
AAaPuAA2'),chartorowid('AAAY5DAABAAAaPuAA3'),chartorowid('AAAY5DAABAAAaPuAA4'
),chartorowid('AAAY5DAABAAAaPuAA5'),chartorowid('AAAY5DAABAAAaPuAA6'),chartor
owid('AAAY5DAABAAAaPuAA7')...................................................
..........................
The provided ROWIDs allow us to compute exact top-n frequent values as:
select /*+ parallel(4) */ prod_id,
count(*),
sum(count(*)) over() topn
FROM sales t
where prod_id in(
select prod_id
from sales t where rowid in (
chartorowid('AAAY5CAABAAAbGPAAA'),chartorowid('AAAY5CAABAAAbGPAAB'),
chartorowid('AAAY5CAABAAAbGPAAC'),chartorowid('AAAY5CAABAAAbGPAAD'),chartorow
id('AAAY5CAABAAAbGPAAG'),chartorowid('AAAY5CAABAAAbGPAAI'),
chartorowid('AAAY5CAABAAAbGPAAJ'),chartorowid('AAAY5CAABAAAbGPAAK'),chartorow
id('AAAY5CAABAAAbGPAAM'),chartorowid('AAAY5CAABAAAbGPAAN'),
.............................................................................
chartorowid('AAAY5DAABAAAaPuAEK'),chartorowid('AAAY5DAABAAAaPuAEL'),chartorow
id('AAAY5DAABAAAaPuAEM'),chartorowid('AAAY5DAABAAAaPuAEN'))
)
group by prod_id
order by prod_id;
PROD_ID COUNT(*) TOPN
PROD_ID COUNT(*) TOPN
---------- ---------- ----------
1 548 10985
2 561 10985
3 600 10985
4 567 10985
5 548 10985
6 566 10985
7 546 10985
8 531 10985
9 539 10985
10 622 10985
11 556 10985
12 556 10985
13 545 10985
14 577 10985
15 547 10985
16 573 10985
17 583 10985
18 571 10985
19 614 10985
761 1 10985
763 1 10985
764 1 10985
765 1 10985
766 1 10985
767 1 10985
768 1 10985
................................
996 1 10985
997 1 10985
998 1 10985
999 1 10985
1000 1 10985
9991 1 10985
9992 1 10985
9993 1 10985
9994 1 10985
9995 1 10985
9996 1 10985
9997 1 10985
9998 1 10985
9999 1 10985
10000 1 10985
254 rows selected.
We also calculated top-n frequent values as 10985 using Query 6. If we set INCREMENTAL=TRUE
and gather statistics again, you will see that Oracle Database will create an HH for the table level.
To find the reason why Oracle Database computes top-n frequent values incorrectly, I have traced and found the SQL statement shown in Query 7.
SELECT /*+ OPT_PARAM('_parallel_syspls_obey_force' 'false') */
ENDPOINT,POP ,EPVALUE, FREQ, EP_INT_WIDTH,
NDV, CDN,
(SUM(POP) OVER()) POPCNT,
(SUM(POP*FREQ) OVER()) POPFREQ,
(SUM(CASE WHEN RNK <= 254 THEN FREQ ELSE 0 END) OVER ()) TOPNFREQ,
(MIN(ENDPOINT) OVER()) MIN_EP,rnk,
(MIN(CASE WHEN RNK <= 254 THEN FREQ ELSE 9999999999999 END)
OVER ()) MIN_FREQ
FROM (SELECT ENDPOINT, EPVALUE, FREQ, EP_INT_WIDTH, (COUNT(*) OVER ()) NDV,
(SUM(EP_INT_WIDTH) OVER ()) CDN,
ROW_NUMBER() OVER (ORDER BY FREQ ) RNK,
(CASE WHEN FREQ > (SUM(EP_INT_WIDTH) OVER ())/254
THEN 1 ELSE 0 END) POP
FROM (SELECT ENDPOINT, EPVALUE, SUM(NUMROWS) FREQ,
SUM(INTERVAL_WIDTH) EP_INT_WIDTH
FROM (SELECT CASE WHEN EP_REPEAT_COUNT > 0 THEN EP_REPEAT_COUNT
ELSE (HIST.BUCKET - LAG(HIST.BUCKET,1,0) OVER
(PARTITION BY HIST.OBJ# ORDER BY HIST.BUCKET))
END *
CASE WHEN ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET)
OVER (PARTITION BY HIST.OBJ#)) < 1.1 THEN 1
ELSE ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET) OVER
(PARTITION BY HIST.OBJ#)) END AS NUMROWS,
(HIST.BUCKET - LAG(HIST.BUCKET,1,0) OVER
(PARTITION BY HIST.OBJ# ORDER BY HIST.BUCKET)) *
CASE WHEN ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET)
OVER (PARTITION BY HIST.OBJ#)) < 1.1 THEN 1
ELSE ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET) OVER
(PARTITION BY HIST.OBJ#)) END AS INTERVAL_WIDTH,
HIST.ENDPOINT,
CASE WHEN HIST.EPVALUE IS NOT NULL THEN
UTL_RAW.CAST_TO_RAW(HIST.EPVALUE)
ELSE HIST.EPVALUE_RAW END EPVALUE
FROM "_HISTGRM_DEC" HIST, HIST_HEAD$ H,
(SELECT OBJ#, ROWCNT FROM TABPART$ TP WHERE BO# = 101953
UNION ALL
SELECT OBJ#, ROWCNT FROM TABCOMPART$ TCP
WHERE BO# = 101953) T
WHERE T.OBJ# = HIST.OBJ#
AND H.OBJ# = HIST.OBJ#
AND H.INTCOL# = HIST.INTCOL#
AND HIST.INTCOL# = 1
AND HIST.BUCKET <> 0)
GROUP BY ENDPOINT, EPVALUE
)
)
ORDER BY ENDPOINT, EPVALUE
As you can see, the query contains a ROW_NUMBER() OVER (ORDER BY FREQ ) RNK
clause but it must be ROW_NUMBER() OVER (ORDER BY FREQ DESC) RNK
to compute the top-n frequent values and this is the reason that Oracle Database does the computation incorrectly.
What can we say about the incremental maintaining of height-balanced histograms (HBHs)? It seems that in Oracle Database 12c, HBHs are not maintained incrementally. Even if we set INCREMENTAL=TRUE
for the table, Oracle Database sets INCREMENTAL=FALSE
for the entry statistics gathering process and does not construct an HBH incrementally. We can see that from the trace file:
execute DBMS_STATS.SET_TABLE_PREFS(null, 'sales', 'INCREMENTAL', 'TRUE');
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute DBMS_STATS.gather_table_stats (null, tabname => 'SALES',
estimate_percent=> 100); -- Also tested with other values of estimate percents
select column_name,
num_distinct,
histogram
from user_tab_col_statistics
where table_name='SALES' AND column_name = 'PROD_ID';
COLUMN_NAME NUM_DISTINCT HISTOGRAM
------------- ------------ ---------------
PROD_ID 269 HEIGHT BALANCED
DBMS_STATS: Specified granularity = AUTO, Fixed granularity = GLOBAL AND PARTITION
DBMS_STATS: parameters ** pfix: (parent pfix: ) ownname: tabname: SALES
partname: estimate_percent: 100 block_sample: FALSE method_opt: Z degree:
32766 granularity: Z gIdxGran: cascade: NULL stattab: statid: statown:
no_invalidate: NULL flush_colu: TRUE fxt: stattype: DATA start_time: 02-14-
2018 15:50:53 gathering_group_stats: FALSE cms_only: FALSE force: FALSE
options: Z executionPhase: SERIAL (NON-CONCURRENT) incremental: FALSE
concurrent: FALSE loc_degree: loc_method_opt: FOR ALL COLUMNS SIZE AUTO
fxt_typ: 0 tobjn: 101953
DBMS_STATS: Preferences for table SYS.SALES
DBMS_STATS: ================================================================================
DBMS_STATS: SKIP_TIME -
DBMS_STATS: STATS_RETENTION -
DBMS_STATS: MON_MODS_ALL_UPD_TIME -
DBMS_STATS: SNAPSHOT_UPD_TIME -
DBMS_STATS: TRACE - 34836
DBMS_STATS: DEBUG - 0
DBMS_STATS: SYS_FLAGS - 1
DBMS_STATS: SPD_RETENTION_WEEKS - 53
DBMS_STATS: CASCADE - DBMS_STATS.AUTO_CASCADE
DBMS_STATS: ESTIMATE_PERCENT - DBMS_STATS.AUTO_SAMPLE_SIZE
DBMS_STATS: DEGREE - NULL
DBMS_STATS: METHOD_OPT - FOR ALL COLUMNS SIZE AUTO
DBMS_STATS: NO_INVALIDATE - DBMS_STATS.AUTO_INVALIDATE
DBMS_STATS: GRANULARITY - AUTO
DBMS_STATS: PUBLISH - TRUE
DBMS_STATS: STALE_PERCENT - 10
DBMS_STATS: APPROXIMATE_NDV - TRUE
DBMS_STATS: APPROXIMATE_NDV_ALGORITHM - HYPERLOGLOG
DBMS_STATS: ANDV_ALGO_INTERNAL_OBSERVE - FALSE
DBMS_STATS: INCREMENTAL - TRUE
DBMS_STATS: INCREMENTAL_INTERNAL_CONTROL - TRUE
DBMS_STATS: AUTOSTATS_TARGET - AUTO
DBMS_STATS: CONCURRENT - OFF
DBMS_STATS: JOB_OVERHEAD_PERC - 1
DBMS_STATS: JOB_OVERHEAD - -1
DBMS_STATS: GLOBAL_TEMP_TABLE_STATS - SESSION
DBMS_STATS: ENABLE_TOP_FREQ_HISTOGRAMS - 3
DBMS_STATS: ENABLE_HYBRID_HISTOGRAMS - 3
DBMS_STATS: TABLE_CACHED_BLOCKS - 1
DBMS_STATS: INCREMENTAL_LEVEL - PARTITION
DBMS_STATS: INCREMENTAL_STALENESS - ALLOW_MIXED_FORMAT
DBMS_STATS: OPTIONS - GATHER
DBMS_STATS: GATHER_AUTO - AFTER_LOAD
DBMS_STATS: STAT_CATEGORY - OBJECT_STATS
DBMS_STATS: SCAN_RATE - 0
DBMS_STATS: GATHER_SCAN_RATE - HADOOP_ONLY
DBMS_STATS: PREFERENCE_OVERRIDES_PARAMETER - FALSE
DBMS_STATS: AUTO_STAT_EXTENSIONS - OFF
DBMS_STATS: WAIT_TIME_TO_UPDATE_STATS - 15
As you can see, although the INCREMENTAL
preference of the table is TRUE
, Oracle Database set incremental=false
for the process and traced it as executionPhase: SERIAL (NON-CONCURRENT) incremental: FALSE
. Of course, the HBH also could be created from partition-level histogram data. I think there is not a restriction against doing this, because in Oracle Database 11g, HBHs are updated incrementally. But to maintain statistics incrementally at first, the database must determine the global NDV and only after that, can it process further. Also, during the NDV approximation, the database identifies top-n values, and approximation of the global NDV requires setting ESTIMATE_PERCENT = DBMS_STATS.AUTO_SAMPLE_SIZE
. That is why Oracle Database makes the decision based on the value of the estimate_percent
parameter, for example:
IF ESTIMATE_PERCENT <> DBMS_STATS.AUTO_SAMPLE_SIZE then INCREMENTAL=FALSE.
So, this means that DBMS_STATS.AUTO_SAMPLE_SIZE
allows and triggers Oracle Database to gather basic column and histogram statistics incrementally. As known, in Oracle Database 12c, an HBH is created only if you explicitly specify a value for ESTIMATE_PERCENT
different than DBMS_STATS.AUTO_SAMPLE_SIZE
. Therefore, an HBH is not maintained incrementally. In Oracle Database 11g, an HBH is maintained incrementally; however, it seems that is not allowed in Oracle Database 12c.
In addition, it is clear that incremental gathering of statistics is triggered by ESTIMATE_PERCENT = DBMS_STATS.AUTO_SAMPLE_SIZE
, and if there are HBHs for the column of a partition of a given table (which is possible only if statistics are gathered with the ESTIMATE_PERCENT <> DBMS_STATS.AUTO_SAMPLE_SIZE
condition), when we gather statistics (with DBMS_STATS.AUTO_SAMPLE_SIZE
) for the partitioned table, all old histograms (for both local and global levels) will be deleted and new ones will be created .
In summary, we saw how Oracle Database constructs global-level histograms by using existing partition-level histograms. Synopsis data is not used to construct global histograms directly; it is used for NDV approximation. Global NDV and partition-level histogram data are enough to describe the table-level data.
Oracle Database first identifies the global NDV for a given column and then tries to extract the top-n frequent values by scanning partition-level histogram data. It performs special queries (Query 1, 2, 3, 4, and 6), as above described.
If the global NDV <=N (N is number of requested buckets and the default is 254), the global histogram will be a frequency histogram. The top-n values and their frequencies can be determined via Query 4.
If NDV > N, then after identifying the top-n frequent values, the database decides which type of histogram should be created as shown in Figure 1. If the TOPNFREQ>=p*num_rows
condition is true, a top frequency histogram will be created; otherwise, Oracle Database will create a global hybrid histogram.
Also, we saw that Oracle Database does not identify the top-n frequent values correctly when creating a global histogram. There is a problem in the implementation of the dbms_stats
package.
Moreover, we saw that height-balanced histograms are not maintained incrementally and the reason is the use of DBMS_STATS.AUTO_SAMPLE_SIZE
. Height-balanced histograms are created only when you explicitly specify a value for ESTIMATE_PERCENT
that is different than AUTO_SAMPLE_SIZE
. But the database considers "incremental maintenance of the statistics" only when ESTIMATE_PERCENT=DBMS_STATS.AUTO_SAMPLE_SIZE
is set.
It is worth mentioning that after creating a global histogram, even if no DML operations have been performed, if we try gather statistics again, Oracle Database will refresh the global statistics regardless of statistics staleness. I think this is not a problem, because the operation of incrementally updating histograms is not a resource-intensive process. Finally, incrementally updating a hybrid histogram using existing histogram data supports the idea that it is possible to construct a hybrid histogram without doing sampling.
Chinar Aliyev graduated from the Baku State University applied mathematics department. He has been working with Oracle Database more than 10 years (since the Oracle 8i version) and specializes in tuning and troubleshooting Oracle Database performance problems. His areas of interest include SQL optimization, optimizer internals, parallel execution, and database designing. He is an Oracle Database 11g Certified Master and an Oracle Database 11g Performance Tuning Certified Expert. He also is a regular presenter at international and local conferences such as Hotsos and Oracle Database user groups.