This article describes the mechanism of histogram construction in Oracle Database 12c and the issues related to the gathering process for a hybrid histogram. It also demonstrates how to avoid those problems by manually setting histogram data, and it explains the possibility of generating a hybrid histogram from a top-n structure without doing sampling from the given table.
Object statistics describe the data within a database and allow the Oracle Database query optimizer to make the right decision in order to produce an optimal execution plan. In the real world, due to nonuniform distribution, basic column statistics are not enough to describe the data correctly. There are two types of skew: nonuniform value repetition and nonuniformity in range.
Oracle Database checks whether a column is distributed nonuniformly before creating a histogram. But to do that, the database should collect some information about the column, including the frequencies of the distinct values. This information is used to create a histogram for the column.
A histogram is designed to store data distribution information in the dictionary and helps the query optimizer to generate an efficient execution plan. In Oracle Database 12c, there are four types of histograms: frequency histogram (FH), top frequency histogram (TFH), hybrid histogram (HH), and the old-style height-balanced histogram (HBH). In this article, we will see how histograms are created in Oracle Database 12c. These test cases were performed with Oracle Database 12c Release 2 (12.2.0.1.0).
In Oracle Database 12c, if histogram creation is requested or the database decides to create a histogram, Oracle Database first generates the top frequent values (TFVs) and then it decides which type of histogram should be created. It is obvious that if the number of distinct values (NDV) is equal to or less than number of requested buckets (NRB) then a frequency histogram will be created. By default, Oracle Database 12c does not create a height-balanced histogram and, therefore, in this article, we are not going to interpret a height-balanced histogram.
Several types (classes) of algorithms have been provided (counter-based, sketch, sampling, and so on) to solve the problem of finding the most frequent values in a given data stream. In this article, we will focus on counter-based algorithms, because according to the trace files, the techniques used in Oracle Database 12c for finding the most frequent values look like they are based on counter-based algorithms.
We have a data stream S={a1,a2,a3,a4...aN}. Let's define ƒi(a) as the frequencies of aiεS. We need to identify the R(aiεS,φ) subset of S using a one-pass method such that the ƒi(a)>φN condition is true, where φε[0..1]. It is obvious the output result will not contain more than k=1/φ elements.
Note: For more information on the frequency algorithm (F), see the paper at http://erikdemaine.org/papers/NetworkStats_TR2002/paper.pdf.
The frequency algorithm stores K pairs of (item, frequency). This algorithm is a generalization of the majority algorithm that compares each new item against stored items R, and is incremented if the selected item is found within R. If all 1/φ frequencies are allocated to distinct items, all are decremented by 1. After decreasing the counts, all items whose frequency is zero are removed from R. This algorithm will return K number of top frequent elements of the data stream. It is obvious the output result will not contain more than K=1/φ elements. Implementation could be different.
To improve the efficiency of the algorithm, we need to have a special data structure to support the above algorithm, and we should perform adding, removing, and updating elements very effectively. First of all, the R table could be implemented as a hash table with limited size. The number of hash table entries could be equal to the requested number of histogram buckets (by default, it is 254). The hash table entries correspond to the distinct value of the column. The hash value is generated during the scan by applying a uniform hash function and the value is used to look up a position in the hash table. The entry in the hash value will indicate the frequency of the distinct value, because it indicates the number of times the distinct value has been mapped to the entry and used to look up the entry.
Initially, the hash table is empty. During the scan, if the hash value of the element is found in the hash table, the frequency value is incremented (this means the position will be incremented). If the hash table reaches its limited size, infrequent values will be deleted from the hash table. (First, all frequency values need to be decremented and then infrequent values—those that have zero frequency—are deleted.) To support decrementing all frequencies at once in constant time, a more sophisticated data structure could be used. The set of distinct values could be maintained as a hash table but the frequencies could be maintained in a linked list L={l1,l2,l3...ln} where n is the current maximum of the frequencies. Each lj has a pointer pointing to a doubly linked list of hash values in R that have a frequency equal to j. To increment the frequency of one of these values, the value just could be removed from the lj doubly linked list and then inserted into lj+1. But to decrement all frequencies by one, it is enough to move the head of the L linked list from l1to l2 and then delete l1.
Note: For more information on the lossy counting (LC) algorithm, see the paper at http://www.vldb.org/conf/2002/S10P03.pdf.
There is another approximate algorithm for finding the top frequent values. Finding the top frequent values of a data stream is a resource-intensive operation and depends on several factors including properties of the data stream such as the skew factor. Therefore, to find the exact top frequent values and their frequencies is not easy to attain. The lossy counting algorithm allows us to approximate most top frequent values with ε error using a single pass. Let's define following symbols:
According to the algorithm, the data stream is divided into buckets (or windows) with the same width w = [ 1/ε]. Each bucket has an ID starting at 1. Let's call the current bucket ID bcurrent. To support the algorithm, the data structure D consists of a set of elements of the form (e, f, Δ), where Δ (delta) is the maximum possible error in ƒ. Initially the required data structure D is empty. During the scan of the stream, each item is checked to see if it is already in D or not. If the item already is there, its frequency is incremented by one; otherwise, the item is inserted into D with a frequency of 1 as (e, 1, bcurrent −1). At every bucket boundary, the deletion operation has to be performed. Bucket boundaries are identified with the N mod w = 0 condition. All entries are deleted if the ƒ+ Δ≤bcurrent condition is true. After scanning all items, the algorithm provides the set of items in the D structure that for all items (s−ε)*N ≤ ƒ ≤ s*N is true. It can be seen that Δ is also equal to the bucket number. Implementation of this algorithm in PL/SQL could be as follows:
CREATE GLOBAL TEMPORARY TABLE lossytable (val NUMBER, freq NUMBER, bk NUMBER);
create or replace procedure find_topn(nm_rws integer,tpk integer)
is
CURSOR c_chk (cval NUMBER)
IS
SELECT *
FROM lossytable
WHERE val = cval;
chk c_chk%ROWTYPE;
n INTEGER := 0;
delta NUMBER := 0;
b NUMBER := 1;
--nm_rws integer:= 1000000; -- Number of rows in the table
--tpk integer:=254; --top frequent values
bktsize integer;
BEGIN
COMMIT;
Bktsize := nm_rws/tpk;
FOR c IN (SELECT * FROM t1)
LOOP
n := n + 1;
OPEN c_chk (c.col);
FETCH c_chk INTO chk;
IF c_chk%FOUND
THEN
UPDATE lossytable
SET freq = freq + 1
WHERE val = c.col;
ELSE
INSERT INTO lossytable (val, freq, bk)
VALUES (c.col, 1, b-1);
END IF;
IF n MOD Bktsize = 0
THEN
FOR j IN (SELECT * FROM lossytable)
LOOP
IF j.freq +j.bk <= b
THEN
DELETE lossytable
WHERE val = j.val;
END IF;
END LOOP;
b := b + 1;
END IF;
CLOSE c_chk;
END LOOP;
/*+ deleting infrequent element as delete lossytable where freq=1 */
END;
Note: For more information on the space saving algorithm, see the paper at http://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf.
This algorithm stores the first K items and their exact frequencies in a data structure. If the next item is found in the data structure, its frequency is incremented by one. But when the selected item is not found in the data structure, the required (mitem, mfreq) pair is identified, where mfreq is the minimum frequency of the data structure and mitem is the corresponding item. The processed item is replaced with the mitem item and its frequency is set to mfreq+1.
As with the previous algorithms, a special data structure is required to implement this algorithm. So a doubly linked list is used. All items with the same frequency are linked together in a linked list. They all point to a parent bucket. Every bucket points to exactly one element among its child list, and buckets are kept in a doubly linked list, sorted by their values. The items can be stored in a hash table to provide constant access cost. The linked list operations (adding or deleting elements from the structure and updating frequencies) are the same as previously explained.
Let's see how Oracle Database computes the top frequent values.
drop table t1;
CREATE TABLE t1 AS
SELECT
CASE
WHEN level <= 9700 THEN TRUNC(DBMS_RANDOM.value(1,20))
ELSE level
END AS col
FROM dual
CONNECT BY level <= 10000;
alter session set events '10832 trace name context forever, level 1';
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute dbms_stats.gather_table_stats(null,'t1', method_opt=> 'for columns col size 20');
The trace file shows the following lines:
DBMS_STATS: Approximate NDV Options
DBMS_STATS: TOPN,NIL,NIL,RWID,U20U
DBMS_STATS: select /*+ full(t) no_parallel(t) no_parallel_index(t)
dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0)
no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware'
'false') no_substrb_pad
*/to_char(count("COL")),substrb(dump(min("COL"),16,0,64),1,240),substrb(dump(
max("COL"),16,0,64),1,240),count(rowidtochar(rowid)) from "SYS"."T1" t /*
TOPN,NIL,NIL,RWID,U20U*/
During the execution of this SQL code, Oracle Database computed the number of distinct values, maintained a ROWID table, and computed the top-n (20) frequency values.
qerandvInitAll: allocate synopsis for opc:0
Allocate rowid table for topn
qerandvGatherSqlStats: sel:0 mode:T
qerandvWriteXML sel:0 -> id:0
| column | #split | ndv | #nonnuls | rsize |
qesandvGetApproximateStats lvl:0 set:319
| 0 | 0 | 319 | 10000 | 20297 |
originally requested bucket num (via method_opt): 20
Computing topn from NDV structures...
Oracle Database first allocates a synopsis in order to compute the number of distinct values (NDV) and it also allocates space for the ROWID table in order to describe the top-n values. The mode:T
indicates that top-n values that need to be extracted. Then Oracle Database computes the NDV, and during the process we see there were no splits. After the NDV computation, Oracle Database starts to extract the top-n values. I do not exactly know which algorithm is implemented in order to compute the top-n values, but from the trace file, we see that Oracle Database did the calculation based on the NDV structure. How could that be possible? It looks like the NDV structure also helps to track frequency information for the distinct values. As mentioned above, that information could be maintained as a hash table and when processing the rows, after applying a uniform hash function to the column value, if the corresponding hash value is found in the hash table then its frequency is incremented.
To efficiently perform the operations of finding, incrementing, and decrementing (or removing) values from the special data structure, a doubly linked list of groups could be used. Each group would indicate a collection of equal frequencies consisting of two parts: (1) a doubly linked list of frequencies and the difference in value between these frequencies, and (2) the frequencies in the previous group (for the first group, the value itself). In this case, the synopsis did not reach its storage capacity and it was enough to compute the NDV and top-n values.
TopN Histogram Feasible?: YES
|-->tot_freq: 9701 tot_nonnull_rows: 10000 bucketnum >> actual: 20
requested: 20 ndv: 319
tot_freq/((double)tot_nonnull_rows): 0.970100, (1 −
1/((double)requested_bucketnum)): 0.950000
--> requested bucket num: 20
-------------------------------------------------------------------------
Rank Rowid Frequency
-------------------------------------------------------------------------
1 AAAXo1AABAAAbHRAAB 569
2 AAAXo1AABAAAbHRAAE 552
3 AAAXo1AABAAAbHRAAb 548
4 AAAXo1AABAAAbHRAAC 536
5 AAAXo1AABAAAbHRAAg 528
6 AAAXo1AABAAAbHRAAR 524
7 AAAXo1AABAAAbHRAAY 522
8 AAAXo1AABAAAbHRAAH 520
9 AAAXo1AABAAAbHRAAM 517
10 AAAXo1AABAAAbHRAAU 516
11 AAAXo1AABAAAbHRAAF 505
12 AAAXo1AABAAAbHRAAI 497
13 AAAXo1AABAAAbHRAAD 492
14 AAAXo1AABAAAbHRAAZ 490
15 AAAXo1AABAAAbHRAAS 488
16 AAAXo1AABAAAbHRAAJ 478
17 AAAXo1AABAAAbHRAAA 477
18 AAAXo1AABAAAbHRAAl 475
19 AAAXo1AABAAAbHRAAP 466
20 AAAXo1AABAAAbHgAAP 1
qerandvGatherSqlStats: sel:1 mode:_
qerandvGatherSqlStats: sel:2 mode:_
qerandvGatherSqlStats: sel:3 mode:R
qerandvGatherSqlStats: xml output -
The database computed the top frequent values and displayed the ROWID table content there. If we execute our function (the lossy counting implementation), we get the exact same result.
SQL> find_topn(10000,20);
SQL> select val,freq from lossytable order by 2 desc;
VAL FREQ
---------- ----------
10 569
11 552
1 548
9 536
2 528
13 524
3 522
16 520
4 517
5 516
6 505
17 497
15 492
7 490
18 488
8 478
19 477
12 475
14 466
In addition, Oracle Database displayed the data stream structure in the trace file. In my opinion, it is doubly linked lists that enable the efficient computation of the frequencies.
Dumping stream from 0 to 1000
----------------------------------
0000: 60 115 101 108 101 99 116 95 108 105 115 116 95 105 116 101 109 62 60 112
<select_list_item><p
0020: 111 115 62 48 60 47 112 111 115 62 60 118 97 108 117 101 62 51 49 57
os>0</pos><value>319
0040: 60 47 118 97 108 117 101 62 60 114 111 119 99 110 116 62 49 48 48 48
</value><rowcnt>1000
0060: 48 60 47 114 111 119 99 110 116 62 60 113 99 109 115 103 62 48 60 47
0</rowcnt><qcmsg>0</
0080: 113 99 109 115 103 62 60 115 112 108 105 116 62 48 60 47 115 112 108 105
qcmsg><split></spli
0100: 116 62 60 110 100 118 62 51 49 57 60 47 110 100 118 62 60 110 111 110
t><ndv>319</ndv><non
0120: 110 117 108 108 115 62 49 48 48 48 48 60 47 110 111 110 110 117 108 108
nulls>10000</nonnull
0140: 115 62 60 114 115 105 122 101 62 50 48 50 57 55 60 47 114 115 105 122
s><rsize>20297</rsiz
0160: 101 62 60 116 111 112 110 99 110 116 62 50 48 60 47 116 111 112 110 99
e><topncnt>20</topncnt>
In addition, by default, Oracle Database calculates user_tab_histograms.density
as 1/(2*num_rows)
for frequency and top frequency histograms. The last value of the top frequency histogram is the maximum column value. The finding of top-n values was implemented in the dbms_sqltune_internal.i_process_sql_callout
procedure.
PROCEDURE I_PROCESS_SQL_CALLOUT(
STMT IN OUT SQLSET_ROW,
ACTION IN BINARY_INTEGER,
TIME_LIMIT IN POSITIVE,
CTRL_OPTIONS IN XMLTYPE,
EXTRA_RESULT OUT CLOB,
ERR_CODE OUT BINARY_INTEGER,
ERR_MESG OUT VARCHAR2)
The value for the parameter ACTION is passed as 'GATHER_SQL_STATS' and EXTRA_RESULT is returned in the XMLType format, which contains approximate NDV and top-n information.
A hybrid histogram is created based on features of the frequency histogram and the height-balanced histogram. A hybrid histogram gives us more-accurate information for the popular values. In spite of the height-balanced histogram, popular values do not span multiple buckets. Each bucket in the hybrid histogram contains two types of frequency information. First is the cumulative frequency number, which indicates the total number of rows in the bucket (the bucket number), and the second is about distinct frequency information for endpoint boundaries (the endpoint repeat count).
drop table t1;
CREATE TABLE t1 AS
SELECT level AS id,
CASE
WHEN level ><= 983000 THEN TRUNC(DBMS_RANDOM.value(1,254))
ELSE level
END AS col,
lpad('desc',level) AS decr
FROM dual
CONNECT BY level <= 1000000;
SQL> set pages 9999
SQL> set pagesize 9999
SQL> exec find_topn(nm_rws=>1000000,tpk=>254);
PL/SQL procedure successfully completed.
SQL> select val,freq from lossytable order by 2 desc;
VAL FREQ
---------- ----------
145 4113
37 4041
49 4035
179 4024
187 4020
151 4014
120 4010
115 4009
157 4008
135 4004
198 3999
................................
And the following statement gives us the total frequency of the TFVs:
SQL> select sum(freq) from lossytable ;
SUM(FREQ)
----------
983002
Select * from t1 where col=100;
alter session set events '10832 trace name context forever, level 1';
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute dbms_stats.gather_table_stats(null,'t1');
From the trace file, we can see that Oracle Database switched to the lossy counting algorithm.
compute approximation on (4:NTNR)
0->sel(0) type(ndv/acl/nonnuls)
1->sel(3) type(topn/ndv/acl/nonnuls)
2->sel(6) type(ndv/acl/nonnuls)
3->sel(9) type(rowid)
qerandvInitAll: allocate synopsis for opc:0
qerandvInitAll: allocate synopsis for opc:1
Allocate rowid table for topn
qerandvInitAll: allocate synopsis for opc:2
Switching to Lossy Counting, freq_threshold: 1
allocate top n structures...
qerandvGatherSqlStats: sel:0 mode:N
qerandvWriteXML sel:0 -> id:0
| column | #split | ndv | #nonnuls | rsize |
qesandvGetApproximateStats lvl:6 set:15656
| 0 | 6 | 1000000 | 1000000 | 3979802 |
qerandvGatherSqlStats: sel:1 mode:_
qerandvGatherSqlStats: sel:2 mode:_
qerandvGatherSqlStats: sel:3 mode:T
qerandvWriteXML sel:3 -> id:1
| column | #split | ndv | #nonnuls | rsize |
qesandvGetApproximateStats lvl:1 set:8658
| 1 | 1 | 17316 | 1000000 | 2624169 |
originally requested bucket num (via method_opt): 254
Computing approximate topn from native topn structures...
TopN Histogram Feasible?: NO
|-->tot_freq: 983002 tot_nonnull_rows: 1000000 bucketnum >> actual: 254
requested: 254 ndv: 17316
tot_freq/((double)tot_nonnull_rows): 0.983002, (1 −
1/((double)requested_bucketnum)): 0.996063
Oracle Database switched to lossy counting because the NDV structure was not enough to compute the top-n values. The reason is the number of splits that were performed to estimate the NDV for the column. And the natural NDV structure cannot provide accuracy for computing top-n values. That is why Oracle Database used lossy counting there. As can clearly be seen, Oracle Database calculated the top frequent values as 983002 and also, that's the same result of our find_topn
procedure. Oracle Database also displayed the stream data in the trace file:
Dumping stream from 0 to 2000
----------------------------------
0000: 60 115 101 108 101 99 116 95 108 105 115 116 95 105 116 101 109 62
60 112 <select_list_item><p
0020: 111 115 62 48 60 47 112 111 115 62 60 118 97 108 117 101 62 49
48 48 os>0</pos><value>100
0040: 48 48 48 48 60 47 118 97 108 117 101 62 60 114 111 119 99 110
116 62 0000</value><rowcnt>
0060: 49 48 48 48 48 48 48 60 47 114 111 119 99 110 116 62 60 113
99 109 1000000</rowcnt><qcm
0080: 115 103 62 48 60 47 113 99 109 115 103 62 60 115 112 108 105 116
62 54 sg>0</qcmsg><split>6
0100: 60 47 115 112 108 105 116 62 60 110 100 118 62 49 48 48 48 48
48 48 </split><ndv>1000000
0120: 60 47 110 100 118 62 60 110 111 110 110 117 108 108 115 62 49 48
48 48 </ndv><nonnulls>1000
0140: 48 48 48 60 47 110 111 110 110 117 108 108 115 62 60 114 115 105
122 101 000</nonnulls><rsize
0160: 62 51 57 55 57 56 48 50 60 47 114 115 105 122 101 62 60 47
115 101 >3979802</rsize></se
0180: 108 101 99 116 95 108 105 115 116 95 105 116 101 109 62 60 115 101
108 101 lect_list_item><sele
0200: 99 116 95 108 105 115 116 95 105 116 101 109 62 60 112 111 115 62
49 60 ct_list_item><pos>1<
0220: 47 112 111 115 62 60 118 97 108 117 101 62 84 121 112 61 50 32
76 101 /pos><value>Typ=2 Le
...
According the criteria for creating a hybrid histogram, the top-n values occupied less than p percent of the total number of rows (p =(1−(1/n))*100) and that is why Oracle Database decided to create a hybrid histogram. In our case, we have 1000000*(1−1/254)=996062.992>983002.
DBMS_STATS: start processing top n values for column "COL"
DBMS_STATS: >> frequency histograms is not feasible
(topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME
DBMS_STATS: Y ID
DBMS_STATS: Y Y Y Y COL
DBMS_STATS: Y DECR
DBMS_STATS: Building Histogram for COL
DBMS_STATS: bktnum=254, nnv=1000000, snnv=5500, sndv=17316, est_ndv=17316, mnb=254
DBMS_STATS: Trying hybrid histogram
To create a hybrid histogram, the database executed the following SQL statement in order to get the histogram data. So, Oracle Database used a sampling method to generate the data for the hybrid histogram. First, the sample size was determined via an adaptive method; in our case, the sample size is equal to 0.55.
select substrb(dump(val,16,0,64),1,240) ep, freq, cdn, ndv, (sum(pop)
over()) popcnt, (sum(pop*freq) over()) popfreq, substrb(dump(max(val)
over(),16,0,64),1,240) maxval, substrb(dump(min(val) over(),16,0,64),1,240)
minval from (select val, freq, (sum(freq) over()) cdn, (count(*) over())
ndv, (case when freq > ((sum(freq) over())/254) then 1 else 0 end) pop
from (select /*+ no_parallel(t) no_parallel_index(t) dbms_stats
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false')
no_substrb_pad */ "COL" val, count("COL") freq from "SYS"."T1" sample (
.5500000000) t where "COL" is not null group by "COL")) order by val
Some questions arise: how does the hybrid histogram describe our data, and what can we say about the quality of the hybrid histogram?
DBMS_STATS: > cdn 5428, popFreq 2659, popCnt 106, bktSize
18.65986394557823129251700680272108843537, bktSzFrc
.65986394557823129251700680272108843537
DBMS_STATS: Evaluating hybrid histogram: cht.count 254, mnb 254, ssize
5428, min_ssize 2500, appr_ndv TRUE, ndv 17316, selNdv 19, selFreq 520, pct
.55, avg_bktsize 21, csr.hreq TRUE, normalize TRUE
DBMS_STATS: Histogram gathering flags: 7
DBMS_STATS: Accepting histogram
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
Oracle Database calculated the following:
bktSize= ((cdn - popFreq - freq) / (mnb - popCnt - 1))= (5428-2659-25)/(254-
1-106)=18.65986394557823129251700680272108843537;
The bucket size could be different according to the buckets, but Oracle Database calculated the average bucket size as this:
avg_bktsize := Cdn/mnb=5428/254=21.39
We requested to create 254 buckets, but the sampling query can return more than 300 rows, and we can see that from the tkprof
output.
Here is Query 1:
SQL ID: fuj9hfc4b30qf Plan Hash: 2782092977
select substrb(dump(val,16,0,64),1,240) ep, freq, cdn, ndv, (sum(pop) over())
popcnt, (sum(pop*freq) over()) popfreq, substrb(dump(max(val) over(),16,
0,64),1,240) maxval, substrb(dump(min(val) over(),16,0,64),1,240) minval
from
(select val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv, (case
when freq > ((sum(freq) over())/254) then 1 else 0 end) pop from (select
/*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */ "COL" val,
count("COL") freq from "SYS"."T1" sample (.5500000000) t where "COL"
is not null group by "COL")) order by val
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 361 3.61 3.93 997444 521514 9 360
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 363 3.62 3.94 997444 521514 9 360
How did Oracle Database process 360 rows and create 254 buckets for the hybrid histogram? What are the criteria for creating a hybrid histogram?
Oracle Database opens a cursor for Query 1 and processes each row that is returned by the cursor. Oracle Database performs the following operations and calculates the following variables.
ROWCNT
: Initially it is zero and when rows are processed, its counter is incremented and it finally represents the total number of rows that are returned by sampling (in our case, it is 360).BKSIZE
: For each iteration, Oracle Database calculates BKSIZE
for each bucket. It is calculated as follows: if popcnt
is greater than or equal to the number of requested buckets (NRB) minus 1, then (CDN−FREQ/(MNB−1)
. Otherwise, ((CDN−POPFREQ−FREQ)/(MNB−POPCNT−1))
.CUMFREQ
: Oracle Database cumulatively calculates frequencies as CUMFREQ=CUMFREQ+freq
. The value is used to define bucket numbers (endpoint numbers).Bkcnt
: This is the bucket count. Oracle Database increments the count when it decides to create a bucket, and this is used to track how many buckets are created and how many are required (taking into account any that are currently not being used).BKTROWS
: This indicates the number of rows within each bucket, and for each iteration it is cumulatively incremented by freq
. It is compared to BKSIZE
to determine whether a bucket should be created. If a bucket should be created, then BKTROWS
is set to 0 for the current iteration.BKTROWS
is greater than BKSIZE
or the nonprocessed NDV count is less than or equal to the number of noncreated buckets, then the buckets should be created (or if the number of unprocessed rows is equal to or less than the number of uncreated buckets), because all rows (the NDV actually is equal to the final ROWCNT
) should be placed among the requested number of histogram buckets (NB).Bkcnt=HB−1
and it is not the last record.We can summarize all of these operations using the following script:
CREATE GLOBAL TEMPORARY TABLE tmp_hybrid_hist
(
ep_num NUMBER, -- endpoint number
ep_val NUMBER, -- endpoint value
ep_rpcnt NUMBER -- endpoint repeat count
);
/* Note that we can write and execute all of the statements below dynamically, but I
use the simple SQL statements explicitly to demonstrate the mechanism*/
CREATE OR REPLACE PROCEDURE CREATE_HH (tnm_rws integer, maxcval number,hbn integer,smplsize number)
IS
val VARCHAR2 (240);
freq NUMBER;
cumfreq NUMBER;
bktrows NUMBER;
bktsize NUMBER;
rowcnt NUMBER;
cdn NUMBER;
ndv NUMBER;
popcnt NUMBER;
popfreq NUMBER;
bktcnt NUMBER;
hb INTEGER := 254; -- we are going to create 254 buckets
CURSOR cur_hh
IS
SELECT val ep,
freq,
cdn,
ndv,
(SUM (pop) OVER ()) popcnt,
(SUM (pop * freq) OVER ()) popfreq
FROM (SELECT val,
freq,
(SUM (freq) OVER ()) cdn,
(COUNT ( * ) OVER ()) ndv,
(CASE
WHEN freq > ( (SUM (freq) OVER ()) / 254)
THEN
1
ELSE
0
END)
pop
FROM ( SELECT
"COL"
val,
COUNT ("COL") freq
FROM "SYS"."T1" sample (.5500000000) t
WHERE "COL" IS NOT NULL
GROUP BY "COL"))
ORDER BY val;
BEGIN
COMMIT;
OPEN cur_hh;
bktrows := 0;
bktcnt := 0;
cumfreq := 0;
rowcnt := 0;
cdn := 0;
ndv := 0;
LOOP
FETCH cur_hh
INTO val,freq,cdn,ndv,popcnt,popfreq;
EXIT WHEN cur_hh%NOTFOUND;
rowcnt := rowcnt + 1;
IF (rowcnt = 1)
THEN
IF (hb - 1 <= popcnt)
THEN
bktsize := (cdn - freq) / (hb - 1);
ELSE
bktsize := ( (cdn - popfreq - freq) / (hb - popcnt - 1));
END IF;
END IF;
bktrows := bktrows + freq;
cumfreq := cumfreq + freq;
IF ( bktrows >= bktsize
OR (ndv - rowcnt) <= (hb - bktcnt)
OR rowcnt = 1
OR rowcnt = ndv)
THEN
IF (bktcnt = hb AND rowcnt <> ndv)
THEN
INSERT INTO tmp_hybrid_hist (ep_num, ep_val, ep_rpcnt)
VALUES (tnm_rws /* numrows in the table*/, maxcval /* max col value*/, 1);
RETURN;
END IF;
bktrows := 0;
bktcnt := bktcnt + 1;
INSERT INTO tmp_hybrid_hist (ep_num, ep_val, ep_rpcnt)
VALUES (cumfreq, val, freq);
END IF;
END LOOP;
CLOSE cur_hh;
END;
/
SELECT * FROM tmp_hybrid_hist;
According to this mechanism, hybrid histograms might have two major issues. First, the sample size might not be enough to capture more-popular values (probably, in most cases) and in this case, the database might not gather the most-popular values and the values will not be included in the histogram. Also, Query 1 returns a popular frequency around 3,000 (we can see that clearly from the dbms_stats
trace file) but the top frequent value is 983002 as we have calculated it using our find_topn
and we have seen in trace file. In addition, condition numbers 6 and 7 shown above directly restrict picking up all top frequent values . This indicates that even though Query 1 does include the most-frequent values (even those included in the sample), the current implementation for constructing a hybrid histogram might not have picked up the most frequent values. And the reason is that Query 1 contains the ORDER BY val clause and condition numbers 6 and 7 actually restrict picking up the values. The ORDER BY val clause restricts reaching the most frequent values. Before reaching the most frequent values, bktcnt
can reach its limit (HB) and the process can be stopped. Let us demonstrate that with following example.
drop table t1;
CREATE TABLE t1 AS
SELECT level AS id,
CASE
WHEN level <= 6000 THEN TRUNC(DBMS_RANDOM.value(1,20))
when level>=9000 and level<=9990 then 9990
ELSE level
END AS col,
'Description for ' || level AS decrp
FROM dual
CONNECT BY level <= 10000;
select * from t1 where col=9990;
alter session set sql_trace=true;
alter session set events '10832 trace name context forever, level 1';
execute dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute dbms_stats.gather_table_stats(null,'t1');
SQL> set pagesize 9999
SQL> select col,count(*) from t1
group by col
order by 2 desc;
COL COUNT(*)
---------- ----------
9990 991
17 350
18 348
1 342
19 330
4 322
3 319
5 318
16 318
11 316
14 316
7 313
13 312
12 304
15 304
10 301
8 300
2 297
6 296
9 294
6001 1
6013 1
6015 1
6023 1
6028 1
6034 1
...............
...............
...............
8916 1
8925 1
8942 1
8947 1
8973 1
8984 1
8989 1
9992 1
9993 1
9995 1
3029 rows selected.
So this means we have 3029 distinct values and the most-frequent value is 9990 (and its frequency is 991). Let's see the histogram data.
SQL> SELECT num_distinct, num_buckets, histogram
FROM user_tab_col_statistics
WHERE table_name = 'T1' AND column_name = 'COL'; 2 3
NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ----------- ---------------
3029 254 HYBRID
SQL> SELECT endpoint_number epn,
endpoint_number
- LAG (endpoint_number, 1, 0) OVER (ORDER BY endpoint_number)
freq,
endpoint_value epv,
endpoint_repeat_count eprc
FROM user_tab_histograms
WHERE table_name = 'T1' AND column_name = 'COL'
ORDER BY endpoint_value;
EPN FREQ EPV EPRC
---------- ---------- ---------- ----------
342 342 1 342
639 297 2 297
958 319 3 319
1280 322 4 322
1598 318 5 318
1894 296 6 296
2207 313 7 313
2507 300 8 300
2801 294 9 294
3102 301 10 301
3418 316 11 316
3722 304 12 304
4034 312 13 312
4350 316 14 316
4654 304 15 304
4972 318 16 318
5322 350 17 350
5670 348 18 348
6000 330 19 330
6011 11 6011 1
6023 12 6023 1
6034 11 6034 1
6046 12 6046 1
6057 11 6057 1
6069 12 6069 1
6080 11 6080 1
.....................................
.....................................
8598 11 8598 1
8610 12 8610 1
8621 11 8621 1
8633 12 8633 1
8644 11 8644 1
8656 12 8656 1
8667 11 8667 1
8678 11 8678 1
10000 1322 10000 1
254 rows selected.
As you can see, the most-frequent value is 9990 and it has not been included in the histogram. However, the second-most-frequent value has been included (val=17
and freq=350
). In addition, Oracle Database calculated the bucket size as 11(12) but the last bucket includes 1322 rows! What is the reason? Do you think it is related to the sample size? There is the possibility that in the case of a small sample size, the database might not read the most-frequent value, but there is another problem also. The table size is very small and, therefore, the database does not use sampling (in other words, it used 100% sampling) to construct the histogram; it actually used FULL TABLE SCAN for that. We can see that from the SQL TRACE/Tkprof result, as follows:
select val,substrb(dump(val,16,0,64),1,240) ep, freq, cdn, ndv, (sum(pop) over())
popcnt, (sum(pop*freq) over()) popfreq, substrb(dump(max(val) over(),16,
0,64),1,240) maxval, substrb(dump(min(val) over(),16,0,64),1,240) minval
from
(select val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv, (case
when freq > ((sum(freq) over())/254) then 1 else 0 end) pop from (select
/*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */ "COL" val,
count("COL") freq from "SYS"."T1" t where "COL" is not null group by
"COL")) order by val
Here is Query 2:
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 3030 0.05 0.05 0 56 3 3029
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3032 0.05 0.05 0 56 3 3029
Query 2 actually contains the most-frequent value: 9990; if we execute Query 2 we can see it clearly:
SQL> clear screen
SQL> select val, freq, cdn, ndv, (sum(pop) over())
2 popcnt, (sum(pop*freq) over()) popfreq
from
(select val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv, (case
when freq > ((sum(freq) over())/254) then 1 else 0 end) pop from (select
/*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact
use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl
opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad */ "COL" val,
count("COL") freq from "SYS"."T1" t where "COL" is not null group by
"COL")) order by val;
VAL FREQ CDN NDV POPCNT POPFREQ
---------- ---------- ---------- ---------- ---------- ----------
1 342 10000 3029 20 6991
2 297 10000 3029 20 6991
3 319 10000 3029 20 6991
4 322 10000 3029 20 6991
5 318 10000 3029 20 6991
6 296 10000 3029 20 6991
7 313 10000 3029 20 6991
8 300 10000 3029 20 6991
9 294 10000 3029 20 6991
10 301 10000 3029 20 6991
11 316 10000 3029 20 6991
12 304 10000 3029 20 6991
13 312 10000 3029 20 6991
14 316 10000 3029 20 6991
15 304 10000 3029 20 6991
16 318 10000 3029 20 6991
17 350 10000 3029 20 6991
18 348 10000 3029 20 6991
19 330 10000 3029 20 6991
6001 1 10000 3029 20 6991
6002 1 10000 3029 20 6991
6003 1 10000 3029 20 6991
6004 1 10000 3029 20 6991
6005 1 10000 3029 20 6991
6006 1 10000 3029 20 6991
6007 1 10000 3029 20 6991
6008 1 10000 3029 20 6991
...........................................................
...........................................................
8992 1 10000 3029 20 6991
8993 1 10000 3029 20 6991
8994 1 10000 3029 20 6991
8995 1 10000 3029 20 6991
8996 1 10000 3029 20 6991
8997 1 10000 3029 20 6991
8998 1 10000 3029 20 6991
8999 1 10000 3029 20 6991
9990 991 10000 3029 20 6991
9991 1 10000 3029 20 6991
9992 1 10000 3029 20 6991
9993 1 10000 3029 20 6991
9994 1 10000 3029 20 6991
9995 1 10000 3029 20 6991
9996 1 10000 3029 20 6991
9997 1 10000 3029 20 6991
9998 1 10000 3029 20 6991
9999 1 10000 3029 20 6991
10000 1 10000 3029 20 6991
3029 rows selected.
It clearly can be seen that Oracle Database did not reach the value 9990; it had already created 253 buckets when it met the cursor value 8678, so at that time condition numbers 6 and 7 evaluated as true and the database stopped the process. In my opinion, the order by val
clause in Query 2 should be replaced with the order by freq desc
clause, and in this case, the most-frequent values would be included in the histogram, and the database also could adjust the number of buckets (254), so we would have gotten the desired result.
But some questions could arise: will replacing ORDER BY VAL with ORDER BY FREQ DESC change bucketsize
definitely, and does it influence the process of cardinality estimation? The answers are no, because for popular values, we actually have endpoint_repeat_count
, but for nonpopular values Oracle Database uses density (NewDensity
) that is calculated at parse time, as below:
NewDensity = (num_rows − popfreq)/((NDV − popCnt)* num_rows)
As you see, that density depends on top frequent values: total frequency values (popfreq
) and their count (popCnt
). This means that if we lost one or more frequent values, it does not just influence the estimation of the popular values but also it influences the estimation for nonpopular values. Before doing the required changes (replacing ORDER BY VAL with ORDER BY FREQ DESC), let`s extract the data from the created histogram, as follows:
SELECT ep_val,
pop,
freq,
crdn,
COUNT ( * ) OVER () ndv,
SUM (pop) OVER () popcnt,
SUM (pop * freq) OVER () popfreq
FROM (SELECT ep_val,
ROUND (freq) freq,
SUM (freq) OVER () crdn,
(CASE
WHEN endpoint_repeat_count >1 THEN 1 else 0
END)
pop
FROM ( SELECT ep_val,endpoint_repeat_count, ROUND (SUM (freq)) freq
FROM (SELECT
endpoint_repeat_count,endpoint_value ep_val,
endpoint_number
- LAG (
endpoint_number,
1,
0)
OVER (
ORDER BY
endpoint_number)
freq
FROM dba_tab_histograms
WHERE table_name = 'T1'
AND column_name = 'COL'
)
GROUP BY ep_val,endpoint_repeat_count
ORDER BY 2 DESC))
ORDER BY ep_val;
EP_VAL POP FREQ CRDN NDV POPCNT POPFREQ
---------- ---------- ---------- ---------- ---------- ---------- ----------
1 1 342 10000 254 19 6000
2 1 297 10000 254 19 6000
3 1 319 10000 254 19 6000
4 1 322 10000 254 19 6000
5 1 318 10000 254 19 6000
6 1 296 10000 254 19 6000
7 1 313 10000 254 19 6000
8 1 300 10000 254 19 6000
9 1 294 10000 254 19 6000
10 1 301 10000 254 19 6000
11 1 316 10000 254 19 6000
12 1 304 10000 254 19 6000
13 1 312 10000 254 19 6000
14 1 316 10000 254 19 6000
15 1 304 10000 254 19 6000
16 1 318 10000 254 19 6000
17 1 350 10000 254 19 6000
18 1 348 10000 254 19 6000
19 1 330 10000 254 19 6000
6011 0 11 10000 254 19 6000
6023 0 12 10000 254 19 6000
6034 0 11 10000 254 19 6000
6046 0 12 10000 254 19 6000
6057 0 11 10000 254 19 6000
......................................................................
......................................................................
8598 0 11 10000 254 19 6000
8610 0 12 10000 254 19 6000
8621 0 11 10000 254 19 6000
8633 0 12 10000 254 19 6000
8644 0 11 10000 254 19 6000
8656 0 12 10000 254 19 6000
8667 0 11 10000 254 19 6000
8678 0 11 10000 254 19 6000
10000 0 1322 10000 254 19 6000
254 rows selected.
As you see, in the current case, we have 19 popular (most-frequent) values and their total frequency is 6000. In that case, we have lost the value 9990 and its frequency, 991. We also can see this information from the optimizer trace file for the query:
select * from t1 where col=9990
kkecdn: Single Table Predicate:"T1"."COL"=9990
Column (#2):
NewDensity:0.000133, OldDensity:0.000179 BktCnt:10000.000000, PopBktCnt:6000.000000, PopValCnt:19, NDV:3029
Column (#2): COL(NUMBER)
AvgLen: 4 NDV: 3029 Nulls: 0 Density: 0.000133 Min: 1.000000 Max: 10000.000000
Histogram: Hybrid #Bkts: 254 UncompBkts: 10000 EndPtVals: 254 ActualVal: yes
Using density: 1.3289e-04 of col #2 as selectivity of pred having unreasonably low value
But if we change ORDER BY VAL to ORDER BY FREQ DESC in our procedure (in the query of the CUR_HH cursor) and execute the CREATE_HH
procedure again, we will get the following result:
SELECT ep_val,
pop,
freq,
crdn,
COUNT ( * ) OVER () ndv,
SUM (pop) OVER () popcnt,
SUM (pop * freq) OVER () popfreq
FROM (SELECT ep_val,
ROUND (freq) freq,
SUM (freq) OVER () crdn,
(CASE
WHEN ep_rpcnt > 1 THEN 1 else 0
END)
pop
FROM ( SELECT ep_rpcnt,ep_val, ROUND (SUM (freq)) freq
FROM (SELECT ep_rpcnt, ep_val ep_val,
ep_num
- LAG (
ep_num,
1,
0)
OVER (
ORDER BY ep_num)
freq
FROM tmp_hybrid_hist)
GROUP BY ep_val,ep_rpcnt
ORDER BY 3 DESC))
ORDER BY ep_val;
EP_VAL POP FREQ CRDN NDV POPCNT POPFREQ
---------- ---------- ---------- ---------- ---------- ---------- ----------
1 1 342 10000 253 20 6991
2 1 297 10000 253 20 6991
3 1 319 10000 253 20 6991
4 1 322 10000 253 20 6991
5 1 318 10000 253 20 6991
6 1 296 10000 253 20 6991
7 1 313 10000 253 20 6991
8 1 300 10000 253 20 6991
9 1 294 10000 253 20 6991
10 1 301 10000 253 20 6991
11 1 316 10000 253 20 6991
12 1 304 10000 253 20 6991
13 1 312 10000 253 20 6991
14 1 316 10000 253 20 6991
15 1 304 10000 253 20 6991
16 1 318 10000 253 20 6991
17 1 350 10000 253 20 6991
18 1 348 10000 253 20 6991
19 1 330 10000 253 20 6991
6032 0 9 10000 253 20 6991
6048 0 9 10000 253 20 6991
6052 0 9 10000 253 20 6991
6067 0 9 10000 253 20 6991
......................................................................
......................................................................
8920 0 9 10000 254 20 6991
8956 0 9 10000 254 20 6991
8961 0 9 10000 254 20 6991
8965 0 9 10000 254 20 6991
8966 0 9 10000 254 20 6991
8996 0 9 10000 254 20 6991
9990 1 991 10000 254 20 6991
10000 0 912 10000 254 20 6991
254 rows selected.
Note: After executing the CREATE_HH
procedure with the CUR_HH cursor that includes the ORDER BY freq DESC clause, we need to rearrange the bucket number for the histogram and we can do that just by executing the following statement:
UPDATE tmp_hybrid_hist t
SET t.ep_num =
(SELECT SUM (ep_rpcnt)
FROM tmp_hybrid_hist tt
WHERE tt.ep_val <= t.ep_val)
As you see, the bucket size was decreased but the number of frequent values has been increased. We have included all frequent values. A hybrid histogram is designed to provide efficient information for popular values and also to improve selectivity for nonpopular values. In other words, a hybrid histogram should include all frequent values as much as possible. And this supports the idea that a hybrid histogram also could be created during FULL TABLE SCAN after extracting the top-n values without sampling. Because, in this case we have the top-n values and the ROWID table, the NDV, the min value, and the max value. If we see the current estimation mechanism of the Oracle Database optimizer, we easily can identify that the estimation for nonpopular values also depends on popular values; that is why creating a hybrid histogram from a top-n structure is possible and achievable.
We can solve this problem (to include the MFVs in the hybrid histogram) using the DBMS_STATS.set_column_values
procedure to set the hybrid histogram data. Before doing that, let me show the explain plan:
explain plan for
select * from t1 where col=9990;
select * from table(dbms_xplan.display);
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 28 | 14 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 1 | 28 | 14 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("COL"=9990)
And now we are going to change the histogram data using the dbms_stats
package (prepare_column_values
, set_column_stats
):
DECLARE
distcnt NUMBER;
density NUMBER;
nullcnt NUMBER;
avgclen NUMBER;
srec DBMS_STATS.statrec;
array_col_val DBMS_STATS.numarray;
BEGIN
/* To generate the information below (elements of the arrays), we can use
DBMS_STATS.get_column_stats or get the information directly from the t1 table
*/
DBMS_STATS.get_column_stats (ownname => USER,
tabname => 't1',
colname => 'col',
distcnt => distcnt,
density => density,
nullcnt => nullcnt,
srec => srec,
avgclen => avgclen);
/* This array contains distinct column values that we want to store in the Hybrid Histogram */
array_col_val :=
DBMS_STATS.numarray (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,6032,
6048,6052,6067,6076,6077,6082,6108,6134,6137,6151,6155,
6161,6190,6196,6202,6206,6242,6258,6260,6264,6265,6294,6300,
6315,6327,6330,6344,6358,6370,6382,6396,6408,6416,6431,6432,6449,
6461,6494,6498,6499,6506,6513,6533,6557,6570,6571,6579,6585,6601,6611,
6618,6626,6660,6661,6664,6671,6681,6699,6736,6740,6749,6751,6760,6772,
6795,6815,6824,6834,6860,6862,6878,6883,6885,6898,6928,6930,6944,6955,6979,
6995,6997,7004,7023,7045,7048,7049,7069,7084,7087,7097,7109,7129,7133,7148,7153,
7161,7190,7207,7219,7226,7230,7243,7277,7281,7283,7303,7313,7323,7328,7343,7347,
7351,7392,7414,7415,7421,7424,7425,7458,7466,7477,7487,7491,7507,7535,7545,7546,
7549,7567,7603,7605,7609,7620,7622,7640,7660,7665,7667,7676,7706,7723,7736,7749,
7752,7754,7765,7768,7795,7809,7814,7818,7824,7850,7874,7881,7891,7896,7917,7922,
7930,7958,7959,7984,7999,8009,8019,8042,8052,8059,8064,8106,8113,8120,8125,8155,
8162,8180,8213,8220,8230,8243,8251,8269,8277,8296,8334,8340,8348,8355,8385,8392,
8409,8413,8429,8461,8470,8488,8492,8494,8527,8536,8538,8542,8574,8600,8603,8615,
8625,8626,8678,8691,8693,8706,8709,8744,8761,8764,8778,8791,8806,8827,8843,8874,
8879,8880,8894,8920,8956,8961,8965,8966,8996,9990,10000);
/* This array contains the total frequency of values that are less than or equal
to each distinct value */
srec.rpcnts :=
DBMS_STATS.numarray (342,639,958,1280,1598,1894,2207,2507,2801,3102,3418,3722,4034,4350,
4654,4972,5322,5670,6000,6011,6023,6034,6046,6057,6069,6080,6092,6103,
6114,6126,6137,6149,6160,6172,6183,6195,6206,6217,6229,6240,6252,6263,6275,
6286,6298,6309,6320,6332,6343,6355,6366,6378,6389,6401,6412,6423,6435,6446,6458,
6469,6481,6492,6504,6515,6527,6538,6549,6561,6572,6584,6595,6607,6618,6630,6641,
6652,6664,6675,6687,6698,6710,6721,6733,6744,6755,6767,6778,6790,6801,6813,6824,
6836,6847,6858,6870,6881,6893,6904,6916,6927,6939,6950,6961,6973,6984,6996,7007,
7019,7030,7042,7053,7064,7076,7087,7099,7110,7122,7133,7145,7156,7168,7179,7190,
7202,7213,7225,7236,7248,7259,7271,7282,7293,7305,7316,7328,7339,7351,7362,7374,
7385,7396,7408,7419,7431,7442,7454,7465,7477,7488,7499,7511,7522,7534,7545,7557,
7568,7580,7591,7602,7614,7625,7637,7648,7660,7671,7683,7694,7705,7717,7728,7740,
7751,7763,7774,7786,7797,7809,7820,7831,7843,7854,7866,7877,7889,7900,7912,7923,
7934,7946,7957,7969,7980,7992,8003,8015,8026,8037,8049,8060,8072,8083,8095,8106,
8118,8129,8140,8152,8163,8175,8186,8198,8209,8221,8232,8243,8255,8266,8278,8289,
8301,8312,8324,8335,8346,8358,8369,8381,8392,8404,8415,8427,8438,8450,8461,8472,
8484,8495,8507,8518,8530,8541,8553,8564,8575,8587,8598,8610,8621,8633,8644,8656,
8667,9658 /*8678+991, we have made this change to create the correct frequency for the value 9990*/,10000
---select from dual
);
/* This array contains the exact frequency for each distinct value that specified in
array_col_val; actually it represents endpoint repeat counts */
srec.bkvals :=
DBMS_STATS.numarray (342,297,319,322,318,296,313,300,294,301,316,304,312,316,304,318,350,348,330,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,991,1
);
srec.epc := 254;
DBMS_STATS.prepare_column_values (srec, array_col_val);
DBMS_STATS.set_column_stats (ownname => USER,
tabname => 't1',
colname => 'col',
distcnt => distcnt,
density => density,
nullcnt => nullcnt,
srec => srec,
avgclen => avgclen);
END;
/
explain plan for
select * from t1 where col=9990;
select * from table(dbms_xplan.display);
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 991 | 27748 | 14 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 991 | 27748 | 14 (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("COL"=9990)
We got correct cardinality!
Now we are going to see an approach for how a hybrid histogram could be created from the top-n structure, without doing sampling from the table.
We need to create 254 buckets (by default) and will include popular values as much as possible. If column values are uniformly distributed, each bucket will have num_rows/num_buckets=10000/254=39
rows, or we get (max_value-min_value)/num_buckets
. Because the values are not uniformly distributed, we can assume average bucket size:
bktSize= ((cdn − popFreq − freq) / (mnb − popCnt - 1))
Where
cdn
is the number of rows in the table.popFreq
is the total frequency of popular values and is derived from top-n values, so it is sum(lossytable.freq where freq>1)
.freq
is the frequency of the min(lossytable.val)
.mnb
is the number of requested buckets (the default is 254).popCnt
is the total count of popular values.So, in our case, it will be this:
bktSize= ((cdn − popFreq − freq) / (mnb − popCnt − 1))=
(10000 − 6991 − 342)/(254 − 20 − 1) = 11.44
We have top-n values stored in losssytable
and need to create 254 buckets. If NRB<= popCnt
, all histogram buckets should contain popular values and we just need to formulate the endpoint number (bucket numbers) by considering bucket size. Only the last bucket could have a different size.
Otherwise, if NRB>popCnt
, we need to create additional buckets in order to complete the number of buckets to 254. So in this case, we are required to create (NRB−popCnt)
number of buckets. In our case, (NRB−popCnt)=(254−20)=233
additional buckets will be created. Initially we will insert all top-n values into the tmp_hybrid_hist table and then we will examine lossytable
to find the appropriate bucket boundaries that allow us to create additional buckets. To do that, we have to find the differences between the bucket sizes of neighbor buckets (endpoint values).
We will rearrange the endpoint numbers at the end of the process.
We can write a simple PL/SQL block to perform the above tasks:
/*+ find topn elements */
execute find_topn(10000,254);
/* Insert topn values to the HH table
Initially we do not compute endpoint numbers
*/
insert into tmp_hybrid_hist
select 1,val,freq from lossytable order by 2;
/* Find appropriate bucket boundaries and create HH buckets */
DECLARE
s NUMBER := 0;
n INTEGER := 0;
BEGIN
FOR c
IN (SELECT cur_val,
prev_val,
dif,
ROUND (dif / 11) bkcnt
FROM ( SELECT val cur_val,
LAG (val, 1, 0) OVER (ORDER BY val) prev_val,
val - LAG (val, 1, 0) OVER (ORDER BY val) dif,
t.*
FROM lossytable t
ORDER BY val)
WHERE dif > 11 /*It is the bucket size */
)
LOOP
s := c.prev_val;
WHILE (s + 11) < c.cur_val
LOOP
n := n + 1;
s := s + 11;
INSERT INTO tmp_hybrid_hist (ep_num, ep_val, ep_rpcnt)
VALUES (s, s, 1);
IF n = 233 /* it is the required number of buckets -1 */
THEN
RETURN;
END IF;
END LOOP;
END LOOP;
END;
/* Re arrange/calculate bucket numbers*/
UPDATE tmp_hybrid_hist t
SET t.ep_num =
(SELECT SUM (case when ep_rpcnt=1 then 8 else ep_rpcnt end)
FROM tmp_hybrid_hist tt
WHERE tt.ep_val <= t.ep_val)
/* Add the lst bucket to the table, */
INSERT INTO tmp_hybrid_hist (ep_num,
ep_val,
ep_rpcnt)
VALUES (10000 /* the numrows in the table */
,
10000 /* the max value of the column */
,
1);
SQL> SELECT ep_num,ep_val,
ep_num - LAG (ep_num, 1, 0) OVER (ORDER BY ep_num) freq,
ep_rpcnt
FROM tmp_hybrid_hist order by 2; 2 3 4
EP_NUM EP_VAL FREQ EP_RPCNT
---------- ---------- ---------- ----------
342 1 342 342
639 2 297 297
958 3 319 319
1280 4 322 322
1596 5 316 316
1892 6 296 296
2205 7 313 313
2504 8 299 299
2797 9 293 293
3098 10 301 301
3414 11 316 316
3718 12 304 304
4030 13 312 312
4345 14 315 315
4649 15 304 304
4967 16 318 318
5316 17 349 349
5664 18 348 348
5994 19 330 330
6002 30 8 1
6010 41 8 1
.....................................
.....................................
7850 2571 8 1
7858 2582 8 1
8849 9990 991 991
10000 10000 1151 1
254 rows selected.
As a result, we have got a hybrid histogram and we can migrate data using the dbms_stats
package (prepare_column_values
, set_column_stats
) from our temporary table (tmp_hybrid_hist
) to the dictionary, as previously explained.
In summary, we see that in Oracle Database 12c, histogram construction has been improved. Starting with Oracle Database 12c, a streaming approach is used to create frequency and top-frequency histograms. Streaming algorithms help to estimate top-frequency values with a given memory bound.
During the NDV approximation, the database also maintains a ROWID table for top-n values. If there are not any splits, the database gives us exact frequencies (it computes the top-n values from the NDV structure). If the synopsis reaches its memory bound, Oracle Database uses a lossy counting algorithm to estimate the top-n values. In that moment, Oracle Database allocates a new data structure to support the lossy counting algorithm. It is possible that Oracle Database could maintain the required data structure even by processing/maintaining the NDV structure (so Oracle Database can do that in parallel/at the same time). But if the NDV structure is not suitable, the database could directly switch to the previously created data structure. Also, it is possible that before doing the first split, Oracle Database could migrate the required data (that can be derived from the NDV structure) to the new data structure (if we can assume that, using the NDV structure, Oracle Database actually computed some top-n values, this information could be inserted to the lossytable
table before doing the first split and lossy counting algorithm can be continued).
Oracle Database creates frequency and top-frequency histograms very efficiently. The idea to create a hybrid histogram is brilliant, but Oracle Database still uses a sampling method to create a hybrid histogram. In this case, we actually could lose frequent values that could be very important for the query optimizer. As a result, the optimizer could not estimate cardinality for the most-frequent values correctly and an inefficient execution plan could be produced.
The current construction mechanism of the hybrid histogram indicates that the problem is not related only to sampling. Oracle Database orders the values that are fetched by the sampling query and it is restricted to include the (most) popular values in the histogram. In my opinion, just replacing ORDER BY VAL with ORDER BY FREQ DESC in the sampling query and rearranging bucket numbers will help to improve the quality of a hybrid histogram.
In addition, Oracle Database calculates density for nonpopular values based on the number of popular values and their total frequencies. During the NDV approximation, the top-n values are extracted and, therefore, the database could create a hybrid histogram without sampling and can include the top-n values as much as possible (all top-n values might not be included and it depends on the requested number of histogram buckets, the number of top-n values, and the NDV).
Finally, I want to note that if it is decided to create a hybrid histogram and the number of requested histogram buckets is greater than the number of popular values, then it is not necessary to create NRB buckets. It does not matter. In this case, just creating popCnt
(the popular value count) number of buckets plus one bucket for the maximum value (last bucket) is enough for cardinality estimation.
As a result, we get a hybrid histogram with popCnt+1
buckets, which represents our data quite well enough (we can do that manually using dbms_stats
). In this case, the query optimizer can estimate cardinality for popular and nonpopular values without having NRB (254) buckets; it will estimate cardinality for a single table or join as an estimation with NRB (254) buckets for the histogram. Only the last bucket will contain num_rows(tab) − POPFREQ(total frequency of popular values)
. I hope in next version of Oracle Database, we will see some improvements for creating histograms.
Chinar Aliyev graduated from the Baku State University applied mathematics department. He has been working with Oracle Database more than 10 years (since the Oracle 8i version) and specializes in tuning and troubleshooting Oracle Database performance problems. His areas of interest include SQL optimization, optimizer internals, parallel execution, and database designing. He is an Oracle Database 11g Certified Master and an Oracle Database 11g Performance Tuning Certified Expert. He also is a regular presenter at international and local conferences such as Hotsos and Oracle Database user groups.