Histogram Construction in Oracle Database 12c

by Chinar Aliyev

This article describes the mechanism of histogram construction in Oracle Database 12c and the issues related to the gathering process for a hybrid histogram. It also demonstrates how to avoid those problems by manually setting histogram data, and it explains the possibility of generating a hybrid histogram from a top-n structure without doing sampling from the given table.

Introduction

Object statistics describe the data within a database and allow the Oracle Database query optimizer to make the right decision in order to produce an optimal execution plan. In the real world, due to nonuniform distribution, basic column statistics are not enough to describe the data correctly. There are two types of skew: nonuniform value repetition and nonuniformity in range.

With nonuniform value repetition, column data is considered to be nonuniformly distributed if there is at least one popular value and the database analyzes the columns involved in equality or equijoin predicates.
With nonuniformity in range, the database tries to determine the skew of the columns involved in range predicates. This is determined based on the number of histogram buckets (NHB) and the number of endpoints that fall in the range of (max-min)/NHB.

Oracle Database checks whether a column is distributed nonuniformly before creating a histogram. But to do that, the database should collect some information about the column, including the frequencies of the distinct values. This information is used to create a histogram for the column.

A histogram is designed to store data distribution information in the dictionary and helps the query optimizer to generate an efficient execution plan. In Oracle Database 12c, there are four types of histograms: frequency histogram (FH), top frequency histogram (TFH), hybrid histogram (HH), and the old-style height-balanced histogram (HBH). In this article, we will see how histograms are created in Oracle Database 12c. These test cases were performed with Oracle Database 12c Release 2 (12.2.0.1.0).

Generating a Top Frequency Histogram

In Oracle Database 12c, if histogram creation is requested or the database decides to create a histogram, Oracle Database first generates the top frequent values (TFVs) and then it decides which type of histogram should be created. It is obvious that if the number of distinct values (NDV) is equal to or less than number of requested buckets (NRB) then a frequency histogram will be created. By default, Oracle Database 12c does not create a height-balanced histogram and, therefore, in this article, we are not going to interpret a height-balanced histogram.

Several types (classes) of algorithms have been provided (counter-based, sketch, sampling, and so on) to solve the problem of finding the most frequent values in a given data stream. In this article, we will focus on counter-based algorithms, because according to the trace files, the techniques used in Oracle Database 12c for finding the most frequent values look like they are based on counter-based algorithms.

The Problem Definition

We have a data stream S={a₁,a₂,a₃,a₄...a_N}. Let's define ƒ_i(a) as the frequencies of a_iεS. We need to identify the R(a_iεS,φ) subset of S using a one-pass method such that the ƒ_i(a)>φN condition is true, where φε[0..1]. It is obvious the output result will not contain more than k=1/φ elements.

Counter-Based Algorithms

Note: For more information on the frequency algorithm (F), see the paper at http://erikdemaine.org/papers/NetworkStats_TR2002/paper.pdf.

The frequency algorithm stores K pairs of (item, frequency). This algorithm is a generalization of the majority algorithm that compares each new item against stored items R, and is incremented if the selected item is found within R. If all 1/φ frequencies are allocated to distinct items, all are decremented by 1. After decreasing the counts, all items whose frequency is zero are removed from R. This algorithm will return K number of top frequent elements of the data stream. It is obvious the output result will not contain more than K=1/φ elements. Implementation could be different.

To improve the efficiency of the algorithm, we need to have a special data structure to support the above algorithm, and we should perform adding, removing, and updating elements very effectively. First of all, the R table could be implemented as a hash table with limited size. The number of hash table entries could be equal to the requested number of histogram buckets (by default, it is 254). The hash table entries correspond to the distinct value of the column. The hash value is generated during the scan by applying a uniform hash function and the value is used to look up a position in the hash table. The entry in the hash value will indicate the frequency of the distinct value, because it indicates the number of times the distinct value has been mapped to the entry and used to look up the entry.

Initially, the hash table is empty. During the scan, if the hash value of the element is found in the hash table, the frequency value is incremented (this means the position will be incremented). If the hash table reaches its limited size, infrequent values will be deleted from the hash table. (First, all frequency values need to be decremented and then infrequent values—those that have zero frequency—are deleted.) To support decrementing all frequencies at once in constant time, a more sophisticated data structure could be used. The set of distinct values could be maintained as a hash table but the frequencies could be maintained in a linked list L={l₁,l₂,l₃...l_n} where n is the current maximum of the frequencies. Each l_j has a pointer pointing to a doubly linked list of hash values in R that have a frequency equal to j. To increment the frequency of one of these values, the value just could be removed from the l_j doubly linked list and then inserted into l_j+1. But to decrement all frequencies by one, it is enough to move the head of the L linked list from l₁to l₂ and then delete l₁.

The Lossy Counting Algorithm

Note: For more information on the lossy counting (LC) algorithm, see the paper at http://www.vldb.org/conf/2002/S10P03.pdf.

There is another approximate algorithm for finding the top frequent values. Finding the top frequent values of a data stream is a resource-intensive operation and depends on several factors including properties of the data stream such as the skew factor. Therefore, to find the exact top frequent values and their frequencies is not easy to attain. The lossy counting algorithm allows us to approximate most top frequent values with ε error using a single pass. Let's define following symbols:

N: the number of items in the data stream
ƒ_e: the exact frequency of the item e in the data stream
ƒ: the approximated frequency of the item
s: the supported threshold s ∈ (0,1)
ε: a user-specified error threshold

According to the algorithm, the data stream is divided into buckets (or windows) with the same width w = [ 1/ε]. Each bucket has an ID starting at 1. Let's call the current bucket ID b_current. To support the algorithm, the data structure D consists of a set of elements of the form (e, f, Δ), where Δ (delta) is the maximum possible error in ƒ. Initially the required data structure D is empty. During the scan of the stream, each item is checked to see if it is already in D or not. If the item already is there, its frequency is incremented by one; otherwise, the item is inserted into D with a frequency of 1 as (e, 1, b_current−1). At every bucket boundary, the deletion operation has to be performed. Bucket boundaries are identified with the N mod w = 0 condition. All entries are deleted if the ƒ+ Δ≤b_current condition is true. After scanning all items, the algorithm provides the set of items in the D structure that for all items (s−ε)*N ≤ ƒ ≤ s*N is true. It can be seen that Δ is also equal to the bucket number. Implementation of this algorithm in PL/SQL could be as follows:


CREATE GLOBAL TEMPORARY TABLE lossytable (val NUMBER, freq NUMBER, bk NUMBER);

create or replace procedure find_topn(nm_rws  integer,tpk integer)
is 
    CURSOR c_chk (cval NUMBER)
    IS
        SELECT   *
          FROM   lossytable
         WHERE   val = cval;

    chk     c_chk%ROWTYPE;
    n       INTEGER := 0;
    delta   NUMBER := 0;
    b       NUMBER := 1;
    --nm_rws  integer:= 1000000; -- Number of rows in the table
    --tpk integer:=254; --top frequent values 
    bktsize integer; 
BEGIN
    COMMIT;
    Bktsize := nm_rws/tpk;  
    FOR c IN (SELECT   * FROM t1)
    LOOP
        n := n + 1;
        OPEN c_chk (c.col);
        FETCH c_chk INTO chk;
        IF c_chk%FOUND
        THEN
            UPDATE   lossytable
               SET   freq = freq + 1
             WHERE   val = c.col;
        ELSE
            INSERT INTO lossytable (val, freq, bk)
              VALUES   (c.col, 1, b-1);
        END IF;
        IF n MOD Bktsize = 0
        THEN
            FOR j IN (SELECT   * FROM lossytable)
            LOOP
                IF j.freq +j.bk <= b
                THEN
                    DELETE   lossytable
                     WHERE   val = j.val;
                END IF;
            END LOOP;
            b := b + 1;
        END IF;
        CLOSE c_chk;
    END LOOP;
   /*+ deleting infrequent element as delete lossytable where freq=1 */
END;

The Space Saving Algorithm

Note: For more information on the space saving algorithm, see the paper at http://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf.

This algorithm stores the first K items and their exact frequencies in a data structure. If the next item is found in the data structure, its frequency is incremented by one. But when the selected item is not found in the data structure, the required (mitem, mfreq) pair is identified, where mfreq is the minimum frequency of the data structure and mitem is the corresponding item. The processed item is replaced with the mitem item and its frequency is set to mfreq+1.

As with the previous algorithms, a special data structure is required to implement this algorithm. So a doubly linked list is used. All items with the same frequency are linked together in a linked list. They all point to a parent bucket. Every bucket points to exactly one element among its child list, and buckets are kept in a doubly linked list, sorted by their values. The items can be stored in a hash table to provide constant access cost. The linked list operations (adding or deleting elements from the structure and updating frequencies) are the same as previously explained.

Let's see how Oracle Database computes the top frequent values.


drop table t1;
CREATE TABLE t1 AS
SELECT 
       CASE
         WHEN level <= 9700 THEN TRUNC(DBMS_RANDOM.value(1,20))
         ELSE level
       END AS col 
FROM   dual
CONNECT BY level <= 10000;

alter session set events '10832 trace name context forever, level 1';

execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;

execute dbms_stats.gather_table_stats(null,'t1', method_opt=> 'for columns col size 20');

The trace file shows the following lines:


DBMS_STATS: Approximate NDV Options 
DBMS_STATS: TOPN,NIL,NIL,RWID,U20U
DBMS_STATS: select /*+  full(t)    no_parallel(t) no_parallel_index(t) 
dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 
'false') no_substrb_pad  
*/to_char(count("COL")),substrb(dump(min("COL"),16,0,64),1,240),substrb(dump(
max("COL"),16,0,64),1,240),count(rowidtochar(rowid)) from "SYS"."T1" t  /* 
TOPN,NIL,NIL,RWID,U20U*/

During the execution of this SQL code, Oracle Database computed the number of distinct values, maintained a ROWID table, and computed the top-n (20) frequency values.


qerandvInitAll: allocate synopsis for opc:0
Allocate rowid table for topn
qerandvGatherSqlStats: sel:0 mode:T
qerandvWriteXML sel:0 -> id:0
|  column   |  #split   |    ndv    |    #nonnuls   |  rsize    |
qesandvGetApproximateStats  lvl:0 set:319 
|       0   |       0   |         319   |       10000   |      20297    |
originally requested bucket num (via method_opt): 20
Computing topn from NDV structures...

Oracle Database first allocates a synopsis in order to compute the number of distinct values (NDV) and it also allocates space for the ROWID table in order to describe the top-n values. The mode:T indicates that top-n values that need to be extracted. Then Oracle Database computes the NDV, and during the process we see there were no splits. After the NDV computation, Oracle Database starts to extract the top-n values. I do not exactly know which algorithm is implemented in order to compute the top-n values, but from the trace file, we see that Oracle Database did the calculation based on the NDV structure. How could that be possible? It looks like the NDV structure also helps to track frequency information for the distinct values. As mentioned above, that information could be maintained as a hash table and when processing the rows, after applying a uniform hash function to the column value, if the corresponding hash value is found in the hash table then its frequency is incremented.

To efficiently perform the operations of finding, incrementing, and decrementing (or removing) values from the special data structure, a doubly linked list of groups could be used. Each group would indicate a collection of equal frequencies consisting of two parts: (1) a doubly linked list of frequencies and the difference in value between these frequencies, and (2) the frequencies in the previous group (for the first group, the value itself). In this case, the synopsis did not reach its storage capacity and it was enough to compute the NDV and top-n values.


TopN Histogram Feasible?: YES
 |-->tot_freq: 9701 tot_nonnull_rows: 10000 bucketnum >> actual: 20                
requested: 20 ndv: 319
 tot_freq/((double)tot_nonnull_rows): 0.970100,               (1 − 
1/((double)requested_bucketnum)): 0.950000
 --> requested bucket num: 20
-------------------------------------------------------------------------
 Rank          Rowid                     Frequency
-------------------------------------------------------------------------
    1  AAAXo1AABAAAbHRAAB                   569
    2  AAAXo1AABAAAbHRAAE                   552
    3  AAAXo1AABAAAbHRAAb                   548
    4  AAAXo1AABAAAbHRAAC                   536
    5  AAAXo1AABAAAbHRAAg                   528
    6  AAAXo1AABAAAbHRAAR                   524
    7  AAAXo1AABAAAbHRAAY                   522
    8  AAAXo1AABAAAbHRAAH                   520
    9  AAAXo1AABAAAbHRAAM                   517
   10  AAAXo1AABAAAbHRAAU                   516
   11  AAAXo1AABAAAbHRAAF                   505
   12  AAAXo1AABAAAbHRAAI                   497
   13  AAAXo1AABAAAbHRAAD                   492
   14  AAAXo1AABAAAbHRAAZ                   490
   15  AAAXo1AABAAAbHRAAS                   488
   16  AAAXo1AABAAAbHRAAJ                   478
   17  AAAXo1AABAAAbHRAAA                   477
   18  AAAXo1AABAAAbHRAAl                   475
   19  AAAXo1AABAAAbHRAAP                   466
   20  AAAXo1AABAAAbHgAAP                     1
qerandvGatherSqlStats: sel:1 mode:_
qerandvGatherSqlStats: sel:2 mode:_
qerandvGatherSqlStats: sel:3 mode:R
qerandvGatherSqlStats: xml output -

The database computed the top frequent values and displayed the ROWID table content there. If we execute our function (the lossy counting implementation), we get the exact same result.


SQL> find_topn(10000,20);
SQL> select val,freq from lossytable order by 2 desc;

       VAL       FREQ
---------- ----------
        10        569
        11        552
         1        548
         9        536
         2        528
        13        524
         3        522
        16        520
         4        517
         5        516
         6        505
        17        497
        15        492
         7        490
        18        488
         8        478
        19        477
        12        475
        14        466

In addition, Oracle Database displayed the data stream structure in the trace file. In my opinion, it is doubly linked lists that enable the efficient computation of the frequencies.


Dumping stream from 0 to 1000
----------------------------------
 
0000:  60 115 101 108 101  99 116  95 108 105 115 116  95 105 116 101 109  62  60 112    
<select_list_item><p
0020: 111 115  62  48  60  47 112 111 115  62  60 118  97 108 117 101  62  51  49  57    
os>0</pos><value>319
0040:  60  47 118  97 108 117 101  62  60 114 111 119  99 110 116  62  49  48  48  48    
</value><rowcnt>1000
0060:  48  60  47 114 111 119  99 110 116  62  60 113  99 109 115 103  62  48  60  47    
0</rowcnt><qcmsg>0</
0080: 113  99 109 115 103  62  60 115 112 108 105 116  62  48  60  47 115 112 108 105    
qcmsg><split></spli
0100: 116  62  60 110 100 118  62  51  49  57  60  47 110 100 118  62  60 110 111 110    
t><ndv>319</ndv><non
0120: 110 117 108 108 115  62  49  48  48  48  48  60  47 110 111 110 110 117 108 108    
nulls>10000</nonnull
0140: 115  62  60 114 115 105 122 101  62  50  48  50  57  55  60  47 114 115 105 122    
s><rsize>20297</rsiz
0160: 101  62  60 116 111 112 110  99 110 116  62  50  48  60  47 116 111 112 110  99    
e><topncnt>20</topncnt>

In addition, by default, Oracle Database calculates user_tab_histograms.density as 1/(2*num_rows) for frequency and top frequency histograms. The last value of the top frequency histogram is the maximum column value. The finding of top-n values was implemented in the dbms_sqltune_internal.i_process_sql_callout procedure.


  PROCEDURE I_PROCESS_SQL_CALLOUT(
    STMT         IN OUT SQLSET_ROW, 
    ACTION       IN     BINARY_INTEGER, 
    TIME_LIMIT   IN     POSITIVE,
    CTRL_OPTIONS IN     XMLTYPE,
    EXTRA_RESULT OUT    CLOB,
    ERR_CODE     OUT    BINARY_INTEGER,
    ERR_MESG     OUT    VARCHAR2)

The value for the parameter ACTION is passed as 'GATHER_SQL_STATS' and EXTRA_RESULT is returned in the XMLType format, which contains approximate NDV and top-n information.

Generating a Hybrid Histogram

A hybrid histogram is created based on features of the frequency histogram and the height-balanced histogram. A hybrid histogram gives us more-accurate information for the popular values. In spite of the height-balanced histogram, popular values do not span multiple buckets. Each bucket in the hybrid histogram contains two types of frequency information. First is the cumulative frequency number, which indicates the total number of rows in the bucket (the bucket number), and the second is about distinct frequency information for endpoint boundaries (the endpoint repeat count).


drop table t1;
CREATE TABLE t1 AS
SELECT level AS id,
       CASE
         WHEN level ><= 983000 THEN TRUNC(DBMS_RANDOM.value(1,254))
         ELSE level
       END AS col,
       lpad('desc',level) AS decr
FROM   dual
CONNECT BY level <= 1000000;

SQL> set pages 9999
SQL> set pagesize 9999
SQL> exec find_topn(nm_rws=>1000000,tpk=>254);

PL/SQL procedure successfully completed.

SQL> select val,freq from lossytable order by 2 desc;

       VAL       FREQ
---------- ----------
       145       4113
        37       4041
        49       4035
       179       4024
       187       4020
       151       4014
       120       4010
       115       4009
       157       4008
       135       4004
       198       3999
................................

And the following statement gives us the total frequency of the TFVs:


SQL> select sum(freq) from lossytable ;

 SUM(FREQ)
----------
    983002

Select * from t1 where col=100;

alter session set events '10832 trace name context forever, level 1';

execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;

execute dbms_stats.gather_table_stats(null,'t1');

From the trace file, we can see that Oracle Database switched to the lossy counting algorithm.


compute approximation on (4:NTNR)
0->sel(0) type(ndv/acl/nonnuls)
1->sel(3) type(topn/ndv/acl/nonnuls)
2->sel(6) type(ndv/acl/nonnuls)
3->sel(9) type(rowid)
qerandvInitAll: allocate synopsis for opc:0
qerandvInitAll: allocate synopsis for opc:1
Allocate rowid table for topn
qerandvInitAll: allocate synopsis for opc:2

Switching to Lossy Counting, freq_threshold: 1
allocate top n structures...
qerandvGatherSqlStats: sel:0 mode:N
qerandvWriteXML sel:0 -> id:0
|  column   |  #split   |    ndv    |    #nonnuls   |  rsize    |
qesandvGetApproximateStats  lvl:6 set:15656 
|       0   |       6   |     1000000   |     1000000   |    3979802    |
qerandvGatherSqlStats: sel:1 mode:_
qerandvGatherSqlStats: sel:2 mode:_
qerandvGatherSqlStats: sel:3 mode:T
qerandvWriteXML sel:3 -> id:1
|  column   |  #split   |    ndv    |    #nonnuls   |  rsize    |
qesandvGetApproximateStats  lvl:1 set:8658 
|       1   |       1   |       17316   |     1000000   |    2624169    |
originally requested bucket num (via method_opt): 254
Computing approximate topn from native topn structures...
TopN Histogram Feasible?: NO
 |-->tot_freq: 983002 tot_nonnull_rows: 1000000 bucketnum >> actual: 254                
requested: 254 ndv: 17316
 tot_freq/((double)tot_nonnull_rows): 0.983002,               (1 − 
1/((double)requested_bucketnum)): 0.996063

Oracle Database switched to lossy counting because the NDV structure was not enough to compute the top-n values. The reason is the number of splits that were performed to estimate the NDV for the column. And the natural NDV structure cannot provide accuracy for computing top-n values. That is why Oracle Database used lossy counting there. As can clearly be seen, Oracle Database calculated the top frequent values as 983002 and also, that's the same result of our find_topn procedure. Oracle Database also displayed the stream data in the trace file:


Dumping stream from 0 to 2000
----------------------------------
 
0000:  60 115 101 108 101  99 116  95 108 105 115 116  95 105 116 101 109  62  
60 112    <select_list_item><p
0020: 111 115  62  48  60  47 112 111 115  62  60 118  97 108 117 101  62  49  
48  48    os>0</pos><value>100
0040:  48  48  48  48  60  47 118  97 108 117 101  62  60 114 111 119  99 110 
116  62    0000</value><rowcnt>
0060:  49  48  48  48  48  48  48  60  47 114 111 119  99 110 116  62  60 113  
99 109    1000000</rowcnt><qcm
0080: 115 103  62  48  60  47 113  99 109 115 103  62  60 115 112 108 105 116  
62  54    sg>0</qcmsg><split>6
0100:  60  47 115 112 108 105 116  62  60 110 100 118  62  49  48  48  48  48  
48  48    </split><ndv>1000000
0120:  60  47 110 100 118  62  60 110 111 110 110 117 108 108 115  62  49  48  
48  48    </ndv><nonnulls>1000
0140:  48  48  48  60  47 110 111 110 110 117 108 108 115  62  60 114 115 105 
122 101    000</nonnulls><rsize
0160:  62  51  57  55  57  56  48  50  60  47 114 115 105 122 101  62  60  47 
115 101    >3979802</rsize></se
0180: 108 101  99 116  95 108 105 115 116  95 105 116 101 109  62  60 115 101 
108 101    lect_list_item><sele
0200:  99 116  95 108 105 115 116  95 105 116 101 109  62  60 112 111 115  62  
49  60    ct_list_item><pos>1<
0220:  47 112 111 115  62  60 118  97 108 117 101  62  84 121 112  61  50  32  
76 101    /pos><value>Typ=2 Le
...

According the criteria for creating a hybrid histogram, the top-n values occupied less than p percent of the total number of rows (p =(1−(1/n))*100) and that is why Oracle Database decided to create a hybrid histogram. In our case, we have 1000000*(1−1/254)=996062.992>983002.


DBMS_STATS: start processing top n values for column "COL"
DBMS_STATS:   >> frequency histograms is not feasible 
                       (topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:                                                          Y    ID
DBMS_STATS:                       Y    Y    Y                        Y    COL
DBMS_STATS:                                                          Y    DECR
DBMS_STATS: Building Histogram for COL
DBMS_STATS:  bktnum=254, nnv=1000000, snnv=5500, sndv=17316, est_ndv=17316, mnb=254
DBMS_STATS:  Trying hybrid histogram

To create a hybrid histogram, the database executed the following SQL statement in order to get the histogram data. So, Oracle Database used a sampling method to generate the data for the hybrid histogram. First, the sample size was determined via an adaptive method; in our case, the sample size is equal to 0.55.


select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) 
over()) popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) 
over(),16,0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) 
minval  from (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) 
ndv,  (case when freq > ((sum(freq) over())/254)  then 1  else 0 end) pop 
from  (select /*+  no_parallel(t) no_parallel_index(t) dbms_stats 
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring 
xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') 
no_substrb_pad  */ "COL"  val, count("COL") freq  from "SYS"."T1" sample (   
.5500000000)  t  where "COL" is not null  group by "COL")) order by val

Some questions arise: how does the hybrid histogram describe our data, and what can we say about the quality of the hybrid histogram?


DBMS_STATS:  > cdn 5428, popFreq 2659, popCnt 106, bktSize 
18.65986394557823129251700680272108843537, bktSzFrc 
.65986394557823129251700680272108843537
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 254, mnb 254, ssize 
5428, min_ssize 2500, appr_ndv  TRUE, ndv 17316, selNdv 19, selFreq 520, pct 
.55, avg_bktsize 21, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram 
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE

Oracle Database calculated the following:


bktSize= ((cdn - popFreq - freq) / (mnb - popCnt - 1))= (5428-2659-25)/(254-
1-106)=18.65986394557823129251700680272108843537;

The bucket size could be different according to the buckets, but Oracle Database calculated the average bucket size as this:


avg_bktsize := Cdn/mnb=5428/254=21.39

We requested to create 254 buckets, but the sampling query can return more than 300 rows, and we can see that from the tkprof output.

Here is Query 1:


SQL ID: fuj9hfc4b30qf Plan Hash: 2782092977

select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) over())
   popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) over(),16,
  0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) minval  
from
 (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case 
  when freq > ((sum(freq) over())/254)  then 1  else 0 end) pop from  (select 
  /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact 
  use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl 
  opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val,
   count("COL") freq  from "SYS"."T1" sample (.5500000000)  t  where "COL" 
  is not null  group by "COL")) order by val


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch      361      3.61       3.93     997444     521514          9         360
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      363      3.62       3.94     997444     521514          9         360

How did Oracle Database process 360 rows and create 254 buckets for the hybrid histogram? What are the criteria for creating a hybrid histogram?

Oracle Database opens a cursor for Query 1 and processes each row that is returned by the cursor. Oracle Database performs the following operations and calculates the following variables.

ROWCNT: Initially it is zero and when rows are processed, its counter is incremented and it finally represents the total number of rows that are returned by sampling (in our case, it is 360).
BKSIZE: For each iteration, Oracle Database calculates BKSIZE for each bucket. It is calculated as follows: if popcnt is greater than or equal to the number of requested buckets (NRB) minus 1, then (CDN−FREQ/(MNB−1). Otherwise, ((CDN−POPFREQ−FREQ)/(MNB−POPCNT−1)).
CUMFREQ: Oracle Database cumulatively calculates frequencies as CUMFREQ=CUMFREQ+freq. The value is used to define bucket numbers (endpoint numbers).
Bkcnt: This is the bucket count. Oracle Database increments the count when it decides to create a bucket, and this is used to track how many buckets are created and how many are required (taking into account any that are currently not being used).
BKTROWS: This indicates the number of rows within each bucket, and for each iteration it is cumulatively incremented by freq. It is compared to BKSIZE to determine whether a bucket should be created. If a bucket should be created, then BKTROWS is set to 0 for the current iteration.
If BKTROWS is greater than BKSIZE or the nonprocessed NDV count is less than or equal to the number of noncreated buckets, then the buckets should be created (or if the number of unprocessed rows is equal to or less than the number of uncreated buckets), because all rows (the NDV actually is equal to the final ROWCNT) should be placed among the requested number of histogram buckets (NB).
The process needs to be stopped when Bkcnt=HB−1 and it is not the last record.

We can summarize all of these operations using the following script:


CREATE GLOBAL TEMPORARY TABLE tmp_hybrid_hist
(
    ep_num     NUMBER, -- endpoint number
    ep_val     NUMBER, -- endpoint value
    ep_rpcnt   NUMBER -- endpoint repeat count
);

/* Note that we can write and execute all of the statements below dynamically, but I 
use the simple SQL statements explicitly to demonstrate the mechanism*/

  CREATE OR REPLACE PROCEDURE CREATE_HH (tnm_rws integer, maxcval number,hbn integer,smplsize number) 
   IS
    val        VARCHAR2 (240);
    freq       NUMBER;
    cumfreq    NUMBER;
    bktrows    NUMBER;
    bktsize    NUMBER;
    rowcnt     NUMBER;
    cdn        NUMBER;
    ndv        NUMBER;
    popcnt     NUMBER;
    popfreq    NUMBER;
    bktcnt     NUMBER;
    hb         INTEGER := 254; -- we are going to create 254 buckets
    
    CURSOR cur_hh
    IS
          SELECT   val ep,
                   freq,
                   cdn,
                   ndv,
                   (SUM (pop) OVER ()) popcnt,
                   (SUM (pop * freq) OVER ()) popfreq 
            FROM   (SELECT   val,
                             freq,
                             (SUM (freq) OVER ()) cdn,
                             (COUNT ( * ) OVER ()) ndv,
                             (CASE
                                  WHEN freq > ( (SUM (freq) OVER ()) / 254)
                                  THEN
                                      1
                                  ELSE
                                      0
                              END)
                                 pop
                      FROM   (  SELECT   
                                         "COL"
                                             val,
                                         COUNT ("COL") freq
                                  FROM   "SYS"."T1" sample (.5500000000) t
                                 WHERE   "COL" IS NOT NULL
                              GROUP BY   "COL"))
        ORDER BY val;
BEGIN
    COMMIT;

    OPEN cur_hh;

    bktrows := 0;
    bktcnt := 0;
    cumfreq := 0;
    rowcnt := 0;
    cdn := 0;
    ndv := 0;

    LOOP
        FETCH cur_hh
        INTO   val,freq,cdn,ndv,popcnt,popfreq;

        EXIT WHEN cur_hh%NOTFOUND;

        rowcnt := rowcnt + 1;

        IF (rowcnt = 1)
        THEN
            IF (hb - 1 <= popcnt)
            THEN
                bktsize := (cdn - freq) / (hb - 1);
            ELSE
                bktsize := ( (cdn - popfreq - freq) / (hb - popcnt - 1));
            END IF;
        END IF;

        bktrows := bktrows + freq;
        cumfreq := cumfreq + freq;

        IF (   bktrows >= bktsize
            OR (ndv - rowcnt) <= (hb - bktcnt)
            OR rowcnt = 1
            OR rowcnt = ndv)
        THEN

            IF (bktcnt = hb  AND rowcnt <> ndv)
            THEN  
                       INSERT INTO tmp_hybrid_hist (ep_num, ep_val, ep_rpcnt)
              VALUES   (tnm_rws /* numrows in the table*/, maxcval /* max col value*/, 1);

                RETURN;
            END IF;

            bktrows := 0;
            bktcnt := bktcnt + 1;

            INSERT INTO tmp_hybrid_hist (ep_num, ep_val, ep_rpcnt)
              VALUES   (cumfreq, val, freq);
        END IF;
    END LOOP;

    CLOSE cur_hh;


END;
/
SELECT   * FROM tmp_hybrid_hist;

According to this mechanism, hybrid histograms might have two major issues. First, the sample size might not be enough to capture more-popular values (probably, in most cases) and in this case, the database might not gather the most-popular values and the values will not be included in the histogram. Also, Query 1 returns a popular frequency around 3,000 (we can see that clearly from the dbms_stats trace file) but the top frequent value is 983002 as we have calculated it using our find_topn and we have seen in trace file. In addition, condition numbers 6 and 7 shown above directly restrict picking up all top frequent values . This indicates that even though Query 1 does include the most-frequent values (even those included in the sample), the current implementation for constructing a hybrid histogram might not have picked up the most frequent values. And the reason is that Query 1 contains the ORDER BY val clause and condition numbers 6 and 7 actually restrict picking up the values. The ORDER BY val clause restricts reaching the most frequent values. Before reaching the most frequent values, bktcnt can reach its limit (HB) and the process can be stopped. Let us demonstrate that with following example.


drop table t1;
CREATE TABLE t1 AS
SELECT level AS id,
       CASE
         WHEN level <= 6000 THEN TRUNC(DBMS_RANDOM.value(1,20))
         when level>=9000 and level<=9990 then 9990
         ELSE level
       END AS col,
       'Description for ' || level AS decrp
FROM   dual
CONNECT BY level <= 10000;

select * from t1  where col=9990;

alter session set sql_trace=true;

alter session set events '10832 trace name context forever, level 1';

execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;

execute dbms_stats.gather_table_stats(null,'t1');

SQL> set pagesize 9999
SQL> select col,count(*) from t1
group by col
order by 2 desc;
       COL   COUNT(*)
---------- ----------
      9990        991
        17        350
        18        348
         1        342
        19        330
         4        322
         3        319
         5        318
        16        318
        11        316
        14        316
         7        313
        13        312
        12        304
        15        304
        10        301
         8        300
         2        297
         6        296
         9        294
      6001          1
      6013          1
      6015          1
      6023          1
      6028          1
      6034          1
      ...............
      ...............
      ...............

      8916          1
      8925          1
      8942          1
      8947          1
      8973          1
      8984          1
      8989          1
      9992          1
      9993          1
      9995          1

3029 rows selected.

So this means we have 3029 distinct values and the most-frequent value is 9990 (and its frequency is 991). Let's see the histogram data.


SQL> SELECT   num_distinct, num_buckets, histogram
  FROM   user_tab_col_statistics
 WHERE   table_name = 'T1' AND column_name = 'COL';  2    3

NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ----------- ---------------
        3029         254 HYBRID

SQL> SELECT   endpoint_number epn,
                endpoint_number
           - LAG (endpoint_number, 1, 0) OVER (ORDER BY endpoint_number)
               freq,
           endpoint_value epv,
           endpoint_repeat_count eprc
    FROM   user_tab_histograms
   WHERE   table_name = 'T1' AND column_name = 'COL'
ORDER BY   endpoint_value;  

       EPN       FREQ        EPV       EPRC
---------- ---------- ---------- ----------
       342        342          1        342
       639        297          2        297
       958        319          3        319
      1280        322          4        322
      1598        318          5        318
      1894        296          6        296
      2207        313          7        313
      2507        300          8        300
      2801        294          9        294
      3102        301         10        301
      3418        316         11        316
      3722        304         12        304
      4034        312         13        312
      4350        316         14        316
      4654        304         15        304
      4972        318         16        318
      5322        350         17        350
      5670        348         18        348
      6000        330         19        330
      6011         11       6011          1
      6023         12       6023          1
      6034         11       6034          1
      6046         12       6046          1
      6057         11       6057          1
      6069         12       6069          1
      6080         11       6080          1
      .....................................
      .....................................
      8598         11       8598          1
      8610         12       8610          1
      8621         11       8621          1
      8633         12       8633          1
      8644         11       8644          1
      8656         12       8656          1
      8667         11       8667          1
      8678         11       8678          1
     10000       1322      10000          1

254 rows selected.

As you can see, the most-frequent value is 9990 and it has not been included in the histogram. However, the second-most-frequent value has been included (val=17 and freq=350). In addition, Oracle Database calculated the bucket size as 11(12) but the last bucket includes 1322 rows! What is the reason? Do you think it is related to the sample size? There is the possibility that in the case of a small sample size, the database might not read the most-frequent value, but there is another problem also. The table size is very small and, therefore, the database does not use sampling (in other words, it used 100% sampling) to construct the histogram; it actually used FULL TABLE SCAN for that. We can see that from the SQL TRACE/Tkprof result, as follows:


select val,substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, (sum(pop) over())
   popcnt,  (sum(pop*freq) over()) popfreq,  substrb(dump(max(val) over(),16,
  0,64),1,240) maxval,  substrb(dump(min(val) over(),16,0,64),1,240) minval  
from
 (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case 
  when freq > ((sum(freq) over())/254)  then 1  else 0 end) pop from  (select 
  /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact 
  use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl 
  opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val,
   count("COL") freq  from "SYS"."T1" t  where "COL" is not null  group by 
  "COL")) order by val

Here is Query 2:


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch     3030      0.05       0.05          0         56          3        3029
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     3032      0.05       0.05          0         56          3        3029

Query 2 actually contains the most-frequent value: 9990; if we execute Query 2 we can see it clearly:


SQL> clear screen
SQL> select val,  freq, cdn, ndv, (sum(pop) over())
  2     popcnt,  (sum(pop*freq) over()) popfreq
from
 (select val, freq,  (sum(freq) over()) cdn, (count(*) over()) ndv,  (case
  when freq > ((sum(freq) over())/254)  then 1  else 0 end) pop from  (select
  /*+  no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl
  opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  */ "COL"  val,
   count("COL") freq  from "SYS"."T1" t  where "COL" is not null  group by
  "COL")) order by val; 

       VAL       FREQ        CDN        NDV     POPCNT    POPFREQ
---------- ---------- ---------- ---------- ---------- ----------
         1        342      10000       3029         20       6991
         2        297      10000       3029         20       6991
         3        319      10000       3029         20       6991
         4        322      10000       3029         20       6991
         5        318      10000       3029         20       6991
         6        296      10000       3029         20       6991
         7        313      10000       3029         20       6991
         8        300      10000       3029         20       6991
         9        294      10000       3029         20       6991
        10        301      10000       3029         20       6991
        11        316      10000       3029         20       6991
        12        304      10000       3029         20       6991
        13        312      10000       3029         20       6991
        14        316      10000       3029         20       6991
        15        304      10000       3029         20       6991
        16        318      10000       3029         20       6991
        17        350      10000       3029         20       6991
        18        348      10000       3029         20       6991
        19        330      10000       3029         20       6991
      6001          1      10000       3029         20       6991
      6002          1      10000       3029         20       6991
      6003          1      10000       3029         20       6991
      6004          1      10000       3029         20       6991
      6005          1      10000       3029         20       6991
      6006          1      10000       3029         20       6991
      6007          1      10000       3029         20       6991
      6008          1      10000       3029         20       6991
      ...........................................................
      ...........................................................
      8992          1      10000       3029         20       6991
      8993          1      10000       3029         20       6991
      8994          1      10000       3029         20       6991
      8995          1      10000       3029         20       6991
      8996          1      10000       3029         20       6991
      8997          1      10000       3029         20       6991
      8998          1      10000       3029         20       6991
      8999          1      10000       3029         20       6991
      9990        991      10000       3029         20       6991
      9991          1      10000       3029         20       6991
      9992          1      10000       3029         20       6991
      9993          1      10000       3029         20       6991
      9994          1      10000       3029         20       6991
      9995          1      10000       3029         20       6991
      9996          1      10000       3029         20       6991
      9997          1      10000       3029         20       6991
      9998          1      10000       3029         20       6991
      9999          1      10000       3029         20       6991
     10000          1      10000       3029         20       6991

3029 rows selected.

It clearly can be seen that Oracle Database did not reach the value 9990; it had already created 253 buckets when it met the cursor value 8678, so at that time condition numbers 6 and 7 evaluated as true and the database stopped the process. In my opinion, the order by val clause in Query 2 should be replaced with the order by freq desc clause, and in this case, the most-frequent values would be included in the histogram, and the database also could adjust the number of buckets (254), so we would have gotten the desired result.

But some questions could arise: will replacing ORDER BY VAL with ORDER BY FREQ DESC change bucketsize definitely, and does it influence the process of cardinality estimation? The answers are no, because for popular values, we actually have endpoint_repeat_count, but for nonpopular values Oracle Database uses density (NewDensity) that is calculated at parse time, as below:


      NewDensity = (num_rows − popfreq)/((NDV − popCnt)* num_rows)

As you see, that density depends on top frequent values: total frequency values (popfreq) and their count (popCnt). This means that if we lost one or more frequent values, it does not just influence the estimation of the popular values but also it influences the estimation for nonpopular values. Before doing the required changes (replacing ORDER BY VAL with ORDER BY FREQ DESC), let`s extract the data from the created histogram, as follows:


SELECT   ep_val,
          pop,
           freq,
           crdn,
           COUNT ( * ) OVER () ndv,
           SUM (pop) OVER () popcnt,
           SUM (pop * freq) OVER () popfreq
    FROM   (SELECT   ep_val,
                     ROUND (freq) freq,
                     SUM (freq) OVER () crdn,
                     (CASE
                          WHEN endpoint_repeat_count >1 THEN 1 else 0
                      END)
                       pop
              FROM   (  SELECT   ep_val,endpoint_repeat_count, ROUND (SUM (freq)) freq
                          FROM   (SELECT   
endpoint_repeat_count,endpoint_value ep_val,
                                           endpoint_number
                                           - LAG (
                                                 endpoint_number,
                                                 1,
                                                 0)
                                                 OVER (
                                                     ORDER BY 
endpoint_number)
                                               freq
                                    FROM  dba_tab_histograms
                                   WHERE   table_name = 'T1'
                                           AND column_name = 'COL'
)
                      GROUP BY   ep_val,endpoint_repeat_count
                      ORDER BY   2 DESC))
ORDER BY   ep_val;
    EP_VAL        POP       FREQ       CRDN        NDV     POPCNT    POPFREQ
---------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1        342      10000        254         19       6000
         2          1        297      10000        254         19       6000
         3          1        319      10000        254         19       6000
         4          1        322      10000        254         19       6000
         5          1        318      10000        254         19       6000
         6          1        296      10000        254         19       6000
         7          1        313      10000        254         19       6000
         8          1        300      10000        254         19       6000
         9          1        294      10000        254         19       6000
        10          1        301      10000        254         19       6000
        11          1        316      10000        254         19       6000
        12          1        304      10000        254         19       6000
        13          1        312      10000        254         19       6000
        14          1        316      10000        254         19       6000
        15          1        304      10000        254         19       6000
        16          1        318      10000        254         19       6000
        17          1        350      10000        254         19       6000
        18          1        348      10000        254         19       6000
        19          1        330      10000        254         19       6000
      6011          0         11      10000        254         19       6000
      6023          0         12      10000        254         19       6000
      6034          0         11      10000        254         19       6000
      6046          0         12      10000        254         19       6000
      6057          0         11      10000        254         19       6000
      ......................................................................
      ......................................................................
      8598          0         11      10000        254         19       6000
      8610          0         12      10000        254         19       6000
      8621          0         11      10000        254         19       6000
      8633          0         12      10000        254         19       6000
      8644          0         11      10000        254         19       6000
      8656          0         12      10000        254         19       6000
      8667          0         11      10000        254         19       6000
      8678          0         11      10000        254         19       6000
     10000          0       1322      10000        254         19       6000

254 rows selected.

As you see, in the current case, we have 19 popular (most-frequent) values and their total frequency is 6000. In that case, we have lost the value 9990 and its frequency, 991. We also can see this information from the optimizer trace file for the query:


select * from t1 where col=9990 

kkecdn: Single Table Predicate:"T1"."COL"=9990
  Column (#2): 
    NewDensity:0.000133, OldDensity:0.000179 BktCnt:10000.000000, PopBktCnt:6000.000000, PopValCnt:19, NDV:3029
  Column (#2): COL(NUMBER)
    AvgLen: 4 NDV: 3029 Nulls: 0 Density: 0.000133 Min: 1.000000 Max: 10000.000000
    Histogram: Hybrid  #Bkts: 254  UncompBkts: 10000  EndPtVals: 254  ActualVal: yes
  Using density: 1.3289e-04 of col #2 as selectivity of pred having unreasonably low value

But if we change ORDER BY VAL to ORDER BY FREQ DESC in our procedure (in the query of the CUR_HH cursor) and execute the CREATE_HH procedure again, we will get the following result:


SELECT   ep_val,
           pop,
           freq,
           crdn,
           COUNT ( * ) OVER () ndv,
           SUM (pop) OVER () popcnt,
           SUM (pop * freq) OVER () popfreq
    FROM   (SELECT   ep_val,
                     ROUND (freq) freq,
                     SUM (freq) OVER () crdn,
                     (CASE
                          WHEN ep_rpcnt > 1 THEN 1 else 0
                      END)
                         pop
              FROM   (  SELECT   ep_rpcnt,ep_val, ROUND (SUM (freq)) freq
                          FROM   (SELECT  ep_rpcnt, ep_val ep_val,
                                           ep_num
                                           - LAG (
                                                 ep_num,
                                                 1,
                                                 0)
                                                 OVER (
                                                     ORDER BY ep_num)
                                               freq
                                    FROM   tmp_hybrid_hist)
                      GROUP BY   ep_val,ep_rpcnt
                      ORDER BY   3 DESC))
ORDER BY   ep_val;
    EP_VAL        POP       FREQ       CRDN        NDV     POPCNT    POPFREQ
---------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1        342      10000        253         20       6991
         2          1        297      10000        253         20       6991
         3          1        319      10000        253         20       6991
         4          1        322      10000        253         20       6991
         5          1        318      10000        253         20       6991
         6          1        296      10000        253         20       6991
         7          1        313      10000        253         20       6991
         8          1        300      10000        253         20       6991
         9          1        294      10000        253         20       6991
        10          1        301      10000        253         20       6991
        11          1        316      10000        253         20       6991
        12          1        304      10000        253         20       6991
        13          1        312      10000        253         20       6991
        14          1        316      10000        253         20       6991
        15          1        304      10000        253         20       6991
        16          1        318      10000        253         20       6991
        17          1        350      10000        253         20       6991
        18          1        348      10000        253         20       6991
        19          1        330      10000        253         20       6991
      6032          0          9      10000        253         20       6991
      6048          0          9      10000        253         20       6991
      6052          0          9      10000        253         20       6991
      6067          0          9      10000        253         20       6991
      ......................................................................
      ......................................................................
      8920          0          9      10000        254         20       6991
      8956          0          9      10000        254         20       6991
      8961          0          9      10000        254         20       6991
      8965          0          9      10000        254         20       6991
      8966          0          9      10000        254         20       6991
      8996          0          9      10000        254         20       6991
      9990          1        991      10000        254         20       6991
     10000          0        912      10000        254         20       6991

254 rows selected.

Note: After executing the CREATE_HH procedure with the CUR_HH cursor that includes the ORDER BY freq DESC clause, we need to rearrange the bucket number for the histogram and we can do that just by executing the following statement:


UPDATE   tmp_hybrid_hist t
   SET   t.ep_num =
             (SELECT   SUM (ep_rpcnt)
                FROM   tmp_hybrid_hist tt
               WHERE   tt.ep_val <= t.ep_val)

As you see, the bucket size was decreased but the number of frequent values has been increased. We have included all frequent values. A hybrid histogram is designed to provide efficient information for popular values and also to improve selectivity for nonpopular values. In other words, a hybrid histogram should include all frequent values as much as possible. And this supports the idea that a hybrid histogram also could be created during FULL TABLE SCAN after extracting the top-n values without sampling. Because, in this case we have the top-n values and the ROWID table, the NDV, the min value, and the max value. If we see the current estimation mechanism of the Oracle Database optimizer, we easily can identify that the estimation for nonpopular values also depends on popular values; that is why creating a hybrid histogram from a top-n structure is possible and achievable.

We can solve this problem (to include the MFVs in the hybrid histogram) using the DBMS_STATS.set_column_values procedure to set the hybrid histogram data. Before doing that, let me show the explain plan:


explain plan for
select * from t1 where col=9990;   
select * from table(dbms_xplan.display);
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    28 |    14   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |     1 |    28 |    14   (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):                       
---------------------------------------------------                       
   1 - filter("COL"=9990)

And now we are going to change the histogram data using the dbms_stats package (prepare_column_values, set_column_stats):


DECLARE
    distcnt   NUMBER; 
    density   NUMBER; 
    nullcnt   NUMBER; 
    avgclen   NUMBER; 
    srec        DBMS_STATS.statrec;
    array_col_val     DBMS_STATS.numarray;
BEGIN
    /* To generate the information below (elements of the arrays), we can use  
DBMS_STATS.get_column_stats or get the information directly from the t1 table 
   */

    DBMS_STATS.get_column_stats (ownname   => USER,
                                 tabname   => 't1',
                                 colname   => 'col',
                                 distcnt   => distcnt,
                                 density   => density,
                                 nullcnt   => nullcnt,
                                 srec      => srec,
                                 avgclen   => avgclen);
                                 
    /* This array contains distinct column values that we want to store in the Hybrid Histogram */
    
    array_col_val :=
        DBMS_STATS.numarray (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,6032,
                             6048,6052,6067,6076,6077,6082,6108,6134,6137,6151,6155,
                             6161,6190,6196,6202,6206,6242,6258,6260,6264,6265,6294,6300,
                             6315,6327,6330,6344,6358,6370,6382,6396,6408,6416,6431,6432,6449,
                             6461,6494,6498,6499,6506,6513,6533,6557,6570,6571,6579,6585,6601,6611,
                             6618,6626,6660,6661,6664,6671,6681,6699,6736,6740,6749,6751,6760,6772,
                             6795,6815,6824,6834,6860,6862,6878,6883,6885,6898,6928,6930,6944,6955,6979,
                             6995,6997,7004,7023,7045,7048,7049,7069,7084,7087,7097,7109,7129,7133,7148,7153,
                             7161,7190,7207,7219,7226,7230,7243,7277,7281,7283,7303,7313,7323,7328,7343,7347,
                             7351,7392,7414,7415,7421,7424,7425,7458,7466,7477,7487,7491,7507,7535,7545,7546,
                             7549,7567,7603,7605,7609,7620,7622,7640,7660,7665,7667,7676,7706,7723,7736,7749,
                             7752,7754,7765,7768,7795,7809,7814,7818,7824,7850,7874,7881,7891,7896,7917,7922,
                             7930,7958,7959,7984,7999,8009,8019,8042,8052,8059,8064,8106,8113,8120,8125,8155,
                             8162,8180,8213,8220,8230,8243,8251,8269,8277,8296,8334,8340,8348,8355,8385,8392,
                             8409,8413,8429,8461,8470,8488,8492,8494,8527,8536,8538,8542,8574,8600,8603,8615,
                             8625,8626,8678,8691,8693,8706,8709,8744,8761,8764,8778,8791,8806,8827,8843,8874,
                             8879,8880,8894,8920,8956,8961,8965,8966,8996,9990,10000);
                             
    /* This array contains the total frequency of values that are less than or equal 
to each distinct value  */
    
    srec.rpcnts :=
        DBMS_STATS.numarray (342,639,958,1280,1598,1894,2207,2507,2801,3102,3418,3722,4034,4350,
4654,4972,5322,5670,6000,6011,6023,6034,6046,6057,6069,6080,6092,6103,
6114,6126,6137,6149,6160,6172,6183,6195,6206,6217,6229,6240,6252,6263,6275,
6286,6298,6309,6320,6332,6343,6355,6366,6378,6389,6401,6412,6423,6435,6446,6458,
6469,6481,6492,6504,6515,6527,6538,6549,6561,6572,6584,6595,6607,6618,6630,6641,
6652,6664,6675,6687,6698,6710,6721,6733,6744,6755,6767,6778,6790,6801,6813,6824,
6836,6847,6858,6870,6881,6893,6904,6916,6927,6939,6950,6961,6973,6984,6996,7007,
7019,7030,7042,7053,7064,7076,7087,7099,7110,7122,7133,7145,7156,7168,7179,7190,
7202,7213,7225,7236,7248,7259,7271,7282,7293,7305,7316,7328,7339,7351,7362,7374,
7385,7396,7408,7419,7431,7442,7454,7465,7477,7488,7499,7511,7522,7534,7545,7557,
7568,7580,7591,7602,7614,7625,7637,7648,7660,7671,7683,7694,7705,7717,7728,7740,
7751,7763,7774,7786,7797,7809,7820,7831,7843,7854,7866,7877,7889,7900,7912,7923,
7934,7946,7957,7969,7980,7992,8003,8015,8026,8037,8049,8060,8072,8083,8095,8106,
8118,8129,8140,8152,8163,8175,8186,8198,8209,8221,8232,8243,8255,8266,8278,8289,
8301,8312,8324,8335,8346,8358,8369,8381,8392,8404,8415,8427,8438,8450,8461,8472,
8484,8495,8507,8518,8530,8541,8553,8564,8575,8587,8598,8610,8621,8633,8644,8656,
8667,9658 /*8678+991, we have made this change to create the correct frequency for the value 9990*/,10000 
  ---select  from dual
);

/* This array contains the exact frequency for each distinct value that specified in 
array_col_val; actually it represents endpoint repeat counts */
  srec.bkvals :=
        DBMS_STATS.numarray (342,297,319,322,318,296,313,300,294,301,316,304,312,316,304,318,350,348,330,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,991,1
);
    srec.epc := 254;

    DBMS_STATS.prepare_column_values (srec, array_col_val);


    DBMS_STATS.set_column_stats (ownname   => USER,
                                 tabname   => 't1',
                                 colname   => 'col',
                                 distcnt   => distcnt,
                                 density   => density,
                                 nullcnt   => nullcnt,
                                 srec      => srec,
                                 avgclen   => avgclen);
END;
/

explain plan for
select * from t1 where col=9990;
select * from table(dbms_xplan.display);
--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   991 | 27748 |    14   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   991 | 27748 |    14   (0)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):                       
---------------------------------------------------                       
   1 - filter("COL"=9990)

We got correct cardinality!

Now we are going to see an approach for how a hybrid histogram could be created from the top-n structure, without doing sampling from the table.

Define the bucket size.

We need to create 254 buckets (by default) and will include popular values as much as possible. If column values are uniformly distributed, each bucket will have num_rows/num_buckets=10000/254=39 rows, or we get (max_value-min_value)/num_buckets. Because the values are not uniformly distributed, we can assume average bucket size:


                bktSize= ((cdn − popFreq − freq) / (mnb − popCnt - 1))

Where

cdn is the number of rows in the table.
popFreq is the total frequency of popular values and is derived from top-n values, so it is sum(lossytable.freq where freq>1).
freq is the frequency of the min(lossytable.val).
mnb is the number of requested buckets (the default is 254).
popCnt is the total count of popular values.

So, in our case, it will be this:


bktSize= ((cdn − popFreq − freq) / (mnb − popCnt − 1))=
(10000 − 6991 − 342)/(254 − 20 − 1) = 11.44

Identify the required bucket numbers.

We have top-n values stored in losssytable and need to create 254 buckets. If NRB<= popCnt, all histogram buckets should contain popular values and we just need to formulate the endpoint number (bucket numbers) by considering bucket size. Only the last bucket could have a different size.

Otherwise, if NRB>popCnt, we need to create additional buckets in order to complete the number of buckets to 254. So in this case, we are required to create (NRB−popCnt) number of buckets. In our case, (NRB−popCnt)=(254−20)=233 additional buckets will be created. Initially we will insert all top-n values into the tmp_hybrid_hist table and then we will examine lossytable to find the appropriate bucket boundaries that allow us to create additional buckets. To do that, we have to find the differences between the bucket sizes of neighbor buckets (endpoint values).

Rearrange the endpoint numbers.

We will rearrange the endpoint numbers at the end of the process.

We can write a simple PL/SQL block to perform the above tasks:


/*+ find topn elements */
execute find_topn(10000,254);
/* Insert topn values to the HH table 
   Initially we do not compute endpoint numbers
*/
insert into tmp_hybrid_hist
select 1,val,freq from lossytable  order by 2;
/* Find appropriate bucket boundaries and create HH buckets */
DECLARE
    s   NUMBER := 0;
    n   INTEGER := 0;
BEGIN
    FOR c
    IN (SELECT   cur_val,
                 prev_val,
                 dif,
                 ROUND (dif / 11) bkcnt
          FROM   (  SELECT   val cur_val,
                             LAG (val, 1, 0) OVER (ORDER BY val) prev_val,
                             val - LAG (val, 1, 0) OVER (ORDER BY val) dif,
                             t.*
                      FROM   lossytable t
                  ORDER BY   val)
         WHERE   dif > 11 /*It is the bucket size */
                         )
    LOOP
        s := c.prev_val;

        WHILE (s + 11) < c.cur_val
        LOOP
            n := n + 1;
            s := s + 11;
            INSERT INTO tmp_hybrid_hist (ep_num, ep_val, ep_rpcnt)
              VALUES   (s, s, 1);

            IF n = 233  /* it is the required number of buckets -1 */
            THEN
                RETURN;
            END IF;
        END LOOP;
    END LOOP;
END;
/* Re arrange/calculate bucket numbers*/
UPDATE   tmp_hybrid_hist t
   SET   t.ep_num =
             (SELECT   SUM (case when ep_rpcnt=1 then 8 else ep_rpcnt end)
                FROM   tmp_hybrid_hist tt
               WHERE   tt.ep_val <= t.ep_val)
/* Add the lst bucket to the table, */
INSERT INTO tmp_hybrid_hist (ep_num,
                             ep_val,
                             ep_rpcnt)
  VALUES   (10000  /* the numrows in the table */
                 ,
            10000 /* the max value of the column */
                 ,
            1);
  SQL> SELECT   ep_num,ep_val,
         ep_num - LAG (ep_num, 1, 0) OVER (ORDER BY ep_num) freq,
         ep_rpcnt
  FROM   tmp_hybrid_hist order by 2;  2    3    4

    EP_NUM     EP_VAL       FREQ   EP_RPCNT
---------- ---------- ---------- ----------
       342          1        342        342
       639          2        297        297
       958          3        319        319
      1280          4        322        322
      1596          5        316        316
      1892          6        296        296
      2205          7        313        313
      2504          8        299        299
      2797          9        293        293
      3098         10        301        301
      3414         11        316        316
      3718         12        304        304
      4030         13        312        312
      4345         14        315        315
      4649         15        304        304
      4967         16        318        318
      5316         17        349        349
      5664         18        348        348
      5994         19        330        330
      6002         30          8          1
      6010         41          8          1
      .....................................
      .....................................
      7850       2571          8          1
      7858       2582          8          1
      8849       9990        991        991
     10000      10000       1151          1

254 rows selected.

As a result, we have got a hybrid histogram and we can migrate data using the dbms_stats package (prepare_column_values, set_column_stats) from our temporary table (tmp_hybrid_hist) to the dictionary, as previously explained.

Conclusion

In summary, we see that in Oracle Database 12c, histogram construction has been improved. Starting with Oracle Database 12c, a streaming approach is used to create frequency and top-frequency histograms. Streaming algorithms help to estimate top-frequency values with a given memory bound.

During the NDV approximation, the database also maintains a ROWID table for top-n values. If there are not any splits, the database gives us exact frequencies (it computes the top-n values from the NDV structure). If the synopsis reaches its memory bound, Oracle Database uses a lossy counting algorithm to estimate the top-n values. In that moment, Oracle Database allocates a new data structure to support the lossy counting algorithm. It is possible that Oracle Database could maintain the required data structure even by processing/maintaining the NDV structure (so Oracle Database can do that in parallel/at the same time). But if the NDV structure is not suitable, the database could directly switch to the previously created data structure. Also, it is possible that before doing the first split, Oracle Database could migrate the required data (that can be derived from the NDV structure) to the new data structure (if we can assume that, using the NDV structure, Oracle Database actually computed some top-n values, this information could be inserted to the lossytable table before doing the first split and lossy counting algorithm can be continued).

Oracle Database creates frequency and top-frequency histograms very efficiently. The idea to create a hybrid histogram is brilliant, but Oracle Database still uses a sampling method to create a hybrid histogram. In this case, we actually could lose frequent values that could be very important for the query optimizer. As a result, the optimizer could not estimate cardinality for the most-frequent values correctly and an inefficient execution plan could be produced.

The current construction mechanism of the hybrid histogram indicates that the problem is not related only to sampling. Oracle Database orders the values that are fetched by the sampling query and it is restricted to include the (most) popular values in the histogram. In my opinion, just replacing ORDER BY VAL with ORDER BY FREQ DESC in the sampling query and rearranging bucket numbers will help to improve the quality of a hybrid histogram.

In addition, Oracle Database calculates density for nonpopular values based on the number of popular values and their total frequencies. During the NDV approximation, the top-n values are extracted and, therefore, the database could create a hybrid histogram without sampling and can include the top-n values as much as possible (all top-n values might not be included and it depends on the requested number of histogram buckets, the number of top-n values, and the NDV).

Finally, I want to note that if it is decided to create a hybrid histogram and the number of requested histogram buckets is greater than the number of popular values, then it is not necessary to create NRB buckets. It does not matter. In this case, just creating popCnt (the popular value count) number of buckets plus one bucket for the maximum value (last bucket) is enough for cardinality estimation.

As a result, we get a hybrid histogram with popCnt+1 buckets, which represents our data quite well enough (we can do that manually using dbms_stats). In this case, the query optimizer can estimate cardinality for popular and nonpopular values without having NRB (254) buckets; it will estimate cardinality for a single table or join as an estimation with NRB (254) buckets for the histogram. Only the last bucket will contain num_rows(tab) − POPFREQ(total frequency of popular values). I hope in next version of Oracle Database, we will see some improvements for creating histograms.

About the Author

Chinar Aliyev graduated from the Baku State University applied mathematics department. He has been working with Oracle Database more than 10 years (since the Oracle 8i version) and specializes in tuning and troubleshooting Oracle Database performance problems. His areas of interest include SQL optimization, optimizer internals, parallel execution, and database designing. He is an Oracle Database 11g Certified Master and an Oracle Database 11g Performance Tuning Certified Expert. He also is a regular presenter at international and local conferences such as Hotsos and Oracle Database user groups.