What You See Is What You Get Element

Incremental Maintenance of Histograms in Oracle Database 12c

 

by Chinar Aliyev

 

This article describes how Oracle Database constructs a global histogram, an issue with that process, and how to correctly identify the top-n frequent values when creating a global histogram.

 

Introduction

 

In Oracle Database 11g and 12c, the process of gathering statistics has been refined. Oracle Database can gather basic column statistics incrementally for partitioned tables. All basic columns statistics (minimum, maximum, number of nulls, and so on) except the number of distinct values (NDV) can be easily derived from partition-level statistics. To incrementally gather the NDV for the global level, the database uses a synopsis data structure. Starting with Oracle Database 11g, histograms are maintained incrementally also.

 

In this article, we will see how Oracle Database constructs a global histogram. Does it use a synopsis data structure? Is the synopsis data important for incrementally updating a histogram?

 

Construction of a Global-Level Frequency Histogram

 

As we know, a frequency histogram (FH) is a collection of {(vi,fi)} pairs, where vi represents the distinct value of a column and fi represents the value's frequency. Assume a table has two partitions, (P1,P2), and the column Ci in each partition has an FH such as {(vip1,fip1), i∈ [1...n]} and {(vip2, fip2), i∈ [1...m]}. Then we need to create a new histogram for the whole table (global level).

 

It is obvious that if the NDV of (vip1,vjp2) is less than N (the number of requested histogram buckets; the default is 254), the new histogram will be a frequency histogram. First, we need to identify the number of buckets for the global histogram (FH). Of course it will be the NDV of (vip1,vjp2). Also, for each number of distinct buckets we need to identify the frequency count, and we can do that by grouping the number of distinct values and summing their frequencies, for example, as shown in the symbolic SQL statements in Query 1.

 

Query 1:

 

Select num_bucket, sum(freq) from (
Select vip1 as num_bucket,fip1 ...
Union all
Select vip2 as num_bucket,fip2 ...
)
Group by num_bucket

As a result, we will have a new FH and its properties could be provided by Query 1.

 

Now let's see Case 1, below, where we have partitioned the SALES table (and its partitions: SALES_p1, SALES_p2) and the prod_id column has an FH in each partition.

 

Case 1:

 

CREATE TABLE sales
    (prod_id                        NUMBER NOT NULL,
    time_id                         DATE NOT NULL,
    amount                         NUMBER(10,2) NOT NULL)
     PARTITION BY RANGE (TIME_ID)
  (
  PARTITION sales_p1 VALUES LESS THAN (TO_DATE('2008-04-01 00:00:00', 'SYYYY-
MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')),
  PARTITION sales_p2 VALUES LESS THAN (TO_DATE('2008-07-01 00:00:00', 'SYYYY-
MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
   )
/

BEGIN
    DBMS_STATS.set_table_prefs (ownname   => null,
                                tabname   => 'sales',
                                pname     => 'approximate_ndv_algorithm',
                                pvalue    => 'hyperloglog');

    DBMS_STATS.set_table_prefs (null,
                                'sales',
                                'INCREMENTAL',
                                'TRUE');
END;


DECLARE
    i   INTEGER;
BEGIN
    FOR i IN 1 .. 10000
    LOOP
        INSERT INTO sales
            SELECT   TRUNC (DBMS_RANDOM.VALUE (1, 50)),
                     TO_DATE ('2008-01-01', 'yyyy-mm-dd')
                     + TRUNC (DBMS_RANDOM.VALUE (1, 80)),
                     TRUNC (DBMS_RANDOM.VALUE (1, 1000))
              FROM   DUAL;

        INSERT INTO sales
            SELECT   TRUNC (DBMS_RANDOM.VALUE (1, 90)),
                     TO_DATE ('2008-04-01', 'yyyy-mm-dd')
                     + TRUNC (DBMS_RANDOM.VALUE (1, 80)),
                     TRUNC (DBMS_RANDOM.VALUE (1, 1000))
              FROM   DUAL;

        COMMIT;
    END LOOP;
END; 

How can Oracle Database derive a global-level FH based on a partition's level?

 

SELECT   partition_name part_name, num_distinct, histogram
  FROM   dba_part_col_statistics
 WHERE   table_name = 'SALES' AND column_name = 'PROD_ID'; 
 
PART_NAME  NUM_DISTINCT HISTOGRAM
---------- ------------ ---------------
SALES_P1   49           FREQUENCY
SALES_P2   89           FREQUENCY

And we can see the histogram data for each partition level as follows:

 

  SELECT   partition_name,
           column_name,
           bucket_number,
           endpoint_value
    FROM   user_part_histograms
   WHERE   table_name = 'SALES' AND columns_name = 'PROD_ID'
ORDER BY   1;

But for the partitions, we have 49 and 89 buckets, respectively.

 

  SELECT   partition_name,
           count(*)
    FROM   user_part_histograms
   WHERE   table_name = 'SALES' AND column_name = 'PROD_ID'
group by partition_name;

PARTITION_NAME   COUNT(*) 
---------------  ---------
SALES_P1         49
SALES_P2         89

So, according to Query 1, we can derive a global histogram from the partition-level FH data, as shown in Query2.

 

Query 2:

 

select ep_val, sum(freq) freq  from
(
SELECT   endpoint_value ep_val,
         bucket_number
         - LAG (bucket_number, 1, 0) OVER (partition by partition_name ORDER BY bucket_number)
             freq
  FROM   user_part_histograms
 WHERE   table_name = 'SALES' AND column_name = 'PROD_ID'
 )
 group by ep_val
 order by 1

Or, this alternative Query 2:

 

select ep_val, sum(freq) freq  from
(
SELECT   endpoint_value ep_val,
         bucket_number
         - LAG (bucket_number, 1, 0) OVER (ORDER BY bucket_number)
             freq
  FROM   user_part_histograms
 WHERE   table_name = 'SALES' AND partition_name='SALES_P1' and column_name = 'PROD_ID'
 union all 
SELECT   endpoint_value ep_val,
         bucket_number
         - LAG (bucket_number, 1, 0) OVER (ORDER BY bucket_number)
             freq
  FROM   dba_part_histograms
 WHERE   table_name = 'SALES1' AND partition_name='SALES_P2' and column_name = 'PROD_ID'
 )
 group by ep_val
 order by 1

 

 

    EP_VAL       FREQ
---------- ----------
         1        333
         2        298
         3        297
         4        345
         5        329
         6        307
         7        312
         8        308
         9        281
        10        338
        11        318
        12        332
        13        332
        14        312
        15        312
        16        314
        17        314
        18        355
        19        314
        20        332
        21        290
        22        338
        23        327
.....................
        81        102
        82        114
        83        123
        84        109
        85        136
        86        121
        87        111
        88        102
        89         99

89 rows selected.

And we can compare the result with a table-level (global) FH with the following SQL statement:

 

SELECT   endpoint_value ep_val,
         endpoint_number
         - LAG (endpoint_number, 1, 0) OVER (ORDER BY endpoint_number)
             freq
  FROM   user_tab_histograms
 WHERE   table_name = 'SALES'  and column_name = 'PROD_ID';

If we enable a SQL trace for the statistics gathering, we see the SQL statement shown in Query 3.

 

Query 3:

 

Select * from sh.sales where prod_id=13; 
alter session set sql_trace=true;
execute DBMS_STATS.gather_table_stats(null,'SALES');
    INSERT /*+ OPT_PARAM('_parallel_syspls_obey_force' 'false') */
    INTO    finalhist$ (endpoint, epvalue_raw, bucket)
      SELECT   endpoint, epvalue, CEIL (SUM (numrows))
        FROM   (SELECT   (hist.bucket
                          - LAG (
                                hist.bucket,
                                1,
                                0)
                                OVER (PARTITION BY hist.obj#
                                      ORDER BY hist.bucket))
                             AS numrows,
                         hist.endpoint,
                         CASE
                             WHEN hist.epvalue IS NOT NULL
                             THEN
                                 UTL_RAW.cast_to_raw (hist.epvalue)
                             ELSE
                                 hist.epvalue_raw
                         END
                             epvalue
                  FROM   "_HISTGRM_DEC" hist, (SELECT   obj#
                                                 FROM   tabpart$ tp
                                                WHERE   bo# = :b1
                                               UNION ALL
                                               SELECT   obj#
                                                 FROM   tabcompart$ tcp
                                                WHERE   bo# = :b1) t
                 WHERE       t.obj# = hist.obj#
                         AND intcol# = :b2
                         AND hist.bucket <> 0)
    GROUP BY   endpoint, epvalue

This statement expresses the same idea that provided by Query 1 and Query 2. But Oracle Database considers and keeps the raw format of the endpoint values (plus statistics for composite partitions). In addition, the dbms_stats trace shows that Oracle Database first created the FH for the PROD_ID column for each partition, and then it derived the FH for the table (global) level from the partition level only without scanning the entry table.

 

Gathering basic column statistics including the NDV and FH for the SALES_P1 partition:

 

not analyzed yet; 
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname: 
SALES_P1 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P1 objn: 101954 
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
............................................................................................................
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:49 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 49, max # of buckects = 254, pct = 100, ssize 
= 10000
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1035
DBMS_STATS: done_hist in process_topn: TRUE  csr.ccnt: 1
DBMS_STATS: Mark column "PROD_ID" as top N computed

Gathering basic column statistics including the NDV and FH for the SALES_P2 partition:

 

DBMS_STATS: gather stats on partition SALES_P2: synopsis not gathered yet; 
not analyzed yet; 
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname: 
SALES_P2 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P2 objn: 101955 
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
.................................................
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:89 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 89, max # of buckects = 254, pct = 100, ssize 
= 10000
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1035
DBMS_STATS: done_hist in process_topn: TRUE  csr.ccnt: 1

Deriving basic column statistics including the NDV and FH (histogram data) from partition levels for the table:

 

DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname:  
spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for PROD_ID


DBMS_STATS:  derive global freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 89, max # of buckects = 254, pct = 100, ssize 
= 20000
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1035
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES

Construction of a Global-Level Top Frequency Histogram

 

In this section, we will see how Oracle Database constructs a global-level top frequency histogram (TFH). It is should be noted that the resulting (global) histogram does not depend on the types of local histograms (partition level) directly. The global histogram is created by the database by just scanning existing partition-level histogram data. So, this means we could have two (or more) partition-level FHs and result could be a TFH or a hybrid histogram (HH), or we could have two FHs and one HH and the global histogram could be a TFH.

 

So, the global histogram only depends on the data (top frequent values) that are provided by the local histograms. In Case 2, below, the structure of our SALES table is the same as what we used in the previous example, but the data is different. I used special data that allows Oracle Database to create frequency and hybrid histograms for the partition levels and a top frequency histogram for the table level.

 

Case 2:

 

DECLARE
    i   INTEGER;
BEGIN
    DELETE   sales;
    COMMIT;

    FOR i IN 1 .. 1000
    LOOP
        INSERT INTO sales
            SELECT   CASE
                         WHEN (i <= 974 OR i >= 984)
                         THEN
                             TRUNC (DBMS_RANDOM.VALUE (1, 20))
                         ELSE
                             i
                     END,
                     TO_DATE ('2008-01-01', 'yyyy-mm-dd')
                     + TRUNC (DBMS_RANDOM.VALUE (1, 80)),
                     TRUNC (DBMS_RANDOM.VALUE (1, 1000))
              FROM   DUAL;

        INSERT INTO sales
            SELECT   CASE
                         WHEN i <= 760 THEN TRUNC (DBMS_RANDOM.VALUE (1, 20))
                         ELSE i
                     END,
                     TO_DATE ('2008-04-01', 'yyyy-mm-dd')
                     + TRUNC (DBMS_RANDOM.VALUE (1, 80)),
                     TRUNC (DBMS_RANDOM.VALUE (1, 1000))
              FROM   DUAL;

    END LOOP;
  COMMIT;
END;

As you see, in our table we have 2000 rows (1000 rows per partition). In the sales_p1 partition there could be a maximum of 30 distinct values, and in the sales_p2 partition there could be a maximum of 260 distinct values. I have produced special data here to avoid a problem in generating a global histogram. In the next part of the article, we will see the problem and explore it.

 

execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;

BEGIN
    DBMS_STATS.gather_table_stats (ownname    => null,
                                   tabname    => 'SALES');
END;
SELECT   partition_name part_name,
         num_distinct,
         histogram,
         num_buckets,
         sample_size
  FROM   user_part_col_statistics
 WHERE   table_name = 'SALES' AND column_name = 'PROD_ID';
 
PART_NAME   NUM_DISTINCT  HISTOGRAM       NUM_BUCKETS SAMPLE_SIZE
----------- ------------  --------------- ----------- -----------
SALES_P1    28            FREQUENCY       28          1000
SALES_P2    259           HYBRID          254         1000

SELECT   histogram, sample_size
  FROM   user_tab_col_statistics
 WHERE   table_name = 'SALES' AND column_name = 'PROD_ID';

HISTOGRAM       SAMPLE_SIZE
--------------- -----------
TOP-FREQUENCY          2000

The PROD_ID column has an FH and an HH according to the partitions, but the global level has a TFH. The dbms_stats trace shows that the database started to gather statistics for each partition individually and created histograms for each partition:

 

DBMS_STATS: gather stats on partition SALES_P1: synopsis not gathered yet; 
not analyzed yet; 
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname: 
SALES_P1 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P1 objn: 101954 
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0

Oracle Database created an FH for the sales_p1 partition:

 

DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:28 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 28, max # of buckects = 254, pct = 100, ssize 
= 1000
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1039
DBMS_STATS: done_hist in process_topn: TRUE  csr.ccnt: 1
DBMS_STATS: Mark column "PROD_ID" as top N computed
DBMS_STATS: Skip topn computed column "PROD_ID" numhist: 0

Then the database started to gather basic and histogram statistics for the second partition, sales_p2.

 

DBMS_STATS: gather stats on partition SALES_P2: synopsis not gathered yet; 
not analyzed yet; 
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname: 
SALES_P2 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P2 objn: 101955 
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
...........................................................................
DBMS_STATS: start processing top n values for column "PROD_ID"
DBMS_STATS:   >> frequency histograms is not feasible 
                       (topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  
COLNAME
DBMS_STATS:                       Y    Y    Y    Y                        
PROD_ID
DBMS_STATS:                                                               
TIME_ID
DBMS_STATS:                                                               
AMOUNT
DBMS_STATS: Building Histogram for PROD_ID
DBMS_STATS:  bktnum=-1, nnv=1000, snnv=1000, sndv=259, est_ndv=259, mnb=254
DBMS_STATS:  Trying hybrid histogram 
DBMS_STATS: Starting query at 12-FEB-18 11.54.22.688238000 AM +04:00
DBMS_STATS: select substrb(dump(val,16,0,64),1,240) ep,  freq, cdn, ndv, 
(sum(pop) over()) popcnt,  (sum(pop*freq) over()) popfreq,  
substrb(dump(max(val) over(),16,0,64),1,240) maxval,  substrb(dump(min(val) 
over(),16,0,64),1,240) minval  from (select val, freq,  (sum(freq) over()) 
cdn, (count(*) over()) ndv,  (case when freq > ((sum(freq) over())/254)  then 
1  else 0 end) pop from  (select /*+  no_parallel(t) no_parallel_index(t) 
dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 
'false') no_substrb_pad  */ "PROD_ID"  val, count("PROD_ID") freq  from 
"SYS"."SALES" t  where "PROD_ID" is not null  and 
TBL$OR$IDX$PART$NUM("SYS"."SALES",0,4,0,"ROWID") = :objn group by "PROD_ID")) 
order by val (objn=101955)
DBMS_STATS:  > cdn 1000, popFreq 760, popCnt 19, bktSize 
.8803418803418803418803418803418803418803, bktSzFrc 
.8803418803418803418803418803418803418803
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 254, mnb 254, ssize 
1000, min_ssize 2500, appr_ndv  TRUE, ndv 259, selNdv 19, selFreq 760, pct 
100, avg_bktsize 4, csr.hreq FALSE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 15
DBMS_STATS:  Accepting histogram

Figure 1 describes the process of choosing the histogram type, where

 

  • NDV is the number of distinct values of the column.
  • N is the number of requested histogram buckets. The default value is 254.
  • P is an internal percentage threshold that is equal to (1-(1/N))*100.

 

histogramFig1

Figure 1. Flowchart for choosing the histogram type

So, in our case we have 19 popular values and their total frequency is 760.Therefore, according the criteria, (760/1000) = 0.76 < (1−1/254) = 0.9960, and thus Oracle Database decided to create an HH for the partition level. To understand how Oracle Database computes top-n values, refer my previous article: "Histogram Construction in Oracle Database 12c."

 

The following lines indicate that Oracle Database used the HyperLogLog algorithm to compute the NDV; also a global top frequency histogram was derived from the partition-level statistics without scanning the entry table.

 

DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname:  
spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS:  top-freq hist feasible? YES ndv: 259 mnb: 254 topn_freq: 1995 nnv: 2000
DBMS_STATS:  derive global top-freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 254, max # of buckects = 254, pct = 100, 
ssize = 2000
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 2063
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
DBMS_STATS: =============================================================================
=======================
DBMS_STATS: Statistics from clist:
DBMS_STATS: =============================================================================
=======================
DBMS_STATS: COL_NAME                       ACL       NV        SNNVDV    NDV       
SNDV      DENS      CCNT     
DBMS_STATS: -----------------------------------------------------------------
-----------------------------------
DBMS_STATS: PROD_ID                        4         0         2000      259       
259       .00025    1   

How did Oracle Database identify the total frequency of top-n frequent values as 1995 (topn_freq: 1995) for the global level?

 

Let's create a SQL statement to find it. In this case, we have one FH and one HH of the partitions. We need to extract the top-n values from the histogram data and then it can be determined which type of histogram should be created for the global (table) level. We can extract frequency information for top-n values from the FH and TFH using Query 2.

 

What about the HH? As you know, an HH contains two types of frequency information: the frequency for each endpoint values and a cumulative bucket frequency (to track rows per bucket). So we need to find frequencies for endpoint values. The Oracle Database query optimizer calculates selectivity for the endpoint values as endpoint_repeat_count/max(bucket_number) and, therefore, cardinality will be endpoint_repeat_count/max(bucket_number)* partnum_rows, where partnum_rows = dba_part_partitions.num_rows. If the HH is created without sampling (or with 100% sampling) then max(bucket_number)= partnum_rows, so endpoint_repeat_count will represent the exact frequency for the endpoint values.

 

So, the query for top-n values that are extracted from the HH will be this:

 

  SELECT   endpoint_value,
           endpoint_repeat_count * p.num_rows / MAX (bucket_number) OVER ()
               freq
    FROM   user_part_histograms h, user_tab_partitions p
   WHERE       h.table_name = 'SALES'
           AND h.partition_name = 'SALES_P2'
           AND h.column_name = 'PROD_ID'
           AND p.table_name = h.table_name
           AND p.partition_name = h.partition_name
ORDER BY   2 DESC

Therefore, the final query for extracting top-n values will be the following:

 

  SELECT   ep_val,
           freq,
           (SUM (CASE WHEN ROWNUM <= 254 THEN freq ELSE 0 END) OVER ())
               topn_freq
    FROM   (  SELECT   ep_val, SUM (freq) freq
                FROM   (SELECT   endpoint_value ep_val,
                                 bucket_number
                                 - LAG (bucket_number, 1, 0)
                                       OVER (ORDER BY bucket_number)
                                     freq
                          FROM   user_part_histograms
                         WHERE       table_name = 'SALES'
                                 AND column_name = 'PROD_ID'
                                 AND partition_name = 'SALES_P1'
                        UNION ALL
                        SELECT   endpoint_value,
                                   endpoint_repeat_count
                                 * p.num_rows
                                 / MAX (bucket_number) OVER ()
                                     freq
                          FROM   user_part_histograms h, user_tab_partitions p
                         WHERE       h.table_name = 'SALES'
                                 AND h.partition_name = 'SALES_P2'
                                 AND h.column_name = 'PROD_ID'
                                 AND p.table_name = h.table_name
                                 AND p.partition_name = h.partition_name)
            GROUP BY   ep_val
            ORDER BY   2 DESC)
ORDER BY   1 

We can generalize our query without specifying partition names explicitly. As a result, Query 4 will not depend on the number of partitions and their names.

 

Query 4:

 

  SELECT   ep_val,
           freq,
           (SUM (CASE WHEN ROWNUM <= 254 THEN freq ELSE 0 END) OVER ())
               topn_freq
    FROM   (  SELECT   ep_val, SUM (freq) freq
                FROM   (SELECT   endpoint_value ep_val,
                                 bucket_number
                                 - LAG (bucket_number, 1, 0)
                                       OVER (ORDER BY bucket_number)
                                     freq
                          FROM   user_part_histograms h,
                                 user_part_col_statistics pcs
                         WHERE       h.table_name = 'SALES'
                                 AND h.column_name = 'PROD_ID'
                                 AND pcs.table_name = h.table_name
                                 AND pcs.column_name = h.column_name
                                 AND pcs.histogram IN
                                            ('FREQUENCY', 'TOP-FREQUENCY')
                                 AND pcs.partition_name = h.partition_name
                        UNION ALL
                        SELECT   endpoint_value,
                                   endpoint_repeat_count
                                 * p.num_rows
                                 / MAX (bucket_number) OVER ()
                                     freq
                          FROM   user_part_histograms h,
                                 user_tab_partitions p,
                                 user_part_col_statistics pcs
                         WHERE       h.table_name = 'SALES'
                                 AND h.column_name = 'PROD_ID'
                                 AND p.table_name = h.table_name
                                 AND p.partition_name = h.partition_name
                                 AND pcs.table_name = h.table_name
                                 AND pcs.column_name = h.column_name
                                 AND pcs.histogram = 'HYBRID'
                                 AND pcs.partition_name = h.partition_name)
            GROUP BY   ep_val
            ORDER BY   2 DESC)
ORDER BY   1

 

 

    EP_VAL       FREQ  TOPN_FREQ
---------- ---------- ----------
         1         82       1995
         2         99       1995
         3         90       1995
         4         91       1995
         5        103       1995
         6        101       1995
         7         87       1995
         8         94       1995
         9        104       1995
        10         84       1995
        11        103       1995
        12         81       1995
        13         79       1995
        14         85       1995
        15        102       1995
        16         87       1995
        17        100       1995
        18         84       1995
        19         95       1995
       761          1       1995
       762          1       1995
       763          1       1995
       764          1       1995
       765          1       1995
       766          1       1995
................................
       990          1       1995
       991          1       1995
       992          1       1995
       993          1       1995
       994          1       1995
      1000          1       1995

254 rows selected.

This means the top-n frequent values represent 1995/2000*100−99.75% of all column values and that is why Oracle Database decided to create the TFH (because the condition TOPNFREQ>=p*num_rows(SALES) is true: p=(1−1/254)=99.60). We see that from the dbms_stats trace file, as follows:

 

DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS:  top-freq hist feasible? YES ndv: 259 mnb: 254 topn_freq: 1995 nnv: 2000
DBMS_STATS:  derive global top-freq hist...
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 254, max # of buckects = 254, pct = 100, ssize = 2000

So, Query 4 allows us to extract the top-n frequent values from the existing partition-level histogram data. And in our case, the condition TOPNFREQ>=p*num_rows(SALES) was true and, therefore, Oracle Database stored the top-n frequent values as a TFH.

 

Construction of a Global-Level Hybrid Histogram

 

It is obvious that after identifying the top-n frequent values and based on the NDV, Oracle Database evaluates the condition TOPNFREQ < p*num_rows(SALES) and decides which type of histogram should be created: FH, TFH, or hybrid histogram (HH). If TOPNFREQ < p*num_rows(SALES), where p=(1-1/N), then an HH needs to be created.

 

To produce an HH, we need to scan existing histogram data, grouping data as buckets, and we also need to identify popular values and their frequencies as endpoint values. Let's see Case 3 below.

 

Case 3:

 

DECLARE
    i   INTEGER;
BEGIN
    DELETE   sales;
    COMMIT;

    FOR i IN 1 .. 10000
    LOOP
        INSERT INTO sales
            SELECT   CASE
                         WHEN (i <= 9990)
                         THEN
                             TRUNC (DBMS_RANDOM.VALUE (1, 20))
                         ELSE
                             i
                     END,
                     TO_DATE ('2008-01-01', 'yyyy-mm-dd')
                     + TRUNC (DBMS_RANDOM.VALUE (1, 80)),
                     TRUNC (DBMS_RANDOM.VALUE (1, 1000))
              FROM   DUAL;
    END LOOP;

    FOR i IN 1 .. 1000
    LOOP
        INSERT INTO sales
            SELECT   CASE
                         WHEN i <= 760 THEN TRUNC (DBMS_RANDOM.VALUE (1, 20))
                         ELSE i
                     END,
                     TO_DATE ('2008-04-01', 'yyyy-mm-dd')
                     + TRUNC (DBMS_RANDOM.VALUE (1, 80)),
                     TRUNC (DBMS_RANDOM.VALUE (1, 1000))
              FROM   DUAL;
    END LOOP;

    COMMIT;
END;


execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
alter session set sql_trace= true;
begin
   dbms_stats.gather_table_stats
    (ownname=>'sh',
     tabname=>'SALES');
end;

The partitions and table-level column statistics are as follows:

 

SELECT   partition_name part_name,
         num_distinct,
         histogram,
         num_buckets,
         sample_size
  FROM   user_part_col_statistics
 WHERE   table_name = 'SALES' AND column_name = 'PROD_ID';

PART_NAME  NUM_DISTINCT  HISTOGRAM   NUM_BUCKETS SAMPLE_SIZE
---------- ------------ ----------  ----------- -----------
SALES_P1   29            FREQUENCY   29          10000
SALES_P2   259           HYBRID      254         1000

select  column_name, 
         num_distinct, 
         num_nulls,
         sample_size,
         histogram
 from user_tab_col_statistics
where table_name='SALES' AND column_name = 'PROD_ID';

COLUMN_NAME  NUM_DISTINCT  NUM_NULLS   SAMPLE_SIZE   HISTOGRAM
------------ ------------  ----------  -----------   ----------
PROD_ID      266           0           11000         HYBRID

As you noticed, in the SALES table there are 11000 rows. In the SALES_P1 partition, 9990 rows have a value range between 1 and 20. Also, in the SALES_P2 partition, 760 rows have a value range between 1 and 20. This means at least 9990+760=10750 out of 11000 rows are top frequent values (actually these are popular values). We can see the precise values:

 

SELECT   SUM (freq) topn_freq
  FROM   (  SELECT   prod_id, COUNT ( * ) freq
              FROM   sales
          GROUP BY   prod_id
          ORDER BY   2 DESC)
 WHERE   ROWNUM <= 254;

 TOPN_FREQ
----------
     10985

This means 10985/11000*100=99.86 percent of the total rows are represented by top frequent values.

 

The question is why, in this case, did Oracle Database decide to create an HH? Let see the dbms_stats trace file. As in the previous example, Oracle Database gathered basic and histogram data for each partition individually.

 

Gathering statistics for the SALES_P1 partition and creating an FH for the column of the partition:

 

DBMS_STATS: gather stats on partition SALES_P1: synopsis not gathered yet; 
not analyzed yet; 
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname: 
SALES_P1 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P1 objn: 101954 
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
DBMS_STATS: delete synopses of a single partition
.......................................
DBMS_STATS: hist_type in exec_get_topn: 1024 ndv:29 mnb:254
DBMS_STATS: Evaluating frequency histogram for col: "PROD_ID"
DBMS_STATS:  number of values = 29, max # of buckects = 254, pct = 100, ssize 
= 10000
DBMS_STATS:   csr.hreq: 0  Histogram gathering flags: 1039
DBMS_STATS: done_hist in process_topn: TRUE  csr.ccnt: 1
DBMS_STATS: Mark column "PROD_ID" as top N computed
DBMS_STATS: Skip topn computed column "PROD_ID" numhist: 0

Gathering statistics for the SALES_P2 partition and creating an HH for the column of the partition:

 

DBMS_STATS: Number of rows in the table = 10000, blocks = , average row 
length = 15, chain count = , scan rate = 0, sample size = 10000
DBMS_STATS: prepare reporting structures...
DBMS_STATS: reporting_man_update_task: objn: 101954 auto_stats: FALSE status: 
COMPLETED ctx.batching_coeff: 0
DBMS_STATS: gather stats on partition SALES_P2: synopsis not gathered yet; 
not analyzed yet; 
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname: 
SALES_P2 spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: reporting_man_log_task: target: SYS.SALES.SALES_P2 objn: 101955 
auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
.............................................................
DBMS_STATS: start processing top n values for column "PROD_ID"
DBMS_STATS:   >> frequency histograms is not feasible 
                       (topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  
COLNAME
DBMS_STATS:                       Y    Y    Y    Y                        
PROD_ID
DBMS_STATS:                                                               
TIME_ID
DBMS_STATS:                                                               
AMOUNT
DBMS_STATS: Building Histogram for PROD_ID
DBMS_STATS:  bktnum=-1, nnv=1000, snnv=1000, sndv=259, est_ndv=259, mnb=254
DBMS_STATS:  Trying hybrid histogram 
....................................................................
DBMS_STATS:  > cdn 1000, popFreq 760, popCnt 19, bktSize 
.8846153846153846153846153846153846153846, bktSzFrc 
.8846153846153846153846153846153846153846
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 254, mnb 254, ssize 
1000, min_ssize 2500, appr_ndv  TRUE, ndv 259, selNdv 19, selFreq 760, pct 
100, avg_bktsize 4, csr.hreq FALSE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 15
DBMS_STATS:  Accepting histogram 
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE

The following lines of the trace file indicate that the database derived an HH from the partition levels of histogram data and created an HH for the table (global HH):

 

DBMS_STATS: reporting_man_update_task: objn: 101955 auto_stats: FALSE status: 
COMPLETED ctx.batching_coeff: 0
DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname:  
spname:  execution phase: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: Synopsis Aggregation Degree: 1
DBMS_STATS: APPROX_NDV_ALGORITHM chosen: HLL (incremental)
DBMS_STATS: get_agg_colstats: HLL only
DBMS_STATS: Deriving Global Histogram for PROD_ID
DBMS_STATS:  top-freq hist feasible? NO ndv: 266 mnb: 254 topn_freq: 5161 
nnv: 11000
DBMS_STATS:  derive global hybrid hist:  cdn 11000, popFreq 10750, popCnt 19, 
bktSize 1.06382978723404255319148936170212765957, bktSzFrc 
.06382978723404255319148936170212765957
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 254, mnb 254, ssize 
11000, min_ssize 2500, appr_ndv  TRUE, ndv 266, selNdv 19, selFreq 10750, pct 
100, avg_bktsize 43, csr.hreq FALSE, normalize FALSE
DBMS_STATS:   Histogram gathering flags: 15
DBMS_STATS:  Accepting histogram 
DBMS_STATS: Derive global stats from partition synopses/stats for table SALES
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE

As you can see, Oracle Database computed the total frequency of the top-n frequent values (top 254) as 5161 (topn_freq: 5161). It is obvious that Oracle Database computed it incorrectly. The total frequency of all popular values is 10750 (popFreq 10750), as we saw above. Also, Oracle Database computed the number of popular values as 19 (popCnt 19).

 

To extract the required information, we can use the idea that Oracle Database uses a SQL statement, such as the SQL statement in Query 5, when generating an HH.

 

Query 5:

 

select substrb(dump(val,16,0,64),1,240) ep,
       freq,
       cdn,
       ndv,
       (sum(pop) over()) popcnt,
       (sum(pop*freq) over()) popfreq,
       substrb(dump(max(val) over(),16,0,64),1,240) maxval,
       substrb(dump(min(val) over(),16,0,64),1,240) minval
  from (select val,
               freq,
               (sum(freq) over()) cdn,
               (count(*) over()) ndv,
               (case when freq > ((sum(freq) over())/num_bucket)  then 1  else 0 end) pop
          from  (select "column"  val,
                        count("column") freq
                   from "table" sample (.abcde...)  t
                  where "column" is not null
                  group by "column"))
order by val

So, we should identify endpoint values (ep), their frequencies (freq), the cardinality (cdn) (histogram-based), the number of distinct values (ndv) (histogram-based), and the count of popular values (popcnt) and their frequencies (popfreq). As in the previous section we saw that frequency information could be extracted using Query 4. Therefore, we should replace the following part of Query 5 with all of Query 4:

 

select "column"  val,
                        count("column") freq
                   from "table" sample (.abcde...)  t
                  where "column" is not null
                  group by "column"
                  

In addition, I have added "bucket frequency" to the query in order to compute the number of rows (cardinality, crdn) via the histogram data. Thus, as a result, the equivalent query will be Query 6.

 

Query 6:

 

      SELECT   ep_val,
           pop,
           freq,
           crdn,
           COUNT ( * ) OVER () ndv,
           SUM (pop) OVER () popcnt,
           SUM (pop * freq) OVER () popfreq,
           (SUM (CASE WHEN ROWNUM <= 254 THEN freq ELSE 0 END) OVER ())
               topnfreq
    FROM   (SELECT   ep_val,
                     ROUND (freq) freq,
                     SUM (cdn) OVER () crdn,
                     (CASE
                          WHEN freq > ROUND (SUM (freq) OVER ()) / 254 THEN 1
                          ELSE 0
                      END)
                         pop
              FROM   (  SELECT   ep_val, SUM (freq) freq, SUM (cdn) cdn
                          FROM   (SELECT   ep_val, freq, freq cdn
                                    FROM   (SELECT   endpoint_value ep_val,
                                                     bucket_number
                                                     - LAG (
                                                           bucket_number,
                                                           1,
                                                           0)
                                                           OVER (
                                                               ORDER BY
                                                                   bucket_number)
                                                         freq
                                              FROM   user_part_histograms h,
                                                     user_part_col_statistics pcs
                                             WHERE   h.table_name = 'SALES'
                                                     AND h.column_name =
                                                            'PROD_ID'
                                                     AND pcs.table_name =
                                                            h.table_name
                                                     AND pcs.column_name =
                                                            h.column_name
                                                     AND pcs.histogram IN
                                                                ('FREQUENCY',
                                                                 'TOP-FREQUENCY')
                                                     AND pcs.partition_name =
                                                            h.partition_name)
                                  UNION ALL
                                  SELECT   endpoint_value,
                                             endpoint_repeat_count
                                           * p.num_rows
                                           / MAX (bucket_number) OVER ()
                                               freq,
                                           bucket_number
                                           - LAG (bucket_number, 1, 0)
                                                 OVER (ORDER BY bucket_number)
                                               cdn
                                    FROM   user_part_histograms h,
                                           user_tab_partitions p,
                                           user_part_col_statistics pcs
                                   WHERE       h.table_name = 'SALES'
                                           AND h.column_name = 'PROD_ID'
                                           AND p.table_name = h.table_name
                                           AND p.partition_name =
                                                  h.partition_name
                                           AND pcs.table_name = h.table_name
                                           AND pcs.column_name = h.column_name
                                           AND pcs.histogram = 'HYBRID'
                                           AND pcs.partition_name =
                                                  h.partition_name)
                      GROUP BY   ep_val
                      ORDER BY   2 DESC))
ORDER BY   ep_val;

 

 

    EP_VAL        POP       FREQ       CRDN        NDV     POPCNT    POPFREQ   TOPNFREQ
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
         1          1        548      11000        264         19      10750      10985
         2          1        561      11000        264         19      10750      10985
         3          1        600      11000        264         19      10750      10985
         4          1        567      11000        264         19      10750      10985
         5          1        548      11000        264         19      10750      10985
         6          1        566      11000        264         19      10750      10985
         7          1        546      11000        264         19      10750      10985
         8          1        531      11000        264         19      10750      10985
         9          1        539      11000        264         19      10750      10985
        10          1        622      11000        264         19      10750      10985
        11          1        556      11000        264         19      10750      10985
        12          1        556      11000        264         19      10750      10985
        13          1        545      11000        264         19      10750      10985
        14          1        577      11000        264         19      10750      10985
        15          1        547      11000        264         19      10750      10985
        16          1        573      11000        264         19      10750      10985
        17          1        583      11000        264         19      10750      10985
        18          1        571      11000        264         19      10750      10985
        19          1        614      11000        264         19      10750      10985
       761          0          1      11000        264         19      10750      10985
       762          0          1      11000        264         19      10750      10985
       763          0          1      11000        264         19      10750      10985
       764          0          1      11000        264         19      10750      10985
       765          0          1      11000        264         19      10750      10985
       766          0          1      11000        264         19      10750      10985
       767          0          1      11000        264         19      10750      10985
       768          0          1      11000        264         19      10750      10985
       769          0          1      11000        264         19      10750      10985
       ................................................................................
       ................................................................................
       993          0          1      11000        264         19      10750      10985
       994          0          1      11000        264         19      10750      10985
      1000          0          1      11000        264         19      10750      10985
      9991          0          1      11000        264         19      10750      10985
      9992          0          1      11000        264         19      10750      10985
      9993          0          1      11000        264         19      10750      10985
      9994          0          1      11000        264         19      10750      10985
      9995          0          1      11000        264         19      10750      10985
      9996          0          1      11000        264         19      10750      10985
      9997          0          1      11000        264         19      10750      10985
      9998          0          1      11000        264         19      10750      10985
      9999          0          1      11000        264         19      10750      10985
     10000          0          1      11000        264         19      10750      10985

264 rows selected.

You can see how the histogram data (in our case, we have one FH and one TFH) describe our (table) data. So the table has 11000 rows and the histogram cardinality is 11000. Also, the NDV of the column is 266, but based on the histogram data, the NDV is 264. The numbers are very close to each other.

 

Let's see another set of numbers. According the histogram data (Query 6), we have 264 frequent values, but the sum of the top-n frequent values is 10985. In addition, there are 19 popular values and sum of their frequencies is 10750. What percentage do the top-n frequent values represent of all column values? So, we can use (TOPNFREQ vs p*num_notnull_rows(SALES1)) => (10985 vs (1-1/254)* 11000) => (10985 vs 10956.69). This indicates clearly that the top-n frequent values represent more than p=(1−1/N) percent of all column rows (values).

 

So, again a TFH should have been created, but in fact, Oracle Database created the HH and the reason is the wrong computation of the total frequency of top-n values. We actually have seen that in the trace file (topn_freq: 5161). This is an issue (bug) in the dbms_stats package implementation.

 

Let's explore how Oracle Database processes the frequent values and creates the HH. We should create 254 buckets (the default) and we have 264 frequent values. Therefore, we need to identify the bucket size for grouping and then create an HH.

 

As you can see, Oracle Database has determined the bucket size as 1.0638. This was calculated based on the histogram data, which was provided by Query 6. The bucket size represents the average number of rows per bucket. In addition, it is known that an HH combines the features of an FH and an HBH. Bucket size is used by the query optimizer to determine selectivity for the column values (nonpopular values). Actually, bucket size explains and helps us to understand the meaning of the density (and selectivity). So, in reality Density=bucket_size/crdn, where crdn is the cardinality of the column, which is determined based on the histogram data (see Query 6). And the bucket size itself is calculated as follows:

 

Bucket size=(crdn−popfreq)/(N−popcnt)

where N is requested number of buckets and the default it is 254. Query 6 gives us all the necessary information to calculate bucket size. Thus

 

Bucket size=(crdn−popfreq)/(N−popcnt)=(11000−10750)/(254−19)= 1.0638.

As can be seen, the calculated bucket size is the same as the reported one in the dbms_stats trace file. To create an HH, Oracle Database starts to process over each row that is returned by Query 6.

 

The maximum and minimum values of the histogram data also indicate the max and min values of the column, and they will be included in the HH. For every row, Oracle Database will compare the average bucket size with the size of the current bucket. If the current bucket size is greater than the average bucket size, the value will be stored in the HH as an endpoint value. The endpoint repeat count for the value is actually the "freq" column of Query 6.

 

In addition, at the same time, the bucket number (or endpoint number) is produced and stored in the HH for the endpoint values. If the bucket is not stored in the HH, its row count (current bucket size) is considered for the next step and this bucket size is added to the next bucket size.

 

The process is continued until the total bucket count reaches 254 (the number of requested buckets). As a result, Oracle Database created an HH from the existing partition-level histogram data. For more information, refer to my "Histogram Construction in Oracle Database 12c" article to see complete steps of histogram construction. [Laura: link to the first article]

 

Thus, using this described method, Oracle Database created an HH and stored it in the dictionary. But in reality, a TFH should have been created. To prove that, let set the INCREMENTAL setting to FALSE and try to gather table statistics again. In this case, Oracle Database does not gather statistics incrementally.

 

execute  DBMS_STATS.SET_TABLE_PREFS(null, 'sales', 'INCREMENTAL', 'FALSE');
execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute  DBMS_STATS.gather_table_stats (null, 'SALES');

And now, let's check which type of histograms have been created for the table and its partitions:

 

select partition_name,
       num_distinct,
       histogram from user_part_col_statistics
where table_name='SALES'
and column_name= 'PROD_ID';

PARTITION_NAME   NUM_DISTINCT  HISTOGRAM
---------------- ------------  -----------
SALES_P1         29            FREQUENCY
SALES_P2         259           HYBRID

select  column_name, 
         num_distinct, 
         num_nulls,
         sample_size,
         histogram
 from user_tab_col_statistics
where table_name='SALES' AND column_name = 'PROD_ID';

COLUMN_NAME   NUM_DISTINCT  NUM_NULLS   SAMPLE_SIZE HISTOGRAM
------------- ------------  ----------  ----------- ---------------
PROD_ID       269           0           11000       TOP-FREQUENCY

As you can see, Oracle Database created a TFH for the columns. This means our prediction was right. As we mentioned above, when incrementally constructing histograms, Oracle Database does not calculate the top-n frequent values correctly. Therefore, Oracle Database decided to create an HH.

 

So we see the problem in incrementally constructing histograms. When gathering statistics with INCREMENTAL=FALSE, Oracle Database does not provide the exact number of top-n frequent values in the trace file (because in our case, it decided to create a TFH not an HH). But, ROWIDs (and their frequencies) for top-n frequent values are provided.

 

DBMS_STATS: Start gather_stats.. pfix:  ownname: SYS tabname: SALES pname:  
spname:  execution phase: 1
DBMS_STATS: Using approximate NDV pct=0
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  
COLNAME
DBMS_STATS:        Y    Y    Y    Y    Y    Y    Y                        
PROD_ID
DBMS_STATS:        Y         Y                                            
TIME_ID
DBMS_STATS:        Y    Y    Y                                            
AMOUNT
DBMS_STATS: Approximate NDV Options 
DBMS_STATS: TOPN,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,RWID,U254,U254,U254U
DBMS_STATS: select /*+  full(t)    no_parallel(t) no_parallel_index(t) 
dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 
'false') no_substrb_pad  
*/to_char(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),subst
rb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("TIME_ID")),substrb(dump
(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),t
o_char(count("AMOUNT")),substrb(dump(min("AMOUNT"),16,0,64),1,240),substrb(du
mp(max("AMOUNT"),16,0,64),1,240),count(rowidtochar(rowid)) from "SYS"."SALES" 
t  /* TOPN,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,RWID,U254,U254,U254U*/

DBMS_STATS: Ending query at 14-FEB-18 03.20.53.066712000 PM +04:00
DBMS_STATS: Approximate NDV Result 
DBMS_STATS: <process_result>
  <select_list_item>
    <pos>0</pos>
    <value>269</value>
    <rowcnt>11000</rowcnt>
    <split>0</split>
    <ndv>269</ndv>
    <nonnulls>11000</nonnulls>
    <rsize>22246</rsize>
    <topncnt>254</topncnt>
    <topn_values>AAAY5CAABAAAbGPAAW,622,AAAY5CAABAAAbGPAAB,614,AAAY5CAABAAAbGPAAC
,600,AAAY5CAABAAAbGPAAO,583,AAAY5CAABAAAbGPAAI,577,AAAY5CAABAAAbGPAAV,573,AAA
Y5CAABAAAbGPAAA,571,AAAY5CAABAAAbGPAAM,567,AAAY5CAABAAAbGPABe,566,AAAY5CAABAA
AbGPAAS,561,AAAY5CAABAAAbGPAAp,556,AAAY5CAABAAAbGPAAf,556,AAAY5CAABAAAbGPAAJ,
548,AAAY5CAABAAAbGPAAT,548,AAAY5CAABAAAbGPAAN,547,AAAY5CAABAAAbGPAAG,546,AAAY
5CAABAAAbGPAAK,545,AAAY5CAABAAAbGPAAD,539,AAAY5CAABAAAbGPAAQ,531
.............................................................................
..........
BMS_STATS: start processing top n values for column "PROD_ID"
DBMS_STATS: Parsing topn values.. 
DBMS_STATS:    Extracted 254 rowid-freq pairs.
DBMS_STATS: topn sql (len: 9002): 
DBMS_STATS: +++ select /*+  no_parallel(t) no_parallel_index(t) dbms_stats 
cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring 
xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') 
no_substrb_pad  */ substrb(dump("PROD_ID",16,0,64),1,240) val, 
                      rowidtochar(rowid) rwid from "SYS"."SALES" t where 
rowid in (chartorowid('AAAY5CAABAAAbGPAAA'),chartorowid('AAAY5CAABAAAbGPAAB'),chartoro
wid('AAAY5CAABAAAbGPAAC'),chartorowid('AAAY5CAABAAAbGPAAD'),chartorowid('AAAY
5CAABAAAbGPAAG'),chartorowid('AAAY5CAABAAAbGPAAI'),chartorowid('AAAY5CAABAAAb
GPAAJ'),chartorowid('AAAY5CAABAAAbGPAAK'),chartorowid('AAAY5CAABAAAbGPAAM'),c
hartorowid('AAAY5CAABAAAbGPAAN'),chartorowid('AAAY5CAABAAAbGPAAO'),chartorowi
d('AAAY5CAABAAAbGPAAQ'),chartorowid('AAAY5CAABAAAbGPAAS'),chartorowid('AAAY5C
AABAAAbGPAAT'),chartorowid('AAAY5CAABAAAbGPAAV'),chartorowid('AAAY5CAABAAAbGP
AAW'),chartorowid('AAAY5CAABAAAbGPAAf'),chartorowid('AAAY5CAABAAAbGPAAp'),cha
rtorowid('AAAY5CAABAAAbGPABe'),chartorowid('AAAY5CAABAAAbGPACJ'),chartorowid(
'AAAY5CAABAAAbGPACK'),chartorowid('AAAY5CAABAAAbGPACL'),chartorowid('AAAY5CAA
BAAAbGPACM'),chartorowid('AAAY5CAABAAAbGPACN'),chartorowid('AAAY5CAABAAAbGPAC
O'),chartorowid('AAAY5CAABAAAbGPACP'),chartorowid('AAAY5CAABAAAbGPACQ'),chart
orowid('AAAY5CAABAAAbGPACR'),chartorowid('AAAY5CAABAAAbGPACS'),chartorowid('A
AAY5DAABAAAaPuAA/'),chartorowid('AAAY5DAABAAAaPuAA1'),chartorowid('AAAY5DAABA
AAaPuAA2'),chartorowid('AAAY5DAABAAAaPuAA3'),chartorowid('AAAY5DAABAAAaPuAA4'
),chartorowid('AAAY5DAABAAAaPuAA5'),chartorowid('AAAY5DAABAAAaPuAA6'),chartor
owid('AAAY5DAABAAAaPuAA7')...................................................
..........................

The provided ROWIDs allow us to compute exact top-n frequent values as:

 

select /*+ parallel(4) */ prod_id,
         count(*),
         sum(count(*)) over() topn
   FROM   sales t
   where prod_id in(
select                prod_id
                      from sales t where rowid in ( 
chartorowid('AAAY5CAABAAAbGPAAA'),chartorowid('AAAY5CAABAAAbGPAAB'),
                      chartorowid('AAAY5CAABAAAbGPAAC'),chartorowid('AAAY5CAABAAAbGPAAD'),chartorow
id('AAAY5CAABAAAbGPAAG'),chartorowid('AAAY5CAABAAAbGPAAI'),
                      chartorowid('AAAY5CAABAAAbGPAAJ'),chartorowid('AAAY5CAABAAAbGPAAK'),chartorow
id('AAAY5CAABAAAbGPAAM'),chartorowid('AAAY5CAABAAAbGPAAN'),
.............................................................................
chartorowid('AAAY5DAABAAAaPuAEK'),chartorowid('AAAY5DAABAAAaPuAEL'),chartorow
id('AAAY5DAABAAAaPuAEM'),chartorowid('AAAY5DAABAAAaPuAEN'))
                 )
                 group by prod_id
                 order by prod_id;

   PROD_ID   COUNT(*)       TOPN
   PROD_ID   COUNT(*)       TOPN
---------- ---------- ----------
         1        548      10985
         2        561      10985
         3        600      10985
         4        567      10985
         5        548      10985
         6        566      10985
         7        546      10985
         8        531      10985
         9        539      10985
        10        622      10985
        11        556      10985
        12        556      10985
        13        545      10985
        14        577      10985
        15        547      10985
        16        573      10985
        17        583      10985
        18        571      10985
        19        614      10985
       761          1      10985
       763          1      10985
       764          1      10985
       765          1      10985
       766          1      10985
       767          1      10985
       768          1      10985
................................
       996          1      10985
       997          1      10985
       998          1      10985
       999          1      10985
      1000          1      10985
      9991          1      10985
      9992          1      10985
      9993          1      10985
      9994          1      10985
      9995          1      10985
      9996          1      10985
      9997          1      10985
      9998          1      10985
      9999          1      10985
     10000          1      10985

254 rows selected.

We also calculated top-n frequent values as 10985 using Query 6. If we set INCREMENTAL=TRUE and gather statistics again, you will see that Oracle Database will create an HH for the table level.

 

To find the reason why Oracle Database computes top-n frequent values incorrectly, I have traced and found the SQL statement shown in Query 7.

 

Query 7:

 

SELECT /*+ OPT_PARAM('_parallel_syspls_obey_force' 'false') */ 
     ENDPOINT,POP ,EPVALUE, FREQ, EP_INT_WIDTH, 
          NDV, CDN, 
          (SUM(POP) OVER()) POPCNT,
          (SUM(POP*FREQ) OVER()) POPFREQ,
          (SUM(CASE WHEN RNK <= 254 THEN FREQ ELSE 0 END) OVER ()) TOPNFREQ,
          (MIN(ENDPOINT) OVER()) MIN_EP,rnk,
          (MIN(CASE WHEN RNK <= 254 THEN FREQ ELSE 9999999999999 END) 
            OVER ()) MIN_FREQ
   
   FROM (SELECT ENDPOINT, EPVALUE, FREQ, EP_INT_WIDTH, (COUNT(*) OVER ()) NDV, 
                (SUM(EP_INT_WIDTH) OVER ()) CDN,
                ROW_NUMBER() OVER (ORDER BY FREQ ) RNK,
                (CASE WHEN FREQ > (SUM(EP_INT_WIDTH) OVER ())/254 
                      THEN 1 ELSE 0 END) POP
       
       FROM (SELECT ENDPOINT, EPVALUE, SUM(NUMROWS) FREQ, 
                    SUM(INTERVAL_WIDTH) EP_INT_WIDTH
            
            FROM (SELECT CASE WHEN EP_REPEAT_COUNT > 0 THEN EP_REPEAT_COUNT
                             ELSE (HIST.BUCKET - LAG(HIST.BUCKET,1,0) OVER 
                              (PARTITION BY HIST.OBJ# ORDER BY HIST.BUCKET))
                         END *
                         
                         CASE WHEN ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET)
                                    OVER (PARTITION BY HIST.OBJ#)) < 1.1 THEN 1
                         ELSE ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET) OVER 
                                 (PARTITION BY HIST.OBJ#)) END AS NUMROWS,
                         (HIST.BUCKET - LAG(HIST.BUCKET,1,0) OVER 
                              (PARTITION BY HIST.OBJ# ORDER BY HIST.BUCKET)) *
                           
                           CASE WHEN ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET)
                                    OVER (PARTITION BY HIST.OBJ#)) < 1.1 THEN 1
                           ELSE ((T.ROWCNT - H.NULL_CNT)/ MAX(HIST.BUCKET) OVER 
                                (PARTITION BY HIST.OBJ#)) END AS INTERVAL_WIDTH,
                        HIST.ENDPOINT, 
                        CASE WHEN HIST.EPVALUE IS NOT NULL THEN
                                  UTL_RAW.CAST_TO_RAW(HIST.EPVALUE)
                             ELSE HIST.EPVALUE_RAW END EPVALUE           
                             
                             
                FROM   "_HISTGRM_DEC" HIST, HIST_HEAD$ H,
                       (SELECT OBJ#, ROWCNT FROM TABPART$ TP WHERE BO# = 101953
                        UNION ALL
                        SELECT OBJ#, ROWCNT FROM TABCOMPART$ TCP 
                         WHERE BO# = 101953) T
                WHERE T.OBJ# = HIST.OBJ# 
                  AND H.OBJ# = HIST.OBJ#
                  AND H.INTCOL# = HIST.INTCOL#
                  AND HIST.INTCOL# = 1
                  AND HIST.BUCKET <> 0)
   GROUP BY ENDPOINT, EPVALUE
         )
      )
   ORDER BY ENDPOINT, EPVALUE

As you can see, the query contains a ROW_NUMBER() OVER (ORDER BY FREQ ) RNK clause but it must be ROW_NUMBER() OVER (ORDER BY FREQ DESC) RNK to compute the top-n frequent values and this is the reason that Oracle Database does the computation incorrectly.

 

What can we say about the incremental maintaining of height-balanced histograms (HBHs)? It seems that in Oracle Database 12c, HBHs are not maintained incrementally. Even if we set INCREMENTAL=TRUE for the table, Oracle Database sets INCREMENTAL=FALSE for the entry statistics gathering process and does not construct an HBH incrementally. We can see that from the trace file:

 

execute  DBMS_STATS.SET_TABLE_PREFS(null, 'sales', 'INCREMENTAL', 'TRUE');
execute  dbms_stats.set_global_prefs('trace',to_char(2048+32768+4+16)) ;
execute  DBMS_STATS.gather_table_stats (null, tabname => 'SALES', 
estimate_percent=> 100); -- Also tested with other values of estimate percents

select  column_name, 
         num_distinct,
         histogram
 from user_tab_col_statistics
where table_name='SALES' AND column_name = 'PROD_ID';

COLUMN_NAME   NUM_DISTINCT  HISTOGRAM
------------- ------------ ---------------
PROD_ID       269          HEIGHT BALANCED

DBMS_STATS: Specified granularity = AUTO, Fixed granularity = GLOBAL AND PARTITION
DBMS_STATS: parameters ** pfix:  (parent pfix: ) ownname:  tabname: SALES 
partname:  estimate_percent: 100 block_sample: FALSE method_opt: Z degree: 
32766 granularity: Z gIdxGran:  cascade: NULL stattab:  statid:  statown:  
no_invalidate: NULL flush_colu: TRUE fxt:  stattype: DATA start_time: 02-14-
2018 15:50:53 gathering_group_stats: FALSE cms_only: FALSE force: FALSE 
options: Z executionPhase: SERIAL (NON-CONCURRENT) incremental: FALSE 
concurrent: FALSE loc_degree:  loc_method_opt: FOR ALL COLUMNS SIZE AUTO 
fxt_typ: 0 tobjn: 101953
DBMS_STATS: Preferences for table SYS.SALES
DBMS_STATS: ================================================================================
DBMS_STATS: SKIP_TIME                                         - 
DBMS_STATS: STATS_RETENTION                                   - 
DBMS_STATS: MON_MODS_ALL_UPD_TIME                             - 
DBMS_STATS: SNAPSHOT_UPD_TIME                                 - 
DBMS_STATS: TRACE                                             - 34836
DBMS_STATS: DEBUG                                             - 0
DBMS_STATS: SYS_FLAGS                                         - 1
DBMS_STATS: SPD_RETENTION_WEEKS                               - 53
DBMS_STATS: CASCADE                                           - DBMS_STATS.AUTO_CASCADE
DBMS_STATS: ESTIMATE_PERCENT                                  - DBMS_STATS.AUTO_SAMPLE_SIZE
DBMS_STATS: DEGREE                                            - NULL
DBMS_STATS: METHOD_OPT                                        - FOR ALL COLUMNS SIZE AUTO
DBMS_STATS: NO_INVALIDATE                                     - DBMS_STATS.AUTO_INVALIDATE
DBMS_STATS: GRANULARITY                                       - AUTO
DBMS_STATS: PUBLISH                                           - TRUE
DBMS_STATS: STALE_PERCENT                                     - 10
DBMS_STATS: APPROXIMATE_NDV                                   - TRUE
DBMS_STATS: APPROXIMATE_NDV_ALGORITHM                         - HYPERLOGLOG
DBMS_STATS: ANDV_ALGO_INTERNAL_OBSERVE                        - FALSE
DBMS_STATS: INCREMENTAL                                       - TRUE
DBMS_STATS: INCREMENTAL_INTERNAL_CONTROL                      - TRUE
DBMS_STATS: AUTOSTATS_TARGET                                  - AUTO
DBMS_STATS: CONCURRENT                                        - OFF
DBMS_STATS: JOB_OVERHEAD_PERC                                 - 1
DBMS_STATS: JOB_OVERHEAD                                      - -1
DBMS_STATS: GLOBAL_TEMP_TABLE_STATS                           - SESSION
DBMS_STATS: ENABLE_TOP_FREQ_HISTOGRAMS                        - 3
DBMS_STATS: ENABLE_HYBRID_HISTOGRAMS                          - 3
DBMS_STATS: TABLE_CACHED_BLOCKS                               - 1
DBMS_STATS: INCREMENTAL_LEVEL                                 - PARTITION
DBMS_STATS: INCREMENTAL_STALENESS                             - ALLOW_MIXED_FORMAT
DBMS_STATS: OPTIONS                                           - GATHER
DBMS_STATS: GATHER_AUTO                                       - AFTER_LOAD
DBMS_STATS: STAT_CATEGORY                                     - OBJECT_STATS
DBMS_STATS: SCAN_RATE                                         - 0
DBMS_STATS: GATHER_SCAN_RATE                                  - HADOOP_ONLY
DBMS_STATS: PREFERENCE_OVERRIDES_PARAMETER                    - FALSE
DBMS_STATS: AUTO_STAT_EXTENSIONS                              - OFF
DBMS_STATS: WAIT_TIME_TO_UPDATE_STATS                         - 15

As you can see, although the INCREMENTAL preference of the table is TRUE, Oracle Database set incremental=false for the process and traced it as executionPhase: SERIAL (NON-CONCURRENT) incremental: FALSE. Of course, the HBH also could be created from partition-level histogram data. I think there is not a restriction against doing this, because in Oracle Database 11g, HBHs are updated incrementally. But to maintain statistics incrementally at first, the database must determine the global NDV and only after that, can it process further. Also, during the NDV approximation, the database identifies top-n values, and approximation of the global NDV requires setting ESTIMATE_PERCENT = DBMS_STATS.AUTO_SAMPLE_SIZE. That is why Oracle Database makes the decision based on the value of the estimate_percent parameter, for example:

 

IF ESTIMATE_PERCENT <> DBMS_STATS.AUTO_SAMPLE_SIZE then INCREMENTAL=FALSE.

So, this means that DBMS_STATS.AUTO_SAMPLE_SIZE allows and triggers Oracle Database to gather basic column and histogram statistics incrementally. As known, in Oracle Database 12c, an HBH is created only if you explicitly specify a value for ESTIMATE_PERCENT different than DBMS_STATS.AUTO_SAMPLE_SIZE. Therefore, an HBH is not maintained incrementally. In Oracle Database 11g, an HBH is maintained incrementally; however, it seems that is not allowed in Oracle Database 12c.

 

In addition, it is clear that incremental gathering of statistics is triggered by ESTIMATE_PERCENT = DBMS_STATS.AUTO_SAMPLE_SIZE, and if there are HBHs for the column of a partition of a given table (which is possible only if statistics are gathered with the ESTIMATE_PERCENT <> DBMS_STATS.AUTO_SAMPLE_SIZE condition), when we gather statistics (with DBMS_STATS.AUTO_SAMPLE_SIZE) for the partitioned table, all old histograms (for both local and global levels) will be deleted and new ones will be created .

 

Conclusion

 

In summary, we saw how Oracle Database constructs global-level histograms by using existing partition-level histograms. Synopsis data is not used to construct global histograms directly; it is used for NDV approximation. Global NDV and partition-level histogram data are enough to describe the table-level data.

 

Oracle Database first identifies the global NDV for a given column and then tries to extract the top-n frequent values by scanning partition-level histogram data. It performs special queries (Query 1, 2, 3, 4, and 6), as above described.

 

If the global NDV <=N (N is number of requested buckets and the default is 254), the global histogram will be a frequency histogram. The top-n values and their frequencies can be determined via Query 4.

 

If NDV > N, then after identifying the top-n frequent values, the database decides which type of histogram should be created as shown in Figure 1. If the TOPNFREQ>=p*num_rows condition is true, a top frequency histogram will be created; otherwise, Oracle Database will create a global hybrid histogram.

 

Also, we saw that Oracle Database does not identify the top-n frequent values correctly when creating a global histogram. There is a problem in the implementation of the dbms_stats package.

 

Moreover, we saw that height-balanced histograms are not maintained incrementally and the reason is the use of DBMS_STATS.AUTO_SAMPLE_SIZE. Height-balanced histograms are created only when you explicitly specify a value for ESTIMATE_PERCENT that is different than AUTO_SAMPLE_SIZE. But the database considers "incremental maintenance of the statistics" only when ESTIMATE_PERCENT=DBMS_STATS.AUTO_SAMPLE_SIZE is set.

 

It is worth mentioning that after creating a global histogram, even if no DML operations have been performed, if we try gather statistics again, Oracle Database will refresh the global statistics regardless of statistics staleness. I think this is not a problem, because the operation of incrementally updating histograms is not a resource-intensive process. Finally, incrementally updating a hybrid histogram using existing histogram data supports the idea that it is possible to construct a hybrid histogram without doing sampling.

 

About the Author

 

Chinar Aliyev graduated from the Baku State University applied mathematics department. He has been working with Oracle Database more than 10 years (since the Oracle 8i version) and specializes in tuning and troubleshooting Oracle Database performance problems. His areas of interest include SQL optimization, optimizer internals, parallel execution, and database designing. He is an Oracle Database 11g Certified Master and an Oracle Database 11g Performance Tuning Certified Expert. He also is a regular presenter at international and local conferences such as Hotsos and Oracle Database user groups.