TECHNOLOGY: Ask Tom
On History, Invisibility, and MovingBy Tom Kyte
Our technologist counts on histograms, proves hash tables by example, and moves CLOBs.
I have a question about using histograms with character data. I’m trying to improve performance on a set of very large tables. My table T1 has more than 120 million rows and is not partitioned, and it is representative of the other tables in my schema. The data is such that now a couple of key varchar2 columns have very skewed data.
I’ve gathered histograms on this information and have been pleased with the results. The optimizer is better able to gauge the estimated cardinalities during optimization and has been coming up with better plans.
I’m concerned about the future, however. I’ve heard that histograms gathered against varchar2 columns have limitations. Specifically, I’ve heard that only the first few characters are considered when the histograms are being generated, so that as my varchar2 data gets longer and longer, I might see a decrease in the efficiency of my histograms. Is this true, and what are the limits?
Before I address the question, let’s first define what a histogram is. A histogram—in the context of statistics—contains information that gives the optimizer a clear picture of the data in a column. In many cases, the histogram on a column will tell the optimizer precisely how many rows in the table have a certain value, so that when the optimizer parses a query in the form “where column_with_histogram = 'some value',” the optimizer will be able to very accurately estimate how many rows that predicate will return. When the column with a histogram generated against it has fewer than 255 distinct values, the histogram will contain a complete picture of the values in the column. When the column has more than 254 distinct values, the histogram will become less accurate but will still paint a good picture of the data in the column. This 255-distinct-value limit is because the database doesn’t store more than 254 rows of information for a single histogram.
That aside, yes, you are correct that for varchar2 data, only so many bytes of information are stored in the histogram for each of the 254 values. That is, the leading edge of the varchar2 string, not necessarily the entire string, is stored in the histogram. I stress the word bytes here because in many cases, bytes are not the same as characters in a varchar2 string. If you are using a multibyte character set, it is quite possible that 32 bytes is able to store far fewer than 32 characters.
I can show you this 32-byte limit rather easily by running a small test. First I’ll create a table containing various varchar2 strings. These strings will be of different fixed lengths, and their leading edge will be constant. I’ll create four columns—one with 30 characters of ‘x’ followed by a single character of A-L, the next with 31 characters of ‘x’ followed by a single character A–L, then 32 ‘x’s, and finally 33 ‘x’s—all followed by a single character, A–L. This will create strings of 31, 32, 33, and 34 bytes in length (I am using a single-byte character set, so I know that a byte is the same as a character in my system).
SQL> create table t 2 as 3 select 4 rpad( 'x', 30, 'x' ) || 5 chr( ascii('A')+ 6 case when rownum < 20000 7 then 0 8 else mod(rownum,12) 9 end ) len30, 10 rpad( 'x', 31, 'x' ) || 11 chr( ascii('A')+ 12 case when rownum < 20000 13 then 0 14 else mod(rownum,12) 15 end ) len31, 16 rpad( 'x', 32, 'x' ) || 17 chr( ascii('A')+ 18 case when rownum < 20000 19 then 0 20 else mod(rownum,12) 21 end ) len32, 22 rpad( 'x', 33, 'x' ) || 23 chr( ascii('A')+ 24 case when rownum < 20000 25 then 0 26 else mod(rownum,12) 27 end ) len33 28 from all_objects 29 / Table created.
As I’ve generated the data, I’ve also introduced some skew to it. If you look at the function for assigning the last letter—after all of the ‘x’s—you’ll see that the first 20,000 rows (out of about 72,000) will have the letter ‘A’ assigned to them. After the first 20,000 rows, I’ll get an even distribution of A–L values. That means that strings that end in ‘A’ happen often and that strings that end in B–L do not happen often.
Next I’ll gather statistics and ensure that I have histograms that provide a perfect picture of the data:
SQL> begin 2 dbms_stats.gather_table_stats 3 ( user, 'T', 4 method_opt => 'for all columns size 12', 5 estimate_percent=> 100 6 ); 7 end; 8 / PL/SQL procedure successfully completed.
The method_opt parameter here is key for this example, and I used it to gather histograms for all the columns in the table. The resulting histograms will contain 12 rows of information for each column—for each of the distinct A–L values I loaded. Because I have 12 distinct values in each column, this will provide a perfect picture of the count of rows for each value.
Now I can use explain plan on a simple query to see how many rows the optimizer believes would be returned, given a predicate on each of the columns, as shown in Listing 1.
Code Listing 1: Checking the optimizer’s use of histograms on “long” columns
SQL> set autotrace on explain SQL> select 'len30', count(*) from t where len30 = rpad('x',30,'x')||'A' 2 union all 3 select 'len30', count(*) from t where len30 = rpad('x',30,'x')||'B' 4 union all 5 select 'len31', count(*) from t where len31 = rpad('x',31,'x')||'A' 6 union all 7 select 'len31', count(*) from t where len31 = rpad('x',31,'x')||'B' 8 union all 9 select 'len32', count(*) from t where len32 = rpad('x',32,'x')||'A' 10 union all 11 select 'len32', count(*) from t where len32 = rpad('x',32,'x')||'B' 12 union all 13 select 'len33', count(*) from t where len33 = rpad('x',33,'x')||'A' 14 union all 15 select 'len33', count(*) from t where len33 = rpad('x',33,'x')||'B' 16 / 'LEN3 COUNT(*) ——————— ——————————————— len30 24307 len30 4308 len31 24307 len31 4308 len32 24307 len32 4308 len33 24307 len33 4308 8 rows selected. Execution Plan ———————————————————————————————————— Plan hash value: 2994643399 —————————————————————————————————————————————————————————————————————————— | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | —————————————————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 8 | 268 | 3079 (88)| 00:00:37 | | 1 | UNION-ALL | | | | | | | 2 | SORT AGGREGATE | | 1 | 32 | | | |* 3 | TABLE ACCESS FULL| T | 24307 | 759K| 385 (1)| 00:00:05 | | 4 | SORT AGGREGATE | | 1 | 32 | | | |* 5 | TABLE ACCESS FULL| T | 4308 | 134K| 385 (1)| 00:00:05 | | 6 | SORT AGGREGATE | | 1 | 33 | | | |* 7 | TABLE ACCESS FULL| T | 24307 | 783K| 385 (1)| 00:00:05 | | 8 | SORT AGGREGATE | | 1 | 33 | | | |* 9 | TABLE ACCESS FULL| T | 4308 | 138K| 385 (1)| 00:00:05 | | 10 | SORT AGGREGATE | | 1 | 34 | | | |* 11 | TABLE ACCESS FULL| T | 71696 | 2380K| 385 (1)| 00:00:05 | | 12 | SORT AGGREGATE | | 1 | 34 | | | |* 13 | TABLE ACCESS FULL| T | 71696 | 2380K| 385 (1)| 00:00:05 | | 14 | SORT AGGREGATE | | 1 | 35 | | | |* 15 | TABLE ACCESS FULL| T | 71696 | 2450K| 385 (1)| 00:00:05 | | 16 | SORT AGGREGATE | | 1 | 35 | | | |* 17 | TABLE ACCESS FULL| T | 71696 | 2450K| 385 (1)| 00:00:05 | —————————————————————————————————————————————————————————————————————————— Predicate Information (identified by operation id): —————————————————————————————————————————————————————————————————————————— 3 - filter("LEN30"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxA') 5 - filter("LEN30"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxB') 7 - filter("LEN31"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxA') 9 - filter("LEN31"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxB') 11 - filter("LEN32"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxA') 13 - filter("LEN32"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxB') 15 - filter("LEN33"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxA') 17 - filter("LEN33"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxB')
As you can see, the optimizer is able to very accurately estimate the cardinality of each predicate as long as the string is 32 bytes or less. So in steps 3, 5, 7, and 9 in Listing 1, the estimated cardinality is very exact: the estimate matches the actual count(*) values. However, when the optimizer gets to step 11, the estimated cardinality goes awry. All of a sudden, the optimizer starts guessing “every row will be returned,” regardless of what the string ends in. This is because the histogram contains only the first 32 bytes of string data, and when the optimizer gets to steps 11 through 17, the strings all start with 32 or more ‘x’s. In the histogram, they all appear to have the same value.
This does not render histograms useless. They are very useful on most data you tend to search on: numbers, dates, and strings less than 33 bytes in length. If you have longer strings, you will typically be performing a text search on them with Oracle Text and the CONTAINS operator.
You should be aware of this histogram implementation, however, because it could prompt you to store some data attributes differently. For example, suppose you have a system that stores URLs in a table—an audit trail, for example. If you look at a URL such as the ones you see on asktom.oracle.com, you’ll discover that the first 39 characters of my URL are constant—
—regardless of what page you end up on in my site. Having a histogram on that data would not be very useful.
However, what if I stored the data in a more meaningful way, such as storing the URL in a series of fields—specifically, using the fields PROTOCOL, HOST, PORT, and PATH? In this case, the PROTOCOL field would contain values such as HTTP or HTTPS, the HOST field would store asktom .oracle.com (and other host names), the PORT field would contain one of the standard port numbers used by typical HTTP, and the PATH field would be the unique bit. You have the ability to render the original URL in its entirety, so you haven’t lost any data. And you have the new ability to provide even better searching capabilities, because you have the components all split out. You also gain the ability to compress this data in the database very easily. (Oracle’s basic and advanced compression for tables works by removing repetitive attributes in a block. PROTOCOL, HOST, and PORT values would all repeat often.) And you would now have an attribute (PATH) that would be able to support a histogram nicely.
This is just another case where knowing how things work enables you to implement a solution more efficiently or at least justify why you have chosen to do something in a particular fashion.
I encourage you to take a look at what a colleague of mine, Jonathan Lewis, has written on this subject recently as well. See the Next Steps box at the end of this column for links to his excellent series of articles on histograms and the 32-byte limit.
Scalar Subquery Caching
Reading the posting at bit.ly/cSqDc6, I wondered about this part of Jonathan Lewis’ reply: “If so, the cache is probably a hash table of 256 rows in 8i and 9i, and 1,024 rows in 10g.” What is this hash table? Where can I see it in Oracle Database?
For readers who might not be able to see the posting in question: it was an article referring to an internal optimization called scalar subquery caching implemented in the database. If you execute a scalar subquery in the context of a larger query, the database will attempt to cache the output of that subquery in hopes of reusing it over and over again during the larger query’s execution rather than having to re-evaluate the subquery many times.
In answer to the question, however, you cannot “see” the hash table anywhere. It is an internal data structure that lives in your session memory for the duration of the query. Once the query is finished, it goes away. It is a cache associated with your query—nothing more, nothing less.
You can, however, “see” the hash table in action by measuring how many times your function is called. In Listing 2, I start by creating a function that counts how many times it has been called and stores the results in CLIENT_INFO in V$SESSION.
Code Listing 2: Function f counts calls and stores information
SQL> create or replace function f( x in varchar2 ) return number 2 as 3 begin 4 dbms_application_info.set_client_info(userenv('client_info')+1 ); 5 return length(x); 6 end; 7 / Function created.
Now, in Listing 3, I can call that function from a SQL query and see how much CPU the query consumed and how many times the function was called.
Code Listing 3: Calling function f and measuring CPU consumed
SQL> exec :cpu := dbms_utility.get_cpu_time; SQL> exec dbms_application_info.set_client_info(0); SQL> set autotrace traceonly statistics; SQL> select owner, f(owner) from stage; 71653 rows selected. Statistics ———————————————————————————————————— ... 71653 rows processed SQL> set autotrace off SQL> select dbms_utility.get_cpu_time-:cpu cpu_hsecs, userenv('client_info') from dual; CPU_HSECS USERENV('CLIENT_INFO') —————————— ————————————————————— 164 71653
The STAGE table is a copy of ALL_OBJECTS, and, as you can see in the Listing 3 statistics, function f was called once per row (71,653 times). And the function processing used 1.64 CPU seconds.
Now I simply convert the PL/SQL function call into a scalar subquery, and I rerun the example, as shown in Listing 4. The scalar subquery was called only 69 times (over 0.25 CPU seconds). The effect of this scalar subquery caching is now apparent.
Code Listing 4: Replacing function f with a scalar subquery
SQL> exec :cpu := dbms_utility.get_cpu_time; SQL> dbms_application_info.set_client_info(0); SQL> set autotrace traceonly statistics; SQL> select owner, (select f(owner) from dual) f from stage; 71653 rows selected. Statistics ———————————————————————————————————— 71653 rows processed SQL> set autotrace off SQL> select dbms_utility.get_cpu_time-:cpu cpu_hsecs, userenv('client_info') from dual; CPU_HSECS USERENV('CLIENT_INFO') ————————— ——————————————————————— 25 69
Direct Access to Partition
We have set up some list-partitioned tables. My developers want to run direct statements (inserts, updates, selects) against partitions, not against the tables. For example, they would like to use this:
INSERT INTO status_delta_stg PARTITION (status_delta_stg_company1).UPDATE status_fact PARTITION (job_status_fact_company1)
Are there any benefits of doing it this way? (The optimizer should be able to do partition elimination if we provide the field we partitioned on in the queries.)
My main problem with this approach is that I lose the ability to do partition exchanges, drops, merges, and so on without having an impact on their code (if I decide to use a different partition name).
The only time I can see a benefit would be during a direct path load. Direct path loads—using INSERT /*+ APPEND */, for example—lock the entire segment they are targeting, so if you do
insert /*+ APPEND */ into t select ...
the entire table T will get locked until you commit. If, however, you are loading data for just one partition and you specify the partition
insert /*+ APPEND */ into t partition (p1) select ...
only partition P1 will get locked. Note that this insert will fail, of course, if you attempt to load any data that doesn’t actually belong in partition P1.
Otherwise, running statements against partitions generally would not help your developers improve performance. (Note that the inserts, updates, and deletes will figure out what partitions they must hit and will hit only those partitions.)
In addition to the negatives of running statements against partitions you’ve pointed out, there’s at least one more issue I foresee. I can easily envision developers writing WHERE clauses they think resolve to a certain partition and then including the PARTITION clause in the DML and having it hit only that one partition they think the WHERE clause targets—only to be disappointed that they were wrong (and have just caused logical data corruption on a huge scale).
Unless your developers have a specific technical, sound, and logical reason, I would not want this in the code at all.
Better with CLOB
We are currently upgrading from Oracle9i Database to Oracle Database 10g Release 2. One of our Oracle9i Database scripts moved data from one table to another when certain conditions applied. The script that copied the data in Oracle9i Database is as follows:
copy from username/password@dbase to username/password@dbase append target_table using select * from source_table where field1=condition1 and field2=condition2;
But with the database upgrade, a field that was type LONG in Oracle9i Database is now type CLOB in Oracle Database 10g Release 2. How do I achieve the old script’s results in Oracle Database 10g Release 2 for a record where one of the fields is a CLOB datatype?
Congratulations on migrating to the LOB datatype—you will be much happier once you discover how much more flexible large objects are than LONGs. In the past, you had to use this deprecated SQL*Plus COPY command to copy LONG types from place to place, or you had to write a program outside of the database to do it. You cannot use INSERT as SELECT with LONG types, and PL/SQL can handle only LONGs that are 32 K or less in size.
With the CLOB type, you will find that a simple
insert into target_table select * from source_table where field1=<something> and field2=<something else>;
in PL/SQL or any program will accomplish your task. You can just use straightforward SQL.
Now, if I could only convince you to upgrade to Oracle Database 11g Release 2. Oracle Database 10g Release 2 will reach end of life a lot sooner than Oracle Database 11g Release 2 will.
Tom Kyte is a database evangelist in Oracle’s Server Technologies division and has worked for Oracle since 1993. He is the author of Expert Oracle Database Architecture (Apress, 2005, 2010) and Effective Oracle by Design (Oracle Press, 2003), among other books.
Send us your comments