TECHNOLOGY: Ask Tom
On Partitioning and Pipelining
By Tom Kyte
Our technologist speaks on UNION ALL, pipelines, and Saturdays.
I have a high-volume OLTP database in which I have partitioned tables by date to facilitate purging. All non-date-related indexes are global nonpartitioned. Because updating global indexes takes a very long time (my tables are 4 to 25 gigabytes in size), I'm concerned about locking and other concurrency issues on the table where DROP partition ... UPDATE GLOBAL INDEXES takes place, because people continue to use UPDATE and INSERT with the table at a pretty high rate (60 to 80 DML operations per second per table).
Well, it is not the aggregate size of the tables that is the issue; it is the individual size of the partitions. If you slice the table into small enough partitions, each DROP will not be that large and will happen rather quickly.
In answer to the question about locking, as long as the partition being dropped has no active transactions running against it (during the DROP command, you'll get an ORA-54 error if the partition does have active transactions), the DROP will take place—even with concurrent/outstanding DML running against other partitions. The UPDATE GLOBAL INDEXES is done via DML-like operations (as if a DELETE had taken place) and happens concurrently with other transactions.
UNION or UNION ALL
I'm tuning a query that has a WHERE EXISTS, like this: WHERE EXISTS (SELECT 'x' FROM t1 WHERE ... UNION SELECT 'x' FROM t2 WHERE ... UNION SELECT 'x' FROM t3 WHERE ...). The WHERE EXISTS is taking an inordinate amount of time to execute. Is there a better way to do this?
Definitely. I see this frequently: the use of a UNION where UNION ALL would be much more efficient. The problem is that in a UNION , Oracle finds all the qualifying rows and then "deduplicates" them. To see what I mean, you can simply compare the following queries:
SQL> select * from dual 2 union 3 select * from dual; D - X SQL> select * from dual 2 union ALL 3 select * from dual; D - X X
Note how the first query returns only one record and the second returns two. A UNION forces a big sort and deduplication—a removal of duplicate values. Most of the time, this is wholly unnecessary. To see how this might affect you, I'll use the data dictionary tables to run a WHERE EXISTS query using UNION and UNION ALL and compare the results with TKPROF . The results are dramatic.
First, I'll do the UNION query:
SQL> select * 2 from dual 3 where exists 4 (select null from all_objects 5 union 6 select null from dba_objects 7 union 8 select null from all_users); call cnt cpu ela query ---- --- ---- --- ------ Parse 1 0.01 0.00 0 Execute 1 2.78 2.75 192234 Fetch 2 0.00 0.00 3 ----- ---- ---- ---- ------ total 4 2.79 2.76 192237
As you can see, that was a lot of work—more than 192,000 I/Os just to see if I should fetch that row from DUAL . Now I add a UNION ALL to the query:
SQL> select * 2 from dual 3 where exists 4 (select null from all_objects 5 union all 6 select null from dba_objects 7 union all 8 select null from all_users); call cnt cpu ela query ------ ---- ---- ---- ----- Parse 1 0.00 0.00 0 Execute 1 0.01 0.00 9 Fetch 2 0.00 0.00 3 ------ ---- ---- ---- ----- total 4 0.01 0.00 12
Quite a change! What happened here was that the WHERE EXISTS stopped running the subquery when it got the first row back, and because the database did not have to bother with that deduplicate step, getting the first row back was very fast indeed.
The bottom line: If you can use UNION ALL , by all means use it over UNION to avoid a costly deduplication step—a step that is probably not even necessary most of the time.
Can you illustrate the usage of pipelined functions with a simple (EMP, DEPT) example? Under what circumstances can using a pipelined function be effective?
Pipelined functions are simply code that you can pretend is a database table. Pipelined functions give you the (amazing, to me) ability to use SELECT * FROM <PLSQL_FUNCTION>; .
Anytime you think you have the ability to use SELECT * from a function instead of a table, it might be useful. Consider the following extract/ transform/load (ETL) process, whereby a flat file is fed into a PL/SQL function that transforms it and the transformed data is then used to update existing table data. It demonstrates quite a few database features, including external tables, pipelined functions, and MERGE .
To create and use an external table, I need to use a directory object. I'll start by mapping a directory object to the temp directory:
SQL> create or replace 2 directory data_dir as '/tmp/' 3 / Directory created.
Now, I'll create the external table. Part of its definition looks like a control file, because part of creating an external table is, in effect, creating a control file:
SQL> create table external_table 2 (EMPNO NUMBER(4) , 3 ENAME VARCHAR2(10), 4 JOB VARCHAR2(9), 5 MGR NUMBER(4), 6 HIREDATE DATE, 7 SAL NUMBER(7, 2), 8 COMM NUMBER(7, 2), 9 DEPTNO NUMBER(2) 10 ) 11 ORGANIZATION EXTERNAL 12 (type oracle_loader 13 default directory data_dir 14 access parameters 15 (fields terminated by ',') 16 location ('emp.dat') 17 ) 18 / Table created.
Now I'll use the flat utility to create a flat file from my EMP table data. You can find the flat utility at asktom.oracle.com.
SQL> host flat scott/tiger - > emp > /tmp/emp.dat
Now I am ready to test the external table; the flat file I created now works just like a database table:
SQL> select empno, ename, hiredate 2 from external_table 3 where ename like '%A%' 4 / EMPNO ENAME HIREDATE ---------- ------ --------- 7499 ALLEN 20-FEB-81 7521 WARD 22-FEB-81 7654 MARTIN 28-SEP-81 7698 BLAKE 01-MAY-81 7782 CLARK 09-JUN-81 7876 ADAMS 12-JAN-83 7900 JAMES 03-DEC-81 7 rows selected.
I'll set up a PL/SQL ETL routine to ingest the flat file and output live data to be merged or inserted. A pipelined function needs to return a collection type, and I want to return a collection that looks like the EMP table itself, so I create the scalar object type and then I create a table of that type:
SQL> create or replace type 2 emp_scalar_type as object 3 (EMPNO NUMBER(4) , 4 ENAME VARCHAR2(10), 5 JOB VARCHAR2(9), 6 MGR NUMBER(4), 7 HIREDATE DATE, 8 SAL NUMBER(7, 2), 9 COMM NUMBER(7, 2), 10 DEPTNO NUMBER(2) 11 ) 12 / Type created. SQL> create or replace type 2 emp_table_type as table 3 of emp_scalar_type 4 / Type created.
Now I am ready to create the pipelined function itself. Note that the ETL function below is very simplistic; it involves modifying the ename column, but you can include any complex logic you want, including the ability to log error records and the like:
create or replace function emp_etl (p_cursor in sys_refcursor) return emp_table_type PIPELINED as l_rec external_table%rowtype; begin loop fetch p_cursor into l_rec; exit when (p_cursor%notfound); -- validation routine -- log bad rows elsewhere -- lookup some value -- perform conversion pipe row( emp_scalar_type(l_rec.empno, LOWER(l_rec.ename), l_rec.job, l_rec.mgr, l_rec.hiredate, l_rec.sal, l_rec.comm, l_rec.deptno) ); end loop; return; end; / Function created.
The emp_etl pipelined function works just like a table. The following query selects columns ( empno, ename ) from the function, and the function selects all columns from the external table:
SQL> select empno, ename 2 from TABLE(emp_etl( 3 cursor(select * 4 from external_table 5 ) ) ) 6 where ename like '%a%'; EMPNO ENAME ---------- ------ 7499 allen 7521 ward 7654 martin 7698 blake 7782 clark 7876 adams 7900 james 7 rows selected.
Note the use of the keyword PIPELINED in the definition of this function; the keyword is mandatory in the making of a pipelined function. Also note the use of the pipe row directive in PL/SQL—that is the magic that makes a pipelined function really interesting. The pipe row directive returns data to the client immediately, meaning that I am getting output from this function in my client routine before the function generates the last row of data. If the cursor I send to this pipelined function returns 1,000,000 rows, I will not have to wait for PL/SQL to process all 1,000,000 rows to get the first row; data will start coming back as soon as it is ready. That is why these are called pipelined functions: Data streams—as if in a big pipe—from the cursor to the PL/SQL function to the caller.
Now, to finish the job, I'll create a table of data I would like to refresh from the source system, which sends me the flat file I produced above. The logic is as follows: If the record already exists in my database, UPDATE the ename and the sal columns; if the record does not exist, INSERT it. I'll start with some of the data from the EMP table:
SQL> create table emp as 2 select * from scott.emp 3 where mod(empno,2) = 0; Table created.
And here is the MERGE , which manages data from the flat file, through ETL, straight to the table, without hitting the disk with staging files:
SQL> merge into EMP e1 2 using (select * 3 from TABLE 4 (emp_etl( 5 cursor(select * 6 from external_table)) 7 ) 8 ) e2 9 on (e2.empno = e1.empno) 10 when matched then 11 update set e1.sal = e2.sal, 12 e1.ename = e2.ename 13 when not matched then 14 insert (empno, ename, job, mgr, 15 hiredate, sal, comm, deptno) 16 values (e2.empno, e2.ename, 17 e2.job, e2.mgr, 18 e2.hiredate, e2.sal, 19 e2.comm, e2.deptno) 20 / 14 rows merged.
The Last Saturday of Each Month
I need to develop a query to return the date of the last Saturday in any given month. How can I do that?
That's pretty easy with the NEXT_DAY and LAST_DAY functions. You can simply get the last day in the month, subtract 7 days from that, and then use NEXT_DAY to find the next Saturday after that one. I'll do two sample queries, one that returns the last Saturday in every month for the current year and another for which I provide a month as input and it returns the last day just of that month. Listing 1 shows the query that returns the last Saturday of each month for a given year.
You might be asking, "What is the 29-01-1927 all about?" Well, it is just a random date I chose that I knew was a Saturday—any Saturday would do. I did this instead of using SAT in the query for international reasons, because in languages other than English, SAT isn't Saturday. This query should work in any language out there.
Listing 2 shows the query that takes a month as input and returns the last Saturday of that month.
That works with the current year. If you need it to return the last Saturday for any given year, it would be easy to add that— just change the date format from MM to whatever you need.
A reader ("Ant" from New York) followed up with a great improvement on the first query. Using the new SQL MODEL clause of Oracle Database 10g, he was able to produce the answer without causing any I/O against the database at all. His solution is in Listing 3.
I am running Oracle9i Database Release 2 (126.96.36.199) on an ALPHA OPEN VMS server with two CPUs. I need help interpreting a STATSPACK report I was reading. The snapshot window was 10 minutes. There are 600 seconds in 10 minutes. However, when I look at the Top 5 Timed events—I see that I had well over 900 seconds of wait time for db_file_sequential_read —the total of all wait events was well over 1,000 seconds. How can this be? I had only 600 seconds to work with. Also, I observe only 54 seconds of CPU time for 27 sessions. I'm trying to compute a service time from STATSPACK and am just getting confused by what I see.
Well, if you have 1,000 people wait for 1 second simultaneously, you'll have 1,000 seconds of wait time—in that 1 second.
The only thing this particular STATSPACK report says is that the cumulative wait for db_file_sequential_read (due to index abuse, I'm guessing) was 906 seconds. Suppose you have just 27 sessions. That means that each session waited an average of 33 seconds (27 * 33 = 891) during your 10-minute (600-second) snapshot window.
As for the CPU, it might mean that each session used an average of two CPU seconds (I stress might) during the period of observation. Then again, it very well might not. The problem is that if 26 of those sessions called a stored procedure right before you took your snapshot and that stored procedure was still running when you took the closing snapshot (it hadn't returned to the client yet) and that stored procedure was burning CPU—not doing any I/O, just burning CPU—these procedures would contribute exactly zero CPU seconds to this report. CPU is reported only at the end of a call, and these calls never completed.
So a procedure that had been running for 10 hours and finished in your snapshot window would dump all of its CPU time into your window.
Conversely, a procedure that had been running for 10 hours, was running when you started, and is still running when you stop the STATSPACK contributes zero CPU time.
So be careful with CPU time; it is sometimes misleading in a batch system but pretty much dead-on in a transactional system. But all of these considerations make it physically impossible to compute a service time from STATSPACK reports; the data you have to work with is too aggregated to do that. You cannot in any way, shape, or form compute service time from STATSPACK reports—ignore anything that says otherwise.
The only way to accurately report service times is to have the client applications record that information somewhere.
Tom Kyte (firstname.lastname@example.org) has worked for Oracle since 1993. He is a vice president in the Oracle Public Sector group and the author of Effective Oracle by Design (Oracle Press, 2003) and Expert One-on-One: Oracle (Apress, 2003).