TECHNOLOGY: Ask Tom
On Better Loading and Fact-CheckingBy Tom Kyte
Our technologist enjoys the rule of external tables and debunks index scan myths.
I’m working to improve the efficiency of an existing process that starts with staging an input file (generated by a client program) for the database. In some situations, the input file can include 20,000,000 lines or more, but with one-third or so of those lines being of no use as input.
At present all the lines are transferred (as rows in a staging table), with unneeded lines pruned at the beginning of processing with a DELETE. The DELETE obviously generates a lot of redo and undo and takes some time. Further, we’re transferring lines we don’t even need to bother with in the first place.
I’m trying to modify the SQL*Loader control file to discard these unneeded records to save both the transfer time and the pruning time. However, I’m running up against SQL*Loader’s lack of an OR in its condition processing. I’ve tried structuring the control file to perform multiple tests, so that each successful test would result in a row’s being sent to the database. My results show that nearly all rows are being rejected.
create table staging
Here’s my sample input file, inputdata.csv:
Here’s my control file:
LOAD DATA REPLACE
Well, I’ll point out two things here. First: SQL*Loader documentation states that SQL*Loader doesn’t “start over” with multiple INTO clauses. That is, the first INTO clause will get the record and start processing it, and the second INTO clause will get the same record and pick up processing where the first INTO clause stopped. The process does not go back to column 1. In short, the documentation states
The important point in this example is that the second empno field is found immediately after the first ename, although it is in a separate INTO TABLE clause. Field scanning does not start over from the beginning of the record for a new INTO TABLE clause. Instead, scanning continues where it left off.
So a quick fix for you would be to use POSITION(1) in your control file for each of the RECTYPE attributes after the first INTO clause. That is, just change RECTYPE in the control file to RECTYPE POSITION(1). Once I made that change, your data loaded right up.
The second thing I’d like to point out is that as far as I’m concerned, SQL*Loader is dead—because external tables rule. The SQL*Loader control file syntax is very inflexible compared to simple SQL. Also, in your example, you are loading a staging table, which likely will be further processed and placed into “real” tables. Therefore, you are
With external tables you can simplify that process to
You can skip major pieces of your current processing, including loading a table temporarily and rereading all the data. Let’s take your control file and use it to create an external table definition to see how easy this would all be if you used an external table instead of SQL*Loader. If I take your control file and remove all but one of the INTO clauses and get rid of the WHEN clause—
LOAD DATA REPLACE INTO TABLE staging FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS ( RECTYPE, ID, NAME, VAL )
—I can use SQL*Loader to generate a CREATE TABLE statement easily by running the command
sqlldr userid=/ control=test.ctl external_table=generate_only
The resulting log file will include the CREATE TABLE statement in Listing 1 for me. Listing 1 also includes a SQL query against the new table.
Code Listing 1: CREATE TABLE and a query against it
CREATE TABLE "SYS_SQLLDR_X_EXT_STAGING" ( "RECTYPE" CHAR(1), "ID" NUMBER, "NAME" VARCHAR2(64), "VAL" VARCHAR2(256) ) ORGANIZATION external ( TYPE oracle_loader DEFAULT DIRECTORY EXEC_DIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII BADFILE 'EXEC_DIR':'test.bad' LOGFILE 'test.log_xt' READSIZE 1048576 FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' LDRTRIM MISSING FIELD VALUES ARE NULL REJECT ROWS WITH ALL NULL FIELDS ( "RECTYPE" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "NAME" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "VAL" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' ) ) location ( 'test.dat' ) )REJECT LIMIT UNLIMITED /
Now, all I have to do is issue a query such as
select * from sys_sqlldr_x_ext_staging where (name = 'rn' and val is not null) or (name = 'rd' and val is not null) or (name = 'ac' and val <> 'NO' ) or (name not in ( 'ie', 'at', 'rn', 'rd', 'ac' ))
Note that Listing 1 includes the entire set of INTO/WHEN clause logic. Because I am using an external table, I have the entire power of SQL at my disposal. In SQL, OR conditions are not hard; they are quite natural; and best of all, I avoided a lot of work. As you all know, the fastest way to do something is to not do it. Using external tables enables you to not do things without losing anything along the way.
Difference Between Full Index Scans and Fast Full Index Scans
What is the difference between full index scan and fast full index scan?
I’ve read elsewhere that a fast full index scan is an alternative to a full table scan when the index contains all the columns needed for the query and at least one column in the index key has the NOT NULL constraint. A fast full index scan accesses the data in the index itself without accessing the table. It cannot be used to eliminate a sort operation, because the data is not ordered by the index key. It reads the entire index by using multiblock reads, unlike a full index scan, and can be parallelized. Fast full index scans cannot be performed against bitmap indexes. A fast full index scan is faster than a full index scan, because it can use multiblock I/O and can be parallelized just like a table scan.
I am not clear about this. Could you please clarify?
I’ll start with a definition and then get into clearing up some of these misconceptions, because many of the stated “facts” are incorrect. They are common misconceptions but untrue nonetheless.
A fast full index scan reads the entire index, unsorted, as it exists on disk. It is basically using the index as a “skinny” version of the table. The query in question would be accessing only attributes in the index. (We are not using the index as a way to get to the table—we are using the index instead of the table.) We use multiblock I/O and read all the leaf, branch, and root blocks. We ignore the branch and root blocks when executing the query and just process the (unordered) data in the leaf blocks.
A full index scan reads the index a block at a time, from start to finish. It reads the root block, navigates down the left-hand side of the index (or the right-hand side for a descending full scan), and then when it hits the leaf block, it reads across the entire bottom of the index—a block at a time—in sorted order. It uses single-block, not multiblock, I/O for this operation.
Now, let’s address the misconceptions the questioner “read elsewhere.” It is true that a fast full index scan is an alternative to a full table scan when the index contains all the columns referenced in the query, but “at least one column in the index key has the NOT NULL constraint” is not a requirement. I can prove this by example. Suppose I have the following table:
SQL> create table t 2 as 3 select * 4 from all_objects; Table created. SQL> alter table t 2 modify owner null; Table altered. SQL> create index t_idx 2 on t(status,owner); Index created. SQL> desc t Name Null? Type ——————————— ———————— ———————————— OWNER VARCHAR2(30) OBJECT_NAME NOT NULL VARCHAR2(30) ... STATUS VARCHAR2(7) ...
As you can see, neither STATUS nor OWNER is defined as NOT NULL. However, if I execute the query in Listing 2, I will observe a fast full index scan in action. That’s because we can use the index in place of the table (the index is acting as a skinny version of the table), and the predicate, where owner = ‘SCOTT’, makes it such that the OWNER column must be NOT NULL to be in the result set. And because any NOT NULL value would appear in the index, we can use the index. If the query were SELECT COUNT(*) FROM T, it is true that we would not be able to use the index with a fast full scan instead of the table, because the table could have rows that are not in the index. Any row in which STATUS was null and OWNER was null would not appear in the index, and we’d miss counting it. But because the WHERE clause explicitly says that we are looking for a not-null value of OWNER, I know that every row I need appears in the index.
Code Listing 2: Query uses fast full index scan
SQL> set autotrace traceonly explain SQL> select count(*) from t where owner = 'SCOTT'; Execution Plan ————————————————————————————————————————————————————————————————————— Plan hash value: 1058879072 ————————————————————————————————————————————————————————————————————— | Id |Operation |Name |Rows |Bytes |Cost (%CPU)|Time | ————————————————————————————————————————————————————————————————————— | 0|SELECT STATEMENT | | 1| 17 | 69 (2)|00:00:01| | 1| SORT AGGREGATE | | 1| 17 | | | |* 2| INDEX FAST FULL SCAN|T_IDX| 12| 204 | 69 (2)|00:00:01|
Moving on, you have the statement “Fast full index scans cannot be performed against bitmap indexes.” This is not true, and I can once again prove that by example, using the same table T. I’ll continue by adding a bitmap index—
SQL> create bitmap index 2 bm_idx on t(owner);
—and then run the simple query in Listing 3.
Code Listing 3: Simple query uses bitmap index for fast full index scan
SQL> set autotrace traceonly explain SQL> select distinct owner from t; Execution Plan ————————————————————————————————————————————————————————————————————————————— Plan hash value: 4145652518 ————————————————————————————————————————————————————————————————————————————— | Id |Operation |Name |Rows |Bytes |Cost (%CPU)|Time | ————————————————————————————————————————————————————————————————————————————— | 0 |SELECT STATEMENT | |82657| 1372K| 465 (1)|00:00:06| | 1 | HASH UNIQUE | |82657| 1372K| 465 (1)|00:00:06| | 2 | BITMAP INDEX FAST FULL SCAN|BM_IDX|82657| 1372K| 7 (0)|00:00:01|
As you can see, we definitely would use a bitmap index to perform a fast full index scan—in fact, we can do it even when OWNER (the only indexed column) is NULLABLE. That is because, unlike the T_IDX B*Tree index, bitmap indexes always index NULL keys. Every row in a table will be pointed to by a bitmap index, regardless of the data values.
And last, there is the statement that “a fast full index scan is faster than a full index scan, because it can use multiblock I/O and can be parallelized just like a table scan.” I have a saying that goes like this: “Never say never and never say always, I always say.” When you see any statement such as “X is faster than Y,” be suspicious. If X were always faster than Y, technology Y would not have been implemented or invented in the first place! There must be some times when the converse is true. So, from a performance perspective, that statement is something that may or may not be true.
The answer to whether that statement is true is, “It depends.” Again using the table T example, if I run the query
select status, owner from t where owner = 'some user' order by status, owner;
the optimizer will sometimes pick a fast full index scan on T_IDX and sometimes pick a full index scan on T_IDX. It all depends on the estimated cardinalities involved. If this query is going to retrieve a very small set of rows, the cost of sorting them will be very low, and we’ll probably use a fast full index scan to reduce the I/O cost. On the other hand, if the query is going to retrieve a lot of rows, the cost of sorting them begins to go way up, and we might choose a full index scan, with a higher I/O cost but zero cost for sorting, instead. This is borne out by the example in Listing 4.
Code Listing 4: Choosing the right index scan for the right query
SQL> select status, owner from t 2 where owner = 'SYS' order by status, owner; Execution Plan ——————————————————————————————————————————————————————————————— Plan hash value: 2277287974 ——————————————————————————————————————————————————————————————— | Id |Operation |Name |Rows |Bytes |Cost (%CPU)|Time | ——————————————————————————————————————————————————————————————— | 0 |SELECT STATEMENT| |34755| 746K| 247 (1)|00:00:03| |* 1 | INDEX FULL SCAN|T_IDX|34755| 746K| 247 (1)|00:00:03| ——————————————————————————————————————————————————————————————— SQL> select status, owner from t 2 where owner = 'SCOTT' order by status, owner; Execution Plan ———————————————————————————————————————————————————————————————— Plan hash value: 3276893591 ————————————————————————————————————————————————————————————————————— | Id |Operation |Name |Rows |Bytes |Cost (%CPU)|Time | ————————————————————————————————————————————————————————————————————— | 0 |SELECT STATEMENT | | 8| 176 | 70 (3)|00:00:01| | 1 | SORT ORDER BY | | 8| 176 | 70 (3)|00:00:01| |* 2 | INDEX FAST FULL SCAN|T_IDX| 8| 176 | 69 (2)|00:00:01|
As Listing 4 demonstrates, when we are going to retrieve about 35,000 rows, we use a full index scan to avoid the sort. When we go after about 8 rows, we use the fast full index scan and just sort the results.
Rollback Segment for SELECT
Does Oracle Database need a rollback segment for queries? I thought it used rollback segments only for data manipulation language (DML) operations, but it recently failed, resulting in the error message “Unable to extend rollback segment” for a SELECT statement.
We added some space to that segment, and then the SELECT statement worked fine. Does the database use rollback segments for SELECT statements?
First, let’s refine the question. Instead of “does the database use rollback segments for SELECT statements,” I’ll answer the question “does a SELECT statement generate or read rollback segments?” The verb use is just not specific enough.
Every SELECT statement has the potential to “use” undo (rollback) data. Oracle Database uses undo naturally when processing all queries to produce read-consistent result sets. (See “On Consistent Results, Different Constraints, and Unlearning" for more details on this topic.)
However, this use of undo would not cause an “unable to extend” error message. It could quite easily cause the dreaded “ORA-1555 Snapshot too old” error message, but it would not by itself generate an ORA-1650 or ORA-1651 error message, “Unable to extend rollback/undo segment. . . .”
To generate the ORA-1650/1651 error message, the SELECT statement would have to be generating undo, possible causes of which would be
I would suspect that the first reason was the likely cause of this “Unable to extend rollback/undo segment . . .” error message. When you issue a SELECT FOR UPDATE, Oracle Database will lock all the rows for you before you get the first row back. Locking a row in Oracle Database involves modifying the database block to place the lock there. And anytime you modify a database block, you generate undo (and redo) for that operation. You can see this easily:
SQL> create table t 2 as 3 select * 4 from all_objects; Table created. SQL> select used_ublk 2 from v$transaction; no rows selected SQL> begin 2 for x in 3 (select * 4 from t 5 for update) 6 loop 7 null; 8 end loop; 9 end; 10 / PL/SQL procedure successfully completed. SQL> select used_ublk 2 from v$transaction; USED_UBLK ————————————— 629
Processing that SELECT statement generated 629 blocks of undo (I was in a “single user” database, so I know that row in V$TRANSACTION is mine).
What Is 1/19?
Oracle Database is under the impression that 1/19 * 19 – 1 = –6 × 10–40. You can replace 19 with 19 multiplied by any positive power of 100, and the problem remains. When you multiply 19 by an odd power of 10, the answer is 0. For example, 1/1900 * 1900 – 1 = –6 × 10–40, but 1/190 * 190 – 1 = 0. I understand that it has to do with the way numbers are stored internally, but is there a simple way around it?
First, it is very important to realize that this isn’t specific to Oracle Database; spreadsheets and all other computer programs I’ve ever seen are subject to this somewhere.
The problem arises anytime in computer programs when you start doing things with really small and really big numbers. The precision of the intermediate result gets maxed out and cannot hold all the digits.
The problem starts with 1/19. Limited digits cannot represent that number “perfectly,” as shown in Listing 5. You run out of places to hold the decimal, so that representation is not exactly 1/19, but it is close.
Code Listing 5: Limited digits cannot accurately represent 1/19
SQL> set numformat 0.999999999999999999999999999999999999999999999 SQL> select 1/19 from dual; 1/19 ————————————————————————————————————————————————— 0.052631578947368421052631578947368421052600000
And when you multiply by 19, you run into another issue. You have 38 digits of precision, but the numbers being multiplied are so far apart that you’d need more than 38 to do it precisely, as shown in Listing 6.
Code Listing 6: 38 digits of precision are not enough to equal 1
SQL> select 1/19*19 from dual; 1/19*19 ————————————————————————————————————————————————— 0.999999999999999999999999999999999999999400000
And now you see the issue. If you were to do this in floating-point numbers, it would work (with these numbers) by accident, as shown in Listing 7.
Code Listing 7: Reduced precision with floating-point numbers
SQL> select 1f/19f from dual; 1F/19F ————————————————————————————————————————————————— 0.052631579300000000000000000000000000000000000
You see the massively reduced precision with floating-point number datatypes.
You get an answer that is much farther away from 1/19 than when you use Oracle NUMBER datatypes. Later, when you multiply, as shown in Listing 8, it appears to work, by accident, but only because of the reduced precision.
Code Listing 8: Reduced precision accidentally helps result
SQL> select 0.0526315793f * 19f from dual; 0.0526315793F*19F ————————————————————————————————————————————————— 1.000000000000000000000000000000000000000000000
Using more-precise numbers, in Listing 9, the result isn’t really 1, and the error is actually much larger, but you “lost” it because you had a number with 10 digits to the right of the decimal point and a number with 2 digits to the left. The reduced-precision arithmetic lost the error (by accident).
Code Listing 9: More-precise numbers create larger error
SQL> select 0.0526315793 * 19 from dual; 0.0526315793*19 ————————————————————————————————————————————————— 1.000000006700000000000000000000000000000000000
The only way I know to get around it is the old programming technique of using a tolerance rather than an equal sign (=). You cannot use another type of number (such as a float or a double), because it will produce errors as well when working with different numbers. If you are mixing really small, precise numbers with much larger numbers, you might need to use “if (abs(x-0) < tolerance) then” instead of “if (x=0) then” in your code.
Tom Kyte is a database evangelist in Oracle’s Server Technologies division and has worked for Oracle since 1993. He is the author of Expert Oracle Database Architecture (Apress, 2005, 2010) and Effective Oracle by Design (Oracle Press, 2003), among other books.
Send us your comments