TECHNOLOGY: Ask Tom
On Unlearning and ModelingBy Tom Kyte
Our technologist goes back to school for Oracle Exadata and vows to be a MODEL student.
I frequently receive questions (on AskTom, at asktom.oracle.com) about Oracle Exadata Database Machine. Many of them center on relearning—or, more specifically, “unlearning” (a concept I’ve written about before). I’d like to briefly describe some of the concepts that need to be unlearned for working with an Oracle Exadata Database Machine for data warehousing.
First, you need to understand how Oracle Exadata is architected in order to appreciate why some of your past tried-and-true techniques are not the right approaches in an Oracle Exadata environment.
An Oracle Exadata Database Machine is big, a common configuration being eight server nodes coupled with 14 storage arrays. Each of these components has 12 CPU cores, delivering 96 cores at the database server level and 168 cores at the storage level for processing data. That is a lot of CPU horsepower right there. And all of the Oracle Exadata storage, I/O bandwidth, cache, and CPU is dedicated to your databases. This is a departure from most environments today, in which many, if not all, of those resources are shared by many other functions. A key to providing consistent, guaranteed performance is to ensure that your operations do not have to share resources with other functions.
So, you have this big server machine, coupled with a large amount of dedicated storage—storage that comes with its own CPU and memory for processing data at the storage layer. That in itself is impressive, but Oracle Exadata goes two steps further. First, all of this storage is connected to the database via a series of 40-gigabit-per-second InfiniBand switches—a private, dedicated network whose sole purpose is to move your bits and bytes from disk to database server CPU at a rate of up to 75 GB/sec. (See bit.ly/tkTYqD for information on this 75 GB/sec performance.) Compare this with your typical corporate network, in which you might have a 1 Gb/sec or maybe a 10 Gb/sec network capable of maybe 100 MB/sec or 1 GB/sec over a shared network. (And you’ll never really get the 100 MB/sec or 1 GB/sec transfer rate, because the resource is shared.) The second step is the addition of up to 5 TB of Exadata Smart Flash Cache flash memory dedicated to caching disk-based information at the storage level. With Oracle Exadata, you have about 1 TB of database server memory for caching information, along with 5 TB of flash memory at the storage level.
The servers, storage arrays, network switching, and memory in Oracle Exadata are impressive, but, as they say on late-night television, “Wait, there’s more!” The CPUs and memory at the storage level are there to run more software. Oracle Exadata includes eight nodes for the conventional database server, but its storage-level software understands that what is stored on disk is Oracle Database blocks. For the first time, the storage array not only knows how to perform I/O but also understands that the I/O it is performing is being done to retrieve Oracle Database blocks. This enables the storage to not only read the database blocks but to process them as well.
If you were to do a full scan of a terabyte of data with conventional SAN or NFS servers, you would have to not only read a terabyte of data but also transmit that same terabyte over the wire and then process it all at the database server level. With the storage cells in Oracle Exadata, however, you don’t need to do all that. Instead, the storage cells will read the terabyte of information, process it (before sending anything over the wire), and then send just the bits and bytes that are relevant to the database server. For example, a query such as “SELECT A, B, C FROM some_table WHERE D > 42” would cause the storage cells to do a full-table scan in parallel (all 14 storage cells would be processing a bit of the table), and as they were reading the data, only columns A, B, C, and D would be considered relevant and sent back to the database server. Furthermore, only rows in which D was greater than 42 would be sent back to the database server. (There could be hundreds of other columns, but the storage cells would not send them to the database server.) In short, only the rows and columns of interest would be transmitted to the database server. Even though you scanned a terabyte of data, you might send back only 200 MB over the wire—a considerable difference.
So, you have lots of disk storage with excellent read performance, the ability to retrieve just the rows and columns of interest, the ability to transmit massive amounts of information if needed (at 75 GB/sec), and the ability to process the resulting data with eight database server nodes and 96 CPU cores.
In short, with Oracle Exadata—all of a sudden—I/O isn’t your problem. Historically, I/O has been the major gating factor in a data warehouse, with many systems “underconfigured” and therefore delivering performance reduced by orders of magnitude. When you remove I/O from the equation, the game changes entirely. The processes you had in place in an I/O-constrained system do not work effectively in a system without I/O constraints. Not that those processes were efficient in the I/O-constrained system either! They were suboptimal there as well, but given the poor I/O performance, it was not as noticeable.
What does this mean to you? What do you need to unlearn for Oracle Exadata? The first thing to look at is your load and query process. On AskTom I frequently see people asking how to speed up their load, claiming they cannot load 5,000,000 new records in their multihour processing window. When they say it takes hours to load millions of rows, I know something is seriously wrong, because I can load millions of rows in minutes—on my laptop. It is all in how you approach the load.
Most people adopt what I’ll call a slow-by-slow (row-by-row) approach. They load and validate a single row millions of times. What they need to do is load and validate millions of rows once. There is a big difference between those two approaches, because they scale entirely differently.
Let’s look at the numbers: If you wanted to load 5,000,000 rows and it took a slow-by-slow process 1 millisecond (1/1000 second) to process each row during a validation/transformation phase, you would be looking at 5,000 seconds—or almost 1.5 hours. Now, 1 millisecond is probably being very generous, because there would be network round-trips for each row, database CPU, client CPU, and reading of input files to be loaded and other steps involved in the process. You can easily see how a row-by-row approach to loading 5,000,000 rows will quickly become your bottleneck as the row counts go up and up. When you take a big number (millions or billions) and multiply it by a little number (the amount of processing time per row) you end up with . . . still a big number!
If, on the other hand, you bulk-load your data, using direct path operations and parallel query to validate and transform the data, you can do in seconds or minutes what would otherwise take hours. If you don’t believe me, watch a series of videos available on YouTube. They were developed by a team including Graham Wood, father of Statspack and Automatic Workload Repository, and Andrew Holdsworth, senior director of Oracle’s Real World Performance group, and delivered by Holdsworth. They show how you would approach loading 1 TB of data (8 billion rows) into an Oracle Exadata Database Machine, gathering statistics, and validating all of it in about 20 minutes—yes, 20 minutes. This can be achieved only with parallel direct path operations. The videos are parts 1, 2, and 3 of Migrate a 1 TB Datawarehouse in 20 Minutes:
Once you’ve decided to use large, bulk, direct path loads, choosing a tool is important as well. Historically, many of us have used SQLLDR as our tool of choice for performing large loads. That worked well in the 1900s, but it isn’t the right tool for the twenty-first century. A new century demands new tools, because the scale of our loading problems has increased by many orders of magnitude.
Consider how SQLLDR works. If you wanted to do a parallel direct path load with it, you would have to set up a script that started n SQLLDR processes (where n was your chosen degree of parallelism). You would need to have n input files, one for each SQLLDR process. Additionally, each SQLLDR process would need to allocate a large chunk of memory to cache the metadata about the segment it was loading into, and that memory can be considerable. Today, it would not be outrageous to have a table with tens of thousands of partitions (say 40,000), with each partition having hundreds of extents. All of a sudden, a table structure might include millions of pieces of information, and each and every SQLLDR process would need to cache this metadata over and over. That can quickly add up to needing a terabyte of memory—just for SQLLDR.
The solution to this dilemma? The external table introduced in Oracle9i Database. External tables allow for a file—or the output of a program—to be read as a table source, using SQL directly. That is, you can invoke “SELECT * FROM some_file” and read directly from a file that exists outside of the database. (See bit.ly/vYSoDo for an article that demonstrates the power of external tables.) Because SQL can execute in parallel and perform direct path loads, a parallel direct path load of a table becomes simply “INSERT /*+ append */ INTO some_table SELECT * FROM some_file”. There is no script to run outside of the database. Changing the degree of parallelism to support more (or fewer) CPUs is trivial (just run the SQL statement in a database that has the cpu_count initialization parameter set differently) compared to changing parallelism in SQLLDR, in which you would be reprogramming a script. Additionally, with direct path loads, the memory considerations are moot. SQL uses the shared pool in the system global area (SGA), so there would be one cached copy of the table’s metadata—not one copy per SQLLDR process. Last, using an external table is a lot more flexible than using SQLLDR.
The next thing you might need to unlearn for Oracle Exadata is how to tune queries. Your toolbox has always included using indexes as a query tuning technique. This works well in systems that have to process tens or hundreds or maybe even thousands of rows, but it quite simply does not scale up when you are considering millions or billions of rows. It is time again for a little math to put things in perspective.
Suppose you want to retrieve 1 percent of a table to process. For most of you, the word that would immediately pop into your mind would be index—you are getting only 1 percent of the rows, so an index would be appropriate. Well, what if I told you the table in question has 100,000,000 rows? Would an index still be appropriate? Let’s work the numbers and see.
To begin, I can safely assume that the table blocks I need are likely not in the buffer cache, so I’ll have to perform physical I/O to read most of them. I can also assume that my buffer cache would not likely hold this result set (given a database with an 8 K block size, the table blocks I’d need to cache could total as much as 7.6 GB), because there are many other objects in this database as well. (Over time, repeated runs of the query would have to perform as many I/Os as the first run.) I’ll also make an assumption that will make the math easier: the index is entirely cached and is “free” to access. It takes 0.0 seconds to retrieve a block from the index, which leaves just the table blocks to consider. Now, seeing as a good I/O system would likely do a random seek, read, and transfer in about 5 milliseconds, I can see that it would take 5,000 seconds to perform the 1,000,000 random I/Os, which means that query is going to take 1.5 hours just to do the I/O! But if I turned that index scan into a full-table scan, I could get those I/Os performed in a few seconds or maybe minutes. (See bit.ly/srjb0c for a classic example of this.)
With Oracle Exadata, you need to unlearn most all of the rules of thumb you learned for online transaction processing (OLTP). For example, you learned, “If you are getting a small percentage of a table, use an index.” The rule should be, “If you are retrieving a small number of rows from a table, use an index.” With Oracle Exadata, you need to employ large bulk operations to process large amounts of data and you need to carry out CPU- and random I/O-intensive operations to process small amounts of data. In short, your mind-set needs to change.
The next mind-set change required for Oracle Exadata involves resource management. I/O used to be your bottleneck—your resource manager, in effect. To get your data warehouses looking busy—to get their CPUs doing work—you would try to process as many queries simultaneously as possible. Because most queries used to be throttled by lack of I/O resources, they would not demand too much of the CPU resources at any given point in time and the database would appear idle if you ran just a few queries at a time. Now, if you “fix” the I/O problem with Oracle Exadata—if you eliminate the I/O bottleneck—queries will not be throttled by the lack of I/O resources. Instead, queries will now all be simultaneously fighting with each other for CPU resources. If that happens, if you try to run more things that all have heavy CPU demands at the same time, you’ll have a serious problem on your hands and you’ll basically have a machine whose plug you have to pull.
Enter resource management and queuing. The goal is to fully utilize the machine but not attempt to overutilize it. You want to “fill the machine up” with queries, but when it gets full, you want to make any new queries wait—to queue them—until resources (CPU, memory) become available. If you allow the machine to become overwhelmed, you will not get your answer in a reasonable amount of time. On the other hand, if you do as much work as possible (and not one drop more than that) and you queue up new requests until you finish processing some of your current workload, everyone will get their responses faster (even those who have to wait in line for some period of time).
Again, I could talk about this and try to convince you, or you can see it for yourself in the real world. I’ll point you to YouTube again for Holdsworth’s fourth installment of Migrate a 1 TB Datawarehouse in 20 Minutes, at bit.ly/sYSLih. In this example, Holdsworth demonstrates the various techniques for managing many queries attempting to run concurrently, with the end result that queuing provides the best overall runtimes, from an average, minimum, and maximum runtime perspective.
So, in short, the main concepts you need to unlearn for Oracle Exadata are
I need to write a SQL statement to group data in ranges. Here is the data:
I need running totals that do not exceed 65,000, and for every running total, I need to get the starting site number, ending site number, and the sum of records for those sites:
start_site end_site running_total
As you can see, sites 1001, 1002, and 1003 form the first group and site 1004 starts the second group in my output. (If site 1004 remained in the first group, the running total would have been 110,500, which exceeds 65,000, so the first group is full and I start the second group. The second group is 1004 all by itself because if I had put 1005 in with it, the running total would have again exceeded 65,000.)
This is difficult to answer efficiently with just straight SQL. What I’ll do here is propose three approaches. The first approach was my initial one and employs a PL/SQL pipelined function. The second and third approaches were supplied by other users on AskTom. The second approach utilizes a new Oracle Database 11g Release 2 feature—recursive subquery factoring—and the third involves a new Oracle Database 10g feature: the MODEL clause. (You can see the evolution of this question and answer and the refinement of the two SQL-based approaches at bit.ly/vJiMVw) This answer will show that there is more than one way to attack a problem. I particularly liked the MODEL clause approach, and it prompted me to make a New Year’s resolution to learn and master the MODEL clause in 2012! Expect to see it used to answer many more questions in the future.
The question also involves a bit of a bin-fitting problem, where the goal is to group rows until some threshold is met and then start a new group. This is quite challenging to do in “pure SQL,” so I cheated a little with my approach, which uses a small bit of PL/SQL.
To implement the PL/SQL pipelined function, I first need to create some object types. These types, shown in Listing 1, describe the “table” the function will return.
Code Listing 1: Creating object types for PL/SQL pipelined function
SQL> create or replace type myScalarType as object 2 ( study_site number, cnt number, the_group number ) 3 / Type created. SQL> create or replace type myTableType as table of myScalarType 2 / Type created.
Next, I need the procedural code that will return the result I want. Basically, I’ll be sending this function a result set via a SYS_REFCURSOR type. This result set will be an ordered selection of the rows I want to see, and it will be ordered by STUDY_SITE so that I can process the data in ascending order—because the answer requires that. I’ll also pass in the “threshold” of 65,000 instead of hard-coding 65,000 in the PL/SQL, and I’ll be a little more flexible and make it a parameter. Listing 2 shows the code I implemented.
Code Listing 2: PL/SQL pipelined function solution
SQL> create or replace function foo( p_cursor in sys_refcursor, p_threshold in number ) return myTableType 2 pipelined 3 as 4 type array is table of t%rowtype index by binary_integer; 5 6 l_data array; 7 l_running_total number := 0; 8 l_group number := 1; 9 n number := 100; 10 begin 11 loop 12 fetch p_cursor bulk collect into l_data limit N; 13 for i in 1 .. l_data.count 14 loop 15 l_running_total := l_running_total + l_data(i).cnt; 16 if ( l_running_total > p_threshold ) 17 then 18 l_group := l_group + 1; 19 l_running_total := l_data(i).cnt; 20 end if; 21 pipe row( myScalarType( l_data(i).study_site, l_data(i).cnt, l_group )); 22 end loop; 23 exit when p_cursor%notfound; 24 end loop; 25 close p_cursor; 26 return; 27 end; 28 /
As you can see, I just read the rows from the SYS_REFCURSOR, and because my running total exceeds the threshold that was passed in, I reset the counters and increment the group number. The PIPE ROW() call returns the data I am making up. (Pipelined functions are great for implementing a tiny bit of procedural processing in relational queries.) To use this function and see the required output, I would run the query in Listing 3.
Code Listing 3: Query for running PL/SQL pipelined function solution
SQL> select min(study_site), max(study_site), sum(cnt) 2 from ( 3 select * 4 from TABLE( foo( cursor(select study_site, cnt from t order by study_site), 65000 ) ) 5 ) 6 group by the_group 7 order by the_group 8 / MIN(STUDY_SITE) MAX(STUDY_SITE) SUM(CNT) --------------- --------------- -------- 1001 1022 48081 1023 1044 62203 1045 1045 3360
That PL/SQL pipelined function works, but as I mentioned earlier, there were also some SQL solutions. The first I’ll use is the new recursive subquery factoring. This is a new feature in Oracle Database 11g Release 2, and I first looked at this feature when it came out: bit.ly/tYWtBG. (Look for “Recursive Subquery Factoring” for an explanation of how it works and what it does.) Listing 4 contains the recursive subquery approach.
Code Listing 4: Factored subquery solution
with data1 as ( select row_number()over(order by study_site) as rno, study_site Site_no, cnt tot_rec from t ) , rec (rno, Site_no, tot_rec, Total, flg) as ( select rno, Site_no, tot_rec, tot_rec, 0 from data1 where rno=1 union all select d.rno, d.Site_no, d.tot_rec, case when r.total + d.tot_rec > 65000 Then d.tot_rec Else r.total + d.tot_rec END, case when r.total + d.tot_rec > 65000 Then r.flg+1 Else r.flg END from data1 d, rec r where d.rno=r.rno+1 ) select min(site_no), max(site_no), sum(tot_rec) from rec group by flg order by flg
Then I have the second factored subquery—REC—which is, in fact, a recursive factored subquery, in that its definition references itself. Note the second part of the UNION ALL—it queries itself.
Here’s what the REC recursive factored subquery does. When the first part of the UNION ALL is executed, the subquery finds the first record from DATA1 and outputs that record into the result set. It then feeds that single row it found into the second part of the query; in short, it joins the first record to the second record. If the value of R.TOTAL (the first row’s total) plus the second row’s total—D.TOT_REC—exceeds the threshold, then the running total will reset and become just D.TOT_REC. If R.TOTAL+D.TOT_REC does not exceed that threshold, the running total will not reset and will continue to be R.TOTAL+D.TOT_REC. The FLG column—the grouping column—will either increment if R.TOTAL+D.TOT_REC exceeds its threshold or stay the same if it does not.
Then the subquery takes this second fabricated record and joins it to the third record in DATA1, performing the same operations. After that the subquery joins the third record to the fourth and so on until there are no records left to process. When you execute that query in full, it returns the same result as the pipelined function.
However, before you get too excited about this approach, think about the work it requires. You have to take the entire result set and sort it to assign the row numbers (which is OK, because all approaches will pretty much include a sort). Then you have to query this subresult to find the first row, scan it again to find the second row, scan it again to find the third, and so on. Suffice it to say, this query does a ton of work over and over again; using it is not the most efficient approach. If you are interested in the numbers, visit bit.ly/vJiMVw to see a tkprof report of this query, demonstrating how much work it performs.
The last approach, using the SQL MODEL clause, was supplied by Jichao Li from Tianjun, China. Li supplied the query in Listing 5.
Code Listing 5: SQL MODEL clause solution
SELECT s, MAX(e), MAX(sm) FROM ( SELECT s, e, cnt, sm FROM t MODEL DIMENSION BY(row_number() over(order by study_site) rn) MEASURES(study_site s, study_site e, cnt, cnt sm) RULES(sm[rn > 1] = CASE WHEN (sm[cv() - 1] + cnt[cv()]) > 20000 OR cnt[cv()] > 20000 THEN cnt[cv()] ELSE sm[cv() - 1] + cnt[cv()] END, s[rn > 1] = CASE WHEN(sm[cv() - 1] + cnt[cv()]) > 20000 OR cnt[cv()] > 20000 THEN s[cv()] ELSE s[cv() - 1] END)) GROUP BY s ORDER BY s;
Li used a threshold of 20,000 to demonstrate a different grouping, but the net result is the same. (Simply replace or bind in a different value wherever you see 20,000 in this query to change it.)
I used procedural code in PL/SQL to answer the question, and Li used procedural code—to a degree—directly in SQL to answer the question. (See bit.ly/rZtatb for documentation on the SQL MODEL clause.) This SQL query takes the same basic approach as the solutions above but uses different syntax. It starts by ordering the data by STUDY_SITE, assigns a row number to each row, and then uses that row number as an index on the result set, retrieving specific rows and columns (which is sort of like writing functions in a spreadsheet—only this spreadsheet is a result set). And it does the same sort of processing as in the PL/SQL and recursive subquery factoring solutions.
The end result: the SQL MODEL clause would be the best-performing approach overall to this problem by far. The PL/SQL pipelined function would come in second, and the recursive subquery-factored result third. Does that mean that the MODEL clause is always better than PL/SQL or the recursive approach? Absolutely not. Any of these three will be the “best” approach in different circumstances. It is always good to have many approaches in your tool belt and to benchmark them to see which is best for solving a specific problem.
Tom Kyte is a database evangelist in Oracle’s Server Technologies division and has worked for Oracle since 1993.
Send us your comments