TECHNOLOGY: Ask Tom
On Top-n and Pagination QueriesBy Tom Kyte
Our technologist gets more results using ROW_NUMBER, RANK, and DENSE_RANK.
In a recent Ask Tom column (September/October 2006), I wrote about using ROWNUM to perform top- n queries, getting the first n number of rows from an ordered result set, and pagination queries, getting rows n through m from a result set.
In that column, I demonstrated how to use ROWNUM like this:
select * from (your_query) where rownum <= 10;
in order to get the first 10 rows from "your query," and how to use ROWNUM like this:
select * from ( select rownum rnum, a.* from (your_query) a where rownum <= :M ) where rnum >= :N;
in order to get rows n through m from "your query."
As always, there is more than one way to do things in Oracle Database, and these top- n and pagination queries are no exception. In this column, I'd like to take a look at alternative ways to achieve this functionality by using analytic functions and introduce another variation on the top- n query: getting the top- n records by some set of attributes , such as getting the "set of rows representing the top three salaries by DEPTNO" from EMP.
Top- N Query
Another way to write a top- n query (which is similar to using the LIMIT clause in MySQL or using "set rowcount" in SQL Server to limit the number of records returned by a query) is to use the analytic function ROW_NUMBER. It behaves similarly to the ROWNUM pseudocolumn but is more flexible and has more capabilities. Specifically, I'll be able to use ROW_NUMBER to get the "top- n records by some set of attributes" in the next section (" Top- n Queries by Something ").
Let's look at what ROW_NUMBER can do. Here is an example query using ROW_NUMBER to assign an increasing number to each row in the EMP table after sorting by SAL DESC:
SQL> select ename, sal, 2 row_number() 3 over (order by sal desc) rn 4 from emp 5 order by sal desc 6 / ENAME SAL RN ----- ---- -- KING 5000 1 FORD 3000 2 SCOTT 3000 3 JONES 2975 4 . . . JAMES 950 13 SMITH 800 14 14 rows selected.
So, ROW_NUMBER allows me to easily assign this increasing number after sorting, but at first glance, it doesn't seem to support retrieval of just the first n rows:
SQL> select ename, sal, 2 row_number() 3 over (order by sal desc) rn 4 from emp 5 where 6 row_number() 7 over (order by sal desc) <= 3 8 order by sal desc 9 / row_number() * ERROR at line 6: ORA-30483: window functions are not allowed here
The problem here is that analytic functions are evaluated after the WHERE clause and therefore cannot be used in the WHERE clause. That is not really a problem, however, because I have inline views—I can apply a predicate to ROW_NUMBER after it is assigned. For example
SQL> select * 2 from ( 3 select ename, sal, 4 row_number() 5 over (order by sal desc) rn 6 from emp 7 ) 8 where rn <= 3 9 order by sal desc 10 / ENAME SAL RN ----- ---- -- KING 5000 1 SCOTT 3000 2 FORD 3000 3 3 rows selected.
So, that demonstrates how to perform a top- n query by using ROW_NUMBER and also points out a general issue with top- n queries. If you look at that result, you see two rows with the value 3000. What if, in the EMP table, three people, instead of just two, had a salary of 3000? The result obtained by the above query would be ambiguous—I would get three records, but the records I retrieved would be somewhat random. For example
SQL> update emp 2 set sal = 3000 3 where ename = 'SMITH'; 1 row updated. SQL> select * 2 from ( 3 select ename, sal, 4 row_number() 5 over (order by sal desc) rn 6 from emp 7 ) 8 where rn <= 3 9 order by sal desc 10 / ENAME SAL RN ----- ---- -- KING 5000 1 SMITH 3000 2 SCOTT 3000 3 3 rows selected.
As you can see, FORD disappeared from this result set. However, you might find that if you perform this update and query your data, FORD does not disappear but rather SCOTT or SMITH does. Which row would disappear would be quite arbitrary. That shows a problem with top- n queries in general: If you are ordering by something nonunique, the first set of n rows you get back may be different, given exactly the same input data. For example
SQL> create table t 2 ( x number, 3 y number 4 ); Table created. SQL> insert into t 2 values ( 1, 100 ); SQL> insert into t 2 values ( 1, 200 ); SQL> insert into t 2 values ( 1, 300 ); SQL> insert into t 2 values ( 1, 400 ); SQL> select * 2 from 3 (select x,y, 4 row_number() 5 over (order by x) rn 6 from t) 7 where rn <= 3; X Y RN --- --- -- 1 100 1 1 200 2 1 300 3 3 rows selected.
Now, here's what happens if I reload this table with the same data, just in a different order :
SQL> truncate table t; Table truncated. SQL> insert into t 2 values ( 1, 400 ); SQL> insert into t 2 values ( 1, 200 ); SQL> insert into t 2 values ( 1, 300 ); SQL> insert into t 2 values ( 1, 100 ); SQL> select * 2 from 3 (select x,y, 4 row_number() 5 over (order by x) rn 6 from t) 7 where rn <= 3; X Y RN --- --- -- 1 400 1 1 200 2 1 300 3 3 rows selected.
All of a sudden, I get an entirely different answer. This is something to be aware of when you perform top- n queries and the attribute you order by is not unique.
Bearing this in mind, I can use other analytic functions to remove the ambiguity. They will do so, but the analytic functions might return more than n rows. In my opinion, when the attribute I order by is not unique, I want my query to return all of the relevant records—not just the first n arbitrary ones. To that end, I can use the RANK and DENSE_RANK analytic functions. Let's take a look at what they do:
SQL> select ename,sal, 2 row_number() 3 over (order by sal desc)rn, 4 rank() 5 over (order by sal desc)rnk, 6 dense_rank() 7 over (order by sal desc)drnk 8 from emp 9 order by sal desc 10 / ENAME SAL RN RNK DRNK ----- ---- -- --- ---- KING 5000 1 1 1 FORD 3000 2 2 2 SCOTT 3000 3 2 2 JONES 2975 4 4 3 BLAKE 2850 5 5 4 CLARK 2450 6 6 5 . . . 14 rows selected.
The main things to note here are the following:
You can use RANK and DENSE_RANK in the same way you would use ROW_NUMBER to restrict the number of rows returned, but obviously you'll get subtly different results. For example
SQL> select * 2 from ( 3 select ename,sal, 4 dense_rank() 5 over (order by sal desc)drnk 6 from emp 7 ) where drnk <= 3 8 order by sal desc 9 / ENAME SAL DRNK ----- ---- ---- KING 5000 1 SCOTT 3000 2 FORD 3000 2 JONES 2975 3 4 rows selected.
That query returns "the set of people who make the top three salaries," which is likely the desired result. Getting the first three records from EMP after sorting by SAL is rather arbitrary, because using exactly the same set of data, simply inserted in different orders, you could observe different result sets with ROW_NUMBER (because SAL is not unique). Using DENSE_RANK, however, I don't get precisely three records but, instead, a repeatable (deterministic) result set. And I suspect that I retrieve the set the end user really meant to retrieve—the set of people making the top three salaries.
Using analytics is superior to using the ROWNUM pseudocolumn when it comes to processing more-complex queries. For example, suppose you wanted the top- n highest-paid people from the EMP table by DEPTNO . That is, for each department, display the top three highest-paid people (or the set of people making the top three salaries, using DENSE_RANK). To achieve this, I need to break up the result set by DEPTNO, sort it by SAL DESC, and then assign the analytic function result to each row. Fortunately, analytics are designed to do exactly this type of work. For example
SQL> break on deptno skip 1 SQL> select deptno, ename,sal, 2 row_number() 3 over (partition by deptno 4 order by sal desc)rn, 5 rank() 6 over (partition by deptno 7 order by sal desc)rnk, 8 dense_rank() 9 over (partition by deptno 10 order by sal desc)drnk 11 from emp 12 order by deptno, sal desc 13 / DEPTNO ENAME SAL RN RNK DRNK -------- ----- ---- -- --- ---- 10 KING 5000 1 1 1 CLARK 2450 2 2 2 MILLER 1300 3 3 3 20 SCOTT 3000 1 1 1 FORD 3000 2 1 1 JONES 2975 3 3 2 ADAMS 1100 4 4 3 SMITH 800 5 5 4 30 BLAKE 2850 1 1 1 ALLEN 1600 2 2 2 TURNER 1500 3 3 3 MARTIN 1250 4 4 4 WARD 1250 5 4 4 JAMES 950 6 6 5 14 rows selected.
Here I've used the PARTITION clause of the analytic function. This breaks up my result set virtually into some number of groups and then applies the analytic function to each group in turn, resetting itself as it goes from group to group.
That allows me to assign an increasing number—using ROW_NUMBER, for example—to the rows in each department, resetting the ROW_NUMBER to 1 as I go from group to group. As you can see, it would be quite easy to get the "top 3 by DEPTNO" now, using an inline view and applying the predicate WHERE RN <= 3 or RNK <= 3 or DRNK <= 3, depending on which numbering method returns the answer you deem most applicable.
So, for example, to retrieve the set of employees who make the top three salaries in each of the departments, I would code the following:
SQL> break on deptno skip 1 SQL> select * 2 from ( 3 select deptno, ename, sal, 4 dense_rank() 5 over (partition by deptno 6 order by sal desc)drnk 7 from emp 8 ) where drnk <= 3 9 order by deptno, sal desc 10 / DEPTNO ENAME SAL DRNK ------ ----- ---- ---- 10 KING 5000 1 CLARK 2450 2 MILLER 1300 3 20 SCOTT 3000 1 FORD 3000 1 JONES 2975 2 ADAMS 1100 3 30 BLAKE 2850 1 ALLEN 1600 2 TURNER 1500 3 10 rows selected.
Pagination in Getting Rows N Through M
The last thing I'll take a look at is performing pagination with analytics. Pagination is common in Web-based applications: The user inputs a set of criteria, which runs a query and then lets the user click the Previous and Next buttons to page through the result set. To achieve this paging functionality, the application needs to be able to get a certain set of rows from the query. In the prior column dealing with ROWNUM and in the "Top- n Query" section above, I demonstrated how to achieve that pagination, but it required two levels of inline views and could be considered rather unwieldy. A syntactically easier way to write that pagination query is
select * from ( select /*+ first_rows(25) */ your_columns, row_number() over (order by something unique)rn from your_tables ) where rn between :n and :m order by rn;
For example, suppose you wanted to paginate through the ALL_OBJECTS view, wanted to show 25 rows on a page, and needed to show page 5. The query would be
SQL> variable n number SQL> variable m number SQL> exec :n := 101; :m := 125 PL/SQL procedure successfully completed. SQL> select * 2 from ( 3 select /*+ first_rows(25) */ 4 object_id,object_name, 5 row_number() over 6 (order by object_id) rn 7 from all_objects) 8 where rn between :n and :m 9 order by rn; OBJECT_ID OBJECT_NAME RN --------- ----------- --- 102 I_SEQ1 101 103 I_OBJAUTH1 102 . . . 124 I_ACCESS1 123 125 I_TRIGGER1 124 126 I_TRIGGER2 125 25 rows selected.
You need to order by something unique for these pagination queries, so that ROW_NUMBER is assigned deterministically to the rows each and every time. Otherwise, you end up with a problem similar to what I pointed out with the salary top- n query. For more details on deterministic queries, go to oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html. Also, for complete information on using analytic functions, see download.oracle.com/docs/cd/B19306_01/server.102/b14223/analysis.htm#i1007779 (chapter 21 of the Data Warehousing Guide ). In my opinion, analytic functions are the coolest thing to happen to SQL since the SELECT keyword was introduced.
Does GROUP BY Sort?
Does a GROUP BY clause in a query guarantee that the output data will be sorted on the GROUP BY columns in order, even if there is no ORDER BY clause?
Unless and until there is an ORDER BY statement on a query, the rows returned cannot be assumed to be in any order. Without an ORDER BY, the data may be returned in any order in which the database feels like returning it. This has always been true and will always be true.
In fact, in Oracle Database 10g Release 2, you'll see GROUP BY returning data in a random order much more often than before:
SQL> set autotrace on explain SQL> select job, count(*) 2 from emp 3 group by job 4 / JOB COUNT(*) --------- -------- CLERK 4 SALESMAN 4 PRESIDENT 1 MANAGER 3 ANALYST 2 ----------------------------------- | Id | Operation | Name | ----------------------------------- | 0 | SELECT STATEMENT | | | 1 | HASH GROUP BY | | | 2 | TABLE ACCESS FULL | EMP | -----------------------------------
The HASH GROUP BY step in the explain plan is a newer and generally faster and more efficient query plan step for a GROUP BY operation, and it definitely doesn't sort data.
Don't be misled if the results are not the same on your system. It does not matter if, when you execute that query, the results are sorted on your system—it takes only one countercase to demonstrate that something is not always true. Only with an ORDER BY can you make any assumptions about the sorted order of data. Many things—from parallel query to partitioning, to reverse-key indexes, to hash-partitioned indexes, to HASH GROUP BY steps, to the presence (or lack) of B*tree indexes—can and will cause data not to be returned in the sorted order you anticipated.
If you need data sorted, you have to use an ORDER BY statement; there is no avoiding that. For more of this ORDER BY discussion, check out asktom.oracle.com/tkyte/OrderBy.html.
Tom Kyte has worked for Oracle since 1993. He is a vice president in the Oracle Public Sector group and the author of Expert Oracle Database Architecture: 9i and 10g Programming Techniques and Solutions (Apress, 2005) and Effective Oracle by Design (Oracle Press, 2003), among others.