Build Custom Aggregate Functions
By Jonathan Gennick
Summarize data in new ways by writing your own aggregate functions.
One aspect of Oracle Database I've long found very powerful is the ability to write custom procedural code and combine it with relational processing. Such code often takes the form of a stored function to be invoked from a SELECT or other statement. The creation and use of single-row functions in Oracle Database is well understood, but did you know that you can also create your own aggregate functions? You can, and more easily than you might think. This article shows you how.
Suppose you work for a car rental company. You have the table shown in Listing 1 with one row representing each car rental to a customer. (You can create the data by using the CustomAggregatesData.sql script, available with the sample code for this article). For each rental, you have a time out and a time in (the rental_out and rental_in columns). Time out corresponds to when the customer picked up the car, and time in is the time at which the car was returned. Your table stores these two values as type TIMESTAMP. Your job is to produce various reports based on averaging of the rental time periods. For example, you want to be able to determine the average rental period by state and to compare each rental period with a moving average.
Code Listing 1: Car_rental table
CAR_ID CUST_ID RENTAL_STATE RENTAL_OUT RENTAL_IN -------- -------- --------- ------------------------ ------------------------ 1 101 MI 05-MAR-06 08.00.00.00 AM 11-MAR-06 02.00.00.00 PM 1 102 MI 12-MAR-06 08.34.24.00 AM 17-MAR-06 07.23.19.00 PM 1 103 MI 17-MAR-06 11.19.00.00 PM 26-MAR-06 06.00.00.00 AM 1 104 MI 28-MAR-06 09.45.15.00 AM 30-MAR-06 03.27.13.00 PM 2 113 WI 06-MAR-06 08.00.00.00 AM 11-MAR-06 02.00.00.00 PM 2 114 WI 13-MAR-06 08.34.24.00 AM 17-MAR-06 07.23.19.00 PM 2 115 WI 18-MAR-06 11.19.00.00 PM 26-MAR-06 06.00.00.00 AM 2 116 WI 29-MAR-06 09.45.15.00 AM 30-MAR-06 03.27.13.00 PM . . .
Your use of the TIMESTAMP types makes computing the duration of each individual rental a trivial matter. You simply subtract one TIMESTAMP from the other, and the result is an interval of type INTERVAL DAY TO SECOND:
SELECT rental_in - rental_out FROM car_rental; RENTAL_IN-RENTAL_OUT 6 6:0:0.0 5 10:48:55.0 8 6:41:0.0 . . .
Computing the average rental period quickly turns into a less-trivial job, however:
SELECT AVG(rental_in - rental_out) FROM car_rental; SQL Error: ORA-00932: inconsistent datatypes: expected NUMBER got INTERVAL DAY TO SECOND
The built-in aggregate function AVG does not support the interval types (and the same holds for SUM, by the way). For many, this problem might lead to running procedural code on the client to retrieve and summarize large amounts of data across the network through cursor loops, with all the attendant performance and scalability problems that such an approach might bring. Knowing how to write aggregate functions, however, makes this problem nothing more than a slight speed bump.
Writing an Aggregate Function
Figure 1 shows conceptually how an aggregate function—any type of aggregate function—must be evaluated. First you must have some sort of initialization. Then you pass in several values as input to the function. Finally you receive back a single result.
Writing an aggregate function in Oracle Database is a matter of creating an object type with a method corresponding to each of the phases in Figure 1, plus one additional method to support parallel evaluation. Figure 2 illustrates the methods involved.
When evaluating an aggregate function against a group of values, Oracle Database does the following:
1. Invokes ODCIAggregateInitialize to create an instance of the type you've created to implement the aggregation. This instance is known as an aggregation context.
This process is the simplest form of aggregation. Oracle Database may also parallelize the operation, by splitting the data into chunks, running these steps on each chunk, and then making one or more ODCIAggregateMerge calls to combine the results from each chunk into a single result.
Creating the Type
Returning to the car rental scenario, your task is to write an aggregate function to compute the average of a group of car rental durations expressed as INTERVAL DAY TO SECOND values. Begin by creating an object type to implement the aggregate function methods. Listing 2 shows the specification for an object type named AvgInterval. Listing 3 shows the object type body. You must implement at least the methods shown in Listing 2. These include the three methods for beginning, feeding, and terminating the aggregation, as well as the one method for merging results from parallel evaluation. All methods must return a NUMBER, which is a success/fail status value. You can rename the parameters, but the number and order of the parameters and their purposes are fixed.
Code Listing 2: AvgInterval type specification
CREATE OR REPLACE TYPE AvgInterval AS OBJECT ( runningSum INTERVAL DAY(9) TO SECOND(9), runningCount NUMBER, STATIC FUNCTION ODCIAggregateInitialize ( actx IN OUT AvgInterval ) RETURN NUMBER, MEMBER FUNCTION ODCIAggregateIterate ( self IN OUT AvgInterval, val IN DSINTERVAL_UNCONSTRAINED ) RETURN NUMBER, MEMBER FUNCTION ODCIAggregateTerminate ( self IN AvgInterval, returnValue OUT DSINTERVAL_UNCONSTRAINED, flags IN NUMBER ) RETURN NUMBER, MEMBER FUNCTION ODCIAggregateMerge (self IN OUT AvgInterval, ctx2 IN AvgInterval ) RETURN NUMBER ); /
Code Listing 3: Body of the AvgInterval type
CREATE OR REPLACE TYPE BODY AvgInterval AS STATIC FUNCTION ODCIAggregateInitialize ( actx IN OUT AvgInterval ) RETURN NUMBER IS BEGIN IF actx IS NULL THEN dbms_output.put_line('NULL INIT'); actx := AvgInterval (INTERVAL '0 0:0:0.0' DAY TO SECOND, 0); ELSE dbms_output.put_line('NON-NULL INIT'); actx.runningSum := INTERVAL '0 0:0:0.0' DAY TO SECOND; actx.runningCount := 0; END IF; RETURN ODCIConst.Success; END; MEMBER FUNCTION ODCIAggregateIterate ( self IN OUT AvgInterval, val IN DSINTERVAL_UNCONSTRAINED ) RETURN NUMBER IS BEGIN DBMS_OUTPUT.PUT_LINE('Iterate ' || TO_CHAR(val)); IF val IS NULL THEN /* Will never happen */ DBMS_OUTPUT.PUT_LINE('Null on iterate'); END IF; self.runningSum := self.runningSum + val; self.runningCount := self.runningCount + 1; RETURN ODCIConst.Success; END; MEMBER FUNCTION ODCIAggregateTerminate ( self IN AvgInterval, ReturnValue OUT DSINTERVAL_UNCONSTRAINED, flags IN NUMBER ) RETURN NUMBER IS BEGIN dbms_output.put_line('Terminate ' || to_char(flags) || to_char(self.runningsum)); IF self.runningCount <> 0 THEN returnValue := self.runningSum / self.runningCount; ELSE /* It *is* possible to have an empty group, so avoid divide-by-zero. */ returnValue := self.runningSum; END IF; RETURN ODCIConst.Success; END; MEMBER FUNCTION ODCIAggregateMerge (self IN OUT AvgInterval, ctx2 IN AvgInterval ) RETURN NUMBER IS BEGIN self.runningSum := self.runningSum + ctx2.runningSum; self.runningCount := self.runningCount + ctx2.runningCount; RETURN ODCIConst.Success; END; END; /
As part of the object type, define any variables needed to maintain the state of the aggregation as it progresses. How you maintain state is up to you and depends on how you choose to implement your aggregation logic. It's reasonable to use a running sum and count to compute an average (although other approaches are possible), so the following two variables are defined near the top of the AvgInterval specification in Listing 2:
runningSum INTERVAL DAY(9) TO SECOND(9), runningCount NUMBER,
runningSum is defined with the maximum precision (9) for both the day and the fractional-second components. The default is to allow only two digits for the day and only six past the decimal point for the fractional seconds. Allowing for the maximum precision in both cases makes AvgInterval generally useful over any INTERVAL DAY TO SECOND values, regardless of their precision.
Next in the AvgInterval specification come the member function declarations for the ODCIAggregate interface. These can be in any order, but let's begin at the beginning, with the initialization function:
. . . STATIC FUNCTION ODCIAggregateInitialize ( actx IN OUT AvgInterval ) RETURN NUMBER, . . .
Oracle Database invokes this function to initialize an aggregation. The one argument will be of the underlying type you are creating to implement the aggregate function (in this case, AvgInterval). Generally speaking, the purpose of the ODCIAggregateInitialize function is to create a new instance of the underlying object type and then to initialize the instance variables used in computing the aggregation result. The STATIC keyword allows Oracle Database to invoke this function independently of whether an instance of the type already exists. ODCIAggregateInitialize is the only one of the ODCIAggregate functions to be defined as STATIC.
The object instance returned by ODCIAggregateInitialize becomes the aggregation context. If you invoke the same aggregate function several times in a single statement, Oracle Database will create at least one context for each invocation. If Oracle Database parallelizes execution, it will create a context for each execution thread.
In Listing 3, you can see the code behind ODCIAggregateInitialize. For now, focus on the case in which the input parameter is null. The code invokes the type's implicitly defined constructor to create a new instance of the type. The two values passed to the constructor correspond to runningSum and runningCount, and both are initialized to 0.
Next, look at ODCIAggregateIterate in Listing 3. It is the workhorse function of the aggregation type. Oracle Database will invoke the function once per row—once for each value feeding into the aggregation. The logic here is simple:
self.runningSum := self.runningSum + val; self.runningCount := self.runningCount + 1;
As each new value comes into the aggregation, the function adds that value to a running sum and increments the counter. After all values have been processed, Oracle Database invokes ODCIAggregateTerminate, shown in Listing 3, which uses the two running values to compute the average:
returnValue := self.runningSum / self.runningCount;
Do be aware of the possibility that a context will be initialized and terminated for an empty group. Given the computation shown here for returnValue, an empty group would result in a divide-by-0 error. The code in Listing 3 avoids the division, and thus the resulting error, when the running count is 0. (A divide-by-0 in this article's scenario will manifest itself as an "ORA-01873: the leading precision of the interval is too small" error.)
Note in Listing 3 the datatype used for ODCIAggregateIterate's val argument, namely DSINTERVAL_UNCONSTRAINED, which is equivalent to INTERVAL DAY(9) TO SECOND(9). It is necessary to use the unconstrained type, because you cannot specify precision in a parameter list. You cannot, however, use the unconstrained type for the running sum.
The last member function to look at is ODCIAggregateMerge, also shown in Listing 3. Recall from Figure 2 that aggregation can be parallelized. Oracle Database will create separate aggregation contexts for each parallel execution thread. Assuming three parallel threads, the sequence of events becomes
1. ODCIAggregateInitialize invoked for thread 1
Oracle Database invokes ODCIAggregateMerge with two arguments. Each is an aggregation context:
MEMBER FUNCTION ODCIAggregateMerge (self IN OUT AvgInterval, ctx2 IN AvgInterval ) RETURN NUMBER IS
The job of the merge function is to change the first aggregation context—self—so that it reflects the combination of the first and second—ctx2—contexts. You're computing an average, so you can add the running sums and counts:
. . . self.runningSum := self.runningSum + ctx2.runningSum; self.runningCount := self.runningCount + ctx2.runningCount; RETURN ODCIConst.Success; . . .
The effect of this merge operation is that the running sum and count in self (the first context in the argument list) now reflect the values they would hold if all values had been passed through that one context. The ctx2 context is no longer necessary. (If you ever write a custom aggregate in a language such as C or Java and you've allocated memory or some other resource to the ctx2 context, release that memory or resource at this point in the process.)
Defining the Function
After you create the object type and methods for implementing a new aggregate function, you must create a linkage between the function name you want to use from SQL and the underlying type. Do that by using a statement such as:
CREATE OR REPLACE FUNCTION avg_interval ( x DSINTERVAL_UNCONSTRAINED ) RETURN DSINTERVAL_UNCONSTRAINED PARALLEL_ENABLE AGGREGATE USING AvgInterval; /
This statement creates the function avg_interval, which has a single argument that is compatible with any INTERVAL DAY TO SECOND value. The PARALLEL_ENABLE clause enables parallelism. The AGGREGATE USING clause links the avg_interval function to the underlying AvgInterval type. Invoke avg_interval, and Oracle Database will invoke the various member functions on AvgInterval to generate your results.
Using the New Function
Using this new avg_interval aggregate function couldn't be simpler: Invoke it the same way as any built-in aggregate function. You can easily compute the average rental period over all rentals:
SELECT avg_interval (rental_in - rental_out) FROM car_rental; AVG_INTERVAL(RENTAL_IN-RENTAL_OUT) +000000004 15:25:17.000000000
And you can just as easily compute the average rental period by state:
SELECT rental_state, avg_interval (rental_in - rental_out) FROM car_rental GROUP BY rental_state; RE AVG_INTERVAL(RENTAL_IN-RENTAL_O... ---- --------------------------------------- MI +000000005 12:31:08.250000000 MN +000000003 12:07:43.250000000 WI +000000004 12:31:08.250000000
The code for this article contains calls to DBMS_OUTPUT. If you run the preceding SELECT from SQL Developer, enable the viewing of DBMS_OUTPUT by clicking the Enable DBMS Output toolbar icon on the DBMS Output tab of the SQL Worksheet. (From SQL*Plus, issue the SET SERVEROUTPUT ON command.) Then execute the query, and you'll see how Oracle Database invokes the various ODCI methods on the underlying AvgInterval type:
NULL INIT Iterate +000000006 06:00:00.000000000 Iterate +000000005 10:48:55.000000000 Iterate +000000008 06:41:00.000000000 . . . NULL INIT Iterate +000000005 06:00:00.000000000 Iterate +000000004 10:48:55.000000000 Iterate +000000007 06:41:00.000000000 . . . NULL INIT Iterate +000000003 14:37:00.000000000 Iterate +000000003 00:48:55.000000000 Iterate +000000007 12:41:00.000000000 . . . Terminate 0+000000066 06:13:39.000000000 Terminate 0+000000028 01:01:46.000000000 Terminate 0+000000054 06:13:39.000000000
The three occurrences of NULL INIT show how Oracle Database creates a new aggregation context for each group. The Iterate lines show the values feeding into each group. Finally, Oracle Database invokes the terminate method once per group to generate the final results. (If you are running on a multi-CPU system, parallelism may occur, and then your DBMS_OUTPUT results won't be as nicely ordered as mine.) Of course, you probably would not want DBMS_OUTPUT calls in production code.
Behind the scenes, Oracle Database does quite a bit of housekeeping work for you:
All you have to worry about when writing the underlying ODCI methods is two things: how to summarize a single group of values and how to merge two groups. You even get free support for analytic syntax.
Any aggregate function you create can also be used as an analytic function. Oracle Database handles all the details of partitioning and ordering for you. For example, to compare each individual rental period with the average period in the same state, you could issue a query such as this:
SELECT rental_in - rental_out time, rental_state st, avg_interval (rental_in - rental_out) OVER (PARTITION BY rental_state) state_avg FROM car_rental;
Your output would look like this:
RENTAL_TIME ST STATE_AVG ------------ -- ----------- 6 6:0:0.0 MI 5 12:31:8.250 5 10:48:55.0 MI 5 12:31:8.250 8 6:41:0.0 MI 5 12:31:8.250 6 6:41:0.0 MN 3 12:7:43.250 4 10:48:55.0 WI 4 12:31:8.250 ...
You'll notice some interesting details if you look at the DBMS Output results from this query. On a single-processor machine, you will see one occurrence of NULL INIT followed by multiples of NON-NULL INIT. This indicates that Oracle Database is creating one aggregation context and then reusing that context for each partition defined by the analytic function's OVER clause. The ODCIAggregateInitialize method in Listing 3 accounts for that possibility. When the first parameter is non-null, the method does not instantiate a new object but instead reinitializes the existing object.
You may also notice a great many more termination calls than you would otherwise expect, because Oracle may invoke the termination function anew to obtain the statewide average for each row of output. On most of those termination calls, the flags argument (see Listing 3) will be set to 1 (ODCIConst.AGGREGATE_REUSE_CTX), indicating that you should expect Oracle Database to reuse the context.
Analytic Framing Support
Analytic function syntax includes support for aggregation over sliding windows of data. The specific clause involved is referred to as a framing clause . The following query uses such a framing clause to compare each rental period with the running average of a three-period window centered on the current rental. The framing clause is
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
The sliding window thus consists of the row currently under consideration, the one row immediately preceding the current row, and the one row immediately following the current row. Here's the full query:
SELECT rental_in - rental_out time, rental_state st, avg_interval (rental_in - rental_out) OVER (PARTITION BY rental_state ORDER BY rental_out ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) state_avg FROM car_rental;
Oracle Database can execute this query by using the methods defined in Listings 2 and 3. However, as the three-period window moves, each movement results in a new grouping that must be evaluated from scratch. You can gain efficiency by writing a method to remove a value from the aggregation, enabling Oracle Database to slide the window by removing values from the trailing end of the range and adding values at the leading end. The method you need is named ODCIAggregateDelete. Here is the declaration to add to the type specification in Listing 2 (don't forget to separate it from the other declarations with a comma):
MEMBER FUNCTION ODCIAggregateDelete ( self IN OUT AvgInterval, val IN DSINTERVAL_UNCONSTRAINED ) RETURN NUMBER
And here is the implementation that goes into the type body (Listing 3):
MEMBER FUNCTION ODCIAggregateDelete ( self IN OUT AvgInterval, val IN DSINTERVAL_UNCONSTRAINED ) RETURN NUMBER IS BEGIN DBMS_OUTPUT.PUT_LINE ('Delete ' || TO_CHAR(val)); IF val IS NULL THEN /* Will never happen */ DBMS_OUTPUT.PUT_LINE ('Null on delete'); END IF; self.runningSum := self.runningSum - val; self.runningCount := self.runningCount - 1; RETURN ODCIConst.Success; END;
Now, as the ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING window changes for each row processed by the query, Oracle Database will make a call to ODCIAggregateDelete to remove the interval for the row leaving the window, followed by a call to ODCIAggregateIterate to add the interval for the next row coming into the window. This is a more efficient approach than aggregating from scratch with each move of the window.
Custom aggregate functions open the door to an interesting world of possibilities. You could write a SUM function that returns 0 instead of null when no rows are in a group. You could write a MUL function to generate the product of a group of numbers by multiplying them all (a request I actually encountered a few months ago). You can even easily solve the not-unheard-of problem of concatenating all the string values in a group. Custom aggregate functions can be a powerful tool for solving SQL problems.
Jonathan Gennick (Jonathan@Gennick.com) is an experienced Oracle professional and member of the Oak Table Network. He wrote the best-selling SQL Pocket Guide and the Oracle SQL*Plus Pocket Reference, both from O'Reilly Media.