General Case: Function behavior depends on the partitioning of the input data

Consider a transformation along the lines of…

select avg ( salary ), department_id from employees group by department_id;

…where the aggregation operation to be performed on the set of salaries for a given department is arbitrarily complex such that a classical SQL implementation is impossible, slow by virtue of a function invocation for each row of the source table, or prohibitively challenging to write and debug. For example, it might be that the cost to the employer of paying a given salary depends on the hire date because of changes in benefits packages that affect only employees hired after the date of change. This is illustrated in code sample, but to avoid obscuring it with a complicated algorithm, the aggregation is simply the sum for the salary for each distinct department. This has the general form…

create or replace function Aggregate_Xform ( p_input_rows in My_Types.cur_t )
  return My_Types.dept_sals_tab
  pipelined
is
  ...

begin
  Get_Next_Row();
  while Got_Next_Dept() /* relies on assumption that
all rows for given dept are delivered consecutively */
loop v_total_sal := 0; while Got_Next_Row_In_Dept() loop v_total_sal := v_total_sal + g_in_row.sal; Get_Next_Row(); end loop; g_out_row.sal := v_total_sal; g_out_row.dept := g_current_dept; pipe row ( g_out_row ); end loop; close p_input_rows; return; end Aggregate_Xform; /

Given that the input rows will be partitioned between different slaves, the integrity of the algorithm requires that all the rows for a given department go to the same slave, and that all these rows are delivered consecutively. (Strictly speaking, the requirement for consecutive delivery is negotiable, but the design of the algorithm to handle this case would need to be much more elaborate. For that reason, Oracle commits to consecutive delivery.) We use the term clustered to signify this type of delivery, and cluster key for the column (in this case “department”) on which the aggregations done. But significantly, the algorithm does not care in what order of cluster key it receives each successive cluster, and Oracle does not guarantee any particular order here.

This allows the possibility of a quicker algorithm than if rows were required to be clustered and delivered in order of the cluster key. It scales as order N rather than order N.log(N), where N is the number of rows. The syntax is…

create or replace function Aggregate_Xform ( p_input_rows in My_Types.cur_t   )
  return My_Types.dept_sals_tab
  pipelined
  cluster p_input_rows by (dept)
  parallel_enable

    ( partition p_input_rows by hash (dept))

is...

We can choose between hash (dept) and range (dept) depending on what we know about the distribution of the values. (hash will be quicker than range and is the natural choice to be used with cluster... by.) Here, to be partitioned by a specified column, the input ref cursor must be strongly typed. cluster... by is not allowed without parallel_enable ( partition... by.

Note: at version 9.0.1 it is necessary to include ORDER BY on the cluster key in the SELECT used to invoke the table function thus…

select * from table (
  Aggregate_Xform (
    cursor (
      select salary, department_id from employees
       where department_id is not null
       order by department_id ) ) );

…to preserve correctness of behavior, but this restriction will be removed when the order N clustering algorithm is productized.