# SQL Statistical Functions

Scalable In-Database SQL Statistical Functions

## Overview

Included in every Oracle Database is a collection of basic statistical functions accessible via SQL. These include descriptive statistics, hypothesis testing, correlations analysis, test for distribution fits, cross tabs with Chi-square statistics, and analysis of variance (ANOVA). The basic statistical functions are implemented as SQL functions and leverage all the strengths of the Oracle Database.  The SQL statistical functions work on Oracle tables and views and exploit all database parallelism, scalability, user privileges and security schemes.  Hence the SQL statistical functions can be included and exposed within SQL queries, BI dashboards and embedded in real-time Applications.    The SQL statistical functions can be used in a variety of ways.  For example, users can call Oracle's SQL statistical functions to obtain mean, max, min, median, mode and standard deviation information for their data; or users can measure the correlations between attributes and measure the strength of relationships using hypothesis testing statistics such as a t-test, f-test or ANOVA. The SQL Aggregate functions return a single result row based on groups of rows, rather than on single rows while the SQL Analytical functions compute an aggregate value based on a group of rows.

SQL statistical functions include:

• Descriptive statistics
• Hypothesis testing
• Correlations analysis (parametric and nonparametric)
• Ranking functions
• Cross Tabulations with Chi-square statistics
• Linear regression
• ANOVA
• Test Distribution fit
• Window Aggregate functions
• Statistical Aggregates
• Reporting aggregate functions

STATS_T_TEST_INDEPU Example: The following example determines the significance of the difference between the average sales to men and women where the distributions are known to have significantly different (unpooled) variances:

SELECT SUBSTR(cust_income_level, 1, 22) income_level,
AVG(DECODE(cust_gender, 'M', amount_sold, null)) sold_to_men,
AVG(DECODE(cust_gender, 'F', amount_sold, null)) sold_to_women,
STATS_T_TEST_INDEPU(cust_gender, amount_sold, 'STATISTIC', 'F') t_observed,
STATS_T_TEST_INDEPU(cust_gender, amount_sold) two_sided_p_value
FROM sh.customers c, sh.sales s WHERE c.cust_id = s.cust_id
GROUP BY ROLLUP(cust_income_level)
ORDER BY income_level, sold_to_men, sold_to_women, t_observed;

INCOME_LEVEL           SOLD_TO_MEN SOLD_TO_WOMEN T_OBSERVED TWO_SIDED_P_VALUE
---------------------- ----------- ------------- ---------- -----------------
A: Below 30,000          105.28349    99.4281447 -2.0542592        .039964704
B: 30,000 - 49,999       102.59651    109.829642 2.96922332        .002987742
C: 50,000 - 69,999      105.627588    110.127931  2.3496854        .018792277
D: 70,000 - 89,999      106.630299    110.47287  2.26839281        .023307831
E: 90,000 - 109,999     103.396741    101.610416 -1.2603509        .207545662
F: 110,000 - 129,999     106.76476    105.981312 -.60580011        .544648553
G: 130,000 - 149,999    108.877532    107.31377  -.85219781        .394107755
H: 150,000 - 169,999    110.987258    107.152191 -1.9451486        .051762624
I: 170,000 - 189,999    102.808238    107.43556  2.14966921        .031587875
J: 190,000 - 249,999    108.040564    115.343356 2.54749867        .010854966
K: 250,000 - 299,999    112.377993    108.196097 -1.4115514        .158091676
L: 300,000 and above    120.970235    112.216342 -2.0726194        .038225611
107.121845    113.80441  .689462437        .490595765
106.663769    107.276386 1.07853782        .280794207
14 rows selected.

(See link below to SQL Language Reference for STATS_T_TEST_*)

Most statistical software vendors charge license fees for these statistical capabilities.  Oracle includes them in every Oracle Database. Users can reduce annual license fees and perform the equivalent basic statistical functionality while keeping big data and analytics simple in a single, unified, consistent, scalable and secure Oracle Database platform.  Because the statistical functions are native SQL functions, statistical results can be immediately used across the Oracle stack - unleashing many more opportunities to leverage your results in spontaneous and unexpected ways.

Additionally, Oracle Advanced Analytics' Oracle R Enterprise component exposes the SQL statistical functions through the R statistical programming language and allows R users to use R statistical functions e.g. Summary but then pushes down the R functions to the equivalent SQL statistical functions for avoidance of data movement and significant in-database performance gains. The SQL Developer Oracle Data Miner workflow GUI extension also leverages the SQL statistical functions in the Explore, Graph, SQL Query and Transform nodes.