Oracle R Advanced Analytics for Hadoop Logo
Oracle R Advanced Analytics for Hadoop Icon

The latest release of Oracle R Advanced Analytics for Hadoop (ORAAH), release 2.8.0, is one of the components of the Oracle Big Data Connectors software suite. 

ORAAH is available:

ORAAH is a set of R packages and Java libraries that provide:
  • An R interface for manipulating data stored in a local File System, HDFS, HIVE, Impala or JDBC sources, and creating Distributed Model Matrices across a Cluster of Hadoop Nodes in preparation for ML
  • A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages
  • Parallel and distributed Machine Learning algorithms that take advantage of all the nodes of a Hadoop cluster for scalable, high performance modeling on big data. Functions use the expressive R formula object optimized for Spark parallel execution
  • ORAAH's custom LM/GLM/MLP NN algorithms on Spark scale better and run faster than the open-source Spark MLlib functions, but ORAAH provides interfaces to MLlib as well
  • Starting from ORAAH 2.8.0, all the Core Analytics functionality has been decoupled and consolidated into a standalone ORAAH Core Analytics Java library that can be used directly without the need of the R language, ad can be called from any Java or Scala platform. This allows direct, efficient and easy integration with ORAAH Core Analytics. For more information, refer to "oraah-analytics-library- 2.8.0-scaladoc" packed as a zip file in the installer package.

Customer Video

Energy Australia Improves Customer Experience with Oracle and ORAAH

Gaurav Singh, Big Data and Data Warehouse Solution Architect, Design & Execution of Energy Australia, speaks about using Big Data Analytics to better understand customers and improve their experience.

To learn more about the use case, click here.

New for release 2.8.0:
  • Interface with Spark 2.x (due to new Spark APIs, starting on this release ORAAH is no longer compatible with Spark 1.6.0)
  • Certified with CDH 5.13, 5.14 and 5.15, and Spark releases 2.1.0 up to 2.3.1
  • New Algorithm: ELM (Extreme Learning Machines) and Hierarchical-ELM (Oracle’s MPI/Spark-engines) - for Classification and Regression
  • New Algorithm: Distributed Stochastic PCA (Oracle’s MPI/Spark-engines)
  • New Algorithm: Distributed Stochastic SVD (Oracle’s MPI/Spark-engines)
  • New model statistics for LM (lm2) and GLM (glm2) Algorithms
  • A new Activation Function "softmax" is available for ORAAH's MLP Neural Networks algorithm, that can be used for binary and multi-class Targets (previously only "entropy" could be used for binary targets)
  • Probabilities are now returned for the Spark MLlib Classification algorithms
  • New options for GLM solver: IRLS (Iteratively Reweighted Least Squares) for precision, and L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) for speed and larger problems.
  • New IMPALA transparency interface, for fast in-memory queries and Data Preparation on large Datasets from R
  • New "orch.summary()" function to run summary statistics on HIVE tables with improved performance
  • New Spark DF data manipulation and summary functions: collect, scale (12 standard scaling techniques), SQL query, CSV import into Spark DF, summary (statistics), describe and persist/unpersist in memory
  • New JDBC Interface to define any SQL source as input to the formula engine and any algorithm
  • New interface to Spark MLlib GBT - Gradient Boosted Trees - for Classification and Regression
  • New faster loading of the ORAAH libraries after the first successful loading for the same user
  • Performance enhancements when interfacing with HIVE, HDFS (new Java-based client interface) and R data.frames
  • spark.connect() now automatically resolves the active HDFS Namenode for the cluster if the user does not specify the dfs.namenode parameter
  • spark.connect() now loads the default Spark configuration from the spark-defaults.conf file if it is found in the CLASSPATH
  • hdfs.toHive() and hdfs.fromHive() now work with the newly supported Apache Impala connection as well
  • A new field "queue" has been added to mapred.config object which is used to specify the MapReduce jobs written in R using ORAAH, allowing the user to select the queue to which the MapReduce task will be submitted
  • New support for the option "-Xms" to improve the configuration of the rJava JVM memory that runs alongside ORAAH
  • The ORAAH R Formula engine introduces support for:
    • Main statistical distributions, with density, cumulative density, quantile and random deviates functions for the following distributions:  Beta, Binomial, Cauchy, Chi-squared, Exponential, F-distribution, Gamma, Geometric, Hypergeometric, Log-normal, Normal, Poisson, Student's t, Triangular, Uniform, Weibull and Pareto
    • Special functions:  gamma(x), lgamma(x), digamma(x), trigamma(x), lanczos(x), factorial(x), lfactorial(x), lbeta(a, b), lchoose(n, k) - natural logarithm of the binomial coefficient
    • New Aggregate Functions: avg(colExpr) which is the same as mean(colExpr), max(colExpr), min(colExpr), sd(colExpr) which is the same as stddev(colExpr), sum(colExpr), variance(colExpr) which is the same as var(colExpr), kurtosis(colExpr), skewness(colExpr), where colExpr is an arbitrary (potentially nonlinear) column expression.
Continuing on the great benefits of ORAAH
  • A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages. Support for binary RData representation of input data enables R-based MapReduce jobs to match the I/O performance of pure Java-based MapReduce programs.
  • Parallel and distributed machine learning algorithms take advantage of all the nodes of your Hadoop cluster for scalable, high performance modeling on big data. Algorithms include linear regression, generalized linear models, neural networks, low rank matrix factorization, non-negative matrix factorization, k- means clustering, principal components analysis, and multivariate analysis. Functions use the expressive R formula object optimized for Spark parallel execution.
  • R functions wrap Apache Spark MLlib algorithms within the ORAAH framework using the R formula specification and Distributed Model Matrix data structure. ORAAH's MLlib R functions can be executed either on a Hadoop cluster using YARN to dynamically form a Spark cluster, or on a dedicated standalone Spark cluster.
Same great Platform for R Analytics at scale:
See this blog post for additional details.

Sample performance on a Big Data Appliance X7-2 (Spark 2.2.0 on YARN, 6 Nodes with 48 cores and 256 GB of RAM per Node)

Benchmark of all available Binary Classification algorithms on an airline Dataset (ontime) with 1 Bi rows.  Model build and Model Scoring times are compared.

Benchmark of scalability of ORAAH's GLM vs. Spark MLlib's Logistic algorithm on an airline Dataset (ontime). Model build times are compared.  At 10 Bi rows, the dataset of 1 TB no longer fits in memory (learn more about Spark Memory Management) and Spark MLlib crashes.

Benchmark of ORAAH's GLM (Logistic Regression) scalability for Model building and Model Scoring, from 100k to 10 Bi rows.

ORAAH high-performance Spark (and MPI) based algorithms available from R (source data can be CSVs in HDFS, HIVE tables or Spark DataFrames) 
  • Linear Regression - orch.lm2() 
  • Logistic Regression - orch.glm2()
  • Multilayer Perceptron Neural Networks - orch.neural2()
  • Extreme Learning Machines (ELM) - orch.elm()
  • Hierarchical-ELM - orch.helm()
  • Distributed Stochastic PCA - orch.dspca()
  • Distributed Stochastic SVD - orch.dssvd()
Set of Apache Spark MLlib algorithms available from R in ORAAH (source data can be CSVs in HDFS, HIVE tables or Spark DataFrames
  • Gradient-Boosted Trees -
  • Gaussian Mixture Models - 
  • Linear Regression -
  • LASSO - Least Absolute Shrinkage and Selection Operator -
  • Ridge Regression -
  • Logistic Regression -
  • Decision Trees -
  • Random Forest -
  • Support Vector Machines -
  • k-Means Clustering -
  • Principal Component Analysis -
Features of ORAAH Spark MLlib algorithms

To support the new Machine Learning algorithms from Apache Spark, several special functions are available:
  • Updated predict() functions for scoring new datasets using the Spark-based models, using Spark.
  • A new function hdfs.write() to allow writing to HDFS the model objects and the prediction results back from Spark RDDs and Data Frames into HDFS.
  • A new set of functions,model() and orch.load.model() to save and load Model objects that can be used to score new datasets on the same or other clusters via Spark
  • ORAAH includes a Distributed Model Matrix engine and a Distributed Formula parser that are used in conjunction with all Spark MLlib-based algorithm interfaces to greatly improve the performance and enhance functional compatibility with R (following closely the R formula syntax). Internally the Distributed Model Matrices are stored as Spark RDDs (Resilient Distributed Datasets).
Additional ORAAH Platform Updates

Support for Cloudera Distribution of Hadoop (CDH) release 5.14.x and 5.15.x. Both “classic” MR1 and YARN MR2 APIs are supported.

For more information see the Change List for ORAAH version 2.8.0  and the Release Notes for ORAAH version 2.8.0 Documents.

The first part of a series of Blog Posts dedicated to ORAAH illustrates more examples of the execution speeds of the ORAAH Spark-based algorithms:

Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks

Oracle Live SQL

OTN Cloud Promo RHS