Oracle R Advanced Analytics for Hadoop Logo
Oracle R Advanced Analytics for Hadoop Icon

The latest release of Oracle R Advanced Analytics for Hadoop (ORAAH), release 2.7.1, is one of the components of the Oracle Big Data Connectors software suite, an option to the Oracle Big Data Appliance.  At its core, ORAAH provides an R interface for manipulating data stored in HDFS, using both HIVE transparency capabilities and mapping HDFS as direct input into Machine Learning algorithms that can run as Map Reduce jobs or inside an Apache Spark container.

New to the release 2.7.1

  • Compatible with Oracle R Distribution 3.3.0 and Oracle R Enterprise 1.5.1
  • New interface to Graph Analytics using Oracle's PGX OAA Graph R package, allowing for Machine Learning and Graph Analytics to be executed on the same R script in a Big Data environment
  • Performance enhancements when interfacing with HIVE, HDFS and R data.frames
  • Experimental option to use Spark as the HIVE Execution Engine: ore.hiveOptions(exeEngine='spark')
  • Improved installation experience, with many automated configuration features
  • Improved demos with enhanced cleanup of sample data
Same great Platform for R Analytics at scale:
  • A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages. Support for binary RData representation of input data enables R-based MapReduce jobs to match the I/O performance of pure Java-based MapReduce programs.
  • Parallel and distributed machine learning algorithms take advantage of all the nodes of your Hadoop cluster for scalable, high performance modeling on big data. Algorithms include linear regression, generalized linear models, neural networks, low rank matrix factorization, non-negative matrix factorization, k- means clustering, principal components analysis, and multivariate analysis. Functions use the expressive R formula object optimized for Spark parallel execution.
  • R functions wrap Apache Spark MLlib algorithms within the ORAAH framework using the R formula specification and Distributed Model Matrix data structure. ORAAH's MLlib R functions can be executed either on a Hadoop cluster using YARN to dynamically form a Spark cluster, or on a dedicated standalone Spark cluster.

See this blog post for additional details.

Sample performance of LM and GLM algorithms against the same Spark MLlib algorithms on the same hardware and same Spark settings.

ORAAH high-performance Spark-based algorithms available from R (source data can be CSVs in HDFS or HIVE tables) 

  • Linear Regression - orch.lm2() 
  • Logistic Regression - orch.glm2()
  • Multilayer Perceptron Neural Networks - orch.neural2()

Set of Apache Spark MLlib algorithms available from R in ORAAH (source data can be CSVs in HDFS or HIVE tables) 

  • Gaussian Mixture Models - 
  • Linear Regression -
  • LASSO - Least Absolute Shrinkage and Selection Operator -
  • Ridge Regression -
  • Logistic Regression -
  • Decision Trees -
  • Random Forest -
  • Support Vector Machines -
  • k-Means Clustering -
  • Principal Component Analysis -
Features of ORAAH Spark MLlib algorithms

To support the new Machine Learning algorithms from Apache Spark, several special functions are available:
  • Updated predict() functions for scoring new datasets using the Spark-based models, using Spark.
  • A new function hdfs.write() to allow writing to HDFS the model objects and the prediction results back from Spark RDDs into HDFS.
  • A new set of functions,model() and orch.load.model() to save and load Model objects that can be used to score new datasets on the same or other clusters via Spark
ORAAH 2.7.1 includes a Distributed Model Matrix engine and a Distributed Formula parser that are used in conjunction with all Spark MLlib-based algorithm interfaces to greatly improve the performance and enhance functional compatibility with R (following closely the R formula syntax). Internally the Distributed Model Matrices are stored as Spark RDDs (Resilient Distributed Datasets).

Additional ORAAH Platform Updates
  • Support for Cloudera Distribution of Hadoop (CDH) release 5.12.0. Both “classic” MR1 and YARN MR2 APIs are supported.
  • A new Intel MKL release was added to the installer package: Intel® Math Kernel Library Version 2017 for Intel® 64 architecture applications.

For more information see the Change List for ORAAH version 2.7.1  and the Release Notes for ORAAH version 2.7.1 Documents.

The first part of a series of Blog Posts dedicated to ORAAH illustrates more examples of the execution speeds of the ORAAH Spark-based algorithms:

Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks
Oracle Live SQL

OTN Cloud Promo RHS