Oracle R Advanced Analytics for Hadoop Logo
Oracle R Advanced Analytics for Hadoop Icon

NEW ORAAH RELEASE 2.7.0: Introducing the fastest GLM and LM algorithms on Spark with full summary, enhanced Deep Neural Networks and support for Spark MLlib Gaussian Mixture Models.
The latest release of Oracle R Advanced Analytics for Hadoop (ORAAH), release 2.7.0, is one of the components of the Oracle Big Data Connectors software suite, an option to the Oracle Big Data Appliance.  At its core, ORAAH provides an R interface for manipulating data stored in HDFS, using both HIVE transparency capabilities and mapping HDFS as direct input into Machine Learning algorithms that can run as Map Reduce jobs or inside an Apache Spark container.

New to the release 2.7.0 are updated ORAAH GLM and LM algorithms which are much faster, stable and light on memory than comparable GLM and LM methods from Spark MLlib. Both methods also bring a new summary feature that makes them comparable to solutions from open-source R glm and lm, but capable of handling Big Data at enterprise scale.

The Neural Networks algorithm has been enhanced to support the full formula processing and a full build and scoring in Spark.

The new Gaussian Mixture Models is an addition to the set of algorithms supported in Spark MLlib.

Our new functionality and high performance for ORAAH's own algorithm.

  • ORAAH's Spark-based LM with full formula support and summary - orch.lm2() - NEW!
  • ORAAH's Spark-based GLM with full formula support and summary - orch.glm2() - NEW features!

Sample performance of both algorithms against the same Spark MLlib algorithms on the same hardware and same Spark settings.





Set of Apache Spark MLlib algorithms available from R in ORAAH (source data can be CSVs in HDFS or HIVE tables) 

  • Gaussian Mixture Models - orch.ml.gmm() - NEW!
  • Linear regression - orch.ml.linear()
  • LASSO - Least Absolute Shrinkage and Selection Operator - orch.ml.lasso()
  • Ridge Regression - orch.ml.ridge()
  • Logistic Regression - orch.ml.logistic()
  • Decision Trees - orch.ml.dt()
  • Random Forest - orch.ml.random.forest()
  • Support Vector Machines - orch.ml.svm()
  • k-Means Clustering - orch.ml.kmeans()
  • Principal Component Analysis - orch.ml.pca()
Features of ORAAH Spark MLlib algorithms

To support the new Machine Learning algorithms from Apache Spark, several special functions are available:
  • Updated predict() functions for scoring new datasets using the Spark-based models, using Spark.
  • A new function hdfs.write() to allow writing to HDFS the model objects and the prediction results back from Spark RDDs into HDFS.
  • A new set of functions orch.save,model() and orch.load.model() to save and load Model objects that can be used to score new datasets on the same or other clusters via Spark

ORAAH 2.7.0 includes a Distributed Model Matrix engine and a Distributed Formula parser that are used in conjunction with all Spark MLlib-based algorithm interfaces to greatly improve the performance and enhance functional compatibility with R (following closely the R formula syntax). Internally the Distributed Model Matrices are stored as Spark RDDs (Resilient Distributed Datasets).

Additional ORAAH Platform Updates
  • Support for Cloudera Distribution of Hadoop (CDH) release 5.8.0. Both “classic” MR1 and YARN MR2 APIs are supported.
  • A new Intel MKL release was added to the installer package: Intel® Math Kernel Library Version 2017 Update 1 for Intel® 64 architecture applications.

For more information see the Change List for ORAAH version 2.7.0  and the Release Notes for ORAAH version 2.7.0 Documents.

In addition to these new interfaces to Spark MLlib algorithms, ORAAH provides eight prepackaged algorithms, including:

  • High-performance Multi-Layer Perceptron Feed Forward Neural Networks (Spark and Map-Reduce versions) - orch.neural()
  • Generalized Linear Models (GLM) Map-Reduce - orch.glm()
  • Linear Regression Map-Reduce - orch.lm()
  • Principal Component Analysis (PCA) Map-Reduce - orch.pca()
  • k-Means Clustering Map-Reduce - orch.kmeans()
  • Non-negative Matrix Factorization (NMF) Map-Reduce - orch.nmf()
  • Low-Rank Matrix Factorization (for collaborative filtering) Map-Reduce - orch.lmf()
  • Correlation and Covariance matrix computations based on Map-Reduce - orch.cor()/orch.cov()
The first part of a series of Blog Posts dedicated to ORAAH illustrates more examples of the execution speeds of the ORAAH Spark-based algorithms:
Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks