Oracle R Advanced Analytics for Hadoop Logo
Oracle R Advanced Analytics for Hadoop Icon

NEW ORAAH RELEASE 2.6.0: Introducing support for Apache Spark MLlib machine learning algorithms from R.

The latest release of Oracle R Advanced Analytics for Hadoop (ORAAH), release 2.6.0, is one of the components in the Oracle Big Data Connectors release 4.5 software suite, an option to the Oracle Big Data Appliance.  At its core, ORAAH provides an R interface for manipulating data stored in HDFS, using both HIVE transparency capabilities and mapping HDFS as direct input into Machine Learning algorithms that can run as Map Reduce jobs or inside an Apache Spark container.

New to the release 2.6.0 is a set of APIs that allows the ORAAH user to access nine different Apache Spark MLlib algorithms from one line of R code.

New Apache Spark MLlib algorithms available from R in ORAAH (source data can be CSVs in HDFS or HIVE tables) 

  • Linear regression -
  • LASSO - Least Absolute Shrinkage and Selection Operator -
  • Ridge Regression -
  • Logistic Regression -
  • Decision Trees -
  • Random Forest -
  • Support Vector Machines -
  • k-Means Clustering -
  • Principal Component Analysis -
New features created for ORAAH Spark MLlib algorithms

To support the new Machine Learning algorithms from Apache Spark, several new functions were created:
  • Updated predict() functions for scoring new datasets using the Spark-based models, using Spark.
  • A new function hdfs.write() to allow writing to HDFS the model objects and the prediction results back from Spark RDDs into HDFS.
  • A new set of functions,model() and orch.load.model() to save and load Model objects that can be used to score new datasets on the same or other clusters via Spark

In addition, ORAAH 2.6.0 also introduces a Distributed Model Matrix engine and a Distributed Formula parser that are used in conjunction with all Spark MLlib-based algorithm interfaces to greatly improve the performance and enhance functional compatibility with R (following closely the R formula syntax). Internally the Distributed Model Matrices are stored as Spark RDDs (Resilient Distributed Datasets).

Additional ORAAH Platform Updates
  • Support for Cloudera Distribution of Hadoop (CDH) release 5.7.0. Both “classic” MR1 and YARN MR2 APIs are supported.
  • A major improvement is support for HiveServer2 in the Hive transparency layer. Now ORAAH will use a JDBC/Thrift connection in order to communicate with Hive and execute Hive queries on a remote or local Hadoop clusters instead of the previous CLI execution layer, which greatly improves the performance as well as lowers latency of Hive queries.
  • A new Intel MKL release added to the installer package: Intel® Math Kernel Library Version 11.1.0 Product Build 20130711 for Intel® 64 architecture applications.

For more information see the Change List for ORAAH version 2.6.0  and the Release Notes for ORAAH version 2.6.0 Documents.

In addition to these new interfaces to Spark MLlib algorithms, ORAAH provides nine prepackaged algorithms, including:

  • High-performance Logistic Regression (Spark-based) - orch.glm2()
  • High-performance Multi-Layer Perceptron Feed Forward Neural Networks (Spark and Map-Reduce versions) - orch.neural()
  • Generalized Linear Models (GLM) Map-Reduce - orch.glm()
  • Linear Regression Map-Reduce - orch.lm()
  • Principal Component Analysis (PCA) Map-Reduce - orch.pca()
  • k-Means Clustering Map-Reduce - orch.kmeans()
  • Non-negative Matrix Factorization (NMF) Map-Reduce - orch.nmf()
  • Low-Rank Matrix Factorization (for collaborative filtering) Map-Reduce - orch.lmf()
  • Correlation and Covariance matrix computations based on Map-Reduce - orch.cor()/orch.cov()
 A sample of the achievements by the new Spark-based algorithms is below. Against the same models running on Map-Reduce, the difference gets up to 217X faster at 1 Billion records.

The first part of a series of Blog Posts dedicated to ORAAH illustrates more examples of the execution speeds of the ORAAH Spark-based algorithms:
Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks