Oracle R Advanced Analytics for Hadoop Logo
Oracle R Advanced Analytics for Hadoop Icon

Oracle R Advanced Analytics for Hadoop (ORAAH) is one of the components in the Oracle Big Data Connectors software suite, an option to the Oracle Big Data Appliance. At its core, ORAAH provides an R interface for manipulating data stored in HDFS, using both HIVE transparency capabilities and mapping HDFS as direct input into Machine Learning algorithms.

The newly released Oracle R Advanced Analytics for Hadoop 2.5.0 includes two new algorithm implementations that can take advantage of an Apache Spark cluster for a significant performance gains on Model Build and Scoring time. These algorithms are a redesigned version of the Multi-Layer Perceptron Neural Networks (orch.neural), and a brand new implementation of a Logistic Regression model (orch.glm2).

The Platform also allows for writing mapper and reducer functions in R, where open source CRAN packages can be leveraged. Users can pass R objects from the client R object space to their mapper and reducer functions, as well as test MapReduce jobs locally at their client R engine without changing any code, just by switching a system flag. This makes it easy to debug code before unleashing it on the full Hadoop cluster.

Other abilities include leveraging the Oracle Big Data Connector called Oracle Loader for Hadoop to quickly push data from HDFS into an Oracle Database, or using Sqoop to pull data from the oracle Database into HDFS, all within an R Session just by using R functions provided.

If parallel distributed map-reduce programming isn't your strength, ORAAH also allows you to manipulate Hive data using the same type of transparency provided by Oracle R Enteprise, but for use on top of Hive tables. So just as Oracle R Enterprise maps data.frame functions to Oracle SQL, Oracle R Advanced Analytics for Hadoop uses the same abstraction to map those data.frame functions to HiveQL.

In addition, ORAAH provides ten prepackaged Map-Reduce advanced analytics algorithms including: Logistic Regression (Spark-based), Multi-Layer Perceptron Feed Forward Neural Networks (Spark and Map-Reduce versions), Generalized Linear Models (GLM), Linear Regression models, Principal Component Analysis (PCA), k-Means clustering, Non-negative Matrix Factorization, Low-Rank Matrix Factorization (for collaborative filtering), Correlation and Covariance matrix computations.

So even if you’re not comfortable turning serial algorithms into parallel distributed algorithms in map-reduce, you can get the benefit of the Hadoop cluster using our high-level R interfaces.

A sample of the achievements by the new Spark-based algorithms is below. Against the same models running on Map-Reduce, the difference gets up to 217X faster at 1 Billion records.

The first part of a series of Blog Posts dedicated to ORAAH illustrates more examples of the execution speeds of the new ORAAH 2.5.0 Spark-based algorithms:
Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks
BIWA Summit 2016

Oracle Database Cloud