Energy Australia Improves Customer Experience with Oracle and ORAAH
Gaurav Singh, Big Data and Data Warehouse Solution Architect, Design & Execution of Energy Australia, speaks about using Big Data Analytics to better understand customers and improve their experience.
To learn more about the use case, click here.
New for release 2.8.0:
- Interface with Spark 2.x (due to new Spark APIs, starting on this release ORAAH is no longer compatible with Spark 1.6.0)
- Certified with CDH 5.13, 5.14 and 5.15, and Spark releases 2.1.0 up to 2.3.1
- New Algorithm: ELM (Extreme Learning Machines) and Hierarchical-ELM (Oracle’s MPI/Spark-engines) - for Classification and Regression
- New Algorithm: Distributed Stochastic PCA (Oracle’s MPI/Spark-engines)
- New Algorithm: Distributed Stochastic SVD (Oracle’s MPI/Spark-engines)
- New model statistics for LM (lm2) and GLM (glm2) Algorithms
- A new Activation Function "softmax" is available for ORAAH's MLP Neural Networks algorithm, that can be used for binary and multi-class Targets (previously only "entropy" could be used for binary targets)
- Probabilities are now returned for the Spark MLlib Classification algorithms
- New options for GLM solver: IRLS (Iteratively Reweighted Least Squares) for precision, and L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) for speed and larger problems.
- New IMPALA transparency interface, for fast in-memory queries and Data Preparation on large Datasets from R
- New "orch.summary()" function to run summary statistics on HIVE tables with improved performance
- New Spark DF data manipulation and summary functions: collect, scale (12 standard scaling techniques), SQL query, CSV import into Spark DF, summary (statistics), describe and persist/unpersist in memory
- New JDBC Interface to define any SQL source as input to the formula engine and any algorithm
- New interface to Spark MLlib GBT - Gradient Boosted Trees - for Classification and Regression
- New faster loading of the ORAAH libraries after the first successful loading for the same user
- Performance enhancements when interfacing with HIVE, HDFS (new Java-based client interface) and R data.frames
- spark.connect() now automatically resolves the active HDFS Namenode for the cluster if the user does not specify the dfs.namenode parameter
- spark.connect() now loads the default Spark configuration from the spark-defaults.conf file if it is found in the CLASSPATH
- hdfs.toHive() and hdfs.fromHive() now work with the newly supported Apache Impala connection as well
- A new field "queue" has been added to mapred.config object which is used to specify the MapReduce jobs written in R using ORAAH, allowing the user to select the queue to which the MapReduce task will be submitted
- New support for the option "-Xms" to improve the configuration of the rJava JVM memory that runs alongside ORAAH
- The ORAAH R Formula engine introduces support for:
- Main statistical distributions, with density, cumulative density, quantile and random deviates functions for the following distributions: Beta, Binomial, Cauchy, Chi-squared, Exponential, F-distribution, Gamma, Geometric, Hypergeometric, Log-normal, Normal, Poisson, Student's t, Triangular, Uniform, Weibull and Pareto
- Special functions: gamma(x), lgamma(x), digamma(x), trigamma(x), lanczos(x), factorial(x), lfactorial(x), lbeta(a, b), lchoose(n, k) - natural logarithm of the binomial coefficient
- New Aggregate Functions: avg(colExpr) which is the same as mean(colExpr), max(colExpr), min(colExpr), sd(colExpr) which is the same as stddev(colExpr), sum(colExpr), variance(colExpr) which is the same as var(colExpr), kurtosis(colExpr), skewness(colExpr), where colExpr is an arbitrary (potentially nonlinear) column expression.
Continuing on the great benefits of ORAAH
- A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages. Support for binary RData representation of input data enables R-based MapReduce jobs to match the I/O performance of pure Java-based MapReduce programs.
- Parallel and distributed machine learning algorithms take advantage of all the nodes of your Hadoop cluster for scalable, high performance modeling on big data. Algorithms include linear regression, generalized linear models, neural networks, low rank matrix factorization, non-negative matrix factorization, k- means clustering, principal components analysis, and multivariate analysis. Functions use the expressive R formula object optimized for Spark parallel execution.
- R functions wrap Apache Spark MLlib algorithms within the ORAAH framework using the R formula specification and Distributed Model Matrix data structure. ORAAH's MLlib R functions can be executed either on a Hadoop cluster using YARN to dynamically form a Spark cluster, or on a dedicated standalone Spark cluster.
Same great Platform for R Analytics at scale:
See this blog post for
Sample performance on a Big Data Appliance X7-2 (Spark 2.2.0 on YARN, 6 Nodes with 48 cores and 256 GB of RAM per Node)
Benchmark of all available Binary Classification algorithms on an airline Dataset (ontime) with 1 Bi rows. Model build and Model Scoring
times are compared.
Benchmark of scalability of ORAAH's GLM vs. Spark MLlib's Logistic algorithm on an airline Dataset (ontime). Model build times are
compared. At 10 Bi rows, the dataset of 1 TB no longer fits in memory
(learn more about Spark Memory Management)
and Spark MLlib crashes.
Benchmark of ORAAH's GLM (Logistic Regression) scalability for Model building and Model Scoring, from 100k to 10 Bi rows.
ORAAH high-performance Spark (and MPI) based algorithms available from R (source data can be CSVs in HDFS, HIVE tables or
Set of Apache Spark MLlib algorithms available from R in ORAAH (source data can be CSVs in
HDFS, HIVE tables or Spark DataFrames)
- Linear Regression - orch.lm2()
- Logistic Regression - orch.glm2()
- Multilayer Perceptron Neural Networks - orch.neural2()
- Extreme Learning Machines (ELM) - orch.elm()
- Hierarchical-ELM - orch.helm()
- Distributed Stochastic PCA - orch.dspca()
- Distributed Stochastic SVD - orch.dssvd()
- Gradient-Boosted Trees - orch.ml.gbt()
- Gaussian Mixture Models - orch.ml.gmm()
- Linear Regression - orch.ml.linear()
- LASSO - Least Absolute Shrinkage and Selection Operator - orch.ml.lasso()
- Ridge Regression - orch.ml.ridge()
- Logistic Regression - orch.ml.logistic()
- Decision Trees - orch.ml.dt()
- Random Forest - orch.ml.random.forest()
- Support Vector Machines - orch.ml.svm()
- k-Means Clustering - orch.ml.kmeans()
- Principal Component Analysis - orch.ml.pca()
Features of ORAAH Spark MLlib algorithms
To support the new Machine Learning algorithms from Apache Spark, several special functions are available:
- Updated predict() functions for scoring new datasets using the Spark-based models, using Spark.
- A new function hdfs.write() to allow writing to HDFS the model objects and the prediction results back from Spark RDDs and Data Frames into HDFS.
- A new set of functions orch.save,model() and orch.load.model() to save and load Model objects that can be used to score new datasets on the same or other clusters via Spark
- ORAAH includes a Distributed Model Matrix engine and a Distributed Formula parser that are used in conjunction with all Spark MLlib-based algorithm interfaces to greatly improve the performance and enhance functional compatibility with R (following closely the R formula syntax). Internally the Distributed Model Matrices are stored as Spark RDDs (Resilient Distributed Datasets).
Additional ORAAH Platform Updates
Support for Cloudera Distribution of Hadoop (CDH) release 5.14.x and 5.15.x. Both “classic” MR1 and YARN MR2 APIs are supported.
For more information see the Change List for ORAAH version 2.8.0 and the Release Notes for ORAAH version 2.8.0 Documents.
The first part of a series of Blog Posts dedicated to ORAAH illustrates more examples of the execution speeds of the ORAAH Spark-based
Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks