Get the Details
Oracle Machine Learning for Spark
OML4Spark R API provides functions for manipulating data stored in a local File System, HDFS, HIVE, Spark DataFrames, Impala, Oracle Database, and other JDBC sources. OML4Spark takes advantage of all the nodes of a Hadoop cluster for scalable, high performance machine learning modeling in Big Data environments. OML4Spark machine learning algorithms use the expressive R formula object optimized for Spark parallel execution.
OML4Spark brings custom Linear Model (LM), Generalized Linear Model (GLM), and MLP Neural Networks algorithms that execute on Spark infrastructure. OML4Spark provides interfaces to Apache SparkML algorithms, but note that OML4Spark algorithms scale and perform better than SparkML. R functions wrap SparkML algorithms within the OML4Spark framework using the R formula specification and Distributed Model Matrix data structure.
Oracle Cloud SQL and OML4Spark can be combined from Oracle Database or Autonomous Database to address large, complex data-driven problems where the source data and patterns to be discovered may lie in big data, relational data, or some combination of the two. OML4Spark provides options for machine learning processing outside the database or as a powerful component of larger, complex machine learning pipelines.
What is Oracle Machine Learning for Spark?
OML4Spark is supported by Oracle R Advanced Analytics for Hadoop, a component of the Big Data Connectors and provides:
- Parallel and distributed machine learning algorithms that leverage all the nodes of a Hadoop cluster for scalable, high performance modeling on big data. Functions use the expressive R formula object optimized for parallel Spark execution.
- Custom Spark-based Linear Model, Generalized Linear Model, and MLP Neural Network algorithms. Interfaces to Apache SparkML that use the R formula and Distributed Model Matrix infrastructure.
- An R interface for manipulating data stored in a local File System, HDFS, HIVE, Impala, and JDBC sources, and creating Distributed Model Matrices across a Cluster of Hadoop Nodes in preparation for machine learning.
- A general computation framework where users invoke parallel, distributed MapReduce jobs from R, writing custom mappers and reducers in R while also leveraging open source CRAN packages.
- The Core Analytics Java library can be called from Java or Scala platforms for direct, efficient and easy integration with applications.
- Empowers data scientists with a range of powerful custom and open source algorithms under a common framework
- Dramatically improves execution performance for data analysis, preparation, and modeling using a natural R interface - custom algorithms achieve linear scalability - performing well even on low memory / low CPU hardware - for narrow, wide and sparse data
- Data remain in Hadoop/Spark environment, minimizing data access latency
- Execute SparkML algorithms either on a Hadoop cluster using YARN (to dynamically form a Spark cluster), or on a dedicated standalone Spark cluster
- Uses Spark SQL for manipulating Spark DataFrames within R code, with interfaces to a local File System, HDFS, HIVE, Impala or JDBC sources and the Oracle Database.
- Leverage open source CRAN pckages in combination with MapReduce jobs from R
- Use models for scoring in deployed clusters via ability to store and load model objects
- Match the I/O performance of pure Java-based MapReduce programs via binary RData representation of input data in R-based MapReduce jobs