Data Mining Using the RDOM Package

By Casimir Saternos

 
A primer on using the open-source R statistical analysis language with Oracle Database Enterprise Edition

Published July 2010

Downloads:


Data mining, predictive analysis, and statistical techniques generally do not make headlines. However, in the last year, the New York Times cited Google’s Chief Economist Hal Varian as predicting a significant demand for statisticians in the next ten years. They also posted an article about the growing popularity of the R language as a tool for data analysis.

R is an open source alternative to statistical software such as SPSS and SAS. It is similar to the S language (developed by John Chambers and others at Bell Laboratories) and is also influenced by Scheme (a major Lisp dialect). It can be used to perform data analysis and visualization through an interactive programming environment.

The Oracle Database includes extensive statistical and analytical functionality. Many features, including Oracle SQL analytic functions and the DBMS_STAT_FUNCS package, are available in standard installations. Oracle Database also has optional packages that complement these capabilities, including Oracle OLAP.

Perhaps the most powerful of all advanced analytical capabilities is the Oracle Data Mining (ODM) option, which provides 12 data mining algorithms for discovering patterns and relationships and building and applying descriptive and predictive data mining models inside the database.

This article will demonstrate the use of Oracle Database 11g Enterprise Edition with the ODM option accessed by an R interface (RODM). ODM allows for analysis of data within the database. This eliminates the time, expense, and resources required to transfer data. The data remains secure in the database resulting in fewer problems due to information latency.

Oracle Database also has a graphical user interface to ODM called Oracle Data Miner “Classic” and a totally new Oracle Data Miner 11g Release 2 workflow GUI, an extension to SQL Developer, that will be available to early adopters soon. And now, for data analysts who are already familiar with the open source R language, there is now another solution: the RODM package.

The RODM package allows R users to interact with the Oracle Database and ODM functionality. Users can analyze and manipulate data without the use of SQL or PL/SQL. Written by Pablo Tamayo and Ari Mozes, it is available for download from the Comprehensive R Archive Network (CRAN). The package depends upon the RODBC package to make Oracle Database connections and do basic data manipulation.

RODM allows R users to access the power of the ODM in-database functions using the familiar R syntax. RODM provides a powerful environment for prototyping data analysis and data mining methodologies. RODM is especially useful for:

  • Quick prototyping of vertical or domain-based applications where the Oracle Database supports the application
  • Scripting of "production" data mining methodologies
  • Customizing graphics of ODM data mining results (examples: classification, regression, anomaly detection)


The RODM interface allows R users to mine data using ODM from the R programming environment. It consists of a set of function wrappers written in source R language that pass data and parameters from the R environment to Oracle Database 11g Enterprise Edition as standard user PL/SQL queries via an Open Database Connectivity (ODBC) interface. The RODM interface code is a thin layer of logic and SQL that calls through an ODBC interface. RODM does not use or expose any Oracle product code because it is completely an external interface and not part of any Oracle product. RODM is similar to the example scripts (for example, the PL/SQL demo code) that illustrate the use of ODM, for example, how to create data mining models, pass arguments, retrieve results, and so on.

Software Prerequisites

If you have not done so already, install the R language on your Microsoft Windows workstation. You can then install any R packages that you would like, including the RODM package.

Please note that you will need to access (or install) an Oracle 11g database with the ODM option to use RODM. If you don't have an installed Oracle database in place and you need to install one from scratch, we strongly recommend that you follow the guidelines in the Oracle Data Mining Administrator's Guide. Oracle Database Express Edition (Oracle Database XE) does not include this option. You will also need a database user with appropriate security and privileges as well as an ODBC connection to the database. This database user should have privileges to connect to the database, create tables, create views, and create mining models.

The following is a script to create such a user. This script (modified for your environment) would need to be run by a DBA or other person with administrative privileges.

CREATE USER dm IDENTIFIED BY  
DEFAULT TABLESPACE users
TEMPORARY TABLESPACE temp
QUOTA 20M ON users;

GRANT create mining model TO dm;
GRANT create table TO dm;
GRANT create view TO dm;
GRANT create session TO dm;


ODBC Connection Configuration

The RODBC package is used to connect to Oracle Database. To configure an ODBC Data Source in Microsoft Windows, click the Start Menu and choose Control Panel, Administrative Tools, and Data Sources (ODBC).

saternos-r-fig1


If a connection is not already set up for your Oracle 11g database, click the Add button and enter the relevant information. After you have entered your connection information and credentials, choose the Test Connection button to verify connectivity.

saternos-r-fig2


Connect to Oracle Database from R

R’s interactive programming environment allows commands to be run and evaluated individually. Each command listed below may be executed in R individually. The first step is to load the required R libraries. Open R and enter the following command.

> library(RODM)


This command loads the RODM library and as well the dependent RODBC package. The next step is to make a database connection.

> DB <- RODM_open_dbms_connection(dsn="orcl", uid="dm", pwd="dm")


Subsequent commands use the DB object (an instance of the RODBC class) to connect to the database. The DNS specified in the command is the name you used earlier for the Data Source Name during the ODBC connection configuration. You can view the actual R code being executed by the command by simply typing the function name (without parentheses).

> RODM_open_dbms_connection


This function calls the RODBC odbcConnect function to make a connection to the database, and it also verifies the version of the database (in the Oracle Data Dictionary product_component_version table) and ensures that the ODM option is available (by checking the v$option table where the parameter column is set to Data Mining). It also verifies that the user making the connection has the necessary credentials (by looking at the user_sys_privs table) and creates the RODM_SETTINGS_TABLE temporary table if needed.

The bottom line for those not interested in the underlying details is that a successful call to this function indicates that you have connected to an Oracle Database using credentials that will allow you to do work with the ODM packages.

Data Frames and Oracle Tables

R users routinely manipulate objects such as data frames, lists, and vectors. Oracle Database stores data in tables. RODM (and RODBC) provide a translation layer that maps R data frames to Oracle Database tables in a single command.

To do your first tests with data mining in Oracle Database, select one of the standard data sets used for statistical analysis and predicative analysis tasks. The ones available on your system can be listed using the data function.

> data()

We will use the Orange data set, which is a table containing a tree number, its age, and its circumference. View the first few lines of the data to get an idea of the structure and contents of the Orange object.

> head(Orange)


Rather than using the R data object directly, we will create a variable with its own name:

> orange_data=Orange


This is not required, but is done to emphasize a couple of points. The names of variables in R are used to automatically create tables in Oracle Database. R is case sensitive, but Oracle Database object names are generally case insensitive.

Create the table in Oracle Database (which will be used by ODM algorithms and processing):

> RODM_create_dbms_table(DB, "orange_data")

You can view the table that has been created by querying the user_tables table using SQL or by simply listing the tables available using the following R command:

> sqlTables(DB, schema='DM')


You can list the columns in the table as well within R:

> sqlColumns(DB, 'orange_data')$COLUMN_NAME


This functionality may require you to make some adjustments to your coding style. For instance, R users frequently include dots in their variable names.

> orange.data<-Orange


A variable name like this produces an error when used to create a table in Oracle Database.

> RODM_create_dbms_table(DB, "orange.data")

Error in sqlSave(database, dat = eval(parse(text = data_table_name)), :
HY000 1918 [Oracle][ODBC][Ora]ORA-01918: user 'ORANGE' does not exist

[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE ORANGE.DATA ("TREE" varchar(255),
"AGE" double precision, "CIRCUMFERENCE" double precision)'


The error is clear enough and easily avoided. Choose variable names that are limited to alpha numeric characters and underscores and are less than 30 characters long.

Some data mining algorithms require a data frame to include a unique “Case ID” to uniquely identify a row. The following set of R commands can make just such an adjustment to the Orange data set.

> numrows <- length(orange_data[,1])
> orange_data.rows <- length(orange_data[,1])
> orange_data.id <- matrix(seq(1, orange_data.rows),  nrow=orange_data.rows, ncol=1, dimnames= list(NULL, c("CASE_ID")))
> orange_data <- cbind(orange_data.id, orange_data)

This adjustment to the data frame then needs to be propagated to the database. You can confirm the change using the sqlColumns function, as listed earlier.

> RODM_create_dbms_table(DB, "orange_data")
> sqlColumns(DB, 'orange_data')$COLUMN_NAME


Notice that there was no need to explicitly drop the table and recreate it. These actions are taken automatically, if needed.

Create and Apply a Model

The following table shows the mapping between the RODM functions and their corresponding data mining algorithm.  

Function

Description

RODM_create_ai_model

Attribute Importance

RODM_create_assoc_model

Association Rules

RODM_create_dt_model 

Decision Tree

RODM_create_glm_model

Generalized Linear Model

RODM_create_kmeans_model

Hierarchical k-means

RODM_create_nb_model 

Naive Bayes

RODM_create_nmf_model

Non-Negative Matrix Factorization

RODM_create_oc_model

O-cluster

RODM_create_svm_model

Support Vector Machine


All of these functions call the RODM_create_model function behind the scenes and retrieve data back to R. If you are interested in detailed information about SQL and PL/SQL statements being executed behind the scenes, set the sql.log.file parameter and view the commands in the log file.

The following is an example of how a model can be created. In this case, a Generalized Linear Model is used to perform a regression analysis on orange_data. Most of the parameters are self explanatory. Keep in mind that the model_name specified will be used in the names of a number of database objects. The default mining function is classification, so regression needs to be specified in this example.

> glm <- RODM_create_glm_model(
database = DB,
data_table_name = "orange_data",
case_id_column_name = "CASE_ID",
target_column_name = "circumference",
model_name = "GLM_MODEL",
mining_function = "regression")

Information about this model can then be obtained by analyzing value returned from the model and stored in the variable named glm.

> glm$model.model_settings
> glm$glm.globals
> $glm.coefficients


Once you have a model, you can apply the model to a new set of data. To begin, create or retrieve sample data in the same format as the training data.

> query<-('select 999 case_id, 1 tree, 120 age, 
32 circumference from dual')

> orange_test<-sqlQuery(DB, query)
> RODM_create_dbms_table(DB, "orange_test")


This example is intended to demonstrate that the data analyzed does not need to originate in R; it can come from the database itself. Data is brought back into R and sent back to the database. However, if you have some knowledge of SQL, you could accomplish this without moving the data back and forth between R and Oracle Database using CREATE TABLE AS syntax.

> RODM_drop_dbms_table(DB, "orange_test")
> sqlQuery(DB, paste('create table orange_test as ', query))


This technique will improve performance and productivity especially when working with large data sets. Finally, the model can be applied to the new data set and the results analyzed.

results <- RODM_apply_model(database = DB, 
data_table_name = "orange_test",
model_name = "GLM_MODEL",
supplemental_cols = "circumference")


When your session is complete, you can clean up objects that were created (if you like) and you should close the database connection:

> RODM_drop_model(database=DB,'GLM_MODEL')
> RODM_drop_dbms_table(DB, "orange_test")
> RODM_drop_dbms_table(DB, "orange_data")
> RODM_close_dbms_connection(DB)

 

Conclusion

There is a growing awareness of the need to effectively analyze astronomical amounts of data, much of which is stored in Oracle databases. Statistics and modeling techniques are used to improve a wide variety of business functions. ODM accessed using the R language increases the value of your data by uncovering additional information. RODM is a powerful tool to enable your organization to make predictions, classify data, and create visualizations that maximize effectiveness and efficiencies.

Casimir Saternos is an Oracle-certified DBA and Sun certified Java programmer living in Pennsylvania.