Developer: BI
   DOWNLOAD
 Oracle Database 10g EE
   TAGS
datamining, bi, database, All

Using the Oracle Data Mining API


by Agust Egilsson

Learn how to safely and directly let end users access Oracle's analytic APIs.

Published August 2006

Following best practices, Oracle releases PL/SQL and Java application-programming interfaces (APIs) well before there is a point-and-click tool or builder available taking advantage of a newly released API. This practice enables developers to incorporate new functionality into their applications immediately and allows their organizations to benefit accordingly.

However, considerable development is required to create specialized tools that take advantage of a newly released API. As a consequence, the lag time following the release and before an intuitive end-user tool utilizing the API is available is several months in the best case and, more realistically, one or two years. In the meantime, taking advantage of the API may require organizations either to come up with their own tools or more commonly with ad hoc solutions, such as embedding calls to the API in report generators and spreadsheets.

One common approach is to wrap the API in several scripts and then access the scripts from a spreadsheet using custom buttons and menus. However, a major disadvantage with this approach is that today’s spreadsheets handle text and numbers only as values in individual cells and are therefore usually not the appropriate tools for dealing with the new functionality.

In this article, I will show how to quickly incorporate analytic and other APIs into a spreadsheet platform from which the end user can easily access the new code. As a guiding example, I will explain how I repackage the Oracle Data Mining (PL/SQL) API as a Java API and how it is then accessed from a spreadsheet called J Cells. J Cells is written entirely in Oracle JDeveloper. It is not limited to only allowing text and numbers as values of cells but instead it also handles Java objects as values and can access any Java API directly from its cells, for immediate deployment.

The Spreadsheet Platform

I use the spreadsheet interface but allow the user to create any Java object in the cells as well as to use primitive Java types. Each cell can be used as a variable in another cell, and the user can choose between writing Java code directly in the cells and using other formats. Bridging the spreadsheet interface and the use of objects instead of just numbers and text in a normal spreadsheet is automatic: J Cells calculates an indication value for each object that fits into a cell, and this indication value gives the user a sufficient clue about the object being displayed. Additionally, a full value system is implemented that displays the object in various other formats when requested—for example, when the user double-clicks on a given cell. Since, even in a spreadsheet, formulas can be complicated to define, the system also recognizes if the object being created has a wizard associated with it. A wizard is usually a graphical code generator specific to an object type. The use of wizards in J Cells will become clear later in our example.

Figure 1 shows the spreadsheet interface for our example.

Figure 1

Figure 1:
J Cells’ spreadsheet interface

The Data Mining API

Oracle supports two compatible APIs for accessing data mining functionality in the database. The first is a PL/SQL API, which includes the DBMS_DATA_MINING package, and there is also a Java API called Oracle Data Mining Java API. Since J Cells is currently best suited to access a Java API, the PL/SQL API needs to be packaged in such a way that it can be accessed directly from Java. The two principal Oracle Data Mining concepts are settings and models. The settings concept is basically built around settings tables with two columns: setting_name and setting_value, where setting_name is the name of an attribute used by a mining algorithm and setting_value is the value corresponding to that attribute.

The DBMS_DATA_MINING package contains several procedures, including CREATE_MODEL and APPLY. The CREATE_MODEL procedure creates a mining model for a given mining function and dataset, based on the values in the settings table that is supplied as one of the arguments to the procedure. The procedure is simple and straightforward to use. Essentially, the user provides the names of the model to be created, the mining function to be used, the table containing the data to be used, the columns to be modeled, and the settings table. The advantage of this approach is that all the different algorithms are invoked in a similar way. The fine-tuning of each algorithm is incorporated into settings tables, but in many cases the various setting coefficients can be determined automatically by the algorithms themselves. The complexity of entries in a settings table varies depending on the algorithm and on the expertise of the user. Many expert users may want to manually set all possible coefficients, while most of us greatly appreciate the system’s ability to figure out the appropriate settings automatically. Oracle provides a list of constants to be used as settings keys, and values are either named constants or numeric intervals.

Table 1: Values for the algo_name (algorithm name) settings key

Values Algorithm Description

algo_adaptive_bayes_network

Adaptive Bayes network algorithm (classification)

algo_ai_mdl

Minimum description length algorithm (attribute importance)

algo_apriori_association_rules

Apriori algorithm (association)

algo_decision_tree

Decision tree algorithm (classification)

algo_kmeans

k-means algorithm (clustering)

algo_naive_bayes

Naive Bayes algorithm (classification)

algo_nonnegative_matrix_factor

Non-negative matrix factorization algorithm (feature selection)

algo_o_cluster

O-Cluster algorithm (clustering)

algo_support_vector_machines

Support vector machine algorithm (classification or regression)

Oracle’s constant values for the algorithm name (algo_name) key are listed above. For each of these values, a different set of possible keys and values is used, and so on. Below (Figure 2) it is shown how a wizard function maps these keys into a tree structure and allows the user to define the settings table by manipulating the settings tree.

As the mining model has been created in the Oracle database, the DBMS_DATA_MINING.APPLY procedure is used to apply the model to new datasets. Again, this is a simple-to-use procedure that requires as input only the name of the mining model, the name of a table containing the new dataset, a column for identifying rows in the new dataset, and the name of the resulting dataset. The Java class OracleMiningModel (below) utilizes the APPLY procedures whenever a prediction, score, or apply method is called. Additionally, the DBMS_DATA_MINING package contains several functions that return details of each model based on type, as result sets or in XML format. These detail functions are also accessed by using instances of the OracleMiningModel class, which represent different models in the database.

Packaging the (PL/SQL) settings concept in Java is achieved here by creating a Java class called OracleModelSettings having flexible constructors with various signatures, including

public OracleModelSettings ( String modelSettingsName,
                                Connection databaseConnection, 
                                String[] keyToValueStringMap) 
                throws SQLException

The keyToValueStringMap is simply an array of strings of the form “<key> -> <value>”. This array specifies the rows of the settings table, and the class is responsible for maintaining the settings table in the Oracle database.

Similarly, packaging the model concept in Java may be achieved by creating a Java class called OracleMiningModel having constructors, again, with various signatures, including

public OracleMiningModel ( String modelName,
                        OracleModelSettings oms,
                        String[] keyToValueMappings, 
                        boolean recreate)
                throws SQLException

Here the keyToValueMappings array is used to determine the algorithm used and other named attributes required to create the data mining model in the Oracle database. The purpose of the class is to create and maintain the data mining model. The OracleMiningModel class additionally defines methods that are used to retrieve and apply the model to new datasets. These methods include the following, just to name a few.

public OracleResultSet infoAprioriAssociationRules(int topn)
public OracleResultSet infoAprioriFrequentItemsets(int topn)
public OracleResultSet infoAdaptiveBayesNetwork()
public OracleResultSet infoAIMinimumDescLength()
public OracleResultSet infoKMeans()
public OracleResultSet infoNaiveBayes()
public OracleResultSet infoNonnegativeMatrixFactorization()
public OracleResultSet infoOCluster()
public OracleResultSet infoSupportVectorMachines()
public XMLType infoDecisionTree()
public Object getPrediction(String[] signature, double[] doubleVal)
public HashMap score(String[] signature, double[] doubleVal)
public OracleResultSet apply(   String dataTable, 
                                String caseID, 
                                String resultTable, 
                                String schema, 
                                boolean overwrite)

All the methods may throw SQL exceptions in addition to having the signature above. Once the data mining functionality is manageable from these two simple classes, the spreadsheet platform may be invoked to access any of the available data mining algorithms in order to model datasets in the Oracle database.

A Data Mining Example

So let’s look at a small data mining application written in this system that accesses the Oracle database to create and run an ODM (Oracle Data Mining) regression model. The purpose of this regression model is to predict heart rate based on inputs such as blood pressure, height, and weight. Using J Cells, one accesses the Java API directly to instantiate objects and to invoke methods on the objects. The first thing to do is to connect to the Oracle database. A DataSource object may be instantiated by entering the formula

() = ~ OracleDataSource("agust","agust","dbVaio","vaioFS");

into cell b3 in the spreadsheet. Here the tilde symbol (~) indicates a shorthand notation and allows J Cells to translate the (right hand) statement into the constructor “new cell.OracleDataSource( "agust", "agust", "dbVaio", "vaioFS");” giving the system access to the database “dbVaio” on server “vaioFS” as user “agust”.

Obtaining a database connection and inspecting the source data in the database may now be achieved by invoking the proper methods on the DataSource object, such as by entering the following formulas in cells b4 and b5, respectively:

(*) = b3.getConnection(); 
(*) = b3.query("select * from pulse_clinical");

The first statement returns a java.sql.Connection object into cell b4, and the second statement returns a java.sql.ResultSet object into cell b5. The result set can be inspected by simply double-clicking on the cell (b5), which opens up the resulting table in a table frame for viewing.

So far, I have just created a few simple data objects in the spreadsheet. Now it is time to invoke the data mining API in order to define a settings object and then to create a simple data mining model. First the settings object is created by entering the following statement in cell b6:

(*) = new cell.odm.OracleModelSettings("xyz_settings", b4,
                new String[]{
                        "algo_name -> algo_support_vector_machines",
                        "svms_kernel_function -> svms_linear"} );

Immediately I see a problem with this formula being end user–friendly, so it may be a good idea to register a wizard with J Cells that generates this formula automatically after prompting the user. It is typical for a spreadsheet to assist users when creating complex formulas, so users may expect to be guided when instantiating complex objects.

The deployed wizard is shown in Figure 2.

Figure 2

Figure 2:
A typical wizard interface

Similarly, the data mining model is created by using a wizard or by typing the formula; in either case, the resulting model is instantiated in cell b7 specifying a call to the API directly:

(*)= new cell.odm.OracleMiningModel("xyz_model", b6,
                new String[]{
                        "data_table_name -> pulse_clinical",
                        "mining_function -> regression",
                        "target_column_name -> pulse",
                        "case_id_column_name -> subject"},
                        false );
Instantiating the data mining model using this formula results in the creation of a standard Oracle Data Mining model in the Oracle database. The model can be inspected by double-clicking on cell b7, resulting in the full value for the model being displayed as shown in Figure 3.

Figure 3

Figure 3:
Inspecting the model in cell b7

Applying the model to datasets in Oracle Database is done by using the OracleMiningModel methods described previously. As a simple and interactive scoring of the model—typical for a spreadsheet application—the user may want to enter blood pressure, height, and weight values and have the database predict the heart rate using the model just defined. The Java API method getPrediction defined on model objects is well suited for this purpose. After having entered the input values in cells e4, e5, e6, and e7 (in the order specified by the signature array in cell b8), the scoring is done by entering the following formula:

(*) = b7.getPrediction(b8,new double[]{e4,e5,e6,e7});

Again, this formula accesses the Java API directly to obtain and display the scoring result into cell e9 as shown in Figure 4.

Figure 4

Figure 4:
Applying the model to input values

The Spreadsheet Advantage

It is somewhat baffling how popular the spreadsheet is with end users. On the one hand, today’s most common spreadsheet systems have long frustrated many developers, who are used to more-flexible and -powerful systems. For the nondeveloper, on the other hand, there are some clear advantages to using spreadsheet systems: not having to build graphical user interfaces, building and testing each piece of formula (code) individually, and hiding the formulas and seeing instead the much simpler results of calculations. These advantages are only present in today’s popular spreadsheets when working with formulas that return numbers or text into cells, which is a major limitation with the systems. Here, I have demonstrated how to remove this limitation and create a more powerful tool. It can then be used to directly access Oracle’s data mining functionality as well as other APIs.

Conclusion

By employing a more powerful spreadsheet, it is possible to significantly reduce the time it requires to introduce new technology releases to end users, such as the Java APIs and PL/SQL APIs published by Oracle. In fact, using the ideas described here, it is often feasible to hand a raw Java API directly to the nonprogrammer for immediate incorporation into decision-making processes or for prediction and analysis.


Agust Egilsson ( egilsson@hi.is) is a mathematics professor, an experienced Oracle database expert, and an avid PL/SQL and Java developer.