One of the new SQL features introduced into the Oracle 12c Database is the ability to create on the fly predictive models for your data. All you need to do is to write a SQL query that will build a predictive model and will then apply this model to your data. This is all done in one step and does not require you have knowledge of its inner works. The predictive models that are built during the execution of the Predictive Query only exists while the query is being run. When the query is finished all the models and associated settings are deleted. These are called transient models.
One of the confusing aspects of the new feature is that is goes by many names. In the documentation it can be found under On-the-Fly Models, and sometimes it is called Dynamic Queries. This feature is also part of the Oracle Data Miner tool that is part of SQL and here it goes by the name of Predictive Queries.
On-the-Fly Models = Dynamic Queries = Predictive Queries. Yes that is a bit confusing. From talking to some of the Oracle folks is seems that the name Predictive Queries might be the name they will be using going forward.
Predictive Queries enable you to build and score data quickly using the in-database data mining algorithms, without the complexity of needing to understand the required settings and fine-tuning of the models. All models created during the execution of the Predictive Query will not exist once the Predictive Query is finished executing. You cannot inspect or tune the algorithms being used or the models that are generated. So you have to trust what Oracle is doing behind the scenes. If you are a Data Scientist type of person you will typically want to tune the models, so this approach is maybe not for you. But if you want to very quickly build models and score your data then Predictive Queries is something you should consider.
A major advantage of using Predictive Queries is that you can partition the data so that predictive models can be build specific to each partition. What this will do, is that it will divide your data into the relevant partitions and will then create a predictive model specific to that partition and then score the data in the partition. Typically for most data mining tools you will have to specifically define the creation of the data subset for the partition, define how to build the model and then run the model to score the data. All of these steps have to be manually defined. By defining a partition in the Predictive Query all of these work will be done automatically for you. There are two major advantages to this. The first is that as new data partitions (i.e. a new value exists for the partition attribute) are created the Predictive Query will automatically pick this up and do all the work for you. The second is that Predictive Queries allows you to use the Parallel Query option to speed up the process of scoring the data. This will be particularly useful when you are working with Big Data.
Using Predictive Queries is very similar to using the PREDICT procedure in the DMBS_PREDICTIVE_ANALYTICS PL/SQL package, except that instead of using PL/SQL, Predictive Queries uses SQL.
There are two ways you can go about creating your predictive queries. The first option is to write a SQL statement to prompt the database to use the in-database data mining algorithms. The second (and much easier) option is to use the Predictive Queries nodes in the Oracle Data Miner tool.
The following sections of this article will illustrate how to build Predictive Queries using Oracle Data Miner and using SQL.
The Predictive Queries option in the Oracle Data Miner GUI, that is part of SQL Developer, will only become visible when you are connected to an Oracle 12c Database. You will see the Predictive Queries section under the Components windows, as shown in Figure 1.
Figure 1 : Predictive Queries section of the Components Window
There are four types of Predictive Query Nodes available in Oracle Data Miner.
Predictive Queries Node
Anomaly Detection Node
The input data set will be used to build an anomaly model and this model will then be applied to the same input data set.
Clusters will be automatically detected and the input data set will be returned with an assigned cluster for each record.
Feature Extraction Query
This can be used to automatically identify and extract key features from the input data set.
The Predictive Query node can be used to perform classification or regression on the input data set. For classification the target attribute needs to have a character data type. For Regression the data type should be numeric.
To illustrate how to build and use a Predictive Query node in the Oracle Data Miner tool, the following example will use the Predictive Query node to build and apply a Classification model.
To create a Data Source and Predictive Query nodes on your workflow you need to:
1. Create and Define a Data Source Node for the table that contains the data. The Data Source Node is located in the Data section of the Components window.
2. Create the Predictive Query node on your workflow.
3. Connect the Data Source node to the Predictive Query node. To do this right click on the Data Source node, select Connect from the menu, move the mouse to the Predictive Query node and click the mouse again. A gray arrow will now appear between the two nodes. This means they are connected.
To define the settings for the Predictive Query node you will need to:
1. Open the Predictive Query node by double clicking on the node.
2. The Edit Predictive Query node window will open. This window have 4 tabs/sections that allows you to define the Predictions to be made, how the data will be Partitioned (and hence the number of Predictive Query models that will be created), the Input list of attributes and the Output list of attributes.
3. Define the Predictions.
a) Define the Case ID for the input data set. The Case ID can be the primary key or some attribute that gives uniqueness to each attribute. The Case ID can be left null, but by defining the Case ID allows for the repeatability of the process and outputs. In our sample data set the Case ID is CUST_ID.
b) Define the target attribute for the data set. Click on the green plus icon to open the Add Target window. In this window select the required Target attribute. In our data set this is the AFFINITY_CARD attribute.
c) Define the Data Mining Type. This will be either Classification or Regression. To define the Data Mining Type you will need to click on the icon under the Data Mining Type column. A drop down list will appear that lists Categorical or Numerical. In our scenario we want to perform Classification so you will need to select Categorical.
d) Define the Prediction outputs. In this section you can define what kind of prediction outputs you want. By default for Classification the output will consist of Prediction, Prediction Details and Prediction Probability. For Regression you will have Prediction and Prediction Details. If you want to remove one or more of these prediction attributes you can select them and then click on the red X icon to remove them.
4. Define the Partition. This section allows you to define what partitions you want the data mining models to be defined on. This is perhaps the biggest benefit of Predictive Queries. A separate data mining model will be build for each value of defined by the partition. A partition can be created based on the values of attributes or by you defining a particular expression. To create a partition select the green plus icon. This will open a window listing the attributes in the data set. In our scenario we want a separate data mining model to be created for each Gender (M/F) by country. So if you have 10 countries then Predictive Queries will build 20 different data mining models.
5. Define the Input. By default all the attributes from our Data Source node are listed. If there are any attributes you do not want to be used as inputs to the building of the data mining models then you can remove them from the input list.
6. Define the Additional Output. In this section you can define what attributes you want displayed as the output of the Predictive Query. By default the output attributes will consist of the defined Case ID, the Target Attribute defined (in our example AFFINITY_CARD) and the Partition Attributes. If you would like additional attributes from the data source included then you can use the green plus icon to add them.
When you have the above steps completed your Predictive Query defined. You can now close the Predictive Query window.
Now you are ready to run the Predictive Query. To run this node right click the node and select Run from the menu. The query will be run in the database using the data source as input, data mining models will be built using the details you entered and then these models will be used to score or label the same data set. When the Predictive Query node is finished you can view the results by right clicking the node and selecting View Data from the menu. Figure 2 shows a sample of the output that is generated.
Figure 2 : Output generated by the Predictive Queries node.
A similar approach can be taken with setting up and defining the other Predictive Query nodes for Anomaly Detection, Clustering and Feature Extraction.
Predictive Queries provides use with a very easy to of constructing data mining models and applying them to your data, in a very efficient manner that allows for the dynamic creation of these based on how the partitions are defined.
In the previous section you have seen how to build up a predictive query using the Oracle Data Miner tool that is part of SQL Developer. The Oracle Data Miner tool also allows you to create Predictive Queries for each of the main types of Data Mining methods.
In this section we will look at how you can go about writing your Predictive Queries using SQL. We will base the examples on the sample example we used in building the Predictive Query using the Oracle Data Miner tool. That way you can easily relate back and compare the two approaches.
If you have already built an Oracle Data Mining model you are able to use the PREDICTION and PREDICTION_PROBABILITY functions to apply that model to your data. Similarly for Predictive Queries we can use these functions but they will have some additional specification to enable Predictive Queries to work.
In the following example it shows a Predictive Query that will use the data in the MINING_DATA_BUIL_V view to build a number of data mining models.
select cust_id, affinity_card, PREDICTION( FOR to_char(affinity_card) USING *) OVER () pred_affinity_card from mining_data_build_v);
In the results shown above we get to see and compare the actual value with the predicted value. You may notices that it got one of the predictions incorrect. This is typical in data mining. It wont get everything correct. But that is OK. Part of the reason for this is that the data mining algorithm is trying to work out what works best for all the data across all the records. One way to improve the accuracy is to look to build data mining models based on different groupings of the data. For example we have a COUNTRY_NAME attribute. Maybe it would be better to build separate data mining models for each country.
To do this we need to example what is entered in the PARTITION BY clause. Here we can add the COUNTRY_NAME and when the query is run a separate data mining model will be created for each country. These models will then be used to score the data.
select cust_id, affinity_card, PREDICTION( FOR to_char(affinity_card) USING *) OVER (PARTITION BY "COUNTRY_NAME") pred_affinity_card from mining_data_build_v);
From the sample output shown above we can see what the actual value is for the predicted attribute (AFFINITY_CARD) and what the predicted value is. In the sample shown above we see that both values are the same and if we looked at all the records we would see a very high degree of accuracy. This is good and it shows us that the Predictive Query generated a very efficient in-database data mining model.
But the problem with Predictive Queries is that these very efficient data mining models no longer exist in the database. It only exists for the duration of the query execution.
For your typical classification type of problem your target attribute will need to be a character data type. If is isn’t for example it could be defined as a number, then you will need to apply the TO_CHAR function to the target attribute to convert it into a categorical value. If you don’t TO_CHAR the number attribute then the Predictive Queries will treat the data as a regression data mining problem.
Let us now look at a regression example.
For this example we are going to use another of the sample data sets that comes when you install the Oracle Data Miner repository. One of these tables is called INSUR_CUST_LTV_SAMPLE. What we want to do with this data is to predict what the potential Life Time Value (LTV) is for each each customer.
The following examples shows a Predictive Query that calculates a predicted LTV and displays this with the current values. For records where you don’t have the LTV value the predicted value will give you an idea of what the LTV would be.
select customer_id, ltv, PREDICTION( FOR ltv USING *) OVER ( ) pred_ltv from insur_cust_ltv_sample;
Just like we did for the previous example of using Predictive Queries we can ask the database to create multiple data mining models that are based on the values of an attribute or a set of attributes. In the following example our LTV Predictive Query is expanded so that it will create separate data mining models for each combination of STATE and SEX. In this case it will be creating 44 different data mining models, as we have 22 distinct states in the data set and 2 distinct values for SEX.
select customer_id, ltv, PREDICTION( FOR ltv USING *) OVER ( PARTITION BY STATE, SEX ) pred_ltv from insur_cust_ltv_sample;
The use of Predictive Queries has limitations. When you run a Predictive Query Oracle will create a transient or temporary model. This model is only available during the execution of the query. The model produced by the Predictive Query does not exist afterwards and is not available for inspection or tuning of the model. If you need to inspect the model to correlate scoring results with the model, specify special algorithm settings, or execute multiple scoring queries that use the same model, you must create a predefined model using the Oracle Data Miner Model Nodes or by using the Oracle Data Mining PL/SQL package called DBMS_DATA_MINING.
Another alternative is to use the DBMS_PREDICTIVE_ANALYTICS PL/SQL package. This allows you to create some transient models but with Predictive Queries you get a much better and more efficient solution.
Brendan Tierney, Oracle ACE Director, is an independent consultant and lectures on Data Mining and Advanced Databases in the Dublin Institute of Technology in Ireland. He 20+ years of extensive experience working in the areas of Data Mining, Data Warehousing, Data Architecture and Database Design. Brendan has worked on projects in Ireland, UK, Belgium and USA. Brendan is the editor of the UKOUG Oracle Scene magazine and deputy chair of the OUG Ireland BI SIG. Brendan is a regular speaker at conferences across Europe and the USA and has written technical articles for OTN, Oracle Scene, IOUG SELECT Journal and ODTUG Technical Journal. Brendan has published his Oracle Press book called Predictive Analytics using Oracle Data Miner. Follow Brendan on twitter @brendantierney, or his blog at www.oralytics.com.