Using Predictive Queries With Oracle Data Miner 4.0

Overview

    Purpose

    This tutorial covers the use of Predictive Queries against mining data by using Oracle Data Miner 4.0.

    Time to Complete

    Approximately 30 mins.

    Introduction

    As you learned in the lesson Using Oracle Data Miner 4.0, data mining is the process of extracting useful information from masses of data by extracting patterns and trends from the data.

    Data mining can be used to solve many kinds of predictive analysis problems, including the following:

    • Predicting outcomes or values (Classification or Regression models)
    • Finding natural segments or clusters in a population (Clustering models)
    • Finding fraudulent or rare events (Anomaly Detection models)
    • Creating new attributes (features) for a target variable by combining original attributes (Feature Extraction models)

    Oracle Data Miner 4.0 provides predictive query capabilities for these specific model types. The predictive query options enable dynamic scoring of these model by generating a transient model that is not persisted.

    In this lesson, you will learn how to use each type of predictive query.

    Software Requirements

    The following is a list of software requirements:

Have access to or have installed:

    • Oracle Database 12c Enterprise Edition, Release 12.1 with Advanced Analytics Option.
    • The Oracle Database sample data, including the unlocked SH schema.
    • SQL Developer 4.0

Prerequisites

Before starting this tutorial, you should have:

  • Set up Oracle Data Miner for use within Oracle SQL Developer 4.0. If you have not already set up Oracle Data Miner, complete the lesson: Setting Up Oracle Data Miner 4.0.
  • Completed the lesson: Using Oracle Data Miner 4.0.

 

Creating Predictive Queries

    Before getting started, let’s preview the Predictive Queries node group in the Components pane. As shown below, the Models and Predictive Queries node groups are opened.

    There are four options in the Predictive Queries node group:

    • Anomaly Detection Query
    • Clustering Query
    • Feature Extraction Query
    • Prediction Query

    As shown in the image, each Predictive Query option enables dynamic scoring of an associated model type or types.

    Open the first subtopic to begin.

    Create a Workflow and Add a Data Source

      As you learned previously, a Data Miner Workflow is a collection of connected nodes that describe a data mining process. Here, you add a new workflow to an existing project that you created in the "Using Oracle Data Miner 4.0" tutorial. Then, you add a data source to the workflow.

      Perform the following steps.

      Right-click the ABC Insurance project and select New Workflow from the menu.

      Result: The Create Workflow window appears.

      In the Create Workflow window, enter Predictive Queries as the name and click OK.

      Result: A new workflow window with the workflow name opens in the middle of the SQL Developer window.

      As you've learned, the first element of any workflow is the source data.

      A. In the Components tab, open the Data category.

      B. Drag and drop the Data Source node onto the Workflow pane.

      Result: A Data Source node appears in the Workflow pane and the Define Data Source wizard automatically opens.

      C. As shown above, select MINING_DATA_BUILD_V from the Available Tables/Views list.

      D. Then, click FINISH at the bottom of the wizard window.

      Result: As shown below, the data source node name is updated with the selected view name.

      Note: You will use this same data source for all four predictive query options. Therefore, you will not need to create an additional workflow, nor add additional data sources, to complete this tutorial.

    Create and Execute a Prediction Query

      When creating a Classification or a Regression model, you must define a target attribute for the prediction. Predictive results are generated by the model and placed in the target attribute for each case in the model scoring process.

      When defining a Prediction Query node, you specify one or more prediction targets, which serve as the equivalent of a target attribute in a Classification or Regression model. Then, when the Prediction Query option is run, the Data Mining engine performs dynamic scoring for the prediction target(s) by generating a transient model that is not persisted.

      In this topic, you define a Prediction Query node to predict those customers who are most likely to select an Affinity card with the company. Then, you run the node and view the results.

      Follow these steps:

      Add a Prediction Query node to the workflow.

      A. First, collapse the Data category, and open the Predictive Queries category in the Components tab.

      B. Then, drag and drop the Prediction Query node from the Components tab to the Workflow pane, like this:

      Connect the data source node to the Prediction Query node:

      A. Right-click the data source node and select Connect from the menu.

      B. Click the Prediction Query node to complete the connection.

      Result: The workflow should look something like this.

      Note: By connecting these nodes, you enable the definition of the Prediction Query options based on the source data.

      A Prediction Query node is defined by specifying the following:

      • First, define the Case ID attribute.
      • Second, add one or more target columns.
      • Optionally, select one or more partition columns.
      • Optionally, add or remove prediction output columns.
      • Optionally specify advanced settings.

      Double-click the node to open the Edit Prediction Query Node window. Then, perform the following:

      A. On the Predictions tab, select CUST_ID as the Case ID attribute.

      Note: Although not required, it is advised that you define a Case ID to uniquely define each record. The Case ID helps with model repeatability and is consistent with good data mining practices.

      B. Next, add three prediction target attributes as specified here:

      • Click the Add Prediction Target tool (green “+” icon). The Add Target dialog box appears.
      • In the Available list, select the AFFINITY_CARD, AGE, and CUST_GENDER attributes to serve as a prediction targets.
      • Move the chosen attributes to the Selected list.
      • Then, click OK.

      Result: The Targets region of the Predictions tab should look like this:

      Notes:

      • As stated previously, the Prediction Query feature supports dynamic scoring for both Regression and Classification modeling. As shown in the Targets list on the Prediction tab, each attribute has an associated data type and mining type.
      • The mining type defines how the attribute is treated in the analytic query:
        • For Regression query analysis, the mining type should be set to Numerical ().
        • For Classification query analysis, the mining type should be set to Categorical ().

      In the example, AFFINITY_CARD and AGE are both Numerical, and CUST_GENDER is Categorical, as a result of the associated data types. However, we want to perform classification analysis on AFFINITY_CARD, because we are attempting to predict a Yes (1) or No (0), rather than a number. You can change the mining type of any attribute with a NUMBER data type as desired.

      Change the mining type for the AFFINITY_CARD attribute from Numerical to Categorical.

      A. For the AFFINITY_CARD target, click the Mining Type icon and select Categorical from the drop-down list, as shown here:

      B. The following Warning window opens when you change the Mining Type value. Click Yes.

      Result: The prediction targets are now defined correctly for the Prediction Query.

      Next, click the Partition tab.

      Notes:

      • If you select a Partition column, the predictive query builds a virtual model for each unique partition. You may select one or more partition columns.
      • Ensure that the level of partitioning does not get so small that there is insufficient data to build a good model.

      Define the following partition column:

      A. Click the Add Partitioning Columns tool (green “+” icon). The Add Partition Column dialog box appears.

      B. Select the EDUCATION attribute in the Available list and move it to the Selected list.

      C. Finally, click OK.

      Note: This definition instructs the Data Mining server to build a unique virtual model for each distinct EDUCATION value. Because the virtual model uses data only from that partition, it can potentially predict cases more accurately than if no partition were selected.

      Select the Input tab. All of the input columns are listed.

      By default, inputs are automatically determined using heuristics. If you want to remove or modify the mining type of any input column, you must deselect this option.

      Note: The Additional Output tab enables you to add (or remove) columns to the predictive output that is automatically generated by the node.

      Once again, select the Predictions tab and examine the Prediction Outputs region at the bottom of the tabbed pane.

      Notes:

      • By default, three prediction output columns are generated for a target attribute with a Categorical mining type (AFFINITY_CARD and CUST_GENDER), and two output colums generated for a target with a Numerical mining type (AGE).
      • You can add or remove prediction output columns.

      For this analytic query, remove the Prediction Details outputs for all three targets by performing the following:

      A. Select the three Prediction Details functions in the Prediction Outputs pane.

      B. Click the Remove Prediction Output Function tool (red “x” icon).

      Result: The Predictions tab should look like this:

      Note: The column names for your Prediction Outputs are automatically generated, and may be different from those shown in this example.

      Click OK to close the Select Columns window.

      To execute the prediction query and view the results, right-click the Prediction Query node and select either Run or View Data from the menu.

      Notes: Given that virtual models may take a while to be formulated, running a query may take a while. However, the View Data option generates a smaller sample output of the query.

      In this example, we select the View Data option.

      Result: A Prediction Query tabbed window opens with a small sample of query output, including all of the Additional Output columns and the Prediction Output columns.

      Notes:

      • Although you are only viewing the first few rows of the query result, the analytic query processes all the data in the query.
      • Your prediction column names may be slightly different.
      • In the example, output columns have been resized to provide a more complete view of the results. You can scroll horizontally and vertically to view more of the prediction query data.

      Click the Sort button and specify the following criteria:

      A. Prediction Affinity Card, in descending order (1 is YES, 0 is NO), and then,

      B. Probability Affinity Card, in descending order (the higher the number, the greater probability of the prediction)

      C. Click Apply Sort to view the results.

      Notes:

      • The sorted results show customers sorted by with the highest probability of choosing an Affinity Card.
      • This information could be used for a targeted affinity card promotion -- to those customers in this list that don't already have an affinity card.

      Close the Prediction Query tabbed window and Save the workflow by clicking the Save All icon in main toolbar.

    Create and Execute a Clustering Query

      Clustering models define segments, or “clusters,” of a population and then decide the likely cluster membership of each new case (although it is possible for an item to be in more than one cluster).

      These models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments

      When defining a Clustering Query node, you specify similar attributes to a Clustering model, but when the Clustering Query option is run, the Data Mining engine performs dynamic scoring of the clustering query without having to pre-define a model.

      In this topic, you define a Clustering Query node to predict homogeneous clusters of customers partitioned on their Country of residence level. Then, you run the node and view the results.

      Follow these steps:

      Add a Clustering Query node to the workflow, as shown here:

      Notes:

      • As you've learned, a yellow exclamation mark on the border indicates that the node needs more information before it is complete.
      • In this case, a connection must be created between the source data node and the Clustering Query node, and certain options must be specified within the Clustering Query node.

      Connect the data source node to the Clustering Query node:

      A. Right-click the data source node and select Connect from the menu.

      B. Click the Clustering Query node to complete the connection.

      Result: The workflow should look like this.

      Note: By connecting these nodes, you enable the definition of the Clustering Query options based on the source data.

      A Clustering Query node is defined by specifying the following:

      • First, define the Case ID attribute.
      • Second, specify the number of clusters.
      • Optionally, select one or more partition columns.
      • Optionally, add or remove prediction output columns.

      Double-click the node to open the Edit Clustering Query Node window. Then, perform the following on the Cluster Predictions tab:

      A. Select CUST_ID as the Case ID attribute

      B. Accept the default value of 10 for number of clusters to compute.

      By default, several Cluster Prediction Output functions are automatically defined, including:

      • Cluster Details
      • Cluster Distance
      • Cluster ID (Identifies each cluster)
      • Cluster Probability

      You can manually add new prediction outputs and remove existing outputs by using the same techniques shown in the Prediction Query topic.

      Next, select the COUNTRY_NAME attribute as the Partition column, as shown here:

      Note: This definition instructs the Data Mining server to build a unique virtual model for each distinct COUNTRY_NAME value.

      Click OK to close the Edit Clustering Query Node window.

      Right-click the Clustering Query node and select View Data from the menu.

      Result: A tabbed window displays the data, including:

      Output Columns (CUST_ID, COUNTRY_NAME)

      Cluster Prediction Outputs (CLID_1, PROB_1, CDET_1, CDST_1)

      Notes:

      • Your prediction column names may be slightly different.
      • In the example, output columns have been resized.

      Next, sort and filter the output as follows:

      A. Click the Sort button, sort by Probability in descending order, and then click Apply Sort.

      B. Next, apply the following Where clause in the Filter box: CLID_1 = 8

      Note: Make sure to use the correct column name from your example in the Filter box.

      Result: The filter selects only those records that are predicted to be in the eighth cluster.

      Close the Clustering Query tabbed window and Save the workflow by clicking the Save All icon in main toolbar.

    Create and Execute an Anomaly Detection Query

      Anomaly Detection models use one-class classification. In this approach, the model trains on data that is homogeneous. Then, the model determines whether a new case is similar to the cases observed, or is somehow “abnormal” or “suspicious.” It detects abnormal cases but does not give reasons.

      When defining an Anomaly Detection Query node, you specify similar attributes to an Anomaly Detection model, but when the query option is run, the Data Mining engine performs dynamic scoring of the anomaly detection query without having to pre-define a model.

      In this topic, you define an Anomaly Detection Query node to predict potentially anomalous cases, and view the supporting prediction details.

      Follow these steps:

      Add an Anomaly Detection Query node to the workflow, as shown here:

      Connect the data source node to the Anomaly Detection Query node.

      Result: The workflow should look something like this.

      An Anomaly Detection Query node is defined by specifying the following:

      • First, define the Case ID attribute.
      • Second, optionally select one or more partition columns.
      • Third, optionally add or remove prediction output columns.

      Double-click the node to open the Edit Anomaly Detection Query Node window. Then, on the Anomaly Predictions tab, select CUST_ID as the Case ID attribute.

      Next, select the COUNTRY_NAME attribute as the Partition column, as shown here:

      Click OK to close the Edit Anomaly Detection Query Node window.

      Right-click the Anomaly Detection Query node and select View Data from the menu.

      Result: A tabbed window displays the data, including:

      • Output Columns (CUST_ID, COUNTRY_NAME)
      • Anomaly Prediction Outputs (PRED_1, PROB_1, PDET_1)

      A. Sort the query results by using the following criteria::

      1. Prediction, in descending order.
      2. Probability in descending order.

      B. Click Apply Sort to requery the data.

      Note: Your prediction column names may be slightly different.

      The Anomaly Prediction Query options enable viewing of expanded prediction details.

      Note: This feature requires SQL Developer 4.0 EA3 or later.

      A. First, click the PDET_# column for the case you want to view.

      B. Then, scroll to the far right and click the sunglasses icon to display the View Value window. This window contains, in ranked order, information about the attributes that are considered material for the anomaly prediction.

      In the example, we select case number 4, with the Customer ID of 101,829. You can examine the prediction details if any case in the same way.

      C. After you examine the prediction details, close the View Value window.

      Close the Anomaly Detection Query tabbed window and Save the workflow by clicking the Save All icon in main toolbar.

    Create and Execute a Feature Extraction Query

      Feature Extraction models create new attributes (features) by using combinations of the original attribute. For example, you can group the demographic attributes for a set of customers into general characteristics that describe the customer.

      When defining a Feature Extraction Query node, you specify similar attributes to a Feature Extraction model, but when the Query option is run, the Data Mining engine performs dynamic scoring of the clustering query without having to pre-define a model.

      Follow these steps:

      Add a Feature Extraction Query node to the workflow, as shown here:

      Connect the data source node to the Feature Extraction Query node:

      Result: The workflow should look like this.

      A Feature Extraction Query node is defined by specifying the following:

      • First, define the Case ID attribute.
      • Second, specify the number of features to extract.
      • Optionally, select one or more partition columns.
      • Optionally, add or remove prediction output columns.

      Double-click the node in the workflow to open the Edit Feature Extraction Query Node window. Then, on the Feature Predictions tab, perform the following:

      A. Select CUST_ID as the Case ID attribute.

      B. Accept the default value of 10 features to extract.

      By default, only one Feature Prediction Output function is automatically defined: the Feature Set.

      Next, select the Partition tab and perform the following:

      A. Click the Add tool (green "+" icon).

      B. Select the CUST_GENDER and EDUCATION attributes as partition columns.

      C. Click OK.

      Note: This definition instructs the Data Mining server to build a unique virtual model for each distinct CUST_GENDER value and each distinct EDUCATION value.

      Finally, click OK to close the Edit Feature Extraction Query Node window.

      To view the query results, right-click the Feature Extraction Query node and select View Data from the menu.

      Result: A tabbed window displays the data, including the output columns and Feature Set definition column. Like the Anomaly Detection Query, expanded prediction details are also provided for Feature Extraction Query, as shown here:

      Close the Feature Extraction Query tabbed window and Save the workflow by clicking the Save All icon in main toolbar.

Summary

    In this lesson, you learned how to use the four types of Predictive Query options, which include:

    • Prediction Query
    • Clustering Query
    • Anomaly Detection Query
    • Feature Extraction Query

    Resources

    To learn more about Oracle Data Mining:

    Credits

    Lead Curriculum Developer: Brian Pottle

    Other Contributors: Charlie Berger, Mark Kelly, Margaret Taft, Kathy Talyor

To help navigate this Oracle by Example, note the following:

Hiding Header Buttons:
Click the Title to hide the buttons in the header. To show the buttons again, simply click the Title again.
Topic List Button:
A list of all the topics. Click one of the topics to navigate to that section.
Expand/Collapse All Topics:
To show/hide all the detail for all the sections. By default, all topics are collapsed
Show/Hide All Images:
To show/hide all the screenshots. By default, all images are displayed.
Print:
To print the content. The content currently displayed or hidden will be printed.

To navigate to a particular section in this tutorial, select the topic from the list.