Using Feature Selection and Generation with GLM

Overview

Purpose

This tutorial covers the use of Oracle Data Miner 4.0 to leverage enhancements to the Oracle implementation of Generalized Liner Models (GLM) for Oracle Database 12c. These enhancements include support for Feature Selection and Generation. In this lesson, you learn how to use these GLM enhancements in a Classification model.

Time to Complete

Approximately 30 mins.

Introduction

Generalized Linear Models provide great transparency, which may be achieved at the expense of accuracy. With the introduction of a feature selection and generation capability, GLMs can maintain a high degree of accuracy without sacrificing transparency (the ability to explain the predictions made by the model).

Feature Selection is the process of selecting the most meaningful attributes. With feature selection, GLMs can be created with fewer predictors, leading to smaller models and faster scoring.
Feature Generation is the process of combining attributes into new features. With feature generation, GLMs use non-linear terms (up to cubic terms), leading to more powerful models and increased transparency.

In this lesson, you will create a new workflow that illustrates these enhancements.

Scenario

This lesson focuses on a business problem that can be solved by applying a Classification model. In our scenario, ABC Company wants to know which customer attributes are most significant in predicting the gender of a customer. The new feature selection / generation enhancements are used as part of this mining exercise.

In this new workflow, you:

Identify and select two new data sources from the Oracle Database sample SH schema: the SALES and CUSTOMERS tables.
Summarize the QUANTITY_SOLD and AMOUNT_SOLD measures from the SALES table by Customer and Product, over the Promotion and Channel dimensions.
Place the summarized data into a new table.
Join the summarized sales data with customer data to provide a pool of data for the Classification model.
Apply the Feature Selection / Generation option with a GLM algorithm and examine the results.

The completed workflow looks like this:

Software Requirements

The following is a list of software requirements:

Have access to or have installed:
- Oracle Database 12c Enterprise Edition, Release 12.1 with Advanced Analytics Option.
- The Oracle Database sample data, including the unlocked SH schema.
- SQL Developer 4.0

Prerequisites

Before starting this tutorial, you should have:

Set up Oracle Data Miner for use within Oracle SQL Developer 4.0. If you have not already set up Oracle Data Miner, complete the lesson: Setting Up Oracle Data Miner 4.0
Completed the lesson: Using Oracle Data Miner 4.0

Create a Data Miner Project

Here, you will create a new project using the same techniques shown in the "Using Oracle Data Miner 4.0" tutorial.

NOTE: If you have already completed the lesson Using the SQL Query Node in a Workflow, skip this topic and go to Build the Data Miner Workflow.

To create a Data Miner Project, perform the following steps:

In the Data Miner tab, right-click dmuser and select **New Project**, as shown here:

In the Create Project window, enter a project name (in this example SH Schema) and then click OK.

Build the Data Miner Workflow

As discussed in the "Using Oracle Data Miner 4.0" tutorial, a Data Miner Workflow is a collection of connected nodes that describe a data mining processes.

Sample Data Mining Scenario

In this topic, you will create a data mining process that identifies which attributes are most significant in predicting the gender of a customer.

To accomplish this goal, you build a workflow that enables you to:

Identify and combine data from multiple data sources
Create three GLM Classification models
Build and compare model results

To create the workflow for this process, perform the following steps.

Create a Workflow and Add Data Sources

Right-click your project (SH Schema) and select **New Workflow** from the menu.

In the Create Workflow window, enter **Predicting Customer Gender** as the name and click OK.

The first element of any workflow is the source data. Here, you add the first of two Data Source nodes to the workflow.

A. In the Components tab, drill on the **Data** category. A group of six data nodes appear, as shown here:

In Step 1 of the wizard:

A. Click **Add Schemas**, beneath the Available Tables/Views list, as shown here:

Next, scroll down and select the **SH.CUSTOMERS** table. Then click **Finish** in the wizard.

A. Using the same technique just described, add a second Data Source nodes to the workflow, just underneath the CUSTOMERS data source, like this:

Aggregate and Join Data

The Transforms node group contains a number of tools that enable you to transform data for use within a workflow. In this topic, you will:

Use an Aggregate node to aggregate the AMOUNT_SOLD and QUANTITY_SOLD measures from the SALES table. The measures will be aggregated by Customer and Product, over the Promotion and Channel dimensions.
Create a table for the aggregated sales data.
Use a Join node to join the aggregated sales data to the CUSTOMERS table.

Follow these steps:

In the Components tab, drill on the **Transforms** group, and then drag and drop the **Aggregate** node to the Workflow, like this:

Create Classification Models

Next, you will add a Classification node to the workflow, like you did in the Using Oracle Data Miner 4.0 tutorial.

However, in this scenario, you will:

Remove all of the default algorithms from the Class Build node except the GLM algorithm.
Add a second GLM algorithm to the node, and modify it to use the Feature Selection option.
Add a third GLM algorithm to the node, and modify it to include the Feature Selection and Feature Generation options.

Then, in the next topic, you will build the classification models and compare the results of the three GLM models.

Follow these steps:

A. First, expand the **Models** category in the Components tab, and add a **Classification** node to the Workflow pane, like this:

Select all of the model settings except for GLM, and then click the **Remove** tool (red "x" icon), as shown below. (Select **Yes** in the warning message window.)

In the Edit Classification Build Node window:

A. Select **CUST_GENDER** as the Target attribute.

B. Select **CUST_ID** as the Case ID attribute.

Next, with the GLM model setting selected, click the **Duplicate Selected Model** tool, as shown here:

Next, you view the default settings for the GLM models, and then modify the duplicated model to add Feature Extraction.

A. With the duplicated model selected (CLAS_GLM_2_2 in this example), click the Edit Advanced Model Settings tool (pencil icon).

Once again, select the second GLM model in the list, select the Feature Selection/Generation option, and then click the associated **Option** button, as shown here:

Back in the Edit Classification Build Node window, select the modified GLM model and click the **Duplicate Selected Model** tool, as shown here.

Next, select the new GLM model and click the (CLAS_GLM_3_2 in this example), click the Edit Advanced Model Settings tool (pencil icon).

A. In the Algorithm Settings tab of the Advanced Model Settings window, click the **Option** button next to Feature Selection/Generation, as shown here:

A. To save your change, click OK in the Advanced Model Settings window.

B. Then, click OK in the Edit Classification Build Node window,as shown here:

Build and Compare the Models

In this topic, you build the three GLM models against the joined source data. Then, you examine the model results. As stated before, we are interested in those input attributes (features) that are most significant in predicting the outcome of customer gender.

In this scenario, we will compare the results of the first GLM model that uses Ridge Regression, to the other two GLM models. The second model uses Feature Selection, and the third model uses both Feature Selection and Feature Generation. We want to see which model produces the highest degree of predictive accuracy without sacrificing transparency.

Right-click the classification build node and select **Run** from the pop-up menu.

Notes:

When the node runs it builds and tests all of the models that are defined in the node.

As before, a green gear icon appears on the node borders to indicate a server process is running, and the status is shown at the top of the workflow window.

When the build is complete, all nodes contain a green check mark in the node border.

Right-click the Class Build node and select **Compare Test Results** from the menu, like this:

Next, select the **Performance Matrix** tab.

Now, we’ll use the **View Models** short-cut menu option from the Class Build node to view data about each model individually. In each case, a model window opens that contains four tabs. We’ll examine the Coefficients tab for each model to compare the attributes that are considered significant in predicting the outcome. For all three models, we will look at the Target Value of M (Male).

A. Right-click the Class Build node and select **View Models > CLAS_GLM_1_#** (the first model, that does not use Feature Selection).

Right-click the Class Build node and select **View Models > CLAS_GLM_2_#** (the second model, that uses Feature Selection only).

A. In the **Coefficients** tab, use the same criteria and sorting as with the first model.

Summary

In this lesson, you learned how to use the Feature Extraction/Generation option with a GLM algorithm classification model.

You also learned how to:

Add data sources from other schemas
Use the Aggregate and Join nodes from the Transforms node group
Copy and modify existing models for comparative purposes

Resources

To learn more about Oracle Data Mining:

See the Oracle Data Mining and Oracle Advanced Analytics pages on OTN.
Refer to additional OBEs in the Oracle Learning Library
See the Data Mining Concepts manuals:
- Oracle Database 12c Release 1 (12.1)
- Oracle Database 11g Release 2 (11.2)

Credits

Lead Curriculum Developer: Brian Pottle

Other Contributors: Charlie Berger, Mark Kelly, Margaret Taft, Kathy Talyor

To help navigate this Oracle by Example, note the following:

Hiding Header Buttons:: Click the Title to hide the buttons in the header. To show the buttons again, simply click the Title again.
Topic List Button:: A list of all the topics. Click one of the topics to navigate to that section.
Expand/Collapse All Topics:: To show/hide all the detail for all the sections. By default, all topics are collapsed
Show/Hide All Images:: To show/hide all the screenshots. By default, all images are displayed.
Print:: To print the content. The content currently displayed or hidden will be printed.

To navigate to a particular section in this tutorial, select the topic from the list.