Star Schema Mining Using Oracle Data Miner

<Do not delete this text because it is a placeholder for the generated list of "main" topics when run in a browser>

Purpose

This tutorial covers the use of Oracle Data Miner to perform star schema mining against Oracle Database 11g Release 2.

A star schema is the simplest style of data warehouse schema. A star schema consists of one or more fact tables referencing any number of dimension tables. A fact table holds the main measure data; dimension tables contains attributes that describe data in the fact table.

In this tutorial, you use Oracle Data Miner against a very simple star schema consisting of one fact table (SALES) and one dimension table (CUSTOMERS) in the sample Sales History (SH) schema that is shipped with Oracle Database.

First, you create data sources for data in the SH schema.
Using the SH schema, you aggregate sales data by calculating the total number of items sold and total amount of money paid on a per customer basis - for each product.
Then, the aggregated data is joined to the customer data.
Finally, the new data is used to create classification models to predict customer gender based on the aggregated data.

Time to Complete

Approximately 30 mins.

Overview

Data mining is the process of extracting useful information from masses of data by extracting patterns and trends from the data. Data mining can be used to solve many kinds of business problems, including:

Predict individual behavior, for example, the customers likely to respond to a promotional offer or the customers likely to buy a specific product (Classification)
Find profiles of targeted people or items (Classification using Decision Trees)
Find natural segments or clusters (Clustering)
Identify factors more associated with a target attribute (Attribute Importance)
Find co-occurring events or purchases (Associations, sometimes known as Market Basket Analysis)
Find fraudulent or rare events (Anomaly Detection)

The phases of solving a business problem using Oracle Data Mining are as follows:

Problem Definition in Terms of Data Mining and Business Goals
Data Acquisition and Preparation
Building and Evaluation of Models
Deployment

Problem Definition and Business Goals

When performing data mining, the business problem must be well-defined and stated in terms of data mining functionality. For example, retail businesses, telephone companies, financial institutions, and other types of enterprises are interested in customer “churn” – that is, the act of a previously loyal customer in switching to a rival vendor.

The statement “I want to use data mining to solve my churn problem” is much too vague. From a business point of view, the reality is that it is much more difficult and costly to try to win a defected customer back than to prevent a disaffected customer from leaving; furthermore, you may not be interested in retaining a low-value customer. Thus, from a data mining point of view, the problem is to predict which customers are likely to churn with high probability, and also to predict which of those are potentially high-value customers.

Data Acquisition and Preparation

A general rule of thumb in data mining is to gather as much information as possible about each individual, then let the data mining operations indicate any filtering of the data that might be beneficial. In particular, you should not eliminate some attribute because you think that it might not be important – let ODM’s algorithms make that decision. Moreover, since the goal is to build a profile of behavior that can be applied to any individual, you should eliminate specific identifiers such as name, street address, telephone number, etc. (however, attributes that indicate a general location without identifying a specific individual, such as Postal Code, may be helpful.)

It is generally agreed that the data gathering and preparation phase consumes more than 50% of the time and effort of a data mining project.

Building and Evaluation of Models

The Workflow creation process of Oracle Data Miner automates many of the difficult tasks during the building and testing of models. It’s difficult to know in advance which algorithms will best solve the business problem, so normally several models are created and tested.

No model is perfect, and the search for the best predictive model is not necessarily a question of determining the model with the highest accuracy, but rather a question of determining the types of errors that are tolerable in view of the business goals.

Deployment

Oracle Data Mining produces actionable results, but the results are not useful unless they can be placed into the correct hands quickly. The Oracle Data Miner user interface provides several options for publishing the results.

Scenario

As stated previously, the data for this tutorial comes from the one of the Sample Schemas that shipped with Oracle Database: the Sales History (SH) schema.

This sample data set is not designed for data mining, nor has it been through the "Data Aquisition and Preparation" phase, as described above in the Overview section. Therefore the focus of this tutorial is not on the solving of a particualr business problem by applying Classification models. Rather, this lesson teaches you how to complete certain important tasks while performing star schema mining.

Specifically, you will:

Aggregate sales data by calculating for each product total number of items sold and total amount of money paid on a per customer basis. This result requires the creation of a nested table.
Join the aggregated data to the customer data.
Create an output table of the aggregated data to improve build performance. This table serves as the source for classification models.
Build classification models from the aggregated data.

Prerequisites

Before starting this tutorial, you should:

.	Have access to or have Installed Oracle Database 11g Enterprise Edition, Release 2 (11.2.0.1) with Data Mining Option. In addition, you must install the Oracle Database Sample Schemas, including the Sales History (SH) schema.
.	Have access to or have installed Oracle SQL Developer, version 3.0, or later. Note: SQL Developer does not have an installer. To install SQL Developer, just unzip the free SQL Developer download to any directory on your PC.
.	Have completed the following lesson: Setting Up Oracle Data Miner 11g Release 2 (or have completed the tasks outlined in this lesson). Have completed or reviewed the following lesson: Using Oracle Data Miner 11g Release 2.

Have access to or have Installed Oracle Database 11g Enterprise Edition, Release 2 (11.2.0.1) with Data Mining Option.

In addition, you must install the Oracle Database Sample Schemas, including the Sales History (SH) schema.

Have access to or have installed Oracle SQL Developer, version 3.0, or later.

Note: SQL Developer does not have an installer. To install SQL Developer, just unzip the free SQL Developer download to any directory on your PC.

Have completed the following lesson: Setting Up Oracle Data Miner 11g Release 2 (or have completed the tasks outlined in this lesson).
Have completed or reviewed the following lesson: Using Oracle Data Miner 11g Release 2.

Create a New Data Miner Project

In the Setting Up Oracle Data Miner 11g Release 2 tutorial, you learned how to create a database connection to a Data Miner user and install the Oracle Data Miner Repository, all from within SQL Developer.

In the Using Oracle Data Miner 11g Release 2 tutorial, you learned how to create Data Miner Project and Workflow.

Here, you will create a new Data Miner project using the existing Data Miner user connection.

Note: The layout and contents of your SQL Developer window may be different than the example shown below. Simply reorganize your SQL Developer enviroinment to match the example given.

To create a new project, select the Data Miner tab, and then perform the following steps:

.	First, select the Data Miner tab. Result: the dmuser connection that you created previously appears. Note: If you have not yet created the dmuser connection, follow the instructions in Setting Up Oracle Data Miner 11g Release 2.
.	Right-click dmuser and select New Project, as shown here:
.	In the Create Project window, enter High Value as the project name and then click OK. Note: You may optionally enter a comment that describes the intentions for a project. This description can be modified at any time. Result: The new project appears below the connection node. Note: the ABC Insurance project may be created by completing the Using Oracle Data Miner 11g Release 2 tutorial.

First, select the Data Miner tab.

Result: the dmuser connection that you created previously appears.

Note: If you have not yet created the dmuser connection, follow the instructions in Setting Up Oracle Data Miner 11g Release 2.

Right-click dmuser and select New Project, as shown here:

In the Create Project window, enter High Value as the project name and then click OK.

Note: You may optionally enter a comment that describes the intentions for a project. This description can be modified at any time.

Result: The new project appears below the connection node.

Note: the ABC Insurance project may be created by completing the Using Oracle Data Miner 11g Release 2 tutorial.

Perform Star Schema Mining Tasks

In the previous tutorial, you selected data from the DMUSER schema to create and apply Classification models. In this tutorial, you will perform star schema mining in a different schema.

As you have already learned, data source nodes allow you to select any table or view in the current account (schema). In addition to tables in the current account, you can select tables in other accounts that you have permission to use. The DMUSER account also provides access to the Sales History (SH) schema, one of the Oracle Database Sample Schemas.

In this topic, you will create a workflow and add two data source nodes for objects in the SH schema: one for the CUSTOMERS table and another for the SALES table.

Create a Workflow and Add Data Sources

In order to access the star schema data using the Data Miner interface, you must first create a Workflow, as described in the Using Oracle Data Miner 11g Release 2 tutorial. In that lesson, you learned that a workflow:

Provides directions for the Data Mining server. For example, the workflow says "Build a model with these characteristics." The model is built by the data mining server with the results returned to the workflow.
Enables you to interactively build, analyze, and test a data mining process within a graphical environment.
May be used to test and analyze only one cycle within a particular phase of a larger process, or it may encapsulate all phases of a process designed to solve a particular business problem.

To create the workflow and add the appropriate data sources for our star schema mining lesson, perform the following steps.

.	Right-click the High Value project and select New Workflow from the menu. Result: The Create Workflow window appears.
.	In the Create Workflow window, enter Star Schema Mining as the name and click OK. Result: In the middle of the SQL Developer window, an empty workflow canvas opens with the worflow name that you specified. On the right-hand side of the interface, the Component Palette tab of the Workflow Editor appears (shown below with a red border). In addition, two other Oracle Data Miner interface elements appear: The Thumbnail tab appears below the Data Miner tab on the left The Property Inspector tab appears below the Workflow Editor on the right.
.	Next, add the first of two Data Source nodes for the SH schema to the workflow. A. In the Component Palette, click the Data category. A list of data nodes appear, as shown here: B. Drag and drop the Data Source node onto the Workflow pane. Result: A Data Source node appears in the Workflow pane and the Define Data Source wizard opens. Notes: Workspace node names and model names are generated automatically by Oracle Data Miner. In this example, the name "Data Source" is generated. You may not get exactly the same node and model names as shown in this lesson. You can change the name of any workspace node or model using the Property Inspector. The tables and views associated with the DMUSER account are displayed in the wizard by default. Now, you will add SH schema tables and views as available data sources.
.	In Step 1 of the wizard: A. Click the Add Schemas button. Result: The Edit Schema List window appears. B. In the Edit Schemas List, move SH from the list of Available Schemas to the list of Selected Schemas. Then, click OK, as shown here: C. Back in the Define Data Source wizard, select the Include Tables from Other Schema option. Result: The Available Tables/Views list now also consists of objects in the added schema. D. Scroll down in the list to see the SH schema objects, as shown here. Note: You may use the two tabs in the bottom pane in the wizard to view and examine the selected table. The Columns tab displays information about the table structure, and the Data tab shows a subset of data from the selected table or view.
.	A. Select the SH.CUSTOMERS table. Note: In Step 2 of the wizard, you may remove individual columns that you don't need in your data source. In our case, we'll keep all of the attributes that are defined in the table. B. Click Finish at the bottom of the wizard window Result: As shown below, the data source node name is updated with the selected table name, and the properties associated with the node are displayed in the Property Inspector, located below the Component Palette pane. Notes: You can resize nodes in the workflow canvas by entering or selecting a different value from the Zoom options. Notice that 100% is selected by default in the Zoom pull-down list. You can add descriptive information about any node by using the Details tab in the Property Inspector. The Thumbnail tab also provides a smaller display of the larger workflow window. As you drag nodes around the workflow window, the thumbnail view automatically adjusts.
.	Now, add a second Data Source node, this time for the SH.SALES table. Once again, drag and drop the Data Source node from the Workflow Editor of the Component Palette onto the Workflow pane. Result: A second Data Source node appears in the Workflow pane and the Define Data Source wizard opens.
.	In Step 1 of the Define Data Source wizard: A. Scroll down in the Available list and select the SH.SALES table B. Click Finish at the bottom of the wizard window. Result: As shown below, the data source node name is updated with the selected table name, and the properties associated with the node are displayed in the Property Inspector, located below the Component Palette pane.
.	Click the Save All tool in the SQL Developer toolbar. Result: The workflow document is saved, and the workflow name changes in appearance from itatic to a normal font display.

Aggregate Data

Next, you use an Aggregate node to transform the data in SALES so that it is summarized (aggregated) on a per-customer basis.

.	To create the aggregation node, follow these steps: A. In the Workflow Editor of the Component Palette, drill on the Transforms section. B. Drag and drop the Aggregate node to the workflow canvas, like this: Result: A new Aggregate node appears in the workflow pane. C. Using the Property Inspector, change the node name to Aggregate SALES, as shown here. D. Then, click on the Aggregate node to see the change in the workflow canvas. Notes: A yellow Information (!) icon in the border around any node indicates that it is not complete. Therefore, at least one additional step is required before the aggregate node can be used. In this case, the aggregate node must be connected to the SALES data source node, and then the aggregation process must be defined.
.	To link the aggregation and data source nodes, use the following instructions: A. Right-click the SALES node, select Connect from the pop-up menu, and then drag the pointer to the Aggregate SALES node, as shown here: B. Then, click the Aggregate SALES node to connect the two nodes. The resulting display looks like this:
.	Next, select a "Group By" attribute for the aggregation. A. Double-click the Aggregate SALES node to display the Edit Aggregation window. B. Click Edit to open the Edit Group By window, as shown here: C. With Column selected as the Type value, move CUST_ID from the Available Attributes list to the Selected Attributes list, as shown below. D. Then, click OK.
.	Next, use the Aggregation Wizard to define the aggregation rules. A. Back in the Edit Aggregation window, click the Aggregation Wizard tool (as shown here) to launch the wizard: B. In the Define Aggregation Wizard, perform the following: In Step 1 of the wizard, select SUM() from the Numerical function list and click Next. In Step 2 of the wizard, move QUANTITY_SOLD and AMOUNT_SOLD from the Available Attributes list to the Selected Attributes list, and click Next. Note: In Step 2, you chose the column or columns on which to apply the aggregation function. In Step 3, with Column selected as the Type value, move PROD_ID from the Available Attributes list to Selected Attributes list, and click Next. Notes: - In Step 3, you selected a "Sub Group By" column for the aggregation function. - The selections made in wizard steps 1 - 3 define the structure of a nested table. In this case, for each customer, the new table will contain total quantity sold and total amount sold on a per product basis. In Step 4, the wizard provides default names for the new columns. Simply click Finish to accept the default names and complete the specification of the aggregation. C. Back in the Edit Aggregation window, click OK. Note: The yellow "!" icon no longer appears in the Aggregate node border, indicating that the node is ready to run.
.	A. Right-click on Aggregate SALES and select Run to run the aggregation. Result: The workflow document is saved, and the Aggregate node is executed. The process creates a nested table that can be used for data mining. After the process is done, the workflow display looks like this:

Join Data Sources

At this stage, the workflow contains three data sources: CUSTOMERS, SALES, and the output of Aggregate SALES.

Next, you use a Join node to combine CUSTOMERS and Aggregated SALES into one data source.

Note: A Join combines rows from two or more tables, views, or materialized views. The join can select any columns from any of the tables being combined.

Here, you create an inner (or simple) join, which is a join of two or more tables that returns only those rows that satisfy the join condition.

Follow these steps:

.	A. Once again open the Transforms section of the Component Palette. B. Drag and drop the Join node to the workflow canvas, like this: B. Using the Property Inspector, name the node Join Data. Notes: As stated previously, a yellow exclamation mark on the border indicates that more information needs to be specified before the node is complete. In this case, two actions are required: The CUSTOMERS and Aggregate SALES source data nodes must be connected to the Join node. The join columns must be specified in the Join node.
.	First, connect the CUSTOMERS data source node to the Join Data node using the right-mouse menu and the same technique described previously.
.	Next, connect the Aggregate SALES data source node to the Join Data node, like this:
.	To define the join, perform the following:. A. Double-click the Join Data node to open the Edit Join Details window. B. Next, click the "+" tool to the right of the Join Columns header, as shown here: Result: The Edit Join Column Dialog opens. C. In the Edit Join Column Dialog, select CUSTOMERS in the Source 1 box, and Aggregate SALES in the Source 2 box, like this: D.To define the join columns: Select CUST_ID as the join column in both Source 1 and in Source 2. Click the Add button to populate the Join Columns box. Resut: the Edit Join Column Dialog should look like this: Note: Notice that the dialog determines that the Join Type is Inner. E. Cick OK in the Edit Join Columns Dialog to complete the join specification. Result: The join specification apears in the Edit Join Details window. F. Finally, click OK in the Edit Join Details window. Result: The Join Data node no longer contains the "!" warning, indicating that it is ready to run.
.	Next, right-click Join Data and select Run from the pop-up menu to create the join. Result: The workflow document is saved, and the join is executed. When complete, all of the nodes in the workflow canvas should contain green check marks:

Create an Output Table for Increased Peformance

The output of the join node is a data flow. If you perform complex transformations, saving the result of the sequence of transformations as a table results in faster subsequent operations.

Therefore, you will create a table from the join node. The Classification models are built against this table.

To create the table, follow these steps:

.	First, open the Data section of the Component Palette to display the available list: Then, drag and drop the Create Table or View node to the workflow canvas like this: Result: An OUTPUT node appears. A generic name is automatically created for the node.
.	Next, using the technique you've learned to link nodes, connect the Join Data node to the OUTPUT node.
.	A. Double-click the OUTPUT node. B. In the Edit Create Table or View window: Name the output table FAST_TABLE. Ensure that Table is selected as the Type. Click OK. Notes: The name of the OUTPUT node changes to FAST_TABLE. In addition, FAST_TABLE will be the name of the table once created.
.	Finally, right-click FAST_TABLE and select Run from the pop-up menu. Result: The table is created. When the process is complete, the display should look something like this:
.	Save the workflow document by clicking the Save All icon in main toolbar.

Build and Run Classification Models

In this topic, you build models that predict customer gender based on the aggregated data. Therefore, you will specify a classification model. By default, Oracle Data Miner selects all of the supported algorithms for a Classification model.

Here, you define a Classification node that uses all algorithms for the model. Then, you run the node to create the models.

.	To begin, add a classification node to the workflow. A. First, click on Models in the Component Palette to display the available list: B. Then, drag the Classification node from the paleltte to the Workflow pane, like this: C. Drop the node onto the workflow. After a moment, a "Class Build" node appears in the workflow, like this: Notes: As stated previously, a yellow exclamation mark on the border indicates that more information needs to be specified before the node is complete. In this case, two actions are required: A link must be created between the source data node and the classification build node. One or more attributes should be specified for the classification build process.
.	Next, connect the data source node FAST_TABLE to the classification build node using the same technique described previously. Result: the Edit Classification window appears. Note: Notice that a yellow "!" indicator is displayed next to the Target field. This means that an attribute must be selected for this item.
.	In the Edit Classification window, select CUST_GENDER as the Target attribute. Notes: Although not required, you may define a Case ID to uniquely define each record. However, we will leave this option undefined. As stated previously, all four algorithms for Classification modeling are selected by default. They will be automatically run unless you specify otherwise.
.	Optionally, you can modify specific settings for each of the algorithms by using the Advanced button. A. Click Advanced at the bottom of the Edit Classification window to display the Advanced Settings window, as shown here: Notes: The Advanced Settings window enables you to specify data usage, algorithm settings, and performance settings for each of the four classificaiton algorithms. You can also de-select (and re-select) any algorithm from this window. B. Select the Support Vector Machine algorithm and click the Algorithm Settings tab. C. Then, In the Kernel Function option, change the setting from System Determined to Linear, as shown here: Note: We want to change the value of this Support Vector Machine (SVM) algorithm setting from the system determined value to Linear in order to ensure model transparency for the user. Model transparancy refers to the ability of the model to communicate logic or rationale to the user. D. Feel free to view any of the tabs for each algorithm, however do not modify any of the other default settings. E. When you are done browsing, click OK to save the SVM algorithm setting and close the Advanced Settings window. F. Finally, click OK in the Edit Classification window to save your changes. Result: The classification build node is ready to run.
.	First, select the Class Build node and use the Details tab of the Property Inspector to change the name to Predict Gender. Note: In the Models tab of the Properties Inspector, you can see the current status for each of the selected algorithms, as shown below:
.	Before building the models, drill on the Test tab in the Property Inspector, and change the split for Build Data Testing to 50, as shown here: With this setting, Oracle Data Miner will split the build data in a 50/50 fashion. Now, you are to build the models.
.	Right-click the Predict Gender node and select Run from the pop-up menu. Note: As before, a green gear icon appears on the node border to indicate a server process is running, and the Workflow Jobs tab shows the status of the build. When the build is complete, the status column displays a green check mark. In the workflow pane, the border of the build node changes from a green gear turning to a green check mark, like this:
.	Once the build process is complete, you can view several pieces of information about the build using the property inspectory. For example, select the Predict Gender node in the workflow, and then choose the Models tab in the Property Inspector. Notes: All four models have been succesfully built. The models all have the same target (CUST_GENDER) but use different algorithms. The source data is automatically divided into test data and build data.

View the Models

When the execution of Predict Gender finishes, you have built and tested four classification models. You can view and compare the models.

If you view the models, they show how they tie together the aggregated sales data and the customer demographic data as part of the analysis.

Here, you view the relative results of the classification models.

Follow these steps:

.	Right-click the classification (Predict Gender) node and select Compare Test Results from the menu. Results:The Predict Gender display tab opens, showing a graphical comparison of the four models, as shown here: Notes: As stated previously, the sample data set is not prepared for mining per se, however the Data Mining Server still creates the specified models and generates predictions based on the aggregated data. In terms of the Compare Test Results output, note the following: The histogram colors that you see may be different than those shown in this example. The comparison results include five tabs: Performance, Performance matrix, ROC, Lift, and Profit. The Performance tab provides numeric and graphical information for each model on Predictive Confidence, Average Accuracy, and Overall Accuracy. The Performance tab seems to indicate that the Naive Bayes (NB) and Decision Tree (DT) algorithms are providing the highest confidence and accuracy results.
.	Select the Performance Matrix tab. Notes: The Performance Matrix shows that the NB and DT models have a higher Correct Prediction percentage than the other models, at over 70% each. Again, as the data is simplistic and not prepared for data mining, the test results serve only as an illustration of Classification model output.
.	Compare the details for the NB and DT models. A. First, select the DT model to view the Target Value Details for this model. Recall that the "Target Value" for each of the models is the CUST_GENDER attribute. Note: The DT model indicates a 98% correct prediction for customers that are male gender and an 18.3% correct prediction outcome for customers that are female gender. B. Next, select the NB model. Note: The NB model indicates exactly the same correct prediction percentage as the DT model.
.	Dismiss the Predict Gender tab, as shown below, and then close the Star Schema Mining workflow.

Note: For information on examining and applying Classification Models, see the tutorial Using Oracle Data Miner 11g Release 2. This tutorial provides a comprehensive example of creating, testing,and applying Classification Models.

Summary

In this lesson, you examined tasks associated with performing Star Schema Mining with the Oracle Data Miner graphical user interface, which is included as an extension to SQL Developer, version 3.0.

In this tutorial, you have learned how to:

Create a Data Miner project
Build a Workflow document
Create data sources from data in a new schema
Aggregate data
Join data sources
Build Classification models from aggregated and joined data.

Resources

To learn more about Oracle Data Mining:

See the Oracle Data Mining home page on OTN.

Curriculum Developer: Brian Pottle

Technical Contributors: Charlie Berger, Mark Kelly, Margaret Taft, Kathy Taylor

	About Oracle \|Oracle and Sun \| \| Careers \| Contact Us \| Site Maps \| Legal Notices \| Terms of Use \| Your Privacy Rights