TECHNOLOGY: Business Intelligence
Learn and Predict
By Ron Hardman
Explain your data and forecast results, using predictive analytics.
We predict potential outcomes based on actions every day. If I speed in my car, I have a much higher likelihood of being pulled over by the police than if I observe the speed limit. Choosing a burger and fries every day at lunch rather than a more nutritious meal is likely to lead to poor health over time. Each choice and potential outcome are part of our life's data set. The amount of data is relatively small, and the potential outcomes are well-known. We don't give the decision process much thought.
Many data sets we collect are not as easily analyzed, though. Large data sets such as those related to housing, education, security, taxation, sales, and even sports are too large to analyze by hand. Not all patterns are easily recognized; many contain seemingly unrelated data.
What if it were possible to know the factors that lead to certain outcomes and predict with confidence the impact of decisions? This is what predictive analytics is about.
This article examines three implementations of the predictive analytics feature in Oracle Data Mining and demonstrates them against a sample data set.
The objective of this analysis and demonstration of predictive analytics is to find out which attributes in the sample data set are related to the result and the degree to which they influence the final outcome.
Note that the steps in this article require the Oracle Data Mining option. To verify that this option is installed, run the following SELECT statement:
SELECT parameter, value FROM v$option WHERE parameter = 'Data Mining';
This must return TRUE.
Data mining is about finding patterns in data and reporting these patterns to the user, who can then use that information to predict future events. Oracle Database 10g Release 2 adds a new PL/SQL package called DBMS_PREDICTIVE_ANALYTICS. The Oracle Data Miner application includes a one-click interface to this package. The predictive analytics spreadsheet add-in provides another interface.
The predictive analytics data mining feature analyzes data, finds patterns, and predicts outcomes or results with an associated level of confidence. The confidence is important, because it quantifies the certainty of the predictions. The range of values is 0 to 1. The higher the confidence value, the more certain the prediction. Getting started with the sample data. The example in this article, including the data set, is available for download. The ZIP file is called DogsPlayingPoker.zip. To walk through the steps, you need to create a schema; run the tables.sql script from the ZIP file; and load the data, using SQL*Loader with the following syntax:
sqlldr <<username>>/<<password>> control=loader.ctl
Replace <<username>> and <<password>> with your own, and add the directory path to the control file, if necessary.
When the CANINE_POKER_RESULTS table is successfully loaded, it will contain 5,916 records.
Command-line implementation. The DBMS_PREDICTIVE_ANALYTICS package is new in Oracle Database 10g Release 2. The package definition is shown in Listing 1.
Listing 1: Describing the DBMS_PREDICTIVE_ANALYTICS package
DESC DBMS_PREDICTIVE_ANALYTICS PROCEDURE EXPLAIN Argument Name Type In/Out Default? --------------- ----------- ---- ----- DATA_TABLE_NAME VARCHAR2 IN EXPLAIN_COLUMN_NAME VARCHAR2 IN RESULT_TABLE_NAME VARCHAR2 IN DATA_SCHEMA_NAME VARCHAR2 IN DEFAULT PROCEDURE PREDICT Argument Name Type In/Out Default? --------------- ----------- ---- ----- ACCURACY NUMBER OUT DATA_TABLE_NAME VARCHAR2 IN CASE_ID_COLUMN_NAME VARCHAR2 IN TARGET_COLUMN_NAME VARCHAR2 IN RESULT_TABLE_NAME VARCHAR2 IN DATA_SCHEMA_NAME VARCHAR2 IN DEFAULT
The EXPLAIN procedure in the DBMS_PREDICTIVE_ANALYTICS package analyzes the data set to determine the explanatory value of each attribute in the data set. The higher the explanatory value for an attribute, the stronger the relationship between it and the result.
In this article, the EXPLAIN procedure reads the data from the CANINE_POKER_RESULTS table, creates the EXPLAIN_RESULTS table, and writes the results from the procedure to it. The following anonymous block executes the EXPLAIN procedure:
BEGIN DBMS_PREDICTIVE_ANALYTICS.EXPLAIN ( DATA_TABLE_NAME => 'CANINE_POKER_RESULTS', EXPLAIN_COLUMN_NAME => 'FINISH', RESULT_TABLE_NAME => 'EXPLAIN_RESULTS'); END; /
Listing 2 selects the results from the EXPLAIN_RESULTS table.
Listing 2: Viewing the EXPLAIN Results
SET PAGES 9999 SELECT * FROM EXPLAIN_RESULTS; ATTRIBUTE_NAME EXPLANATORY_VALUE RANK -------------- ----------- ----- FOOD_TYPE .269033829 1 GAME .261283299 2 SHAGGY_EYEBROWS .030682642 3 AGE 0 4 CASE_ID 0 4 BREED 0 4
The three attributes in Listing 2 that have a positive EXPLANATORY_VALUE are the attributes that can be used to predict the success each dog will have in a poker game. The type of food (FOOD_TYPE) consumed by the dog is most closely aligned with the result, followed closely by the type of game played (GAME). Shaggy eyebrows (SHAGGY_EYEBROWS) come in a distant third. CASE_ID is the unique identifier for each record, so there is no correlation between it and the result. The analysis also shows an EXPLANATORY_VALUE of 0 for both AGE and BREED, so there is no pattern that would allow them to be used in the prediction.
The code in Listing 3 predicts the probability of winning a game, using the PREDICT procedure.
Listing 3: Running PREDICT
SET SERVEROUTPUT ON DECLARE v_predict_accuracy NUMBER(30,10); BEGIN DBMS_PREDICTIVE_ANALYTICS.PREDICT ( ACCURACY => v_predict_accuracy, DATA_TABLE_NAME => 'CANINE_POKER_RESULTS', CASE_ID_COLUMN_NAME => 'CASE_ID', TARGET_COLUMN_NAME => 'FINISH', RESULT_TABLE_NAME => 'PREDICT_RESULTS'); DBMS_OUTPUT.PUT_LINE('*** Accuracy ***'); DBMS_OUTPUT.PUT_LINE(v_predict_accuracy); END; / *** Accuracy *** .5666155894
Listing 3 shows that I can expect a better-than-average—57 percent (for the value .5666155894)—ability to predict the outcome based on the three attributes identified with the EXPLAIN procedure. Using predictive analytics increases the accuracy of predicting the winner by 57 percent over the accuracy of using the average of the target value (the FINISH column).
Oracle Data Miner implementation. Note: Oracle Data Miner 10.2's graphical user interface is not included with Oracle Database 10g Release 2. You can download it from oracle.com/technology/products/bi/odm.
You can use the Oracle Data Miner user interface to quickly and easily arrive at the same results as with the command-line implementation. To use Oracle Data Miner to explain the data, perform the following steps:
The target table is created and populated with the EXPLAIN results.
To see the results, press the F5 key on your keyboard to refresh Oracle Data Miner and find the target table you created. The Data tab displays the results, as shown in Figure 1.
Predict is done in similar fashion. Select Data -> Predict , and follow the instructions provided in the wizard. Once finished, press F5 to refresh the interface and select the results table that was generated. The Data tab, shown in Figure 2, shows a record-by-record account of the prediction.
The 0s and 1s in the PREDICTION column are the predicted values for the FINISH column in the original data set. A value of 0 indicates a predicted loss, and a value of 1 means a predicted win.
Spreadsheet add-in implementation. The third implementation uses a spreadsheet add-in.
The spreadsheet add-in comes as a ZIP file that includes complete installation instructions, and the process is quite simple. The add-in itself is the Predictive_Analytics.xla file. To add it to the Microsoft Excel user interface, select Tools -> Add-Ins in Excel and click Browse . Select the Predictive_Analytics.xla file, click OK , select the Oracle Predictive Analytics check box, and click OK .
With the add-in selected, a new menu—OraclePA—becomes available. Figure 3 shows the menu.
The spreadsheet add-in provides the ability to connect to the tables in the database for analysis or to analyze data directly in the spreadsheet.
To explain the data and predict the results by using the spreadsheet add-in, open the DogsPlayingPoker_Results.xls file (included in the DogsPlayingPoker.zip file) in Excel. Select OraclePA , and click Connect to create a new connection to your database.
To use Explain, perform the following steps:
The Explain results are added to a new worksheet in your spreadsheet. Once the data is in the spreadsheet, it is easy to graph the results. Highlight A1 through C4, and select Insert -> Chart .
The steps for using Predict in the spreadsheet are similar to those required to use Predict in Oracle Data Miner.
To use Predict, perform the following steps:
The Predict results are added to a new worksheet in your spreadsheet.
Figure 7, shows the Predict results. The overall predictive confidence is .57. This is the same value returned by the DBMS_PREDICTIVE_ANALYTICS.PREDICT procedure earlier in this article.
Adding New Records
The information gleaned from the Dogs Playing Poker data set shows that breed and age have no bearing on whether the entrant is a winner or a loser. The diet is the most critical, followed closely by the game being played. To illustrate, add the information in Listing 4 as the last two rows of the DogsPlayingPoker_Results.xls spreadsheet.
Code Listing 4: New data for spreadsheet implementation
5908 Standard Poodle 12 CANNED YES 5 CARD DRAW 5909 Yorkshire Terrier 5 TABLE SCRAPS YES BLACK JACK
Rerun Predict, and scroll to the bottom of the new worksheet. Figure 8 shows the result.
So, the first entrant—5908, which eats canned dog food and plays five-card draw—is projected to win (value of 1) with 95 percent (0.953579664) certainty. The second entrant—5909, which eats table scraps and plays blackjack—is projected to lose with 98 percent certainty.
How much better is it to take action when you know the likely outcome, instead of simply hoping for the best and reporting results afterward?
Although the sample data in this article is not based on enterprise business data, it shows that predictive analytics really doesn't have any bounds, as long as the data is accurate and doesn't lack any major attributes. Predictive analytics has even made its way to television. Characters in the television show Numbers use math and statistics to solve crimes. Data mining, with an emphasis on predicting future actions based on patterns, is frequently used. Although the example in this article is completely fictional, predictive analytics is not.
Ron Hardman works with Academy District 20 schools in Colorado Springs, Colorado, and is the founder of Peak Retrieval, LLC, a firm specializing in information retrieval technologies. He is coauthor of Oracle Database 10g PL/SQL Programming and Expert PL/SQL, both from Oracle Press, and is an Oracle ACE.