by Brendan Tierney
Oracle Database 12.2 has a number of new analytical new features. One of these is the ability to create partitioned machine learning models.
Oracle Database comes with a suite of machine learning algorithms that are built into the core of the database and are available in the Enterprise Edition. Having the machine learning algorithms built into the database means you do not have to extract and move your data to another analytical environment. The data is secure and the scalability and performance features of the database are utilized, allowing machine learning to be performed on data volumes from a few hundred records right up to billions of records. Believe me, this is possible on a mid-spec server.
A typical problem that many data scientists encounter is having to create many tens or hundreds of predictive models based on the values in a particular attribute. For example, if a dataset contains data from many countries, the data scientist will have to manually partition the data into separate datasets, one for each country, and then create a separate predictive model for each country. You can imagine the challenges with managing this process and then the even greater challenge of having to update all of these models.
With the 12.2 version of Oracle Database, all of this work has been automated for you. With a few clicks of a button or a few SQL queries, you can create and use partitioned machine learning models that automatically partition the data, build one model per partition, associate the resulting set of models as a single high-level model, and when scoring, select the appropriate component model that applies to the partition column data values.
This article will show you how you can create a partitioned machine learning model using the Oracle Data Miner tool and how you can use this model to score or label new data using a SQL query.
Note: Oracle SQL Developer 4.2 or higher is required to use the Oracle Database 12.2 new features.
The example scenario described is illustrated in Figure 1. In this figure, the chart represents the percentage of the customers in each country. On closer examination, the United States represents 90 of the customers in the dataset. If a predictive model was built using all the customers, it would be biased towards customers from the United States. Similarly, the predictive model would not be very accurate for the other countries.
To overcome this problem, the data scientist would partition or separate the dataset into subsets containing the data for each country. The data scientist would then have to create, test, and validate the models to ensure that they are appropriate and performant for the data subset.
Figure 1: Plot of number of customers per country
In the scenario illustrated in Figure 1, there are 19 different countries. This would mean the data scientist would have to create 19 different data subsets and 19 different machine learning models.
Yes, that is a lot of work and a lot of predictive model maintenance to manage two, three, or four times a year.
In Oracle Database 12.2, all or most of this can be automatically managed by the in-database machine learning algorithms.
As illustrated in the previous section, the dataset contains data from a number of countries. Unfortunately, some of these countries have a tiny number of case records and, as such, these cannot really be used as input to the machine learning algorithms. Therefore, some care is needed to understand the data distributions in the dataset and to perform some pruning if needed. The following example will walk through an initial scenario. This will be expanded further later in the article to show how to handle such cases with limited data.
For the initial scenario, the input dataset for the machine learning algorithms will be trimmed to include only the top three most-popular countries (United States of America, Argentina, and Italy). A partitioned machine learning model will be generated based on these three countries.
To generate the models, the Oracle Data Miner tool will be used. This comes as part of Oracle SQL Developer, and the new Oracle Database 12.2 features will appear only in Oracle SQL Developer 4.2 (or higher).
The first step is to define a Data Source node and then to assign the initial input dataset to this node. The example shown in Figure 2 uses MINING_DATA_BUILD_V that is part of the sample datasets setup and is configured when Oracle Data Miner is configured.
The second step is to create a Filter Rows node and connect it to the Data Source node. Edit this node and add the filter on the COUNTRY_NAME attribute.
Figure 2: Adding a Filter Rows node to select required countries
The next step involves creating the Classification node and joining the Filter Rows node to the Classification node. The first thing that needs to be defined in the Classification node is the Target attribute and the Case ID attribute, as shown in Figure 3. The red circle in Figure 3 shows the Partition tab. This will appear only if Oracle SQL Developer 4.2 is being used with Oracle Database 12.2.
Figure 3: Initial setup of the Classification node
The next step involves defining the partition for the algorithm. A partition can consist of one of more attributes in the dataset. In our example scenario, COUNTRY_NAME is the partitioning attribute, as shown in Figure 4.
Figure 4: Defining the partitioning attribute for the machine learning algorithm
The Classification node can now be run to generate the partitioned machine learning models.
Note: You might get some informational messages regarding the General Linear Model (GLM). This is OK and the model will be generated. For this example, you can ignore the messages.
When the Oracle Data Miner tool is used to create a machine learning model, all the associated objects for this are contained in Oracle Database. A number of data dictionary views exist that give the details of the component models. The first data dictionary view to use is the ALL_MINING_MODELS view. This will list the machine learning models that exist in the database. In Oracle Database 12.2, there is a new attribute in this view called PARTITIONED, as shown in Figure 5. This attribute will contain "YES" if the model is a partitioned model or "NO" if it isn't.
Figure 5: Listing the partitioned models in the ALL_MINING_MODELS view
The ALL_MINING_MODEL_SETTING table displays the parameters that were set/configured for creating the model. There is a new parameter setting to use for defining the partitioning attribute(s). This parameter is called ODMS_PARTITION_COLUMNS. This can be seen in Figure 6 and will occur for each of the created partitioned models. This parameter setting was defined in the Oracle Data Miner tool, as shown in Figure 4.
Figure 6: Parameter settings for machine learning models indicating the partitioned column setting
The partitioned machine learning model can be used in the same way as any other machine learning model developed using the GUI tool called Oracle Data Miner. There are a number of SQL functions available that can be used to apply the model to the data. In the following example, two of these SQL functions are used. The first function, PREDICTION, returns the predicted value from the model. The second function, PREDICTION_PROBABILITY, returns a probability score for the prediction. This is a measure of how strong of a prediction the model has made. The higher the value (that is, the closer it is to one), the more likely this prediction will be true.
Looking at Figure 7, you can see that not every record has a prediction. This is because the partitioned model was defined only on a subset of the data, which consisted of three models. This was because there was insufficient data for the remaining countries for the partitioned machine learning to work with. The next section will look at how to address this issue.
Figure 7: Results produced by the partitioned machine learning model
In the previous example, the dataset included just three countries. To make sure all other countries are included and a prediction can be made for all your customers, you need to do some additional data wrangling/preparation.
Instead of excluding the other countries, they can be recoded into some default country code. This can be done using SQL code using a CASE statement or, alternatively, the Transform node in the Oracle Data Miner tool can be used. Figure 8 illustrates creating a new attribute that contains the transformed data.
Figure 8: Creating a new attribute using the Transform node
After this new COUNTRY_NAME2 attribute has been defined, the previous COUNTRY_NAME attribute needs to be excluded from the output of the Transform node.
Then, create a new Classification node and follow the previous steps to create the partitioned machine learning model. The only thing that is different is that the new COUNTRY_NAME2 attribute will be fed into the machine learning algorithms instead of the original COUNTRY_NAME attribute.
After the models have been created, the next step is to apply the same transformation to the dataset to score. To do this, create a view over the original scoring dataset and then apply the same data transformation, as shown below.
create or replace view mining_data_apply_partition_v as select CUST_ID, CUST_GENDER, AGE, CUST_MARITAL_STATUS, COUNTRY_NAME, CUST_INCOME_LEVEL, EDUCATION, OCCUPATION, HOUSEHOLD_SIZE, YRS_RESIDENCE, AFFINITY_CARD, BULK_PACK_DISKETTES, FLAT_PANEL_MONITOR, HOME_THEATER_PACKAGE, BOOKKEEPING_APPLICATION, PRINTER_SUPPLIES, Y_BOX_GAMES, OS_DOC_SET_KANJI, case country_name when 'United States of America' then 'United States of America' when 'Argentina' then 'Argentina' when 'Italy' then 'Italy' else 'Other' end country_name2 from mining_data_apply_v;
Using a similar query to what was used earlier, the new partitioned machine learning model can be applied to the data to label it. The only difference is that the machine learning model names are different. The star in the two functions, shown in Figure 9, takes all the attributes from the new data set in MINING_DATA_APPLY_PARTITION_V and feeds these into the models. But the models will use only a subset of these attributes and will ignore the rest. This way, you can include COUNTRY_NAME and COUNTRY_NAME2 in the input data set.
Figure 9 shows the results with a prediction and probability score for the other countries.
Figure 9: Results from second partitioned machine learning model.
In-database partitioned machine learning models is a new feature with Oracle Database 12.2. This new feature allows a data scientist to be more productive by not having to create multiple data subsets, create multiple machine learning models based on each of these and, finally, ensure the right model is used for scoring. The data scientist can now perform all of this work with a few lines of SQL code and/or a few clicks using the Oracle Data Miner tool. In addition to this increased productivity, there are also opportunities to parallelize the model building, allowing this work to easily scale to more than 500 million records and to take no longer than a few seconds to run. Yes, this is realistic—I have performed this on datasets consisting of more than 3.5 billion records all inside Oracle Database 12.2.
Oracle ACE Director Brendan Tierney is an independent consultant (Oralytics) and lectures on data science, databases, and big data at the Dublin Institute of Technology/Dublin Technological University. He has 24+ years of experience working in the areas of data mining, data science, big data, and data warehousing. As a recognized data science and big data expert, Tierney has worked on projects in Ireland, the UK, Belgium, Holland, Norway, Spain, Canada, and the US. He is active in the UK Oracle User Group (UKOUG) community and one of the user group leaders in Ireland. Tierney has also been editor of the UKOUG Oracle Scene magazine, is a regular speaker at conferences around the world, and writes for several publications. In addition, he has published four books, three with Oracle Press/McGraw-Hill (Predictive Analytics Using Oracle Data Miner, Oracle R Enterprise: Harnessing the Power of R in Oracle Database, and Real World SQL and PL/SQL: Advice from the Experts) and one with MIT Press (Essentials of Data Science).