Data Profiling and Automated Cleansing Using Oracle Warehouse Builder 10g Release 2
by Mark Rittman
Published September 2006
See how Oracle Warehouse Builder 10g Release 2 enables you to graphically profile and then automatically correct the data within your data warehouse.
Most organizations build a data warehouse to provide an integrated, reliable, and consistent “single version of the truth.” Data is usually sourced from a number of systems and has to be extracted, cleansed, and integrated before being made available for users to query.
The quality of the data loaded into the data warehouse is often variable, however, and for that reason, historically the process of profiling your source data has been a time-consuming, manual process that has required either lots of experience with SQL*Plus or the purchase of an expensive third-party tool.
With the release of Oracle Warehouse Builder 10g Release 2 , however, the ability to profile your data is built into the tool and no knowledge of SQL*Plus is required. Furthermore, the data profiles that you build using Oracle Warehouse Builder can be used to generate automatic corrections to your data. In this article, you’ll learn all the nuances of this important new feature.
Data Profiling and Correcting Within Oracle Warehouse Builder
Data within your data warehouse can only be turned into actionable information when you are confident of its reliability. When you bring data into your data warehouse, you need to first understand the structure and the meaning of your data, and then assess the quality and the extent to which you may need to cleanse and transform it. Once you know what actions you need to take, you then need to make the required corrections to the data, and put in place a means to detect and correct any more errors that might occur in future loads. To do this, Oracle Warehouse Builder 10g Release 2 includes three new features that make this process simple and straightforward:
- Graphical Data Profiler—Enables you to understand the structure, semantics, content, anomalies, and outliers present in your data, and derive data rules that will later be used within your data warehouse
- Correction Wizard—Takes your data rules and applies them to your data, automatically generating correction mappings to cleanse and transform your data
- Data Auditor—Takes your data rules and monitors the quality of subsequent data loads
Apart from removing the need for complex SQL*Plus scripts or third-party tools, doing your data profiling and corrections within Oracle Warehouse Builder has several advantages. The metadata that you generate about your data quality is stored alongside the other metadata in your design repository. Also, the mappings used to correct your data are regular Oracle Warehouse Builder mappings and can be monitored and managed with all of your other ETL (extract, transform, and load) processes. Doing your data cleansing and profiling within Oracle Warehouse Builder means that you only have to learn a single tool, and in addition, by integrating this process with your other ETL work, you ensure that data quality and data cleansing becomes an integral part of your data warehouse build process, and not just an afterthought.
If you are new to Oracle Warehouse Builder, or you have experience with earlier versions of the tool, you should note that the packaging and licensing has changed with this latest release. In the future, the “core ETL” features of Oracle Warehouse Builder, which roughly equate to the functions previously available with earlier versions of the tool, will be provided free as part of the database license. Additional functionality that supports deployments in the enterprise is now provided via options to the Enterprise Edition of Oracle Database. To take advantage of the functionality described in this article, you will need to license the Warehouse Builder Data Quality Option for Oracle Database 10g.
To show how you can use Oracle Warehouse Builder 10g Release 2 to profile, correct, and then audit your data, we will consider a situation where you have a requirement to profile and cleanse some product data required for your data warehouse.
Profiling and Cleansing Product Data
In this example, you will use Oracle Warehouse Builder 10g Release 2 to first profile, and then correct data about the products offered by your company. You will use the data profiling feature within Oracle Warehouse Builder to determine the structure and characteristics of your data, and to automatically derive a set of data rules that will be applied to your data. Using these data rules, you will then automatically generate a series of data corrections that will take your source table and create a corrected table from it.
The data that you want to profile is contained in a table called PRODUCT. This is defined as follows:
SQL*Plus: Release 10.2.0.1.0 - Production on Fri Nov 11 15:52:17 2005 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to: Oracle Database 10 g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options SQL> desc products Name Null? Type ----------------------------------------- -------- ---------------------------- PROD_ID NUMBER PROD_NAME VARCHAR2(50) PACK_COUNT VARCHAR2(10) AVAILABLE_DATE DATE MARKET_SEGMENT VARCHAR2(50) MANUF_COUNTRY VARCHAR2(50) REORDER_YN VARCHAR2(1)
You can install the sample data using the Oracle export file provided in the sample code download. To import the data, create the user that will contain the data (PRODUCT_CATALOG, for example), grant your usual privileges, and then import the data as follows:
Imp product_catalog/ password@connect_string file=product_catalog.dmp fromuser=product_catalog touser=product_catalog
Using Oracle Warehouse Builder, you then view the data in the table.
You notice that there appears to be a few anomalies in the data. For example, one of the market segments is misspelled. In another, a manufacturing country is incorrectly listed as “England” rather than “UK.” At this point, you can now decide to use Oracle Warehouse Builder 10g Release 2 to profile and correct your data.
Create a Data Profile
Once you have created a project using Oracle Warehouse Builder, and imported the metadata for the PRODUCT table, you should then right-click on the Data Profiles node in the Project Explorer and then select New… to start the Data Profiler wizard. Once the wizard has been used to select the PRODUCT table for profiling, and the profiling job submitted for asynchronous processing, you would be presented with the Data Profile Editor, as shown below. Note that if this is the first data profile that you have created, Oracle Warehouse Builder will prompt you to create a profile module to hold the results, and you can specify the database user and tablespace if you want to store them away from the data you will be profiling.
The Data Profile Editor has a number of panels. These show objects that have been profiled and the results of the profiling.
On the left side of the Data Profile Editor are panels that show the objects—tables, views, materialized views, external tables, dimensions, and facts—that have been profiled during this exercise, and details of any corrected modules that have been created. Below the list of profiled objects and corrected modules is a listing of the properties associated with the data profile. Using this list of properties, you can fine-tune the parameters for your data profile; enable or disable certain data profiling components; and enable data rule profiling for the selected table. At the bottom of this set of panels is the Monitor Panel. This shows the progress of any profiling jobs that you have submitted. Because data profiling can sometimes take awhile to complete, you can submit jobs to be completed in the background while you perform other Oracle Warehouse Builder tasks; Oracle Warehouse Builder will alert you when the job is complete.
On the right side of the Data Profile Editor is a set of panels that show you the results of the data profile. The top panel contains the Profile Results Canvas with a number of tabs containing summaries of the profiling results.
The Data Type tab details the documented and dominant data types for each column in the object and their minimum and maximum lengths.
The Unique Key tab shows any columns where either a unique or primary key constraint has been detected for a column. It also shows columns where the number of unique values is sufficiently high enough to suggest that a unique key could be defined if the nonunique rows were removed or corrected. Note the Six Sigma score for each column. This is a figure between 1 and 7 that indicates the number of “defects per thousands”—rows that would fail a unique constraint—in the object.
The Profile Object tab allows you to query the records within the profiled object, and can be used instead of SQL*Plus to view data within the object, optionally using the Where Clause, Execute Query, and Get More buttons to refine your query.
The Domain tab shows for each column in the profiled object the suggested domain, and the degree to which the data is compliant with this domain. A domain is defined as the set of allowable values for the column, and Oracle Warehouse Builder will include within the domain any column value that occurs more than once.
Once you have a domain defined for a particular column, you can use it to derive a data rule that will be applied to your data. Oracle Warehouse Builder will then implement this data rule as a check constraint on the column, enforcing your data rule at the database level to ensure that it is never broken. For those columns where values are present that would otherwise break the data rule, you can use Oracle Warehouse Builder to automatically correct your data.
Creating a Data Correction
While examining the data profile for your products table, you will notice that Oracle Warehouse Builder has suggested a domain for the MANUF_COUNTRY column containing Canada, USA, and UK. These countries were included in the suggested domain because they occurred more than once in your dataset. England and Mexico were excluded because they only occurred once.
From speaking to your business users, you know that in fact the domain for this column should be Not Known, Canada, USA, UK, and Mexico, even though Mexico only occurs once; whereas any occurrences of England should in fact be changed to UK. What you would like to do therefore is use Oracle Warehouse Builder to correct this incoming source data, and place a constraint on the data warehouse’s copy of the table to enforce this domain.
The first step in this process is to use the Derive Data Rule button below the suggested domain results to invoke a wizard. This wizard takes you through the steps of viewing the suggested domain, making any amendments, and then creating the data rule. In your instance, you first move the “Mexico” value over to the Selected values list, and then type in “Not Known” at the bottom of the list.
Now that you have created the data rule, the second step is to create a correction. When you create a correction, Oracle Warehouse Builder will first ask you to either select an existing target module, or create a new one to contain the corrected version of the object. This will by default have the same definition as the original object, except with the new data rules (implemented as check constraints or unique keys) additionally present.
Once the target module has been specified, the next step in the process is to select which of the data rules will be used to create the correction (see below). In your instance, you will use the data rule defined previously and two others you have defined previously—one that excludes all records that have a REORDER_YN value of “N” and another that ensures that all values in the MARKET_SEGMENT table have a value of either “Economy” or “Executive.”
Now that the rules on which the data corrections are to be based have been selected, you should specify the action and the cleanse strategy for the corrections.
When specifying the actions, you can either
- Ignore—Do not take any action
- Report—Log, in the ERR$$$ columns in the table, the instances when the data rule would have been violated
- Cleanse—Apply a data cleansing strategy to correct data that violates the data rule
For those columns where you have chosen to cleanse the data, you can select one of four cleansing strategies:
- Remove—Excludes from the corrected object those rows that fail this data rule. We will use this rule for rows that break the REORDER_YN = “Y” rule.
- Similarity Match—Uses the built-in Match-Merge functionality in Oracle Warehouse Builder to change the erroneous value to the one that is most similar to it within the column domain.
- Soundex Match—Uses the SOUNDEX function to change erroneous values to the one within the domain that is the closest phonetic match to it; you will use this strategy to cleanse the MARKET_SEGMENT column.
- Custom—Used where the logic to cleanse the column is more complex, involves other PL/SQL functions, uses conditional logic, or otherwise requires the creation of a custom PL/SQL function. You will use this strategy to cleanse the MANUF_COUNTRY column.
When you complete the selection of your cleansing actions and strategy, the wizard will then take your specification and create a mapping within the target module to implement the corrections.
When you then return to the Design Center, you will notice the new target module that will contain your corrected data. Within the module, you will notice the table definitions that hold your corrected data, and the mapping and transformations that implement the data correction.
If you examine the mapping that implements the correction, you will note that the mapping first reads data from the original PRODUCT table, and then attempts to load it into a staging copy of the table with the data rule applied to it.
Those rows that pass the data rule are then copied into the corrected table. Those that fail any of the rules are then cleansed via “pluggable mapping”: a feature new to Oracle Warehouse Builder 10g Release 2 that allows you to take a series of mapping steps and “plug” them into another mapping. You can then examine the contents of the pluggable mapping to see how it has been implemented—although currently the only way in which you can amend this correction is to delete the correction and generate it again.
Going back to the Project Explorer, you will notice the two transformations that the correction wizard has created for you: (1) the CUS_MANUF_COUNTRY function, a “shell” PL/SQL function that will hold your custom cleansing logic, and (2) the SIM_MARKET_SEGMENT function that the wizard will have automatically implemented.
To add the program logic that will implement your custom cleansing action, double-click on the CUS_MANUF_COUNTRY function and add the required logic:
Finally, to test the data correction you should deploy the correction objects, transformations, and mappings via the Control Center, and then run the correction mapping.
Now, you check the contents of the corrected PRODUCT table to verify that the results are as you expected.
You note that all MARKET_SEGMENTS are now either “Executive” or “Economy,” the one row that had REORDER_YN equal to “N” has been removed, and the instance of “England” as a MANUF_COUNTRY has been altered to “UK.”
Now that you are satisfied that the corrections you have created using Oracle Warehouse Builder are working correctly, you finally set up a Data Auditor within Oracle Warehouse Builder to monitor the quality of further incoming data.
A Data Auditor is a process that can either be run on an ad hoc basis, or can be included along with mappings, subprocesses, and transformations in a process flow, and then scheduled for execution on a predetermined basis. Data Auditors use the data rules that you either derive or manually define, and can provide statistical reports on how compliant your data is, which can then be stored in audit and error logging tables. Data Auditors are also programmable, such that you can specify that they contact you with a notification after scoring below a certain threshold, and then with your permission run a cleansing mapping to clean up the data. Once this cleansing mapping has run, you can program the auditor to only continue with the rest of the ETL process if the audit score is now above a certain level or Six Sigma score, avoiding the situation where dirty data is loaded into the data warehouse and effort is needed to remove it later.
Oracle Warehouse Builder represents a step change in the process of profiling, correcting, and then monitoring the quality of data within your data warehouse. The graphical data profiling feature within Oracle Warehouse Builder provides an easy to use and easy to understand facility. While this feature allows you to view the structure, meaning, and quality of your data, the correction wizard provides a means of taking your profile results and using them to automatically generate mappings to correct your data. Once you have used the Data Profiler to assess the quality of your data, you can then use the data rules generated to create Data Auditors, which will allow you to periodically monitor the quality of new data coming into your data warehouse.
Mark Rittman is an Oracle Certified Professional Database Administratorand is director of consulting at SolStonePlus, an Oracle partner based in the U.K. that specializes in business intelligence and data warehousing. Mark is the Oracle Magazine Editor’s Choice ACE of the Year 2005, is chair of the U.K. Oracle User Group Business Intelligence & Reporting Tools SIG, and runs a blog at www.rittman.net.