Examining Source Data Using Data Profiling

Purpose

The goal of this tutorial is to profile a data source, derive a data rule, and create a correction mapping to correct the data based on a similarity match.

Time to complete

Approximately 1 hour

Topics

This tutorial covers the following topics:

	Overview
	Prerequisites
	Creating a Data Profile
	Deriving a Data Rule
	Using Correction Mappings to Cleanse Data
	Summary

Viewing Screenshots

Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)

Note: Alternatively, you can place the cursor over each individual icon in the following steps to load and view only the screenshot associated with that step.

Overview

Data profiling is a prominent feature of Warehouse Builder, that enables you to investigate data and detect anomalies. You can then create business rules and generate a corrective ETL process based on these business rules.

Data Profiling can be defined as a process that automates the identification of problematic data and metadata while enabling the correction of inconsistencies, redundancies and inaccuracies in the data.

You can profile your source or target data to discover structure, relationships and data rules. Data Profiling provides statistical information about compliant data and outliers, including Six Sigma. You can drill down into the anomalies and derive data rules manually or automatically. You can use these data rules to monitor the quality of data in your data warehouse.

You can perform attribute analysis, referential analysis, functional dependency analysis or profile data against custom designed rules.

The steps are displayed in the following diagram.

In the initial step the source data is profiled and all the information the data holds is detected. In step 2 you derive some of the data rules and then use these derived rules to derive corrections (step 3). These corrections are regular Warehouse Builder mappings and you need to deploy those in step 4. The last step runs the correction mappings to cleanse the data in the target.

Back to Topic List

Prerequisites

Before starting this tutorial, you should:

Have completed the tutorial, Setting Up the Oracle Warehouse Builder 11g Release 2 Tutorial Environment

The Setting Up the Oracle Warehouse Builder 11g Release 2 Environment tutorial provides the setup files to configure and setup an environment for this tutorial.

Have connected in the OWB Design Center as dp_owner/dp_owner username/password.

Note: This tutorial uses the HANDSON project that you will be able to access after you log-in as dp_owner in the OWB Design Center.

Back to Topic List

Creating a Data Profile

Before you begin profiling data, you need to define the source to be profiled. The DQ_SRC source module has been already created for you. Now you are ready to create the Data Profile by performing the following tasks:

	Creating a Data Profile Using a Wizard
	Enabling the Data Profile Properties
	Profiling the Source Data

Back to Topic List

Creating a Data Profile Using a Wizard

1.	Assuming that you logged in as dp_owner user, from the Projects navigator panel, expand the HANDSON project, right click Data Profiles and select New Data Profile.
2.	The Create Data Profile wizard launches. On the Welcome page, click Next.
3.	In the Name and Description page, enter DP1 for the name of the Data Profile and optionally enter a Description. Click Next.
4.	In the Select Objects page, hold [Ctrl] and select DQ_SRC from both the list of Tables and Views and click on > to move all the objects to the Selected list. Click Next.
5.	In the Summary page, select Finish. You have created a metadata container to store all the information on the data profile.

Back to Topic

Enabling the Data Profile Properties

The Data Profile Editor appears. The following note appears in a dialog box. Click OK.

Locate the Property Inspector on the left side, and enable the checkboxes for Enable Common Format Discovery and Enable Pattern Discovery.

In the case of Common Format Discovery, you can have Warehouse Builder discover common formats in your data. This uses a list of rules to determine if any of your data fields comply. This is useful in finding items such as Social Security Numbers (US). Once Warehouse Builder detects one of these attributes it gets marked as a common format (such as SSN).

In the case you are enabling Pattern Discovery, allowing Warehouse Builder to detect patterns in the data, you are detecting any pattern within your data. These patterns will require the user to interpret the actual results, and then define data rules to interpret and work on these patterns. For example, pattern recognition can find the most common pattern in such columns as Product Codes or SKUs (Stock Keeping Unit). Once you have discovered the pattern, you can interpret it and act by defining a data rule.

From the Profile menu in the Data Profile Editor, select Save All then click Yes.

Back to Topic

Profiling the Source Data

In the Data Profile Editor, from the Profile menu select Profile.

You may be prompted for the system password if this is the first profile run. Enter your system password. Click Show Details.

Note: Warehouse Builder creates a database schema and a location using a default name. Enter the following details for the new profiling schema.

Name	profile
Password	profile
Confirm Password	profile

Click OK.

This will take approximately 2-5 minutes depending on the machine.

When the job is started, you receive a message. Click OK.

You can see the status of your profiling operation in the Monitor Panel (lower right corner in the Data Profile Editor).

The profiling operation may take 1-3 minutes. When the profiling operation completes (indicated by a check mark), you can synchronize the results. This option is useful if you have previously profiled your data in the same data profile, because it gives you control of when the new profiling results become visible in the Data Profile Editor.

Click Yes to retrieve the results.

The profiling is now complete and the results are shown.

Back to Topic

Back to Topic List

Deriving a Data Rule

Data Profiling does not stop at analyzing the profiled data. You can derive Data Rules from the results of data profiling: manually or automatically.

In the following steps, you create a derive data rule for the JOB_ID domain values:

1.	Ensure the Profile Objects tab is selected. Under DQ_SRC, select the EMPLOYEES table.
2.	In Profile Results Canvas panel, click the Domain tab.
3.	Scroll down to the JOB_ID column and select the hyperlink in the Found Domain column.
4.	In the Data Drill Panel, select the first row. You can observe that 98 percent of the records are compliant with the five domain values: SH_CLERK, ST_CLERK, SA_REP, and PU_CLERK. Observe in the Data Drill Panel, there are 2 records that do not comply with these values. The first row indicates employee_id 1001 is not compliant.
5.	Click the second row. The second row indicates employee_id 1000 is not compliant.
6.	In the top panel, make sure the JOB_ID row is still highlighted and click the Derive Data Rule button.
7.	The Derive Data Rule Wizard appears. Click Next in the Welcome Page for the Wizard.
8.	Enter DM_JOB_ID for the Data Rule Name and click Next.
9.	In Define Rule page, you see the list of values that are deemed to be compliant with the domain automatically populated (SH_CLERK, ST_CLERK, and so on). Ensure the correct values are in the domain list (all but the two incorrectly spelled JOB_IDs) and click Next.
10.	Click Finish to create the Derived Data Rule.
11.	Notice the data rule appearing as applied to the EMPLOYEES table in the data rule panel at the bottom of the screen. Notice how the derivable symbol changes to the derived symbol in the Found Domains cell. You can view the derived data rule in the Data Rule panel.
12.	Switch to the Design Center window (you can leave the Data Profile Editor window open. In the Projects Navigator panel, expand Data Rules > DERIVED_DATA_RULES. Right-click DM_JOB_ID and select Open. The Edit Data Rule window opens.
13.	In the Edit Data Rule window, click Define Rule. The domain values are listed. Click OK to close the window.

Back to Topic List

Using Correction Mappings to Cleanse Data

You can use a data rule to create a correction. You can automatically generate a mapping that enforces the data rule and correct the records that do not comply with the rule. This mapping when executed moves the data from the profiled source to the corrected target.

	Creating a Correction Mapping
	Inspecting the Correction Mapping
	Deploying the Correction Mapping
	Executing the Correction Mapping to Cleanse the Data
	Viewing the Corrected Data

Back to Topic List

Creating a Correction Mapping

1.	Switch back to the Data Profile Editor. You are only interested in the rule you just created, so deselect all the data rules except for DM_JOB_ID in the Data Rule Panel.
2.	Select EMPLOYEES in the list of Profile Objects. From the Profile menu select Create Correction.
3.	The Create Correction Wizard starts. Click Next on the welcome screen.
4.	Choose DP_TGT as the target module for the corrections and click Next.
5.	In the Select Objects page, you see the EMPLOYEES table is already selected. Click Next.
6.	A domain rule can be implemented on the new target table as a check constraint. Click Next.
7.	Warehouse Builder generates the appropriate schema metadata. Change the name of the new table to CORR_EMPLOYEES. Click the Constraints tab.
8.	Rename the check constraint to DM_JOB_ID_CK1, then scroll to the right.
9.	Notice that this constraint is derived from the data rule. Click Next.
10.	In the Choose data correction actions page, click the ellipses in the Cleanse Strategy column, highlighted in the screenshot below. In the Cleanse Strategy Selection dialog box, select Similarity Match and click OK. Click Next.
12.	A progress dialog appears.
13.	After the progress dialog disappears, in the Summary page, click Finish to exit the Correction wizard.

Back to Topic

Inspecting the Correction Mapping

You can review the correction mapping you just created. Perform the following steps:

1.	Click the Corrected Modules tab in the top left corner of the Data Profile Editor window.
2.	All the objects are generated based on choices made in the wizard. Expand Functions. Double-click SIM_JOB_ID. The function opens in a new panel. Click the Implementation tab.
3.	Notice the call to the similarity algorithm to ensure the cleansing happens using the similarity match. The highlighted text in the screenshot below shows: score:=owb_match_merge.similarity(s(1), JOB_ID); In your results, you should see: score:=owbsys_owb_match_merge.similarity(s(1), JOB_ID); Close the function panel. Switch back to the Data Profile Editor.
5.	In the Corrected Modules tab again, expand Mappings and double-click M_EMPLOYEES.
6.	To see all the tables, click Fit in Window ( ) on the toolbar. Review the mapping. Notice that the CORR_EMPLOYEES table receives all data, and both target operators are bound to this single table.
7.	Select the ATTR_VALUE_1 table and click the Expand Child Graph on the toolbar.
8.	ATTR_VALUE_1 is the operator that does the actual cleansing. This operator is a pluggable mapping. When you have finished reviewing, close the mapping panel and the Data Profile Editor.
9.	In the Design Center window, from the File menu, select Save All then click Yes to save your work.
10.	Within the Design Center, expand HANDSON > Databases > Oracle > DP_TGT > Mappings. Notice the same Mapping. All metadata seen in the data profile editor is now stored in this target module.

Back to Topic

Deploying the Correction Mapping

To deploy the Correction Mapping, perform the following steps:

1.	Select Tools > Control Center Manager.
2.	Expand DP_TGT_LOCATION and select DP_TGT. Click Default Actions. This task sets the objects for deployment. Note the Deploy Action changes from None to Create.
3.	Select File > Deploy > To Control Center.
4.	Note: You may receive a message indicating that the location was not registered. If you receive this message, make sure the hostname is set to localhost instead of the machine name and click Test Connection, then click OK. You can monitor the deployment job in the Control Center Jobs panel.
5.	Once the job has finished successfully, you can run the mapping to actually cleanse data into the target table. Note all the objects are deployed successfully.

Back to Topic

Executing the Correction Mapping to Cleanse the Data

To cleanse the data, you need to execute the correction mapping into the target table. Perform the following steps:

Under DP_TGT, expand Mappings, and right-click on M_EMPLOYEES. Then select Start.

In the Control Center Jobs panel, monitor the job execution. Note the Execution tab is active.

When finished, double-click on the job.

In the Job Details dialog box, examine the row selected and rows inserted details.

From File menu, select Close to close the dialog box. Close the Control Center window.

Back to Topic

Viewing the Corrected Data

To view the corrected data, perform the following steps:

1.	In the Projects Navigator panel, expand DP_TGT > Tables and right-click CORR_EMPLOYEES then select Data.
2.	In the Data panel, click the Where Clause... button. The Where Clause dialog box opens.
3.	Enter employee_id=1001 and click OK.
4.	Notice that the JOB_ID value for EMPLOYEE_ID 1001 has been corrected to SA_REP instead of SA_REPP.
5.	Click the Where Clause... button again. Enter employee_id=1000 and click OK.
6.	Notice that the JOB_ID value for EMPLOYEE_ID 1000 has been corrected to PU_CLERK instead of PU_CLIRK. Close the Data panel. From File menu, select Exit to exit the Design Center. In Exit Confirmation dialog box, click Yes.

Back to Topic

Back to Topic List

Summary

Data Profiling is an integral part of Warehouse Builder and can be a major step in the design and implementation of a business intelligence system. With as little as only connection information available about the data source, data profiling enables you to have a fair assessment of the source data and its business objectives.

In this tutorial, you learned how to:

	Profile a set of source tables
	Derive a data rule automatically for an anomaly discovered through data profiling
	Created a correction to enforce the derived data rule on the profiled source.

Back to Topic List