Examining Source Data Using Data Profiling

The goal of this tutorial is to profile a data source, derive a data rule and create a correction mapping to correct the data based on a similarity match.

Note: The screenshots in this tutorial may appear slightly different from yours depending on the order in which you perform the other tutorials in this OBE Series. You can ignore the irrelevant items you see on your screens.

Approximately 1 hour

Topics

This tutorial covers the following topics:

Overview
Prerequisites
Summary

Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)

Note: Alternatively, you can place the cursor over each individual icon in the following steps to load and view only the screenshot associated with that step.

Overview

Data profiling is a prominent new addition to Warehouse Builder 10g Release 2, which allows you to investigate data and detect anomalies. Then create business rules and generate a corrective ETL process based on these business rules.

Data Profiling can be defined as a process that automates the identification of problematic data and metadata while enabling the correction of inconsistencies, redundancies and inaccuracies in the data.

You can profile your source or target data to discover structure, relationships and data rules. Data Profiling provides statistical information about compliant data and outliers, including Six Sigma. You can drill down into the anomalies and derive data rules manually or automatically. You can use these data rules to monitor the quality of data in your data warehouse.

You can perform attribute analysis, referential analysis, functional dependency analysis or profile data against custom designed rules.

The steps are displayed in the following diagram.

In the initial step the source data is profiled and all the information the data holds is detected. In step 2 you derive some of the data rules and then use these derived rules to derive corrections (step 3). These corrections are regular Warehouse Builder mappings and you need to deploy those in step 4. The last step runs the correction mappings to cleanse the data in the target.

Back to Topic List

Before you perform this tutorial, you should:

1.

Complete the Setting Up the Oracle Warehouse Builder Project tutorial.

Back to Topic List

Creating a Data Profile

Before you begin profiling data, you need to define the source to be profiled. You have already performed the task to create the DQ_SRC source module to be profiled and also imported metadata in Setup tutorial. Now you are ready to create the Data Profile by performing the following tasks:

Create a Data Profile Using a Wizard
Enabling the Data Profile Properties
Profiling the Source Data

Back to Topic List

Creating a Data Profile Using a Wizard
1.

Assuming that you logged in as OWB user, from the main Warehouse Builder console, expand the HANDSON project, right click Data Profiles and select New.

 

2.

The Create Data Profile wizard launches. On the Welcome page, click Next.

 

3.

In the Name and Description page, enter DP1 for the name of the Data Profile and optionally enter a Description. Click Next.

 

4.

In the Select Objects page, hold ctrl and select DQ_SRC from both the list of Tables and Views and click on > to move all the objects to the Selected list. Click Next.

 

5.

In the summary page, select Finish. You have created a metadata container to store all the information on the data profile.

 

Back to Topic

Enabling the Data Profile Properties
1.

The Data Profile Editor appears. Should the following note appear, click OK.

 

2.

Locate the Property Inspector on the left side, and enable the checkboxes for Enable Common Format Discovery and Enable Pattern Discovery.

In the case of Common Format Discovery, you can have Warehouse Builder discover common formats in your data. This uses a list of rules to determine if any of your data fields comply. This is useful in finding items such as Social Security Numbers (US). Once Warehouse Builder detects one of these attributes it gets marked as a common format (such as SSN).

In the case you are enabling Pattern Discovery, allowing Warehouse Builder to detect patterns in the data, you are detecting any pattern within your data. These patterns will require the user to interpret the actual results, and then define data rules to interpret and work on these patterns. For example, pattern recognition can find the most common pattern in such columns as Product Codes or SKUs (Stock Keeping Unit). Once you have discovered the pattern, you can interpret it and act by defining a data rule.

 

3.

From the Profile menu in the Data Profile Editor, select Save All then click Yes.

 

Back to Topic

Profiling the Source Data

1.

In the Data Profile Editor, from the Profile menu select Profile.

 

2.

You may be prompted for the SYS password if this is the first profile run. Enter your SYS password and click OK.

 

3.

This will take approximately 2-5 minutes depending on the machine.

 

4.

When the job is started, you receive a message. Click OK.

 

5.

You can see the status of your profiling operation in the Monitor Panel (lower right corner in the Data Profile Editor).

 

6.

The profiling operation may take 1-3 minutes. When the profiling operation completes (indicated by a check mark), you can synchronize the results. This option is useful if you have previously profiled your data in the same data profile, because it gives you control of when the new profiling results become visible in the Data Profile Editor. Click Yes to retrieve the results.

The profiling is now complete. The job status changes in the Monitor Panel and the results are shown.

 

Back to Topic

Back to Topic List

Deriving a Data Rule

Data Profiling does not stop at analyzing the profiled data. You can derive Data Rules from the results of data profiling: manually or automatically.

In the following steps, you create a derive data rule for the JOB_ID domain values:

1.

In the Object Trees panel, ensure the Profile Objects tab is selected. Under DQ_SRC, select the EMPLOYEES table.

 

2.

Click the Domain tab.

 

3.

Scroll down to the JOB_ID column and select the hyperlink in the Found Domain column.

 

4.

In the Data Drill Panel, select the first row. You can observe that 98 percent of the records are compliant with the five domain values: SH_CLERK, ST_CLERK, SA_REP, and PU_CLERK. Observe in the Data Drill Panel, there are 2 records that do not comply with these values. The first row indicates employee_id 1001 is not compliant. Click the second row.

 

5.

The second row indicates employee_id 1000 is not compliant.

 

6.

In the top panel, make sure the JOB_ID row is still highlighted and click the Derive Data Rule button.

 

7.

Click Next in the Welcome Page for the Wizard.

 

8.

Enter DM_JOB_ID for the Data Rule Name and click Next.

 

9.

In Define Rule page, you see the list of values that are deemed to be compliant with the domain automatically populated (SH_CLERK, ST_CLERK, etc). Ensure the correct values are in the domain list (all but the two incorrectly spelled JOB_IDs) and click Next.

 

10.

Click Finish to create the Derived Data Rule.

 

11.

Notice the data rule appearing as applied to the EMPLOYEES table in the data rule panel at the bottom of the screen. Notice how the derivable symbol changes to the derived symbol in the Found Domains cell. You can view the derived data rule in the Data Rule panel.

 

12.

Switch to the Design Center window (you can leave the Data Profile Editor window open. Expand Data Rules and double-click on DM_JOB_ID.

 

13.

Click the Define Rule tab.

 

14.

The domain values are listed. Click OK to close the window.

 

Back to Topic List

Using Correction Mappings to Cleanse Data

You can use a data rule to create a correction. You can automatically generate a mapping that enforces the data rule and correct the records that do not comply with the rule. This mapping when executed moves the data from the profiled source to the corrected target.

Creating a Correction Mapping
Inspecting the Correction Mapping
Deploying the Correction Mapping
Executing the Correction Mapping to Cleanse the Data
Viewing the Corrected Data

Back to Topic List

Creating a Correction Mapping

1.

Switch back to the Data Profile Editor. You are only interested in the rule you just created, so uncheck all the data rules except for DM_JOB_ID in the Data Rule Panel.

 

2.

Select EMPLOYEES in the list of Profile Objects. From the Profile menu select Create Correction.

 

3.

The Create Correction Wizard starts. Click Next on the welcome screen.

 

4.

Choose DP_TGT as the target module for the corrections and click Next.

 

5.

In the Select Objects page, you see the EMPLOYEES table is already selected. Click Next.

 

6.

A domain rule can be implemented on the new target table as a check constraint. Click Next.

 

7.

Warehouse Builder generates the appropriate schema metadata. Change the name of the new table to CORR_EMPLOYEES. Click the Constraints tab.

 

8.

Rename the check constraint to DM_JOB_ID_CK1, then scroll to the right.

 

9.

Notice that this constraint is derived from the data rule. Click Next.

 

10.

In the Choose data correction actions section, select DM_JOB_ID and select Similarity Match from the list of Cleanse Strategies, then click Next.

 

12.

A progress dialog appears.

 

13.

After the progress dialog disappears, click Finish to exit the Correction wizard.

 

Back to Topic

Inspecting the Correction Mapping

You can review the correction mapping you just created. Perform the following steps:

1.

Click the Corrected Modules tab in the top left corner of the Data Profile Editor window.

 

2.

All the objects are generated based on choices made in the wizard. Expand Functions and double-click on SIM_JOB_ID.

 

3.

Click the Implementation tab.

 

4.

Notice the call to the similarity algorithm to ensure the cleansing happens using the similarity match. Click OK to close the window.

 

5.

In the Correct modules tab again, expand Mappings and double-click on the M_EMPLOYEES mapping.

 

6.

To see all the tables, click the Auto-layout button in the toolbar. Review the mapping. Notice that the CORR_EMPLOYEES table receives all data, and both target operators are bound to this single table.

 

7.

Select the ATTR_VALUE_1 table and click the Expand Child Graph button in the toolbar.

 

8.

ATTR_VALUE_1 is the operator that does the actual cleansing. This operator is a pluggable mapping. When you have finished reviewing, close the Mapping window and the Data Profile Editor.

 

9.

In the Design Center window, from the Design menu, select Save All then click Yes to save your work.

 

10.

Within the Design Center, expand DP_TGT > Mappings. Notice the same Mapping. All metadata seen in the data profile editor is now stored in this target module.

 

Back to Topic

Deploying the Correction Mapping

To deploy the Correction Mapping, perform the following steps:

1.

Select Tools > Control Center Manager.

 

2.

Expand DP_TGT_LOCATION and select DP_TGT. Click Default Actions. This task sets the objects for deployment.

 

3.

Select File > Deploy > To Control Center.

 

4.

Note: you may receive a message indicating that the location was not registered. If you receive this message, make sure the hostname is set to localhost instead of the machine name and click Test Connection, then click OK.

You can monitor the deployment job in the Job Panel.

 

5.

Once the job has finished successfully, you can run the mapping to actually cleanse data into the target table.

 

Back to Topic

Executing the Correction Mapping to Cleanse the Data

To cleanse the data, you need to execute the correction mapping into the target table. Perform the following steps:

1.

Under DP_TGT, expand Mappings, and right-click on M_EMPLOYEES. Then select Start.

 

2.

In the job panel, monitor the job execution. When finished, double-click on the job.

 

3.

Click the Execution Results tab to inspect the results.

 

4.

In the results, the 109 records are all records selected from the source. 2 records are incorrect according to the data rule, that is the 2 you see in the next 3 tables (inserted into the temp table, error table and when cleansed into the target table). The 107 is the count of correct records directly moved into the target as they comply to the rule. When finished reviewing, close this window and the Control Center Manager window.

 

Back to Topic

Viewing the Corrected Data

To view the corrected data, perform the following steps:

1.

In the Design Center, expand DP_TGT > Tables and right-click CORR_EMPLOYEES then select Data.

 

2.

Click the Where Clause... button.

 

3.

Enter employee_id=1001 and press OK.

 

4.

Notice that the JOB_ID value for EMPLOYEE_ID 1001 has been corrected to SA_REP instead of SA_REPP. Click the Where Clause... button again.

 

5.

Enter employee_id=1000 and press OK.

 

6.

Notice that the JOB_ID value for EMPLOYEE_ID 1000 has been corrected to PU_CLERK instead of PU_CLIRK.

 

Back to Topic

Back to Topic List

Data Profiling is an integral part of Warehouse Builder and can be a major step in the design and implementation of a business intelligence system. With as little as only connection information available about the data source, data profiling enables you to have a fair assessment of the source data and its business objectives.

In this tutorial, you learned how to:

Profile a set of source tables

Derive a data rule automatically for an anomaly discovered through data profiling

Created a correction to enforce the derived data rule on the profiled source.

Back to Topic List

Place the cursor over this icon to hide all screenshots.