As Published In
Oracle Magazine
November/December 2006

TECHNOLOGY: Data Warehousing


Managing Data Quality

By Ron Hardman Oracle ACE

Oracle Warehouse Builder 10g Release 2 handles the truth.

How much confidence do you have in the quality of your data? How about the data your business partners are feeding into your data warehouse? As reported in a November 2005 Computerworld article, "Dirty Data Blights the Bottom Line," Gartner estimates that more than 25 percent of critical data in Fortune 1000 companies will remain flawed through 2007.

It may be impossible to tell the good data from the bad, because both are reported identically through the same interface. And if you can't tell which data is good and which is bad, all of the data is suspect.

So, how can data quality be improved in the organization?

1. Involve users. Data quality is ultimately a business problem, so people in the business must be involved. People frequently enter the data being used, so they are the first line of defense. People are also the final consumers in most cases and provide the last line of defense.

2. Monitor processes. Bad data actually might have been accurate at one time but has since decayed. For example, prospect lists get outdated. The more outdated the information, the more time and money is wasted trying to sell goods or services to the wrong people. Business processes can ensure timely and accurate updates to data. Streamlining processes where possible can reduce the number of hands touching data, thereby reducing the chances of manual data corruption.

3. Use Oracle Warehouse Builder. In addition to offering database design and extract, transform, and load (ETL) features, it includes the ability to profile, cleanse, and audit data, based on data rules. This technology provides an umbrella over the data warehouse, using predefined rules to catch critical mistakes before they make their way into the decision-making process.

This article demonstrates how Oracle Warehouse Builder 10g Release 2 can profile datasources, determine data rules, and generate corrections. To run the examples in this article, install and configure Oracle Warehouse Builder 10g Release 2 and create a repository by using the Repository Assistant. See Oracle Warehouse Builder Installation and Configuration Guide for Microsoft Windows and UNIX, for setup instructions.

Introducing Oracle Warehouse Builder

Oracle Warehouse Builder is a design tool for modeling and maintaining a data warehouse. Its datasources are not limited to Oracle databases. Oracle Warehouse Builder also supports other databases, including DB2, SQL Server, Sybase, Informix, and Teradata. It also works with file and enterprise application datasources, so it supports the complete collection of data for the organization.

Oracle Warehouse Builder provides profiling and auditing capabilities through its user interface, allowing for complete data quality management when combined with the previously mentioned user participation and process improvements. Historically, profiling and cleansing data required good knowledge of SQL and PL/SQL, but the Oracle Warehouse Builder integrated profiler and Data Correction wizard make this task relatively easy. Auditing capabilities apply consistent data rules to the incoming data stream, making the cleansing of new data a very simple process.

Oracle Warehouse Builder core components are now bundled with Oracle Database 10g Release 2 at no additional charge.

Oracle Warehouse Builder Setup

To see profiling and data cleansing for yourself, run the LOAD_DATA.sql script, available for download at oracle.com/technetwork/issue-archive/2006/06-nov/o66owb-128300.zip, in the same database as your Oracle Warehouse Builder repository owner. The script creates users named STUDENTS and STUDENT_TARGET and creates a table called STUDENT_TESTS, with the following definition:  

Name          Null?     Type
------        --------- --------------            
TEST_ID       NOT NULL  NUMBER(10)
TEST_NAME     NOT NULL  VARCHAR2(30)
CONTENT_AREA  NOT NULL  VARCHAR2(10)
GRADE         NOT NULL  VARCHAR2(2)
STATE         NOT NULL  VARCHAR2(2)
SCHOOL_ID     NOT NULL  VARCHAR2(15)
FIRST_NAME    NOT NULL  VARCHAR2(30)
MIDDLE_NAME             VARCHAR2(30)
LAST_NAME     NOT NULL  VARCHAR2(30)
ETHNICITY               VARCHAR2(1)
NEW_TO_SCHOOL           VARCHAR2(1)
AGE_TESTED    NOT NULL  VARCHAR2(3)
GENDER        NOT NULL  VARCHAR2(1)
TOTAL_SCORE   NOT NULL  NUMBER(3)
PERFORMANCE   NOT NULL  NUMBER(1)

The LOAD_DATA.sql script then populates this table with sample test results for use in the steps in this article.

Next, create a repository user as follows:

1. To start the repository assistant, select Start -> All Programs -> OWB Home -> Warehouse Builder -> Administration -> Repository Assistant .
2. Select Advanced Setup and click Next .
3. Click Next .
4. Select Manage Warehouse Builder repository users and click Next .
5. Select Create the registration of one or more Warehouse Builder repository users and click Next .
6. Select your repository owner, supply the password, and click Next .
7. Select the STUDENTS and STUDENT_TARGET users and click Next .
8. Supply the passwords for STUDENTS (which is STUDENTS by default) and STUDENT_TARGET (which is STUDENT_TARGET); click Next ; and click Finish .

With user, data, and repository user configuration complete, you can import the data.

Project Definition

The Oracle Warehouse Builder Design Center organizes metadata by project. Nodes are listed under each project and include items such as databases, data profiles, and data rules.

When you open the Design Center and are logged in as the repository owner, you will see a seed project called MY_PROJECT. You can either use this project or create your own. Perform the following steps to create your own project:

1. Right-click in the Project Explorer and click New .
2. Enter STUDENT_PROJECT as the new project name.
3. Expand the STUDENT_PROJECT project.
4. Expand the Databases node.
5. Right-click Oracle and click New .
6. Enter STUDENT_SOURCE as the module name, click Data Source , and click Next .
7. Select Edit next to the default.

  • Provide all connection information, including the STUDENTS / STUDENTS user name and password.

  • Make sure you select the correct database version.

  • Click Test Connect to verify your connection information.

8. Click Finish .

At this point, the datasource connection is established but you must still select the objects in the datasource. The Import Metadata wizard opens automatically and prompts you for filter information. Complete the wizard as follows:

1. Uncheck everything except Table and click Next .
2. Expand Tables and select STUDENT_TESTS for import.
3. Click Finish .

The datasource is now defined. Next, create the target:

1. Right-click Oracle in the Databases node and click New .
2. Provide STUDENT_TARGET as the module name, click Warehouse Target , and click Next .
3. Select Edit next to the default.

  • Provide all connection information, including the STUDENT_TARGET / STUDENT_TARGET user name and password.
  • Make sure you select the correct database version.
  • Click Test Connect to verify your connection information.
4. Click Finish .

The data is ready for profiling.

Profiling Datasources

Profiling identifies data attributes such as data structure, common formats, common values, and minimum/maximum values. From these findings, Oracle Warehouse Builder determines domain values, or values it believes should be allowed in that column. Not all findings are absolute, however. If your data includes the abbreviations for only half of the states in the United States, for example, only those 25 states will be shown as existing values. You must enter the other valid values before using the values to establish a rule.

To see this in action, do the following:

1. Under the STUDENT_PROJECT project in the Design Center, right-click Data Profiles , and select New .
2. Enter STUDENT_PROFILE for the name and click Next .
3. Select the STUDENT_TESTS table and click Finish .

The Data Profile Editor opens, but the data has not been profiled yet. The Profile Objects window (top left) shows the STUDENT_TESTS table being analyzed. Below that is the Property Inspector window. Modifying the properties in this window modifies the way the profiler analyzes the data.

To start profiling, navigate to Profile on the navigation menu at the top of the Data Profile Editor and click the Profile option. If you have never created a profile before, you will be prompted to create a schema to hold profile data. Follow the wizard's instructions, and continue to create the profile. When you're finished, Oracle Warehouse Builder submits a job that does the work, so even though it returns to the main window, it is working behind the scenes. You can monitor the status of the job in the Monitor Panel at the bottom left of the Data Profile Editor.

When the job completes, you are asked whether you want to retrieve the results now. Click Yes , and the profile results appear in the Profile Results Canvas . Click the Domain tab, and you will see the results shown in Figure 1.

 If there is more than one of any value in the source data, it will appear in the Found Domain column. Also note the percentage displayed in the % Compliant column. This reflects the percentage of the source data that complies with the identified domains. CONTENT_AREA , for example, shows READING and MATH . Click the value in the Found Domain column, and examine the Data Drill Panel in the region beneath it to see how these values were derived, as shown in Figure 1.

SCIENCE and WRITING are in the table, but they occur only once, so they are not included in the Found Domain column by default. You'll also notice a value of WRITEING (misspelled) in the list. Click this value, and the source row is shown to the right.

Defining Data Rules

Now that the profiling is finished, it is time to create some data rules. With the value in the Found Domain column and the CONTENT_AREA row still selected, click the Derive Data Rule button to launch the wizard. Follow these steps:

1. Enter CONTENT_AREA_CHECK as the name for the data rule.
2. On the Define Rule page, choose SCIENCE as a selected value.
3. Type WRITING as a fourth value and click Finish .

The rule is displayed with a check box beside it in the Data Rule Panel , as shown in Figure 2. If additional rules are defined in the profile, they are listed here as well. 

Data Correction Wizard

The data rule can be used to create an automated correction strategy. To create a correction, use the Data Correction wizard as follows:

1. Select Profile -> Create Correction from the menu at the top of the Data Profile Editor.
2. Choose Select an existing module , choose STUDENT_TARGET , and click Next .
3. Verify that the STUDENT_TESTS table is selected and click Next .
4. Verify that CONTENT_AREA_CHECK is selected as a data rule and click Next .
5. Click Next .
6. Choose Cleanse as the action and Soundex Match as the cleanse strategy, as shown in Figure 3, and click Next .
7. Complete the wizard by accepting the defaults on all remaining screens and click Finish .

 The wizard generates the mappings and functions that are required to perform the correction. To see the mappings, tables, and functions created by the wizard, click the Corrected Modules tab, as shown in Figure 4. These mappings show how Oracle Warehouse Builder processes the data through temporary and staging tables to complete the correction. 

Double-click the M_STUDENT_TESTS mapping to open the Mappings Editor. The tables may appear one behind the other in the window. To see how the mappings are laid out, separate the tables and organize them as shown in Figure 5.

 

figure 1
Figure 1: Domain tab in the Profile Results Canvas

figure 2
Figure 2: Data Rule Panel

figure 3
Figure 3: Cleansing with the Data Correction wizard

figure 4
Figure 4: Corrected Modules tab

figure 5
Figure 5: Viewing tables in the Mappings Editor


To see the function created by the Data Correction wizard, return to the Data Profile Editor and click the SDX_CONTENT_AREA function. Click the Implementation tab, and the function in Listing 1 is displayed.

Code Listing 1: SDX_CONTENT_AREA function 

type t_arr is table of varchar2(2000) index by binary_integer;
s t_arr;
begin
  s(0):= 'MATH';  s(1):= 'READING';  s(2):= 'SCIENCE';  s(3):= 'WRITING';

  -- check for equality first. Function may be called when there is no error.
  for i in s.first..s.last loop
    if (s(i) = CONTENT_AREA) then
      return CONTENT_AREA;
    end if;
  end loop;

  for i in s.first..s.last loop
    if (soundex(s(i)) = soundex(CONTENT_AREA)) then
      return s(i);
    end if;
  end loop;
  return CONTENT_AREA;
end;

 

Next Steps



READ more about
Warehouse Builder
Oracle Warehouse Builder Installation and Configuration Guide for Microsoft Windows and UNIX

DOWNLOAD
Oracle Warehouse Builder 10g Release 2
sample code for this article

To deploy the correction, return to the Design Center and navigate to Tools -> Control Center Manager . In the Object Details window, click the Default Actions button and then the Deploy icon in the top menu bar. This submits a job to deploy the objects. Run the mapping to correct the data in the target table, by selecting the M_STUDENT_TESTS mapping in the Control Center Manager and clicking the Run button. Confirm that WRITEING is now shown as WRITING in the corrected STUDENT_TESTS table.

Conclusion

Oracle Warehouse Builder's Data Profile Editor simplifies rule definition. Simply specify the source and target, and all of your data is analyzed. Domain values are derived from existing data, providing a great starting point for rule definition. After you create rules, Oracle Warehouse Builder's Data Correction wizard makes generating corrections easy. Specify the rules that should be applied as well as the action to take when a rule is violated, and all mappings, tables, and functions are generated automatically.

 


Ron Hardman works with Academy District 20 schools in Colorado Springs, Colorado, and is the founder of 5-Mile Software. He is coauthor of Oracle Database 10g PL/SQL Programming and Expert PL/SQL, both from Oracle Press, and is an Oracle ACE.

 



Send us your comments