TECHNOLOGY: Data Warehousing
Managing Data QualityBy Ron Hardman
Oracle Warehouse Builder 10g Release 2 handles the truth.
How much confidence do you have in the quality of your data? How about the data your business partners are feeding into your data warehouse? As reported in a November 2005 Computerworld article, "Dirty Data Blights the Bottom Line," Gartner estimates that more than 25 percent of critical data in Fortune 1000 companies will remain flawed through 2007.
It may be impossible to tell the good data from the bad, because both are reported identically through the same interface. And if you can't tell which data is good and which is bad, all of the data is suspect.
So, how can data quality be improved in the organization?
1. Involve users. Data quality is ultimately a business problem, so people in the business must be involved. People frequently enter the data being used, so they are the first line of defense. People are also the final consumers in most cases and provide the last line of defense.
2. Monitor processes. Bad data actually might have been accurate at one time but has since decayed. For example, prospect lists get outdated. The more outdated the information, the more time and money is wasted trying to sell goods or services to the wrong people. Business processes can ensure timely and accurate updates to data. Streamlining processes where possible can reduce the number of hands touching data, thereby reducing the chances of manual data corruption.
3. Use Oracle Warehouse Builder. In addition to offering database design and extract, transform, and load (ETL) features, it includes the ability to profile, cleanse, and audit data, based on data rules. This technology provides an umbrella over the data warehouse, using predefined rules to catch critical mistakes before they make their way into the decision-making process.
This article demonstrates how Oracle Warehouse Builder 10g Release 2 can profile datasources, determine data rules, and generate corrections. To run the examples in this article, install and configure Oracle Warehouse Builder 10g Release 2 and create a repository by using the Repository Assistant. See Oracle Warehouse Builder Installation and Configuration Guide for Microsoft Windows and UNIX, for setup instructions.
Introducing Oracle Warehouse Builder
Oracle Warehouse Builder is a design tool for modeling and maintaining a data warehouse. Its datasources are not limited to Oracle databases. Oracle Warehouse Builder also supports other databases, including DB2, SQL Server, Sybase, Informix, and Teradata. It also works with file and enterprise application datasources, so it supports the complete collection of data for the organization.
Oracle Warehouse Builder provides profiling and auditing capabilities through its user interface, allowing for complete data quality management when combined with the previously mentioned user participation and process improvements. Historically, profiling and cleansing data required good knowledge of SQL and PL/SQL, but the Oracle Warehouse Builder integrated profiler and Data Correction wizard make this task relatively easy. Auditing capabilities apply consistent data rules to the incoming data stream, making the cleansing of new data a very simple process.
Oracle Warehouse Builder core components are now bundled with Oracle Database 10g Release 2 at no additional charge.
Oracle Warehouse Builder Setup
To see profiling and data cleansing for yourself, run the LOAD_DATA.sql script, available for download at oracle.com/technetwork/issue-archive/2006/06-nov/o66owb-128300.zip, in the same database as your Oracle Warehouse Builder repository owner. The script creates users named STUDENTS and STUDENT_TARGET and creates a table called STUDENT_TESTS, with the following definition:
Name Null? Type ------ --------- -------------- TEST_ID NOT NULL NUMBER(10) TEST_NAME NOT NULL VARCHAR2(30) CONTENT_AREA NOT NULL VARCHAR2(10) GRADE NOT NULL VARCHAR2(2) STATE NOT NULL VARCHAR2(2) SCHOOL_ID NOT NULL VARCHAR2(15) FIRST_NAME NOT NULL VARCHAR2(30) MIDDLE_NAME VARCHAR2(30) LAST_NAME NOT NULL VARCHAR2(30) ETHNICITY VARCHAR2(1) NEW_TO_SCHOOL VARCHAR2(1) AGE_TESTED NOT NULL VARCHAR2(3) GENDER NOT NULL VARCHAR2(1) TOTAL_SCORE NOT NULL NUMBER(3) PERFORMANCE NOT NULL NUMBER(1)
The LOAD_DATA.sql script then populates this table with sample test results for use in the steps in this article.
Next, create a repository user as follows:
1. To start the repository assistant, select Start -> All Programs -> OWB Home -> Warehouse Builder -> Administration -> Repository Assistant .
With user, data, and repository user configuration complete, you can import the data.
The Oracle Warehouse Builder Design Center organizes metadata by project. Nodes are listed under each project and include items such as databases, data profiles, and data rules.
When you open the Design Center and are logged in as the repository owner, you will see a seed project called MY_PROJECT. You can either use this project or create your own. Perform the following steps to create your own project:
1. Right-click in the Project Explorer and click New .
8. Click Finish .
At this point, the datasource connection is established but you must still select the objects in the datasource. The Import Metadata wizard opens automatically and prompts you for filter information. Complete the wizard as follows:
1. Uncheck everything except Table and click Next .
The datasource is now defined. Next, create the target:
1. Right-click Oracle in the Databases node and click New .
The data is ready for profiling.
Profiling identifies data attributes such as data structure, common formats, common values, and minimum/maximum values. From these findings, Oracle Warehouse Builder determines domain values, or values it believes should be allowed in that column. Not all findings are absolute, however. If your data includes the abbreviations for only half of the states in the United States, for example, only those 25 states will be shown as existing values. You must enter the other valid values before using the values to establish a rule.
To see this in action, do the following:
1. Under the STUDENT_PROJECT project in the Design Center, right-click Data Profiles , and select New .
The Data Profile Editor opens, but the data has not been profiled yet. The Profile Objects window (top left) shows the STUDENT_TESTS table being analyzed. Below that is the Property Inspector window. Modifying the properties in this window modifies the way the profiler analyzes the data.
To start profiling, navigate to Profile on the navigation menu at the top of the Data Profile Editor and click the Profile option. If you have never created a profile before, you will be prompted to create a schema to hold profile data. Follow the wizard's instructions, and continue to create the profile. When you're finished, Oracle Warehouse Builder submits a job that does the work, so even though it returns to the main window, it is working behind the scenes. You can monitor the status of the job in the Monitor Panel at the bottom left of the Data Profile Editor.
When the job completes, you are asked whether you want to retrieve the results now. Click Yes , and the profile results appear in the Profile Results Canvas . Click the Domain tab, and you will see the results shown in Figure 1.
If there is more than one of any value in the source data, it will appear in the Found Domain column. Also note the percentage displayed in the % Compliant column. This reflects the percentage of the source data that complies with the identified domains. CONTENT_AREA , for example, shows READING and MATH . Click the value in the Found Domain column, and examine the Data Drill Panel in the region beneath it to see how these values were derived, as shown in Figure 1.
SCIENCE and WRITING are in the table, but they occur only once, so they are not included in the Found Domain column by default. You'll also notice a value of WRITEING (misspelled) in the list. Click this value, and the source row is shown to the right.
Defining Data Rules
Now that the profiling is finished, it is time to create some data rules. With the value in the Found Domain column and the CONTENT_AREA row still selected, click the Derive Data Rule button to launch the wizard. Follow these steps:
1. Enter CONTENT_AREA_CHECK as the name for the data rule.
The rule is displayed with a check box beside it in the Data Rule Panel , as shown in Figure 2. If additional rules are defined in the profile, they are listed here as well.
Data Correction Wizard
The data rule can be used to create an automated correction strategy. To create a correction, use the Data Correction wizard as follows:
1. Select Profile -> Create Correction from the menu at the top of the Data Profile Editor.
The wizard generates the mappings and functions that are required to perform the correction. To see the mappings, tables, and functions created by the wizard, click the Corrected Modules tab, as shown in Figure 4. These mappings show how Oracle Warehouse Builder processes the data through temporary and staging tables to complete the correction.
Double-click the M_STUDENT_TESTS mapping to open the Mappings Editor. The tables may appear one behind the other in the window. To see how the mappings are laid out, separate the tables and organize them as shown in Figure 5.
To see the function created by the Data Correction wizard, return to the Data Profile Editor and click the SDX_CONTENT_AREA function. Click the Implementation tab, and the function in Listing 1 is displayed.
Code Listing 1: SDX_CONTENT_AREA function
type t_arr is table of varchar2(2000) index by binary_integer; s t_arr; begin s(0):= 'MATH'; s(1):= 'READING'; s(2):= 'SCIENCE'; s(3):= 'WRITING'; -- check for equality first. Function may be called when there is no error. for i in s.first..s.last loop if (s(i) = CONTENT_AREA) then return CONTENT_AREA; end if; end loop; for i in s.first..s.last loop if (soundex(s(i)) = soundex(CONTENT_AREA)) then return s(i); end if; end loop; return CONTENT_AREA; end;
Oracle Warehouse Builder's Data Profile Editor simplifies rule definition. Simply specify the source and target, and all of your data is analyzed. Domain values are derived from existing data, providing a great starting point for rule definition. After you create rules, Oracle Warehouse Builder's Data Correction wizard makes generating corrections easy. Specify the rules that should be applied as well as the action to take when a rule is violated, and all mappings, tables, and functions are generated automatically.
Ron Hardman works with Academy District 20 schools in Colorado Springs, Colorado, and is the founder of 5-Mile Software. He is coauthor of Oracle Database 10g PL/SQL Programming and Expert PL/SQL, both from Oracle Press, and is an Oracle ACE.