|
ORACLE9i Warehouse Builder 9.2
Data Quality
1.0 Name and Address
Cleansing
2.0 Match-Merge
3.0 Other Information
1.0 Name
and Address Cleansing
1.1 What is Name and Address Cleansing?
Name
and Address Cleansing is the group of transformations performed on data
containing individual and business names as well as domestic and foreign
addresses for the purpose of improving the quality of data. Such transformations
are usually referred to as parsing,
standardization, correction and augmentation:
- Name/address parsing is the breakdown of non-discrete
input into discrete name or address components.
- Name/address standardization is the modification of
components to a standard version acceptable to a postal service or
suitable for record matching.
- Postal correction involves matching an input address
with postal database entries to verify and/or correct an address.
- Augmentation adds derived information to the data,
such as gender based on name, or collection of census and geo-location
data
1.2 How is Name and Address Cleansing accomplished in Warehouse
Builder?
Name and
address cleansing in Warehouse Builder is a built-in transformation
and is integrated into the ETL functionality as Name and Address Operator.
Name and Address Operator offers the full range of transformations for
name and address data: parsing, standardization, correction and augmentation.
At the design time this wizard-driven operator is used to model Name
and Address rules inside Warehouse Builder. At runtime, the deployed
Name and Address transformations transparently access parsers and data
provided by the third-party vendors.
1.3 What is Oracle9i Pure Name and Address?
Oracle9i Pure
Name and Address is the prior implementation of Name and Address cleansing,
based on bundling a single third-party vendor software. Pure Name and
Address ended with Warehouse Builder 9.0.4 and is no longer available.
1.4 What is the current architecture of Name and Address cleansing in
Warehouse Builder?
Warehouse
Builder users now have a choice of multiple certified third-party data
quality vendors. The use of Name and Address Operator at the design
time does not change in any way. However, at runtime Warehouse Builder
transparently accesses adaptors build by the third party vendors and
the data they supply.
1.5 How are Name and Address components of Warehouse Builder installed?
Warehouse
Builder Name and Address Server is installed automatically as part of
Warehouse Builder server side install. Third-party adapters and data
are installed separately by following the vendors’ own installation
programs or scripts. Warehouse Builder becomes aware of their location
through the use of environment variables and property files modified
by third-party installation programs automatically.
1.6 How are Name and Address components of Warehouse Builder licensed?
Warehouse
Builder Name and Address Server is included at no extra charge. Third
party data quality vendors sell their software separately and directly
to the customer.
1.7 Who are the third-party data quality vendors partnering with Warehouse
Builder?
At present
Warehouse Builder is certified to work with Trillium Software and First
Logic. Other leading data quality vendors have partnered with Oracle
and are in the process of building their adapters for Warehouse Builder.
1.8 How can the sales, consulting or product support of the third-party
data quality vendors be contacted?
The data
quality partner matrix is available at /products/warehouse/index.html.
It lists contacts and is updated periodically.
1.9 What countries are supported?
The country
support varies by vendor. The data quality partner matrix referenced
above provides a list of each vendor’s supported countries.
1.10 What are the prices of third-party
data quality software?
The prices
obviously vary by vendor. Contact the vendor directly for pricing information.
One of the benefits of Warehouse Builder working with multiple partners is the ability to offer different price points.
1.11
What about customers with current licenses for Oracle 9i Pure
Name and Address?
These customers can continue using Pure Name and Address.
Warehouse Builder 9.2 is certified to work with Oracle9i Pure
Name and Address 9.0.4. They would need to perform slightly different
installation steps as described in the 9.2 release notes. Alternatively,
they can choose to take advantage of special offers from the data quality
vendors and trade-in their Pure Name and Address licenses. Please contact
the data quality partners directly for details.
1.12
How does Name and Address cleansing work in the Oracle 9i RAC environment?
Warehouse Builder supports Real Application Clusters (RAC)
features of Oracle 9i. One way this support is accomplished is by allowing
the install of OWB's runtime service on multiple nodes in a cluster.
The benefits of such configuration are the transparent failover between
nodes and the near linear increase of processing performance with each
added node.
Taking advantage of OWB Name and Address Cleansing in the RAC environment
requires installation of the following components on each node in a
cluster:
- OWB Name and Address Server - performed by "OWB Runtime Assistant",
option "Register RAC Instance"
- A third-party Name and Address adapter - performed by third-party
installation programs
- A third-party Name and Address set of data - performed by third-party
installation programs
Back to Top
2.0 Match-Merge
2.1
What is Matching and Merging?
Matching
is the process of determining, through business rules, which records
refer to the same logical data. Merging is the business rules driven
consolidation of the data from the matched set into a single record.
2.2
What are the Deduplication, Householding and Record Linking?
In the
area of Data Quality, all three are instances of using Match-Merge for
a particular purpose:
- Deduplication is the process
of matching and merging for the purpose of removing the duplicate
records, especially customer related records. This contributes to
achieving the single view of the customer.
- Householding is the process
of matching customers belonging to the same household, usually identified
by the same address. Customer names are not merged, however, they
are linked to the address that is stored once. The benefit of householding
is the improved ability to understand and target customers.
- Record linking is the more generic
instance of householding. The records may need to be linked for purposes
other than determining households, for example, linking business customers’
branches and subsidiaries to one parent entity.
2.2 How is Matching and Merging accomplished in Warehouse
Builder?
Matching
and Merging in Warehouse Builder is a built-in transformation and is
integrated into the ETL functionality as Match-Merge Operator. Match-Merge
Operator has a wizard driven interface with a powerful set of UI controlled
matching and merging algorithms.
2.3
How is Warehouse Builder Match-Merge installed?
Match-Merge is inseparable from Warehouse Builder and is
installed automatically like any other operator.
2.4
How is Warehouse Builder Match-Merge licensed?
Match-Merge
is included with Warehouse Builder at no extra charge. There is no dependency
on third-party software for Match-Merge.
2.5 What are fuzzy
matching, deterministic matching and probabilistic matching?
Warehouse Builder employs the elements of the fuzzy logic and
provides both deterministic and probabilistic matching algorithms.
- In general, fuzzy logic resembles
human reasoning in its use of approximate information and uncertainty
to generate decisions. In relation to matching, the term is used loosely
to describe the approach that relies on rules that are imprecise rather
than precise and operates on data with boundaries that are not sharply
defined.
- Deterministic matching gives
equal weight to different types of information a record may contain.
For example, a deterministic approach might place equal reliance on
a match between the names on two records or a match between two birth
dates.
- Probabilistic matching exploits
the statistical probability that a match on particular items is more
or less likely to indicate that the records are the same. For example,
birth date information is subject to errors made by a mistake on a
single digit, and the number of possible birth dates is relatively
small. Names, in contrast, are more likely to be recognizable even
if a single error is made. Probabilistic matching thus allows assigning
appropriate weights to different attributes and then compares the
total score to the threshold that defines a successful match.
2.6
What Match rule types does Warehouse
Builder
offer?
- Person Rule. Built-in
rules for matching names. This rule requires users to set what
data within the record make up a person’s name. The rule
offers different parameters that enable specific person name matching
such as “Match on initials”, “Match hyphenated names”.
- Firm Rule. Built-in
rules for matching business names. This rule requires users to set
what data within the record make up a firm name. The rule offers different
parameters that enable specific firm matching such as “Strip
noise words” (Inc, Corp), “Match on acronyms”.
- Address Rule. Built-in
rules for matching Addresses. This rule requires users to set what
data within the record make up an address. The rule offers different
parameters that enable specific address matching such as “Match
either street or post office box”, “Match on blank secondary
address”.
- Conditional Rule. The
user sets each column to be compared with a comparison algorithm to
use. Algorithms such as Exact, Standard Exact, Soundex, Similarity,
Standard Similarity, Partial Name, Abbreviation, and Acronym are supplied.
Each column must match when compared using the desired algorithm for
two records to be considered a match. In other words, the comparisons
are logically ‘AND’ed together.
- Weight Rule. The
user gives a weight to each column included in the comparison and
an overall score that must be achieved for the match. Each column
that has a weight is compared using a similarity algorithm. This rule
implements the probabilistic matching approach.
- Custom Match Rule. The
users can write their own comparison algorithms desired for matching.
They can also reference one of the other match rule types to logically
‘AND’ or ‘OR’ them together to produce a unique
rule type.
2.7
What Merge rule types does Warehouse
Builder
offer?
Once the
records are matched, they can be merged based on the following rule
types:
- Any: This rule will pick any
(first non-null) value of the match set for this attribute.
- Rank: Rank the values of the
match set for a particular attribute and attribute value, and use
the associated attribute value of the record that satisfies the highest
rank. For example, choose the record where ‘Is_Found’
flag is True.
- Sequence: User can specify a
database sequence for this rule. The nextval of the sequence will
be used for the value.
- Min/Max: User can specify an
attribute and a relation to choose the record to be used as a source
for the merge attribute. For example, choose the record with a greater
date.
- Copy: User can choose a value
from a different previously merged value.
- Custom: User can create a PL/SQL
package function to select the merge value.
2.8 What are Match Bins?
Match
Bins group records together prior to matching to limit the set of necessary
comparisons to a manageable number. The user designates an attribute
or attributes by which to group the data. When Warehouse Builder matches
the rows, it compares each row with the subsequent row for all rows
within the same group. This can greatly enhance performance since Warehouse
Builder searches for matches only within the “match bins”
and not throughout the entire data set. The number of comparisons performed
grows with the size of the bin. The formula is n=(b*(b-1))/2, where
n is number of comparisons, b is number of records in a bin.
To match 5 records, Warehouse Builder must perform 10 comparisons. To
match 50 records, Warehouse Builder must perform 1,225 comparisons.
To match 500 records, Warehouse Builder must perform 124,750 comparisons.
2.9 What is the relationship between Name and Address cleansing and Match-Merge?
It is
recommended to match the data after it has been cleansed through Name/Address.
However, the data that is not Name/Address centric, such as product
data, or the data that is already deemed clean can be processed by Match-merge
alone.
Back
to Top
3.0 Other Information
3.1 What are the differences between Warehouse Builder
and Pure Integrate?
Pure Integrate has been desupported
and all of its functionality has been migrated to Warehouse Builder
as of this release 9.2. Oracle9i Warehouse Builder combines complete
ETL, metadata management and Data Quality functionality into one tool
and through the same User Interface.
3.2 Where can I learn more about
Warehouse Builder
Name & Address cleansing as well as Match-merge?
- Oracle9i Warehouse Builder
User Guide and Online Help
- Data Quality White Paper
- Name/Address and Match-Merge
viewlets
Back
to Top
|