The Word Match Count comparison enables matching of multi-word String values that contain a number of common distinct words (separated by whitespace), regardless of the order in which they are found.
Use the Word Match Count comparison when matching multi-word String identifier values (such as people's names) that may have common words, but where the values are not always in a standard order, which might cause other comparisons not to match them. For example, the values "David SMITH" and "Smith, David" would not match using a Character Edit Distance comparison on a name field, but the fact that they have two words in common might mean they are a strong match, especially if the name data is known to contain a maximum of 3 words.
This comparison supports the use of result bands.
|
Option |
Type |
Purpose |
Default Value |
|
Match No Data pairs? |
Yes/No |
This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier. If set to False, the comparison will give a 'no data' result when comparing a No Data value against another No Data value. If set to True, the comparison will return a result of 0 when comparing a No Data value against a No Data value (as the number of matching words will be 0). A 'no data' result will only be returned if a No Data value is compared against a populated value. |
No |
|
Ignore case? |
Yes/No |
Sets whether or not to ignore case when comparing values. For example, if case is ignored, the Word Match Count between "Joseph Andrew COLE" and "Joseph Andrew Cole" will be 3. If case is not ignored, it will be 2. |
Yes |
|
Character error tolerance |
Integer
|
This option specifies a number of character edits that are 'tolerated' when comparing words with each other. All words with a Character Edit Distance of less than or equal to the specified figure will be considered as the same. For example, if set to 1, the Word Match Count between "95 Charnwood Court, Mile End, Parnham, Middlesex" and "95 Charwood Court, Mile End, Parnam, Middlesex" would be 7, as all words match each other considering this tolerance. |
0 |
|
Ignore tolerance on numbers? |
Yes/No |
This option allows the Character error tolerance to be ignored for words that consist entirely of numerics. For example, if set to Yes, and using a Character error tolerance of 1, the Word Match Percentage between "95 Charnwood Court, Mile End, Parnham, Middlesex" and "96 Charnwood Court, Mile End, Parnam, Middlesex" would be 6, rather than 7, because the numbers 95 and 96 would be considered as different, despite the fact that they only differ by a single character. |
Yes |
|
Treat tolerance value as percentage? |
Yes/No |
This allows the character error tolerance to be treated as a percentage of the word length in characters. For example, to tolerate a single character error for every five characters in a word, a value of 20% should be used. This option may be useful to avoid treating short words as the same when they differ by a single character, but to retain the ability to be tolerant of typos in longer words - for example, to consider "Parnham" and "Parnam" as the same, but to treat "Bath" and "Batt" as different. If set to Yes, the Character error tolerance property should be entered as a maximum percentage of the number of characters in a word that you allow to be different, while still considering each word as the same. For example, if set to Yes, a Character error tolerance of 20% will mean "Parnam" and "Parnham" will be considered as the same, as they have an edit distance of 1, and a longer word length of 7 characters - meaning a Character Match Percentage error of 14%, which is below the 20% threshold. The values "Bath" and "Batt", however, will not considered as the same, as they have a Character Match Percentage error of 25% (1 error in 4 characters). If set to No, the Character error tolerance property will be treated as a character edit tolerance between words. |
No |
Example configuration
In this example, the Word Match Count comparison is used to match people's names. The following options are specified:
Match No Data pairs? = No
Ignore case? = Yes
Character error tolerance = 2
Ignore tolerance on numbers? = No
Treat tolerance value as percentage? = No
Example results
With the above configuration, the following table illustrates some comparison results:
|
Value A |
Value B |
Comparison result (Word Match Count) |
|
David Sheldon Turner |
TURNER David Shelldon |
3 |
|
David Sheldon Turner |
TURNER Sheldon David |
3 |
|
David Turner |
David Turner |
2 |
|
David Turner |
Dave Turner |
2 |
|
Mr David Sheldon Turner |
David Turner |
2 |
|
Alexander Graham Bell |
Alexander BELL |
2 |
|
Mrs Susan Chung |
Mrs Susane Chung |
3 |
|
Susan Smith |
Suzanne Smith |
1 |
|
Susan Simpson |
Susan Musslewhite |
1 |
|
Alexander Wallace |
Alex Walace |
1 |
|
Alexander Wallace |
Alex Wace |
0 |
Oracle ® Enterprise Data Quality Help version 9.0
Copyright ©
2006,2011 Oracle and/or its affiliates. All rights reserved.