You are here: Processor Library > Matching > Comparisons > Longest Common Substring

Comparison:  Longest Common Substring

The Longest Common Substring comparison compares two String values and determines whether they might match by determining the longest length of a sequence of characters (substring) that is common to both values, whether that substring represents the whole or a part of the String value.  

Use

Use the Longest Common Substring comparison to find matches between String values where there may be 'noise' either at the beginning or the end of String that is difficult to ignore in a comparison by stripping words, or where you know that String values with a common sequence of characters over a certain length are likely to be related, for example, to match "Nomura Securities Co., Ltd." with "Nomura Investor Relations Co., Ltd." with a Longest Common Substring of 6 characters "Nomura".

The Longest Common Substring comparison is often used in match rules that are low down in the decision table in order to find and review possible matches that have similarity but which have failed to match using other rules, perhaps due to ordering issues, or due to excess 'noise'.

This comparison supports the use of result bands.

Options

Option

Type

Purpose

Default Value

Match No Data pairs?

Yes/No

This option determines the result of a comparison when it compares two No Data (Null, or containing only whitespace characters) values for an identifier.

If set to No, the comparison will give a 'no data' result when comparing a No Data value against another No Data value.

If set to Yes, the comparison will give a result of 0 when comparing a No Data value against another No Data value. A 'no data' result will only be returned if a No Data value is compared against a populated value.

No

Ignore case?

Yes/No

Sets whether or not to ignore case when comparing values.

Yes

Example

Example configuration

In this example, the Longest Common Substring comparison is used to identify possible matches in customer names. The following options are specified:

Match No Data pairs? = No

Ignore case? = Yes

A Trim Whitespace transformation is used to remove all whitespace from values before they are compared.

Example results

With the above configuration, the following table illustrates some comparison results:

Value A

Value B

Comparison result (Longest Common Substring)

Jill Lewis

Jill Lewis-Thompson

9

Jill Lewis

Bill Lewis

8

Jill Lewis

Jill Lonerghan

5

Michael Davis **DO NOT CALL**

Michael Davis

12

Tom Featherstone ----DECEASED----

Thomas David Featherstone

12

Tom Featherstone

John Feathers

8

Oracle ® Enterprise Data Quality Help version 9.0
Copyright © 2006,2011 Oracle and/or its affiliates. All rights reserved.