Phrase Profiler

The Phrase Profiler analyzes a number of attributes and searches for common words and phrases.

The returned words and phrases are returned in order of their frequency within all the input attributes.

Use

The Phrase Profiler is a quick way of discovering the most frequent and significant words and phrases in the data, and where they occur. You can then use the results of phrase profiling to drive the configuration of the Parse processor. For example, you can add the words and phrases that were found to Reference Data lists used to classify data, and, by seeing which words and phrases occur in which attributes, work out which token checks to apply to which attributes.

The Phrase Profiler is therefore an important tool to use when understanding the content of text fields, especially when you may need to improve or otherwise change the structure of the data (for example, for a data migration).

Configuration

Inputs

Options

A large dataset containing free text will typically contain a large number of distinct phrases with only a few of them being significant in understanding the content of the dataset.

The Phrase Profiler provides two main settings to help eliminate insignificant results: the Cutoff frequency and the Allowable variation.

Cutoff frequency

Typically, the Phrase Profiler will generate a relatively small collection of phrases that occur in a large number of records and are potentially significant, together with a very large number of phrases that occur in a small number of records and so are less significant. You may want not to include the less frequent phrases in the results. As the absolute cutoff frequency varies depending on the size of the dataset, it is convenient to express the Cutoff frequency setting as a frequency per million input records.

Allowable variation

Where a phrase consists of many words (or a substring consists of many characters), longer phrases will include shorter phrases, so that data that includes the phrase ‘Newcastle Upon Tyne’ will also include at least the same number of sub-phrases ‘Newcastle Upon’ and ‘Upon Tyne’.

If the two sub-phrases occur with exactly the same frequency as the full phrase and there is no variation in their frequencies, then the full phrase is significant (a 'top-level phrase') and the sub-phrases are not. The sub-phrases are therefore excluded from the results.

If the sub-phrases occur more frequently than the full phrase, however, then they become more interesting and the variation in frequency between a phrase and a sub-phrase is a measure of the independent significance of the sub-phrase. So you may specify an Allowable variation to remove sub-phrases with a variation in frequency that is below this value. Again, as the absolute variation varies depending on the size of the dataset, it is convenient to express the Allowable variation setting as a variation per million input records.

Example

So for example, consider:

1 million records are analyzed by the Phrase Profiler
The Cutoff frequency is set to 100 parts per million
The Allowable variation is set to 50 parts per million
There are 400 occurrences of the phrase ‘Newcastle Upon Tyne’
There are 50 occurrence of the phrase ‘Newcastel Upon Tyne’

The phrase ‘Newcastle Upon Tyne’ appears in the results but ‘Newcastel Upon Tyne’ does not because of the cutoff. The sub-phrase 'Upon Tyne' has a frequency of 450 and so is unaffected by the cutoff, but does not appear in the results because the frequency variation of 50 with its containing phrase is just within the allowable limit. If 'Upon Tyne' appeared in just one more record, anywhere within the data, then it would appear in the results as potentially significant. It is generally appropriate to set the Cutoff frequency and Allowable variation to the same value.

Marking top-level phrases

Sometimes it is useful to know if a phrase is a sub-phrase of something else or if it is a 'top level phrase'. In the above example, ‘Newcastle Upon Tyne’ may be a top-level phrase - in which case it presumably represents a city. However, if there were just one occurrence of the phrase ‘Newcastle Upon Tyne Borough Council’, and this occurrence is included in the results (not excluded by either the Cutoff or Allowable Variation options) then ‘Newcastle Upon Tyne’ would no longer be a top-level phrase and so may sometimes represent something other than a city. The Phrase Profiler flags top-level phrases in the results.

Outputs

Data attributes

Flags

Execution

Results Browsing

The Phrase Profiler produces a summary view of its results, showing the words and phrases that were found in the input attributes in order of their frequency of occurrence.

Output Filters

Example

In this example, Customer Name and Address data is analyzed with a view to parsing it to resolve any structural issues.

The Phrase Profiler is run in order to find the most common words and phrases in the name and address attributes. Note that in this case, the options were configured as follows:

From the above information, we can quickly see that the words 'Mr', 'Ms', 'Mrs' and 'Miss' are frequently occurring, and valid, Titles, so we might create a Reference Data list for classifying them in parsing:

We can then sort the results by the Title attribute to find further values that occur here:

Looking further down the list of phrases and words, we can quickly find phrases and words with an ambiguous meaning in the data, that depends on context. For example:

In the above list, we can see that 'VICTORIA' and 'EDWARD' do not only occur in the NAME attribute, but also in the ADDRESS1 attribute. Drilling down on one of them reveals why:

So, when parsing the data, we may wish to classify 'VICTORIA' as a Valid Forename when it appears in the NAME attribute, but when it appears in the ADDRESS1 attribute, we might not classify the word at all, but we might choose to classify 'Victoria Centre' as a Valid Building, in three quick steps, as follows:

1. Right-click on the data containing 'Victoria Centre', and select Add to Reference Data...:

3. Edit the list entry in the Reference Data Editor to the required value ('Victoria Centre'):

Once the most significant words and phrases have been added to the required classification lists, we might begin parsing the data, knowing that we can come back to the Phrase Profiler's results at any time.

Option	Type	Purpose	Default Value
Cutoff frequency (parts per million)	Number	Allows you not to return words or phrases that only occur a small number of times in the data set, expressed in parts per million to represent a small percentage of the records analyzed. For example, values that occur less frequently than 100 times in each million records (that is, in 0.0001% of records). (See note below)	5000 parts per million
Allowable variation (parts per million)	Number	Allows you to cut off further insignificant phrases (that are contained within others), and mark top-level phrases as more significant, by expressing the allowable variation in frequency between two phrases that contain each other. (See note below)	5000 parts per million
Maximum words in a phrase	Number	Sets a maximum length of phrases to return, in number of words.	10 Note: The maximum value for this option is 20, for performance reasons.
Additional word delimiter	Selection of common delimiter characters	Allows the definition of an additional separator character (as well as the normal space character) that will be used to separate words and phrases.	None
Word delimiter regular expression	Regular expression	Allows the definition of a regular expression to be used to separate words and phrases.	None
Ignore case?	Yes/No	Sets whether or not to distinguish between words or phrases that are the same except for case differences. (See note below)	No

Execution Mode	Supported
Batch	Yes
Real time Monitoring	Yes
Real time Response	No

Statistic	Meaning
Size	The size of the phrase, in number of words.
Top Phrase	Indicates whether or not the phrase is a top-level phrase. See the note above explaining the Allowable variation setting.
Phrase	The word or phrase that was found in the data.
Frequency	The number of occurrences of the phrase or word. Note that when drilling down to the data, you may see fewer records than this frequency, because the same phrase or word may occur more than once in some records.
[Attribute].freq	The number of occurrences of the phrase or word within each input attribute.