You are here: Advanced Features > Parsing concept guide

Parsing Concept Guide

Why parsing is needed

An important aspect of data being fit for purpose is the structure it is found in. Often, the structure itself is not suitable for the needs of the data. For example:

Alternatively, the structure of the data may be sound, but the use of it insufficiently controlled, or subject to error. For example:

These issues all lead to poor data quality, which may in many cases be costly to the business. It is therefore important for businesses to be able to analyze data for these problems, and to resolve them where necessary.

The OEDQ Parser

The OEDQ Parse processor is designed to be used by developers of data quality processes to create packaged parsers for the understanding and transformation of specific types of data - for example Names data, Address data, or Product Descriptions. However, it is a generic parser that has no default rules that are specific to any type of data. Data-specific rules can be created by analyzing the data itself, and setting the Parse configuration.

Terminology

Parsing is a frequently used term both in the realm of data quality, and in computing in general. It can mean anything from simply 'breaking up data' to full Natural Language Parsing (NLP), which uses sophisticated artificial intelligence to allow computers to 'understand' human language. A number of other terms are also frequently used related to parsing. Again, these can have slightly different meanings in different contexts. It is therefore important to define what we mean by parsing, and its associated terms, in OEDQ.

Please note the following terms and definitions:

Term

Definition

Parsing

In OEDQ, Parsing is defined as the application of user-specified business rules and artificial intelligence in order to understand and validate any type of data en masse, and, if required, improve its structure in order to make it fit for purpose.

Token

A token is a piece of data that is recognized as a unit by the Parse processor using rules. A given data value may consist of one or many tokens.

A token may be recognized using either syntactic or semantic analysis of the data.

Tokenization

The initial syntactic analysis of data, in order to split it into its smallest units (base tokens) using rules. Each base token is given a tag, such as <A>, which is used to represent unbroken sequences of alphabetic characters.

Base Token

An initial token, as recognized by Tokenization. A sequence of Base Tokens may later be combined to form a new Token, in Classification or Reclassification.

Classification

Semantic analysis of data, in order to assign meaning to base tokens, or sequences of base tokens. Each classification has a tag, such as 'Building', and a classification level (Valid or Possible) that is used when selecting the best understanding of ambiguous data.

Token Check

A set of classification rules that is applied against an attribute in order to check for a specific type of token.

Reclassification

An optional additional classification step which allows sequences of classified tokens and unclassified (base) tokens to be reclassified as a single new token.

Token Pattern

An explanation of a String of data using a pattern of token tags, either in a single attribute, or across a number of attributes.

A String of data may be represented using a number of different token patterns.

Selection

The process by which the Parse processor attempts to select the 'best' explanation of the data using a tuneable algorithm, where a record has many possible explanations (or token patterns).

Resolution

The categorization of records with a given selected explanation (token pattern) with a Result (Pass, Review or Fail), and an optional Comment. Resolution may also resolve records into a new output structure using rules based on the selected token pattern.

Summary of the OEDQ Parse processor

The following diagram shows a summary of the way the OEDQ Parse processor works:

See the help pages for the OEDQ Parse processor for full instructions on how to configure it.

Oracle ® Enterprise Data Quality Help version 9.0
Copyright © 2006,2011 Oracle and/or its affiliates. All rights reserved.