|
COMMENT: Information Matters
Zen and the Art of Information
By George Demarest
What is data quality?
Where I stood, seemingly paralyzed by the persistence of The Question. All features of the room receded into the distance, the ambient sounds first amplifying to a deafening din and then reverberating into a diffuse whisper, and then silence. Again and again, I was tortured by my seeming inability to answer the basic question "What is data quality?" Then I had a cup of coffee, and I felt much better.
Happily, the pursuit of superior data quality, and thus useful business information, is a lot less torturous than Robert Pirsig's journey in Zen and the Art of Motorcycle Maintenance. But we can reach a similar Zen-inspired conclusion: Data quality is more a journey than a destination. Why is that? It is because getting useful information greatly depends on human time and effort. In time, data (and thus information) will change.
For example, consider the value of your stock portfolio. It can change in a moment, due to a profit warning here, a lawsuit there, or a new invention on the horizon. So it goes with all sorts of information in the enterprise. In a matter of seconds, a happy customer can turn into your worst nightmare or a waffling potential customer can put you over quota thanks to an extra morning doughnut.
This article, the first in a two-part series about data quality, considers the implications of data quality challenges for both business intelligence and other areas. In the next issue, part 2 discusses the remarkable rise of one of the most practical and impactive examples of data quality, data hubs.
The Information Equation
Before proceeding, we need to make a distinction between data and information. These terms are so commonly used interchangeably that they are both weakened for the purposes of our discussion. To bring some clarity to this situation, I defer to Ali El Kortobi and Paul Narth, of the Oracle Warehouse Builder development team. Ali and Paul think about information quality a lot and don't require coffee to get them going about it.
The first thing Ali wrote on the whiteboard when I interviewed him for this piece was
Data ≠ Information
OK, fair enough. Ali went on to make a distinction between data and information that was so elegant and clear that I've named a theorem after its inventors. Here is the El Kortobi-Narth theorem:
Information = quality (data +
metadata)
In plain English, this states that information (that is, human-readable, no-batteries-required information) equals data (which our computers love to collect and trade) plus metadata (which is data about data, or context), with quality applied. For you mathematicians out there, don't try to go home and prove this, but rather accept it at face value. In simple terms, the clear distinction between data and information is the application of context and quality.
Data collection is easy and has become steadily easier and cheaper. With seemingly every Web site, application, and home appliance just dying to tell us how and what they're doing, our friends at EMC, NetApp, and Seagate will sleep soundly tonight. Creating metadata is also a well-understood process that can be devised when and where the data is collected.
The creation and maintenance of metadata will greatly affect the information theorem. Thus, it is the quality component of the equation that is the hard (and costly) part. According to a study by the Data Warehousing Institute, U.S. businesses lose more than US$600 billion each year due to data quality problems.
Quality Is Not a Place
Data quality has long been associated with data warehousing, the idea being that you can establish the elusive "single source of truth" by extracting data from many source systems and then cleansing (or transforming) the data and loading it over there in that system. It's nice to be able to sit next to a machine and say, "I've got good data quality here!"
But this notion no longer sufficiently tells the complete data quality story. Data quality can't be considered a many-to-one proposition anymore. IT professionals increasingly act on the premise that numerous core systems, not just data warehouses, require high-quality data in order to create valuable information. Thus, this many-to-many approach to data quality requires the tools and techniques to evolve so that the aggregate level of data quality is higher. To stick with the motorcycle maintenance metaphor, the machine reaches peak performance when all of the individual componentssuch as the clutch, pistons, gears, and steeringare performing at their best.
Oracle has been working on data quality challenges for years. Oracle Warehouse Builder is one of the more visible products in this area, a product that Gartner recently moved into its coveted leadership quadrant for extraction, transformation, and loading (ETL) tools. This recognition is partly a result of a market catching up with a technology. As Gartner puts it, "Market demand is taking extraction, transformation, and loading tools into new application areas" as these types of tools are applied "beyond the domain of business intelligence."&1sup;
This idea is not lost on the Oracle Warehouse Builder development team. The main reason Oracle Warehouse Builder is gaining accolades and new users is our increasing focus on the T in ETL. Extracting (and by extension, loading) data to and from different computer systems is a mature, well-understood process.
The Oracle database, the application server, and various tools have been refined so that data is portable and mobile. Moving data and application instances, tables, tablespaces, files, and even entire systems and data centers has been highly automated and simplified in more recent generations of our products.
But it is the "T" that looms large. Transformation: a longish word that represents the gargantuan problem of reconciling how myriad applications and data sources can agree. It is this huge collection of data streams, application interfaces, and target systems for this data that has multiplied and evolved almost beyond reckoning.
So the notion that a data warehouse, data mart, operational data store, or what-have-you is the ultimate recipient of good-quality data and the ultimate repository of a single source of truth is not as satisfying as it once was. In fact, the name Warehouse Builder no longer really captures what this product delivers and what it is being used for. Maybe the name should be changed.
Quality on the Grid
Anyone who attended last year's Oracle OpenWorld conference enjoyed a detailed view of Oracle's grid computing strategy. We focused on new-application design, integration, and middleware (after having given grid infrastructure a thorough treatment the previous year).
From top to bottom, grid computing is all about achieving better information and better overall quality: quality of service, data quality, maintenance of quality in rapid application development, and hopefully better quality of life for the people who maintain IT systems. In infrastructure terms, Oracle's grid mantra of consolidation, standardization, and automation is about reducing complexity, lowering the number of variables, and creating an environment in which better data quality is easier to attain.
But easier does not mean easy. And here we get to my main point on data quality: Everywhere people and technology meet and everywhere two or more technologies meet, you have a data quality opportunity. A choice of opportunities, in factyou have an opportunity to apply a data quality discipline and an opportunity to pollute your data streams.
When I say, "Everywhere people and technology meet," I am referring to the user interface and the act of data entry. All we can really do here is develop better front ends, add better error-checking mechanisms, and train our people as best we can. But when two or more technologies meet, I am talking about integration.
In a very frank article in the November 2003 issue of Business Integration Journal, author and technology architect Russell Levine talks about "the myth of disappearing interfaces." He states, "The dirty little secret of integration is that no technology is ever going to resolve the semantic data mapping issue." Levine goes on to say, "Data mapping requires intimate knowledge of the data and how it's used. This can be accomplished only with the network of knowledge and analytic processing power possessed by higher-order, carbon-based life forms. It takes time and effort."
I guess we'll have to wait a bit longer for completely automated application integration tools, flying cars, and hotels on the moon. For the time being, the best we can hope for is a technology infrastructure that doesn't get you out of bed at 2:00 in the morning and a good set of tools that gets the most out of that brain of yours.
On the Road Again
Data quality has evolved and must continue to evolve into a ubiquitous operation, a de rigueur discipline, a Zen-like practice. It should be as much a part of your IT operations as your strategy for system backup and recovery or your hardware maintenance schedule. And although achieving high levels of data quality can at times seem like what Pirsig would call "a continually receding horizon where perfection is impossible," take heart. Knowing that you've got a smoothly running engine and dry pavement in front of you and that you are heading in the right direction, you might as well enjoy the scenery. It's bound to improve.
[All due acknowledgment to Robert Pirsig for my somewhat haphazard references to his grand work. If you haven't read it, put it on your reading list. It's a classic.]
George Demarest (george.demarest@oracle.com) is a senior director of product marketing at Oracle.
|