Articles
Service-Oriented Architecture
|
|||
|
by Lucas Jellema
An introduction to the basics of data integrity enforcement in a variety of environments
April 2013
Downloads
Oracle Database
Oracle SOA Suite
To be useful, data held and used in information systems has to live up to a number of expectations. The data should be an accurate representation of its source. It should be reliable. The data should have internal consistency. The data should adhere to rules based on the logic of the real world. This accuracy, internal quality, and reliability of data is frequently referred as data integrity.
Safeguarding the integrity of data is a challenge, one that increases in complexity when multiple users access and manipulate the data simultaneously , obviously a common situation. And that challenge reaches new heights when the data is managed in multiple independent data stores rather than a single database.
This article introduces some of the basics of data integrity enforcement. It then discusses how, in a multi-user environment with a single [Oracle] database, that integrity is realized. The article then extends that discussion to an environment with multiple data stores that jointly present unified data services to consumers - and that are together responsible for data integrity. Not just for the internal, encapsulated integrity within each store, but also for integrity across those stores.
The integrity of data is guarded by specific business rules called data integrity constraints. The rules that apply in the real world to events, actions, manipulations, and situations are translated into constraints that apply to the data handled in information systems. Data integrity means that the data recorded in the system does not violate these constraints.
The constraints are typically classified into a number of categories of increasing complexity and scope. For example:
|
Scope |
Example |
|
|
Attribute |
The value of the Salary attribute must be an integer between 1000 and 5000 |
![]() |
|
Tuple (record) |
For employees with a Job (attribute value) equal to SALESMAN the value of the Salary attribute should not be higher than 2500. The sum of the attributes Salary and Commission may not be higher than 10000 |
![]() |
|
Entity (table) |
Uniqueness across (a subsection of) the table of records of a certain attribute The difference between the maximum salary for all employees with the same job and their average salary may not exceed 30% of the average salary An employee who is the manager of other employees should either have the Job attribute value of MANAGER or should have the same Job attribute value of each of his/her subordinates
|
![]() |
| Inter-Entity (cross-table) | Referential integrity (foreign key): any value used for the Department attribute must exist in the DepartmentId attribute of a record in the Departments table. | ![]() |
Note that in an environment with multiple stand-alone data stores that together present a common data service interface, the attribute, tuple, and entity scope constraints are encapsulated inside each data store; these constraints do not cross the boundaries of the data store (unless of course the data stores each contain partitions of the same data collection). Inter-entity constraints however can involve entities (data sets) that live in different data stores. We will discuss that situation further on.
Some rules apply to the change itself, not to the status of the data. For example: "Salaries may not be updated in February or March." Or: "A salary increase may not be higher than 25%." Data authorization rules also apply to the change, not to the status. They may state things like "Only managers [users who have been granted the role manager] can update the value of the salary attribute" or "Creating a new employee record can only be done by a manager who is associated with the department into which the employee is created."
These rules do not describe the integrity of data (as it is) and cannot, for example, be verified for a static data collection. These rules only govern the validity of transactions, and should be enforced during the time of the change.
Data integrity is about the internal, system- level consistency of data collection. Without data integrity, the quality of the data would be highly suspect. However, data integrity is definitely not the same thing as data quality. Even when the data recorded in the system abides by all the integrity constraints, the data may not be a correct representation of the real world and will therefore have less than optimal quality. Spelling mistakes may have made their way into names of people and streets. There may also be unintentional duplication -- multiple records representing the same real world instance such as person, object, or event. Unintentional duplication can be further complicated if the records are mutually inconsistent.
Data may be outdated because it is no longer be relevant or (even worse) it is no longer accurate. The world changes—people change addresses,they buy and sell property, they change jobs, they die. When these changes are not reflected in the system that supposedly represents the real world, the recorded data becomes less accurate—even though its integrity may be intact.
One data constraint may not be as important as the next. Some may require less rigorous enforcement than others. The same holds true in daily life, of course. For example, many rules govern people's behavior in social contexts. Some are enforced by law, others by common courtesy, and there is a world in between. Another example: sometimes, to spite an employer, employees may practice extreme adherence to formal rules with the specific intention of causing delays and frustration.
Apparently, there can be a distinction in the level of rigorousness with which rules have to be enforced. And this, too, applies to data constraints. The distinction arises from several considerations, including the grounds for the constraint (legal, health, KPI, QA), the cost of implementation, and the overhead of strict enforcement.
A simple constraint for a market research environment could be that "a consumer should not participate more than once in a specific questionnaire." It is an intention—but probably not one worth an enormous effort to enforce. For a classic rock radio station a constraint might be: "'The same song should not be played more than once per week.". Again, this defines an intention, the violation of which is probably not worth firing someone over.
When we discuss the implementation of data constraints, it will become clear that there is a substantial difference in the effort required to enforce some constraints with 100% certainty compared with the effort to "almost always enforce with a small risk of some incidents."
There is also another choice: can there never be a violation of the constraint - one that may be visible to consumers of the data - or will it be sufficient to periodically scan the data and make sure that in the long run no invalid data exists?
In the first case, constraint validation has to be performed before any changes are finalized in the system . That means that constraint validation is performed synchronously with direct impact on system response time as perceived by the party making the change and - as we will see later on - with resource locks that may impact system scalability. In this case any party trying to submit changes to the system will know immediately whether those changes are acceptable and are recorded successfully.
In the second case the constraint validation is performed asynchronously with regard to the actual change itself. A background process scans the data that has already been recorded and tests for validity, i.e. integrity. When violations are found, some process—probably with some human involvement—is started to deal with the undesirable situation. Records can be removed, marked as invalid, added to a 'to rectify' list or dealt with in other ways, depending on the seriousness of the violation.
Note that in this case there may be a two-step process: changes are submitted provisionally, and only after the off-line check has been performed will these changes become real. Between the time of submission and this okay-stamp from the background validation process, these changes are not yet visible to consumers of the system.
If there is no such two-step process, invalid data may be read from the system by consumers in the time between the submission of the data and the background validation. Depending on the nature of the constraints that may not be satisfied, this may not be a problem at all.
This article will focus on the case of real-time, on-line synchronous constraint enforcement that strives for continuous data integrity.
We would typically not criticize a sculptor about the quality of this work while he is still working on his masterpiece, nor would we complain about the taste of the food the cook is still working on or the spelling errors in a article the reporter has not yet submitted for publication. Commenting on the smoothness of a road while it is still under construction is equally nonsensical.
The same applies to data integrity constraints. They should be enforced only when the changes that a user or system is in the process of making to the data are complete. Constraints need not be applied during every stage of the process. Only when the work is complete and "submitted for publication" is there a need for full enforcement of and compliance with constraints. When we talk about data in an IT environment, the term "commit" is used to describe the process of for submitting for publication. The unit of work that comprises all changes is called "transaction." During the preparation of the transaction—before the changes are committed—the constraints can be temporarily violated. During this time, the state of data in the transaction is not visible outside that transaction. Outside the transaction, no invalid data can be seen, and consumers are unable to retrieve data from the system that does not comply with the integrity constraints.
The term "rollback" is used to describe the undoing of all changes wrought in a transaction. When the cook breaks off the cooking process and throws all the pots and pans (or at least their contents) in the bin or the writer crumples up her article or the sculptor tears down his unfinished sculpture there is little point in testing the taste, checking the spelling or evaluating the art quality -- the work has been undone and no longer has relevance. The same applies to a roll-backed transaction: it is as if it never existed. All changes that may have been underway during the transaction are reversed and nothing remains of the transaction itself.
A session is the context in which a consumer interacts with the system - both for retrieving data as well as for executing transactions that manipulate data. A consumer may, and frequently does, correspond with an end user, but that consumer might also be an automated agent( such as a batch procedure), an external service consumer, or an automated business process. Today's systems typically deal with many simultaneous sessions - from dozens to thousands or even tens of millions in the case of global systems such as Twitter, Facebook, and Google Mail.
Sessions, like transactions, are private: sessions do not interact with each other. Actions in one session are not visible in another, nor do they impact other sessions. Only when a transaction in one session commits are other sessions able to experience the effects from that transaction. Note, however, that a session may claim a resource - through a lock - and may thereby impact other sessions that at that moment are prevented from claiming the same resource. Examples of resources include individual records, entire data collections, or a custom token.
The notion of a session—and even that very term—is used at various levels. A user in a web browser performs actions in a session. This browser session starts when a user arrives at a certain web site or accesses a web application, and lasts until the site is abandoned, the user exits the application, or the browser closed. An agent accessing a web service may do so in a session that includes multiple interactions. A batch process interacting with a data store may perform many operations in one or more transactions that all take place through the same connection with the data store and in the same session context.
Associated with the concept of a session are the terms connection (with a system such as an application or a data store) and identity (of the agent who initiated the session using authentication based on credentials).
The easiest way to enforce constraints and reach integrity nirvana is refusing to allow any data manipulation at all. Only through change can the existing situation be taken from a state of complete integrity to one that lacks that integrity. By preventing change, the status quo is preserved.
Obviously that is not a viable strategy, but it does clearly illustrate a point: change is the trigger for enforcement. We do not have to continually check all our constraints—we only do so when changes are made to the data. And we do not have to check every constraint for any data manipulation: changing the name of a customer is not likely to cause a product price constraint to be violated. Also, assuming we have an attribute constraint on customer names, changing one customer's name would not then require a check of all untouched customer names to verify continued adherence to the constraint. We can focus on the customer record(s) that have undergone a name change.
That point lays the foundation for the implementation of constraint enforcement:
We already know that enforcement is necessary only as part of the transaction commit. Before that time, while the transaction is being composed and the data is still in limbo, the constraints may be violated. Adherence to constraints at any moment prior to the commit is irrelevant: it only counts at commit time. Note that there may not be room for the change to somehow become invalid during the time between its validation and the commit. In a multi-session environment, this is not trivial. See the next section, "Pulling the rug," for more on this.
If we know which data manipulation events can cause a constraint to be violated and we know which data manipulations a transaction contains, - we can map the latter onto the former and determine which constraints to validate at transaction commit time.
The implementation of an efficient and robust data constraint validation mechanism requires a careful analysis of the constraints: which data manipulations—create, update, delete—to which entities (and in case of updates: which attributes specifically) have the potential of violating the constraint. Additionally, at run time for a transaction this mechanism needs to be able to determine: which entities undergo which change—create, update, delete—and for updates: which attributes are touched. Together, these factors tell us which constraints need to be validated for which records.
The table below contains an example of the constraint data manipulation analysis for a number of constraints of different types:
| Scope | Constraint | Data Event Analysis | |
| Attribute | The value of the Salary attribute must be an integer between 1000 and 5000 | - Create - Update of Salary attribute |
![]() |
| Tuple (record) | For employees with a Job (attribute value) equal to SALESMAN the value of the Salary attribute should not be higher than 2500.
The sum of the attributes Salary and Commission may not be higher than 10000 |
- Create - Update of Salary or Job attribute - Create - Update of Salary or Commission attribute |
![]() |
| Entity (table |
The name of an employee should be unique in a certain job. The difference between the maximum salary for all employees with the same job and their average salary may not exceed 30% of the average salary An employee who is the manager of other employees should either have the Job attribute value of MANAGER or should have the same Job attribute value of each of his/her subordinates |
- Create - Create - Create |
![]() |
| Inter-Entity (cross-table) | Referential integrity (foreign key): any value used for the Department attribute must exist in the DepartmentId attribute of a record in the Departments table. | - Create (of Employee) - Delete (of Department) - Update of Department reference (of Employee) - Update of DepartmentId attribute (of Department) |
![]() |
With the event analysis of the constraints at our disposal and a way to determine for a pending transaction which data manipulations it consists of, the constraint validating mechanism can do its job. It is to be kicked off when the transaction is complete and submitted for commit by the user or the agent owning the session. It performs the validations in the context of that session, in which all pending changes are visible together with all previously committed changes from other sessions. Any uncommitted changes that may be pending in other sessions are not visible, and since they have not been committed (and may never be) they are irrelevant for the constraint validation of the transaction.
One possible algorithm for the Constraint Enforcer could be:
In a multi-session environment, changes committed in other sessions may transform a data manipulation that has already been validated successfully for all constraints, from perfectly valid to in violation of a constraint. This in turn could lead to invalid data being committed.
Figure 9 shows a simple example. An entity has been defined with a single constraint that states that the NAME attribute is unique and therefore must have a different value for each record in the collection. Initially this condition is met. Two sessions are active. Both sessions create a new record with the Name attribute set to the value JOHNSON.
Constraint enforcement for one session is kicked off when the user commits the transaction. Since there is only one record with NAME='JOHNSON' in the scope of this session, the change validates against the uniqueness constraint. Meanwhile, in the second session the transaction has also been committed - or submitted for commit as is probably a better way of putting it. The constraint enforcer is doing the same thing for the second session as it does for the first. And it too finds no fault with the creation of record with NAME='JOHNSON' — because no such record currently exists in the scope of this second session. The second transaction is also given the thumbs up and may proceed.
As a result, both records are successfully committed. However, this results in a clear violation of the uniqueness constraint, something that we set out to prevent. Note that this violation could occur because the creation in the first session was not visible in the second session (because the commit of session 1 was not yet complete when the validation took place in session 2). Figure 10 shows this: the yellow ellipse indicates a validation in session two that does not include the changes from session one, and the orange ellipse indicates a validation that does include the effects from the transaction in session one.
We seem to be facing a Catch 22: the validation in session two cannot include the changes from session one before they have been committed, yet they cannot ignore these changes that so clearly collide with the manipulation in session two. The solution to this challenge is the use of exclusive locks. Such locks can be used by one session to prevent other sessions from making changes that can result in invalid outcomes as shown in the above example.
These locks can be used in very coarse-grained way, making it impossible for other sessions to safely perform many operations. For example, for the Name Uniqueness constraint from this example, a lock can be used to prevent any update or create operation in other sessions once a create or update of Name has been performed. This is an approach destined for lots of frustration, though quite unnecessarily so. A fine grained approach can be selected instead, one that does not use locks to prevent data manipulation, but rather uses locks that claim the exclusive right to perform a constraint validation. In the previously discussed example, the lock would be used as illustrated in Figure 11.
When session one starts the validation of the uniqueness constraint, it tries to acquire a system-wide lock on that constraint. Because no other session currently holds that lock, the lock is granted and session one can proceed. As before, it does not find a problem with the new JOHNSON record.
Session two, meanwhile, has also started the validation phase of the transaction. It also must validate against the uniqueness constraint for the Name attribute. It tries to acquire the PSN_UK23 lock for that constrain, but fails to do so. We know why: it is currently held by session one, which is going through its own validations. Session two either can wait until the lock becomes available, or it can stop processing and return with an exception. If it waits, it will be able to continue as soon as the lock is relinquished by session one. This happens when session one commits. At that moment, session two can start the validation of the transaction against the uniqueness constraint. And at that time, this validation will include the changes from session one, including the new record with NAME='JOHNSON'. This means that the validation against the uniqueness constraint will fail for session two and the transaction will be rejected and possibly rolled back.
Note that our lock can be defined in even a more fine-grained way: instead of taking the lock for the constraint PSN_UK23, we can take a lock on that constraint in conjunction with the new value of the Name attribute, in this case {PSN_UK23, JOHNSON}. That means that validations of the constraint for other values for Name can still go ahead and acquire a lock even when session one is chewing on PSN_UK23 for the JOHNSON value.
As we will see further on in this article, this fine-grained locking approach is adopted as the internal way the Oracle Database enforces its declarative constraints.
Up to this point our discussion has not focused on a specific environment. The data store to which the data constraints have to be applied can be a relational database, an in-memory data grid, a content management system, a file system, cloud-based storage, or any other mechanism for managing data. We did, however, assume that we were dealing with a single data store that handles data, transactions, sessions, and constraint enforcement. But in actual practice that may not always be the case. Therefore, we have to introduce another level of complexity into our story: multiple data stores and transactions that span data stores. If the data can be held in multiple data stores, we can describe the data integrity for each data store on its own and leave it to consumers to interact with each of these stores in turn. That would be too much of a short-cut, however.
Especially in service-oriented environments, we may not want consumers to know about multiple data stores. In the decoupled world, with encapsulated service implementations, we probably want to offer entity services and composite business services that expose operations that may span multiple data stores.
The transaction a user or agent is trying to execute is a logical unit. Either all of the transaction succeeds - or all of it fails. The transaction should not be partially committed - with one data store changed but not the others.
This is where the concept of a global transaction comes in, sometimes referred to as distributed transaction. The global transaction is a single transaction,, a single, coherent, logical unit that should be committed - or not committed -- in its entirety . The global transaction spans two or more transactional resources such as data stores. Each of these resources is capable of managing its own transactions, and the data integrity of such transactions.
The challenge with global transactions is in extending the transaction mechanism for each of these resources to a higher level, where they participate in a joint transaction in which all resources together coordinate success (commit) or failure (rollback). Such coordination usually involves two phases (hence the term Two Phase Commit or 2PC): Each transactional resource starts a session, applies all changes, and performs all constraint validations. The resource then reports its success (or failure) to the transaction coordinator. Note that at the end of phase one, locks will be held in each transactional source, on the records that have been changed and on the constraints that have been validated for these records, in order to ensure that the rug is not pulled by other (distributed or local) transactions.
If all resources have reported success the global transaction has been given the thumbs-up from all parties involved and the second phase can start. The commit is completed in each transactional resource, publishing the changes and releasing all locks. Note that if only one of the transactional resource reports failure (for example, because of constraint validation) the global transaction is aborted and each resource is instructed to rollback all changes.
From a conceptual point of view, as well as during implementation, the requirement to use global transactions that span multiple data stores is not trivial. Even though there is robust support for distributed transactions in many environments—including Java/JEE, Oracle SOA Suite, and Oracle Service Bus—the added complexity is not to be taken lightly.
This brings up an interesting and sometimes heated discussion about how to deal with this situation: where to enforce such distributed constraints and how to implement the distributed locks that prevent transactions from interfering with each other. Some even question the validity of cross-transaction resource constraints,- arguing that independent data stores are not truly independent if constraints are defined between them.
In addition to this theoretical debate, there is also the matter of the effort required to implement constraint enforcement and the question of the capabilities of the technology involved. Will the constraint enforcer in one data store even be able to access the data in another data store? Will it be able to request a lock on the constraint in the other data store(s) and release that lock in the appropriate way?
This section will discuss the implementation of a cross-Data Store service (using Oracle SOA Suite) that will not only use a global, distributed transaction but also takes care of cross-Data Store constraints. In short, the general algorithm for this situation would have to be something like:
An interesting question comes out of this algorithm that must be addressed: where is the lock acquired for a distributed constraint? The lock is needed to prevent multiple concurrent transactions from each performing a validation of the same constraint at the same time, each in their own context with only their own changes in scope, with the risk of together producing an invalid situation. The lock ensures that only one transaction can enforce the constraint including its own uncommitted changes. If all transactions were global transactions, a global lock from some lock provider service would be sufficient. Frequently, however, there will be local transactions against each individual data source in addition to the global transactions. In those cases local locks are required. We will discuss this in more detail a little bit further on.
Let us now look at an example of a simple distributed constraint and how the generic algorithm introduced above would play out for this constraint. Data Source A contains a data collection of Employee records, Data Source B has a data collection of Department records, and there is a distributed constraint EMP_DEPT_REF1 that states: "each Employee record must refer through its DepartmentId attribute to a Department record in the collection in Data Source B."
A quick data event analysis for this distributed constraint tells us that this constraint should be enforced for these data events:
We will assume that the identity attribute of the Department records used for the reference from Employee cannot be updated. The above means that data manipulation events in both Data Source A and Data Source B can trigger the enforcement of the Distributed Referential Constraint EMP_DEPT_REF1. Both Data Sources need to be aware of that fact. And both Data Sources need to have a way to make this enforcement happen — somehow, somewhere.
Figure 17 shows the commit cycle for the case where a new employee is created:
The steps can formally be described as follows:
Figure 18 shows a situation that could occur when the lock on the distributed constraint is not taken for both Data Sources:
If Data Source B is not covered by the lock acquired by the transaction that creates the new employee in Data Source A, the following might happen: while the creation of the employee is happening in one transaction, another transaction is removing the Department that the new Employee refers to. If that concurrent transaction in Data Source B could acquire its own lock on EMP_DEPT_REF1, it would continue, checking for any employee referencing the Department, But failing to find the new employee that is not yet committed, the transaction would then remove the Department that the new Employee believes it is referencing (because in the first transaction, the removal is not yet visible and constraint enforcement is also successful).
Let us assume again the situation of two independent Data Sources A and B with data collections Employees and Departments respectively. Let us further assume the following data constraints defined for these collections:
Employees:
Departments:
Distributed:
Let us further assume a global transaction that hits both Data Sources with a number of data manipulations:
Now we need the works in terms of transaction coordination and constraint enforcement, following the steps that were described earlier on:
The business value of IT systems is determined by several factors, one of which is the quality of the data produced by the systems. This quality depends on the degree to which the data in the system is an accurate representation of the real world. One of the measures we can take to enhance the data quality is to enforce constraints on the data that are derived from business rules. More specifically, we need to ensure that transactions that manipulate the data in the system can only complete when all constraints are satisfied. Note however that sometimes the cost of enforcing all constraints immediately, during each transaction, is too high - in terms of performance overhead, implementation effort or technical feasibility - and an asynchronous, off-line validation is the better alternative.
This article introduces four categories of constraints, with a scope that ranges from a single attribute through a record to multiple records in multiple data collections that may even live in multiple independent data sources. Each category of constraints is triggered in different ways and is enforced in different ways, depending on the technology available in the data source.
This article also explains why robust, synchronous (or real time) constraint enforcement requires a locking mechanism to prevent undue interference between concurrent sessions and transactions. This locking mechanism can be fine-grained, locking only the constraint and the smallest possible data scope.
Distributed constraints and distributed transactions that span data sources require a slightly more involved procedure for validating transactions. After applying the changes in each data source and performing the per-data-source constraint enforcement, the distributed constraints need to be enforced, -also using a constraint lock to prevent session interference. Only when all local validations succeed and the distributed constraints are enforced can all local transactions be committed.
The author wishes to thank Toon Koppelaars, ACE Director Alumnus, for his thorough review of this paper and his many valuable suggestions for improvement.
Lucas Jellema has been active in IT (and with Oracle) since 1994. An Oracle ACE Director specializing in Oracle Fusion Middleware, Lucas is a consultant, trainer, and instructor in diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, ADF, and Java. He is the author of the Oracle SOA Suite 11g Handbook, and a frequent presenter at JavaOne, Oracle Open World, ODTUG Kaleidoscope, Devoxx, OBUG and other conferences.
Connect with Lucas Jellema
![]()