What You See Is What You Get Element
Write for OTN
Earn money and promote your technical skills by writing a technical article for Oracle Technology Network.
Learn more
Stay Connected
OTN Architect Community
OTN ArchBeat Blog Facebook Twitter YouTube Podcast Icon

On the Integrity of Data

by Lucas Jellema

An introduction to the basics of data integrity enforcement in a variety of environments

April 2013

Downloads
download-icon13-1Oracle Database
download-icon13-1Oracle SOA Suite

To be useful, data held and used in information systems has to live up to a number of expectations. The data should be an accurate representation of its source. It should be reliable. The data should have internal consistency. The data should adhere to rules based on the logic of the real world. This accuracy, internal quality, and reliability of data is frequently referred as data integrity.

Safeguarding the integrity of data is a challenge, one that increases in complexity when multiple users access and manipulate the data simultaneously , obviously a common situation. And that challenge reaches new heights when the data is managed in multiple independent data stores rather than a single database.

This article introduces some of the basics of data integrity enforcement. It then discusses how, in a multi-user environment with a single [Oracle] database, that integrity is realized. The article then extends that discussion to an environment with multiple data stores that jointly present unified data services to consumers - and that are together responsible for data integrity. Not just for the internal, encapsulated integrity within each store, but also for integrity across those stores.

Data Integrity Constraints

The integrity of data is guarded by specific business rules called data integrity constraints. The rules that apply in the real world to events, actions, manipulations, and situations are translated into constraints that apply to the data handled in information systems. Data integrity means that the data recorded in the system does not violate these constraints.

The constraints are typically classified into a number of categories of increasing complexity and scope. For example:

Scope

Example

 

Attribute

The value of the Salary attribute must be an integer between 1000 and 5000

Figure 1

Tuple (record)

For employees with a Job (attribute value) equal to SALESMAN the value of the Salary attribute should not be higher than 2500. The sum of the attributes Salary and Commission may not be higher than 10000

Figure 2

Entity (table)

Uniqueness across (a subsection of) the table of records of a certain attribute

The difference between the maximum salary for all employees with the same job and their average salary may not exceed 30% of the average salary

An employee who is the manager of other employees should either have the Job attribute value of MANAGER or should have the same Job attribute value of each of his/her subordinates

 

Figure 3
Inter-Entity (cross-table) Referential integrity (foreign key): any value used for the Department attribute must exist in the DepartmentId attribute of a record in the Departments table. Figure 4

 

Note that in an environment with multiple stand-alone data stores that together present a common data service interface, the attribute, tuple, and entity scope constraints are encapsulated inside each data store; these constraints do not cross the boundaries of the data store (unless of course the data stores each contain partitions of the same data collection). Inter-entity constraints however can involve entities (data sets) that live in different data stores. We will discuss that situation further on.

Figure 5
Figure 5

Some rules apply to the change itself, not to the status of the data. For example: "Salaries may not be updated in February or March." Or: "A salary increase may not be higher than 25%." Data authorization rules also apply to the change, not to the status. They may state things like "Only managers [users who have been granted the role manager] can update the value of the salary attribute" or "Creating a new employee record can only be done by a manager who is associated with the department into which the employee is created."

These rules do not describe the integrity of data (as it is) and cannot, for example, be verified for a static data collection. These rules only govern the validity of transactions, and should be enforced during the time of the change.

Data Quality

Data integrity is about the internal, system- level consistency of data collection. Without data integrity, the quality of the data would be highly suspect. However, data integrity is definitely not the same thing as data quality. Even when the data recorded in the system abides by all the integrity constraints, the data may not be a correct representation of the real world and will therefore have less than optimal quality. Spelling mistakes may have made their way into names of people and streets. There may also be unintentional duplication -- multiple records representing the same real world instance such as person, object, or event. Unintentional duplication can be further complicated if the records are mutually inconsistent.

Data may be outdated because it is no longer be relevant or (even worse) it is no longer accurate. The world changes—people change addresses,they buy and sell property, they change jobs, they die. When these changes are not reflected in the system that supposedly represents the real world, the recorded data becomes less accurate—even though its integrity may be intact.

Levels of Robustness

One data constraint may not be as important as the next. Some may require less rigorous enforcement than others. The same holds true in daily life, of course. For example, many rules govern people's behavior in social contexts. Some are enforced by law, others by common courtesy, and there is a world in between. Another example: sometimes, to spite an employer, employees may practice extreme adherence to formal rules with the specific intention of causing delays and frustration.

Apparently, there can be a distinction in the level of rigorousness with which rules have to be enforced. And this, too, applies to data constraints. The distinction arises from several considerations, including the grounds for the constraint (legal, health, KPI, QA), the cost of implementation, and the overhead of strict enforcement.

  • "It is imperative that such is always the case".
  • "This may never be untrue".
  • "This is how it should be".
  • "Ideally this should hold true".
  • "It is preferable to do it like this".

A simple constraint for a market research environment could be that "a consumer should not participate more than once in a specific questionnaire." It is an intention—but probably not one worth an enormous effort to enforce. For a classic rock radio station a constraint might be: "'The same song should not be played more than once per week.". Again, this defines an intention, the violation of which is probably not worth firing someone over.

When we discuss the implementation of data constraints, it will become clear that there is a substantial difference in the effort required to enforce some constraints with 100% certainty compared with the effort to "almost always enforce with a small risk of some incidents."

There is also another choice: can there never be a violation of the constraint - one that may be visible to consumers of the data - or will it be sufficient to periodically scan the data and make sure that in the long run no invalid data exists?

In the first case, constraint validation has to be performed before any changes are finalized in the system . That means that constraint validation is performed synchronously with direct impact on system response time as perceived by the party making the change and - as we will see later on - with resource locks that may impact system scalability. In this case any party trying to submit changes to the system will know immediately whether those changes are acceptable and are recorded successfully.

In the second case the constraint validation is performed asynchronously with regard to the actual change itself. A background process scans the data that has already been recorded and tests for validity, i.e. integrity. When violations are found, some process—probably with some human involvement—is started to deal with the undesirable situation. Records can be removed, marked as invalid, added to a 'to rectify' list or dealt with in other ways, depending on the seriousness of the violation.

Note that in this case there may be a two-step process: changes are submitted provisionally, and only after the off-line check has been performed will these changes become real. Between the time of submission and this okay-stamp from the background validation process, these changes are not yet visible to consumers of the system.

If there is no such two-step process, invalid data may be read from the system by consumers in the time between the submission of the data and the background validation. Depending on the nature of the constraints that may not be satisfied, this may not be a problem at all.

This article will focus on the case of real-time, on-line synchronous constraint enforcement that strives for continuous data integrity.

Transaction

We would typically not criticize a sculptor about the quality of this work while he is still working on his masterpiece, nor would we complain about the taste of the food the cook is still working on or the spelling errors in a article the reporter has not yet submitted for publication. Commenting on the smoothness of a road while it is still under construction is equally nonsensical.

The same applies to data integrity constraints. They should be enforced only when the changes that a user or system is in the process of making to the data are complete. Constraints need not be applied during every stage of the process. Only when the work is complete and "submitted for publication" is there a need for full enforcement of and compliance with constraints. When we talk about data in an IT environment, the term "commit" is used to describe the process of for submitting for publication. The unit of work that comprises all changes is called "transaction." During the preparation of the transaction—before the changes are committed—the constraints can be temporarily violated. During this time, the state of data in the transaction is not visible outside that transaction. Outside the transaction, no invalid data can be seen, and consumers are unable to retrieve data from the system that does not comply with the integrity constraints.

Figure 6
Figure 6

The term "rollback" is used to describe the undoing of all changes wrought in a transaction. When the cook breaks off the cooking process and throws all the pots and pans (or at least their contents) in the bin or the writer crumples up her article or the sculptor tears down his unfinished sculpture there is little point in testing the taste, checking the spelling or evaluating the art quality -- the work has been undone and no longer has relevance. The same applies to a roll-backed transaction: it is as if it never existed. All changes that may have been underway during the transaction are reversed and nothing remains of the transaction itself.

Session

A session is the context in which a consumer interacts with the system - both for retrieving data as well as for executing transactions that manipulate data. A consumer may, and frequently does, correspond with an end user, but that consumer might also be an automated agent( such as a batch procedure), an external service consumer, or an automated business process. Today's systems typically deal with many simultaneous sessions - from dozens to thousands or even tens of millions in the case of global systems such as Twitter, Facebook, and Google Mail.

Sessions, like transactions, are private: sessions do not interact with each other. Actions in one session are not visible in another, nor do they impact other sessions. Only when a transaction in one session commits are other sessions able to experience the effects from that transaction. Note, however, that a session may claim a resource - through a lock - and may thereby impact other sessions that at that moment are prevented from claiming the same resource. Examples of resources include individual records, entire data collections, or a custom token.

The notion of a session—and even that very term—is used at various levels. A user in a web browser performs actions in a session. This browser session starts when a user arrives at a certain web site or accesses a web application, and lasts until the site is abandoned, the user exits the application, or the browser closed. An agent accessing a web service may do so in a session that includes multiple interactions. A batch process interacting with a data store may perform many operations in one or more transactions that all take place through the same connection with the data store and in the same session context.

Figure 7
Figure 7

Associated with the concept of a session are the terms connection (with a system such as an application or a data store) and identity (of the agent who initiated the session using authentication based on credentials).

Anatomy of Constraint Enforcement

The easiest way to enforce constraints and reach integrity nirvana is refusing to allow any data manipulation at all. Only through change can the existing situation be taken from a state of complete integrity to one that lacks that integrity. By preventing change, the status quo is preserved.

Obviously that is not a viable strategy, but it does clearly illustrate a point: change is the trigger for enforcement. We do not have to continually check all our constraints—we only do so when changes are made to the data. And we do not have to check every constraint for any data manipulation: changing the name of a customer is not likely to cause a product price constraint to be violated. Also, assuming we have an attribute constraint on customer names, changing one customer's name would not then require a check of all untouched customer names to verify continued adherence to the constraint. We can focus on the customer record(s) that have undergone a name change.

That point lays the foundation for the implementation of constraint enforcement:

  • Enforce only when data is changed.
  • Enforce the constraints that are potentially violated by the type of data change.
  • Enforce the constraints for the records that are in danger of violating the constraints given the data changes taking place.

We already know that enforcement is necessary only as part of the transaction commit. Before that time, while the transaction is being composed and the data is still in limbo, the constraints may be violated. Adherence to constraints at any moment prior to the commit is irrelevant: it only counts at commit time. Note that there may not be room for the change to somehow become invalid during the time between its validation and the commit. In a multi-session environment, this is not trivial. See the next section, "Pulling the rug," for more on this.

If we know which data manipulation events can cause a constraint to be violated and we know which data manipulations a transaction contains, - we can map the latter onto the former and determine which constraints to validate at transaction commit time.

The implementation of an efficient and robust data constraint validation mechanism requires a careful analysis of the constraints: which data manipulations—create, update, delete—to which entities (and in case of updates: which attributes specifically) have the potential of violating the constraint. Additionally, at run time for a transaction this mechanism needs to be able to determine: which entities undergo which change—create, update, delete—and for updates: which attributes are touched. Together, these factors tell us which constraints need to be validated for which records.

The table below contains an example of the constraint data manipulation analysis for a number of constraints of different types:

Scope Constraint Data Event Analysis  
Attribute The value of the Salary attribute must be an integer between 1000 and 5000 - Create
- Update of Salary attribute
Figure 1
Tuple (record) For employees with a Job (attribute value) equal to SALESMAN the value of the Salary attribute should not be higher than 2500.

The sum of the attributes Salary and Commission may not be higher than 10000

- Create
- Update of Salary or Job attribute
- Create
- Update of Salary or Commission attribute
Figure 2
Entity (table

The name of an employee should be unique in a certain job.

The difference between the maximum salary for all employees with the same job and their average salary may not exceed 30% of the average salary


An employee who is the manager of other employees should either have the Job attribute value of MANAGER or should have the same Job attribute value of each of his/her subordinates

- Create
- Update of Name or Job attribute

- Create
- Delete
- Update of Salary or Job attribute

- Create
- Update of Job attribute or Manager reference

Figure 3
Inter-Entity (cross-table) Referential integrity (foreign key): any value used for the Department attribute must exist in the DepartmentId attribute of a record in the Departments table. - Create (of Employee)
- Delete (of Department)
- Update of Department reference (of Employee)
- Update of DepartmentId attribute (of Department)
Figure 4

 

With the event analysis of the constraints at our disposal and a way to determine for a pending transaction which data manipulations it consists of, the constraint validating mechanism can do its job. It is to be kicked off when the transaction is complete and submitted for commit by the user or the agent owning the session. It performs the validations in the context of that session, in which all pending changes are visible together with all previously committed changes from other sessions. Any uncommitted changes that may be pending in other sessions are not visible, and since they have not been committed (and may never be) they are irrelevant for the constraint validation of the transaction.

Figure 8
Figure 8: Constraint Enforcer

One possible algorithm for the Constraint Enforcer could be:

  • Determine which Entities and Attributes are changed through which manipulation in the current transaction.
  • Determine which constraints are potentially violated by these changes and should therefore be enforced.
  • Determine for which records the enforcement must be performed,- given the scope of the transaction.
  • Iterate over all constraints and records; validate the constraint for the record(s); ensure that at this point there cannot be a commit in a different session that can impact the validity of the records with regard to the constraint just checked (before too long, a locking mechanism is introduced for this).
  • Report findings: either all validations succeeded and the transaction gets the thumbs up or one or more validations failed. The transaction cannot proceed. The constraint violations are reported.

Pulling the Rug - Constraint Validation in a Multi-Session Environment

In a multi-session environment, changes committed in other sessions may transform a data manipulation that has already been validated successfully for all constraints, from perfectly valid to in violation of a constraint. This in turn could lead to invalid data being committed.

Figure 9 shows a simple example. An entity has been defined with a single constraint that states that the NAME attribute is unique and therefore must have a different value for each record in the collection. Initially this condition is met. Two sessions are active. Both sessions create a new record with the Name attribute set to the value JOHNSON.

Constraint enforcement for one session is kicked off when the user commits the transaction. Since there is only one record with NAME='JOHNSON' in the scope of this session, the change validates against the uniqueness constraint. Meanwhile, in the second session the transaction has also been committed - or submitted for commit as is probably a better way of putting it. The constraint enforcer is doing the same thing for the second session as it does for the first. And it too finds no fault with the creation of record with NAME='JOHNSON' — because no such record currently exists in the scope of this second session. The second transaction is also given the thumbs up and may proceed.

Figure 9
Figure 9: Multi-Session Environment

As a result, both records are successfully committed. However, this results in a clear violation of the uniqueness constraint, something that we set out to prevent. Note that this violation could occur because the creation in the first session was not visible in the second session (because the commit of session 1 was not yet complete when the validation took place in session 2). Figure 10 shows this: the yellow ellipse indicates a validation in session two that does not include the changes from session one, and the orange ellipse indicates a validation that does include the effects from the transaction in session one.

Figure 10
Figure 10: Violation on Uniqueness Constraint

We seem to be facing a Catch 22: the validation in session two cannot include the changes from session one before they have been committed, yet they cannot ignore these changes that so clearly collide with the manipulation in session two. The solution to this challenge is the use of exclusive locks. Such locks can be used by one session to prevent other sessions from making changes that can result in invalid outcomes as shown in the above example.

These locks can be used in very coarse-grained way, making it impossible for other sessions to safely perform many operations. For example, for the Name Uniqueness constraint from this example, a lock can be used to prevent any update or create operation in other sessions once a create or update of Name has been performed. This is an approach destined for lots of frustration, though quite unnecessarily so. A fine grained approach can be selected instead, one that does not use locks to prevent data manipulation, but rather uses locks that claim the exclusive right to perform a constraint validation. In the previously discussed example, the lock would be used as illustrated in Figure 11.

Figure 11
Figure 11: Locks

When session one starts the validation of the uniqueness constraint, it tries to acquire a system-wide lock on that constraint. Because no other session currently holds that lock, the lock is granted and session one can proceed. As before, it does not find a problem with the new JOHNSON record.

Session two, meanwhile, has also started the validation phase of the transaction. It also must validate against the uniqueness constraint for the Name attribute. It tries to acquire the PSN_UK23 lock for that constrain, but fails to do so. We know why: it is currently held by session one, which is going through its own validations. Session two either can wait until the lock becomes available, or it can stop processing and return with an exception. If it waits, it will be able to continue as soon as the lock is relinquished by session one. This happens when session one commits. At that moment, session two can start the validation of the transaction against the uniqueness constraint. And at that time, this validation will include the changes from session one, including the new record with NAME='JOHNSON'. This means that the validation against the uniqueness constraint will fail for session two and the transaction will be rejected and possibly rolled back.

Note that our lock can be defined in even a more fine-grained way: instead of taking the lock for the constraint PSN_UK23, we can take a lock on that constraint in conjunction with the new value of the Name attribute, in this case {PSN_UK23, JOHNSON}. That means that validations of the constraint for other values for Name can still go ahead and acquire a lock even when session one is chewing on PSN_UK23 for the JOHNSON value.

As we will see further on in this article, this fine-grained locking approach is adopted as the internal way the Oracle Database enforces its declarative constraints.

Global Transaction and Distributed Transaction

Up to this point our discussion has not focused on a specific environment. The data store to which the data constraints have to be applied can be a relational database, an in-memory data grid, a content management system, a file system, cloud-based storage, or any other mechanism for managing data. We did, however, assume that we were dealing with a single data store that handles data, transactions, sessions, and constraint enforcement. But in actual practice that may not always be the case. Therefore, we have to introduce another level of complexity into our story: multiple data stores and transactions that span data stores. If the data can be held in multiple data stores, we can describe the data integrity for each data store on its own and leave it to consumers to interact with each of these stores in turn. That would be too much of a short-cut, however.

Figure 12
Figure 12: Spanning Multiple Data Stores

Especially in service-oriented environments, we may not want consumers to know about multiple data stores. In the decoupled world, with encapsulated service implementations, we probably want to offer entity services and composite business services that expose operations that may span multiple data stores.

The transaction a user or agent is trying to execute is a logical unit. Either all of the transaction succeeds - or all of it fails. The transaction should not be partially committed - with one data store changed but not the others.

This is where the concept of a global transaction comes in, sometimes referred to as distributed transaction. The global transaction is a single transaction,, a single, coherent, logical unit that should be committed - or not committed -- in its entirety . The global transaction spans two or more transactional resources such as data stores. Each of these resources is capable of managing its own transactions, and the data integrity of such transactions.

The challenge with global transactions is in extending the transaction mechanism for each of these resources to a higher level, where they participate in a joint transaction in which all resources together coordinate success (commit) or failure (rollback). Such coordination usually involves two phases (hence the term Two Phase Commit or 2PC): Each transactional resource starts a session, applies all changes, and performs all constraint validations. The resource then reports its success (or failure) to the transaction coordinator. Note that at the end of phase one, locks will be held in each transactional source, on the records that have been changed and on the constraints that have been validated for these records, in order to ensure that the rug is not pulled by other (distributed or local) transactions.

Figure 13
Figure 13: Phase 1

If all resources have reported success the global transaction has been given the thumbs-up from all parties involved and the second phase can start. The commit is completed in each transactional resource, publishing the changes and releasing all locks. Note that if only one of the transactional resource reports failure (for example, because of constraint validation) the global transaction is aborted and each resource is instructed to rollback all changes.

Distributed Constraints

Inter-entity constraints involve two or more data collections. Potentially, not all of these data collections are part of the same data store. Validation of such constraints therefore may cross the boundaries between data stores.
Figure 14
Figure 14: Phase 2

From a conceptual point of view, as well as during implementation, the requirement to use global transactions that span multiple data stores is not trivial. Even though there is robust support for distributed transactions in many environments—including Java/JEE, Oracle SOA Suite, and Oracle Service Bus—the added complexity is not to be taken lightly.

Figure 15
Figure 15

This brings up an interesting and sometimes heated discussion about how to deal with this situation: where to enforce such distributed constraints and how to implement the distributed locks that prevent transactions from interfering with each other. Some even question the validity of cross-transaction resource constraints,- arguing that independent data stores are not truly independent if constraints are defined between them.

In addition to this theoretical debate, there is also the matter of the effort required to implement constraint enforcement and the question of the capabilities of the technology involved. Will the constraint enforcer in one data store even be able to access the data in another data store? Will it be able to request a lock on the constraint in the other data store(s) and release that lock in the appropriate way?

This section will discuss the implementation of a cross-Data Store service (using Oracle SOA Suite) that will not only use a global, distributed transaction but also takes care of cross-Data Store constraints. In short, the general algorithm for this situation would have to be something like:

  • Transaction is submitted for commit by the agent or end user.
  • A global transaction is initiated.
  • In the context of this global transaction, transactions are started for each data source involved; the data manipulations submitted by the agent or end user are routed to the data source that should handle them.
  • Each data source performs phase one of the 2PC cycle: applying changes, taking locks on changed records and on data constraints to be validated, and enforcing constraints in the normal (local) manner.
  • Report the result of phase one back to the transaction manager; either all changes were applied successfully and all constraints were successfully enforced or they were not. In the the latter case, the global transaction is aborted and all data sources are instructed to rollback their local transaction,- reverting all changes and releasing all locks.
  • Determine which distributed constraints are potentially violated by the manipulations in the global transaction and determine for which records the enforcement of these distributed constraints must be performed.
  • Iterate over all these distributed constraints; acquire a lock on the constraint (possibly in combination with specific value(s) based on the records involved; validate the constraint using the data in all involved data sources - that is now in the final state it will be in at the end of the current transaction; note that validation of the distributed constraint needs to take place in the context of the current global transaction in order to see all the effects from that transaction - it cannot use a separate session for this validation.
  • If all triggered distributed constraints were enforced successfully, then the global transaction may proceed. (if not, the global transaction is aborted and all data sources are instructed to that effect; the constraint violations are reported.).
  • Every data source involved in the global transaction performs a local commit and releases all locks held on changed records or on enforced constraints; finally any global resource locks are released as well.

An interesting question comes out of this algorithm that must be addressed: where is the lock acquired for a distributed constraint? The lock is needed to prevent multiple concurrent transactions from each performing a validation of the same constraint at the same time, each in their own context with only their own changes in scope, with the risk of together producing an invalid situation. The lock ensures that only one transaction can enforce the constraint including its own uncommitted changes. If all transactions were global transactions, a global lock from some lock provider service would be sufficient. Frequently, however, there will be local transactions against each individual data source in addition to the global transactions. In those cases local locks are required. We will discuss this in more detail a little bit further on.

Distributed Referential Integrity

Let us now look at an example of a simple distributed constraint and how the generic algorithm introduced above would play out for this constraint. Data Source A contains a data collection of Employee records, Data Source B has a data collection of Department records, and there is a distributed constraint EMP_DEPT_REF1 that states: "each Employee record must refer through its DepartmentId attribute to a Department record in the collection in Data Source B."

[FILENAME]
Figure 16: Simple Distributed Constraint

A quick data event analysis for this distributed constraint tells us that this constraint should be enforced for these data events:

  • Create of Employee (in Data Source A).
  • Update of Employee - Attribute DepartmentId (in Data Source A).
  • Delete of Department (in Data Source B).

We will assume that the identity attribute of the Department records used for the reference from Employee cannot be updated. The above means that data manipulation events in both Data Source A and Data Source B can trigger the enforcement of the Distributed Referential Constraint EMP_DEPT_REF1. Both Data Sources need to be aware of that fact. And both Data Sources need to have a way to make this enforcement happen — somehow, somewhere.

Figure 17 shows the commit cycle for the case where a new employee is created:

Figure 17
Figure 17: Commit Cycle

The steps can formally be described as follows:

  • Transaction is submitted for commit by the agent or end user.
  • Write the pending data - the new employee record - to Data Source A.
  • The 'create employee' event triggers a single data constraint in this example: EMP_DEPT_REF1. A lock should now be acquired by this transaction on that constraint in the context of the Department indicated in the DepartmentId attribute of the new employee record. It is imperative that this lock is either taken both in Data Source A and in Data Source B (or at some global level that spans these two Data Sources).
  • If the constraint was enforced successfully, then the transaction may proceed. The local commit is completed and the lock held on the constraint is released (globally and/or in both Data Source A and Data Source B).

Figure 18 shows a situation that could occur when the lock on the distributed constraint is not taken for both Data Sources:

Figure 18
Figure 18

If Data Source B is not covered by the lock acquired by the transaction that creates the new employee in Data Source A, the following might happen: while the creation of the employee is happening in one transaction, another transaction is removing the Department that the new Employee refers to. If that concurrent transaction in Data Source B could acquire its own lock on EMP_DEPT_REF1, it would continue, checking for any employee referencing the Department, But failing to find the new employee that is not yet committed, the transaction would then remove the Department that the new Employee believes it is referencing (because in the first transaction, the removal is not yet visible and constraint enforcement is also successful).

Example of Distributed Constraint Enforcement and Global Transaction

Let us assume again the situation of two independent Data Sources A and B with data collections Employees and Departments respectively. Let us further assume the following data constraints defined for these collections:

Employees:

  • Salary < 5000 (EMP_ATT_1).
  • Not more than one Employee with Job equal to MANAGER per Department (EMP_ENT_1).
  • For employees with a Job (attribute value) equal to SALESMAN the value of the Salary attribute should not be higher than 2500. (EMP_TPL_1).

Departments:

  • The Name of the Department must be unique (DEPT_ENT_1).
  • No Departments should be located in Boston (DEPT_ATT_1).

Distributed:

  • The DepartmentId used in Employee records (in Data Source A) to indicate a Department should in fact correspond with an Id value for exactly one Department (in Data Source B) (EMP_DEPT_REF1).

Let us further assume a global transaction that hits both Data Sources with a number of data manipulations:

  • Create a new employee with DepartmentId equal to 42.
  • Update an employee: the Job is changed to MANAGER and the salary is changed to 4000.
  • An employee record is deleted.
  • The name of a Department is changed to Child Affairs.
  • Delete a department - with an identifier value of 567.

Now we need the works in terms of transaction coordination and constraint enforcement, following the steps that were described earlier on:

  • The transaction is submitted for commit by the agent or end user.
Figure 19
Figure 19
  • A global transaction is initiated.
  • In the context of this global transaction, transactions are started for each data source involved. The data manipulations submitted by the agent or end user are routed to the data source that should handle them: one create, one update, and one delete of an employee record to the blue data source, and one update and one delete of a department record to the red data source.
  • Each data source performs phase one of the 2PC cycle, applying changes, taking locks on changed records and on data constraints to be validated, and enforcing constraints in the normal (local) manner:.
Figure 20
Figure 20
  • Report the result of phase one back to the transaction manager. Let us assume that all changes were applied successfully and all constraints were successfully enforced.
  • The global transaction - wherever it is started and coordinated - determines which distributed constraints are potentially violated by the manipulations in the global transaction and determines for which records the enforcement of these distributed constraints must be performed. Clearly, distributed constraint EMP_DEPT_REF1 runs a risk of being violated by this global transaction, for Department 42 (the one referenced in the new Employee record) and for Department 567 (the one deleted in the global transaction).
  • Iterate over all these distributed constraint instances; acquire a lock (from an as yet unidentified global lock coordinator) on the constraint EM_DEPT_REF1 in combination with DepartmentIdentifier 42 and 567 respectively. validate the constraint using the data posted to the two sources. It cannot use a separate session for this validation, but should use the sessions used for posting the changes to the respective data sources. (Note: all local locks on data constraints that have been successfully enforced need to be retained until the global transaction is ready to commit.).
Figure 21
Figure 21
  • Note: depending on the implementation of distributed constraints and on the question of whether data sources can be accessed locally ( without going through a global transaction coordinator) - it may not be enough to hold only global locks on distributed constraints. It may be required to explicitly acquire locks for these distributed constraints in one or more of the data sources involved. Figure 22 shows an example of the locks on constraint EMP_DEPT_REF1. Note that such local locks in one or more of the local transactions could in fact be the way the global lock gets implemented when a global lock manager is unavailable.

    For as long as this global transaction has the locks on constraint EMP_DEPT_REF1 for departments 42 and 567, no other transaction can commit changes that could violate that constraint for these two departments. This is because these other transactions would also have to enforce that constraint for their changes but cannot do so as they cannot acquire the lock they need for enforcing. That is a good thing. In fact, it is exactly what we were looking for: a way to prevent other transactions from pulling the rug from under the constraint validation taking place in one transaction. No one should be able to delete department 42 while we are validating or have already validated that we can create a new employee in that very department.
  • If and when the triggered distributed constraint instances EMP_DEPT_REF1 for departments 42 and 567 are enforced successfully, the global transaction may proceed.
  • The global transaction informs the data sources that they may perform the local commit, thereby finalizing the changes and releasing all locks held on the changed records and on the enforced constraints. Finally, any global resource locks are released as well.
Figure 22
Figure 22

Summary

The business value of IT systems is determined by several factors, one of which is the quality of the data produced by the systems. This quality depends on the degree to which the data in the system is an accurate representation of the real world. One of the measures we can take to enhance the data quality is to enforce constraints on the data that are derived from business rules. More specifically, we need to ensure that transactions that manipulate the data in the system can only complete when all constraints are satisfied. Note however that sometimes the cost of enforcing all constraints immediately, during each transaction, is too high - in terms of performance overhead, implementation effort or technical feasibility - and an asynchronous, off-line validation is the better alternative.

This article introduces four categories of constraints, with a scope that ranges from a single attribute through a record to multiple records in multiple data collections that may even live in multiple independent data sources. Each category of constraints is triggered in different ways and is enforced in different ways, depending on the technology available in the data source.

This article also explains why robust, synchronous (or real time) constraint enforcement requires a locking mechanism to prevent undue interference between concurrent sessions and transactions. This locking mechanism can be fine-grained, locking only the constraint and the smallest possible data scope.

Distributed constraints and distributed transactions that span data sources require a slightly more involved procedure for validating transactions. After applying the changes in each data source and performing the per-data-source constraint enforcement, the distributed constraints need to be enforced, -also using a constraint lock to prevent session interference. Only when all local validations succeed and the distributed constraints are enforced can all local transactions be committed.

Credits

The author wishes to thank Toon Koppelaars, ACE Director Alumnus, for his thorough review of this paper and his many valuable suggestions for improvement.

Resources

The following resources provided inspiration for this article as well as additional information.

Data Integrity in the Oracle Database

SOA Suite 11g, Database Adapter and BPEL, Global Transactions

About the Author

Lucas Jellema has been active in IT (and with Oracle) since 1994. An Oracle ACE Director specializing in Oracle Fusion Middleware, Lucas is a consultant, trainer, and instructor in diverse areas including Oracle Database (SQL & PLSQL), Service Oriented Architecture, ADF, and Java. He is the author of the Oracle SOA Suite 11g Handbook, and a frequent presenter at JavaOne, Oracle Open World, ODTUG Kaleidoscope, Devoxx, OBUG and other conferences.
Connect with Lucas Jellema Oracle ACE Director