by Porus Homi Havewala
Getting started with the Oracle Data Masking Pack
Published August 2010
Data confidentiality is an absolute must in today’s corporate world. As databases have expanded in every sphere of life, the plethora of information gathered in these databases is simply staggering.
These databases include personal and confidential information, which companies are duty-bound to protect due to confidentiality contracts with their customers and privacy regulations to safeguard personally identifiable information (PII). Companies must also ensure compliance with globally applicable standards and regulations, such as the worldwide Payment Card Industry (PCI) Data Security Standards, the Sarbanes Oxley Act (SOX) in the U.S., J-SOX in Japan, and the European Union Directive 95/46/EC on Data Protection.
In the real world, databases are often cloned or copied, either from production to test or from test to development. There might be multiple test or development copies. For example, it is normal for one production Oracle E-Business Suite instance to have ten or more corresponding test or development instance copies for use by several different teams of developers or departments.
Then, there is the possibility that external vendors request the data for their own application development or testing purposes. Market researchers might need data for analysis. Nowadays, due to the global economy paradigm, application vendors and market researchers might be in offshore countries, so your entire data set has the potential to end up in a different country—where confidentiality of data might not be strictly observed as in the data’s country of origin.
In all these cases, the database administrator (DBA) had to be responsible for the ultimate protection of data before providing an off-site or development copy of the database. Typically, this involved manually looking through all the tables and columns in the numerous databases in the DBA’s domain to attempt to find any columns that held potentially confidential data.
Custom SQL or PL/SQL scripts then had to be created to massage and mask the data to turn it into fake data that still looked realistic—a tall order for today’s complex enterprise applications.
A separate custom script had to be written to mask each separate database, even though the databases had similar columns that needed to be masked. There was no centralized way to share masking techniques across multiple databases.
Once the custom masking scripts were written and deployed, they had to be maintained manually for future changes to the columns containing sensitive data. As the number of databases increased, more and more scripts needed to be created and maintained, making it clear that this was not a scalable solution.
Custom masking scripts were also not very auditable, since auditors could not examine each and every script customized for every database. Therefore, auditors could not be satisfied as to the efficiency of the manual masking process.
Those were the days of the so-called DBA 1.0, when many DBAs struggled with command-line actions to manage an increasing number of databases that were also growing in size. All the day-to-day administrative tasks were done manually with scripts or without scripts: database backups, performance diagnostics and tuning, patch applications, and creation and management of standby databases.
And then Oracle Enterprise Manager entered the arena enabling a new breed of DBA—DBA 2.0—in the process.
This DBA saw the value in using the graphical user interface of Oracle Enterprise Manager and realized that this product would boost productivity, enable many tasks to get done sooner, and enable the “managing of many as one.”
It was this flagship enterprise management product from Oracle that streamlined and automated the day-to-day DBA tasks, leaving more time for the DBA to concentrate on really important things—the security, architecture, and ever-expanding new technology of Oracle Database and associated products.
Of course, for any medium-to-large site with hundreds of targets, the Oracle Enterprise Manager Grid Control site must be properly architected. For an overview of the required architecture, please refer to a recently published article by this author on Oracle Technology Network (OTN), Oracle Enterprise Manager Grid Control Architecture for Very Large Sites. This article lucidly explains how a large central Oracle Enterprise Manager Grid Control site was able to manage and monitor more than a thousand hosts containing multiple targets, such as databases, listeners, application servers, and so on.
Being aware of the continually increasing requirements to mask data in the real world, Oracle released Oracle Data Masking Pack for the first time in Oracle Enterprise Manager 10g Release 4 in order to assist DBAs with this issue.
Oracle Data Masking Pack is seen in Oracle Enterprise Manager Grid Control and in Oracle Enterprise Manager Database Control (from 11g Release 1 onward). The latter is used to monitor and manage only one database. Oracle Enterprise Manager Grid Control, on the other hand, is able to monitor and manage multitudes of databases, all from a single Web console accessing a central repository.
The centralized capabilities of Oracle Enterprise Manager Grid Control are well suited to the new Oracle Data Masking Pack. The idea is to store all data masking formats centrally in a library, and then apply them to any database that is managed by Oracle Enterprise Manager Grid Control.
These centralized data masking formats can be created by a corporate information security team, and then they can be handed over to the DBAs responsible for each project. The DBAs then create data masking definitions that map the specific columns and tables in their databases to suitable data masking formats supplied to them.
The masking definitions can then be applied on confidential data in any of the managed databases, and they can also be shared between DBA teams who are looking after different target groups of databases.
The Oracle Data Masking Pack, therefore, allows a central, corporate-wide strategy for compliance with privacy regulation, since data privacy rules can be defined and applied in a consistent manner to all the sensitive data in the corporate database space.
We will explore this in further detail in the following sections.
Oracle Enterprise Manager 10g Grid Control Release 5 is the release used for illustration purposes in this article. In this release, as well as in the current release (Oracle Enterprise Manager 11g Grid Control Release 1, at the time of writing), the Oracle Data Masking Pack is accessible under the Data Masking section on the Database>Schema tab. (see Figure 1). Whereas, in the very first release of Data Masking in Oracle Enterprise Manager 10g Release 4, this section was found in the Database>Administration tab.
Figure 1: Data Masking Section in the Schema Tab
Select the Format Library link to access a collection of predefined out-of-the-box masking formats. These formats are for common confidential data, such as MasterCard numbers, Visa numbers, Social Security numbers, and so on, as seen in Figure 2.
You can use the predefined formats for your data, or you can even create your own formats, as will be seen soon.
Figure 2: Format Library Screen
From the list of masking formats, select National Insurance Number Formatted and click the View button. This brings up the format entries that comprise this particular format (see Figure 3), which is for the UK.
From first glance, you can tell that there is an array list type of format entry, which allows a list of values that can be defined, and there is also a random digits type of format entry.
You can also specify a post-processing function, which, in this case, is supplied out of the box. The post-processing function can check the validity of the generated fields and do some extra formatting; in this case, spaces are being inserted between the generated values.
Figure 3: Format Entries Screen
Let us now examine the task ahead of us.
The IT Manager has authorized you, as the DBA, to create and supply a copy of the Human Resources table “Employees” to an external vendor for development purposes. However, all confidential data must be protected, and the onus is on you to ensure this is done.
For the purpose of this test, a copy of the Employees table is created as Employees_test. The data in this table is then examined to see which columns are likely candidates for the data masking process.
See Figure 4 for the fields of the table and the data contained in the table.
Figure 4: Employees_test Table
One of the confidential fields in this table is clearly the NRIC column, which is the National Registration Identity Card number in Singapore where the employees are based.
The NRIC number is similar to the Social Security number in the U.S. It is personally identifiable information (PII) that should be masked.
For the purpose of the masking demonstration, this article will assume that the NRIC column is the only column that needs to be masked in this table. Other columns, such as First_Name, Last_Name, Phone_Number, and Salary, are also columns that can be masked in real-life situations.
You decide that a new masking format is required for the NRIC column in the table.
From the Data Masking Definitions > Format Library page shown in Figure 2, click the Create button, which allows you to create your own specific data masking format.
See Figure 5 for the Create Format screen that appears.
Figure 5: Create Format Screen
The new format is named Singapore NRIC Number. There are no format entries initially, and you can add the entries one by one.
The drop-down box displayed in Figure 5 shows the various types of format types available. These formats are filtered so only the format types applicable to the column data type are displayed. The format types are known as the mask primitives. Examples are Random Digits, Random Dates, Random Strings, Fixed Strings, and so on.
One of the in-built masking routines provided here is Shuffle, which can be used to randomly shuffle the values in a column across different rows. This can be used when the format of the column is not known and it is not possible to generate values.
Data such as names can also be replaced by fictitious names from external data sources using the Table Column format entry. It is also possible to use User Defined Function to generate the masked value, as in the case of financial institutions that use algorithms to derive values such as an account number.
The Substitute mask format type can be used for deterministic masking, which means using the same type of masking for a data column in all tables of a database and even in multiple databases. For example, the employee number might need to be masked similarly in multiple tables and multiple databases. Using the Substitute mask format, each employee number is masked to a unique masked value using hash-based substitution, and the same mask value is generated for any specific input value.
The case of the NRIC number is a different matter. Since you are aware of the format of the NRIC number from officially supplied information, select Array List for the first format entry.
On the screen shown in Figure 6, type the valid alphabetical values that can start an NRIC number: S,F,T,G.
Figure 6: Array List Screen
In this manner, you combine the format entries using different types to make up the entire NRIC number. The next entry is specified as the Random Digits format with values ranging from 7 to 7, meaning that 7 random digits will be generated.
This is followed by a format entry type of Random Strings of string length 1 to 1, meaning a single letter. See Figure 7.
Sample NRIC numbers are immediately generated and displayed at the bottom of the page, based on the format entries you specified, so you can see if you are on the right track.
No post-processing function is specified. In this case, you could possibly specify a function to make the last letter uppercase.
Figure 7: Create Format Screen
When you click the OK button, this new masking format appears in the library of masking formats, as was seen in Figure 2.
Go back to the Database > Schema tab and click Definitions under the Data Masking section (see Figure 1).
This is where you can select the actual data columns to be masked and associate them with the masking formats in the library. The data definition is itself a mapping between columns and masking formats.
The Data Masking Definitions screen appears (see Figure 8). This screen shows a list of all definitions that have been defined in this database. The list is empty, since there are no definitions at present.
Figure 8: Data Masking Definitions Screen
You can import a data masking definition on this screen or create a new one.
Click the Create button. The Create Masking Definition screen that appears allows you to create the HR Employees_Test Masking Definition.
Select the actual table and column in the definition—the NRIC column in the HR.EMPLOYEES_TEST table (see Figure 9).
You can add multiple columns in different tables in the database to a definition on this screen, but we will keep only this one column for this particular definition.
If there are any foreign key columns pointing back to the columns you have chosen, the foreign key columns are automatically discovered and added at this point to maintain referential integrity.
There is also a facility to manually add dependant columns, which can be columns referencing the masked column at the application level, since some applications, such as Oracle E-Business Suite and Oracle PeopleSoft applications, do not use foreign keys in the database but instead prefer to handle referential integrity at their code level.
If the foreign key columns or the dependant columns are used in this way and are also masked, the integrity of the application will be maintained even with the masked data.
In the old way of using manually written masking scripts, it was possible that the other columns were not included in the masking, thus breaking the integrity and making most of the manual masking process useless, since applications will not work with masked data that does not have referential integrity.
Figure 9: Create Masking Definition Screen
In Figure 9, you have added the NRIC column in the HR.EMPLOYEES_TEST table. A red spanner icon is visible under Format, which indicates you need to set up the masking format for this column. Click this spanner icon.
The Define Column Mask screen appears (see Figure 10). On this screen, which shows the column you are masking and the table and schema it is contained in, you can either add the masking format entry directly via the Add button, or you can click the Import Format button to retrieve the format from the Format Library.
Use the Import Format button, since you already have a defined masking format for the NRIC number.
Figure 10: Define Column Mask Screen
You can now select the masking format you want from the Format Library list, which appears in Figure 11.
Select the Singapore NRIC Number and proceed.
Figure 11: Format Library List
This brings the masking format you have selected back into the Define Column Mask screen, where you can associate it with the NRIC column.
Figure 12 displays the format entries that have been retrieved from the Singapore NRIC Number masking format in the library.
It is also possible to change the format at this stage if you want it to be slightly different from the one defined in the Format Library.
Figure 12: Define Column Mask Screen
On the Define Column Mask screen, it is possible to set up the new feature Conditional Data Masking. This is done by clicking the Add Condition button and then typing in a separate condition for which a different masking format can be applied.
For example, in Figure 13, a new condition is specified as “last_name like upper('%RAJAH%')” and if this condition is satisfied, the format entry selected is Preserve Original Data.
This means that for all employees with the string “RAJAH” in their last name, the NRIC number will not be changed in the masking process and will stay the same.
For the new condition, you can either add a format entry directly, or you can import a format from the library on this screen.
In our case, we do not need such a condition. All NRICs must be masked regardless of the employee names, so we remove the added condition.
Figure 13: Conditional Data Masking
Click the OK button, which brings you back to the Create Masking Definition screen. The spanner icon is no longer visible, since you defined the masking format for this column (see Figure 14). The spanner icon is now replaced by an edit icon (pencil and paper).
In this way, if there are multiple columns defined in the masking definition, you can associate a masking format to each of the columns by clicking the spanner icons or the edit icons.
Other advanced options on the Create Masking Definition screen allow you to disable database logging (redo log generation) during masking, refresh table and index statistics after the masking process, drop the temporary tables created during masking, and use parallel execution, when possible, to speed up the masking process for execution on large tables.
If redo log generation is enabled, you have the opportunity to roll back masking changes using Flashback technology in the database, in case the masking results were not satisfactory. If redo log generation is disabled, the masking process might be faster on large tables.
The temporary tables are the mapping tables generated during the masking process. They can be retained to allow business users to map the changed data to the original data, if this is needed to verify application integrity.
Otherwise, if space is at a premium and is to be reclaimed, the temporary tables can be automatically dropped right after the masking process is completed.
Figure 14: Advanced Options
There is one other point to note.
When you are adding columns on the Create Masking Definition screen, it is possible to select Mask selected columns as a group.
This option is seen in Figure 15, where the FIRST_NAME and LAST_NAME columns have both been selected, and this option has been selected.
This process is known as compound masking, and it is another new feature in the Oracle Data Masking Pack. Compound masking allows columns to be associated together for the masking process, so the masked data appears consistent.
Figure 15: Compound Masking
A better example of compound masking would be fields such as city and country. If these are masked, the application might still expect the associated information to be correct.
The city of Mumbai, which is a famous city in India, cannot be associated with the country of Iran after the masking. Neither can Teheran, the capital of Iran, be associated with India. Mumbai, India and Teheran, Iran must always be grouped together even in the masking process.
When masked data needs to be credible in this manner across fields, you can use the compound masking option.
The result of grouping the masked columns together is seen when you click the spanner icon on the Create Masking Definition screen (see Figure 9 for the spanner icon on this screen).
You are asked to define a group mask for the two columns. Figure 16 shows the Define Group Mask screen.
Figure 16: Define Group Mask Screen
The options visible for the format type with group masking are only Shuffle, Substitute, Table Column, or User Defined Function.
The other format types possible with single column masking (such as Random Digits and Random Strings), are not visible and cannot be used with group masking. Only the visible format types can be used.
For example, for the case of the City and Country columns, you could use Shuffle to change these columns among different rows but keep the values together. So Mumbai, India would stay together and just move to a different row.
Remember to deselect Preserve Original Data for both the columns and then select Shuffle; otherwise, there will be no changes to the data in these columns and they will not be shuffled or changed in any other way.
Back on the screen shown in Figure 14, click OK to save the new masking definition you created.
The main Data Masking Definitions page now appears (see Figure 17), where the newly created HR Employees_Test Masking Definition can be seen in the list. This masking definition is now centrally available in the Oracle Enterprise Manager Grid Control system.
Figure 17:Data Masking Definitions Screen Showing New Masking Definition
The Export button on this page allows you to export the masking definition as a file, in a portable XML format called Application Masking Template. The exported definition can then be shared with other Oracle Enterprise Manager repositories, if required.
The exported definition can also be used by application vendors and supplied with new applications to customers, so customers get an out-of-the-box mapping of mask formats to sensitive data that can be deployed in a few clicks.
Now, you can generate the actual data masking script that will be used by Oracle Enterprise Manager to mask the data. Click the Generate Script button.
This generates the script. (A warning is given that the generation process might take up to 15 minutes to complete.)
Initially, a series of validation steps are performed to ensure that there will be no errors in the actual data masking process. The validation steps include space availability checks (since a replacement table will be created during the process) and validation of the masking formats used to make sure they match the type and length of the masked columns.
The checks also ensure that the masking formats meet the integrity requirements of the database and application by generating unique values for the column being masked, if there are unique constraints.
Finally, the script is displayed on the screen (see Figure 18).
The Impact Report is displayed at the bottom of the page. This report is the result of the validation process and shows that there are no errors, so you can proceed with the masking.
A PL/SQL script is generated. You can either view the script summary or the full script. It is possible to save the full script at this stage on your workstation.
Figure 18: Script Generation Results Screen
You can examine the generated script to get an idea of the actual masking process. The full script can be seen in Listing 1.
Oracle Enterprise Manager Grid Control has obviously saved you the time and effort of writing this PL/SQL code.
In the highly optimized PL/SQL code that is generated, bulk operations are used to replace the original table containing confidential data with a masked-data copy of the table. The original indexes, constraints, partitions, and grants are all retained.
Advanced features in the database, such as parallel execution and the disabling of redo log generation, are used to speed up the masking process. At the end, the original table with the confidential data is dropped and the new masked table takes it place.
It is possible to schedule the data masking job directly from the screen shown in Figure 18, the Script Generation Results screen. Click the Schedule Job button.
Clicking this button allows you to create an Oracle Enterprise Manager job that will execute the script at the time and date you specify or immediately. The host credentials also need to be specified at this stage. See Figure 19.
Figure 19: Schedule Data Masking Job Screen
Note that the masking script must be generated before the masking job can be scheduled. Even though there is a Schedule Job button on the main Data Masking Definitions page (Figure 17), this button will not work unless the script has been generated.
You can also click the Save Full Script button on the Script Generation Results screen (Figure 18) to save the PL/SQL masking script as an external file.
Having an external file allows you to use a scheduler to run the script manually, if your database environment is command-line based and does not use Oracle Enterprise Manager to manage the databases.
The masking job executes immediately and succeeds in a few seconds since the table is small. The execution steps that succeeded are shown in Figure 20.
Figure 20: Job Run Results
It is possible to drill down on the execution steps to see the actual output log, which is shown in Figure 21.
This is the execution result of the script we examined in the code listing.
Figure 21: Output Log
If you scroll down on this page, the ending comments in the output log can be seen as follows:
Completed Data Masking. Starting cleanup phase. Starting cleanup of recovery tables Completed cleanup of recovery tables Starting cleanup of generated procedures Completed cleanup of generated procedures Script execution complete SQL> set ver on SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 220.127.116.11.0 - Production With the Partitioning, OLAP, Data Mining and Real Application Testing options
Figure 22: SQL Worksheet
Go back to the Data Masking Definitions screen shown in Figure 17 and click the Clone Database button to see how the data masking process is tied in with the cloning process in a secure clone-and-mask workflow.
This brings up the Clone Database wizard page (see Figure 23).
Note that database cloning using Oracle Enterprise Manager Grid Control requires a license for the Oracle Enterprise Manager Provisioning and Patch Automation Pack in addition to the Oracle Data Masking Pack that is used for masking.
The first page of the Clone Database wizard allows you to specify the source database for the cloning operation. You can make an online Oracle Recovery Manager (RMAN) backup of the source database or use an existing database backup.
The RMAN backup uses the RMAN duplicate feature behind the scenes when staging areas are not used. If staging areas are selected, the duplicate feature is not used. The first page explains the steps in more detail.
Figure 23: Clone Database Wizard
On the next page, you specify the source options for the clone database process. Concurrent file copy processes can be specified to speed up the cloning via the RMAN duplicate feature.
Source host credentials also need to be specified at this point. See Figure 24.
Figure 24: Source Host Credentials
If the source database is in NOARCHIVELOG mode, it must be shut down and restarted in the mounted mode for the cloning to proceed, or you can switch to ARCHIVELOG mode at this stage of the process. See Figure 25.
Figure 25: Archiving Mode Screen
The next page of the wizard allows you to specify the destination Oracle Home on the server you want to clone the database to. You can also name the destination database on this page (see Figure 26).
The first requirement is that this should be a server on which the Oracle Enterprise Manager agent has been installed, and the agent should be in communication with the central Oracle Enterprise Manager Management Service (OMS). This means the server has been discovered and is visible as a target in Oracle Enterprise Manager Grid Control.
The other requirements are that the operating systems of the source and target servers should match, and the Oracle Homes at the source and target should also match at the version level of the database software.
Figure 26: Select Destination Screen
Click Next to move to the Destination Options page (Figure 27) where you can specify the file locations and either customize them or use a standard Oracle Optimal Flexible Architecture (OFA) layout.
Simplest of all, you can use the Database Area and Flash Recovery Area option to store all database, control, redo, archive, and backup files.
Figure 27: Destinations Options
The next page is where you can see the crux of the clone-and-mask workflow.
On this page of the Clone Database wizard, seen in Figure 28, you are allowed to specify the masking definitions to apply as part and parcel of the clone database process.
Select Execute masking steps after cloning the database, and then add the masking definitions that were previously defined and the script generated for the source database. In this case, you add the HR Employees_Test Masking Definition.
On the same page, you can also specify the network configuration files, the post cloning scripts (if required), and whether to register the newly cloned database in Oracle Enterprise Manager Grid Control (so it appears as a new database target).
Figure 28: Database Configuration Screen
Moving to the next step allows you to schedule the Clone Database job, either immediately or at a future date and time using the Oracle Enterprise Manager job scheduler. See Figure 29.
Figure 29: Schedule Screen
Finally, the Clone Database: Review screen appears (Figure 30). You can verify the cloning process information, including the files that are to be created, under the Database Storage section on this page.
The masking definition status is shown as “Specified” on the Review page.
When the Clone Database job is submitted, the new database is created as a copy of the original. The masking script is generated and applied on the new database as a part of the Clone Database process.
This clone-and-mask workflow is secure because when the cloning process has completed, the new database is opened in RESTRICTED mode, which allows only administrator access to the database. The masking process is then executed, and the workflow verifies the masking process has completed successfully before the new database is opened up for normal use.
This process ensures that the confidential data will not be visible in the new database at the end of the process or even during the process.
A masked database has been created that is safe for use on development servers, on test servers, or for offsite vendor use (provided all confidential columns, such as the NRIC, have been properly located and masked using the techniques described).
Figure 30: Review Screen
Oracle Enterprise Manager 11g Grid Control Release 1 was released in April 2010. The enhancements to data masking in this release are as follows:
The Oracle Data Masking Pack is available at an extra license cost, but it is well worth the price if you consider the benefits: a secure, centralized way of defining your masking formats for use throughout the corporate space and the ability to apply the masks to any column in any table in any database that is managed by the centralized installation of Oracle Enterprise Manager Grid Control.
You can protect the confidentiality of your data while also achieving regulatory compliance.
The integration of masking with the cloning of databases, which requires the Oracle Enterprise Manager Provisioning and Patch Automation Pack license, is a welcome addition. This integration precludes the presence of confidential data in cloned databases, if the masking formats are defined properly and applied to the appropriate columns via the masking definitions created for the database.
The data masking scripts are automatically generated, and the masking definitions can be transported between different Oracle Enterprise Manager Grid Control installations. DBA productivity is vastly increased by automating the discovery (via the foreign key columns) of confidential data, as well as automating the masking process.
These features combine to enable, for the first time, a corporate-wide data masking and confidentiality policy that can be easily set up and applied using Oracle Enterprise Manager Grid Control.
Of course, this flagship enterprise management product from Oracle does a lot of other things as well; it is truly a behemoth. If you want to learn how to automate the patching of all Oracle Real Application Clusters (RAC) or non-RAC databases, Oracle Automatic Storage Management instances, and Oracle Clusterware in your environment, read a recent article from the author titled Patch a Thousand Databases, Using Oracle Enterprise Manager Grid Control. For information about easily setting up Oracle Recovery Manager (RMAN) backups for your corporate databases using Oracle Enterprise Manager Grid Control, read Oracle RMAN Backups: Pushing the Easy Button.
You might also be interested in data integration, which you can read about in Using Oracle GoldenGate for Real-Time Data Integration. And for understanding how Oracle Enterprise Manager Grid Control can save a lot of time and resources when used for Oracle Data Guard setup, management (including switchover or failover), and monitoring, read another recent article from the author, Easy Disaster Proof Production with Grid Control.
This is the age of the DBA 2.0, and the prudent DBA will realize and understand that Oracle Enterprise Manager Grid Control is essential in this age.
Porus Homi Havewala works as a Senior Manager (Enterprise Technology) at Oracle Singapore and was previously a Principal Consultant at S&I Systems Singapore (Oracle Platinum Partner). He was awarded the Oracle ACE Director title by Oracle HQ in 2008 and is currently an Oracle Employee ACE. He has extensive experience in Oracle technology since 1994, including as a Senior Production DBA, Principal Database Consultant, Database Architect, E-Business Technical DBA, Development DBA, and Database Designer Modeler (using Oracle Designer, of course). Porus is also the author of the book Oracle Enterprise Manager Grid Control – Advanced Techniques for the Real World.