|
Technology BUSINESS INTELLIGENCE
Capturing Change
By Sanjay Mishra
Use Change Data Capture to simplify ETL in data warehouse applications.
Data extraction is an integral
part of all data warehousing.
Data is often extracted on a
nightly basis from the transac
tional systems and transported to
the data warehouse. Typically, all the
data in the data warehouse is refreshed with data extracted from the source system. But this involves the extraction and transportation of huge volumes of data and is very expensive in both resources and time.
Since the data extraction takes place daily, it would be much more efficient to extract and transport only the data that has changed since the last extraction. However, in most source systems, it is extremely difficult, if not impossible, to identify and extract only the recently changed data.
Beyond the challenge of identifying the recently changed data, many extraction, transformation, and
loading (ETL) environments involve one source system feeding data into multiple target systems. It is also a challenge to sync up change data from one source to many targets.
The Oracle Change Data Capture (CDC) feature simplifies the process of identifying the change data since the last extraction. CDC provides easy-to-use APIs to set up a CDC environment. If you are a DBA or a developer involved in a data integration project that involves the regular extraction and transportation of a large amount of data from one system to another system or systems, consider CDC to help reduce your ETL time.
This article gives you a step-by-step approach for setting up a CDC environment, outlines the prerequisites
for the setup, and discusses the APIs you need to know to set up CDC for your application.
Overview
The CDC feature was introduced in Oracle9i Database. CDC helps identify the data in the source system that has changed since the last extraction.
With CDC, data extraction takes place at the same time the INSERT, UPDATE,
or DELETE operations occur in the source tables, and the change data is stored inside the database in change tables. The change data, thus captured, is then made available to the target systems in a controlled manner, using database views.
Figure 1 shows the basic architecture of a CDC environment.
CDC creates the subscriber view
to control accessibility of the change table data to the target systems. A target system is any application that wants to use the data captured from the source system.
The Publisher and the Subscriber
The CDC architecture is based on the publisher/subscriber model. The publisher captures the change data and makes it available to the subscribers. The subscribers utilize the change data obtained from the publisher. Usually, a CDC system has one publisher and many subscribers.
The publisher first identifies the source tables from which the change data needs to be captured. It then captures the change data and stores it in specially created change tables. Then it allows the subscribers controlled access to the change data.
The subscriber needs to know what change data it is interested in. A subscriber may not be interested in all the data that is published by the publisher. The subscriber creates a subscriber view to access the change data to which it has been granted access by the publisher.
Figure 1: Change Data Capture architecture
Setting Up a CDC Environment
The CDC feature is part of the Oracle9i (and later) Database software; you
don't need to install any software other than Oracle9i Database to use the
CDC feature. To make use of Oracle9i Database's CDC feature, do the following:
- Enable Java (the Java virtual machineJVM) in the database.
Refer to the Oracle9i Database Java Developer's Guide for detailed instructions on configuring Oracle JVM. When creating a database using Database Configuration Assistant (DBCA), you can choose to install Oracle JVM.
- Enable system triggers. When you create a database, system triggers are enabled by default.
Once your database is ready for CDC, you can follow this step-by-step approach to set up a CDC environment:
1. Identify the source tables.
2. Set up a publisher.
3. Create change tables.
4. Set up a subscriber.
5. Subscribe to the source tables, and activate the subscription.
6. Set up the CDC window.
7. Prepare a subscriber view.
8. Access data from the change tables.
9. Drop the subscriber view, and purge the CDC window.
10. Repeat steps 6 through 9 to see
new data.
For the purposes of this article, I will use a sales application example. The source application is a transactional system that records the information on each sale. The tables for the transactional application are owned by schema owner SALES_OP.
I am setting up a data warehouse based on the sales operational data.
The user SALES_DW is the owner of the data warehouse tables. The purpose of my CDC system is to transfer change data from the transactional system (source tables owned by SALES_OP) to the data warehouse system (target tables owned by SALES_DW) efficiently. The change data consists of
- Data inserted (INSERT) in the source tables since the last extraction
- Data modified (UPDATE) in the source tables since the last extraction
- Data deleted (DELETE) from the source tables since the last extraction
In the following few sections, I will show how the change data is captured in the source system and made available to the target system.
1. Identify the Source Tables
The database user SALES_OP owns two tables, PRICE_LIST and SALES_TRAN, which contain the price list of the items sold and the transaction details of each sale, respectively.
SQL> DESC PRICE_LIST
Name Null? Type
---------- ------- -------
PRODUCT_ID NOT NULL VARCHAR2(20)
UOM NOT NULL VARCHAR2(20)
PRICE_PER_UOM NOT NULL NUMBER(10,2)
SQL> DESC SALES_TRAN
Name Null? Type
---------- ------- -------
PRODUCT_ID NOT NULL VARCHAR2(20)
SALE_DATE_TIME NOT NULL TIMESTAMP(6)
QUANTITY NOT NULL NUMBER(10,2)
Every morning, the changes made in these two tables must be extracted and made available to the data warehouse application.
2. Set Up a Publisher
A publisher is a database user who creates and maintains the change tables. As mentioned earlier, the responsibility of the publisher is to identify and extract the change data and make it available to the subscribers. To be able to do that, the following two database roles must be granted to the publisher:
- SELECT_CATALOG_ROLE
- EXECUTE_CATALOG_ROLE
For simplicity, let the user SALES_OP assume the role of the publisher. Since SALES_OP owns the source table, it is easier to designate this user as the publisher. However, you can designate any other database user as the publisher. In situations where tables owned by multiple database users constitute the source tables, it is better to designate another user as the publisher. In such a situation, the publisher needs the SELECT privilege on the source tables.
3. Create Change Tables
A change table is required for each source table. The publisher uses the
procedure DBMS_LOGMNR_CDC_PUBLISH .CREATE_CHANGE_TABLE to create change tables, as shown in Listing 1. In this example, the change tables corresponding to PRICE_LIST and SALES_TRAN are named CDC_PRICE_LIST and CDC_SALES_TRAN respectively.
The CREATE_CHANGE_TABLE procedure passes the name of the change set as an argument. A change set is a collection of change tables. In Listing 1, I have used the system-generated change set: SYNC_SET.
4. Set Up a Subscriber
In my example, the target system, SALES_DW, is the subscriber. To be able to subscribe to the change data, the subscriber needs the SELECT privilege on the source tables and the change tables, as shown:
connect sales_op/sales_op
grant select on PRICE_LIST to sales_dw;
grant select on SALES_TRAN to sales_dw;
grant select on CDC_PRICE_LIST
to sales_dw;
grant select on CDC_SALES_TRAN
to sales_dw;
5. Subscribe to the Source Tables, and Activate the Subscription
In a typical CDC environment there
are several subscribers, and each
subscriber needs to subscribe only
to the tables he or she requires. Moreover, a subscriber may not want to see all the columns of the table to which he or she subscribes.
Before you can subscribe to the published data, you need to get a subscription handle. The subscription handle lets the subscriber manage the change tables and columns of interest. Within Oracle's CDC environment, a subscription is uniquely identified by the subscription handle, and no matter how many tables you are subscribing to, you need to create just one subscription handle. Use the DBMS_LOGMNR_ CDC_SUBSCRIBE.GET_SUBSCRIPTION HANDLE procedure to get the subscription handle. This procedure returns the value of the handle in a variable. Define a variable within your session
to receive the value from the procedure, as follows:
variable sh_sales number;
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE
.GET_SUBSCRIPTION_HANDLE( -
CHANGE_SET=>'SYNC_SET', -
DESCRIPTION=>'Change data from
Sales Appln.', -
SUBSCRIPTION_HANDLE=>:sh_sales);
Once you create the handle, you can subscribe to the change data you want. While subscribing to the change data, specify which tables and columns you want to subscribe to, as follows:
EXECUTE DBMS_LOGMNR_CDC_SUBSCRIBE
.SUBSCRIBE( -
SUBSCRIPTION_HANDLE=>:sh_sales, -
SOURCE_SCHEMA=>'SALES_OP', -
SOURCE_TABLE=>'PRICE_LIST', -
COLUMN_LIST=>'PRODUCT_ID,UOM,
PRICE_PER_UOM');
EXECUTE DBMS_LOGMNR_CDC_SUBSCRIBE
.SUBSCRIBE( -
SUBSCRIPTION_HANDLE=>:sh_sales, -
SOURCE_SCHEMA=>'SALES_OP', -
SOURCE_TABLE=>'SALES_TRAN', -
COLUMN_LIST=>'PRODUCT_ID,SALE_DATE_TIME,
QUANTITY');
After you specify which tables and columns you want to subscribe to, you can activate your subscription. No matter how many tables you are subscribing to, you need to activate the subscription just once. Once a subscription is activated, you can no longer add any additional tables or columns to your subscription. To activate the subscription, call the DBMS_LOGMNR_CDC_SUBSCRIBE.ACTIVATE_
SUBSCRIPTION procedure and pass the subscription handle to it:
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE
.ACTIVATE_SUBSCRIPTION( -
SUBSCRIPTION_HANDLE=>:sh_sales);
6. Set Up the CDC Window
Data keeps changing (inserted, updated, or deleted) in the source tables, regardless of the subscriptions on the change data. Therefore, the subscription is responsible for setting up (extending) a windowthe CDC windowduring which it wants to see the changes. The DBMS_LOGMNR_CDC_SUBSCRIBE.EXTEND_WINDOW procedure is used to set up the CDC window, as follows:
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE.EXTEND_WINDOW( -
SUBSCRIPTION_HANDLE=>:sh_sales);
The changes done to the source table after you execute the EXTEND_WINDOW procedure will not be visible to the subscription until you purge the window and extend it again (more about this later).
7. Prepare a Subscriber View
Once you have set up the CDC window, you need to prepare a view for the subscriber so that the subscriber can see the change data. You need to prepare a view for each source table the subscriber is subscribing to. The procedure DBMS_LOGMNR_CDC_SUBSCRIBE.PREPARE_SUBSCRIBER_VIEW is used to prepare a subscriber view. You need to pass the subscription handle, source schema name, and source table name to this procedure, and it will return the name of the subscriber view in a variable. In this example, the subscriber view names for the source tables PRICE_LIST and SALES_TRAN are returned in variables sv_price_list and sv_sales_tran, respectively:
variable sv_price_list varchar2(30);
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE
.PREPARE_SUBSCRIBER_VIEW( -
SUBSCRIPTION_HANDLE =>:sh_sales, -
SOURCE_SCHEMA=>'SALES_OP', -
SOURCE_TABLE=>'PRICE_LIST', -
VIEW_NAME => :sv_price_list);
variable sv_sales_tran varchar2(30);
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE
.PREPARE_SUBSCRIBER_VIEW( -
SUBSCRIPTION_HANDLE =>:sh_sales, -
SOURCE_SCHEMA=>'SALES_OP', -
SOURCE_TABLE=>'SALES_TRAN', -
VIEW_NAME => :sv_sales_tran);
You can get the view names from these two variables. The view names are system-generated and generally contain a prefix of CDC and a number. For example:
SQL> select :sv_price_list from dual;
:SV_PRICE_LIST
----------------
CDC#CV$1430621
SQL> select :sv_sales_tran from dual;
:SV_SALES_TRAN
-----------------
CDC#CV$1430623
8. Access Data from the Change Tables
The subscriber view contains the change data you need. The subscriber view also contains some additional information to facilitate usage of the change data:
SQL> desc CDC#CV$1430621
Name Null? Type
---------- ------- --------
OPERATION$ CHAR(2)
CSCN$ NUMBER
COMMIT_TIMESTAMP$ DATE
TIMESTAMP$ DATE
USERNAME$ VARCHAR2(30)
PRICE_PER_UOM NUMBER(10,2)
PRODUCT_ID VARCHAR2(20)
UOM VARCHAR2(20)
For example, the column OPERATION$ indicates whether the change is an INSERT, an UPDATE, or a DELETE. The columns CSCN$, COMMIT_TIMESTAMP$,
TIMESTAMP$, and USERNAME$ indicate
who made the change in the source
table and when the change was
made. The other columns are the
subscribed columns from the source table. You can use a SELECT statement to see the data from the subscriber view. For example:
SQL> select * from CDC#CV$1430621;
OP PRICE_PER_UOM PRODUCT_ID UOM
-- ------------ ------------ ----
I 2.99 P101 pound
I 4.99 P102 pound
9. Drop the Subscriber View, and Purge the CDC Window
|
Next Steps
READ more on Change Data Capture
Oracle9i Data Warehousing Guide Release 2 (9.2)
Oracle9i Supplied PL/SQL Packages and Types Reference Release 2 (9.2)
otn.oracle.com/documentation
|
When you have finished reading and extracting the change data and you no longer need the change data that is visible in the subscriber view, you need to drop the subscriber view and purge the CDC window so that you can set up (extend) a new window and see new change data. To drop an existing subscriber view, use the procedure DBMS_LOGMNR_CDC_SUBSCRIBE.DROP_SUBSCRIBER_VIEW. You need to drop
each subscriber view for your sub-scription individually, before you can
purge the old CDC window and set
up a new window:
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE
.DROP_SUBSCRIBER_VIEW( -
SUBSCRIPTION_HANDLE =>:sh_sales, -
SOURCE_SCHEMA=>'SALES_OP', -
SOURCE_TABLE=>'PRICE_LIST');
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE
.DROP_SUBSCRIBER_VIEW( -
SUBSCRIPTION_HANDLE =>:sh_sales, -
SOURCE_SCHEMA=>'SALES_OP', -
SOURCE_TABLE=>'SALES_TRAN');
After you drop the subscriber
views, you can purge the window.
Use the DBMS_LOGMNR_CDC_SUBSCRIBE. PURGE_WINDOW procedure to purge
the window:
EXECUTE -
DBMS_LOGMNR_CDC_SUBSCRIBE.PURGE_WINDOW( -
SUBSCRIPTION_HANDLE=>:sh_sales);
10. Repeat Steps 6 through 9 to See
New Data
To extract change data periodically, you need to repeat steps 6 through
9. If you extract change data on a
daily basis, your daily (or, preferably, nightly) data-extraction job will include setting up the subscription window (EXTEND_WINDOW), preparing subscriber views (PREPARE_SUBSCRIBER_ VIEW), accessing data from the subscriber views, dropping subscriber views (DROP_SUBSCRIBER_VIEW), and purging the window (PURGE_WINDOW).
Summary of Steps
Steps 1 through 5 are configuration steps that need to be set up just
once for a given publisher/subscriber scenario. Execute steps 6 through
9 periodically to incrementally
extract the change data from the source system.
Sanjay Mishra (smishra_tech@yahoo.com) has coauthored three Oracle books: Oracle Parallel Processing, Oracle SQL Loader: The Definitive
Guide, and Mastering Oracle SQLall published
by O'Reilly and Associates.
|