Place the cursor
over this icon to load and view all the screenshots for this tutorial. (Caution:
Because this action loads all screenshots simultaneously, response time may
be slow depending on your Internet connection.)
Note:
Alternatively, you can place the cursor over each individual icon in the following
steps to load and view only the screenshot associated with that step.
Overview
What Happens During the ETL Process?
What
happens during the ETL process? During Extraction,
the desired data has to be identified and extracted from many different sources,
including database systems and applications. Very often, it’s not possible
to identify the specific subset of interest, therefore more than necessary
data has to be extracted, and the identification of the relevant data is done
at a later point in time. Depending on the source system’s capabilities
(for example, OS resources), some transformations may take place during this extraction
process. The size of the extracted data varies from hundreds of kilobytes to hundreds
of gigabytes, depending on the source system and the business situation. The same
is true for the time delta between two (logically) identical extractions: the
time span may vary between days/hours and minutes to near real-time. For example, Web server
log files can easily become hundreds of megabytes in a very short
period of time.
After extracting data, it has to be physically transported to the target
system or an intermediate system for further processing. Depending on the chosen
way of transportation, some transformations can be done during this process, too.
For example, a SQL statement that directly accesses a remote target through
a gateway, can concatenate two columns as part of the SELECT statement .
After extracting and transporting the data, the most challenging (and
time consuming) part of ETL follows: Transformation
and Loading into the target system.
This may include:
Application of
complex filters
Data
has to be validated against information already existing in target database tables
Data extracted
without the knowledge of new versus changed information has to be checked against
the target objects to determine whether it must be updated or inserted
The same data
has to be inserted several times as detail level and aggregated information
This should be done as quickly as possible in a scaleable manner and must
not affect the existing target with respect to concurrent access for information
retrieval.
Oracle offers a wide variety of capabilities
to address all the issues and tasks relevant in an ETL scenario.The Oracle Database 10g is the ETL transformation engine.
Implement Schema Changes for the Sales History
Schema
Some changes
on the existing Sales History schema objects are necessary, and some additional
system privileges must be granted to the user SH. The SQL file for applying those
changes is modifySH_10g.sql. This
file is provided to you in the setup_dwh.zip file.
To use the setup files for the Data Warehousing tutorials, perform the following
steps:
1.
Start
a SQL*Plus session and Run the modifySH_10g.sqlscript from your SQL*Plus session.
MyCompany gets some nonrelational data structures from one of its partner
companies, which sells our products for a special advertisement campaign. The
data structure is a denormalized, nonrelational record structure from proprietary
mainframe systems consisting of one record per customer and product per week.
Those data structures have to be inserted into our data warehouse. Because sales
record data is stored per customer and product per day, you have to transform
the incoming data.
As part of the transformation, the nonrelational denormalized data structure
must be transformed from one record per week into seven records, each consisting
the information of a business day (first example business transformation). In
addition, the data warehouse keeps track of all new customers with a creditlimit
above a certain limit. Those customers should be tracked in addition separately.
In this section, you will implement those business transformations by leveraging
Oracle’s new multi-table insert capabilities. To do this, perform
the following steps:
Before issuing the multi-table insert, you want to investigate plans for
the multi-table insert and how what looks like without having this functionality.
To do this, you perform the following steps:
1.
Show the execution plan for the new
multi-table insert. From a SQL*Plus session logged on to the SH schema, execute
the following SQL script:
@explain_mti_new.sql
DELETE FROM PLAN_TABLE;
EXPLAIN PLAN FOR INSERT ALL INTO sales VALUES(product_id, customer_id,weekly_start_date,2,9999,
q_sun,sales_sun) INTO sales VALUES(product_id, customer_id,weekly_start_date+1,2,9999,
q_mon,sales_mon) INTO sales VALUES(product_id, customer_id,weekly_start_date+2,2,9999,
q_tue,sales_tue) INTO sales VALUES(product_id, customer_id,weekly_start_date+3,2,9999,
q_wed,sales_wed) INTO sales VALUES(product_id, customer_id,weekly_start_date+4,2,9999,
q_thu,sales_thu) INTO sales VALUES(product_id, customer_id,weekly_start_date+5,2,9999,
q_fri,sales_fri) INTO sales VALUES(product_id, customer_id,weekly_start_date+6,2,9999,
q_sat,sales_sat) SELECT * FROM sales_input_table;
SET linesize 140 SELECT * from table(dbms_xplan.display);
Note: The input
source table is scanned only once! The complexity of the denormalization is handled
within the several INSERT INTO branches, thus avoiding multiple scans.
2.
Now show the execution plan for the
multi-table insert based on a UNION ALL set operation. From a SQL*Plus session
logged on to the SH schema, execute the following SQL script:
@explain_mti_old.sql
DELETE FROM PLAN_TABLE;
COMMIT;
EXPLAIN PLAN FOR INSERT
INTO sales (prod_id, cust_id, time_id, channel_id,promo_id,amount_sold,quantity_sold)
SELECT product_id, customer_id,weekly_start_date,2,9999,sales_sun,q_sun FROM
sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+1,2,9999,sales_mon,q_mon
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+2,2,9999,sales_tue,q_tue
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+3,2,9999,sales_wed,q_wed
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+4,2,9999,sales_thu,q_thu
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+5,2,9999,sales_fri,q_fri
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+6,2,9999,sales_sat,q_sat
FROM sales_input_table;
SET linesize 140 SELECT
* from table(dbms_xplan.display);
COMMIT;
Note: The input
source table is scanned seven times! The complexity of the denormalization is
handled within the several SELECT operations.
With an increasing number of input
records, the superiority and the performance improvement of the new multi-table
insert statement—by reducing the statement to only one SCAN—will become more
and more obvious.
3.
Now, you are ready to execute the
following SQL script to perform the new multi-table insert:
@do_mti.sql
set timing on alter system
flush buffer_cache;
INSERT ALL INTO sales
VALUES(product_id, customer_id,weekly_start_date,2,9999,q_sun,sales_sun) INTO
sales VALUES(product_id, customer_id,weekly_start_date+1,2,9999, q_mon,sales_mon)
INTO sales VALUES(product_id, customer_id,weekly_start_date+2,2,9999, q_tue,sales_tue)
INTO sales VALUES(product_id, customer_id,weekly_start_date+3,2,9999, q_wed,sales_wed)
INTO sales VALUES(product_id, customer_id,weekly_start_date+4,2,9999, q_thu,sales_thu)
INTO sales VALUES(product_id, customer_id,weekly_start_date+5,2,9999, q_fri,sales_fri)
INTO sales VALUES(product_id, customer_id,weekly_start_date+6,2,9999, q_sat,sales_sat)
SELECT * FROM sales_input_table; Rollback;
4.
You can now compare the run-time statistics
to the old way, using a UNION ALL pivoting approach:
@do_mti1_old.sql
set timing on alter system
flush buffer_cache;
INSERT INTO sales (prod_id,
cust_id, time_id, channel_id,promo_id,amount_sold,quantity_sold) SELECT product_id,
customer_id,weekly_start_date,2,9999,sales_sun,q_sun FROM sales_input_table
UNION ALL SELECT product_id, customer_id,weekly_start_date+1,2,9999,sales_mon,q_mon
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+2,2,9999,sales_tue,q_tue
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+3,2,9999,sales_wed,q_wed
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+4,2,9999,sales_thu,q_thu
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+5,2,9999,sales_fri,q_fri
FROM sales_input_table UNION ALL SELECT product_id, customer_id,weekly_start_date+6,2,9999,sales_sat,q_sat
FROM sales_input_table;
Using
the Multi-Table Insert for Conditional Insertion
In this multi-table insert, you insert data into several tables with different
table structures based on some conditions. To do this, you perform the following
steps:
1.
Create an intermediate table consisting
of new information. From a SQL*Plus session logged on to the SH schema, execute
the following SQL script, which contains the SQL statements from steps 1-4:
@mti2_prepare.sql
Rem create intermediate table
with some records
CREATE TABLE customers_new
AS
SELECT * FROM customers
WHERE cust_id BETWEEN 2000 AND 5000;
2.
Disable constraints on the SALES table;
necessary for step 3.
ALTER TABLE
sales DISABLE CONSTRAINT sales_customer_fk;
3.
Delete some data from the PRODUCTS
table.
DELETE FROM
customers WHERE cust_id BETWEEN 2000 AND 5000;
4.
Create an empty table for our special
promotion information.
CREATE TABLE
customers_special AS SELECT cust_id, cust_credit_limit FROM customers
WHERE rownum > 1;
5.
Issue the multi-table insert into
several tables with different table structures. From a SQL*Plus session logged
on to the SH schema, execute the following SQL script:
@do_mti2.sql
INSERT /*+ APPEND NOLOGGING
*/FIRST WHEN cust_credit_limit >= 4500 THEN INTO customers INTO
customers_special VALUES (cust_id, cust_credit_limit) ELSE INTO customers
SELECT * FROM customers_new;
6.
Now, you can see what was inserted.
Execute the following SQL script:
@control_mti2.sql
SELECT COUNT(*) FROM customers; SELECT COUNT(*) FROM customers_special;
SELECT MIN(cust_credit_limit) FROM customers_special;
7.
Before continuing you will want to
reset your environment by executing the following SQL script:
@reset_mti2.sql
ALTER TABLE sales
MODIFY CONSTRAINT sales_customer_fk RELY ENABLE NOVALIDATE;
DROP TABLE customers_special;
DROP TABLE customers_new;
Using the New Upsert
Functionality, SQL MERGE Keyword Overview
MyCompany has to update its product information in the data warehouse periodically
with data from its products database. Unfortunately, you are getting product delta
information without the possibility to distinguish between new and updated information. Therefore, you have to figure this out on the data warehouse site.
To use the new SQL MERGE command to either update
or insert the data conditionally, you perform the following steps:
1.
Creating an External Table (and Directories) for the External Products Information
Before you can create the external table, you need to create a directory
object in the database that will point to the directory on the file system where
the data files will reside. Optionally, you can separate the location for the
logged, bad and discarded files from the location of the data files, as shown in the
following example.
1.
To create the directory, perform the
following. If you have already completed High-Speed Data Load and Rolling Window
Operations, the directories will already exist and you can skip this step. From
a SQL*Plus session logged on to the SH schema, execute the following SQL script:
@create_directory.sql
DROP DIRECTORY data_dir;
DROP DIRECTORY log_dir;
Rem ***** Rem CREATE DIRECTORIES
Rem note that security is controlled Rem (a) with privileges on directory
(read/write) Rem (b) with privileges on external table Rem *****
CREATE DIRECTORY data_dir
AS '/home/oracle/wkdir/'; CREATE DIRECTORY log_dir AS '/home/oracle/wkdir/';
2.
When creating an external table, you
are defining two parts of information:
1.
The metadata information for the table
representation inside the database
2.
The access parameter definition of
how to extract the data from the external file
After creation of this meta information,
the external data can be accessed from within the database, without the necessity
of an initial load. To create the external table, from a SQL*Plus session logged
on to the SH schema, execute the following SQL script:
@create_external_table2.sql
Rem ***** Rem preparation
for upsert Rem file prodDelta.dat, 1.2 MB must Rem reuse of directory
STAGE_DIR Rem access via external table Rem *****
2.
Performing an Upsert by Using the New SQL MERGECommand
To perform an upsert using the new SQL MERGE
command, perform the following step:
1.
From a SQL*Plus session logged on
to the SH schema, execute the following SQL script:
@do_merge_new.sql
Rem ***** Rem MERGE
Rem new and modified product information has arrived Rem use of external table
again Rem - seemlessly within the MERGE command !!! Rem *****
MERGE INTO products t
USING products_delta s ON ( t.prod_id=s.prod_id ) WHEN MATCHED THEN
UPDATE SET t.prod_name=s.prod_name, t.prod_list_price=s.prod_list_price, t.prod_min_price=s.prod_min_price
WHEN NOT MATCHED THEN INSERT (prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_id, prod_subcategory_desc, prod_category, prod_category_id,
prod_category_desc, prod_status, prod_list_price, prod_min_price, prod_total,
prod_total_id, prod_weight_class, prod_pack_size, supplier_id) VALUES
(s.prod_id, s.prod_name, s.prod_desc, s.prod_subcategory, s.prod_subcategory_id,
s.prod_subcategory_desc, s.prod_category, s.prod_category_id, s.prod_category_desc,
s.prod_status, s.prod_list_price, s.prod_min_price, s.prod_total, s.prod_total_id,
s.prod_weight_class, s.prod_pack_size, s.supplier_id) ;
ROLLBACK;
Note: You will
ROLLBACK so that you can reissue
the same upsert with two SQL statements in a subsequent operation.
3.
Showing the Execution Plan of the New MERGECommand
To look at the execution plan of the MERGE
statement, you perform the following steps:
1.
From a SQL*Plus session logged on
to the SH schema, run explain_merge_new.sql, or copy the following SQL statement
into your SQL*Plus session:
@explain_merge_new.sql
DELETE FROM plan_table;
COMMIT;
EXPLAIN PLAN FOR MERGE
INTO products t USING products_delta s ON ( t.prod_id=s.prod_id )
WHEN MATCHED THEN UPDATE SET t.prod_name=s.prod_name, t.prod_list_price=s.prod_list_price,
t.prod_min_price=s.prod_min_price WHEN NOT MATCHED THEN INSERT (prod_id,
prod_name, prod_desc, prod_subcategory, prod_subcategory_id, prod_subcategory_desc,
prod_category, prod_category_id, prod_category_desc, prod_status, prod_list_price,
prod_min_price, prod_total, prod_total_id, prod_weight_class, prod_pack_size,
supplier_id) VALUES (s.prod_id, s.prod_name, s.prod_desc, s.prod_subcategory,
s.prod_subcategory_id, s.prod_subcategory_desc, s.prod_category, s.prod_category_id,
s.prod_category_desc, s.prod_status, s.prod_list_price, s.prod_min_price,
s.prod_total, s.prod_total_id, s.prod_weight_class, s.prod_pack_size,
s.supplier_id) ;
4.
Performing an Upsert by Using Two Separate SQL Commands (Before Oracle9i)
Before Oracle9i, such upsert
functionality could be implemented as a combination of two SQL statements (insert
plus update) or in a procedural manner. To demonstrate how this was done, perform
the following steps:
1.
To leverage an updatable join view,
a primary or unique key must exist on the join column. Therefore, you cannot
use the external table directly. As a result, you need to create an intermediate
table in the database with the appropriate unique key constraint. Run
the following script:
do_merge_816.sql
CREATE TABLE prod_delta NOLOGGING
AS SELECT * FROM products_delta;
INSERT INTO products
(prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_id,
prod_subcategory_desc, prod_category, prod_category_id, prod_category_desc,
prod_status, prod_list_price, prod_min_price, prod_total, prod_total_id,
prod_weight_class, prod_pack_size, supplier_id)
SELECT prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_id,
prod_subcategory_desc, prod_category, prod_category_id, prod_category_desc,
prod_status, prod_list_price, prod_min_price, prod_total, prod_total_id,
prod_weight_class, prod_pack_size, supplier_id
FROM prod_delta s WHERE NOT EXISTS (SELECT 1 FROM products t WHERE t.prod_id
=
s.prod_id);
ROLLBACK;
You have to issue two separate SQL
statements to accomplish the same functionality than the new MERGE command.
To leverage the updatable join view functionality of 8.1.6, we also had
to copy the external table into a real database table and create a
unique index on the join column. In short, more operations, more space
requirements, and more processing time.
In the ETL process, the data extracted from a source system passes through
a sequence of transformations before it is loaded into a data warehouse. Complex
transformations are implemented in a procedural manner, either outside the database
or inside the database in PL/SQL. When the results of a transformation become
too large to fit into memory, they must be staged by materializing them either
into database tables or flat files. This data is then read and processed as input
to the next transformation in the sequence. Oracle table functions provide
support for pipelined and parallel execution of such transformations implemented
in PL/SQL, C, or Java.
A table function is defined as a function that can produce a set of rows
as output; additionally, Oracle table functions can take a set of rows as input.
These sets of rows are processed iteratively in subsets, thus enabling a pipelined
mechanism to stream these subset results from one transformation to the next before
the first operation has finished. Furthermore, table functions can be processed
transparently in parallel, which is equivalent to SQL operations such as a table
scan or a sort. Staging tables are no longer necessary.
Examining Basic Implementation Issues
for Table Functions