|
Oracle9i
introduces a family of new features to support more scalable
and efficient ETL (Extraction, Transformation, Load) processing
for data warehouses and business intelligence systems.
One of the most exciting ETL additions is
the External Table feature. The External Table feature allows
for flat files, which reside outside the database, to be accessed
just like relational tables within the database: the flat-file
data can be queried and joined to other tables using standard
SQL. Data access can be serial or parallel for maximum scalability.
From a users point of view, the main difference between
an external table and a regular table is that the external
table is read-only.
To understand the value of external tables,
consider that a data warehouse always contains data from external
data sources. For example, a data warehouse may contain data
extracted from an OLTP system or data purchased from a third-party
data provider. When adding new data into a data warehouse
from an external data sources, the data has to be transformed.
Transformation tasks may include complex filtering, validation
of new data against information already in the warehouse,
and insertion of data at both detail and aggregate levels.
Prior to Oracle9i,
performing complex transformations within Oracle required
that the external data be transformed using one of two possible
strategies:
- Load the data into Oracle and store it
in a staging table. When the transformations were finished
the table could be deleted. This approach devoted disk space,
administrative resources and processing time to data that
was already available on disk.
- Transform the external data in flat files
outside the database. Since some transformations require
data from the data warehouse, this process requires not
only flat-file manipulation but also database access. When
the transformations were completed, the transformed data
is loaded into the data warehouse. This approach also requires
redundant data storage and processing time.
Oracle9i's
external tables avoid the problems inherent in the approaches
described above: external tables provide a whole new model
for loading and transforming external data. The data is no
longer transformed outside the database, nor must it be stored
in a staging table. Instead, the external data is presented
to the database as a virtual table, enabling full data processing
inside the database engine. This seamless integration of external
data access with the transformation phase is generally referred
to as "pipelining." With pipelining, there is no
interruption of the data stream, so every ETL process can
show increased performance.
Example
External tables are defined to Oracle
using a CREATE TABLE statement. The example below defines
an external table called "products_ext" which is
based on two flat files outside Oracle called 'new_prod1.txt'
and 'new_prod2.txt'.
CREATE
TABLE products_ext
(prod_id NUMBER, prod_name VARCHAR2(50), prod_desc VARCHAR2(4000),
prod_category VARCHAR2(50), prod_category_desc VARCHAR2(4000),
list_price NUMBER(6,2), min_price NUMBER(6,2),
last_updated DATE)
ORGANIZATION EXTERNAL
(
TYPE oracle_loader
DEFAULT DIRECTORY stage_dir
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE
BADFILE bad_dir:'bad_products_ext'
LOGFILE log_dir:'log_products_ext'
FIELDS TERMINATED BY ','
MISSING FIELD VALUES ARE NULL
(prod_id, prod_name, prod_desc, prod_category,
prod_category_desc, price, price_delta,
last_updated char date_format
date mask "dd-mon-yyyy")
)
LOCATION ('new_prod1.txt','new_prod2.txt')
)
PARALLEL 5
REJECT LIMIT UNLIMITED;
The access of this external table is defined
as PARALLEL 5. Access to external files is transparently parallelized,
independent of the number of files (intra-file parallelism).
The parallelism available through external tables increases
performance dramatically compared to serial processing on
a file-by-file basis.
The external table defined above can now be
used directly like a regular table using SQL. For example,
the external data can be used to synchronize existing products
information in the database with the changes of the last 24
hours.
MERGE
INTO products d
USING (SELECT * FROM products_ext
WHERE last_updated > SYSDATE
- 1) s
ON (d.prod_id=s.prod_id)
WHEN MATCHED THEN
UPDATE
SET d.prod_list_price =
s.list_price,
d.prod_min_price = s.min_price
WHEN NOT MATCHED THEN
INSERT (prod_id, prod_name,
prod_desc, prod_category,
prod_category_desc,
prod_list_price, prod_min_price)
VALUES (s.prod_id, s.prod_name,
s.prod_desc, s.prod_category,
s.prod_category_desc,
s.list_price, s.min_price);
This statement processes the rows which have
been changed in the past 24 hours and uses these rows directly
in a MERGE command. This not only reduces the amount of data
to be processed, it also avoids any intermediate staging in
the database by combining the access and transform operation.
Since any SQL query can be executed against an external table,
the external table data can be transformed in much more complex
ways as the data is loaded.
External tables provide a whole new way to
access and process external data directly from within the
database. They simplify data access and transformations by
treating external files as standard tables. Through pipelining
and parallelism, external tables significantly increase overall
ETL performance. External tables are just one example of the
new ETL features of Oracle9i.
The full set of new ETL functionality creates a powerful framework
for handling all ETL tasks within the Oracle database.
More
Info
Oracle9i
Daily Features
|
 |