|
TECHNOLOGY: Data Warehousing
Moving Data Faster
By Arup Nanda
Find the fastest way to move data from warehouse to mart.
Lora, the DBA at Acme Bank, is at the center of attention in a high-profile meeting of the bank's top management team. The objective is to identify ways of enabling end users to slice and dice the data in the company's main data warehouse. At the meeting, one idea presented is to create several small data martseach based on a particular functional areathat can each be used by specialized teams.
To effectively implement the data mart approach, the data specialists must get data into the data marts quickly and efficiently. The challenge the team faces is figuring out how to quickly refresh the warehouse data to the data marts, which run on heterogeneous platforms. And that's why Lora is at the meeting. What options does she propose for moving the data?
An experienced and knowledgeable DBA, Lora provides the meeting attendees with three possibilities, as follows:
- Using transportable tablespaces
- Using Data Pump (Export and Import)
- Pulling tablespaces
This article shows Lora's explanation of these options, including their implementation details and their pros and cons.
Transportable Tablespaces
Lora starts by describing the transportable tablespaces option. The quickest way to transport an entire tablespace to a target system is to simply transfer the tablespace's underlying files, using FTP (file transfer protocol) or rcp (remote copy). However, just copying the Oracle data files is not sufficient; the target database must recognize and import the files and the corresponding tablespace before the tablespace data can become available to end users. Using transportable tablespaces involves copying the tablespace files and making the data available in the target database.
A few checks are necessary before this option can be considered. First, for a tablespace TS1 to be transported to a target system, it must be self-contained. That is, all the indexes, partitions, and other dependent segments of the tables in the tablespace must be inside the tablespace. Lora explains that if a set of tablespaces contains all the dependent segments, the set is considered to be self-contained. For instance, if tablespaces TS1 and TS2 are to be transferred as a set and a table in TS1 has an index in TS2, the tablespace set is self-contained. However, if another index of a table in TS1 is in tablespace TS3, the tablespace set (TS1, TS2) is not self-contained.
To transport the tablespaces, Lora proposes using the Data Pump Export utility in Oracle Database 10g. Data Pump is Oracle's next-generation data transfer tool, which replaces the earlier Oracle Export (EXP) and Import (IMP) tools. Unlike those older tools, which use regular SQL to extract and insert data, Data Pump uses proprietary APIs that bypass the SQL buffer, making the process extremely fast. In addition, Data Pump can extract specific objects, such as a particular stored procedure or a set of tables from a particular tablespace. Data Pump Export and Import are controlled by jobs, which the DBA can pause, restart, and stop at will.
Lora has run a test before the meeting to see if Data Pump can handle Acme's requirements. Lora's test transports
the TS1 and TS2 tablespaces
as follows:
1. Check that the set of TS1
and TS2 tablespaces is self-
contained. Issue the
following command:
BEGIN
SYS.DBMS_TTS.TRANSPORT_SET_CHECK ('TS1,TS2');
END;
2. Identify any nontransportable sets. If no rows are selected, the tablespaces are self-contained:
SELECT * FROM TRANSPORT_SET_VIOLATIONS;
no rows selected
3. Ensure the tablespaces are read-only:
SELECT STATUS
FROM DBA_TABLESPACES
WHERE TABLESPACE_NAME IN ('TS1','TS2');
STATUS
---------
READ ONLY
READ ONLY
4. Transfer the data files of each tablespace to the remote system, into the directory /u01/oradata, using a transfer mechanism such as FTP or rcp.
5. In the target database, create a database link to the source database (named srcdb in the line below).
CREATE DATABASE LINK srcdb
USING 'srcdb';
6. In the target database, import the tablespaces into the database, using Data Pump Import.
impdp lora/lora123
TRANSPORT_DATAFILES=
"'/u01/oradata/ts1_1.dbf',
'/u01/oradata/ts2_1.dbf'"
NETWORK_LINK='srcdb'
TRANSPORT_TABLESPACES=\(TS1,TS2\)
NOLOGFILE=Y
This step makes the TS1 and TS2 tablespaces and their data available in the target database.
Note that Lora doesn't export the metadata from the source database. She merely specifies the value srcdb, the database link to the source database, for the parameter NETWORK_LINK in the impdp command above. Data Pump Import fetches the necessary metadata from the source across the database link and re-creates it in the target.
7. Finally, make the TS1 and TS2 tablespaces in the source database read-write.
ALTER TABLESPACE TS1 READ WRITE;
ALTER TABLESPACE TS2 READ WRITE;
This step makes the TS1 and TS2 tablespaces immediately available to users in the source database.
Note that of all the preceding steps, the most time-consuming is step 4, in which the data files are transferred across systems.
The Cross-Platform Challenge
Lora knows that one of the complicating factors in data replication is the frequent mismatch between source and target platforms. In Acme's environment, for instance, the current data warehouse is located on an HP server running Tru64 UNIX and the proposed data marts will be deployed on Intel hardware running Linux and Windows.
Before the advent of Oracle Database 10g, using transportable tablespaces would not have been a viable approach at Acme. You couldn't have transported tablespaces unless the source and the target were running on the same platform.
With Oracle Database 10g, this restriction has been relaxed significantly. Data files can be freely copied across operating systems when tablespaces are being transported. In the preceding example, the data files can be copied from Tru64 UNIX to Linux or Windows, and the target database still recognizes them.
The team must consider one other limitation, however. By default, copying across operating systems is possible only if both operating systems have the same byte order (also known as "endian-ness"). In Acme's case, Tru64 UNIX, Intel-based Linux, and Windows all use the little-endian byte order, so copying between them is possible. However, some team members discuss the idea of using the Solaris operating system in the data-warehouse/data-mart proposal. Solaris uses the big-endian byte order, which makes simply copying files between the databases impossible.
Lora notes that Oracle provides a byte-order-conversion solution that uses Oracle Recovery Manager (RMAN). She describes a test for the RMAN solution, which is based on the foregoing seven-step transporting tablespaces solution and altered slightly to accommodate
the Solaris-to-Intel/Linux environment. The steps are identical, except for
one extra step before or after step 5. This additional step, performed on the source (Solaris) database server, is shown in Listing 1.
With this code, the file /u01/tts/
TS1_34 is created in Intel Linux format from the file /u01/oradata/ts1_01.dbf, which is in Solaris format. Note that the original file is not disturbed; a new file is created that can be transferred to the target Linux box and imported.
Lora then discusses a couple of modifications to the RMAN solution. First, to enhance performance, she can specify the clause PARALLELISM=<degree> to increase the number of execution threads. Second, she can specify creation of the data files with the same filename but in different directories. These changes follow:
RMAN> CONVERT TABLESPACE TS1
2> TO PLATFORM 'Linux IA (32-bit)'
3> DB_FILE_NAME_CONVERT
4> '/u01/oradata','/u01/tts'
5> PARALLELISM=4
6> ;
This command creates the converted data file, with the same name, ts1_01.dbf, but in the directory /u01/tts, from the original file, /u01/oradata/ts1_01.dbf. This approach puts all the converted files into one location, making it easier to identify which files to transfer.
Lora can also perform the conversion on the target (Linux) database server instead of on the source server. In such a case, the conversion would be performed on the Linux server as follows:
RMAN> CONVERT DATAFILE
2> '/u01/oradata/ts1_1.dbf',
'/u01/oradata/ts2_1.dbf'
3> TO PLATFORM='Linux IA (32-bit)'
4> FROM PLATFORM='Solaris[tm] OE
(64-bit)'
5> DB_FILE_NAME_CONVERT="ts","tslinux"
6> ;
This command creates the files by substituting tslinux for ts (in other words, ts1_1.dbf is converted to Linux format and has the new filename tslinux1_1.dbf). Inside the database, the files will have this filename.
How does Lora know which platforms use which byte order? The following query against a data dictionary view provides the answer:
SELECT * FROM V$TRANSPORTABLE_PLATFORM ORDER BY PLATFORM_ID;
Data Pump Export and Import
One limitation of the transportable tablespaces option is that the source tablespaces need to be in read-only mode while the files are being transferred. In the real world, it may not always be feasible to meet this requirement. For instance, in OLTP databases, tables are subject to constant activity requiring read/write access.
As an alternative, Lora proposes
using the Data Pump utility in Oracle Database 10g to transfer tablespaces. Her proposed test of this approach involves the following steps to
move the contents of the TS1 and
TS2 tablespaces:
1. Create a directory object to hold the dump file.
CREATE DIRECTORY dump_dir AS '/u01/dumps';
2. Export the data, using Data Pump Export.
expdp lora/lora123 TABLESPACES=\(ts1,ts2\) DUMPFILE=ts1_ts2.dmp DIRECTORY=dump_dir
This step creates a file /u01/dumps/
ts1_ts2.dmp with the contents of the TS1 and TS2 tablespaces.
3. Transfer the file ts1_ts2.dmp to the remote system, into the directory /u01/dumps (using a file transfer method such as FTP or rcp).
4. Create a directory object in the target database.
CREATE DIRECTORY dump_dir
AS '/u01/dumps';
5. Import the file into the database, using Data Pump Import.
impdp lora/lora123 DIRECTORY=dump_dir DUMPFILE=ts1_ts2.dmp
If the amount of data inside the tablespaces is relatively small, Lora can perform all the preceding steps with a single command:
impdp lora/lora123 DIRECTORY=dump_dir NETWORK_LINK='srcdb' TABLESPACES=\(ts1,ts2\)
This command loads the tables, using Data Pump Import from the data retrieved across the database link srcdb (discussed in an earlier section). However, because network bandwidth is often limited, this approach may be slower than using an export/transfer/import cycle.
If only a specific table or set of tables needs to be transferred, Lora can use the TABLES=<tablelist> clause in the expdp command to download only a specific table or set of tables.
Pulling Tablespaces
As a third option, Lora suggests using a new tool in Oracle Database 10g that simplifies the transportable tablespace approach so that it involves only the execution of a single packaged procedure. In this approach, the user "pulls" the tablespaces from the source system, using a supplied package, DBMS_STREAMS_TABLESPACE_ADM. This package uses Data Pump to transport tablespaces and transfers the data files to the target system's format. It also performs any required endian conversion automatically.
The procedure for the easiest case using this optioninvolving a single, simple tablespacefollows (more-complex cases are shown in the next section). A tablespace is called a simple tablespace if it has only one data file. Lora demonstrates the use of the procedure PULL_SIMPLE_TABLESPACE in the DBMS_STREAMS_TABLESPACE_ADM package:
1. Create a directory object in the (remote) data warehouse database in the directory where the data files reside.
CREATE DIRECTORY dbf_dir AS '/u01/oradata/dw';
2. Make the tablespace TS1 on the remote database read-only.
ALTER TABLESPACE TS1 READ ONLY;
The rest of the steps occur in the local (data mart) database.
3. Create a database link to connect to the remote (data warehouse) database (dwdb in Lora's example).
CREATE DATABASE LINK dwdb USING 'dwdb';
4. Create a directory object to which to transfer the data file.
CREATE DIRECTORY dbf_dir AS '/u01/oradata/mart';
5. Pull the tablespace from the remote database.
BEGIN
DBMS_STREAMS_TABLESPACE_ADM
.PULL_SIMPLE_TABLESPACE (
tablespace_name => 'TS1',
database_link => 'dwdb',
directory_object => 'DBF_DIR',
conversion_extension => 'linux'
);
END;
This operation performs a lot of steps behind the scenes. It makes the source tablespace read-only, it creates a dump of the tablespace's metadata by use of Data Pump Export, it moves the data file and the dump file by use of the DBMS_FILE_TRANSFER package, it returns the source tablespace to its original read-write state, and it plugs the tablespace into the local database by use of Data Pump Import. Because the source database runs on Linux and the target runs on Solaris, this operation first copies the raw data file (in Linux format) and then converts it to the target platform (Solaris). The copy process preserves the original transferred file and creates a new one for the conversion. The new file has the same name as the original, but with a linux extension, as specified by the CONVERSION_EXTENSION parameter. The tablespace is created in the target database as read-only.
This operation also creates a log file named ts1_01.plg in the same directory as the data file. If running the procedure returns errors, examining the contents of this file may help identify the reason.
Pulling Multiple Tablespaces
The preceding example addresses the case of a single simple tablespace. But what if Lora wants to move a set of tablespaces or if some of the tablespaces have more than one data file? In such a scenario, she can use a different procedure, PULL_TABLESPACES, in the same package. Listing 2 shows an example of how Lora can transfer two tablespaces, TS7 and TS8, regardless of how many data files they have.
The procedure expects the tablespace and directory names to be given in a data type of VARCHAR2. Lines 2 through 5 in Listing 2 show the declarations for those variables, and lines 10 through 13 show the variable assignments to the corresponding tablespace and directory names. Because two directories are being defined, the first file is created in the first directory, the next file in the second directory, the third file in the first directory again, and so on. The operation is performed through a Data Pump job, whose name is assigned in line 17. The conversion of the data file from the endian byte order of the source system to that of the target happens automatically, if necessary. The new file created on the target database receives a linux extension, as shown in line 21. The process is recorded in a log file named ts7_ts8.log in the directory specified by the directory object LOG_DIR (line 14).
The advantage of this approach is clear. All the tasks needed to transport a tablespace from one system to the other are encapsulated inside one program unit, with the details being completely transparent to the user. Even the function of transferring files from the source to the target is handled inside this procedure with the supplied package DBMS_FILE_TRANSFER. Users simply call this procedure with the names of the tablespaces that will be refreshed in the local database. They
do not have to worry about underlying details (such as the operating system), because the file transfer converts the files automatically.
Is there anything not to like about this approach? The main disadvantage Lora describes in the meeting is that the encapsulation of the individual functions inside a single procedure may mask an error attributable to one step, complicating problem diagnosis. The manual approach of transporting a tablespace, with the required transparency of commands, does have the advantage of letting the user see the result of each individual action.
Meeting Adjourned
Lora has presented several data movement options for Acme's proposed data-warehouse/data-mart architecture.
The first option, using transportable tablespaces, enables a full set of tablespaces to be moved (including not only tables but also indexes, materialized views, and other objects). It is also generally the fastest approach of the three options. However, one primary disadvantage is that the specified tablespaces need to be set to read-only while the files are copied.
The next option, using Data Pump, does not carry the requirement for read-only tablespaces. This option
is useful when the team needs to
move only specific tables rather than entire tablespaces.
Finally, pulling tablespaces combines all the steps in the transportable tablespaces approach into a single step. This approach makes it very easy to replicate data, but it offers the DBA less flexibility in scheduling each specific step for optimal performance.
At the conclusion of the meeting, senior management thanks Lora, and Lora thanks senior management for having supported Acme's migration to Oracle Database 10g, because that made the different data movement options possible.
Arup Nanda (arup@proligence.com) is the manager of Database Systems at Starwood Hotels and Resorts in White Plains, New York. He is the recipient of Oracle Magazine's 2003 DBA of the Year award and the coauthor of Oracle Privacy Security Auditing, from Rampant Press (rampant-books.com), 2003.
|