|
TECHNOLOGY: Import and Export
Supercharging the Pump
By Jonathan Gennick
New utilities in Oracle Database 10g take performance and versatility to new levels.
New import and export features introduced in Oracle Database 10g, called Oracle Data Pump, represent a radical departure from the client/server approach to which database users have grown accustomed over the past several generations of Oracle Database. The server now runs export and import jobs. You can load or unload massive amounts of data quickly using parallelization, and you can adjust the degree of parallelism on the fly. Export and import jobs are now restartable, so a failure doesn't necessarily mean starting over. The API is exposed and easy to use;
it's simple to create an import or export job from PL/SQL. And once started, such jobs run in the background, but you can check status and make modifications, from anywhere, using the client utilities.
Architecture
Before Oracle Database 10g, (Oracle7 through Oracle9i) the import and export utilities ran as clients and did the bulk of the work. Data being exported was read by the database instance, passed over the connection to the export client, and then written to disk. All the data was single-threaded through the one export process. Data volumes today are often magnitudes larger than when this architecture was first put in place, making that single export process a bottleneck because performance of an export job is limited by the throughput that the export utility can sustain.
With Oracle Database 10g and the new Data Pump architecture, all the work is now done by the database instance, which can parallelize the work in two ways: by creating multiple Data Pump worker-processes to read/write data being exported/imported, and by creating parallel I/O server processes to more quickly SELECT or INSERT that data. Gone is the single-process bottleneck.
Data Pump jobs are created, monitored, and adjusted using the new
DBMS_DATAPUMP PL/SQL API. The new import and export utilitiesimpdp and expdp, respectivelyare nothing more than command-line interfaces to the API. You can initiate a jobsay, an export jobusing the Data Pump export utility. You can then shut down your client, go home for the evening, and eat dinner; all while your job is still running. Later in the evening, you can reconnect to that same job, check the status, and even increase the degree of parallelism to get more work done at night while users aren't on the system. The next morning, you can decrease the degree of parallelism, or even suspend the job, to free up resources for online users during the day.
The ability to restart jobs is an important feature of the Data Pump architecture. You can stop and restart a Data Pump job at any time, perhaps to free up resources for online users. You can also recover easily from file system space problems. If a 12-hour export fails for lack of disk space 11 hours into the job, you no longer need to restart the job from scratch, repeating the first 11 hours of work. Instead, you can attach to the failed job, add one or more new dump files, restart from the point of failure, and you'll be done in an hour. This is a huge benefit when you're working with large amounts of data.
File System Access
Having the server handle all file
I/O is a great boon for DBAs performing exports and imports remotely. It's easy enough now on UNIX-like systems (such as Linux) to telnet or ssh into a server, get a command prompt, and initiate an export or import
job that actually runs on the server. However, that's not so easily done on other operating systems, Windows being a notable example. Before Data Pump, to export a large amount of data from an Oracle database on Windows, you pretty much had to be sitting at the server console to issue the commands. The alternative of exporting over a TCP/IP connection is viable
only for very small amounts of data. Data Pump changes all this, because even when you initiate an export or import by running the new export
and import utilities on your client,
the job runs on the server; all the I/O happens on the server.
For security purposes, Data Pump requires you to specify target directories, those containing dump files that you wish to create or to read, using Oracle directory objects. For example:
CREATE DIRECTORY export_dumps
AS 'c:\a';
GRANT read, write
ON DIRECTORY export_dumps
TO gennick;
I executed these statements on my lab database while logged in as SYSTEM to create a directory object pointing to a scratch directory on my disk that I want to use to hold export dump files. The GRANT statement gives the user gennickthat's meaccess to the directory. I gave myself read/write access, because I'll be exporting and importing. You can grant a user read access to limit that user to importing.
Starting an Export Job
You start an export using the new expdp utility. The parameters are not the same as for the old exp utility, so you'll want to familiarize yourself with the new parameters. You can specify parameters on the command line, but for this article I'm using parameter files. I want to export my entire schema, and I'll use the following parameters:
DUMPFILE=gnis%U.dmp
DIRECTORY=export_dumps
LOGFILE=gnis_export.log
JOB_NAME=gnis_export
DUMPFILE specifies the file to which I want to write exported data. The %U syntax gives me an incrementing counter, resulting in the filenames gnis01.dmp, gnis02.dmp, and so forth. DIRECTORY specifies my target directory.
My LOGFILE parameter gives a name to the log file that is created by default for any export job. JOB_NAME gives a name to the job. I chose a name that is easy to remember (and to type), because I may need to attach to this job later. Take care to specify job names that don't conflict with schema object names in your login schema. Data Pump creates a table known as the job's master table in your login schema with a name matching the job name. This table tracks the status of the job and is ultimately written to the dump file as a record of what that file contains.
Listing 1 shows an export job being started. One of the first things the job does is to estimate the amount of disk space required. After that estimate displays, I press ctrl-C to get to an interactive export prompt, and then I use the EXIT_CLIENT command to return to my operating system command prompt. The export job is still running on the server.
Note that if I wanted to do a parallel export and wanted to spread my I/O across two disks, I could make the following changes to the DUMPFILE parameter values and add the PARALLEL parameter and value as follows:
DUMPFILE=export_dumps01:gnis%U.dmp,
export_dumps02:gnis%U.dmp
PARALLEL=2
Note that in this parallel export,
the directory is specified as part of
the filename.
Checking Status
You can attach to a running job at any time to check the status. To attach to an export job, issue an expdp command, using the ATTACH parameter to specify the job name. Listing 2 shows an attach to the GNIS_EXPORT job. When you attach to a job, expdp displays information about the job and the current status of the job and presents you with an EXPORT> prompt.
At any time while you're attached to a job, you can issue the STATUS command to see the current status, as shown in Listing 3. You can also issue the CONTINUE_CLIENT command, which you can abbreviate to just CONTINUE as shown in Listing 4, to return to the state in which log output is displayed as the job progresses.
You can quickly check the state of all Data Pump jobs by querying the DBA_DATAPUMP_JOBS view. You won't get the detail that the STATUS command gives, but you can quickly see which jobs
are executing, which are idle, and so forth. Another view to be aware of is
DBA_DATAPUMP_SESSIONS, which lists all active Data Pump worker-processes.
Recovering from Failure
The ability to restart a job lets you recover from certain types of failure. For example, Listing 5 shows the tail of a log from an export job that ran out of dump file space. All is not lost, however. The job has merely entered an idle state, which you can see if you attach to the job and view the status output. The status does not indicate the reason the job is idling. To be certain it's from lack of dump file space, you need to check the log file.
After attaching to a job that has stopped for lack of dump file space, you can choose between two courses of action: you can use the KILL_JOB command to abort the job, or you can add one or more dump files and continue the job. If the space problem is from lack of space on a drive, you of course want to be sure the files you add are on a different disk, one with available space. You may need to create a new Oracle directory object pointing to this new location.
Listing 6 uses the ADD_FILES command to add two files to my idle job. The two files are in different directories, neither of which is the same as that specified for the job's first dump file. I use the START_JOB command to restart the job, and then I use CONTINUE to watch the remaining log output scroll by on the screen.
Import jobs won't suffer from lack of dump file space. Rather, they may suffer from lack of tablespace, or the inability to extend a tablespace. The recovery process is much the same as for an export job. First make space availableby adding a datafile to a tablespace, by extending a datafile, or by some other means. Then attach to the job, and issue a START_JOB command. The import will pick up from where it left off.
Importing Selected Data
The example so far in this article shows a schema-level export of all the objects owned by the user GENNICK. To demonstrate some of Data Pump's new versatility, I'll import that data, and to make things interesting, I'll lay down the following requirements:
- Import only the GNIS table
- Import that table into the MICHIGAN schema
- Import only rows related to Michigan
- Do not import the original storage parameters
To begin, I can write the following four lines in my import parameter file:
DUMPFILE=gnis%U.dmp
DIRECTORY=export_dumps
LOGFILE=gnis_import.log
JOB_NAME=gnis_import
These four lines are nothing new. They name the dump file, the directory, the log file, and the job. Moving on to our four requirements, I can use the INCLUDE parameter to restrict the import to just the one table of interest:
INCLUDE=TABLE:"= 'GNIS'"
INCLUDE is an interesting parameter. When it comes to importing part of a dump file, you can go down two paths:
- You can use one or more INCLUDE parameters to list just the objects that you wish to import.
- You can use EXCLUDE parameters to list what you don't want, importing everything else.
Since I want only one object, it's easier to specifically include that object than it is to specifically exclude everything else. The first part of my INCLUDE parameter value is the keyword TABLE, indicating that I wish to include an object that is a table (as opposed to, say, a function or a procedure). Next comes a colon, and following that a WHERE clause predicate. I specifically want the table named GNIS, so the predicate is "= 'GNIS'". You can write more elaborate predicates if necessary. Using the INCLUDE and EXCLUDE parameters, you can be
quite granular in what gets imported
or exported. I urge you to read the
documentation on these two parameters carefully. They're quite powerful and versatile, well beyond what I can describe in this article.
I can accomplish the schema change, remapping the table from the GNIS schema into the MICHIGAN schema, quite easily:
REMAP_SCHEMA=gennick:michigan
I only want rows for Michigan. For that, I can use the QUERY parameter to specify a WHERE clause:
QUERY="WHERE gnis_state_abbr='MI'"
QUERY was available with the old utilities, but only for export. Data Pump can make QUERY available for import too, because Data Pump leverages Oracle's relatively new external table functionality. Whenever possible, Data Pump will choose to export or import data using the direct path, which involves reading data from the database datafiles and writing directly to an export dump file, or reading the dump file and writing directly to the database datafiles respectively. However, when you specify the QUERY parameter, Data Pump uses an external table. For an import job, Data Pump will create an external table using the ORACLE_DATAPUMP access driver and will issue an INSERT...SELECT...FROM statement.
My last requirement is to avoid importing the storage parameters associated with the table that was exported. I want the new GNIS table in the MICHIGAN schema to take on the default storage parameters for the schema's default tablespace. The reason for this is that the default tablespace for MICHIGAN is not large enough to contain segments of the original size, but it is large enough to hold just those rows that pertain to Michigan. Using the TRANSFORM parameter, I can tell import not to include any segment attributes associated with the original table:
TRANSFORM=SEGMENT_ATTRIBUTES:N
It may seem like a little thing, but many times in the past I've wished for functionality like that of the TRANSFORM parameter in the old import utility. I've often been tripped up trying to import small amounts of production data into test systems just because the production extents, even though they were largely empty, were far larger than what my test system would support. For a single table import like this, precreating the table is one way around the problem. However, precreating can become cumbersome as the number of tables grows, and it's nice to have an easy switch like TRANSFORM to cause all segment attributes in the dump file to be universally ignored.
Putting all the options I've just described into a parameter file, I can invoke the import as follows:
impdp michigan/password
parfile=gnis_import.par
When importing as a nonprivileged user, you need to connect to the target schema. If you have the IMP_FULL_DATABASE role, you can log in as yourself and import into any target schema.
Performance and Versatility
Oracle Data Pump represents a great improvement in performance over the previous export and import utilities. Much of this performance improvement comes from the use of parallelism to read and write dump files. You specify the degree of parallelism
to make whatever trade-off between speed and resource consumption
your needs demand.
Data Pump elegantly leverages other recently developed and innovative
features of Oracle Database. Flashback
is used to ensure consistency of exported data, and the FLASHBACK_SCN
and FLASHBACK_TIME parameters give you complete control over that functionality. The direct-path API is used to maximize performance whenever possible, and when the direct-path API cannot be used, data is transferred using external tables and the new ORACLE_DATAPUMP external table access driver.
Data Pump provides you with flexibility in addition to performance. This shows in the implementation of the INCLUDE and EXCLUDE parameters, the QUERY parameter, the TRANSFORM parameter, and other parameters that give you fine-grained control over data and objects being loaded or unloaded.
In a world that is constantly redefining the meaning of "large database" upward to heights only dreamed of a decade ago, Data Pump is a welcome addition to your DBA arsenal, enabling you to move more data in and out of your database faster than ever before.
Jonathan Gennick (Jonathan@Gennick.com) is an experienced Oracle DBA and an Oracle Certified Professional living in Michigan's Upper Peninsula. He runs the Oracle-article e-mail list, which you can learn about by visiting gennick.com. Gennick recently collaborated on the Oracle Regular Expressions Pocket Reference (O'Reilly & Associates, 2003).
|