Archiving Data Using XML
By Arup Nanda
Build a powerful, flexible data archive.
Acme Insurance Company is ready to archive data more than three years old to tape and to delete it from the production database. The company's IT architects have suggested an archive strategy that also uses transportable tablespaces, and the production tables have been carefully partitioned on the transaction date with this specific objective in mind.
According to the original archiving plan, the partitions with data more than three years old would be converted into standalone tables by use of partition exchange, and then the tablespaces containing those tables could be transported to tape and dropped from the main database. If deleted data needed to be reinstated, the tablespaces transported earlier could be restored from tape and plugged back into the main database. Transporting tablespaces and exchanging partitions do not modify data, so they do not produce UNDO and REDO and therefore perform significantly better than the traditional approaches for data movement, such as using INSERT and DELETE statements. And if the archived data needed to be searched for some information, it would be as simple as querying the standalone tables created during partition exchange.
The plan would be perfect, but there is a problem. As part of other projects, the IT architects are planning incremental changes to the structure of the production tables. Columns will be added and dropped, datatypes will be changed, constraints will be modified, and so on. The moment a table's structure is modified, it will no longer be possible to plug the tablespace transported earlier back into the main database, because the table structures will not match!
John, the company's lead DBA, has been called back from his vacation to find a way to make the archiving strategy work in a world of changing database structures.
An XML Solution
John puts down his luggage, rolls up his sleeves, and summarizes the archiving requirements in five main points:
1. The data must be archived based on its age.
John's solution is to use XML as the format for the archived data.
DBAs and programmers at Acme, responding to John's solution, express concern that their applications have been developed to use the relational model with regular tables, that these applications and tables can't handle XML data, and that there isn't time for a thorough impact analysis.
John responds that the programs will still see the data as relational, even though the storage is actually in XML.
The Archival Table
John offers an explanation of the XML archiving solution to the concerned programmers and DBAs, starting with a representative table—TRANS—from their database, created and loaded as shown in Listing 1. The TRANS table has been partitioned on the TRANS_DATE column, with the idea that after three years, the oldest partition can be transported off and dropped. Because dropping a partition has virtually no impact on database performance and generates very little REDO and UNDO information, this is also a faster way to purge data than methods such as deletion and truncation.
Code Listing 1: Creating and populating the TRANS table
create table trans ( trans_date date, trans_id number(10), trans_type varchar2(1), trans_amount number(12,2) ) partition by range (trans_date) ( partition Jan2003 values less than (to_date('01/02/2003', 'dd/mm/yyyy')), partition Feb2003 values less than (to_date('01/03/2003', 'dd/mm/yyyy')), . . . ); begin for ctr in 1..100 loop insert into trans values ( sysdate - dbms_random.value(1,300), ctr, decode (round(dbms_random.value(1,2)),1,'C','D'), dbms_random.value(1,10000) ); end loop; end; /
However, because the production table structure needs to change, John explains that Acme will need to create another table specifically for archiving. The columns in this archive table will not be the same as the columns in the production table; most of the column values from the production table will be placed in the archive table in a single column—trans_details—of datatype XMLTYPE:
create table arch_trans ( arch_date date, trans_date date, trans_details xmltype ) partition by range (trans_date) ( partition Jan2003 values less than (to_date('01/02/2003', 'dd/mm/yyyy')), partition Feb2003 values less than (to_date('01/03/2003', 'dd/mm/yyyy')), . . . ) /
John notes two important points:
To convert the data from relational to XML format, John uses two SQL functions: XMLFOREST and XMLELEMENT. XMLFOREST converts the relational data into XML format. The column values are converted into XML data enclosed by tag names identical to the column names. To demonstrate, John selects two columns of the TRANS table, modified by the XMLFOREST function:
select xmlforest(trans_id,trans_type) from trans; XMLFOREST(TRANS_ID,TRANS_TYPE) ---------------------------------------------------- <TRANS_ID>1</TRANS_ID><TRANS_TYPE>D</TRANS_TYPE> <TRANS_ID>2</TRANS_ID><TRANS_TYPE>C</TRANS_TYPE> . . .
The XMLELEMENT function places a user-defined tag around the records. John uses the <TransRec> tag to contain records of the TRANS table:
select xmlelement("TransRec",xmlforest( trans_id,trans_type)) from trans; XMLELEMENT("TRANSREC",XMLFOREST( TRANS_ID,TRANS_TYPE)) --------------------------------------------- <TransRec><TRANS_ID>1</TRANS_ID> <TRANS_TYPE>D</TRANS_TYPE> </TransRec> <TransRec><TRANS_ID>2</TRANS_ID> <TRANS_TYPE>C</TRANS_TYPE> </TransRec> . . .
Using these two functions, John demonstrates the SQL script shown in Listing 2, for converting the data from the TRANS table and loading it into the ARCH_TRANS table.
Code Listing 2: Archiving as XML
1 insert into arch_trans 2 ( 3 arch_date, 4 trans_date, 5 trans_details 6 ) 7 select 8 sysdate, 9 trans_date, 10 xmlelement("TransRec", 11 xmlforest( 12 trans_id, 13 trans_type, 14 trans_amount 15 ) 16 ) 17 from trans
The columns of TRANS converted to XML format are stored in the TRANS_DETAILS column of the ARCH_TRANS table. After the data is inserted into the ARCH_TRANS table, John selects from it, as shown in Listing 3, to show how the data is stored. Note that John uses the SET LONG statement to enable the display of a long list of data (the default is only 80 characters).
Code Listing 3: Looking at ARCH_TRANS
SQL> set long 999999999 SQL> select * from ARCH_TRANS; ARCH_DATE TRANS_DAT TRANS_DETAILS ---------------- -------------------- ------------------------------------ 07-JAN-06 27-DEC-05 <TransRec> <TRANS_ID>80</TRANS_ID> <TRANS_TYPE>D</TRANS_TYPE> <TRANS_AMOUNT>4142.68</TRANS_AMOUNT> </TransRec>
When it is time to archive a particular partition, John converts that partition of the ARCH_TRANS table to a standalone table:
alter table arch_trans exchange partition jan2003 with table arch_trans_jan2003 /
The ARCH_TRANS_JAN2003 table is created in the same tablespace as the JAN2003 partition. John can find the tablespace name by using this query:
select tablespace_name from dba_tab_partitions where table_name = 'TRANS' and partition_name = 'JAN2003'; TABLESPACE_NAME --------------------------------- JAN2003
John transports the JAN2003 tablespace in which the ARCH_TRANS_JAN2003 table resides.
expdp \"/ as sysdba\" transport_tablespaces=jan2003 dumpfile=jan2003.dmp
Finally, John drops the exchanged table (ARCH_TRANS_JAN2003) and the partition from both the production table (TRANS) and the archive table (ARCH_TRANS).
drop table arch_trans_jan2003; alter table trans drop partition jan2003; alter table arch_trans drop partition jan2003;
John's demonstrations to the Acme programmers and DBAs of how to archive the table are well received, but reinstating the table easily is another important archiving requirement. To show this process, John outlines how to reverse the activities performed during archiving.
First, he plugs the tablespace back into the database:
impdp dump_file=jan2003.dmp transport_datafiles='/u01/jan2003.dbf'
This instantly makes the ARCH_TRANS_JAN2003 table available in the database. The table can be queried now, or it can be merged with the main table (ARCH_TRANS) as a partition and queried. To merge it back, John issues
alter table arch_trans exchange partition jan2003 with table arch_trans_jan2003 /
Now the ARCH_TRANS table also contains the data for January 2003. Reinstatement is complete.
Querying the Archives
The Acme DBAs and programmers are still concerned. They remind John that the data in the ARCH_TRANS table is in XML—not relational—format.
John writes the query shown in Listing 4 to transform the XML data to relational format. The query uses XPath notation to extract all the columns from the XML representation of the data. Acme's applications still see the archived data as relational, and they do not need to change. The developers in attendance seem satisfied with this solution.
Code Listing 4: Transforming XML to relational format
col Trans_ID format a5 col Trans_Type format a1 col Trans_Amount format 999999.99 select arch_date, trans_date, extractvalue(trans_details,'/TransRec/TRANS_ID') Trans_ID, extractvalue(trans_details,'/TransRec/TRANS_TYPE') Trans_Type, to_number(extractvalue(trans_details,'/TransRec/TRANS_AMOUNT')) Trans_Amount from arch_trans;
The key element of the query in Listing 4 is the EXTRACTVALUE function, which extracts the value of an element from the XML document or data. John reminds everyone that the data in the TRANS_DETAILS column includes the XML tags that show the nature of the data. For instance, the data in one record looks like this:
<TransRec> <TRANS_ID>80</TRANS_ID> <TRANS_TYPE>D</TRANS_TYPE> <TRANS_AMOUNT>4142.68 </TRANS_AMOUNT> </TransRec>
To get the value of TRANS_TYPE, John starts at the highest-level element, TransRec, and then uses XPath notation to get to the appropriate level:
John notes an important point here: Unlike SQL, the names in XML are case-sensitive, so TransRec and transRec are not the same.
To make the data retrieval faster, John proposes to create an index on ARCH_TRANS. Most of the queries select data based on TRANS_TYPE, so that is a good candidate for the index. John creates the index:
create index in_arch_trans_type on arch_trans ( extractvalue ( TRANS_DETAILS, '/TransRec/TRANS_TYPE'));
This SQL statement may resemble a function-based index, John explains, but it actually creates the index based on the XPath notation.
Now only one critical requirement is left: the ability to allow for the production tables to change. John shows an example in which a column called ACC_NO is added to the TRANS table after data has been archived and removed from the table.
alter table trans add (acc_no varchar2(10))
Now the new ACC_NO column is in the TRANS table, but it is not in the TRANS_DETAILS column of the ARCH_TRANS table. The insertion of the data from the TRANS_DETAILS column into TRANS will not fail, assures John.
When a new column is added, the INSERT script needs to be updated to reflect the existence of the new column. The new script is shown in Listing 5. It's identical to Listing 2 except at line 15, where the new ACC_NO column is selected.
Code Listing 5: Modifying the archival script to reflect the added column
1 insert into arch_trans 2 ( 3 arch_date, 4 trans_date, 5 trans_details 6 ) 7 select 8 sysdate, 9 trans_date, 10 xmlelement("TransRec", 11 xmlforest( 12 trans_id, 13 trans_type, 14 trans_amount 15 acc_no 16 ) 17 ) 18 from trans;
John also modifies the retrieval script to show the new column, as shown in Listing 6.
Code Listing 6: Modified retrieval script
select arch_date, trans_date, extractvalue(trans_details,'/TransRec/TRANS_ID') Trans_ID, extractvalue(trans_details,'/TransRec/TRANS_TYPE') Trans_Type, to_number(extractvalue(trans_details,'/TransRec/TRANS_AMOUNT')) Trans_Amount, extractvalue(trans_details,'/TransRec/ACC_NO') Acc_No from arch_trans;
Again, Listing 6 is almost identical to the script in Listing 2, except for the new ACC_NO column name.
Several of the Acme DBAs and developers point out that the older records will have no ACC_NO values in the TRANS_DETAILS column, whereas newer records will have the ACC_NO values. They question whether the SELECT statement will be successful.
John says that XML's flexibility allows the referencing of elements that may not be present. In such a case, the EXTRACTVALUES function returns NULL. John executes the query in Listing 6 and highlights two records from the output as shown in Listing 7. The value of ACC_NO is shown as NULL in the first record that doesn't have the ACC_NO column. The second record shows the ACC_NO value as entered.
Code Listing 7: Two Sample Records
ARCH_DATE TRANS_DAT TRANS T TRANS_AMOUNT ACC_NO --------- --------- ----- - ------------ --------- 01-JAN-06 20-OCT-05 100 C 5740.29 01-FEB-06 07-JAN-06 101 X 100.03 M101
Similarly, the columns could have been dropped from the table as well, in which case the XML function would have reported them as NULL and actual values where they are present. This allows modification of the main table while making archiving and retrieval a breeze. In fact, over time, all the columns of a table may be gone, replaced by a set of entirely new columns, but using the XML archive format will make sure the data is accessible by the same SQL. The audience murmurs its approval.
In closing, John revisits the original archiving requirements and shows how they have been addressed by the proposed solutions, shown in Table 1. There are no more questions; there are no more concerns.
Arup Nanda (email@example.com) has been an Oracle DBA for more than 12 years, handling all aspects of database administration—from performance tuning to security and disaster recovery. He is a coauthor of PL/SQL for DBAs (O'Reilly Media, 2005) and was Oracle Magazine's DBA of the Year in 2003.