As Published In
Oracle Magazine
July/August 2006

TECHNOLOGY: XML


Archiving Data Using XML

By Arup Nanda Oracle ACE Director

Build a powerful, flexible data archive.

Acme Insurance Company is ready to archive data more than three years old to tape and to delete it from the production database. The company's IT architects have suggested an archive strategy that also uses transportable tablespaces, and the production tables have been carefully partitioned on the transaction date with this specific objective in mind.

According to the original archiving plan, the partitions with data more than three years old would be converted into standalone tables by use of partition exchange, and then the tablespaces containing those tables could be transported to tape and dropped from the main database. If deleted data needed to be reinstated, the tablespaces transported earlier could be restored from tape and plugged back into the main database. Transporting tablespaces and exchanging partitions do not modify data, so they do not produce UNDO and REDO and therefore perform significantly better than the traditional approaches for data movement, such as using INSERT and DELETE statements. And if the archived data needed to be searched for some information, it would be as simple as querying the standalone tables created during partition exchange.

The plan would be perfect, but there is a problem. As part of other projects, the IT architects are planning incremental changes to the structure of the production tables. Columns will be added and dropped, datatypes will be changed, constraints will be modified, and so on. The moment a table's structure is modified, it will no longer be possible to plug the tablespace transported earlier back into the main database, because the table structures will not match!

John, the company's lead DBA, has been called back from his vacation to find a way to make the archiving strategy work in a world of changing database structures.

An XML Solution

John puts down his luggage, rolls up his sleeves, and summarizes the archiving requirements in five main points:

1. The data must be archived based on its age.
2. The archived data must be dropped from the main database without much performance impact.
3. Reinstating the archived data must be quick and easy.
4. The solution must allow for table structures to change.
5. It must be possible to search inside the archived data without restoring it to the main database.

John's solution is to use XML as the format for the archived data.

DBAs and programmers at Acme, responding to John's solution, express concern that their applications have been developed to use the relational model with regular tables, that these applications and tables can't handle XML data, and that there isn't time for a thorough impact analysis.

John responds that the programs will still see the data as relational, even though the storage is actually in XML.

The Archival Table

John offers an explanation of the XML archiving solution to the concerned programmers and DBAs, starting with a representative table—TRANS—from their database, created and loaded as shown in Listing 1. The TRANS table has been partitioned on the TRANS_DATE column, with the idea that after three years, the oldest partition can be transported off and dropped. Because dropping a partition has virtually no impact on database performance and generates very little REDO and UNDO information, this is also a faster way to purge data than methods such as deletion and truncation.

Code Listing 1: Creating and populating the TRANS table 

create table trans
(
        trans_date              date,
        trans_id                        number(10),
        trans_type              varchar2(1),
        trans_amount    number(12,2)
)
partition by range (trans_date)
(
        partition Jan2003 values less than (to_date('01/02/2003', 'dd/mm/yyyy')),
        partition Feb2003 values less than (to_date('01/03/2003', 'dd/mm/yyyy')),
.
.
.
);

begin
        for ctr in 1..100 loop
                insert into trans values
                (    
                        sysdate - dbms_random.value(1,300),
                        ctr, 
                        decode (round(dbms_random.value(1,2)),1,'C','D'),
                        dbms_random.value(1,10000)
                );   
        end loop;        
end;                 
/                


However, because the production table structure needs to change, John explains that Acme will need to create another table specifically for archiving. The columns in this archive table will not be the same as the columns in the production table; most of the column values from the production table will be placed in the archive table in a single column—trans_details—of datatype XMLTYPE: 

create table arch_trans
(
   arch_date date,
   trans_date date,
   trans_details xmltype
)
partition by range (trans_date)
(
   partition Jan2003 values less than (to_date('01/02/2003', 'dd/mm/yyyy')),
   partition Feb2003 values less than (to_date('01/03/2003', 'dd/mm/yyyy')),
.
.
.
)
/


John notes two important points: 

  • The ARCH_TRANS archive table has been partitioned on TRANS_DATE, in the same way as the TRANS table.

  • Besides the TRANS_DATE column, no other column from the original table is in the ARCH_TRANS table. Rather, a new column called TRANS_DETAILS holds other production table column values in XML format.

Archiving

To convert the data from relational to XML format, John uses two SQL functions: XMLFOREST and XMLELEMENT. XMLFOREST converts the relational data into XML format. The column values are converted into XML data enclosed by tag names identical to the column names. To demonstrate, John selects two columns of the TRANS table, modified by the XMLFOREST function: 

select xmlforest(trans_id,trans_type) 
from trans;

XMLFOREST(TRANS_ID,TRANS_TYPE)
----------------------------------------------------
<TRANS_ID>1</TRANS_ID><TRANS_TYPE>D</TRANS_TYPE>
<TRANS_ID>2</TRANS_ID><TRANS_TYPE>C</TRANS_TYPE>
.
.
.


The XMLELEMENT function places a user-defined tag around the records. John uses the <TransRec> tag to contain records of the TRANS table: 

select xmlelement("TransRec",xmlforest(
trans_id,trans_type)) from trans;
XMLELEMENT("TRANSREC",XMLFOREST(
TRANS_ID,TRANS_TYPE))
---------------------------------------------
<TransRec><TRANS_ID>1</TRANS_ID>
<TRANS_TYPE>D</TRANS_TYPE>
</TransRec>
<TransRec><TRANS_ID>2</TRANS_ID>
<TRANS_TYPE>C</TRANS_TYPE>
</TransRec>
.
.
.


Using these two functions, John demonstrates the SQL script shown in Listing 2, for converting the data from the TRANS table and loading it into the ARCH_TRANS table.

Code Listing 2: Archiving as XML 

        1       insert into arch_trans
        2       (
        3               arch_date,
        4               trans_date,
        5               trans_details
        6       )
        7       select
        8               sysdate,
        9               trans_date,
        10              xmlelement("TransRec",
        11                      xmlforest(
        12                              trans_id,
        13                              trans_type,
        14                              trans_amount
        15                      )
        16              )
        17      from trans


The columns of TRANS converted to XML format are stored in the TRANS_DETAILS column of the ARCH_TRANS table. After the data is inserted into the ARCH_TRANS table, John selects from it, as shown in Listing 3, to show how the data is stored. Note that John uses the SET LONG statement to enable the display of a long list of data (the default is only 80 characters).

Code Listing 3: Looking at ARCH_TRANS 

SQL> set long 999999999
SQL> select * from ARCH_TRANS;

ARCH_DATE           TRANS_DAT              TRANS_DETAILS
----------------    --------------------   ------------------------------------ 
07-JAN-06           27-DEC-05              <TransRec>
                                           <TRANS_ID>80</TRANS_ID>
                                           <TRANS_TYPE>D</TRANS_TYPE>
                                           <TRANS_AMOUNT>4142.68</TRANS_AMOUNT>
                                           </TransRec>


When it is time to archive a particular partition, John converts that partition of the ARCH_TRANS table to a standalone table: 

alter table arch_trans
exchange partition jan2003
with table arch_trans_jan2003
/


The ARCH_TRANS_JAN2003 table is created in the same tablespace as the JAN2003 partition. John can find the tablespace name by using this query: 

select tablespace_name
from dba_tab_partitions
where table_name = 'TRANS'
and partition_name = 'JAN2003';

TABLESPACE_NAME
---------------------------------
JAN2003


John transports the JAN2003 tablespace in which the ARCH_TRANS_JAN2003 table resides. 

expdp \"/ as sysdba\" 
transport_tablespaces=jan2003 dumpfile=jan2003.dmp


Finally, John drops the exchanged table (ARCH_TRANS_JAN2003) and the partition from both the production table (TRANS) and the archive table (ARCH_TRANS). 

drop table arch_trans_jan2003;

alter table trans
drop partition jan2003;

alter table arch_trans
drop partition jan2003;


Reinstatement

John's demonstrations to the Acme programmers and DBAs of how to archive the table are well received, but reinstating the table easily is another important archiving requirement. To show this process, John outlines how to reverse the activities performed during archiving.

First, he plugs the tablespace back into the database: 

impdp dump_file=jan2003.dmp 
transport_datafiles='/u01/jan2003.dbf' 


This instantly makes the ARCH_TRANS_JAN2003 table available in the database. The table can be queried now, or it can be merged with the main table (ARCH_TRANS) as a partition and queried. To merge it back, John issues 

alter table arch_trans
exchange partition jan2003
with table arch_trans_jan2003
/


Now the ARCH_TRANS table also contains the data for January 2003. Reinstatement is complete.

Querying the Archives

The Acme DBAs and programmers are still concerned. They remind John that the data in the ARCH_TRANS table is in XML—not relational—format.

John writes the query shown in Listing 4 to transform the XML data to relational format. The query uses XPath notation to extract all the columns from the XML representation of the data. Acme's applications still see the archived data as relational, and they do not need to change. The developers in attendance seem satisfied with this solution.

Code Listing 4: Transforming XML to relational format 

col Trans_ID format a5
col Trans_Type format a1
col Trans_Amount format 999999.99
select arch_date, trans_date,
extractvalue(trans_details,'/TransRec/TRANS_ID') Trans_ID,
extractvalue(trans_details,'/TransRec/TRANS_TYPE') Trans_Type,
to_number(extractvalue(trans_details,'/TransRec/TRANS_AMOUNT')) Trans_Amount
from arch_trans;


The key element of the query in Listing 4 is the EXTRACTVALUE function, which extracts the value of an element from the XML document or data. John reminds everyone that the data in the TRANS_DETAILS column includes the XML tags that show the nature of the data. For instance, the data in one record looks like this: 

<TransRec>
  <TRANS_ID>80</TRANS_ID>
  <TRANS_TYPE>D</TRANS_TYPE>
  <TRANS_AMOUNT>4142.68
</TRANS_AMOUNT>
</TransRec>


To get the value of TRANS_TYPE, John starts at the highest-level element, TransRec, and then uses XPath notation to get to the appropriate level: 

extractvalue(trans_details,'/TransRec/TRANS_TYPE')


John notes an important point here: Unlike SQL, the names in XML are case-sensitive, so TransRec and transRec are not the same.

To make the data retrieval faster, John proposes to create an index on ARCH_TRANS. Most of the queries select data based on TRANS_TYPE, so that is a good candidate for the index. John creates the index: 

create index 
  in_arch_trans_type
on arch_trans (
  extractvalue (
    TRANS_DETAILS,
    '/TransRec/TRANS_TYPE'));


This SQL statement may resemble a function-based index, John explains, but it actually creates the index based on the XPath notation.

Structural Changes

Now only one critical requirement is left: the ability to allow for the production tables to change. John shows an example in which a column called ACC_NO is added to the TRANS table after data has been archived and removed from the table. 

alter table trans add (acc_no varchar2(10))


Now the new ACC_NO column is in the TRANS table, but it is not in the TRANS_DETAILS column of the ARCH_TRANS table. The insertion of the data from the TRANS_DETAILS column into TRANS will not fail, assures John.

When a new column is added, the INSERT script needs to be updated to reflect the existence of the new column. The new script is shown in Listing 5. It's identical to Listing 2 except at line 15, where the new ACC_NO column is selected.

Code Listing 5: Modifying the archival script to reflect the added column 

1       insert into arch_trans
2         (
3               arch_date,
4               trans_date,
5               trans_details
6         )
7         select
8               sysdate,
9               trans_date,
10              xmlelement("TransRec",
11                      xmlforest(
12                              trans_id,
13                              trans_type,
14                              trans_amount
15                              acc_no
16                      )
17              )
18        from trans;


John also modifies the retrieval script to show the new column, as shown in Listing 6.

Code Listing 6: Modified retrieval script 

select arch_date, trans_date,
extractvalue(trans_details,'/TransRec/TRANS_ID') Trans_ID,
extractvalue(trans_details,'/TransRec/TRANS_TYPE') Trans_Type,
to_number(extractvalue(trans_details,'/TransRec/TRANS_AMOUNT')) Trans_Amount,
extractvalue(trans_details,'/TransRec/ACC_NO') Acc_No
from arch_trans;


Again, Listing 6 is almost identical to the script in Listing 2, except for the new ACC_NO column name.

Next Steps


READ more about
XML functionality in Oracle Database 10g
SQL functions used in this article

Several of the Acme DBAs and developers point out that the older records will have no ACC_NO values in the TRANS_DETAILS column, whereas newer records will have the ACC_NO values. They question whether the SELECT statement will be successful.

John says that XML's flexibility allows the referencing of elements that may not be present. In such a case, the EXTRACTVALUES function returns NULL. John executes the query in Listing 6 and highlights two records from the output as shown in Listing 7. The value of ACC_NO is shown as NULL in the first record that doesn't have the ACC_NO column. The second record shows the ACC_NO value as entered.

Code Listing 7: Two Sample Records 

ARCH_DATE TRANS_DAT TRANS T TRANS_AMOUNT ACC_NO
--------- --------- ----- - ------------ ---------
01-JAN-06 20-OCT-05 100   C 5740.29
01-FEB-06 07-JAN-06 101   X  100.03      M101


Similarly, the columns could have been dropped from the table as well, in which case the XML function would have reported them as NULL and actual values where they are present. This allows modification of the main table while making archiving and retrieval a breeze. In fact, over time, all the columns of a table may be gone, replaced by a set of entirely new columns, but using the XML archive format will make sure the data is accessible by the same SQL. The audience murmurs its approval.

Conclusion

In closing, John revisits the original archiving requirements and shows how they have been addressed by the proposed solutions, shown in Table 1. There are no more questions; there are no more concerns. 

Requirement Solution
Time-based archiving Partitioning of the ARCH_TRANS table in exactly the same way as the main table, TRANS. This way, a whole partition can be exchanged with a table and then transported.
Minimal impact during purging of data The purging means dropping a partition, which has virtually no overhead.
Quick and low-overhead restoration Reinstating the purged data involves plugging the tablespace back into the database, which has virtually no overhead.
Flexibility for table structural changes The data is archived in XML format, which removes restrictions on the table structural changes.
Ability to search within archived data The XML data is shown in a relational format, enabling it to be queried as such, with the familiar SQL interface.

Table 1: Acme archiving requirements and solutions

More XMLType


You can use XMLType in many other cases. Consider a scenario in which the attributes of an entity are not clearly known in advance and evolve over time. An example might be a car, whose attributes might be make, model, trim level, year, and color. Later, for some specially manufactured cars, you may want to add other details such as assembly plant name and engine plant name. If you have a relational format, you can add columns, but suddenly you may have to change all your INSERT statements, even if those new attributes are not inserted. If the attributes are stored in XMLType, all you have to do is to add those new attributes as tags for those specific records only. The SQL would have to change only when those specific attributes are accessed, not otherwise.

Another advantage of XMLType is free search of data, using the Context option, in which case you can create a CTXSYS type of index and search inside the XMLType data for specific strings. This technique is very useful when you don't know which column of the table has a particular value you are searching for; for instance, you are searching for the value ARUP but don't know if that's part of the customer's first name, middle name, last name, street address, city, county, partner's name, or whatever else. The use of context search will reveal that value in a record and identify which tag includes it.


Arup Nanda (arup@proligence.com) has been an Oracle DBA for more than 12 years, handling all aspects of database administration—from performance tuning to security and disaster recovery. He is a coauthor of PL/SQL for DBAs (O'Reilly Media, 2005) and was Oracle Magazine's DBA of the Year in 2003.

Send us your comments