Oracle Database 11g: by Arup Nanda
Support for data compression is nothing new in Oracle Database, but 11g takes the concept to a new level with Advanced and Hybrid Columnar Compression.
Compression is nothing new to Oracle; it has been around in the form of COMPRESS BASIC since Oracle9i Database.
Compression is always CPU-intensive and it takes time. Usually, if you compress data, it has to be uncompressed to be usable. While this requirement is acceptable in data warehouse environments, where the SQL typically operates on a large number of rows and large response times are generally tolerable, it may not be acceptable in OLTP environments.
Now, in Oracle Database 11g (but only if you have licensed the Advanced Compression Option), you can do this:
create table my_compressed_table ( col1 number(20), col2 varchar2(300), ... ) compress for all operationsThe clause "compress for all operations" enables compression on all DML activities like INSERT, UPDATE, etc. The compression occurs on all DML activities; not just direct path inserts as in the previous versions.
Will that slow the DML? Not necessarily. This is where the new feature works best. The compression does not occur when the row is inserted into the table. Instead, the rows are inserted uncompressed and in the routine way. When a certain number of rows are inserted (or updated) in the uncompressed manner, the compression algorithm kicks in and compresses all the uncompressed rows in the block. In other words, the block is compressed, not the row. The threshold at which the compression occurs is internally defined inside the RDBMS code.
Consider a table, ACCOUNTS, with records shown below:
Inside the database, assume a single database block has all the above rows.
This is what an uncompressed block looks like: with the records with all the data in all the fields (columns). When this block is compressed, the database first calculates the repeating values found in all the rows, moves them out of the rows, and puts them near the header of the block. These repeating values in the rows are replaced with a symbol that represents each of these values. Conceptually it is shown in the figure below where you can see a block before and after compression.
Note how the values have been taken out of the rows and put in a special area at the top called "Symbol Table". Each value in the columns is assigned a symbol that takes the place of the actual value inside the rows. As symbols are smaller than the actual values, the record sizes are considerably smaller than the original as well. The more repeating data the row has, the more compact the symbol table and subsequently the block.
Because compression occurs as a triggered event, not when the row is inserted, the performance impact of the compression is nil during the normal DML process. When the compression is triggered, the demand on the CPU becomes high, of course,but at all other times the CPU impact is zero, making the compression suitable for OLTP applications as well—which is the sweetspot for compression in Oracle Database 11g.
Apart from reduced space consumption, the compressed data takes less time to go across the network, uses less space for backup, and makes it feasible to maintain full copies of the production database in QA and testing.
In 11g Release 2 (but only if pre-installed with Oracle Exadata v2), compression technology has been extended in the form of Hybrid Columnar Compression (HCC).
First of all, why there is a need to extend compression capability? The reason is simple: not all data in a database may be searched in the same way. For example, some data (such as company email) needs to be in the database purely for legal reasons; it must be accessible all the time, even if actually needed only rarely. This data needs to be stored, and storage costs money – for the disks, for power to run the disks, for cooling, for floorspace. So why use expensive storage for data that is rarely, if ever, accessed?
This is where HCC comes in. Typically compression works by replacing repeating values with some smaller symbols and reducing overall space consumption. For instance, suppose the rows of uncompressed data looks like this (with columns separated by “|”):
Row1: Quite_a_large_value1|Quite_a_long_value1|Another_quite_long_value1 Row2: Quite_a_large_value2|Quite_a_long_value1|Another_quite_long_value1 Row3: Quite_a_large_value1|Quite_a_long_value2|Another_quite_long_value1 Row4: Quite_a_large_value1|Quite_a_long_value1|Another_quite_long_value2
The total size of the three rows is 264 bytes. Note that there are actually just six distinct values in these three rows, which have been used several times.
Quite_a_large_value1 Quite_a_large_value2 Quite_a_long_value1 Quite_a_long_value2 Another_quite_long_value1 Another_quite_long_value2
When this block is compressed, a special value is assigned to each distinct value and the actual values are replaced by the corresponding special values. For instance, here are the assigned special values:
Replaced by Symbol
| || |
| || |
| || |
| || |
| || |
| || |
The rows now look like:
Total size: 32 bytes, a substantial reduction from 264, by about 88%. (Of course the percentage of reduction will depend on the data, in particular how many distinct values are present, but the principle is the same.) The relationship between unique symbols (A1, A2, etc.) and the values they represent (“Quite_a_large_value1”, etc.) is stored in a structure called Symbol Directory and is stored in the block header. Each block header must store the symbols used in that block. Naturally, there will be repetitions since the same values will be repeated in the blocks.
If you look at a real life data, you will notice that the values are typically repeated across columns, not rows. For instance, a column called FIRST_NAME will have values like John, Jack, etc. while another called CITY_NAME will have values like New York or Los Angeles. You will not expect to see a value called New York on the first_name column. Since values repeat more frequently within the columns, rather than rows, you can have just one symbol directory per column. Since symbol directories take up space, the significant reduction in their numbers will reduce the overall space compared to traditional compression where every block must have a symbol directory.
Hybrid Columnar Compression uses this method. This accomplishes significant amount of compression but affects the DML performance. So, it is best used for tables that are queries that are not subject to DML.
Here is an example of a table that is created as
create table trans_comp nologging compress for query low as select * from trans;
The clause “compress” compresses the table. The clause “for query” compresses it in hybrid columnar manner. The clause “low” does not compress it as aggressively. So more space is consumed but compression and decompression do not consume as much CPU. A more aggressive compression can be achieved by replacing the “low” clause with “high”. This type of HCC is known as warehouse compression since it is useful for data warehouses where a lot of data is stored but they are queried frequently.
If a table will be accessed very rarely, you can compress it even more, using a “for archive” clause:
create table trans_comp nologging compress for archive low as select * from trans;
This will reduce the space consumption even more but at the cost of CPU. Like the “for query” clause’s setting it also has two values, high and low. This is known as archive compression where data is accessed infrequently. Here is an example of the compression ratios on a set of representative data. Of course they will vary widely with the data.
Type of Compression
Compressed Table Size
Hybrid Columnar Compression allows you to compress tables that are not subject to too much DML but whose data can’t be dropped for legal or other reasons. Now you can store these tables yet consume far less storage, and possibly use less CPU as well. Note again that this feature is available only on Oracle Exadata Storage Server v2. For more information, refer to this white paper.
See Series TOC