Use the Character Set Scanner to Migrate a Database to support Unicode
In this module, you will learn how to address the issues involved in migrating your database to a new database character set. This lesson will focus on the detection of invalid or incompatable character data prior to performing an actual database character set migration.
After completing this module, you should be able to:
Use the Oracle Character Set Scanner Utility to perform character set migration tests
Identify and resolve issues involved in migrating a database to a new database character set
Note: Because our sample database is used in other lessons, we will not actually perform a database character set migration. The steps for performing the character set migration will be provided for reference purposes. This lesson will focus on the detection of invalid or incompatible character data prior to performing an actual database character set migration.
Before starting this module, you should have completed the following modules:
Downloaded the scan.zip module files and unzipped them into your working directory
The following is a list of useful reference material if you want additional information about the topics in this module:
Documentation: Database Character Set Migration Guide
Migrating legacy data to Unicode requires careful step-by-step planning. Invalid data needs to be identified for a successful Unicode migration to occur. Otherwise, character set conversion may cause data loss or further data corruption. This lesson will focus on detecting and avoiding potential data migration issues.
How does invalid data get into the database?
A common scenario that can cause invalid data to be inserted into a database is when the client's NLS_LANG character set setting is the same as the database character set. When these settings are the same, Oracle assumes that the data being sent or received is of the same character set, so no validations or conversions are performed. For example, a database who's character set is US7ASCII, will only be able to handle 7-bit ASCII data. If a Western European Windows NT client enters 8 bit data, the following invalid data will occur:
If the client's NLS_LANG setting is set correctly to reflect the client character set, WE8MSWIN1252, then characters not of the ASCII range will be detected on insertion and default to valid replacement characters.
If the client's NLS_LANG is set incorrectly to the database character set, then accented characters for example end up in the database as binary values with no corresponding code points in ASCII. Converting this data as part of an export, import operation produces garbage.
Another possibile source of invalid data is having data from mixed character sets inside the database. For example, if the database character set is WE8MSWIN1252, and two separate Windows clients using German and Greek are both using the NLS_LANG character set setting as WE8MSWIN1252, then the database will contain a mixture of German and Greek characters.
The sizes of character data types CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data. This is acceptable when the database character set uses a single-byte character encoding scheme because the number of characters will be equivalent to the number of bytes. If the database character set uses a multi-byte character encoding scheme however, there is no such correspondence. That is, the number of bytes no longer equals the number of characters because a character can consist of one or more bytes. This situation can cause problems. During migration to a new character set, it is important to verify the column widths of existing CHAR and VARCHAR columns because they might need to be extended to support encoding that requires multi-byte storage. If the character set width differs during the import process, truncation of data can occur if conversion causes expansion of data.
Before migrating your character set, you need to identify areas of possible database character set conversions and truncation of data. This step is called data scanning. Data scanning identifies the amount of effort required to migrate data into the new character encoding scheme prior to the change of the database character set. Some examples of what are found during a data scan are the number of schema objects where the column widths need to be expanded and the extent of the data that does not exist in the target repertoire. This information will assist in determining the best approach for the conversion of the database character set.
What is the Character Set Scanner Utility?
The Character Set Scanner (csscan) provides an assessment of the feasibility and potential issues in migrating an Oracle database to a new database character set. The Scanner checks all character data in the database and tests for the effects and problems of changing the character set encoding. At the end of the scan, it generates a summary report of the database scan. This report provides estimates of the amount of work required to convert the database to a new character set.
Based on the information in the summary report, you will be able to decide on the most appropriate method to migrate the database's character set. These methods are:
Full Export and Import
ALTER DATABASE CHARACTER SET statement
ALTER DATABASE CHARACTER SET with selective Export and Import
If there are conversion exceptions reported by the Scanner, these problems must be fixed first before using any of the above methods to do the conversions. This may involve modifying the problem data to eliminate those exceptions. In extreme cases, both database and application might need to be modified. Oracle recommends that you contact Oracle Consulting Services for services on database character set migration.
Conversion Tests on Character Data
The Character Set Scanner provides three modes of database scan:
Full Database Scan
User Table Scan
Single Table Scan
The Scanner reads the character data and tests for the following conditions on each data cell:
Do character codes of the data cells change when converted to the new character set?
Can the data cells be successfully converted to the new character set?
Will the post-conversion data fit into the current column size?
Initial Setup for the Scanner
To use the Scanner, you must first run the system script CSMINST.SQL on the database that you plan to scan. CSMINST.SQL only needs to be run once. It is not necessary to run it each time you scan the database. The script performs the following tasks to prepare the database for scanning:
Creates a user named CSMIG
Assigns the necessary privileges to CSMIG
Assigns the default tablespace to CSMIG
Connects as CSMIG
Creates the Scanner system tables under CSMIG
Open a DOS prompt and run the following commands:
Create a Table and Populate with ASCII data
You will run the scanner against a table that you create and populate with ASCII data. To setup the table, perform the following:
Execute the following commands to run the books_ascii.sql script:
The books_ascii.sql script contains the following statements:
/* Populate with ASCII data */
INSERT INTO books VALUES ('01','Dead Poets Society', 'N. H. Kleinbaum');
INSERT INTO books VALUES ('02','Hunt for Red October','Tom Clancy');
INSERT INTO books VALUES ('03','Professional Apache','Peter Wainwright');
INSERT INTO books VALUES ('04','New Joy of Cooking','Irma S. Rombauer');
Invoking the Scanner (US7ASCII to AL32UTF8 Migration - No Invalid Data Scenario)
The first Scanner test you will perform will be a "Best Case" Scenario where source and destination character sets are fully compatable. It will show how to scan a user table to see the effects on migrating it to AL32UTF8. You will override your database character set with the US7ASCII character set to force the desired results
There are three methods that can be used to invoke the Scanner:
Using a parameter file
Full command line
The csscan executable should be run on the server where the database is installed. The Scanner provides online help by entering csscan help=y at the OS command line. If the fromchar parameter is not specified it will default to your database character set.
At the OS prompt, enter the following command to perform a single table scan:
The Scanner generates two reports per scan in the directory where executed:
Database Scan Summary Report (scan.txt)
Individual Exception Report (scan.err)
Execute the following command to see the list of files:
Open the Database Scan Summary Report (scan.txt) and scroll down to the Scan Summary. Notice that it tells you that "All character type application data remain the same in the new character set".
When all data remains the same in the new character set, it means that the data encoding of the original character set is identical to that of the target character set. This is called changeless data. Often when all the data is changeless the ALTER DATABASE CHARACTER SET command can be used instead of a full Export/Import. But there are restrictions that apply. See the Globalization Guide for more details.
Populating BOOK Table with Additional Western European Data
Now you can insert conflicting character set data into your table so that it can be detected using the Scanner. To do this, perform the following:
Execute the following commands to run the books_we.sql script to add rows of Western European data to the original ASCII data.
The books_we.sql script contains the following statements:
/* Populate with Western European data */
INSERT INTO books VALUES ('05','Chal�s Encantados','Mar�ann Rupestre');
INSERT INTO books VALUES ('06','Mati�re � rire','Raymond Devos');
INSERT INTO books VALUES ('07','God natt min �lskade','Inger Frimansson');
INSERT INTO books VALUES ('08','Susaya Sis S�kt�','El�iene Cavansir');
Invoking the Scanner (US7ASCII to AL32UTF8 Migration - Some Invalid Data Scenario)
At this point you will again override your database character set with the US7ASCII character set to force specific results. This will help illustrate some common problems occurring during conversion. Perform the following:
Once again, enter the following command at the OS prompt:
Execute the following command to see the list of files:
Open the Database Scan Summary Report (scan.txt) in Notepad and scroll down to the Scan Summary. This time, it tells indicates that "Some character type application data are not convertible to the new character set".
Open the Database Scan Exception Report (scan.err) that is generated by the Scanner and scroll down to the Application data individual exceptions section
The results now indicate that you have data that cannot be properly converted. This is an indication that not all of the data is ASCII. You must examine the convertible data and decide if it belongs in the database. Often characters such as European accented characters are not anticipated but could be considered necessary data. If the data is invalid you can manually eliminate it and rerun the Scanner to confirm success.
Invoking the Scanner using a Western European Character Set
Now you will want to preserve the data as is, including the European accented characters. When you run the scanner this time you will override the database character set as WE8MSWIN1252 to make sure all data falls into a range. Perform the following:
From a sqlplus session logged on as OE, execute the following commands:
column parameter format a30
column value format a30
select * from nls_database_parameters;
Open the newly generated Database Scan Exception Report (scan.err) and scroll down to the Application data individual exceptions section
Conversion of Data
The exception report indicates you must also extend two columns, BOOK_NAME and AUTHOR, to accommodate their expansion when you convert to UTF-8. One ASCII character converts to 1 byte in UTF-8. But an accented European character will take 2 bytes in UTF-8 and if you had Asian data it would expand to 3 bytes. The Exception report indicates that our column sizes must be expanded to at least VARCHAR(21) for BOOK_NAME and VARCHAR(17) for author to accommodate the largest expansion required for the current data. Existing PL/SQL code should be reviewed to ensure all byte-based SQL functions such as LENGTHB, SUBSTRB, and INSTRB, and PL/SQL CHAR and VARCHAR2 declarations are still valid. Once the column sizes have been adjusted you are ready to do a full export and import to properly convert all data to UTF-8.
Lesson Cleanup (Optional)
To delete the BOOKS table that you created in this lesson, execute the following:
drop table books;
In this module, you should have learned how to:
Use the Oracle Character Set Scanner Utility to perform character set conversion tests
Identify issues involved in migrating a database to a new database character set
Choose the appropriate database character set migration method