|
Technology GLOBALIZATION
Unicode Enables Globalization
By Jonathan Gennick and Peter Linsley
Unicode encodes everything and represents everyone.
Letters, numerals, and punctuationall characters, in factare represented in a computer as numbers, and there are dozens of different schemes for encoding characters.
As computer applications became more global, driving the need to support characters from languages the world over, it became clear to many that a single, unified encoding scheme was needed. Enter the Unicode Standard. Unicode is the default encoding for XML, is required by LDAP, is the underlying character set used by Java and Windows XP, and more. Years ago, you needed to understand ASCII to be a successful programmer. Today you need to understand Unicode.
What Is Unicode?
Unicode is a single encoding scheme covering all known characters currently in use worldwide. Each character is assigned a distinct numeric value called a code point. The standard organizes characters into blocks of related characters. For example, Unicode uses the block of code points from 0x0400 through 0x04FF to represent characters of the Cyrillic script used by Russian, Ukrainian, and related alphabets. (The notation 0x indicates that what follows is a hexadecimal number. Unicode code points are commonly expressed in hexadecimal.) A script is a collection of symbols used in one or more related writing systems or languages. Thus, the Cyrillic script is a superset of all the characters used by the different Cyrillic alphabets. Unicode is constantly evolving to increase the number of scripts supported, and to encompass new characters such as the Euro sign (€).
Unicode defines the term character to mean the smallest component of a written language that conveys meaning. Letters such as A, B, and C are obvious examples of characters, as are the digits 1, 2, and 3. However, the term "character" can be somewhat ambiguous, and the more you work with Unicode, the more you'll find yourself speaking in terms of code points. Consider the character Ë, which is a capital E with a diaeresis. You see it as one character, and indeed, it corresponds to the Unicode code point 0x00CB. However, Unicode also lets you treat the two elements of Ë separately: you can encode the same character as 0x0045, the code point for the capital letter E, followed by 0x0308, the code point for the combining diaeresis (¨). The two code points combine to form one character. Unicode defines a large number of combining diacritical marks and combining characters that you can use in this manner.
How Characters Are Encoded
With ASCII text, the numeric value assigned to a character is also used to represent that character in memory, or on disk. The ASCII code point for
the letter A is 0x41, end of story. Unicode distinguishes between a code point's value and how that value is represented in memory. The letter A, for example, could end up as 0x41, or as 0x0041, and that's just a simple example. How a code point is represented in memory is defined by Unicode's encoding forms.
Unicode defines two encoding forms: UTF-8 and UTF-16. UTF-8 encoding is backward-compatible with ASCII. Any valid ASCII encoding is automatically valid UTF-8 encoding, making it fairly easy to convert existing ASCII databases forward into Unicode databases. UTF-8 uses a code point value to generate a bit pattern that is distributed over one to four bytes.
Each byte in the UTF-8 encoding of a given character is known as a code unit. Code points for characters in the standard ASCII character set are encoded using single code units in the range 0x00 through 0x7F. Most non-Asian scripts are represented using one or two code units (or bytes) per character. For example, the UTF-8 representation of the Ukrainian letter ghe:
code point 0x0490, is: 0xD2 0x90. Asian scripts typically require three code units per character. In Unicode 3.1, a large number of new, supplementary characters were defined, each requiring four UTF-8 code units.
The second encoding form of importance is UTF-16, which uses a two-byte code unit. For the most part, UTF-16 is as simple as ASCII in that the code unit values are the same as the code point values for any code point in the range 0x0000 through 0xFFFF. For characters in that range, UTF-16 is a fixed-length, two-byte encoding. Revisiting our earlier example involving ghe, you represent in UTF-16 using the code unit 0x0490, which directly corresponds to the code point for that character.
Choosing Between UTF-8 and UTF-16
Oracle supports two methods for Unicode data storage. The first is known as the Unicode Database solution and involves creating a Unicode-based database using UTF-8 as the encoding not only for CHAR and VARCHAR2 character datatypes but
also for all SQL names and literals. (CLOB is a special case here, as Oracle Database 10g always uses UTF-16 for CLOBs in a Unicode Database.) In order to implement the Unicode Database solution, configure your database character set as AL32UTF8, the Oracle name for UTF-8.
The alternative Unicode approach is the Unicode Datatype solution, in which UTF-16 data is stored in the NCHAR, NVARCHAR, and NCLOB Unicode datatypes. This is an ideal solution when you want Unicode support at the column level in a non-Unicode database. To implement this solution, set your national character set to AL16UTF16, the Oracle name for UTF-16. In fact, beginning with Oracle9I Database, using AL16UTF16 as the national character set is the default behavior.
There are several factors involved in deciding whether to use UTF-8 or UTF-16, and Oracle even enables you to apply both solutions simultaneously. Your application may be written to work only in terms of CHAR and VARCHAR, which might drive you to opt for a Unicode Database because the changes you would then need to make at the application level are minimal, if any. If this is the case, then your encoding choice is limited to UTF-8, because UTF-16 is valid only in the Unicode datatypes. On the other hand, Windows or Java, which both use UTF-16 internally, might push you toward a UTF-16 solution.
Storage considerations might make you lean toward one encoding form over the other. Generally speaking, UTF-8 uses less space than UTF-16 when the majority of data is in Western European characters, because in UTF-8 such data requires only oneor at most, twobytes per character. The reverse is true when the majority of your data is in Asian characters, because UTF-8 requires at least three bytes per Asian character whereas UTF-16 encodes most Asian characters using only two bytes. Performance can also be a consideration, due to the fixed-width property of UTF-16, under which the performance of string manipulation functions often surpasses that under UTF-8.
For the most part, UTF-8 is more widely used primarily because internet protocols are geared toward sending byte-oriented data over the network. UTF-8 is also the only solution that provides Unicode support across the entire database, allowing you to define metadata such as table names and user names in Unicode.
Surrogates
The first few versions of the Unicode standard supported 65,536 distinct code points in the range 0x0000 through 0xFFFF. This seemingly large number of code points proved inadequate, and Unicode 3.0 sacrificed 2,048 of them to provide over one million (1,024*1,024) new, supplementary code points, beginning where the old range left off at 0x10000 and continuing through 0x10FFFF. Characters corresponding to these new code points are called supplementary characters.
For example, 0x1D11E represents
the musical symbol G (treble) clef:
UTF-8 is a variable-length encoding and simply encodes these new code points using four bytes, one more than the previous maximum of three. However, such large code point values can't be represented in the two-byte code units used by UTF-16. Instead, UTF-16 encodes supplementary code points using a combination of two code units called a surrogate pair. The first such code unit always falls in the range 0xD800 through 0xDBFF and is called a high surrogate. The other code unit is the low surrogate and falls in the range 0xDC00 through 0xDFFF. These well-defined ranges prevent any ambiguity as to whether a given code unit is part of a surrogate pair or stands alone.
String Processing
To make it easy to allocate proper storage for Unicode values, Oracle9i Database introduced character semantics. You can now use a declaration such as VARCHAR2(3 CHAR), and Oracle will set aside the correct number of bytes to accommodate three characters in the underlying character set. In the case of AL32UTF8, Oracle will allocate 12 bytes, because the maximum length of a UTF-8 character encoding is four bytes (3 characters * 4 bytes/character = 12 bytes). On the other hand, if you're using AL16UTF16, in which case your declaration would be NVARCHAR2(3), Oracle allocates just six bytes (3 characters * 2 bytes/character = 6 bytes). One difference worth noting is that for UTF-8, a declaration using character semantics allows enough room for surrogate characters, whereas for UTF-16 that is not the case. A declaration such as NVARCHAR2(3) provides room for three UTF-16 code units, but a single supplementary character may consume two of those code units.
To help you create and use strings consisting of code points that you aren't able to type or view from your client software, Oracle Database provides the UNISTR and ASCIISTR functions. UNISTR enables you to embed arbitrary code points into a string. For example, UNISTR('\0490\0435\043D\0438\043A') returns Jonathan's surname in its original Ukrainian:
You can reverse the process via a call to ASCIISTR, which converts all non-ASCII characters in a string to escape sequences in the form \xxxx.
When it comes to string functions such as LENGTH, SUBSTR, and INSTR, Oracle Database provides you with choices. When you use UTF-8 encoding, such string functions operate in terms of code points, which correspond to characters or combining marks. For example, LENGTH will always return the number of code points in a UTF-8 encoded string. The rules change slightly when you use UTF-16, in which case LENGTH, SUBSTR, and INSTR all operate in terms of code units, making it possible to treat UTF-16 as a fixed-width character set. This is a subtle but important difference to understand. LENGTH applied to a UTF-16 encoded string will return not the number of code points in the string, but rather the number of code units. This distinction might even factor into your choice of encoding. For example, given an application written with the expectation that string functions count characters, it may be less costly to use UTF-8, even for Asian characters, than to modify the application to handle code units.
Finally, you have the COMPOSE and DECOMPOSE functions. One use for these functions lies in comparing two Unicode strings. The functions allow you to decide whether to treat precomposed characters and their decomposed equivalents as if they are the same or different. For example, a comparison such
as DECOMPOSE (UNISTR('\00CB')) =
DECOMPOSE(UNISTR ('E\0308')) will
be considered true, because DECOMPOSE transforms 0x00CB into an E followed by the code unit 0x0308. The COMPOSE function enables you to approach
this problem from the opposite direction: COMPOSE(UNISTR('\00CB')) = COMPOSE(UNISTR ('E\0308')) also is true.
Some scripts supported by Unicode are read from right to left. Arabic is one example. The Unicode standard specifies that characters always be stored in logical order, the order in which you read them, regardless of whether display is left-to-right or right-to-left. This is something to be aware of if you find yourself building a Unicode string a character at a time. Don't attempt to impose display order on the characters in your strings.
Collation
Every language has its own means of sorting data, and Unicode allows you to mix characters from many languages in one string. How, then, do you sort such strings? The collation standard ISO-14651 attempts to address the way in which multilingual data should be sorted. Oracle supports this standard in the form of sort orders with names ending in _M, for multilingual. The most generic of these sorts is GENERIC_M, which will sort data in the order described by ISO-14651. Within GENERIC_M, scripts
are sorted roughly in West-to-East order, starting with those scripts used by
Western European languages and ending with those used by Asian languages. Supplementary characters are sorted within their associated scripts.
GENERIC_M is acceptable for most languages, but some do require special collating rules of their own. To meet this need, Oracle also provides several variations on GENERIC_M. For example, the sort named SCHINESE_STROKE_M sorts primarily by the number of strokes in characters of the Simplified Chinese script. When you apply such a language-specific sort, characters that are part of that language are always sorted first; all other characters follow in the order specified by GENERIC_M.
Character Set Conversion
There are many scenarios in which data needs to be represented in a different character encoding. The process of moving from one encoding to another is referred to as character set conversion and is most often required when a client operating in one character set communicates with a database that stores data in another. For example, you might have a client using Windows Code Page 1251, an encoding that supports Cyrillic. That client might access an Oracle database containing Cyrillic data encoded in Unicode UTF-8. When the client queries that database, any data returned to the client is automatically translated from the database's UTF-8 encoding into the Windows Code Page 1251 encoding used by the client. Recall our earlier mention of the Cyrillic ghe (), which is represented in UTF-8 as the two bytes 0xD2 and 0x90. Those two bytes would be translated to 0xA5, the Code Page 1251 representation of ghe. Likewise, the reverse translation occurs for data sent from the client to the database. Oracle Database supports such character set conversion among all supported character sets. In cases where the data you are selecting contains a Unicode character that is not available in your client character set, the unavailable character will be changed to the Oracle-defined replacement character, which is often a question mark (?).
Character set conversion also extends to converting between Unicode encoding forms. For example, if you store Unicode data in CHAR columns using UTF-8 only to later discover that you are storing many Asian characters and that using UTF-16 would save you a great deal of disk space, help is available. Oracle provides a quick and painless migration solution whereby you can easily change the encoding from UTF-8 to UTF-16 by using the ALTER TABLE MODIFY statement to change the column's datatype from CHAR to NCHAR. The reverse conversion is also possible. The conversion from one Unicode encoding form to another is seamlessly handled.
Unicode in Oracle Database 10g
Oracle Database 10g has full support for the recently released Unicode 3.2 standard. In addition, the performance of Unicode string processing has been improved upon over previous releases, and Oracle has introduced automatic language detection to the character set scanner so you can report on language usage within your database.
Oracle Database 10g also adds two important features to aid application development in a Unicode database.
The first is the addition of Unicode-enabled regular expression pattern matching, which allows you to tap into character properties and perform complex linguistic searches, including case/accent-insensitive searches.
The second feature is an enhanced collation algorithm that provides case- and accent-insensitive sorting that can be obtained by appending '_CI' or '_AI' to the end of a NLS_SORT name. String functions and regular expression functions will honor the rules of the current NLS_SORT. With a case-insensitive sort, case variants of a particular characteras defined by the Unicode standardare all considered equivalent. Likewise, an accent-insensitive sort disregards accents on base characters.
Jonathan Gennick (Jonathan@Gennick.com) is an experienced Oracle DBA and an Oracle Certified Professional living in Michigan's Upper Peninsula. He runs the Oracle-article e-mail list, which you can learn about at http://gennick.com. Peter Linsley (peter.linsley@oracle.com) is a principle member of technical staff in the Server Globalization Technology group at Oracle. Gennick and Linsley recently collaborated on the Oracle Regular Expression Pocket Reference (O'Reilly & Associates, 2003).
|