Technology SQL
Globalize with Character Semantics
By Jonathan Gennick
Character semantics help enable globalization and enhance code portability.
Globalization is a big word
that refers to the practice of developing applications that work seamlessly across languages and cultures. The need for such applications is growing, driven
in part by ever-larger transnational corporations and in part by a global internet that makes it possible for a user in, say, Iqaluit to run an application against a database that might be hosted in Dnipropetrovs'k.
One of the first issues you run up against when building a global application is the need to represent characters from more than one language. Ukrainians, for example, might prefer to see Dnipropetrovs'k in its Cyrillic form. When you develop an application to support multiple languages, it's increasingly likely that you'll end up using a multibyte character set. Unicode UTF-8, or one of the other Unicode character sets, is a likely choice.
Working with multibyte character sets can present some special challenges, especially if, like me, you've grown up using single-byte character sets such as US7ASCII. Look no further than Listing 1 for an example
of what you might run into when you take an application developed under the implicit assumption of a single-byte character set and port it to a Unicode UTF-8 environment.
Listing 1 shows two different lengths for an ostensibly fixed-length character string. If you're used to thinking of CHAR variables (or database columns) as fixed-length strings, you'll need to readjust your world view. Why? Because there's a disconnect between characters and bytes, which I'll be exploring throughout this article. I'll also introduce Oracle9i's new support for character semantics. Seemingly simple on the surface, character semantics eliminate this disconnect and can make it tremendously easy to convert existing applications and database schemas to use multibyte character sets.
Bytes Versus Characters
Is "a" a byte or a character? Is there a difference? In the western world, with our simple alphabets and single-byte character sets, we've long been accustomed to thinking of bytes and characters as being one and the same. We often don't distinguish between the two as we should. We look at a declaration such as CHAR(3), and we see that as a three-character string. But there is a distinction to be made: a byte is a unit of storage, whereas a character is a symbol used in a language or system of writing. The declaration of CHAR(3) shown in Listing 1 resulted in a 3-byte string, which may or may not hold three characters.
Working with multibyte character sets when your declarations and string functions all operate in terms of bytes can be a bit of a challenge. Listing 1, for example, was generated on a database that used Unicode UTF-8 (AL32UTF8) as the database character set. The variable x was declared as CHAR(3), which resulted in an allocation of three bytes. When the single-byte value a was assigned to x, the other two bytes were padded with spaces (a space is also a single byte), and thus the resulting string was three characters long. When the two-byte character was assigned to x, only one byte remained to be padded with a single space. The result? A two-character-long string that happened to be using three bytes of storage.
Character Semantics
Recognizing the confusion surrounding bytes versus characters, Oracle formalized the distinction between byte and character semantics when it released Oracle9i Database. Rather than being forced to declare strings in terms of bytes and then work with them in terms of characters, you can choose whether to declare strings in terms of characters or bytes. You can write declarations such as the following:
x CHAR(3 BYTE)
x CHAR(3 CHAR)
The first of these declarations explicitly declares x as a 3-byte CHAR variable (or it could be a database column). The second explicitly declares x as a three-character CHAR variable. How much storage will be allocated for those three characters? That depends on the underlying character set. The general rule is that Oracle9i Database will look at the maximum number of bytes per character used by the underlying character set and will multiply that value by the number of characters specified in your declaration. In the case of Unicode UTF-8, three bytes will be allocated for each character, so a declaration of CHAR(3 CHAR) results in an allocation of nine bytes.
Declaring a variable or column in terms of bytes is referred to as using byte semantics. Declaring a variable or column in terms of characters is referred to as using character semantics.
The BYTE and CHAR qualifiers can also be applied to VARCHAR2 declarations:
x VARCHAR2(3 BYTE)
x VARCHAR2(3 CHAR)
However, you cannot specify the
BYTE and CHAR qualifiers for NCHAR and
NVARCHAR2 declarations. Neither of the following declarations is valid:
x NCHAR(3 BYTE)
x NVARCHAR2(3 CHAR)
In Oracle9i Database, declarations of NCHAR and NVARCHAR2 variables and columns are always in terms of characters. Thus, there's no need for the BYTE and CHAR qualifiers in such declarations.
Why Bother?
If a declaration of CHAR(3 CHAR) results in an allocation of nine bytes when using Unicode UTF-8 as the database character set, why not use CHAR(9 BYTE) or simply CHAR(9)? Are there reasons to bother with character semantics and the CHAR qualifier? The answer is yes; I think there are some compelling reasons to use character semantics. In fact, I'd go so far as to argue that you should use character semantics as your default allocation method.
When working with character strings, you do almost everything in terms of characters. In the western world, when you declare a string as CHAR(3), you usually do that because you want to put three characters into it. When you compute the length of a string, it's usually to find the number of characters in it. When you take the substring of a string, you want a substring of complete characters; you don't want only part of a multibyte character. When you search a string to see whether it contains a substring, you are typically looking for a substring of characters, not of random bytes. Given all this, it makes sense to extend the concept of working with characters all the way back to the beginning, to the initial declaration of the string.
Using byte semantics, when you move an application from US7ASCII to UTF-8 in order to support characters from different languages, your CHAR(3) strings may or may not hold three characters, depending on which three characters you happen to be storing at any given time. Under those circumstances, your application will likely fail in short order. You would then need to go through all your PL/SQL source code and all your CREATE TABLE statements and adjust all of your character string declarations to accommodate the new character set. That's not only a huge amount of unnecessary work, it's also a huge opportunity to make mistakes, thereby introducing bugs into what used to be a working application.
There's a better way: declare character variables using character semantics rather than byte semantics. A declaration of CHAR(3 CHAR) will support a maximum of three characters regardless of which character set you end up using. A CHAR(3 CHAR) variable will hold three characters when you use US7ASCII, and it will still hold exactly three characters when you convert your application to Unicode UTF-8. Thus, using character semantics greatly increases the portability of your applications and your database schemas.
The issue of exact length is an interesting one. A declaration of CHAR(3 CHAR) will always support a maximum of three characters, regardless of the number of bytes used to represent those characters. Use a declaration of CHAR(9 BYTE) to allow for up to three Unicode UTF-8 characters, and you'll find that the string will often hold more than three characters. Listing 2 illustrates this. Whether this is a problem or not depends on your application. It's more of a problem with CHAR variables than with VARCHAR2 variables, because CHAR variables are always padded with spaces to fill out their maximum length. Thus, when mixing byte semantics with multibyte character sets, CHAR variables suddenly appear to be variable-length. Using character semantics eliminates all that confusion. Listing 3 uses character semantics to declare x as a fixed, three-character string. The underlying character set is still UTF-8, but you really don't need to be concerned with knowing that. Using character semantics guarantees you a three-character string regardless of character set, so the string is consistently padded to exactly three characters, not bytes, in length.
Length Semantics and String Functions
Oracle supports several built-in functions that operate on characters within a string. These functions are:
- INSTR. Searches a string for a specified sequence of characters.
- LENGTH. Returns the number of characters in a string.
- SUBSTR. Returns a specified portion, or substring, of a string.
Your choice of character or byte semantics affects only the declaration of a PL/SQL variable or a database column. Regardless of how you declare a variable or a column, in Oracle9i the INSTR, LENGTH, and SUBSTR functions always operate in terms of characters. You've seen this already in the operation of the LENGTH function. In Listings 1, 2 and 3, LENGTH always returned the number of characters in the stringnever the number of bytes. Listing 4 provides an even more comprehensive example of how these functions operate. UTF-8 is still the character set. The string ãa is two characters long but consumes three bytes because ã is a two-byte character in UTF-8. The INSTR function finds the letter a in the second character position, even though the bytecode for that letter occupies the third byte. As before, LENGTH returns the number of characters in the string (2). SUBSTR(x,1,1) actually returns two bytes of the string, because the first character happens to be a two-byte character.
It's good that INSTR, LENGTH, and SUBSTR operate in terms of characters. You can take an application using these functions and install that application on a database using a different character setit doesn't matter what that character set isand all your application's calls to these functions will still provide sane, sensible, and correct results.
There are times, however, when you do actually need to work with strings as sequences of bytes rather than characters. For this reason, Oracle provides the functions INSTRB, LENGTHB, and SUBSTRB, which are shown in Listing 5. Note the suffix B on all these function names, which indicates that these functions operate in terms of bytes, not characters. Other than this difference, the three functions are identical to their non-B counterparts.
Character Semantics Default
If all Oracle did was add the ability to use CHAR(3 CHAR) rather than CHAR(3), that would be good, but you'd still have your work cut out for you when converting existing applications and schemas to UTF-8 or some other multibyte character set. Instead of being faced with the task of expanding CHAR(3) to CHAR(9), you'd be faced with the equally dismal task of changing all CHAR(3)s to CHAR(3 CHAR)s. Fortunately, Oracle9i Database provides the ability to set the default character semantics on either the session or instance level using the NLS_LENGTH_SEMANTICS parameter.
To convert an existing schema and its associated data from byte semantics and a single-byte character set to character semantics and a multibyte character set, such as UTF-8, you need only follow these steps: [The following steps have been corrected since the magazine was printed.]
- Export the schema.
- Issue an ALTER SYSTEM SET NLS_LENGTH_SEMANTICS=CHAR SCOPE=BOTH command on the target database.
- Stop and restart the instance so that the parameter change takes effect.
- Drop the original schema.
- Recreate the original schema and its tables (you can use import's show=Y option to get the CREATE TABLE statements). Columns in the recreated tables will use character semantics, because that's now the default.
- Import the schema into the target database using the IGNORE=Y import option.
Because step 2 changed the default length semantics from byte semantics to character semantics, any CHAR or VARCHAR2