As Published In

Oracle Magazine
March/April 2003
Technology SQL

Globalize with Character Semantics

By Jonathan Gennick

Character semantics help enable globalization and enhance code portability.

Globalization is a big word that refers to the practice of developing applications that work seamlessly across languages and cultures. The need for such applications is growing, driven in part by ever-larger transnational corporations and in part by a global internet that makes it possible for a user in, say, Iqaluit to run an application against a database that might be hosted in Dnipropetrovs'k.

One of the first issues you run up against when building a global application is the need to represent characters from more than one language. Ukrainians, for example, might prefer to see Dnipropetrovs'k in its Cyrillic form. When you develop an application to support multiple languages, it's increasingly likely that you'll end up using a multibyte character set. Unicode UTF-8, or one of the other Unicode character sets, is a likely choice.

Working with multibyte character sets can present some special challenges, especially if, like me, you've grown up using single-byte character sets such as US7ASCII. Look no further than Listing 1 for an example of what you might run into when you take an application developed under the implicit assumption of a single-byte character set and port it to a Unicode UTF-8 environment.

Listing 1 shows two different lengths for an ostensibly fixed-length character string. If you're used to thinking of CHAR variables (or database columns) as fixed-length strings, you'll need to readjust your world view. Why? Because there's a disconnect between characters and bytes, which I'll be exploring throughout this article. I'll also introduce Oracle9i's new support for character semantics. Seemingly simple on the surface, character semantics eliminate this disconnect and can make it tremendously easy to convert existing applications and database schemas to use multibyte character sets.

Bytes Versus Characters

Is "a" a byte or a character? Is there a difference? In the western world, with our simple alphabets and single-byte character sets, we've long been accustomed to thinking of bytes and characters as being one and the same. We often don't distinguish between the two as we should. We look at a declaration such as CHAR(3), and we see that as a three-character string. But there is a distinction to be made: a byte is a unit of storage, whereas a character is a symbol used in a language or system of writing. The declaration of CHAR(3) shown in Listing 1 resulted in a 3-byte string, which may or may not hold three characters.

Working with multibyte character sets when your declarations and string functions all operate in terms of bytes can be a bit of a challenge. Listing 1, for example, was generated on a database that used Unicode UTF-8 (AL32UTF8) as the database character set. The variable x was declared as CHAR(3), which resulted in an allocation of three bytes. When the single-byte value a was assigned to x, the other two bytes were padded with spaces (a space is also a single byte), and thus the resulting string was three characters long. When the two-byte character was assigned to x, only one byte remained to be padded with a single space. The result? A two-character-long string that happened to be using three bytes of storage.

Character Semantics

Recognizing the confusion surrounding bytes versus characters, Oracle formalized the distinction between byte and character semantics when it released Oracle9i Database. Rather than being forced to declare strings in terms of bytes and then work with them in terms of characters, you can choose whether to declare strings in terms of characters or bytes. You can write declarations such as the following:

x CHAR(3 BYTE)
x CHAR(3 CHAR)

The first of these declarations explicitly declares x as a 3-byte CHAR variable (or it could be a database column). The second explicitly declares x as a three-character CHAR variable. How much storage will be allocated for those three characters? That depends on the underlying character set. The general rule is that Oracle9i Database will look at the maximum number of bytes per character used by the underlying character set and will multiply that value by the number of characters specified in your declaration. In the case of Unicode UTF-8, three bytes will be allocated for each character, so a declaration of CHAR(3 CHAR) results in an allocation of nine bytes.

Declaring a variable or column in terms of bytes is referred to as using byte semantics. Declaring a variable or column in terms of characters is referred to as using character semantics.

The BYTE and CHAR qualifiers can also be applied to VARCHAR2 declarations:

x VARCHAR2(3 BYTE)

x VARCHAR2(3 CHAR)

However, you cannot specify the BYTE and CHAR qualifiers for NCHAR and NVARCHAR2 declarations. Neither of the following declarations is valid:

x NCHAR(3 BYTE)
x NVARCHAR2(3 CHAR)

In Oracle9i Database, declarations of NCHAR and NVARCHAR2 variables and columns are always in terms of characters. Thus, there's no need for the BYTE and CHAR qualifiers in such declarations.

Why Bother?

If a declaration of CHAR(3 CHAR) results in an allocation of nine bytes when using Unicode UTF-8 as the database character set, why not use CHAR(9 BYTE) or simply CHAR(9)? Are there reasons to bother with character semantics and the CHAR qualifier? The answer is yes; I think there are some compelling reasons to use character semantics. In fact, I'd go so far as to argue that you should use character semantics as your default allocation method.

When working with character strings, you do almost everything in terms of characters. In the western world, when you declare a string as CHAR(3), you usually do that because you want to put three characters into it. When you compute the length of a string, it's usually to find the number of characters in it. When you take the substring of a string, you want a substring of complete characters; you don't want only part of a multibyte character. When you search a string to see whether it contains a substring, you are typically looking for a substring of characters, not of random bytes. Given all this, it makes sense to extend the concept of working with characters all the way back to the beginning, to the initial declaration of the string.

Using byte semantics, when you move an application from US7ASCII to UTF-8 in order to support characters from different languages, your CHAR(3) strings may or may not hold three characters, depending on which three characters you happen to be storing at any given time. Under those circumstances, your application will likely fail in short order. You would then need to go through all your PL/SQL source code and all your CREATE TABLE statements and adjust all of your character string declarations to accommodate the new character set. That's not only a huge amount of unnecessary work, it's also a huge opportunity to make mistakes, thereby introducing bugs into what used to be a working application.

There's a better way: declare character variables using character semantics rather than byte semantics. A declaration of CHAR(3 CHAR) will support a maximum of three characters regardless of which character set you end up using. A CHAR(3 CHAR) variable will hold three characters when you use US7ASCII, and it will still hold exactly three characters when you convert your application to Unicode UTF-8. Thus, using character semantics greatly increases the portability of your applications and your database schemas.

The issue of exact length is an interesting one. A declaration of CHAR(3 CHAR) will always support a maximum of three characters, regardless of the number of bytes used to represent those characters. Use a declaration of CHAR(9 BYTE) to allow for up to three Unicode UTF-8 characters, and you'll find that the string will often hold more than three characters. Listing 2 illustrates this. Whether this is a problem or not depends on your application. It's more of a problem with CHAR variables than with VARCHAR2 variables, because CHAR variables are always padded with spaces to fill out their maximum length. Thus, when mixing byte semantics with multibyte character sets, CHAR variables suddenly appear to be variable-length. Using character semantics eliminates all that confusion. Listing 3 uses character semantics to declare x as a fixed, three-character string. The underlying character set is still UTF-8, but you really don't need to be concerned with knowing that. Using character semantics guarantees you a three-character string regardless of character set, so the string is consistently padded to exactly three characters, not bytes, in length.

Length Semantics and String Functions

Oracle supports several built-in functions that operate on characters within a string. These functions are:

  • INSTR. Searches a string for a specified sequence of characters.
  • LENGTH. Returns the number of characters in a string.
  • SUBSTR. Returns a specified portion, or substring, of a string.

Your choice of character or byte semantics affects only the declaration of a PL/SQL variable or a database column. Regardless of how you declare a variable or a column, in Oracle9i the INSTR, LENGTH, and SUBSTR functions always operate in terms of characters. You've seen this already in the operation of the LENGTH function. In Listings 1, 2 and 3, LENGTH always returned the number of characters in the string—never the number of bytes. Listing 4 provides an even more comprehensive example of how these functions operate. UTF-8 is still the character set. The string ãa is two characters long but consumes three bytes because ã is a two-byte character in UTF-8. The INSTR function finds the letter a in the second character position, even though the bytecode for that letter occupies the third byte. As before, LENGTH returns the number of characters in the string (2). SUBSTR(x,1,1) actually returns two bytes of the string, because the first character happens to be a two-byte character.

It's good that INSTR, LENGTH, and SUBSTR operate in terms of characters. You can take an application using these functions and install that application on a database using a different character set—it doesn't matter what that character set is—and all your application's calls to these functions will still provide sane, sensible, and correct results.

There are times, however, when you do actually need to work with strings as sequences of bytes rather than characters. For this reason, Oracle provides the functions INSTRB, LENGTHB, and SUBSTRB, which are shown in Listing 5. Note the suffix B on all these function names, which indicates that these functions operate in terms of bytes, not characters. Other than this difference, the three functions are identical to their non-B counterparts.

Character Semantics Default

If all Oracle did was add the ability to use CHAR(3 CHAR) rather than CHAR(3), that would be good, but you'd still have your work cut out for you when converting existing applications and schemas to UTF-8 or some other multibyte character set. Instead of being faced with the task of expanding CHAR(3) to CHAR(9), you'd be faced with the equally dismal task of changing all CHAR(3)s to CHAR(3 CHAR)s. Fortunately, Oracle9i Database provides the ability to set the default character semantics on either the session or instance level using the NLS_LENGTH_SEMANTICS parameter.

To convert an existing schema and its associated data from byte semantics and a single-byte character set to character semantics and a multibyte character set, such as UTF-8, you need only follow these steps: [The following steps have been corrected since the magazine was printed.]

  1. Export the schema.
  2. Issue an ALTER SYSTEM SET NLS_LENGTH_SEMANTICS=CHAR SCOPE=BOTH command on the target database.
  3. Stop and restart the instance so that the parameter change takes effect.
  4. Drop the original schema.
  5. Recreate the original schema and its tables (you can use import's show=Y option to get the CREATE TABLE statements). Columns in the recreated tables will use character semantics, because that's now the default.
  6. Import the schema into the target database using the IGNORE=Y import option.

Because step 2 changed the default length semantics from byte semantics to character semantics, any CHAR or VARCHAR2 declarations encountered during the import will be interpreted as if the CHAR qualifier was specified. For example, CHAR(3) will be interpreted as CHAR(3 CHAR). The ability to set the default-length semantics on an instance-wide or session basis enables you to easily convert existing schema definitions or PL/SQL code to use character semantics without the need to tediously go through and add the CHAR qualifier to all your declarations.

Note: This process applies only to schemas composed entirely of tables. If your schema includes stored PL/SQL code, you'll need to recompile all that code to use character semantics. If your schema includes object types, you should precreate those types in step 5. Be sure to run a test conversion before attempting to convert a production schema.

Next Steps
READ
about Unicode
unicode.org

You can check the current value of NLS_LENGTH_SEMANTICS at any time by querying the NLS_INSTANCE_PARAMETERS and NLS_SESSION_PARAMETERS views. Changing the ?arameter does not affect any existing column or variable definitions. Columns declared using byte semantics will not be changed simply by changing the parameter. The parameter affects only new column or variable definitions.

Be aware that converting a schema to use a multibyte character set may involve more work than simply rebuilding the schema and recompiling any PL/SQL code to use character semantics. For example, you'll need to look at any external programs, such as those written in C++ or Java, because they may need to be adjusted to handle the multibyte data now coming from the database. Due diligence is still required.

The NLS_LENGTH_SEMANTICS parameter gives you the option of switching to character semantics globally or on a column-by-column basis. To introduce character semantics on a column-by-column basis, leave NLS_LENGTH_SEMANTICS at its default value of BYTE, and use the CHAR qualifier when declaring variables and columns. Change the default when it makes sense to make a wholesale change—for example, when you are developing a brand-new schema and application.

Conclusion

Newly introduced in Oracle9i Database, character semantics makes it much easier than before to work with multibyte character sets. No longer must you declare variables and columns in terms of bytes and then try to keep straight the number of characters such variables and columns will hold. Using character semantics eliminates that confusion by allowing you to work consistently in terms of characters.

Jonathan Gennick (Jonathan@Gennick.com) is an experienced Oracle DBA and an Oracle Certified Professional. He likes to explore new Oracle technologies and recently completed work on the Oracle SQL*Plus Pocket Reference, Second Edition (O'Reilly & Associates, 2002).

UNICODE STRING FUNCTIONS

There are variations of the INSTR, LENGTH, and SUBSTR functions designed specifically for working with Unicode. Looking at the LENGTH function, you'll find the LENGTHC, LENGTH2, and LENGTH4 functions for Unicode.

These functions distinguish between Unicode characters, Unicode code points, and Unicode code units. A code point is the numeric value corresponding to an entry in a Unicode encoding table, and often, but not always, to a Unicode character. For example, 0x00E3 is the code point for the letter "ã". However, that same letter can also be expressed using the code point 0x0061 (for "a") followed by the code point 0x0303 (for "˜").

The standard LENGTH function will interpret code point 0x0061 followed by code point 0x0303 as two separate characters. The LENGTHC function, because it is designed for use with Unicode, will recognize that in Unicode, 0x0061 followed by 0x0303 really represents just one character.

The term code unit refers to the actual representation of a Unicode code point. For example, the UTF-8 representation of the code point 0x0061 is the single byte 0x61. That one byte is a code unit. The UTF-16 representation of 0x0061 is the two bytes 0x0061, and those two bytes form one code unit. Sometimes a single code point is large enough that it needs to be represented as multiple code units. For example, the code point 0x1D11E, which represents the musical symbol G-clef (available at http://www.unicode.org/charts/PDF/U1D100.pdf), is represented in UTF-16 using two code units: 0xD834 followed by 0xDD1E. Neither value by itself is a code point. Only together do these two code units represent a single code point.

The LENGTH2 and LENGTH4 functions enable you to distinguish between code units and code points. LENGTH2 returns the number of code units in a Unicode string. LENGTH4 returns the number of code points in a Unicode string.

The Unicode function variants of the INSTR and SUBSTR function families parallel the Unicode variants of the LENGTH function. Just as there are LENGTHC, LENGTH2, and LENGTH4 Unicode functions, there are also INSTRC, INSTR2, INSTR4, SUBSTRC, SUBSTR2, and SUBSTR4 Unicode functions.

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy