Technology: Globalization
The Unicode Imperative
by Barry Trute
Deploying a new database system starts with Unicode.
Designing and setting up a new database system can take a lot of planning. But typically, one important decision is overlooked: choosing the right character set.
A basic consideration for choosing a character set is to make sure it can handle any language that needs to be supported immediately and in the indeterminate future. Another overlooked consideration is to think about what applications and technologies you may want to utilize or interact with the database. Choosing Unicode as the database character set ensures a strong foundation for whatever is built into and on top of the database.
About Unicode
Unicode is a universal character-encoding scheme that allows you to store information from any major language using a single character set. Unicode defines properties for each character, standardizes script behavior, provides a standard algorithm for bidirectional text, and defines cross-mappings to other standards. The current version 4 of the Unicode Standard, developed by the Unicode Consortium, assigns a unique identifier to each of 96,382 characters (increased from 95,156 in version 3.2), covering the scripts of the world's principal written languages and many mathematical and other symbols.
Unicode Secures Maximum Usability
Industry leaders such as Apple, HP, IBM, Microsoft, and Oracle have adopted the Unicode Standard. Modern standards such as CORBA 3.0, Java, JavaScript, LDAP, WML, XML, and so on require Unicode, and it is the official means to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products.
Incorporating Unicode into applications and Web sites offers significant cost savings over the use of other character sets. Unicode enables a single software product or a single Web site to be targeted across multiple platforms, languages, and countries without re-engineering. It also allows data to be transported across many different systems without corruption.
Unicode Ensures Extensibility
Unicode has started to replace character-encoding schemes such as ASCII, EUC, and ISO 8859 at all levels. With the UTF-8 encoding, Unicode can be used in a convenient and backward-compatible way in environments that were designed entirely around ASCII, such as UNIX. UTF-8 is the encoding used for Unix, Linux, and similar systems.
All de facto Web standardsHTML, XML, and so onare supporting or requiring Unicode. By deploying new systems in Unicode today, you ensure compatibility with today's latest technologies and put yourself in a position to leverage future advances without the need for costly migrations.
Deploying Your System Without Unicode
So what happens if you don't deploy your new system in Unicode now? If you don't require Unicode today and choose a different character set appropriate to your needs, you can still implement your system very successfully. However, in our experience working with numerous customers, in the long run this tends to be a mistake. There are so many circumstances in which Unicode eventually becomes essentialincluding mergers and acquisitions, data and systems consolidation, internationalization of operations, support for new regulations or standards, and just trying to stay ahead of competitors.
| "The time and cost involved of migrating legacy systems to Unicode can be prohibitive."
|
The time and cost involved in migrating legacy systems to Unicode can be significant and even prohibitive. A critical factor for the migration of a database is the downtime, which may last anywhere from hours to days. Careful preparations are required before any migration, including doing backups, taking the system offline, scanning for invalid data (using the Database Character Set Scanner), dropping and recreating indexes, and running Oracle Database Utilities such as Oracle Data Pump or the Export and Import utilities. Some companies cannot afford any downtime and may have hundreds of databases to migrate.
The aforementioned time investment and potential difficulty in migrating the database character set to Unicode is often just the tip of the iceberg for many customers. Customers are often faced with simultaneously having to migrate applications and hundreds of end users along with multiple database instances. Applications may need to be internationalized to handle multibyte data, including providing locale-sensitive operations, expanding field sizes for storage, and properly converting input to Unicode as a start. Often third-party tools are not chosen with Unicode in mind and must be upgraded or replaced . With no other options available, customers in these situations frequently must stage migration to Unicode in phases that can sometimes take years to complete.
Oracle's Support of Unicode
Oracle's support of Unicode is quite comprehensive. Oracle Database 10g Release 2 provides full support for Unicode 4.0, the standard for multilingual support. This support allows customers to develop, deploy, and host multiple languages in a single central database or as part of a grid. Oracle also offers the flexibility to store all data in a Unicode database in UTF-8 or to incrementally store select columns in the Unicode datatype in UTF-8 or UTF-16.
Oracle provides several database-access products for inserting and retrieving Unicode data. Oracle offers support for the most commonly used programming environments, such as Java and C/C++. Data is transparently converted between the database and client programs, to ensure that client programs are independent of the database character set.
Oracle's support for Unicode extends throughout the database technology stack. Several key technologiessuch as Oracle HTML DB, Oracle Text, SQL Regular Expressions, Oracle XML DB, and XQuerynot only support Unicode but offer enhanced capabilities and flexibility when it is used.
Comprehensive support for Unicode extends throughout the Oracle product family, including the Oracle Fusion middleware family, Oracle E-Business Suite, and Oracle Enterprise Manager.
Oracle Enhances the Unicode Experience
While many customers realize the need for and benefits of using Unicode, they are often intimidated by its perceived complexity. In some cases it is the fear of the unknown, but some challenges to using Unicode need to be addressed. Most of these revolve around UTF-8 being a varying-width encoding scheme. ASCII characters occupy 1 byte each. Accented Latin characters, Arabic, Cyrillic, Greek, and Hebrew occupy 2 bytes each. Most other charactersincluding Chinese, Indian, Japanese, and Koreanoccupy 3 bytes each, and supplementary characters are represented in 4 bytes. Here are some common issues and how Oracle addresses them.
What about storage management and string handling with a Unicode database?
Many people are under the impression that supporting a UTF-8 database means doubling or tripling storage requirements and having to deal with very complicated string handling. Typically storage requirements will need to be increased but not so dramatically. The first thing to keep in mind is that databases contain many noncharacter data fields that will not be subject to expansion, such as numeric, date, time stamp, and binary data (such as images and documents). The amount of data expansion often depends on the language of the text data. For a database presumed to consist primarily of English data, the amount of expansion should be very low; only the non-ASCII symbols will experience expansion from 1 byte to 2 or 3 bytes. Western European languages contain some diacritics; these characters will generally expand to 2 bytes in UTF-8. Characters used in Asian languages will experience the most expansion; these will typically go from a 2-byte character to a 3-byte character.
| "You can get the identical string handling behavior whether your character set is single byte or multibyte."
| Oracle9i introduced length semantics, which provides the choice of whether to declare strings in terms of characters or bytes. Byte semantics is still the default, and character-length semantics can be declared at column, table, session, and for the entire database. Length semantics makes storage management more intuitive and allows a common database schema to run successfully on different databases with different character sets. This makes migrating applications and databases to Unicode much simpler. The choice of character or byte semantics affects only the declaration of a PL/SQL variable or a database column.
Regardless of how you declare a variable or a column, the INSTR, LENGTH, and SUBSTR functions always operate in terms of characters. The benefit here is that you get identical behavior whether your character set is single byte or multibyte such as Unicode.
Will having a Unicode database impact performance?
Oracle works on improving the performance of Unicode with every database release. Generally, a Unicode database will provide nearly the same performance as a single-byte database for the same release. Performance will be impacted more on a Unicode database than on a single-byte database in environments where PL/SQL string-manipulation functions are heavily used.
The performance of a Unicode database can exceed the performance of a single-byte database implemented in a previous release. For example, a Unicode database in Oracle Database 10g will typically run faster than a single-byte database running on Oracle9i, all other things being equal. And of course there are other great Oracle features, such as Real Application Clusters (RAC) and Grid Computing, to accelerate performance as needed.
What about integration with my existing systems?
Projects are rarely begun completely from scratch. While new systems are built all the time, there is often some dependency on existing data or systems. Since Unicode supports virtually every language and is a superset of all Oracle character sets, you can properly convert and store valid incoming data. Should your Unicode database be a feed to other systems, you need to make sure all data sent to the target system can be supported. Existing schemas and applications can more easily be reused thanks to the storage management and string handling capabilities I mentioned previously. For many customers, implementing a new system with Unicode can be a great opportunity to break away from legacy practices and start down the path to upgrading all their systems to Unicode.
The Oracle Recommendation
Oracle recommends using Unicode for all new system deployment. Migrating legacy systems to Unicode is also recommended. Deploying your systems today in Unicode offers many advantages in usability, compatibility, and extensibility. Oracle's comprehensive support allows you to deploy high-performing systems faster and more easily while leveraging the true power of Unicode. Even if you don't need to support multilingual data today or have any requirement for Unicode, it is still likely to be the best choice for a new system in the long run and will ultimately save you time and money and give you competitive advantages. Maximize your investment in Oracle technology and use Unicode.
For more information on Unicode and the features described here, visit OTN's Globalization Support home page.
Barry Trute is a Principal Product Manager in the Server Globalization Technology group at Oracle. He is responsible for driving new globalization features based on inbound customer requirements, promoting Oracle's Globalization Support features and working as the liaison between Oracle customers and the development organization.
|