Legal | Privacy
Sorting Multilingual Data
 
 

Sorting Multilingual Data

Module Objectives

Purpose

In this module, you will learn how to sort multilingual data that is stored in the database in different ways.

Objectives

After completing this module, you should be able to:

Understand the differences between monolingual & multilingual linguistic sorts
Discuss GENERIC_M multilingual linguistic sorts
Create Linguistic Indexes to improve sorting performance
Understand the differences between the different types of Asian sorts

Prerequisites

Before starting this module, you should have completed the following:

Preinstallation Tasks

Install the Oracle9i Database

Postinstallation Tasks

Review the Sample Schema
Deploying Unicode with NCHAR SQL Data Types
Downloaded the sort.zip module files and unzipped them into your working directory

Reference Material

The following is a list of useful reference material if you want additional information about the topics in this module:

Documentation: Oracle9i Globalization Support Guide

 

Overview

Oracle provides linguistic sort capabilities that handle the complex sorting requirements of different languages and cultures.

Prior to Oracle9i, only monolingual and binary sorting were available:

Conventionally, when character data is stored, the sort sequence is based on the numeric values of the characters defined by the character encoding scheme. This is called a binary sort. Binary sorts produce reasonable results for the English alphabet because the ASCII and EBCDIC standards define the letters in ascending numeric value. When characters used in other languages are present, a binary sort generally does not product resonalbe results.

Linguistic sorts provide an alternative sort technique that sorts characters independently of their numeric values in the character encoding scheme. A linguistic sort operates by replacing characters with numeric values that reflect each character's proper linguistic order. These numeric values are found in a table containing major and minor values. With monolingual sorting, Oracle makes two passes when comparing strings. The first pass is to compare the major value of the entire string from the major table and the second pass is to compare the minor value from the minor table. Each major table entry contains the Unicode codepoint and major value. Usually, letters with the same appearance will have the same major value. Oracle defines letters with diacritic and case differences for the same major value but different minor values. Sorting with two levels is limited because many languages have properties that can not be sorted in two levels. Monolingual Linguistic Sort, as the name implies, sorts data in one language order, it can support only around 1000 characters, and it can not perform sorts for Chinese, Japanese and Korean Languages.

Oracle9i extends monolingual linguistic sorts with multilingual linguistic sorts so that you can now sort additional languages as part of one sort.  For multilingual data, Oracle provides a sorting mechanism based on an ISO standard (ISO14651 - International String Ordering) and the Unicode 3.0 standard. This is useful for certain regions or languages that have complex sorting rules or global multilingual databases. Multilingual linguistic sorts also work for Asian language sorts ordered by the number of strokes, PinYin, or radicals. Additionally, Oracle9i still supports all the sort orders defined by previous releases.

 

Create Sort Table and Load Data

In order for you to understand how sorting can be used you will create a product descriptions table that contains some specific data that will be used to demonstrate how multilingual data can be sorted.

 
1.

From a SQL*Plus session logged in as OE, execute the @creatpds.sql. You can also perform a describe to see the contents of the table you just created.

 

2.

Now you are ready to populate the table with data. Execute the @populate_pd.sql.

 

 

Binary sorts and GENERIC_M Linguistic sorts

With the new UTF-8 compliant, browser based version of SQLPLUS (iSQL*Plus), you can query the database to see the product names that you have just added. To display text of different languages, you need to change the character encoding in your iSQL*Plus user interface. To do this, perform the following:

 
1.

Open your browser and select Encoding (Microsoft Internet Explorer) or Character Set (Netscape Navigator) from the View Menu and click Unicode (UTF-8). Make sure fonts are associated with the UTF-8 character encoding. Enter the following commands:

connect oe/oe@orcl.world

select language_id, translated_name
 from product_descriptions_sort 
 order by translated_name;

The above example illustrates the default binary sorting sequence for the SQL NCHAR datatype, which is the binary order of the national character set. In general, binary sort is not very useful, because the upper case ASCII characters are always sorted before the lower case ones, and their accented ancestors follow afterwards in the order that they are encoded in the particular character set . The binary order changes depending on the character set you are using. The ISO standard 14651 ( International String Ordering) defines the ordering of international data.

 

2.

In Oracle9i, this is implemented in the linguistic sort GENERIC_M, where the _M means that this is a Multilingual linguistic sort. GENERIC_M generally handles the order of the Latin, cyrillic, greek characters and other European characters. Non-European characters will be sorted in the Unicode binary sequence. Execute the following commands:

alter session set nls_sort=generic_m;

select translated_name, language_id from product_descriptions_sort
order by translated_name;

GENERIC_M sorts data by grouping them by their base characters, so that all variants of the same base characters are placed together , for the above example you can see that all the 'a' s are group together with the lowercase versions first followed by uppercase.

 

 

Linguistic Indexes

Linguistic sort is language specific and requires more data processing than binary sorting. Binary sorting is fast because it utilizes the order of the encoding of your character set . When data of multiple languages is stored in the database, you may want your applications to collate a result set returned from a SELECT statement using the ORDER BY clause with different collating sequences based upon the language being used. This can accomplished without sacrificing performance, by using linguistic indexes.

There are three ways to build linguistic indexes for data in multiple languages:

 
1.

Build a linguistic index for each language which the application needs to support

This approach offers simplicity but requires more disk space. For each index, the rows in the language other than the one on which the index is built are collated together at the end of the sequence. The following example builds linguistic indexes for French and German.

CREATE INDEX french_index ON product_descriptions_sort (NLSSORT(translated_name, 'NLS_SORT=FRENCH'));
CREATE INDEX german_index ON product_descriptions_sort (NLSSORT(translated_name, 'NLS_SORT=GERMAN'));

Which index to use is based on the NLS_SORT session parameters or the arguments of the NLSSORT function you specified in the ORDER BY clause. For example, if the session variable NLS_SORT is set to FRENCH, you can use french_index and when it is set to GERMAN, you can use german_index.

 

2.

Build a single linguistic index for all languages using LANG_COL

This can be accomplished by including a language column (LANG_COL in the example below) that contains NLS_LANGUAGE values for the corresponding column on which the index is built as a parameter to the NLSSORT function. The following example builds a single linguistic index for multiple languages. With this index, the rows with the same values for NLS_LANGUAGE are collated together.

CREATE INDEX i ON t (NLSSORT(col, 'NLS_SORT=' || LANG_COL));

The index that will be used, is based on the argument of the NLSSORT function you specified in the ORDER BY clause.

 

3.

Build a single linguistic index for all languages using a sort sequence

This can be accomplished by using one of the sorting sequences such as GENERIC_M or FRENCH_M. These indexes collate characters according to the rules defined in ISO 14651.

CREATE INDEX i on t (NLSSORT(col, 'NLS_SORT=GENERIC_M');

 

 

Requirements for Linguistic Indexes

Whether using a single linguistic index or multiple linguistic indexes, some requirements must be met for the linguistic index to be used:

The initialization parameter QUERY_REWRITE_ENABLED needs to be true. This is not a specific requirement for linguistic indexes, but for all function based indexes.
NLS_COMP needs to be ANSI.
NLS_SORT needs to indicate the linguistic definition you want to use for the linguistic sort. If you want a French linguistic sort order, NLS_SORT needs to be FRENCH.
You need to use the cost based optimizer, because they are not recognized by the rule based optimizer.
You need to specify WHERE NLSSORT(column_name) IS NOT NULL when you want to use ORDER BY column_name where the column_name is the column with the linguistic index. This is necessary only when you use an ORDER BY clause.

 

Creating and Using a Linguistic Index

To create and use a linguistic index, perform the following:

Load and execute the script @crlingindex.sql from a sqlplus session to create a Linguistic Index and SELECT from the PRODUCT_DESCRIPTIONS_SORT table using the linguistic index GENERIC_M.

CONNECT oe/oe@orcl.world
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE; 
ALTER SESSION SET NLS_COMP=ANSI;        
CREATE INDEX NLS_GENERIC ON product_descriptions_sort
 ( NLSSORT(translated_name,'NLS_SORT=GENERIC_M'));        
ALTER SESSION SET NLS_SORT='GENERIC_M';        
SELECT * FROM PRODUCT_DESCRIPTIONS_SORT 
  WHERE  NLSSORT(TRANSLATED_NAME) IS NOT NULL          
  ORDER BY TRANSLATED_NAME;        

 

 

Asian Multilingual Linguistic Sorts

With the introduction of Multilingual Linguistic sorts and the support for differentiating characters now in 3 distinct levels, complex Asian sorts have been added in Oracle9i to cover the need of ordering Asian data in a way that users are custom to. These sorts are all based on the GENERIC_M multilingual linguistic sort, so that the order of the Latin, cyrillic, greek characters and other European characters will also be sorted according to the ISO 14651 standard.

Here is a list of the Asian Multilingual Linguistic Sorts supported in Oracle9i.

Basic Name

Explanation

JAPANESE_M 

Japanese sort supports SJIS character set order and EUC characters which are not included in SJIS 

KOREAN_M  Korean sort: Hangul characters are based on Unicode binary order. Hanja characters based on pronunciation order. All Hangul characters are before Hanja characters 
THAI_M  Thai sort supports swap characters for some vowels and consonants 
SCHINESE_STROKE_M  Simplified Chinese sort uses number of strokes as primary order and radical as secondary order 
SCHINESE_PINYIN_M  Simplified Chinese PinYin sorting order 
TCHINESE_RADICAL_M  Traditional Chinese sort based on radical as primary order and number of strokes order as secondary order 
TCHINESE_STROKE_M  Traditional Chinese sort uses number of strokes as primary order and radical as secondary order 

To see how the asian multilingual lingustic sort works, perform the following:

1.

Enter the following commands to show the Simplified Chinese Pin Yin (pronunciation) sort.

alter session set nls_sort=schinese_pinyin_m;

select translated_name, language_id
  from product_descriptions_sort
  where language_id='ZHS'
  order by translated_name;

 

2.

Enter the following commands to show the Simplified Chinese sort ordered by the number of strokes.

alter session set nls_sort=schinese_stroke_m;

select translated_name, language_id
  from product_descriptions_sort
  where language_id='ZHS'
  order by translated_name;

 

 

Module Summary

In this module, you should have learned how to:

Understand the differences between monolingual & multilingual linguistic sorts
Discuss GENERIC_M multilingual linguistic sorts
Create Linguistic Indexes to improve sorting performance
Understand the differences between the different types of Asian sorts

 

Close Window

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy