| |
Sorting Multilingual Data
Module Objectives
Purpose
In this module, you will learn how to sort multilingual data that is
stored in the database in different ways.
Objectives
After completing this module, you should be able to:
 |
Understand the differences between monolingual &
multilingual linguistic sorts |
 |
Discuss GENERIC_M multilingual linguistic sorts |
 |
Create Linguistic Indexes to improve sorting performance |
 |
Understand the differences between the different types
of Asian sorts |
Prerequisites
Before starting this module, you should have completed the following:
Reference Material
The following is a list of useful reference material if you want additional
information about the topics in this module:
 |
Documentation: Oracle9i
Globalization Support Guide
|
Overview
Oracle provides linguistic sort capabilities that
handle the complex sorting requirements of different languages and cultures.
Prior to Oracle9i, only monolingual and binary
sorting were available:
Conventionally, when character data is stored,
the sort sequence is based on the numeric values of the characters defined
by the character encoding scheme. This is called a binary sort.
Binary sorts produce reasonable results for the English alphabet because
the ASCII and EBCDIC standards define the letters in ascending numeric
value. When characters used in other languages are present, a binary sort
generally does not product resonalbe results.
Linguistic sorts provide an alternative sort technique
that sorts characters independently of their numeric values in the character
encoding scheme. A linguistic sort operates by replacing characters with
numeric values that reflect each character's proper linguistic order.
These numeric values are found in a table containing major and minor values.
With monolingual sorting, Oracle makes two passes when comparing
strings. The first pass is to compare the major value of the entire string
from the major table and the second pass is to compare the minor value
from the minor table. Each major table entry contains the Unicode codepoint
and major value. Usually, letters with the same appearance will have the
same major value. Oracle defines letters with diacritic and case differences
for the same major value but different minor values. Sorting with two
levels is limited because many languages have properties that can not
be sorted in two levels. Monolingual Linguistic Sort, as the name implies,
sorts data in one language order, it can support only around 1000 characters,
and it can not perform sorts for Chinese, Japanese and Korean Languages.
Oracle9i extends monolingual linguistic sorts with
multilingual linguistic sorts so that you can now sort additional
languages as part of one sort. For multilingual data, Oracle provides
a sorting mechanism based on an ISO standard (ISO14651 - International
String Ordering) and the Unicode 3.0 standard. This is useful for certain
regions or languages that have complex sorting rules or global multilingual
databases. Multilingual linguistic sorts also work for Asian language
sorts ordered by the number of strokes, PinYin, or radicals. Additionally,
Oracle9i still supports all the sort orders defined by previous releases.
Create Sort Table and Load Data
In order for you to understand how sorting can be used you will create
a product descriptions table that contains some specific data that will
be used to demonstrate how multilingual data can be sorted.
| 1. |
From a SQL*Plus session logged in as OE, execute the @creatpds.sql.
You can also perform a describe to see the contents of the table
you just created.
|
| 2. |
Now you are ready to populate the table with data. Execute the
@populate_pd.sql.
|
Binary sorts and GENERIC_M Linguistic sorts
With the new UTF-8 compliant, browser based version of SQLPLUS (iSQL*Plus),
you can query the database to see the product names that you have just
added. To display text of different languages, you need to change the
character encoding in your iSQL*Plus user interface. To do this, perform
the following:
| 1. |
Open your browser and select Encoding (Microsoft Internet
Explorer) or Character Set (Netscape Navigator) from the
View Menu and click Unicode (UTF-8). Make sure fonts are associated
with the UTF-8 character encoding. Enter the following commands:
connect oe/oe@orcl.world
select language_id, translated_name
from product_descriptions_sort
order by translated_name;
The above example illustrates the default
binary sorting sequence for the SQL NCHAR datatype, which is the
binary order of the national character set. In general, binary sort
is not very useful, because the upper case ASCII characters are
always sorted before the lower case ones, and their accented ancestors
follow afterwards in the order that they are encoded in the particular
character set . The binary order changes depending on the character
set you are using. The ISO standard 14651 ( International String
Ordering) defines the ordering of international data.
|
| 2. |
In Oracle9i, this is implemented in
the linguistic sort GENERIC_M, where the _M means
that this is a Multilingual linguistic sort. GENERIC_M generally
handles the order of the Latin, cyrillic, greek characters and other
European characters. Non-European characters will be sorted in the
Unicode binary sequence. Execute the following commands:
alter session set nls_sort=generic_m;
select translated_name, language_id from product_descriptions_sort
order by translated_name;
GENERIC_M sorts data by grouping them by their base characters,
so that all variants of the same base characters are placed together
, for the above example you can see that all the 'a' s are group
together with the lowercase versions first followed by uppercase.
|
Linguistic Indexes
Linguistic sort is language specific and requires more data processing
than binary sorting. Binary sorting is fast because it utilizes the order
of the encoding of your character set . When data of multiple languages
is stored in the database, you may want your applications to collate a
result set returned from a SELECT statement using the ORDER BY clause
with different collating sequences based upon the language being used.
This can accomplished without sacrificing performance, by using linguistic
indexes.
There are three ways to build linguistic indexes for data in multiple
languages:
| 1. |
Build a linguistic index for each language which the application
needs to support
This approach offers simplicity but requires more disk space. For
each index, the rows in the language other than the one on which
the index is built are collated together at the end of the sequence.
The following example builds linguistic indexes for French and German.
CREATE INDEX french_index ON product_descriptions_sort (NLSSORT(translated_name,
'NLS_SORT=FRENCH'));
CREATE INDEX german_index ON product_descriptions_sort (NLSSORT(translated_name,
'NLS_SORT=GERMAN'));
Which index to use is based on the NLS_SORT session parameters
or the arguments of the NLSSORT function you specified in the ORDER
BY clause. For example, if the session variable NLS_SORT is set
to FRENCH, you can use french_index and when it is set to GERMAN,
you can use german_index.
|
| 2. |
Build a single linguistic index for all languages using LANG_COL
This can be accomplished by including a language column (LANG_COL
in the example below) that contains NLS_LANGUAGE values for the
corresponding column on which the index is built as a parameter
to the NLSSORT function. The following example builds a single linguistic
index for multiple languages. With this index, the rows with the
same values for NLS_LANGUAGE are collated together.
CREATE INDEX i ON t (NLSSORT(col, 'NLS_SORT=' || LANG_COL));
The index that will be used, is based on the argument of the NLSSORT
function you specified in the ORDER BY clause.
|
| 3. |
Build a single linguistic index for all languages
using a sort sequence
This can be accomplished by using one of the sorting sequences
such as GENERIC_M or FRENCH_M. These indexes collate characters
according to the rules defined in ISO 14651.
CREATE INDEX i on t (NLSSORT(col, 'NLS_SORT=GENERIC_M');
|
Requirements for Linguistic Indexes
Whether using a single linguistic index or multiple linguistic indexes,
some requirements must be met for the linguistic index to be used:
 |
The initialization parameter QUERY_REWRITE_ENABLED needs
to be true. This is not a specific requirement for linguistic indexes,
but for all function based indexes. |
 |
NLS_COMP needs to be ANSI. |
 |
NLS_SORT needs to indicate the linguistic definition
you want to use for the linguistic sort. If you want a French linguistic
sort order, NLS_SORT needs to be FRENCH. |
 |
You need to use the cost based optimizer, because they
are not recognized by the rule based optimizer. |
 |
You need to specify WHERE NLSSORT(column_name) IS NOT
NULL when you want to use ORDER BY column_name where the column_name
is the column with the linguistic index. This is necessary only when
you use an ORDER BY clause. |
Creating and Using a Linguistic Index
To create and use a linguistic index, perform the following:
 |
Load and execute the script @crlingindex.sql
from a sqlplus session to create a Linguistic Index and SELECT from
the PRODUCT_DESCRIPTIONS_SORT table using the linguistic index GENERIC_M.
CONNECT oe/oe@orcl.world
ALTER SESSION SET QUERY_REWRITE_ENABLED=TRUE;
ALTER SESSION SET NLS_COMP=ANSI;
CREATE INDEX NLS_GENERIC ON product_descriptions_sort
( NLSSORT(translated_name,'NLS_SORT=GENERIC_M'));
ALTER SESSION SET NLS_SORT='GENERIC_M';
SELECT * FROM PRODUCT_DESCRIPTIONS_SORT
WHERE NLSSORT(TRANSLATED_NAME) IS NOT NULL
ORDER BY TRANSLATED_NAME;

|
Asian Multilingual Linguistic Sorts
With the introduction of Multilingual Linguistic sorts and the support
for differentiating characters now in 3 distinct levels, complex Asian
sorts have been added in Oracle9i to cover the need of ordering
Asian data in a way that users are custom to. These sorts are all based
on the GENERIC_M multilingual linguistic sort, so that the
order of the Latin, cyrillic, greek characters and other European characters
will also be sorted according to the ISO 14651 standard.
Here is a list of the Asian Multilingual Linguistic
Sorts supported in Oracle9i.
| Basic Name |
Explanation
|
| JAPANESE_M |
Japanese sort supports SJIS character set order and EUC characters
which are not included in SJIS
|
| KOREAN_M |
Korean sort: Hangul characters are based on Unicode
binary order. Hanja characters based on pronunciation order. All Hangul
characters are before Hanja characters |
| THAI_M |
Thai sort supports swap characters for some vowels and
consonants |
| SCHINESE_STROKE_M |
Simplified Chinese sort uses number of strokes as primary
order and radical as secondary order |
| SCHINESE_PINYIN_M |
Simplified Chinese PinYin sorting order |
| TCHINESE_RADICAL_M |
Traditional Chinese sort based on radical as primary
order and number of strokes order as secondary order |
| TCHINESE_STROKE_M |
Traditional Chinese sort uses number of strokes as primary
order and radical as secondary order |
To see how the asian multilingual lingustic sort works, perform the following:
| 1. |
Enter the following commands to show the Simplified Chinese Pin
Yin (pronunciation) sort.
alter session set nls_sort=schinese_pinyin_m;
select translated_name, language_id
from product_descriptions_sort
where language_id='ZHS'
order by translated_name;
|
| 2. |
Enter the following commands to show the Simplified Chinese sort
ordered by the number of strokes.
alter session set nls_sort=schinese_stroke_m;
select translated_name, language_id
from product_descriptions_sort
where language_id='ZHS'
order by translated_name;
|
Module Summary
In this module, you should have learned how to:
 |
Understand the differences between monolingual &
multilingual linguistic sorts |
 |
Discuss GENERIC_M multilingual linguistic sorts |
 |
Create Linguistic Indexes to improve sorting performance |
 |
Understand the differences between the different types
of Asian sorts |
Close Window
|