Internationalization is the process of designing software so that it can be adapted (localized) to various languages and regions easily, cost-effectively, and in particular without engineering changes to the software. This generally involves isolating the parts of a program that are dependent on language and culture. For example, the text of error messages must be kept separate from program source code because they must be translated during localization.
Localization is the process of adapting a program for use in a specific locale. A locale is a geographic or political region that shares the same language and customs. Localization includes the translation of text such as user interface labels, error messages, and online help. It also includes the culture-specific formatting of data items such as monetary values, times, dates, and numbers.
It's best to start internationalization right from the beginning, when you determine the requirements for your software. To design the flexibility into your software that's necessary to enable easy localization, you need to understand how the requirements differ among all the countries and languages (locales) that you plan to support. You can use the Sun Software Product Internationalization Taxonomy to guide you in this process. The Java Tutorial also provides a simple Checklist that helps you identify some common issues. Once you have identified the requirements, the Internationalization Trail of the Java Tutorial and other materials referenced from the Java Internationalization page can help you find appropriate solutions for design and implementation.
Yes, Sun's JREs let you type the euro character, render it, convert it from and to numerous character encodings, and use it when formatting numeric values as currency. For text input and rendering, you need the appropriate support in the host operating system - see the documentation for Windows and Solaris. For formatting with a currency symbol, Sun's JREs from version 1.4 use the euro as the default currency for the member countries of the European Monetary Union, while for Sun's JRE 1.3.1 you need to select locales with the "EURO" variant.
The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type
char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences -
char, implementations of
java.lang.CharSequence (such as the
String class), and implementations of
java.text.CharacterIterator - are UTF-16 sequences.
Unicode is an international character set standard which supports all of the major scripts of the world, as well as common technical symbols. The original Unicode specification defined characters as fixed-width 16-bit entities, but the Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF. You can learn more about the Unicode standard at the Unicode Consortium web site.
Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.
J2SE 1.4 uses version 3.0 of the Unicode standard, and J2SE 1.3 uses version 2.1. They generally don't support supplementary characters.
A coded character set is a character set (a collection of characters) where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041 16 and the letter "€" (the symbol for the euro currency) the number 20AC 16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".
Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.
Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.
A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are 8-bit bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.
A character encoding is a mapping from a set of characters to sequences of code units. They apply a character encoding scheme to one or more coded character sets. Some commonly used character encodings are UTF-8, ISO-8859-1, GB18030, Shift_JIS.
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
A locale is a geographic or political region that shares the same language and customs. In the Java platform, a locale is represented by a
Locale object. Locale-sensitive operations, such as collation and date formatting, vary according to locale.
The supported locales vary between different implementations of the Java platform and between areas of functionality. Information about the supported locales in Sun's JREs is provided by the Supported Locales documents for Java SE 6, J2SE 5.0, J2SE 1.4.2, and J2SE 1.3.1.
Many locale-sensitive APIs let you specify the desired locale. This allows you to write application that can handle different locales at the same time. For example, a web application written in Java can handle different requests using different locales simultaneously.
This depends on the implementation of the Java platform you're using. The initial default locale is normally determined from the host operating system's locale. Versions 1.4 and higher of Sun's JREs let you override this by setting the user.language, user.country, and user.variant system properties from the command line. For example, to select
Locale("th", "TH", "TH") as the initial default locale, you would use:
java -Duser.language=th -Duser.country=TH -Duser.variant=TH MainClass
Since not all runtime environments provide this feature, it should only be used for testing.
ResourceBundle object allows you to isolate localizable elements from the rest of the application. With all resources separated into a bundle, the application simply loads the appropriate bundle for the requested locale. If a different locale is requested, the application just loads a different bundle.
You can specify any Unicode character with a \u XXXX escape sequence. For a supplementary character, you need two escape sequences, one for each of the two UTF-16 code units. The XXXX denotes the 4 hexadecimal digits for the value of the UTF-16 code unit. For example, a properties file might have the following entries:
s1=hello there s2=\u3053\u3093\u306b\u3061\u306f
If you have edited and saved the file in a non-ASCII encoding, you can convert it to ASCII with the native2ascii tool. For example, you might want to do this when editing a properties file in Shift_JIS, a popular Japanese encoding.
If your source file contains non-ASCII characters, you need to tell the compiler which encoding the file is using. For example, you would compile a Japanese resource bundle written in the Shift_JIS encoding as follows:
javac -encoding Shift_JIS LabelsResource_ja.java
java.text.Format and its subclasses are generally not synchronized. It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally.
Collator class, and its subclasses, are used for building sorting routines. These classes are locale-sensitive, and when created with the no-argument constructor will use the collating sequence of the default locale.
Since decomposing takes time, turning decomposition off makes comparisons go faster. However, for Latin languages the
NO_DECOMPOSITION mode is not useful if the text contains accents. You should use the default decomposition unless you really know what you're doing.
The strength property you choose depends on what your application is trying to accomplish. For example, when performing a text search you may allow a "weak" match, in which accents and differences in case (upper vs. lower) are ignored. This type of search employs the
PRIMARY strength. If you are sorting a list of words, you might want to use the
TERTIARY strength. In this mode the properties that must match are the base character, accent, and case.
The term "charset" is used as a synonym for character encoding in several Internet-related specifications, as well as in Java APIs. It should not be confused with a character set or coded character set. The names of many (but unfortunately not all) charsets are registered with IANA.
The Converting Non-Unicode Text section of the The Java Tutorial explains how to perform the conversions within an application using APIs in the
java.io packages. The
java.nio.charset.Charset class, available since J2SE 1.4, provides more direct access to character conversion. Numerous other Java interfaces, from the
java.net package to the JSP compiler, rely on these low-level interfaces to provide character conversion as necessary for their respective functionality. To convert data files, use the native2ascii tool.
The supported encodings vary between different implementations of the Java platform. Information about the supported encodings in Sun's JREs is provided by the Supported Encodings documents for Java SE 6, J2SE 5.0, J2SE 1.4.2, and J2SE 1.3.1.
This depends on which other software your application communicates with as well as on the language(s) used in the text. Where circumstances permit, UTF-8 is an excellent character encoding for the following reasons:
Here are some examples where UTF-8 cannot be used:
UTF-8 stands for Unicode (or UCS) Transformation Format, 8-bit encoding form. It is a transmission format for Unicode that uses 8-bit code units.
The default encoding is selected by the JRE based on the host operating system and its locale. For example, in the US locale on Windows, windows-1252 is used. In the Simplified Chinese locale on Solaris, GB2312, GBK, GB18030, or UTF-8 can be the default encoding, depending on the selection made when logging into Solaris.
The default encoding is significant because the JRE commonly exchanges text with the host operating system in the default encoding. The default encoding has to match the encoding used by the host operating system to ensure correct interaction.
An application can determine the default encoding by calling the
Charset.defaultCharset method, available since J2SE 5.0. In older versions of the Java platform, you can use the expression
(new OutputStreamWriter(new ByteArrayOutputStream())).getEncoding()
There are many character encodings that don't support all European characters (such as "ß" or "é"), but we get this question particularly often from users of the Solaris C locale. On Solaris and Linux, the JRE determines the default encoding by calling the nl_langinfo function. This function returns "646" when run in the C locale, indicating US-ASCII as the default encoding. US-ASCII only includes half the characters of ISO-8859-1, so many commonly used European characters are missing.
An easy workaround is to use the Solaris en_US locale, which uses ISO-8859-1 as its character encoding. You can set the Solaris locale from the login screen or by setting the the LC_ALL environment variable. Another solution is to explicitly specify the desired character encoding in all interfaces that you use to perform encoding conversion.
No. The windows-1252 encoding contains some additional characters in the range from 0x80 to 0x9F. See the Microsoft documentation for more information.
java.nio.charset.spi.CharsetProvider class, available since J2SE 1.4, lets developers create their own character converters.
The input method framework enables all text editing components to receive Japanese, Chinese, or Korean text input through input methods. An input method lets users enter thousands of different characters using keyboards with far fewer keys. Typically a sequence of several characters needs to be typed and then converted to create one or more characters. For specifications and examples see the web page, Input Method Framework.
A user may have multiple input methods available. For example, the user may have input methods for different languages or input methods that accept various types of input. Such a user must be able to select the input method used for a particular language or the input method that provides the fastest input.
An application can request an input method that supports a specific locale using the
InputContext.selectInputMethod method, but it cannot select a specific input method - that selection is up to the user.
An application can activate an input method using the
An application using lightweight components can select fonts in four different ways:
An application using peered AWT components can only use logical font names.
Here's a brief summary:
The answer depends on how your application selects fonts - see above.
The font configuration files are used in Sun's JREs since 5.0 to map logical font names to physical fonts; earlier versions used font.properties files. There are several files to support different mappings depending on the host operating system version. The files are located in the lib directory within the JRE installation.
Note that font configuration and font.properties files are implementation dependent. Not all implementations of the Java platform use them, and the format and content vary between different runtime environments as well as between releases.
Since the mapping from logical fonts to physical fonts is implementation dependent, the answer varies. For Sun's JRE 5.0, the easiest way is to install the font into the JRE's lib/fonts/fallback directory - it will be automatically added as a fallback font to all logical fonts for 2D rendering. For AWT, you may need to modify a font configuration file - see the web page Font Configuration Files. For earlier versions of Sun's JRE, you need to edit font.properties files - see the Font.properties Files documents for J2SE 1.4.2 and J2SE 1.3.1. Note however that editing these files is a modification of the JRE, and Sun does not support modified JREs. For other implementations, see their respective documentation.
Swing user interface components use a different mechanism to render text from peered AWT components. The Swing components use the
Graphics.drawString method, typically specifying a logical font name. The logical font name is then mapped to a set of physical fonts to cover a large range of characters. Peered AWT components on the other hand are implemented using host operating system components. These host operating system components often do not support Unicode, so the text gets converted to some other character encoding, depending on the host operating system and locale. These encodings often cover a smaller range of characters than the physical fonts used to implement logical font names. For example, on a Japanese Windows 98 system, many European accented characters are mapped to the Arial font for Swing components, but get lost when converting the text to the Shift_JIS encoding for peered AWT components.
The components of the XAWT toolkit, available since J2SE 5.0 on Solaris and Linux, use the
Graphics.drawString method. Their text rendering behavior therefore is similar to Swing components.
As in the Chinese/Japanese/Korean case above, this may be because text is not rendered using the Unicode font at all or only for some characters. If your application selects the Unicode font using its physical font name, and it still cannot render all characters, it could be that the Unicode font doesn't in fact cover the entire Unicode character set - sometimes a font is called a Unicode font if it just provides the tables that support the Unicode character encoding.
The short answer is yes. The long answer needs to look at which languages you want to display at the same time, and how your application selects fonts.
Among the South and South-East Asian scripts, Sun's JREs have supported Thai and Devanagari since version 1.4. For a complete list of all supported writing systems, see the Supported Locales documents for Java SE 6 for J2SE 5.0, J2SE 1.4.2, and J2SE 1.3.1. Support for other writing systems may be added in future releases.
The component orientation property is respected only by Swing components and layout managers, not by peered AWT components. It is independent of the host operating system. The following classes support component orientation in Java SE 6, J2SE 5.0 and J2SE 1.4.2:
In J2SE 1.3.1, the following classes shown in the table above do not support component orientation: java.awt.GridBagLayout, javax.swing.BoxLayout, javax.swing.JColorChooser javax.swing.JComboBox, javax.swing.JFileChooser, javax.swing.JOptionPane.
In some cases web applications set the response character encoding (which corresponds to the charset value of the content type), but the web page sent to the browser is actually encoded in a different encoding. This problem can occur when using a Servlet 2.3 based container together with the JavaServer Pages Standard Tag Library. The sequence of events is:
This problem was solved in the Servlet 2.4 specification by distinguishing between explicit and implicit character encoding specifications. Setting the character encoding through the content type or via the new method
ServletResponse.setCharacterEncoding are explicit specifications, while determining it from a locale setting is an implicit specification. Implicit specifications cannot override explicit specifications, so event 4) above does not occur.
If your application needs to be compatible with containers based on older specifications, you must freeze the character encoding by calling
ServletResponse.flushBuffer between the explicit character encoding specification and the first use of custom actions that might implicitly determine the character encoding.
The Internationalization and Localization chapter of Designing Enterprise Applications with the J2EE Platform and the article Developing Multilingual Web Applications Using JavaServer Pages Technology discuss many of the issues that you need to address.