Encoding of Character Data

In the classical client/server architecture, the database and the client may use different encoding schemes to represent character data. For example in a Japanese application, the database might use (a variety of) EUC character set and the client might use (a variety of) SJIS character set. Thus character encoding conversion is required. The solution is well known and long established: Oracle Net transparently handles the conversion (as specified by the database character set and the NLS_LANG client environment variable). A corresponding issue exists for Utl_Http. When a request is sent it might need to be encoded differently than the database character set (because the sender knows that the target URL requires this). And when a response is received, it may again be encoded differently from the database character set (because that's the non-negotiable behavior of the target URL).

There are two areas of concern when sending a request: the URL and the request body. When sending by the "GET" method, all request parameterization is via the URL itself, typically after the ? delimiter. Search terms for example are normally handled this way. HTTP defines no convention for specifying different character sets for the URL and expects that everything is 7-bit ASCII. Other character encoding schemes should be represented as the hex codes of their bytes using the %nn notation. (The sender of the request must know from documentation what character set the URL expects to decode from the hex representation.) Oracle9i introduces the Utl_Url package which has functions to convert from the database character set to a hex coded representation of a specified character set, and vice versa. In addition, these functions handle the conversion of the reserved symbols: percent (%), semi-colon (;) slash (/), question mark (?), colon (:), at sign (@), ampersand (&), equals sign (=), plus sign (+), dollar sign ($), and comma (,).

When sending by the "PUT" method, the character set of the request body should be set via the charset attribute of the Content-Type in the request header, using the new Utl_Http.Set_Header procedure. If this is done, it gives Oracle sufficient information to transform appropriately when sending a character request body (by using Utl_Http.Write_Text). If the charset attribute is not set in the request header, then no character set conversion takes place unless the user has catered for it via the overloaded procedure Utl_Http.Set_Body_Charset. The variant Set_Body_Charset(charset varchar2) - a.k.a. the global variant - allows the user to set a fallback character set, to be assumed, if no other information is provided, for both requests and responses for the session. The variant Set_Body_Charset(r Utl_Http.Req, charset varchar2) - a.k.a the request variant, allows the user to insist on a character set for the body for this request. (A record of PL/SQL type Utl_Http.Req is returned when the HTTP request is begun with Utl_Http.Begin_Request.) The choice made via the request variant will not only override that made via the global variant but will also override that made via the charset attribute of the request header. For this reason, the recommended way to specify the character set conversion for the request body is via the charset attribute of the header. Only if the user has a special reason for leaving this unspecified in the request header would he use the request variant of Set_Body_Charset.

There is just one area of concern when receiving the response: the response body. If the implementation of the URL is well-mannered, then the character set of the response body will be specified correctly in the charset attribute of the Content-Type in the response header, accessible to the user via the procedure Utl_Http.Get_Header. Oracle will implicitly perform the appropriate conversion in connection with calling Utl_Http.Read_Text. However, this is often not set. In this case the user can use the global variant of Set_Body_Charset to determine the character set conversion. However, the charset attribute of the response header is sometimes set wrong. (This is likely when pages in different character sets are served up as files from the filesystem seen by the webserver, since the Content-Type header information will often be set globally for the server with no mechanism to make it file specific.) For this reason a third overloaded variant Set_Body_Charset(r Utl_Http.Resp,charset varchar2) is provided - a.k.a. the response variant. (A record of PL/SQL type Utl_Http.Resp is returned when the HTTP response is got with Utl_Http.Get_Response.) The choice made via the response variant will override that made via the global variant and that expressed via the charset attribute of the response header.

Note: from Oracle8i v8.1.6 and pre-Oracle9i, Oracle detected the charset of the response body (if this was specified) and used the information to do the character set conversion. And if the charset attribute of the response body was not specified then no conversion took place and no overriding or fallback mechanism was provided. Under special circumstances (eg fetching a SJIS Japanese response where the charset attribute is not specified into a EUC database) problems arose pre-Orcale9i.

Thus the user now has full control over all character set conversion issues. In an extreme case, where the response body is Content-Type text/html and where the HTML <meta> tag is used to specify the character set, the user can retrieve the response body into a PL/SQL RAW with Utl_Http.Read_Raw and then write custom code to parse the HTML and to convert to the database character set in a PL/SQL VARCHAR2 once the response character set is discovered.