In the classical client/server architecture, the database and the client
may use different encoding schemes to represent character data. For example
in a Japanese application, the database might use (a variety of) EUC character
set and the client might use (a variety of) SJIS character set. Thus character
encoding conversion is required. The solution is well known and long established:
Oracle Net transparently handles the conversion (as specified by the database
character set and the NLS_LANG client environment variable). A corresponding
issue exists for Utl_Http. When a request is sent it might need to
be encoded differently than the database character set (because the
sender knows that the target URL requires this). And when a response is received,
it may again be encoded differently from the database character set (because
that's the non-negotiable behavior of the target URL).
There are two areas of concern when sending a request: the URL and the request
body. When sending by the "GET" method, all request parameterization
is via the URL itself, typically after the ? delimiter. Search terms
for example are normally handled this way. HTTP defines no convention for
specifying different character sets for the URL and expects that everything
is 7-bit ASCII. Other character encoding schemes should be represented as
the hex codes of their bytes using the %nn notation. (The sender of
the request must know from documentation what character set the URL expects
to decode from the hex representation.) Oracle9i introduces the Utl_Url
package which has functions to convert from the database character set to
a hex coded representation of a specified character set, and vice versa. In
addition, these functions handle the conversion of the reserved symbols: percent
(%), semi-colon (;) slash (/), question mark (?), colon (:), at sign (@),
ampersand (&), equals sign (=), plus sign (+), dollar sign ($), and comma
(,).
When sending by the "PUT" method, the character set of the request
body should be set via the charset attribute of the Content-Type in
the request header, using the new Utl_Http.Set_Header procedure. If
this is done, it gives Oracle sufficient information to transform appropriately
when sending a character request body (by using Utl_Http.Write_Text).
If the charset attribute is not set in the request header, then no
character set conversion takes place unless the user has catered for it via
the overloaded procedure Utl_Http.Set_Body_Charset. The variant Set_Body_Charset(charset
varchar2) - a.k.a. the global variant - allows the user to set
a fallback character set, to be assumed, if no other information is provided,
for both requests and responses for the session. The variant Set_Body_Charset(r
Utl_Http.Req, charset varchar2) - a.k.a the request variant, allows
the user to insist on a character set for the body for this request. (A record
of PL/SQL type Utl_Http.Req is returned when the HTTP request is begun
with Utl_Http.Begin_Request.) The choice made via the request variant
will not only override that made via the global variant but will also override
that made via the charset attribute of the request header. For this
reason, the recommended way to specify the character set conversion for the
request body is via the charset attribute of the header. Only if the
user has a special reason for leaving this unspecified in the request header
would he use the request variant of Set_Body_Charset.
There is just one area of concern when receiving the response: the response
body. If the implementation of the URL is well-mannered, then the character
set of the response body will be specified correctly in the charset attribute
of the Content-Type in the response header, accessible to the user via the
procedure Utl_Http.Get_Header. Oracle will implicitly perform the appropriate
conversion in connection with calling Utl_Http.Read_Text. However,
this is often not set. In this case the user can use the global variant of
Set_Body_Charset to determine the character set conversion. However,
the charset attribute of the response header is sometimes set wrong.
(This is likely when pages in different character sets are served up as files
from the filesystem seen by the webserver, since the Content-Type header information
will often be set globally for the server with no mechanism to make it file
specific.) For this reason a third overloaded variant Set_Body_Charset(r
Utl_Http.Resp,charset varchar2) is provided - a.k.a. the response variant.
(A record of PL/SQL type Utl_Http.Resp is returned when the HTTP response
is got with Utl_Http.Get_Response.) The choice made via the response
variant will override that made via the global variant and that expressed
via the charset attribute of the response header.
Note: from Oracle8i v8.1.6 and pre-Oracle9i, Oracle detected
the charset of the response body (if this was specified) and used the
information to do the character set conversion. And if the charset
attribute of the response body was not specified then no conversion took place
and no overriding or fallback mechanism was provided. Under special circumstances
(eg fetching a SJIS Japanese response where the charset attribute is
not specified into a EUC database) problems arose pre-Orcale9i.
Thus the user now has full control over all character set conversion issues.
In an extreme case, where the response body is Content-Type text/html and
where the HTML <meta> tag is used to specify the character set,
the user can retrieve the response body into a PL/SQL RAW with Utl_Http.Read_Raw
and then write custom code to parse the HTML and to convert to the database
character set in a PL/SQL VARCHAR2 once the response character set is discovered.