The IP Multimedia Subsystem (IMS) is the 3rd Generation Partnership Project's (3GPP) vision for a converged telecommunications architecture that merges cellular and Internet technologies to uniformly deliver voice, video, and data on a single network. Currently one of the hottest topics in telecom, IMS is rapidly becoming the architecture of choice for operators who wish to upgrade their existing cellular and fixed-line networks.
In this article, we introduce the key points of the IMS architecture and present a complete application based on Java SIP servlets and VoiceXML, which are consistent with this architecture. The application is a "Personal Assistant," providing a service that answers calls on behalf of its owner, and then attempts to find the owner by placing a series of calls to their work phone, home phone, cell phone. Once found, the owner is given the option of whether to accept the call and be connected to the caller.
Full source code is made available for the application and links are provided to download evaluation versions of the BEA WebLogic SIP Server and Voxpilot Media Resource Function to allow the application to be run. After reading this article and studying the code, you should be in a position to create your own applications that can easily be extended to incorporate more advanced features such as speech recognition, speech synthesis, and video interactivity by simply leveraging features readily available in VoiceXML.
We'll begin by reviewing the IMS architecture and VoiceXML, and then look at how to create interactive applications for IMS using VoiceXML and SIP. The rest of this article will describe the sample application.
IMS is a standardized architecture that employs Voice- and Video-over-IP technology based on a 3GPP profile of SIP, and runs over the standard packet-based IP network. Figure 1 presents a simplified view of the IMS architecture; we have divided the network into three parts: the service layer, the control layer, and the access network.
Figure 1. Simplified view of the IMS architecture
IMS applications are hosted in the service layer. This layer consists of SIP Application Servers (AS) (such as the WebLogic SIP Server) which execute IMS applications and services by manipulating SIP signaling and interfacing with other systems. The AS may also include HTTP capabilities allowing it to also perform the role of a content server for resources such as media files and VoiceXML application scripts. Typically, the AS will offer a programming language and framework for creating new services, for example Java SIP and HTTP Servlets. See An Introduction to SIP, Part 2: SIP Servlets (Dev2Dev, February 2006) for examples of this.
The control layer of the IMS network consists of nodes for managing call establishment, management, and release. At the heart of the control layer is a specialized SIP server called the Call Session Control Function (CSCF); all SIP signaling traverses this essential node. The CSCF inspects each SIP message and determines if the signaling should visit one or more application servers en route to its final destination. The CSCF interacts with the Home Subscriber Service (HSS), which provides a central repository of user-related information. When the CSCF or AS requires media capabilities, it routes the signaling to the Media Resource Function (MRF) (such as the Voxpilot MRF). The MRF provides centralized media processing capabilities (described in the next section). Finally, the Media Gateway (MGW) together with its controlling node, the Media Gateway Control Function (MGCF), connects with circuit switch networks.
The access network consists of IP routers and legacy PSTN switches that provide access to the IMS network both from contemporary IP telephony devices and older circuit switch devices respectively. IP devices compatible with IMS incorporate a SIP user agent that is used to place voice or video calls toward the network.
VoiceXML is a markup-based, declarative programming language standardized by the W3C for creating speech-based telephony applications. VoiceXML supports dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of audio, basic telephony call control, and mixed initiative conversations. More recently, VoiceXML has proven itself to be an ideal language for creating video interactive services since video can be regarded simply as an additional input/output channel for VoiceXML. Future versions of VoiceXML will include new features specifically for video services.
VoiceXML enables Web-based development and content delivery paradigms to be used with interactive voice and video response applications. A Web or application server hosts VoiceXML documents that specify the Voice User Interface (VUI) with a user in much the same way as HTML documents define a Graphical User Interface (GUI) with a user. A VoiceXML Interpreter fetches VoiceXML documents from the Web or application server via HTTP, interprets them, and sends instructions to media processing resources as appropriate to render the VUI.
VoiceXML is primarily a dialog language: It is designed to manage a voice or video dialog between a human and a computer. While VoiceXML does include some basic features for call transfer (via its <transfer> element), this capability is really only appropriate when there is no SIP AS present. Rather, the AS is ideally suited to performing all call control and SIP signaling manipulation.
The IMS architecture consists of many nodes that are required for delivering core telecommunication functions such as mobility, billing, interworking, and quality of service. Delivering interactive services, however, principally requires the interplay of just three nodes, namely a user agent (UA), the AS, and the MRF. Figure 2 illustrates a canonical architecture derived from IMS for delivering SIP-based interactive services.
Figure 2. Canonical architecture for delivering SIP-based interactive services
The UA is either a SIP phone or a gateway to a traditional cellular or fixed-line phone. SIP signaling is routed independent of the media. The media runs directly between the UA and MRF over the Real-Time Transport Protocol (RTP). The AS is responsible for call control (that is, SIP signaling manipulation) and application management functions. We consider the AS to be a converged container, with the ability to support the HTTP protocol. The AS hosts VoiceXML documents that describe portions of user interaction by voice and video. When the service logic running on the AS requires interaction with the user, it forwards SIP signaling to the MRF and supplies it with a HTTP URI for the VoiceXML application to run. This URI typically points to a VoiceXML application back on the AS and the MRF subsequently fetches the VoiceXML document from the AS via HTTP. The MRF executes the VoiceXML document, which drives the interaction with the user. The MRF continues to request new VoiceXML documents from the AS until the application ends or the AS terminates it. The MRF may send data back to the AS via HTTP or by SIP (by encoding it in the SIP BYE request, for example).
It should be noted that the interface to the MRF (known as the "Mr" interface) currently is not fully defined by the 3GPP except it is based on SIP. While the IMS specifications are rich in detail about how to create call control applications via the AS, they are largely silent on how to program media interactivity. In the meantime, VoiceXML has emerged as the industry standard for programming interactive applications for telephony and is the natural choice for IMS. A new IETF Internet-Draft called "SIP Interface to VoiceXML Media Services" (see References) describes how SIP and VoiceXML work together. Finally, note that the general architecture presented in Figure 2 is equally applicable outside IMS networks and is ideal for deploying applications in pre-IMS networks, enterprises, and call centers.