The design goals for the Java Speech API
Using speech technology in Java applications and applets
Technical overview of the Java Speech API
The Java Speech API in the Java Platform
The Java Speech API roadmap
Speech technology, once limited to the realm of science fiction, is now available for use in real applications in personal and enterprise computing. The Java tm Speech API , being developed by Sun Microsystems in cooperation with many other companies, defines a software interface which allows developers to take advantage of speech technology for personal and enterprise computing. By leveraging the inherent strengths of the Java Platform, the Java Speech API enables developers of speech-enabled applications to incorporate more sophisticated and natural user interfaces into Java based applications and applets that can be deployed on a wide range of platforms. The Java Speech API is part of the Java Media APIs, a suite of class libraries that provide capabilities in areas such as audio/video playback, capture, conferencing, 2D and 3D graphics, animation, computer telephony, collaboration, advanced imaging, and other technologies. The Java Speech API, in combination with the other Java Media APIs, will help enable developers to enrich Java applications and applets with rich media and communications capabilities that meet the expectations of today's users, and to enhance business communications and the flow of information.
The Java Speech API will provide a standard, easy to use cross-platform interface to state-of-the-art speech technology for both desktop and telephony environments.
Two core speech technologies will be supported through the Java Speech API: speech recognition and speech synthesis. Speech recognition provides computers with the ability to listen to spoken language and to determine what has been said. In other words, it processes audio input containing speech by converting it to text. Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an applet or a user. It is often referred to as text-to-speech technology.
Enterprises will benefit from a wide range of applications of speech technology using the Java Speech API. For instance, interactive voice response systems are an attractive alternative to touch-tone interfaces; dictation systems can substantially improve typed input; speech technology even improves accessibility to computers for many people with physical limitations.
Speech interfaces will give Java developers the opportunity to implement distinct and engaging personalities for their applications and to differentiate their products. Java developers will be able to access the capabilities of state-of-the-art speech technology from leading speech vendors. With a standard API for speech, users will be able to choose the speech products which best meet their needs and their budget.
The Java Speech API will leverage the audio capabilities of other Java Media APIs, and when combined with the Java Telephony API, will support advanced computer telephony integration. On desktop systems, the widespread availability of audio input/output capabilities, the increasing power of CPUs and the growing availability of telephony devices all enable the use of speech technology.
Along with the other Java Media APIs, the Java Speech API is designed to enable developers to incorporate advanced user interfaces into Java applications. The following are the design goals for the Java Speech API.
|The Java Speech API will support speech synthesis, command-and-control recognizers and dictation systems.|
|The Java Speech API will be simple and compact.|
|"Write Once, Run Anywhere tm " access to speech synthesis and speech recognition will be consistent across the major Java platforms and across products from different speech technology companies.|
|Existing speech technology should be accessible through the Java Speech API using bridging software provided by Sun, licensees of Java and others.|
|The Java Speech API will complement other Java features, including the Java Media APIs.|
Speech technology is becoming increasingly important in both personal and enterprise computing as it used to improve existing user interfaces and to support new means of human interaction with computers. Speech technology can allow hands-free use of computers and support access to many computing facilities away from the desk and over the telephone. Speech recognition and speech synthesis can improve accessibility for disabled users and reduce the risk of repetitive strain injury and other problems caused by current interfaces.
The Java Speech API will open up these opportunities to developers of Java applications and applets, but with the added advantages of developing in the Java environment. Java enables "Write Once, Run Anywhere" applications in a robust, object-oriented, secure, network-aware, multi-threaded language with advanced development tools.
Speech technology is already being used by many enterprises to handle customer calls to provide access to information and resources. Speech recognition is being used to provide a more natural and efficient interface than touch-tone systems for facilities such as voice mail, menu selections and entering numbers. Systems are available for telephone access to email, calendars and other computing facilities that have previously been available only on the desktop or with special equipment. Such systems allow convenient computer access by telephone in hotels, airports and airplanes.
Prototype telephone systems are providing a conversational style of interacting with these systems. For example: "Do I have any email?" "Yes, you have 7 messages including 2 high priority messages from the production manager." "Please read me the mail from the production manager." "Email arrived at 12:30pm...".
Speech technology can augment traditional graphical user interfaces. It can be used to provide audible prompts with spoken "Yes/No/OK" responses that do not distract the user's focus. Complex, structured commands can provide rapid access to features that are traditionally hidden in sub-menus. For example, speaking "Use 12-point, bold, Helvetica font" replaces multiple menu selections and mouse clicks. Drawing and CAD applications can be enhanced by using speech commands in combination with mouse and keyboard actions to improve the speed at which users can manipulate objects. For example, while dragging an object, a speech command could be used to change its color and line type all without moving the pointer to the menu-bar or a tool palette.
Speech dictation systems are widely available and are already popular with users who have restricted typing ability. For experienced users, dictation systems can provide typing rates exceeding 100 words per minute and word accuracy over 95%.
Speech synthesis can enhance applications in many ways. Speech synthesis of text in a word processor is a reliable aid to proof-reading, as many users find it easier to detect grammatical and stylistic problems when listening than when reading. Speech synthesis can provide background notification of events and status changes, such as printer activity, without requiring a user to lose current context. Finally, applications which currently include speech output using pre-recorded messages can be enhanced by using speech synthesis to reduce the storage space by a factor up to 1000 and by removing the restriction that the output sentences be defined in advance.
Other Speech Applications
Hands Free Computer Interfaces
In many situations where keyboard input is impractical and visual displays are restricted or unavailable, speech provides the only way of interacting with a computer. For example, surgeons and other medical staff can use speech dictation to enter reports when their hands are busy and where touching a keyboard represents a hygiene risk. In vehicle and airline maintenance, warehousing and many other hands-busy tasks, speech interfaces can provide practical data input and output and can enable computer-based training.
Small Computing Devices
Speech technology is being adopted in a wide range of small-scale and embedded computing devices to enhance their usability by recognizing simple commands and producing synthesized speech. Such devices include Personal Digital Assistants (PDA), telephone handsets, toys and consumer product controllers. In some devices the speech processing is implemented by dedicated hardware while in others digital signaling processors (DSP) capabilities are used. The Java Platform is being widely adopted in these same computing devices because of its compactness, robustness and portability - necessary capabilities to support access to constrained speech recognition and speech synthesis in this class of devices.
Speech and the Internet
The Java Speech API will allow applets transmitted over the Internet or intranets to access speech capabilities on the user's machine. This will provide the ability to enhance World Wide Web sites with speech and support new ways of browsing. Speech recognition can be used to control browsers, fill out forms, control Java applets and enhance the WWW/Internet experience in many other ways. Speech synthesis can be used to bring web pages alive, inform users of the progress of applets, and dramatically improve browsing time by reducing the amount of audio sent across the Internet.
Limitations of Speech Technology
Despite very substantial improvements in speech technology in the last 40 years, speech synthesis and speech recognition technologies still have many limitations, and often do not meet the high expectations of users familiar with natural speech communication. It is therefore important for developers to consider which applications are suited to speech technology and how to develop those applications within the constraints of the technology.
The two major issues to consider when using speech synthesis are understandability and naturalness. Understandability, the ability of a listener to correctly hear the spoken words, may be reduced by a number of factors. The synthesizer may incorrectly "guess" the pronunciation of words, or produce a word unintelligibly, or it may incorrectly speak textual forms such as punctuation, dates, times and abbreviations. In some contexts, users are insensitive to mispronunciation because their knowledge of the domain allows them to infer what was said. Many applications, however, will need to use the markup language being developed for the Java Speech API that supports explicit specification of pronunciations for words and phrases.
Naturalness is a more complex area. Some speech synthesis voices sound mechanical, and thus unnatural. More subtle factors that influence naturalness include phrasing and pausing, the placement of emphasis, and intonation. The markup language for the Java Speech API will allow developers to control many of these factors to produce speech that has improved naturalness and that is not tedious to the listener.
Speech recognition technology has not yet reached a point where systems can accurately transcribe free-form speech input. The two major issues to consider with speech recognition are how to constrain what may be said by a user to a specific grammar, and the possibility of misrecognition of what a user says.
Standard methods of constraining the speech recognition task have evolved to improve performance and reliability. The Java Speech API will support the two most widely available types of recognition system: recognizers constrained by rule grammars, and dictation systems. These approaches trade off between computing requirements, complexity of language and robustness. Even using grammars to constrain speech input, accuracy is still not 100%. Factors that influence the recognition accuracy include the complexity of the recognition task, the amount of noise in the background, the type of microphone, and the accent, style and voice quality of the speaker.
Because speech recognizers can misrecognize incoming speech, developers need to incorporate error-handling capabilities into their applications. In dictation systems, correction mechanisms are provided to users that allow them to select from alternative guesses of what the user said. In command and control applications, users must be able to undo the results of an unwanted action and may be required to confirm important commands, for example, "delete all files" or "quit."
Speech is not an ideal interface for all tasks. In many instances, other interactive techniques are more effective and natural than speech recognition and synthesis. For example, trying to control mouse movements by speech commands is far less efficient than using pointing devices. Likewise, reading text on a screen (by vision) is much faster than listening to speech synthesis and allows users to scan documents much more rapidly.
User Interface Design
Natural human communication is a complex process and designing reliable, intuitive speech interfaces can require considerable effort. Part of the reason is that speech interfaces must be designed in a different manner than graphical user interfaces (GUIs). The following are examples:
|Long lists of items (such as email headers or menu options) are conveniently displayed in a graphical interface where the user can scan the list, but are tedious when read in sequence by a speech synthesizer.|
|The terms used in the menus and options of a GUI often differ from spoken vocabulary. For example, rather than the standard GUI "delete", users may say "remove", "rub out" or "erase."|
|GUI-based calendar applications use absolute dates (e.g. 10/12/97). However, in a speech interface, users like to refer to relative days, for example, "next Tuesday" and "the day after Christmas".|
|Menus in a GUI make the functionality of an application visible to a user. By contrast, in a speech application, there is not always a clear way to inform a user of what they may or may not say.|
|A GUI constrains the actions of a user so that applications can unambiguously respond to mouse and keyboard input. However, in speech applications, misrecognitions of commands occur. Such errors are difficult to detect, can be difficult to repair, and can be confusing to users.|
Because well-designed speech interfaces can differ in so many ways from graphical user interfaces, automatic speech-enabling of existing applications does not usually produce an intuitive or robust speech interface. Thus, automatic speech-enabling of Java applications and applets is not a direct goal for the Java Speech API. It is anticipated, however, that libraries will become available that will simplify the process of speech-enabling applications and that the Java Accessibility initiative, in particular, will use the Java Speech API to provide basic speech input and output for many Java applications.
The Java Speech API is being developed by Sun Microsystems with contributions from six other companies, all leaders in speech technology. While the API will not be publicly available until later in 1997, this section provides a brief technical overview of the API's coverage and capabilities for the benefit of Java application developers and speech technology companies.
The API can be divided into three relatively self-contained areas: resource management, speech synthesis and speech recognition. Resource management provides audio connectivity and supports the selection and management of speech recognition and speech synthesis engines. The following sections introduce the speech synthesis and speech recognition interfaces.
Speech synthesizers are defined and selected by characteristics such as language, gender, and age. The core API specifications for speech synthesis in the Java Speech API handle the control of speech output, notifications of speech output, vocabulary management and the format of text to be spoken.
There are a number of controls for speech output by Java speech synthesizers. Applications provide a speech synthesizer with a text string to be spoken. The output of that text may be paused, resumed or canceled. The pitch, speaking rate and volume for speech output can be controlled through calls to the synthesizer's interface.
As text is spoken, the synthesizer can be requested to send notifications of major events to the application. The events include the start and end of audio output for the text, start of words in the text, output of markers embedded in the input text and possibly the output of individual phonemes. These events may be used by applications for a broad range of purposes including coordination and synchronization between multiple synthesizers and with other media, facial animation and the highlighting of displayed text as words are spoken.
Applications may provide speech synthesizers with vocabulary information. The vocabulary is typically used to define the pronunciation and grammatical information for words that may be incorrectly pronounced by a synthesizer.
The Java Synthesis Markup Language (JSML), an SGML-based markup language, is being specified for formatting text input to speech synthesizers. JSML allows applications to control important characteristics of the speech produced by a synthesizer. Pronunciations can be specified for words, phrases, acronyms and abbreviations to ensure understandability. Explicit control of pauses, boundaries and emphasis can be provided to improve naturalness. Explicit control of pitch, speaking rate and loudness at the word and phrase level can be used to improve naturalness and understandability.
Text is provided to a speech synthesizer as a Java
String object. The Java Platform uses the Unicode character set for all strings. Unicode provides excellent multi-lingual support and also includes the full International Phonetic Alphabet (IPA) which can be used to accurately define the pronunciations of words and phrases.
Speech Synthesis Libraries
The Java Speech API provides developers with the opportunity to distribute speech-enabled Java applications and compliant speech synthesizers. There will also be an opportunity for the development of Java libraries that reside above the API and which simplify the development of speech-enabled applications. Such software should be 100% Pure Java for portability. Examples include:
|A library that uses JSML to mark up email, HTML and other text representations for output by a speech synthesizer.|
|A queue management library that coordinates speech output from multiple speech synthesizers.|
Speech recognizers are defined and selected by characteristics including language, the type of grammar supported, and the preferred audio quality. The core API specifications for speech recognition in the Java Speech API handle the control of speech input, notifications of recognition results and other important events, vocabulary management and the definition of recognition grammars.
The basic controls for a speech recognizer are pausing and resuming of recognition. However, the most important control of a speech recognizer is in the management of recognition grammars. The two basic grammar types currently being specified are rule-based grammars and dictation grammars, and are described below in more detail. For both grammar types, an appli- cation can load multiple grammars into a recognizer and activate and de-activate each grammar as required.
As speech input from a user is detected, the recognizer sends notifications of major events to the application. The basic events are the start and end of audio input and the notification of recognition results when the recognizer has finished processing the incoming speech. The results include information on alternative guesses of what the user said, and may provide access to a recording of the audio stream of the user's speech. Applications may provide speech recognizers with vocabulary information. The vocabulary defines the pronunciation and grammatical information for words and is used by recognizers to improve the recognition accuracy.
In a rule-based speech recognition system, an application provides the recognizer with rules that define what the user is expected to say. These rules constrain the recognition process. Careful design of the rules (often combined with careful user interface design) will provide rules that allow users reasonable freedom of expression while limiting the range of things that may be said so that the recognition process is as fast and accurate as possible. The following is an example of a simple recognition grammar using a tentative grammar format.
RuleGrammar SimpleCommand; <COMMAND> = [<POLITE>] <ACTION> <OBJECT> (and <OBJECT>); <ACTION> = open | close | delete; <OBJECT> = the window | the file; <POLITE> = please;
This grammar defines one public rule,
COMMAND, that may be spoken by users. This rule is a combination of sub-rules,
ACTION, OBJECT and
POLITE. It allows a user to say commands such as "Open the window", "Please close the window and the file."
The grammar specified above may import rules from other grammars and may itself be imported into other grammars. Complex language behavior can be produced by this process.
Dictation systems impose fewer restrictions on what can be said and so are closer to providing the ideal of free-form speech input. The cost of this greater freedom is that they require more substantial computing resources and tend to make more errors. Furthermore, many current commercial dictation systems require users to speak with short pauses between words, known as discrete speech. In the last year, continuous dictation systems have become available for specific domains such as medical reporting and are likely to become more widely available over the next few years. Speech dictation grammars are more complex than rule-based grammars. Dictation systems that support the Java Speech API provide built-in dictation grammars that are typically statistically trained. These grammars may be specialized for domains such as legal texts or medical reporting, or may be for general text of a language. Applications and users can augment the built-in grammar by defining lists of words and phrases for which the recognizer should listen. These lists may include specialized vocabulary for an application or user (for example, the names of application objects, colleagues or companies) or application-specific commands (for example "delete text", "use italic font").
Applications that use dictation recognition need to provide special capabilities in handling the results. First, correction mechanisms need to be implemented so that users can easily fix errors made by the recognizer. Second, in some contexts, the results need to be converted from spoken forms to written forms (e.g. "twenty dollars" to "$20"). The Java Speech API provides support for both these capabilities.
Speech Recognition Libraries
The Java Speech API provides the opportunity to distribute compliant speech recognizers and Java applications that use speech recognition. As with speech synthesis, there will be a market for Java libraries that reside above the API and that simplify the development of speech-enabled applications. Such software should be 100% Pure Java tm for portability. Examples include:
|Grammar management software that simplifies activation and de-activation of individual grammars as the focus of an application changes.|
|Parsing software that analyzes recognition results and executes Java methods defined with the grammar.|
|Dialog management libraries that initiate and coordinate complex interactions using both speech synthesis and recognition.|
|Development tools that simplify the building of grammars, dialog systems and other speech components.|
Java has rapidly established itself as a platform for building portable Internet and intranet applications. The Java Speech API extends Java's strengths by allowing developers to implement more sophisticated and natural user interfaces. The Java Platform provides many advantages to developers, including developers of speech applications:
|Portable platform: the language, APIs and virtual machine are available for a wide variety of hardware platforms and operating systems and are supported by all the major WWW browsers.|
|Powerful and compact environment: The Java Platform provides developers with a powerful object-oriented language while eliminating much of the complexity and housekeeping associated with other languages and programming environments.|
|Network aware and secure: From its inception, the Java Platform has been network aware and has included security mechanisms that provide users with protection from untrusted applets.|
Integration with Other APIs
The Java Media APIs are being developed to support a wide range of integrated multi-media capabilities. The APIs most related to speech include the Java Media Framework, the Java Telephony API, and the Java Sound API. The Java Media Framework provides a unified architecture for the media APIs. The Java Telephony API integrates telephones and computers with support for a range of services including simple phone calls, teleconferencing, call transfer, caller ID and DTMF decode/encode. Java Sound will allow for cross-platform audio input and output with access to microphones, speakers and headphones as well as advanced audio capabilities such as mixing of audio channels and synchronization. Speech applications will also be able to incorporate other Java Media APIs: Java 2D, Java 3D and Java Animation for advanced graphical capabilities; Java Share for cooperation of applications and users; Java Media Framework for synchronized video, audio and MIDI.
The Java Speech API is being developed to support a wide range of speech technology software and hardware including much of the existing speech technology. The following are some of the key implementation mechanisms.
|Java Implementations: Speech synthesizers and speech recognizers may be written in Java. These engines will benefit from the portability of Java, from access to cross-platform audio and from the continuing improvements in the execution speed of Java.|
|Native Implementations: Much of the existing speech technology base is implemented to support C, C++ and Pascal APIs including Microsoft SAPI, SRAPI, the Apple Speech Managers, and vendor-specific APIs. Using the Java Native Interface and Java wrappers, speech vendors can implement the Java Speech API using their existing speech software. This is expected to be the major mechanism of support for JSAPI in the first year of deployment.|
|Telephony Implementations: Enterprise telephony applications often require special hardware to support large numbers of simultaneous connections using, for example, DSP cards on special boards. These configurations can be implemented with similar software to that described above for native implementations combined with the Java Telephony API for call management.|
Sun Microsystems has set up strategic collaborations with leading speech technology companies to define the initial specification for the Java Speech API. These companies are Apple Computer, Inc., Dragon Systems, Inc., IBM Corporation, Novell, Inc., Philips Electronics, BV, and Texas Instruments, Inc. These companies bring decades of research on speech technology, a product portfolio covering most major operating systems, experience with application development, and great depth of experience with the major technologies being supported by the Java Speech API.
Speech technology is in transition from niche applications and research systems to wide deployment. As this transition takes place, the technology is changing rapidly to meet user and application demands. It is likely that the Java Speech API will be upgraded and extended to meet future requirements and technology capabilities. While it is difficult to anticipate such changes, it is likely that a future version will include support for the speaker verification and speaker recognition technologies that are widely used for security purposes. Recognizers with word spotting capability may be supported in future versions.
The Java Speech Application Programming Interface (JSAPI) will allow applications and applets to use speech technology for the development of advanced user interfaces. The Java Speech API will provide a standard, cross-platform interface to speech synthesizers, command and control recognition and dictation systems for both desktop and telephony environments. The Java Speech API supports state-of-the-art speech technology capabilities but remains easy to use and maintain. With the Java Speech API, developers will be able to develop applications with dynamic and compelling user interfaces.
For More Information
See the Java Speech API Data Sheet