Skip to main content icon/video/no-internet

Multimodal Conversational Systems

Multimodal conversational systems are computer systems that engage human users in intelligent conversation through speech and other modalities such as gesture and gaze. These systems are motivated largely by human-human conversation, where nonverbal communication modalities such as hand gestures, body postures, eye gaze, head movements, and facial expressions are used to complement spoken language. Studies have shown that multimodal conversational systems provide more natural and effective human-machine interaction compared to speech-only systems. This entry provides a brief overview of the types of systems, their general architecture, and key components of automated multimodal interpretation and generation in such systems.

Types of Systems

A variety of multimodal conversational systems have been developed in the past 3 decades. They range from multimodal conversational interfaces to embodied conversational agents and to more recent situated dialogue agents. Multimodal conversational interfaces address interaction with interfaces from computers or other devices (e.g., handheld devices). A user can look at the interface, point to regions on the interface, and talk to the system. These types of interfaces are particularly useful for map-based applications. Embodied conversational agents (also called virtual humans) allow users to carry on conversations with virtual embodied agents (often lifesize virtual agents) through multiple modalities such as speech, facial expressions, hand gesture, and head movement. These types of systems are often applied in the domain of cultural training, tutoring, and education. Situated dialogue agents represent a new generation of dialogue agents that are co-present with human partners in a shared world, which could be virtual or physical. In situated dialogue (e.g., human-robot dialogue), the perception of the shared environment and the mobility and embodiment of the partners play an important role in success of the dialogue. In these systems, language processing needs to be combined with vision processing, gesture recognition, and situation modeling. Situated dialogue in virtual worlds can be applied in the domains of interactive games, training, and education, while dialogue in the physical world can benefit a range of applications involving human-robot interaction.

System Architecture

Most multimodal dialogue systems share a similar architecture with four major components: multimodal interpreter, dialogue manager, action manager, and multimodal generator, as shown in Figure 1. The multimodal interpreter is responsible for combining different modalities and identifying the semantic meanings of user multimodal input. Based on the understanding of user intent, the dialogue manager decides what to do in response, for example, ask for clarification or provide information requested by the user. Once this decision has been made, the action manager takes charge of any required backend processes, such as retrieving relevant information. The multimodal generator uses the gathered information to produce specific responses such as multimedia presentations on graphical interfaces or multimodal conversational behaviors for embodied agents. Each of these components is critical to the overall performance of a multimodal conversational system. The multimodal interpreter and generator are the two important components unique to multimodal conversational systems in contrast to traditional spoken dialogue systems.

Figure 1 A general architecture for multimodal conversational systems

None

Multimodal Interpretation

The capability to process and identify semantic meanings from user multimodal inputs is one of the most critical components in multimodal conversational systems. A large body of research has focused on how different modalities are aligned, how different modalities and/or shared visual environments can be integrated to derive an overall semantic representation (such as user intent), and how nonverbal modalities may improve spoken language understanding.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading