This book is based on publications from the ISCA Tutorial and Research
Workshop on Multi-Modal Dialogue in Mobile Environments held at Kloster
Irsee, Germany, in 2002. The workshop covered various aspects of development
and evaluation of spoken multimodal dialogue systems and components
with particular emphasis on mobile environments, and discussed the state-ofthe-
art within this area. On the development side the major aspects addressed
include speech recognition, dialogue management, multimodal output generation,
system architectures, full applications, and user interface issues.
Presentations "Multimodal discourse analysis" present on: Limitations of multimodal discourse analysis, analysis the picture, background to multimodal discourse analysis and introduce the picture. Invite you to consult.
Systems that attempt to understand natural human input make mistakes, even humans. However, humans avoid misunderstandings by confirming doubtful input. Multimodal systems--those that combine simultaneous input from more than one modality, for example speech and gesture--have historically been designed so that they either request confwmation of speech, their primary modality, or not at all. Instead, we experimented with delaying confirmation until after the speech and gesture were combined into a complete multimodal command. ...
We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary.
This paper describes Dico II+, an in-vehicle dialogue system demonstrating a novel combination of ﬂexible multimodal menu-based dialogueand a “speech cursor” which enables menu navigation as well as browsing long list using haptic input and spoken output.
We investigate the use of machine learning in combination with feature engineering techniques to explore human multimodal clariﬁcation strategies and the use of those strategies for dialogue systems. We learn from data collected in a Wizardof-Oz study where different wizards could decide whether to ask a clariﬁcation request in a multimodal manner or else use speech alone. We show that there is a uniform strategy across wizards which is based on multiple features in the context. These are generic runtime features which can be implemented in dialogue systems. ...
In order to effectively access the rapidly increasing range of media content available in the home, new kinds of more natural interfaces are needed. In this paper, we explore the application of multimodal interface technologies to searching and browsing a database of movies. The resulting system allows users to access movies using speech, pen, remote control, and dynamic combinations of these modalities.
The aim of this paper is to develop animated agents that can control multimodal instruction dialogues by monitoring user’s behaviors. First, this paper reports on our Wizard-of-Oz experiments, and then, using the collected corpus, proposes a probabilistic model of fine-grained timing dependencies among multimodal communication behaviors: speech, gestures, and mouse manipulations.
Software to translate English text into American Sign Language (ASL) animation can improve information accessibility for the majority of deaf adults with limited English literacy. ASL natural language generation (NLG) is a special form of multimodal NLG that uses multiple linguistic output channels. ASL NLG technology has applications for the generation of gesture animation and other communication signals that are not easily encoded as text strings.
In order to realize their full potential, multimodal systems need to support not just input from multiple modes, but also synchronized integration of modes. Johnston et al (1997) model this integration using a unification operation over typed feature structures. This is an effective solution for a broad class of systems, but limits multimodal utterances to combinations of a single spoken phrase with a single gesture. We show how the unification-based approach can be scaled up to provide a full multimodal grammar formalism.
Mobile interfaces need to allow the user and system to adapt their choice of communication modes according to user preferences, the task at hand, and the physical and social environment. We describe a multimodal application architecture which combines ﬁnite-state multimodal language processing, a speech-act based multimodal dialogue manager, dynamic multimodal output generation, and user-tailored text planning to enable rapid prototyping of multimodal interfaces with ﬂexible input and adaptive output. ...
Human face-to-face conversation is an ideal model for human-computer dialogue. One of the major features of face-to-face communication is its multiplicity of communication channels that act on multiple modalities. To realize a natural multimodal dialogue, it is necessary to study how humans perceive information and determine the information to which humans are sensitive. A face is an independent communication channel that conveys emotional and conversational signals, encoded as facial expressions.
Recent empirical research has shown conclusive advantages of multimodal interaction over speech-only interaction for mapbased tasks. This paper describes a multimodal language processing architecture which supports interfaces allowing simultaneous input from speech and gesture recognition. Integration of spoken and gestural input is driven by unification of typed feature structures representing the semantic contributions of the different modes.
We address two problems in the ﬁeld of automatic optimization of dialogue strategies: learning effective dialogue strategies when no initial data or system exists, and evaluating the result with real users. We use Reinforcement Learning (RL) to learn multimodal dialogue strategies by interaction with a simulated environment which is “bootstrapped” from small amounts of Wizard-of-Oz (WOZ) data.
We discuss Image Sense Discrimination (ISD), and apply a method based on spectral clustering, using multimodal features from the image and text of the embedding web page. We evaluate our method on a new data set of annotated web images, retrieved with ambiguous query terms. Experiments investigate different levels of sense granularity, as well as the impact of text and image features, and global versus local text features.
We introduce a new multi-threaded parsing algorithm on uniﬁcation grammars designed speciﬁcally for multimodal interaction and noisy environments. By lifting some traditional constraints, namely those related to the ordering of constituents, we overcome several difﬁculties of other systems in this domain. We also present several criteria used in this model to constrain the search process using dynamically loadable scoring functions. Some early analyses of our implementation are discussed. ...
This paper presents Archivus, a multimodal language-enabled meeting browsing and retrieval system. The prototype is in an early stage of development, and we are currently exploring the role of natural language for interacting in this relatively unfamiliar and complex domain. We brieﬂy describe the design and implementation status of the system, and then focus on how this system is used to elicit useful data for supporting hypotheses about multimodal interaction in the domain of meeting retrieval and for developing NLP modules for this speciﬁc domain. ...
The system is an in-car multimodal dialogue system for an MP3 application. It is used as a testing environment for our research in natural, intuitive mixed-initiative interaction, with particular emphasis on multimodal output planning and realization aimed to produce output adapted to the context, including the driver’s attention state w.r.t. the primary driving task.
This paper describes MIMUS, a multimodal and multilingual dialogue system for the in– home scenario, which allows users to control some home devices by voice and/or clicks. Its design relies on Wizard of Oz experiments and is targeted at disabled users. MIMUS follows the Information State Update approach to dialogue management, and supports English, German and Spanish, with the possibility of changing language on–the– ﬂy. MIMUS includes a gestures–enabled talking head which endows the system with a human–like personality. ...