Handling Emotions in Human-Computer Dialogues
Johannes Pittermann Angela Pittermann Wolfgang Minker Handling Emotions in Human-Computer Dialogues ABC
Johannes Pittermann Universität Ulm Inst. Informationstechnik Albert-Einstein-Allee 43 89081 Ulm Germany johannes.pittermann@alumni.uni-ulm.de Angela Pittermann Universität Ulm Inst. Informationstechnik Albert-Einstein-Allee 43 89081 Ulm Germany angelapittermann@gmx.de Wolfgang Minker Universität Ulm Fak. Ingenieurwissenschaften And Elektrotechnik Albert-Einstein-Allee 43 89081 Ulm Germany wolfgang.minker@uni-ulm.de ISBN 978-90-481-3128-0 e-isbn 978-90-481-3129-7 DOI 10.1007/978-90-481-3129-7 Springer Dordrecht Heidelberg London New York Library of Congress Control Number: 2009931247 c Springer Science+Business Media B.V. 2010 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Cover design: Boekhorst Design b.v. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface The finest emotion of which we are capable is the mystic emotion (Albert Einstein, 1879 1955) During the past years the mystery of emotions has increasingly attracted interest in research on human computer interaction. In this work we investigate the problem of how to incorporate the user s emotional state into a spoken language dialogue system. The book describes the recognition and classification of emotions and proposes models integrating emotions into adaptive dialogue management. In computer and telecommunication technologies the way in how people communicate with each other is changing significantly from a strictly structured and formatted information transfer to a flexible and more natural communication. Spoken language is the most natural way of communication between humans and it also provides an easy and quick way to interact with a computer application. These systems range from information kiosks where travelers can book flights or buy train tickets to handheld devices which show tourists around cities while interactively giving information about points of interest. Generally, spoken language dialogue does not only mean simplicity, comfort and saving of time but moreover contributes to safety aspects in critical environments like in cars, where hands-free operation is indispensible in order to keep the driver s distraction minimal. Within the context of ubiquitous computing in intelligent environments dialogue systems facilitate everyday work, e.g., at home where lights or household appliances can be controlled by voice commands, and provide the possibility, especially in assisted living, to quickly summon help in emergency cases. In parallel to the progress made in technical development the customer s demands concerning the products have increased. While car owners in the 1920s might have been completely satisfied once they arrived at a destination without any major complications, people in the 1970s would have already tended to become annoyed once their engine refuses to start on the first turn of the ignition key. And nowadays a navigation system showing the wrong way might even cause more anger. For ubiquitous technology like cars this means on the one hand that the driver is literally at the mercy of sophisticated technology on the other hand this does not hinder him/her from building some kind of personal relation to the car, ranging from decorations v
vi Preface like car fresheners or fuzzy dice to expensive tuning. Such a relation includes as well the expression of emotions towards the car just imagine drivers spurring on their cars when climbing a steep hill and being glad having reached the top, or drivers shouting at their non-functioning navigation system, hitting or kicking their cars... A similar behavior can be observed among computer users. Having successfully written a book using a word processing software might arouse happiness, however a sudden hard disc crash destroying all documents will probably drive the author up the wall. Normally neither the car nor the computer is capable of replying to the user s affect. So why not enable devices to react accordingly? Think of a car that refuses to start and the driver shouting angrily Stupid car, I paid more than $40,000 and now it s only causing trouble!. Here a car s reply like I am sorry that the engine does not run properly. This is due to a defective spark-plug which needs to be replaced. would certainly defuse the tense situation and it moreover provides useful information on how to solve the problem. This again contributes to safety aspects in the car as the driver can be calmed down, e.g., in the case of a delay due to a traffic jam, whereupon the driver tries to make up the loss of time by speeding. Here the car s computer could try to rearrange the planned meeting and inform the user: Due to our delay I have rescheduled your meeting one hour later. So there is no need to hurry. To implement a more flexible system, the typical architecture of a spoken language dialogue system needs to be equipped with additional functionality. This includes the recognition of emotions and the detection of situation-based parameters as well as user-state and situation managers which calculate models based on these parameters and influence the course of the dialogue accordingly. Constituting a hot topic of interest in current research there exist several approaches to classify the user s emotions. These methods include the measurement of physiological values using biosensors, the interpretation of gestures and facial expressions using cameras, natural language processing spotting emotive keywords and fillers in recognized utterances or classification of prosodic features extracted from the speech signal. Concentrating on a monomodal system without video input and trying to reduce inconveniences to the user, this work focuses on the recognition of emotions from the speech signal using Hidden Markov Models (HMMs). Based on a database of emotional speech, a set of prosodic features has been selected and HMMs have been trained and tested for six emotions and ten speakers. Due to variations in model parameters multiple recognizers have been implemented. According to the output of the emotion recognizer(s) the course of the dialogue is influenced. With the help of a user-state model and a situation model the dialogue strategy is adapted and an appropriate stylistic realization of its prompts is chosen. I.e., if the user is in a neutral mood and speaks clearly, there are no confirmations necessary and the dialogue can be kept relatively short. However if the user is angry and speaks correspondingly unclearly, the system has to try to calm down the user but it also has to ask often for confirmation, which again makes the user turn angry... Principally there exist two methods to model the influence of these so-called control parameters like emotions: a rule-based approach where every eventuality in the
Preface vii user s behavior is covered by a rule which contains a suitable reply, or a stochastic approach which models the probability of a certain reply in dependence of the user s previous utterances and corresponding control parameters. So how is this book organized? An introduction to the research topic is followed by an overview on emotions theories and emotions in speech. In the third chapter, dialogue strategy concepts with regard to integrating emotions in spoken dialogue are described. Signal processing and speech-based emotion recognition are discussed in Chapter 4 and improvements to our proposed emotion recognizers as well as the implementation of our adaptive dialogue manager are discussed in Chapter 5. Chapter 6 presents evaluation results of the emotion recognition component and of the end-to-end system with respect to existing spoken language dialogue systems evaluation paradigms. The book concludes with a final discussion and an outlook on future research directions. Ulm, May 2009 Johannes & Angela Pittermann Wolfgang Minker
Contents 1 Introduction... 1 1.1 Spoken Language Dialogue Systems... 2 1.2 Enhancing a Spoken Language Dialogue System... 6 1.3 Challenges in Dialogue Management Development... 8 1.4 Issues in User Modeling... 11 1.5 Evaluation of Dialogue Systems... 14 1.6 Summary of Contributions... 16 2 Human Emotions... 19 2.1 Definition of Emotion... 19 2.2 Theories of Emotion and Categorization... 22 2.3 Emotional Labeling... 36 2.4 Emotional Speech Databases/Corpora... 42 2.5 Discussion... 45 3 Adaptive Human Computer Dialogue... 47 3.1 Background and Related Research... 48 3.2 User-State and Situation Management... 61 3.3 Dialogue Strategies and Control Parameters... 65 3.4 Integrating Speech Recognizer Confidence Measures into Adaptive Dialogue Management... 66 3.5 Integrating Emotions into Adaptive Dialogue Management... 72 3.6 A Semi-Stochastic Dialogue Model... 78 3.7 A Semi-Stochastic Emotional Model... 90 3.8 A Semi-Stochastic Combined Emotional Dialogue Model... 95 3.9 Extending the Semi-Stochastic Combined Emotional Dialogue Model...100 3.10 Discussion...104 4 Hybrid Approach to Speech Emotion Recognition...107 4.1 Signal Processing...108 4.2 Classifiers for Emotion Recognition...120 4.3 Existing Approaches to Emotion Recognition...127 ix
x Contents 4.4 HMM-Based Speech Recognition...131 4.5 HMM-Based Emotion Recognition...135 4.6 Combined Speech and Emotion Recognition...142 4.7 Emotion Recognition by Linguistic Analysis...144 4.8 Discussion...149 5 Implementation...151 5.1 Emotion Recognizer Optimizations...151 5.2 Using Multiple (Speech )Emotion Recognizers...159 5.3 Implementation of Our Dialogue Manager...173 5.4 Discussion...185 6 Evaluation...187 6.1 Description of Dialogue System Evaluation Paradigms...187 6.2 Speech Data Used for the Emotion Recognizer Evaluation...190 6.3 Performance of Our Emotion Recognizer...192 6.4 Evaluation of Our Dialogue Manager...217 6.5 Discussion...223 7 Conclusion and Future Directions...227 A Emotional Speech Databases...237 B Used Abbreviations...251 References...253 Index...273