EUROPEAN ORGANISATION FOR THE SAFETY OF AIR NAVIGATION EUROCONTROL EUROCONTROL EXPERIMENTAL CENTRE MULTIMODAL INTERFACES: A BRIEF LITERATURE REVIEW

Size: px

Start display at page:

Download "EUROPEAN ORGANISATION FOR THE SAFETY OF AIR NAVIGATION EUROCONTROL EUROCONTROL EXPERIMENTAL CENTRE MULTIMODAL INTERFACES: A BRIEF LITERATURE REVIEW"

Melissa Brianne McCormick
5 years ago
Views:

1 EUROPEAN ORGANISATION FOR THE SAFETY OF AIR NAVIGATION EUROCONTROL EUROCONTROL EXPERIMENTAL CENTRE MULTIMODAL INTERFACES: A BRIEF LITERATURE REVIEW EEC Note No. 01/07 Project: MMF Issued: Avril 2007 The information contained in this document is the property of the EUROCONTROL Agency; no part of it is to be reproduced in any form without the Agency's permission. The views expressed herein do not necessarily reflect the official views or policy of the Agency.

3 REPORT DOCUMENTATION PAGE Reference EEC Note No. 01/07 Originator: Sponsor Marc Bourgois Manager Innovative Studies EUROCONTROL Experimental Centre Security Classification Unclassified Originator (Corporate Author) Name/Location DeepBlue s.r.l. Piazza Buenos Aires Rome, Italy Tel.: Sponsor (Contract Authority) Name/Location EUROCONTROL Agency Rue de la Fusée 96 B-1130 Brussels Tel.: Internet: TITLE: MULTIMODAL INTERFACES: A BRIEF LITERATURE REVIEW Author Monica Tavanti Date 04/2007 Pages viii + 32 Figures 12 Tables 1 Annexes - References 62 Project Task no. sponsor Period MMF C61PT/ Distribution statement (a) Controlled by: Marc Bourgois (b) Special limitations: None Descriptors (keywords) Multimodal interaction, Interfaces Abstract Multimodal interfaces have for quite some time been considered the "interfaces of the future", aiming to allow more natural interaction and offering new opportunities for parallelism and individual capacity increases. This document provides the reader with an overview of multimodal interfaces and of the results of empirical studies assessing users' performance with multimodal systems. The study tackles applications from various domains, including air traffic control. The document also discusses a number of limitations of current multimodal interfaces, particularly for the ATC domain, for which a deeper analysis of the costs and benefits associated with multimodal interfaces should be carried out.

5 Multimodal Interfaces: a Brief Literature Review EUROCONTROL FOREWORD Recent ATC working-position interfaces would appear to have converged on a mouse-windows paradigm for all human-system interactions. Is there no future for any of the other futuristic devices we see in virtual and augmented reality applications, such as wands, speech input and haptics? Could these devices not be more natural channels for interaction? Could they not offer opportunities for parallelism and thus for individual capacity increases? In order to start studying these issues and to draw a more informed picture of the pros and cons of multimodal interaction, we searched the literature for generic lessons from multimodal interfaces which would be transferable to the ATM domain. The report is supplemented with a number of landmark experiments in ATM itself. Marc Bourgois, Manager Innovative Studies Project MMF EEC Note No. 01/07 v

6 EUROCONTROL Multimodal Interfaces: a Brief Literature Review TABLE OF CONTENTS FOREWORD... V LIST OF FIGURES... VII LIST OF TABLES... VII 1. PURPOSES OF THIS DOCUMENT DEFINING MULTIMODALITY FIRST DEFINITION WHY MULTIMODAL INTERFACES? MYTHS OF MULTIMODAL INTERACTION HUMAN-COMPUTER COMMUNICATION CHANNELS (HCCC) HUMAN-CENTRED PERSPECTIVE SYSTEM-CENTRED PERSPECTIVE DESIGN SPACE FOR MULTIMODAL SYSTEMS AN EXAMPLE OF CLASSIFICATION DEVICES: A SHORT SUMMARY INPUT OUTPUT NON-ATM DOMAIN EMPIRICAL RESULTS DO PEOPLE INTERACT MULTIMODALLY? MULTIMODALITY AND TASK DIFFICULTY MUTUAL DISAMBIGUATION MEMORY THE MCGURK EFFECT CROSS-MODAL INTERACTIONS ATM MULTIMODAL INTERFACES DIGISTRIPS DigiStrips evaluation THE ANOTO PEN VIGIESTRIPS Vigiestrips evaluation D SEMI-IMMERSIVE ATM ENVIRONMENT (LINKÖPING UNIVERSITET) Flight information Weather information Terrain information Orientation Conflict detection Control Positional audio Interaction mechanisms Evaluation AVITRACK AVITRACK evaluation...27 vi Project MMF EEC Note No. 01/07

7 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 7. CONCLUSIONS ACKNOWLEDGMENTS REFERENCES...29 LIST OF FIGURES Figure 1: Basic model for HCCC (adapted from Schomaker et al., 1995)... 5 Figure 2: Design space (after Nigay & Coutaz, 1993)... 6 Figure 3: The NoteBook example within the design space (after Nigay & Coutaz, 1993)... 8 Figure 4: The manipulation properties preserved in DigiStrips (after Mertz & Vinot, 1999) Figure 5: DigiStrips strokes (after Mertz, Chatty & Vinot, 2000a) Figure 6: Simple strokes to open menus (after Mertz, Chatty & Vinot, 2000a) Figure 7: An annotated strip (after Mertz & Vinot, 1999) Figure 8: The ANOTO pen technical limitations Figure 9: Vigiestrips main working areas (after Salaun, Pene, Garron, Journet &... Pavet, 2005) Figure 10: Snapshot of the 3D ATM system (after Lange, Cooper, Ynnerman & Duong, 2004) Figure 11: Interaction usability test (after Le-Hong, Tavanti & Dang, 2004) Figure 12: AVITRACK overview (adapted from 26 LIST OF TABLES Table 1: Classification of senses and modalities (adapted from Silbernagel, 1979)... 5 Project MMF EEC Note No. 01/07 vii

8 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Page intentionally left blank viii Project MMF EEC Note No. 01/07

9 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 1. PURPOSES OF THIS DOCUMENT This document provides an overview of multimodal interfaces and of the results of empirical studies assessing users' performance with multimodal systems. It is structured as follows: In section 2, the notion of multimodality will be defined and explained. In section 3, the main ideas in support of multimodal interfaces (as well as the "false myths" related to multimodal systems) will be presented. Subsequently, a number of models will also be proposed attempting to classify and structure the available research on multimodal systems. In section 4, a brief overview will be given of the available devices most commonly used in multimodal research and of a number of strong and weak points for each technology. In sections 5 and 6, a collection of empirical results the aim of which is to assess users' performance with multimodal systems will be provided. As the multimodal literature is not very large with respect to air traffic control (ATC), this document will present both non-atc and ATCrelated results. The last section will sum up the work and put forward a number of criticisms and possible limitations of multimodal interfaces, particularly for the ATC domain. This document and the preliminary conclusions call for an analysis of the costs and benefits associated with multimodal interfaces, especially for ATC. Project MMF EEC Note No. 01/07 1

10 EUROCONTROL Multimodal Interfaces: a Brief Literature Review 2. DEFINING MULTIMODALITY 2.1. FIRST DEFINITION Multimodal systems "process two or more combined user input modes such as speech, pen, touch, manual gestures, gaze and head and body movements in a coordinated manner with multimedia system output" (Oviatt, 2002). These interfaces are different from traditional graphical interfaces since they aim "to recognize naturally occurring forms of human language and behaviour, which incorporate at least one recognition-based technology" (Oviatt, 2002) WHY MULTIMODAL INTERFACES? Historically, the birth of multimodal interfaces is often identified with the "Put That There" system (Bolt, 1980), in which both speech and gesture are recognised by the system and interpreted as command inputs. The user, sitting in a room in front of a large display, can provide vocal inputs accompanied by deictic gestures which contribute to the identification of an object (Robin, 2004). Therefore, the user can give the command "Put That There" while pointing at an object. Bolt states: "there, now indicated by gesture, serves in lieu of the entire phrase "...to the right of the green square". The power of this function is even more general. The place description "... to the right of the green square" presupposes an item in the vicinity for reference: namely, the green square. There may be no plausible reference frame in terms of already extant items for a word description of where the moved item is to go. The intended spot, however, may readily be indicated by voiceand-pointing: there. In this function, as well as others, some variation in expression is understandably a valuable option" (Bolt, 1980). The interesting point is that Bolt's system is able to disambiguate an unclear and vague noun "there" by interpreting the meaning of the pointing gestures. The great innovation of Bolt's system was to allow the users to carry out a task using a more natural way of interacting with the system, exploiting everyday communication strategies such as pointing to objects. Multimodal interfaces are made with the objective of providing flexible interaction, since the users can choose from among a selection of input modalities, using one input type, using multiple simultaneous input modalities, or alternating among different modalities. Multimodal interfaces are also characterised by availability, because they are intended to accommodate several user types under multiple circumstances, making available a number of interaction modalities. Moreover, the possibility of using several interaction modalities characterises these interfaces in terms of adaptability, since the user can choose the most appropriate modality depending on the circumstances, and in terms of efficiency, because the interfaces can process inputs in a parallel manner (Robbins, 2004). Finally, as suggested by the example "Put-That-There", multimodal interfaces aim to support more natural human-computer interaction. Multimodal interfaces are based on the recognition of natural human behaviour such as speech, gestures, and gaze; new computational capabilities will eventually allow for automatic and seamless interpretation of these behaviours so that the systems will intelligently adapt and respond to the users (Oviatt, 2002). The main benefit would appear to be "natural interaction", a way of carrying out tasks with systems which are able to grasp and understand the way we behave in everyday life. However, the design of multimodal interfaces is often based on common-sense and false assumptions, constituting a sort of misleading mythology. 2 Project MMF EEC Note No. 01/07

11 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 3. MYTHS OF MULTIMODAL INTERACTION There are great expectations surrounding multimodal interfaces and multimodal interaction. Often, these expectations lead to mistaken beliefs, which have little or nothing to do with the actual "empirical reality" (Oviatt, 2002). Oviatt (1999) summarises these false beliefs based on empirical evidence. She provides a list of ten myths of multimodal interaction and explains how these myths should be "corrected" in order to meet real users' requirements. Myth No. 1: If you build a multimodal system, the user will interact multimodally. According to a study carried out by Oviatt (1997), % of users interacted multimodally when they were free to use either speech or pen input in a spatial domain. Users would appear to mix unimodal and multimodal interaction, depending on the requirements of the task at hand. Multimodal interactions would appear related to spatial content (e.g. calculation of the distance between objects, specification of distances among objects, etc.). When the action does not entail spatiality, users are not likely to interact multimodally. Myth No. 2: Speech-and-pointing is the dominant multimodal integration pattern. This type of interaction seems dominant because most interfaces implement this type of interaction modality, especially to resolve "deictic" forms (i.e. to resolve the meaning of expressions such as "that" or "there", which require a reference to something). However, this choice would appear to be a sort of "new implementation style" for the traditional mouse interaction paradigm. Speak-andpoint interaction accounts for only 14% of all spontaneous multimodal interactions (Oviatt, DeAngeli & Kuhn, 1997). Oviatt mentions the results of past research (McNeill, 1992) indicating that analysis of interpersonal communications shows that pointing accounts for less than 20%. Myth No. 3: Multimodal input involves simultaneous signals. It is often assumed that users "act multimodally", using different modalities in a simultaneous manner. Taking deictic expressions as an example, one might think that users would speak while simultaneously pointing at something and saying, for example, "there". This overlapping is yet another myth: only 25% of the overlap between deictic expressions and pointing was present in the empirical study of Oviatt et al. (1997). In actual fact, gesturing would often appear to precede spoken inputs. The presence of a degree of synchronisation between signals should not be misunderstood, since synchronisation is not co-occurrence. Myth No. 4: Speech is the primary input mode in any multimodal system that includes it. Another commonplace is that speech is a form of primary mode. Thus, the presence of different input modalities should be considered as a sort of compensation, redundant modes which can "take over", especially if the primary mode (i.e. speech) is degraded. This myth should remain just that a myth. In fact, there are modes which can convey information not efficiently conveyed by speech (e.g. spatial information). Moreover, as previously stated, different modalities are used in a very articulated and not necessarily redundant manner; for example, many gesture signals precede speech. Myth No. 5: Multimodal language does not differ linguistically from unimodal language. According to Oviatt (1999), every language has its own peculiarity. For example, pen/voice language seems more brief and syntactically simpler than unimodal speech. When the users are free to interact using the preferred modality of their choice, they are likely to selectively avoid linguistic complexities. Myth No. 6: Multimodal integration involves redundancy of content between modes. Project MMF EEC Note No. 01/07 3

12 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Multimodal communication can be considered as a means of "putting together" content in a complementary manner rather than redundantly. Different modes contribute to different and complementary information. As stated above, locative information is often written, while subjectverb-object information is more likely to be spoken. Multiple communication modes do not imply duplicate information. Myth No. 7: Individual error-prone recognition technologies combine multimodally to produce even greater unreliability. In general it is thought that when using error-prone recognition technologies (such as speech and pen-input recognition) many composite errors will be produced. In fact, multimodal systems are reasonably robust to errors. Users naturally know when and how to use a given input mode (instead of another) in the most efficient manner. They are likely to deploy the most effective mode and avoid using the more error-prone input. Oviatt (1999) also speaks of mutual disambiguation of two input signals, i.e. the recovery of errors thanks to the interpretation of two input signals. For example, if the system recognises not only the word "ditch" but also a number of parallel graphic marks, then the word "ditch" will be interpreted as "ditches". Myth No. 8: All users' multimodal commands are integrated in a uniform way. Users are characterised by individual differences and deploy different strategies while interacting in accordance with their preferences. Multimodal systems, then, should detect these differences and adapt to the users' dominant integration patterns. Myth No. 9: Different input modes are capable of transmitting comparable content. From a technology-oriented perspective, the various modes might appear to be interchangeable, able to efficiently transmit comparable content. This is not the case. Every mode is unique; the type of information transmitted, the way it is transmitted, and the functionality of each mode during communication is specific. Myth No. 10: Enhanced efficiency is the main advantage of multimodal systems. It has not been demonstrated that efficiency is a substantial gain, unless restricted to spatial domains, i.e. during multimodal-pen interaction in a spatial domain a 10% speed-up gain was obtained in comparison to a simple speech-only interface (Oviatt, 1997). Apart from efficiency, however, there are other substantial advantages characterising multimodal interfaces. For example, such interfaces are more flexible (the users can switch among modalities and make choices), multimodal interfaces can accommodate a wider range of users and tasks than unimodal interfaces, etc Human-computer communication channels (HCCC) When speaking of multimodal interfaces, it is important to define some basic notions concerning the input/output (I/O) modalities implied in the interaction between the interfaces (or, in a more general sense, computers) and humans. According to Schomaker et al. (1995) we can identify four I/O channels (cf. Figure 1). The HOC and CIM describe the input, while COM and HIC define the output (or feedback). 4 Project MMF EEC Note No. 01/07

13 Multimodal Interfaces: a Brief Literature Review EUROCONTROL Computer Computer Input Modalities (CIM) Human Output Channels (HOC) "cognition" Cognition Computer Output Media (COM) Interface Human Input Channels (HIC) Human I t ti i f ti fl Intrinsic Perception/Action Loop Interaction Information Flow Figure 1: Basic model for HCCC (adapted from Schomaker et al., 1995) There are two processes involved in interaction: perception (entailing human input and computer output) and control (entailing human output and computer input). Information flow can be defined as the sum of cross-talking perception and control communication channels; however, the complexity of human information processing channels hinders the elaboration of a model describing multimodal integration, and the improvements in multimodal interface design can only be supported by empirical results (Popescu, Burdea & Trefftz, 2002). This model introduces interaction and communication flows between humans and computers. In fact, there are some attempts in the literature to provide an "organised vision" of multimodal interfaces, focusing either on the human or on the system Human-centred perspective Raisamo (1999) provides a classification of two different perspectives that seem to guide the development of multimodal interaction. The first view is human-centred. It builds on the idea that modality is closely related to the human senses. Raisamo makes reference to the classification by Silbernagel (1979), in which the senses and their corresponding modalities are listed. Table 1: Classification of senses and modalities (adapted from Silbernagel, 1979) Sensory perception Sensory organ Modality Sense of sight Eyes Visual Sense of hearing Ears Auditive Sense of touch Skin Tactile Sense of smell Nose Olfactory Sense of taste Tongue Gustatory Sense of balance Organ of equilibrium Vestibular Project MMF EEC Note No. 01/07 5

EUROCONTROL Multimodal Interfaces: a Brief Literature Review A multimodal system thus implies that the user will make use of different senses (e.g. auditive, visual, tactile, etc.

14 EUROCONTROL Multimodal Interfaces: a Brief Literature Review A multimodal system thus implies that the user will make use of different senses (e.g. auditive, visual, tactile, etc.) to interact with the system, and that interaction might be enhanced and eased if based on a plurality of sensorial channels System-centred perspective The second view reported by Raisamo is system-centred. In order to define this perspective in multimodality, Raisamo refers to the definition provided by Nigay and Coutaz (1993): "Modality refers to the type of communication channels used to convey or acquire information. [ ] Mode refers to a state that determines the way information is interpreted to extract or convey meaning". Nigay and Coutaz adopt this system-centred definition and stress the difference between the two approaches, providing a mailing system (the NeXT) by way of an example. This system allows the user to send mail conveying graphics, text and voice messages; "from the user's point of view this system is perceived as being multimodal: the user employs different modalities (referring to the human senses) to interpret mail messages". However, adopting a system-centred view, "the NeXT system is not multimodal" (Nigay & Coutaz, 1993). The system-centred perspective of Nigay and Coutaz takes into account those salient features which would appear relevant to the design of multimodal systems: 1. the fusion of different types of data from/to different input/output devices; and 2. the temporal constraints imposed on information processing from/to input/output devices Design space for multimodal systems Starting from the above premises, Nigay and Coutaz aim to provide a systematic description of multimodal interfaces, presenting a framework classifying the features of multimodal interfaces, from a software-engineering perspective. Figure 2: Design space (after Nigay & Coutaz, 1993) According to this model, there are three main axes along which a system can be classified. 1. Use of modalities: This term refers to the temporal availability of the modalities of a system, which can be either sequential (one after the other) or parallel (at the same time). 2. Fusion: This term refers to how the types of data (and thus various modalities) can be combined. 6 Project MMF EEC Note No. 01/07

15 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 3. Level of abstraction: This term refers to the level of abstraction of the data to be processed by a system. For example, a speech input can be recorded by the system either as a simple signal or as a meaningful sentence. This definition is valid for both input and output messages. A vocal output message can thus be "replayed from a previous recording" or actually "synthesised from an abstract representation of meaning". However, it is unclear why this distinction is introduced in this framework, because "a multimodal system automatically models the content of the information". By definition, therefore, "a multimodal system falls in the meaning category" of multi-feature space design An example of classification Since the value of the level of abstraction of the framework is necessarily meaning, the remaining combinations of fusion and use of modalities provide four categories: exclusive, alternate, concurrent and synergistic. Nigay and Coutaz explain these categories with an example, the NoteBook, which is a personal book allowing users to create, browse and edit notes. The NoteBook implements multimodal interactions. For instance, the user can insert a note with a vocal command ("insert a note"), simultaneously selecting the place of insertion with the mouse (i.e. two input modalities). The insertion of the note is done by typing (a single modality). Browsing can be done by clicking on dedicated buttons or by vocal commands (two modalities). Clearing a note can be done either by pressing a "clear" option with the mouse or by vocal command (two modalities). According to the model of Nigay and Coutaz, every classification is based on a set of features (fi), and every feature has a weight (wi) based on a number of importance criteria, namely that feature's frequency of use (rare, normal, frequent or very frequent). Every feature can be defined according to its weight (w) and the position (p) which it occupies in the design space, following the formal rule: fi = (pi, wi). The main goal here is to attempt to define the position of a certain system within the design space. The position (C) corresponds to the centre of gravity of the system's features. This definition is also formalised, with the following equation: C = 1 w p i w i i The representation of the features of the NoteBook system is given in the following diagram. Feature <1> ("insert a note") occupies the synergistic part of the design space, since it is provided via two simultaneous modalities; moreover, it is used frequently. The feature is thus defined by the couple "synergistic and frequent" use. The second feature, <2> ("edit the content of a note"), is characterised by the couple "exclusive [since only one modality is available for this feature] and very frequent" use. The third and fourth features, <4> and <3>, correspond respectively to "browsing" and "clear the note". Feature <4> is exclusive (two modalities are available, but only one command type can be integrated by the system) and used very frequently; feature <3> is also exclusive, but is rarely used. By applying the formula provided above, the centre of gravity, C, is obtained: NoteBook is very close to the exclusive category of multimodal system. Project MMF EEC Note No. 01/07 7

16 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Figure 3: The NoteBook example within the design space (after Nigay & Coutaz, 1993) The authors claim that the use of such a design space may fulfill three goals: 1. to make explicit the ways in which the different modalities are supported by a system; 2. to locate a system within a certain part of the design space; and 3. to provide a means of evaluating the usability of a system. To summarise, this model seems useful in that it attempts to provide a systematic framework to describe and analyse multimodal interfaces. Moreover, it aims to support the software engineers in assessing usability issues and problems during the very early phase of the system's design. However, the model should be further refined in order to provide more "practical" tools to properly address the usability of a given system. More specifically, the model should: 1. make a distinction in relation to the "weight" of multimodalities between the I/O conditions; 2. provide a set of criteria to assign values to the weights and positions of the features; 3. combine the design space with a detailed model of the users, in order to adequately determine the weight of every feature; and 4. provide a set of rules identifying what is the "best" centre of gravity (e.g. why should a more synergistic interaction be better than an exclusive one? under what conditions? etc.). 8 Project MMF EEC Note No. 01/07

17 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 4. DEVICES: A SHORT SUMMARY A detailed technical overview of every single technology is beyond the scope of this document. However, in order to afford a better understanding of the most common technologies for multimodal interfaces, this section provides a quick reference guide to the current tools and devices used in the multimodal research (in relation to both input and output) mentioned in the empirical studies included in the present document INPUT Nowadays, there are a number of technologies allowing an alternative interaction style to the traditional keyboard and mouse. Special pens are widely used input devices for multimodal interaction (e.g. consider the pointing and selection tasks which can be carried out with pens for PDAs). These pens can be extended with sophisticated technologies for enhanced capabilities such as character recognition. However, these capabilities often impose constraints on interaction. For instance, fully unconstrained handwriting is not yet very well recognised. In most cases, written input is geometrically constrained, thus requiring the users a much more regular inputting procedure (therefore implying learning efforts and demanded accuracy). Other input devices comprise gesture recognition. For examples of this interaction type, see section 2.2 (the pointing detected by Bolt's "Put-That-There") or even more novel and advanced interfaces such as HandVu (Kölsch, 2004), which is capable of detecting the hand in a standard posture, tracking it and recognising key postures; another example is that of eye-tracking technologies, in which the movement of the user's eyes is sensed and the gaze direction interpreted and mapped onto an electronic space. Despite the initial "attractiveness" of gazetracking technologies, several limitations hinder the efficient exploitation of this means, as the extraction of semantics from gaze movements is ambiguous (e.g. efficient discrimination between unconscious eye movements and intentionally driven movements, etc.). Audio signals can also be used for inputting data into a multimodal system. Speech has great potential for multimodal interaction, and many available systems are capable of parsing and recognising natural language speech with great accuracy (Dragon, 1999). However, the available systems do not allow completely unconstrained use of natural language; they permit the use of a limited vocabulary set only, implying a certain degree of user effort to learn and recall the available commands. There are also touchpads or touch screens. This type of device allows the user to directly point at objects (on special "sensitive" screens) and select them simply by touch. At the state of the art, there are also other types of haptic devices such as Teletact (Stone, 1991) and PHANToM (Massie & Salisbury, 1994). Teletact is a system comprising two data gloves, one with tactile feedback, used for outputs to the user, and a second used for inputting information to the computer. PHANToM is a 3D input device that can be operated with the fingertip OUTPUT There are several strategies for implementing outputs and feedback. Besides visual output (which does not need to be explained in this report), acoustic or haptic feedback might be mentioned. But the first and most intuitive is speech. Project MMF EEC Note No. 01/07 9

18 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Available technologies for Voice-Over-IP (VOIP) or teleconferencing systems allow the generation, transmission and reception of speech (Thompson, 2003). However, sounds can also be used in a simpler but nevertheless sophisticated manner. For example, auditory icons (Gaver, 1994) are sounds, which make use of metaphors and analogies with everyday sounds, while earcon symbols are acoustic signals which, by analogy with verbal language, make use of rhythm or pitch to denote certain meanings. Sonification is the transformation of data relations into perceived relations in an acoustic signal for the purposes of facilitating communication or interpretation (Kramer, Gregory et al., 1997). Another way in which sound can be used is known as sound spatialisation, which is a technique using sound filtering to map the sound to its 3D source position, so that it can be perceived as originating from different sources. The main drawback of this technology is that it cannot account for individual characteristics; in some cases, therefore, the sound information is not correctly mapped. Haptic devices such as Teletact and PHANToM (cf. 4.1) are provided with tactile feedback or force-feedback capabilities. Therefore, these systems address users' interaction in "two ways", both supporting the inputting of data and providing haptic feedback as the result of the interaction. 10 Project MMF EEC Note No. 01/07

19 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 5. NON-ATM DOMAIN In the literature, the amount of multimodal interfaces and technical attempts to develop forms of multimodal interactions is fairly large. However, only a small part of the literature presents results concerning structured evaluations of multimodal systems. As stated in section 3, the claim that "multimodal is better" is somewhat sterile unless it is supported by empirical results showing that user performance and/or subjective acceptability are enhanced. Ultimately, what it is interesting is the attempt to comprehend how the distribution of I/O information within different communication channels can improve people's ability to efficiently interact. Therefore, this section will review relevant studies, which provide consistent and potentially generalisable empirical results EMPIRICAL RESULTS Do people interact multimodally? An important study involving multimodal interaction, carried out by Oviatt, DeAngeli and Kuhn (1997), aimed to identify when people are more likely to interact multimodally. The study entailed a simulation in which subjects could interact using pen and speech input, while accomplishing some tasks with an interactive map. The system was simulated, which means that a (hidden) assistant was actually providing the correct feedback in reply to the subjects' inputs. The results indicate that the subjects had a strong tendency to prefer multimodal interaction (95% of cases). Multimodal interactions were likely to take place when location or spatially based commands were involved, and unimodal interaction was mostly used for selection commands. It would appear that the tools used to interact had a close relation with the type of task to be performed (and this provides information about the actual properties which a specific tool needs to implement in order to guarantee more efficient and usable interaction). For example, the authors discovered that a pen was mostly used to draw rather than to write. Pen inputs were used to convey spatial information, and speech constituted a duplicate of the gesture-based information in only 2% of cases. According to the authors, this implies that speech and gestures are used in a complementary manner, rather than providing mere redundancy of input data. The analysis of the semantics of the language used by the subjects indicates that most multimodal constructions (59% of cases) did not contain any spoken deictics. Moreover, the presence of spoken deictics overlapping in time with pen inputs was only 25%. Another result of the study is that the subjects would appear to have followed a precise pattern when multimodal interaction was involved: the writing of inputs preceded speech commands, suggesting that pen annotations supported the understanding and elaboration of spatial data, and that only when this "spatial understanding" was achieved could the users speak about it. Project MMF EEC Note No. 01/07 11

20 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Multimodality and task difficulty Another study, carried out by Oviatt, Coulston and Lunsford (2004), aimed to empirically evaluate an important assumption supporting the idea of multimodal interfaces. Concretely, the authors asked the question "is multimodal interaction an efficient support to difficult tasks?" The evaluation entailed interaction with a simple two-dimensional system displaying information to be used to manage flood emergencies. Basically, the users had to input information in order to logistically manage rescue teams and facilities, gather information, etc. The task entailed various degrees of difficulty (controlled through the number of location-based pieces of information to be managed by the users). The subjects could interact via a pen, speech or both. The results showed that, in general, the subjects were likely to interact multimodally (61.8% of cases) and that in the majority of cases the modalities were simultaneous. Moreover, as the tasks increased in difficulty, the users were likely to deploy multimodal interaction strategies (from 59.2% to 75.0% in highly difficult tasks). This result is very important because it indicates that as the degree of difficulty increased the users faced the complexity of the task by distributing the load across multiple modalities Mutual disambiguation The results discussed above were collected with simple two-dimensional systems in which interaction with the system was simulated or, at most, entailed very simple gesture interpretation. However, another claim of multimodality is that the possibility of processing information across more than one channel also enhances the disambiguation of input information. This last claim is supported by the empirical evidence of a study carried out by Oviatt (1999). Oviatt defines a set of reasons explaining why a multimodal system should be better at handling errors than a unimodal interface. 1. If users are free to choose, they will probably select the input mode, which they judge to be least error prone. 2. When interacting multimodally, the number of utterances is smaller and briefer; the degree of spoken-language complexity can thus be reduced. 3. Recovery is facilitated by multimodal interaction: if an input mode does not sort the desired results, users can switch to another modality in order to avoid repeated system failures. 4. Users report a lesser sense of frustration when they use a multimodal system (Oviatt, Bernard & Levow, 1998). 5. The last reason in support of multimodal systems could be "mutual disambiguation (MD) of input signals, that is, the recovery from unimodal recognition errors within a multimodal architecture". In order to properly address this last issue, Oviatt included in the study non-native (or accented) English speakers. She thus assessed whether mutual disambiguation was higher for this type of user, hypothesising that degraded speech recognition (more likely to happen for non-native English speakers) would be disambiguated by gestural inputs more often than for English native speakers. An adaptation of the QuickSet pen and voice device (Cohen, Johnston et al., 1997) system was implemented. A closed set of commands was available (and the vocabulary of the system was also limited, including 9 types of gestures, 400 spoken commands and 200 multimodal utterances). Using these commands, the subjects were engaged in two sets of tasks involving emergency fire and flood logistics. 12 Project MMF EEC Note No. 01/07

21 Multimodal Interfaces: a Brief Literature Review EUROCONTROL The subjects were instructed to interact multimodally with the system (using both pen and spoken inputs). Since recognition errors were possible (e.g. the system could fail to correctly interpret spoken commands), the users were instructed to repeat their input up to three times, and then to proceed to the next task if the failure persisted. The results indicate that (approximately) 25.2% of multimodal utterances contained parse-level mutual disambiguation for native speakers, but this figure rose to more than 30% for accented speakers, confirming a significant difference between the two groups of users. In addition, the speech-recognition degradation was affected by the speakers' status. These results suggest that, when the working conditions are challenging, mutual disambiguation could support a more efficient and flexible option for reducing system failures and a more reliable means of sustaining users' needs. In fact, users' multimodal interaction styles can vary substantially, suggesting that the design of multimodal interfaces should be tailored to each user's model. For example, it was discovered that the use of sequential or simultaneous multimodal construction is affected by people's ages (Xiao, Girand & Oviatt, 2002; Xiao, Lunsford, Coulston, Wesson & Oviatt, 2003). The following sections describe the results of studies that have tackled the issue of multimodality at a more "basic" level, mainly investigating how the co-occurrence of different input signals affects subjects' perceptual/cognitive abilities Memory A recent study addressed the issue of how multimodality impacts on memory performances (Stefanucci & Proffitt, 2005a). Stefanucci and Proffitt's study was motivated by two main assumptions. The first assumption was that memory performance would be improved if there was congruence between the learning and retrieval phases. This congruence took account of several factors, one of which was the learning context in which information was learned: contextual cues present during encoding would play a role in later retrieval (Smith & Vela, 2001). The second assumption was that the presence of auditory cues during the learning phase would support memory during the testing phase (Davis, Scott, Pair, Hodges & Oliverio, 1999). Consequently, Stefanucci and Proffitt carried out a number of experiments in order to assess whether contextual and auditory cues would impact on memory performance. The InfoCockpit (Tan, Stefanucci, Proffitt & Pausch, 2001) is a system including three flat monitors and a large projection screen. This tool was used to test memory performance. Subjects involved in the "multimodal condition" were asked to learn a list of words displayed on the three monitors, while contextual images were displayed on the projection screen (the images were of the Lawn at the University of Virginia) and coherent sounds were played to accompany the projection images (e.g. dogs barking or birds). The other group of subjects had to learn the lists of words from a standard computer with one monitor and no additional cues. The retrieval phase took place 24 hours after the learning task. During the retrieval, no cues were present; however, the performance of the subjects who learnt the list of words under multimodal conditions was superior (by 56%). Another study (Stefanucci & Proffitt, 2005b) attempted to assess the contribution of isolated cues (sights and sounds) in relation to memory performance, in a four conditions experiment (multimodal with sights and sounds; unimodal with sights only; unimodal with sounds only; and a simple computer desktop). Again, the results showed that performance was superior under multimodal conditions, indicating that the presence of sounds provided a strong cue for binding the visuals to the information learnt. Project MMF EEC Note No. 01/07 13

22 EUROCONTROL Multimodal Interfaces: a Brief Literature Review The McGurk effect The combination of auditory and visual information has also been investigated in perception studies. One famous example of such an investigation illustrated what is known as the "McGurk effect" (McGurk & McDonald, 1976). An auditory stimulus (ba) was paired with a visual stimulus (ga): the consequence of this combination was that the subjects reported hearing the sound da. However, if the stimuli were presented the other way around (auditory = ga, visual = ba), the subjects reported hearing the sound bga. An obvious implication of this discovery is that information processing from one modality can affect the perception of information processed from another modality. The argument is controversial, and to date opinions have varied on the impact of the McGurk effect (and similar subsequent experimental research). The McGurk effect has at times been used to support the idea of a dominant modality of perception (i.e. vision); a number of researchers have, however, questioned this assumption (Massaro, 1998) Cross-modal interactions Modalities are not separate modules "working" independently (Shimojo & Shams, 2001), and cross-modal effects are determined by more complex mechanisms than a simple hierarchical dependency in which vision predominates over the other senses. Further empirical evidence supports this idea. For example, animal models suggest the presence of cross-modal plasticity (Sur, Pallas & Roe, 1990); Sadato et al. (1996) discovered that in blind subjects the primary and secondary visual cortical areas were activated by tactile tasks, whereas they were deactivated in the control group. In a PET study, Renier et al., 2005, trained the subjects to use a prosthesis (substituting vision with audition) in order to recognise objects and estimate depth distances. The results indicate that some areas of the visual cortex are relatively multimodal and that senses other than vision could be involved in the processing of depth; in addition, evidence suggesting the presence of transformation of visual input into auditory information in working-memory processing (Suchan et al., 2005). So far, however, it has proved difficult to provide any general clarification of cross-modal effects, and various theoretical explanations (as well as empirical evidence) have been proposed. For example, Massaro (1998) suggests that all the modalities contribute to the perceptual experience, and that the greatest influence is provided by the least ambiguous information source. Other hypotheses claim that cross-modal effects can be explained in terms of the "ecology of the combined stimuli". A study carried out to assess the impact of sounds on visual stimuli (Sekuler, Sekuler & Lau, 1997) showed that sounds have great influence in determining the perception of "ambiguous motion" (the "bouncing illusion"). Two targets moving along crossing trajectories are perceived as streaming (and overlapping) into each other, but if a sound is added at the "crossing point" of the two targets, they are perceived as bouncing away and following diverging paths. This effect would appear to be related to the ecological perception of collisions, for which a sound is naturally associated with objects' clashing. Another hypothesis (attempting to provide a general framework for the understanding of crossmodal effects) is put forward by Welch and Warren (1980), who argue for the "modality appropriateness hypothesis". In this approach, the most appropriate modality for a given task is also the dominant modality (i.e. vision for spatial resolution tasks and audition for temporal resolution tasks). Some evidence is provided by Scheier, Nijwahan and Shimojo (1999). They empirically attempted to assess the influence of sounds on visual stimuli. When two lights were turned on with a temporal delay, subjects' judgment about the temporal order in which the lights were turned on improved if one sound preceded and another sound followed the visual stimulus (audio-visual-audio-visual). However, the performance broke down if the sounds were placed 14 Project MMF EEC Note No. 01/07

23 Multimodal Interfaces: a Brief Literature Review EUROCONTROL between the visual stimuli (visual-audio-audio-visual). This result suggests that temporal resolution can be altered by the order in which sounds are presented in the sequence of acoustic/visual stimuli, so that when two sounds are inserted between two visual stimuli, the temporal resolution task performance worsens. However, this explanation cannot be accepted as fully exhaustive with regard to cross-modal phenomena. In fact, further studies have shown that the "bouncing illusion" is also present when other modalities are associated with the visual stimuli. For instance, Watanabe (2001) and Watanabe and Shimojo (1998) suggest that the bouncing illusion is also induced when a haptic feedback or a visual stimulus is presented, synchronised with the "crossing point" of the two moving targets. Watanabe (2001) explains this effect in terms of saliency, as follows: "Any salient sensory transient around the moment of the visual coincidence biases visual perception toward bouncing". To summarise, the authors state that what "affects the direction of cross-modal interactions is the structure of the stimuli instead of the appropriate use of modalities" (Shimojo & Shams, 2001). Project MMF EEC Note No. 01/07 15

24 EUROCONTROL Multimodal Interfaces: a Brief Literature Review 6. ATM MULTIMODAL INTERFACES 6.1. DIGISTRIPS Mertz & Vinot (1999) Mertz, Chatty & Vinot (2000a) Mertz, Chatty & Vinot (2000b) The papers listed above provide some suggestions for the design of new concepts and innovative interfaces for ATC, implementing novel interaction and display techniques. The work is embedded in a larger project, the Toccata project, carried out at CENA (Centre d'etudes sur la Navigation Aérienne [= Centre for Air Navigation Studies]) in Toulouse and Athis-Mons. The Toccata project federates the results of several years of ATC research at CENA. According to the authors, there is still a long way to go in the design of innovative tools for ATC. Previous efforts have failed to fully exploit the opportunities offered by state-of-the-art technologies and transpose them into an acceptable operative form. However, there are examples of innovation. For example, DigiStrips represent a new approach to the design of electronic strips. DigiStrips are based on the idea that strips should not simply be transposed into a digital form but should also be augmented in such a way as to preserve the same manipulative and interactive properties as paper strips (Mackay, Fayard, Frobert & Médini, 1998). Figure 4 represents one of the features which a digital strip should maintain, i.e. the possibility to use one's hands to manipulate and change the strip position: Figure 4: The manipulation properties preserved in DigiStrips (after Mertz & Vinot, 1999) Interaction with DigiStrips is easier and more natural than mouse interaction, because it deploys direct manipulation supported by a touch screen. In addition, the interaction is enhanced with the help of expressive and meaningful animations. The DigiStrips experience allowed the authors to suggest a set of guidelines for the design of innovative tools for ATC. Graphical design 1. Appropriate fonts and good graphical design can increase the amount of displayable information (an accurate choice of these details allows more information to be displayed on a single screen). 2. Texture and colour gradation can code information (different nuances of colours and textures can be used to code information; for example, in DigiStrips, a different texture codes the area of the strip which can be handled and moved). 3. Different fonts can convey different types of information (in DigiStrips, different fonts were used to display system-computed information and users' input data). 16 Project MMF EEC Note No. 01/07

25 Multimodal Interfaces: a Brief Literature Review EUROCONTROL Animations 1. Animations can be strategically used, for example to display status changes or transitions. 2. Animations can also be used for menus (used to input data into the system). For instance, past studies have suggested that controllers can feel uncertain about which flight on the radar a given menu applied to (Marais, Torrent-Guell & Tremblay, 1999). Graphical animation displaying the opening and closing of the menu, with a smooth transition, provides reinforced feedback on which specific flight the menu (and thus the input) was applied to. 3. Animations can also notify users that events (e.g. the arrival of a new strip) have taken place and provide "intermediate feedback", e.g. while the user is waiting for the system's "final feedback". The authors give a number of suggestions on how to use touch-screen devices for ATC. The touch screens allow for direct manipulation of objects. Compared to traditional mouse interaction, this requires less visual attention from the users because visual tracking of the mouse pointer is not necessary. Since the members of a working team can easily see what a colleague is doing (moving a strip within an electronic strip bay), the gesturing implied by a touch screen could enhance mutual awareness. In addition, touch screens can be shared. Other innovative devices such as pen input could also enhance interaction with strips; however, current technology does not allow for efficient interpretation of letter-like or digit-like strokes. DigiStrips implement gesture input (via a pen or fingers) in a very simplified form (simple short lines or arrow-like shapes). Figure 5 displays the simplified strokes available in DigiStrips (although not all of the strokes displayed in the picture are available). Figure 5: DigiStrips strokes (after Mertz, Chatty & Vinot, 2000a) These strokes are used to open special menus containing information which can be inputted in relation to a given flight. For example, in order to input headings, a right-bound stroke opens a "turn-to-right menu" (cf. Figure 6); then, the controller can choose the appropriate information from a selection of heading values. Figure 6: Simple strokes to open menus (after Mertz, Chatty & Vinot, 2000a) Users can input written information on the DigiStrips, but no character recognition is implemented. In this interface, therefore, writing represents a simple enhancement to interaction and perhaps a means of supporting memory in relation to actions performed and past events. Project MMF EEC Note No. 01/07 17

26 EUROCONTROL Multimodal Interfaces: a Brief Literature Review DigiStrips evaluation A simplified version of DigiStrips was formally evaluated, with a comparison of some "manipulation tasks" (changes to the position of the strips) in two situations: hand interaction and mouse interaction. The results showed quicker performance with touch-based interaction (from 10% to 14% quicker), even though the "moving strips" animations caused a number of time delays in the interaction with the touch screen. However, the authors inform us of a number of errors which the subjects were likely to make with the novel interface. These included: 1) false releases of the strips, caused by insufficient pressure of the subjects' fingers on the touch screen; this type of error was reduced after some practice with the interface; 2) parallax errors, i.e. the selection of the target strip was sometimes incorrect. This effect was mostly due to the type of touch screen used (cathodic). The use of flat screens should reduce the problem. The authors point out that the errors were mostly determined by a lack of experience with the touch screens (insufficient pressure on the screen) and by technical limitations (the screen was not flat), and that adequate training and the use of an appropriate screen type could reduce the problems THE ANOTO PEN The SkyPen project attempted to analyse the applicability of a special tool, the ANOTO pen, as an input device for marking and annotating paper strips (Begouin, 2002). The ANOTO pen was equipped with a special camera (placed at the tip end of the pen, capable of tracking the written signs) and a character recogniser. Unfortunately, this technology was not suitable for interpreting the marks written on a paper strip. Fig. 8 shows a schematic representation of the limitations of the ANOTO pen. The camera is coupled (very closely) to the tip of the pen and records information. However, the height of the strip is very small and, if the tip of the pen approaches the borders of the strip, the camera is displaced out of the strip area; the written information is thus not visible and cannot be tracked or interpreted. This technology would better work if the marks were annotated at the very centre of the paper strip (Hering, 2005). In real working settings, the way the paper strip is marked is reasonably free and, in a way, hectic (cf. Figure 7); it would thus not be feasible to constrain controllers' movements without paying a cost in terms of execution speed and efficiency. Figure 7: An annotated strip (after Mertz & Vinot, 1999) Figure 8: The ANOTO pen technical limitations 18 Project MMF EEC Note No. 01/07

27 Multimodal Interfaces: a Brief Literature Review EUROCONTROL 6.3. VIGIESTRIPS Pavet & Garron Coullon (2001) Salaun, Pene, Garron, Journet & Pavet (2005) Garron, Journet & Pavet (2005) This work was motivated by the dramatic increase in the number of paper strips managed in the tower. As the number of strips which can be safely managed was sometimes exceeded, the authors tried to address the issue. Inspired by the DigiStrips experience, the authors tried to take into consideration two main ideas: 1. Maintenance of the paper strips' cognitive value 2. Use of the latest interaction technologies while producing an interface which preserves the original manipulation requirements of paper strips (i.e. use of touch input devices, animations, gesture recognition) An in-depth analysis of paper strips use was therefore carried out. Specifically, the authors aimed to discover how controllers use and manipulate the strips, and to define the most important properties (both physical and cognitive) of these tools. Thanks to recordings obtained by a wearable camera, the authors discovered that 80% of the controllers' attention was directed "inside the tower", thus underlining the importance of the strip bay in the controllers' tasks. According to the authors, the strip board was a sort of "thinking space" characterised by a set of important properties, such as: 1. the manipulation of strips by controllers in order to build representations of reality; manipulating the strips means acting on the aircraft. The authors state that, as in the DigiStrips approach, the "manipulative properties" of the strip should be preserved; 2. promotion by the strips of active monitoring; manual manipulations and annotations reduce the cognitive loads necessary to memorise the current situation and strengthen memorisation of controllers' past actions. Strips are also planning aids; they are a sort of "organiser" allowing the controllers to be ahead of the situation. Moreover, they act as a reminder: since they are continuously checked, they allow the controllers to spot forgotten items or properties; 3. the collaborative nature of strips as a medium. The exchange of strips among controllers is often accompanied by vocal warnings, and every strip is placed in a special position on the strip bay. This coded action represents a transfer of knowledge between controllers; 4. the permanence, tangibility, and reliability of strips. In line with the results of the field study, the authors suggest that: 1. for every aircraft, a single electronic strip should be produced; 2. the size of the strip could be variable in relation to the status of the strip; 3. the possibility of manually manipulating the strip should be retained; 4. controllers should be able to input data onto the strip, following current annotation mechanisms; and 5. free annotations should be allowed. Project MMF EEC Note No. 01/07 19

EUROCONTROL Multimodal Interfaces: a Brief Literature Review The availability of new tools supporting the management of strips could enhance current working procedures to an extent.

28 EUROCONTROL Multimodal Interfaces: a Brief Literature Review The availability of new tools supporting the management of strips could enhance current working procedures to an extent. Having acquired knowledge of actual ATC working procedures in the tower, and inspired by the experience of DigiStrips, which introduced the ideas of new interaction modalities, the authors designed a system which implements electronic strips within a touch-input device (shown in Figure 9). Figure 9: Vigiestrips main working areas (after Salaun, Pene, Garron, Journet & Pavet, 2005) The operator is allowed to freely manipulate and organise the strips. Animations support the tracing of the strips' movements; moreover, a number of special functions make it possible to dynamically anticipate the final position of the strip currently being handled, thanks to a "ghost strip". An automatic function allows the precise lining-up of strips (even if the position determined by the user is not 100% precise). A special function allows the strip to be placed in the central part of the strip column. An antioverlapping function hinders the occlusion of strips. Vigiestrips can be integrated within a real working environment and can be coupled with an A-SMGCS system, for example, displaying consistent alert messages on both displays. Furthermore, it allows free annotations on the strip, reminders of the clearances given to the aircraft, and flight data modifications. In summary, there are many similarities between DigiStrips and Vigiestrips. However, the latter interface was specially created for tower tasks. Moreover, Vigiestrips are certainly more advanced and integrate innovative tools and functions with current ATC tools (such as the A-SMGCS) Vigiestrips evaluation The Vigiestrips interface was evaluated in a set of small-scale experiments, addressing very specific tasks. The main idea was to discover usability issues/problems before setting up a more sophisticated and realistic experiment, including all the components of the tower environment. The tasks evaluated entailed manipulation (grabbing, moving and placing) and free pen annotations. The evaluation was comparative, i.e. assessing the performance of tasks with Vigiestrips and with paper strips. The results indicate that paper strips supported quicker interaction than Vigiestrips. However, this was mainly caused by the major limitations imposed by the technology. First of all, the technical time taken by the system to process the users' actions caused a delay in the system's responses. 20 Project MMF EEC Note No. 01/07

Multimodal Interfaces: a Brief Literature Review EUROCONTROL Secondly (and in a similar way to DigiStrips), the Vigiestrips digital board required the users to operate with the fingers (or the pen),

29 Multimodal Interfaces: a Brief Literature Review EUROCONTROL Secondly (and in a similar way to DigiStrips), the Vigiestrips digital board required the users to operate with the fingers (or the pen), keeping a certain pressure on the touch screen. For writing tasks in particular, the "pressure problem" imposed a trade-off between accuracy and speed of execution. A third problem concerns the lack of tangible tactile perception with Vigiestrips. Clearly, paper strips allow strong and concrete tactile perception, so that controllers can shift their visual attention away from the paper strip while maintaining control of the tool. By contrast, this is not possible with Vigiestrips: visual control of the digital strip has to be maintained in order to properly manipulate the e-strip. Another problem was the lack of adequate recognition of a number of characters (namely, the letters A, F, E, and T). The problem was even more severe when the annotation consisted of pairs of characters. Eventually, the subjects had to spend a significant amount of time deleting the characters mistakenly interpreted by the system and/or re-writing the correct characters. The Vigiestrips project is still ongoing. The authors indicate that new experiments will be carried out and that new hardware capable of reducing the technical drawbacks of Vigiestrips will be deployed D SEMI-IMMERSIVE ATM ENVIRONMENT (LINKÖPING UNIVERSITET) Cooper, Ynnerman & Duong (2002) Lange, Cooper, Ynnerman & Duong (2004) Bourgois, Cooper, Duong, Hjalmarsson, Lange & Ynnerman (2005) The environment developed at Linköping University consists of a semi-immersive 3D interface for en-route ATC and covers a large area representing the southern part of Sweden, Arlanda Airport and the Baltic Sea. Typically, this system makes use of head-tracking techniques and a stereoscopic display. The interface is quite extensive in that it has state-of-the-art visualisation capabilities and allows the display of several types of air traffic information, which can be described as follows. Figure 10: Snapshot of the 3D ATM system (after Lange, Cooper, Ynnerman & Duong, 2004) Project MMF EEC Note No. 01/07 21

30 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Flight information The interface displays flights, trajectories and waypoints. Labels (called flags), which are semi-transparent data blocks containing the call sign, altitude, destination airport and speed, are "linked" to the flights. An anti-overlapping algorithm guarantees that every flag remains visible even when the number of flights would otherwise cause clutter and occlusion. The waypoints are associated with the altitude corresponding to the level of each flight. The flight trajectories are colour-coded on the basis of the "direction" of the flight. Keeping Arlanda Airport as the reference point, flight trajectories are classified, on the basis of their direction, as "incoming", "outgoing" or "unused". A new type of trajectory was implemented in order to take account of "other" airports. The users can also change their "target", for example by focusing on another airport (as a consequence, the camera view will change centre and the trajectories will change colour on the basis of this new "airport in focus") Weather information Realistic data sets are used to display weather information such as predicted turbulence and icing risks within a certain area. Again, colours are used to code different information types. For example, icing warnings are coloured blue, while turbulence warnings are white. Air pressure is displayed by means of three-dimensional isobars which graphically depict the air formation; wind is displayed as a 3D isobar structure and animated streaming particles. The authors acknowledge the fact that the amount and richness of weather information is considerable, and could negatively impact on the ability of the user to distinguish relevant data from "noise". Therefore, an alternative approach was used. Surface extraction techniques were used in order to obtain and present extreme weather conditions in a "lighter manner", with the aim of presenting information in a less cluttered and simpler way Terrain information The scene is enhanced with geographical and terrain information. A colour map is used to display the terrain, on which relevant features such as high ground and water stand out against the flatter parts of the scene (typically coloured in green) Orientation Terrain information could also be used in order to provide orientation cues to the users. However, this type of cueing was not sufficient, especially when the users zoomed in close. A new visual tool called the ground compass was thus added to the scene. At first, this tool was a compass rose displayed in the top-right-hand corner of the environment providing orientation information. More recently, the compass was blended and projected onto the centre of the terrain map Conflict detection An approach based on the detection of solid intersections allows the position of each flight to be checked against the other flights in the database within a user-defined horizon. A pair of line segments is built for every pair of flights (which could be in conflict). These segments are then "extended" so as to display the future flight trajectory within a certain horizon. If the estimate provides evidence for insufficient separation (or even collision), audio and visual warnings are provided to the users. 22 Project MMF EEC Note No. 01/07

31 Multimodal Interfaces: a Brief Literature Review EUROCONTROL Control The traffic scenes can be fairly complex; the users can have access to a multitude of information, which could be customised in order to better meet the controllers' specific needs. Several options for controlling the interface were investigated, for example, a separate control window (which requires the users to shift their attention away from the traffic environment), menus (which could hinder the visibility of the traffic scene) and speech commands. This last solution was deployed. The command interface runs on a separate system. This system includes approximately 150 commands and has a very good recognition rate (nearly 95%). The vocal interactions comprise basic commands allowing the users to switch the visual features of the interface on and off. Moreover, additional commands are available for rotation, zoom, focus-on and elevation. More complex interactions (such as interaction with a precise flight, or other fine-tuned manipulations) rely on a tracked wand Positional audio Positional audio feedback enhances the interface. If the user is focusing on a certain part of the environment, positional sounds are used to "capture the user's attention" (for example, suggesting the presence of a possible conflict), inviting him/her to shift the focus to another area of the screen Interaction mechanisms The interface allows interaction with the air traffic scene by means of devices typically used in three-dimensional environments, such as the wand. The interface usually sets the users' point of view to a point of interest, but this point of view can be changed using the wand. Moreover, the users can rotate the camera around the selected point of view, zoom, and enlarge areas and/or objects. The objects displayed (flights and trajectories) can be manipulated. For example, the trajectory of a flight can be changed by selecting a flight (the path of which will be highlighted) and then by choosing a precise waypoint. Glove selection methods can be used to interact with the environment. A new approach based on alignment between the hand and the dominant eye proved to be reasonably efficient in properly tracking the users' gestures. The main drawback of this interaction method is that the existing tracking and glove technologies are fairly heavy, causing users' arms to become tired Evaluation The interface was the object of several small, qualitative evaluations, although its "multimodal capabilities" were not addressed. The most important usability aspects of the environment concern: 1. users' ability to have a concrete sense of their "in-focus" position in relation to the context, at any given time; 2. the need to control the interface without shifting attention away from the working space; and 3. the trade-off between visualisation enhancements and the need for simple and (yet) efficient visual information. 1 This is a tracked hand-input device, provided with a small joystick and four buttons. A picture of the wand is given in Figure 11 Project MMF EEC Note No. 01/07 23

EUROCONTROL Multimodal Interfaces: a Brief Literature Review At the very beginning of the development phase, a number of controllers commented on the system.

In order to address this problem, a clip-box restricting the view to a specific area of interest was implemented.

32 EUROCONTROL Multimodal Interfaces: a Brief Literature Review At the very beginning of the development phase, a number of controllers commented on the system. The first concern was that the interface displayed a very wide area and provided too much information. In order to address this problem, a clip-box restricting the view to a specific area of interest was implemented. Additional visual aids (such as the compass) were introduced to avoid user disorientation and to maintain focus plus context information. Audio feedback and speech commands were also implemented in order to enhance the sense of presence within the working environment (avoiding continuous attentional shifts). The weather representation was too rich: the preliminary implementation was too "heavy" and cluttered. A new approach (described in section 6.3.2) was adopted in order to provide only selected and very salient weather features. Another evaluation, carried out by Le-Hong, Tavanti and Dang (2004), addressed two problems: 1) "getting lost" during translations, i.e. when the "focus" of the camera was changed; 2) "lacking control" during rotations. Two experiments were thus carried out, comparing: 1. dragging with jumping (where dragging involved holding the wand button and moving the scene as if it were "attached" to the wand, while jumping involved double-clicking on the cursor point to re-centre the scene, which then jumped to the new centre without providing a reference point); and 2. Joystick with Wand (where Joystick involved turning the scene by pressing the joystick buttons, while Wand involved tilting the wand to move the scene). Figure 11: Interaction usability test (after Le-Hong, Tavanti & Dang, 2004) 24 Project MMF EEC Note No. 01/07

33 Multimodal Interfaces: a Brief Literature Review EUROCONTROL Very simple scenes were built, with a flat, lit area of ground mapped onto the x- and y-axes, surrounded by a transparent box (cf. the upper image in Figure 11). Although the number of subjects was reasonably small (below ten), the results indicated that interaction was quicker using "dragging" than using "jumping". No differences in terms of time were found between the wand and joystick techniques. A possible explanation for this last result was that, despite the different interaction methods, the time ratio for the rotation mapping was equivalent in both conditions. However, the users declared that they preferred rotating the scene with the new technique (i.e. the joystick solution was judged easier to use in 71.4% of cases). Another evaluation is currently ongoing. This study aims to evaluate the suitability and usability of the interface's weather representations (Dang, 2005) AVITRACK Borg, Thirde, Ferryman, Fusier, Valentin, Brémond & Thonnat (2006) Ferryman, Borg, Thirde, Fusier, Valentin, Bremond, Thonnat, Aguilera & Kampel (2005) These works present a description of the European project AVITRACK, the aim of which is to automate ground service operations on aprons. The project combines visual surveillance and video event recognition algorithms, in order to provide automatic real-time recognition of the activities carried out on aprons. The main goal of AVITRACK is to support a more efficient, safer and prompter management of apron areas, in which ground operations such as fuelling, baggage loading, etc., can have a negative impact on the whole functioning of the airport, including traffic time delays. Among AVITRACK's goals are the following aims: 1. to achieve departure punctuality; 2. to optimise the use of resources; 3. to avoid sector overloading; 4. to manage identified actions undertaken for aircraft operations; 5. to grant the estimated off-block time; 6. to improve security; 7. to manage the existing scarce airport capacity; and 8. to provide apron allocation supervision. A number of main events (both static and dynamic) can be recognised and interpreted by the AVITRACK system: aircraft, vehicles and people. In order to support this "smart surveillance", the system uses a set of cameras with overlapping fields of view. All the cameras are placed at strategic points in the apron areas. The camera streams are temporally synchronised by a central video server. Object categories are extracted from the camera information. This "digital data" is further processed and classified on the basis of pre-defined object categories (people, vehicles, aircraft, etc.), associated with a function, and "shaped" in a three-dimensional form, through a datafusion method. Relevant activities (usually carried out in apron areas) were previously modelled and then used as a form of "knowledge pool" with which the "captured" real-time data are compared. The basic idea, then, would be to automatically report to the competent human operators information on the apron area which matches relevant scenarios. A schematic description of the main idea underlying AVITRACK is shown in Figure 12. Project MMF EEC Note No. 01/07 25

EUROCONTROL Multimodal Interfaces: a Brief Literature Review Figure 12: AVITRACK overview (adapted from www.avitrack.net) On a more technical level, AVITRACK works as follows.

34 EUROCONTROL Multimodal Interfaces: a Brief Literature Review Figure 12: AVITRACK overview (adapted from On a more technical level, AVITRACK works as follows. The tracking of objects is achieved by means of bottom-up tracking, which is composed of two sub-processes, motion detection and object tracking. Video-event recognition algorithms analyse the results of the tracking procedure in order to recognise high-level activities taking place within the target scene. AVITRACK is composed of several modules. There are frame-to-frame trackers which implement special algorithms capable of distinguishing foreground from background, moving from stationary objects, and possible object interactions. As stated above, a bottom-up approach was deployed in order to categorise the various object types (people, vehicles, etc.). A top-down method is deployed to apply three-dimensional features to the detected objects. All the information related to the scene is then fused in a data-fusion module and provided to the scene-understanding phase. To simplify, this last phase involves a video event recognition which recognises which events are taking place within the video-streaming information. The idea of AVITRACK is indeed very interesting, because it attempts to provide important information concerning some airport "bottle-necks", in real time and in an automatic manner. The potential gain in terms of time and resources is straightforward. However, there are technical limitations which pose problems to the applicability of this application (Borg et al., 2006). The camera-based surveillance implies a number of basic problems, such as: long occlusions during ground operations; the generation as a result of fog of a number of foreground pixels which are classified as background pixels; the detection of shadows as part of mobile objects; confusion in the detection of background/foreground objects of the same colour; the very similar appearance of many of the objects, which means that appearance-based recognition can be ambiguous; the continuing presence of "ghosts" (i.e. if an object remains steady for a long period of time, it is integrated into the background information; but when it starts to move again, ghost-like images are left behind). 26 Project MMF EEC Note No. 01/07

Introduction to Haptics

Introduction to Haptics Roope Raisamo Multimodal Interaction Research Group Tampere Unit for Computer Human Interaction (TAUCHI) Department of Computer Sciences University of Tampere, Finland Definition