ModaDJ. Development and evaluation of a multimodal user interface. Institute of Computer Science University of Bern

ModaDJ Development and evaluation of a multimodal user interface Course Master of Computer Science Professor: Denis Lalanne Renato Corti1 Alina Petrescu2 1 Institute of Computer Science University of Bern 2 Department of Informatics University of Fribourg

Contents 1 Introduction...1 1.1 Objectives and research question...1 1.2 Overview and project description...1 2 Input devices...2 2.1 Conventional interface (mouse and touch screen)...2 2.2 Multimodal interface (Kinect)...2 3 Modalities...3 4 Implementation...4 5 Evaluation...5 6 Conclusions...6 7 References...6

1 Introduction This project report describes a mini-project prepared and run in the course. We developed a simple music application controllable either by mouse, touch screen or Kinect used for gesture recognition. In accordance with terms commonly used in music world, velocity and volume are used synonymous and describe the loudness. 1.1 Objectives and research question Main purpose of this project is the question, how different age groups make use of the different modalities provided by the application when creating music. As the idea is to do live music i.e. not pressing pause to add notes and take time to think, another focus lies on the reaction time the modalities take to carry out given tasks. 1.2 Overview and project description The application is composed of a grid of notes on the left and a control panel on the right as depicted in figure 1. The grid of notes is composed of twelve rows and sixteen columns. Each row shows a certain pitch and the columns show the elapsed time. Each square can be filled with one of the four colours (black, white, red or green). Black represents void and the other Figure 1: Main window of the application three colours represent different instruments. Additionally, each note can have a velocity value which can be translated to the volume of the note. The velocity and colour can be chosen in the control panel on the right. Furthermore, there are three buttons: Play which starts or pauses the playback, Stop which stops the playback and Clear which deletes all the entered notes. The Tempo slider controls the speed of the playback and the 1

Pitch Bend slider applies a pitch bending effect on the playing melody. Both sliders can be changed during the animation. The application is written in Python version 3.5 and uses the Tkinter UI toolkit. The audio output is not handled by the application itself. Instead, MIDI messages are emitted through the mido library, which can then be processed by any MIDI-compatible program, such as the FluidSynth audio synthetizer. 2 Input devices We compared two modes of input for our application: a conventional mouse or touch screen interface for the right hand and a multimodal interface using the Kinect for the left hand. Without loss of generality the inputs can be switched for left-handed persons. 2.1 Conventional interface (mouse and touch screen) The user places notes on the grid with either the mouse or the finger. Before placing a note, the user has to choose a colour by clicking or tapping it. He also has to set the velocity by sliding it, again either with the mouse or the finger. Once all the desired notes are placed on the grid, the user pushes the Play button and the music playback starts to loop continuously. The user can also change the velocity or the pitch bend by sliding it while the application is running. 2.2 Multimodal interface (Kinect) First the user has to calibrate the colours on the cube and the hand recognition with the Kinect facing down on the table as depicted in figure 2. The cube can take up to six colours, each for one side of the cube. By doing this, the user needs to go through two calibration screens. The first one allows the user to crop the image around the hand to make its detection more easier as shown in figure 3. The second one permits the user to calibrate well enough the cube s colour faces in order to use the application properly. 2

Figure 2: Figure 4: The Setup of the second calibration Kinect device Figure 3: The first calibration screen screen Once the calibration is done, the user arrives at the actual application seen in figure 1. With the help of the Kinect, the user can adjust any note s velocity by sliding the hand vertically up and down under it, or he can adjust the whole melody s pitch bend by sliding his hand horizontally left and right under it. These two hand methods were well appreciated and faster to use. Finally the user can choose any instrument implemented in the application by simply rotating the cube s faces with the hand to the desired colour. 3 Modalities On the device side, the Kinect provides one possibility to select a specific note and multiple ways to modify or adjust the selection. This can be done in sequence or at the same time, for example selecting a note and changing the instrument with one movement. According to the CASE model this qualifies as a synergistic or alternate user interface. But in addition to that, the two input modalities can be redundant as for example the pitch can be changed with either input modality which adds a concurrent communication type to the whole application. An exclusive input is not demanded by the application and in addition not intended. In the case of contradicting inputs, for 3

example raising the volume with one gesture and decreasing the volume with the touch interface, the touch input device should take precedence. On the user side of the modalities, the CARE model describes the usability concepts the user is confronted with when composing live music. Normally the inputs are carried out complementary i.e. within a given temporal window to reach a given state i.e. the live change of the notes. An assignment to a specific input modality to reach a certain state is not given: The application does not lack different choices to carry out a task. The user can also perform modalities in a redundant way like for example selecting a different instrument and turn the cube to show a different face. This also was one of the objectives. Same goes with equivalence properties which allow the user to choose different modalities. Such choice possibilities sometimes yield interesting results as explained in chapter 5. Fission of the channels to provide the users with appropriate feedback was unobtrusive and worked without noteworthy problems. Fusion for the different modalities were also less of a problem and limited to a few particular situations like providing contradicting inputs as described earlier in this chapter. 4 Implementation Unfortunately, under Linux, the choice of libraries to interface with the Kinect is limited. We opted for libfreenect2 and its Python binding pyfreenect2. In addition to this a library for scientific computing with Python numpy was used for certain image and array manipulation functions. Before launching the program, the user is presented with two calibration screens. The first screen allows the user to choose a subregion of the Kinect s fieldof view seen in figure 3 to make the detection of their hand easier. On the second screen displayed in figure 4, the user must calibrate the colour detection by showing each of the cube s faces to the Kinect s camera and clicking the corresponding colour s button. Note that we initially planned on having six possible colour choices, but the results were poor. Therefore we reduced the number of choices to four colours to make the colour detection more accurate. 4

5 Evaluation Among the user feedbacks, there were several negatives and positives. For example, the cube was often seen as negative because people tend to click directly on the touch screen to select the desired colour. Additionally, instead of having colours on the cube s faces, we could have the actual image of the instrument to allow a distinction of the cube sides for users with colour blindness. In contrast, the hand gestures up / down and left / right have been highly appreciated as it was more intuitive. It was faster changing the velocity and the pitch bend with the Kinect s hand recognition method rather than touching the screen to do it. Moreover, a global amelioration could be to make the buttons on the right panel of the application bigger so that it is easier to manipulate them with the finger. The following table resumes the users evaluations. We had six different testers, all having different ages and gender. They were mostly family and friends. The third column answers the questions Do you know music? Do you play an instrument?. The fourth column indicated the approximate time the user took while adapting to the application. The fifth and sixth columns show what users liked and disliked with the overall system. Finally, the last column was for a little challenge proposed to them. The challenge was to reproduce a song, for example Frère Jacques only by hearing it and to see how much time they was needed in order to replicate it in the application with the preferred modalities. Gender Age Familiar Accomodation with time to the app Preferred modalities Did not like Successful task? music? Female 20 Yes ~ 5 minutes Hand gestures (up / down, The cube + the ~ 15 minutes left / right) + touch screen mouse for the instruments Female 58 No > 10 minutes Touch screen only The cube + the ~ 30 minutes Kinect Female 25 Yes < 5 minutes Hand gestures (up / down, The cube + the ~ 10 minutes left / right) + touch screen mouse for the instruments Male 28 Yes < 5 minutes Hand gestures (up / down, The cube + the ~ 15 minutes left / right) + touch screen mouse for the instruments 5

Male 59 No ~ 8 minutes Mouse + touch only The cube + the ~ 25 minutes Kinect Male 23 No > 10 minutes Hand gestures (up/down, The cube + the left/right) + touch screen mouse Abadoned for the instruments 6 Conclusions The results showed that elder persons preferred the touch and mouse input modalities over the Kinect. Younger people took less time to accommodate to the whole system, not only the application. From the last column it is also visible that users using the Kinect as input device were slightly faster at completing successfully the given task. Future improvements might include the modality of the voice limited to certain commands like Start, Pause and Stop. An additional third axis might also be set up to provide the user with an additional input channel. An important part at this point might be the usage behaviour of users to evaluate the intuition and assignments of usage modalities for the users. 7 References Lalanne, Denis. Slides from the course 2017. Department of Informatics: University of Fribourg. 6