Perceptual effects of visual images on out-of-head localization of sounds produced by binaural recording and reproduction Eiichi Miyasaka 1 1 Introduction Large-screen HDTV sets with the screen sizes over 37 inches have become widespread in Japan. 3DTV with passive or active special lenses have also launched this year 2010. Home theaters with surround sound systems have not been so widely spread in Japan although they have been widespread in USA, Australia, and Europe. One of the reasons is considered that room sizes are smaller in Japan than in the foreign countries mentioned above so that the 5.1 channel surround sound systems used in the home theater systems are hard to be set up because they require relatively a large space including the rear loudspeakers. A new surround sound system called as the 22.2ch surround sound system appropriate to the super-hdtv has been developed at NHK Laboratories 1 where the researchers are now trying to reduce the number of the loudspeakers used in the multi-channel system as small as possible with small impairments. Whereas multi-channel sound reproduction with loudspeakers localizes the sound images at areas in 3D audio space around the loudspeakers apart from a listener, binaural reproduction with a headphone localizes them in a head of the listener. But it can localize them out of head when the sounds are recorded through a dummy head. It has been widely recognized that the extent of outof-head localization of sounds recorded by an arbitrary dummy head and reproduced through a headphone will decrease, because there exists the differences in figure between the dummy head and the real head of a listener. HRTF Head-Related-Transfer-Function, the transfer function between an eardrum of a listener and a sound source, is one of the useful tools which improve the externalization of out-ofhead localization. It is, however, dependent of individual torso, head, shoulders and pinnae because it captures the diffraction of a sound wave for a certain angle of incidence 2. So, it will be necessary to introduce the modified HRTF to the system in order to improve the localization accuracy. Not a few researchers are now strenuously studying enormous amounts of accurate measurements of individual HRTFs. The difficulties of introduction of accurate HRTFs to the systems because of dazzling variety of HRTFs are considered to prevent any success of the externalization of out-of-head localization 3. It has been also recognized that it will be difficult for a listener to identify the sound images at the same positions in a free air as those intended by the creators or the directors if no visual images are presented at the same time. Various sound images intended to be localized at different areas in a free air in multi-channel radio drams making use of binaural recording and reproduction, are frequently perceived by listeners with headphones at the different positions from those intended by the directors because they have no background knowledge of the drams. 1 Professor, Faculty of Environmental and Information Studies, Tokyo City University 78
One of the possible systems which will overcome the problems mentioned above will be a new system combined with a head-mounted HDTV display and a multi-channel headphone. This system requires no physical space to establish multi-channel audio-visual images. It will be successful if visual images will facilitate externalization of out-of-head localization of sounds, resulting in perceiving the sounds as if they come from the corresponding visual images. This paper presents some trials of perceptual experiments whether visual images will be effective for externalization of out-of-head localization of sounds which are binaurally recorded and reproduced though a headphone. 2 Experiment 1 2.1 Visual stimuli with the sounds Table 1 shows the stimuli used in the experiment. Seven visual stimuli accompanied with the sounds are prepared. The stimuli ST 1 to ST 6 are moving pictures, while the rest ST 7 is a mobile phone with a flash-lamp. Table 1 List of the stimuli used in the experiment stimulus ST 1 ST 2 ST 3 ST 4 ST 5 ST 6 ST 7 visual stimulus A golf ball rolls from right to left on a desk A golf ball rolls from front to back depth on a desk A golf ball bounces at a center point on a desk A ping-pong ball bounces from right to left on a desk A ping-pong ball bounces from front to back on a desk A golf ball bounces at a center point on a desk mobile phone with a flash-lamp auditory stimulus The sound moves from right to left The sound moves from front to back The sound diminishes without movement The sound moves from right to left The sound moves from front to back The sound diminishes without movement The sound keeps a constant sound level The sounds recorded through a KU 100 dummy head manufactured by Georg Neuman GmBH were connected to a lap-top computer through an audio interface M-Audio Firewire410. 2.2 Experimental setup and the procedure Fig.1 shows the experimental setup. An observer with a ATH AD700 headphone Audio- Technica sits straight up in the chair on casters and watches the 24 inch display setup just in front of the observer. He/she is able to move freely with the chair to forward or to backward Fig.1 Experimental setup 79
along the line suggested in the range of 50cm apart from the display to 400cm. At first, observers are asked whether they can find any area or position where they perceive that the sounds come from the corresponding visual images on the display or not. When they can find the area, they are required to identify the range of the area, that is, 1 the nearest position to the display, 2 the farthest position from the display, and 3 the most appropriate position at which the sounds and the visual images are perceived naturally to meet to each other. On the other hand, when they can not find any area or position, they are required to answer where they perceive the sounds come from. Next, without any visual images, they are required to answer where the sounds come from. When they perceive the sounds out of their heads, they are asked to answer the perceived distances between the sounds and their heads, and the direction of the sounds. Ten students with normal vision and hearing participated the experiment as the observers. 2.3 Results 2.3.1 Effects of visual images Fig.2 show the number of the observers who answered that they perceived out-of-head localization of the sounds with or without the visual images. For ST 3, ST 4, ST 5, ST 6, no less than nine out of ten observers perceived out-of-head localization of the sounds with visual images, and the observers from two to six perceived the localization without visual images. Fig.2 The number of the observers who perceived out-of-head localization of the sounds with or without the visual images 2.3.2 Results for each stimulus Fig.3 consists of seven graphs showing the results of the experiment for each stimulus from ST 1 to ST 7. The abscissa indicates the observers and the ordinate indicates the distance from the display. The symbols used in the figures are as follows ; : the minimum nearest distance from the display at which an observer perceives the sounds come from the visual images. : the maximum farthest distance from the display at which an observer perceives the sounds come from the visual images. the most appropriate position at which the sounds and the visual images are perceived naturally to meet to each other. 80
Fig.3 Results of the experiment for each stimulus The abscissa indicates the observers while the ordinate is the distance from the display The symbol means minimum distance from the display at which an observer perceives the sounds come from the visual images means the most appropriate distance means the maximum distance 81
1 ST 1 Half of ten observers could perceive out-of-head localization of the sounds with the visual images although no observers could out-of-head localization of the sounds without any visual images. The averaged appropriate distance is about 150cm to 200cm apart from the display. 2 ST 2 Half of the observers could perceive out-of-head localization of the sounds with the visual images. Two of them the observer G and I, however, could only perceive out-of-head localization of the sounds for ST 1 or ST 2. 3 ST 3 All observers perceived out-of-head localization of the sounds with the averaged appropriate distance of 100 150cm from the display. 4 ST 4 Nine observers perceived out-of-head localization of the sounds with the averaged appropriate distance of 100 250cm from the display. The observer D could not perceive out-ofhead localization for ST 1, ST2 and ST 4. 5 ST 5, ST 6 All observers perceived out-of-head localization of the sounds with the averaged appropriate distance of 100 200cm from the display. 6 ST 7 While only three observers could perceive out-of-head localization of the sounds with visual images, any observers could not perceive the localization of the sound without the visual images. The sounds were produced at a fixed point on the cell body, and the moving scene in the visual images is flashing of the lamp attached to the cell body. Such a stimulus as mentioned above will be considered to be difficult to bring out-of-localization. 2.3.3 Positions or direction of localization of the sounds For the following stimuli, ST 3, ST4, ST 5, and ST 6, almost all observers perceived out-ofhead localization of the sounds when they were presented with the visual images although less than half the observers could perceive out-of-head localization without visual images. They localized the sound images forward or above or backward area in-the-head when the sounds were presented accompanied by no visual images. These results indicate that the visual images used in the stimuli from ST 3 to ST 6 will be able to facilitate perception of out-of-head localization. 2.3.4 Sizes of the visual images The position of a movie camera with which the visual images were taken can be one of the important parameters which will influence perception of localization. An additional experiment was executed. Two types of the positions used in the experiment were 1m and 4m apart from the object accompanied with the sounds. Fig.4 shows one of the results. The abscissa indicates the stimulus ST 1m indicates the visual image taken at 1m apart from the object where a ping-pong ball bounces from right to left on a desk, while ST 4m indicates the visual image at 4m apart from the same object. The ordinate indicates the number of the observers who selected one of the two stimuli on the basis of the extent of naturalness between the sounds and the visual images. The result shows that the ST 4m is more natural than ST 1m because the former images present more visual information 82
which will facilitate the visual position at which the sounds will be produced resulting in out-ofhead localization. Fig.4 The number of the observers who selected one of the two stimuli on the basis of the extent of naturalness between the sounds and the visual images 3 Discussion and conclusion It is clear that it will be difficult to realize out-of-head localization only by auditory stimuli. Introduction of HRTF will be effective on the realization if it could be exactly close to the listener s HRTF although it is clearly difficult to execute precise measurements of HRTF. The results in this experiment imply that effective visual images will facilitate out-of-head localization of the sounds. As shown in the Fig.3, however, some visual images produce few effects on the out-ofhead localization. It is remained to be solved how any visual images facilitate out-of-head localization. An audio-visual system introduced a generic HRTF, consisting of binaural audio reproduction system and the visual system may be one of the effective systems. A display size will also be one of the important parameters which will facilitate out-of-head localization although the size was fixed to be 24 inches in this experiment. The author is now planning to test the effectiveness of the display sizes including 50 inches as well as 24 inches. Acknowledgement The author expresses thanks to Mr. Tomoyuki Fujii, Mr. Yasuaki Abe and Mr. Yuji Sato who assisted the author to perform the experiments and gather the experimental data. A part of this research was supported by HBF Hoso Bunka Foundation, 2009. References 1 K. Hamasaki, T. Nishiguchi, R. Okumura, Y. Nakayama and A. Ando ; 22.2 Multichannel Sound System for Ultra High-Definition TV, SMPTE Technical Conference Publication, 2007 2 M. Noistering, T. Musil, A. Sontacchi and R. Hoeldrich, 3D Binaural Sound Reproduction using a Virtual Ambisonic Approach, International Symposium on Virtual Environments, Human-Computer Interfaces, and Measurement Systems, 27 29 2003 3 Y. Iwaya, Individualization of head-related transfer functions with tournament-style listening test : Listening with other s ears, Acoust. Scie. & Tech., 27, 340 343 2006 83