3D Sound Simulation over Headphones

Size: px

Start display at page:

Download "3D Sound Simulation over Headphones"

Mervyn Hoover
6 years ago
Views:

1 Lorenzo Picinali or Paris, 30 th September, 2008 Chapter for the Handbook of Research on Computational Art and Creative Informatics Chapter title: 3D Sound Simulation over Headphones Abstract What is the real potential of computer science when applied to music? It is possible to synthesize a real guitar using physical modelling software, yet it is also possible virtually to create a guitar with 40 strings, each 100 metres long. The potential can thus be seen both in the simulation of that which in nature already exists, and in the creation of that which in nature cannot exist. After a brief introduction to spatial hearing and the binaural spatialization technique, passing from principles of psychoacoustics to digital signal processing, the reader will be included on a voyage through multi-dimensional auditory worlds, first simulating what in nature already exists, starting from zero and arriving at three soundscape dimensions, then trying to advance the idea of a fourth auditory dimension, creating synthetically a four-dimensional soundscape. 1. Introduction What is the real potential of computer science when applied to music? Using physical modelling synthesis techniques it is possible to simulate as accurately as possible an acoustic guitar: of course, this is useful in terms of the opportunities made available to musicians to use an instrument they cannot in reality play, and in terms of acoustical studies of the instrument itself. But: Is that it? Once created, a guitar mathematical model can be altered as far as the imagination can extend: an acoustic guitar made of gold, with 40 strings, each of 100 metres in length, could virtually be created and played with a one-square-metre plectrum! Therefore, the potential can be seen both in the simulation of that which in nature already exists, and in the creation of that which in nature cannot exist. Another kind of example will be discussed later in the chapter: instead of toying with the simulation of musical instruments, we shall try to create virtual acoustical environments through the simulation of three-dimensional (henceforth: 3D) soundscapes. The binaural spatialization technique will be used in order to achieve this goal; multi-dimensional soundscapes will be simulated not by placing real sound sources, such as loudspeakers, within the three dimensions, but by simulating the behaviour of our outer

2 ear in terms of directional modifications brought to the sound input into the hearing system. The mechanisms of spatial hearing will be investigated and analyzed, and three localization cues will be characterized and simulated. These are the Interaural Level Differences (ILDs), the Interaural Time Differences (ITDs), and the Direction-Dependent Filtering (DDF). Within the simulation of a real environment, these parameters would all be coherent with the position of the sound source: for example, for a sound source placed at 60 of azimuth, the sound would reach first the right ear and then the left (ITDs). Furthermore, it would be more intense at the right ear (ILDs) than at the left, and the sound would be filtered depending on the particular resonances of our outer hearing systems for that specific sound source location. It must, however, be asked what could happen if the three localization cues were incoherent with the real position of the sound source. Of course, this is impossible in nature, and equally so in a standard soundscape simulation, when loudspeakers are placed in a 3D space. Still, achieving such incoherence is not impossible in a system based on headphones, where the signals sent to the hearing system are much more controllable, thus the whole reproduction system results may be that much more flexible. In this case, the binaural spatialization technique is useful not only to simulate a real 3D soundscape, but also to create new soundscapes, i.e., environments that are impossible to find in the real world. This seems to be one of the amazing new options offered by computer science: while it could indeed be considered inessential to simulate a feature that already exists in nature, it is particularly interesting to create a feature that as yet has no existence in the real world. To appreciate this new digital feature fully, it may help to think about a voyage into multiple dimensions; the results may appear similar to Abbott s graphic achievements (see Abbott, 1999) when he wrote Flatland (frequent reference will be made to this book later in the chapter). A monophonic diotic signal (the same at both ears) could be perceived as a point sound source located in the middle of the head: zero dimensions, or the point. By introducing intensity and content differences between the two channels (ILDs) and creating a dichotic signal (different for each of the ears), it is possible to obtain a standard headphone stereo signal, with multiple sound sources located along a line between the ears (always inside the head): one dimension, or the line. Through introducing time differences between the two channels (ITDs), it is possible to obtain the sensation of the sound coming from out of the head, and with multiple sound sources located in a plane: two dimensions, or the square. Then, upon introducing a simulation of the DDF, with different frequency filtering for each virtual sound source, the perception reaches the third dimension, the cube. The auditory passage between these steps could be visualized as the graphic perception of a point that becomes a line, then a square and at the end a cube. What, then, about the fourth dimension? The gain of a dimension can mean the coexistence of a concept that in a lower dimension cannot be seen simultaneously. For example, when inspecting a square in a bi-dimensional world it will be possible to see just a line, one or two angles at the same time, and not more (in a bi-dimensional space, the viewed perception should be mono-dimensional, as in the three-dimensional the perception is bidimensional). Only when reaching the third dimension will it be possible to see all four angles of the square at the same time.

3 In a four-dimensional space, therefore, it may be possible to see, for example, the front and the back of a person at the same time Yet how can this be rendered from an auditory perspective? The fourth dimension could be seen as the coexistence of entities that in a real listening situation could not exist at the same time, such as a sound signal with an ILDs of a sound source placed at 60 of azimuth and at 0 elevation, an ITDs of a sound source placed at 300 of azimuth and at 0 elevation, and a DDF of a sound source placed at 0 of azimuth and at 90 elevation. How on earth could all of this be created, if not with the help of computers? 2. Elements of spatial hearing How can our hearing system determine the direction of the provenance of a sound, and therefore the position of a sound source, in a 3D soundscape? In this chapter the mechanisms of spatial hearing will be described and analyzed, starting from a short overview of the external hearing system and continuing to a brief investigation of the three localization cues, the ILDs, the ITDs and the DDF, and of the binaural phenomena. References for the topics discussed in this chapter can be found in Blauert (1996), Moore (2003) and Yost (2000). 2.1 Some basic notions Before beginning this overview of the binaural phenomena and of the psychophysiology of the spatial hearing system, the definitions of a few terms will be attempted, in order better to understand that which is to follow: Sound localization: the judgement on the specific location of a sound source. Sound lateralization: it is feasible, most of all while listening to sound through a pair of headphones, to be unable to localize sound sources outside our head, but to perceive the sound as coming from inside the head, with sound sources placed along an imaginary line that starts at one ear and crosses to the other. This phenomenon is known as sound lateralization. Localization cues: specific attributes of the sound event that are used by the hearing system in order to establish the position of a sound source in a 3D soundscape. Monoaural: relating to or involving a sound stimulus presented to one ear only. Binaural: relating to or involving a sound stimulus presented to both ears simultaneously. Interaural: between one pair of ears. Diotic: relating to or involving a sound stimulus presented to both ears in exactly the same way. Dichotic: relating to or involving a sound stimulus presented to one ear differently from the sound stimulus presented to the other ear. 2.2 Spatial coordinates In order correctly to localize a sound source in a 3D soundscape, a coordinate system needs to be established. Three planes need to be distinguished, each one with the origin placed at the centre of the head (see Figure 1): Horizontal plane: placed at the superior margins of the two ear canals and on the inferior part of the ocular cavity.

4 Frontal or vertical plane: placed at an angle of 90º to the horizontal plane, it intersects with it at the superior margins of the two ear canals. Median plane: placed at an angle of 90º to both the horizontal and the frontal planes, it constitutes the axis of symmetry of the head. Using this system as a reference, the position of a sound source can be unequivocally defined by the Azimuth (ϕ, localization angle on the horizontal plane, calculated proceeding anti-clockwise), the Elevation (δ, localization angle on the median or frontal plane, calculated proceeding upwards) and the Distance (r, the distance between the sound source and the centre of the listener s head). In Figure 1, three sound sources are placed as an example; their spherical coordinates are: A: ϕ = 0º, δ = 0º, r = depending on the radius of the circles drawn B: ϕ = 345º, δ = 30º, r = depending on the radius of the circles drawn C: ϕ = 270º, δ = 0º, r = depending on the radius of the circles drawn Figure 1: The spherical coordinate system (after Blauert, 1996) 2.3 Outer ear overview To introduce the human auditory system: the outer ear is its external part. It is composed of the pinna (the visible part), and the auditory canal or meatus. At a first sight, the role of the pinna seems to be quite simple: to convey the sounds that reach the head into the ear canal. However, if its particular shape is inspected carefully, its functions may be guessed at being far more complicated. The pinna in fact also significantly modifies the incoming sound, depending upon the angle of incidence of the sound itself and thus on the position of the sound source. This modification is mainly related to frequency filtering, especially for high frequencies (above 3000 Hz).

5 After it has been conveyed and modified by the pinna, the sound travels down the ear canal and causes the eardrum, also known as the timpanic membrane, to vibrate. After this point, the vibrations are transmitted through the middle ear by the ossicles, three small bones (the malleus, incus and stapes) that work as impedance converters and mechanic amplification devices through a complicated system of levers, and then to the cochlea, the last part of the auditory system and part of the inner ear. The system as a whole does not, for our purposes, warrant scrutiny. The part of the peripheral auditory system involved in the mechanisms of sound modification linked to the source position, and therefore to the sound incidence angle, is in fact solely the external one, thus the outer ear. 2.4 The mechanisms of sound source localization As defined in Section 2.1, sound localization is the judgement made regarding the specific location of a sound source, performed through particular mechanisms fulfilled by our auditory system. In order to accomplish these mechanisms, the auditory system can work on certain particular attributes of the signal input into the ear canal: those are called the localization cues, and they can further be distinguished between interaural differences and monoaural attributes, as will be shown in the following sections The interaural differences There are two kinds of interaural differences for the localization of sound sources at the left or at the right of our head: ILDs: Interaural Level Differences, the differences in terms of the pressure level of a sound stimulus between one ear and the other. They are generated by the presence of the head between the ears, the head that acts as an obstacle placed in the direct path between the sound source and the ear entrance (in this case, the ear is that opposite the position of the sound source). ITDs: Interaural Time Differences, the differences in terms of arrival times of a sound stimulus between one ear and the other. The differences are generated by the different paths that the sound wave needs to cover in order to arrive from the sound source to each of the ears: when the sound source is not located in the median plane, the distances between it and the two ears individually will differ, thus the sound wave will take a longer or a shorter time to reach the beginning of each of the ear canals. Both perceptions are effective in order to perform the localization of a sound source placed on the left or on the right of our head; however, their importance varies according to the frequency bands covered by the sound source to be localized. The duplex theory (see Lord Rayleigh, 1907) is probably the most widely accepted hypothesis as to how the interaural mechanism works for sound source localization: the ILDs are more effective for the localization of high frequencies, while the ITDs are for low ones. To try to explain this: as is known, low frequencies (between 20 Hz and 500 Hz, with wavelengths ranging from 16 m to 64 cm) have a wavelength that is much larger if compared to the diameter of our head (~17 cm), therefore they will not be scattered or absorbed by such a small obstacle. Thus, the ILDs result in being nearly irrelevant for low frequencies, and significantly larger for high frequencies: as an example, after Figure 2, it can be noticed that when a sound source is located at 90 of azimuth, the ILDs are at 1-2 db for the 200 Hz, and at 20 db for the 6000 Hz. Thus, the ILDs can be seen as a frequency-dependent parameter.

6 Figure 2: The interaural level differences divided by frequency and in function of the azimuth (after Feddersen, 1957). This diagram has been redrawn from Moore, 2003: due to this fact, it is not used as a reference, but only to give an idea about how the ILDs works. Figure 3, below, shows how the ITDs vary independently according to the frequency of the stimulus, only because of the speed of sound; therefore, the time taken by a sound wave to travel from the sound source to the two ears is dependent not upon the frequency, but upon other physical parameters such as air temperature and humidity. Thus, the ITDs can be seen as a frequency-independent parameter.

7 Figure 3: The Interaural Time Differences in function of the azimuth (after Feddersen, 1957, redrawn from Moore, 2003). This diagram has been redrawn from Moore, 2003: due to this fact, it is not used as a reference, but only to give an idea about how the ILDs works. However, the fact that the ITDs can be measured in terms of microseconds (a millionth of a second) creates some detection problems for frequencies whose period is comparable with that of the ITDs themselves. For example, if a 1000 Hz sound source is located at 60 of azimuth, the ITDs would be 0.5 milliseconds, exactly half of the period of the 1000 Hz frequency. In this case, it would be utterly impossible to establish the position of the sound source using only the ITDs as a determinant, because the sound waves at both ears would be in exactly the same phase, and would not be distinguishable except for the 0.5-millisecond difference in the onset of the oscillations. These problems are magnified for higher frequencies and smaller periods. In this scenario, the basis of the duplex theory may be understood: for certain frequencies, the ILDs seem to be the more reliable parameter for left-right localization, while for others the ITDs can be considered more effective. It has been calculated that for frequencies above 725 Hz (with periods shorter than 1.38 ms, when the ITDs would start to create problems) the sound localization in terms of left-right detection is accomplished through considering mainly the ILDs, while for frequencies below 725 Hz (with wavelengths greater than 44 cm, when the ILDs would start to become irrelevant) the ITDs constitute the most important parameter.

8 2.4.2 The Cone of Confusion Thus far, it has been explained how it is possible to determine the provenance of a sound from left and from right on the horizontal plane, but the question remains: how is it possible to differentiate sound sources placed in front of or behind the listener, or above or below? If two sound sources are located at 60 and 120 of azimuth (two positions that are specular, referring to the frontal plane), the interaural differences in the signals coming from them will be exactly the same, even if the two sources have different real positions. This problem is called the Cone of Confusion because plotting all of the positions of sound sources with the same interaural differences would generate the shape of a cone, with the head position as the apex (see Figure 4), and can be resolved merely with the help of a third localization cue, the direction-dependent filtering. Figure 4: The Cone of Confusion. For each of the sound sources located in the circle (which can be considered as the base of a cone), r1 and r2 are respectively equal, therefore the interaural differences are exactly the same (von Hornbostel, 1920, redrawn from Blauert, 1996) The direction-dependent filtering and the Head Related Transfer Function As said previously in Section 2.3, the pinna has two main functions: while the first, sound gathering, can readily be understood, the second, direction-dependent filtering, appears more complicated. The dimensions of the pinna are far too small, if compared with the wavelengths of many audible frequencies, for it to function as a simple sound reflector: the dimensions of its cavities, instead, are comparable to λ/4 (where λ is the wavelength of a given frequency) of a large number of frequencies, and these can easily become sound resonators for sound waves coming from specific directions. Therefore, inside the pinna the sound is modified by reflections, refractions, interferences, and resonances that are activated for specific frequencies and, most significantly, for the incident angles of

9 specific sound waves, hence the name Direction-Dependent Filtering (Batteau, ; Shaw, 1968). In order to abstract and simplify the principle, an empty bottle serves as an example: when blowing with the mouth close to the neck of the bottle, it is possible to generate a resonance whose frequency is determined only by the volume of the air inside the bottle and the dimensions of the bottle neck and not, for example, by the material of the bottle itself. However, in order to generate the resonance, the position of the mouth, the force of the air blown, and the inclination of the bottle need to be specifically selected (a choice that is usually achieved by trying different positions and speeds). It must be asked what would happen if we had a special bottle, with more necks, more openings, and more cavities? There would then be many more resonances, and many more combinations of positions and blowing speeds in order to activate them. This is what happens if the pinna is considered as a complex resonator: multiple resonances can be activated depending on the incidence of the sound wave. All of the resonances generated by the reflections and refractions on the shoulders and on the torso of the listener need to be considered, too. The sound input into the auditory canal is therefore modified through complex frequency filters that change their shapes depending on the position of the sound source. A couple of examples: for sound sources located above, there is usually a strong resonance around the 8000 Hz mark, while for the frontback detection of sound sources placed in the horizontal plane, the weight of resonances and absorptions at 3000 Hz and 5000 Hz are essential in order not to fall into the cone of confusion effect. The combination of these filters, together with the interaural differences, is called the Head Related Transfer Function Individual and general attributes of the HRTF While some of the HRTF parameters can be considered constant for all human hearing systems, some others need to be considered individually, because they depend on idiosyncratic physiological differences between human beings, such as the shape of the pinna and the circumference of the head. When an HRTF is simulated, or binaural recordings are performed (see Sections 3.2 and 3.3), both individual and general attributes should be considered, thus a simulation should be performed individually for each subject. All of the individual HRTF attributes mentioned in this chapter will be simplified and approximated, and for the current purposes it will be assumed that every single human being has the same identically shaped outer ear (and, equally, all other directional filtering elements such as the head, shoulders, and torso). For more information about this topic, see Møller (1996) and Katz (1996) The role of the head movements Even if this is not directly relevant to what will be claimed in the following chapters, it is essential to underline that head movements are extremely important to sound source localization, and most of all for the front-back and up-down discriminations when the cone of confusion issue needs to be resolved. In fact, turning the head left-right or up-down causes important changes in the soundscape and in the relative positions of the sound sources, generating further information which can be considered as particularly relevant to correct sound source localization. As an example, if a sound source is located in the horizontal plane at 60º of azimuth, it could be easily confused with another located in the same plane at 120º of azimuth; on rotating the head to the right, if the sound source is really located at 60º of azimuth it will move towards the front (0º of azimuth); otherwise, if it is located at 120º of azimuth, it

10 would move towards the right (90º of azimuth). Therefore, in critical situations, for example, with narrowband stimuli (sound stimuli with a narrow frequency extension, which can be filtered only with difficulty on the whole frequency scale) or in the case of a localization task in a particularly reverberant environment, when the localization cues cannot be precisely analyzed, the movements of the head are essential for a proper sound source localization. 2.5 Sound source localization on three planes After this brief overview of the interaural differences and of direction-dependent filtering, it should now be understood how sound source localization is performed for a source placed in a three-dimensional soundscape. In order to simplify the problem, the mechanisms for the localization of the sound sources when these are placed in just one plane will now be addressed Sound Source Localization in the horizontal plane In this specific case, the presence of dichotic stimuli is highly probable: in fact, solely for sound sources located exactly at 0º and 180º of azimuth (and, of course, 0º of elevation, simply because it is in the horizontal plane) a diotic stimulus can be input into the hearing system. Thus, for left-right discrimination (0º/180º to 180º/360º) the ITDs and ILDs are used, while for front-back judgement (90º/270º to 270º/90º) the DDF carries major importance. Due to the facts that all the three localization cues can be used and that more often a sound source would be located here (taking speech as an example), the horizontal is the plane where sound source localization performances are higher Sound Sources localization in the median plane While in the horizontal plane dichotic stimuli are far more common, whereas if a sound source is located in the median plane the sound will certainly reach the ear as a diotic stimulus: in fact, if the asymmetries of the head are ignored, the distances and the incidence angles between a sound source located in this plane and both ears are always the same. Therefore, the only parameter applicable by our hearing system is the DDF. For these reasons, the median one is the plane with which our hearing system has more difficulties in terms of sound source localization performances Sound Sources localization in the vertical or frontal plane The frontal plane can be considered as the vertical version of the horizontal plane: diotic stimuli are present only for sound sources located at 90º and 270º of elevation, whilst for all other positions a dichotic stimulus would reach the hearing system. Up-down discriminations are performed through analyzing the DDF. The localization accuracy on this plane is not as precise as it is in the horizontal plane, while certainly not as vague as for the median plane. 3. Simulation of the spatial hearing A short yet comprehensive overview of the mechanisms of spatial hearing has been provided. How our hearing system can localize a sound source in a 3D soundscape has been examined, and the specific attributes of the sound input into our ear applied in order to achieve this have been noted. Therefore, phenomena that exist in nature have been described: every sound coming from every position is filtered by our body, our head, and our outer ear, and thanks to these it can be localized by our hearing system. The goal now becomes how to simulate this using a computer.

11 3.1 Spatial hearing through headphones What does sound spatialization mean? It could be considered as a synæsthesia, because the concept of space mainly refers to the sense of sight, while the word sound is, of course, related to hearing; nevertheless, these terms may be associated, and such an association creates a new concept: the soundscape. An attempt to define the concept of soundscape could start with a simple question: what is the difference between a real listening to the sounds that we hear every day, for example, walking down the street, and the listening to a CD-DA (Compact Disc Digital Audio), played from any stereo reproduction system, of the same sounds recorded? Independently of the origin of the stimuli, in everyday life we listen to sounds coming from sources located in a 3D space: we are in the middle of an immersive 3D soundscape, where for each sound we are more or less able to detect the position of its source. When we listen, instead, to a CD-DA, the sound is presented frontally. Using a standard stereo reproduction set-up (with the two loudspeakers placed in two of the angles of an equilateral triangle and the listener placed in the third), the sound that reaches our ear is not 3D, but mono-dimensional, in the sense that each sound source can be localized in one or the other loudspeaker, or on a imaginary line between the two. In fact, a sound that is played at an equal level from both loudspeakers would be localized exactly between the two, thanks to psycho-acoustic mechanisms that will not be discussed in this chapter (for more information, see Moore, 2003 and Blauert, 1996). The main difference between the two listening situations has thus been defined: in the real one, a 3D soundscape is presented to our hearing system, while during a CD-DA playback the soundscape is mono-dimensional and frontal (of course, in this case the various interactions with the room where the CD-DA is played, interactions that can generate reflections coming from all directions and therefore stimulate the perception of a more spacious soundscape, are not now considered). Taking for a moment the playback of recorded sound: adding, for example, two loudspeakers behind the listener could help in coming closer to the experience of a 3D soundscape. If the four loudspeakers are placed at the corners of a square, with the listener located exactly in the centre, the sounds can be spatialized within a plane, thus a bi-dimensional soundscape can be created. In fact, changing the weights (the levels) and the sound contents of the signals sent to the four channels, sound sources can be virtually located within the square described by the loudspeakers (again, for more information see Moore, 2003 and Blauert, 1996). This is called Quadraphonic reproduction system (Quad), and it was the starting point for more famous and recent surround systems such as Dolby Digital (5.1, 7.1, etc.) or THX Surround: even if they more closely approach a proper 3D soundscape simulation, giving the impression of sound sources spatialized within a plane, they nevertheless lack one dimension. What if four other loudspeakers are added above, generating a cube with eight loudspeakers at the apexes and the listener placed exactly in the middle? Using this specific system a third dimension (height) can be simulated. Now for some examples: If a sound is played at the same level from two frontal loudspeakers, the virtual sound source will be located in the middle of the line between the two loudspeakers. Through introducing differences in level between the two loudspeakers, the sound source can be moved into every position along that line. If a sound is played at the same level from four loudspeakers placed at the corners of a square, with the listener located in the centre, the virtual sound source will be located in the middle of the square described by the four loudspeakers (position of the listener). Through introducing differences in

12 level among the four loudspeakers, the sound source can be moved into every position within that plane. If a sound is played at the same level from eight loudspeakers placed at the apexes of a cube, with the listener located in the centre, the virtual sound source will be located in the middle of the cube described by the eight loudspeakers (again, the position of the listener). Through introducing differences in level among the eight loudspeakers, the sound source can be moved into every position within that space. With multiple sounds played at different levels from the eight loudspeakers, a complex 3D soundscape can be generated. Using this reproduction system with eight loudspeakers and, of course, a proper sound spatialization engine for the weighting of the respective signals in the eight channels, a 3D soundscape can be simulated (for more information on 3D spatialization techniques over loudspeakers array, see Gerzon, 1973 and Pulkki, 1997). However, is this the only way to generate artificially a 3D soundscape? We used eight loudspeakers, therefore eight channels, in order to be able virtually to locate a sound source in a 3D space, but don t we have only two ears each? In Section 2 of this chapter, the mechanisms of the spatial hearing are well described: sound sources can be located in the three dimensions by using just two receivers. Thus, having sound reaching both ears by the use of a simple pair of headphones, it could be possible to eliminate complex and expensive multichannel loudspeaker systems. Yet how can we simulate a three-dimensional soundscape when using only two channels? An answer to this question will be found in the following sections. Note from the author It could be noticed that, in order to generate a bi-dimensional soundscape, it is not in fact essential to have four loudspeakers, as three placed at the corners of a triangle, with the listener in the centre, are sufficient; also, for 3D soundscape simulation, four loudspeakers placed at the apexes of a tetrahedron with the listener in the middle would be enough. This is absolutely true, yet I have tried to make the examples as simple as possible, and using an even number of loudspeakers seemed to aid clarity. 3.2 Dummy head and in-ear microphones: binaural recordings The first and easiest way to simulate a 3D soundscape through headphones is simply to perform a binaural recording using a dummy head microphone or a pair of in-ear microphones. A dummy head microphone can easily be made by taking a head mannequin with the dimensions of an average adult human head, with sufficiently precise pinna reproductions (in order to preserve the resonances, refractions and absorptions typical of a human HRTF), and placing two miniature omni-directional microphones at the entrances to each of the auditory canals. The recordings made through placing this device in the middle of a 3D soundscape, then played back through a pair of headphones, will give to the listener the impression of being exactly in the position of the dummy head, with sounds coming from every direction: left-right, front-back, and up-down. This result can be obtained even using the socalled in-ear microphones, which are simply two miniature omni-directional microphones placed inside the auditory canals of a subject, positioned at the entrance to the canal itself. The fact that the microphones should be placed at the exact entrance of the auditory canal, and not at the position of the eardrum, may need some explanation: as happens with the pinna, even the auditory canal

13 has its own resonances, while different studies (see Hammershoi, 1995) showed that these are not dependent on the angle of incidence of the input sound. Therefore, all of the localization cues are already present in the signal at the entrance to the ear canal, thus the microphone can be placed in that position. A further observation needs to be made about the use of headphones for the reproduction of binaurally recorded sounds (and, as will be seen here, even for the reproduction of binaurally synthesized signals): when a stereo sound is played back through two frontal loudspeakers, the signal coming from the left loudspeaker will reach both the left and the right ears, exactly as will the signal coming from the right loudspeaker. This phenomena is called crosstalk and, in the case of binaural sound reproduction, it would generate many unwanted situations. In fact, when playing binaural sounds, it is really important that the signal of the left channel should reach only the left ear, and that of the right channel only the right ear. Thus, the use of headphones is essential in this specific case: there exist systems that can be used to reproduce binaural sounds through stereo loudspeakers (transaural and crosstalk cancellation systems; as an example, see Tokuno, 1996), although they will not be discussed in this chapter. The obvious problems linked to binaural recordings lie in the fact that the recorded 3D soundscape needs to be created in a real environment, using real sound sources or loudspeakers, and that the recorded scene cannot be modified after the recording. For these reasons, it cannot be considered a proper 3D sound simulation technique simply a 3D sound recording technique. 3.3 Impulse response and digital convolution In order to simulate the directional mechanisms of the outer ear, this needs to be seen as a system, considering its mathematical definition. Given two families of signals, F1 and F2, a system is an apparatus that can transform each F1 signal into an F2 signal. It can be seen as a black box, the behaviour of which is described by the transform law S: F1 F2. In environmental acoustics, a system is a room or a hall; in a recording studio, a system is an outboard effect; in an orchestra, a system is a musical instrument. As regards spatial hearing, a system is the ensemble of all of those elements that participate in the modifications of the incoming sound depending on its incidence angle, thus the shoulders, the torso, the head, and the pinna. Without entering too deeply into the mathematical domain (if so, we would have to face other mathematical definitions such as linearity and time-invariance; for more information on these topics, see Rabiner, 1975), it may be stated that a system can be unequivocally described by its response to a specific signal: the impulse. This specific signal, known also as the Dirac δ or δ(t), can be seen as a rectangle with an infinitesimal base and an infinite height: thinking about the sound, it is an impulse, shorter than a clap, like a simple click. A particular characteristic of this signal is that its frequency content is represented by a flat line running parallel to the X axis, related to frequency: it can therefore be gathered that the Dirac δ contains all the frequencies at the same amplitude. If an impulse is reproduced within a system, the recording of the impulse itself, having passed through and been processed by the system, is called the Impulse Response, or IR, and it describes unequivocally the response of the system to all of the possible input signals. The system can then be simulated as performing a simple mathematical operation between its IR and any signal that needs to be filtered. This operation is called convolution, and in the digital domain it can be seen as a series of multiplications between the samples of both the IR and the signal to be filtered. Here, it should be specified that the

14 digital convolution is not a straightforward sample-by-sample multiplication of the two signals (see Picinali, 2006). Therefore, knowing from example the IR of a music hall and convolving with it a musical signal, for example, a piano sonata, the recording can be listened to as it was played inside that music hall itself! This simulation becomes somewhat more complicated when multi-dimensional soundscapes are considered: as was said above, the spatial hearing system modifies the sounds differently according to the respective positions of their sound sources, thus the IR of this system needs to be recorded for all of the positions of the sound sources that need to be simulated. Assuming a sphere around the head of a listener or of a dummy head, equidistantly spaced positions can be sampled on its surface. Reproducing an IR in each of the sampled positions, and recording them from the two microphones of the dummy head placed in the middle of the sphere, a database of IR would be created. Such a specific database is called a Head Related Impulse Response (HRIR) database (for an example, see Algazi, 2001); it represents the behaviour of the spatial hearing system for each sound source position around the head at a given distance (in this specific case, the distance is given by the diameter of the sphere where positions have been sampled). Therefore, in order to spatialize a sound in a three-dimensional soundscape, an IR corresponding to the position of the sound source to be simulated (the azimuth and elevation angles) needs to be extracted from the HRIR and convolved with the signal to be spatialized. The resulting processed stereo signal, listened to through headphones, will give to the listener the impression of a sound coming from the desired position. While this simulation technique seems relatively easy to realize, there are several problems linked to the IR extraction (again, see Picinali, 2006) and to the fact that performing convolution in real-time is a computationally heavy operation, possible only with fast and powerful computers. 3.4 Isolation and individual simulation of the three localization cues Another approach to the 3D soundscape simulation over headphones could be to dissociate the three localization cues and to simulate them individually (similar approaches can be found in the literature discussing HRIR interpolation techniques for the simulation of moving sound sources; as an example see Hwang, 2006). At first sight, the ILDs seem to be the easier to simulate. When sounds are mixed on a standard mixing desk, there is usually available a potentiometer, a panpot, to regulate the levels of the signals sent to the left or to the right output. It simply creates differences in level between the two channels, in order to localize the sound not in the centre, but in any position between the two speakers. However, as has already been seen in Section 2.4.1, the ILDs form a frequency-dependent parameter, thus they vary throughout the audible frequency range. For this reason, ILDs cannot be simulated simply by reducing the level of a signal sent to the right or to the left channel; an equalization filter needs to be implemented in order to attenuate increasingly the high frequencies rather than the low ones (in fact, the slope of the filter needs to change according to the sound source position to be simulated), trying to follow the ILDs values given by the diagram shown in Figure 3. An equalization filter can be implemented even for the simulation of the DDF, both for the left and the right channels: the coefficients of the filter (the values of enhancement or reduction for each frequency band to be applied to the signal) need to change according

15 to the position of the sound source to be simulated, in an attempt to follow the specific reflections, refractions, and resonances generated by the pinna and by the other directional filtering elements (head, shoulders, and torso). The ITDs can easily be simulated using a simple delay: if the sound source is located in the right hemisphere, the delay would be placed on the signal sent to the left ear, while if the source is placed in the left hemisphere, the right ear signal is the one to be delayed. Summing up those three individual simulations, calibrated specifically for the sound source to be simulated, binaural spatialization can be performed for each desired sound, yet two problems may be still generated by this binaural simulation method: How is it possible to separate the ILDs from the DDF? If each is frequency dependent, through analysing the HRTF we would find that the ILDs are perfectly mixed with the DDF: we would be able to measure only the effect of the sum of the two, and not of one individually. In fact, referring to the psycho-acoustic properties of spatial hearing, there are no significant differences between these two parameters; they both modify the incoming signals, filtering them in frequency, thus they can be considered as non-divisible. The implementation of equalization filters for the simulation of ILDs and DDF means that approximations need to be made in terms of the frequency response of the spatial hearing system. Even if studies have demonstrated how these can be considered as non-influential in terms of spatialization accuracy (see Kistler, 1992), binaural spatialization algorithms based on simple equalization filters are far from providing a high quality spatial sound perception. Nevertheless, the flexibility of this approach makes it the more suitable for what will be performed in the following section: we thus have to approximate the ILDs as a frequency-independent parameter, and the DDF as that which carries all of the frequency content of the HRTF. 4. A multi-dimensional trip In the book Flatland (see Abbott, 1999), the author imagines the trip of a square, coming from a bidimensional word called Flatland, into other worlds with different numbers of dimensions: from the zero-dimensional world, a point with no dimensions and with all the inhabitants placed exactly in the same position, to the mono-dimensional world, a long line where segments lie one beside the other, and finally to the three-dimensional world, called Spaceland. As regards figures, the book obviously focuses on the visual stimulus: each added dimension is described as a change in terms of view, passing from single points to lines, and then to bi-dimensional figures. No reference is made to the hearing stimulus. Of course, Abbott cannot be blamed for having ignored multi-dimensional hearing. This was probably not among his intentions, and it is difficult not to admit that the visual sense is the most important for the human being in terms of environmental information gathering. Nevertheless, now that we have seen how to simulate three-dimensional soundscapes over headphones, an attempt at a multi-dimensional sound trip could indeed be made. Zero dimensions: the point What is a zero-dimensional soundscape, and how can it be created?

16 Answering the first part of the question, it may be stated that a zero-dimensional soundscape can exist when the sound has absolutely no attributes providing us with information about the location of that sound in the space. From this, it is possible to assume that in a zero-dimensional soundscape the sound needs to be located in an undefined position or, even better, in a neutral position, yet what about the middle of the head, in the centre of our auditory space? Having no spatial attributes results in the situation that no localization cues can be simulated. There can be no interaural intensity or time differences, and no direction-dependent filtering. It is a purely diotic stimulus, with the same amplitude and arrival time at the two ears, and with no frequency filtering at all. To offer a practical example, a sinusoidal signal can be considered (a sinusoid is a pure tone, containing only one frequency and which, therefore, cannot be frequency filtered, but merely varied in its amplitude), presented through headphones at the same intensity and phase at both the ears. No interaural differences, no direction-dependent filtering simulation, only a single pure tone: these result in zero dimensions, or the point. One dimension: the line Starting from the zero-dimensional soundfield and proceeding with an additive approach, we could create a mono-dimensional soundfield simply through introducing a spatial attribute within the selected signal, therefore modifying it and making it partially localizable. The use of the expression partially localizable has a precise meaning: visually talking, and referring to Flatland, a mono-dimensional world is composed of elements located along a line, therefore localizable in only left-right, front-back, or up-down dimensions but not in a combination of these. If we pass to the auditory domain, a better representation of that line is given by the phenomenon of sound lateralization. In that case, the sound source is not localized within a three-dimensional soundscape, but lateralized inside our head along an imaginary line drawn from one ear to the other. Starting from the point, a single diotic pure tone, the first dimension can be added introducing Interaural Level Differences (ILDs) within that signal, transforming it into a dichotic stimulus. Two different pure tones could be created, both with different ILDs, therefore both lateralized along a line passing from the left ear to the right. An objection could be made that the first dimension could be created through introducing ITDs instead of ILDs. The fact that, historically, the ILDs is the first localization cue to have been simulated and it is the most frequently used in terms of sound reproduction must be stated. Take again the example of the channel strip of a sound mixing console, with its panpot potentiometer: it moves the sounds from left to right simply changing the amplitude of the left or the right channel, simulating therefore the ILDs. Only interaural level differences, no time differences between the signals at the two ears, nor directiondependent filtering simulation, merely two pure tones: these result in one dimension, or the line. Two dimensions: the plane Proceeding with the method proposed before, the second dimension could easily be added introducing Interaural Time Differences (ITDs) within the stimuli sent to the two ears. At this point, further objections might come from readers: the introduction of the ITDs has nothing to do with the addition of a second dimension and, most of all, follows not at all the visual notion of a line becoming a square. These objections are all completely justified, yet... The important event to happen perceptually after adding the ITDs is that the sounds begin to be localized outside the head, and are no longer lateralized

17 between the ears. When we listen to a standard stereo signal from a CD-DA or MP3 player, the sensation we have is that the sound sources are lateralized within our head: this effect is called Inside the Head Locatedness (IHL, see Blauert, 1996, and Moore, 2003), and is due to the fact that the only localization cue simulated in this specific case is the ILDs. Introducing differences in the arrival time of the signals at the two ears, the soundscape seems to expand outside our head, gaining a dimension which cannot precisely be defined as height or depth, but which obviously gives a more complex spatial characterization to the simulated soundscape. This added impression could easily be compared with the gain of a new dimension: starting from the mono-dimensional soundscape, the two pure tones could then become more complex in terms of frequency content, for example, adding simple harmonic frequencies, in order to bring them closer to a real sound and of course introducing ILDs and ITDs. Only interaural level and time differences, no direction-dependent filtering simulation, two slightly complex tones: these result in two dimensions, or the plane. For those of you conversant with computer music science and who wish to gain a clearer idea of the sensation described in the section above, a simple experiment can be tried. Using audio editing software (a good, simple and cost-free one can be found in load the same mono audiofile into two different tracks, with the first track sent only to the left channel, and the second only to the right. Now, boost the second track with a +6dB gain (turning up the volume fader of that specific track until the +6dB level), and listen to the sound through a pair of headphones. It will be perceived that the sound source should clearly be lateralized to the right, remaining still inside our head. Next, delay the first track of 32 samples (this can be done using a VST plugin or by simply moving forward the audiofile within the track), which at 44.1 khz corresponds to a delay of approximately 0.7 ms. The sound source would still stay on the right side of our head, but the sensation would be that the sound moves outside the head itself, passing from a simple sound lateralization sensation to a more complex, even if not precisely definable, sound localization one. It needs to be said that this effect is highly subjective, and therefore it can differ according to the person who is listening to the spatialized audio. The only objective sensation is that through adding the ITDs, the sound gains in terms of spatial representation. Three dimensions: space The passage from a bi-dimensional to a three-dimensional soundscape should now be an easy task: after the introduction of a simulation of the last localization cue, the Direction-Dependent Filtering (DDF), the two slightly complex tones can be substituted by various recordings of real sounds, and be freely spatialized within the three dimensions. As learned from the preceding sections, this can easily be achieved performing a convolution between the signals to be spatialized and different HRIRs correspondent to the positions where the different sound sources want to be simulated, and performing what is called binaural spatialization. Interaural level and time differences, direction-dependent filtering simulation, multiple real sounds recordings spatialized on left-right, front-back and up-down positions: these result in three dimensions, or space.

Introduction. 1.1 Surround sound

Introduction. 1.1 Surround sound Introduction 1 This chapter introduces the project. First a brief description of surround sound is presented. A problem statement is defined which leads to the goal of the project. Finally the scope of