Monaural and binaural processing of fluctuating sounds in the auditory system

Size: px

Start display at page:

Download "Monaural and binaural processing of fluctuating sounds in the auditory system"

Catherine Underwood
6 years ago
Views:

1 Monaural and binaural processing of fluctuating sounds in the auditory system Eric R. Thompson September 23, 2005 MSc Thesis Acoustic Technology Ørsted DTU Technical University of Denmark Supervisor: Torsten Dau

2 Abstract Two models of the effective signal processing in the human auditory system have been developed recently. One model from Dau et al. (1997) can predict human listeners performance with static and dynamic monaural signals. The other model from Breebaart et al. (2001a) is a binaural model that is based on the same peripheral processing as the Dau model, but was developed mainly for static signals. These models were selected because of their success and common roots for the development of a consolidated model that can effectively process static and dynamic, monaural and binaural signals in reverberant environments. Such a model could eventually be used as the basis for real-time speech recognition and auditory object identification systems. In this project, the first steps toward this goal were made with measurements of the ability of the auditory systems to detect monaural, diotic and dichotic envelope fluctuations. The measurements were performed using artificial probe signals in a series of headphone experiments with pure-tone and 3, 30 and 300 Hz wide noise carriers centered at 5 khz aimed at measuring the threshold for the detection and discrimination of amplitude modulation. The amplitude modulation was presented monaurally, diotically and interaurally in antiphase with diotic and interaurally uncorrelated carriers. The pure-tone results showed a very different shape than the binaural temporal modulation transfer function (TMTF) reported in the literature. This suggests a new paradigm for characterizing the envelope processing capabilities of the binaural system. The intrinsic fluctuations of both diotic and uncorrelated carriers were seen to create masking in the binaural domain, similar in some respects to that reported in previous studies in the monaural domain. Preliminary tests indicated that the Breebaart model would not be able to predict the same relative thresholds as were measured when using narrowband carriers without further development with dynamic signals. Further simulations were recommended to test the dynamic capabilities of the model in order to find its strengths and weaknesses. When the model can successfully predict thresholds similar to those measured, it will be well equipped to process both monaural and binaural, static and dynamic signals in anechoic conditions. The next steps will be to gather data and continue development of the model with more complex, reverberant environments. i

3 Contents 1 Introduction 1 2 Background Signal envelopes The envelope of speech Intrinsic envelope fluctuations Imposed envelopes and amplitude modulation Amplitude modulation and the monaural system Detection of amplitude modulation Detection of AM with narrowband noise carriers Model of monaural AM detection Binaural processing Interaural time and level differences Binaural masking level difference Dynamic interaural parameters Binaural models Jeffress model Equalization-Cancellation model Breebaart model Experimental Methods Procedure Test Subjects Apparatus and sessions Stimuli Results and Discussion 31 5 Toward a consolidated model 38 6 Overall summary and conclusions 43 ii

4 List of Figures 1 Envelope and fine structure of a 3 Hz wide signal Spectrogram of a speech sample, Frequency selectivity Modulation spectrum for two minutes of spoken discourse from a single speaker Beats created by adding a 40 Hz and a 44 Hz sine wave Intrinsic envelope fluctuations for a 5 Hz wide noise Theoretical power spectrum and envelope power spectrum of Gaussian bandpass noise Typical envelope imposed on a tone by a music synthesizer Sinusoidally amplitude modulated (SAM) pure-tone example. 8 9 Approximate regions of amplitude modulation perception Inner hair cell receptor potential response to pure-tone stimuli Temporal modulation transfer function with pure-tone carriers TMTF measured with pure-tone and narrowband noise carriers Block diagram of the modulation processing model from Dau et al. (1997) Transfer functions of the modulation filters used for processing monaural signals in the monaural modulation detection model Monaural modulation phase discrimination probability as a function of modulation frequency Theoretical interaural time difference (ITD) vs. azimuth Measured interaural level differences (ILD) vs. azimuth Illustration of binaural masking level differences Monaural and binaural TMTFs and masked TMTFs from Grantham and Bacon (1991) Instantaneous ILD vs. envelope amplitude for homophasic and antiphasic modulation with a narrowband carrier Basic concept of the Jeffress model Block diagram of the Equalization-Cancellation (EC) model Block diagram of the Breebaart model Array of EI-elements from the binaural model Detail of the EI-elements from the binaural model Amplitude Modulation detection and discrimination thresholds with 5 khz pure-tone carriers Amplitude Modulation detection and discrimination thresholds with 3 Hz wide carriers, f c = 5 khz Amplitude Modulation detection and discrimination thresholds with 30 Hz wide carriers, f c = 5 khz iii

5 29 Amplitude Modulation detection and discrimination thresholds with 300 Hz wide carriers, f c = 5 khz Interaural AM detection and discrimination thresholds with pure-tone and 3 Hz wide noise carriers Binaural modulation masking caused by intrinsic fluctuations of diotic narrowband carriers Monaural modulation masking caused by intrinsic fluctuations of narrowband carriers Binaural modulation masking caused by intrinsic interaural fluctuations from uncorrelated narrowband carriers Model predicted thresholds for discriminating interaurally antiphasic amplitude modulation Average EI activity vs. time for low-frequency uncorrelated noise carriers Average EI activity vs. time for correlated narrowband and pure-tone carriers Possible implementation of the consolidated model List of Acronyms AC ADSR AM BMLD DC EC EEG EI fmri FFT IFC ILD IPD ITD SAM SNR TMTF Alternating Current Attack, Decay, Sustain, Release Amplitude Modulation Binaural Masking Level Difference Direct Current Equalization-Cancellation Electroencephalography Excitation-Inhibition Functional Magnetic Resonance Imaging Fast Fourier Transform Interval, Forced Choice Interaural Level Difference Interaural Phase Difference Interaural Time Difference (or Delay) Sinusoidal Amplitude Modulation Signal-to-Noise Ratio Temporal Modulation Transfer Function iv

6 1 Introduction The world around us is constantly changing, and the acoustic environment is no exception. All real sounds will change in some manner if given enough time. Even steady state sounds will eventually be turned off or the sound producing mechanism will degrade so that the spectral characteristics of the sound will change over time. There may also be inherent properties of the sound that cause continuous change in level or frequency. These changes can be described by, for example, their speed, magnitude, or periodicity. Since our ears move through this constantly changing soundscape, it is of great interest to know how audible those changes are as part of a deeper understanding of the human auditory system. One of the most important tasks of the auditory system is the processing of speech, often in the presence of background noise. Speech signals are characterized by variations in level with many pauses between words and variations in frequency with low-frequency vowels and high-frequency fricative consonants as well as the small fluctuations in pitch that make speech more colorful. A normal hearing human listener can make sense out of these fluctuations, even in complex, reverberant acoustic environments, and understand the speech. Hearing-impaired listeners and computer-based speech recognition systems often have difficulties in understanding speech under such conditions. In many ways, the auditory system is still a black box into which sounds flow through two input channels (ears) and from which come many sensations and perceptions. The exact components of the system and their functions cannot be directly viewed or measured because of ethical and other restrictions. There are non-invasive methods of measuring the response of the brain to audio signals, such as functional magnetic resonance imaging (fmri) and electroencephalography (EEG), which can reveal general locations of activity and the far-field electrical response, respectively, but are limited in spatial resolution and cannot pinpoint precisely which cell responds to a signal in what way. Therefore, these tools should be used in conjunction with welldesigned psychoacoustic experiments in order to reverse engineer the auditory system and understand how the system is supposed to work. With this knowledge, it should be possible to identify more causes and symptoms of damage and defects in the auditory system, and to design better systems for repairing lost hearing ability or more effectively utilizing the residual hearing ability in an impaired system. In addition, a knowledge of which changes in a sound are audible can help in improving sound quality of e.g. loudspeakers or audio processing algorithms by pointing out which artifacts are audible and which are not, as has already been done in the 1

7 development of the MPEG Audio Layer-3 (MP3) audio compression algorithm (Fraunhofer IIS, 1998). As wireless and battery technologies improve, hearing aid manufacturers are looking at designing binaural hearing aids and need to know what interaural fluctuations are most important for formation and localization of sound sources and for speech intelligibility in real acoustic environments. A computer model that can simulate these aspects of a human listener s ability would be a powerful tool in the development cycle that could save time and expense by reducing the need for extensive listening tests. In this project, many experiments were performed that gathered data to better understand how the auditory system processes changes in the overall sound pressure level of a sound when those changes are only in one ear (monaural), the same in both ears (diotic, homophasic), or opposite in the two ears (dichotic, antiphasic). To this end, artificial signals were generated using pure-tone and narrowband high-frequency carriers with an imposed sinusoidal amplitude modulation. The response of a listener to these signals should provide insight into the temporal resolution and frequency selectivity of the auditory subsystems that process these amplitude fluctuations. This insight could be used to enhance existing models and work toward a model that can parse a complex acoustic signal into auditory objects, much like a human listener might do. Two existing models were selected as a starting point to develop a new model that can simulate the envelope processing abilities of a human listener in some tasks in anechoic environments. The first model is a monaural model from Dau et al. (1997) that includes envelope processing capabilities. The second model, from Breebaart et al. (2001a), simulates binaural processing, but was developed mainly for processing static signals. The goal for this project is to combine the dynamic signal processing of the Dau model with the binaural processing of the Breebaart model to develop a comprehensive model that can simulate human performance in detection tasks with static and dynamic, monaural and binaural signals. 2 Background 2.1 Signal envelopes Any sound signal can be described in terms of its fine structure and envelope, where the fine structure contains the rapid fluctuations in the sound pressure level and the envelope is a smooth, slowly varying curve tangent to the peaks of the fine-structure. Figure 1 shows an example of a signal with the fine 2

8 Figure 1: Envelope (heavy line) and fine structure (fine line) of a 1 s long sample of a 3 Hz wide noise signal centered at 40 Hz. The envelope was extracted using the Hilbert transform. structure drawn with a fine line and its envelope drawn with a heavy line. The fluctuations seen in the envelope of this signal are actually the result of interference between frequency components of the signal. The envelope of the signal shown in Figure 1 was calculated using the Hilbert transform. This transform convolves a real signal x(t) with a filter 1/πt to calculate the imaginary part of the analytic signal x a (t), as defined in Equation 1 (from Proakis and Manolakis, 1995, sec ). x a (t) = x(t) + j 1 x(t) (1) πt The envelope of the signal x(t) is then equal to the magnitude of the analytic signal x a (t) and is therefore always positive and real-valued The envelope of speech As described in the introduction, speech signals have large variations in level over time. There are pauses between words and a rise and fall during syllables. Figure 2 shows a frequency analysis vs. time for a sample of speech from Rosen and Fourcin (1986). In this plot, the shade indicates the amount of energy at that frequency at that instant in time. Darker shades indicate 3

9 Figure 2: Top panel: wideband spectrogram of a speech sample, Frequency selectivity, shown in the bottom panel (from Rosen and Fourcin, 1986). In the spectrogram, the shade indicates the energy level for that frequency at that moment in time. Darker shades represent higher energy levels, white indicates low energy or pauses. Figure 3: Modulation spectrum for two minutes of spoken discourse from a single speaker (from Greenberg et al., 1996). A peak is seen at about 4-5 Hz, which corresponds to a typical syllable duration of about ms. 4

10 Figure 4: Beats created by adding a 40 Hz and a 44 Hz sine wave. The time domain signal (fine line) is shown in the left panel with its Hilbert envelope (heavy line). The top-right panel shows the magnitude of the same signal in the frequency domain. The lower-right panel shows the magnitude of the FFT of the envelope of the signal. more energy and white indicates little or no energy (i.e., a pause). The fricative phonemes (e.g., /s/ and /f/) have a lot of high-frequency energy and the vowels (e.g., /i/, /@/ and /E/) have more low-frequency energy. An analysis of the speech envelope can be done on the total signal or by frequency band, as is shown in Figure 3 for a different speech sample from Greenberg et al. (1996). The modulation spectrum for the 1-2 khz frequency band is shown for two minutes of running speech for a single speaker. A maximum can be seen at about 4-5 Hz, which corresponds to a typical syllable duration of about ms Intrinsic envelope fluctuations When two pure-tone signals with similar frequencies are added together, the result is a signal whose envelope varies in amplitude at a frequency equal to the difference between the frequencies of the pure-tones. This effect, often referred to as beats, is shown in the left panel of Figure 4. This signal was created by adding a 40 Hz and a 44 Hz sine wave, as can be seen from the 5

11 Figure 5: Intrinsic envelope fluctuations for a 5 Hz wide noise. The time domain signal (fine line) is shown in the left panel with its Hilbert envelope (heavy line). The top-right panel shows the magnitude of the same signal in the frequency domain. The lower-right panel shows the magnitude of the FFT of the envelope of the signal. frequency domain representation of the signal in the top-right panel of the figure, calculated using the fast Fourier transform (FFT). The envelope was extracted from the signal using the Hilbert transform and can be seen to have the shape of a sinusoid with the frequency of the beats, 4 Hz (i.e., 44 40). Since an envelope is always positive valued, the result is here the absolute value of the sinusoid. Taking the absolute value creates a non-linearity at the zero crossings that produces harmonics, which can be seen along with the fundamental beat frequency in the envelope or modulation frequency domain, calculated as an FFT of the envelope, shown in the lower-right panel of Figure 4. If many sequential frequency components are added together with random phase, the result is a narrowband noise as shown in Figure 5. This signal was created by bandpass filtering a Gaussian noise between 38 and 43 Hz. Each component of the signal creates interference with each other component. Each of these beats contributes to the envelope energy at the frequency equal to the difference between the components frequencies. In the example shown in Figure 5, there are six frequency components with a 1 Hz spacing, which 6

12 Figure 6: Theoretical power spectrum (left panel) and envelope power spectrum (right panel) of Gaussian bandpass noise (from Dau et al., 1999), for details see also Lawson and Uhlenbeck (1950). means that there are five pairs of components with 1 Hz spacing, four pairs with 2 Hz spacing, and so on, up to one pair with 5 Hz spacing. There are no component pairs that have a frequency difference greater than the bandwidth of the noise, in this case 5 Hz, so, in theory, there should not be any intrinsic envelope energy at frequencies above the bandwidth (Lawson and Uhlenbeck, 1950). The theoretical representation of the intrinsic fluctuations in narrowband noises is shown in Figure 6. Given a narrowband noise with bandwidth f and power density ρ, the envelope power density forms a triangle between the origin, π ρ along the power density axis and the bandwidth of the noise along 4 the frequency axis, plus a DC component reflecting the non-zero mean of the envelope. The shape of the envelope power density function will be used later in discussions of amplitude modulation detection when using narrowband noise carriers Imposed envelopes and amplitude modulation A signal can also have an envelope imposed on it. In the simplest case, by simply switching on a sound source and switching it off again, a rectangular envelope is imposed on the sound. In music synthesizers, the envelope of a sound played when a key is pressed is often described in terms of its Attack time, Decay time, Sustain level and Release time (ADSR) (Kientzle, 1998), where the attack time describes how quickly the sound reaches its maximum 7

13 Figure 7: Typical envelope imposed on a tone by a music synthesizer (from Kientzle, 1998). The control parameters are the attack and decay time, the sustain level and the release time. Figure 8: Example of a sinusoidally amplitude modulated (SAM) pure-tone (f c =40 Hz; f m =4 Hz). The time domain signal is shown in the left panel with its Hilbert envelope. The top-right panel shows the magnitude of the same signal in the frequency domain. The lower-right panel shows the magnitude of the FFT of the envelope of the signal. 8

14 amplitude when the key is depressed, the decay time is how long it takes the sound to settle into its long-term or sustain level after reaching the maximum, and the release time is how quickly the volume returns to zero after releasing the key (see Figure 7). One question that an accurate model of the auditory system could answer is how well a human listener can actually perceive the initial peak in the envelope, so the developers of the synthesizer might be able to shape the envelope to enhance features of the sound based on their audibility. Another common way of imposing an envelope on a signal is by adding amplitude modulation (AM). With AM, the amplitude of a carrier c(t) is changed in time proportional to the magnitude of a modulator m(t) as given by Equation 2. [1 + m(t)] c(t) (2) The carrier can be any signal, for example a pure-tone, and the modulator can also be any signal with a lower frequency than the carrier signal. If a sinusoidal amplitude modulator (SAM) with amplitude, or modulation depth, m and frequency f m is used with a pure-tone carrier with frequency f c, as in equation 3 and the left panel of Figure 8, it can be shown using trigonometric identities that there are three resultant frequency components at f c, f c f m and f c + f m. These three components can also be seen in the frequency spectrum of the 40 Hz sinusoid with 4 Hz SAM shown in the top-right panel of Figure 8. [1 + m sin(2πf m t)] sin(2πf c t) = sin(2πf c t) + m 2 cos[2π(f c f m )t] m 2 cos[2π(f c + f m )t] (3) These sidebands can be used as detection cues for the presence of amplitude modulation in listening tests. The power of a periodic signal can be calculated from the sum of the squares of its frequency components. If a carrier has a power P, then when a sinusoidal amplitude modulation is imposed on the carrier, its power will be P (1 + m 2 /2). This can be shown easily for the example signal from Equation 3. Given a carrier signal x(t) = sin(2πf c t) with power P and an imposed sinusoidal amplitude modulation with modulation depth m, the power of the resultant signal, P AM, will be: ( m P AM = P + 2 ) 2 P + ( m 2 ) 2 P = P (1 + m2 2 This change in signal intensity could also be used as a cue for detection, so amplitude modulated signals should be scaled appropriately in order for the dynamic AM cue to be the most salient cue. 9 ) (4)

15 Figure 9: Approximate regions of amplitude modulation perception for carrier and modulation frequencies (adapted from Joris et al., 2004). Below about 20 Hz, fluctuations in level are perceived (hatched region). This percept makes a smooth transition to a percept of roughness above about 15 Hz (region between solid lines). Sidebands may be resolved for modulation frequencies above the upper solid line. 2.2 Amplitude modulation and the monaural system The cochlea is the transducer in the inner ear that converts mechanical vibrations into neural impulses. It contains the basilar membrane, which vibrates with incident sound and serves as a frequency-to-place filter where low frequencies activate the apical end of the membrane and high frequencies activate the basal end. It is often modeled as a sort of filterbank, dividing the audible frequency range into bands, often called critical bands (Zwicker, 1961). These filters are assumed not to be fixed to specific center frequencies, but rather can be moved to optimally improve signal-to-noise ratios (SNR). Among other effects, the auditory filters influence the perception of amplitude modulated signals. If the sidebands of a modulated signal fall within the same auditory filter as the carrier, then the sound may be perceived as a single auditory object. At very low modulation frequencies, the fluctuations of the envelope can be followed and are perceived as changes in loudness. As the modulation frequency increases, the loudness of the sound is not perceived to change, but a 10

Figure 10: Inner hair cell receptor potential response to pure-tone stimuli at frequencies indicated in Hz to the right of each trace (adapted from Palmer and Russell, 1986).

16 Figure 10: Inner hair cell receptor potential response to pure-tone stimuli at frequencies indicated in Hz to the right of each trace (adapted from Palmer and Russell, 1986). The upper scale bar corresponds to the traces from 300 to 900 Hz, and the lower scale bar for 1-5 khz. Data measured in a guinea pig. quality of roughness is perceived instead. These regions of perceived loudness fluctuations and roughness are depicted in Figure 9. If the modulation frequency is further increased so that the sidebands can be resolved in adjacent auditory filters, then the sidebands will be perceived as independent tones. The inner hair cells are cells that ride on the basilar membrane whose inner electrical potential varies in response to the membrane motion. The inner potential of the hair cells is the driver for action potentials in neurons in the auditory nerve. For low frequencies, the inner hair cell potential follows, or is phase-locked to, the fine structure of the driving frequency (see Figure 10). However, as the frequency increases, the DC component of the inner hair cell response increases and the AC component decreases, until for very high frequencies the receptor potential is essentially just following the envelope of the stimulus. Therefore, high-frequency signals, where no finestructure information is coded, are often used in tests of envelope processing in the auditory system. 11

17 Figure 11: Temporal modulation transfer function with pure-tone carriers (from Kohlrausch et al., 2000). The data was measured for carriers at 1 ( ), 3 ( ), 5 ( ), 8 ( ) and 10 khz (pentagon). The TMTF curves show a lowpass shape for carriers above 1 khz with a sharp increase in sensitivity when the modulation sidebands can be resolved Detection of amplitude modulation Wideband noise carriers are also frequently used when measuring modulation detection thresholds, or the temporal modulation transfer function (TMTF), in order to eliminate the possible spectral cues resulting from the sidebands. In Viemeister (1979), the TMTFs were measured using wideband Gaussian noise carriers. The results showed great sensitivity at low modulation frequencies and a decrease in sensitivity of about 3 db/oct. above about 64 Hz. This led to a model of envelope processing that simply used a lowpass filter with a cut-off frequency of 64 Hz. In Kohlrausch et al. (2000), TMTFs were measured using sinusoidal carriers with frequencies at and above 1 khz (see Figure 11). The shape of the curves were similar to those found by Viemeister for carrier frequencies over 1 khz, although the cut-off frequencies were higher, typically around 130 Hz. In addition, Kohlrausch and colleagues measured thresholds in regions where the sidebands could be resolved and found dramatic increases in sensitivity with the presence of these spectral cues. The frequency where this increase occurs increases with increasing carrier frequency, which shows that the bandwidth of the auditory filters also increases with increasing center frequency. With the 1 khz carrier, the sidebands were resolved below the 130 Hz cutoff frequency, so the lowpass shape was not seen. 12

18 Figure 12: TMTF measured with pure-tone ( ) and narrowband noise carriers with bandwidths of 3 ( ), 31 ( ) and 314 Hz ( ), centered at 5 khz (adapted from Moore, 2003, data from Dau et al., 1997). Also shown are the theoretical envelope spectra for 3, 31 and 314 Hz wide noise bands Detection of AM with narrowband noise carriers The shape of the TMTF looks quite different when narrowband noises are used as carriers instead of broadband noise or pure-tones, as can be seen in Figure 12 (see also Fleischer, 1982; Dau et al., 1997). The TMTFs for narrowband noises with bandwidths of 3 ( ), 31 ( ) and 314 Hz ( ) are shown with the TMTF measured with a pure-tone carrier ( ) (data from Dau et al., 1997). The theoretical envelope spectra of the carriers (3, 31 and 314 Hz wide), as discussed above (see also Figure 6), are also plotted, only now with logarithmic axes. From these results, it can be seen that the intrinsic fluctuations of the carrier mask SAM detection for frequencies that are in or close to the range of the carrier fluctuations. The fact that the detection is not exactly the same as with the sinusoidal carrier immediately outside the range of intrinsic fluctuations indicates that the frequency resolution in envelope processing is limited. A lowpass filter model would still predict the same shape TMTF as measured with broadband noise, therefore a model with a modulation filterbank was proposed in Dau et al. (1997). Joris et al. (2004) also reported that there are neurons that respond in proportion to the degree of modulation and show a bandpass characteristic with respect to modulation frequency. This adds a physiological case in support of the modulation filterbank. 13

19 Figure 13: Block diagram of the modulation processing model from Dau et al. (1997). The model includes basilar membrane filtering in the form of a gammatone filterbank and inner hair cell processing with half-wave rectification and a lowpass filter. Adaptation loops add a compressive non-linearity. Then a modulation filterbank is applied to each peripheral channel, internal noise is added to limit the resolution of the system and an optimal detector generates the decision. 2.3 Model of monaural AM detection The model proposed in Dau et al. (1997) was based largely on the peripheral processing from Dau et al. (1996). Figure 13 shows a block diagram of the model. The first stage of the model simulates the bandpass characteristics of the basilar membrane with the gammatone filterbank from Patterson et al. (1988). The output from each of the gammatone filters is then half-wave rectified, to simulate the characteristic of the inner hair cells to only fire when displaced in one direction, and lowpass filtered with a cutoff frequency of 1 khz, to simulate the loss of phase-locking for higher frequencies. A series of five adaptation loops with different time constants is next applied to provide a compressive non-linearity while maintaining sensitivity for fast temporal variations (see Dau et al., 1996, for details). This adaptation step 14

20 Figure 14: Transfer functions of the modulation filters used for processing monaural signals in the monaural modulation detection model from Dau et al. (1997). Below 10 Hz, the filters have a constant bandwidth of 5 Hz, starting with a lowpass filter with a cut-off frequency of 2.5 Hz. Above 10 Hz, the filters have a constant Q-value of 2. also enables the simulation of forward masking results. Then a modulation filterbank is applied to each channel. Finally, an internal noise is added to limit the resolution of the system and an optimal detector generates the decision. The modulation filterbank is shown in more detail in Figure 14. Below 10 Hz, the filters have a constant bandwidth of 5 Hz, starting with a lowpass filter with a cutoff frequency of 2.5 Hz. From 10 Hz to 1000 Hz, the filters have a constant Q value of 2. Listening tests indicated that human listeners could detect an inversion of modulation phase better than chance probability for frequencies below approx. 10 Hz (see Figure 15) (Dau, 1996). Therefore, the modulation phase was preserved below 10 Hz and only the magnitude of the envelope was maintained for the higher modulation frequencies. With the modulation filterbank, this model is able to predict human listeners modulation detection thresholds with narrowband noise carriers. The energy of the intrinsic fluctuations for the 3 Hz wide carrier (see Figure 12) falls mostly within the first filter, greatly increasing the modulation depth required to detect modulation below 2.5 Hz. The subsequent filters also pass some of the modulation energy from this carrier, but the amount of energy decreases and the threshold slowly decreases toward the pure-tone carrier threshold with increasing filter center frequency. With the 30 Hz wide carrier, the envelope energy appears fairly flat on the logarithmic scale until 15

21 Figure 15: Monaural modulation phase discrimination probability as a function of modulation frequency. 5 khz pure-tone carrier with full modulation (m = 1). Data from three test subjects (dashed and dotted lines) and chance probability (solid line) (from Dau, 1996). almost 20 Hz. The modulation detection threshold is also very flat in the same range and then slowly rolls off. The intrinsic modulation energy of the 300 Hz wide carrier is very flat through the whole frequency range of interest. Since the bandwidth of the modulation filters increases with increasing filter center frequency, the total amount of energy passed through each filter also increases with filter center frequency with a flat envelope frequency spectrum. Assuming that a constant signal-to-noise ratio is required at the output of any filter for detection, this model also predicts a lowpass shape similar to Viemeister s model for a wideband noise carrier. Given a flat carrier envelope spectrum and the logarithmic scaling of the filters, each doubling of filter center frequency results in a doubling of the carrier envelope power (noise) passed by the filter and a subsequent 3 db/oct. increase in the threshold required for signal modulation detection. 2.4 Binaural processing Up to this point, all discussion of the auditory system has dealt with processes requiring only one input, or monaural processing. The complete auditory system is, however, a binaural system, meaning that there are two ear input channels. The main functions of the binaural system are to identify 16

22 Figure 16: Theoretical interaural time difference (ITD) as a function of azimuth. From Moore (2003), adapted from Feddersen et al. (1957). Calculated from the difference in distance traveled from the sound source to each of the ears. and localize sound sources and to improve the signal-to-noise ratio (SNR) by selectively tuning in on one particular sound source. Again, the exact mechanisms are unknown, but the physics of the acoustic waves are fairly well understood and there are several models that can simulate some aspects of binaural hearing Interaural time and level differences The most basic physics behind sound localization can be explained with a simple model of a spherical head with two receivers (ears) at opposite ends of an axis, placed in an anechoic sound field. When a sound is incident from directly in front of this head, the signals at the two ears should be identical. However, if the sound source moves to the right, as seen from the head, then the sound has to travel a greater distance to reach the left ear than the right ear. Since sound travels at a finite rate, this difference in distance creates an interaural time difference (ITD), which can be derived based on assumptions about the shape and size of the head. Figure 16 shows the theoretical ITD as a function of the angle from directly ahead, or azimuth. This figure shows the increase in ITD with azimuth until 90 and the subsequent decrease until 180. This also shows how there can be front-to-back confusions with cues based solely on ITD, since angles symmetric about 90 azimuth produce the same ITD. The ITD can be converted to an interaural phase difference (IPD) 17

23 Figure 17: Measured interaural level differences (ILD) as a function of azimuth for pure-tones from 200 to 6000 Hz. These curves show almost no measurable ITD for low frequencies and up to almost 20 db ILD for high-frequencies. Average of five subjects. From Moore (2003), adapted from Feddersen et al. (1957) for sinusoidal tones or for frequency components of a complex sound. This cue is only unambiguous for phase differences between ± π. For a low-frequency 2 sinusoid at 250 Hz, an ITD of 0.4 ms is equivalent to an IPD of π. However, 5 for a 5 khz sinusoid, the same ITD of 0.4 ms is equivalent to an IPD of 4π (two full cycles) and it is impossible for the auditory system to know if the two signals are in phase or n-cycles out of phase. In addition to the ITD, there is another basic physical difference between the sound that reaches the two ears from an angle other than straight ahead. The physical presence of the head between the ear aimed away from the source and the source itself attenuates the level of the sound, creating the so-called head shadow effect and an interaural level difference (ILD). The measured ILDs for various frequencies are shown in Figure 17 (from Feddersen et al., 1957). In these curves, it is obvious that the head shadow effect is strongest for high frequencies with almost 20 db ILD for large azimuth angles and creates almost no difference for low frequencies. 18

24 A human listener is extremely sensitive to ITDs with difference limens of as little as 10 µs (Klumpp and Eady, 1956) for low-frequency tones and noise. In that study, it was possible to measure ITD thresholds for puretones with frequencies up to 1300 Hz, but the listeners were unable to detect ITDs for any higher frequency pure-tones. This can be explained by the loss of fine structure coding for high frequency sounds (see Figure 10). If only the envelope of the signal is preserved in the peripheral processing, then any timing differences in the fine structure would not be seen at higher processing levels in the brain. It is interesting to note that the listeners could detect ITDs with narrowband noises that consisted exclusively of frequencies that were above the range of pure-tone ITD detectability (in one case, Hz). This indicates that the listeners were able to compare the timing of the intrinsic fluctuations of the envelope in the detection task. In the high-frequency range, Mills (1960) measured static ILD detection thresholds of about db using pure-tones. When listening through headphones, a diotic sound is typically perceived to be located in the center of the head on a line connecting the two ears. If an ITD or ILD is then artificially imposed on the signal, the signal is perceived to move toward either the leading ear or the ear with the greater sound pressure level, respectively. Together, the ITD and ILD resulting from the size and shape of the head and the location of the ears provide cues that can be used by a human listener or a computer to localize a sound source. Additional cues are provided to the human listener by reflections and resonances from the outer ear, or pinna, which are not yet fully understood and are beyond the scope of this project Binaural masking level difference In addition to providing information about the location of a sound source, the binaural system can actually improve our ability to hear a signal in background noise. This system is used, for example, in the cocktail party scenario where a listener s task is to understand a conversation in the presence of multiple other sound sources (e.g. a stereo system and other conversations) that are usually incident from random angles. In the laboratory, this effect has been demonstrated using white noise maskers (N) and pure-tone signals (S) where the task is to detect the tone in the noise (see illustration in Figure 18). If the noise and signal are presented monaurally (N m S m subscript m for monaural), then a certain detection threshold is measured. If the same noise is added to the opposite ear (N 0 S m subscript 0 for zero interaural phase shift), it becomes easier to detect the signal again. If the signal is then added to the second ear (N 0 S 0 ), it is again more difficult to detect the 19

25 Figure 18: Illustration of binaural masking level differences (BMLD). The N and S descriptors are for the noise and signal, respectively. The subscripts indicate the method of presentation. m for monaural, 0 for homophasic and π for antiphasic. Smiling faces indicate better detectability (from Moore, 2003) relative to the monaural case. signal. However, if the phase of the signal is inverted (N 0 S π subscript π for the interaural phase shift), the signal is again easily detectable. It is assumed that the binaural system somehow suppresses the information that is the same in the two ears (in this case N 0 ) leaving the information that is different (S π ). Since the binaural system reduces the effective masking of the noise, the effect is called the binaural masking level difference (BMLD). There have been numerous models developed that can simulate this behavior in simple environments Dynamic interaural parameters As with the monaural system, it is relevant to measure the temporal acuity of the binaural system using dynamic signals. This can be done by varying the interaural parameters in time at a given rate of change. Grantham and colleagues measured the ability of human listeners to detect sinusoidally varying interaural time and intensity differences (Grantham and Wightman, 1978; Grantham, 1984). They concluded that the binaural system is sluggish in that only modulation rates in the range of 1-5 Hz are perceived to have motion. Higher interaural modulation rates can still be discriminated from diotic noise, even above 100 Hz, but instead of having a perceived motion, the sounds have an increased perceived width. In Grantham and Bacon (1991), the TMTF and modulation masking curves in the binaural domain were measured. They used wideband Gaus- 20

26 Figure 19: Left panel: Monaural and binaural TMTFs from Grantham and Bacon (1991). Right panel: Monaural and binaural masked TMTFs measured in the presence of a second noise with a diotic 16 Hz sinusoidal amplitude modulation. : monaural thresholds (S m ), : binaural threshold (S π ). Note inverted ordinate. sian noise carriers and the task was to detect an applied sinusoidal signal modulation with modulation frequencies from Hz in monaural (S m ) and binaural (S π ) presentations. 1 Their monaural results were very similar to those reported in Viemeister (1979) and the binaural TMTF had no significant difference from the monaural case except at 512 Hz, where the binaural case had an advantage of approximately 5 db over the monaural case, as seen in the mean data (see left panel of Figure 19, note inverted ordinate). When measuring modulation masking, a wideband Gaussian noise with a 16 Hz amplitude modulation was added to the signals from the previous experiment and the task was to detect the applied signal modulation as a function of signal modulation frequency. The presence of the masking 16 Hz modulation made it more difficult to detect the signal modulation, particularly for signal modulation frequencies close to the masker frequency, thereby creating the dip in the masked TMTF curves. In these masked conditions, the binaural detection had an advantage in the lowest frequencies and at 512 Hz, where the thresholds were very similar to those measured in 1 Note that the signals in the dynamic interaural parameter experiments are the modulated parameters (ITD or ILD) themselves, which can be presented monaurally (S m ), interaurally in-phase (S 0 ) or interaurally in antiphase (S π ). This differs from the BMLD experiments where the signals were pure-tones added to a noise masker as shown in Figure

27 Figure 20: Instantaneous ILD vs. normalized envelope amplitude (max. of antiphasic signal amplitude = 1) for homophasic (red) and antiphasic (blue) modulation with a diotic 3 Hz wide carrier. f m = 4 Hz and m = 10 db. The projections of these two signals in the ILD-time and amplitude-time planes are also shown, drawn with fine lines. the unmasked condition, and around the masker frequency (16 Hz), where the threshold was about the same as that measured at 512 Hz, as seen in the mean data. The cues for detection of homophasic (or monaural) and antiphasic amplitude modulation are very different. The homophasic modulations cause changes in perceived loudness or roughness, depending on the frequency of modulation. On the other hand, the antiphasic modulations cause a perceived motion without a change in loudness or an increase in perceived source width, again depending on the frequency of modulation. Figure 20 shows the instantaneous ILD and envelope amplitude for a diotic 3 Hz wide noise carrier with an imposed homophasic (red) and antiphasic (blue) 4 Hz amplitude modulation with m = 10 db. For a low-frequency amplitude modulation, a change in ILD with time can be perceived as a change in localization and a change in level can be perceived as a change in loudness. These changes can be seen more clearly in the projections in the ILD-time and amplitude-time planes, drawn with fine lines in the figure. The plot in Figure 20 shows that the homophasic modulation causes a 22

28 change in the level of the sound without a change in ILD, so this sound would be localized in the center of the head, while the antiphasic modulation causes a change in lateralization with little change in perceived level. Since the carrier was diotic and only 3 Hz wide, it had intrinsic fluctuations in level but not in ILD. These changes in level make it much more difficult to detect diotic amplitude modulation (see section 2.2.2), but have no effect on ILD. The level of the carrier does fluctuate in time, so a time of maximum ILD resulting from the imposed antiphasic modulation may coincide with a low carrier level, thereby making it harder to hear, so there could be an influence of the shape of the diotic carrier on the detection of antiphasic modulation. However, it should be much more effective to use an antiphasic amplitude modulation masker or uncorrelated narrowband carriers, which have intrinsic fluctuations in ILD, in measurements of masking of modulation detection in the binaural system. 2.5 Binaural models The task of the binaural models is to be able to simulate human performance in the binaural experiments described above. In this section, some of the historical models that have provided a foundation for today s models and one state-of-the-art model are described Jeffress model Almost all current models of binaural hearing are in some way based on the Jeffress coincidence detector model (Jeffress, 1948). The basic concept is that there are arrays of neurons that receive inputs from both ears via delay lines that serve to delay one signal relative to the other. Each neuron will have a characteristic interaural delay and serves as a coincidence detector so that the neuron has a maximum output if the two inputs arrive simultaneously. A conceptual drawing of the Jeffress model is shown in Figure 21. This model can also be thought of as performing a sort of cross-correlation of the left and right signals Equalization-Cancellation model A new model was proposed in Durlach (1963) that should account for ITDs, ILDs and the BMLD effect. This model is called the equalization-cancellation (EC) model (see block diagram in Figure 22). The equalization stage adds an optimal interaural delay and interaural gain so that the two signals are most similar. However, there is jitter added to the delay and noise added 23

29 Figure 21: Basic concept of the Jeffress (1948) model showing different length delay lines with t spacing between neurons that create an array for detecting coincidence for a range of ITDs. Figure 22: Block diagram of the Equalization-Cancellation (EC) model (Durlach, 1963). The signals from the two ears are first equalized in time and level (E mechanism) with internal noise and jitter to limit resolution and then subtracted (C mechanism). The decision device makes use of the input channel with the best signal-to-noise ratio (SNR) to make its decision. 24

30 to the gain in order to limit the resolution of the system. The cancellation stage then simply takes the difference of the two signals, removing what is diotic after equalization and leaving the dichotic part of the signal. If an N 0 S π signal from the BMLD experiments is fed into the EC model, the optimal imposed ITD and ILD in the equalization stage would both be zero. In the cancellation stage, the diotic noise (N 0 ) would cancel out leaving the antiphasic signal, thereby making the signal detection easier. Finally, a detection mechanism determines which of the two monaural channels and the EC channel has the best SNR and generates the decision Breebaart model One of the latest models of binaural processing was proposed in Breebaart et al. (2001a) (see block diagram in Figure 23). This model makes use of the peripheral processing from the Dau et al. (1996) monaural model (see description in 2.3). After the adaptation loops, the two monaural channels are fed through an array of excitation-inhibition (EI) elements. Each of the EI elements has a characteristic ITD ( τ) and ILD ( α), as shown in Figure 24. The EI name stems from a type of neuron that receives excitatory input from the ipsilateral side and inhibitory input from the contralateral side, thereby effectively canceling out identical inputs. The output from each EI element is passed through a sliding integrator, implemented as a double-sided exponential with time constants of 30 ms (see upper-right panel of Figure 25), that smoothes the output signals and accounts for binaural sluggishness. A logarithmic compression is then applied to each channel (see lower-right panel of Figure 25). This compression function is essentially linear for low input levels and compressive for higher input levels. As with the Dau models, an internal noise is added to limit the resolution of the system and an optimal detector is used over all monaural and binaural channels to generate the decision. This model was quite successful in simulating a human listener s performance in a range of binaural listening tests based on spectral and temporal cues (see Breebaart et al., 2001b,c, for details). Among the tests of the model, Breebaart and colleagues tried to simulate the dynamic ITD and ILD experiments from Grantham and Wightman (1978) and Grantham (1984). The model was unable to simulate human listeners performance with the dynamic ITD tests, but had fairly good results with the dynamic ILD experiments. This will be described in more detail in the Modeling section. Since the model was very successful with static binaural signals and had some success with dynamic interaural parameters, and because it was based on the same peripheral processing as the monaural modulation processing 25

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

9th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, -7 SEPTEMBER 7 A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL PACS: PACS:. Pn Nicolas Le Goff ; Armin Kohlrausch ; Jeroen