Improving Speech Intelligibility in Fluctuating Background Interference

Size: px

Start display at page:

Download "Improving Speech Intelligibility in Fluctuating Background Interference"

Elvin Weaver
6 years ago
Views:

1 Improving Speech Intelligibility in Fluctuating Background Interference 1 by Laura A. D Aquila S.B., Massachusetts Institute of Technology (2015), Electrical Engineering and Computer Science, Mathematics Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology June 2016 Massachusetts Institute of Technology All rights reserved. Author: Department of Electrical Engineering and Computer Science May 20, 2016 Certified by: Dr. Charlotte M. Reed, Senior Research Scientist Research Laboratory of Electronics May 20, 2016 Certified by: Professor Louis D. Braida, Henry Ellis Warren Professor Electrical Engineering and Health Sciences and Technology May 20, 2016 Accepted by: Dr. Christopher J. Terman, Chairman, Masters of Engineering Thesis Committee

2 Improving Speech Intelligibility in Fluctuating Background Interference 2 by Laura A. D Aquila Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2016, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science. ABSTRACT The masking release (MR; i.e., better speech recognition in fluctuating compared to continuous noise backgrounds) that is evident for normal-hearing (NH) listeners is generally reduced or absent in hearing-impaired (HI) listeners. In this study, a signal-processing technique was developed to improve MR in HI listeners and offer insight into the mechanisms influencing the size of MR. This technique compares short-term and long-term estimates of energy, increases the level of short-term segments whose energy is below the average energy, and normalizes the overall energy of the processed signal to be equivalent to that of the original long-term estimate. In consonant-identification tests, HI listeners achieved similar scores for processed and unprocessed stimuli in quiet and in continuous-noise backgrounds, while superior performance was obtained for the processed speech in some of the fluctuating background noises. Thus, the energy-normalized signals led to larger values of MR compared to that obtained with unprocessed signals.

3 3 ACKNOWLEDGMENTS This research was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under Award Number R01 DC I would like to extend a big thank you to my advisors, Dr. Charlotte M. Reed and Professor Louis D. Braida. From when I first began doing research in their lab as a sophomore to now as I wrap up my M.Eng thesis, they have always made themselves available and offered much guidance, instruction, and support. Their analysis and ideas for moving forward were crucial to the success of this project. Their kindness made me look forward to coming into lab every day. I am also extremely grateful for the RA funding that they provided me with as I worked on the project. Additionally, I would like to heartily thank Dr. Joseph G. Desloge, the signal processing mastermind of the project. During the spring of my senior year, his help was critical as I coded the different components of this project. Despite having since taken a new job on the West Coast, he still kindly spoke with me weekly on the phone throughout the year to discuss my project and offer his very valuable insight, ideas, and feedback. I could not have asked for a better group of mentors than Dr. Reed, Professor Braida, and Dr. Desloge. As part of the Sensory Communication Group, the three of them performed much of the previous work that led to this project, and this project would also not have been possible without their continued involvement. I would lastly like to thank my family, who have provided me with countless opportunities throughout my life without which I would not be at where I am today. I am very grateful for the love and confidence that they have had in me throughout it all and for their shaping me into the person I am. It is comforting to know that I can always turn to them no matter what happens.

4 4 I. BACKGROUND Many hearing-impaired (HI) listeners with sensorineural hearing loss who are able to understand speech in quiet environments without much difficulty encounter more problems in noisy situations, such as in a cafeteria or at a social gathering. Indeed, it has been shown that these listeners require a higher speech-to-noise ratio (SNR) to achieve a given level of performance than do normal-hearing (NH) listeners (Festen and Plomp, 1990). This is the case regardless of whether the noise is temporally fluctuating, such as interfering voices in the background, or is steady-state, such as a motor. Festen and Plomp (1990) measured the SNR required for 50%-correct sentence reception in different types of background interference. Whereas HI listeners required a similar SNR regardless of the type of interfering noise, NH listeners performed better (i.e., required a lower SNR) in temporally fluctuating interference than in steady-state interference. Listeners who perform better with fluctuating interference are said to experience a release from masking. This release from masking occurs when listeners are able to perceive audible glimpses of the target speech during dips in the fluctuating noise (Cooke, 2006) and it aids in the ability to converse normally in the noisy social situations mentioned above. One possible explanation of reduced release from masking in HI listeners is based on the effects of reduced audibility in HI listeners, who are less likely to be able to receive the target speech in the noise gaps (Desloge et al., 2010). Léger et al. (2015) looked at release from masking in greater depth, particularly with respect to consonant recognition with different types of speech processing. The processing allowed for the examination of the roles played by the signal s slowly varying component, known as its envelope (ENV), and rapidly varying component, known as its temporal fine structure (TFS), on release from masking. The consonant

5 5 speech stimuli were processed using the Hilbert Transform to convey ENV cues, TFS cues, or ENV cues recovered from TFS speech. Consonant identification was measured in the presence of steady-state and 10-Hz square-wave interrupted speech-shaped noise. The percent-correct scores were used to calculate masking release (MR) in percentage points, defined as the difference in scores in interrupted noise and in continuous noise at a given SNR. The results showed that HI listeners generally experienced MR for TFS and recovered-env speech but not for unprocessed or ENV speech. The study concluded that the increase in MR may be related to the way the TFS processing interacts with the interrupted noise signal, rather than to the presence of TFS itself. Under certain circumstances, the removal of amplitude-envelope variation in TFS speech may amplify the higher SNR glimpses of the speech signal during gaps in a fluctuating noise. Reed et al. (2016) further investigated the conclusions of Léger et al regarding the role of reduced amplitude variation in MR. The study tested an infinite peak-clipped (IPC) speech condition, which used the sign of each sample point of the input signal to convert positive terms to +1, convert negative terms to -1, and leave zero terms unchanged. This processing thus also removed much of the amplitude variation. Speech intelligibility in noise and MR were compared for TFS, IPC, and unprocessed speech for HI listeners. Outcomes for TFS and IPC speech were very similar, leading to the conclusion that the removal of amplitude variation can indeed lead to MR. Because both the TFS and IPC speech contained fine-structure cues, however, it was still possible that TFS was responsible for the observed MR. Another condition was created in which both TFS and amplitude-variation cues were eliminated by passing an ENV signal through the TFS processing stage. Greater MR was observed for this condition than for the original ENV speech, thus lending support to the hypothesis that reduced amplitude variation can lead to improved MR in HI listeners. This MR arose as the less-intense portions of the speech stimulus,

6 6 which occurred in the noise gaps, became more audible to HI listeners when the amplitude was normalized to remove variation. These studies proved promising in understanding a potential way to improve MR in HI listeners; however, the improvement in MR was mainly due to a decreased performance in continuous noise rather than an increased performance in fluctuating noise. To address these issues, Desloge et al. (2016) developed a signal-processing technique designed to achieve similar reductions in signal amplitude variation without suffering a loss in intelligibility in continuous background noise. Using non-real-time processing over the broadband signal, the technique compared short-term and long-term estimates of energy, increased the level of short-term segments whose energy was below the average energy, and normalized the overall energy of the processed signal to be equivalent to that of the original long-term estimate. In consonant-identification tests, HI listeners achieved similar scores for processed and unprocessed stimuli in quiet and in continuous-noise backgrounds, while superior performance was obtained for the processed speech in fluctuating background noises. Thus, the energy-normalized signals led to larger values of MR compared to that obtained with unprocessed signals. The work described in this paper builds upon Desloge et al. by implementing and evaluating a real-time and multi-band version of the signal processing algorithm in a broader range of noises. II. GOALS This study investigates a novel signal processing technique, called energy equalization (EEQ), for the reduction of amplitude variation, which Reed et al. (2016) had concluded could contribute to MR in HI listeners. EEQ processing normalizes the fluctuating short-term signal energy to be equal to the long-term average signal energy. This technique is thus another way of

7 7 removing the rapid amplitude variation that occurs in speech. The goal is for this signal processing to improve the performance of HI listeners in fluctuating background noise without leading to a drop in performance in continuous background noise. This change in performance would thus result in greater MR for HI listeners. Energy equalization is applicable in the area of hearing aid and cochlear implant processing, and it could potentially also be used to benefit NH listeners and even machine listening systems that use automatic speech recognition. Other potential applications of EEQ processing include cell-phone or teleconferencing systems where an individual is speaking in a noisy environment and in speech recognition in interfering backgrounds. Thus, wherever speech reception is needed in noise, energy equalization could be used. The short-term signal energy for speech varies at a syllabic rate as intervals fluctuate between being more intense (usually during vowels), less intense (usually during consonants), and silent. Meanwhile, the long-term signal energy remains relatively constant and reflects the overall loudness at which a speaker is talking. These overall properties of speech persist even when background noise is added to the signal. The quiet portions of the speech signal are the most troublesome ones to HI listeners and lead to reduced speech comprehension. Energy equalization is a way for combatting this difficulty by amplifying the quieter parts of the signal (that may be present during gaps in the background noise) relative to the louder parts of the signal (that occur when background noise is fully present). This technique makes speech content present during the dips in background noise more audible and hence useful for speech comprehension.

8 8 III. SIGNAL PROCESSING ALGORITHM The EEQ processing seeks to reduce short-term amplitude fluctuations of a speech-plusnoise (S+N) stimulus while operating blindly and without introducing excessive distortion. The following is a general description of the steps the EEQ processing performs in real-time on a S+N signal x(t): Form running short-term and long-term moving averages of the signal energy, Eshort(t) and Elong(t): Eshort(t) = AVGshort[x 2 (t)] and Elong(t) = AVGlong[x 2 (t)], where AVG is a moving-average operator that utilizes specified short and long time constants to provide an estimate of the signal s energy. In this implementation, the AVG operators are single-pole infinite impulse response (IIR) low pass filters applied to the instantaneous signal energy, x 2 (t), with time constants of 5 ms and 200 ms for the short and long averages, respectively. The magnitude and phase of the square root of the ratio of the frequency response of AVGlong to the frequency response of AVGshort are shown in Figure 1, which is useful in understanding the scale factor computed in the next step of the processing. Determine the scale factor, SC(t): SC(t) = E long (t) / E short (t), where attention is made to avoid dividing by zero during quiet intervals. To prevent over-amplification of the noise floor, SC(t) had an upper limit of 20 db.

9 9 To prevent attenuation of stronger signal components, SC(t) had a lower limit of 0 db. Apply the scale factor to the original signal: y(t) = SC(t)x(t). Form the output z(t) by normalizing y(t) to have the same energy as x(t): z(t) = K(t)y(t), where K(t) is chosen such that AVGlong[z 2 (t)] = AVGlong[x 2 (t)]. The processing described above can be applied either to a broadband signal or independently to bandpass filtered components. The current implementation operated on both the broadband signal (EEQ1) and a signal divided into four contiguous frequency bands (EEQ4). These conditions are described in more detail in Section IV-D. Figure 2 depicts block diagrams of the EEQ1 (Figure 2A) and EEQ4 (Figure 2B) algorithms. The EEQ algorithm that was implemented follows the outline of the steps described above to process x[n], a sampled version of the original signal x(t). The original signal is first multiplied by SC[n], as shown in Figure 2A, and the resulting EEQ signal is then multiplied by K[n] to ensure that the long-term energy of the EEQ signal is equal to the long-term energy of the original signal at every sample point. SC[n] is restricted to lie in the range of 0 db to 20 db. Appendix I describes a modification to the computation of the scale factor that could be used without this lower limit in place. IV. METHODS The experimental protocol for testing human subjects was approved by the internal review board of the Massachusetts Institute of Technology. All testing was conducted in

10 10 compliance with regulations and ethical guidelines on experimentation with human subjects. All listeners provided informed consent and were paid for their participation in the experiments. A. Participants Six male and three female HI listeners with bilateral, symmetric, mild-to-severe sensorineural hearing loss participated in the experiment. They were all native speakers of American English and ranged in age from 20 to 69 years with an average age of 36.7 years. Six of the listeners were younger (33 years or less) and three were older (58-69 years). Five of the listeners had sloping high-frequency losses (HI-1, HI-2, HI-4, HI-5, and HI-7), three had relatively flat losses (HI-6, HI-8, and HI-9), and one had a cookie-bite loss (HI-3). Seven of the listeners (all but HI-1 and HI-3) were regular users of bilateral hearing aids. The fivefrequency (0.25, 0.5, 1, 2, and 4 khz) audiometric pure-tone average (PTA) ranged from 27 db HL to 75 db HL across listeners with an average of 45.3 db HL. The test ear, age, and five frequency PTA for each HI listener are listed in Table 1 along with the speech levels and SNRs employed in the experiment. The pure-tone thresholds of the HI listeners in db SPL are shown in Figure 3. The pure-tone threshold measurements were obtained with Sennheiser HD580 headphones for 500 ms stimuli in a three-alternative forced-choice adaptive procedure which estimates the threshold level required for 70.7%-correct detection (see Léger et al., 2015). Four NH listeners (defined as having pure-tone thresholds of 15 db HL or better in the octave frequencies between 250 and 8000 Hz) also participated in the study. They were native speakers of American English, included three males and one female, and ranged in age from 19 to 54 years, with an average age of 30.0 years. A test ear was selected for each listener (2 left ear

11 11 and 2 right ear). The mean adaptive thresholds across test ears of the NH listeners are provided in the first panel of Figure 3. B. Speech Stimuli The speech materials were Vowel-Consonant-Vowel (VCV) stimuli, with C=/p t k b d g f s ʃ v z dʒ m n r l/ and V=/a/, taken from the corpus of Shannon et al. (1999). The set used for testing consisted of 64 VCV tokens (one utterance of each of the 16 disyllables by two male and two female speakers). The mean VCV duration was 945 ms with a range of 688 to 1339 ms across the 64 VCVs in the test set. The recordings were digitized with 16-bit precision at a sampling rate of 32 khz and filtered to a bandwidth of Hz for presentation. C. Interference Conditions Noises from two broad categories of maskers were added to the speech stimuli prior to processing for presentation. Four background interference conditions were derived from speechshaped noise but did not come from actual speech samples. Three additional background interference conditions, referred to as vocoded modulated noises, were derived from actual speech samples. The RMS level of each of the noises except for the baseline condition was adjusted to be equal to that of the continuous noise, whose level was set as described in Section IV-F. The maskers used in the study are shown in Figure 4 and are summarized below. Maskers derived from randomly-generated speech-shaped noise (spectrogram shown in Figure 5) but not coming from actual speech samples. This paper refers to these as non-speech-derived noises: o Baseline Noise (BAS): Continuous speech-shaped noise at 30 db SPL. o Continuous Noise (CON): Additional continuous noise added to BAS.

12 12 o Square-Wave Interrupted Noise (SQW): 10-Hz square-wave interruption with 50% duty cycle added to BAS. o Sinusoidal Amplitude Modulation Noise (SAM): 10-Hz sinusoidal amplitude modulation noise added to BAS. Maskers derived from actual speech samples (referred to as vocoded modulated noise). These maskers were designed to exhibit fluctuations realistic of speech without the informational masking component. This paper refers to these as speech-derived noises: o 1-Speaker Vocoded Modulated Noise (VC-1) o 2-Speaker Vocoded Modulated Noise (VC-2) o 4-Speaker Vocoded Modulated Noise (VC-4) Appendix II describes the steps used to generate the vocoded modulated noises. D. Speech Conditions Listeners were presented with S+N signals with three different kinds of processing applied: Unprocessed Condition (UNP): The S+N signals were presented as described above with no further processing beyond per-subject NAL-RP (Dillon, 2001) amplification. 1-band Energy Equalized Condition (EEQ1): EEQ processing was applied to the broadband S+N signal over the range of Hz. As described in Section III, the EEQ processing compared short-term and long-term estimates of S+N signal energy, increased the level of short-term segments whose energy was below the average signal energy, and normalized the overall energy of the processed signal to be equivalent to that of the original long-term estimate (see Figure 2A).

13 13 4-band Energy Equalized Condition (EEQ4): The same technique as in the EEQ1 condition was applied independently to 4 logarithmically-equal bands of the S+N signal in the range of Hz. In doing so, order 6 (36 db/octave) passband filters divided the input signals into bands with frequency ranges of Hz, Hz, Hz, and Hz, respectively, and the EEQ1 processing was applied independently to each band prior to reconstructing the signal by summing across bands (see Figure 2B). E. Speech and Noise Signals Figure 6 shows the waveform of one the VCV tokens used in the experiment, APA, for UNP speech in BAS noise. The vowel components, which have more energy than the consonant component, constitute the two higher-energy sections of the speech that surround the weaker consonant component in the center. These sections of the speech are annotated at the top of the figure. Figure 7 shows the waveforms of this token in the different speech and noise conditions (Figure 7A for the UNP condition, Figure 7B for the EEQ1 condition, and Figure 7C for the recombined EEQ4 condition). In every type of interference except for BAS, the SNR is set to -4 db. The left panels show the S+N waveforms, and the right panels show the distribution of the amplitude of the S+N signal in db. These amplitude distributions were generated by sampling points of the S+N signal and, based on their amplitudes in db, placing them into buckets of length 1 db in the range of -10 db to 85 db. The RMS value of the signal in db is shown by the blue vertical line, and the median amplitude is shown by the green vertical line. The gaps of the noise in the plots of the S+N waveforms make evident the reduction in short-term amplitude fluctuations by the EEQ processing. For example, a comparison of the S+N waveforms in SQW between UNP and either EEQ1 or EEQ4 shows that the lower-energy speech

14 14 components that are present during the gaps in the fluctuating interference are greater in energy in the EEQ processed signals. The reduction in amplitude is also seen in the amplitude distributions in the right panels. The low-energy tails of the amplitude distributions in the UNP condition are reduced or absent in the EEQ1 and EEQ4 conditions. As a result, the median amplitudes (given by the green vertical lines) in the EEQ1 and EEQ4 conditions are shifted to the right, despite the RMS values (given by the blue vertical lines) remaining constant between UNP and EEQ (as a result of the final normalization step in the EEQ processing that sets the long-term energy of the output equal to the long-term energy of the input at every sample point). These effects are analyzed in more detail in Section VI-A of the paper. Figure 8 shows the EEQ4 waveforms and amplitude distributions in the CON (Figure 8A) and SQW (Figure 8B) conditions on a band-by-band basis. As in Figure 7, an SNR of -4 db is used. The boundaries of the four bands, which are logarithmically spaced, are Hz (Band 1), Hz (Band 2), Hz (Band 3), and Hz (Band 4). Band 2 has the largest RMS value, followed by, in decreasing order, Band 3, Band 4, and Band 1. The EEQ1 processing was applied independently in each band. F. Test Procedure Experiments were controlled by a desktop PC using Matlab TM software. The digitized speech-plus-noise stimuli were played through a 24-bit PCI sound card (E-MU 0404 by Creative Professional) and then passed through a programmable attenuator (Tucker-Davis PA4) and a headphone buffer (Tucker-Davis HB6) before being presented monaurally to the listener in a soundproof booth via a pair of headphones (Sennheiser HD580). A monitor, keyboard, and mouse located within the soundproof booth allowed the listener to interact with the control PC.

15 15 Consonant identification was tested using a one-interval, 16-alternative, forced-choice procedure without correct-answer feedback. On each 64-trial run, one of the 64 tokens from the test set was selected randomly without replacement. Depending on the noise condition, a randomly selected noise segment equal in duration to that of the speech token was scaled to achieve the desired SNR and then added to the speech token. The resulting stimulus was either presented unprocessed (for the UNP conditions) or processed according to EEQ1 or EEQ4 before being presented to the listener for identification. The listener s task was to identify the medial consonant of the VCV token that had been presented by selecting a response (using a computer mouse) from a 4x4 visual array of orthographic representations associated with the consonant stimuli. No time limit was imposed on the listeners responses. Each run typically lasted 3-5 minutes depending on the listener s response times. Chance performance was 6.25%- correct. Experiment 1. NH listeners were tested using a speech level of 60 db SPL. The SNR was set to -10 db (selected to yield roughly 50%-correct performance for UNP speech in CON noise) for all noise conditions (except for BAS). For the HI listeners, a linear-gain amplification was applied to the speech-plus-noise stimuli using the NAL-RP formula (Dillon, 2001). Each HI listener selected a comfortable speech level when listening to UNP speech in the BAS condition. For these listeners, the SNR was selected to yield roughly 50%-correct performance for UNP speech in CON noise. The speech levels and SNRs for each HI listener are listed in Table 1. The noise levels in db are the differences between the speech levels and the SNRs. The three speech conditions were tested in the order of UNP first, followed by EEQ1 and EEQ4 in a random order. The seven noise conditions were tested in order of BAS first, followed by a randomized order of the remaining six noises (CON, SQW, SAM, VC-1, VC-2, and VC-4).

16 16 Five 64-trial runs were presented for each of the 21 conditions (3 speech types x 7 noises). The first run was considered as practice and discarded. The final four test runs were used to calculate the percent-correct score in each condition. Experiment 2. Four of the HI listeners (HI-2, HI-4, HI-5, and HI-7) were tested at two additional values of SNR after completing Experiment 1. As shown in Table 2, one SNR was 4 db lower than that employed in Experiment 1 and the other was 4 db higher. This testing was conducted with UNP and EEQ1 speech in six types of noise: CON, SQW, SAM, VC-1, VC-2, and VC-4. The test order for UNP and EEQ1 speech was selected randomly for each listener. For each speech type, the two additional values of SNR were presented in random order. Within each SNR, the test order of the six types of noises was selected at random. Five 64-trial runs were presented at each condition using the tokens from the test set. The first run was discarded as practice and the final four runs were used to calculate the percent-correct score on each of the 24 additional conditions (2 speech types x 6 noises x 2 SNRs). Other than the SNR, all other experimental parameters remained the same as in Experiment 1. G. Data Analysis For each condition, percent-correct scores were averaged over the final 4 runs (consisting of 4*64=256 trials). Analysis of Variance (ANOVA) tests were performed on rationalized arcsine units (RAU; Studebaker, 1985) scores to examine the effects of speech type and noise condition on these percent-correct scores. MR in percentage points was calculated as the difference between scores in fluctuating noise and in continuous noise: MR = Score in Fluctuating Noise - Score in Continuous Noise.

17 17 Additionally, as was done by Léger et al., a normalized measure of masking release (NMR) was calculated as the quotient of MR and the difference between scores in quiet and in continuous noise: NMR = Score in Fluctuating Noise - Score in Continuous Noise Score in Quiet - Score in Continuous Noise. NMR thus represents the fraction of baseline performance lost in continuous noise that can be recovered in interrupted noise. Listeners who perform just as well in fluctuating noise as in quiet have an NMR of 1, and listeners who do not perform any better in fluctuating noise than in continuous noise have an NMR of 0. The metric is useful for comparing performance among HI listeners whose scores in quiet are different. By using baseline performance as a reference, NMR emphasizes the differences in performance with interrupted and continuous noise as opposed to the differences due to factors such as the severity of the hearing loss of the listener or the distorting effects of the processing on the speech itself. The MR and NMR calculations in SQW and SAM noises used CON noise as the continuous noise, and the MR and NMR calculations in VC-1 and VC-2 noises used VC-4 noise as the continuous noise. These NMR formulas are listed here: NMR SQW = NMR SAM = NMR VC 1 = NMR VC 2 = SQW Score - CON Score BAS Score - CON Score SAM Score - CON Score BAS Score - CON Score VC-1 Score - VC-4 Score BAS Score - VC-4 Score VC-2 Score - VC-4 Score BAS Score - VC-4 Score

18 18 V. RESULTS A. Experiment 1 The scores from Experiment 1 are reported in Appendix III-A and Appendix III-B and are summarized in Figure 9, Figure 10, and Figure 11. Appendix III-A provides the scores for each NH listener in each of the seven noise conditions for UNP, EEQ1, and EEQ4 speech, and Appendix III-B provides this same information for each HI listener. In Figure 9, the scores are plotted to highlight the differences in the average scores of the NH and HI listeners across conditions. In Figure 10 and Figure 11, the scores are plotted to highlight the differences of speech types within each noise for the NH and HI listeners (Figure 10 for the average NH results and the average HI results and Figure 11 for the average NH results and the individual HI results). First, consider average NH and HI performance, as shown in Figure 9. As expected, the performance for both groups was greatest in the BAS condition. Performance was lowest in CON (and was approximately 50%-correct by design of the experiment) and VC-4 (which was derived from samples of enough speakers to behave similarly to continuous noise). Performance was intermediate for the remaining noises. Other than in CON noise, scores were greater for NH than for HI listeners across noise conditions for all three speech types. The differences between the two groups were relatively small in the BAS condition (where the average differences between NH and HI listeners were 5.3% in UNP, 7.8% in EEQ1, and 12.2% in EEQ4), showing that the two groups diverge the most in fluctuating noise conditions where NH listeners were able to listen in the gaps, unlike HI listeners. In fact, across the five noises other than BAS and CON, NH scores were on average 17.9, 15.9, and 17.1 percentage points higher than HI scores for the UNP, EEQ1, and EEQ4 conditions, respectively. HI listeners exhibited slightly more

19 19 variability in their results than did NH listeners: the mean standard deviations across listeners (computed as the average of the standard deviations in each of the seven noises 1 ) in percentage points were 3.59 for UNP, 3.23 for EEQ1, and 4.38 in EEQ4 for NH listeners and 4.67 in UNP, 4.86 in EEQ1, and 4.59 in EEQ4 for HI listeners. Next, consider NH and HI performance across the different speech types, as is shown in Figures 10 and 11. Both figures show the mean scores for the NH listeners. Figure 10 shows the mean scores for the HI listeners, whereas Figure 11 shows the scores for the individual HI listeners. Note that the data depicted here are the same as that shown in Figure 9 and are replotted to highlight differences in speech types within a given noise. In general, both NH and HI listeners scored best in UNP followed by EEQ1 followed by EEQ4. Averaged across the different listeners and noise types, the NH scores were 78.7% in UNP, 75.5% in EEQ1, and 73.4% in EEQ4, and the HI scores were 65.3% in UNP, 63.0% in EEQ1, and 59.1% in EEQ4. By noise type, the scores of NH listeners generally followed the pattern of CON = VC-4 < VC-2 < VC-1 < SAM < SQW < BAS, and those of HI listeners generally followed the pattern of VC-4 < CON = VC-2 < VC-1 = SAM < SQW < BAS. Averaged across the different listeners and speech types, the NH scores were 98.3% in BAS, 52.1% in CON, 92.4% in SQW, 86.2% in SAM, 81.1% in VC-1, 68.6% in VC-2, and 52.4% in VC-4, and the HI scores were 89.8% in BAS, 51.5% in CON, 72.1% in SQW, 64.5% in SAM, 61.7% in VC-1, 52.1% in VC-2, and 45.7% in VC-4. EEQ1 processing was effective in improving the scores of HI and NH listeners in SQW noise: the average NH listener and eight of the nine individual HI listeners (all but HI-3) 1 Note that here, the standard deviation for a given noise and processing condition is calculated as n i=1 σ i 2 2, where σ n i is the variance of the four recorded runs on listener i and n is the number of listeners.

20 20 scored higher with EEQ1 than with UNP in SQW noise. EEQ1 processing also yielded improved performance for SAM noise in six of the nine HI listeners (all but HI-3, HI-7, and HI-9). For EEQ4 processing, no improvements over UNP were seen in SQW noise for NH listeners; however, all but one HI listener (HI-3) showed an improvement. For EEQ4 in SAM noise, there was no evidence for improvements over UNP for either NH or HI listeners. For all remaining noise conditions, for both HI and NH, scores were highest with UNP and lowest with EEQ4, with EEQ1 in between. The results obtained on each individual NH and HI listener were analyzed using a twoway ANOVA with main factors of speech type and noise condition. The ANOVAs were conducted at the significance level of 0.01 on the RAU of the 84 percent-correct scores obtained on each listener (3 speech x 7 noise x 4 repetitions) and are reported in Table 3. All but one of the NH listeners (NH-1) and all of the HI listeners had a significant effect of speech type, and all of the NH and HI listeners had a significant effect of noise type. One of the NH listeners (NH-2) and all but three of the HI listeners (HI-2, HI-3, and HI-7) had a significant speech by noise interaction. Post-hoc Tukey-Kramer comparisons at the significance level of 0.05 were conducted for cases of significant main factor effects, and the results are listed in Table 4. By speech type, most listeners had UNP = EEQ1 > EEQ4 (NH-2, HI-2, HI-4, HI-8, HI-9) or UNP > EEQ1 = EEQ4 (NH-3, NH-4, HI-1, HI-5, HI-7). The exceptions were HI-3, who had UNP > EEQ1 > EEQ4, and HI-6, who had EEQ1 > UNP > EEQ4. By noise type, BAS, SQW, SAM, and VC-1 were greater than VC-2, VC-4, and CON. Most listeners had BAS > SQW > SAM > VC-1 (NH-2, NH-3, NH- 4, HI-1, and HI-3) or BAS > SQW > SAM = VC-1 (NH-1, HI-2, HI-4, HI-5, HI-6, HI-8, and HI- 9). The exception was HI-7, who had BAS > SQW = SAM = VC-1. All NH listeners had VC-2 >

21 21 VC-4 = CON, and the order of VC-2, VC-4, and CON in HI listeners varied with each listener, with five of the nine HI listeners (HI-2, HI-4, HI-5, HI-7, and HI-8) having no significant differences among the three conditions. As discussed in the preceding paragraph, the significant speech by noise interaction present in many of the HI listeners is largely due to improved performance with EEQ1 processing relative to UNP in the SQW and SAM conditions but not in the other noises. The NMR data calculated from the scores of Experiment 1 are reported in Appendix III-C and Appendix III-D and are summarized in Figure 12. Appendix III-C provides the NMR for each NH listener in the SQW, SAM, VC-1, and VC-2 noise conditions for UNP, EEQ1, and EEQ4 speech, and Appendix III-D provides this same information for each HI listener. In Figure 12, the NMR results are plotted for the average NH listener and the individual HI listeners to highlight the differences of speech types within each noise. As shown in Figure 12, for the HI listeners, NMR was generally similar in EEQ1 and EEQ4 speech in the various noises and was greater in EEQ1 and EEQ4 than in UNP speech. Averaged over the HI listeners and the noise types, these NMR values were in UNP, in EEQ1, and in EEQ4. NMR for HI listeners by noise type was generally greatest in SQW interference, smallest in VC-2 interference, and between the two and equivalent in SAM and VC-1 interference. As such, NMR was generally greater in the non-speech derived noises than in the speech-derived noises. Averaged over the HI listeners and speech types, these NMR values were in SQW, in SAM, in VC-1, and in VC-2. EEQ processing yielded the largest improvement in NMR for HI listeners in the SQW conditions. This improvement decreased in the SAM condition and disappeared in the VC-1 and VC-2 conditions. Averaged across HI listeners, NMR values for UNP, EEQ1, and EEQ4, respectively, were 0.320, 0.639,

22 22 and in SQW; 0.227, 0.400, and in SAM; 0.391, 0.376, and in VC-1; and 0.126, 0.125, and in VC-2. NH listeners generally achieved greater NMR than did the HI listeners with little effect of speech type. Averaged across speech type for NH listeners, NMR decreases in the order of SQW, SAM, VC-1, and VC-2. Averaged across NH listeners, NMR for UNP, EEQ1, and EEQ4, respectively, were 0.861, 0.907, and in SQW; 0.792, 0.735, and in SAM; 0.673, 0.600, and in VC-1; and 0.356, 0.351, and in VC-2. Both within and across listeners, HI listeners exhibited greater variability in their results than did NH listeners. B. Experiment 2 The scores of Experiment 2 are reported in Appendix IV-A and are summarized in Figure 13. Appendix IV-A provides the scores for each HI listener in the non-bas noise conditions for UNP and EEQ1 speech at each of the three SNRs that were tested. Figure 13A plots the results in non-speech derived noises (except BAS) as a function of SNR and fits sigmoidal functions to the data, and Figure 13B does the same for the speech-derived noises. The sigmoidal fits to the psychometric functions in Figure 13 assumed a lower bound corresponding to chance performance on the consonant-identification task (6.25%-correct) and an upper bound corresponding to a given listener s score on the BAS condition for UNP or EEQ. The fitting process found the slope and midpoint values of a logistic function that minimized the error between the fit and the data points. The results of the fits are summarized in Table 5 in terms of their midpoints (SNR in db yielding a 50%-correct score) and slopes around the midpoint (in percentage points per db). For the CON noise conditions, the midpoints and slopes were similar for UNP and EEQ1 signals for each of the HI listeners. In CON, averaged across listeners, midpoints were -3.9 db and -2.8 db for UNP and EEQ1, respectively, and slopes were 5.2

23 23 percent per db and 4.3 percent per db, respectively. In the two non-speech derived fluctuating background noises, the midpoints were lower for EEQ1 than for UNP for each of the HI listeners. Averaged across listeners and for UNP and EEQ1, respectively, midpoints were db and db in SQW (a difference of 84.9 db) and -7.3 db and db in SAM (a difference of 8.6 db). 2 For the speech-derived noise conditions, the midpoints and slopes were similar for UNP and EEQ1 signals for each of the HI listeners. Averaged across listeners and for UNP and EEQ1, respectively, midpoints were -9.0 db and db in VC-1 (a difference of 2.0 db), -5.8 db and -4.5 db in VC-2 (a difference of -1.3 db), and -3.4 db and -2.4 db in VC-4 (a difference of -1.0 db). Slopes were similar for both types of processing, where they were ordered as SQW < SAM < CON for the non-speech derived noises and VC-1 < VC-2 < VC-4 for the speech-derived noises. In Figure 14, MR in percentage points is plotted as a function of SNR for SQW, SAM, VC-1, and VC-2. Here, MR was computed as the difference in sigmoid fits between fluctuating and continuous noises. Note that this metric differs from NMR in that it is not normalized by the difference between scores in quiet and in continuous noise (i.e., MR is the numerator in the NMR quotient). Similarly to what was done in the NMR calculations, MR was computed with CON as the continuous noise when SQW and SAM were the fluctuating noises and with VC-4 as the continuous noise when VC-1 and VC-2 were the fluctuating noises. In SQW interference, these plots indicate greater MR with EEQ1 than with UNP for all subjects across the SNRs. For SAM interference, MR with EEQ1 generally exceeded that with UNP, although the increase was generally smaller than in SQW interference. The trend was similar with VC-1 interference, 2 It should be noted that the midpoint of HI-5 ( db) was highly deviant relative to the remaining 3 HI listeners (whose midpoints ranged from to db). The SQW midpoint average falls to db if HI-5 is eliminated, leading to a difference of 17.6 db with UNP.

24 24 although the increase in MR with EEQ1 was smaller than that observed for SAM interference. With VC-2, there was no clear trend showing greater MR for either EEQ1 or UNP. These observations were generally consistent with the NMR findings discussed in the next paragraph. NMR was calculated from the scores of Experiment 2 and is reported in Appendix IV-B and summarized in Figure 15. Appendix IV-B provides the calculated NMR data for each HI listener in each of the seven noise conditions for UNP and EEQ1 speech at each of the three SNRs that were tested. In Figure 15A, NMR for EEQ1 is plotted as a function of NMR for UNP for the individual HI listeners in SQW and SAM noise at the various SNRs, and in Figure 15B, this same information is plotted for VC-1 and VC-2 noise. In Figure 15A, every NMR data point lies above the 45-degree reference line, showing a strong tendency in HI listeners for larger NMR with EEQ1 processing in non-speech derived noises at all SNRs tested. Additionally, NMR was greater with SQW interference than with SAM interference. In SQW noise, NMR averaged across subjects at the low, mid, and high SNRs was 0.431, 0.314, and , respectively, for UNP and 0.765, 0.657, and 0.564, respectively, for EEQ1. These same numbers in SAM noise were 0.284, 0.210, and 0.136, respectively, for UNP and 0.505, 0.501, and 0.467, respectively, for EEQ1. As shown in Figure 15B, there was less of a difference in NMR for UNP and EEQ1 for the speech-derived noises than seen in Figure 15A for the non-speech derived noises. However, more data points were above the reference line with VC-1 than with VC-2. Additionally, NMR with both UNP and EEQ1 was greater with VC-1 interference than with VC-2 interference. In VC-1 noise, NMR averaged across subjects for UNP was at the low SNR, at the mid SNR, and at the high SNR. These numbers for EEQ1 were at the low SNR, at

25 the mid SNR, and at the high SNR. These same numbers in VC-2 noise were 0.223, 0.117, and 0.019, respectively, for UNP and 0.177, 0.119, and 0.247, respectively, for EEQ1. 25 VI. DISCUSSION This section discusses the results of the experiments in greater detail and analyzes potential explanations for the outcomes. Section VI-A begins by examining the effects that the EEQ processing has on the amplitude variability of the waveforms. In Section VI-B, the EEQ effect on NMR is explored. Models are introduced in Section VI-C that attempt to predict the performance benefit gained with the EEQ processing. In Section VI-D, EEQ1 is compared to EEQ4 in an attempt to understand differences in performance. Finally, in Section VI-E, different types of background interference are subjected to a glimpse analysis to explain the different effects of the EEQ processing with the speech-derived versus non-speech-derived noises. A. Effect of EEQ on Signal Amplitude The waveform and amplitude distribution plots of the various S+N signals in Figures 7A, 7B, and 7C are now examined in more detail to assess how EEQ achieves its goal of equalizing the energy of an S+N signal. The amplitude distribution plots depict RMS values with blue vertical lines and amplitude medians with green vertical lines. Median amplitudes were plotted because of their resilience to outliers as compared to the means. As shown in the figures, the RMS values are constant between UNP and EEQ1 and between UNP and EEQ4 within each type of interference. This is because the final step of the EEQ processing normalizes the output signal at every sample point to have a long-term energy equal to that of the input signal. Note also that the RMS value, which is determined by the levels of the speech and the noise, is equal in all types of interference except BAS. This is because, in the figure, the SNR is -4 db in all non-bas

26 26 conditions. However, despite the RMS values being the same within a type of interference, the median amplitudes are not the same. The median amplitude is greater with EEQ1 and EEQ4 than with UNP. For the VCV token depicted in the figure, the differences in median amplitudes in db between EEQ1 and UNP are 2.10 for BAS, 0.42 for CON, 1.78 for SQW, 2.05 for SAM, 1.36 for VC-1, 1.11 for VC-2, and 0.80 for VC-4. Note that except in CON and VC-4, these values are smaller than the differences in mean amplitudes between EEQ1 and UNP, which are 4.98 for BAS, 0.39 for CON, 4.25 for SQW, 2.82 for SAM, 2.97 for VC-1, 1.79 for VC-2, and 0.65 for VC-4. The fact that the differences in mean amplitudes are greater than the differences in median amplitudes highlights the fact that the UNP histograms contain tails of low-energy components that are not present with EEQ. The rightwards shift of the amplitude distribution with the EEQ processing occurs because the lower energy speech components which are present during the gaps in the noise are amplified with the processing. The movement of the tail of the amplitude distribution towards the center of the histogram corresponds to the reduction in amplitude variation in the processed speech. The waveform and amplitude distributions of EEQ1 and EEQ4 look approximately the same when examining the broadband signals. In db, the absolute values of the differences in mean amplitudes between EEQ1 and EEQ4 are 0.21 for BAS, 0.27 for CON, 0.32 for SQW, 0.33 for SAM, 0.18 for VC-1, 0.73 for VC-2, and 0.57 for VC-4. Further discussion of the differences between EEQ1 and EEQ4 is found in Section VI-D below. B. Effect of EEQ on NMR It was stated that the goals of this research are to increase NMR in HI listeners by increasing performance in fluctuating interference while maintaining performance in baseline and continuous noise conditions. For HI listeners, EEQ1 processing yielded improved performance in SQW and SAM noises (average scores increased by 7.2 and 1.6 percentage

27 27 points, respectively) but not for the speech-derived noises. For HI listeners, EEQ1 processing resulted on average in 2.4 and 6.3 percentage point reductions in performance for BAS and CON noises, respectively. As such, for HI listeners, NMR was greater in SQW and SAM noises with EEQ1 compared to UNP. This was brought about both by an increase in performance in fluctuating noise and a greater decrease in performance in CON noise than in BAS. For HI listeners with EEQ4 processing, NMR also increased, but compared to UNP there was a bigger performance drop in all noise conditions except SQW, which had a slight performance increase. Meanwhile, for NH listeners, EEQ1 processing yielded a slight improvement in performance in SQW noise (average score increased by 1.4 percentage points) but not in the remaining noises. The overall effect on NMR was minimal both in EEQ1 and EEQ4. The benefits of EEQ processing for HI listeners in SQW interference are evident through the NMR results, which are shown in Figure 12 for Experiment 1 and in Figure 15 for Experiment 2. Figure 16 re-plots the results from Figure 12 to show NMR as a function of the 5- frequency PTA hearing loss of each of the nine HI listeners. In the figure, NMR decreases as a function of PTA with UNP speech, which demonstrates the increasing effects of reduced audibility in the SQW noise gaps with severity of hearing loss. However, with EEQ1 and EEQ4 processing, NMR is much more constant relative to PTA, which highlights the benefits provided to HI listeners by making the speech component of the signal more audible in the SQW noise gaps. Additionally, as shown in Figure 15, the increase in NMR with EEQ1 relative to UNP for SQW and SAM holds at various SNRs: with UNP in these types of interference, NMR becomes close to zero or even negative at the high SNRs, whereas with EEQ1, NMR is always positive.

28 28 C. Modelling Psychometric Functions Two analyses were performed to explore the percent-correct performance shown in Figure 13 for each speech type and noise as a function of SNR and to attempt to account for the changes in performance, especially the performance boost in SQW noise. 1) Local Changes in SNR The first analysis investigated whether the performance improvement in fluctuating noises with EEQ processing can be explained solely by changes to the SNR. Specifically, for low-to-moderate SNRs and fluctuating noise, EEQ tends to amplify the higher-snr stimulus segments present in the gaps when noise energy is low relative to the lower-snr stimulus segments when the noise energy is high. By doing this, EEQ changes the effective SNR of the stimulus, and so it is possible that the observed increase in NMR, which depends on the observer, might be explained simply by an increase in SNR, which is independent of the observer. The first analysis addressed this question by estimating the change in SNR as a result of EEQ processing and looking at scores as a function of this changed SNR. Although EEQ processing is nonlinear, the scale factor is applied linearly to the speech and to the noise. Thus, it is possible to determine its effect on the speech and noise components of the signal separately for a particular stimulus at a particular input SNR, thus allowing computation of the post-processing SNR for that input. The output SNR for a particular input sample (consisting of specific speech and noise samples s(t) and n(t) and a known input SNR, SNRUNP) may be calculated as follows: (1) Compute the EEQ scale factor SC(t) applied to an input of x(t) = s(t) + n(t). (2) The EEQ output signal is given as y(t) = x(t) * SC(t) = ys(t) + yn(t), where: ys(t) = s(t) * SC(t) and

29 29 yn(t) = n(t) * SC(t). (3) The post-processed SNR (in db) for this combination of s(t), n(t), and SNRUNP is SNR EEQ = 10log 10 ( y s2 (t) / y n2 (t) ), where y 2 s (t) and y 2 n (t) are the mean values of y 2 s (t) and y 2 n (t), respectively. Each of the 64 speech tokens used in the experiments was examined with six noise types (CON, SQW, SAM, VC-1, VC-2, and VC-4) and values of SNRUNP (ranging from -40 to +40 db). For every combination of speech token, noise type, and SNRUNP, 10 noise samples n(t) of length equal to s(t) were randomly generated. The above procedure was used to calculate SNREEQ1 as a function of SNRUNP and noise type averaged across each of 10 noise samples combined with each of the 64 speech tokens. These averages were used to formulate a pre-topost-processing SNR mapping function SNREEQ1 = F(SNRUNP, noise type), shown in Figure 17, where a diagonal reference line is included to show the case of SNRUNP = SNREEQ1. When SNRUNP is negative, EEQ1 processing provides an SNR boost by raising the level of the signal present in the dips in the noise. Interestingly, when SNRUNP is positive, EEQ1 processing actually lowers the SNR because fluctuations in the signal, as opposed to the noise, drive the equalization. The CON, SQW, SAM, VC-1, VC-2, and VC-4 curves cross the reference line at SNRUNP values of -5.8, 4.8, 3.1, 2.5, -0.6, -2.3 db, respectively. Using the pre-to-post-processing SNR mapping function, the psychometric functions in Figure 13 were replotted. Scores for UNP were plotted versus SNRUNP and scores for EEQ1 were plotted versus SNREEQ1. These plots can be seen in Figure 18A for the non-speech-derived noises and in Figure 18B for the speech-derived noises. Had the performance boost with EEQ1 processing been able to be explained solely by the change in SNR, the score for UNP and EEQ1 in a given noise type should be the same at a given SNR. For the non-speech-derived noises, this

30 30 prediction fits well for the data of HI-2 and HI-7, especially in the SQW and SAM conditions, and for HI-5 in the SAM condition. For the speech-derived noises, this prediction fits well for the data of HI-2 and HI-7 in the VC-1, VC-2, and VC-4 conditions, for HI-4 in the VC-1 condition, and for HI-5 in the VC-2 condition. Other than these cases, the modelling of performance based on the SNR mapping function is less effective. It should be noted that the local SNR analysis was computed using an entire VCV token, which is dominated by the vowel components in both duration and level. It is assumed that for the consonant portion alone, the SNREEQ1 vs SNRUNP curves cross the diagonal reference line at more positive SNRs than are shown in Figure 17 for the whole VCV token. This is because the low-energy consonant component is often the beneficiary of the short-term amplification provided by the EEQ processing algorithm. As such, the EEQ processing does not have a negative impact on local consonant SNR until more positive SNRs, at which point the speech is dominant enough that a slight decrease in SNR would not hurt performance. 2) Crest Factor The second analysis explored whether the performance improvements with the EEQ processing can be explained by the changes in amplitude variation of an S+N signal. The crest factor, defined as the peak amplitude of a waveform x divided by its RMS value, gives a sense of the amplitude variation of the signal. In db, the crest factor is given as: Crest Factor = 20log 10 ( x peak x rms ). Because EEQ processing reduces amplitude variation, it is expected that the processing would reduce the crest factor as the maximum value of the signal moves closer to the RMS value of the signal. In a manner similar to what was done for the SNR analysis above, each of the 64 speech tokens used in the experiments was examined with various noise types (CON, SQW,

31 31 SAM, VC-1, VC-2, and VC-4), processing conditions (UNP and EEQ1), and values of SNRUNP (ranging from -40 to +40 db). For every combination of speech token, processing, and SNRUNP, 10 noise samples n(t) of length equal to s(t) were randomly generated. The average maximum value and the average RMS value across these 10 S+N samples were then recorded. Using these two average values, an average crest factor in db was calculated for each of the 64 test syllables at each noise type, processing type, and value of SNRUNP. Finally, the 64 crest factors (in db) calculated for each condition were averaged to formulate a function of Crest Factor = F(SNRUNP, noise type, processing type). This function is shown in Figure 19A for the non-speech-derived noises and in Figure 19B for the speech-derived noises, where it can be seen that the EEQ1 crestfactor curves lie below the corresponding UNP curves of the same noise type. This effect is consistent with the reduced amplitude variability (and therefore decrease in the ratio of its maximum value to its RMS value) in the EEQ1 processed signal. Note that the crest factor for speech in the speech-derived noises is more variable than that in the non-speech-derived noises, as shown by the jagged curves across the SNRs in Figure 19B as compared to Figure 19A. This behavior comes from the greater variability in the speech-derived noises in general. In a manner similar to what was done with the SNR analysis described above, the psychometric functions in Figure 13 were plotted on a crest-factor scale. Scores for UNP were plotted versus the pre-processing crest factor and scores for EEQ1 were plotted versus the postprocessing crest factor. These plots can be seen in Figure 20. The percent-correct curves still do not lie on top of each other, indicating that crest factor by itself also cannot be used to explain the performance benefits with the EEQ1 processing. In fact, because pure noise has a lower crest factor than pure speech (as seen by the crest factor curves for CON being lower at negative SNRs than at positive SNRs), one would expect processing whose performance benefits can be

32 32 explained solely by crest factor changes to result in signals that have higher crest factors than UNP signals. However, as stated above, EEQ1 processing lowers the crest factor and thus cannot be used to explain the psychometric data. This analysis was also performed by using different percentiles of signal amplitude in the numerator of the crest factor formula (for example, by using the 95 th or 99 th percentile rather than the maximum value), but this mapping did not fit the data well either. D. EEQ1 vs EEQ4 Processing EEQ1 proved to be more effective than EEQ4 processing for HI listeners; with the average HI data, the mean difference between EEQ1 and EEQ4 scores across the seven noises was 4.0 percentage points. It had been hypothesized that processing frequency bands independently could provide benefit particularly with speech-derived noises for HI listeners with non-uniform losses. However, by applying different scale factors to different frequency bands, such independent-band processing may have interfered with the spectral shape, resulting in decreased effectiveness. To see if this might be the case, outputs of the three processing schemes were examined in each of the four bands used for EEQ4. Figure 21 compares RMS values and median amplitudes for UNP, EEQ1, and EEQ4 within each of the four bands used in the EEQ4 processing as a function of SNR. As seen in Figures 21A, 21B, and 21C, the RMS values for the three different kinds of processing do not differ much on a band-by-band basis. This is because the EEQ processing normalizes the RMS mean of the output signal to be equal to that of the input signal. However, an obvious difference can be seen between the median values of the UNP and both EEQ processing schemes, as shown in Figures 21D, 21E, and 21F. For UNP, the median amplitudes have a generally linear decrease with an increase in SNR, whereas the slopes of the EEQ1 and EEQ4 functions level off at around

33 33 0 db. This is consistent with the EEQ processing amplifying the low energy speech components. The shapes of these functions are generally similar for EEQ1 and EEQ4. However, at the lower SNRs, bands 1 and 4, and to a lesser extent bands 2 and 3, show greater overlap with EEQ4 than with EEQ1. At the higher SNRs, bands 1 and 4 and bands 2 and 3 show greater separation for EEQ4 compared to EEQ1. It is possible that these differences in spectral shape led to the overall 4.0 percentage point decrease in performance with EEQ4 relative to EEQ1. It is possible that other metrics besides RMS values and median amplitudes might reveal a larger difference in spectral shape between the two processing schemes. It is also possible that the additional processing involved in the multi-band scheme introduced additional distortions to the signal, which led to the observed decreases in performance with EEQ4 compared to EEQ1. E. Glimpse Analysis of Vocoded and Non-Vocoded Noises The EEQ processing scheme proved to be more effective, both in terms of improving scores and NMR, with the non-speech-derived noises compared to the speech-derived noises. An analysis was conducted on the differences in occurrences of noise glimpses between these two categories of noises to explore why this may be the case. This analysis was similar to one done by Cooke (2006), who looked at glimpse percentages and counts for a number of competing background speakers. Cooke s analysis defined a glimpse to be a connected region of the spectrotemporal excitation pattern in which the energy of the speech token exceeded that of the background by at least 3 db in each time-frequency element. Unlike Cooke s analysis, the current analysis looked at noise alone and examined its envelope as opposed to its spectrotemporal pattern. The analysis used here defines a noise glimpse to be a section of the noise where the envelope drops more than 3 db below the RMS value of the noise for at least 10 ms. Gaps present at the immediate start or end of the noise were not counted because these

34 34 durations might be truncated from their actual duration. The threshold of 3 db below the RMS value was chosen to prevent steady-state noise, which has many small fluctuations in the vicinity of its RMS value, from being classified as having a significant portion of the signal spent in a glimpse. The minimum duration of 10 ms was chosen based on a study by He et al. (1999), which showed that the threshold for detecting a gap in a longer duration noise (400 ms) is roughly 5 ms, independent of the location of the gap within the noise or whether the gap location is randomly selected on each trial. Therefore, the analysis described in this paper chose a minimum duration of 10 ms that was twice as long as the threshold where gaps can be perfectly discriminated. Figure 22 depicts the waveforms and envelopes of VC-1 (Figure 22A), VC-2 (Figure 22B), and VC-4 (Figure 22C) noises. The envelope was computed by passing the absolute value of the signal s Hilbert transform through 16 logarithmically spaced low-pass filters in the range of 80 Hz to 8020 Hz with cutoffs of 64 Hz. The red lines represent the RMS values, and the green lines represent 3 db below the RMS values. An interval is considered to be a noise gap if the envelope (shown in blue) drops below the green threshold line for at least 10 ms. As shown in the figures, as the number of speakers increases in the speech-derived noises, the envelope hovers closer to the RMS value. The analysis considered six of the noises used in the experiment (eliminating only the BAS noise). Additionally, it considered speech-vocoded modulated noises derived from more than 4 speaker samples; VC-8, VC-16, VC-32, VC-64, VC-128, VC-256, and VC-512 were also examined. Five hundred samples of each of the noise types were generated to have a duration equal to an arbitrarily chosen VCV token of 1.29 seconds. Note that these additional noise types were generated from multiple samples of the same eight speakers as were used to generate VC-1,

35 35 VC-2, and VC-4. Half of these samples came from combinations of female speakers and half from combinations of male speakers. For each noise sample, the occurrences of the glimpses using the above definition were determined. This information was then used to calculate the percentage that the glimpses constitute of the overall noise duration, the number of glimpses per second, and the average length of the glimpses. For the first two quantities, the averages over the 500 noise samples generated are shown in Figures 23 and 24: Figure 23 depicts the percentage of glimpses information, and Figure 24 depicts the measured glimpses-per-second information. Cooke s paper contains plots of these same quantities which are similar in shape to the results obtained with this study s slightly different metric of a noise glimpse. Figure 25 shows a histogram of the final quantity, the average glimpse duration in each of the 500 noise samples. As shown in Figure 23, the average fraction of time spent in a glimpse decreased as the number of speakers in the vocoded modulated noise increased. As the number of speakers increased, this fraction approached zero, the value in CON noise. SQW and SAM had values of and 0.419, respectively. Note that had the opening and closing gaps been counted and had the RMS value been used as a threshold instead of 3 db below the RMS value, these numbers would have been closer to 0.5 by nature of the symmetry of the noises. VC-1, VC-2, and VC-4, the three speech-derived noises used in the experiment, had fractions of 0.423, 0.336, and 0.257, respectively. VC-1 was therefore very similar to SQW and SAM in terms of fraction of the time spent in a gap, whereas VC-4 had more gaps than CON using the current metric. The variability of this measure was greater for the speech-derived noises than the non-speech-derived noises and decreased as the number of speakers in the speech-derived noises increased. Standard deviations within the 500 samples were for VC-1, for VC-2, and for VC-4, whereas these values were for CON, for SQW, and for SAM.

36 36 As shown in Figure 24, the average number of glimpses per second increased from 1 speaker to 2 speakers and then decreased from there on as the number of speakers in the vocoded noises increased. This quantity approached zero, the value in CON noise. SQW and SAM had glimpse per second values of 9.20 and 9.22, respectively, and VC-1, VC-2, and VC-4 had glimpse per second values of 2.57, 3.46, and 3.43, respectively. Thus, all of the speech-derived noises had fewer number of glimpses per second than did the non-speech-derived noises. The variability in the number of glimpses per second was greater for the speech-derived noises than for the non-speech-derived noises, and the variability first increased and then decreased as the number of speakers in the speech-derived noises increased. VC-1, VC-2, and VC-4 had standard deviations of 1.14, 1.33, and 1.34, respectively, between the 500 samples, whereas these values were 0.000, 0.318, and for CON, SQW, and SAM, respectively. To generate Figure 25, the 500 average glimpse durations (corresponding to the average glimpse duration in each of the 500 noise samples generated for a given noise) were placed into buckets of length 10 ms. As shown in the figure, the average glimpse duration between samples of the same type of speech-derived noise varies quite a bit, especially for the ones which were tested in the experiments (VC-1, VC-2, and VC-4). The histograms for these noises span many buckets. Meanwhile, for the non-speech-derived noises (CON, SQW, and SAM), there is very little variability in the average glimpse duration between noise samples. The histograms for these noises occupy a single bucket: for CON, the bin from 0 to 0.01 seconds and for SQW and SAM, the bin from 0.04 to 0.05 seconds. Almost every single sample for VC-1, VC-2, and VC-4 falls into a bucket of greater duration than that for SQW and SAM. Figures 23, 24, and 25 offer insight into why the EEQ processing performed better with the non-speech-derived noises than with the speech-derived noises. Although VC-1, SQW, and

37 37 SAM have similar amounts of total time spent in glimpses, these times are distributed over a greater number of glimpses in SQW and SAM. With VC-2 and VC-4, there are both fewer total times spent in glimpses and total number of glimpses than with SQW and SAM. Additionally, the variability is much greater in the speech-derived noises than the non-speech-derived noises. The EEQ processing performs best with short, frequent glimpses, as this gives it the best opportunity to amplify speech exposed during the gaps in the noise by normalizing with the ratio of the long and short term energies. With VC-1, there are fewer, longer glimpses. Therefore, the listener only gets a few samples of the speech stimulus rather than little bits throughout. During the longer glimpses, the running long-term average would be reduced, leading to smaller changes in gain in these sections. With fewer and longer glimpses (and therefore fewer and longer nonglimpses as well), it is also possible that the entirety of the low-energy consonant portion of the speech stimuli would be covered by noise. Thus, the EEQ processing may have less of an opportunity to operate effectively on the portion of the speech where HI listeners require the most amplification and could instead end up amplifying noise during these parts. Finally, the predictability of the non-speech-derived noises (as evidenced by the low standard deviation in percentage glimpses and number of glimpses) may make it easier for HI listeners to benefit from EEQ processing with those noises. With the speech-derived noises, the standard deviations are high, and each sample of noise is quite different from the others. This unpredictable pattern makes it harder for HI listeners to benefit from EEQ processing in the speech-derived noises. To examine the role of glimpsing on performance in the different types of noise, Figure 26 plots the mean NH and HI scores for UNP and EEQ1 as a function of the average fraction of the noise spent in a glimpse. For both speech types and groups of listeners, scores increased with an increase in the fraction of glimpses once this measure exceeded approximately Below

38 38 this fraction, scores were roughly constant at the level observed in CON. For both NH and HI listeners, the UNP curve lies above the EEQ1 curve for the smaller fractions of glimpses. However, as the fraction of glimpses increases, the difference between the curves gets smaller and even reverses at the highest fractions of glimpses. These trends are consistent with the concept that the EEQ processing is most effective when there are many glimpses available throughout the duration of the noise signal. Another explanation for why EEQ processing results in less of an NMR improvement in speech-derived noises for HI listeners lies in the fact that many HI listeners begin with a greater NMR in VC-1 and VC-2 compared to SQW and SAM. As shown by Figure 12, the HI listeners with the most severe hearing losses (HI-6, HI-7, HI-8, and HI-9) have almost no NMR in the UNP condition in SQW but do have a non-zero NMR in VC-1. In fact, in the UNP condition, the NMR is much more constant in VC-1 interference as the listener s PTA increases than is the case in SQW. For UNP, this implies that in VC-1 noise, HI listeners were able to recover more of the performance that was lost in VC-4 noise than they were in SQW with their performance lost in CON. Thus, there is less room for NMR improvement with EEQ processing in the speechderived noises, and it is less surprising that there is not as much of an increase compared to the non-speech-derived noises. VII. CONCLUSIONS EEQ processing was effective in improving NMR for HI listeners in SQW and SAM interference. The EEQ effect on NMR was less apparent in VC-1 and VC-2 interference. These observations held across various SNRs.

39 39 NMR improvements for EEQ resulted primarily from increased performance in fluctuating noise, especially in SQW interference, although there was also a smaller decrease in performance in BAS and CON for EEQ. EEQ processing is more effective with regular and frequent gaps in the fluctuating noises, as is seen in SQW and SAM. VC-1 and VC-2 have gaps that are more variable in length and therefore limit the effectiveness of the EEQ processing in using the short and long window to normalize energy. EEQ1 processing was more effective than EEQ4 processing. EEQ4 may have interfered with the spectral shape, resulting in decreased effectiveness. NMR decreased with increasing hearing loss for UNP speech but was roughly independent of degree of loss for EEQ speech. This resulted in a large increase in NMR for HI listeners with the most severe hearing losses. Although EEQ processing increases the local SNR and decreases the amplitude variation of a noisy speech signal, neither of these effects provided a complete explanation of behavioral performance with EEQ signals over a wide range of SNR. VIII. FUTURE WORK This study arose out of the desire to study and understand the factors that influence NMR in HI listeners and to explore a signal processing technique to improve NMR. Future work relates to these long-term goals. The work reported here evaluated the effectiveness of the EEQ processing scheme in a consonant identification task. Future studies will explore how the EEQ processing scheme fares in other speech types, specifically vowels and sentences. A model to predict the differences in performance exhibited by HI listeners with UNP and EEQ processing

40 40 in the different types of background noise, building on the SNR and crest factor analyses described in this thesis, will be further investigated. Also, the EEQ processing scheme will continue to be analyzed for the potential for improvements in a broader range of fluctuating noises, most notably in noises with irregular gap lengths. Additionally, the factors causing NMR to be greater in the speech-derived noises than in the non-speech derived noises with UNP will be investigated. Ways to restrict the EEQ4 processing from resulting in as much spectral alteration will also be explored, potentially by restricting the scale factor applied to adjacent bands to be within a fixed range of each other. Additional signal processing techniques will also be examined for the improvement of NMR in HI listeners. These techniques will perhaps make use of what was learned from the EEQ processing, and they could together lead to an increased understanding of the mechanisms contributing to masking release in both NH and HI listeners.

41 41 References Cooke, M. (2006). A glimpsing model of speech perception in noise, J. Acoust. Soc. Am. 119, Desloge, J. G., Reed, C. M., Braida, L. D., Perez, Z. D., and D Aquila, L. A. (2016). Technique to improve speech intelligibility in fluctuating background noise by normalizing signal energy over time. Manuscript in preparation. Desloge, J. G., Reed, C. M., Braida, L. D. Perez, Z. D., and Delhorne, L. A. (2010). Speech reception by listeners with real and simulated hearing impairment: Effects of continuous and interrupted noise, J. Acoust. Soc. Am. 128, Dillon, H. (2001). Hearing Aids (Thieme, New York), pp Festen, J. M., and Plomp, R. (1990). "Effects of fluctuating noise and interfering speech on the speech reception threshold for impaired and normal hearing,'' J. Acoust. Soc. Am. 88, He, N., Horwitz, A. R., Dubno, J. R., and Mills, J. H. (1999). Psychometric functions for gap detection in noise measured from young and aged subjects, J. Acoust. Soc. Am. 106, Léger, A. C., Reed, C. M., Desloge, J. G., Swaminathan, J., and Braida, L. D. (2015). Consonant identification in noise using Hilbert-transform temporal fine-structure speech and recovered-envelope speech for listeners with normal and impaired hearing, J. Acoust. Soc. Am. 136, Phatak, S, and Grant, K. W. (2014). Phoneme recognition in vocoded maskers by normalhearing and aided hearing-impaired listeners, J. Acoust. Soc. Am. 136,

42 42 Reed, C. M., Desloge, J. G., Braida, L. D., Léger, A. C., and Perez, Z. D. (2016). Level variations in speech: Effect on masking release in hearing-impaired listeners. Under review for J. Acoust. Soc. Am. Shannon, R. V., Jensvold, A., Padilla, M., Robert, M. E., and Wang, X. (1999). "Consonant recordings for speech testing," J. Acoust. Soc. Am. 106, L Studebaker, G. A. (1985). A Rationalized Arcsine Transform, J. Speech Lang. Hear. Res. 28,

43 43 Figure 1: The magnitude and phase of the square root of the ratio of the frequency response of AVGlong to the frequency response of AVGshort. AVGshort and AVGlong are the moving average operators used by the EEQ processing in the computation of the running short- and long-term energies, respectively, of the signal. They are single-pole IIR low pass filters applied to the instantaneous signal energy with time constants of 5 ms and 200 ms for the short and long averages, respectively.

44 44 Figure 2: Block diagrams of the EEQ processing algorithm used in this implementation. Figure 2A outlines the steps of the EEQ1 processing. Eshort[n] and Elong[n] are computed with single pole IIR filters applied to the instantaneous signal energy with time constants of 5 ms and 200 ms, respectively, and the scale factor SC[n] is restricted to lie in the range of 0 db to 20 db. Figure 2B shows the EEQ1 processing applied independently in each of the four frequency bands to yield EEQ4. Figure 2A: Figure 2B:

45 45 Table 1: Test ear, age, and 5-frequency pure-tone average (PTA) for each HI listener. The final two columns provide the comfortable speech presentation levels chosen by each listener with NAL amplification and the SNR used in testing all speech conditions. The SNR was chosen to yield 50%-correct in continuous noise. Listener Test Ear Age 5-Frequency PTA (db HL) Speech Level (db SPL) SNR (db) HI-1 R HI-2 R HI-3 L HI-4 L HI-5 L HI-6 L HI-7 L HI-8 R HI-9 L

46 46 Figure 3: Pure-tone detection thresholds in db SPL measured for 500 ms tones in a threealternative forced-choice adaptive procedure. A green line representing the average thresholds of the test ears of the NH listeners is shown in the upper left box, and the thresholds for the HI listeners are shown in the remaining boxes. For the HI listeners, thresholds are shown for the right ear (red circles) and left ear (blue x s), with the points of the test ear connected using a solid line and the points of the non-test ear connected using a dashed line.

47 47 Figure 4: Waveforms of the 7 background interference conditions used in testing. To make it easier to see the effects of the modulation, the same underlying noise sample was used to generate all four non-speech derived noises in this figure.

48 48 Figure 5: The spectrogram of the randomly generated speech-shaped noise used to create the BAS, CON, SQW, and SAM interference conditions. The speech-shaped noise had a total duration of 30 seconds, and a random segment of the speech-shaped noise (of the desired interference duration) was chosen every time a sample of BAS, CON, SQW, or SAM was generated.

49 49 Figure 6: Waveform of the VCV token APA for UNP speech in the BAS noise condition. The high-energy vowel components and the low energy consonant component are annotated at the top of the figure.

50 50 Figure 7: Waveforms and amplitude distribution plots for the VCV token APA presented with the seven different kinds of background interference (BAS, CON, SQW, SAM, VC-1, VC-2, and VC-4) with UNP (Figure 7A), EEQ1 (Figure 7B), and EEQ4 (Figure 7C) processing. The speech is presented at a level of 65 db SPL, and the noise (other than BAS) is presented at a level of 69 db SPL, leading to an SNR of -4 db. The blue line in the amplitude distribution plot represents the RMS value. The green line is the median of the amplitude distribution.

51 Figure 7A 51

52 Figure 7B 52

53 Figure 7C 53

54 54 Figure 8: Waveforms and amplitude distribution plots for the VCV token APA presented with CON (Figure 8A) and SQW (Figure 8B) interference with EEQ4 processing. Each of the four rows in the figure corresponds to a different logarithmically equal band in the range of 80 Hz to 8020 Hz. The speech is presented at a level of 65 db SPL, and the noise is presented at a level of 69 db SPL, leading to an SNR of -4 db. The blue line in the amplitude distribution plot represents the RMS value. The green line is the median of the amplitude distribution. Figure 8A:

55 Figure 8B: 55

56 56 Table 2: The SNRs employed in Experiment 2. The Mid SNR was equivalent to the one used in Experiment 1. The Low SNR was 4 db lower than the Mid SNR, and the High SNR was 4 db higher than the Mid SNR. Listener Low SNR Mid SNR High SNR HI HI HI HI

57 57 Figure 9: The mean percent correct scores achieved by the four NH (green bars) and nine HI listeners (gold bars) in Experiment 1. The scores were measured with UNP (top panel), EEQ1 (middle panel), and EEQ4 (bottom panel) processing in BAS, CON, SQW, SAM, VC-1, VC-2, and VC-4 background interference conditions. The error bars associated with each bar are +/- 1 standard deviation.

58 58 Figure 10: The mean percent correct scores achieved by the four NH (upper panel) and nine HI listeners (lower panel) in Experiment 1. The scores were measured with UNP (purple bars), EEQ1 (orange bars), and EEQ4 (green bars) processing in BAS, CON, SQW, SAM, VC-1, VC- 2, and VC-4 background interference conditions.

59 Figure 11: The mean percent correct scores achieved by the four NH listeners (upper left panel) and the individual percent correct scores achieved by the nine HI listeners (other nine panels) in

59 59 Figure 11: The mean percent correct scores achieved by the four NH listeners (upper left panel) and the individual percent correct scores achieved by the nine HI listeners (other nine panels) in Experiment 1. The scores were measured with UNP (purple bars), EEQ1 (orange bars), and EEQ4 (green bars) processing in BAS, CON, SQW, SAM, VC-1, VC-2, and VC-4 background interference conditions. The error bars associated with each bar are +/- 1 standard deviation.

60 60 Table 3: Analysis of Variance results conducted on the rationalized arcsine units for the percent correct scores of each of the four NH and nine HI listeners. The F-statistic (along with the degrees of freedom) and probability are shown for each listener by speech type, noise type, and speech by noise interaction. The probabilities below the 0.01 significance level are bolded and annotated with an asterisk. Speech Type Noise Type Speech x Noise F(2, 63) p F(6, 63) p F(12, 63) p NH < * NH < * < * * NH < * < * NH < * < * Speech Type Noise Type Speech x Noise F(2, 63) p F(6, 63) p F(12, 63) p HI * < * * HI * < * HI < * < * HI < * < * 5.22 < * HI < * < * * HI < * < * * HI * < * HI < * < * 4.16 < * HI * < * *

61 61 Table 4: Tukey-Kramer post-hoc multiple comparison results among the four NH and nine HI listeners using a significance level of The ordering is shown for each listener by speech type (1 for UNP, 2 for EEQ1, and 3 for EEQ4), noise type (1 for BAS, 2 for CON, 3 for SQW, 4 for SAM, 5 for VC-1, 6 for VC-2, and 7 for VC-4), and speech by noise interaction. When two conditions are not significantly different, they are listed in decreasing order of mean value observed. Note that there were some cases where conditions X and Y and conditions Y and Z were not significantly different, but conditions X and Z were significantly different. In these cases, Y was listed in the table as being not significantly different to whichever of X and Z it was closer in mean value to. Speech Type Noise Type NH-1 1 = 2 = 3 1 > 3 > 4 = 5 > 6 > 7 = 2 NH-2 1 = 2 > 3 1 > 3 > 4 > 5 > 6 > 7 = 2 NH-3 1 > 2 = 3 1 > 3 > 4 > 5 > 6 > 7 = 2 NH-4 1 > 2 = 3 1 > 3 > 4 > 5 > 6 > 7 = 2 Speech Type Noise Type HI-1 1 > 2 = 3 1 > 3 > 4 > 5 > 6 > 2 = 7 HI-2 1 = 2 > 3 1 > 3 > 4 = 5 > 6 = 7 = 2 HI-3 1 > 2 > 3 1 > 3 > 4 > 5 > 2 > 6 > 7 HI-4 1 = 2 > 3 1 > 3 > 4 = 5 > 6 = 2 = 7 HI-5 1 > 2 = 3 1 > 3 > 4 = 5 > 6 = 2 = 7 HI-6 2 > 1 > 3 1 > 3 > 4 = 5 = 2 > 6 = 7 HI-7 1 > 2 = 3 1 > 3 = 4 = 5 > 6 = 2 = 7 HI-8 1 = 2 > 3 1 > 3 > 4 = 5 > 2 = 6 = 7 HI-9 1 = 2 > 3 1 > 3 > 4 = 2 = 5 > 6 = 7

62 Figure 12: The mean NMR achieved by the NH listeners (first group of bars) and the individual NMR for each of the HI listeners (remaining nine groups of bars) with UNP (purple bars), EEQ1 (orange

62 62 Figure 12: The mean NMR achieved by the NH listeners (first group of bars) and the individual NMR for each of the HI listeners (remaining nine groups of bars) with UNP (purple bars), EEQ1 (orange bars), and EEQ4 (green bars) processing. The NMR for the SQW (upper left panel) and SAM (lower left panel) noises was calculated relative to the CON condition, whereas the NMR for the VC-1 (upper right panel) and VC-2 (lower right panel) noises was calculated relative to the VC-4 noise condition.

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 MODELING SPECTRAL AND TEMPORAL MASKING IN THE HUMAN AUDITORY SYSTEM PACS: 43.66.Ba, 43.66.Dc Dau, Torsten; Jepsen, Morten L.; Ewert,