ΑΛΕΞΑΝΔΡΟΥ ΤΣΙΛΦΙΔΗ ΔΙΠΛ. ΗΛΕΚΤΡΟΛΟΓΟΥ ΜΗΧΑΝΙΚΟΥ & ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ

Size: px

Start display at page:

Download "ΑΛΕΞΑΝΔΡΟΥ ΤΣΙΛΦΙΔΗ ΔΙΠΛ. ΗΛΕΚΤΡΟΛΟΓΟΥ ΜΗΧΑΝΙΚΟΥ & ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ"

Amber Bond
6 years ago
Views:

1 ΜΕΘΟΔΟΙ ΑΝΑΛΥΣΗΣ ΚΑΙ ΨΗΦΙΑΚΗΣ ΕΠΕΞΕΡΓΑΣΙΑΣ ΓΙΑ ΤΗΝ ΒΕΛΤΙΩΣΗ ΣΗΜΑΤΩΝ ΟΜΙΛΙΑΣ ΚΑΙ ΜΟΥΣΙΚΗΣ ΣΕ ΧΩΡΟΥΣ ΜΕ ΑΝΤΗΧΗΣΗ ΔΙΔΑΚΤΟΡΙΚΗ ΔΙΑΤΡΙΒΗ ΑΛΕΞΑΝΔΡΟΥ ΤΣΙΛΦΙΔΗ ΔΙΠΛ. ΗΛΕΚΤΡΟΛΟΓΟΥ ΜΗΧΑΝΙΚΟΥ & ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ & ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΑΡΙΘΜΟΣ ΔΙΑΤΡΙΒΗΣ 273 ΠΑΤΡΑ - ΙΟΥΝΙΟΣ 2011

3 Signal Processing Methods for Enhancing Speech and Music Signals In Reverberant Environments Alexandros Tsilfidis Department of Electrical and Computer Engineering University of Patras PhD Dissertation

5 To my parents, Agni and Vassilis.

7 Acknowledgements The work presented in this thesis was conducted in the Audio and Acoustic Technology Group, Wire Communications Laboratory, Electrical and Computer Engineering Department, University of Patras. Apart from the thesis supervisor, Professor J. Mourjopoulos, the dissertation committee consisted of Professor N. Fakotakis andassociate Professor E. Dermatas. I would like to thank them for their usefull comments and suggestions that have improved the presentation and quality of this work. The examining committee consisted also of Professor V. Anastassopoulos (Physics Department), Professor K. Berberidis (Department of Computer Engineering and Informatics), Associate Professor D. Skarlatos (Department of Mechanical Engineering and Aeronautics), Assistant Professor D. A. Toumpakaris (Electrical and Computer Engineering Department). I would like to thank them all for their interest, their valuable suggestions and for their participation on the examination of the thesis. For all these years, Professor John Mourjopoulos was more than a thesis supervisor for me. His vision for science and music has had adecisiveinfluenceonmylifechoices. Hissupportandguidance were invaluable and he often sacrificed his private time in order to be there for me. I will always be inspired from his scientific passion and professional dignity. IwouldliketoacknowledgeDr. D.Tsoukalasforhishelpinthe early stages of this work, Dr. J. Buchholz and A. Westermann for our collaboration on the binaural aspects of dereverberation and Dr. I.

8 Mporas for our collaboration on the Automatic Speech Recognition field. Iwouldliketoparticularlythankmy colleagueandfriendelias Kokkinis for the motivating discussions on (every crazy aspect of)audioand music signal processing. Moreover, the moving speaker dereverberation method would have never been developed without our sleepless working nights in the lab. I am also grateful to my colleague and friend Eleftheria Georganti for her significant help and contribution especially on the binaural extension of dereverberation but alsofor joining me in the anti-stress jogging sessions! Iwouldliketoexpressmygratitudetomyfriendsandcolleagues from the AudioGroup Dr. Panos Hatziantoniou, Dr. Nicolas Tatlas, Dr. Thomas Zarouchas, Fotis Kontomichos, Thodoris Altanis, Babis Papadakos and also Dimitris Sofos for their support and for creating aniceworkingenvironment. At this point I would like to thank Stelios Georgakakos (Mentor Training Inc.) for giving me the chance to work in the private sector while pursuing my thesis. His confidence and understanding were critical in order to finish this work. IwouldalsoliketoexpressmygratitudetomysisterEleanafor her care and encouragement. Moreover, I would like to thank my friend Dr. Giorgos Papageorgiou for his support, for the spell/ grammar check of my first english papers and for his help on the mathematical notations. During my undergraduate years, I have learned the practical aspects of music signal processing while working as a musician and sound engineer together with Stavros Gasparatos. Since then Stavros keeps me up to date on the new perspectives of the audio industry through (never ending but) fruitful discussions. But more importantly he is always there when I need him and I am really grateful for that. IowemygreatestthankstoReniaGasparatouforheremotionalsupport, love and encouragement. But most of all I thank her for making me happy every-single day for all these years.

9 Finally, I would like thank my parents Agni and Vassilis. Without their (emotional and financial) support throughout my education years and beyond, this dissertation would have never been accomplished.

13 Abstract This thesis presents novel signal processing algorithms forspeechand music dereverberation. The proposed algorithms focus on blind singlechannel suppression of late reverberation; however binaural and semiblind methods have also been introduced. Late reverberation isa particularly harmful distortion, since it significantly decreases the perceived quality of the reverberant signals but also degrades the performance of Automatic Speech Recognition (ASR) systems and other speech and music processing algorithms. Hence, the proposed deverberation methods can be either used as standalone enhancing techniques or implemented as preprocessing schemes prior to ASRor other applied systems. Spectral subtraction is traditionally employed for the suppression of late reverberation. A study on existing spectral subtraction dereverberation methods revealed that they degrade disproportionally lowlevel signal parts and signal transients and that they often introduce musical noise artifacts. Hence, a unified framework for joint compensation for the above degradations has been proposed, based on two signal-dependent relaxation criteria and a perceptually-motivated non-linear filtering stage. The proposed framework can be easily implemented in existing spectral subtraction dereverberation techniques, significantly improving their performance. Anoveltechniqueformovingspeaker dereverberationhasbeen also introduced. This method assumes that late reverberation is stationary in different room positions and takes advantage of a single measurement of room properties (either via a measured Room Impulse Response or via a more practical method based on a recorded handclap).

14 The results show that the technique achieves significant reverberation reduction, being at the same time robust to source-receiver position changes. A novel binaural extension of single-channel late reverberation suppression techniques has been also developed. The proposed implementation relies on bilateral gain adaptation and apart from reducing reverberation it also preserves the binaural localization cues. Furthermore, a blind dereverberation method based on perceptual reverberation modeling has been developed. This technique employs a computational auditory masking model and locates the signal regions where late reverberation is audible, i.e. where it is unmasked from the clean signal components. Following a selective signal processing approach, only such signal regions are further processed through sub-band gain filtering. The above technique has been evaluated for both speech and music signals and for a wide range of reverberation conditions. In all cases it was found to minimize the processing artifacts and to produce perceptually superior clean signal estimations than any other tested technique. Moreover, extensive ASR tests have shown that it significantly improves the recognition performance, especially in highly reverberant environments.

17 Contents Contents List of Figures xi xv 1 Introduction Reverberation Scope and motivation Speech degradation from reverberation Music degradation from reverberation Objectives Outline and main contributions Fundamentals Introduction Room acoustics prerequisites The room impulse response RIR Measurement Early- late reflections Reverberation time RT blind estimation Other important room acoustics parameters Auditory models, masking and reverberation perception Masking The critical band The computational auditory masking model (CAMM) xi

18 CONTENTS The reverberation masking index (RMI) Binaural hearing Dereverberation: Literature Summary Introduction Suppression of early reflections/ decoloration Inverse filtering Cepstral techniques LP residual enhancement Late reverberation suppression Temporal envelope filtering Spectral subtraction Dereverberation method proposed by Lebart et. al Dereverberation method proposed by Wu and Wang Dereverberation method proposed by Furuya and Kataoka Binaural techniques Multi-channel dereverberation Generalized Framework For Improved Spectral Subtraction Dereverberation Introduction Dealing with musical noise Audible reverberation suppression Dealing with overestimation errors Signal dependent constraints Power relaxation criterion Normalized cross-correlation criterion Generalized framework Tests and results Evaluation of the audible reverberation suppression stage Evaluation of the proposed unified framework Conclusion xii

19 CONTENTS 5 Late Reverberation Suppression At Multiple Speaker Positions Introduction Theoretical formulation Definition of the early/ late reflections boundary Gain magnitude regularization The handclap approximation Tests and results Results for measured RIRs Results for recorded handclaps Conclusion Binaural Late Reverberation Suppression Introduction Proposed binaural dereverberation processing Tests and results Conclusion Blind Dereverberation Based On Perceptual Reverberation Modeling Introduction Method overview Method description Tests and results Method implementation Tests and results for music signals Spectrogram evaluation Segmental signal to reverberation ratio evaluation Segmental noise to mask ratio evaluation Tests and results for speech signals Cepstral distance evaluation Perceptual evaluation of speech quality Subjective performance evaluation Discussion xiii

20 CONTENTS 7.5 Conclusion Automatic Speech Recognition Performance Improvement The effect of room acoustic parameter on automatic speech recognition performance Test methodology Reverberant speech ASR performance Adjusting the dereverberation parameters for ASR Comparison of late reverberation suppression techniques The effect of RT and source-receiver distance Correlation with D50 and C Overall recognition improvement Conclusion Conclusions And Future Work Conclusions Future work Appendix A: Computational Cost Of The Proposed Algorithms 123 Appendix B: Related Publications 127 References 131 xiv

21 List of Figures 2.1 Time domain (a) and Frequency domain (b) representation of a RIR measured at the UoP Conference Centre main hall Different parts of a RIR Log square representation of a RIR Illustrative example: RT estimation from a measured RIR Absolute threshold of hearing over frequency Illustrative diagram of the different masking types Masking Curve for a narrow-band noise masker with a length of 160 Hz and centered at 410 Hz. After Egan and Hake [40] Backward and Forward Masking. After Eliot [41] Signal flow for the first stage of the CAMM model The Decision Device of the CAMM model utilized for the RMI derivation Illustrative figure: geometric estimate of the ITD Illustrative example of the masking threshold calculation. For agivensignalblockthefigureillustrates: thesignalmagnitude (blue line), the tonal and non-tonal components (circles), the absolute threshold (black solid line) and the derived masking threshold (black dashed line) Signal flow of the proposed unified framework Spectrograms of speech signals: (a) anechoic speech, (b) reverberant speech, (c) late reverberation suppression using the FK method, (d) late reverberation suppression using the FK method together with the perceptually-motivated non linear filtering stage 48 xv

22 LIST OF FIGURES 4.4 Time domain representation of speech signals (a) clean speech, (b) inverse filtered reverberant speech signal, (c) late reverberation suppression by FK, (d) late reverberation suppression by Modified FK, (e)latereverberationsuppressionbylb,(f)latereverberation suppression by Modified LB Average NMR (db) for varying source-receiver distances and for (a) RT=0.24 sec, (b) RT=0.38 sec, (c) RT= 0.58 sec Average LPC Cepstrum Distance difference between the modified and the unmodified versions of the algorithms for varying RT conditions (a) case of signal onsets, power relaxation threshold set at 3dB,(b)caseofsignalonsets,powerrelaxationthresholdset at 10 db, (c) case of steady state, power relaxation threshold set at 3 db, (d) case of steady state, power relaxation threshold set at 10 db, (e) case of signal tails, power relaxation threshold set at 3 db, (f) case of signal tails, power relaxation threshold set at 10 db Typical time domain representations for the case of the prolonged phonemes (a) clean speech, (b) inverse filtered reverberant speech signal, (c) in black: late reverberation suppression by FK and in gray: late reverberation suppression by Modified FK, (d)in black: late reverberation suppression by LB and in gray: late reverberation suppression by Modified LB Signal flow of the proposed method Illustration of the measurement setup for the RIRs used in Section SRR improvement (in db) for different cases in (a) Room 1 and (b) Room Illustration of the measurement setup for the RIR and handclaps Mean SRR difference of the reverberant and the estimated clean signals for (a) Room 3, (b) Room Mean NMR difference between the estimated clean and the corresponding reverberant signals for (a) Room 3, (b) Room xvi

23 LIST OF FIGURES 6.1 SRR difference for the three tested methods (LB, WW, FK) for (a) Stairway Hall, (b) Cafeteria (DSB: Delay and Sum Beamformer, maxgain: maximum bilateral gain adaptation, avggain: average bilateral gain adaptation, mingain: minimum bilateral gain adaptation) Block diagram overview of the proposed method Typical representations for a percussion music sample in asubband (k=10) of (a) reverberant signal (in light grey) and clean signal (in black), (b) reverberant signal, the corresponding RMI and his local extrema, (c) signal regions of the reverberant signal that will remain intact (in light grey) and signal regions of the reverberant signal that will be processed (in dark grey), (d) the derived sub-band gain function Typical spectrograms : a) clean signal, b) equalized reverberant signal, c) clean signal obtained with the proposed method utilizing the Modified FK rough estimation, d) clean signal obtained with the proposed method utilizing the Modified LB rough estimation SRR difference between the estimated clean signal and the equalized reverberant signal for different methods, RT conditions and sample types (a) percussion, (b) guitar, (c) speech, (d) cello NMR difference between the estimated clean signal and the equalized reverberant signal for different methods, RT conditions and sample types (a) percussion, (b) guitar, (c) speech, (d) cello Cepstral Distance difference between the estimated clean speech and the equalized reverberant speech for different RT conditions and for (i) Proposed method utilizing the Modified LB Method as preliminary estimation, (ii) Proposed method utilizing the Modified FK Method as preliminary estimation, (iii) LB Method, (iv) FK Method, (v) WW Method, (vi) Modified LB Method, (vii) Modified FK Method xvii

24 LIST OF FIGURES 7.7 PESQ difference between the estimated clean speech and theequalized reverberant speech for different RT conditions and for (i) Proposed method utilizing the Modified LB Method as preliminary estimation, (ii) Proposed method utilizing the Modified FK Method as preliminary estimation, (iii) LB Method, (iv) FK Method, (v) WW Method, (vi) Modified LB Method, (vii) Modified FK Method Mean subjective ratings: (a) across different RT scenarios and (b) across different sample types Mean subjective ratings for (a) the equalized reverberant signal, (b) the clean signals obtained with the Modified LB, (c) the clean signals obtained with the proposed method utilizing the Modified FK rough estimation and (d) the clean signals obtained with the proposed method utilizing the Modified LB rough estimation Energy Decay Curve (db) of the RIRs for all tested distances: (a) Room 1, (b) Room 2, (c) Room 3, (d) Room 4, (e) Room 5, (f) Room Phone Recognition Rates over various room acoustic parameters: (a) D50, (b) C50, (c) C80, (d) Ts, (e) EDT, (f) RT The derived gain functions in a single sub-band ( Hz) for (a) γ 2 =0.05 and γ 1 =0.5, 0.3, 0.1 and (b) γ 1 =1 and γ 2 =0.1, 0.3, 0.5. The absolute reverberant (in black) and rough clean signal estimation(in grey) are also shown Phone Recognition Rates(%) for source-receiver distances 1.5, 2, 3, 4mandfor(a)Room1,(b)Room2,(c)Room3,(d)Room4,(e) Room 5, (f) Room 6. Results are shown for (i) reverberant signal ( ), (ii) Tsilfidis & Mourjopoulos dereverberation using γ 1 = 0.1 andγ 2 =0.5 (TM1) ( ), (iii) Tsilfidis & Mourjopoulos dereverberation using γ 1 =0.3 andγ 2 =0.05 (TM2) ( ), (iv) Lebart et al. (LB) dereverberation ( ), (v) Furuya & Kataoka (FK) dereverberation (+), (vi) Wu & Wang (WW) dereverberation ( ) Phone Recognition results (%) over D xviii

25 LIST OF FIGURES 8.6 Phone Recognition results (%) over C Relation between C50 (left y-axis) and D50 (right y-axis) withthe overall Phone Recognition Improvement (%) CPU time for the LB, WW and FK algorithms CPU time for the original and modified versions of the LB and FK algorithms CPU time for the moving speaker dereverberation algorithm CPU time for the dereverberation based on perceptual reverberation modeling xix

27 Nomenclature Symbols and Variables γ 1 and γ 2 Constants that control the suppression rate in function G k (n) G k (n) GainFunctioninthesub-bandk(seeChapter7) ω Φ jl c Cf k frequency bin Normalize cross-correlation between the frames j and l speed of sound Crest factor of the sub-band k D k (n) and D k (n) RMIandRMIestimationforthesub-bandk G(ω,j)Single-channelspectralsubtractiongain G k (i) Value of the gain function G k (n) ateachlocalrmimaximum g k (i) Value of the gain function G k (n) ateachlocalrmiminimum G l (ω,j) and G r (ω,j)leftandrightchannelbinauralspectralsubtractiongains H(n) The Heaviside step function h r (n) RIR i j consecutive pairs of RMI extrema time frame M k (i) and m k (i) TimeintervalswhereeachlocalRMImaximumandthesucceeding local RMI minimum occur respectively n discrete time index xxi

28 LIST OF FIGURES P Yj Power of the reverberant frame j R(ω,j)Shorttimemagnitudespectrumoflatereverberation(estimation) s(n) anechoic signal s d (n) direct signal S e (ω,j)shorttimemagnitudespectrumofthecleansignal(estimation) t continuous time index T Se (ω,j)maskingthresholdofthes e (ω,j) Y (ω,j)shorttimemagnitudespectrumofthereverberantsignal y(n) reverberant signal ACR Absolute Category Rating ANOVA Analysis Of Variance ANS ARS ASR Audible Noise Suppression Audible Reverberation Suppression Automatic Speech Recognition avggain Average Gain Binaural Adaptation Strategy BRIR Binaural Room Impulse Response C50 C80 Clarity (over the first 50 ms of the RIR) Clarity (over the first 80 ms of the RIR) CAMM The Computational Auditory Masking Model CB CD D50 DD DSB Critical Band Cepstrum Distance Definition (over the first 50 ms of the RIR) Decision Device Delay and Sum Beamformer EDC Energy Decay Curve xxii

29 LIST OF FIGURES EDT Early Decay Time ERB Equivalent Rectangular Bandwitdth FK The dereverberation method proposed by Furuya and Kataoka [51] GMR Gain Magnitude Regularization HMM Hidden Markov Model i.i.d. ILD IR ITD independent identically distributed Interaural Level Difference Impulse Response Interaural Time Difference LB The dereverberation method proposed by Lebart et al. [107] LP Linear Prediction LPC CD Linear Prediction Coding Cepstrum Distance LTI Linear and Time-Invariant maxgain Maximum Gain Binaural Adaptation Strategy mingain Minimum Gain Binaural Adaptation Strategy MLS Maximum Length Sequence MOS Mean Opinion Score MUSHRA Multiple Stimulus with Hidden Reference Anchors NMR Noise to Mask Ratio PESQ Perceptual Speech Quality measure PRI PRR PSD RIR RMI Phone Recognition Improvement Phone Recognition Rate Power Spectral Density Room Impulse Response Reverberation Masking Index xxiii

30 LIST OF FIGURES RMS Root Mean Square RT RTF SDC Reverberation Time Room Transfer Function Signal Dependent Compression SIMO Single Input Multiple Output SISO Single Input Single Output SNR SRR Signal to Noise Ratio Signal to Reverberation Ratio STFT Short Time Fourier Transform TM The dereverberation method proposed by Tsilfidis and Mourjopoulos [189] Ts Centre Time WW The dereverberation method proposed by Wu and Wang [216] xxiv

31 Chapter 1 Introduction The minds of men are mirrors to one another, not only because they reflect each other s emotions, but also because those rays of passions, sentiments and opinions may be often reverberated, and may decay away by insensible degrees David Hume, Treatise Reverberation When a sound is emitted in a closed space, the reflections and diffractions of the surrounding surfaces distort the original source and reverberation becomes an important element of the auditory event. Hence, the qualitative and quantitative aspects of the received signal are affected both in terms of physical (objective) and perceptual (subjective) perspective. Therefore, reverberation is the topic of extensive interdisciplinary research: the sound wave propagation in closed spaces is investigated as a subtopic of physics, namelythephysicalacoustics. Incogni- tive sciences reverberation is studied in the context of spatial awareness, speech perception and music cognition. In arts, reverberationplaysanimportantrole 1

32 1. INTRODUCTION primarily in music but also in theatre and cinema. In the engineering framework, reverberation is examined in the fields of architecture, audio and speech signal processing and communications. Reverberation has influenced the evolution of human culture since the prehistoric times. Waller [207] suggeststhatpalaeolithicrockartwasrelatedtothe reverberant nature of the prehistoric caves. In ancient times, the ritual spaces and temples were acoustically designed in order to impress the visitors via their unique reverberant characteristics [204]. In ancient greek mythology, Echo was a beautiful nymph famous for her delightful singing and music performance. However, the goddess Hera punished her by taking away her voice and only letting her repeat what everybody else said. When Echo died, her voice remainedand spread all over earth. Similar goddesses are also present in other ancient traditions [112]. In the Republic [149], Plato uses the cave allegory to challenge the validity of our sense data. When he comes to the auditory experience he writes: And suppose further that the prison had an echo which came from the other side, would they not be sure to fancy when one of the passers-by spoke thatthevoice which they heard came from the passing shadow?. Inthisquote,Platorefersto the actual problem of speaker identification in reverberant environments, as it seems that the degradation of speech due to reverberation distortion was already known. Furthermore, in a another quote of the same Socratic dialogue the perceptual impact of reverberation is discussed: When they meet together, and the world sits down at an assembly, or in a court of law, or a theatre, or a camp, or in any other popular resort, and there is a great uproar, and theypraisesome things which are being said or done, and blame other things, equally exaggerating both, shouting and clapping their hands, and the echo of the rocks and the place in which they are assembled redoubles the sound of the praise or blame at such atimewillnotayoungman sheart,astheysay,leapwithinhim?. Reverberation is natural phenomenon, present in our everyday lives and according to Blesser [18]: Reverberation is the story of humans and their culture. From a perceptual point of view, reverberation is directly linked to the bigger topic of spatial awareness. Depending on the application context, reverberation may be considered as a pleasant sound quality or a harmful sound distortion. In music acoustics, the reverberation of the concert halls influences the character of 2

33 music performance or even musical style of composition and orchestration [160], while religious ceremonies usually take place in highly reverberant enclosures [18]. In addition, there is a great research effort on the topic on artificial reverberation (e.g. [88, 124, 163]). Artificial reverberators, i.e. the hardware of software implementations used to add reverberation in dry recordings, are widely used in music performance and recording but also in cinema, theatre, video games, virtual reality systems etc. On the other hand, there are many applications where reverberation is considered as an unwanted distortion deteriorating the quality ofacousticsignals. Reducing or completely removing reverberation from audio and speech signals has been a challenging research issue for at least 4 decades (e.g. [49, 119]). A literature summary of dereverberation methods is given in Chapter 3. Mostrecent dereverberation techniques have been developed specifically for speech signals since reverberation (and essentially late reverberation) is known to reduce speech quality and intelligibility and deteriorate the performance of Automatic Speech Recognition (ASR) systems (see Section 1.2.1). Furthermore, dereverberation of music signals can be useful in many audio applications such aspost-processing of recorded signals, in active music listening systems, as well as in music signal classification, in automatic music transcription, analysis andmelodydetection (see Section 1.2.2). Despite the efforts of the research community, the dereverberation problem is far from being solved. From a mathematical point of view, blind dereverberation is a blind deconvolution problem. However, blind deconvolution is only possible when the involved signals are irreducible, i.e.when they cannotbeexactly expressed as the convolution of two or more component signals. In speech and audio applications this is almost never true and direct application of blind deconvolution techniques is ambiguous (see Section 3.5). In addition, the room is usually modeled as a Linear and Time Invariant (LTI) filter. However, the filter order can reach taps and for a subtle change in the source-receiver position the filter changes since the model becomes time variant. Moreover, the sound velocity is practically non uniform and a 0.5 Crandomthermalchangecandisplace the arrival time of a single echo for about 2ms. Due to this uncertainty the high frequency reverberation components cannot be accurately measured [18, 161]. 3

34 1. INTRODUCTION Obviously, a deterministic representation of reverberation is not feasible and approximate models of reverberation are used, such as the image method[3], ray tracing techniques [206] andfiniteelementsmethods[7]. Note also that there is ahugevarietyoflisteningspacesrangingfromsmalloffices and listening rooms to large auditoriums and cathedrals, each one presenting different reverberant acoustic characteristics [160]. Reverberation is associated with perceptual attributes related to the everyday hearing, approaching the higher cognitive level. Such auditory processes are known to interfere with other senses, especially with vision [118]. Stein and Meredith [172] statethatthesuperior colliculus i.e. the part of the brain that contains the visual map, also integrates a corresponding audio map. This indicates that even if a complete blind removal of reverberation from a sound signal was possible, this wouldn t probably result to the perfect perceptualderever- beration. It is obvious though, that reverberation is more than a typical additive acoustic noise and dereverberation is a largely complicated problem. In the context of this thesis, novel acoustic signal processing methods for suppressing reverberation from both speech and music signals were developed. Despite the complexity of the topic, the proposed algorithms are applicable in many engineering problems, dealing with a wide range of reverberation conditions. Moreover, the proposed techniques significantly reduce the audible processing artifacts, addressing the dereverberation problem in a meaningful manner from a perceptual point of view. 1.2 Scope and motivation In order to further explain the motivation for this thesis, the harmful aspects of reverberation on speech and music signals are discussed below Speech degradation from reverberation In general, reverberation is considered to reduce speech intelligibility. However, the early reflections can be proven beneficial as they increase theperceivedlevel of the direct sound [171, 208]. On the other hand, late reverberation may de- 4

35 grade the human speech recognition performance [21, 104, 113] andtheeffect of room acoustics becomes more pronounced in noisy environments [23]. The loss of speech intelligibility is fundamentally due to the masking of phonemes [20] (asaresultofthedecayingreverberanttails)andbecomesmore important in non-native and hearing impaired listeners [132, 178]. In order to predict the speech intelligibility in closed spaces, room parameters such as the %Alcons and the RASTI were developed [106, 148]. The speech quality is also degraded by reverberation. According to Allen [2], the main factors affecting the quality of speech are the Reverberation Time (RT) and the spectral deviation of the impulse response. However, the perceptual dimensions of the reverberation degradation are far more complicated [26]. Speech recognition by humans in reverberant environments isrobustwhen compared to the performance of the Automatic Speech Recognition systems [96, 110]. Although such systems have already been well-developed and even embedded in commercial applications, in real-life cases room reverberation severely degrades their performance, making the use of such systems less effective. Several methods have been developed in order to overcome this problem. A first class of compensating techniques is implemented in the ASR front-end and aims to make the speech features robust to reverberation (e.g. [36, 75, 142, 168, 214]); asecondclassoftechniquesisemployedintheasrback-endand intends to properly adapt the acoustic model (e.g. [59, 60, 156]). The third approach, is to implement a preprocessing step attempting to suppress reverberation before the derivation of the speech feature vectors (see Chapter 3 and 8). Roomreverberation significantly alterstheclean signal sstatistics [143]. Hence, reverberation (and especially late reverberation) is harmful for practically every speech application involving machine learning such as automatic speaker recognition [144] andemotionrecognition[167]. Furthermore, it degrades the performance of speech detection [222], speech separation [100], pitch tracking [153] and speech segregation [159] algorithms. Inalltheabovecasesadereverberation preprocessing may improve the performance of the algorithms. 5

36 1. INTRODUCTION Music degradation from reverberation The quality of reproduced music and audio signals in closed spaces is mainly affected by the loudspeaker and the room responses [126] andcommercialroom correction systems trying to compensate for these degradations have already been developed [137]. However, the fundamental limitation of such systems is that they can only deal with the coloration produced by the early reflections [72, 91], as it is impossible to establish a causal inverse filter for every source-receiver position in a room [125]. A short review of room inverse-filtering techniques trying to overcome the above limitation is given in Section In addition, music is often recorded in venues that are not acoustically treated and reducing the reverberation effect is a necessary step before further processing. In such cases, sound engineers deal with the coloration produced by the early reflections utilizing (digital) equalizers. For the late reverberation, in the lack of a specialized tool, sound engineers are often obliged to manually suppress the reverberant tails or use tools that are not directly developed to handle such problems (e.g. noise gates). As in the case of speech, reverberation severely changes the music signal s statistics [55, 56]. Hence, reverberation (and mainly late reverberation) causes a decrease of performance in music signal classification, automatic music transcription, analysis and melody detection, source separation, etc [6, 8, 100, 180, 212, 217] 1.3 Objectives From the past history it appears that, developing methods forreducingorcom- pletely removing reverberation from speech and music signals poses a great scientific challenge. In the same time, it is of great practical importance since such techniques can be proven beneficial for many engineering applications. Therefore, the main focus of this thesis was to develop novel dereverberation algorithms and emphasis was given in blind single-channel dereverberation which from the historical perspective is the most challenging dereverberation task. From the preceding discussion it became apparent that late reverberation re- 6

37 sults: (i) in the reduction of speech quality and intelligibility, (ii) in the degradation of music quality, (iii) in a decrease of ASR systems performance and (iv) in a performance decrease of speech and music processing algorithms. Obviously, late reverberation is particularly harmful for both speech and music signals and the methods presented in this thesis compensate exactly for this typeofdegradation. Most recent dereverberation algorithms have been developed specificallyfor speech signals. However, such techniques are not always appropriate for processing broadband audio signals (e.g. music) as in principle music dereverberation is more challenging (see the related discussion in Chapter 7). The scope of this thesis was to develop novel techniques appropriate for the dereverberation of both speech and music. The diversity of the problem implies that a different technique may be optimum for each dereverberation scenario. Hence, the methods developed in this thesis were employed in spectral, time and sub-band domains. Inaddition,given that the acoustic principles in small and big enclosures are different [181], the proposed dereverberation methods were designed to be applicable in a wide range of acoustic scenarios, ranging from small offices to large auditoria. As discussed in Section 3.5 many researchers evaluate their algorithms in unrealistic simulated environments. However here, the objective was to address real-life problems, employing in situ measurements of real room responses. Note also that dereverberation often introduces annoying processing artifacts, but the techniques proposed here produce higher quality dereverberated signals significantly reducing artifacts and processing distortions. Last but not least, the goal for this thesis was to propose perceptually compliant methods that achieve subjectively superior results. Inarecentreviewof aspeechprocessingtextbookof[34] itisstated: There is also a philosophical problem associated with the development of any noise- or reverberation-reduction algorithm and not just the ones presented in this book. The approaches are chosen because they are mathematically tractable and not because they are perceptually relevant. [...] So there is a further challenge to the editors ofthisbookandothers working in the field of speech processing: How to integrate models of auditory perception into the problem definition and optimization procedures.. Oneofthe main objectives of this thesis was to employ such auditory modeling, in order 7

38 1. INTRODUCTION to develop dereverberation algorithms that significantly enhance the reverberant signals without compromising their perceptual quality. 1.4 Outline and main contributions In this section, an outline of the thesis is given, also pointing to the main contributions of the study. The related publications are also mentioned. Chapter 2 presents fundamental concepts of room acoustics, auditory modeling and binaural hearing. This background information is essential to understand the proposed signal processing algorithms. Furthermore, in Chapter 3 aliterature survey on dereverberation is presented, mainly focusing on late reverberation suppression techniques. In Chapter 4 ageneralizedframeworkforimprovingspectralsubtraction dereverberation is presented: At first, spectral subtraction dereverberation algorithms were implemented and their performance was investigated. These methods assume that late reverberation is an additive noise that can be estimated based on RIR modeling, or on reverberant signal statistics. A comparative study showed that such techniques often overestimate late reverberation, introducing musical noise artifacts and degrading low-level components and signal transients. [186, 187] In order to compensate for the musical noise, a perceptually-motivated nonlinear filtering approach was employed. The original technique was proposed by Tsoukalas et al. [199, 200]fordenoisingapplications. Inthisthesis it is shown that despite the time-varying nature of reverberation interference, the use of this perceptually-motivated method significantly improves the resulting signals. [191, 192] Moreover, in order to preserve the signal s transients from overestimation errors, two novel signal-dependent constraints were proposed. The Power Relaxation Criterion identifies and preserves the signal s onsets whilst the Normalized Cross-Correlation Relaxation Criterion identifies and preserves 8

39 subsequent correlated frames appearing due to sustained notes (in music) or prolonged phonemes (in speech). [186, 187] The above modifications provide a novel generalized framework potentially improving most existing late reverberation suppression methods. These criteria were implemented in the STFT domain and can be easily incorporated in existing spectral subtraction algorithms. Objective andsubjectiveresults indicate that the proposed framework achieves significant late reverberation suppression with fewer transient and spectral distortions. [186, 187] Therefore, the publications related with Chapter 4 are [186, 187, 191, 192]. In teleconference and hands-free applications, the speaker isusuallymov- ing and efficient single-channel dereverberation is of great interest. Chapter 5 presents a novel spectral subtraction method for late reverberation suppression at multiple speaker positions: Assuming that the late reverberant part of a Room Impulse Response is stationary at different room positions, a new spectral subtraction dereverberation approach has been proposed utilizing a measured RIR andthe excitation signal derived from the Linear Prediction (LP) analysis of the reverberant signal. [197] Anovelstepforadjustingthesuppressionrateandeliminating the musical noise has been introduced. The proposed low-complexity implementation is based on Gain Magnitude Regularization (GMR). Low Signal to Reverberation Ratio (SRR) signal regions are identified and the suppression is dynamically constrained in order to prevent overestimation errors. [101, 194, 195, 196, 197] The above technique has been extended in a semi-blind framework. It has been shown that a recorded handclap can be used in order to extract an approximation of the Power Spectral Density of late reverberation. The method achieves sufficient robustness with respect to different source/receiver arrangements within the room and the results showed significant speech enhancement independent of the reference room. [101, 194] 9

40 1. INTRODUCTION Hence, the above work led to the following publications: [101, 194, 197] The development of binaural dereverberation algorithms is important in the context of digital hearing aids, binaural telephony and immersive telecommunications. In Chapter 6 binaural extensions and performance evaluation of dereverberation methods are presented: Abinauralextensionofspectralsubtractiondereverberation methods has been developed. The proposed implementation, apart from suppressing reverberation without introducing processing artifacts, also preserves the signal s binaural localization cues. [57, 195, 196] This generalized approach is based on the adaptation of the spectral gains derived by bilateral processing and three possible gain adaptation strategies were investigated. The gain adaptation schemes were evaluated in three state-of -the-art spectral subtraction dereverberation algorithms. [195, 196] The most prominent binaural extensions were revealed through objective measures for several experimental conditions, indicating the optimum dereverberation approach for each reverberation scenario. [195, 196] Therefore, the publications related to Chapter 6 are [57, 195, 196] Chapter 7 presents a blind, single-channel late reverberation suppression method based on perceptual reverberation modeling: Aspectralpreprocessorwasusedinordertoachieveafirstrough estimation of the clean signal. For this, any spectral subtraction algorithm can be employed. However incorporating the modifications presented in Chapter 4 was found beneficial. [188, 189] Atime-frequencyauditorymodelhasbeenincorporatedinthedereverberation process. This model quantifies the reverberation distortion throughout the signal s evolution and locates the signal regions where late reverberation is audible, i.e. it is unmasked from the clean signal. [189] The proposed method has employed a selective signal-processing approach, where only the signal components that are badly contaminated fromlate 10

41 reverberation are processed through hybrid sub-band filtering. The above filtering is adaptively adjusted based on indicators of the severity of the reverberation degradation. This novel approach resulted to substantialreverberation reduction with fewer processing artifacts than anyothertested technique. [189, 190] Psychoacoustic research has pointed that the temporal envelope of a signal should be viewed as a real signal within the auditory system [211]. Despite the fact that it is well-known that temporal processing can be used to suppress late reverberation [129], recent late reverberation suppression methods are entirely employed in the spectral domain. The proposed technique suppresses reverberation involving temporal processing in perceptually-significant sub-bands. [188, 189, 190] Most dereverberation techniques are designed either for speech or for music signals. However, the proposed method achieves superior results for both speech and music signals [189, 190] Hence, Chapter 7 is related to the following publications: [188, 189, 190] Chapter 8 focuses on the Automatic Speech Recognition performance improvement achieved from dereverberation preprocessing algorithms. ARoomImpulseResponsedatabasecoveringawiderangeofacoustic scenarios has been constructed and phone recognition results in thesereverberant environments were obtained. The results indicate that position dependent room acoustics parameters such as the clarity C50 and the definition D50 are more appropriate than the widely used RT when evaluating the ASR performance degradation in closed spaces. [198] The dereverberation method described in Chapter 7 was fine-tuned for ASR applications and the optimal method parameters were found. [198] An extensive study on the effect of the dereverberarion preprocessing in ASR performance improvement was made. The perceptually-motivated approach proposed in Chapter 7 [189] wasfoundtoachievesuperiorresults 11

42 1. INTRODUCTION than any other tested technique especially in highly reverberant environments. [198] It has been shown that the established acoustic parameters correlate well to the performance of ASR systems and that they can be possibly usedas predictors of the performance improvement. [198] Hence, the related publication is [198]. Finally conclusions and motivation for future work are given inchapter 9. In Appendix A an informal evaluation of the computational cost of the proposed algorithms is made while Appendix B presents a list of the author s publications. 12

43 Chapter 2 Fundamentals 2.1 Introduction In this Chapter fundamental concepts of room acoustics, auditory modeling and binaural hearing are discussed. In Section 2.2, theroomimpulseresponse(rir) concept is detailed and the RIR measurement procedure is explained. In addition, the Reverberation Time (RT) and other room acoustic parameters are defined. Section 2.3 includes a synopsis of the auditory masking theory, a description of a computational time-frequency auditory masking model and an outline of the reverberation perceptual modeling framework used in this work. Finally, in Section 2.4, backgroundinformationonthebinauralhearingisgiven. 2.2 Room acoustics prerequisites The room impulse response In signal processing theory, the Impulse Response (IR) of a given system is the system s output when it is excited with an impulse input. The impulse is theoretically modeled as a Dirac delta function in the continuous-time domain, or a Kronecker delta function in the discrete-time domain. The Dirac delta function 13

44 2. FUNDAMENTALS is defined as: +, t =0 δ(t) = 0, t 0 (2.1) and δ(t) dt =1 (2.2) where t denotes the continuous time index [140]. Note that from a mathematical point of view the Dirac delta is a distribution, even if it is usually manipulated as a function. Respectively, in the discrete time domain the Kronecker delta is defined as: 1, n =0 δ[n] = (2.3) 0, n 0. where n denotes the discrete time index. If a system is Linear and Time-Invariant (LTI) it can be completely described by its IR and the time domain system s output is derived as the convolution of the system s input with its IR [139]. In the field of Acoustics, the sound waves emitted from a source inaclosed space reflect and diffract on the surrounding surfaces. These reflections result to a superimposition of delayed and filtered copies of the original source in the receiver s position and the room is roughly assumed to be a LTI system. Thus, the RIR has a great practical use as it contains all room acoustics information for a given source-receiver position. When a sound signal s(t) isemittedina reverberant room with a RIR h r (t), the reverberant signal y(t) isobtainedas: y(t) = t 0 s(τ)h r (t τ) dτ, 0 t<l (2.4) where L is the length of the RIR. However, in practical implementations where the sound source is moving the RIR is time variant and the above equation becomes: y(t, θ) = t 0 s(τ)h r (t τ, θ) dτ, 0 t<l (2.5) where θ denotes the dependence on the room position. ARIRmeasuredinaconcerthallisshowninFig. 2.1 (a) in the time and (b) 14

45 Amplitude (a) Time (s) Magnitude (db) Frequency Response Smoothed Frequency Response (1/3 oct) (b) 4 Frequency (Hz) Figure 2.1: Time domain (a) and Frequency domain (b) representation of a RIR measured at the UoP Conference Centre main hall 15

46 2. FUNDAMENTALS the spectral domain. The frequency response of a RIR is typically called Room Transfer Function (RTF). The spectral degradation producedfromthereflections (and essentially from the early reflections) is shown in the RTF presented in Fig. 2.1 (b). Note that the frequency resolution of the human ear is limited and the fine spectral details are not perceptually important. Hence, theacousticiansare usually interested in the smoothed RTF characteristics (see 2.1 (b)) [70] RIR Measurement The ideal delta function is a mathematical tool and apparently it cannot be represented in real life. Hence, in order to measure the RIRs in closed spaces gunshots, ballon pops, handclaps or other impulsive sounds have been historically used. Nevertheless, for an accurate RIR measurement standarized excitation signals must be used. The ISO standard [84] proposestheuseofamaximum Length Sequence (MLS) noise signal and gives the following directives: The sound source should be as close to omni-directional as possible and the microphone should be omni-directional Several measurements should be averaged to compensate for noise errors The microphone should be placed at least 1/4 of the lower wavelength away from the closer reflective surface The minimum source-receiver distance d should be at least: V d =2 in meters (2.6) crt where V is the room volume, c is the speed of sound and RT is the expected reverberation time. Another widely-used excitation signal is the sine sweep; sometimes it has been found more appropriate for precise measurements than the MLS noise[46, 47] Early- late reflections In room acoustics the RIR is usually decomposed in three parts asshowninfig. 2.2: (i) thedirectpart, (ii) theearlyreflectionsand (iii) thelate reverberation 16

47 while in Fig. 2.3 the log square representation of a RIR is shown. The direct sound is the free field contribution, i.e. the sound arriving without distortions. The initial delay before the arrival of the first peak of the RIR depictsthesourcereceiver distance. After the direct sound, the early reflections arrive. The early reflections are considered relatively sparse and span a short timeinterval. From aperceptuallypointofview,theearlyreflectionsmainlyaffect the signal s timbre and are perceived as coloration [9, 76]. The last part of the RIR is called late reverberation and it is responsible for the signal s reverberant tails [76, 99]. Late reverberation arises in the diffuse field where the RIR presents stochastic characteristics and produces a noise-like effect [63]. Usually, late reverberation is considered detrimental for speech reproduction as it degrades the intelligibility and reduces the performance of the ASR systems. Furthermore, latereverberation contributes to the spatial awareness and gives the sense of continuity to the sound [181]. In small-room acoustics where the surrounding surfaces are close to the receiver the early reflections are dominant, while latereverberationprevails in large enclosures Reverberation time The RT 60 (or RT for simplicity) is the most widely used room acoustic parameter and it is defined as the time required for the sound level in a room to decrease by 60 dbs after the sound source has stopped emitting sound. It is calculatedfrom the Energy Decay Curve (EDC) of the RIR that depicts the energy remainingin aroomafterthetimeintervalt from the excitation termination [104]. The EDC is commonly referred to as the Schroeder integral [164, 165] anditisdefinedas the integral of the squared RIR h r (t): EDC(t) = t h 2 r(τ)dτ (2.7) When the environmental noise level is high during the RIR measurement, the RT value may be overestimated. In order to compensate for such errors, the RT is usually extrapolated from the least-squares regression line taken from the -5 db to -35 db part of the decay curve. An illustrative example of the procedure is 17

48 2. FUNDAMENTALS 3)&$1#%4*5+6!"#$%&$'$&($&"#)*+,"&-.% /$0$1#)*+2 Figure 2.2: Different parts of a RIR Amplitude (db) Time (s) Figure 2.3: Log square representation of a RIR 18

49 0 10 RT 60 = 1.3 s Amplitude (db) Time (ms) Figure 2.4: Illustrative example: RT estimation from a measured RIR given in Fig TheRTofagivenroomisderivedastheaverageofseveralRT measurements in different source-receiver positions, as specified in [84]. In architectural acoustics, it is of great interest to predict the RT from the roomcharacteristics. In the late 1890s [161], Sabine developed an equation linking the RT with the room volume V,thetotalabsorptionofthesurroundingsurfaces Sa and the sound velocity c [104]: RT 60 = 4ln106 c V Sa m 1 V Sa in s (2.8) Note that the above equation does not take into account the fine geometrical details and assumes a stochastic and ergodic system [161] RT blind estimation When a RIR measurement is not feasible and the room characteristics are unknown, a blind RT estimation has to be performed. This can be useful, in occupied rooms where the recording of an excitation signal may be disturbing for the occupants. Furthermore, in speech processing systems only the reverberant signals are typically available [116]. The main principle of most blind RT estimation 19

50 2. FUNDAMENTALS algorithms (e.g. [93, 155]) is to detect the envelope of the signal and calculate an approximation of the relevant decay curves. Then, the estimated RT is derived through Schroeders backward integration [164] Other important room acoustics parameters The RT is the most widely used room acoustic parameter, as it depicts the general properties of a reverberant enclosure. However, many other room acoustics parameters are available. The critical distance is defined as thesource-receiver distance where the direct and the reverberant sound energy are equal and it is derived as [104]: d crit V in m (2.9) 4 π RT Another important parameter is the Schroeder Frequency [166] whichisde- rived as: RT f Schroeder 2000 in Hz (2.10) V In low frequencies the distinct room modes are the dominant characteristic in the room s behavour while in higher frequencies the general statistical properties of the room are important. The crossover frequency between the those two frequency ranges is the f Schroeder where the room modes start grouping so closely together that they are no longer considered as resonant peaks. Both the d crit and the f Schroeder are not always relevant in small room acoustics [181]. The Early Decay Time (EDT) is derived following the same principle with the RT. The EDT is an alternative measure that is known to correlate better with the perceived reverberation effect. It is calculated in a similar fashion with the RT from the 0 db to -10 db part of the decay curve [104]. The Definition D50 (Deutlichkeit) hasbeenassociatedtotheasrperfor- mance [169], [134] anditisdefinedastheenergyratioofthefirst50msofthe 20

51 RIR and the total RIR energy: 50ms h 2 r (τ)dτ D50 = 0 h 2 r (τ)dτ 100% (2.11) 0 The D50 is considered as a good indicator for the subjective speech intelligibility provided that the background noise is not prominent and higher D50 values denote superior signal acoustical quality [104]. The Clarity C50 parameter can also evaluate whether the signal qualities are well perceived in a room and it appears that it may be an appropriate speech intelligibility measure [22]. The C50 is defined as the energy ratio of the early and late reflections: C50 = 10 log ms 50ms h 2 r (τ)dτ in db (2.12) h 2 r(τ)dτ Higher clarity values are achieved when the direct sound is dominant and denote better acoustical quality. By combining Eq and 2.12 the relation between definition and clarity can be deduced: ( ) D50 C50 = 10 log 10 in db (2.13) 100 D50 Note that in bigger halls, the clarity is often calculated over the first 80 ms (C80) [160]. Finally, the centre time (Ts) represents the centre of gravity of the squared 21

52 2. FUNDAMENTALS Absolute Threshold (db SPL) Frequency (Hz) Figure 2.5: Absolute threshold of hearing over frequency RIR: τ h 2 r (τ)dτ Ts = 0 h 2 r(τ)dτ in ms (2.14) and lower Ts values denote better acoustical quality[104]. 0 All the above parameters are widely used to describe the perceptual attributes of the concert halls [160, 181]. Note however, that even if the above parameters are valid descriptors of room acoustic properties, thorough statisticalanalysisof the RIR may extract additional acoustical information [55, 56, 89, 151]. 22

53 Figure 2.6: Illustrative diagram of the different masking types 2.3 Auditory models, masking and reverberation perception Masking The lowest intensity at which a sound can be detected in total silence defines the absolute threshold A TH.Forsoundsofmorethan500ms,theabsolutethreshold exclusively depends on the sound frequency (see Fig. 2.5). This dependency can be analytically approximated by [221]: A TH 3.64 ω e 0.6 (ω 3.3) ω 4 db (2.15) where ω is the frequency in khz. For durations less than 200 ms, the required intensity for a sound in order to be detected increases as the duration of the sound diminishes. According to the American Standards Association, masking has been defined as the process by which the threshold for audibility for one sound is raised by the presence of another (masking) sound and also as the amount by which the threshold of audibility of a sound is raised by the presence of another(masking) sound [121]. Hence, the amount of masking in dbs is defined as the difference between the masking and the absolute thresholds. There are three different types of masking (see Fig. 2.6): 23

54 2. FUNDAMENTALS Figure 2.7: Masking Curve for a narrow-band noise masker with alengthof160 Hz and centered at 410 Hz. After Egan and Hake [40] 1. Simultaneous or frequency masking: In this case the masker andthetest signal are presented simultaneously to the listener. Usually, narrow or wide band noise and pure tones are used as test signals in masking experiments. The duration of the test sounds changes dramatically the masking curves and impulsive sounds seem to produce less masking [184, 185]. In Fig. 2.7 the masking curve for a continuous narrow-band noise masker centered at 410 Hz is shown [40]. It must be noted that the masking curve is not symmetrical around the central frequency of the masking noise and that the masking effect begins at 100 Hz which is lower than the first frequency component of the masker (i.e. 330 Hz). In addition, the masking increases linearly after that frequency [121]. 2. Forward masking or postmasking: In this case the masker precedes the test signal. Then, as shown in Fig. 2.8 the masking threshold is exponentially decreasing until it reaches the absolute threshold of hearing[41]. The masking level depends on several parameters such as: the time delayand the spectral differences between the masker and the test signal and also the 24

55 Figure 2.8: Backward and Forward Masking. After Eliot [41] level, the duration and the frequency content of both the masker and the test signal [48, 121]. 3. Backward masking or premasking: The test signal is presented at the listeners before the masker signal. As shown in Fig. 2.8 atestsignalcan be masked even if it is presented 20 ms before the masker. Experienced subjects show less or no backward masking [121]. There are two possible physiological mechanisms that explain the simultaneous masking. According to the first explanation, the frequency masking is due to the swamping of neuronal activity caused by the signal. If the maskerproducesa significant activity in the ear canal which normally should respond to the signal, the activity added by the signal can not be detected. Another possible mechanism is that the masker suppresses the activity that should be causedfromthe test signal when presented alone [121]. The interpretation of temporal masking is not obvious. Forward masking can be explained by the movement ofthebasilar membrane, i.e. the inner ear structural element responsible forthefrequency analysis of the perceived sounds. A strong excitation of the membrane by a certain frequency may reduce the response of the basilar membrane in neighboring frequency regions. Despite numerous research efforts, the physiological mechanisms responsible for the backward masking are still not clear. Potential causes 25

56 2. FUNDAMENTALS may be a) the overlap at the cochlea (the test sound reaches a cochlea that is not rested) and b) the overlap at the central brain level (the signal is not yet processed at the time of the intervention of the masker) [30, 209] The critical band The Critical Band (CB) concept goes back to Fletcher [50], who assumed that the auditory system operates as a bank of parallel filters (these filters are now called auditory filters). Thus, when a pure tone is masked by white noise, only a small portion of the noise is involved in the masking process and the widthofacritical band is defined as the ratio between the intensity threshold of the test signal and the spectral density of the masking noise. This definition was lateramended. It appeared that such filtering also occurs in other situations, e.g.theloudnessofa given broadband noise signal remains constant as its bandwidth does not exceed acertainlimit. FeltdkellerandZwickersuggestedtheuseofthisbandwidthlimit as a definition of the critical band [48]. By ranging 24 critical bands between 20 to 16,000 Hz the Barks scale is defined. The Barks are perceptually relevant frequency intervals and they can be related to the corresponding Hz through the following analytical expression [182]: Barks number ( ) (2.16) ω where ω is in khz. The Bark scale takes into account the tonotopic resolution, but ignores the temporal analysis of the human auditory system. Hence, Moore and Glasberg [122] suggestedusingtheequivalentrectangularbandwitdhfilter (ERB). A formula that relates the ERB number to the corresponding Hz is [123]: ERB number 21.4 log(4.37 ω +1) (2.17) where ω is again in khz. 26

57 '$.4.*"0&5 &8&&9 6 :; '<= >8"9!"%2/"*+&?2#/2@2"%*%(A"& BA/&@$.4.*"0&5!"#$%&'()"*+,-./(0& 1(+%2/.*"3 '$.4.*"0&6 &8&&9 6 :; '<= >8"9!"%2/"*+&?2#/2@2"%*%(A"& BA/&@$.4.*"0&6 '$.4.*"0&7 &8&&9 6 :; '<= >8"9!"%2/"*+&?2#/2@2"%*%(A"& BA/&@$.4.*"0&7 Figure 2.9: Signal flow for the first stage of the CAMM model The computational auditory masking model (CAMM) The Computational Auditory Masking Model (CAMM) is a time-frequency auditory masking model introduced by Buchholz and Mourjopoulos [28, 29]. It assumes two inputs: a masker and a masker plus test signal. The CAMMfollows two stages: (i) a preprocessing stage which extracts the corresponding auditory internal representations from the two inputs and (ii) a Decision Device (DD) stage which determines whether the test signal is audible or not, based on the previously derived internal representations. These internal representations illustrate a transformation from the physical to the psychophysical (internal) domain [10]. The CAMM is mainly based on psychoacoustic masking data although it is also related to some physiological aspects of the human auditory system. As it is essentially related to the frequency analysis performed in the basilar membrane, the input signals are fed into an auditory filterbank and hence anysubsequent analysis and processing isperformed in thetemporal sub-band domain. Then, full wave rectification and low-pass filtering are performed related to the mechanical to neural transduction realized by the inner hair cells. The CAMM assumes that adynamiccompressionoftheamplitudeofeachinputsignalisperformed;hence asignaldependentcompression(sdc)moduleisalsoimplemented to describe such signal-adaptation effects. After this stage, a temporal integratorisused(1st order lowpass filter with a cut off frequency of 4 Hz) to compensate for the signal duration dependency of the simultaneous masking threshold, asdeterminedby psychoacoustic results. Fig. 2.9 illustrates the above steps. Then, the resulting internal representations of the input signals are fed into 27

58 2. FUNDAMENTALS,-.'&-"*/ 0'1&'('-.".2)-/ )3/0'4'&5'&"-./ 627-"*,-.'&-"*/ 0'1&'('-.".2)-/ )3/8*'"-/627-"*/!! "!"# $%&'(%)*+ 0!, Figure 2.10: The Decision Device of the CAMM model utilized for the RMI derivation addtogetherwithasetofstaticthresholds. Whentheinternal representation difference is below the corresponding threshold the test signal is considered to be masked; otherwise it is considered audible The reverberation masking index (RMI) Zarouchas and Mourjopoulos [220] usedtheanalysismadebythecamminorder to propose a reverberation perceptual modeling framework. For this, the reverberant and the clean signal were fed into the model and their internal representations (ỹ k (n) and s k (n) respectively)werepassedthroughthedd.theirexact (per sample) difference was calculated and compared to a set of staticthresholds T k (n) [220] inordertoderivethereverberationmaskingindex(rmi)d k (n) for each sub band (see Fig. 2.10): D k (n) =max( ỹ k (n) s k (n),t k (n)) (2.18) The derived RMI represents an estimate of the perceived alterations mainly due to late reverberation which is acting as masking noise on the original clean signal. The CAMM is considered as more suitable than corresponding frequencydomain block-based masking models (e.g. [87]) to analyze the largely temporal late reverberation effects [220]. The RMI has been found to account for room acoustics characteristics, the value of RMI increases with RT. But being signaldependent it was also found to vary along the time signal s evolution and as expected to increase during the decay of the reverberation tail [220]. 28

59 .&+)!/0!"#$%&'%&()*+,- Figure 2.11: Illustrative figure: geometric estimate of the ITD 2.4 Binaural hearing The term binaural hearing is used to describe the ability of thehumanauditory system to benefit from the fact that it has two ears. It contributes to the auditory localization, detection and recognition [15]. When a sound wave is emitted from asinglesource,thesoundarriveswitharelativedelayonthetwoears,dueto different lengths of the acoustical paths. This delay represents the Interaural Time Difference (ITD) [179]. The ITD depends on the source-receiver position and on the shadowing of the human head. The maximum ITD can be found when the source is on the interaural axis (see Fig. 2.11) andgiventhatthewidthof the head is approximately 18 cm and the speed of sound 340 m/sec the ITD may reach a value of 529 µs. Whentheheadshadowingisalsotakenintoaccountthe measured ITD can reach a value of 800 µs. The Interaural Level Difference (ILD) i.e. the sound intensity difference on the two ears, is also important in the binaural hearing context [16, 173]. The ILD is more important on the high-frequency range where the wavelength of the emitted sound is short when compared to the dimensions of the human head [16, 157]. Hence, in principle the ITDs are the dominant cues in the low frequencies while the ILDs are more important in the high frequencies. However, ithasbeenshown that the ITDs can be also important in the high frequency range, as they can be perceived through envelope fluctuations [16]. 29

60 2. FUNDAMENTALS The binaural cues have been applied in multi-channel spatial rendering,in virtual acoustics and multichannel audio coding [16, 44, 45]. In addition it is very important to preserve these binaural cues in hearing aids signal processing [64, 86, 115], otherwise the ability for sound localization is degraded and hearing impaired people present worse localization skills when they weartheirhearing aids [202]. 30

61 Chapter 3 Dereverberation: Literature Summary 3.1 Introduction Since the early works of Flanagan and Lummis [49] andmitchellandberkley [119], many blind or non-blind dereverberation techniques have been developed, utilizing single or multiple input channels. In most of these researchefforts, room reverberation was regarded as the combination of early and late reflections and most dereverberation techniques usually handle separately the early and late reverberant signal components. In this Chapter, a summary of the existing literature on devererberation is presented. 3.2 Suppression of early reflections/ decoloration Inverse filtering Inverse filtering of the RIR [125, 126, 127, 128, 135], is mainly used to minimize the coloration effect produced by the early reflections. In theory, an ideal RIR inversion will completely remove the effect of reverberation (bothearlyandlate reflections). However, the RIR is known to have non minimum phase characteristics [135] andthenon-causalnatureoftheinversefiltermayintroducesignificant 31

62 3. DEREVERBERATION: LITERATURE SUMMARY artifacts. In addition, exact measurements of the RIR must be availableforthe specific source/ receiver room position even if the RIRs are known to present common features in different room positions (e.g. [65]). The above limitations can be avoided by compensating exclusively for the broad spectral coloration effect. For this, many techniques have been proposed based on least-squares [130, 135], frequency warping [67, 68, 162], complex smoothing [71, 72], Kautz filters [90, 141] and spatial clustering [13]. Many of them are already incorporated in commercial Room Correction systems. However, results from subjective tests show that some of these techniques do not always achieve the desired perceptual effect [73, 136] Cepstral techniques In 1975, Stockham tried to restore old Caruso recordings through cepstral blind deconvolution [138, 175]. The technique was based in homomorphic signal processing, exploring the fact that deconvolution may be represented as a subtraction in the log frequency domain. Alternate techniques based on the same principle were later proposed in [11, 130, 147] LP residual enhancement Using the source-filter production model, the speech can be represented as a convolutive mixture of the Linear Prediction (LP) Coefficients and the LP Residual [39]. The fundamental assumption of the LP residual dereverberation techniques is that the excitation signal is distorted from the room reflections but the LP coefficients are not significantly affected from reverberation. Hence, the above techniques enhance the LP residual and recover the speech using the reverberant LP coefficients e.g. [58, 61, 102, 218]. 3.3 Late reverberation suppression Temporal envelope filtering Aclassoftechniquesaimingtocompensatemostlyforlatereverberation is based on temporal envelope filtering (e.g. [5]). They are mainly motivated from the 32

63 concept of Modulation Index [79] whichisreducedwhenthelatereverberation tails fill the low-energy regions of a signal (e.g. [105]). Mourjopoulos and Hammond [129] showedthatdereverberationofspeechcanbeachieved byenvelope deconvolution in frequency sub-bands. Furthermore, the temporal envelope filtering principle has been found to be advantageous when used in conjunction with other techniques such as LP residual enhancement [218] andspectralsubtraction [102] Spectral subtraction Spectral enhancement techniques have also been developed in ordertosuppress reverberation starting with a multi-microphone reverberation reducing method proposed by Flanagan and Lummis [49] whichwaslaterextendedbyallenet al. [4]. Spectral subtraction was originally proposed for denoising applications [12, 19, 39, 43, 114]. The technique is implemented in the STFT domain and its main principle is to subtract an estimate of the noise power spectrum from the power spectrum of the noisy signal. Usually, a speech activity detector is involved in order to update the estimation of noise characteristics during the non speech frames. Musical noise is the most common processing artifact introduced by spectral subtraction. It is generated when spectral bins of the noisy signal are strongly attenuated because they are close or below the estimated noise spectrum. As aresult,theresidualnoisecontainsannoyingpuretonecomponents in random frequencies. Most spectral subtraction methods are trying to accurately estimate the noise spectra and avoid or reduce the musical noise [31, 114, 200]. Generally speaking, reverberation is a convolutive distortion; however late reverberation can be considered as an additive degradation with noise-like characteristics. Hence, in the dereverberation context spectral subtraction has been adapted for the suppression of late reverberation. The basic principleofsinglechannel spectral subtraction dereverberation [51, 107, 216]istoestimatetheshort time spectrum of the clean signal S e (ω,j)bysubtractinganestimationofthe short time spectrum of late reverberation R(ω,j)fromtheshorttimespectrum 33

64 3. DEREVERBERATION: LITERATURE SUMMARY of the reverberant signal Y (ω,j): S e (ω,j)=y (ω,j) R(ω,j) (3.1) where ω and j are the frequency bin and time index respectively. Following an alternative formulation, the estimation of the short time spectrum of the clean signal can be derived by applying appropriate weighting gains G(ω,j)intheshort time spectrum of the reverberant signal i.e.: S e (ω,j)=g(ω,j)y (ω,j) (3.2) where G(ω,j)= Y (ω,j) R(ω,j) Y (ω,j) (3.3) Therefore, the dereverberation problem is deduced in an estimation of the late reverberation short time spectrum Dereverberation method proposed by Lebart et. al. For the estimation of the late reverberation short time spectrum, Lebart et al. [107] proposedamethod(lb)basedontherirmodeling,wheretherir h r (n) is modeled as a discrete non-stationary stochastic process [150]: b(n)exp 3ln10 n n 0, h r (n) = RT 60 (3.4) 0 n<0. where b(n) is a zero-mean stationary Gaussian noise. The above modeling is valid when the direct energy of the RIR is smaller than the energy of the reflections [62]. The short time spectral magnitude of the reverberation is estimated as: R(ω,j) = 1 Y (ω,j) (3.5) SNRpri (ω,j) +1 where SNR pri (ω,j) is the a priori Signal to Noise Ratio that can be approximated by a moving average of the a posteriori Signal to Noise Ratio SNR post (ω,j) 34

65 in each frame: SNR pri (ω,j) = β SNR pri (ω,j 1) +(1 β)max(0, SNR post (ω,j) ) (3.6) where β is a constant taking values close to 1. Thus, S e (ω,j)isestimatedby subtraction and is combined with the phase of the reverberant signal,so that the dereverberated signal in the time domain is finally obtained through overlap-add Dereverberation method proposed by Wu and Wang The method proposed by Wu and Wang [216] (WW)ismotivatedbytheobservation that the smearing effect of late reflections produces a smoothing of the signal spectrum in the time domain. Hence, the late reverberation power spectrum is considered a smoothed and shifted version of the power spectrumofthe reverberant speech: R(ω,j) 2 = γw(j ρ) Y (ω,j) 2 (3.7) where ρ is a frame delay, γ ascalingfactor. Thetermw(j) representsanassymetrical smoothing function given by the Rayleigh distribution: ( ) j + α (j + α) 2 exp if j< α w(j) = α 2 2α 2 (3.8) 0 otherwise where α represents a constant number of frames. The phase of the reverberant speech is combined with the estimated clean signal s spectrum and overlap add is used to extract the time domain estimation Dereverberation method proposed by Furuya and Kataoka Alternatively, Furuya and Kataoka [51] proposedamethod(fk) wheretheshort time power spectrum of late reverberation in each frame can be estimatedasthe sum of filtered versions of the previous frames of the reverberant signal s short 35

66 3. DEREVERBERATION: LITERATURE SUMMARY time power spectrum: R(ω,j) 2 = K a l (ω,j) 2 Y (ω,j l) 2 (3.9) l=1 where K is the number of frames that corresponds to an estimation of the RT 60 and a l (ω,j)arethecoefficients of late reverberation. The FK method assume that an inverse filtering step that reduces the spectral degradation produced by the early reflections precedes the spectral subtraction. Hence, the short-time power spectrum of the reverberant signal is considered to roughly approximate the short-time power spectrum of the anechoic signal. The coefficients of late reverberation are derived from: a l (ω,j)=e { } Y (ω,j)y (ω,j l) Y (ω,j l) 2 (3.10) Then an estimation of the clean signal in the time domain can be derived through overlap add from the short-time spectrum of the dereverberated signal S e (ω,j): { } Y (ω,j) 2 R(ω,j) 2 S e (ω,j)= Y (ω,j) (3.11) Y (ω,j) 2 An overlap-add technique is then applied in order to estimate the time domain dereverberated signal. 3.4 Binaural techniques Dereverberation is important for binaural applications in the context of digital hearing aids, binaural telephony, hands free devices and immersive audio applications (e.g. [64, 115, 213]). However, adapting single or multichannel techniques for binaural processing is not trivial. Apart from the challenging task of reducing reverberation without introducing audible artifacts, binaural dereverberation methods should also at least preserve the Interaural Time Difference (ITD) and Interaural Level Difference (ILD) cues as it has been shown that bilateral signal processing affects the source localization [64]. Lee et al. [108] presentedasemi- 36

67 blind method where they estimate a dereverberation filter from a pre-trained whitening filter and a whitened signal. In addition, Jeub et al. [86] proposeda two-stage dereverberation algorithm that explicitly preserves binaural cues. The coloration is reduced with a dual-channel Wiener-filter [85] whilelatereverber- ation is suppressed through spectral subtraction employing abinauralversion of the LB technique [107, 115]. Note that despite the importance of binaural dereverberation, few studies are published in the existing literature. 3.5 Multi-channel dereverberation Multi-channel dereverberation is considered as an easier task than the singlechannel dereverberation, since the spatial diversity of the receivedsignalscanbe explored. A first set of multi-channel techniques is based on beamforming [203]. Such techniques explore the directivity properties of microphone arrays and require some a priori knowledge of the array configuration. For agivensystem, the improvement depends on the microphone arrangement and the source-receiver positions but it is independent of the room RT [54]. In simple implementations, the beamforming microphone arrays may present fixed directivity characteristics (fixed beamforming techniques), however adaptive beamforming setups where the processing parameters are adjusted to the environment also exist. Most beamforming algorithms assume that the noise and the source signal are statistically independent. This assumption does not stand for reverberation which is a convolutive distortion. Therefore, the performance of such algorithms is poor in the dereverberation context [14, 63]. Complete reverberation reduction may be theoretically achieved by applying blind deconvolution [77]. However, in order to perform blind deconvolution the signal and the RIR must be irreducible i.e. they cannot be expressed as the convolution of two other signals [103]. The LTI systems are usually reducible and hence in principle blind deconvolution cannot be applied. However in some cases global irreducibility may be roughly assumed by exploring the time diversity in Single Input Single Output (SISO) systems and the spatial diversity in Single Input Multiple Output (SIMO) systems [63]. Furthermore, in blind deconvolution scenarios, most methods assume that the clean signal is independent identically 37

68 3. DEREVERBERATION: LITERATURE SUMMARY distributed (i.i.d.). This hypothesis does not stand for most sound signals and such techniques cannot be directly applied in dereverberation. Thus, single or multi-channel blind deconvolution implementations usually involve very low channel orders and the number of reflections in the tested RIRs is unrealistically low (e.g. [42, 52, 77, 78]). On the other hand, Miyoshi et al. [120] haveshownthat in non-blind multichannel systems perfect inverse filtering canbeachievedwhen the captured RIRs do not share any common zeros. Multi-channel blind deconvolution methods for speech based onthelpanaly- sis have been developed based on the principle that when the input of a system is white it can be equalized through multichannel LP. For speech dereverberation, the reverberant speech signal is pre-whitened in order to estimate a dereverberation filter. Then the above filter is applied to the reverberant signal (e.g. [38, 51, 97, 183]). 38

69 Chapter 4 Generalized Framework For Improved Spectral Subtraction Dereverberation 4.1 Introduction Spectral subtraction was originally developed to confront denoising problems. Even though late reverberation is often regarded as additive noise,inpracti- cal implementations significant differences between reverberation and noise arise leading to suboptimal suppression of the reverberation and/ orprocessingarti- facts. In this Chapter, novel enhancements of the traditional spectral subtraction methods are presented taking into account the unique nature of reverberation distortion. As mentioned in Section 3.3.2, when the noise spectrum is overestimated spectral subtraction may produce musical noise artifacts. In order to compensate for such type of artifacts several methods have been proposed. Tsoukalas et al. [199, 200] havedevelopedawell-establishedtechniquebasedonaudible Noise Suppression (ANS). The method involves a perceptually-motivated spectral estimator that utilizes an approximate evaluation of the clean signal smasking threshold. Therefore, the audible noise components are located and suppressed through non-linear filtering. The original implementation has been proposed 39

70 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION for denoising applications; however here (in Section 4.2) thetechniqueisimplemented in the dereverberation context. The method detects and suppresses the audible reverberation components resulting to an alternative Audible Reverberation Suppression (ARS) implementation. In Section 4.5 it is shown that despite the time-varying nature of reverberation interference, theuseofthisperceptuallymotivated technique significantly improves the resulting signals. Late reverberation always affects different signal components in different ways and signal tails are in all cases more degraded than other signal parts. Spectral subtraction methods are applied in a similar fashion to the complete reverberant signal, and therefore they can degrade disproportionally low-level components and/ or signal transients. In order to address such problems, in Section 4.3 two novel relaxation criteria are introduced that take into account the signaldependent effect of late reverberation. The proposed criteria are general enough to be easily incorporated into any spectral subtraction dereverberation method and can preserve the signals transients from over-subtraction. In addition they can be combined with the perceptually-motivated non-linear filteringinaunified framework and significantly improve the performance of conventional late reverberation suppression algorithms [189]. The proposed framework is beneficial for speech but also for broadband music signals. Hence, the results (see Section 4.5) show that these enhancements improve the robustness and performance of the reference methods, preserving temporal and spectral signal fidelityandreducing detrimental effects and artifacts. The novel aspects presented in this Chapter are briefly described below: An investigation on the existing spectral subtraction techniques reveals the most prominent overestimation artifacts e.g. the musical noise and the degradation of low-level signal components and signal transients. [186, 187] Aperceptually-motivatednon-linearfilteringdenoisingtechnique is implemented in the dereverberation context and despite the time-varying nature of reverberation interference the resulted signals are significantly improved. [191, 192]. AnovelPower Relaxation Criterion identifying and preserving the signal s onsets is proposed. [186, 187] 40

71 A novel Normalized Cross-Correlation Relaxation Criterion identifying and preserving subsequent correlated frames appearing from overestimation errors is introduced. [186, 187] The above signal-dependent constraints along with the non-linear filtering are combined in a novel generalized framework potentially improving most existing late reverberation suppression methods. [186, 187] 4.2 Dealing with musical noise It is well established that, spectral subtraction methods are sensitive to musical noise due to the over-subtraction of some spectral components. In order to minimize the effect of the residual musical noise, a perceptually motivated non linear filtering technique, is adopted [199, 200]. The above method is based on the analysis and implementation of a well-known auditory mechanism, the frequency masking (see Section 2.3.1) andthebasicprincipleistosuppressonlythe spectral components that contribute to audible noise Audible reverberation suppression The technique assumes that a rough clean spectrum estimation S e (ω,j)isavailable (e.g. see Section 3.3.2). This is used to derive an estimation of the corresponding auditory masking threshold T bse (j) ineachcriticalband(cb)(see Section 2.3.2) accordingtothemethodproposedbyjohnston[87]. Note that the above method is also incorporated in the MPEG standard [1, 24]. At first, the total power spectrum is derived in each CB: Q b (j) = ω hb ω=ω lb S e (ω,j), 0 b B 1 (4.1) where ω lb and ω hb represent the lower and upper spectral bins of the CB b and B the total number of CBs. Then Q b (j) isconvolvedwiththespreadingfunction 41

72 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION of the basilar membrane Sp: C b (j) = B Sp(b m +25)Q m (j), 0 b B 1 (4.2) m=1 The spreading function has been derived from empirical psychoacoustic results and provides information for the frequency masking in the bark domain. The geometric and arithmetic means of the signal s power spectrum (G(j) anda(j) respectively) are calculated in order to determine the noiselike or tonelike nature of the signal, using the spectral flatness measure (SFM): G(j) SFM =10log 10, in db (4.3) A(j) The tonality of the signal is then derived as follows: tonality(j) =min { SFM 60, 1} (4.4) where -60 db is the SFM value of a pure tone. Hence, when tonality(j) =1the signal in the reference CB is considered as a pure tone and when tonality(j) =0 it is considered as white noise. In order to take into account the noise or tone like nature of the signal an offset O b (j) isintroduced: O b (j) =tonality(j)(14.5+b)+(1 b)5.5, 0 b B 1 (4.5) and the corresponding critical band masking threshold can be calculatedas: T bse (j) =10 log 10 C b (j) O b (j) 10, 0 b B 1 (4.6) Finally the absolute auditory threshold is taken into account and the masking threshold in the CB is the maximum of the absolute threshold of hearingand T bse (j). The masking threshold in each frame T Se (ω,j)issimplyinterpolated from the CB thresholds T bse (j). The procedure is illustrated in Fig After the calculation of the masking threshold, an evaluation parameter b(ω,j) 42

73 Sound Level (db) Frequency (Hz) Figure 4.1: Illustrative example of the masking threshold calculation. For a given signal block the figure illustrates: the signal magnitude (blue line), the tonal and non-tonal components (circles), the absolute threshold (black solid line) and the derived masking threshold (black dashed line) is introduced [200]: ( Y (ω,j) S b(ω,j)= e (ω,j) 1 ( Y (ω,j) T Se (ω,j) 1 ) 1 ν(ω,j) Y (ω,j) ) 1 ν(ω,j) Y (ω,j) if Se (ω,j) T Se (ω,j) if Se (ω,j) <T Se (ω,j). (4.7) where ν(ω,j)isafactorthatcontrolsthesuppressionrate. Sinceb(ω,j)should not be influenced by random spectrum fluctuations, the maximum b(ω,j)foreach critical band is utilized. The final clean spectrum estimation S e (ω,j)isobtained by implementing the following non-linear law: S e (ω,j)= Y (ω,j) ν(ω,j) Y (ω,j) (4.8) Y (ω,j) ν(ω,j) ν(ω,j) + b(ω,j) Note that Eq. 4.8 has the theoretical form of a perceptually-derived Wiener filter [114]. 43

74 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION 4.3 Dealing with overestimation errors Signal dependent constraints When processing the whole signal in order to remove the late reverberant tail, it is likely to overestimate the late reverberation. Here, two relaxation criteria constraining such problems in a signal-dependent fashion are proposed. For this, the signals power and cross-correlation values are computed in each frame and are used to identify the regions where conventional algorithms commonly produce overestimation errors. The power relaxation criterion is designed to compensate for overestimation errors in the signal onsets while the normalized cross-correlation relaxation criterion compensates for processing artifacts introduced in the signal steady states [187] Power relaxation criterion The overestimation of late reverberation will affect subsequent signal components especially if the temporal envelope presents sharp onsets. These parts are usually less affected by late reverberation given that these high-energy signal frames locally present better than average Signal to Reverberant Ratio [102]. Such errors will result in flattened temporal envelope and transients fortheestimatedclean signal. Note that from a perceptual point of view, these signal onsets are important for the localization of sounds in closed spaces [111], hence they should not be degraded by the dereverberation process. In order to address the problem and to allow spectral subtraction to process mainly the late reverberation tail, the Power Relaxation Criterion is introduced. The short-time power spectrum of each frame P Yj is calculated as: P Yj = E { Y (ω,j) 2} (4.9) Then, a comparison between the power of the present frame P Yj and the power of the previous frame P Yj 1 is made. When their difference exceeds a pre-determined threshold of A db, the short time spectral magnitude estimation is reduced by a 44

75 afixedperframerelaxationfactorr p : R(ω,j) R (ω,j) = r p R(ω,j) if P Yj P Yj 1 A db if P Yj P Yj 1 <AdB (4.10) The above criterion has been proven useful for speech but also formusic signals [186, 187, 189]. The crest factor of a signal Cf(n) isdefinedasthepeakvalueof any signal relative to its Root Mean Square (RMS) value [154, 193]: Cf(n) = max( x(n) ) RMS ( x(n) ) (4.11) By convention the crest factor is always a positive number, it evaluatesthe peakiness of a sound signal [154] andhighercrestfactorvaluesdesignatesharper transients. Speech and music signals normally display intervals of peak levels and low-level sections. The Cf(n) ofasinusoidis3.01db,naturalspeechhas usually a crest factor of about 12 db whilst music signals may have a crest factor of db [32]. Hence, due to the peaky nature of such signals the power relaxation criterion is beneficial for both speech and music Normalized cross-correlation criterion This criterion relies on the normalized cross correlation Φ jl,whichisameasure of similarity between successive frames of the signal, i.e.: E { Y (ω,j) Y (ω,j l) } Φ jl = E { Y (ω,j) 2} E { Y (ω,j l) 2} (4.12) where j denotes the present frame and l the past frame of interest. When Φ jl 1 the frames of the signal being strongly correlated. Despite the fact that reverberation is known to increase correlation in clean frames, extreme values of Φ jl are most likely to appear mainly due to strong components of clean signals, frequently appearing as a result of prolonged phonemes for speech or as a result of sustained notes for music. Under such conditions, the signal is in a steady state condition, the effect of late reverberation is practically less significant and 45

76 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION additional relaxation of the late reverberation is performed. Hence, the estimated short-time spectral magnitude of the reverberation is directly constrained according to normalized cross-correlation between the current and thepreviousframe Φ j1 : R (ω,j) (1 Φ R j1 ) if Φ j1 1 (ω,j) = (4.13) R (ω,j) otherwise It should be noted that the average duration of speech phonemes is typically ms, but the duration of long vowels often surpasses 160 ms[152]. Given that typical frame durations for spectral subtraction methods are between ms, then the duration of such phonemes exceeds the frame duration of the spectral subtraction algorithm and therefore successive frames are considerably similar. For the case of music, such extremely correlated frames may appear more frequently, as a result of sustained notes. The results for speech signals (presented in Section 4.5) indicatethatsuchframespresentvaluesofφ jl higher than Generalized framework The proposed relaxation criteria together with the ARS stage canbecombinedin aunifiedframeworkthatimprovesconventionalspectralsubtraction approaches. The signal flow of the combined implementation is presented in Fig At first the two relaxation criteria adjust the subtraction according to the nature of each frame. The frames that do not contain significant part of late reverberation are identified and for these frames the subtraction is relaxed. Therefore, the spectro-temporal content of the resulted signals is preserved from overestimation degradations. Then the masking threshold of this preliminary clean signal estimation is calculated and the perceptually-motivated non linear filtering stage is employed. Therefore, only the audible reverberation components are suppressed and the musical noise artifacts are reduced. Finally the clean dereverberated speech is obtained through overlap add. 46

77 Figure 4.2: Signal flow of the proposed unified framework 47

78 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION Figure 4.3: Spectrograms of speech signals: (a) anechoic speech, (b) reverberant speech, (c) late reverberation suppression using the FK method, (d) late reverberation suppression using the FK method together with the perceptually-motivated non linear filtering stage 4.5 Tests and results Evaluation of the audible reverberation suppression stage At first the perceptually-motivated non linear filtering stage is evaluated using the FK late reverberation suppression algorithm as a reference (see Section ). The typical spectrograms of (a) anechoic speech, (b) reverberant speech (RT 60 = 1sec), (c) late reverberation suppression using the FK method, (d) late reverberation suppression using the FK method together with the ARS stage are shown in Fig Bycomparing (a) and (b) itisclearthatthereverberationhas a smearing effect in the harmonic structure of the clean speech and fills the silence 48

79 gaps between the phonemes. As seen in Fig. 4.3 (c) and (d) both approaches seem to reconstruct the silence parts and to retrieve an approximate envelope of the anechoic signal. However an overestimation of the reverberation power leads both methods in over-subtracting some useful speech spectra. But, by carefully examining (c) and (d) it is clear that the implementation of the perceptuallymotivated non-linear filtering stage reduces the artifacts due to overestimation. The elongated spectral lines in the high frequencies in Fig. 4.3 (c), which represent the musical noise are removed in Fig. 4.3 (d). Artificial impulse responses simulating four rooms with different RT (i.e. RT =0.5,1,1.5and3sec)weremodifiedinordertoretainonlytheir late reverberant tail. The time boundary between early and late reflections was setat2 V in ms where V is the volume of the room in square meters [89]. Reverberant speech signals were obtained by convolution of male and female anechoic speech phrases with these impulse responses. The resolution of the signals was 16 bit at a sampling frequency of Hz. The window length of the STFT was set at 8192 samples corresponding at sec and the frame overlap was 50 %. Different values for the parameter ν(ω,j) ofeq. 4.7 and 4.8 were tested. Optimal results were obtained when ν(ω,j)wastreatedasaconstantandwassetto1. This finding is in line with the original work [200]. The Noise to Mask Ratio (NMR) measure was used to evaluate the performance of the proposed approach. The objective NMR criterion takes into account some properties of the human auditory system and here registers the audible (non-masked) reverberation components; lower NMR values denote better speech quality. The NMR is calculated as[200]: NMR = 10 N 1 B 1 1 log 10 N B i=0 b=0 1 C b k=k h k=k l (S(ω,j) S d (ω,j)) 2 T b (j) (4.14) where N is the number of frames, B is the number of critical bands, C b is the number of spectral components of the critical band b, S d (ω,j)isthespectrum of the direct signal, S(ω,j)isthespectrumofthereverberantortheenhanced signal, ω and j are the frequency bin and the time frame respectively. The lower 49

80 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION Experimental Conditions NMR (db) Reverberant Speech 2.85 RT 60 =0.5sec FK dereverberation 0.68 FK dereverberation + ARS 0.26 Reverberant Speech 7.75 RT 60 =1sec FK dereverberation 1.58 FK dereverberation + ARS 0.62 Reverberant Speech RT 60 =1.5sec FK dereverberation 1.83 FK dereverberation + ARS 0.76 Reverberant Speech RT 60 =3sec FK dereverberation 3.39 FK dereverberation + ARS 1.18 Table 4.1: NMR results for several test cases and upper frequency bin of the critical band b are denoted as ω lb and ω hb while T b (j) istheauditorymaskingthresholdofthe(remaining)reverberation in this critical band [51, 200]. The results of the NMR for (i) reverberant speech, (ii) dereverberation using the FK method and (iii) dereverberation using the FK method together with the ARS stage are shown in Table 4.1. The provided values are the average of measurements for two male and two female speakers, and smaller NMR values indicate lower distortion. For RT 60 = 0.5sec,theNMRofbothmethodsishigherthanthisofthereverberant speech, indicating some detrimental effect due to processing for short reverberant decays. In all the other cases, an improvement of the NMR is achieved. The implementation of the proposed perceptually-motivated non-linear filtering stage is proven beneficial in all three cases and the improvement seems to become more significant for higher RT values. In order to further investigate the performance improvement asubjectivetest has been conducted. One of the main purposes of speech dereverberation is to improve the quality of speech. Hence, in order to evaluate and comparethe performance of dereverberation algorithms it is suitable to utilizetheabsolute Category Rating (ACR) method [81]. The ACR categories are 1: Bad, 2: Poor, 3: Fair, 4: Good and 5: Excellent. Ten subjects, both male and female, followed the 50

81 Experimental Conditions MOS 95% Conf. Int. RT 60 =0.5 sec FK dereverberation FK dereverberation + ARS RT 60 =1sec FK dereverberation FK dereverberation + ARS RT 60 =1.5 sec FK dereverberation FK dereverberation + ARS RT 60 =3sec FK dereverberation FK dereverberation + ARS Table 4.2: MOS results for several test cases test twice for all the different experimental conditions. The results are presented in Table 4.2. reverberation time. Its clear that the performance deteriorates for higher values of However, in all cases the implementation of the proposed ARS step provides superior speech quality than the FK method alone Evaluation of the proposed unified framework In order to assess the overall performance improvement, the proposed generalized framework (relaxation criteria and non linear filtering) has beenincorporated in the LB (see Section ) andfk(seesection ) latereverberation suppression algorithms. Hence, the original FK and LB algorithms are compared to the corresponding Modified FK and LB algorithms. Adatasetconsistingof9anechoicsentences(5maleand4female speakers) is taken from the IEEE Harvard-corpus [114] as44.1khz, 16bitdata, in order to evaluate the performance of the proposed generalized framework. The corresponding reverberant signals are derived after convolution with real impulse responses obtained from the measurements of the varechoic chamber at Bell Labs [69]. Three room configurations are chosen, corresponding to RT values of approximately 0.24, 0.38 and 0.58 sec. For each configuration, 5on-axismeasured impulse responses were taken for a source-receiver distance of1.5,2,3,4and5 m. In order to be in line with the original implementation of the FK algorithm the reverberant signals produced by the convolution were at first inverse filtered by the 1/3 octave smoothed minimum phase version of the impulse response in order to equalize the dominant spectral coloration, which is mainlyduetoearly 51

82 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION reflections. The frame length of the analysis is set at 8192 samples (corresponding to 186 ms) with a 50% overlap. The parameter ν(ω,j)ineq. 4.7 and 4.8 is set to 1. The normalized cross correlation criterion is applied for Φ 1 > 0.95, while the power relaxation criterion is applied for A =10dBandr p =10. Fig. 4.4 shows typical time domain segments of (a) anechoic speech signal, (b) inverse filtered reverberant signal (RT=0.58 sec, source-receiver distance=2m), (c) late reverberation suppression by FK, (d) late reverberation suppression by proposed Modified FK, (e) late reverberation suppression by LB, (f) late reverberation suppression by proposed Modified LB. By comparing (a) and (b), the effect of late reverberation tail can be observed which is retained after spectral coloration inverse filtering. The FK method (Fig. 4.4 (c)) can achieve a limited amount of late reverberant tail suppression. On the contrary the proposed Modified FK (Fig. 4.4 (d)) successfully suppresses most of the late reverberant components, as the relaxation criteria introduced in this work detect explicitly the dominating late reverberation signal components. Similar observations apply when comparing Fig. 4.4 (f) and (e). As can be observed in Fig. 4.4 (e) LB achieves some suppression of late reverberation; however the envelope pattern of the anechoic signal has not been retained. Again, by examining Fig. 4.4 (e) and (f), it is shown that the proposed criteria have allowed clearer identification of the signals onset envelope. To assess the relative improvement, the NMR measure has been also evaluated. The average NMR for all tested distances is shown in Fig. 3 (a) for RT=0.24 sec, (b) RT=0.38 sec and (c) RT=0.58 sec. In all cases it is shown that in terms of NMR assessment FK achieves better results than the LB. This finding is consistent with the results presented in [51]. In Fig. 4.5 (a), the low NMR value for the inverse filtered reverberant signal implies that the late reverberation has little perceptual effect especially for short source-receiver distances, given this low RT setting. Furthermore, all methods introduce some artifacts resulting in higher NMR values, although the proposed modifications both for LB and FK methods achieved slightly better NMR values for all cases. It is clear thattheresultsfor this very low reverberation value illustrate the processing artifactsgeneratedby such methods as is expected by similar tests in literature (e.g. [51]). By examining Fig. 4.5 (b), it is shown that LB establishes average NMR val- 52

83 Figure 4.4: Time domain representation of speech signals (a) cleanspeech,(b) inverse filtered reverberant speech signal, (c) late reverberation suppression by FK, (d) late reverberation suppression by Modified FK, (e)latereverberation suppression by LB, (f) late reverberation suppression by Modified LB 53

84 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION Figure 4.5: Average NMR (db) for varying source-receiver distances and for (a) RT=0.24 sec, (b) RT=0.38 sec, (c) RT= 0.58 sec 54

85 ues approximately equal to the inverse filtered reverberant speech signal. On the contrary the proposed Modified LB method results in a significantly improved reverberation suppression. Comparing FK to the proposed Modified FK, again, afurtherdecreaseintheaveragenmrisobservedforalldistances. These improvements become more significant for longer source-receiver distances where the NMR for the inverse filtered reverberant signals also indicate reduced quality. In Fig. 4.5 (c), it is shown that all methods achieve improved NMR values compared to the reverberant signal. Again, the proposed modified methods achieved significantly better anechoic estimations compared to the corresponding reference algorithms. The average NMR improvement achieved bythemodified methods for all tested distances was 2.44 db for the LB and 5.04 dbforthefk. These NMR results were found to be significant from a perceptual point of view. Note as an example, that in [51] annmrimprovement of5dbwasfoundtoresult in a 0.5 to 1 point improvement in a 5-scale Mean Opinion Score experiment. Asecondexperimenthasbeenconductedinordertoevaluate the relative improvement achieved by the introduction of the proposed criteria, on the specific signal degradation cases for which the criteria have been designed. By calculating the slope of the envelope of the clean speech signals, segments representing: (i) signal onsets, (ii) steady signal parts (typically due to prolonged phonemes) and (iii) signal tails have been extracted. Furthermore the performance is tested for different values of r p and A. The evaluation is made using the average Linear Prediction Coding (LPC) Cepstrum Distance (CD) criterion [97]. The cepstrum distance of the 20 first LPC coefficients between the dereverberated signals and the corresponding clean signals is calculated for segments of 1024 samples (23.2 ms). This distance indicates the error between the estimated signalsandthe corresponding clean signals. The difference between the LPC CD of the modified versions of the algorithms and the LPC CD of the original algorithms demonstrates the relative improvement obtained by the introduction of the proposed criteria. Note that smaller negative difference values indicate better clean speech estimation. Fig. 4.6 shows the average LPC CD difference for different RT conditions and for the case of: (i) signal onsets (Fig. 4.6 (a, b)), (ii) steady state (Fig. 4.6 (c, d)) and (iii) signal tails (Fig. 4.6 (e, f)). An improved estimation of the clean 55

86 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION Figure 4.6: Average LPC Cepstrum Distance difference between the modified and the unmodified versions of the algorithms for varying RT conditions (a) case of signal onsets, power relaxation threshold set at 3 db, (b) case of signal onsets, power relaxation threshold set at 10 db, (c) case of steady state, power relaxation threshold set at 3 db, (d) case of steady state, power relaxation threshold set at 10 db, (e) case of signal tails, power relaxation threshold set at 3 db, (f) case of signal tails, power relaxation threshold set at 10 db. 56

87 signal is observed in all tested conditions. For the case of signal onsets (Fig. 4.6 (a, b)) the modified algorithms achieve from 0.65 to 2.9 db improvement in LPC CD when compared to the original reference methods. In average, better results are obtained for A =3dB, whenthepowerrelaxationcriterionismore frequently activated. In almost all cases the performance is improvedforr p = 10. In Fig. 4.6 (c, d), it can be seen that the proposed modifications can also be advantageous for the estimation of the steady state signal parts. Itcanbe observed that a lower relaxation threshold (Fig. 4.6 (c)), should be combined with a softer relaxation (r p = 2). However, a higher threshold can be used together with a higher relaxation factor (Fig. 4.6 (d)). Finally, in Fig. 4.6 (e, f) it can be observed that the modified versions of the late reverberation suppression algorithms provide superior estimations of the signaltailsandthe relative improvement varied from 0.12 to 4.8 db. In order to evaluate explicitly the normalized cross correlation criterion, a data set of 4 male and 3 female speakers has been generated exclusively from prolonged phonemes, recorded in semi-anechoic conditions, at44.1khzwitha16 bit resolution. The duration of the phoneme samples was deliberately long (2-3 seconds). At first, the data set was used to calculate the normalized cross correlation Φ jl between the frames of the prolonged phonemes; in all cases this being found to be higher than Then, the corresponding reverberant signals were derived after convolution with real impulse responses described above (sourcereceiver distance=3m). The normalized cross correlation criterion was applied for Φ j1 =0.95. Typical time domain representations of (a) clean speech signal, (b) inverse filtered reverberant signal (RT=0.58 sec), (c) in black:latereverberation suppression by FK and in gray: late reverberation suppression by Modified FK, (d) in black: late reverberation suppression by LB and in gray: late reverberation suppression by Modified LB, are presented in Fig Thedegradationof the temporal envelope of the signal by the unmodified versions ofbothreference algorithms can be observed in Fig. 4.7 (c) and (d). On the contrary the Modified FK and Modified LB methods (in gray), utilizing the normalized crosscorrelation criterion, seem to preserve the signals original temporal characteristics. Table 4.3 presents the average LPC CD difference between the modified and unmodified versions of the tested algorithms and as can be observed, an improvement between 57

88 4. GENERALIZED FRAMEWORK FOR IMPROVED SPECTRAL SUBTRACTION DEREVERBERATION (a) Magnitude 1 1 (b) 1 1 (c) (d) Time (sec) Figure 4.7: Typical time domain representations for the case oftheprolonged phonemes (a) clean speech, (b) inverse filtered reverberant speech signal, (c) in black: late reverberation suppression by FK and in gray: late reverberation suppression by Modified FK, (d)in black: late reverberation suppression by LB and in gray: late reverberation suppression by Modified LB 58

89 RT=0.24 sec RT=0.38 sec RT=0.58 sec LPC CD Diff. (LB) 1.03 db 1.55 db 2.00 db LPC CD Diff. (FK) 4.05 db 4.65 db 4.03 db Table 4.3: Average LPC CD difference between the modified and unmodified versions of the tested algorithms for different RT conditions 1.03 and 4.65 db can be achieved by the proposed criterion. 4.6 Conclusion This Chapter presented two signal-dependent criteria that can be easily implemented within state-of-the-art late reverberation suppression algorithms in conjunction with perceptually motivated non-linear filtering. The proposed modifications provide a generalized framework for improving spectral subtraction dereverberation algorithms, suppressing more effectively reverberant tails whilst retaining signal onsets. The results confirm that the proposed framework achieves better estimation of the reverberant power spectrum. Therefore a more accurate approximation of the anechoic signal is extracted. The clean signals energy envelope is better recovered and the temporal defects caused by the oversubtraction in the spectral domain are reduced. The improvements were found consistent throughout a significant range of room reverberation conditions and source-receiver distances. 59

91 Chapter 5 Late Reverberation Suppression At Multiple Speaker Positions 5.1 Introduction Inverse-filtering techniques commonly use measured RIRs in order to reduce the coloration effect produced by the early room reflections (see Section 3.2.1). However, very few late reverberation suppression methods take advantage of a measured RIR (e.g. [60]). In this Chapter, a method for suppressing late reverberation from a moving speaker utilizing a single RIR measurement ispresented,assuming that the late part of the RIR approaches a wide-sense stationary process. Therefore, the spectral magnitude of late reverberation is estimated using the late part of the measured impulse response ( reference response) and the excitation signal derived from the LP analysis of the reverberant signal. Then, spectral subtraction is used to derive a clean signal estimation. Contrary to Automatic Speech Recognition (ASR oriented techniques (e.g. [60]) the present approach introduces alow-complexityimplementationanddoesnotrequiredatabase training either for the derivation of the early/late reflections boundary or for implementing the spectral subtraction. A final novel step, namely the Gain Magnitude Regularization (GMR) step, that prevents overestimation errors and reduces musical noise artifacts is also introduced so that low Signal to Reverberation Ratio (SRR) signal regions are identified [86] andthesuppressionisdynamicallyconstrainedin 61

92 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS order to avoid such processing distortions. In a second step, a semi-blind version of the proposed approach aws developed. In order to derive an approximation of the Power Spectral Density (PSD) of the late reflections, a recorded handclap is utilized. This isaflexibleoption when a RIR measurement is not feasible since a handclap recording can provide areasonablerirapproximation[155]. Comparing to a properly measured RIR, the handclaps usually have a more pronounce radiation directivity and contain less energy in the low frequencies [177] whilstpresentingsomespectralcoloration, largely varying between such signals [158, 177]. However, for this dereverberation method, an acceptable approximation of the late reverberant PSD can be obtained. Moreover, the proposed approach provides additional compensation for estimation errors through the GMR step. The results show significant late reverberation suppression in various RT conditions and source receiver distances, indicating that the proposed approach is appropriate for real life applications. The novel aspects presented in this Chapter are briefly: Anovelspectralsubtractiondereverberationapproachisproposed exploring the stationarity of the late RIR part. The proposed method employs a single RIR measurement and the excitation signal derived from the Linear Prediction (LP) analysis of the reverberant signal. [197] Asemi-blindapproachoftheabovedereverberationmethodispresented based on a recorded handclap achieving sufficient robustness with respect to different source/receiver arrangements within the room. [101, 194] Anovelmethodcompensatingforoverestimation errorsispresented where the suppression rate is dynamically constrained based on a Gain Magnitude Regularization (GMR) scheme. [101, 194, 195, 196, 197] 5.2 Theoretical formulation Assuming a noise-free and stationary environment the transfer function between aspeakerandamicrophoneinaspecificroompositioncanbedescribed by the corresponding impulse response h r (n). Then the reverberant speech signal 62

93 captured by the microphone y(n) is the convolution of the clean speech s(n) and the room impulse response: y(n) = L r m=0 h r (m)s(n m) (5.1) where L r is the length of the RIR. Using the LP analysis, a speech signal is modelled as the convolution of an excitation signal u(n) and a speech production filter h s (n) describingtheformantstructuredeterminedbytheglottal, the vocal tract, and the lip radiation filters: L s s(n) = h s (m)u(n m) (5.2) m=0 From Eq. 5.1 and 5.2 the reverberant signal can be described as: y(n) = L r L s m=0 l=0 h r (m l)h s (l)u(n m) (5.3) As mentioned, the impulse response of a reverberant room can be separated in two parts, that is the early reflections and the late reverberation: Therefore Eq. 5.3 can be written as: h r (n) =h early (n)+h late (n) (5.4) y(n) = L b L s m=0 l=0 h early (m l)h s (l)u(n m)+ L r L s m=l b +1 l=0 h late (m l)h s (l)u(n m) (5.5) where L b is the length of h early (n). Usually the length of the speech production filter can be assumed shorter than the length of h early (n) (i.e. L s <L b )[214]. 63

94 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS Hence Eq. 5.5 is written as: y(n) = L b L s m=0 l=0 h early (m l)h s (l)u(n m)+ L r m=l b +1 h late (m)u(n m) (5.6) Consider now a setup with a fixed microphone and a moving source which represents a common hands-free communications scenario. The source has an initial position ρ 0 with a corresponding RIR h 0 (n). Assume that h 0 r (n) isknown and as in equation 5.4 it can be expressed as: h 0 r (n) =h0 early (n)+h0 late (n) (5.7) In general, when the source moves to another position ρ i,adifferent RIR defines the corresponding acoustical path: h i r(n) =h i early(n)+h i late(n) (5.8) The early part of a RIR changes significantly with even small changes in the source-microphone position. On the other hand, during the late reverberation part, the energy is statistically equal in all regions of the room [18]. Hence, it can be assumed that the PSD of h i late (n) isapproximatelythesameforalli and equal to the PSD of h 0 late (n): Hlate i (ω,j) 2 = Hlate 0 (ω,j) 2 i (5.9) where Hlate 0 (ω,j)isthestftofthelatepartofthemeasuredimpulseresponse. Based on the above assumption and the LP analysis, the principle of spectral subtraction can be used for the suppression of late reverberation from a moving speech signal. A fairly accurate estimation of the late reverberation power spectrum R late (κ, ω) canbeobtainedinthestftdomainas: R late (ω,j)= H 0 late (ω,j) 2 U i (ω,j) 2 (5.10) with U i (ω,j)beingthestftofthelpresidualofthereverberantsignal. 64

95 Hence, an estimation of the clean signal s power spectrum can be derived: Ŝi (ω,j) 2 = Y i (ω,j) 2 R late (ω,j) Y i (ω,j) 2 Y i (ω,j) 2 = G(ω,j) Y i (ω,j) 2 (5.11) where S i (ω,j) 2 and Y i (ω,j) 2 are the PSD estimations of the clean and the reverberant signals in position ρ i respectively and G(ω,j)isthederivedgain magnitude function. 5.3 Definition of the early/ late reflections boundary For the sake of simplicity, the boundary between early reflections and late reverberation of a RIR is often defined as a fixed time interval [18] orinrelationto the volume of the room [37, 89]. However, the precise definition of the early/late reflections boundary is a challenging and open research issue ([76, 201]). Here, the method proposed in [37, 174] isused. ThemeasuredRIRh 0 (n) ispartitioned in non-overlapping frames of length L κ and for each frame h 0 (j) thenormalized kurtosis is calculated as follows: Kurt[h 0 (j)] = E[h0 (j) µ] 4 3 (5.12) σ 4 The boundary is defined as L b = k min L κ where L κ is the frame size and j min is given by: j min = argmin(kurt[h 0 (j)]) (5.13) 5.4 Gain magnitude regularization In order to compensate for overestimation errors, a novel low-complexity approach based on Gain Magnitude Regularization (GMR) is introduced. HighSRRspec- tral regions such as signal steady states are less affected by late reverberation [187, 218] andthusanoverestimationofthelatereverberationislesslikelyto 65

96 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS happen. On the other hand, artifacts are expected in low SRR regions, hence alowsrrdetectorisused[86] andthegmrtechniqueisintroducedinorder to constraint only the low gain parts. The proposed technique canbebeneficial when compared to moving-averaging approaches as it is not affecting the high gain bins. The PSD estimation of the clean signal is derived as ineqs5.14 and 5.15, whereθ is the threshold for applying the gain constraints, r is the regularization ratio, ζ is the power ratio between the enhanced and the reference signal, ζ th the threshold of the low SRR detector and Ω is the frame size. ( ) G(ω,j) θ + θ Y i (ω,j) 2 when ζ < ζ Ŝi (ω,j) 2 th G(ω,j) < θ = r G(ω,j) Y i (ω,j) 2 otherwise (5.14) ζ = Ω G(ω,j) Y i (ω,j) 2 ω=1 (5.15) Ω Y i (ω,j) 2 ω=1 5.5 The handclap approximation In this section, it is further assumed that the PSD of the late reverberation in position ρ 0 can be approximated from the PSD of the late part of a handclap recording Clate 0 (κ, ω) 2 in the same position, i.e.: Hlate(κ, 0 ω) 2 Clate(κ, 0 ω) 2 (5.16) Hence, following Eqs 5.10, 5.11 and 5.16, an estimation of the direct signal s power spectrum can be derived: Ŝi (κ, ω) 2 = Y i (κ, ω) 2 C 0 late (κ, ω) 2 U i (κ, ω) 2 (5.17) Previous work has shown that a recorded clap may differ from a measured RIR in: (i) the low-frequency range, and (ii) the details of its spectrum, presenting 66

97 some sort of spectral coloration [158, 177]. Late reverberation arises by definition in the diffuse field and its spectrum is approximately white [18]. It is reasonable to assume that the same applies for the late RIR part due to a handclap. In addition, speech signals do not contain significant energy in the low-frequency range. Hence, the above difference can be considered to be insignificant in the context of late reverberation affecting speech signals and the PSD of late reflections can be efficiently approximated by the PSD of the late part of a handclap. These assumptions have been verified through higher order statistics [55, 56] thatcompare the properties of the PSD of handclaps and RIRs. The analysis is detailed in [194] andithasproventhat(a)thestatisticalpropertiesofthelate parts of the RIRs and handclaps present similarities and (b) the PSD of thelatepartofa handclap recorded in a certain room position can be used as an approximation of the PSD of the late part of the RIR in another position. Note however that the running kurtosis approach described in Section 5.3 sometimes fails to provide a robust mixing time estimation due to the noisier nature of the handclaprecordings. Hence, when the proposed method is employed with a handclap recording, the normalized kurtosis technique is initially implemented andifthederivedt mix value is within a reasonable range (e.g. 50 ms t mix 500 ms [76]), then the derived value is used; in any other case the static threshold of 80 ms is applied. The proposed signal flow is shown in Fig Tests and results Results for measured RIRs For the evaluation of the proposed method sixteen phrases uttered by both male and female speakers of the TIMIT database were convolved with realrirsmeasured in (a) a lecture hall (Room 1) and (b) a large auditorium (Room 2) of aconferencecenter. Themeasurementsetup,aswellastheroom acoustical properties are shown in Fig Theperformanceoftheproposedmethodwas examined both with and without the GMR step. The speech signals and the RIRs were sampled at 16 khz with 16 bit precision and the LP analysis order was 13. The frame size was 1024 samples with a 25% overlap, the thresholds θ 67

98 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS!"#"$%"$&'()*+"",-.&'/,0&+)$",1$/2'3 8A<)$"52/;&0 45(26&(21')17)8&(")!9! 45(26&(21')17)0&(")$"#"$%"$&(21' *,1, : *'#+,$&%( *-.,$&+,/!0!1;3-)<0"&')*+",($;6)45(26&(21' =&2')>&3'2(;/")!"3;0&$2?&(21' Figure 5.1: Signal flow of the proposed method )%*' &%*'!! " # )%*' +%*' $%&' (%&' " +%&' # &%&' Room RT 60 (sec) Volume (m 3 ) 1 (grey) (white) Figure 5.2: Illustration of the measurement setup for the RIRs used in Section

99 Segmental SRR Improvement (db) h A h A +GMR h B h B +GMR h C h C +GMR A B C Room Position (a) Segmental SRR Improvement (db) A B C Room Position (b) Figure 5.3: SRR improvement (in db) for different cases in (a) Room 1 and (b) Room 2 and ζ th were set at 0.4 and the value of the regularization ratio r was 6. Figure 5.3 shows the averaged segmental Signal to Reverberation Ratio (SRR) improvement for (a) Room 1 and (b) Room 2. For each room, the method was evaluated at three different positions. For each position the methodwasalso tested three times, each time assuming a RIR measured at each of the three test positions. An improvement in terms of SRR was noticed in all tested cases with a relatively consistent performance regardless of the reference RIR. In both rooms the improvement was greater for more distant room positions (position C in both rooms) where the reverberant signals contained more late reverberation. The use 69

100 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS Room 1 Position h A Reference Impulse Response h A +GMR h B h B +GMR h C h C +GMR A B C Room 2 Position h A Reference Impulse Response h A +GMR h B h B +GMR h C h C +GMR A B C Table 5.1: PESQ improvement for various cases of the GMR step resulted in a small reduction of the achieved late reverberation suppression in all positions. Table 5.1 presents the improvement achieved by the proposed method in terms of the Perceptual Speech Quality measure (PESQ) [82], when comparing to the reverberant signals. The PESQ measure [82, 114] implementsaperceptualmodel in order to assess the quality of the tested speech signal and rate it according to the five grade Mean Opinion Score (MOS) scale. It has been found to correlate with subjective listening quality tests and to perform reliably across a wide range of speech coding and network transmission conditions [114]. Again, an improvement is shown for all tested cases. Moreover, the application of the GMR resulted in significantly improved results for both rooms. Apparently theapproximate nature of the late reverberation spectrum estimation presented here may produce artifacts and the GMR step is important in order to perceptually enhance the evaluated clean speech signal. It is interesting to note that themainassumption used here, i.e. the stationarity of the late reverberation, is largely supported by the presented results, as the method performs consistently regardless of the reference RIR used. 70

101 Figure 5.4: Illustration of the measurement setup for the RIR andhandclaps Results for recorded handclaps The same TIMIT database excerpts as before have been used together with real RIRs measured in two different rooms: (a) a listening room (Room 3) and (b) alecturehall(room4). Inaddition,handclapsindifferent positions in these rooms were also recorded. The room acoustical properties and thesetupofthe measurements are presented in Fig Squaresdenotethepositionswhere handclaps were recorded whereas circles symbolize the positions where the impulse responses were measured. The dereverberation processing was applied for frame size of 1024 samples with a 25% overlap, the thresholds θ and ζ th were set at 0.4, the value of the regularization ratio r was 6 and the LP analysis order was 13. The mean SRR difference between the reverberant and the estimated clean signals is shown in Fig Twosource-receiverpositionsaretestedineachroom. Note that positive SRR differences denote the absolute improvement when compared to the reverberant signals. In Room 3 the proposed method is evaluated using as a reference a RIR and a clap recorded at position A (h A and c A respectively) and a RIR and a clap recorded at position B (h B and c B respectively). In Room 4 the method is evaluated utilizing two RIRs measured in positions C and D (h C and h D respectively) and two claps recorded in positions E and F (c E and c F respectively). In all cases an improvement in terms of SRR was observed; this improvement being greater for the larger room (Room 4). In 71

102 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS 2 h A c A h B c B 1 SRR difference (db) 0 2 Position A (a) Position B h C c E h D c F 1 0 Position C (b) Position D Room Position Figure 5.5: Mean SRR difference of the reverberant and the estimated clean signals for (a) Room 3, (b) Room 4. 72

103 0 2 4 NMR difference (db) h A c A h B c B Position A (a) Position B 4 6 Position C (b) Room Position Position D h C c E h D c F Figure 5.6: Mean NMR difference between the estimated clean and the corresponding reverberant signals for (a) Room 3, (b) Room 4. 73

104 5. LATE REVERBERATION SUPPRESSION AT MULTIPLE SPEAKER POSITIONS addition it seems that the use of the actual RIRs as reference, reducesthelate reverberation suppression in all positions. The Noise to Mask Ratio (NMR) measure is used to assess the quality of the clean signal estimations [51, 187]. The mean NMR difference between the estimated clean and the corresponding reverberant signals for the same experimental conditions as above is shown in Fig Therecordedclapsachieved significant NMR improvement in all tested cases; however slightly better results are obtained when the actual RIRs where used as a reference. This indicates that the greater SRR values achieved from the reference claps (seefig. 5.5) may be to a some extent due to an overestimation of the late reverberant PSD. However, the substantial overall improvement both in NMR and SRRshowsthat the proposed approach achieves suppression of late reverberation and improves the quality of the produced signals, which was actually confirmed after several informal listening tests performed by the authors. 5.7 Conclusion The proposed technique extracts the room late reflections characteristics based either on a single RIR measurement or on a recorded handclap. Then, an efficient spectral subtraction approach is adopted in order to suppress late reverberation at multiple speaker positions. A Gain Magnitude Regularization technique is also proposed in order to compensate for any overestimation errors and to reduce the musical noise artifacts. By using either a single RIR measurement or a single handclap recording it appears that sufficient robustness can be achieved with respect to different source/receiver arrangements within the room and the results show significant speech enhancement independent of the reference room. 74

105 Chapter 6 Binaural Late Reverberation Suppression 6.1 Introduction As mentioned before, single-channel spectral subtraction algorithms are commonly used to suppress late reverberation. A binaural extension of such methods, apart from suppressing reverberation without introducing processing artifacts, should also preserve the signal s binaural localization cues (see Section 3.4). In this Chapter, efficient techniques to adapt single-channel spectral subtraction dereveberation algorithms to a binaural context are examined and evaluated. Hence, binaural extensions based on the Delay and Sum Beamformer (as proposed in [86]) are implemented into three state-of-the-art spectral subtraction methods described in Section Inaddition,ageneralizedapproachbasedon the adaptation of the spectral gains derived by bilateral processing is presented and three possible gain adaptation strategies are investigated. Briefly, the novel aspects introduced in this chapter are: Anovelbinauralextensionofsingle-channelspectralsubtraction dereverberation methods preserving the signal s binaural localization cues is presented. Three bilateral gain adaptation schemes are evaluated in three state-of -theart spectral subtraction dereverberation algorithms and the most prominent 75

106 6. BINAURAL LATE REVERBERATION SUPPRESSION binaural extensions were revealed through objective measures for several experimental conditions, indicating the optimum dereverberation approach for each reverberation scenario. The publications related to Chapter 6 are [57, 195, 196] 6.2 Proposed binaural dereverberation processing An effective approach for extending the LB dereverberation method to a binaural context is to derive a reference signal using a Delay and Sum Beamformer (DSB) [86] wherethetimedelaysareestimatedutilizingamethodbasedonthegeneralized cross-correlation with phase transform as proposed in [98]. The reference signal is calculated as the average of the time aligned left and right reverberant signals. Using the reference, appropriate weighting gains are derived following the procedure described in Section , andidenticalprocessingisappliedtoboth left and right channel. In this Chapter, the DSB approach is also implemented for both the WW and FK methods in order to evaluate the efficiency ofdifferent late reverberation estimation techniques in a binaural scenario. Furthermore, an alternate binaural adaptation scheme is presented. In binaural applications, the time delay between the left and right channels of the speech signal is limited by the width of the human head. Therefore, it canbeassumed shorter than the length of a typical analysis window used in spectral subtraction techniques and hence the time alignment stage is omitted. Then, each algorithm is implemented independently for the left and right ear channel signals resulting to the corresponding weighting gains G l (ω,j)andg r (ω,j)(seesection3.3.2). These gains are combined and different adaptation strategies areinvestigatedfor each algorithm: (i) The final gain is derived as the maximum of the left and right channel weighting gains: G(ω,j)=max(G l (ω,j),g r (ω,j)) (6.1) This approach (maxgain) achieves moderate late reverberation suppression, but 76

107 Table 6.1: Parameter values for the employed methods Parameter LB WW FK Total Frame Length Zero padding Frame Overlap it is also less likely to produce overestimation artifacts. (ii) The final gain is derived as the average of the left and right channel weighting gains: G(ω,j)= (G l(ω,j)+g r (ω,j)) (6.2) 2 This gain adaptation strategy (avggain) compensates equally for the contribution of the left and right channels. (iii) The final gain is derived as the minimum of the left and right channel weighting gains: G(ω,j)=min(G l (ω,j),g r (ω,j)) (6.3) The above adaptation technique (mingain) results to maximum reverberationattenuation but the final estimation may be susceptible to overestimation artifacts. In all the above cases, the GMR as described in Section 5.4 is utilized. This technique is applied in order to reduce artifacts related to overestimations errors. 6.3 Tests and results Eight anechoic phrases uttered by both male and female speakers of the TIMIT database were convolved with real Binaural RIRs (BRIRs). Four BRIRs measured in a Stairway Hall (RT 60 =0.69 sec) at a source-receiver distance of 3m and azimuth angles of 0, 30, 60 and 90 were chosen from the Aachen database [86]. In addition three BRIRs measured in a Cafeteria (RT 60 =1.29 sec) at source-receiver distances of 1.18, 1and1.62m and azimuth angles of approximately 30, 0and 90 were chosen from the Oldenburg database [92]. The speech signals and the BRIRs were sampled at 16 khz with a 16 bit resolution and the authors made unofficial tests to select optimal values for the analysis parameters (see Table 6.1). The θ and ζ th values of the GMR step were set at 0.15, the regularization ratio 77

108 6. BINAURAL LATE REVERBERATION SUPPRESSION 2 (a) 1 (b) rwerwe SRR difference (db) DSB maxgain avggain mingain 2 LB WW FK LB WW FK Proposed Method 0 Figure 6.1: SRR difference for the three tested methods (LB, WW, FK) for (a) Stairway Hall, (b) Cafeteria (DSB: Delay and Sum Beamformer, maxgain: maximum bilateral gain adaptation, avggain: averagebilateralgainadaptation, mingain: minimumbilateralgainadaptation). r was 4 and the RT 60 was calculated from the impulse responses. All parameter values that are not detailed here were set according to the values proposed by the authors of the original works. In addition, for the FK and LB techniques, two additional relaxation criteria were imposed [187] astheywerepreviouslyfoundby the authors to have advantageous effects on the performance. The WW and FK methods assume that an inverse-filtering stage precedes the spectral subtraction implementation. However here, the implementation of a 1/3 octave minimumphase inverse filtering was not found to notably alter the relative improvement achieved by the tested methods. Therefore, a generalized case where the spectral subtraction is applied directly to the reverberant signals is presented. The average segmental Signal to Reverberation Ratio (SRR) differences when compared to the corresponding reverberant signals for (a) Stairway Hall and (b) Cafeteria are presented in Fig TheSRRmeasureevaluatesthesuppression intensity and it is the equivalent of SNR when reverberation is considered as additive noise. For the case of the Stairway Hall, all binaural extension strategies for all three methods achieve a significant SRR improvement. As expected, the mingain technique achieved substantial reverberation suppression and therefore resulted to a greater SRR. On the other hand, less reverberation was suppressed 78

109 utilizing the maxgain technique. The DSB and the avggain adaptation techniques produce similar results as in principle they both take intoaccountequally the left and right channel. The FK method seems to suppress more reverberation than the other two tested methods. The SRR differences presented in Fig. 6.1(b) for the larger enclosure (Cafeteria) are significantly smaller. In such rooms, dereverberation becomes a very challenging problem and most algorithms introduce artifacts due to late reverberation overestimation errors. Hence, it can be seen that both LB and WW approaches achieve a small SRR improvement whilethe FK method reduces the SRR in all cases. Afurtherevaluationoftheproducedsignalsismadethroughthe Perceptual Speech Quality measure (PESQ) variation [82], compared to the reverberant signals. The results are presented in Table 6.2 (bold values denote optimum performance). For the case of the Stairway Hall the bigger PESQ improvement is achieved utilizing the WW method with the mingain adaptation technique. The same technique seems to be also the optimal choice when used inconjunctionwith the LB method. It can be assumed that in a scenario where bilateral late reverberation estimations are successful this technique presents superior performance. However, it is not beneficial when used with the FK method where probablythe bilateral processing resulted to inferior results. The FK method produces better results when used with the avggain technique. In general, thewwmethod shows a significant PESQ increment for all tested adaptation techniques. For the Cafeteria, the LB method produces a relatively stable PESQ improvement independent of the employed binaural extension. On the other hand, better results are derived with the WW method for all binaural adaptation techniques; the best results achieved with the avggain approach. The FK method seems to produce processing artifacts despite the utilized binaural adaptation scheme and decreases the PESQ values in every case. Finally, note that the DSB implementation has the advantage of lower computational complexity as it involves single-channel processing for the estimation of the weighting gain functions. On the other hand, the proposed gain adaptation techniques involve bilateral processing, but do not necessitate the initial time delay estimation. Spectral-subtraction dereverberation techniques were often proposed in order to improve speech intelligibility in teleconference setups andtocompensatefor 79

110 6. BINAURAL LATE REVERBERATION SUPPRESSION Stairway Hall Method DSB maxgain avggain mingain LB WW FK Cafeteria Method DSB maxgain avggain mingain LB WW FK Table 6.2: PESQ improvement for various cases the ASR deterioration. In such cases, it is usual to assume speech reproduction in acoustically treated rooms and the proposed dereverberation methods achieve better results. However, in real-life scenarios (e.g. binaural dereverberation for hearing aids) no acoustically optimized enclosures can be assumed. In such cases perceptually-motivated algorithms may be more appropriate (e.g. seechapter 7). However, the binaural extension of such algorithms is challenging as it should take into account many aspects of the binaural hearing mechanism. 6.4 Conclusion Different binaural implementation strategies for single-channel spectral subtraction dereverberation algorithms were presented. The performance of a previously proposed approach based on a Delay and Sum Beamformer was compared with three new schemes adapting the gains derived from bilateral processing. All techniques are implemented in three state-of-the-art spectral subtraction dereverberation methods. The results show that best performance in low reverberation environments is achieved when using the mingain or the avggain technique while in strongly reverberant conditions the maxgain or the avggain implementation achieve better results. Finally the method proposed by Wu and Wang[216] was found to be more robust for extension into a binaural context. 80

111 Chapter 7 Blind Dereverberation Based On Perceptual Reverberation Modeling 7.1 Introduction Most recent dereverberation techniques have been developed specificallyforspeech signals since reverberation (and essentially late reverberation) is known to reduce speech intelligibility and deteriorate the performance of ASR systems [63]. However, such dereverberation techniques developed for speech are not always appropriate for processing broadband audio signals (e.g. music) since these have broader frequency range, sharper transient structure and also since the typical statistical assumptions made for speech are not always valid for music (e.g. stationarity for frames shorter than 20 ms [39]. Furthermore, music is often reproduced in big auditoria and thus longer RT values are usually involved. In addition, for most speech applications, a deterioration of signal s quality might in principle be acceptable after dereverberation if an increase in ASR performance can be achieved. On the contrary, in realistic sound engineering scenarios, the quality of the produced dereverberated signal should not be compromised. In this Chapter, a novel unified approach for blind, single-channel late reverberation suppression which is appropriate for both speech and music signals is 81

112 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING presented. The proposed technique involves signal processing both in the spectral and in the sub-band domains and incorporates a time-frequency auditory masking model in order to derive a perceptually compliant clean signal estimation. This model is used to identify signal regions where late reverberationis perceptually prominent as opposed to signal parts where reverberation is masked and hence not audible. A hybrid sub-band gain filtering is performed aiming to reduce reverberation mainly on those critically degraded signalcomponents. The technique is evaluated through objective and subjective testsandtheresults show that it achieves significant suppression of late reverberation by producing high-quality clean signal estimations. Briefly, the novel aspects presented in this Chapter are: Atime-frequencyauditorymodelisincorporatedinthedereverberation process. This model quantifies the reverberation distortionthroughoutthe signal s evolution and locates the signal regions where late reverberationis audible, i.e. it is unmasked from the clean signal components. [188, 189] This novel method employs a selective signal-processing approach, where only the signal components that are badly contaminated from late reverberation are processed through hybrid sub-band filtering. [188, 189] The above filtering is adaptively adjusted based on indicators of the severity of the reverberation degradation. This novel approach results to substantial reverberation reduction with fewer processing artifacts than any other tested technique. [188, 189, 190] The proposed technique suppresses reverberation involving temporalpro- cessing in perceptually-significant sub-bands, in line with psychoacoustic research findings pointing that the temporal envelope of a signal should be viewed as a real signal within the auditory system [211]. The proposed method is appropriate for both speech and music signals [188, 189, 190, 198] 82

113 7.1.1 Method overview As a starting point for the proposed method, modified versions of two state of the art spectral subtraction algorithms (see Sections , and Chapter 3) are used in order to provide rough estimations of the clean signal. In the next step, after the employment of the CAMM (see Section 2.3.3), these rough estimations are used to derive an approximation of the RMI time-frequency map which provides a perceptual measure of reverberation distortion throughout the signal s evolution. From this map, the perceptually important reverberant signal regions are located and suppressed through sub-band envelope filtering, the gain functions for each sub-band being calculated through analytical expressions. For derivation of these filtering gain functions, this approach takes into account a rough estimate of the RT, the Crest Factor of each sub-band and thermivalues for the signal area of interest which are all used as indicators of the severity of the reverberation degradation. 7.1): Hence, the main processing steps of the method are the following (see Fig. (a) a blind approximation of the RT is made through a hybrid approach [191] combining the techniques described in earlier works [93, 155]. (b) a rough blind estimation of the clean signal s e (n) isderived,inorder to evaluate a RMI approximation. To this end, any traditional techniquemay be employed; however here the generalized framework described in Chapter 4 is employed. (c) an estimation of the RMI D k (n) ineachsub-bandisderivedwhichis utilized to identify the signal regions that contain perceptually-significant amount of late reverberation (k being the sub-band number). For this, the time domain rough clean signal estimations s e (n) togetherwiththereverberantsignaly(n) are used as inputs in the CAMM. (d) a temporal envelope filtering stage is implemented where the previously defined signal regions are further processed via a novel sub-band technique, in order to obtain the final clean signal estimation s f (n). 83

114 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING #$%$&'$&()*" +,-)(. DE3F./!0121'34*/!"#$%&'( )*%+,'%-&* * # E3F #/0-1 2.$()" 34*,5(*,/) =+(%#&1'3.!" :3%#&3'(/ &#"&#*#3%'%+-3*./!0121'34*/!" '6 78&#*8-(4 >#$+*+-3/ >#?+$#!"./!0121'34*/!" ;5: #*%+,'%+-3 >G. E3F./!0121'34*/!" <'+3/ =03$%+-3* * C(#'3/ H E3F./!0121'34*/!+A3'(/ B )*%+,'%+-3!" Figure 7.1: Block diagram overview of the proposed method 7.2 Method description After the estimations of the RT 60 and the rough clean signal, the reverberant signal is analyzed in sub-bands and the crest factor Cf k in each sub-band is calculated. The crest factor is defined as the peak value of any signalrelative to its RMS value (see Section 4.3). Note that for real-time implementations, an estimation of both the crest factor and the RT can be derived from short signal segments. The RMI in each sub-band is employed as an indicator of reverberation distortion so that signal regions that contain perceptually significant components of late reverberation are located. In Fig. 7.2 (a), the typical representations of reverberant and anechoic signal in a single sub-band are shown. The corresponding RMI function D k (n) anditslocalextremaarepresentedinfig. 7.2 (b). It can be observed that reverberant signal tails can be identified by the D k (n). Signal regions between a local RMI maximum and a subsequent local minimum, correspond to signal offsets typically containing perceptually-detectable late reverberation energy. Based on this observation, a novel selective signal processing approach is adopted, which affects only such signal regions while leaving the other signal components intact. This is realized by deriving appropriate envelope gain functions in each sub-band as is shown in Fig. 7.2 (c) and (d). In this way, artifacts that may be introduced on signal portions that do not contain significant late reverberation are less likely to be introduced. To express this formally, having identified the signal regions that must be further processed, a gain function G k (n) isintroducedthatwillapplyonlyin 84

115 (a) Eq. Rev. Signal Clean Signal Amplitude local RMI maxima local RMI minima (b) Eq. Rev. Signal RMI Local Extremes (c) Regions to remain intact Regions to be processed 0.5 Gain Function (d) Time (sec) Figure 7.2: Typical representations for a percussion music sample in a sub-band (k=10) of (a) reverberant signal (in light grey) and clean signal (in black), (b) reverberant signal, the corresponding RMI and his local extrema, (c) signal regions of the reverberant signal that will remain intact (in light grey) and signal regions of the reverberant signal that will be processed (in dark grey), (d) the derived sub-band gain function 85

116 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING these sub-band signal components, through a novel heuristic approach. Let L k be the total number of local D k (n) extremainasub-band, M k(i) and m k (i) the time intervals where each local maximum and the succeeding local minimum occurs respectively. We define m k (0) = 0 and i to designate consecutive pairs of local extrema. By assuming that silence precedes and followseachsoundsegment, the first and the last local extremum of each D k (n) willbealocalminimum(e.g. Fig. 7.2). For i =0,..., L k 1 2 ( gk (i) G G k (n) = k (i) G k (i) ) n M k(i) m k (i) M k (i) H(n m k (i)) H(n M k (i +1)) when M k (i) n m k (i) otherwise (7.1) where, G k (i) beingthevalueofthegainfunctionateachlocalmaximum: G k (i) = RT 60 ( ) 1 Cfk + γ 1 1+Cf k (7.2) and g k (i) beingthevalueofthegainfunctionattheconsecutivelocalminimum: g k (i) =G k (i) γ 2 (max(d k (n)) + 1 m k (i) M k (i) m k (i) M k (i) j=1 ) D 2 k (n) (7.3) and H being the Heaviside step function: 1 when n 0 H(n) = 0 otherwise (7.4) Asimplemovingaveragesmoothingisalsoperformedinorderto avoid temporal discontinuities that will result to audible distortion. Naturally, higher RT 60 values denote increased degradation from late reverberation and result to lower G k (i) values. OntheotherhandthecrestfactorCf k quantifies the depth of modulation of the temporal envelope and reflects the waveform peakiness [154]. Alowcrestfactorvalueindicatesaflattertemporalenvelope, while a higher value 86

117 points to a waveform with strong peaks and intermediate low level signal regions which are most likely to be afflicted by late reverberation. Hence, the temporal envelope of high crest factor signals is more likely to be significantly degraded while low crest factor signals will be less affected by late reverberation. This assumption is used in Eq. 7.2 in order to predict the severity of the degradation produced by reverberation. Clearly, the gain function G k (n) decaysexponentiallybetweeng k (i) andg k (i) during signal regions that are identified to contain perceptually detectable reverberation energy and remains 1 otherwise, hence G k (n), G k (i) andg k (i) are bounded between 0 and 1. Note that lower values of G k (i) resulttosharpreverberation attenuation while high values lead to a smoother envelope function. On the other hand, the value of g k (i) thataffects the end point of each decay, controls the intensity of the attenuation across time. Therefore, each decaying function starts at a local maximum of the RMI and decreases inversely proportionally to the previously evaluated RT 60 and Cf k values. The ending point of the decaying function depends on the D k (n) valueforthesignalcomponentofinterest,from Eq. 7.3 being noted that g k (i) decreasesaccordingtothequadraticmeanofthe D k (n). This represents a statistical measure of the magnitude of the RMI s variation which has been found to be relatively insensitive at RMI estimationerrors. The variables γ 1 and γ 2 in Eqs. 7.2 and 7.3 are constants that control the rate of late reverberation suppression. Note that γ 1 is proportional to G k (n), higher γ 1 values resulting to smoother reverberation reduction. On the other hand, a high γ 2 value leads to a decreased g k (n) andconsequentlytomoredrasticsuppression. 7.3 Tests and results Method implementation For the tests, a signal dataset was generated consisting of anechoic music and speech signals sampled at 44.1 khz with a resolution of 16 bits [66]. The test samples were excerpts of: (i) percussion (bongos), (ii) acoustic guitar, (iii) cello, (iv) male and female speech. In addition the impulse responses of five rooms, ranging from a small dressing room to a big auditorium, were measured for a 87

118 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING Room RT 60 (s) Volume(m 3 ) Cr.Dist.(m) Dressing Room (D) Room (I12) Room (I10) Small Auditorium (I4) Auditorium (I1) Table 7.1: Properties of the measured rooms source-receiver distance of 1 m. The properties and names of the measured rooms are presented in Table I. The corresponding reverberant audio and speech signals were produced by convolution. As the proposed approach does not intend to compensate for the spectral coloration which is mainly generated by the RIRs early reflections, these reverberant signals were equalized by inverse-filtering via the 1/3 octave smoothed minimum phase version of the impulse responses. Thereafter such signals will be called equalized reverberant signals [91]. The FK method assumes speech frames that have been decolorated in advance through blind multichannel deconvolution. Since here, the modified version of the method is used to perform late reverberation suppression in broadband music signals as well, the relaxation criteria imposed here improve the estimation but the process may still introduce artifacts. However, the proposed technique does not necessitate an exact preliminary estimation of the clean signal and can produce adequate results even when the chosen method fails to provide anaccurateestimate. As mentioned before, numerous late reverberation suppression methods have been proposed for speech but very few for broadband music signals. In order to thoroughly evaluate the proposed technique, separate assessment tests have been realized for the cases of (i) music and (ii) speech signals. The processed music signals were assessed in terms of spectrogram improvement, Signal to Reverberation Ratio (SRR) and Noise to Mask Ratio (NMR) while the speech signals using the Cepstral Difference and the Perceptual Evaluation of Speech Quality measure (PESQ). For the music signal tests, the reference techniques used for comparison were the LB and the Modified LB (both appropriate for broadband music). For the speech signal tests, apart from the above techniques, the FK [51], 88

119 the Modified FK [187] andthewwmethodwerealsoused. Finally,asubjective evaluation both for music and speech signals was performed by conducting amodifiedmultiplestimuluswithhiddenreferenceanchors(mushra) test. Informal tests were conducted in order to derive the optimal values for the parameters employed by the proposed method. Note that the proposed approach targets broadband signals with significantly higher frequency bandwidth (44.1 khz) compared to most other comparable methods and also was tested with real measured room impulse responses, so that the parameter values proposed by the authors of the original works were not always useful. The spectral subtraction methods employed here, have been implemented for a frame size of4096sam- ples (corresponding to 0.09 s) except for the LB approach where a frame size of 1024 samples (corresponding to 0.02 s) was used since it was foundtogive better results;these methods have been implemented for a 50% frameoverlap. The threshold for applying the power relaxation criterion was A = 10dB while the relaxation factor was r p =10;thenormalizedcross-correlationcriterionwas applied for Φ > 0.95 (see Eq and 4.13). The values for all other method parameters were chosen as indicated in the original references. A hybrid filterbank of 41 non-uniform sub-bands with near perfect reconstruction properties was utilized in order to establish transparent 16 bit audio processing [220]. The CAMM parameters were those originally proposed by Buchholz and Mourjopoulos in [28, 29]. The user-defined variables γ 1 and γ 2 can be adjusted in order to counterbalance possible RMI estimation errors for different signal cases. Although, typical values for γ 1 range between 0 and 1.5 and for γ 2 between 0.1 and 0.5, here these were to 0.5 and 0.4 respectively Tests and results for music signals Spectrogram evaluation In Fig. 7.3 typical spectrograms are shown of: (a) anechoic percussion sample (bongos), (b) equalized reverberant signal (RT 60 =1.47 s), (c) clean signal obtained with the proposed method utilizing the Modified FK rough estimation and (d) clean signal obtained with the proposed method utilizing the Modified LB rough estimation. By comparing Fig. 7.3 (a) and Fig. 7.3 (b) the smearing 89

120 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING effect of the late reverberation can be seen, the silence parts betweenthepercus- sion beats of the anechoic sample being largely dominated by late reverberant tails. In Fig. 7.3 (c) and Fig. 7.3 (d) it can be seen that both variants of the proposed method substantially reduce late reverberation. Furthermore there are no significant degradations in the recovered clean signals when compared to the actual anechoic signal Segmental signal to reverberation ratio evaluation The segmental SRR [51, 102] isequivalenttothesignaltonoiseratio(snr) when reverberation is considered as noise. Thus, the SRR of the m th frame is defined as: SRR(m) =10log 10 [ mr+n 1 n=mr mr+n 1 n=mr s 2 d (n) (s d (n) s(n)) 2 ] (7.5) where s d (n) isthedirectsignal(producedbytheconvolutionoftheanechoic signal and direct part of the impulse response), s(n) isthereverberantorthe enhanced signal, N is the total number of signal samples and R is the frame rate in samples. The mean SRR is derived by averaging SRR(m) overallnon-silence frames. The mean SRR difference between the estimated clean signal and the reverberant signal is calculated as: SRR = SRR estimate SRR reverberant (7.6) Note that positive SRR differences denote the absolute improvement achieved by the proposed method when compared to the reverberant signal. Fig. 7.4 presents the segmental SRR difference between the estimated clean signal and the equalized reverberant signal. It can be observed that for the vast majority of cases, the proposed method has achieved significant reverberation reduction and has produced better results when compared to the reference LB and the Modified LB for the vast majority of tested cases. The improvement was up to 6 db (for the acoustic guitar signal in room I10) the method being unsuccessful 90

121 Figure 7.3: Typical spectrograms : a) clean signal, b) equalized reverberant signal, c) clean signal obtained with the proposed method utilizing the Modified FK rough estimation, d) clean signal obtained with the proposed method utilizing the Modified LB rough estimation 91

122 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING Figure 7.4: SRR difference between the estimated clean signal and the equalized reverberant signal for different methods, RT conditions and sample types (a) percussion, (b) guitar, (c) speech, (d) cello 92

123 only for the cello signal in room I1. In general, the reverberation suppression of sounds with rich spectral content and complex harmonic structure (i.e. cello) was found to be more challenging although the proposed method achieved reverberation reduction even under these conditions for the great majority of the tested RT scenarios. Generally, in terms of SRR improvement, the useofthemodified LB as a rough spectral estimator, appears to produce better performance for the proposed method Segmental noise to mask ratio evaluation The NMR difference between the estimated clean signal and the reverberant signal is derived as: NMR = NMR estimate NMR reverberant. (7.7) In Fig. 7.5 the NMR difference between the estimated clean signal and the equalized reverberant signal is presented. Again, the proposed technique, has achieved significant NMR improvement for almost all tested cases. The NMR improvement was up to 11 db for the case of the male speaker and cello (RT 60 = 0.7s),thoughadecreaseofabout2dBwasobservedagainforthecase of the cello sample in room I1 (RT 60 =1.47 s). The results obtained with the Modified LB rough spectral estimator were in general better in terms of NMR improvement; however in specific cases, the Modified FK was proved more efficient. On the other hand the LB and especially the Modified LB seem to produce notably better results in terms of NMR for the percussion and the speech samples Tests and results for speech signals Cepstral distance evaluation Afrequentlyusedobjectivemeasuretoevaluatespeech quality is the Cepstral Distance (CD) [114, 133], calculated as: CD = 10 p 2 (c k c k ) ln(10) 2 (7.8) k=1 93

124 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING Figure 7.5: NMR difference between the estimated clean signal and the equalized reverberant signal for different methods, RT conditions and sample types (a) percussion, (b) guitar, (c) speech, (d) cello 94

125 Figure 7.6: Cepstral Distance difference between the estimated clean speech and the equalized reverberant speech for different RT conditionsandfor(i)proposed method utilizing the Modified LB Method as preliminary estimation, (ii) Proposed method utilizing the Modified FK Method as preliminary estimation, (iii) LB Method, (iv) FK Method, (v) WW Method, (vi) Modified LB Method, (vii) Modified FK Method 95

126 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING where c k and c k are the cepstral coefficients of the speech signal under evaluation and the clean signal respectively and p is the order of the LP analysis. Here the calculation has been made using the 20 first cepstral coefficients. The CD difference between the estimated clean speech and the reverberant speech is derived as: CD = CD estimate CD reverberant. (7.9) In Fig. 7.6, comparisonsofthecddifferences for different methods and RT conditions are presented, noting that negative differences indicate an improvement in terms of CD. In all tested cases, the proposed technique has achieved a CD improvement and performed better than the other tested methods. It can also be observed that every other technique except from the LB (forrt =1and RT =1.47 s) and the WW method (for RT =1s)failstoreducetheCepstral Distance when compared to the equalized reverberant signal Perceptual evaluation of speech quality The PESQ difference (see Section 5.6.1) betweentheestimatedcleanspeech and the reverberant speech is calculated as: PESQ = PESQ estimate PESQ reverberant. (7.10) Note that positive PESQvalues denote enhanced speech quality. Here, in order to employ the PESQ measure all signals have been downsampled in 16 khz. In Fig. 7.7 it can be observed that all methods except the FK and the Modified FK achieved an improvement in terms of PESQ. However, the proposed technique has achieved considerably better results for all tested scenarios. Furthermore, the present study indicates that the use of the Modified LB as a rough estimator appears to be advantageous in all cases, although even when using the Modified FK estimator, the proposed approach still significantly enhances the reverberant signal. 96

127 Figure 7.7: PESQ difference between the estimated clean speech and the equalized reverberant speech for different RT conditions and for (i) Proposed method utilizing the Modified LB Method as preliminary estimation, (ii) Proposed method utilizing the Modified FK Method as preliminary estimation, (iii) LB Method, (iv) FK Method, (v) WW Method, (vi) Modified LB Method, (vii) Modified FK Method 97

128 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING Subjective performance evaluation To subjectively assess the performance of the proposed method, a version of the MUSHRA test [83, 136] wasconductedthelistenerswereaskedtocompare each test signal to a reference and rate their similarity. Foreachofthethree tested RT cases (RT 60 =0.39, 1, 1.47 s), the experimental stimuli were (a) the equalized reverberant signal, (b) the clean signal obtained withthemodified LB, (c) the clean signal obtained with the proposed method using the Modified LB estimator, (d) clean signal obtained with the proposed method using the Modified FK estimator. The reference signal was the corresponding anechoic while the hidden anchor was a low pass filtered version of the reverberant signal (as defined by the standard). The listeners rated the resemblance of each tested signal with the anechoic reference in a scale. A score of 100indicates that the subject could not hear any difference between the test signalandthe reference; conversely a score of 0 indicates that the subject could hear large differences. The five test stimuli types (percussion, guitar, cello, maleandfemale speech) were presented through headphones (Ultrasone S-Logic) to the listeners. The subjects were able to switch with their mouse between the different stimuli, via a computer interface. A training session preceded the formal experiment and the subjects were allowed to complete the test at their own pace without any interruptions from the experimenter. The experiments were conducted with 23 male and female (self-reported) normal-hearing experienced listeners. Fig. 7.8 (a) shows the listener s subjective ratings, averaged across stimuli, and as a function of RT, the error bars representing the 95% confidence interval. For the lower RT case (RT 60 =0.39 s), the subjective ratings showed that the equalized reverberant signals were perceptually closer to the corresponding anechoic than the estimated clean signals obtained withthemodi- fied LB. This is probably due to the fact the unnatural sound and artifacts of some processed signals is perceptually more irritating than the actual late reverberation for this case of low reverberance. However, the two versions of the proposed approach have achieved significantly better results than the reference Modified LB. For all other tested RT scenarios (RT 60 =1, 1.47 s), the estimated clean signals through Modified LB and the corresponding reverber- 98

(a) 80 70 60 Subjective Grade 50 40 30 20 10 0 Percus.

129 (a) Subjective Grade Percus. Cello Guitar Male Female Sample Type (b) Figure 7.8: Mean subjective ratings: (a) across different RT scenarios and (b) across different sample types 99

130 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING ant signals were equally rated in terms of their resemblance to the anechoic references. In both cases, a significant improvement was found for the clean estimations obtained with the proposed approach. The collected data were subjected to an analysis of variance (ANOVA) to reveal whether the differences presented above are statistically significant and a significant interaction between the tested method and RT (F (6, 1320) = 2.51,p < 0.05) was noticed. Note also that the ANOVA revealed a significant main effect for the tested methods (F (3, 1320) = 82.97,p<0.001), samples (F (4, 1320) = 58.58,p<0.001) and RT conditions (F (2, 1320) = 16.55,p < 0.001). In Fig. 7.8 (b) the listener s subjective ratings as a function of different sample types, averaged across RT are shown. Again, an ANOVA has shown asignificant interaction (F (12, 1320) = 1.89,p < 0.05) between the tested method and the sample type. A comparison between the different ratings in Fig. 7.8 (b) reveals that sounds with richer spectral content and complicated harmonic structure were rated substantially better in all cases and it appears that it iseasiertoidentify reverberation when a less complex sound is involved. Nevertheless, it is significant that the proposed technique was found to produce clean signal estimationsthat were perceptually closer to the original anechoic signals for all sample types. The top ratings were observed for the cello and the guitar samples andbetterresults than the Modified LB were achieved for all cases. In Fig. 7.9 the listener s mean subjective ratings, averaged both across stimuli and RT are presented. From this, it appears that results obtained by the proposed method (using both rough clean estimate options) are proved to be perceptually closer to the anechoic signals than the corresponding reverberant or the clean estimations obtained by the Modified LB. 7.4 Discussion Here, some aspects of the performance and the relative advantages of the proposed method will be further analysed. Firstly, some contradicting findings from the results regarding the objective and subjective evaluation for the case of cello signal will be discussed (Figures 7.5 and 7.8 (b)). Note that forthe former case the method was found to perform worse than the Modified LB, whereas forthelatter 100

131 Figure 7.9: Mean subjective ratings for (a) the equalized reverberant signal, (b) the clean signals obtained with the Modified LB, (c) the clean signals obtained with the proposed method utilizing the Modified FK rough estimation and (d) the clean signals obtained with the proposed method utilizing the Modified LB rough estimation 101

132 7. BLIND DEREVERBERATION BASED ON PERCEPTUAL REVERBERATION MODELING case was rated better than the Modified LB. Here we should consider that reverberation affects signals at a varying degree, largely depending on their relative amplitude modulation spectra [79] andconsequentlysignal-dependentvariations in the dereverberation performance must be expected (see Figures 7.4, 7.5 and 7.8 (b)). For the case of steady-state signals such as the cello, reverberant components are not easily detected and removed, especially during the overlapping notes in contrast to transient signals, where detection of late reverberation is more feasible. However, such a performance deficiency perceptually appears to be less noticeable. It should be also noted that the block-based frequency domain masking model employed for the NMR (Fig. 7.5) appearstobelessappropriate to deal with inter-block related reverberation artifacts than the CAMM employed by the proposed technique. Hence, for such cases this evaluation approach may be less adequate to predict well the method s subjective performance ratings, taking also into account the well-documented diversity between objective quality measures and the listener perceived effects due to reverberation[26]. Therefore, most objective measures appear to evaluate the general quality of the produced signals rather than the dereverberation procedure itself [210]. Significantly, the proposed approach has achieved top ratings onthe MUSHRA test, noting though that all methods achieved relatively low ratingsespeciallyfor the percussion and the speech signals. These results indicate that some reverberation remains after processing and that some processing artifacts are introduced. In fact, such relatively low subjective quality ratings are not uncommon for blind dereverberation processing [210] andalsoforblindaudiosourceseparation[205] which are both complex and largely open research problems. Nevertheless, the higher scores for the proposed method indicate that it suppresses late reverberation and introduces less audible processing artifacts than previous methods. In a simplified way, the present method acts like an intelligent multiband compressor aiming to suppress the audible parts of the reverberation tails. For the worst performance cases, this can introduce a mild gated reverb effect that may be judged as artificial. On the other hand, extreme distortions and coloration that appear in other dereverberation approaches are largely avoided resulting to enhanced signal quality and to better subjective test scores. Note also that, in most STFT based approaches, several parameter values 102

133 such as the length of the frame window or the early/ late reverberation boundary may have a strong impact on the final result. Here, a largely temporal domain analysis and processing is employed over perceptually significant sub-bands so that the method can address some inter-frame effects due to late reverberation. Furthermore, most other late reverberation suppression methods have been developed for speech, being fine-tuned for sampling frequencies of 8 and/or 16 khz whilst the present work is realized for broader bandwidth 44.1 khz signals. 7.5 Conclusion In this Chapter, a new blind method for suppressing late reverberation from speech and audio signals has been introduced. The technique employs a perceptual model to identify the perceived alterations due to reverberation in sub-band signal regions. Then, appropriate envelope gain functions are derived via a novel heuristic approach and a temporal sub-band envelope filtering is performed in order to suppress late reverberation and derive the final clean signal estimation. The performance of the proposed approach was evaluated for different speech and music signals as well as measured rooms of varying RT and the derived estimations indicate that the proposed technique achieves substantial reverberation reduction without introducing significant artifacts. This finding was supported by the the results of the objective tests that demonstrated a significant improvement in terms of segmental Signal to Reverberation Ratio and Noise to Mask Ratio for audio signals, as well as Cepstral Distance and PESQ forspeechsignals. A subjective evaluation test was also conducted, to indicate the perceptual similarity of the processed signals to the corresponding anechoic signals. In all tested cases, the proposed technique was found to improve the estimateswhen compared to the reference method and also to subjectively achieve considerable late reverberation suppression. The proposed technique aims at suppressing late reverberation which is known to adversely affect, apart from the listener perception, also theasrsystemper- formance [142]. In the next Chapter a proper evaluation of the ASR improvement achieved by the proposed method will be presented. 103

134

135 Chapter 8 Automatic Speech Recognition Performance Improvement It has been well established that the early RIR reflections are usuallybeneficial for the Automatic and Human Speech Recognition [169] whilelatereverberation is the principal cause for the ASR degradation [146]. The effect of late reflections is dominant in large rooms, especially for distant-talking applications. When only one microphone is available to capture the speech signal so that beamforming or other multichannel techniques are not applicable, single-channel suppression of late reverberation must be used to improve the ASR performance. For a review of single channel late reverberation suppression techniques see Chapter 3. It is useful to note that although reverberation degrades the speechqualityand intelligibility, speech recognition by humans in reverberant environments is robust when compared to the performance of the ASR systems [110]. In complex acoustic environments human listeners are able to distinguish the meaningful sounds from the unwanted noise [25] andtomoderatelycompensatefortheoverlap-masking effect caused by the reverberant tails [142]. Although there is significant research effort to employ auditory models in ASR systems (e.g. [33, 35, 74, 176, 215]), only few methods (e.g. [36, 95, 142]) have explicitly incorporated perceptual reverberation modeling. Such a method has been tested here for ASR applications. Considering now the variety of possible room environments, the room acoustic properties are broadly described by the RT; however it has been shown that the 105

136 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT RT is a general statistic parameter insufficient to describe the acoustic diversity between different source-receiver positions within a given room [160]. The dependence of the ASR performance on the RT and the source-receiver distance is well established [146], but the correlation to other room acoustic parameters has not been extensively studied. In a recent paper [169], Sehr et al. have found alinearcorrelationofthe definition D50withwordaccuracy results obtained with modified real RIRs. Furthermore, these authors indicate thatafine-tuning of dereverberation methods can be achieved by exploring the variability of the ASR performance in different room acoustics scenarios. Hence, in the first part of this Chapter, a study of the relationship between room acoustics and phone ASR is presented, based on a database ofmeasured RIRs from six different enclosures of different properties. The scope of the rest of the Chapter is threefold: (i) to extend and fine-tune for speech recognition the dereverberation method presented in Chapter 7, (ii)toinvestigatetherecognition performance improvement achieved by the above method and compare the results to those obtained from three LB, WW and FK spectral subtraction dereverberation techniques and (iii) to examine the correlation of the established acoustic parameters to the performance of such ASR systems. In brief, the novel aspects presented in this Chapter are: The ASR results are correlated with position dependent room acoustics parameters such as the clarity C50 and the definition D50. [198] It is shown that such parameters are more appropriate than the widelyused RT when evaluating the ASR performance degradation in closed spaces. [198] The dereverberation method described in Chapter 7 is fine-tuned for ASR applications and the optimal method parameters are shown. The proposed method achieves superior results than any other tested dereverberation approach, especially in adverse room acoustic conditions. [198] It is shown that the established acoustic parameters correlate well to the performance of ASR systems and that they can be possibly used as predictors of the performance improvement. [198] 106

137 Room RT 60 (s) Vol.(m 3 ) Room Type Var. Chamber Var. Chamber Var. Chamber Lecture Hall Small Auditorium Auditorium Table 8.1: Acoustical Properties of the Measured Rooms 8.1 The effect of room acoustic parameter on automatic speech recognition performance Test methodology Initially, it is necessary to investigate the correlation of roomacousticparame- ters to phone ASR performance and apart from the RT, the Early Decay Time (EDT), the Definition (D50), the Clarity (C50 and C80) and the Centre Time (Ts) parameters extracted from the measured RIRs were also employed based on the definitions given in Chapter 2. For this, a database of measured RIRs with RT values ranging from 0.24 to 1.47 s was created (see Table 8.1). The RIRs with the lower RT values (0.24, 0.38 and 0.58 sec) were obtained from measurements at the Bell Labs varechoic chamber [69], each one corresponding to a different room configuration. The remaining RIRs were measured from a Lecture Hall, a small and a big Auditorium, at the University of Patras Conference Centre. These RT values of the selected rooms are representative of a wide range of typical acoustic scenarios and taking into account that the ASR deteriorates for distant-talking applications, four different RIRs were recorded in each room, atsource-receiver distances of 1.5, 2, 3 and 4 m. The diversity of the chosen acoustic conditions can be illustrated via the Energy Decay Curves (EDCs) for all room and distance conditions in Fig For all of the above cases, speech signals derived from the TIMIT database [53]wereconvolved withthemeasuredrirstorepresentthecorresponding rever- 107

138 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT 0 10 Energy Decay Curve (db) (a) (c) (b) (d) 10 Distance=1.5 m Distance=2 m Distance=3 m Distance=4 m (e) (f) Time (ms) Figure 8.1: Energy Decay Curve (db) of the RIRs for all tested distances: (a) Room 1, (b) Room 2, (c) Room 3, (d) Room 4, (e) Room 5, (f) Room 6 108

139 berant signals. The performance of the dereverberation methods was evaluated considering it as a speech preprocessing stage and by comparing Phone Recognition Rates (PRRs) for the reverberant and the processed signals. The PRR metric does not depend on a language model or grammar, and hence it can be considered as a more appropriate choice for evaluating dereverberation processing algorithms [142]. Furthermore, the TIMIT database consisting of microphone quality recordings of 630 American-English speakers, with sampling frequency 16 khz and resolution 16-bit has been widely used in the past for the task of phone recognition e.g. [27, 94, 131, 145, 170], making it a convenient choice for direct comparison to other methods. Here, the standard train/test subset division of the database has been adopted, i.e. the train subset was utilized for the training of the HMM phone models and the performance was measured both on the training and test subsets whilst the dialect (SA) sentences were excluded from this evaluation, since their word sequence is common for all speakers in both train and test subsets. TIMIT transcription includes phonetic annotation that has been manually checked. The annotation time-marks correspond to a narrow-phonetic set of 61 American-English phones and here, the established for American-English 48 phone set was employed as proposed in [109]. Successive occurrences of the same phone were merged to one single occurrence, as in [27, 145]. Furthermore, a standard speech preprocessing and parametrization protocol was followed here: the speech signals were frame blocked fromahammingwindow of length of 20 milliseconds, with frame step of 10 milliseconds and a first-order FIR filter with pre-emphasis factor equal to 0.97 was applied. For every speech frame, the 12 first Mel frequency cepstral coefficients [219] andthe0-thcepstral coefficient were computed. Additionally, cepstral mean normalization was applied to the computed coefficients and the corresponding delta and double-delta coefficients were appended to the thirteen static cepstral coefficients, resulting to a 39- dimensional feature vector. For each phone, one three-state context-independent hidden Markov model (HMM) was trained. 109

140 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT C50 C80 EDT D50 RT Ts Table 8.2: Correlation coefficients for the different room acoustic parameters 55 (a) (b) 35 Phone Recognition Rate (%) D50 (%) 55 (c) C80 (db) 55 (e) C50 (db) (d) PRR values Fitted Curves Ts (ms) (f) EDT (s) RT (s) Room Acoustics Parameters Figure 8.2: Phone Recognition Rates over various room acoustic parameters: (a) D50, (b) C50, (c) C80, (d) Ts, (e) EDT, (f) RT 110

141 8.1.2 Reverberant speech ASR performance Fig. 8.2 presents the PRR degradation over (a) D50, (b) C50, (c) C80, (d) Ts, (e) EDT and (f) RT as well as the corresponding first order polynomial data fits whereas Table 8.2 gives the linear correlation coefficients for the above parameters. It can be seen that the clarity calculated over the first 50 ms of the RIRs (C50) is strongly correlated to the PRR results and the same applies to a lesser degree for the C80, the D50 and the EDT parameters. On the other hand, the widely used RT and also the Ts parameters present lower correlation values. It must be noted that the RT is of fundamental value in room acoustics, but contrary to the other tested parameters it does not indicate the acoustic differences between different positions in the same room, being in principle a spatially-averaged property of the room absorption [160]. Hence, when evaluating the ASR performance in a specific room location the above position-dependent parameters may be more appropriate. 8.2 Adjusting the dereverberation parameters for ASR In Chapter 7.2 the perceptually-motivated dereverberation gain functions were derived. The parameters γ 1 and γ 2 in Equations 7.2 and 7.3 are constants used to fine-tune the method, here for ASR improvement. The parameter γ 1 controls the suppression rate in the signal area of interest, while the parameterγ 2 directly controls the decaying rate of the gain function. Choosing a smaller γ 1 results to bigger G k (n) valuesandconsequentlytoamoredrasticreverberationsuppression in each sub-band. On the other hand, the bigger the value of γ 2,thesteeperthe decay of the gain function G k (n) betweeneachsetoflocalextrema. Theabove are illustrated in Fig. 8.3 where the gain functions for typical γ 1 and γ 2 values, together with the absolute value of the reverberant and the corresponding clean signal in a sub-band are shown. 111

142 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT 1! 1 =0.5,! 2 =0.05! 1 =0.3,! 2 =0.05! 1 =0.1,! 2 = Amplitude 0 1! 1 =1,! 2 =0.1 (a)! 1 =1,! 2 =0.3! 1 =1,! 2 = (b) Time (s) Figure 8.3: The derived gain functions in a single sub-band ( Hz) for (a) γ 2 =0.05 and γ 1 =0.5, 0.3, 0.1 and (b) γ 1 =1 and γ 2 =0.1, 0.3, 0.5. The absolute reverberant (in black) and rough clean signal estimation(in grey)arealsoshown. 112

143 Phone Recognition Rate (%) (a) (b) (c) Rev TM1 TM2 LB FK WW (d) (e) (f) 3 4 Source Receiver Distance (m) Figure 8.4: Phone Recognition Rates(%) for source-receiver distances1.5,2,3, 4mandfor(a)Room1,(b)Room2,(c)Room3,(d)Room4,(e)Room5, (f) Room 6. Results are shown for (i) reverberant signal ( ), (ii) Tsilfidis & Mourjopoulos dereverberation using γ 1 =0.1 andγ 2 =0.5 (TM1) ( ), (iii) Tsilfidis & Mourjopoulos dereverberation using γ 1 =0.3 andγ 2 =0.05 (TM2) ( ), (iv) Lebart et al. (LB) dereverberation ( ), (v) Furuya & Kataoka (FK) dereverberation (+), (vi) Wu & Wang (WW) dereverberation ( ) Comparison of late reverberation suppression techniques The ASR performance for the proposed speech preprocessing dereverberation method described in Chapter 7 (Tsilfidis and Mourjopoulos [189], termed here TM) was also compared to preprocessing based on three spectral subtraction late reverberation suppression techniques (LB, WW and FK), described in Chapter

144 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT The effect of RT and source-receiver distance The TM method was tested for values γ 1 =0.1, 0.3, 0.5, 1, 1.5 and for γ 2 =0.05, 0.1, 0.3, 0.5, 0.9 resulting to 25 test cases for each experimental condition. From the analysis of the results, it was found that the greatest recognition performance improvement was obtained for γ 1 =0.1 and γ 2 =0.5 and for γ 1 =0.3 and γ 2 =0.05. The first pair of constants achieves a more drastic suppression (TM1) while the second one (TM2) results to moderate reverberation reduction on the signal areas of interest. In Fig. 8.4 the % PRR for source-receiver distances of 1.5, 2, 3, 4 m and for the six tested rooms are shown, noting that the baseline performance was 61.98%. In Fig. 8.4(a) the results for the lower RT (RT =0.24 s, varechoic chamber) case are presented and despite the low RT value, the reverberant PRR performance appears to be reduced. Furthermore, all tested dereverberation methods seem to further reduce the recognition rate apart from the TM method (TM2) which does not affect the ASR performance of the original reverberant signals. It is therefore evident that for such dry rooms where late reverberation is minimal, processing artifacts introduced by the late reverberation suppression methodscanbemore harmful for the ASR performance than the reverberation itself. The results for the moderate RT (RT =0.38 s, varechoic chamber) case are shown in Fig. 8.4(b). Again, all methods seem to deteriorate the PRR results except from the proposed TM2 technique which slightly improves the recognition rates. Note that for both cases in Fig. 8.4(a) and (b), the reverberant PRR slightly improves with increasing the source-receiver distance from 1.5 to 2m, possibly since in such enclosures, the effect of the early reflections can be beneficial to speech recognition. However, for the cases of rooms/auditoria with higher RT values shown in Fig. 8.4(c), (d), (e) and (f), the TM method achieves significant improvements in recognition rates when compared to those of all other tested methods. In Room 3, TM1 and TM2 seem to achieve comparable results but in Rooms 4, 5, and 6 the drastic version of the algorithm (TM1) achieves significantly better results. The WW and the LB dereverberation techniques seem also to be beneficial for the recognition in larger rooms (Rooms 4, 5, and 6) while the FK methodfails 114

145 Phone Recognition Rate (%) TM1 TM2 Rev Fitted Curve (TM1) Fitted Curve (TM2) Fitted Curve (Rev) D50 (%) Figure 8.5: Phone Recognition results (%) over D50 to improve the recognition results except for the larger source-receiver distances in Rooms 5 and 6. When comparing the recognition results for Rooms 4 and 5 (i.e. Fig8.4 (c) and (d)) it can be easily observed that there is a large variation in PRR performance, even when the corresponding RT values are rather similar. This verifies the findings of Section 8.1 indicating that the RT value is not sufficient to indicate the potential ASR performance within a room so that the other acoustic parameters can be more appropriate performance indicators [169] Correlation with D50 and C50 In order to further investigate the performance of the presented method and following the findings in Section 8.1, theclarityc50andthedefinitiond50 parameters were considered as the more appropriate performance metrics. The C50 has shown the best correlation with the PRR results (see Section 8.1) while the D50 is chosen for comparison reasons (e.g. [169]). In Figures 8.5 and

146 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT Phone Recognition Rate (%) TM1 TM2 Rev Fitted Curve (TM1) Fitted Curve (TM2) Fitted Curve (Rev) C50 (db) Figure 8.6: Phone Recognition results (%) over C50 the PRR results with respect to the tested RIR D50 and C50 values are shown for the reverberant signals, the TM1 and the TM2 dereverberation methods. For these tested cases, the corresponding second order polynomial fitted curves are also shown. Hence, from Fig. 8.5 the dependency of the PRR and D50 can be described by the relationship: PRR = αd βd50 + γ (8.1) while in Fig. 8.6, thedependencyoftheprrandc50is: PRR = δc ɛc50 + ζ (8.2) The values of the polynomial coefficients for the curves of Fig. 8.5 and Fig. 8.6 are shown in Tables 8.3 and 8.4 respectively. Clearly the above findings require further support from a larger number of RIRs with varying D50 and C50 values. However, these initial results from real RIR measurements are in line with the findings of [169] andshowthatthese 116

147 α β γ Rev TM TM Table 8.3: Polynomial coefficients for the PRR over D50 regression curves δ ɛ ζ Rev TM TM Table 8.4: Polynomial coefficients for the PRR over C50 regression curves room acoustic parameters can possibly predict the PRR deterioration due to reverberation as well as the corresponding improvement that canbeachieved utilizing a late reverberation suppression approach Overall recognition improvement Overall ASR improvement here is represented via the % Phone Recognition Improvement (PRI) for all tested cases. The PRI is defined as: PRI = PRR derev PRR rev PRR rev (8.3) where PRR derev is the phone recognition rate of the dereverberation method and PRR rev the phone recognition rate of the corresponding reverberant database. Fig. 8.7 shows the relation between C50 (left y-axis) and D50 (right y-axis) with the overall Phone Recognition Improvement (%) (x-axis). Note that the second order polynomial fits are also showed. Clearly, the best PRI results for the proposed method are achieved for the lower C50 and D50 values that correspond to high RT values and large source-receiver distances, the PRI approaching a maximum of 32% for the case of C50 being in the range of -10 db and D50inthe range of 25%. Note that these values were derived in Room 6 for asource-receiver distance of 4 m. 117

148 8. AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT C50 (db) 0 50 D50 (%) D50 (fitted curve) D50 C50 (fitted curve) C Phone Recognition Improvement (%) 0 Figure 8.7: Relation between C50 (left y-axis) and D50 (right y-axis)withthe overall Phone Recognition Improvement (%) 8.3 Conclusion In this chapter, it has been shown that position dependent room acoustic parameters such as the C50 and D50 are more appropriate than the widely used RT when evaluating (i) the ASR performance degradation in closed spaces and (ii) the potential improvement that can be achieved via dereverberation preprocessing. This finding is supported from phone recognition results in a wide range of room acoustics scenarios. Furthermore, the technique described inchapter7 has been employed and fine-tuned with respect to optimum ASR performance. The proposed ASR evaluation framework that complies with room acoustics principles, is used to compare the recognition performance improvement achieved from the above and other late reverberation suppression methods. The TMperceptuallymotivated method has achieved superior results compared to any other tested technique especially in highly reverberant environments. 118

149 Chapter 9 Conclusions And Future Work 9.1 Conclusions In this thesis, a number of novel signal processing methods for enhancing reverberant speech and music signals have been developed. In Chapter 4 ageneralized framework for improving single-channel spectral subtraction algorithms has been presented. The proposed implementation consists of a relaxation stage, including two signal-dependent constraints and a non-linear filtering stage, including aperceptually-derivedwiener-typefilter. Objectiveandsubjective results have shown that these modifications significantly improve the tested algorithms by reducing the musical noise artifacts and by preserving the signal s transients from over-subtraction. In Chapter 5 anovelspectralsubtractiontechniqueformovingspeakerdereverberation has been introduced. This approach utilizes a single RIR measurement and assumes that late reverberation is stationary in different room positions. Hence, the measured RIR along with the excitation signal derived from the LP analysis of the reverberant speech are used to estimate the late reverberant PSD at multiple speaker positions. A semi-blind implementation ofthismethodsubstitutes the measured RIR with a recorded handclap. It has been shown that the late reverberant PSD can be efficiently estimated from the handclap and that the proposed technique is robust in source-receiver position changes achieving significant reverberation reduction in all tested reverberation conditions. In Chapter 6 abinauralextensionofspectralsubtractiondereverberation algo- 119

150 9. CONCLUSIONS AND FUTURE WORK rithms has been presented. The proposed implementation includes three bilateral gain adaptation strategies and a Gain Magnitude Regularization step that controls the suppression rate and prevents from overestimation errors. Theproposed extensions have been implemented in three state-of-the-art spectralsubtraction dereverberation techniques and have been tested for several reverberationscenar- ios. Objective measures revealed the optimum implementation for each case. Chapter 7 has presented an advanced blind dereverberation algorithm that employs a Computational Auditory Masking Model (CAMM) in order to identify the signal regions where late reverberation is audible. This selective signalprocessing algorithm applies gain filtering in perceptually-significant sub-bands. The gain functions are adjusted according to estimators of the severity of the reverberation degradation. The proposed technique has been extensivelyevaluated for a vast range of reverberation conditions and has been provensuperior from any other tested technique for dereverberating speech and music signals. Moreover, as shown in Chapter 8 the proposed method is beneficial when used as a preprocessing step prior to Automatic Speech Recognition. The method was fine-tuned in the ASR context and it has been demonstrated that it achieves more than 30 % improvement of the recognition rate, especially in highly reverberant environments. Furthermore, the room acoustic analysis has revealed that position-dependent room acoustic parameters are more appropriate than the widely-used RT to describe the ASR degradation in reverberant enclosures and that such parameters can be used to predict the ASR performance improvement achieved via dereverberation preprocessing. 9.2 Future work This thesis proposed a broad range of novel dereveberation techniques and created new potential for future research: The late reverberation suppression algorithms presented here can be combined with well-established decoloration techniques in order to compensate also for the early reflections and hence for the complete reverberation effect. For example, the algorithms that assume a RIR measurement (see Chap- 120

151 ter 5) canbecombinedwiththewell-establishedinverse-filtering technique proposed in [71]. The proposed methods can be implemented in conjuction with noise reduction algorithms for joint suppression of noise and reverberation, providing acompletecompensationframeworkfortheacousticinterferences. A first step towards this direction has been made in [101] The scope of this thesis was to develop dereverberation algorithms that produce artifact-free anechoic estimations. Even if the proposed methods were designed to be efficient, they were not optimized in the source code level (for an informal comparison on the computational cost of the algorithms see Appendix A). Hence, efficient, real-time implementations canbemade in order to properly evaluate the speed of the algorithms. In Chapter 8 the ASR performance improvement achieved from dereverberation preprocessing has been investigated. Phone recognition analysis has shown that the perceptually-motivated approach of Chapter 7 achieves superior recognition results than any other tested method. The phone recognition tests were appropriate for this evaluation as they do not depend on a language model or grammar. In the future, the proposed algorithm can be incorporated in full ASR implementations where methods that compensate for reverberation in the ASR front-end and back-end can also be included. Hence, a complete ASR framework robust in room acoustic interferences can be constructed. As discussed in the Introduction dereverberation preprocessing can be useful in many engineering applications such as speaker recognition, source separation, signal classification etc. Therefore, the algorithms presented in this work can be implemented and evaluated as a preprocessing stepfor such methods. The evaluation of the binaural spectral subtraction schemes inchapter 6 has shown that in real-life scenarios (e.g. binaural dereverberation for hearing aids) the assumption for acoustically treated enclosures is not valid 121

152 9. CONCLUSIONS AND FUTURE WORK and perceptually-motivated approaches can be beneficial. Hence, it would be of interest to establish a binaural version of the perceptual motivated algorithm presented in Chapter 7. However,thisisnotstraightforwardasit requires the development of a binaural auditory masking model appropriate for engineering applications. For this, the Audio and Acoustic Technology group is already collaborating with other research institutes of the AABBA consortium [17]. Finally, such enhancement algorithms can be incorporated in multimodal applications (e.g. immersive teleconferencing [80]), in order to enhance the auditory experience of such systems. 122

153 Appendix A: Computational Cost Of The Proposed Algorithms The scope of this thesis was to develop dereverberation algorithms that produce artifact-free anechoic estimations. Even if the proposed methods were designed to be efficient, they were not optimized in the source code level. However an informal evaluation of the computational cost of the proposed algorithms is made here. The measurements were performed on an Apple Macbook with an Intel Core Duo (2 GHz) processor running Mac OS X v (32 bit). All algorithms were implemented in Matlab [117] andthecomputationalcostwasmeasuredusing the cputime function which returns the total CPU time (in seconds) used by the application. For the tests a 5 sec reverberant speech sample (RT=1 s) was used, with a sampling frequency of Hz and a 16 bit precision. The average cpu time of ten repetitions for each dereverberation algorithm was calculated. In a first step the cpu time of the three state-of-the art spectral subtraction algorithms LB, WW and FK was calculated. Note that the chosen analysis parameters critically alter the cpu time performance. In order to directly compare Meth. Frame Length Zero pad. Frame Overlap LB WW FK Table 1: Spectral subtraction analysis parameters 123

. APPENDIX A: COMPUTATIONAL COST OF THE PROPOSED ALGORITHMS Figure 1: CPU time for the LB, WW and FK algorithms the computational cost, the algorithms were implemented for aframelengthof 2048 samples

154 . APPENDIX A: COMPUTATIONAL COST OF THE PROPOSED ALGORITHMS Figure 1: CPU time for the LB, WW and FK algorithms the computational cost, the algorithms were implemented for aframelengthof 2048 samples (including 1024 samples of zero padding) and a 50 %overlap. In this case the LB technique needed 1 s to produce the clean estimation compared to 1.8 s for the WW and 5 s for the FK algorithm. However, these analysis parameters do not produce the optimal clean signal estimations for all methods. Hence, the tests were repeated using the parameters shown in Table 1 that were found to produce the best clean signal estimations for the 44.1 khz sample rate [195]. The measured CPU time is shown in Fig. 1 and the LB algorithm seems to be significantly more efficient than the other two. Fig. 2 presents the measured CPU time for the original and modified versions of the LB and FK algorithms (see Chapter 4). It can be seen that the implementation of the proposed relaxation criteria is not computationally expensive especially for the LB algorithm. However, the perceptually-motivated non-linear filtering stage adds significant computational load. As described in Chapter 4, this step involves a masking threshold calculation that enhances the subjective quality of the produced clean signal estimations and this perceptually-motivated stage inevitably decreases the efficiency of the dereverberation algorithms. The cpu time of the processing blocks of the proposed moving speaker dereverberation algorithm is presented in Fig. 3 (see Chapter 5). The frame length was 4096 samples (including a 1024 samples zero padding) and the overlap was 50 %. Here, the LP analysis is shown to consume a significant amount of cpu time, however in practical implementations this can be easily optimized. On 124

Figure 2: CPU time for the original and modified versions of the LB and FK algorithms the other hand, the GMR step significantly reduces the overestimation artifacts (see Chapters 5 and 6)

155 Figure 2: CPU time for the original and modified versions of the LB and FK algorithms the other hand, the GMR step significantly reduces the overestimation artifacts (see Chapters 5 and 6) beingthesametimeremarkablyefficient. Overall, the computational cost of the moving speaker dereverberation algorithm seems to be low. Finally, the CPU time for the dereverberation algorithm based on perceptual reverberation modeling as presented in Chapter 7 is shown in Fig. 4. Asexpected, the employment of the time-frequency masking model as well as thefilterbank analysis significantly increases the computational load. However, these steps are necessary in order to achieve subjectively superior clean signal estimations as explained in Chapters 7 and

156 . APPENDIX A: COMPUTATIONAL COST OF THE PROPOSED ALGORITHMS Figure 3: CPU time for the moving speaker dereverberation algorithm Figure 4: CPU time for the dereverberation based on perceptual reverberation modeling 126

157 Appendix B: Related Publications Journal papers [1] A. Tsilfidis, I. Mporas, J. Mourjopoulos, andn. Fakotakis. Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing. submitted for publication. [2] A. Tsilfidis and J. Mourjopoulos. Blind single-channel suppression of late reverberation based on perceptual reverberation modeling. Journal of the Acoustical Society of America, 129(3): , [3] A. Tsilfidis and J. Mourjopoulos. Signal-dependent constraints for perceptually motivated suppression of late reverberation. Signal Processing, 90: , International Conference Papers ([1-4] refereed proceedings, [5] invited paper, [6-10] abstract/ precis refereed proceedings) [1] A. Tsilfidis, E.Georganti,E.Kokkinis,andJ.Mourjopoulos. Speech dereverberation based on a recorded handclap. In Digital Signal Processing Conference (DSP), Corfu, Greece, [2] A. Tsilfidis, E.Georganti,andJ.Mourjopoulos.Binaural extension and performance of single-channel spectral subtraction dereverberation algorithms. In Proc. of the IEEE ICASSP, Prague, Czech Republic,

158 . APPENDIX B: RELATED PUBLICATIONS [3] A. Tsilfidis and J. Mourjopoulos. Perceptually-motivated selective suppression of late reverberation. In Proc. of the 17th Digital Signal Processing Conference (DSP), Santorini, Greece, 2009 [4] A. Tsilfidis, J.Mourjopoulos,andD.Tsoukalas. Blind estimation and suppression of late reverberation utilizing auditory masking. In Proc. of the Hands Free Speech Communication and Microphone Arrays (HSCMA), Trento, Italy, 2008 [5] A. Tsilfidis, E.Georganti,andJ.Mourjopoulos. Abinauralframework for spectral subtraction dereverberation. In Forum Acusticum (invited paper), Aalborg, Denmark, [6] A. Tsilfidis, K.E.Kokkinis,andJ.Mourjopoulos. Suppression of late reverberation at multiple speaker positions utilizing a single impulse response measurement. In Forum Acusticum, Aalborg, Denmark, [7] A. Tsilfidis and J. Mourjopoulos. Blind single-channel dereverberation for music post-processing. In Proc. of the 130th Convention of the Audio Engineering Society, London, UK, [8] E Kokkinis, A. Tsilfidis, E. Georganti, andj. Mourjopoulos. Joint noise and reverberation suppression for speech applications. In Proc. of the 130th Convention of the Audio Engineering Society, London, UK, [9] E Georganti, A. Tsilfidis, andj.mourjopoulos. Statistical analysis of binaural room impulse responses. In Proc. of the 130th Convention of the Audio Engineering Society, London, UK, [10] A. Tsilfidis, C.Papadakos,andJ.Mourjopoulos. Hierarchical Perceptual Mixing. In Proc. of the 126th Convention of the Audio Engineering Society, Munich, Germany, Greek Conference Papers [1] A. Tsilfidis and J. Mourjopoulos. Blind Dereverberation for speech and music signals. In Proc. of the Hellenic Institute of Acoustics 2010 conference (in Greek), Athens, Greece, [2] A. Tsilfidis Auditory thresholds of Gaussian Shaped Sinusoidal Tones. In Proc. of the Hellenic Institute of Acoustics 2008 conference (ingreek),xanthi, 128

159 Greece, [3] A. Tsilfidis, J. Mourjopoulos, andd. Tsoukalas. Method for estimation and suppression of reverberation using psychoacoustic criteria. In Proc. of the Hellenic Institute of Acoustics 2008 conference (in Greek), Xanthi,Greece,

160

161 References [1] ISO/IEC I S Coding of Moving Pictures and Associated Audio for Digital Storage Media up to 1.5 Mbit/s Part 3, Audio, [2] J B Allen. Effects of small room reverberation on subjective preference. Journal of the Acoustical Society of America, 71:S5, [3] J B Allen and D A Berkley. Image method for efficiently simulating small room acoustics. Journal of the Acoustical Society of America, 65(4): , [4] J B Allen, D Berkley, and J Blauert. Multimicrophone signal-processing technique to remove room reverberation from speech signals. Journal of the Acoustical Society of America, 62: , [5] C Avendano and H Hermansky. Study on the dereverberation of speech based on temporal processing. In Proc. of the ICSLP, [6] M Barthet and M Sandler. On the effect of reverberation on musical instrument automatic recognition. In Proc. of the 128th Convention of the Audio Engineering Society, May [7] K J Bathe. Finite Element Procedures. PrenticeHall, [8] J W. Beauchamp, R C. Maher, and R Brown. Detection of musical pitch from recorded solo performances. In Proc. of the 94th Convention of the Audio Engineering Society,

162 REFERENCES [9] S Bech. Timbral aspects of reproduced sound in small rooms. I. Journal of the Acoustical Society of America, 97: , [10] J Beerends and J Stemerdink. A perceptual audio quality measure based on apsychoacousticsoundrepresentation. Journal of the Audio Engineering Society, 40(12): , [11] D Bees, M Blostein, and P Kabal. Reverberant speech enhancement using cepstral processing. In Proc. of the IEEE ICASSP, volume2,pages , [12] M Berouti, R Schwartz, and J Makhoul. Enhancement of speech corrupted by acoustic noise. In Proc. of the IEEE ICASSP, volume4,pages , [13] S Bharitkar, P Hilmes, and C Kyriakakis. Robustness of spatial average equalization: A statistical reverberation model approach. Journal of the Acoustical Society of America, 116(6): , [14] J Bitzer, K U Simmer, and K.-D. Kammeyer. Theoretical noise reduction limits of the generalized sidelobe canceller (GSC) for speech enhancement. In Proc. of the IEEE ICASSP, volume5,pages , [15] J Blauert. Spatial Hearing. MITPress, [16] J Blauert. Communication acoustics. Springer, , 30 [17] J Blauert, J Braasch, J Buchholz, S Colburn, U Jekosch, A Kohlrausch, JMourjopoulos,VPulkki,andARaake. Auralassessmentbymeans of binaural algorithms -the aabba project-. Technical report, [18] B Blesser. An interdisciplinary synthesis of reverberation viewpoints. Journal of the Audio Engineering Society, 49(10): , , 3, 64, 65, 67 [19] S Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2): , April

163 REFERENCES [20] R H Bolt and A D MacDonald. Theory of speech masking by reverberation. Journal of the Acoustical Society of America, 21(6): , [21] J S Bradley. Predictors of speech intelligibility in rooms. Journal of the Acoustical Society of America, 80(3): , [22] J S Bradley, R Reich, and S G Norcross. A just noticeable difference in C50 for speech. Applied Acoustics, 58(2):99 108, [23] J S Bradley, R D Reich, and S G Norcross. On the combined effects of signal-to-noise ratio and room acoustics on speech intelligibility. Journal of the Acoustical Society of America, 106(4): , [24] K Brandenburg and G Stoll. ISO/MPEG-1 Audio: A Generic Standard for Coding of High-Quality Digital Audio. Journal of the Audio Engineering Society, 42(10): , [25] AS Bregman. Auditory Scene Analysis. MITPress, [26] M Bruggen. Coloration and binaural decoloration in natural environments. Acta Acustica united with Acustica, 87: , , 102 [27] F Brugnara, D Falavigna, and M Omologo. Automatic segmentation and labeling of speech based on Hidden Markov Models. Speech Communication, 12(4): , [28] J Buchholz and J Mourjopoulos. A computational auditorymaskingmodel based on signal-dependent compression. I. Model description and performance analysis. Acta Acustica united with Acustica, 90: , , 89 [29] J Buchholz and J Mourjopoulos. A computational auditorymaskingmodel based on signal-dependent compression. II. Model simulations and analytical approximations. Acta Acust. Acust., 90: , , 89 [30] G Canevet. Elements de Psychoacoustique (in French), page110. CNRS, France,

164 REFERENCES [31] O Cappe. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Transactions on Speech and Audio Processing, 2(2): , [32] M Chassin. Music and hearing aids. The Hearing Journal, 56:36,38,40 41, [33] Y.-H.B. Chiu, B Raj, and R M Stern. Learning-based auditory encoding for robust speech recognition. In Proc. of IEEE ICASSP, pages , March [34] I Cohen, J Benesty, S Gannot, and J M Kates. Speech processing in modern communication: Challenges and perspectives. Journal of the Acoustical Society of America, 129(1): , [35] J R Cohen. Application of an auditory model to speech recognition. Journal of the Acoustical Society of America, 85(6): , [36] M Cooke, P Green, L Josifovski, and A Vizinho. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication, 34(3): , , 105 [37] G Defrance and J.-D. Polack. Measuring the mixing time in auditoria. In Proc. of the Acoustics 08, pages ,Paris,France, [38] M Delcroix, T Hikichi, and M Miyoshi. Precise Dereverberation Using Multichannel Linear Prediction. IEEE Transactions on Audio, Speech and Language Processing, 15(2): , [39] J Deller, J Hansen, and J Proakis. Discrete-time processing of speech signals. Wiley-IEEE Press, , 33, 81 [40] James P Egan and Harold W Hake. On the Masking Pattern of a Simple Auditory Stimulus. Journal of the Acoustical Society of America, 22(5): , xv, 24 [41] L L Eliot. Backward and Forward Masking. International Journal of Audiology, 10(2):65 76,1971.xv, 24,

165 REFERENCES [42] K Eneman and M Moonen. Multimicrophone Speech Dereverberation: Experimental Validation. EURASIP Journal on Audio Speech and Music Processing, pages1 20, [43] Y Ephraim and D Malah. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions in Acoustics, Speech and Signal Processing, 32(6): ,December [44] C Faller. Parametric multichannel audio coding: synthesis of coherence cues. IEEE Transactions on Audio, Speech, and Language Processing, 14 (1): , [45] C Faller and F Baumgarte. Binaural cue coding-part II:Schemes and applications. IEEE Transactions on Speech and Audio Processing, 11(6): , [46] A Farina. Simultaneous Measurement of Impulse Response anddistortion with a Swept-Sine Technique. In Proc. of the 108th Convention of the Audio Engineering Society, February [47] A Farina. Advancements in Impulse Response Measurements by Sine Sweeps. In Proc. of the 122nd Convention of the Audio Engineering Society, May [48] H Fastl and E Zwicker. Psychoacoustics: Facts and Models, volume22. Springer-Verlag, , 26 [49] J L Flanagan and Lummis R.C. Signal processing to reduce multipath distortion in small rooms. Journal of the Acoustical Society of America, 47: , , 31, 33 [50] H Fletcher. Auditory Patterns. Reviews of Modern Physics, 12(1):47 65, [51] K Furuya and A Kataoka. Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction. IEEE Transactions on 135

166 REFERENCES Audio, Speech and Language Processing, 15: , xxiii, 33, 35, 38, 50, 52, 55, 74, 88, 90 [52] S Gannot and M Moonen. Subspace Methods for Multimicrophone Speech Dereverberation. EURASIP Journal on Advances in Signal Processing, pages , [53] J S Garofolo. Getting Started with the {DARPA} {TIMIT} {CD-ROM}: An Acoustic Phonetic Continuous Speech Database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, [54] N D Gaubitch and P A Naylor. Analysis of the Dereverberation Performance of Microphone Arrays. In Proc. of the IWAENC, [55] E Georganti, J Mourjopoulos, and F Jacobsen. Analysis of roomtransfer function and reverberant signal statistics. In Proc, of the Acoustics 08 Conference, Paris,France,2008.6, 22, 67 [56] E Georganti, T Zarouchas, and J Mourjopoulos. Reverberation Analysis via Response and Signal Statistics. In Proc. of the 128th Convention of the Audio Engineering Society, London, UK, , 22, 67 [57] E Georganti, A Tsilfidis, and J. Mourjopoulos. Statistical analysis of binaural room impulse responses. In Proc. of the 130th Convention of the Audio Engineering Society, London, UK, , 76 [58] B W Gillespie, H S Malvar, and D A F Florencio. Speech dereverberation via maximum-kurtosis subband adaptive filtering, volume6,pages Ieee, [59] D Giuliani, M Matassoni, M Omologo, and P Svaizer. Training of HMM with Filtered Speech Material for Hands-Free Recognition. In Proc. of the IEEE ICASSP, pages , [60] R Gomez and T Kawahara. Robust Speech Recognition Based on Dereverberation Parameter Optimization Using Acoustic Model Likelihood. IEEE Transactions on Audio, Speech, and Language Processing, 18: , ,

167 REFERENCES [61] S Griebel and M Brandstein. Wavelet Transform Extrema Clustering For Multi-Channel Speech Dereverberation. In Proc. of the IEEE IWAENC, pages 27 30, [62] E Habets. Single- and multi-microphone speech dereverberation using spectral enhancement. PhDthesis,TechnischeUniv.Eindhoven, [63] E Habets, S Gannot, I Cohen, and P Sommen. Joint dereverberation and residual echo suppression of speech signals in noisy environments. IEEE Transactions on Audio, Speech and Language Processing,16(8): , , 37, 81 [64] V Hamacher, J Chalupper, J Eggers, E Fischer, U Kornagel, HPuder, and U Rass. Signal Processing in High-End Hearing Aids: State ofthe Art, Challenges, and Future Trends. EURASIP Journal on Applied Signal Processing, pages , , 36 [65] Y Haneda, S Makino, and Y Kaneda. Common acoustical pole and zero modeling of Room Transfer Functions. IEEE Transactions on Speech Audio Processing, 2(2): , [66] V Hansen and G Munch. Making recordings for simulation tests in the Archimedes Project. Journal of the Audio Engineering Society, 39(10): , [67] A Harma and U K Laine. A comparison of warped and conventional linear predictive coding. IEEE Transactions on Speech and Audio Processing, 9 (5): , [68] A Harma, M Karjalainen, L Savioja, V Valimaki, U Laine, and JHuopaniemi. Frequency-warpedsignalprocessingforaudioapplications. Journal of the Audio Engineering Society, 48: , [69] A Harma, T Lokki, and V Pulkki. Drawing Quality Maps of the Sweet Spot and Its Surroundings in Multichannel Reproduction and Coding. In Proc. of the 21st Conference of the Audio Engineering Society: Architectural Acoustics and Sound Reinforcement, ,

168 REFERENCES [70] W M Hartmann. Signals, sound, and sensation. Springer, [71] P Hatziantoniou and J Mourjopoulos. Generalized fractional-octave smoothing of audio and acoustic responses. Journal of the Audio Engineering Society, 48: , , 121 [72] P Hatziantoniou and J Mourjopoulos. Errors in real-time roomacoustics dereverberation. Journal of the Audio Engineering Society, 52: , , 32 [73] P Hatziantoniou, J Mourjopoulos, and J Worley. Subjective assessments of real-time room dereverberation and loudspeaker equalization. In Proc. of the 118th Convention of the Audio Engineering Society, [74] H Hermansky. Should recognizers have ears. Speech Communication, 25: 3 27, [75] H Hermansky and N Morgan. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4): , [76] T. Hidaka, Y Yamada, and T Nakagawa. A new definition of boundary point between early reflections and late reverberation in room impulse responses. Journal of the Acoustical Society of America, 122(1): , , 65, 67 [77] J Hopgood. Nonstationary Signal Processing with Application to Reverberation Cancellation in Acoustic Environments. PhDthesis,Universityof Cambridge, , 38 [78] J Hopgood, C Evers, and J Bell. Bayesian single channel blind speech dereverberation using Monte Carlo methods. Journal of the Acoustical Society of America, 123(5):3586, [79] T Houtgast and H J M Steeneken. A review of the MTF concept in room acoustics and its use for estimating speech intelligibility inauditoria.journal of the Acoustical Society of America, 77(3): , ,

169 REFERENCES [80] Yiteng Huang, Jingdong Chen, and J. Benesty. Immersive audio schemes. IEEE Signal Processing Magazine, 28(1):20 32, [81] Absolute category rating (ACR) method. InternationalTelecommunications Union, Geneva, Switzerland, [82] Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. International Telecommunications Union, Geneva, Switzerland, , 79 [83] Method for the subjective assessment of intermediate audio quality. International Telecommunications Union, Geneva, Switzerland, [84] ISO3382. Acoustics - measurement of the reverberation time of rooms with reference to other acoustical parameters, , 19 [85] M Jeub and P Vary. Binaural dereverberation based on a dual-channel Wiener filter with optimized noise field coherence. In Proc. of the IEEE ICASSP, pages , [86] M Jeub, M Schafer, T Esch, and P Vary. Model-Based Dereverberation Preserving Binaural Cues. IEEE Transactions on Audio, Speech, and Language Processing, 18: , , 37, 61, 66, 75, 76, 77 [87] J D Johnston. Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications, 6(2): , , 41 [88] J-M Jot. An analysis/synthesis approach to real-time artificial reverberation, volume2,pages Ieee, [89] J-M Jot, L Cerveau, and O Warusfel. Analysis and Synthesis of Room Reverberation Based on a Statistical Time-Frequency Model. In Proc. of the 123rd Convention of the Audio Engineering Society, , 49,

170 REFERENCES [90] M Karjalainen and T Paatero. Equalization of loudspeaker and room responses using Kautz filters: direct least squares design. EURASIP Journal on Applied Signal Processing, [91] M Karjalainen, T Paatero, J Mourjopoulos, and P Hatziantoniou. About room response equalization and dereverberation. In Proc. of IEEE WAS- PAA,2005.6, 88 [92] H Kayser, S D Ewert, J Anemuller, T Rohdenburg, V Hohhmann, and BKollmeier. DatabaseofMultichannelIn-EarandBehind-the-Ear Head- Related and Binaural Room Impulse Responses. EURASIP Journal on Applied Signal Processing, 2009:1 10, [93] P Kendrick, F Li, and T Cox. Blind estimation of reverberation parameters fornon-diffuse rooms. Acta Acustica united with Acustica, 93: , , 83 [94] J Keshet, S Shalev-Shwartz, Y Singer, and D Chazan. A Large Margin Algorithm for Speech-to-Phoneme and Music-to-Score Alignment. IEEE Transactions on Audio, Speech, and Language Processing,15(8): , [95] B E Kingsbury. Perceptually inspired signal processing strategies for robust speech recognition in reverberant environments. PhD thesis, University of California, Berkeley, [96] B E Kingsbury and N Morgan. Recognizing reverberant speech with rastaplp. In Proc. of the IEEE ICASSP, pages , [97] K Kinoshita, M Delcroix, T Nakatani, and M Miyoshi. Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiplestep Linear Prediction. IEEE Transactions on Audio, Speech, and Language Processing, 17(4): , , 55 [98] C Knapp and G Carter. The generalized correlation methodforestima- tion of time delay. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-24: ,

171 REFERENCES [99] A Koening, J Allen, D Berkley, and Curtis T. Determination of masking level differences in a reverberant environment. Journal of the Acoustical Society of America, 61: , [100] E Kokkinis and J Mourjopoulos. Unmixing Acoustic Sources in Real Reverberant Environments for Close-Microphone Applications. Journal of the Audio Engineering Society, 58(11):1 10, November , 6 [101] E Kokkinis, A Tsilfidis, E Georganti, and J. Mourjopoulos. Joint noise and reverberation suppression for speech applications. In Proc. of the 130th Convention of the Audio Engineering Society, London,UK,2011.9, 10, 62, 121 [102] P Krishnamoorthy and S R Mahadeva Prasanna. Reverberant speech enhancement by temporal and spectral processing. IEEE Transactions on Audio, Speech and Language Processing, 17: , , 33, 44, 90 [103] D Kundur and D Hatzinakos. Blind image deconvolution. IEEE Signal Processing Magazine, 13(3):43 64, [104] H Kuttruff. Room acoustics. Applied Science Publishers Ltd, London, 2 edition, , 17, 19, 20, 21, 22 [105] T Langhans and H Strube. Speech enhancement by nonlinear multiband envelope filtering. In Proc. of the IEEE ICASSP, [106] R Leavitt and C Flexer. Speech degradation as measured by the rapid speech transmission index (rasti). Ear and Hearing, 12(2): , [107] K Lebart, J Boucher, and P. Denbigh. A new method based on spectral subtraction for speech dereverberation. Acta Acustica united with Acustica, 87: , xxiii, 33, 34, 37 [108] J.-H. Lee, S.-H. Oh, and Lee S.-Y. Binaural semi-blind dereverberation of noisy convoluted speech signals. Neurocomputing, 72: ,

172 REFERENCES [109] K F Lee and H W Hon. Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Audio, Speech, and Language Processing, 37(11): , [110] R P Lippmann. Speech recognition by machines and humans. Speech Communication, 22(1):1 15,July1997.5, 105 [111] R Y Litovsky, H S Colburn, W A Yost, and S J Guzman. The precedence effect. Journal of the Acoustical Society of America, 106(4): , [112] C S Littleton. Gods, goddesses, and mythology, volume 1. Marshall Cavendish Corporation, [113] J P A Lochner and J F Burger. The influence of reflections on auditorium acoustics. Journal of Sound and Vibration, 1(4): ,IN15, , [114] P Loizou. Speech enhancement: theory and practice. CRCPress,1stedition, , 43, 51, 70, 93 [115] H W Löllmann and P Vary. Low delay noise reduction and dereverberation for hearing aids. EURASIP Journal on Advances in Signal Processing,2009: 1 9, , 36, 37 [116] H W Löllmann, E Yilmaz, M Jeub, and P Vary. An Improved Algorithm for Blind Reverberation Time Estimation. In Proc of the IEEE IWAENC, pages 1 4, [117] MATLAB. version 7.9 (R2009b). The MathWorks Inc., Natick, Massachusetts, [118] H McGurk and J MacDonald. Hearing lips and seeing voices. Nature, pages , [119] O M M Mitchell and D A Berkley. Reduction of long time reverberation by a center clipping process. Journal of the Acoustical Society of America, 47:84, ,

173 REFERENCES [120] M Miyoshi and Y Kaneda. Inverse filtering of room acoustics. IEEE Transactions in Acoustics, Speech and Signal Processing, 36: , [121] B C J Moore. An Introduction to the Psychology of Hearing. Academic Press, 4 edition, , 24, 25 [122] B C J Moore and B R Glasberg. Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. Journal of the Acoustical Society of America, 74(3): , [123] B C J Moore and B R Glasberg. Frequency discrimination of complextones with overlapping and non-overlapping harmonics. Journal of the Acoustical Society of America, 87(5): , [124] J A Moorer. About this reverberation business. Computer Music Journal, 3(2):13 28, [125] J Mourjopoulos. On the variation and invertibility of room impulse response functions. Journal of Sound and Vibration, 102: , , 31 [126] J Mourjopoulos. Digital equalization methods for audio systems. In Proc. of the 84th Convention of the Audio Engineering Society, , 31 [127] J Mourjopoulos. Digital equalization of room acoustics. In Proc. of the 92nd Convention of the Audio Engineering Society, [128] J Mourjopoulos. Digital equalization of room acoustics. Journal of the Audio Engineering Society, 42: , [129] J Mourjopoulos and J K Hammond. Modelling and enhancement of reverberant speech using an envelope convolution method. In Proc. of the IEEE ICASSP, , 33 [130] J Mourjopoulos, P Clarkson, and J Hammond. A comparative study of least-squares and homomorphic techniques for the inversion ofmixedphase signals. In Proc. of the IEEE ICASSP,

174 REFERENCES [131] I Mporas, T Ganchev, and N Fakotakis. A hybrid architecture for automatic segmentation of speech waveforms. In Proc. of the IEEE ICASSP, pages , April [132] A K Nabelek and D Mason. Effect of noise and reverberation onbinau- ral and monaural word identification by subjects with various audiograms. Journal Of Speech And Hearing Research, 24(3): , [133] T Nakatani, B Juang, T Yoshioka, K Kinoshita, M Delcroix, and M Miyoshi. Speech dereverberation based on maximum-likelihood estimation with timevarying gaussian source model. IEEE Transactions on Audio, Speech and Language Processing, 16: , [134] T Nakatani, T Yoshioka, K Kinoshita, M Miyoshi, and B-H Juang. Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Transactions on Audio, Speech and Language Processing, 18(7): , September [135] S Neely and J B Allen. Invertibility of room impulse response. Journal of the Acoustical Society of America, 66: , , 32 [136] S Norcross, G Soulodre, and M Lavoie. Subjective investigations of inverse filtering. Journal of the Audio Engineering Society, 52: , , 98 [137] S Olive, J Jackson, A Devantier, and D Hunt. The subjective and objective evaluation of room correction products. In Proc. of the 127th Convention of the Audio Engineering Society, [138] A V Oppenheim. Applications of digital signal processing. Prentice-Hall, [139] A V Oppenheim, R W Schafer, and J R Buck. Discrete Time Signal Processing. PrenticeHall, [140] S J Orfanidis. Introduction to Signal Processing. PrenticeHall,

175 REFERENCES [141] T Paatero and M Karjalainen. Kautz Filters and Generalized Frequency Resolution: Theory and Audio Applications. Journal of the Audio Engineering Society, 51(1/2):27 44, [142] K Palomaki, G Brown, and J Barker. Techniques for handling convolutional distortion with missing data automatic speech recognition. Speech Communication, 43: ,2004.5, 103, 105, 109 [143] Y Pan and A Waibel. The effects of room acoustics on mfcc speech parameter. Proc. of the ICLSP, [144] I Peer, B Rafaely, and Y Zigel. Reverberation matching for speaker recognition. In Proc. of the IEEE ICASSP, pages , [145] B L Pellom and J H L Hansen. Automatic segmentation of speech recorded in unknown noisy channel characteristics. Speech Communication, 20:97 116, [146] R Petrick, K Lohde, M Wolff, andrhoffmann. The harming part of room acoustics in automatic speech recognition. In Proc. of the INTERSPEECH, pages , August , 106 [147] A P Petropulu and S Subramaniam. Cepstrum based deconvolution for speech dereverberation. In Proc. of the IEEE ICASSP, pages I/9 I12, [148] V M A Peutz. Articulation loss of consonants as a criterion for speech transmission in a room. Journal of the Audio Engineering Society, 19(11): , [149] Plato. Republic. The Internet Classics Archive, 380 B.C. URL 2 [150] J D Polack. La transmission de l energie sonore dans les salles. PhD thesis, Universite du Maine,

176 REFERENCES [151] J D Polack. Playing billiards in the concert hall: The mathematical foundations of geometrical room acoustics. Applied Acoustics, 38(2-4): , [152] L C W Pols, X Wang, and L F M ten Bosch. Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR. Speech Communication, 19(2): , [153] S R M Prasanna and B Yegnanarayana. Extraction of pitch in adverse conditions. Proc. of the IEEE ICASSP, 1: , [154] J P Preece and R H Wilson. Detection, loudness, and discrimination of five-component tonal complexes differing in crest factor. Journal of the Acoustical Society of America, 84(1): , , 86 [155] R Ratnam, D Jones, B Wheeler, W O Brien, C Lansing, and S Feng. Blind estimation of reverberation time. Journal of the Acoustical Society of America, 114(5): , , 62, 83 [156] C K Raut, T Nishimoto, and S Sagayama. Model Adaptation for Long Convolutional Distortion by Maximum Likelihood Based StateFiltering Approach. In Proc. of the IEEE ICASSP, [157] Lord Rayleigh. On our perception of sound direction. Philosophical Magazine, 13: , [158] Bruno H Repp. The sound of two hands clapping: An exploratory study. Journal of the Acoustical Society of America, 81(4): , , 67 [159] N Roman and D Wang. Pitch-based monaural segregation ofreverberant speech. Journal of the Acoustical Society of America, 120(1):458, [160] T D Rossing, editor. Springer Handbook of Acoustics. Springer,2007.3, 4, 21, 22, 106, 111 [161] W C Sabine. Collected Papers on Acoustics. Peninsula Publishing, Los Altos, ,

177 REFERENCES [162] L Savioja and V Valimaki. Multiwarping for enhancing the frequency accuracy of digital waveguide mesh simulations. IEEE Signal Processing Letters, 8(5): , [163] M R Schroeder. Natural Sounding Artificial Reverberation. Journal of the Audio Engineering Society, 10(3): , [164] M R Schroeder. New Method of Measuring Reverberation Time. Journal of the Acoustical Society of America, 37(6): , , 20 [165] M R Schroeder. Integrated-impulse method measuring sound decay without using impulses. Journal of the Acoustical Society of America, 66(2): , [166] Manfred R Schroeder. The Schroeder frequency revisited. Journal of the Acoustical Society of America, 99(5): , [167] B Schuller. Affective speaker state analysis in the presence of reverberation. International Journal of Speech Technology, pages1 11,2011. ISSN [168] A Sehr and W Kellermann. Towards Robust Distant-Talking Automatic Speech Recognition in Reverberant Environments. In Eberhard Hänsler and Gerhard Schmidt, editors, Speech and Audio Processing in Adverse Environments,SignalsandCommunicationTechnology,pages Springer Berlin Heidelberg, [169] A Sehr, E Habets, R Maas, and W Kellermann. Towards a better understanding of the effect of reverberation on speech recognition performance. In Proc. of the IEEE IWAENC, , 105, 106, 115, 116 [170] A Sethy and S Narayanan. Refined speech segmentation forconcatenative speech synthesis. In Proc. of the ICSLP, pages , [171] G A Soulodre, N Popplewell, and John S Bradley. Combined effects of early reflections and background noise on speech intelligibility. Journal of Sound and Vibration, 135: ,

178 REFERENCES [172] B E Stein and M A Meredith. The merging of the senses. MITPress, [173] A Steinhauser. The theory of binaural audition. Philosophical Magazine, 7: , [174] R Stewart and M Sandler. Statistical Measures of Early Reflections of Room Impulse Responses. In Proc. of the 10th International Conference on Digital Audio Effects (DAFx-07), Bordeaux,France, [175] T.G. Stockham, T M Cannon, and R B Ingebretsen. Blind deconvolution through digital signal processing. Proceedings of the IEEE, 63(4): , [176] B Strope and A Alwan. A model of dynamic auditory perception and its application to robust word recognition. IEEE Transactions on Speech and Audio Processing, pages , [177] D Sumarac-Pavlovic, M Mijic, and H Kurtovic. A simple impulse sound source for measurements in room acoustics. Applied Acoustics, 69(4): , , 67 [178] Y Takata and A K Nabelek. English consonant recognition innoiseandin reverberation by japanese and american listeners. Journal of the Acoustical Society of America, 88(2): , [179] S P Thompson. On binaural audition. Philosophical Magazine, 4: , [180] H. Thornburg, R. J. Leistikow, and J. Berger. Melody Extraction and Musical Onset Detection via Probabilistic Models of Framewise STFT Peak Data. IEEE Transactions on Audio, Speech, and Language Processing, 15 (4): , [181] F E Toole. Loudspeakers and Rooms for Sound Reproduction- A Scientific Review. Journal of the Audio Engineering Society, 54(6): ,2006.7, 17, 20,

179 REFERENCES [182] H Traunmüller. Analytical expressions for the tonotopicsensory scale. Journal of the Acoustical Society of America, 88(1):97 100, [183] M Triki and D T M Slock. Delay and Predict Equalization for Blind Speech Dereverberation. In Proc. of the IEEE ICASSP, volume5, [184] A Tsilfidis. Masquage Sonore Temps-Frequence (in French). Master s thesis, Universite de la Mediteranee Aix-Marseille II, [185] A Tsilfidis. Auditory thresholds of Gaussian Shaped Sinusoidal Tones. In Proc. of the Hellenic Institute of Acoustics 2008 conference (ingreek), Xanthi, Greece, [186] A Tsilfidis and J Mourjopoulos. Perceptually-motivated selective suppression of late reverberation. In Proc. of the 17th Digital Signal Processing Conference (DSP), Santorini,Greece,2009.8, 9, 40, 41, 45 [187] A Tsilfidis and J Mourjopoulos. Signal-dependent constraints for perceptually motivated suppression of late reverberation. Signal Processing, 90: , , 9, 40, 41, 44, 45, 65, 74, 78, 89 [188] A Tsilfidis and J Mourjopoulos. Blind Dereverberation for speech and music signals. In Proc. of the Hellenic Institute of Acoustics 2010 conference (in Greek), Athens,Greece, , 11, 82 [189] A. Tsilfidis and J Mourjopoulos. Blind single-channel suppression of late reverberation based on perceptual reverberation modeling. Journal of the Acoustical Society of America, 129(3): ,2011.xxiv, 10, 11, 40, 45, 82, 113 [190] A Tsilfidis and J. Mourjopoulos. Blind single-channel dereverberation for music post-processing. In Proc. of the 130th Convention of the Audio Engineering Society, London,UK, , 82 [191] A Tsilfidis, J Mourjopoulos, and D Tsoukalas. Blind estimation and suppression of late reverberation utilizing auditory masking,. In Proc. of the Hands Free Speech Communication and Microphone Arrays (HSCMA), Trento, Italy, , 9, 40,

180 REFERENCES [192] A Tsilfidis, J Mourjopoulos, and D Tsoukalas. Method for estimationand suppression of reverberation using psychoacoustic criteria. In Proc. of the Hellenic Institute of Acoustics 2008 conference (in Greek), Xanthi,Greece, , 9, 40 [193] A Tsilfidis, C. Papadakos, and J. Mourjopoulos. Hierarchical Perceptual Mixing. In Proc. of the 126th Convention of the Audio Engineering Society, Munich, Germany, [194] A Tsilfidis, E Georganti, E Kokkinis, and J Mourjopoulos. Speech dereverberation based on a recorded handclap. In Digital Signal Processing Conference (DSP), Corfu,Greece,2011.9, 10, 62, 67 [195] A Tsilfidis, E Georganti, and J Mourjopoulos. A binaural frameworkfor spectral subtraction dereverberation. In Forum Acusticum (invited paper), Aalborg, Denmark, , 10, 62, 76, 124 [196] A Tsilfidis, E Georganti, and J Mourjopoulos. Binaural extension and performance of single-channel spectral subtraction dereverberation algorithms. In Proc. of the IEEE ICASSP, Prague,CzechRepublic,2011.9, 10, 62, 76 [197] A Tsilfidis, K E Kokkinis, and J Mourjopoulos. Suppression of late reverberation at multiple speaker positions utilizing a single impulse response measurement. In Forum Acusticum, Aalborg, Denmark, , 10, 62 [198] A Tsilfidis, I Mporas, J Mourjopoulos, and N Fakotakis. Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing. submitted for publication, , 12, 82, 106 [199] D Tsoukalas, M Paraskevas, and J Mourjopoulos. Speech enhancement using psychoacoustic criteria. Proc. of the IEEE ICASSP, 2: , , 39, 41 [200] D Tsoukalas, J Mourjopoulos, and G Kokkinakis. Speech enhancement based on audible noise suppression. IEEE Transactions on Speech and Audio Processing, 5: ,1997.8, 33, 39, 41, 43, 49,

181 REFERENCES [201] J Usher. An improved method to determine the onset timings of reflections in an acoustic impulse response. Journal of the Acoustical Society of America, 127(4):EL172 EL177, [202] T Van Den Bogaert, T J Klasen, M Moonen, L Van Deun, and J Wouters. Horizontal localization with bilateral hearing aids: Without is better than with. Journal of the Acoustical Society of America, 119(1): , [203] H L Van Trees. Optimum array processing, volume4. Wiley-Interscience New York, NY, USA, [204] S Vassilantonopoulos and J Mourjopoulos. Virtual Acoustic Reconstruction of Ritual and Public Spaces of Ancient Greece. Acta Acustica united with Acustica, 87(5): , [205] E Vincent, M Jafari, and M Plumbley. Preliminary guidelines for subjective evaluation of audio source separation algorithms. In Proc. of the ICA Research Network International Workshop, [206] M Vorländer. Simulation of the transient and steady-state sound propagation in rooms using a new combined ray-tracing/image-source algorithm. Journal of the Acoustical Society of America, 86(1): , [207] S J Waller. Sound and Rock Art. Nature, 363(6429):501, [208] A J Watkins and N J Holt. Effects of a complex reflection on vowel indentification. Acustica, 86: , [209] D L Weber and D M Green. Suppression effects in backward and forward masking. Journal of the Acoustical Society of America, 65(5): , [210] J Wen, N Gaubitch, E Habets, T Myatt, and P Naylor. Evaluation of speech dereverberation algorithms using the MARDY database. In Proc. of the IEEE IWAENC,

182 REFERENCES [211] L Wiegrebe and R D Patterson. Quantifying the distortion products generated by amplitude-modulated noise. Journal of the Acoustical Society of America, 106(5): , , 82 [212] T Wilmering, G Fazekas, and M Sandler. The effects of reverberation on onset detection tasks. Proc. of the the 128th of the Audio Engineering Society, [213] T Wittkop and V Hohmann. Strategy-selective noise reduction for binaural digital hearing aids. Speech Communication, 39: , [214] M Wölfel. Enhanced Speech Features by Single-Channel Joint Compensation of Noise and Reverberation. IEEE Transactions on Audio, Speech, and Language Processing, 17(2): , , 63 [215] J Woojay and B H Juang. Speech Analysis in a Model of the Central Auditory System. IEEE Transactions on Audio, Speech, and Language Processing, 15(6): , [216] M Wu and D Wang. A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Transactions on Audio, Speech and Language Processing, 14: , xxiv, 33, 35, 80 [217] N Yasuraoka, T Yoshioka, T Nakatani, A Nakamura, and H Okuno. Music dereverberation using harmonic structure source model and Wiener filter. In Proc. IEEE ICASSP, [218] B Yegnanarayana and P S Murthy. Enhancement of reverberant speech using LP residual signal. IEEE Transactions on Audio, Speech and Language Processing, 8(3): , , 33, 65 [219] S Young, G Evermann, M Gales, T Hain, D Kershaw, G Moore, JOdell, DOllason,DPovey,VValtchev,andPWoodland. The HTK Book (for HTK Version 3.3). Cambridge University Engineering Department,

183 REFERENCES [220] T Zarouchas and J Mourjopoulos. Modeling perceptual effects of reverberation on stereophonic sound reproduction in rooms. Journal of the Acoustical Society of America, 126(1): , , 89 [221] U Zölzer. Digital Audio Signal Processing. JohnWileyandsons, [222] Q Zou, X Zou, M Zhang, and Z Lin. A robust speech detection algorithmin amicrophonearrayteleconferencingsystem.inproc. of the IEEE ICASSP, pages ,

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198 S e (ω,j) R(ω,j) Y (ω,j) S e (ω,j)=y (ω,j) R(ω,j) ω j G(ω,j) S e (ω,j)=g(ω,j)y (ω,j) Y (ω,j) R(ω,j) G(ω,j)= Y (ω,j) h r (n) b(n) 3ln10 n n 0, h r (n) = RT 60 0 n<0. b(n) 1 R(ω,j) = Y (ω,j) SNRpri (ω,j) +1

199 SNR pri (ω,j) SNR post (ω,j) SNR pri (ω,j) = β SNR pri (ω,j 1) + (1 β)max(0, SNR post (ω,j) ) β S e (ω,j) R(ω,j) 2 = γw(j ρ) Y (ω,j) 2 ρ γ w(j) ( ) j + α (j + α) 2 j< α w(j) = α 2 2α 2 0 α R(ω,j) 2 = K a l (ω,j) 2 Y (ω,j l) 2 l=1

200 K RT 60 a l (ω,j) { } Y (ω,j)y (ω,j l) a l (ω,j)=e Y (ω,j l) 2 S e (ω,j) { } Y (ω,j) 2 R(ω,j) 2 S e (ω,j)= Y (ω,j) Y (ω,j) 2 T Se (ω,j)

201 b(ω,j) ( ) 1 Y (ω,j) S b(ω,j)= e (ω,j) 1 ν(ω,j) Y (ω,j) Se (ω,j) T Se (ω,j) ( ) 1 Y (ω,j) T Se (ω,j) 1 ν(ω,j) Y (ω,j) Se (ω,j) <T Se (ω,j). ν(ω,j) b(ω,j) S e(ω,j) S e(ω,j)= Y (ω,j) ν(ω,j) Y (ω,j) Y (ω,j) ν(ω,j) ν(ω,j) + b(ω,j) P Yj = E { Y (ω,j) 2} r p R(ω,j) P R Yj P Yj 1 A (ω,j) = r p R(ω,j) P Yj P Yj 1 <A

202 E { Y (ω,j) Y (ω,j l) } Φ jl = E { Y (ω,j) 2} E { Y (ω,j l) 2} R (ω,j) (1 Φ R j1 ) Φ j1 1 (ω,j) = R (ω,j) otherwise

203 RT 60 =0.5 sec RT 60 =1sec RT 60 =1.5 sec RT 60 =3sec

204

205

206

207 h r (n) y(n) s(n) y(n) = L r m=0 h r (m)s(n m) L r u(n) h s (n) s(n) = L s m=0 h s (m)u(n m)

208 y(n) = L r L s m=0 l=0 h r (m l)h s (l)u(n m) h r (n) =h early (n)+h late (n) y(n) = L b L s m=0 l=0 h early (m l)h s (l)u(n m)+ L r L s m=l b +1 l=0 h late (m l)h s (l)u(n m) L b h early (n) h early (n) L s <L b y(n) = L b L s m=0 l=0 h early (m l)h s (l)u(n m)+ L r m=l b +1 h late (m)u(n m) ρ 0 h 0 (n) h 0 r(n) h 0 r(n) =h 0 early(n)+h 0 late(n) ρ i h i r(n) =h i early(n)+h i late(n)

209 h i late (n) i h 0 late (n) H i late(ω,j) 2 = H 0 late(ω,j) 2 i Hlate 0 (ω,j) R late (κ, ω) R late (ω,j)= H 0 late(ω,j) 2 U i (ω,j) 2 U i (ω,j) Ŝi (ω,j) 2 = Y i (ω,j) 2 R late (ω,j) Y i (ω,j) 2 Y i (ω,j) 2 = G(ω,j) Y i (ω,j) 2 S i (ω,j) 2 Y i (ω,j) 2 ρ i G(ω,j)

210 θ r ζ ζ th Ω ( ) G(ω,j) θ + θ Y i (ω,j) 2 ζ < ζ th G(ω,j) < θ Ŝi (ω,j) 2 = r G(ω,j) Y i (ω,j) 2 ζ = Ω G(ω,j) Y i (ω,j) 2 ω=1 Ω Y i (ω,j) 2 ω=1 ρ 0 Clate 0 (κ, ω) 2 Hlate(κ, 0 ω) 2 Clate(κ, 0 ω) 2 Ŝi (κ, ω) 2 = Y i (κ, ω) 2 C 0 late(κ, ω) 2 U i (κ, ω) 2

211 !"#"$%"$&'()*+"",-.&'/,0&+)$",1$/2'3 8A<)$"52/;&0 45(26&(21')17)8&(")!9! 45(26&(21')17)0&(")$"#"$%"$&(21' *,1, : *'#+,$&%( *-.,$&+,/!0!1;3-)<0"&')*+",($;6)45(26&(21' =&2')>&3'2(;/")!"3;0&$2?&(21'

212 θ ζ th r

213 2 h A c A h B c B 1 SRR difference (db) 0 2 Position A (a) Position B h C c E h D c F 1 0 Position C (b) Position D Room Position

214

215

216 G l (ω,j) G r (ω,j) G(ω,j)=max(G l (ω,j),g r (ω,j)) G(ω,j)= (G l(ω,j)+g r (ω,j)) 2 G(ω,j)=min(G l (ω,j),g r (ω,j)) RT 60 =0.69 3m 0, 30, 60 90

217 2 (a) 1 (b) rwerwe SRR difference (db) DSB maxgain avggain mingain 2 LB WW FK LB WW FK Proposed Method 0 RT 60 = , m 30, 0 90 θ ζ th r RT 60

218

219

220 #$%$&'$&()*" +,-)(. DE3F./!0121'34*/!"#$%&'( )*%+,'%-&* * # E3F #/0-1 2.$()" 34*,5(*,/) =+(%#&1'3.!" :3%#&3'(/ &#"&#*#3%'%+-3*./!0121'34*/!" '6 78&#*8-(4 >#$+*+-3/ >#?+$#!"./!0121'34*/!" ;5: #*%+,'%+-3 >G. E3F./!0121'34*/!" <'+3/ =03$%+-3* * C(#'3/ H E3F./!0121'34*/!+A3'(/ B )*%+,'%+-3!" RT 60

221 Cf k D k (n) D k (n) G k (n) L k D k (n) M k(i) m k (i) m k (0) = 0 i D k (n) i =0,..., L k 1 2 ( ) gk (i) n M k(i) m k (i) M k (i) G G k (n) = k (i) M k (i) n m k (i) G k (i) H(n m k (i)) H(n M k (i + 1)) otherwise G k (i) ( ) 1 1 Cfk G k (i) = + γ 1.8 RT Cf k g k (i) m k (i) M k (i) g k (i) =G k (i) γ 2 (max(d k(n)) + 1 ) D 2 m k (i) M k (i) k (n) j=1

222 (a) Eq. Rev. Signal Clean Signal Amplitude local RMI maxima local RMI minima (b) Eq. Rev. Signal RMI Local Extremes (c) Regions to remain intact Regions to be processed 0.5 Gain Function (d) Time (sec)

223 H 1 n 0 H(n) = 0 otherwise

224

A generalized framework for binaural spectral subtraction dereverberation

A generalized framework for binaural spectral subtraction dereverberation Alexandros Tsilfidis, Eleftheria Georganti, John Mourjopoulos Audio and Acoustic Technology Group, Department of Electrical and