Single Channel Speech Enhancement in Severe Noise Conditions

Size: px
Start display at page:

Download "Single Channel Speech Enhancement in Severe Noise Conditions"

Transcription

1 Single Channel Speech Enhancement in Severe Noise Conditions This thesis is presented for the degree of Doctor of Philosophy In the School of Electrical, Electronic and Computer Engineering The University of Western Australia By Dariush Farrokhi (Bachelor Electronic Engineering, Master Computer Engineering (Electronic) by research) Date, 30/09/2010 Revised date 14/09/2011 1

2 Declaration I declare that this thesis is my own account of my research and contains as its main content, work that has not been previously presented in any tertiary educational institutions. Dariush Farrokhi Perth, Western Australia 2

3 Abstract SINGLE Channel Non Stationary Noise Speech Enhancement (SCNSNSE) algorithms can be used in many applications including enhancement of pre-recorded speech, hearing aids devices, speech recognition and telecommunication equipment. Many organizations such as medical, aviation and local or federal police are interested in having access to algorithms that can improve noisy speech signals. A combined set of existing and new algorithms were uniquely put together to produce a SCNSNSE system architecture. This novel SCNSNSE system architecture produces improved speech enhancement at low SNR (below 0 db SNR). This SCNSNSE architecture contains novel algorithms at pre and post processing stages and enhances the speech signal which is contaminated with highly non-stationary noise. The SCNSNSE architecture consists of three major layers. Layer one consists of spectrum estimation (the preprocessing) of the SCNSNSE system architecture. At the spectrum estimation layer the concept of narrow band variation reduction is explained. The novel introduction of the Controlled Forward Moving Average (CFMA) algorithm greatly improves the reduction of narrow band variation. The CFMA is strategically placed in the SCNSNSE architecture to provide a better outcome. At the spectrum estimation layer a combination of existing algorithms cascaded with a new algorithm is applied as described below: 1. Discrete or Prolate Spheroidal Sequence (DPSS) multi-taper algorithm. 2. Controlled Forward Moving Average (CFMA) algorithm. 3. Stein s Unbiased Risk Estimator (SURE) wavelet thresholding. The second layer consists of the noise estimation (the post processing) algorithms. During the post-processing the concept of wide band variation estimation, Frequency Threshold Mapping (FTM) and Multi-Channel Threshold Mapping (MCTM) are introduced. At this layer the noise estimation algorithms for the SCNSNSE system are improved. At the post processing layer a distinctive FTM algorithm is introduced that increases the accuracy of the noise estimation. This original noise estimation algorithm adapts to the 3

4 rapid changes of noisy speech signal in each sub-band and hence it reduces the under estimation issue that most of the SCNSNSE systems suffer from. This new algorithm improves the efficiency of the noise estimation algorithms. The final layer (Speech Enhancement) of the SCNSNSE architecture consists of speech enhancement algorithms. This layer simply subtracts the estimated noise signal from the original noisy speech and adds the original phase information. Both qualitative and quantitative experiments have shown that these unique algorithms as a whole offer a more advanced system that removes non-stationary noise in a more intelligent way than previously suggested in the literature. 4

5 Contents Abstract... 3 Contents... 5 List of Figures... 7 List of Tables... 8 Acknowledgement... 9 List of Publications Chapter Introduction Introduction Overview Human speech production system Human s voice perception system Application Conventional speech enhancement systems Some of the new trends in speech enhancement system Aims and objectives (Scope) Contribution of thesis Thesis organization Chapter Current Research on Speech Enhancement Current Research on Speech Enhancement Overview Current speech enhancement algorithms SE based on subtractive-type algorithms Suppression algorithm Generalized spectral subtraction Signal preconditioning to remove musical noise Human perception of speech Computational auditory analysis Background segregation and Harmonic-Temporal Clustering (HTC) Estimating the background noises Spectral minima tracking in sub-bands Optimal smoothing and minimum statistics Noise estimation with rapid adaptation techniques Evolution of current single channel NSNSE techniques on the NE and noise reduction Chapter Speech Enhancement Fundamental Algorithms Speech Enhancement Fundamental Algorithms Overview Windowing technique DFT and IDFT technique Autocorrelation coefficient and power spectrum Spectral subtraction algorithm Objective measure based on Signal to Noise Ratio (SNR) Objective measure based PESQ test The AR technique for evaluating spectrum estimation Synopsis Chapter

6 A Novel SCNSNSE System with Controlled Forward March Averaging (CFMA) Algorithm Proposed speech enhancement system Overview Review of spectrum estimation (pre-processing layer) The DPSS multi-taper spectrum algorithm Wavelet thresholding A new algorithm to improve spectrum estimation layer Controlled Forward March Averaging (CFMA) to smooth the power spectrum Noise estimation layer Speech absent noise estimation Speech present noise estimation Post processing or enhancement layer Synopsis Chapter The Second Innovative Improvement to the SCNSNSE Architecture The Second Innovative Improvement to the SCNSNSE Architecture Improving noise estimation layer using multi-channel threshold mapping Speech absent noise calculation Speech present noise calculation Synopsis Chapter Performance Evaluation of the Proposed SCNSNSE System Performance Evaluation of the Proposed SCNSNSE System Overview Evaluation strategy The CFMA and the 9 channels FTM algorithms performance evaluations Spectrum Estimation of Clean AR Data Global SNR, segmental SNR and the PESQ evaluations of Noisy Speech Synopsis Chapter Conclusion, and Future Work Conclusion, and Future Work Recommendations and future work References

7 List of Figures Figure 1: Human hearing anatomy [1] Figure 2: Audible range of the human ear [1] Figure 3: Simple spectral subtraction block diagram Figure 4: Illustrates graphically the AR4 creation to use in the SCNSNSE system Figure 5: Average spectrum of a few non-stationary noise types [54] Figure 6: Illustrates the SCNSNSE system and the algorithm applied in each building block Figure 7: Enhanced SCNSNSE system including the CFMA algorithm Figure 8: Average power spectrum of crowd noise or more commonly known as the babble noise Figure 9: Average power spectrum of factory floor noise Figure 10: Enhanced SCNSNSE system Figure 11: shows an example of the shaped babble noise Figure 12: Comparison of the power spectrum of a white noise excited AR4 process estimated by direct hamming window method (top panel). The DPSS MT method (N= 2048, L=5) (second panel). The DPSS MT (N= 2048, L=5) and the CFMA method with j=2 in (37) (third panel). The DPSS, the CFMA (j=2 in (37)), segmental SURE wavelet applied all together (bottom panel) Figure 13: Comparison of the power spectrum of a white noise excited AR4 process estimated by direct hamming window method (top panel). The DPSS MT method (N= 2048, L=5) (second panel). The DPSS MT (N= 2048, L=5) and the CFMA method with j=15 in (37) (third panel). The DPSS, the CFMA (j=15 in (37)), segmental SURE wavelet applied all together (bottom panel) Figure 14: Graphical representation of comparative result, in terms of absolute PESQ output values Figure 15: Graphical representation of comparative results, in terms of absolute GSNR output values Figure 16: Graphical representation of comparative results, in terms of absolute SSNR output values Figure 17: Graphical representation of comparative results, in terms of PESQ gains Figure 18: Graphical representation of comparative result, in terms of GSNR gains Figure 19: Graphical representation of comparative result, in terms of SSNR gains Figure 20: Graphical representation of the PESQ performance of different algorithms, in terms of % improvement over the SS algorithm Figure 21: Graphical representation of the GSNR performance of different algorithms, in terms of % improvement over the SS algorithm Figure 22: Graphical representation of the SSNR performance of different algorithms, in terms of % improvement over the SS algorithm Figure 23: Graphical representation of the comparative performance in terms of the absolute PESQ values Figure 24: Graphical representation of the comparative performance in terms of the absolute global SNR values Figure 25: Graphical representation of the comparative performance in terms of the absolute segmental SNR values Figure 26: Graphical representation of comparative improvement in terms of the PESQ gains Figure 27: Graphical representation of comparative improvement in terms of the GSNR gains

8 Figure 28: Graphical representation of comparative improvement in terms of the SSNR gains Figure 29: Graphical representation of comparative improvement in terms of the % PESQ improvement over the SS algorithm Figure 30: Graphical representation of comparative global SNR gain improvement, in percentage terms over the SS algorithm Figure 31: Graphical representation of comparative segmental SNR gain improvement, in percentage terms over the SS algorithm List of Tables Table 1: Mean square error test results on the CFMA algorithm Table 2: Comparative performance of different algorithms, in terms of absolute PESQ output values Table 3: Comparative performance of different algorithms, in terms of absolute GSNR output values Table 4: Comparative performance of different algorithms, in terms of absolute segmental SNR outputs Table 5: Comparative performance of different algorithms, in terms of the PESQ gain Table 6: Comparative performance of different algorithms, in terms of the global SNR gain Table 7: Comparative performance of different algorithms, in terms of the segmental SNR gain Table 8: Comparative PESQ performance of different algorithms, in terms of % improvement over the SS algorithm Table 9: Comparative GSNR performance of different algorithms, in terms of % improvement over the SS algorithm Table 10: Comparative SSNR performance of different algorithms, in terms of percentage improvement over the SS algorithm Table 11: Shows comparative performance of different algorithms, in terms of the absolute PESQ output values Table 12: Shows comparative performance of different algorithms, in terms of the absolute Global SNR value Table 13: Comparative performance of different algorithms, in terms of the absolute segmental SNR values Table 14: Comparative performance of different algorithms, in terms of the PESQ gains Table 15: Comparative performance of different algorithms, in terms of the Global SNR gain Table 16: Comparative performance of different algorithms, in terms of the segmental SNR gain Table 17: Comparative performance of different algorithms, in terms of the % PESQ improvement over the SS algorithm Table 18: Comparative global SNR gain of different algorithms, in terms of % improvement over the SS algorithms Table 19: Comparative segmental SNR gain of different algorithms, in terms of % improvement over the SS algorithms

9 Acknowledgement There were a few people that patiently helped and guided me on my research path. I would like to thank them for their idealism and respect. At the university, these people are my supervisor Dr Roberto Togneri and Professor Anthony Zaknich. I would also like to thank Dr Seow Yong Low from the Australian Telecommunication Research Institute (ATRI) for extra information and opinions on this field. At home, I would like to thank my wife for filling in for me patiently with regards to house duties and children responsibilities. At work, I would like to thank my employer and colleagues who supported me and helped me on this journey. 9

10 List of Publications 1. Farrokhi D., Togneri R., Zaknich A., Speech Enhancement of Non-stationary Noise Based on Controlled Forward Moving Average, in Proc. IEEE Int. Symposium on Communication and Information Technology, pp , October Farrokhi D, Togneri R. Zaknich A, Single Channel Speech Enhancement using a 9 Dimensional Noise Estimation Algorithm and Controlled Forward March Averaging International Conference on Signal Processing, vol. 1, pp , October

11 Chapter 1 Introduction 11

12 1. Introduction 1.1 Overview To provide a good speech enhancement system, it is essential to know how the human vocal and hearing system works. The human speech production system is reasonably well known and will be discussed briefly in this section, however the human voice perception system is not sufficiently known. In the next few sections each part of the human vocal and hearing systems is discussed separately then the current speech enhancement research is introduced in the proceeding chapter. 1.2 Human speech production system Human speech production is complex. The speech information travels through air in a series of longitudinal waves. The speaker s mouth excites adjacent air molecules after a sequence of coordinated movements of the human vocal system. The dynamics and production of human sound, known as phonetics, are well understood, although the process of the speech perception is not totally known. The human vocal mechanism made of the lungs, trachea (windpipe), larynx, pharyngeal cavity (throat), oral cavity (mouth), nasal cavity, velum (soft palate), tongue, jaw, teeth and lips [1]. The following are the integrated systems that produce the speech. 1. The lungs act as energy source and blow air into the trachea. 2. As the airflow passes upwards toward the mouth, the larynx introduces a periodic excitation to the system to provide the voiced sounds. The lungs, trachea and larynx form the main acoustic filter that creates the speech waveform. 3. At the end, the articulators, which are made of the lips, tongue, jaw, teeth, and velum, provide the last changes to the acoustic waveform. 12

13 The human speech system generates speech that can be classified into voiced, unvoiced, mixed, plosive, whisper and silence. A particular sound type can also be produced by an aggregation of the above sounds. The linguistic term phoneme is often used to describe any particular speech sound. As an example, the American English language has 42 different phonemes. Each person s vocal system varies from others and the frequency at which the vocal cords vibrate is called the fundamental frequency or pitch of the speech. This pitch frequency is dependent on the length and shape of the vocal cords, and is usually between 50 Hz to 600 Hz in humans. 1.3 Human s voice perception system The human voice perception system is somewhat unknown however the following facts are exposed to us. The anatomy of human hearing is shown in Figure 1 [1]. Figure 1: Human hearing anatomy [1] Sounds consist of waves which oscillate at different frequencies. Frequency is one of the properties of sound or noise. A sound with a high frequency is said to be highpitched and a sound with a low frequency is low-pitched. As depicted in Figure 2 the frequency range that the human hearing system can perceive is between 20 Hz and 20 13

14 KHz. The human hearing system requires more audio pressure to perceive low frequency speech (i.e. frequency range of Hz) than in the high frequency range. The human hearing perception has a peak response around 1000 to 4000 Hz and has relatively low response at lower frequencies. In other words, even with the same pressure level a sound within 1000 to 4000 Hz will appear louder to the human ear than a sound with a lower frequency of 50 Hz [1]. Figure 2: Audible range of the human ear [1]. Hearing threshold is the weakest sound pressure that the human ear can detect. Since the ear response is dependent on sound frequency content the threshold of hearing is different for sounds of different frequencies. The lowest curved shape (darker shape) in Figure 2 gives the hearing threshold for various frequencies [1]. These characteristics as simple as they seem have great impact on applications in the speech enhancement architecture. In the following chapters reference is made to these characteristics and it is explained how research has adopted these attributes to build better speech enhancement systems. Although we somewhat know the mechanic and physiology of our outer, middle and inner hearing system, we do not fully understand how the human voice perception works when there is a lot of background noise, for example, in the case of the babble noise. This concept is becoming more important as the human population increases and mobile phones are used in many un-conventional public places such as bus stations, train stations, and airports etc. 14

15 1.4 Application Speech enhancement applies to many facets of life, spanning from the communication to international security industries and for human-computer interaction. Speech enhancement algorithms are widely studied by researchers to improve the quality of noisy speech in telecommunication and voice recognition systems. Following are a list of other possible application areas for Speech Enhancement (SE) systems: Communication devices such as half/full duplex voice and roaming mobiles. Hearing aid equipment. Spy hearing equipment or eavesdropping devices. Wireless Sensor Networks. Speech recognition devices. Recorded Speech. Federal and local Police investigations on recorded speech. As will be shown in detail in the following sections one realizes how important the SE algorithms are to improving the quality of speech communications in society today. In particular, when SE of non-stationary noise matures it will impact on many facets of technology in human society. This fact is not only true for Multi-Channel Speech Enhancement (MCSE) but also for Single Channel Speech Enhancement (SCSE) systems. Currently the most effective speech enhancement for any single channel recorded voice materials is via the utilization of a single channel speech enhancement system. 1.5 Conventional speech enhancement systems Most of the conventional single channel speech enhancement systems only remove stationary noise satisfactorily. When noise is non-stationary most conventional single channel SE systems fail. 15

16 Conventional systems have also used a Voice Activity Detector (VAD) to distinguish between speech and non-speech segments and try to estimate the noise. The VAD system has proven not to be flexible therefore it was not utilized for robust speech enhancement using this system [2]. 1.6 Some of the new trends in speech enhancement system Although current speech enhancement systems consist of a wide spectrum of approaches most of them have certain common characteristics. The following are a summary of these characteristics: 1. It is assumed that short time segments of speech signals are stationary. 2. Most of the current speech enhancement systems are carried out in the frequency domain. 3. It is assumed that the additive noise does not affect the speech phase information (strictly speaking this is not true but it is common practice to make this assumption). 4. Most Single Channel Speech Enhancement (SCSE) systems have a dedicated preconditioning algorithm to smooth the signal. 5. Most speech processing systems use a dedicated filtering or subtraction algorithm to remove estimated noise from the noisy speech signal. 6. The new state of art SE systems does not use a Voice Activity Detector but a soft threshold method to detect the speech presence. The details of the current SCSE and the MCSE systems will be discussed in the subsequent sections. The non-stationary single channel SE systems are challenging and difficult and as it will be shown, this research has provided an original SE architecture with a few original algorithms to enhance speech data. 16

17 1.7 Aims and objectives (Scope) Speech enhancement under various non-stationary noise environments is a complex task. Therefore it is divided into specialized areas of study. Two major specialized areas of studies are: Multi-channel or microphone array speech enhancement, and Single channel speech enhancement. Each of the above areas addresses speech enhancements for stationary and nonstationary noisy environments. The scope of this study has been restricted to the research of non-stationary noise within the Single Channel Speech Enhancement system. The Multi-Channels Speech Enhancement similar to microphone array has been excluded from the scope of this research. Having said this, it should also be mentioned that all the algorithms that apply to the SCSE could also apply to the MCSE. The main non-stationary noises under investigation in our study are babble, and the factory floor noise types. These noise types are more characteristic of real-world nonstationary noise as opposed to the usual white noise. 17

18 1.8 Contribution of thesis In this thesis a new architecture for the Single Channel Non-Stationary Noise Speech Enhancement (SCNSNSE) system is proposed. This architecture is defined and explained in chapter 4 and chapter 5. This architecture for the SCNSNSE is designed to enhance noisy speech where speech is dominated by noise and the Signal to Noise Ratio (SNR) is below 0 db. These enhancements were addressed in two different layers of the SCNSNSE architecture with novel algorithms. The two key innovations of this thesis are described as follows. The first contribution in the proposed architecture is to deal with narrow band frequency variations. The algorithm that smoothes the narrow band frequency variations is called Controlled Forward March Averaging (CFMA). The CFMA algorithm Farrokhi [4] works in conjunction with Discrete or Prolate Spheroidal Sequence (DPSS) multi-taper algorithm Hu [3], and Stein s Unbiased Risk Estimator (SURE) wavelet thresholding from Donoho [5]. These concepts are explained in chapter 4. The second novel algorithm addressed the wide band frequency variations. This algorithm is called Frequency Threshold Mapping (FTM) or Multi-Channels Threshold Mapping (MCTM) algorithm [6]. This algorithm is defined and elaborated on in chapter 5. The FTM increased the accuracy of the noise estimation since it mapped the characteristics of noise more accurately and therefore noise was removed with a greater degree of accuracy. These two novel algorithms (The CFMA and the FTM) increased the segmental and global signal to noise ratios as well as the intelligibility of the output speech. Furthermore the proposed algorithms are robustly evaluated in the most severe noise conditions using a uniquely shaped noise profile to render highly volatile non-stationary impulse noise interference and by considering performance in the 0 db and lower SNR ranges. 18

19 1.9 Thesis organization This introduction is followed by a general discussion on speech enhancement techniques in Chapter 2. In the subsequent chapters fundamental algorithms applied in speech enhancement system are discussed. This discussion leads to the presentation of the 1 st SCNSNSE architecture. Chapter 3 describes the current trends of speech enhancement research. Chapter 4 introduces the SCNSNSE architecture. In this chapter the CFMA algorithm is explained within single channel non-stationary noise speech enhancement system architecture. Chapter 5 provides insight to the FTM and the MCTM algorithms. Chapter 6 of this manuscript provides comprehensive discussion and interpretation of the performance evaluation of the proposed algorithms. Chapter 7 is dedicated to conclusions, summary and future work prospects. 19

20 Chapter 2 Current Research on Speech Enhancement 20

21 2 Current Research on Speech Enhancement 2.1 Overview In the previous chapter the importance of the SE under non-stationary noisy environment was emphasized. In this chapter it will be followed up with a bit more explanation with regards to current research in this field. In the past decade great progress has been made in the Speech Recognition (SR) and the SE systems. However these successes have mostly been made in stationary noisy environments (most being statistical methods which assumed Gaussian distributions for the noise). Most of the SE and the SR systems fail when the noise non-stationary power changes rapidly. A good SE algorithm should have the following characteristics: 1. It should be able to enhance speech no matter what type of noise background it has (stationary or non-stationary). 2. It should have minimum distortion on speech information. 3. It should not introduce any noise i.e. musical noise. 4. The algorithm should be simple to implement and, 5. have low CPU cost to perform the calculations and produce results. Removal of non-stationary noise from noisy speech requires much more improvement particularly in the vicinity of zero and below zero db SNR. Currently only a few researchers have produced non-stationary SE systems [7][8][9][10][11][12]. However these SE systems have not been proven to be robust under different non-stationary noise environments. The Non-Stationary Noisy Speech Enhancement (NSNSE) systems have shown some challenging problems compared to Stationary Noisy SE (SNSE) systems. An example of the non-stationary speech enhancement algorithms is shown by A. Mouchtaris [14], however in his proposed algorithms he assumed clean speech is available through training, and the system has prior knowledge of the clean speech. The problem with these techniques is that they only apply to certain applications and 21

22 excludes many others that cannot train the system. Even those applications that can train the system under a controlled environment will produce annoyance for the user and it takes valuable time and therefore is less attractive. Having said that, there are situations where training data is not even available. In these categories are applications such as the speech enhancement of pre-recorded speech data, and espionage equipment. The research undertaken here is based on the assumption that the knowledge of clean speech does not exist. In the Non-Stationary Noise Speech Enhancement (NSNSE) with no prior knowledge of clean speech there are two major speech enhancement architectures that are currently being researched. These two architectures are: 1. The Multiple Channels Speech Enhancement (MCSE) system, also called the Microphone Array Speech Enhancement (MASE) system [14][15][16]. 2. The Single channel Speech Enhancement (SCSE) system, also called the Single Microphone Speech Enhancement (SMSE) system [14][17]. The primary goal of both of these two systems is to estimate the clean speech by forming accurate estimates of the non-stationary noise and then enhancing the noisy speech signal based on the estimated noise information. During the SE process no processing noise or distortion to the original speech should be allowed or at least it should be minimized. In the case of the MCSE system multiple input microphones are used to discriminate the speech and background noises. The microphones are strategically positioned in the device so the separation of background noise and speech is achieved by utilizing the spatial dimension. The delays created by speed of sound in the spatial dimension are then used in the system to enhance the speech signal. Techniques such as post-filter algorithms are used to distinguish between noise sources and speech sources. Microphone array post-filters have demonstrated their ability to greatly reduce noise at the output of a beamformer [13]. However, some of the current techniques only consider a single source of interest, and most of the time assuming stationary background noise. Current research in this field proposes a microphone array post-filter that enhances the signals produced by the separation of simultaneous sources using common source separation algorithms. This method is based on a loudness-domain 22

23 optimal spectral estimator and on the assumption that the noise can be described as the sum of a stationary component and of a transient component that is due to leakage between the channels of the initial source separation algorithm [15]. In the case of SCSE the problem has a different prospective and is much more challenging as the speech source and noise source are already mixed and collected from a single source. This means noise has to be estimated from an already mixed noisy speech signal. This problem is exacerbated when applied to non-stationary noisy speech and at the low SNR. The progress for devising better algorithms in the SCNSNSE has been limited. For these reasons the SCSE research field in a nonstationary noisy environment has been chosen as the subject of study. Almost all of the SCSE systems employ either a subtractive type algorithm such as segmental spectral subtraction or a Weiner filter algorithm. Most SCSE do not consider the phase of the noise signal since the human ear does not comprehend and respond to phase property. The phase information of noisy speech signal is usually separated at the beginning of the process and added to the enhanced signal at the output of the system. All of the other parts of the current SCSE systems can be summarized into the following sections: 1. A spectral analysis to pre-condition the noisy speech signal, 2. a background noise Power Spectral Density (PSD) estimation algorithm and, 3. a spectral gain computation method that also includes a spectral subtraction algorithm. There are quite a few research groups around the globe that are actively working on the SCSE algorithms. The research can be categorized into the following classes based on the different focus: Removing or reducing introduced processing noise (i.e. Musical or harmonic noises) in a spectral analysis/synthesis as part of the SE system known as the pre-conditioning and smoothing algorithm [3][4][18]. Computational Auditory Scene Analysis (CASA). This is a new method to mimic the human auditory system Estimating the background noises in order to enhance original noisy signal [19][20][21][22]. 23

24 Estimating the background noises in order to more accurately estimate the original noise [23][48] [7]. In the following sections these categories are discussed in more detail and their proposed behavior and characteristics are expanded upon. 2.2 Current speech enhancement algorithms Following are the four dominant research ideas in the SE fields: 1. A SE algorithm based on subtractive-type algorithms which are discussed in section A general SE algorithm using preconditioning of speech signal to remove introduced noise as proposed by Y. Hu [3]. One example of this introduced noise is the so-called musical noise. This is discussed in section A SE algorithm based on the masking property of the human auditory system using high frequency regions to estimate the noise as proposed by N. Virag [17]. This is discussed in section A SE algorithm based on the masking property of the human auditory system utilizing low frequency regions and a subtractive type algorithm [11]. This is discussed in section SE based on subtractive-type algorithms The basic definition of a signal corrupted by additive stationary background noise is defined as: y ( m) = s( m) + n( m) (1) where s(m) is the speech signal and n(m) denotes the noise signal. It is assumed that speech and noise are uncorrelated. Conversion to the frequency domain and any further processing is predominantly based on one frame at the time. These frames are typically 256 or 512 samples each (assuming a sampling rate of Hz). In the frequency 24

25 domain the noise spectrum magnitude is estimated during the speech pauses [17]. Therefore the Power Spectral Subtraction (PSS) is expressed using: Sˆ( ω ) 2 Y Nˆ 2 2 ( ω ) ( ω), if Y ( ω) > Nˆ ( = 0, otherwise 2 ω 2 ) (2) where Nˆ ( ω ) 2 denotes the noise power spectrum estimate. During this transformation process the phase of the noisy speech is not modified. Once the subtraction is computed in the spectral domain with (2), the enhanced speech signal is therefore obtained as: s ˆ( n) = IFFT [ Sˆ( ω). e j arg Y ( ω ) ] (3) Suppression algorithm The suppression algorithm suppresses the noise with a time varying gain function similar to the subtraction algorithm. This process is similar to a filtering process where noisy speech is de-noised by filtering algorithms. As indicated previously suppression algorithms are very similar to subtractive-type algorithms. In fact the noise suppression process becomes a multiplication of the short time spectral magnitude (Frame length) of the noisy speech by a gain function defined as: Sˆ( ω) = G( ω). Y ( ω) where 0 G( ω) 1 (4) The filter for the PSS corresponding to (4) is given by: G( ω) = Nˆ ( ω ) 1 Y ( ω ) 2 2 (5) The gain parameter is set to 0 N ˆ ( ω ) > Y ( ω ) 2 2 G ( ω) = if to ensure that the gain is always real number. The value of the gain function changes between speech and noise dominated part of the noisy speech to ensure maximum enhancement. The sections or frames containing only speech are unmodified with ( G ( ω) = 1), while the sections containing only noise are suppressed with ( G ( ω) = 0) [17]. Between these two extreme 25

26 cases, the gain function takes a value depending on the a posteriori SNR rule defined by: 2 Y( ω) SNR post ( ω) = (6) ˆ 2 N( ω) Each section of the noisy speech attracts a different gain function corresponding to a given subtraction rule. To fine tune the subtraction rule a suppression curve can also be represented. This representation can provide the attenuation as a function of the SNR which can be easily measured on the noisy speech. This attenuation ranges from db (maximal attenuation) to 0 db (no processing). Each subtraction rule can be derived from various criteria [24]. Parametric algorithms have a good influence to provide more flexibility in the variation of the suppression curves Generalized spectral subtraction A combination of proposal by Lim et al. in [24] and Berouti et al. in [25] lead to a new and flexible gain function [17] defined by: G( ω) = G[ SNR post Nˆ ( ω) (1 α.[ ] Y ( ω) ( ω)] = Nˆ ( ω) γ 1 ( β.[ ] ) Y( ω) γ 2 ) γ 1 γ 2 Nˆ ( ω), if [ ] Y( ω), otherwise γ 1 1 < α + β (7) This algorithm is a very flexible subtractive type algorithm. It allows for a best fit between the following results: 1. Noise reduction, 2. residual noise and, 3. speech distortion (i.e. Over-subtraction). By manipulating the α ( α > 1), β ( 0 β 1) and γ = γ1 = 1/ γ 2 parameters the characteristics of the level of spectral subtraction of the algorithm is modified. With increasing values of the factor α ( α > 1) then over-subtraction occurs which means the short-time spectrum is attenuated more than it needs to be. This causes an increased audible distortion. 26

27 With increases in values of the parameter β ( 0 β 1) spectral flooring occurs which means background noise increasing and reducing the residual noise. The exponentγ = γ1 = 1/ γ 2 will affect the sharpness of enhanced speech as it changes from G ( ω) = 1 (the spectral component is not modified) to G ( ω) = 0 (the spectral component is suppressed). At low SNR, the calculation of optimal values for the subtraction parameters α, β and γ is a very challenging task in order to minimize speech distortion and residual noise at the same time Signal preconditioning to remove musical noise Signal preconditioning also called smoothing has been utilized in the SE systems to reduce any residual noise such as musical noise. These are also applied to increase accuracy of noise estimation. Most speech enhancement algorithms improve speech quality, however they suffer from an annoying phenomenon called musical noise [25][26],. This phenomenon is produced by randomly spaced spectral peaks that appear in each frame, and occur at random frequencies. One cause of the randomly spaced peaks is due to the inaccurate and large-variance estimates of the spectra of the noise, typically computed using periodogram-type methods. The works of [17] [18] [25][26] [27] and [28] detail a sample of the many methods that have been proposed to reduce musical noise. Berouti et al. [25], proposed oversubtracting the noise spectrum by a constant which depends on the segmental SNR. The over-subtraction approach provided less musical noise, but it was performed at the expense of introducing speech distortion. Berouti et al. [25], also suggested spectral flooring any negative spectral estimates. Utilizing psychoacoustic models, Tsoukalas, et al. [27], Virag, N. [17], focused on masking the musical noise. The minimum mean square error (MMSE) short-time 27

28 spectral amplitude (STSA) estimator suggested by Ephraim and Malah in [28], was revealed by Cappé O. [18] to remove the musical noise by way of a different mechanism. The MMSE estimator applies a spectral gain which is a function of two parameters: a priori signal-to-noise ratio and a posteriori signal-to-noise ratio. Cappé explained as a final note that in the low SNR areas where musical noise frequently dominates, the estimate of the a priori SNR proposed in [18] corresponds to a highly smoothed version of the a posteriori SNR over successive short-time frames. As a consequence, the variance was much smaller. Similar deductions were also documented by Vary [29] who examined the theoretical limits of spectral-magnitude estimation. Vary described that in a stationary environment the estimate of the power of the noisy speech signal in a speech-absent frame is not equal to the estimate of the true power of the noise signal obtained during silence frames, but fluctuates near the noise estimate. Consequently, a priori SNR estimate fluctuates, and musical noise is created. Clearly an accurate estimate of the a priori SNR is critical for eliminating the musical noise as an alternative to smoothing the a priori SNR estimate as in [28] Human perception of speech As stated it in the introduction chapter, it is known that human perception is limited to certain frequencies and levels. The human hearing system is dependent on these frequencies and levels in order to perceive sound. For example, at the low frequencies above 50 Hz, the pressure required for human ear to perceive is a lot higher than of the higher frequencies (See chapter 1 for more details). These known facts have been used by researchers to estimate and remove noise. In the next few paragraphs these methods and the way researchers achieved their objectives to remove different noise types are discussed. A fundamental principle of telephony shows the human speech information mostly exists between 50 Hz and 3.5 khz. For the purpose of tracking non-stationary noise, researchers such as Yamauchi et al. [30] used high-frequency regions of a noisy speech signal to estimate the noise. Yamauchi used high-frequency regions of more than 10 khz for the averaging operation, and a flat noise spectrum was evaluated and applied to 28

29 spectral subtraction as the noise spectrum estimate. Others such as Yamshita [11], used a similar method but focusing on the low frequency region below 50 Hz. Subtractive-type algorithms are the central piece of most SE systems. They attempt to estimate the short-time spectral magnitude of speech by subtracting the noise estimation from the noisy speech. The phase of the noisy speech is not processed, based on the assumption that phase distortion is not perceived by the human ear. Short-time spectral magnitude estimation is a basic technique for many speech enhancement algorithms. In addition to the basic approach of spectral magnitude subtraction [24], other variations have also been developed [26]. Subtractive type algorithms constitute a traditional approach for removing stationary background noise in single channel systems. Generally this type of algorithm has been chosen for its simplicity in the implementation. It also offers a high flexibility in terms of subtraction parameters. As it was stated previously that musical noise is an annoying noise that is introduced during the SE process. This annoying noise is made of tones at different frequencies of variable and varying amplitudes. Several solutions have been proposed to reduce this effect. These can be grouped as: magnitude averaging [24], over-subtraction of noise and introduction of a spectral floor [25], soft-decision noise suppression filtering [31], optimal MMSE estimation of the short-time spectral amplitude [28], nonlinear spectral subtraction [32], and introduction of morphological-based spectral constraints [33]. In most of these algorithms, the difficult task is to develop a solution which suppresses noise without decreasing intelligibility and without introducing speech distortion and residual noise. Tsoukalas [34] proposed to render this residual noise perceptually white by introducing the knowledge that the human perception does not recognize noisy signals that are either below a certain threshold or are above a certain frequency. Methods were developed in this direction, by modeling several aspects of the enhancement mechanism 29

30 present in the auditory system [34][35][36]. However, few existing enhancement algorithms take into account such auditory properties, and still fewer have worked at very low SNRs. Johnston [37] proposed to incorporate a human hearing model that is already widely used in wideband audio coding. This model is based on the masking of characteristics in the auditory system. It is related to the concept of critical band analysis, which is a central analysis mechanism in the inner ear. The masking properties are modeled by calculating a noise masking threshold. A human listener tolerates additive noise as long as it remains below this threshold. In the speech enhancement process, the subtraction parameters are adapted based on this noise masking threshold. Therefore this provides a balance between the amount of noise reduction, the speech distortion and the level of residual noise in a perceptual sense. As will be described in the subsequent chapters, the proposal offered in this thesis reduces this annoying musical noise to a manageable level by one of the proposed novel algorithms. 2.3 Computational auditory analysis Another method of SE that is currently gaining momentum is Auditory Scene Analysis (ASA). This renewed interest in this area can be attributed to Bergma. Since Bergman [38] published his key work in the area more attention is given to this subject. Scientists attentions were drawn to the fact that the human auditory perception system has a significant ability to actively recognize the external environment. The listeners are able to focus on listening to a specific target sound without difficulty even in the situation where many speakers are talking at the same time, and they are able to do so even in very difficult situations. Currently published ASA work is underpinned by a few groups studying the SCSE through background segregation and harmonic-temporal clustering (HTC). 30

31 2.3.1 Background segregation and Harmonic-Temporal Clustering (HTC) Bregman [38] has shown through experiments the psychological evidence that the auditory system segregates the acoustic signals. These segregated acoustic signals are separated into spectrogram-like pieces, called auditory elements. These auditory elements are further grouped into auditory streams according to a few grouping cues. Recent efforts produced a framework towards the reproduction of this ability of the auditory system in computers. This framework is called Computational Auditory Scene Analysis (CASA) [19][20][21][22]. At this preliminary stage of the CASA research the main focus has been to develop a source separation method based upon the grouping cues suggested by Bregman. The other efforts, in this field are: To extract useful features (for example, the fundamental frequency F0) or, to restore the target signal of interest by performing the segregation process and grouping process through a computational algorithm. Given multiple schemes utilizing the grouping cues (see [22] for a list of references), almost all of the grouping process is implemented in two steps. 1. Step one: At each discrete time point use frequency direction to extract immediate features in the grouping process. 2. Step two: By applying post processing algorithms such as the hidden Markov model (HMM), or the Kalman filtering the errors in extracting these features are reduced. Some research scientists suggested performing these analyses on frequency and time domains simultaneously to increase efficiency. In fact a few research scientists formulated a unified estimation framework Harmonic-Temporal Clustering (HTC) for the two dimensional structure of time-frequency power spectra [22], in contrast to the previous strategy. This emerging research field has potential to transform the way speech enhancement systems are designed in the near future. 31

32 2.4 Estimating the background noises A noise estimator algorithm has a major impact on the overall quality of the speech enhancement system. For example if the noise is under-estimated, unnatural residual noise will be perceived. If the noise is over estimated, speech sounds will be muffled and intelligibility will be lost. Any single channel SE system requires an accurate estimation of the noise spectrum. This is usually done by detection of speech pauses to evaluate segments of pure noise. In practical situations this is a difficult task especially if the background noise is not stationary or the signal-to noise ratio (SNR) is low. Conventional gain functions such as spectral subtraction similar to what was discussed in the depends only on the measured signal level of the current frame and the estimated noise level. These methods cause musical tones which degrade the quality of the audio signal. Some methods are known to escape the dilemma of speech pause detection and are able to estimate the noise characteristics from just a past segment of noisy speech [39]. Almost all of them apply a gain-function to modify the spectral amplitude where the phase is normally unchanged. However, the core component of all the SCSE systems is the ability to accurately estimate the noise. Many of the successful techniques have been advanced over the past years. Below are a few successful and relevant algorithms: 1. The Spectral Minima Tracking (SMT) algorithm [28]. 2. The Optimal Smoothing and Minimum Statistics (OSMS) algorithm [23] [42]. 3. The Noise Estimation (NE) algorithm what compares past spectra magnitude in each sub-band (1 st order recursive system) [40]. 4. The Histogram Threshold (HT) algorithm which compares past spectra against a threshold in each sub-band (1 st order recursive system) [40]. 5. The NE algorithm that applies a ratio of noisy speech with a local minimum threshold of the noisy speech (2 nd order recursive system) [48]. 32

33 Each of the NE methods highlighted above have some advantages and some disadvantages. These qualities will be explored in the following sections and consequently the NE algorithm that is uniquely developed and tested in our proposed SE system will be described to the reader Spectral minima tracking in sub-bands Spectral minima tracking in sub-bands were introduced by Ephraim and Malah [28]. Consequently many others proposed deviation in parameters estimation. This method proved to be successful in estimating the noise using spectral attenuation functions. Two methods were tried [40] and one proved to be more superior to the other. These were: Comparing past weighted spectral amplitude in each sub-band. Comparing histogram of the past spectral values in each sub-band. 33

34 Noise estimation by comparing past weighted spectral amplitude in each sub-band Hirsch and Ehrlicher [40] examined this method. They offered a linear 1 st order recursive equation to estimate the noise. This equation calculates the weighted sum of past spectral magnitude values Xi in each sub-band i. Ni(k) = (l - α) * Xi(k) + α * Ni (k-1) (8) where Xi(k) denotes the spectral magnitude at time k in sub-band i and Ni(k) is an estimation of the noise magnitude. Hirsch and Ehrlicher compared their work against some algorithms that take the average of past spectral power values as an estimation for the noise power in the individual sub-band. This method is called Continuous Spectral Subtraction (CSS) [41]. Hirsch introduced an adaptive threshold in the CSS system. Hirsch also applied Rayleigh distribution in segments of pure noise for the magnitude values Xi. It was found that considerably higher values occur at the onset of speech. Consequently a term α * Ni(k-1) is introduced to prevent updating in speech dominant regions with an adaptive weighting parameter, α. In this algorithm the noise is estimated by first finding when the speech is dominant and the threshold α is adjusted (e.g. α = 1) to ensure that no update from the current speech occurs in this situation. At the onset of speech in the noisy speech signal the variance of the noisy speech spectrum will change rapidly and is different in each frequency sub-band. As discussed later this lead to research into new algorithms that made the noise estimation algorithm adapt faster and more accurately to the increasing power spectrum fluctuations. The SE architecture that will be presented in this thesis uses this fact and provides an integrated multi-dimensional noise estimation (NE) algorithm that improves the SE results. 34

35 A NE algorithm based on comparison of histogram Hirsch [40] offered a second approach that was based on histograms of past spectral values in each sub-band. The threshold mentioned in previous section is used to evaluate histograms of past values which are below the specified threshold. This separation of data at the threshold point can be interpreted as a rough separation of the distributions of noise (Rayleigh distribution) and speech spectra. The speech signal usually takes much higher values. Hirsch s method used 400 ms duration of past data to determine the distribution into 40 bins. Hirsch proposed that the noise level is estimated at the maximum of the distribution in each sub-band. The estimated values for the noise magnitude are smoothed versus time to eliminate rarely occurring spikes. This led to a more accurate estimation of the noise spectrum as oppose to his previous method mentioned in Optimal smoothing and minimum statistics The minimum statistics method came about based on two interpretations or observations: The speech and the disturbing noise are statistically independent, and the power of a noisy speech signal often decays to the power level of the disturbing noise. Based on these observations it is possible to derive an estimate of noise power spectrum density by tracking the minimum of the noisy signal Power Spectrum Density (PSD). Martin [42] proposed speech enhancement based on minimum statistics which was later modified by Doblinger [23]. It was shown that in contrast to other methods the minimum statistics algorithm does not use any explicit threshold to distinguish between speech activity and speech pause or noise level data. This method was more strongly associated with soft-decision methods than the conventional Voice Activity Detection (VAD) methods discussed in the previous sections ( , ). In parallel to softdecision methods this method also updated the estimated noise power spectrum density during speech activity which was new in this field. 35

36 Subsequently Martin [2] introduced the power spectrum density smoothing algorithm that utilizes a first order recursive system with a time and frequency dependent smoothing parameter. Martin optimized the smoothing parameter for tracking nonstationary signals by minimizing a conditional mean square error decisive factor. P( λ, k) = αp( λ 1, k) + (1 α( λ, k)) Y ( λ, k) 2 (9) where α is a smoothing constant and P denotes the power spectral density of noisy speech. For the detail of the derivatives for the Mean Square Error (MSE) equations please refer to Martin s paper [2]. This method takes slightly more than the window duration to update the noise spectrum when the noise floor increases abruptly [2]. Estimating noise in conventional gain functions like the spectral subtraction [26] method depends only on the measured signal level of the current frame and a priori estimated noise level. These methods cause musical tones which degrade the quality of the audio signal. A better solution was proposed by Ephraim and Malah [28]. They use the decision directed method to estimate an a priori SNR and the gain function minimizes the mean-square error of the log-spectra, based on a Gaussian statistical model. This type of estimating proved more efficient in reducing the musical residual noise phenomena [18]. In recent years speech presence probability has been used for further improvements in the performance of these algorithms [12], [44]. In the present study the speech presence probability is applied and it is extended to adopt a more accurate detection and mapping of the noise estimation under each subband. This novel proposed noise estimation algorithm is adaptable for different noise types. It applies an adaptive threshold algorithm in conjunction with three smoothing algorithms to reduce the annoying musical noise. 2.5 Noise estimation with rapid adaptation techniques Martin s introduced a minimum statistics and smoothed power estimate of the noisy signal which was sensitive to high variations [2]. Its variance was about twice as large as the variance of a conventional noise estimator. In addition, when the minimum search window was too short Martin s method occasionally attenuated low energy phonemes 36

37 [46] and the intelligibility of the speech was lost. Doblinger presented a computationally more efficient minimum tracking scheme in [23]. Its main drawback was the very slow update rate of noise estimate when the noise energy level increased suddenly [47]. It also had a tendency to cancel the signal. Israel Cohen [46] introduced a minima controlled recursive averaging (MCRA) approach for noise estimation. The noise estimate was given by averaging past spectral power values, using a smoothing parameter that is adjusted by the signal presence probability in sub-bands. It was shown that presence of speech in a given frame of a sub-band can be determined by the ratio between the local energy of the noisy speech and its minimum within a specified time window. The ratio was compared to a certain threshold value, where a smaller ratio indicates absence of speech. Subsequently, a temporal smoothing was used to reduce fluctuations or variations between speech and non-speech segments, thereby taking advantage of the strong correlation of speech presence in neighboring frames. The resultant noise estimate was computationally efficient, robust with respect to the input SNR and distinguished by the ability to more quickly follow abrupt changes in the noise spectrum. Subsequently Cohen and Berdugo [46] proposed an algorithm which tracked the noiseonly regions by finding the ratio of the noisy speech to the local minimum over a period of sec. This method suffered from lags when the noise spectrum increased abruptly. Consequently Rangachari [7] proposed a noise estimation method which adapts to highly non-stationary noisy environments. This method used three frequency sub-bands to do the signal processing. Each sub-band used a ratio and a set threshold to do a softdecision technique for detection of presence or absence of speech data. Based on these ratios then two different algorithms were applied to estimate the noise. Rangachari [7] algorithm addressed somewhat how a rapid adaptation to noise spectrum could be achieved by a second order equation. However this method was not efficiently applied to different types of noises utilizing one general threshold map. By better threshold mapping and increased number of sub-bands a more advanced NE can be acquired. These advanced steps are applied in the present innovative algorithm and will be discussed in Chapter 5. 37

38 2.6 Evolution of current single channel NSNSE techniques on the NE and noise reduction Removal of non-stationary noise from noisy speech has much more room for improvement particularly at the low SNR level. The Non-Stationary Noise Speech Enhancement (NSNSE) is rather challenging and more difficult compared to Stationary Noisy SE (SNSE). Rangachari [7] and others have established algorithms that can adapt to sudden change on the noise signal. This was based on using an adaptive recursive algorithm to update the noise statistics. However the current single channel NSNSE systems do not perform well under different types of background noise such as the babble, the machine gun, and the factory floor etc. The single channel NSNSE system should not distort speech information and intelligibility of the speech through this process. The single channel NSNSE system should also not introduce processing noise such as harmonic noise. The NSNSE system should be simple to implement and efficient in producing the result. The NSNSE system should calculate the noise from noisy speech as much as possible since training usually takes a lot of valuable time and access to clean data. This means any NSNSE system should not assume clean speech is available since some systems do not have the luxury of access to the data (nor the time) to train their system. Furthermore, some single channel NSNSE suffer from processing noise (introduced noise) and residual noise. The current understanding with regard to the processing noise shows, this noise could mostly be eliminated by improved spectrum estimation of the noisy speech. The residual noise in the single channel NSNSE can be minimized by correct noise estimation techniques. Challenges in the single channel NSNSE system require an approach that considers the system solution as a whole. This holistic analysis is important in order to propose a successful solution. The present study of the single channel NSNSE system as a whole identified two main improvements. These enhancements have been the main focus of this research and are 38

39 documented in Chapters 4 and 5. In the next chapter the key speech enhancement algorithms on which this research is based are discussed. 39

40 Chapter 3 Speech Enhancement Fundamental Algorithms 40

41 3 Speech Enhancement Fundamental Algorithms 3.1 Overview Speech enhancement systems utilize repeatedly a set of fundamental algorithms that have proven to be the key building blocks for many current SE systems. In this chapter these fundamental algorithms used in the Single Channel Non-Stationary Noise Speech Enhancement (SCNSNSE) system are described. Moreover the evaluation methodologies are also clarified. Following is a list of these fundamental algorithms that will be discussed briefly: 1. Windowing procedure in particular the Hamming window technique. 2. DFT and IDFT technique. 3. Autocorrelation coefficient and power spectrum. 4. Segmental based Spectral Subtraction algorithm. 5. Spectral Subtraction. 6. Subjective and the objective measures. The Auto Regressive technique for evaluating spectrum estimation. 3.2 Windowing technique Equation (10) describes the relationship of a noisy speech and clean speech. Noise and speech in (10) are assumed to be uncorrelated. y( m) = s( m) + n( m), m = 0,..., λ 1 (10) In (10) y is enhanced speech and s denotes clean speech and n is the noise in the time domain. Before the noisy speech is converted to the frequency domain it is multiplied frame by frame by a window function. Windowing is used extensively in signal processing particularly in conjunction with the Discrete Fourier Transform (DFT). 41

42 Currently, continuous signals are transformed with reasonable accuracy to the frequency domain by dividing into smaller frames, given an adequate overlap length (usually half of the window size). This use of short-time frequency analysis allows a reasonable capture of both the spectral and temporal characteristics of the signal. The Hamming and Hann window functions are the most commonly used functions in the communication industry. The Hamming windows are formulated as shown below: m a k ( m) = cos(2π ), λ 1 m = 0,..., λ 1 (11) where k represents the frame window number. The Hann window is slightly different and is defined by: m a k ( m) = 0.5(1 cos(2π )), λ 1 m = 0,..., λ 1 (12) Window size (λ) is usually 256 or 512 with half of the size of the sampling window as overlap size (i.e. for 256 windows size use 128 sample as overlap). The α k denotes the k th data window used for spectral estimate calculation. It has been established that there are no significant changes on the outcome when either the Hann or the Hamming windows are used on the signal. In this research the Hamming window was selected and applied to the SCNSNSE system. 3.3 DFT and IDFT technique A series of λ (the frame length) real numbers representing the time samples (s 0,, s λ-1 ) can be transformed to a series of λ complex numbers (S 0,, S λ-1 ) by Discrete Fourier Transform (DFT) according to: 42

43 S m = λ 1 n= 0 s e j m = 0,..., λ 1 n 2π m n λ (13) where j e 2π λ is a primitive λ th root of unity circle. The Inverse Discrete Fourier Transform (IDFT) is calculated by: λ 1 1 = sn Sme λ m= 0 n = 0,..., λ 1 2π m n j λ (14) 3.4 Autocorrelation coefficient and power spectrum The level of correlation is measured by the correlation coefficient defined as strength and direction of linear relationship between two random variables. When two variables depart from independence the correlation is measured by the correlation coefficient: ρ xy = cov( X, Y ) = σ σ x y E (( X µ )( Y x σ σ x y µ )) y (15) Where E is the expected value operator and cov means covariance. The cross-correlation is more practical as it allows correlation between different processes to be measured. The cross-correlation is defined by: * Rxy( m) = E{ xn+ m yn} = E{ xn y * n m } (16) Where x n and y n are stationary random processes, m is the index value, - <n< + and E{.} is the expectation. The cross-correlation must be estimated since in practice only initial samples are available. Equation (17) is such a cross correlation estimate. Rˆ xy N m 1 xn+ m y ( m) = n= 0 ˆ * Rxy ( m) * n m 0 m < 0 (17) 43

44 The power spectrum of a quasi-stationary process x n is mathematically related to the estimated correlation by the discrete-time Fourier transform. In terms of normalized frequency, this is given by: xx P ( ω) = Rˆ ( m) e m= xx jωm (18) where m is the index and ω = 2πf / power over the given sampling frequency is defined by: f s, where f s is the sampling frequency. The average Rˆ xx (0) = π P ( ω) ω fs / 2 xx d = π 2π fs / 2 Pxx ( f ) df f s (19) 3.5 Spectral subtraction algorithm A basic way to enhance speech is to subtract the estimated noise frame from the noisy speech frame. This is usually carried out in the frequency domain. The Spectral Subtraction (SS) algorithm that is applied in this proposed system is a simple SS [25] with bias adaptation as described in the previous section. Figure 3 shows the block diagram of a typical SS system. 44

45 Noisy Speech Signal Windowing FFT Noisy Magnitude Frequency Analysis Magnitude Noise Estimate - + Magnitude Estimate Noisy Phase + Frequency Synthesis Enhanced Speech Signal Overlap and Integrate IFFT Figure 3: Simple spectral subtraction block diagram 45

46 3.6 Objective measure based on Signal to Noise Ratio (SNR) A well-established classical method to compare and quantify the enhanced speech is the Signal to Noise Ratio (SNR). The SNR measure has been applied in signal processing for many years. This equation represents a quantitative value of the improvement of noise after processing. First the input SNR value is calculated and a similar calculation of the output is carried out and compared with the input SNR to gauge the reduction in noise. The net gain achieved by subtraction between SNR output and SNR input as defined by (22). The target objective test range proposed in this system varies between the 0, -2.5, -5, and db SNR. Equation (20) defines the output Segmental SNR (SSNR) ratio [50]: (20) where x(n) is the original (clean) signal, is the enhanced signal, N is the frame length (typically chosen to be msec.), N m is the index of current sample in the frame m and M is the number of frames in the signal. The SNR calculation can either be a Global SNR (GSNR) or Segmental SNR (SSNR). In the GSNR the complete signal (1.4 second or so) is taken into account to calculate the input and output SNR values. On the other hand the SSNR calculates the input and output SNR for each frame which can be as short as 15 msec. Equation (21) calculates GSNR in db. 10 (21) where x(n) is the original (clean) signal, is the enhanced signal, and N is the speech length. Thus global and segmental gains are calculated based on (22). 46

47 10 10 (22) where is the original (clean) signal, is the noisy speech signal, is the enhanced signal, and N is the speech length. 3.7 Objective measure based PESQ test Most objective measurements are only partially related to the speech quality. Subjective tests complement these tests to evaluate the quality of the results. Subjective listening tests are conventional ways to evaluate the outcome of speech enhancement systems, however recently the PESQ algorithm has replaced this manual and tedious process with an automated quality measurement. In this work the PESQ algorithm is applied as a measure of quality of the speech. The standard PESQ (ITU-T P.862), PESQ MOS, PESQ LQ (P.862.1) [51] are currently recognized and used as a measurement of voice quality in most communication systems. PESQ is an automated way to test the quality of the enhanced speech without using actual people to examine speech quality. 3.8 The AR technique for evaluating spectrum estimation Statistics and signal processing sciences apply a predictor model for random data which is called the Auto Regressive (AR) model. The AR model could be applied to measure and validate de-noising and smoothing of a signal. The AR algorithm is utilized here to illustrate the efficacy of the proposed spectrum estimation algorithms on a quasi-speech signal modeled by a low-order AR process. There are many AR algorithm formulations. These include the Burg, covariance, modified covariance and the Yule-Walker AR model parameter estimators [52][53]. A general notation for the AR(n) which represents an AR of order n is defined below: 47

48 (23) where,,. are the model parameters, and is the noise. Thus the AR model can also be viewed as the result of an all-pole Infinite or Finite Impulse Response (FIR) filter whose input is noise. The AR model will provide random data that can be used as input to test novel algorithms. The number of the AR order varies between applications and is depending on the subject under examination. The AR algorithm adopted and applied to our SCNSNSE system is of a 4 th order polynomial (AR(4) or AR4). The AR4 algorithm was used since the combination of the noise and the polynomial signal create suitable wide and narrow band noise frequencies for examination. The AR4 is suitable since it does not create complex environment while create both narrow and wide band frequencies. Figure 4 illustrates an example of the AR4 algorithms applied to our SCNSNSE system and the output results. This method will be discussed in detail in the following chapter. Figure 4: Illustrates graphically the AR4 creation to use in the SCNSNSE system. 48

49 3.9 Synopsis Over many years researchers in the field of speech enhancement have developed a set of fundamental algorithms and measures as building blocks. These algorithms were discussed in this chapter and are the fundamental algorithms and measures which are utilized in one form or the other in almost the entire speech enhancement process including the SCNSNSE system architecture. In the next two chapters, these fundamental algorithms are applied and discussed as used in the SCNSNSE system. 49

50 Chapter 4 A Novel SCNSNSE System with Controlled Forward March Averaging (CFMA) Algorithm 50

51 4 Proposed speech enhancement system 4.1 Overview Speech enhancement (SE) systems with respect to noise estimation are generally divided into two categories. One category of SE systems utilizes an array of microphones to estimate the noise, and the other group of SE systems only uses single microphone or single channel to achieve the same result. In the SCSE the noise is estimated from the noisy speech while in the microphone array system the noise is calculated from directional microphones. Here we focus on a system that must estimate the noise from a single channel noisy speech signal (i.e. the Single Channel SE). In the SCSE study there are two branches of research. One research branch is on the removal of stationary noises and the other focuses on the removal of non-stationary noises. This thesis is about the SCNSNSE system for non-stationary noises. Let us describe how the structure of the SCNSNSE system that is presented here is derived. To construct a well-defined system one must first describe the fundamental problems and then explore the answers to these fundamental problems. Below are some of these fundamental problems: How to separate non-stationary noise that is mixed with desired clean speech. Note that the only data available is noisy speech (i.e. pre-recorded noisy speech or single channel speech enhancement system)? What is the nature of non-stationary noises? How do they behave? What are the mixture characteristics? What are the characteristics of the non-stationary noisy speech mixtures? Noises are unwanted signals created by the environment or the system itself. All noises have signal strength or amplitude that varies with frequency. Figure 5 shows the average spectrum of a few common noises. Our study revealed there are two main types of variation that have been observed with different characteristics across each noise type: 1. Short term variance which is called narrowband noise variance. 2. Long term variance which is called wideband noise variance. Narrowband (Short term) noise variance: We define this as changes in the power of the noise spectrum that fluctuate within tens of Hz. All non-stationary noise types exhibit 51

52 such local, rapid fluctuations in the spectrum mainly due to estimation errors. Wideband (Long term) noise variance is defined as changes in the power of the noise spectrum that are only recognizable within thousands of Hz or khz. All noise types (with the exception of white noise) exhibit such longer term changes in the spectrum due to the specific characterization of the noise behavior. Note that white noise, being spectrally flat, displays no such variation across the spectrum of interest. Megumi Umeno [54] discussed the characteristic power spectrums of different types of noises and presented them as shown in Figure 5. Each different noise has certain power and frequency characteristics. Figure 5: Average spectrum of a few non-stationary noise types [54]. Identifying noise behavior is the primary step to produce a SCNSNSE system. Knowing the facts with regard to these two noise variances has enabled us to structure a system to enhance the noisy speech accordingly. 52

53 Figure 6 depicts the original SCNSNSE system developed in our research. It shows how the different algorithms such as the Multi-Taper Method (MTM), the DPSS and the Wavelet Transform are interconnected through a purposeful structure to address noise behavior and enhance the noisy speech accordingly. 53

54 y(m) Digital Filter: 1. Order of Ripple of 0.5 in Band pass Provides 3 frequencies separation as below: 1. 0 to 1 K Hz 2. 1 K to 3K Hz 3. 3 K to Fs/2 Hz Phase information Z( ) = log Y( ) + η( ) SURE Wavelet Thresholding UNIV or heuristic z(j,k) MTM ( DPSS) Algorithm = y(j,k) + n(j,k) Layer two starts from here z(j,k) LF z(j,k) MF z(j,k) HF P min( ) Layer 2 P min ( ) P min( ) > M > No > Yes Yes Yes No No Layer 3 Speech Present Speech Absent Speech Present Algorithm to estimate the noise. Speech Absent Algorithm to Estimate the Noise Bias adjuster & Noise Estimation Data Constructor N( ) Layer three starts from here Enhanced Speech Frame Spectral Subtraction and sentence reconstruction IFFT z(m) Figure 6: Illustrates the SCNSNSE system and the algorithm applied in each building block. In Figure 6 the symbol y denotes noisy speech in time domain, the symbol N ω denotes noise spectrum, the symbol Y ω denotes estimated enhanced speech and the symbol Y ω is the spectrum of the noisy speech. Other symbols will be introduced throughout this chapter as each algorithm is defined in the appropriate section. The 54

55 structure of this SCNSNSE system is divided into three layers: 1. Spectrum Estimation layer (layer one). 2. Noise estimation layer (layer two). 3. Post processing or enhancement layer (layer three). At the Spectrum Estimation layer, the proposed algorithm we describe will address the narrowband variance of noise and reduction of this accordingly. At layer 2 and 3 the long term variance of noise is addressed by applying a noise estimation algorithm, and we will propose an extension of this approach in the subsequent chapter. In the next few sections these layers are described in detail. 4.2 Review of spectrum estimation (pre-processing layer) The spectrum estimation layer includes any algorithm that prepares the noisy speech signal for noise estimation. As part of spectrum estimation layer the short term variance or small variation of the spectrum is reduced. The algorithms in this layer are applied to prepare the noisy speech for more efficient noise estimation which happens at the next layer of this system. A few of applied algorithms in this layer are listed below: The digital filtering algorithm. The Discrete or Prolate Spheroidal Sequence (DPSS) multi-taper spectrum algorithm. The Stein s Unbiased Risk Estimator (SURE) wavelet thresholding to smooth the power spectrum near zero. Another phenomenon that is well known in speech research is called harmonic noise or musical noise. If applied correctly the spectrum estimation algorithms at layer one will reduce harmonic noise considerably. A digital low pass filter of cut off frequency of around 3.6 khz is applied in most SE systems to eliminate the unrelated noisy speech signal from the SE system. It is not our intention to explore the setup configuration of this Digital Low Pass Filter (DLPF) as it is readily available from various software tools and literature devoted to this subject. 55

56 4.2.1 The DPSS multi-taper spectrum algorithm Spectrum estimation using a Hamming window is the most widely used method. This technique reduces the bias and not the variances of the estimated spectrum. In order to reduce the short term variances of noises it is important to find algorithms that not only could change the bias but also the variances of any estimated spectrum. The DPSS is a short time spectral amplitude estimation technique that reduces variances and bias of the noisy speech signal. It estimates the spectrum from a combination of multiple orthogonal windows (or tapers ). The multi-taper method (MTM) applies a mixture of modified periodograms to estimate the Power Spectrum Density (PSD) defined by Percival, D.B. [55]. These periodograms are computed using a sequence of orthogonal tapers (usually in the frequency domain). The DPSS algorithm is similar to Welch s method of modified periodogram. Let (24) With (25) Where λ is the data length (typically a vector length of 500) and α k is the k th data taper used for spectral estimate. The α k (m) represents the coefficients of the estimated rectangular taper sequence. These coefficients are optimal band pass filters. In simple terms the k number of tapers are applied to each data frame and then averaged across the frame length. Hu, Y. and Loizou, P. [3] showed that since the ratio of the estimated multi-taper power spectrum,, and the true power spectrum,, conform to a chi-square distribution, the tapers could be chosen so the estimation noise η ω can be approximately Gaussian with zero mean and known variance, specifically Hu and Loizou showed that: 56

57 log ω log η ω (26) which indicates that the log multi-taper power spectrum, plus a constant is equivalent to the true log power spectrum, log plus a zero-mean Gaussian noise term, η ω. This is particularly important as it is ideal for applying the wavelet denoising technique [3]. For the subsequent analysis we rewrite the RHS as: log ω η ω log ω Wavelet thresholding There are many documents describing the wavelet algorithm. The main purpose of using wavelet algorithm is to flatten all the noise data points near zero if their values fall below a set threshold. The central focus and effort in wavelet de-noising is determining an appropriate threshold level. Two thresholding techniques were explored in our research: Universal. Stein s Unbiased Risk Estimator (SURE) [56] Let { z, } be the wavelet coefficients of Z ω in (27), let { y, } be the wavelet coefficient of logy ω), and let { n, } be the wavelet coefficients of logη ω. Then by the linearity of the discrete wavelet transform equation (27) is transformed to z, y, n, (28) where the subscript j indicates the jth scale, and the subscript k indicates the kth wavelet coefficient. Since the noise η ω is nearly Gaussian, the hard and soft thresholding techniques proposed by Donoho in [5] can be used for noise reduction. The hard thresholding for T is defined as: z, δ h (z, T ) = 0, if z T otherwise (29) 57

58 and the soft thresholding function is defined as: z T, δ s (z, T ) = 0, z + T, if z T if z < T if z T (30) In (30) T for Universal thresholding is defined as: T = ˆ σ 2 logλ (31) Where λ denotes the frame size (window length) and the noise deviation defined as: MAD ˆ σ = (32) Where MAD represents the absolute median estimated on the scale j. This method shows that it only depends on the noise variance. The outcome of this process is good when the noise data is uncorrelated to speech data. The wavelet coefficient z, is thresholded at each level j using either soft,, or hard,, thresholding: zˆ j, k δ ( z j = z, k j, k, T ),, if 1 j q if j > q 0 0 (33) where denotes coarse resolution level. When noise dominates the speech data, the universal threshold method performs better. When the underlying signal dominates the speech data, the SURE method performs better. This observation led to the heuristic SURE method, which selects either the Universal threshold or the SURE threshold according to a test of significance of presence of the signal. The decision on the choice of threshold is based on comparing (34) with (35): 2 1 λ 2 2 s d = λ = z σˆ i 1 i (34) 58

59 and a threshold equation: T d ˆ(log σ 2 λ) = λ 1.5 (35) where z is the input data and λ is the frame length. The threshold T used in the heuristic SURE method is therefore computed as: T σˆ 2logλ = TSURE s s 2 d 2 d T d > T d (36) where T SURE threshold obtained according to [56] [3]. 4.3 A new algorithm to improve spectrum estimation layer One step to improve the spectrum estimation of the SCNSNSE sub-system is to smooth the low-band variation of the noisy speech signal. To do this we look at the signal after completion of the signal conditioning shown in (25) in section Controlled Forward March Averaging (CFMA) to smooth the power spectrum Given the completion of the signal conditioning in the application of (25) it can be defined as an improved low variance spectrum and show that (37) where (i = k-j, k-(j+1),., k+j), k is the frame number and ω is the estimated multi-taper spectrum calculated at frame using the DPSS equation as defined by (25). The value of j is calculated heuristically and could be optimized depending on the type of noise characteristics. This value could change from one type of noise to another. The focus of this study has been on non-stationary noise. In this study applying j = 2 to 4 produced good results for most non-stationary noises whereas by increasing this value the wide band frequency variation deteriorated significantly. This method of processing is a non-casual and off-line and not suitable for real-time filtering. 59

60 Even though the variance of the frequency spectrum is reduced after the CFMA, more can be done to improve the low band variation in the vicinity of zero signal level. Hu s application of chi-square distribution will hold in this case since the CFMA algorithm is an averaging manipulation of data after the multi-taper spectrum [3]. Thus we will have: log ω η ω log 38 The next step to smooth the signal in the vicinity of zero level is the application of the wavelet thresholding technique. The updated system diagram is illustrated in Figure 7. The CFMA algorithm is inserted in the spectrum estimation sub-system, and it takes the output of the DPSS multi-taper algorithm to produce the required smoothed output for the next layer of processing by the SURE wavelet algorithm. 60

61 y(m) Digital Filter: 1. Order of Ripple of 0.5 in Band pass Provides 3 frequencies separation as below: 1. 0 to 1 K Hz 2. 1 K to 3K Hz 3. 3 K to Fs/2 Hz MTM (DPSS) Algorithm CFMA algorithm Z( ) = log Y( ) + η( ) Phase information SURE Wavelet Thresholding UNIV or heuristic z(j,k) = y(j,k) + n(j,k) Layer two starts from here z(j,k) LF z(j,k)mf z(j,k)hf P min( ) Layer 2 P min ( ) P min( ) > M > No > Yes Yes Yes No No Layer 3 Speech Present Speech Absent Speech Present Algorithm to estimate the noise. Speech Absent Algorithm to Estimate the Noise Bias adjuster & Noise Estimation Data Constructor N( ) Layer three starts from here Enhanced Speech Frame Spectral Subtraction and sentence reconstruction IFFT z(m) Figure 7: Enhanced SCNSNSE system including the CFMA algorithm Therefore the new proposed SCNSNSE system [2] now consists of the following three processing layers. These three layers are: 61

62 1. Low-variance spectrum estimation and noise removal by applying the DPSS MT, the CFMA techniques and the segmental SURE wavelet thresholding. 2. A segmental noise estimation algorithm for speech presence and speech absent data using three frequency sub-bands. 3. A spectral subtraction and the IFFT algorithms. 4.4 Noise estimation layer Single channel segmental noise estimation particularly for non-stationary noise is a state of the art technology. The major drawback of most noise estimation algorithms is that they are either slow in tracking sudden increases of noise power or that they overestimate the noise energy resulting in speech distortion. In recent research there have been a few advancements in this field. Amongst these are the works of: Martin s minimum statistics [2]. Minima constrained recursive averaging from Cohen and Berdugo [46]. Minimum statistics frequency bins by Doblinger [23]. Adaptation to sudden change in noise power by Rangachari [7]. Out of the methods mentioned above only Doblinger [23] and Rangachari [7] provided a better noise power estimation. Of these two Dobliner s method [23] was overestimating noise, and Rangachari s [7] method had better adaptation to the sudden change while estimating the noise power spectrum. Rangachari [7] has shown how noise estimation can be calculated to adapt to the highly non-stationary environment. Hence it estimates the noise information more accurately. However this process may be improved. Two papers by Doblinger [23] and Rangachari [7] have inspired our SCNSNSE noise estimation algorithms. This process consists of two parts. Part one is to estimate the noise spectrum on speech-absent frames. The second part is to estimate the noise spectrum in speech-presence frames. Let: (39) 62

63 where is the noisy speech signal, is the clean signal and is the additive noise. The smoothed power spectrum of the noisy speech signal,,, can be estimated using the first-order recursive equation. P( λ, k) = ηp( λ 1, k) + (1 η) Y ( λ, k) 2 (40) Where Y (λ, k) 2 is an estimate of the short-time power spectrum of y(m) obtained by wavelet thresholding the multi-taper spectrum of y(m) [3], η is a smoothing constant, λ is the frame index and k is the frequency bin index. Since the noisy speech power spectrum in the speech absent frames is equal to the power spectrum of the noise, we can update the estimate of the noise spectrum by tracking the speech-absent frames (with a software speech present detection). To do that, we compute the ratio of the energy of the noisy speech power spectrum in three different frequency bands (low: 0-1 khz, middle: 1-3 khz, high: 3 khz and above) to the energy of the corresponding frequency band in the previous noise estimate. The following three ratios are computed: ξ ( λ) L k= 1 LF k= 1 = LF P( λ, k) N( λ 1, k) (41) ξ ( λ) M MF k= LF + 1 = MF k= LF + 1 P( λ, k) N( λ 1, k) (42) ξ ( λ) H F / 2 s s k= MF + 1 = F / 2 k= MF + 1 P( λ, k) N( λ 1, k) (43) where N(λ,k) is the estimate of the noise power spectrum at frame λ, and LF, MF, Fs correspond to the frequency bins of 1 khz, 3 khz and the sampling frequency respectively. 63

64 4.4.1 Speech absent noise estimation If the above three ratios ξ L (λ), ξ M (λ) and ξ H (λ) are all smaller than a threshold σ, then it is concluded that it is a speech-absent frame and the noise estimate is updated according to: N( λ, k) = εn( λ 1, k) + (1 ε ) Y ( λ, k) 2 (44) where ε is a smoothing constant Speech present noise estimation The proposed algorithm used for speech-present segments is based on finding the minimum of the noisy speech spectrum first, then using that minimum to determine signal presence probability in sub-bands. The signal presence probability is used to determine a frequency-dependent smoothing parameter which replaces the fixed smoothing constant ε in (44). The local minimum of the noisy speech is computed by averaging the past spectral values with a look-ahead factor as defined in [2]: if then else P min (λ-1,k) < P(λ,k) P min (λ,k) = γ P min (λ-1,k) + (1-γ)/(1-β)[P(λ,k)- β P (λ-1,k)] P min (λ,k) = P(λ,k) (45) where P(λ,k) denotes the local minimum of the noisy speech power spectrum and γ and β are constants determined heuristically. Let 64

65 S ( λ, k) = r P( λ, k) P ( λ, k) min (46) denote the ratio between the energy of the noisy speech to its local minimum. This ratio is compared against a frequency-dependent threshold and if it is found to be larger than that threshold, then the corresponding frequency is considered to contain speech. Using the above ratio S r (λ,k), the new frequency-dependent smoothing constant can be estimated as follows: asa if Sr ( λ, k) < δ ( k) as( λ, k) = asp otherwise (47) where a sa and a are smoothing constants ( sp a > sp a sa ) corresponding to speech absent and speech present respectively, and δ(k) is a frequency-dependent threshold given by: 1.3 δ ( k) = k LF LF < k MF MF < k Fs / 2 (48) Finally after computing the frequency-depending smoothing factor α ( λ, k) the noise spectrum estimate is updated according to: s N( λ, k) = α ( λ, k) N( λ 1, k) + (1 α ( λ, k)) Y ( λ, k) s s 2 (49) To summarize this section here is what happens: If the ratios defined in (41), (42), (43) indicate that the current frame is a speech-absent frame, (44) is applied to update the noise spectrum. Otherwise, (45)-(49) are used to update the noise spectrum. 65

66 4.5 Post processing or enhancement layer As seen from Figure 6 and 7, after completion of segmental noise estimation the estimated segmental noise is subtracted from the original signal segment in the frequency domain. A simple spectral subtraction is applied to this system to produce the outcome. The entire noisy speech segments are enhanced with this method. Once all segments are enhanced then a new enhanced sentence is reconstructed. This enhanced reconstructed sentence in the frequency domain and the original preserved phase information are fed through the inverse discrete FFT to produce enhanced speech in the time domain. 66

67 4.6 Synopsis In this chapter the concepts of short term and long term noise variation were introduced. The architecture for a SCNSNSE system was described and the implementation and flow diagram for the system defined. The benefit of reduction of low band frequency variation has also been discussed. The new smoothing algorithm the CFMA was introduced. The CFMA is a low-pass filter across the spectral values to provide added smoothing and is a moving average filter but applied to spectral values across k number of samples rather than time. The CFMA provided an extra smoothing level to the highly volatile low band frequency and prepared the noisy speech for further enhancement. The CFMA algorithm will be shown to be effective in increasing the SNR and preparing the noisy data for further preservation of intelligibility of the enhanced speech. 67

68 Chapter 5 The Second Innovative Improvement to the SCNSNSE Architecture 68

69 5 The Second Innovative Improvement to the SCNSNSE Architecture Our research into the SCNSNSE field identified two areas of improvement. One has been the pre-conditioning of the SE algorithms [3], the other the post processing algorithms [7] where the noise has been estimated from the noisy speech. Our study has been a holistic approach to single channel non-stationary noise speech enhancement. This means the complete SE system has been studied as a whole and the components that are essential for the SCNSNSE system have been identified. One of the identified spectrum estimation areas that have required improvement has been the removal of lowband frequency variation as discussed in Chapter 4. The other identified area has been detection of the wideband frequency variation at the post-processing layer. The focus of this study was to create and advance a SCNSNSE system so it can perform well when it is applied to non-stationary noises such as babble, machine gun and factory floor noises. As explained in the previous chapter Hu s study [3] used a multi-taper algorithm to pre-condition the noisy speech (not specific to the SCNSNSE system) before applying wavelet thresholding. Hu s research [3] did not address the nonstationary noisy speech, where SNR input level was low. Rangachari s [7] algorithm studied the noise estimation techniques to improve noise estimation where there is a rapid change to the non-stationary noise environment. Rangachari s study [7] did not address the pre-conditioning of noisy speech. This study has focused on these two for possible improvement. The SCNSNSE system introduced here has combined a novel pre-conditioning algorithm with a novel noise estimation algorithm. In this system the spectrum estimation is critical to pre-condition the noisy speech before noise removal at the post-processing layer. 5.1 Improving noise estimation layer using multi-channel threshold mapping Noise estimation in the non-stationary environment is an art and still has a long way to go to be perfect. Currently most noise estimation algorithms are either slow in tracking 69

70 sudden increases of noise power or they overestimate the noise energy resulting in speech distortion. Cohen and Berdugo [48] tracked the noise-only regions by finding the ratio of the noisy speech to the local minimum over a period of seconds. This method created lags of twice that period ( seconds) when the noise spectrum increased abruptly. Cohen and Berdugo [48] and later on Rangachari [7] showed how the noise estimation can be calculated to adapt to the highly non-stationary environment. Rangachari s [7] work in comparison with Martin s [2] Cohen [48] and Doblinger [23] showed that the noise estimation can be improved to adapt to sudden changes in noise power. Hence estimates of the noise information are more accurate. The approach we took to propose a noise estimation technique has been to comprehend the fundamental issues of noise estimation. Our rational has been to understand the behavior of different non-stationary noises first before proposing a solution. After studying noise characteristics such as babble, machine gun and factory floor noises, it was discovered that each noise has a different long-time average spectrum. Refer to Figure 5 which provides the long-time average spectrum of these non-stationary noises. The fact that each non-stationary noise behavior is considerably different led us to a new calculation of localized minima ratio and a new Frequency Threshold Mapping (FTM). This subject was briefly discussed in section 4.1 and now we elaborate by comparing the average spectrum of two typical non-stationary noises. Consider two typical nonstationary noises, babble noise in Figure 8 and the factory floor noise in Figure 9. It can clearly be observed that wide band frequency trends are substantially different. This clearly warrants a different FTM vector for each type of noise. 70

71 Figure 8: Average power spectrum of crowd noise or more commonly known as the babble noise. 71

72 Figure 9: Average power spectrum of factory floor noise. This study led us to better results by introducing the new localized minima ratio and the new FTM techniques with increased sub-bands. In this system 9 frequency sub-bands have been chosen for the non-stationary noises such as babble noise. The number of band selection is dependent on the wide band frequency variation of the average power spectrum of noise type. If the target is telephone communication bandwidth range (which applies to our study) and tries to remove babble type noise then 8-10 bands is a good assumption to achieve a reasonable result without compromising a huge CPU processing speed. In order to give an extra flexibility and perhaps better resolution than 0.5 khz per band we chose 9 bands. This resolution allows for more accurate threshold mapping applied to the non-stationary noise such as babble type noise. The number of bands will depend on how rapidly the average power spectrum changes across the wide band frequency of a particular noise type. For example compare the average power spectrum slope of Figure 8 and Figure 9. The average power spectrum slope of babble 72

73 noise is almost twice as much as that of factory floor noise with respect to the wide band frequency variation. If the rate of the slope changes rapidly across wide band frequency from 0 to 4 khz, then the bandwidth size, the number of frequency bands and threshold detection level parameter will be adjusted accordingly. This means creating a suitable FTM algorithm to have an effective noise estimation result. The proposed algorithm used for speech-present segments is based on first finding the minimum of the noisy speech spectrum, and using that minimum to determine signal presence probability in sub-bands. Local minima ratios in the NE algorithms are generally derived for two conditions. First is to estimate the noise spectrum on speech-absent frame. Second is to estimate the noise spectrum where speech is present. Since the noisy speech power spectrum in the speech absent frames is equal to the power spectrum of the noise, then the noise spectrum can be estimated directly by tracking the speech-absent frames Speech absent noise calculation As it was described above to improve the noise estimation technique we increased the number of frequency bands. We used a linear frequency separation bandwidth of 500 Hz. We compute the ratio of the energy of the noisy speech power spectrum in 9 different frequency bands (i.e. low: Hz, Hz, khz, khz, khz, khz, khz, khz and 4 khz and above) to the energy of the corresponding frequency band in the previous noise estimate and calculate the 9 ratios defined below: ξ ( λ) ξ L λ k= 1 LF k= 1 = LF P( λ, k) N ( λ 1, k) MF1 k = LF + 1 M 1( ) = MF1 k = LF + 1 P( λ, k) N( λ 1, k) (50) (51) 73

74 ξ ξ ξ ξ ξ λ MF 2 k = MF1+ 1 M 2 ( ) = MF 2 λ k = MF1+ 1 P( λ, k) N( λ 1, k) MF 3 k = MF 2+ 1 M 3( ) = MF 3 λ k = MF 2+ 1 P( λ, k) N( λ 1, k) MF 4 k = MF 3+ 1 M 4 ( ) = MF 4 λ k = MF 3+ 1 MF 5 k = MF 4+ 1 M 5( ) = MF 5 λ k = MF 4+ 1 P( λ, k) N( λ 1, k) P( λ, k) N( λ 1, k) MF 6 k = MF 5+ 1 M 6 ( ) = MF 6 k = MF 5+ 1 P( λ, k) N( λ 1, k) (52) (53) (54) (55) (56) ξ ξ ( λ) H λ F / 2 s k = MF 7+ 1 = F / 2 s MF 7 k = MF 6+ 1 M 7 ( ) = MF 7 k = MF 6+ 1 k = MF 7+ 1 P( λ, k) N( λ 1, k) P( λ, k) N( λ 1, k) (57) (58) where, is smoothed power spectrum from (40) and N(λ, k) is the estimate of noise power spectrum at frame λ, and LF, MF1, MF2, MF3, MF4, MF5, MF6, MF7, Fs correspond to the frequency bins of khz, khz, khz, khz, khz, khz, khz, khz and above 4.0 khz respectively. We conclude that if the above 9 ratios ξ L (λ), ξ M1 (λ), ξ M2 (λ), ξ M3 (λ), ξ M4 (λ), ξ M5 (λ), ξ M6 (λ), ξ M7 (λ) and ξ H (λ) are all smaller than a threshold σ, then it is deduced that the frame is a speech-absent frame and the noise estimate is updated accordingly and defined by: 74

75 N( λ, k) = εn( λ 1, k) + (1 ε ) Y ( λ, k) 2 (59) where ε is a smoothing constant. The constants for (59) and noise threshold (50) to (58) are set to σ =1.03 and ε = Speech present noise calculation The proposed algorithm used for speech-present segments is based on first finding the minimum of the noisy speech spectrum in each sub-band, and using that minimum to determine signal presence probability in that sub-band. The signal presence probability is used to determine a frequency-dependent smoothing parameter which replaces the fixed smoothing constant ε in (59) Let P( λ, k) Sr ( λ, k) = P ( λ, k) min (60) denote the ratio between the energy of the noisy speech to its local minimum and, is defined in (45). This ratio is compared against a frequency-dependent threshold and if it is found to be larger than that threshold, then the corresponding frequency is considered to contain speech. Using the above ratio S r (λ,k), the new frequency-dependent smoothing constant can be calculated as follows: asa if Sr ( λ, k) < δ ( k) as( λ, k) = asp otherwise (61) where a sa and a are smoothing constants ( sp a > sp a sa ) corresponding to speech absent and speech present respectively. Our novel proposal is a more sophisticated formulation of the frequency-dependent threshold δ(k) when compared to (48) and is given by: 75

76 δ ( k) = < k LF LF k MF1 MF1< k MF2 MF2< k MF3 MF3< k MF4 MF4< k MF5 MF5< k MF6 MF6< k MF7 MF7 < k Fs / 2 (62) As can be seen from (62) we deviated this from the previous proposal suggested in [7] by Rangachari. Rangachari suggested only three mapping frequency thresholds for 3 frequency channels. However we improved this by increasing them to 9 channels. When speech is present in any of 9 frequency bands the speech present algorithm will be applied to estimate the noise. The threshold map given by (62) is initially best estimated for the babble type noise characteristics and then fine-tuned on a trial and error process. This procedure can be repeated on other noise profiles (e.g. factory noise). The relationship to consider between Figure 8 and (62) is the slope of the wide band frequency variation of the average power spectrum of the babble noise. The rate of change on the wide band frequency variation will influence the threshold set for speech absent or present level and should be set accordingly on each frequency band. If the wide band frequency variation changes rapidly within a band then there is a justification for increasing the number of frequency bands to accommodate better threshold setting. Comparing Figure 8 and Figure 9 it can be seen how the babble noise average power drops dramatically within 1 khz range as opposed to the factory floor noise average power spectrum. This fact provides some justification to set the threshold low for better cropping of the speech present frames on each of the frequency bands. That is why these threshold values have been decreased from 1.3 set on the conventional algorithm shown by (48) and set to 0.61 and 0.99 respectively in 500 Hz and khz frequency bands. Applying the same rationale the other 7 threshold channels are set accordingly. Two major changes apparent in this technique: 1. Increasing the number of frequency bands (i.e. from 1 channel (below 1 khz) to two channels (500 Hz and khz)). 2. Increasing the number of appropriate threshold settings for each band. 76

77 The key step in this process fine tunes the output results through a trial and error feedback process. This was achieved by adjusting the threshold values in (62), fine tuning bias adjustment value and analyzing the average gain on the output for each band. Finally after computing the frequency-depending smoothing factor α s (λ,k) the noise spectrum estimate is updated according to: N( λ, k) = α ( λ, k) N( λ 1, k) + (1 α ( λ, k)) Y ( λ, k) s s 2 (63) The complete system architecture is presented in Figure

78 y(m) Digital Filter : 1. Order of Ripple of 0.5 in Band pass Provides 9 frequencies separation as below: (0-0.5, 0.5-1, 1-1.5, , , , , 4.0-Fs/2 khz) MTM ( DPSS) Algorithm CFMA algorithm Z( ) = log Y( ) + η( ) Phase information SURE Wavelet Thresholding UNIV or heuristic z(j,k) = y(j,k) + n(j,k) Layer two starts from here z(j,k)lf... z(j,k) MF2 z(j,k) MF7 z (j,k)hf Layer 2 P min( ) P min ( ) P min ( ) Pmin( ) > M2 > M7 > No > Yes Yes Yes... No No... No Layer 3 Speech Present Speech Absent Speech Present Algorithm to estimate the noise. Speech Absent Algorithm to Estimate the Noise Bias adjuster & Noise Estimation Data Constructor N( ) Layer three starts from here Enhanced Speech Frame Spectral Subtraction and sentence reconstruction IFFT z(m) Figure 10: Enhanced SCNSNSE system. 78

79 5.2 Synopsis In this chapter the benefit of reduction of high band frequency variation was discussed. The wide band frequency variation and the FTM algorithms are the components that describe the Multi-Channel Threshold Mapping (MCTM) architecture. Discussion of the FTM algorithm led to the concept of tuning to the wide band frequency variation for different noise types. Consequently to improve noise estimation algorithm the concept of fine-tuning to the wide band frequency variation was put forward. The MCTM system increased the accuracy of wide band mapping for noise estimation. We proposed a set of new algorithms that better adapt to the rises and falls of the nonstationary noises and hence more accurately estimate the noise spectrum in the highly noisy environment. The results of the tests which are documented in the next chapter show how the amplitude and quality of noisy speech of a SCNSNSE system can be improved in the highly volatile noisy environment (0 db and below) using this architecture. 79

80 Chapter 6 Performance Evaluation of the Proposed SCNSNSE System 80

81 6 Performance Evaluation of the Proposed SCNSNSE System 6.1 Overview Validity of most technical achievements is through producing verifiable results. No result is proven valid unless it has exploited some known standard methodologies and standard data. In the SE there is need for a standard set of speech data. The DARPA TIMIT sentences speech data is one such collection. The DARPA TIMIT has been available for many years and is utilized by the research industry widely. We have adopted the DARPA TIMIT CD-ROM (CD 1-1.1) speech sentences as our input speech data. The well-known NOISEX database has been adopted as the additive noise data. Furthermore these noise data were shaped before being mixed and added as input to the system as shown by Figure 11 for the case of shaped babble noise. The shaping is to deliberately emphasize a noise profile which is highly time variant and non-stationary. For this purpose some parts of noisy signal were flattened to zeros (i.e. on for 1500 samples and off for 500 samples, repeat) to create an abrupt noise signal. The on/off non-stationary noise is used to create extreme abrupt situations to test the responsiveness of the algorithms under test. The NOISEX database includes white noise, Volvo car noise, pink noise, machine gun noise and non-stationary babble noise. These different noise types in the NOISEX database are distinctly different in characteristics. Out of these types of noises in the NOISEX database, the babble noise is particularly important as it is very close to a real world problem such as a crowded bus or train station. 81

82 Figure 11: shows an example of the shaped babble noise. Some of the current standard evaluation tools and methodologies that are applied in the SE research industry are: 1. The Auto Regressive (AR) algorithm to examine the smoothness and accuracy of the spectrum estimation layer. 2. Analysis of the enhancement as measured by standard objective measures (i.e. The Global and the Segmental SNRs). 3. Analysis of the enhancement as measured by perceptual based measures (i.e. PESQ). 6.2 Evaluation strategy The proposed evaluation strategy for our investigation is summarized below: 1. Test the SCNSNSE system initially with one type of non-stationary noise (i.e. the babble noise) and at the low SNR level (below zero db). 2. Apply and focus on the improvement of one type of noise before applied to other type of non-stationary noises. 82

83 3. Apply 0, -2.5, -5, and -7.5 db SNR as base level for all testing These severe SNR conditions were chosen to demonstrate the importance of enhanced smoothing by the proposed CFMA and FTM in dealing with such conditions. 4. Apply at least 9 mixed sentences of male and female speakers from the DARPA TIMIT (sampling rate of 16 khz) for each new algorithm developed. 5. Investigate the Auto Regressive (AR) input to assess spectrum estimation accuracy and smoothness. 6. Investigate the PESQ, the global and segmental SNR as the core evaluation methodologies. 7. Apply a base enhancement method to compare with. The base method for comparison was chosen to be the well-known simple Spectral Subtraction (SS) algorithm. It must be noted that the SS algorithm that was used in this context utilizes the initial frames of silence data as the noise estimate. To compare different existing methods with state of art methods developed here we test and compare different technique as follows: Test a SCNSNSE system with the simple SS algorithm. Test a SCNSNSE system including the 3 channels Hu s method [3] (conventional method) Test a SCNSNSE system including the 3 channels Hu s method [3] (conventional method) with the proposed CFMA algorithm as extra smoothing layer. Test the proposed 9 channels SCNSNSE system with the CFMA algorithm. 6.3 The CFMA and the 9 channels FTM algorithms performance evaluations Spectrum Estimation of Clean AR Data The AR algorithms could be applied as a measure to gauge the reduction on narrow and wide bands noise variation. It is usually presented graphically to show the smoothness of the signal. Consider Figure 12 and Figure 13 which plot the spectrum from the DPSS MT, the CFMA, and the segmental SURE algorithms against the AR4 process using the Gaussian noise as the excitation. As it can be seen from the graph on Figure 12 and 83

84 Figure 13 the test signal spectrum is greatly smoothed out when passed through the CFMA algorithm (bottom panel of figures). The choice of frame length (j value) in the CFMA algorithm described in (37) is vital for an optimal result. If the frame length is too big it will lose vital wide band information (see Figure 13). However if the frame length (j value in (37)) is too small narrow band variation would not be removed greatly. One of the best results achieved for the CFMA algorithm when j in (37) is set to 2. The result shown in the spectral graphical representation of Figure 12 is presented as Mean Square Error (MSE) measures between the true and estimated spectra. The MSE of 20 randomly generated white noise sequences with different deviation numbers was averaged and shown in the Table 1 below. Then the MSE was calculated after each algorithm was applied and the result tabulated in Table 1. AR4 spectrum estimation algorithms The MSE of 20 random generated data (Averaged) Percentage error reduction improvement over the original DFT values The DFT using hamming window DPSS Multi-Taper algorithms % The DPSS and the CFMA algorithms % The DPSS, the CFMA and segmental SURE wavelet algorithms % Table 1: Mean square error test results on the CFMA algorithm. The narrow band frequency variation reduced considerably after each successive algorithm was applied as shown in Table 1. The CFMA algorithm is an example of how a spectrum estimation-smoothing algorithm is able to remove low band spectrum noise variation without compromising the wide band spectrum data. The smoothing algorithms such as the CFMA, the DPSS MT, and the segmental SURE wavelet algorithms improve the noise estimation techniques in sequence. 84

85 Figure 12: Comparison of the power spectrum of a white noise excited AR4 process estimated by direct hamming window method (top panel). The DPSS MT method (N= 2048, L=5) (second panel). The DPSS MT (N= 2048, L=5) and the CFMA method with j=2 in (37) (third panel). The DPSS, the CFMA (j=2 in (37)), segmental SURE wavelet applied all together (bottom panel). 85

86 Figure 13: Comparison of the power spectrum of a white noise excited AR4 process estimated by direct hamming window method (top panel). The DPSS MT method (N= 2048, L=5) (second panel). The DPSS MT (N= 2048, L=5) and the CFMA method with j=15 in (37) (third panel). The DPSS, the CFMA (j=15 in (37)), segmental SURE wavelet applied all together (bottom panel) Global SNR, segmental SNR and the PESQ evaluations of Noisy Speech The global SNR defined in (22), and Segmental SNR defined in (20) are based on the implementation described by Quackenbush S. R. [57] and Papamichalis P. E. [58]. Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment was carried out based on the International 86

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Different Approaches of Spectral Subtraction Method for Speech Enhancement ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches

More information

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter 1 Gupteswar Sahu, 2 D. Arun Kumar, 3 M. Bala Krishna and 4 Jami Venkata Suman Assistant Professor, Department of ECE,

More information

EE482: Digital Signal Processing Applications

EE482: Digital Signal Processing Applications Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/

More information

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Ching-Ta Lu, Kun-Fu Tseng 2, Chih-Tsung Chen 2 Department of Information Communication, Asia University, Taichung, Taiwan, ROC

More information

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Mel Spectrum Analysis of Speech Recognition using Single Microphone International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree

More information

Chapter 4 SPEECH ENHANCEMENT

Chapter 4 SPEECH ENHANCEMENT 44 Chapter 4 SPEECH ENHANCEMENT 4.1 INTRODUCTION: Enhancement is defined as improvement in the value or Quality of something. Speech enhancement is defined as the improvement in intelligibility and/or

More information

REAL-TIME BROADBAND NOISE REDUCTION

REAL-TIME BROADBAND NOISE REDUCTION REAL-TIME BROADBAND NOISE REDUCTION Robert Hoeldrich and Markus Lorber Institute of Electronic Music Graz Jakoministrasse 3-5, A-8010 Graz, Austria email: robert.hoeldrich@mhsg.ac.at Abstract A real-time

More information

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2

MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2 MMSE STSA Based Techniques for Single channel Speech Enhancement Application Simit Shah 1, Roma Patel 2 1 Electronics and Communication Department, Parul institute of engineering and technology, Vadodara,

More information

Single channel noise reduction

Single channel noise reduction Single channel noise reduction Basics and processing used for ETSI STF 94 ETSI Workshop on Speech and Noise in Wideband Communication Claude Marro France Telecom ETSI 007. All rights reserved Outline Scope

More information

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 7, Issue, Ver. I (Mar. - Apr. 7), PP 4-46 e-issn: 9 4, p-issn No. : 9 497 www.iosrjournals.org Speech Enhancement Using Spectral Flatness Measure

More information

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis Mohini Avatade & S.L. Sahare Electronics & Telecommunication Department, Cummins

More information

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS

MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS 1 S.PRASANNA VENKATESH, 2 NITIN NARAYAN, 3 K.SAILESH BHARATHWAAJ, 4 M.P.ACTLIN JEEVA, 5 P.VIJAYALAKSHMI 1,2,3,4,5 SSN College of Engineering,

More information

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract

More information

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS 46 CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS 3.1 INTRODUCTION Personal communication of today is impaired by nearly ubiquitous noise. Speech communication becomes difficult under these conditions; speech

More information

Chapter 3. Speech Enhancement and Detection Techniques: Transform Domain

Chapter 3. Speech Enhancement and Detection Techniques: Transform Domain Speech Enhancement and Detection Techniques: Transform Domain 43 This chapter describes techniques for additive noise removal which are transform domain methods and based mostly on short time Fourier transform

More information

Monaural and Binaural Speech Separation

Monaural and Binaural Speech Separation Monaural and Binaural Speech Separation DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction CASA approach to sound separation Ideal binary mask as

More information

Speech Enhancement for Nonstationary Noise Environments

Speech Enhancement for Nonstationary Noise Environments Signal & Image Processing : An International Journal (SIPIJ) Vol., No.4, December Speech Enhancement for Nonstationary Noise Environments Sandhya Hawaldar and Manasi Dixit Department of Electronics, KIT

More information

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a R E S E A R C H R E P O R T I D I A P Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a IDIAP RR 7-7 January 8 submitted for publication a IDIAP Research Institute,

More information

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter Sana Alaya, Novlène Zoghlami and Zied Lachiri Signal, Image and Information Technology Laboratory National Engineering School

More information

Speech Signal Enhancement Techniques

Speech Signal Enhancement Techniques Speech Signal Enhancement Techniques Chouki Zegar 1, Abdelhakim Dahimene 2 1,2 Institute of Electrical and Electronic Engineering, University of Boumerdes, Algeria inelectr@yahoo.fr, dahimenehakim@yahoo.fr

More information

SPEECH AND SPECTRAL ANALYSIS

SPEECH AND SPECTRAL ANALYSIS SPEECH AND SPECTRAL ANALYSIS 1 Sound waves: production in general: acoustic interference vibration (carried by some propagation medium) variations in air pressure speech: actions of the articulatory organs

More information

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking

ScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 122 126 International Conference on Information and Communication Technologies (ICICT 2014) Unsupervised Speech

More information

Pitch Period of Speech Signals Preface, Determination and Transformation

Pitch Period of Speech Signals Preface, Determination and Transformation Pitch Period of Speech Signals Preface, Determination and Transformation Mohammad Hossein Saeidinezhad 1, Bahareh Karamsichani 2, Ehsan Movahedi 3 1 Islamic Azad university, Najafabad Branch, Saidinezhad@yahoo.com

More information

Speech Enhancement using Wiener filtering

Speech Enhancement using Wiener filtering Speech Enhancement using Wiener filtering S. Chirtmay and M. Tahernezhadi Department of Electrical Engineering Northern Illinois University DeKalb, IL 60115 ABSTRACT The problem of reducing the disturbing

More information

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

International Journal of Modern Trends in Engineering and Research   e-issn No.: , Date: 2-4 July, 2015 International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Analysis of Speech Signal Using Graphic User Interface Solly Joy 1, Savitha

More information

Audio Restoration Based on DSP Tools

Audio Restoration Based on DSP Tools Audio Restoration Based on DSP Tools EECS 451 Final Project Report Nan Wu School of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI, United States wunan@umich.edu Abstract

More information

High-speed Noise Cancellation with Microphone Array

High-speed Noise Cancellation with Microphone Array Noise Cancellation a Posteriori Probability, Maximum Criteria Independent Component Analysis High-speed Noise Cancellation with Microphone Array We propose the use of a microphone array based on independent

More information

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients ISSN (Print) : 232 3765 An ISO 3297: 27 Certified Organization Vol. 3, Special Issue 3, April 214 Paiyanoor-63 14, Tamil Nadu, India Enhancement of Speech Signal by Adaptation of Scales and Thresholds

More information

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner. Perception of pitch AUDL4007: 11 Feb 2010. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum, 2005 Chapter 7 1 Definitions

More information

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner. Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb 2009. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence

More information

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner. Perception of pitch BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb 2008. A. Faulkner. See Moore, BCJ Introduction to the Psychology of Hearing, Chapter 5. Or Plack CJ The Sense of Hearing Lawrence Erlbaum,

More information

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement

Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement Frequency Domain Analysis for Noise Suppression Using Spectral Processing Methods for Degraded Speech Signal in Speech Enhancement 1 Zeeshan Hashmi Khateeb, 2 Gopalaiah 1,2 Department of Instrumentation

More information

Wavelet Speech Enhancement based on the Teager Energy Operator

Wavelet Speech Enhancement based on the Teager Energy Operator Wavelet Speech Enhancement based on the Teager Energy Operator Mohammed Bahoura and Jean Rouat ERMETIS, DSA, Université du Québec à Chicoutimi, Chicoutimi, Québec, G7H 2B1, Canada. Abstract We propose

More information

SOUND SOURCE RECOGNITION AND MODELING

SOUND SOURCE RECOGNITION AND MODELING SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental

More information

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm A.T. Rajamanickam, N.P.Subiramaniyam, A.Balamurugan*,

More information

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa Spring 2008 Introduction Problem Formulation Possible Solutions Proposed Algorithm Experimental Results Conclusions

More information

Digitally controlled Active Noise Reduction with integrated Speech Communication

Digitally controlled Active Noise Reduction with integrated Speech Communication Digitally controlled Active Noise Reduction with integrated Speech Communication Herman J.M. Steeneken and Jan Verhave TNO Human Factors, Soesterberg, The Netherlands herman@steeneken.com ABSTRACT Active

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 2, Issue 11, November 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Review of

More information

COMP 546, Winter 2017 lecture 20 - sound 2

COMP 546, Winter 2017 lecture 20 - sound 2 Today we will examine two types of sounds that are of great interest: music and speech. We will see how a frequency domain analysis is fundamental to both. Musical sounds Let s begin by briefly considering

More information

Speech Synthesis using Mel-Cepstral Coefficient Feature

Speech Synthesis using Mel-Cepstral Coefficient Feature Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract

More information

Auditory modelling for speech processing in the perceptual domain

Auditory modelling for speech processing in the perceptual domain ANZIAM J. 45 (E) ppc964 C980, 2004 C964 Auditory modelling for speech processing in the perceptual domain L. Lin E. Ambikairajah W. H. Holmes (Received 8 August 2003; revised 28 January 2004) Abstract

More information

Communications Theory and Engineering

Communications Theory and Engineering Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 Speech and telephone speech Based on a voice production model Parametric representation

More information

Enhancement of Speech in Noisy Conditions

Enhancement of Speech in Noisy Conditions Enhancement of Speech in Noisy Conditions Anuprita P Pawar 1, Asst.Prof.Kirtimalini.B.Choudhari 2 PG Student, Dept. of Electronics and Telecommunication, AISSMS C.O.E., Pune University, India 1 Assistant

More information

Measuring the complexity of sound

Measuring the complexity of sound PRAMANA c Indian Academy of Sciences Vol. 77, No. 5 journal of November 2011 physics pp. 811 816 Measuring the complexity of sound NANDINI CHATTERJEE SINGH National Brain Research Centre, NH-8, Nainwal

More information

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta

Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1 Summary Studied

More information

Speech Enhancement Techniques using Wiener Filter and Subspace Filter

Speech Enhancement Techniques using Wiener Filter and Subspace Filter IJSTE - International Journal of Science Technology & Engineering Volume 3 Issue 05 November 2016 ISSN (online): 2349-784X Speech Enhancement Techniques using Wiener Filter and Subspace Filter Ankeeta

More information

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction Human performance Reverberation

More information

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS

SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS 17th European Signal Processing Conference (EUSIPCO 29) Glasgow, Scotland, August 24-28, 29 SPECTRAL COMBINING FOR MICROPHONE DIVERSITY SYSTEMS Jürgen Freudenberger, Sebastian Stenzel, Benjamin Venditti

More information

INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE

INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE INFLUENCE OF FREQUENCY DISTRIBUTION ON INTENSITY FLUCTUATIONS OF NOISE Pierre HANNA SCRIME - LaBRI Université de Bordeaux 1 F-33405 Talence Cedex, France hanna@labriu-bordeauxfr Myriam DESAINTE-CATHERINE

More information

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Robust Voice Activity Detection Based on Discrete Wavelet. Transform Robust Voice Activity Detection Based on Discrete Wavelet Transform Kun-Ching Wang Department of Information Technology & Communication Shin Chien University kunching@mail.kh.usc.edu.tw Abstract This paper

More information

A Digital Signal Processor for Musicians and Audiophiles Published on Monday, 09 February :54

A Digital Signal Processor for Musicians and Audiophiles Published on Monday, 09 February :54 A Digital Signal Processor for Musicians and Audiophiles Published on Monday, 09 February 2009 09:54 The main focus of hearing aid research and development has been on the use of hearing aids to improve

More information

Mikko Myllymäki and Tuomas Virtanen

Mikko Myllymäki and Tuomas Virtanen NON-STATIONARY NOISE MODEL COMPENSATION IN VOICE ACTIVITY DETECTION Mikko Myllymäki and Tuomas Virtanen Department of Signal Processing, Tampere University of Technology Korkeakoulunkatu 1, 3370, Tampere,

More information

Speech Enhancement Based On Noise Reduction

Speech Enhancement Based On Noise Reduction Speech Enhancement Based On Noise Reduction Kundan Kumar Singh Electrical Engineering Department University Of Rochester ksingh11@z.rochester.edu ABSTRACT This paper addresses the problem of signal distortion

More information

RECENTLY, there has been an increasing interest in noisy

RECENTLY, there has been an increasing interest in noisy IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 9, SEPTEMBER 2005 535 Warped Discrete Cosine Transform-Based Noisy Speech Enhancement Joon-Hyuk Chang, Member, IEEE Abstract In

More information

Speech Synthesis; Pitch Detection and Vocoders

Speech Synthesis; Pitch Detection and Vocoders Speech Synthesis; Pitch Detection and Vocoders Tai-Shih Chi ( 冀泰石 ) Department of Communication Engineering National Chiao Tung University May. 29, 2008 Speech Synthesis Basic components of the text-to-speech

More information

Robust Low-Resource Sound Localization in Correlated Noise

Robust Low-Resource Sound Localization in Correlated Noise INTERSPEECH 2014 Robust Low-Resource Sound Localization in Correlated Noise Lorin Netsch, Jacek Stachurski Texas Instruments, Inc. netsch@ti.com, jacek@ti.com Abstract In this paper we address the problem

More information

III. Publication III. c 2005 Toni Hirvonen.

III. Publication III. c 2005 Toni Hirvonen. III Publication III Hirvonen, T., Segregation of Two Simultaneously Arriving Narrowband Noise Signals as a Function of Spatial and Frequency Separation, in Proceedings of th International Conference on

More information

SPEECH ENHANCEMENT WITH SIGNAL SUBSPACE FILTER BASED ON PERCEPTUAL POST FILTERING

SPEECH ENHANCEMENT WITH SIGNAL SUBSPACE FILTER BASED ON PERCEPTUAL POST FILTERING SPEECH ENHANCEMENT WITH SIGNAL SUBSPACE FILTER BASED ON PERCEPTUAL POST FILTERING K.Ramalakshmi Assistant Professor, Dept of CSE Sri Ramakrishna Institute of Technology, Coimbatore R.N.Devendra Kumar Assistant

More information

The psychoacoustics of reverberation

The psychoacoustics of reverberation The psychoacoustics of reverberation Steven van de Par Steven.van.de.Par@uni-oldenburg.de July 19, 2016 Thanks to Julian Grosse and Andreas Häußler 2016 AES International Conference on Sound Field Control

More information

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

Available online at   ScienceDirect. Procedia Computer Science 54 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 54 (2015 ) 574 584 Eleventh International Multi-Conference on Information Processing-2015 (IMCIP-2015) Speech Enhancement

More information

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition Author Shannon, Ben, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium

More information

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding.

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding. Volume 5, Issue 2, February 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Speech Enhancement

More information

Modified Kalman Filter-based Approach in Comparison with Traditional Speech Enhancement Algorithms from Adverse Noisy Environments

Modified Kalman Filter-based Approach in Comparison with Traditional Speech Enhancement Algorithms from Adverse Noisy Environments Modified Kalman Filter-based Approach in Comparison with Traditional Speech Enhancement Algorithms from Adverse Noisy Environments G. Ramesh Babu 1 Department of E.C.E, Sri Sivani College of Engg., Chilakapalem,

More information

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Single Channel Speaker Segregation using Sinusoidal Residual Modeling NCC 2009, January 16-18, IIT Guwahati 294 Single Channel Speaker Segregation using Sinusoidal Residual Modeling Rajesh M Hegde and A. Srinivas Dept. of Electrical Engineering Indian Institute of Technology

More information

Noise Reduction: An Instructional Example

Noise Reduction: An Instructional Example Noise Reduction: An Instructional Example VOCAL Technologies LTD July 1st, 2012 Abstract A discussion on general structure of noise reduction algorithms along with an illustrative example are contained

More information

Speech Signal Analysis

Speech Signal Analysis Speech Signal Analysis Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 2&3 14,18 January 216 ASR Lectures 2&3 Speech Signal Analysis 1 Overview Speech Signal Analysis for

More information

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping

Structure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping Structure of Speech Physical acoustics Time-domain representation Frequency domain representation Sound shaping Speech acoustics Source-Filter Theory Speech Source characteristics Speech Filter characteristics

More information

Nonuniform multi level crossing for signal reconstruction

Nonuniform multi level crossing for signal reconstruction 6 Nonuniform multi level crossing for signal reconstruction 6.1 Introduction In recent years, there has been considerable interest in level crossing algorithms for sampling continuous time signals. Driven

More information

Quality Estimation of Alaryngeal Speech

Quality Estimation of Alaryngeal Speech Quality Estimation of Alaryngeal Speech R.Dhivya #, Judith Justin *2, M.Arnika #3 #PG Scholars, Department of Biomedical Instrumentation Engineering, Avinashilingam University Coimbatore, India dhivyaramasamy2@gmail.com

More information

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012

Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 Preeti Rao 2 nd CompMusicWorkshop, Istanbul 2012 o Music signal characteristics o Perceptual attributes and acoustic properties o Signal representations for pitch detection o STFT o Sinusoidal model o

More information

CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS

CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS 66 CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS 4.1 INTRODUCTION New frontiers of speech technology are demanding increased levels of performance in many areas. In the advent of Wireless Communications

More information

Speech Enhancement Based on Audible Noise Suppression

Speech Enhancement Based on Audible Noise Suppression IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 6, NOVEMBER 1997 497 Speech Enhancement Based on Audible Noise Suppression Dionysis E. Tsoukalas, John N. Mourjopoulos, Member, IEEE, and George

More information

Estimation of Non-stationary Noise Power Spectrum using DWT

Estimation of Non-stationary Noise Power Spectrum using DWT Estimation of Non-stationary Noise Power Spectrum using DWT Haripriya.R.P. Department of Electronics & Communication Engineering Mar Baselios College of Engineering & Technology, Kerala, India Lani Rachel

More information

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution

Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution PAGE 433 Accurate Delay Measurement of Coded Speech Signals with Subsample Resolution Wenliang Lu, D. Sen, and Shuai Wang School of Electrical Engineering & Telecommunications University of New South Wales,

More information

DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS

DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS DESIGN AND IMPLEMENTATION OF AN ALGORITHM FOR MODULATION IDENTIFICATION OF ANALOG AND DIGITAL SIGNALS John Yong Jia Chen (Department of Electrical Engineering, San José State University, San José, California,

More information

Modulation Domain Spectral Subtraction for Speech Enhancement

Modulation Domain Spectral Subtraction for Speech Enhancement Modulation Domain Spectral Subtraction for Speech Enhancement Author Paliwal, Kuldip, Schwerin, Belinda, Wojcicki, Kamil Published 9 Conference Title Proceedings of Interspeech 9 Copyright Statement 9

More information

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation

Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation Dominant Voiced Speech Segregation Using Onset Offset Detection and IBM Based Segmentation Shibani.H 1, Lekshmi M S 2 M. Tech Student, Ilahia college of Engineering and Technology, Muvattupuzha, Kerala,

More information

Implementation of SYMLET Wavelets to Removal of Gaussian Additive Noise from Speech Signal

Implementation of SYMLET Wavelets to Removal of Gaussian Additive Noise from Speech Signal Implementation of SYMLET Wavelets to Removal of Gaussian Additive Noise from Speech Signal Abstract: MAHESH S. CHAVAN, * NIKOS MASTORAKIS, MANJUSHA N. CHAVAN, *** M.S. GAIKWAD Department of Electronics

More information

Speech Enhancement in Noisy Environment using Kalman Filter

Speech Enhancement in Noisy Environment using Kalman Filter Speech Enhancement in Noisy Environment using Kalman Filter Erukonda Sravya 1, Rakesh Ranjan 2, Nitish J. Wadne 3 1, 2 Assistant professor, Dept. of ECE, CMR Engineering College, Hyderabad (India) 3 PG

More information

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL

A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL 9th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, -7 SEPTEMBER 7 A CLOSER LOOK AT THE REPRESENTATION OF INTERAURAL DIFFERENCES IN A BINAURAL MODEL PACS: PACS:. Pn Nicolas Le Goff ; Armin Kohlrausch ; Jeroen

More information

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying

More information

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech

Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech INTERSPEECH 5 Synchronous Overlap and Add of Spectra for Enhancement of Excitation in Artificial Bandwidth Extension of Speech M. A. Tuğtekin Turan and Engin Erzin Multimedia, Vision and Graphics Laboratory,

More information

8.3 Basic Parameters for Audio

8.3 Basic Parameters for Audio 8.3 Basic Parameters for Audio Analysis Physical audio signal: simple one-dimensional amplitude = loudness frequency = pitch Psycho-acoustic features: complex A real-life tone arises from a complex superposition

More information

Audio Imputation Using the Non-negative Hidden Markov Model

Audio Imputation Using the Non-negative Hidden Markov Model Audio Imputation Using the Non-negative Hidden Markov Model Jinyu Han 1,, Gautham J. Mysore 2, and Bryan Pardo 1 1 EECS Department, Northwestern University 2 Advanced Technology Labs, Adobe Systems Inc.

More information

Wavelet Based Adaptive Speech Enhancement

Wavelet Based Adaptive Speech Enhancement Wavelet Based Adaptive Speech Enhancement By Essa Jafer Essa B.Eng, MSc. Eng A thesis submitted for the degree of Master of Engineering Department of Electronic and Computer Engineering University of Limerick

More information

Audio Signal Compression using DCT and LPC Techniques

Audio Signal Compression using DCT and LPC Techniques Audio Signal Compression using DCT and LPC Techniques P. Sandhya Rani#1, D.Nanaji#2, V.Ramesh#3,K.V.S. Kiran#4 #Student, Department of ECE, Lendi Institute Of Engineering And Technology, Vizianagaram,

More information

Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model

Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model Analysis of the SNR Estimator for Speech Enhancement Using a Cascaded Linear Model Harjeet Kaur Ph.D Research Scholar I.K.Gujral Punjab Technical University Jalandhar, Punjab, India Rajneesh Talwar Principal,Professor

More information

Voice Activity Detection

Voice Activity Detection Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class

More information

Phase estimation in speech enhancement unimportant, important, or impossible?

Phase estimation in speech enhancement unimportant, important, or impossible? IEEE 7-th Convention of Electrical and Electronics Engineers in Israel Phase estimation in speech enhancement unimportant, important, or impossible? Timo Gerkmann, Martin Krawczyk, and Robert Rehr Speech

More information

Automotive three-microphone voice activity detector and noise-canceller

Automotive three-microphone voice activity detector and noise-canceller Res. Lett. Inf. Math. Sci., 005, Vol. 7, pp 47-55 47 Available online at http://iims.massey.ac.nz/research/letters/ Automotive three-microphone voice activity detector and noise-canceller Z. QI and T.J.MOIR

More information

Detection, Interpolation and Cancellation Algorithms for GSM burst Removal for Forensic Audio

Detection, Interpolation and Cancellation Algorithms for GSM burst Removal for Forensic Audio >Bitzer and Rademacher (Paper Nr. 21)< 1 Detection, Interpolation and Cancellation Algorithms for GSM burst Removal for Forensic Audio Joerg Bitzer and Jan Rademacher Abstract One increasing problem for

More information

Recent Advances in Acoustic Signal Extraction and Dereverberation

Recent Advances in Acoustic Signal Extraction and Dereverberation Recent Advances in Acoustic Signal Extraction and Dereverberation Emanuël Habets Erlangen Colloquium 2016 Scenario Spatial Filtering Estimated Desired Signal Undesired sound components: Sensor noise Competing

More information

Chapter IV THEORY OF CELP CODING

Chapter IV THEORY OF CELP CODING Chapter IV THEORY OF CELP CODING CHAPTER IV THEORY OF CELP CODING 4.1 Introduction Wavefonn coders fail to produce high quality speech at bit rate lower than 16 kbps. Source coders, such as LPC vocoders,

More information

Introduction of Audio and Music

Introduction of Audio and Music 1 Introduction of Audio and Music Wei-Ta Chu 2009/12/3 Outline 2 Introduction of Audio Signals Introduction of Music 3 Introduction of Audio Signals Wei-Ta Chu 2009/12/3 Li and Drew, Fundamentals of Multimedia,

More information

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech

Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech Speech Enhancement: Reduction of Additive Noise in the Digital Processing of Speech Project Proposal Avner Halevy Department of Mathematics University of Maryland, College Park ahalevy at math.umd.edu

More information

Removal of Line Noise Component from EEG Signal

Removal of Line Noise Component from EEG Signal 1 Removal of Line Noise Component from EEG Signal Removal of Line Noise Component from EEG Signal When carrying out time-frequency analysis, if one is interested in analysing frequencies above 30Hz (i.e.

More information

Binaural Hearing. Reading: Yost Ch. 12

Binaural Hearing. Reading: Yost Ch. 12 Binaural Hearing Reading: Yost Ch. 12 Binaural Advantages Sounds in our environment are usually complex, and occur either simultaneously or close together in time. Studies have shown that the ability to

More information

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 89 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 89 (2016 ) 666 676 Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) Comparison of Speech

More information

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER

X. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER X. SPEECH ANALYSIS Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER Most vowel identifiers constructed in the past were designed on the principle of "pattern matching";

More information

Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts

Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts POSTER 25, PRAGUE MAY 4 Testing of Objective Audio Quality Assessment Models on Archive Recordings Artifacts Bc. Martin Zalabák Department of Radioelectronics, Czech Technical University in Prague, Technická

More information