Automatic Speech Recognition Adaptation for Various Noise Levels

Size: px

Start display at page:

Download "Automatic Speech Recognition Adaptation for Various Noise Levels"

Milo Adams
5 years ago
Views:

1 Automatic Speech Recognition Adaptation for Various Noise Levels by Azhar Sabah Abdulaziz Bachelor of Science Computer Engineering College of Engineering University of Mosul 2002 Master of Science in Communication and Signal Processing College of Agricultual, Science and Engineering Electrical and Electronics Engineering Newcastle University 2009 A dissertation submitted to the College of Engineering Department of Electrical and Computer Engineering at Florida Institute of Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Melbourne, Florida May 2018

3 We the undersigned committee hereby recommend that the attached document be accepted as fulfilling in part the requirements for the degree of Doctor of Philosophy in Computer Engineering. Automatic Speech Recognition Adaptation for Various Noise Levels, a dissertation by Azhar Sabah Abdulaziz Veton Z. Këpuska, Ph.D. Professor, Electrical and Computer Engineering Dissertation Advisor Samuel Kozaitis, Ph.D. Professor, Department Head Electrical and Computer Engineering Josko Zec, Ph.D. Associate Professor, Electrical and Computer Engineering Nezamoddin Nezamoddini-Kachouie, Ph.D. Assistant Professor, Mathematical Sciences

4 Abstract TITLE: Automatic Speech Recognition Adaptation for Various Noise Levels AUTHOR: MAJOR ADVISOR: Azhar Sabah Abdulaziz Veton Z. Këpuska, Ph.D. The automatic speech recognition (ASR) is a set of complicated algorithms that convert the intended spoken utterance into a textual form. Acoustic features, which are extracted from the speech signal, are matched against a trained network of linguistic and acoustic models. The ASR performance is degraded significantly when the ambient noise is different than that of the training data. Many approaches have been introduced to address this problem with various degrees of complexity and improvement rates. The general pattern of solving this issue lies in three categories: empowering features, train a general acoustic model and transform models to match noisy features. The acoustic noise is added to the training speech data after collecting them for two reasons: firstly because the data are usually recorded in a specific environment and secondly to control the environments during the training and testing phases. The speech and noise signals are usually combined in the electrical domain using straightforward linear addition. Although this procedure is commonly used, it is iii

5 investigated in depth in this research. It has been proven that the linear addition is no more than an approximation of the real acoustic combination, and it is valid if the speech and noise are non-coherent signals. The adaptive model switching (AMS) solution is proposed, so that the ASR measures the noise level then picks the model that should produce as minimum errors as possible. This solution is a trade-off between model generalization and transformation properties, so that both error and speed costs are maintained as minimum as possible. The short time of silence (STS), which is a signal-to-noise ratio (SNR) level detector, was designed specifically for the proposed system. The proposed AMS approach is a general recipe that could be applied to any other ASR systems, although it was tested on Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) recognizer. The AMS ASR has outperformed the model generalization and multiple-decoder maximum score voting for both accuracy and decoding speed. The average error rate reduction was around 34.11%, with a decoding speed improvement of about 37.79% relatively, both compared to the baseline ASR. iv

6 Contents Contents v List of Figures ix List of Tables xii Abbreviations xiii Acknowledgments xvi Dedication xvii 1 Introduction Automatic Speech Recognition ASR in Noisy Environment Outline Review of Speech Recognition Technologies Introduction The Motivations for the ASR Design The Automatic Speech Recognition Design Feature Extraction v

7 2.3.2 Decoder Knowledge Base Knowledge Sources Integration Decoder Methodologies Hidden Markov Model Neural Networks Towards Robust ASR Feature Enhancement Transformation Generalization Multiple Decoder Combination Acoustic Noise Simulation Introduction The SPL and the Total SPL Total Microphone Voltage For N Acoustic Sources Experimental Results Monotonic Acoustic Signal Test Real Speech Audio Test The Divergence between the Two Approaches Conclusion Signal-to-Noise Ratio Estimation Introduction Previous SNR Estimation Approaches The Statical Closed-Form Approach vi

8 4.2.2 The Audio Power Histogram Approach The Supervised Method Approach The Proposed SNR Estimation Approach The Short-Time of Silence SNR Estimator The Noise-Speech Power Offset Experimental Results Discussion Experimental Tools and Environments Introduction CMU Sphinx ASR Toolkit Feature Extraction GMM-HMM Acoustic Model Training Sphinxtrain Tool Procedure The Data Acoustic Noise and ASR Performance Introduction Performance Metrics Analyzing the Noise Effect on Performance Noise Type Effect Noise Level Effect Conclusion Adaptive Model Switching (AMS) Speech Recognition Introduction System Design vii

9 7.3 Adaptive Model Switching ASR Training Acoustic Models Training STS-SNR Training AMS Decoding Algorithm Silence Samples Update Conclusion Results and Discussion Introduction The word Error Rate (WER) White Noise Experiment Door Slam Noise Experiment Babble Noise Experiment WER Relative Improvement Comparison The Recognition Speed Conclusion Conclusions 120 Appendices 122 A Noisy TIMIT Corpus 123 A.1 TIMIT Corpus A.2 Noisy TIMIT Data Structure A.3 Noise Types and Ranges viii

10 List of Figures 1.1 General schematic diagram for the LVCSR system Human speech production/perception process A simple speech production/perception model The LVCSR general modules functions PLP and MFCC analysis comparison Knowledge sources hierarchy for ASR ASR bottom-up approach for knowledge integration Top-down approach for knowledge integration Blackboard knowledge sources integration in ASR Two states simple Markov model HMM word model and sub-phoneme Deep neural network (DNN) architecture Conventional MFCC, RASTA-PLP and PNCC PNCC vs MFCC WER Joint training frame work example Monotonic acoustic signal synchronization Monotonic signal addition experiment Synchronization pulse train ix

11 3.4 The divergence analysis scenario Factors affecting vector-linear divergence The closed-form SNR response for frame-based noise The closed form SNR response for fixed 10 db noise The NIST-STNR approach Comparison WADA and NIST SNR algorithms Regression, WADA and NIST SNR error The Short-Time-Silence estimator (STS-SNR) Algorithm The effect of STS step Finding the offset for STS-SNR MAE results for NIST, WADA and STS using NOIZEUS corpus MAE of NIST, WADA and STS estimators using TIMIT corpus Mean estimation of NIST, WADA and STS using NOIZUS corpus Comparing different SNR estimators on for different noise types The speech features used in experiments A simple 3-state HMM ASR WER degradation in noisy environment Different noise types speech on AWGN AM Baeline ASR noise level test The proposed AMS speech recognition STS-SNR Estmator Block Diagram Multiple decoder maximum MAP decoder The AMS WER performance of TIMIT on AWGN x

12 8.3 The AMS WER performance of AN4 on AWGN The AMS WER performance of PDAmWSJ on AWGN The AMS WER performance of TIMIT on door slam noise The AMS WER performance of AN4 on door slam noise The AMS WER performance of PDAmWSJ on door slam noise Babble noise TIMIT test Babble noise AN4 test Babble noise PDAmWSJ test The averge decoding speed performance of TIMIT The average decoding speed performance for AN The average decoding performance for PDAmWSJ A.1 Noisy TIMIT Corpus directory structure A.2 The spectrum of different colors of noise xi

13 List of Tables 3.1 The MD distance between real and simulated combinations Sound types with their associated SPL MAE imrpovements of STS Average processing time for SNR Estimators Speech corpora used in experiments Relative WER% improvement comparison The relative decoding speed improvement comparison A.1 Dialects regions distribution of TIMIT test set A.2 Sentence categories in TIMIT xii

14 List of Abbreviations AM: AMS: AMG: ANN: ASR: AWGN: CD: CI: CMN: CMU: DCT: DFT: DNN: EBF: EM: FFT: FSM: HMM: Acoustic Model Adaptive model switching Acoustic model generalization Artificial neural network Automatic Speech Recognition Additive white Gaussian noise Context dependent Context independent Cepstral mean normalization Carnegie Mellon University Discrete cosine transform Discrete Fourier transform Deep neural network Excitation-based feature Expectation maximization Fast Fourier transform Finite state machine Hidden Markov Model xiii

15 HPF: IDCT: IFFT: KB : KL : LM: LPC: LVCSR: MAE: MAP: maxmap: MBR: MFCC: NIST: NIST-STNR: NLP: OOV: PDF: PLP: PSD: PNCC: RASTA-PLP: RI: RNN: RMS: High pass filter Inverse discrete cosine transform Inverse Fast Fourier transform The Knowledge Base Kullback-Leibler Language Model Linear predictive coefficient Large Vocabulary Continuous Speech Recognition Mean absolute error Maximum a posteriori Maximum of the maximum a posteriori Minimum Bays risk Mel Feature Cepstral Coefficient National institute of standards and technology NIST signal to noise ratio system Natural language processing Out of vocabulary Probability density function Perceptual Linear Prediction Power spectral density Power-normalized cepstral coefficient Relative spectral perceptual linear prediction Relative improvement Recurrent neural network Root mean square xiv

16 ROVER: RTF: SNR: SPL: SSE: STS: STPS: TDNN: VAD: WER: WADA: Recognizer output voting error reduction Real-time factor Signal to noise ratio Sound pressure level Sum of square error Short time of silence Spectro-temporal power spectrum Time delay neural network Voice activity detection (or detector) Word error rate Waveform amplitude distribution analysis xv

17 Acknowledgements I wish to thank, first and foremost, my advisor Dr. Veton Z. Këpuska, with whom I enjoyed working on such a difficult topic. I would like to thank the staff of the Florida tech radio station WFIT for providing us a suitable acoustic room to do some experiments. The WFIT general manager, Terri Wright and the broadcast technician Mr. Sonny Wolfgram offered us everything we needed in a very welcoming manner. I also want to thank the Linguistic Data Consortium (LDC), in University of Pennsylvanian, for publishing the Noisy TIMIT corpus. Cooperating with the LDC was very impressive, and communicating with their staff was a great experience. Last but not the least, I would like to thank my family: my parents, my wife, and my brothers and sisters for supporting me spiritually throughout writing this thesis and my life in general. xvi

18 Dedication To my wife For being patient and helpful throughout writing this dissertation. To my parents, brothers and sisters. They suffered the pain for being in the middle of a war zone while I was abroad. xvii

19 Chapter 1 Introduction 1.1 Automatic Speech Recognition Automatic Speech Recognition(ASR) is a set of algorithms and techniques that enable a machine to recognize and comprehend the human spoken utterance. The ASR introduces a new human-machine communication paradigm, which improves machine productivity. There are plenty of applications that use ASR, such as: short text dictation, voice search, personal digital assistant, gaming, living room interaction systems, in-vehicle infotainment systems, to name a few [1]. Automatic speech recognition is considered as a highly complicated task due to its interdisciplinary nature. To design ASR algorithm, researchers have to consider signal processing, acoustic physics, pattern recognition, information theory, linguistics, programming, physiology and psychology. Therefore, most researchers tend to apply a monolithic approach to individual problems [2]. Many dimensions of the problem are affecting the ASR design complexity, such as isolated or connected words and continuous speech, dependent and independent 1

20 Textual Result Knowledge Base AM s(t) Feature Extraction O Search & Decoding LM Lexicon Figure 1.1: The schematic diagram of the Large Vocabulary Continuous Speech Recognition LVCSR. speaker, small and large vocabulary, etc [3]. Generally, for Large Vocabulary Continuous Speech Recognition (LVCSR), the ASR system includes feature extraction, knowledge base and a decoder as shown in figure 1.1. Feature extraction is the first step that converts the speech signal to certain features that could be classified by the decoder. Typically, the knowledge base (KB) consists of the linguistic information; which are the language models (LM) and lexicons, and the acoustic model (AM). The decoder uses a complex search algorithm to match the incoming acoustic feature to the most likely sequence of words that is hypothesized to be uttered by the speaker. The decoder uses a trained acoustic and language models, with the help of the pronunciation arrangement in the lexicon, to build a complicated graph. Eventually, using decoder s graph, the acoustic features will be decoded into a form of textual information throughout a highly sophisticated process known as search and decoding. 2

21 The search and decoding paradigm implies that the human utterance, or the intention of what speaker wants a listener to know, is somehow encoded in a speech signal. The aim of the ASR is then to decode the speaker s intention from that speech signal. It is easier to consider that there are certain patterns of each utterance in the speech signal. Therefore, for some applications a simple pattern recognition is enough, while state machines or artificial networks are needed for more complicated ones. 1.2 ASR in Noisy Environment The ASRs performance is significantly affected by a large degree with the environmental mismatch between the training and testing audio sets. While the client-server ASR can cope with the problem by having a big and diverse training data; the embedded systems are less immune to the ambient noise variations due to data size limitations. In general, the ASR performance is heavily degraded when the speech signal is distorted by unexpected acoustic noise. To improve ASR performance in noisy environment, the designer should either empower feature extraction, transform the acoustic models or train more general ones. The acoustic model transformation aims to find a relationship between the noisy features and the trained models. Model generalization on the other hand is to train the models to any possible noisy features by having more variabilities of the speech features in the training process. In many cases, even by increasing the variability of the training data, this mismatch problem will not be completely solved [4]. The methodological approach considered in this study is mixing the model transformation and generalization approaches. The main argument in this dissertation is if the ASR can adaptively switch the model parameters, based on the initial knowledge 3

22 of the noise level, the performance will be better than the baseline system. The proposed ASR will outperform the typical approach if the knowledge base contains multiple acoustic models that are trained with different levels of noise. In order to train the models with various noisy features, the training audio data has to be corrupted by acoustic noise. The corpus, which is a transcribed set of audio files, is usually collected in certain and fixed environment. Therefore, the acoustic noise is typically added after data collection by linear addition of speech and noise electrical samples. This procedure is used by everyone working on acoustic research. In this dissertation, representing the acoustic signal combination in the electrical domain is investigated in depth to verify whether the linear samples addition is the correct approach. The study starts by deriving the mathematical relationships of the audio combination in the acoustic domain and the microphone transformation properties. The noise level estimator is investigated intensively in this study so that a robust approach will be adopted. For some ASR applications, specially live speech recognition, processing time is a vital factor that cannot be ignored. Therefore, a specific noise level estimator, namely the short time of silence (STS) estimator, was designed specifically for the proposed ASR system. 4

23 1.3 Outline This dissertation begins in chapter 2 by introducing some theoretical background about automatic speech recognition. Chapter 3 is dedicated to find the relationship between the acoustic signal combination and its equivalent in the electrical domain. In chapter 4, the SNR estimator for audio signals is discussed and a new approach is introduced and compared with other approaches. The data and tools used in the ASR experiments are discussed in chapter 5, with more details about the feature properties and the acoustic model training methodology. Chapter 6 presents the performance analysis of the baseline ASR on different noise levels and types. The proposed noise robust approach is introduced in Chapter 7, while the test results are in Chapter 8. Finally, the conclusion and future works are discussed in Chapter 9. 5

24 Chapter 2 Review of Speech Recognition Technologies 2.1 Introduction Automatic Speech Recognition (ASR) is converting the speech signal into text that represents the spoken words accurately and efficiently. This process should work adequately for expected variations of the capturing devices, speaker accent and the acoustic environment. The automatic speech recognizers ultimate goal is to approach the human being recognition efficiency [5]. This chapter is an overview of automatic speech recognition systems and how noise robustness had been tackled. It begins by the theoretical motivations for the ASR design. Later in this chapter, specifically in section 2.6, the literature of the ASR robustness for noisy environments is introduced. 6

25 Figure 2.1: Schematic diagram of human speech production/perception process [2]. 2.2 The Motivations for the ASR Design It is believed that the human being is the best speech recognizer ever known. Therefore, researchers on the ASR tried to mimic the human speech perception model. It has commonly been assumed that the inter-human speech is a communication process that requires generation and perception of complex acoustic signals. Figure 2.1 is a schematic diagram of human speech generation and recognition process. It shows that the purpose of speech is to deliver a message, which is the speaker s thoughts and feelings, to another person. The speaker starts a mental process to formulate a specific message. Then, he/she will use a language code by selecting consistent words, put them in a valid syntax. That language depends on the social and intellectual circumstances of both the speaker and the listener. The nervous system will then control some of the body muscles to articulate the speech signal. A series of neuro-muscular commands have to control 7

26 Speaker Intention Speaker Model Speaker W s(t) Acoustic O Ŵ Production Decoder Processor System Speech Recognizer Figure 2.2: A simple speech production/perception model [5]. the lung, velum, jaw tongue and lips to produce a set of phonemes that make up the words. This process will finally produce the acoustic signal, which is a continuous disturbance of air pressure, so that the listener hearing system can process it [2]. When the acoustic wave propagates to the recipient, the speech recognition process begins. The listener will reverse the speech generation process, which starts by capturing that acoustic wave via the hearing system. It is believed that the inner ear provides spectral analysis of the incoming acoustic wave, while a neural transduction will extract features from that spectrum. Later on, with a not well-understood mechanism, neural activities along the auditory nerve are converted into a language code in the brain, and finally, the message comprehension is achieved [2]. A general overview of the human speech generation and recognition suggests that it is an encoding/decoding process of a complex signal. Speech communication seems to encode speaker s intentions into an acoustic signal, while the recipient has to decode it back to the corresponding linguistic form [6]. Figure 2.2 is a simple version of figure 2.1 that emphasizes the encoding/decoding mechanism. The linguistically meaningful utterance W is encoded into an acoustic signal representation s(t). The speech recognizer, either a human being or a machine, has to analyze the speech signal s(t) in both temporal and spectral domains to extract distinct observation features O. The decoder in the speech recognizer will match those observations to the most likely word sequence Ŵ [5]. 8

27 Result: Ŵ Knowledge Base P r(o W ) AM s(t) Feature Extraction O Decoder P r(w ) LM Pron. Lexicon Figure 2.3: The function of each module in the LVCSR general design. The acoustic processor will extract the most relevant features from the speech signal. The decoder s operations are much more complicated algorithms that have to match acoustic evidences against an artificial network, that was already built with the help of the knowledge base (KB). 2.3 The Automatic Speech Recognition Design There are different approaches used to design the speech recognition systems. Those differences are related to the requirement and the complexity of the application. For instance, isolated noncontinuous ASR is a different from connected words and continuous speech. On the other hand, the design would be different when the recognizer is designed to be speaker dependent or independent, or use small or large vocabulary [3]. For Large Vocabulary Continuous Speech Recognition (LVCSR), the general approach is depicted in figure

28 For the LVCSR system in figure 2.3, a timely divided frame-by-frame observation feature set O = {o 1, o 2, o 3,..., o T } is extracted from the input speech s(t). The decoder scores those features against the already trained acoustic and language model. With the help of the knowledge base, it will build sets of possible combinations of phonemes (or even sub-phonemes) that form words using the lexicon information, and words that form sentences or utterances using the language model data. The decoder searches within those possibilities and selects the highest score candidate Ŵ to be the best guess for the spoken utterance W [5]. The feature extraction in LVCSR is also known as the front-end, while the rest of the system is called the back-end. The following subsections discuss each component in figure 2.3 with some details Feature Extraction Feature extraction, also known as the front-end of the Automatic Speech Recognition (ASR), prepares a new version of the speech signal. This new form should help the back-end to classify phonemes, then discriminate words and sentences. A considerable effort has been achieved to extract features so that it can help the automatic speech recognition [7]. The Perceptual Linear Prediction (PLP) and the Mel-Frequency Cepstral Coefficients (MFCC) are considered the most successful and common feature extraction algorithms [8]. Figure 2.4 shows how the speech signal is analyzed to extract static PLP and MFCC features. The broken arrows link the similar analysis stages of PLP and MFCC. Both feature types go through short-time power spectral analysis by applying a Fourier transform to speech frames. Then, both MFCC and PLP employ a critical-band analysis to mimic human hearing system. In this stage, the power spectral density 10

29 Figure 2.4: PLP and MFCC analysis comparison[9] (PSD) is modified based on the human hearing sensitivity [9]. Equal-loudness and pre-emphasis operations are used by PLP and MFCC extraction respectively to compensate the human hearing sensitivity. The pre-emphasis stage in MFCC is merely a first order high pass filter (HPF) that is applied to the speech signal. Mathematically it would be [8]: x (t) = x(t) αx(t 1) (2.1) Where x(t) is the speech sample at time t, x (t) is the filtered signal and α is a constant that should be less than 1, typically α = On the other hand, PLP analysis equalizes the loudness in the frequency domain by multiplying each power spectral coefficient by a specific weight using the following 11

30 formula[9]: E(ω) = (ω )ω 4 (ω ) 2 (ω ) (2.2) Where E(ω) is the weight to be multiplied by each PSD coefficient P (ω). Both equal-loudness and pre-emphasis analysis are similar although the latter is widely used as it can be applied with less expense before the short-time spectral analysis [8]. The only main difference between MFCC and PLP is the use of the auto-regressive (AR) modeling by the latter.in PLP analysis, the Bark-scaled PSD coefficients are used to solve Yule-Walker equation to obtain an M-order Auto-Regressive model or the All-Pole Model. When AR model order is carefully selected, it approximates the high energy concentration in the PSD of the speech signal and smooths out the less relevant harmonics. In other words, the AR model is basically estimating the formant frequencies of the speech signal. The reason for choosing a Bark-scaled version of the PSD is to cope the human hearing sensitivity [9][10]. The typical feature vector, that is extracted by the front end of the ASR, consists of a certain number of coefficients accompanied with their first and the second order derivatives. The sequential vectors, which are computed from a continuous speech are fed to the decoder to match their patterns against a trained system [5] Decoder The decoder is considered to be the heart of the automatic speech recognizer, in which the incoming features patterns are related to some meaningful words. It has commonly been assumed that the automatic speech recognition is a statistical decision problem. The Bayes maximum a posteriori probability (MAP) decision process formula has been specifically used to solve the problem [2] [5]. 12

31 The basic function of the decoder is to find a word sequence Ŵ (within a task language L) that maximizes the a posteriori probability P (W O) [5]. The mentioned a posteriori means the probability a word W given the current observed feature vector O. The decoder has to propose many suggestions for W for each observation O and pick the most probable one Ŵ using some scoring technique. The optimization for that process is to find the following maximum a posteriori (MAP) [5]: Ŵ = argmax P (W O) (2.3) W L According to Bayes rule, the a posteriori probability P (W O) could be solved as follows[11]: P (W O)P (O) = P (O W )P (W ) P (W O) = P (O W )P (W ) P (O) (2.4) Hence, the MAP equation in 2.3 could be re-written as follows: Ŵ = argmax W L P (O W )P (W ) P (O) (2.5) Where, the incoming observed feature vector O is scored against a set of trained acoustic model P (O W ) while P (W ) is the language model score. The denominator probability, P (O), is the probability of having a certain observation and it is independent of the word sequence that would be optimized [5]. Besides, P (O) is expected to have a uniform probability density function (PDF) as each different feature vector pattern has an equal probability to be observed. Therefore, it will be considered as a constant weight and will be disregarded in all future formulas. 13

32 The MAP in equation 2.5 indicates that the a posteriori probability could be estimated by the likelihood P (O W ) and the a priori probability of the word sequence P (W ). Both of them could be estimated by using a properly trained acoustic and language models, AM and LM respectively. Hence, the optimization process in 2.3 will be feasible in practice. Then, the language model (LM) and the AM will be used in the search and decoding module to give the best guess Ŵ using the following 3-step process [5]: Ŵ = argmax W L Step 3 P (O W ) P (W ) Step 1 Step 2 (2.6) In step 1, the likelihood P (O W ) is calculated from scoring the current observation against the acoustic model AM. Step 2 computes a priori probability P (W ) for each suggested word sequence using the trained LM. In Step 3, the decoder searches within all valid words and/or sentences in L to find the maximum likelihood utterance [5]. When it comes to the uncertainty of making decisions, the optimum solution is the algorithm that minimizes the loss, in our case produces minimum WER. One of the solutions could be the statistical decision rule that is known as the minimum Bayes risk (MBR). The MBR assumes that for a large number of expected outcome from a decision rule δ(.), the risk is the average of the losses due to that rule [12]. In the ASR, the loss function represents erroneously assigning the wrong hypothesis Ŵe to the reference word sequence W or l(ŵ, W). Although the MBR seems to be more attractive than the conventional maximum a posteriori (MAP) for general ASR, the latter is still dominant estimation algorithm in speech recognition. The reason is that the MBR has a higher computational expenses as compared to the MAP algorithm [13]. However, as more powerful machines 14

33 Deep Understanding Pragmatic Simple Speech Understanding Semantic Syntactic (LM) Lexical knowledge Basic Speech Recgnition Acoustic Knowledge (AM) Figure 2.5: Knowledge sources hierarchy for speech automatic recognition/understanding emerged in the last few decades, an attempt to use MBR in the ASR system was initiated with limited to specific recognition tasks [14] Knowledge Base The decoder steps 1 and 2 would be more efficient in LVCSR with the help of different sources of knowledge. The acoustic model AM, lexical, semantic, syntactic and even pragmatic knowledge can improve the ASR [2]. Embedding one or more of those knowledge sources depends on the task that the ASR is required to implement. For instance, basic modern LVCSR systems use AM, lexical and syntactic knowledge sources only, where the latter is commonly referred as Language Model (LM) [15]. Figure 2.5 shows the knowledge sources hierarchy and how going from speech recognition to understanding requires additional knowledge. Acoustic Model (AM) is a trained knowledge that will be used for classifying the observed features O to find the possible utterance Ŵ. It is specifically used to estimate the likelihood P (O W ), which is the first step of the search and decoding procedure as declared in formula 2.6. Lexical knowledge (or simply the lexicon) is 15

34 the set of rules of phonetician composition for each word the target language L. The syntactic information, or the Language Model (LM), is the possible combination of all possible grammatically correct words in the language L. In other words, it should give the probability of having an utterance W, or P (W ) which is required at step 2 of the decoding in formula 2.6 [5]. With the help of these three sources of knowledge, the basic ASR system would be able to convert speech-to-text even for large vocabulary recognition tasks. When it comes to the speech understanding, the semantic information for the language L is necessary. It will help the recognizer to check if decoded phrases are consistent for the task by having specific meanings. However, some decoded phrases in this case may hold the higher application layer due to their ambiguity. Therefore, the pragmatic knowledge source has to be considered in the system. The pragmatic knowledge with inference capability will remove the ambiguity of the decoded phrases by knowing the meaning of certain words in a specific context [2]. 2.4 Knowledge Sources Integration The chosen strategy of integrating the knowledge base components will affect the detailed architecture of the ASR. It could be achieved in many different ways depending on the task requirements and the software/hardware capabilities. However, the top-down, bottom-up and an alternative Blackboard approaches are more common[2]. Figure 2.6, illustrates the bottom-up approach ASR system design. The lowestlevel processes precede the highest level ones in a sequential manner so that at each level limiting the search scope as much as possible. Each phase of processing delivers 16

35 Spoken Utterance Signal Processing Feature Extraction Voiced/Unvoiced/Silence Knowledge Source Segmentation Labelling Sound Merging Sound Classification Rules Phonetic Rules Word Verification Lexical Access Sentence Verification Language Model Recognized Utterance Figure 2.6: Bottom-up approach for knowledge integration in ASR system[2] specific information to the next stage based on the incoming utterance and the already trained information in the knowledge source. Using this topology, the voiced/unvoiced and silence detection beside the feature extraction will help utterance segmentation process. The sound classification and phonetic rules, lexicon and the LM will enable the higher levels to suggest the most probable utterance. In contrast, the top-down design flow sequence will start by the language model and lexical information to suggest a hypothetical utterances that will be matched against each given speech signal pattern as shown in figure

36 AM Lexicon LM Task Model Speech Signal Feature Extraction Unit Matching System Lexical Hypothesis Syntactic Hypothesis Semantic Hypothesis Utterance Verification Matcher Recognized Utterance Figure 2.7: Top-down approach for knowledge integration in ASR system [2] The bottom-up and top-down approaches are common in information processing domain. However, when Lesser et al. introduced the blackboard approach in 1975, which is described in [16], they shed light on a pioneering concept. They assumed that knowledge sources are independent, parallel and asynchronous processes that communicate with each other in a hypothesis-test schema. Each knowledge source (KS) process is derived by the pattern occurrence on the global blackboard that match templates coming from multiple KS s. In this approach, scoring of different asynchronous KS s are integrated and propagated to all levels to produce the next level hypothesis [2][17]. Figure 2.8 depicts the schematic diagram of the blackboard knowledge integrated approach within ASR system. 2.5 Decoder Methodologies For speech recognizers, pattern classification could be as simple as template matching. From early 1950 s, the ASR system design was guided by acoustic-phonetic components of speech. For instance, matching the vowel formant frequencies were successfully used for digits and separate-words recognition. In 1960 s, a statistical 18

37 Acoustic Processes Lexical Processes Blackboard Environmental Processes Semantic Processes Syntactic Processes Figure 2.8: Blackboard knowledge sources integration in ASR system design [2] ASR which was able to recognize 4 vowels and 9 consonants by collecting statistical information for phoneme sequences. By the end of 1970 s Linear Predictive Coding (LPC) was introduced, simplified the vocal tract estimation and paved the way for many researchers to adopt pattern recognition technology in ASR [17]. The heart of the ASR system, the decoder, has to be designed to score the incoming patterns against the trained knowledge sources, regardless of how they are represented. Eventually, the decoder will select the hypothesized utterance candidate that has the highest score, whatever that means. Although the essence of the theoretical aspects for the decoder were previously discussed in section 2.3.2, its implementation methodologies need to be addressed, which is the purpose of the following subsections. 19

38 2.5.1 Hidden Markov Model Different ASR realizations and technologies for various goals and tasks were developed, specially with the rapid development in statistical analysis in the 1980 s. The most notable milestone for that matter was the adoption of Hidden Markov Model (HMM) frameworks. It is believed that the HMM introduction to the ASR system shifted its design methodology from simple pattern recognition to a more organized statistical modeling framework [17]. The HMM is a probabilistic finite state machine (FSM) that is based on Markov chain. It is the core of many successful pattern recognition tasks, such as computer vision, as well as natural language processing (NLP) algorithms like shallow parsing, speech recognition and translation tasks[18]. An HMM machine could be defined as M = Q, Σ, A, B, π, where [5][18] : Q is the set of N states, or Q = q 1, q 2,..., q N and at time t, the state Q t is active, and Q t Q. There are two special start and end states that are not associated with acoustic observations, q 0 and q e respectively. Σ is the finite set of symbols. It could represent words, sub-words, phonemes and sub-phonemes depending on the ASR knowledge integration methodology. A is the state-transition probabilities. So, a ij is the probability of transition from state q i to q j or: a ij = P (Q t+1 = q j Q t = q i ), 1 i, j N (2.7) 20

39 a 11 a 22 a 12 π q 1 a 0 q 1 q 2e 2 q e a 21 Figure 2.9: Two states simple Markov model with the following conditions: a ij 0 j, i (2.8a) N a ij = 1 i (2.8b) j=1 B is the state -based symbol emission probability. At each time t, the acoustic observation pattern O t has a probability b i (O t ) to be matched with state q i at that time. Mathematically: b i (O t ) = P (O t q i ), 1 i N (2.9) π is the set of initial state probabilities, where π = π i is the probability of starting with state q i initially. When it comes to automatic speech recognition systems, the observations are speech signal features, also known as acoustic evidence. For GMM-HMM framework decoders, the emission probabilities B are estimated from projecting those features onto trained GMM models. The transition probabilities matrices A are fixed for each acoustic model (AM) and they are calculated during the training phase. The AM consists of a set of Markov models for each word that exists in the language model (LM)[15]. 21

40 For example, if the Markov machine in figure 2.9 could be used to represent the word be. then it would be defined as follows: M = {q 1, q 2 }, {/b/, /i : /}, A, B, {1, 0} (2.10) Where Q = {q 1, q 2 },Σ = {/b/, /i : /}, A is 2 2 transition matrix, B = {b 1, b 2 } is the state emission probabilities vector and π = {1, 0}. The correspondent phoneme for state q 1 and q 2 are /b/ and /i:/ respectively. It is obvious that this machine cannot start from state q 2 as its initial probability is zero. It means that the word eb is not accepted and will not be considered. When the elements of the observation vector O t arrive, this machine will be swinging between q 1 and q 2 states, until it reached its final observation. A string of symbols will be the outcome of this machine, which depends on the incoming pattern of observation vector O t. Let a 21 = 0 for the model in figure 2.9, which means the machine cannot go from state q 2 to q 1. If the feature vector pattern O t = {O 1, O 2, O 3, O 4 } drives the machine in time to go through the following sequence {q 1, q 2, q 2, q e }, then the output string will be bee. This example shows the power of HMMs by making a simple model for handling different accents, or may be different words, as it can also decode other utterances like { be, bee, beee... }. Typically, the sub-words units (phonemes) are decomposed into a 3-state model for each. They represent the begin, middle and the end variations of the statistical parameters. For the previous example, the two states will be elaborated into 6 states HMM as shown in A word HMM is composed of concatenating multiple subword models (phonemes) which in turn is the concatenation of sub-phoneme models [5]. 22

41 q 0 b e q e (a) Phoneme HMM for word be q 0 b b b e e e q e Begin Middle End Begin Middle (b) HMM with 3-state model for each phoneme for word be Figure 2.10: Altering the HMM model for word be from phoneme to sub-phoneme modeling End The GMM-HMM framework succeeded to model the statistical properties variation of speech feature vector over time. The GMM probability density function and the HMM transition probabilities are learned during the training phase. The Gaussian mixtures model the statistical characteristics of the feature vector within each state i, which is the emission probability b i (O t ). The probability density function(pdf) for each state is [5]: b j (O t ) = K c ik N [ ] O t, µ ik, U ik k=1 (2.11) Where K is the number of Gaussian mixture components, c ik is the weight for k th mixture component in state i. N is the Gaussian (normal) distribution function of mean vector µ ik and a k k covariance matrix U ik for state i. A set of speaker independent orthographically labeled audio is used to estimate the mean and the variance statistical parameters. The transition probabilities are also learned during this phase, by designing word models of concatenated sub-phonemes HMMs using the lexical information in the knowledge base [5] [15]. 23

42 Hidden Markov models have become a standard solution to deal with speech temporal dynamics in modern LVCSR ASR systems. It is usually supported by GMMs to calculate the fitness score for each state of the HMM with the input acoustic features. As it became a successful standard ASR tool, significant advances have been introduced on the GMM-HMM design to improve its recognition accuracy [19] [20] Neural Networks In the late 1980 s, artificial neural network (ANN) approach was re-introduced again after it failed to give reasonable ASR accuracy in the 1950 s. The advent of the multi-layer ANN and its corresponding error back-propagation training paradigm made ANN feasible for ASR applications. The multi-layer perceptron form of the ANN draws the attention to adapt ANN in ASR design, not because it mimics the neural processing but because it can approximate any function with high precision. [17]. The neural network technology is known for its efficiency in static pattern recognition while the speech patterns are naturally dynamics. Therefore, modifications are necessary to enable ANN structure works for speech recognition. Some of the suggested modifications are simple, such as time delay neural networks (TDNN). In this modification, the input to each perceptron is extended to include many input frames which are time variant. Other modifications could be more complicated and lead to altering the ANN basic structures [2]. The state of the art approach for using the ANN in speech recognition is the deep neural network (DNN), which is a multilayer perceptrons structure with many hidden layers as shown in figure 2.11 [21]. Each layer l of the DNN has N number of neurons, where N is the acoustic feature vector dimension. Except for the initial layer 0 and 24

43 the output layer L, the activation vector v l is [21]: v l = f(z l ) = f(w l v l 1 + b l ) for 0 l L (2.12) Where z l is the excitation vector, W l is the weight matrix, v l 1 is the activation vector from the previous layer and b l is the bias (error) vector. The activation function applied on the excitation vector f(z) element-wise. This function could be either a sigmoid σ(z), a hyperbolic tangent tanh(z) or a rectified linear unit ReLU(z) function as defined in the following formulas[21]: 1 σ(z) = 1 + e z (2.13a) tanh(z) = ez e z e z + e z (2.13b) ReLU(z) = max(0, z) (2.13c) There are two different tasks to calculate the activation vectors by the DNN algorithm, the regression and the classification. The first task is applied on the layers 1 to L 1 and it is as follows: v L = z L = W L v L 1 + b L (2.14) For the output layer L, a multi-class classification would is used to calculate the activation vector. Each output neuron represents a class i {1, 2,..., C}. The value of the i th output neuron is the probability of being at state i given observation O, or P (i O). It represents the probability of the observed acoustic feature vector belongs to class i. For having valid multi-dimensional PDF, the output vector v L has to satisfy 25

44 Figure 2.11: Deep neural network (DNN) architecture example [21] the conditions of vl i 0 and C i=1 vi L = 1. The following softmax function is used to maintain these two conditions [21]: v i L = P (i O) = softmax i ( zl ) = e zi L C j=1 ezi L (2.15) where zl i is the ith element of the excitation vector z L. Many achievements have been introduced for using the ANN with one or two nonlinear hidden layers as a hybrid ANN-HMM system. Modern ASR systems employ neural networks for the acoustic model instead of GMM s. However, the ANN-based acoustic models has more parameters than the GMM ones, which increases the computational expenses. Besides, there are some technical issues to deploy a pure ANN, or even a hybrid ANN-HMM system, due to the complexity of the ANN learning procedure [17, 19, 22]. 26

45 2.6 Towards Robust ASR Improving ASR systems noise robustness has been approached from different aspects. It could be achieved by feature enhancement, transformation or generalization [4]. The following sections discuss different approaches to improve speech recognition robustness Feature Enhancement According to Hönig et al., the optimal performance of the ASR for the speech with varying background noise can be achieved when the properties of both MFCC and PLP are combined. Therefore, the researchers proposed a Revised Perceptual Linear Prediction (RPLP) in which the MFCC is modified to contain the PLP feature s attributes. They added the intensity-to-loudness conversion and linear prediction steps to the MFCC base system. On the other hand, they found that increasing the Mel filter banks to contain 257 banks of filters rather than 24 banks used in the conventional MFCC improved the word error rate (WER) by 4.8% relatively over both MFCC and PLP systems for HUB4 speech database [8]. In [23], researchers proposed adding excitation-based features to the vocal tractbased ones, like the MFCC and the PLP. They first assess the phoneme class separability before trying any experiments on speech recognition. Ten Excitation-Based Features (EBF) were separately added to 13-dimension conventional PLP and tested for phoneme separation. Finally, they added all of those ten EBF s to the PLP-13 and Mel filter of 26 banks (FBANK-26) with their first, second and third dynamics. Both types of features with the new flavor of EBF were trained using a Deep Neural Network (DNN) of for hidden layer with 1000 nodes for each. That DNN was designed 27

46 to increase classification accuracy before building the decision tree of the phonemes using GMM-HMM system. Results from this work tend to be variety of relative improvements in WER between 1.6% to 4.8% when EBF statistics are combined with PLP or FBANK features [23]. To compensate environmental noise degradations, spectral coefficient manipulations were imposed on MFCC and PLP common feature extraction algorithms. Relative spectral PLP (RASTA-PLP) described in [24] and power-normalized cepstral coefficient (PNCC) in[25] were introduced to achieve that goal. The RASTA-PLP reduces noise by compression, filtering and expansion of the PLP critical band output. On the other hand, PNCC normalizes the power over temporal masking windows and it was assumed to improve the ASR environmental noise immunity. Comparing RASTA-PLP and PNCC with traditional MFCC algorithm is shown in figure The PNCC algorithm was proposed as a better and more noise insusceptible than the traditional MFCC. However, the current ASR tools still use MFCC with different tunable options to overcome the environmental noise issues, for example, CMU Sphinx has noise removal option for MFCC [15]. When the PNCC is compared with the MFCC for LVCSR task, it shows a small improvement on the word error rate (WER). Further experiments in this study show no significant word error rate improvements.testing speech in different noise levels shows that the WER was less for PNCC in low SNR while it is higher for high SNR compared to MFCC. Figure 2.13 depicts WER comparison for two flavors of MFCC, with and without noise removal, and the PNCC using Noisy TIMIT corpus. The PLP and MFCC have become the most successful feature extraction methods over the years. As the recent ASR tools introduce more tunable features, the MFCC feature extraction is still used efficiently even for noise robustness tasks[8, 27]. 28

47 Figure 2.12: Conventional MFCC, RASTA-PLP and PNCC feature extraction algorithms[26] 29

48 48 MFCC NR MFCC PNCC W ER% SNR db Figure 2.13: The word error rate (WER) utilizing CMU Sphinx ASR tool, where PNCC, MFCC, an NR MFCC features are used. The NR MFCC is when the noise removal option was activated in CMU Sphinx ASR tool. Recently, more sophisticated feature extraction approaches have been investigated by many researchers. For example, a Spectro-temporal power spectrum (STPS) was introduced to enhance speech recognition in noisy environment. It is simply a 2-D speech spectrum filter followed by encoding the spectro-temporal information around each frequency component. The STPS combined with the advanced front-end (AFE) feature extraction algorithm ETSI seems to be better than MFCC by around 32% on average [28]. Another recent speech feature enhancement was motivated by the physiological binaural model in mammals. Researchers in [29] uses 6-channels microphone array signal to produce spectro-temporal Gabor features to estimate the speaker position and hence remove the background noise. It is stated that the 2-D Gabor features imitate the cortical neurons of the primary auditory cortex of mammals. The average overall improvement over MFCC was around 25.9% [29]. 30

49 2.6.2 Transformation From the early beginnings of using ASR systems, building a suitable AM that would suite the noisy environments had been tackled. The environmental noise leads to mismatch of the trained acoustic model, with the noisy test acoustic evidence. Interestingly, when the Cepstral mean subtraction (CMS) was used, the mismatch problem design in [30] was already reduced without doing any transformation. Besides, the feature-space and model-space transformations have little improvements in the ASR accuracy. Hence, the CMS (or lately known as Cepstral mean normalization CMN) became a standard procedure in modern ASR systems like CMU Sphinx4 [31]. Recently, Nguyen et al. proposed a spectro-temporal (ST) feature adaptation approach using Kullback-Leibler (KL) divergence objective function, in addition to the EM-based method, to estimate the transformation function parameters. They generalize the feature transform ST using the following formula [32]: L y t = A τ x t + b = W x t (2.16) τ= L where y t and x t are the output and input t th feature frame respectively. A τ is the sequence of transform matrices over 2L + 1 frames and b is the bias vector. As the spectro-temporal (ST) matrices invoked, a huge number of parameters have to be estimated and the practical issues emerge. In order to solve this problem, researchers suggested feature-space maximum likelihood regression (fmllr) followed by a temporal filter to make some irrelevant matrices parameters to be zeros. Other solutions could be statistical interpolation and smoothing. Although they tested it using reverberant speech corpus, they claimed that it could reduce other environmental distortions effects. The best performance for using clean GMM-HMM acoustic 31

50 models was reported when near and far field microphone data were combined with interpolation, utterance adaptation and fmllr with 58.9% WER (or 3% relative to the baseline ASR)[32] Generalization The joint training framework that was proposed in [33] is a good example of robust ASR in noisy environment. The researchers built a speech separation DNN from the log power spectrogram of the noisy speech. From this network, they tried to estimate the ideal ratio mask (IRM), which is the ratio of the speech energy to the sum of the speech and noise energies. This mask will be used later in the test phase by multiplying it with the speech spectrogram. The over-all ASR system is based on a hybrid DNN-HMM approach and the results were promising. The outline of AM building is shown in figure 2.14 which shows how two DNNs are jointly trained to get the final AM. The reported WER relative improvement was 2.7% using MEL, PNCC, multi-resolution cochleagram MRCG and Fset features together. The Fset itself is a combination of RASTA-PLP, amplitude modulation spectrogram (AMS), narrow-band and wide-band MFCC features. In total, they used 7 different feature vectors, besides their first and second derivatives to make a total of 4026 dimensions as an input feature vector [33]. It is clear that the jointly trained acoustic models, which have many different patterns of the speech signal, will improve the ASR system accuracy within a noisy environment. However, it is impractical to implement such a highly expensive feature extraction in live ASR tasks. 32

51 Figure 2.14: Joint training framework proposed by [33] 33

52 2.6.4 Multiple Decoder Combination Other possible ways to improve ASR performance in noisy environment is to use multiple decoders and combine their outputs. For example, the recognizer output voting error reduction approach, known as ROVER, combines the final text hypothesis from each decoder in a single word transition network (WTN) [34]. Based on different voting approaches, it will select the words that have the highest score [34]. The ROVER relies upon a single best hypothesis from each recognizer, whereas the correct one might never appear in any one of them [35]. Recently, Mortiz et al. used six microphones to separate target speaker from the background noise. The applied algorithm starts by detecting the failure microphones and enhance the speech using two consequent techniques: the non-negative matrix factorization (NMF) and the time-varying minimum variance distortion-less response (MVDR). Then amplitude modulation of the filter-bank output is applied followed by speaker adaptation algorithm by the help of GMM-HMM trained models. After using a recurrent neural network (RNN) to generate LM, the output of is combined with other ASR s hypotheses to feed another training machine based on minimum Bayes risk (MBR) over all given hypotheses. The relative WER improvement is reported to be 0.16% on the 3rd CHiME challenge data [36]. 34

53 Chapter 3 Acoustic Noise Simulation 3.1 Introduction When it comes to combining the electrical samples of multiple acoustic sources, the linear addition using + operation is commonly used. For various noise levels and types, finding the sufficient and proper noisy audio corpus will be harder. In many different cases, almost all researchers used linear combination to simulate what is known to be the acoustically additive noise in the electrical domain [37 42]. So that, a single microphone separately captures two acoustic sources x and y it will produce electrical samples X s and Y s respectively. If both sources are propagating simultaneously to the same microphone, the overall combined signal Z s in the electrical domain would be: Z s = X s + Y s (3.1) According to this assumption, the electrical samples of the clean audio are added to the noise samples in the electrical domain. Nevertheless, adding electrical samples 35

54 of multiple and concurrent acoustic sources is more complicated than a simple linear addition operation. In fact, this scenario has to consider two non-linear processes that affect the linearity assumption. First, there is non-linearity relationship between the sound pressure level (SPL) and the microphone output voltage, as it will be shown later in section 3.3 [43]. Secondly, the total SPL already has a non-linear relationship with every individual SPL that forms it [44]. Regardless of those facts, the linear addition of acoustic signals is considered to be valid. The following sections elaborate the real acoustic combination transformation from the acoustic to electrical domain. In this chapter, the aim is to question the validity of the linear acoustic combination to ensure that whether it is the best way to simulate acoustic noise and why. The chapter is organized as follows: section 3.2 elaborates the relationship between the total and individual SPLs in the acoustic domain. In section 3.3, a derivation of the acoustic addition formula in the electrical domain is developed. The conducted experiments are reported in section 3.4. Theoretical difference between the linear and the new combination formula is analyzed in section The SPL and the Total SPL The theoretical relationship between the individual sound pressure levels in db versus their total combination SPL is derived gradually. The Sound Pressure level (SPL) is a decibel (db) ratio that will be used frequently in this work. The decibel db is defined as a ratio of two powers: ( P ower ) db = 10 log 10 Reference Power (3.2) 36

55 The acoustic power P a is proportional to the square of sound pressure p: P a p 2 P a = kp 2 (3.3) where k is a constant. In the other hand, it is agreed that the reference acoustic pressure p o at the Reference Power is: dyne/cm 2 or microbar and the international equivalent is 20 mironewtons per square meter(20 µn/m 2 ) or 20 micropascal (20 µp a) or to microbar, which is the sound pressure that approximates the threshold of hearing [45]. Hence, Whereas the SPL is p p o, the sound pressure level in db (SPL) is: SP L = 10 log 10 ( kp 2 kp o 2 ) SP L = 10 log10 ( p p o ) 2 (3.4) For N sound sources that are located in the same environment and distance regarding the receiving end, the individual SPL in db is: SP L 1 = 10log 10 (p 1 p o ) 2 SP L 2 = 10log 10 (p 2 p o ) 2 : (3.5) SP L N = 10log 10 (p N p o ) 2 For more than one acoustic sound source, the total sound pressure is not a + addition. In fact, it is more complicated as sound pressure level in db is a logarithmic operation. It is believed that the total SPL at the receiving end is ten times the log of the sum of the power ratios (not the pressures) of each source [44], therefore the total 37

56 SPL is: where P ower i is: ( N ) SP L T = 10log 10 P ower i i=1 (3.6) P ower i = ( p i p o ) 2 (3.7) Hence: SP L T = 10log 10 [(p 1 p o ) 2 + (p 2 p o ) 2 + (p 3 p o ) (p 1 p o ) 2 ] SP L T = 10log 10 [ N i=1 ( p i p o ) 2 ] (3.8) (3.9) The sound pressure level L i is ratio ( p i p o ) 2, which is not the same as SP Li as it is unit-less while the latter is the db unit ratio. Therefore : [ N ] SP L T = 10log 10 L i i=1 (3.10) The final relationship suggests that the total pressure level SP L T is ten times the logarithmic addition of the acoustic pressure of each component of that combination. Therefore, individual sound pressures are not linearly added, and this non-linear addition is applied to the microphone that will be also non-linearly convert them to electrical samples, as it will be shown in the following sections. 38

57 3.3 Total Microphone Voltage For N Acoustic Sources The acoustic signal is collected by a microphone, which converts the Sound Pressure Level (SPL) to an electrical signal. The microphone voltage (E) does not have a linear relationship with the acoustic pressure. The output voltage for each microphone is related to the input SPL as follows [43]: Sv+SP L 94 E = (3.11) where E is the output voltage of the microphone in volts. S v is the open-circuit microphone sensitivity in db regarding 1 V/Pa (1 volt per Pascal sound pressure compared to 94 SPL acoustic input) and measured in dbv. It is measured for 1 Khz sine wave signal [46]. As it has been shown in section 3.2, the total sound pressure level in db SP L T is the addition of the power ratios, not the individual SPL s. This means that SP L T for N acoustic signals will be [44]: SP L T = 10log 10 N i=1 ( p i p o ) 2 (3.12) While the voltage that the microphone produce by the i th acoustic pressure is E i and it has the following relationship: ( pi ) 2 E i = 10 Sv+10log po 20 (3.13) From the last equation, the squared normalized pressure could be expressed by the 39

58 following relationship: ( p i p o ) 2 = 10 20log10Ei Sv (3.14) By substituting Eq. (3.14) in Eq. (3.12), a new relationship between SP L T and the microphone voltage for each acoustic signal E i is found as follows: SP L T = 10log 10 N i=1 ( 20log 10 E i ) Sv 10 (3.15) This equation could be simplified mathematically to extract more clear relationship. The first step is to divide by 10, inverse the logarithm for both sides and take the common factor (10 94 Sv 10 ) in Equ. (3.15) to obtain the following: [ N ( 10 = Sv 20logE i )] SP L T By taking log 10 for both sides: i=1 (3.16) SP L T 10 = 94 S v 10 ( N + log log 10E i ) i=1 (3.17) Multiply by 10, we got: SP L T = 94 S v + 10log 10 ( N i=1 ) 10 log 10Ei 2 (3.18) Note that, (10 log 10E 2 i ) is simply equals to (E 2 i ), therefore Equ. (3.18) could be more simplified and written as follows: SP L T = 94 S v + 10log 10 ( N i=1 E 2 i ) (3.19) 40

59 From Equ. (3.11), SP L T could be related to the total microphone voltage E v as follows: SP L T = 20log 10 E S v + 94 (3.20) As both Equ. (3.19) and Equ. (3.20) are equal, they could be written in one formula as follows: 20log 10 E S v + 94 = 94 S v + 10log 10 ( N i=1 E 2 i ) (3.21) Adding (S v 94) to both sides: 20log 10 E = 10log 10 ( N i=1 E 2 i ) (3.22) To make things simple, let the left-hand side be 10log 10 ET 2, hence the balanced equation would be: 10log 10 E 2 T = 10log 10 ( N Finally, dividing by 10 and take 10 x for both sides of the final relationship is: i=1 E 2 i ) (3.23) E 2 T = N Ei 2 (3.24) i=1 It is obvious that Equ. (3.24) represents the actual total voltage relationship between E v, when sound sources are captured simultaneously, and the microphone voltages when each sound is recorded individually using the same microphone. According to the new combination formula in Equ. (3.24) when there are two simultaneous acoustic sources only, the total microphone samples are : 41

60 ET 2 = E1 2 + E2 2 (3.25) = E = E1 2 + E2 2 (3.26) In this case, the individual voltages of each acoustic source are added as if they were vectors that are perpendicular to each other. Therefore, we will denote our new simulation formula in Equ. (3.24) as a vector addition, as opposed to the commonly used linear addition in Equ. (3.1). The validity of both formulas is tested experimentally in the next section in order to estimate the error between the real addition and each one of them. 3.4 Experimental Results The conducted experiments were implemented in a sound-proofed conference room in the WFIT radio station at Florida institute of technology. Experiments were implemented by having the two physical arrangements depicted in Fig First, each acoustic signal x and y were recorded in separate sessions by the same microphone using the arrangement in Fig. 3.4a. Those samples were used to mimic the acoustic addition using both the linear and vector approaches. The second step was recording x and y in a single session, as they were played simultaneously on different loudspeakers using the arrangement in Fig. 3.4b. The loudspeakers in this configuration are aligned and separated by a distance of 0.75 of a meter, while the microphone is set to be 1 meter away from each loudspeaker. The latter signal represents the electrical domain version of the true acoustic addition. 42

61 3.4.1 Monotonic Acoustic Signal Test An equal loudness and in-phase cosine acoustic signal was generated digitally for both separate signals assuming that x = y (or E 1 = E 2 ) as follows: E 1 = E 2 = Acos(ωt) (3.27) where A = 1 and ω = 2000π rad/sec or f = 1 KHz. It is important to set the frequency to 1 KHz as the microphone sensitivity S v is measured based on this frequency [46]. According to the common simulation criterion, the Linear Addition, the accumulated audio signal in is: E l = E 1 + E 2 E l = 2cos(ωt) (3.28) However, regarding the proposed vector approach, the total voltage of the microphone reflects the accumulated sound pressure level SPL of the acoustic signals. Hence, the vector-addition representation of the total acoustic signal in the electrical domain will be: E v = E E 2 2 E = 2cos(ωt) (3.29) At the beginning, the samples of each sinusoidal acoustic signal, E 1 and E 2 are recorded in two separate sessions. The electrical representation methods, linear and vector addition approaches, were simulated using those real samples to estimate E l and E v respectively. Then, the microphone captured the acoustic samples when both sounds were played simultaneously on separate loudspeakers to measure E T, which is the real acoustic accumulation in the electrical domain. 43

62 0.2 Transient State First peak P eak 1 Amplitude Silence Time in ms Figure 3.1: Synchronization: the undesired silence and loudspeaker transient distortion captured throughout recording sessions are ignored by considering the signal starts after P eak 1. Throughout conducting the experiments, the recording sessions show that the first samples do not show obvious sinusoidal signal. This type of distortion usually happens because the speaker output level does not change directly as the input signal changes [47]. As those undesired or distorted samples will affect synchronization, and hence the accuracy of the error measurement, they were not considered. The correct signal is assumed to start after the first peak Peak 1, which is the first stable peak of the sinusoidal signal. This peak occurs after the transient state ends and decided by visual inspection as in Fig Few samples of the results are plotted in Fig. 3.2, which shows that the proposed vector representation is more accurate than the common linear addition. The objective error measurement, between the two approaches and the real acoustic addition in the electrical domain, confirms the validity of the proposed approach. 44

63 0.1 Amplitude Time in seconds (a) Linear Addition 0.1 Amplitude Time in seconds Reall Addition Simulation (b) Vector Addition Figure 3.2: The real combination of two sinusoidal signals E T (in blue dots), is compared with the samples of the two combination methods, E l and E v for linear and vector representations respectively. Simulation samples (in red boxes) in Fig. 3.2a is the common linear addition E l = E 1 + E 2, while in Fig. 3.2b is the proposed vector addition E v. 45

64 The relative error RE between the linear simulation total voltage E l and the actual samples E v is around 15.82%, which is close to the theoretical value of 17.16%. While RE between the proposed vector addition simulation voltage E v, and the real accumulated samples of multiple simultaneous acoustic E T is 0.5% Real Speech Audio Test The same procedure of the monotonic experiment was repeated, except that acoustic signals x and y are real speech. Both of them are identical 15 minutes of TIMIT corpus test speech. As the alignment was done on the monotonic signals is not suitable here, an extra synchronization train of pulses was used. A 15 minutes audio segment from TIMIT corpus was formed as shown in Fig The linear and vector approaches were compared to the real recordings E T, when both audio signals are played simultaneously. In other words, the two representation approaches are compared to the acoustically added samples. Due to the randomness of the speech signals, the most proper way to measure the difference is by extracting their features. This kind of approach is sometimes referred as fingerprint extraction; then a matching algorithm could be used later to decide how similar those features are [48]. The selected audio fingerprints are the common feature vectors used in automatic speech recognition: the Mel-Feature Cepstral Coefficients (MFCC) and the Perceptual Linear Prediction (PLP) [27]. The Mahalanobis distance is used to assess the similarity between the fingerprints of the addition simulation approaches, linear E l and vector E v, and the real audio samples of the acoustic combination E T. 46

65 Synch speech Synch Amplitude Time Figure 3.3: Starting and ending the 15 minutes TIMIT speech with pulse train for synchronization which will be used in alignment after recording. The Mahalanobis distance is a statistical measurement that is used to evaluate how different is a point x i from a specific distribution and it is formulated as follows[49]: MD i = (x i µ)c 1 (x i µ) T (3.30) where µ is the mean of the distribution and C 1 is the inverse of the correlationcovariance matrix of the real combined samples E T. Whereas x i are the samples of linear and vector representation, E l and E v respectively. Table 3.1 shows how distant each of those representation approaches from the real addition of the acoustic signal x(t). In Table 3.1, the real multi-source acoustic combination representation in the electrical domain, the total voltage samples E T, is considered a statistical distribution or a class. This table shows the average Mahalanobis distance of each simulation approach from the real distribution. 47

66 Table 3.1: The average Mahalanobis distance test for 15 minutes of TIMIT speech. It measure the distance between the fingerprints of the real acoustic addition and the simulation approaches. Feature Type MFCC PLP-12 Average Linear Vector Relative Error between linear and vector approaches 1.21% In general, the vector addition approach is slightly closer to the real voltage representation of the acoustic combination than the common + approach. The relative error between the two representation is 1.21%, indicating that the two simulation approaches are not sufficiently different from each other. It also means that the proposed vector addition method is closer to reality by 1.21% from the commonly used linear addition approach. 3.5 The Divergence between the Two Approaches In order to estimate the microphone voltage that at the presence of multiple acoustic sources, the proposed approach transforms the total sound pressure from the acoustic to the electrical domain. In the other hand, the common linear addition adds the individual voltages of each acoustic source. Assuming that the proposed vector addition is more accurate than the linear one, at least theoretically, the next step is to understand the significance of using it over the linear addition. In this section, the theoretical difference between the linear and the proposed vector addition methods is analyzed. To do so, let two 1 KHz sinusoidal acoustic signals, x and y, propagate towards two identical microphones. 48

67 x E 1 + E l y E 2 (a) Recording separately propagated acoustic signals. x y E v (b) Recording concurrent acoustic signals Figure 3.4: The linear-vector divergence analysis scenario. All microphones are identical and the loudspeaker-to-microphone distances are the same Those microphones will produce E 1 and E 2 voltages for each sound signal respectively as shown in Fig.3.4a. Later, the same acoustic signals are simultaneously captured by a single microphone as depicted in Fig.3.4b. The distances between the sound sources and the microphones in both situations in Fig. 3.4 are equal. There are two different voltages that represent the same simultaneous acoustic sources, x and y, in the voltage domain. Fig. 3.4a shows the commonly used linear addition simulation that ends with E l. On the other hand, Fig. 3.4b depicts what really happens when two concurrent acoustic signals are combined and collected by a single microphone, which leads to having E v as the total voltage representation. 49

68 According to the common linear approach that is used so far to represent the acoustic addition in the electrical domain, the total voltage E l is: E l = E 1 + E 2 (3.31) From Equ. (3.11), the relationship of the vector and the linear addition, E v and E l respectively, with respect to the sound pressure levels (SP L s) are as follows: E v = K 10 (SP L T /20) (3.32) E l = K 10 (SP L 1/20) + K 10 (SP L 2/20) (3.33) where K is a function of the microphone sensitivity S v : K = 10 Sv (3.34) Both Equ. (3.32) and Equ. (3.33) are used throughout this analysis, assuming that only two sinusoidal acoustic sources are collected by microphones. For an objective evaluation of the difference between the two total voltage representations, E v and E l, the following relative error will be used [50]: RE% = N ( El E v ) 2 N E2 v 100% (3.35) where N is the total number of samples. The relative error will measure the difference between both approaches for any applicable factor that may affect the acoustic accumulation/recording process. 50

69 The analysis is achieved by using a sinusoidal signal at 1 KHz, because it is the reference signal for the microphone sensitivity[46]. While the loudspeaker-microphone distance factor is not considered, as it is assumed to be fixed and equal in all cases, the relative error is measured regarding variations of the microphone sensitivity and the phase and loudness difference between the two acoustic signals. The appearance of the parameter K in both microphone sensitivities in Equs 3.32 and 3.33 indicates that the relative error, (RE) in Equ. (3.35), will be affected by the microphone sensitivity S v. Surprisingly, the relative error between the linear and the vector representation seems to be fixed around 17.16% for various sensitivities as shown in Fig. 3.5a. The phase shift between the two concurrent signals seems to have a great impact on the two formulas divergence for sinusoidal acoustic signals as shown in Fig. 3.5b. For equal loudness monotonic sound signal, there will be 100% relative error when a 180 phase shift occurs,i. e. at recording two simultaneous out-of-phase sinusoidal acoustic signals. The minimum error between linear and vector representations happens to be 17.16% when the two acoustic sources are in-phase sinusoidal signals. The two addition simulations are almost equal to each other when the captured acoustic signals of different loudnesses. The error between both formulas converges to zero as the sound pressure level difference increased between the two acoustic signals as shown in Figure 3.5c. The relative error of 17.16% is the maximum difference between the linear and the proposed vector addition, which happens to be when the two in-phase acoustic signals have equal loudness. By inspecting figure 3.5c, it could be noticed that the linear addition is valid for non-coherent signals. For instant noise and clear speech, there will be no significant difference between the linear and the real vector addition. 51

70 20 RE=17.16% % RE % RE 60 RE=17.16% Microphone Sensitivity dbv (a) RE vs Microphone Sensitivity 20 0 π/2 π 3π/2 2π Phase Shift (b) RE vs Phase shift % RE RE=17.16% SPL Difference in db (c) RE vs loudness difference Figure 3.5: The relative error between the linear and vector acoustic additions regarding the microphone sensitivity, phase shift and loudness difference for two concurrent sinusoidal acoustic signals. 52

71 3.6 Conclusion It has been shown in this chapter that the voltage samples of each acoustic signal are added together as if they were vectors that are perpendicular to each other. The proposed vector-addition formula is derived from simultaneous sound pressure combination, in the acoustic domain, and the electrical-acoustic transformation formulas. While the common linear approach suggests adding the individual captured voltages of each acoustic source, the proposed method transforms the total SPL into the electrical domain. The main difference is in the concept of transformation and the separation between the acoustic and the electrical domains. The vector-addition approach resembles the voltage representation of the monotonic acoustic combination better than the traditional linear addition. The divergence analysis in section 3.5 proofs that the common linear combination is an approximation of the real vector addition criterion. The relative error between the physical acoustic combination and the linear one is very low for non-coherent signals. For noise and speech are statistically independent, the linear addition is valid and will be used throughout the rest of this work. 53

72 Chapter 4 Signal-to-Noise Ratio Estimation 4.1 Introduction The signal-to-noise ratio (SNR) estimation algorithms have been investigated deeply for the last decade for different reasons. They could be used to improve speech enhancement, detection and recognition algorithms in different ways[37]. Those estimators try to make full use of the information that could be extracted from the signal samples and gradually build the guess when the signal reaches its final sample. Many of the SNR estimation approaches are based on either a pre-specified weighting factor or preceding assumptions of some parameters in the signal model [41]. The uncertainty of finding the SNR of random signals, like audio speech, is due to the non-deterministic and variability nature of both speech and noise. For those two different stochastic process to be added, producing what is defined as noisy speech, it gets even more complication. The spoken utterance SNR is a ratio between the power of two random signals, the speech and the noise. The variability and randomness behavior makes the SNR estimation more difficult to investigate [39]. 54

73 Although the live speech processing is becoming more important these days, the SNR ratio factor is still developed as a non-live option. It is true that the SNR is irrelevant to many speech applications; due to its nature of being a non-steady unitless ratio. However, for live audio applications that require continuous noise level estimation, efficient on-line SNR estimators should be investigated. For such live audio processing, the SNR estimator should not wait for the last sample of the audio to give results. This chapter will discuss the ability to have an SNR estimator that can be used with other processes in a certain way. The aim is to prepare live SNR estimator that can be plugged into any ASR system to improve its response under environmental noise variations. When there is a couple of acoustic models (AM) that are trained for different noise levels, the ASR has to decide which AM should it use by estimating the SNR of the incoming audio. If the system has to wait for a speaker to finish his/her utterance before the SNR estimator can decide, it will make that system far from being live recognizer. Therefore, the Short-Time Silence (STS) SNR estimator, that is introduced in this chapter, estimates the SNR based on silence samples only. Before discussing the new estimator, the next section will discuss some examples of the already existing algorithms. Then the proposed approach will be introduced in section 4.3 with some simulation results are discussed in section 4.4, while in section 4.5 there will be a further results analysis. 4.2 Previous SNR Estimation Approaches So far, the existent SNR ratio estimation algorithms usually consider all samples of the audio signal in computations. Those types of estimators are not convenient to 55

74 be used for live applications, as the accurate estimation will not be ready until all audio samples are considered. In speech enhancement, this process can take the whole duration of the speech signal. Regarding SNR estimation for speech enhancement, the aim is to find the SNR as precise as possible even if it takes the whole speech duration. Sometimes it is enough to know if the signal has high or low SNR in order to do further operations on the speech signal for better recognition accuracy as in [50 52]. The following subsections describe with some details different approaches that pave the way to this research. Those algorithms were tested to be plugged in the ASR so that they can classify audio signals based on their SNR estimates The Statical Closed-Form Approach The analytical solution that was proposed by Moazzeni et al. in [41] is a clear example of deriving closed mathematical forms to estimate speech SNR. In this approach, the noisy audio signal is modeled using one closed form, that is derived from the Generalized Normal-Laplace (GNL) distribution, from which the signal and noise powers could be estimated. The estimates for the noise and signal powers ˆσ 2 2 v and ˆσ s respectively are [41]: ˆσ s 2 = 45 ˆM c 2 ˆM c ˆM c ˆM c (4.1) ( 2 ) ˆσ 2 2 ˆσs ˆM2 v = ˆσ s 1 4c (4.2) where c is c = ˆM 4 3 ˆM 2 12 (4.3) 56

75 and ˆM p is the p th sample moment of the distribution and is calculated by the following equation: ˆM p 1 N N (y(k)) p (4.4) k=1 The additive colored noise was specific in a way that researchers update it for each frame before adding it to the clean frames of the signal. They designed a nonstationary colored noise v(k) which was obtained by passing a zero mean Gaussian random samples w(k) to the following second-order autoregressive filter [41]: v(k) = sin(0.002k) v(k 1) 0.4v(k 2) + w(k) (4.5) They verify their estimator by adding this noise to the speech signal and try to track the noise power and the SNR for that distorted speech signal. The speech signal of 4 seconds, sampled at 8 KHz was split into frames of 320 samples each. They show that for ten simulation runs, the average of the normalized mean square error of the estimator was 0.31 in the fourth frame for 10 db SNR [41]. In this case the noise in equation 4.5 and the chosen frame size makes the speech signal more stationary with respect to noise[41]. Therefore, any try to use this approach in a different assumption might not give the expected results, as the shown in figure 4.2. Equations 4.1 and 4.2 were tested in this research to predict the SNR when the speech signal is distorted by a quasi-stationary 10 db white noise. The average detected SNR was around db, which is far away from the real value of 10 db, as shown in figure 4.2. The main reason for this behavior of that approach is the contradiction of the assumptions. 57

76 Figure 4.1: The statistical closed-form estimator for frame-based additive noise [41] Figure 4.2: The replica of the closed form SNR estimator of stationary additive white noise 10 db 58

77 Figure 4.3: The NIST-STNR estimation approach using RMS power Histogram [54] The statistical closed form algorithm works if the noise is non-stationary so that it is changing in each frame. While in this thesis, the additive noise is not restricted to be changing in frame basis. In fact, the main assumption of this thesis is the noise change is slower than the frame rate. Therefore the statistical closed form approach, as suggested in [41], is beyond the interests of this study The Audio Power Histogram Approach The histogram analysis of the power of an audio signal is used successfully in many SNR estimators. The National Institute of Standard and Technology NIST developed their own estimator, known as the NIST-STNR [53]. It uses the entire speech samples, framing it to 20 ms frames and estimate the histogram of the root mean square (RMS) of the power. Those frames are 50% overlapped among each other, then used for histogram updates. The resulting power histogram will be analyzed so that the 15% of the total area from the left represents the noise while the 85% of the area from the left of the histogram represents the signal [54]. The NIST algorithm tries to separate the high and the low powers of the audio signal using the power probability density function (PDF). It is supposed that the noise 59

78 and signal PDF boundaries lies in the selected regions with higher probability. This strategy assumes that the speech signal has a higher power than the noise, which is a valid assumption which suites the ASR applications. It is true that the human hearing system, an hence ASRs, will not be able to decode the utterance correctly if both noise and speech have equal power. While the power PDF is the key for NIST SNR, the Waveform Amplitude Distribution Analysis (WADA-SNR) algorithm uses the amplitude PDF instead [37]. This approach is based on the assumption that the clean speech amplitude has approximately the following symmetrical Gamma distribution function: f x (x β x ) = β x 2Γ(α x ) (β x x ) αx 1 exp( β x x ) (4.6) Where x is the amplitude, α x and β x are the shaping and rate parameters respectively. While β x serves as a normalization factor in equation 4.6, α x with a fixed value of 0.4 would make this equation the best formula for describing the amplitude PDF of the clean speech [37]. According to Kim and Stern in [37], when white Gaussian noise is added to the clean speech, there will be a unique parameter G z that can determine SNR. This parameter is estimated using a double integral formula which is a function of the clean signal and noise amplitudes, α x and the SNR (see [37]). Since α x is constant, the G z parameter versus SNR could be calculated off-line and stored in a lookup table. Then, the estimated parameter Ĝz is computed by the following equation [37]: Ĝ z = ln(e[ x + v ]) E(ln( x + v )) (4.7) where x and v are the clean signal and noise amplitudes respectively. 60

the processing of the entire signal samples.

79 (a) Results with WADA-SNR (b) Results with NIST-STNR Figure 4.4: Comparison of the average SNR estimates between WADA and NIST SNR algorithms on DARPA-RM database [37] Although the computation of estimating the G z parameter is efficient, this approach requires the processing of the entire signal samples. It is important to note that whereas the WADA-SNR approach assumes that the noise has a Gaussian distribution, the empirical results shows that it has superiority over the NIST-STNR even for other types of noise like background speech and music as shown in the following figure 4.4 below [37] The Supervised Method Approach As a try to improve the SNR estimators, training was introduced by Papadopoulos et al. in [39]. In this approach, two cases are investigated, a supervised and unsupervised cases. In the first case, the noise type is assumed to be priorly known, while in the unsupervised case of their work they identify the noise by classification. They did not use the Voice Activity Detection (VAD) step in their proposal, instead they compare the energy of the ongoing utterance to decide the signal part X, which is the speech plus the noise, and the noise part N. The regression models are trained for each 61

80 (a) MAE comparison for white noise (b) MAE comparison for pink noise Figure 4.5: MAE Comparison of the Regression, WADA and NIST SNR Estimators [39] kind of noise using four types of features, long-term energy and signal variability, pitch and voicing probability. Those four features are used to minimize the estimation least-squares error when the estimated SNR E is as follows [39]: E c d a b = 10 log P (Xc d ) P (Xa) b 10 P (Xa) b (4.8) Where a,b,c, and d are the percentile values of energy concentration, the P (X) is the power of the audio signal. For the unsupervised case, classification is used firstly to decide the type of the noise and then choose the pre-trained specific regression model. The researchers compared there Mean Absolute Error (MAE) with both the WADA-SNR and the NIST-STNR as shown in figure 4.5, where the supervised algorithm is denoted by Regression. The MAE comparison among the three SNR estimators in figure 4.5, shows the overall improvement that the regression approach has. The researchers continue to compare the three SNR estimators and prove that it worked better even for other types 62

81 of noise, like babble speech noise (when someone is talking while others are speaking in the same time), machine gun noise and car interior noise [39]. From equation 4.8 above, it seems that the regression model tracks the power histogram of the audio signal. It is similar to what was introduced by the NIST estimator except for the training part. That addition makes the regression algorithm in [39] more powerful than other techniques. However, it still requires a pre-knowledge of the noise type even for the un-supervised case as it has to be trained anyway. Therefore, researchers are planning to make this algorithm more general for different noise types, and aim to develop a feature that can figure out noise characteristics to improve classification [39]. 4.3 The Proposed SNR Estimation Approach Based on what has been discussed in the previous section, SNR estimators cannot give its results until all the audio signal samples are processed. This strategy could be disturbing for some real-time live audio applications, like Automatic Speech Recognition (ASR). In this study, a procedure that can estimate SNR in advance is proposed. So that, the processing resources are efficiently given to other processes that may require the knowledge of the SNR. The proposed procedure is named the short-time of silence SNR estimator, or STS-SNR for short The Short-Time of Silence SNR Estimator It is intended to design an efficient SNR estimator that would not take too much of the processing resources that the ASR needs. The idea is to make this estimator the first phase and then let other parts of the recognizer work with less delay. It is important to 63

82 note that the natural noise could be non-stationary in the sense that it may change its power dynamically. However, the stationary and non-stationary is a relative matter which depends on the periods of interest. For example, when an ASR system has ms of overlapped frames and the noise is changing each minute, then that noise is a stationary random process. Furthermore, if that ASR system can update the noise information periodically, the changing noise characteristics will not make that noise process a non-stationary. According to the proposed short-time of silence approach, the best time slot to collect information about noise is at the beginning of the recording or the silence duration. During either of these situations, the ASR captures the signal although it is not speech. In fact, it represents the amount of the noise that would be added to the speech samples. For live speech, the speaker will not be able to deliver his/her intended utterance within the first 30 ms duration. Therefore, it is a great chance to examine the noise behavior within this period. Besides, as the noise is expected to be non-stationary within the update frequency, the 30 ms duration could be considered as a quasi-stationary window of it [2]. Figure 4.6 shows the processes sequence of the proposed short-time of silence SNR estimator (STS-SNR) when only the first silence is picked. The STS-SNR estimator is achieved using the following steps: 1. Take the first 30 ms duration directly after the microphone is on, which is denoted here as the Noise Frame N Frame. 2. Subtract the mean of the N Frame 3. Estimate the Power Spectral Density (PSD) of the N Frame using Fast-Fourier Transform (FFT) of 512 points and taking only 0 to 8 khz band in consideration. 64

83 Figure 4.6: The Short-Time-Silence estimator (STS-SNR) Algorithm Where PSD for the 30 ms N PSD is: N PSD = N Frame (ω) 2 (4.9) where N Frame (ω) is the spectrum of the N Frame audio signal. 4. Reform the PSD by taking the absolute difference of its flipped version to produce a white-like PSD using the following process: N Reformed = N PSD N T PSD (4.10) where N T PSD is the flipped version of the noise PSD N PSD. 5. Take the average of the first and last quarter of the 8 khz band of the N Reformed. The average of those quarters is then considered as the estimated noise power spectral density ˆN P SD. 65

(a) PSD before reforming (b) PSD after reforming Figure 4.7: The effect of STS step 4: The Absolute Flipped Self Subtraction 6. The SNR in db is: SNR db = offset 10 log 10 ( ˆN P SD ) (4.11) 7.

As non-white noise is also expected, the N PSD is reformed in step 4 to produce a white-like PSD.

84 (a) PSD before reforming (b) PSD after reforming Figure 4.7: The effect of STS step 4: The Absolute Flipped Self Subtraction 6. The SNR in db is: SNR db = offset 10 log 10 ( ˆN P SD ) (4.11) 7. To update the SNR db, repeat steps 2 to 6 above for 30 ms of samples of any available SIL when t T, where T is the desired update frequency. As non-white noise is also expected, the N PSD is reformed in step 4 to produce a white-like PSD. This step will eliminate the effect of some colored noise that might appear while preserving the white noise PSD. In figure 4.7, the N Reformed is shown for a speech that was sampled by 16KHz sampling rate and has an additive blue noise that made the SNR 10 db. Step 5 is designed to estimate the maximum contribution by averaging the reformed spectrum from step 4. The N P SD in the final step shows to change linearly with the SNR of the speech utterance. However, there is nearly a fixed offset of the estimator mean from the real expected SNR value. It is possible to update the noise power parameters after the microphone is on for 66

85 long period for live ASR applications without the need for extra VAD processing. Some recent ASR systems provide enough information about the Silence which represents non-speech events. The proposed algorithm, which will be referred as Short-Time Silence SNR Estimator (STS-SNR), can easily tracking the silence (SIL) information to update noise power parameters. For instance, Sphinx4 ASR can detect SIL and give timing information for it as SIL is treated as a word that represents a non-speech event and stored in filler dictionary [15]. It is not necessary to track every SIL duration to take 30 ms from it to update SNR. Instead, this 30 ms could be from the nearest SIL to the pre-defined update rate T. For instance, if T is 1 minute, then the 30 ms would be taken from the next available SIL after every minute duration The Noise-Speech Power Offset Experiments on different SNR noisy speech showed that the offset in equation 4.11 is not varying with the SNR. It was found that use of value 23 as an offset to get minimum estimation errors for different kinds and levels of noise. This exact value of the offset is estimated by the minimum sum of square error (SSE) and the minimum mean absolute error (MAE) as depicted in figure 4.8. Experiments on the TIMIT test data and the NOIZEUS [55] show that this offset in equation 4.11 would be 23 db as it gave the minimum for both the Sum of Square Error (SSE) and the Mean Absolute Error (MAE) as shown in figure 4.8. It is common that the speaker will try his/her best to speak the utterance more likely with a power that is greater than the ambient noise. As that noise is already captured during the first 30 ms or the SIL period; the rest of the speech is expected to be with higher power. Experiments shows that the spoken signal would be higher than the N Frame signal by at around 23dB. Hence, the N F rame signal power (which is noise 67

86 (a) SSE for different Offsets (b) MAE for different offsets Figure 4.8: Testing different offsets for N F rame of NOIZEUS noisy speech at 15 db SNR in fact) could be a suitable estimator for the SNR of that utterance. Experiments for the STS-SNR show that the offset in equation 4.11 does not depend on the SNR value of the captured speech on average. The implemented experiments on TIMIT that has an additive white, blue and pink noise beside the NOIZEUS data that has real additive noise confirm the linearity conclusion. It was found that use of value 23 as an offset to get minimum error for different kind of noises. This exact value of the offset is estimated by the minimum sum of square error (SSE) and the minimum mean absolute error (MAE) as depicted in figure 4.8. The offset value in equation 4.11 represents the minimum power gap between the moderate speech and the faint noise so that the speech is differentiated from the noise by the speaker and his audience. In order to comply with time constraints, the signal and noise relationship has to been dealt by a different point of view. The novel approach in this research is to collect the noise samples only and relate its power to the psychoacoustic phenomenon that comes from the human speech attitude. Table 4.1 shows categories of sounds, in the noise column, with some important examples. According to Gordon J. King in his 68

87 book the The Audio Handbook, there is a relative difference of about 20 db between moderate speech and the faint sound as shown in the following table [45]. It is obvious from table 4.1 that the pressure difference between the faint and the moderate noise ranges from 0 to 40 db. So, for the highest faint noise and the moderate speech, there is around 20 db difference. The empirical results in this research support the known differentiation of the moderate speech from the non-speech noise using their relative Sound Pressure Level (SPL). It is common for a speaker to raise his vocal power to compete against the ambiance noise energy at least to hear himself, and by doing so allowing the close audience to hear him. According to Gordon J. King in [45], the Faint Noise starts by the 20 db Sound Pressure Level (SPL) for whisper and goes up to 40 db SPL for public library noise. The moderate speech lies in a pressure region of what he called Moderate Noise which is between 40 and 60 db SPL. Any increase in the faint noise will make the speaker, who aims to speak moderately, to increase his utterance power in roughly the same amount of the additional noise. Based on that fact, the constant offset of 23 in equation 4.11 seems to be reasonable and close to theory. 4.4 Experimental Results The STS-SNR estimator was evaluated using the DARPA TIMIT [56] and the noisy speech corpus NOIZEUS that is described in [55]. Figure 4.9, shows the MAE results of the three SNR estimators for NOZEUS noisy speech, which has real additive acoustic noise that was collected from airport, street, restaurant, station, car, exhibition and babble noise. Each noise category is added to the clean speech signal to form 5, 10 and 15 db SNR versions. 69

88 Table 4.1: Sound Pressure Level (SPL) examples from King in [45] Noise SPL in db Relative Energy Pressure microbars Typical Examples Jet aircraft at 150 m (500 ft) Inside boiler-making factory Deafening Pop music group, Motor horn at 5 m (16 ft) Very loud Inside tube train Busy street, Work shop Small car at 7.5 m (24 ft) Loud Noisy office Inside small car, Large shop Radio set-full volume Moderate Normal conversation Urban house Quiet office Rural house Faint Public library Quiet conversation Rustle of paper Whisper Very faint Quiet church Still night in the country Sound proof room Threshold of hearing 70

89 Figure 4.9: MAE results for NIST, WADA and STS using NOIZEUS corpus As the available NOIZEUS noisy speech corpus has only 5, 10 and 15 db SNR, the WADA-SNR will have the less MAE. The proposed STS-SNR estimator is better than the NIST-STNR one and close to the WADA algorithm. For this test, WADA-SNR is better than the proposed STS approach by 20.58% relatively while the latter is better than the NIST-STNR by 41.56% relatively. Figure 4.10 depicts an MAE comparison of the WADA-SNR, NIST-STNR and the proposed STS-SNR algorithm. It shows that all the three results have nearly the same response regarded to the white, pink and blue types of artificial noise using TIMIT test speech. On the other hand, comparing the mean of the three estimators shows that the proposed SNR estimator has better responses than the others. Figure 4.11 explains that the proposed STS-SNR is closer to the ideal case in average although it shows higher MAE than the WADA-SNR in total that was shown in figure

90 (a) White Noise MAE Results (b) Pink Noise MAE Results (c) Blue Noise MAE Results Figure 4.10: MAE of NIST, WADA and STS estimators using TIMIT corpus. 72

91 Figure 4.11: Comparing NIST, WADA and STS estimators results using NOIZUS corpus. For other types of noise, the STS-SNR proves its goodness in the average estimated SNR. Figure 4.12 show NIST, WADA and the STS average estimation on TIMIT corpus, when white, blue and pink additive noise are tested separately. 4.5 Discussion The statical model used in the WADA-SNR assumes that the clean speech has a Gamma distribution of a constant shaping factor α x of 0.4. Hence, the WADA-SNR had less average errors than both the NIST and the proposed STS algorithm when the real SNR is less than 20 db. According to the authors in [37], the WADA estimated factor G z is independent of α x only if the actual true SNR is less than 20 db. Otherwise the 0.4 value of α x will no longer be the optimum choice as shown in figure In general, the proposed algorithm beats both the others above that level and shows to give consistent MAE up to 40 db real SNR. The overall response of the STS-SNR estimator has better results than the other tested algorithms. 73

92 (a) White noise (b) Pink noise (c) Blue noise Figure 4.12: Comparing NIST, WADA, and STS SNR estimation on TIMIT for different noise types. 74

93 Table 4.2: The percentage MAE relative improvement of the STS-SNR for results in figure 4.10 Noise Type I SW I SN White 19.23% 23.07% Pink 18.97% 25.55% Blue 17.69% 21.26% The average MAE relative improvement of the STS estimator over WADA-SNR and the NIST-STNR for the range of 5 to 50 db SNR is calculated using the following equation: I SW = 1 M I SN = 1 M M ( ) MAEW (i) MAE S (i) 100% MAE W (i) + MAE S (i) (4.12) M ( ) MAEN (i) MAE S (i) 100% MAE N (i) + MAE S (i) (4.13) i=1 i=1 where M is the number of simulation runs, I SW is the MAE relative improvement of the STS over WADA SNR, while MAE W is the output of the WADA-SNR estimator. I SN is the MAE relative improvement that the STS has over NIST-STNR, MAE N is the output of NIST-STNR estimator and MAE S is the estimated SNR by the proposed STS approach. Although WADA-SNR is proved to be better than the proposed STS-SNR algorithm for SNR less than 20 db, the latter shows overall improvements if the test is run over higher noise levels. The following table shows the percentage relative improvement in MAE when the SNR ranges from 5 to 50 db. For the computational complexity comparison among those SNR estimators, the STS-SNR is considered less expensive. As the proposed STS-SNR uses only 30 ms 75

94 Table 4.3: Average processing time for SNR Estimators SNR Estimator Average processing speed NIST-STNR WADA-SNR STS-SNR N F rame window of the SIL signal, it is expected to perform faster than the NIST and WADA SNR approaches. The processing speed of the three estimators was measured by the following ratio: Processing Speed = Processing Time Audio Duration (4.14) In fact, the N Frame window is considered only at the beginning of the audio test files as the audio length do not exceed 6 seconds, which is the proposed strategy for short time recordings. Table 4.3 shows a comparison among the three SNR estimators, NIST, WADA and the proposed STS approach. The processing speed in table 4.3, is the average of 1680 speech audio files of TIMIT test corpus. The WADA-SNR estimator proves to work more efficiently as compared to the NIST algorithm. Although WADA estimator is based on extensive integration, it gets its powerful performance by using the off-line pre-calculated tabular to decrease the computational expansions [37]. However, as it is noted in table 4.3, there is an improvement in the performance of the proposed STS-SNR estimator over the WADA-SNR, which is about 11.86% on average. This improvement is achieved by the fact that the STS-SNR uses only 30 ms of the audio signal and does not require to process all the incoming samples. However, the Fourier transform of 512 points is the most computational intensive phase of the algorithm. 76

95 Chapter 5 Experimental Tools and Environments 5.1 Introduction This chapter discusses the basic settings of the tools and the data used in the experiments. In the first section, a list of open source ASR tools is introduced briefly. Section 5.3 describes with some details the specifications of the used Mel-Frequency Cepstral Coefficients (MFCC). Section 5.4 introduces the GMM-HMM acoustic model training, describing the Baum-Welch algorithm, while section 5.5 summarizes the CMU Sphinxtrain tool training procedure. Finally, the specifications of the speech data used in this dissertation are listed in section

96 5.2 CMU Sphinx ASR Toolkit There are plenty of open source ASR s such as RWTH Aachen Automatic Speech Recognition System (RASR) [57], Segmental Conditional Random Field Toolkit for Speech Recognition (SCARF) [58], Improved ATROS (iatros) [59], SRI Internationals Decipher [60] and IDIAPs Juicer [61]. They exist openly on the Internet either as full-automated systems or as toolkits. However, CMU Sphinx, Kaldi and HTK toolkits have fewer restrictions and widely used [62]. The Carnegie Mellon University (CMU) Sphinx-4 is an HMM-based speech recognition that is written in Java TM [63], while Kaldi is an ANN-HMM system that is written in C++ language[64]. Although Kaldi outperformed both CMU Sphinx and HTK toolkits, it has the highest computational expenses among them. According to Gaida et al. in [62], the CMU Sphinx-4 toolkit gave reasonable results with less complexity. Therefore, CMU Sphinx-4 toolkit is used to build and test the proposed noisy robust ASR throughout this study. There are three tools in CMU Sphinx toolkit: sphinxtrain, pocketsphinx and Sphinx 4. While the first two tools are written in C/C++, the third tool is completely written in JAVA language. Sphinxtrain tool is used to prepare GMM-HMM acoustic models for the ASR. It uses some functions from pocketsphinx library throughout the training procedure. 5.3 Feature Extraction As it has been discussed earlier in chapter 2, the Mel-Frequency Cepstral Coefficient (MFCC) was chosen due to its efficiency. The MFCC can be tweaked by many 78

97 x(t) Preemphasize HPF 25 Mel Filter Banks DCT log Energy 1 Element First and second derivative 12 Cepstra IDCT 39 Features Figure 5.1: The Mel Frequency Cepstral Coefficients (MFCC) used in experiments. parameters that could be altered based on the project requirements. As far as this research is concerned, the MFCC features are extracted from the speech signal x(t) as shown in figure 5.1. The speech features were extracted at a 100 frames per second with input sample rate of 16 KHz. In the beginning, the speech signal is processed by a pre-emphasis high pass filter that has the following transfer function [9]: H(z) = 1 αz 1 (5.1) where α = Using 25 Mel filter bank, discrete cosine transform (DCT), log and inverse cosine transform (IDCT) will make what is known as the cepstrum (real cepstrum) of the signal. The Mel filter banks convert the spectrum from the linear scale to the nonlinear Mel scale, which resembles the human hearing frequency rapping [9]. The first and the second derivatives, known as delta and delta-delta respectively, are derived from the 13 components of the cepstrum to form what is known as dynamic features. 79

98 The 13 cepstrum coefficients consists of the first 12 cepstra and the energy component of each frame. Eventually, the dynamic features will be a 39-dimensional MFCC feature vector. The general mathematical representation of the MFCC features depicted in figure 5.1 is[7]: 25 MF CC i = X k cos [ i(k 1 2 )π ] 2 k=1, i = 1, 2,... M (5.2) Where M is the number of cepstral coefficients which is, in this case, is 12, k is the number of Mel filter banks and X k is the log-energy of the k th filter output. The energy coefficient is simply: N E s = x 2 (n) (5.3) n=1 where N is the number of signal samples, x(n) is the output of raised cosine window of the signal x(t). The next step is to use those features as inputs to train the acoustic models using the training feature vectors. 5.4 GMM-HMM Acoustic Model Training The GMM-HMM acoustic models are designed by initially associating each linguistic symbol (phoneme or sub-phoneme) with a state in the model. It is more common to use tri-state for each symbol (v k ). For instance, a simple 3-state model in figure 5.2 is defined as: λ = (A, B, π) (5.4) 80

99 where A = {a ij } is the transition probability from state i to state j in the HMM. The vector B i is the partial observation (emission probability) of state i and π is the set of initial states probabilities. The transition matrix A for the HMM in figure 5.2 will be: and the vector B is: a 11 a 12 a 13 A = a 21 a 22 a 23 a 31 a 32 a 33 b s1 = P (o t s 1 ) B = b s2 = P (o t s 2 ) b s3 = P (o t s3) (5.5) (5.6) Gaussian mixture model (GMM) is the best way to represent vector B. For a given acoustic evidence, the probability of having the symbol v k is described within its associated state by N mixtures of Gaussian PDFs. Hence, the emission probability of each state for the feature vector o t at time t is [65]: P (o t ) = N [ ] c i (ot µ i ) 2 exp 2πσi i=1 2σ 2 i (5.7) where N is the number of Gaussian mixtures, c i is the weight of the i th mixture where N i=1 c i = 1, µ i and σ i are the mean and variance of the i th mixture in the GMM. The probability P (o t ) = b i (o t ) indicates that each input feature o t at a time t will produce the symbol v k, which is the predefined symbol for the state i. There is no analytic approach to estimate the best HMM parameters for given training acoustic data O = {o t } T t=1. The forward-backward iterative re-estimation, also known as the Baum-Welch algorithm, is used to obtain a better estimate of the HMM parameters. Baum-Welch algorithm is a repetitive re-estimation procedure, 81

100 a 11 a 22 a 33 a23 a 12 π 1 a 3e s 0 s 1 s 2 s 3 s e a 21 a 32 b s1 b s2 b s3 Figure 5.2: A simple 3-state Markov model, where state a ij is the transition probability, b si is the probability of having symbol v k given the input feature vector o t, at time t, or simply P (o t ). which is very similar to the expectation maximization (EM) algorithm and could be described as follows[65]: 1. Initialization: Choose an initial estimate λ 2. Expectation Step: Compute the auxiliary function Q(λ, λ) based on λ. 3. Maximization step: Compute λ to maximize the auxiliary Q-function. 4. Iteration: Set λ = λ and repeat from step 2 where Q is defined as: Q(λ, λ) = S P (O, S λ) P (O λ) log P (O, S λ) (5.8) where: P (O, S λ) = T a st 1 s t P (o t ) (5.9) t=1 where S is the set of states in the HMM and P (O λ) is the overall probability of the model λ for all observations O. 82

101 The assumption is when λ is the initial HMM model and the input speech features O = {o t } are observed, then a better model λ could be re-estimated by re-calculating the HMM new parameters Ā, and B as follows [66]: ā ij = T t=1 P (O, s t 1 = i, s t = j λ) T t=1 P (O, s t 1 = i λ) (5.10) and bi (k) = T t=1 P (O, s t = j λ)δ(o, v k ) T t=1 P (O, s t = j λ) (5.11) Equation 5.10 is basically ratio between the expected number of transition from state i to state j and the expected number of transitions from state i. In the other hand, in equation 5.11 the numerator is the expected number of times in state s j and observing symbol v k. The initial state probability set π is a usually uniformly distributed for each pre-designed initial state π i. The parameter update in the expectation step could be summarized as follows: π i = Expected number of times starting from state s i at time t=1 ā ij = Expected number of transitions from state s i to state s j Expected number of transitions from state s i bj = Expected number of times in state s j and observing symbol v k Expected number of times in state s j The HMM parameters are enhanced by maximizing the overall likelihood of having a model λ for observation O, P (O λ). With enough amount of training data, the update process of the HMM parameters always leads to a better model after each 83

102 iteration. The iterations will stop if the average likelihood difference between the previous and the current iteration reaches a predefined convergence ratio [15]. 5.5 Sphinxtrain Tool Procedure The language models (LM) and the lexicon information (dictionaries) were extracted from each corpus using the CMU Sphinx on-line LM tool [67]. Meanwhile, the acoustic models were trained using the audio part of the corpora, with the help of the sphinxtrain tool. The general procedure of the acoustic model (AM) training has the following main stages [15]: 1- Context Independent (CI) Training: Each phoneme in the dictionary is given 3-states of unique identification numbers in the HMM. The flat initialization sets the means, variances and mixture weights to the same pool values for all feature vectors.then, the flat-initialized models are re-estimated through the Baum-Welch forward-backward re-estimation algorithm. 2- Context Dependent (CD) Untied Model: The HMMs are trained for all contextdependent tri-phones that are seen in the training corpus, where the dictionary and the LM help detecting them. This step tries to estimate the parameters of each phoneme based on its corresponding phoneme-context. Baum-Welch algorithm is applied to re-estimate the GMM-HMM parameters, where the initial values are the output of the previous stage (CI training step). 3- Build Decision Trees: The phonemes are classified based on their common acoustic properties. The classification is done by building trees of questions about 84

103 whether a set of phonemes are similar acoustically. So that, the next stage will collect data from all these similar states together and use one global state, which is called a senone for training. The classes are predefined manually or automatically by the sphinx train using CI models. After building the decision trees, they will be pruned to make their leaves equal to the number of tied states (senones) that will be trained. The number of senones is set to be 200 for all training procedures in this thesis. 4-Context dependent Tied states: All possible tri-phone sets from the CD untied stage and the senones from the build-tree stage are re-trained using Baum-Welch algorithm. Starting from a single mixture of GMM up to 10 mixtures maximum. The acoustic model (AM) that comes out after training contains a model definition file, which is a table of the identification numbers of all trained 3-states Markov chains. Each state is a senone, which represents phonemes that share similar acoustic properties. Beside the model definition, the GMM parameters of means, variances and mixture weights are stored in separate files. Those parameters define the emission PDFs for each HMM state, whereas the transition matrices of the model are stored in transition matrices file. The feature extraction global settings (like feature type etc.) are listed in feat param file. 85

104 Table 5.1: Basic information about the used corpora in experiments. Audio length columns show the total duration of the baseline audio files for each corpus. Audio Duration in Hours Corpus Name Training Testing Dictionary length (words) Filler Words TIMIT PDAmWSJ AN The Data Phonemes Many types of speech corpora were used to compare the proposed systems with the standard speech recognition system. The well-known TIMIT, which is described in [56] and published in [68] and the alphanumeric (an4), which is available at [69], are among them. In addition, PDAmWSJ corpus was derived from the PDA speech database published in [70]. The acoustic data were recorded using multiple microphones via a PDA-like mock-up. Only the utterances from Channel 5 that contain sentences from the Wall Street Journal (WSJ) were used. Table 5.1 list those corpora and some of their specifications. 86

105 Chapter 6 Acoustic Noise and ASR Performance 6.1 Introduction There are many factors that affect the accuracy of the ASR, like speaker accents, microphone variability, unexpected out of vocabulary (OOV) input and the ambient noise variations. Those factors will cause serious degradations to the ASR performance, as the recognizer is trained for specific circumstances that rely on the training corpus characteristics. The specific objective of this study is to consider the effect of noise variations on ASR. The ASR performance degradation, due to the unexpected change in the environmental acoustic noise, is analyzed in this chapter. Understanding the ASR performance pattern in various noise types and levels will reduce the complexity of the solution. The next section introduces the common metrics used to measure ASR performance, word error rate (WER) and decoding speed. The detailed analysis in section 6.3 will focus on finding any useful patterns in ASR performance for noisy speech. 87

106 6.2 Performance Metrics The automatic speech recognition quality is measured by the word error rate (WER) and recognition speed metrics [35]. The WER is measured by comparing the decoder hypothesis and the reference utterances. When the ASR is fed with speech audio files, it produces a hypothesis utterance in textual form for each audio file. The ASR hypothesis output is then compared to the human-transcribed one for the same audio files, called the reference utterances The word error rate (WER) is usually calculated as a percentage using the following formula [71]: W ER = Substitutions + Deletions + Insertions Substitutions + Insertion + Correct 100% (6.1) While the denominator of equation 6.1 represents the total number of the test words, the three errors in the numerator are: Insertion: Any word that the decoder produces in the hypothesis but does not exist in the corresponding reference. Deletion : A word that exists in the reference utterance but missing in the corresponding hypothesis. Substitution: A word in the reference utterance that was substituted by different one in the hypothesis. 88

107 The second metric that is used to measure the performance is the recognition speed. It is usually measured by the real time factor (RTF), in unit that is expressed as times real-time (xrt), using the following formula: RT F = Decoding Time Audio Time (6.2) This ratio measures how long the recognizer takes to process one second of speech, it is a unit-less ratio where the lower number, the better. This factor is important in ASR as it shows the expected delay in real-time applications, like live speech recognition system. When the RTF is 1 or less, the recognizer will seamlessly produce the textual transcript of the live speech with apparently no annoying delay. 6.3 Analyzing the Noise Effect on Performance The ASR system will suffer accuracy degradation when the environmental noise is different than that of the audio corpus used to train the acoustic model(am). By testing different corpora using various noisy data, the ASR performance is gradually degraded as the input signal noise level increases, as shown in figure 6.1. To design a noise immune ASR, there are many types and levels of acoustic noise that has to be considered. Therefore, there is a trade-off between the amount of data required for training the ASR and the recognition efficiency. For different environments, the performance pattern is examined carefully to predict the best compromise for that trade-off. In other words, we want to use as minimum amount of data as possible without decreasing the ASR performance significantly. 89

108 80 AN4 corpus PDAmDigits corpus PDAmWSJ corpus W ER% InputAudioSNR db Figure 6.1: ASR word error rate degradation in various noise levels for different speech corpora. The higher the SNR db, the lower the noise level.at SNR db = 50, the W ER is as that of the baseline clean audio input, as it has less additive noise. For this analysis, our experiments used four types of noise and 5 to 50 db SNR levels, in 5 db steps for each type. The experimental audio corpora were around hours of training corpus and 57.2 hours for testing. In the next section 6.3.1, the noise type effect on ASR is tested. In section argues how to reduce the number of noise levels that can be used in ASR training and still cover a wide range of input noise Noise Type Effect There are many different noise types that the speech may suffer from, like babble noise, additive white Gaussian noise (AWGN) and different colors of non-white acoustic noise. According to the conducted experiments throughout this research, it is found that there is no need to train the corpus using all noise types. When the AM is trained using AWGN, the colored-noisy input speech will have almost equal, or even less performance degradation effect on the ASR decoder, as compared to the AWGN. 90

109 60 55 WER % Input SNR level for speech AWGN Pink Noise Blue Noise Babble Crowd Figure 6.2: The WER for different noise types speech when AWGN AM is used in ASR. This model is trained using 20 db SNR audio data. The babble noise is a random sample of crowd speech. Figure 6.2 shows the recognition word error rate (WER), when the acoustic model is trained using AWGN audio, with 20 db SNR. The ASR was tested when the input speech that suffers from different noise types, AWGN, pink, blue and babble crowd noise. It is obvious that the ASR performance is better (lower WER) for noise types other than AWGN, except for blue of 25 db SNR. When the AMs are trained using the AWGN, the ASR will be able to handle other noise types even better than the white noise. This phenomenon simplifies the mismatch-gap reduction for the tested data, so that building AMs using AWGN degrees of infection is enough to handle any other noise types. 91

110 6.3.2 Noise Level Effect As the aim is to build an ASR that can efficiently work for various noise levels, the recognizer has to be trained using multiple degrees of noise infection. Hence, the next question is how many noise levels should be used to train the AMs, without degrading the performance significantly. In spite of having an infinite number of levels in theory, empirical results in this research show that it could be highly reduced to a small number of them. Referring to the different noise-level AM performances versus various noise levels in figure 6.3, it is easy to conclude the following: 1. For audio signals that have n 10 db, the high accuracy happens to be at AM 10. So, if the acoustic model A 1 is trained using audio with SNR=10 db, it will cover this sub-range. 2. When 10 n 15 db, the highest performance is at AM 15, which will be denoted A 2, or the second model. 3. For 15 n < 25 db, the best performance is guaranteed when the acoustic model is AM 20. This model will be the third one in the knowledge base, denoted with A When the input audio has n > 20 db, the reference baseline AM has less errors than other ones. It means that the AM trained using the baseline corpus, or AM Reference, the fourth model A 4 will be used to cover the noise level sub-range from SNR = 20 db up to (clean audio). 5. Selecting the acoustic model AM 5, which is trained with 5 db SNR audio, will not be helpful as it does not give minimum WER as compared to AM

111 A B C D W ER% AM 10 AM 15 AM 20 AM Ref InputAudioSNR db AM Ref AM 5 AM 10 AM 15 AM 20 Figure 6.3: Baseline ASR word error rate degradation for different acoustic models and various levels of input noise. 93

112 6.4 Conclusion Based on experimental results, when the expected input audio SNR varies from 5 to 50 db for any noise type, it is enough to train the baseline audio corpus and three AWGN noisy versions of it. Those important assumptions have reduced the AM generalization complexity when it comes to the issue of input speech noise level versus ASR performance degradation. The proposed noise robust ASR, which is introduced in the next chapter, relies completely on the empirically verified assumptions conducted in this chapter. 94

113 Chapter 7 Adaptive Model Switching (AMS) Speech Recognition 7.1 Introduction Conventionally, the ASR has a knowledge base that contains a single acoustic model (AM), a language model (LM) and a lexicon. The objective of noise robust ASR is to maintain the WER as minimum as possible when the environmental noise varies. To achieve such a goal, the knowledge base in the proposed ASR has multiple AMs that are trained for different noise levels. The suggested ASR decoding procedure has three stages which are: estimating the noise level, selecting of an AM that maximizes the expected performance, and decoding with maximum a posteriori (MAP) using the selected model. As the noise might change its level, the robust ASR can adaptively switch the acoustic model information. The proposed methodology will be called adaptive model switching algorithm. 95

Result: Ŵ Knowledge Base Pron. Lexicon x(t) Feature Extraction O Decoder P r(w ) LM s(t) STS-SNR Estimator ˆn A i AM Selector A 1 A 2 A 3 A 4 Acoustic Models Figure 7.

114 Result: Ŵ Knowledge Base Pron. Lexicon x(t) Feature Extraction O Decoder P r(w ) LM s(t) STS-SNR Estimator ˆn A i AM Selector A 1 A 2 A 3 A 4 Acoustic Models Figure 7.1: The proposed Adaptive Model Switching (AMS) automatic speech recognition system. Where s(t) is 30 ms samples of silence 7.2 System Design Unlike the conventional ASR architecture, the AMS ASR has more than one AM in the knowledge base, of which the decoder will pick one of them depending on the detected input noise level. As it has been concluded in chapter 6, minimum WER could be obtained by training acoustic models using a limited number of noise levels. The proposed noise robust speech recognizer will adaptively switch the acoustic model based on the estimation of the noise level of the input audio. Figure 7.1 shows the proposed system that will be called Adaptive Model Switching (AMS) ASR. 96

115 7.3 Adaptive Model Switching ASR Training There are two main parts of the AMS ASR that need to be trained: the multiple AMs and the noise level estimator (STS-SNR). The following sub-sections describe the training procedure for each one of them Acoustic Models Training The training noise levels are carefully selected so that they can minimize ASR performance degradation. Each AM is trained with different acoustic evidence, as those evidences are derived from the same reference speech corpus, doped with various noise levels. As it was shown earlier in section 6, the AMs are trained for the following categories: C 1 : 10 db AWGN to build the acoustic model A 1. C 2 : 15 db AWGN to build the acoustic model A 2. C 3 : 20 db AWGN to build the acoustic model A 3. C 4 : No db is considered, i.e. using the audio corpus as is. It is used to build the acoustic model A 4. The training procedure is achieved by the following steps: 1. From the target corpus, extract the linguist information, the language model and dictionary (also known as lexicon). 2. Use the raw audio of target corpus to train A 4 acoustic model. This model is considered to be the baseline clean version. 97

116 3. From each audio segment in the training baseline corpus, s, estimate the noisy speech s for the k th db SNR level. Estimate signal variance : σ 2 s = N i=1 x 2 i N (7.1) where N is the number of samples in audio segment s Estimate noise variance: σ 2 n = σ 2 s 10 k 10 (7.2) Calculate random noise samples at k db SNR as follows: R k = 1 2πσ 2 n exp r 2σ 2 n (7.3) where r is a vector of random numbers and length N. Estimate the noisy speech s by adding the noise samples to the baseline audio. s = R k + s (7.4) 4. Repeat step 3 for all values of k, where k = 10, 15, 20 db SNR to build A 1, A 2 and A 3 acoustic models. Having four different AMs, the designed AMS recognizer will selectively use one at a time depending on the input noise level. The selected best model is supposed to reduce the performance degradation in the noisy environment. 98

117 P P T P T s(t) S(f) s(t) s(t) FFT S(f) 2 P ˆn ˆB 10log 10 (A) A Average of P P T Offset Figure 7.2: The Short-Time-Silence estimator (STS-SNR) operation sequence flow. The silence samples s(t) are the 30 ms samples of silence or non-speech signal STS-SNR Training Allowing the ASR decoder to pick a specific AM with minimum risk, reduces the undesired effects when the environmental noise level changes. Detecting the SNR level at the input speech signal could be done using various types of estimators. The proposed short-time silence STS-SNR estimator, shown in figure 7.2, is used in this thesis(discussed in detail in chapter 4 and the published in [72]). The STS-SNR estimate the background noise level by using the following equation: ( M/4 1 ) ˆB = 10 log 10 P j Pj T (M/4) j=1 (7.5) 99

118 where P = S(f) 2 is the power spectral density (PSD) of 30 ms of silence samples, P T is the transpose version of PSD and M is the number of fast Fourier transform FFT samples. Then the estimated level of SNR ˆk in db for speech audio will be: ˆk = Offset ˆB (7.6) According to the discussion in section 4.3.2, the human being usually speak with a power higher than the background noise by around 23 db [72]. However, testing this algorithm on different corpora shows that the speech power is higher than 23 db in general. In other words, the speech-background offset is higher that 23 for some corpora. Therefore, we will train the STS-SNR estimator for the optimum offset value, which will provide a more stable noise estimation for the ASR. Hence, the following training procedure is introduced. 1. Use the training audio files from the noisy corpus obtained in From each training set in the i th db SNR level corpus, (a) For Offset values j from 20 to 60, do the following: i. Estimate the background noise ˆB i using equation 7.5. ii. Estimate the SNR db level for current Offset value: ˆn j = Offset j ˆB i 100

119 iii. Calculate the mean absolute error between the real and the estimated SNR: MAE j = N i=1 ˆn j n i N (b) Find the best offset value j opt that gives minimum MAE j opt = argmin MAE j 0 j 3 3. Repeat for SNR db values 0 i 3. After estimating the optimum offset value, the training audio for each noise level i is fed to the STS-SNR estimator, to estimate the mean µ i and variance σ 2 i at the i th noise level. So that, an inference model {θ i } 4 i=1 is: θ i = N (µ i, σ 2 i ) (7.7) where µ i and σ 2 i are the sample mean and variance for each noise category C i respectively and i = {0, 1, 2, 3}. The out mean and variance at each noise level are the output parameters that will be used in the decoding phase. 7.4 AMS Decoding Algorithm In the decoding, the STS-SNR estimates the expected noise of the input audio speech signal continuously. The AM selector uses the statistical inference parameters θ i and picks the optimum AM. Then, the MAP is applied using the A i acoustic model parameters. 101

120 Hence, the AMS decoding procedure will be: 1. Feed the silence samples to the STS-SNR module to estimate the input SNR level ˆn. 2. Estimate the best model type using the decision rule in formula Use the MAP in to get the final best hypothesis of the model A i. 4. Update silence samples. The AM selector decides to pick the model A i to maximizing the probability of the input noise ˆn against the trained STS-SNR inference model θ i, so that: A i = argmax f(ˆn; θ i ) (7.8) 1 i 4 where f(ˆn; θ i ) is the probability of the estimated noise level ˆn to be in class i, which is: f(ˆn; θ i ) = 1 exp ( (ˆn µ i) ) 2πσ 2 i 2σ 2 i for 1 i 4 (7.9) Equation (7.9) measures the probability of the input speech to be corrupted by the i th noise level. The best match, between the acoustic model A i and the input speech noise level SNR = ˆn db, is when the latter is more probable to be in C i noise level category. The acoustic models were designed in every corpus to be context-independent and continuous model with 3 states per HMM and without any skip state. The language model is a 3-gram statistical model, which consists a list of the probability of any word w n occurs after the sequence w n 2 w n 1. This probability is simply counting the 102

121 number of occurrence of the word w n regarding different preceding words. In general N-gram grammar probability for each word w n is calculated using the following formula [73]: P (w n w n 1 n N+1 ) = C(wn 1 n N w n) C(w n 1 n N ) (7.10) Where C(x) is counting the number of occurrence of x. Both the language and the acoustic models information are used together to give the best guess Ŵ using the following 3-step process: Ŵ = argmax W L Step 3 P (O W ) P (W ) Step 1 Step 2 (7.11) In step 1, the likelihood P (O W ) is calculated from scoring the current observation against the acoustic model AM. Step 2 computes a priori probability P (W ) for each suggested word sequence using the LM tables produced by (7.10). In Step 3, the decoder searches within all valid words and sentences in L to find the maximum likelihood utterance [5]. 7.5 Silence Samples Update The STS-SNR estimator assumes that 30 ms of non-speech signal samples s(t) are enough to detect the noise level when someone speaks within that environment. The silence is considered to be the first samples, directly captured when the microphone is turned on. If the noise level is expected to change during the recognition session, the silence samples detected by the ASR will be used to update the estimated noise level. Modern ASR s usually provide information that could be used inherently to detect 103

122 the silence samples. For example, Sphinx4 ASR tracks non-speech events, including silences that are embedded in the filler dictionary and labeled as SIL. Luckily, the recognizer also gives timing information for filler words which implies that no extra voice activity detector is required to update the estimated noise level [15]. Therefore, a 30 ms of samples could be extracted from the recognizer itself, by using the nearest SIL samples. Based on a specific update rate, the SIL samples information could be considered as an input to the STS-SNR if necessary. 7.6 Conclusion Whereas the typical ASR has a single AM, the proposed AMS-ASR uses four acoustic models that are trained using four noise categories as described earlier. For the proposed AMS-ASR, the SNR is estimated firstly to decide which AM, of the four models, will give the minimum WER under such circumstance. Then the decoder will use the maximum a posteriori in (7.11) to find the most probable sequence of words by matching the input signal to the selected acoustic model. It is clear that the AM selector module dynamically alters the AM based on the current situation of noise at the input. This system will try to maintain the WER as low as possible, even when the environmental noise varies continuously. 104

123 Chapter 8 Results and Discussion 8.1 Introduction The AMS system is tested on different corpora for AWGN, door slam noise plus AWGN and babble noise, where many people are talking in the same time. The proposed AMS ASR was tested, and its performance was compared with the baseline recognition system. Other possible approaches were also tested on the same environment, specifically the acoustic model generalization (AMG) and multiple decoder maximum MAP vote (maxmap). The AMG is an ASR system that has a single acoustic model that is trained for different noise levels. In this case, the training corpus is equally divided into four sets, where each set is corrupted by a specific level of noise. On the other hand, maxmap has the same four acoustic models of the proposed AMS system. However, the speech recognition will not detect the expected noise level and it will use all the four models at the same time. 105

124 Decoder 1 Hyp1 Features Decoder 2 Decoder 3 Hyp2 Hyp3 maxmap Best Hypothesis Decoder 4 Hyp4 Figure 8.1: Multiple decoder maximum MAP (maxmap) decoder architecture. In the maxmap approach, there are four parallel decoders, and four hypotheses will emerge. The final winning hypothesis out of the maxmap ASR is that of the maximum MAP score as shown in figure The word Error Rate (WER) The adaptive model switching (AMS) approach was tested for various noise levels and types. The acoustic models in all cases were trained only using AWGN as suggested in chapter 6. The proposed AMS system was compared to AMG and maxmap paradigms, as other possible solutions that might be used for such a problem. In the following subsections, the white noise, door slam, and crowd babble noise types were tested on the three ASR approaches to compare performance. 106

125 WER % Baseline AMS AMG maxmap SNR db Figure 8.2: The proposed AMS, AMG and maxmap approaches (WER%) performance for TIMIT on different noise levels of AWGN White Noise Experiment The WER performance in figures 8.2, 8.3, and 8.4 show the baseline, AMS, AMG and maxmap for various noise levels tested on TIMIT, an4, and PDAmWSJ corpus respectively. It has to be noticed that all approaches had an advantage over the baseline ASR. However, the proposed AMS system was better than the AMG and maxmap approaches, specially in low SNR. The AMG and maxmap systems. In fact, the proposed AMS system almost performed as if there is no noise, even at that high noise level situations. 107

126 WER % Baseline AMS AMG maxmap SNR db Figure 8.3: The proposed AMS, AMG and maxmap approaches (WER%) performance for AN4 on different noise levels of AWGN. WER % Baseline AMS AMG maxmap SNR db Figure 8.4: The proposed AMS, AMG and maxmap approaches (WER%) performance for PDAmWSJ on different noise levels of AWGN. 108

127 WER % Baseline AMS AMG maxmap SNR db Figure 8.5: The proposed AMS, AMG and maxmap approaches (WER%) performance for TIMIT on different noise levels of AWGN plus door slam noise Door Slam Noise Experiment The aim was to test the STS-SNR estimator for an unexpected environment that was not used in the training process. Recorded door slam audio samples are added at the beginning of each audio segment in the test part of the corpus, with the existence of the AWGN as well. The acoustic models of the AMS are trained with the AWGN only, as well as the STS-SNR noise level estimator. It is intended to check the performance degradation that the STS-SNR module can do in the proposed AMS. This is because the noise level estimator module may not estimate the noise level correctly making the AM selector derive the wrong decisions. Figures 8.5, 8.6 and 8.7 show the WER performance comparison of the AMG, maxmap and the proposed AMS ASR when the AWGN is added to door slam noise. 109

128 WER % Baseline AMS AMG maxmap SNR db Figure 8.6: The proposed AMS, AMG and maxmap approaches (WER%) performance for AN4 on different noise levels of AWGN plus door slam noise. WER % Baseline AMS AMG maxmap SNR db Figure 8.7: The proposed AMS, AMG and maxmap approaches (WER%) performance for PDAmWSJ on different noise levels of AWGN plus door slam noise. 110

129 For AN4 corpus, it is noticed that all the tested systems failed to keep WER down all the way to high SNR. Figure 8.6 shows that the WER% slightly increases with the decrease of the noise level. The reason for this unexpected phenomenon is that the AN4 audio files have short utterances durations as compared to the door slam noise. Therefore, even if the door shot noise is added at the beginning of the utterance, its effect will continue to impact the speech part. In general, the AMS still has better performance even under such unexpected situations Babble Noise Experiment Another experiment was designed to test the proposed AMS system when the background noise the babbling of many people. The babble noise has the bandwidth of speech signal, which cannot be filtered by the front-end of the ASR. There are more advanced ways to filter out such noise like using microphone arrays and apply algorithms like the angle of arrival detection (AOA) to remove the unwanted background speech. However, the proposed strategy is as simple as using white noise acoustic models to recognize the speech embedded within babble noise background. The fundamental idea is to reduce the mismatch error factor of the ASR by picking the AM of the closest noise level to the input speech. Although the acoustic models were not trained for babble noise, picking the AWGN model of the same noise level of the input reduces error rate significantly. In this experiment, the crowd babble noise is tested on acoustic models that are trained using clean, 10, 15 and 20 db AWGN speech data. The proposed AMS was also compared to the AMG and maxmap ASR systems. 111

130 WER % Baseline AMS AMG maxmap SNR db Figure 8.8: Comparing AMS, AMG, and maxmap word error performance for TIMIT corpus on crowd babble noise. All the acoustic models are trained using AWGN speech signals. Figure 8.8 depicts the performance of AMS, AMG and maxmap systems for TIMIT on crowd babble noise. It is clear that the approach outperformed both other ASR systems and has an improvement in various noise levels. For AN4 corpus in figure 8.9, the AMG ASR seems to perform better than the other two systems for low SNR. However, it has undesired performance for high SNR (low noise), if compared with the maxmap and the proposed AMS. At 10 db SNR, both the maxmap and AMG outperform the proposed AMS ASR. There are two factors that affect the AMS performance: the STS-SNR estimator and the acoustic model parameters. As it has been mentioned before, the audio segments in the AN4 corpus are shorter than TIMIT and PDAmWSJ ones. Besides, the silence period at the beginning of each audio segment is shorter as well. Hence, the probability of the STS-SNR to miss-classify the input noise level increases for AN4. 112

131 WER % Baseline AMS AMG maxmap SNR db Figure 8.9: Comparing AMS, AMG, and maxmap word error performance using AN4 corpus and babble noise input. All the acoustic models are trained using AWGN speech signals. WER % Baseline AMS AMG maxmap SNR db Figure 8.10: Comparing AMS, AMG, and maxmap word error performance for PDAmWSJ corpus on crowd babble noise. All the acoustic models are trained using AWGN speech signals. 113

132 Table 8.1: Relative WER% improvement of AMG, maxmap and the proposed AMS ASR systems. TIMIT AN4 AMG maxmap AMS White Door slam Babble White Door slam Babble White PDAmWSJDoor slam ,18 Babble Average Figure 8.10, which depicts the experiment results on PDAmWSJ corpus, shows that the proposed AMS has better performance than the tested AMG and maxmap approaches. Like TIMIT, the utterances periods are sufficient with enough silence samples at the beginning of each segment. 8.3 WER Relative Improvement Comparison The acoustic model generalization (AMG), the maximum MAP multiple decoder voting (maxmap) and the proposed adaptive model switching (AMS) performed in different ways as shown in the previous section. Table 8.1 summarize the overall relative WER improvement of each of the tested ASR methodology. The relative improvement (RI) is measured by comparing the average percentage of the WER of each tested ASR W ER T with the baseline ASR error rate W ER B as 114

133 follows: RI = W ER B W ER T W ER B % (8.1) At every noise type, the proposed AMS outperformed the AMG and maxmap approaches. The highest improvement was for the AWGN because the acoustic models were trained with such a noise type. Interestingly, the proposed AMS improved the performance of the baseline ASR on crowd babble and door slam noise although that type of noise was not introduced in training. The experimental results confirm the fact that was concluded in chapter 5 about the usefulness of AWGN models even for other different types of noise. The crowd babble noise is highly non-stationary and occupied the same bandwidth that the ASR used to decode speech. In this case, the feature extraction step, which has many filters and power averaging, is not able to exclude the unwanted signal. Therefore, all tested systems, including the AMS approach, had low WER improvement with crowd babble noise. 8.4 The Recognition Speed The proposed AMS recognizer is designed in such a way so that the addition computational expenses to the typical ASR is as less as possible. It is expected that the AMS recognizer to add more decoding time to the typical ASR due to the additional steps of SNR estimator and the AM selection. However, the empirical results show that the AMS is relatively faster than the typical ASR. The following figures depicts the decoding speed in times real-time for the AMG, maxmap and the proposed AMS ASR. 115

134 0.4 xrt SNR db Baseline AMS AMG maxmap Figure 8.11: The average decoding speed in times real-time (xrt) for AMG, maxmap and the proposed AMS approaches with TIMIT corpus. Comparing the real-time factor for the three corpora shows that maxmap approach performed slower than the AMG and the proposed AMS. There is a relative speed gain of using the proposed AMS recognition system over the baseline one in the noisy environment. After conducting multiple speed test experiments, it was discovered that the HMM decoder will reach the final state faster when the AM was trained using the similar environmental situations. Therefore, the lower the SNR, the slower the baseline ASR, whereas the proposed AMS speed was almost the same at different noise levels. Generally, the proposed AMS showed to be the fastest recognizer among the three tested recognizers. Interestingly, the AMG decoder had a comparable processing speed to AMS ASR. For PDAmWSJ corpus speed test results in figure 8.13, the AMG and the proposed AMS had almost the same processing time for each noise level. 116

135 xrt SNR db Baseline AMS AMG maxmap Figure 8.12: The average decoding speed in times real-time (xrt) for the AMG, maxmap and the proposed AMS approaches with AN4 corpus. 117

136 4 3 xrt SNR db Baseline AMS AMG maxmap Figure 8.13: The average decoding speed in times real-time (xrt) for the AMG, maxmap and the proposed AMS approaches with PDAmWSJ corpus. On the other hand, maxmap ASR reaches its final hypothesis slower than the baseline ASR system. The speed loss of the maxmap ASR is due to having four decoders that have to work simultaneously for each acoustic evidence. Table 8.2 list the percentage of the average relative improvement of each tested approach using TIMIT, AN4 and PDAmWSJ corpora. The test was measured by the pocketsphinx decoder working on 2.6 GHz Intel R Core TM i7 quad core processor. 8.5 Conclusion Training the acoustic models using audio speech that was corrupted by white noise for certain levels, with the ability to switch the models adaptively, significantly improved ASR performance. 118

137 Table 8.2: The relative improvement % of the average decoding speed for the AMG, maxmap and the proposed AMS for each corpus. AMG maxmap AMS TIMIT AN PDAmWSJ Average If an acoustic model is trained at a certain level of white noise, it will decode the audio that is corrupted with other types of noise with fewer errors. The white noise has a wide-band spectrum, while any other type of noise have a narrower bandwidth. The acoustic features used in the ASR will only consider the speech band spectra as discussed in chapter 5 earlier. Therefore, the AWGN models could be used as a general-case model to decrease training efforts. The main feature of the proposed AMS ASR is to expect the acoustic noise level and decide which AM is more suitable for such environment. The WER and the decoding speed were improved using that scheme. The proposed AMS outperformed other tested possible approaches, like model generalization (AMG) and multiple decoder voting (maxmap). The maxmap approach had a comparable performance with the proposed AMS for accuracy, but it is slower than it many cases. In the other hand, the AMG has shown to be as fast as the AMS in most cases but its less accurate than the proposed approach. 119

138 Chapter 9 Conclusions In this study, many fundamental issues were investigated and tested extensively. The automatic speech recognition system has been highly improved through the last decade, specially the client-server commercial ASR system. Having sufficient acoustic data that are comprised of various types and levels of noise, the accuracy of the ASR will be increased dramatically. However, local ASR systems are highly affected by the noise variations due to the lack of training data with specific noise levels and types. In order to train the ASR on different noise levels and types, the noise signal is added after data collection in order to control the levels and types of the tested noise. The acoustic signals are usually combined in the digital domain with a linear + addition. In this dissertation, the linear acoustic combination, algebraic addition using + was first verified. It was an effort to determine the best way that can be used to simulate the real acoustic signal-noise combination. A new formula, the vector combination addition, was derived to describe the transformation of the acoustic combination from the acoustic space to the electrical domain. It was discovered that the + linear addition, as compared to vector addition, is an approximation of the 120

139 acoustic combination captured by a single microphone. It will be interesting if the noise level is detected before the speech recognition decoding process. The proper estimation of the input acoustic noise level gives the decoder a chance to prepare suitable parameters at each noise situation. The moderate speaking power is found to be higher than the ambient noise at least by 23 db. Early acoustic studies showed that the recognizable-speech should be higher than the faintspeech level by around 20 db [45]. The proposed Short-Time of Silence (STS) SNR estimator was based on that fact, and it was tested to be more accurate and fast than the WADA and NIST algorithms. It was noticed that training the acoustic models of the ASR with white noise for certain levels, with the ability to switch the models adaptively, will significantly improve ASR performance. The white noise occupies a wide-band spectrum, while other different types of noise have narrower bandwidth. Therefore, the AWGN models would be used as a general noisy model to decrease training efforts. The ASR adaptation of the noise variations used the STS-SNR estimator to determine the noise level and hence alter the decoder s parameters to reduce the error. The adaptive model switching (AMS) system was introduced as an efficient approach to tackle the performance degradation of the ASR in noisy environment variations. It was more efficient than using a general model or using multiple decoder voting schemes in both error rate and speed. The proposed AMS approach is a general recipe that could be applied to any other ASR systems although it was tested on Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) paradigm. Therefore, future works would test the AMS approach on different ASR types like ANN-HMM system and so on. 121

140 Appendices

141 Appendix A Noisy TIMIT Corpus Noisy TIMIT was developed throughout this research, and it is a new version of the well-known TIMIT corpus. The new corpus was published in Linguistic Data Consortium (LDC) at the University of Pennsylvanian in 2017 [38]. As the conducted experiments in Chapter 5 required more types and levels of noisy speech; the audio in TIMIT corpus has been reproduced. According to what has been proven in Chapter 3, the linear combination is an approximation representation of the acoustic signals superposition in the electrical domain. Therefore, the linear addition was used to simulate the acoustic combination of the speech and noise. A.1 TIMIT Corpus Texas Instrument and Massachusetts Institute of Technology (TIMIT) is a speech corpus that was designed for acoustic-phonetic knowledge acquisition for development and evaluation purposes. 123

142 Table A.1: Dialects regions distribution of TIMIT test set. Dialect #Male #Female Total Total Table A.2: Sentence categories in TIMIT Sentence Type #Sentences #Speakers Total #Sentences/Speaker Dialect (SA) Compact (SX) Diverse (SI) Total TIMIT is a US English continuous speech corpus that is phonetically balanced. It also utilizes a large number of most common accents in the United States. The corpus contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States [74]. The phonetically-diverse sentences were chosen from existing text sources of the Brown Corpus [75] and the Playwrights Dialog [76] to add diversity in sentence types and phonetic contexts. The selection paradigm maximized the variety of phonics found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. Table A.2 summarizes the TIMIT sentence categories. 124

Figure A.1: Noisy TIMIT Corpus directory structure A.2 Noisy TIMIT Data Structure The following figure shows the main structure for Noisy TIMIT directory. In each subdirectory in figure A.

143 Figure A.1: Noisy TIMIT Corpus directory structure A.2 Noisy TIMIT Data Structure The following figure shows the main structure for Noisy TIMIT directory. In each subdirectory in figure A.1, the directory region (dr) and sentence type file names are identical to the TIMIT corpus. The original TIMIT structure will be repeated for each noise level and type. There are six types of noise and for each noise category there are 10 different noise levels. The phonetic description files are not provided, while the prompts are provided in separate files. The hierarchical file arrangement then is identical to the original TIMIT corpus. Starting from <Noisy_TIMIT>/<Noise_Type>/wav<Type>xdB>/, the speech flac files are organized according to the following hierarchy [74]: <Usage>/<Dialect>/<Sex><SpeakerID>/<SentenceID>.<FileType> where 125

144 Usage : train or test Dialect : drx, where x is the dialect region code in table A.1. Sex :== m or f SpeakerID :== <INITIALS><DIGIT> where INITIALS is 3 letters of speaker initials, DIGIT is 0-9 to differentiate speakers with identical initials. SentenceID :== TEXT TYPE SENTENCE NUMBER where: TEXT TYPE : could be sa, si or sx (see table A.2 for sentence text type) SENTENCE NUMBER : FileType : flac For example: If you have the following file: Noisy_TIMIT/RedNoise/wavRed15dB/train/dr1/fcjf0/sa1.flac it means that: Audio with additive red noise, training set, dialect region 1, female speaker, speaker-id cjf0, sentence text sa1, speech waveform file compressed using flac. Then the text associated with this audio is found in the transcription file /NoisyTIMIT Documentation/timit train.transcription, in a line that ends with (SA1). Note that, this line is repeated in this file because its associated audio is repeated multiple of times in the training set. The test and train transcription files are the text representation for the audio directories and arranged in the order of occurrence. A.3 Noise Types and Ranges The additive white, pink, blue, red, violet, and babble noise are selected to be included in Noisy TIMIT corpus. Except for the babble noise, background noisy signals have 126

145 been artificially generated using MATLAB programing environment. The noise level variation is measured with the Signal-to-Noise Ratio (SNR) in decibel (db). The noise level is varying in 5 db steps and ranges from 5 to 50 db. The white, pink, blue, red and violet noise types are generated artificially using MATLAB. The function randn(), which generates normally distributed random numbers, is used to generate the white noise samples. If S(f) is the power spectral density (PSD) of the colored noise are: A. Pink noise PSD S p = S(f) f B. Blue noise PSD S b = S(f) f C. Red noise PSD S r = S(f) f 2 D. Violet noise PSD S p = S(f) f 2 where f is the frequency index, so that {f} NF F T 0 and NF F T is the total number of frequency samples. Figure A.2 shows the single-sided spectrum of each color of noise. The derived colored noise samples are designed to have unity standard deviation and zero mean. After calculating the signal power of the TIMIT clean audio, the resultant noise signals powers are scaled to obtain a certain signal-to-noise ratios (SNR). Then the noisy audio is formed by adding the baseline audio to the noise as follows: x(t) = s(t) + n(t) (A.1) where s(t) is the original TIMIT audio signal and n(t) is the white or colored noise signal. 127

(e) The spectrum of violet noise. Figure A.

146 (a) The spectrum of white noise. (b) The spectrum of pink noise. (c) The spectrum of blue noise. (d) The spectrum of red noise. (e) The spectrum of violet noise. Figure A.2: Single-sided spectrum of different colors of noise where y-axis is the amplitude of the spectrum, X(f), and the x-axis is the frequency f in Hertz. 128

Using RASTA in task independent TANDEM feature extraction

R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t