Robust Algorithms For Speech Reconstruction On Mobile Devices

Size: px

Start display at page:

Download "Robust Algorithms For Speech Reconstruction On Mobile Devices"

Britton Short
5 years ago
Views:

1 Robust Algorithms For Speech Reconstruction On Mobile Devices XU SHAO A Thesis presented for the degree of Doctor of Philosophy Speech Group School of Computing Sciences University of East Anglia England February, 2005

2 Dedicated to MY FAMILY

3 Robust Algorithms for Speech Reconstruction on Mobile Devices Xu Shao Submitted for the degree of Doctor of Philosophy 2005 Abstract This thesis is concerned with reconstructing an intelligible time-domain speech signal from speech recognition features, such as Mel-frequency cepstral coefficients (MFCCs), in a distributed speech recognition(dsr) environment. The initial reconstruction methods in this thesis require, in addition to MFCC vectors, fundamental frequency and voicing information. In the later parts of the thesis these parameters are predicted from the MFCC vectors. Speech reconstruction is achieved by first estimating a spectral envelope from the MFCC vectors. This is combined with excitation information from the fundamental frequency and voicing which enables both a source-filter and sinusoidal model of speech to be investigated and compared. Analysis into the sinusoidal model shows that both clean spectral envelope estimates and robust fundamental frequency estimates are necessary for clean speech reconstruction in noisy environments. Inclusion of spectral subtraction is shown to provide the clean spectral envelope estimates. A comparison of fundamental frequency estimation methods shows the most robust to be obtained from an auditory model. This leads to the proposal of an integrated front-end which replaces the mel-filterbank by the auditory filterbank for MFCC extraction and thereby reduces computation. Speech reconstruction tests reveal that robust fundamental frequency estimation and spectral subtraction lead to intelligible and relatively noise free speech. Evidence in this work has shown that correlation exists between fundamental frequency and the MFCC vectors. This leads to the proposal of predicting the fundamental frequency and voicing of a frame of speech from its MFCC representation. An initial method uses a Gaussian mixture model (GMM) to model the joint density of fundamental frequency and MFCCs. A second method combines the GMMs into a hidden Markov model framework to give a more localized modeling of the joint density. Experimental results on both speaker-dependent and speaker-independent tasks show accurate prediction of the fundamental frequency and voicing from the MFCC vectors which leads to intelligible speech reconstruction.

4 Declaration The work in this thesis is based on research carried out at the Speech Group, the School of Computing Sciences, the University of East Anglia, England. No part of this thesis has been submitted elsewhere for any other degree or qualification and it is all my own work unless referenced to the contrary in the text. Copyright c 2005 by XU SHAO. The copyright of this thesis rests with the author. No quotations from it should be published without the author s prior written consent and information derived from it should be acknowledged. iv

5 Acknowledgments I am deeply indebted to my supervisor, Dr. Ben Milner, for his support, guidance and encouragement through my graduate research and studies. His scientific integrity, humane idealism and good cheer have earned him my respect. His writing abilities also have helped me make this thesis a great deal better. It could be more difficult for me to have this achievement without his help. Working and interacting with him over the last three and half years was truly an invaluable learning experience. My sincere thanks go to Prof. Stephen Cox, the speech group leader, whom I always admire for his extensive knowledge and outgoing personality. I am very grateful to him for his suggestions during my research. I would like to thank former and current members in our thriving speech group, Alastair, Barry, Ian, Ibrahim, Kris, Jonathan, Judy, Mark and Qiang for their help. The days for us discussing, joking, laughing etc. in the Wolfson lab S01.07 would not be forgotten. I must acknowledge the School of Computing Sciences for funding my research. Especially, the help from Professor Vic Rayward-Smith who was the Dean of the School of Computing Sciences must be recognised. v

6 vi Finally I would like to thank my parents, Yumin Shao and Lan Wang, for providing me such a loving environment and encouraging me during my study in England. I also wish to thank my loving wife, Wenlian Xue for her support. She was always with me since I studied in England. They deserved to share the success of this thesis.

7 Contents Abstract iii Declaration iv Acknowledgments v 1 Introduction Motivations of this Thesis Traditional codec based architecture Codec-derived architecture Distributed speech recognition based architecture Overview of this Thesis Background Information Speech Production Models The mechanism of speech production Source-filter model Sinusoidal model Mel-Frequency Cepstral Coefficients Extraction vii

8 Contents viii Pre-emphasis filter Hamming window and Fourier transform Mel-filterbank Log and DCT Auditory Model Clustering Algorithms K-means EM algorithm Maximum A Posteriori Estimation Summary Speech Reconstruction from MFCCs and Fundamental Period Introduction Estimation of Spectral Envelope from MFCCs Fundamental Frequency Estimation SIFT fundamental frequency estimation Comb function based fundamental frequency estimation Fundamental frequency smoothing algorithm Evaluation of fundamental frequency Experimental Results Speech Reconstruction Using the Source-filter Model Speech Reconstruction Using the Sinusoidal Model Parameter estimation from spectral envelope and fundamental frequency

9 Contents ix Reconstruction of a single frame of speech Post Processing Reconstruction of multiple frames Fundamental frequency smoothing between frames Experimental Result Summary Speech Reconstruction from Noisy MFCCs Introduction Effect of noise on spectral envelope Effect of noise on fundamental frequency estimation Proposal for clean speech reconstruction Noise Compensation for Spectral Envelope Estimation Spectral subtraction Robust Fundamental Frequency Estimation Auditory filterbank Teager energy operator Autocorrelation for output of the auditory model Channel selection Pseudo-periodic histogram Voiced/unvoiced classification Experimental Results Evaluation of auditory model based fundamental frequency estimation

10 Contents x Experiments for speech reconstruction Summary An Integrated Front-End for Speech Recognition and Reconstruction Introduction Integrated Front-End Review of auditory model Feature extraction from auditory model for speech recognition Robust fundamental frequency estimation Experimental Results Fundamental frequency evaluation Recognition performance Speech reconstruction quality Summary Fundamental Frequency Prediction from MFCCs Introduction Fundamental Frequency Prediction GMM-based fundamental frequency prediction HMM-GMM based fundamental frequency prediction Voiced/Unvoiced Classification Voicing classification using prior probability Voicing classification using posterior probability Voicing classification for GMM-only method

11 Contents xi 6.4 Experimental Results Experimental results for digit models Experimental results for free speech Summary Conclusions and Future Work Review of this work Conclusions Future work A Conditional Distributions of Multivariate Normal Distribution 158 B Speech Quality Evaluation 160 B.1 Histogram of raw scores B.2 Evaluation of data Bibliography 165

12 List of Figures 1.1 Architectures for speech recognition and reconstruction over mobile networks The human vocal system Voiced and unvoiced sound Lossless tubes in speech production model Formants in frequency spectrum Wideband/narrowband spectrogram Source-filter model for speech production An application of source-filter model Sinusoidal representation of speech signal An application of sinusoidal representation Mel-frequency cepstral coefficients extraction Frequency response of pre-emphasis filter Hamming window and its frequency response The relation between linear frequency and Mel frequency Mel-scale filterbank Basis functions in the DCT xii

13 List of Figures xiii 2.16 Work flow for the K-means clustering algorithm A demonstration for K-means clustering algorithm Magnitude spectrum of fundamental frequency, spectral envelope and speech MFCC vector to spectral envelope Estimation of spectral envelope from linear mel-filterbank vectors Comparison of estimated spectral envelope and magnitude spectrum Center frequency in triangular filterbank The SIFT fundamental frequency estimation method An example of the SIFT fundamental frequency estimation method Autocorrelation diagram of excitation signal Magnitude spectrum of voiced/unvoiced speech signal A set of comb functions from 50Hz to 400Hz Fundamental frequency estimation using comb function Three of comb functions which used as fundamental frequency candidates Comb function based fundamental frequency estimates without smoothing Comb function based fundamental frequency estimates with smoothing Distribution of percentage fundamental frequency errors Comparison of the performance of fundamental frequency estimation methods Block diagram of speech reconstruction using the source filter model. 65

14 List of Figures xiv 3.18 MFCC-derived vocal tract frequency response Speech reconstruction using the sinusoidal model Comparison of estimated peaks and peaks of the magnitude spectrum Linear phase model Overlap-and-add method Synthesised signal with and without overlap-and-add algorithm The narrowband spectrogram of original and reconstructed speech signal Harmonic confusion and fundamental frequency smoothing Initial and improved reconstructed spectrogram Comparison of reconstructed speech signals using the sinusoidal and the source-filter model Effect of noise on spectral envelope Effect of noise on fundamental frequency estimation Robust speech reconstruction from noisy MFCCs and fundamental frequency Comparison of spectral subtraction Robust fundamental frequency estimation from noisy speech using an auditory model Impulse response and frequency response of a gammatone filter An energy envelope extracted using Teager energy operator Autocorrelation of all channels from a clean speech frame Channel selection from the autocorrelogram

15 List of Figures xv 4.10 Pseudo-periodic histogram Fundamental frequency evaluation for different fundamental frequency estimators Spectrogram of clean speech and noisy speech - SNR 10dB of utterance Spectrogram of reconstructed speech using robust fundamental frequency estimates with and without spectral subtraction Reconstructed speech using reference fundamental frequency from the laryngograph and spectral subtraction Integrated front-end and back-end systems Frequency response of triangular mel scale filterbank and 23-channel auditory model filterbank Evaluation of fundamental frequency for different auditory model configurations Comparison of speech recognition accuracy Comparison of speech reconstruction results DSR-based speech reconstruction from MFCCs with fundamental frequency prediction Workflow diagram of GMM-based fundamental frequency prediction Modeling of the joint MFCC and fundamental frequency feature space Workflow of HMM-GMM based fundamental frequency predication Voicing probability for eleven digit models Voicing probability for phoneme models Histogram of prior voicing probability for digits models

16 List of Figures xvi 6.8 Histogram of prior voicing probabilities for phoneme models An example of predicting fundamental frequency from MFCCs for a digit string Comparison of reconstructed speech for connected digit string An example of predicting fundamental frequency from MFCCs for a phonetic rich utterance Comparison of reconstructed speech for a phonetic rich utterance B.1 ACR for GSM B.2 ACR for CELP B.3 ACR for Sinusoidal Model B.4 ACR for Source-filter Model

17 List of Tables 3.1 Investigation of MFCC extractions Center frequency and bandwidth of each triangular filter in Melfrequency cepstral coefficients extraction Fundamental frequency evaluation for the SIFT and comb function based fundamental frequency estimation methods Mean opinion score for reconstructed speech Classification accuracy and percentage fundamental frequency error for male speech on the ETSI Aurora connected digit database Classification accuracy and percentage fundamental frequency error for female speech on the ETSI Aurora connected digit database Classification accuracy and percentage fundamental frequency error for unconstrain monophone model on free speech from a single female speaker B.1 Statistical details for reconstructed speech using different methods B.2 Comparison of T between each speech codec xvii

18 Chapter 1 Introduction Contents 1.1 Motivation of this thesis 1.2 Overview of this thesis Preface This chapter forms the introduction to the thesis. A motivation is first given which outlines the need for reconstructing a time domain speech signal from MFCC vectors and fundamental frequency. This is put in the context of architectures for mobile based communication system. An overview of the remaining chapter of the thesis is then given. 1.1 Motivations of this Thesis Mobile devices, such as mobile phones and personal digital assistants (PDAs), have become widely used in the human s daily life in recent years. This has resulted in a substantial increase in the number of automated speech based services becoming available, such as v-commerce, auto-booking and operator assistance. The success of these automated services relies on their ability to perform robust speech recogni- 1

19 1.1. Motivations of this Thesis 2 tion from mobile devices. Considerable progress has been made in the Automated Speech Recognition (ASR) community in the past decade. A typical ASR system is composed of a front-end and back-end. In the front-end, redundant information is removed from the time domain signal and a set of compact discriminative features are extracted through signal processing methods. These features are then sent to the back-end for decoding into a series of phonemes or words based on the particular grammar restrictions [30]. The speech recognition accuracy for clean speech data, such as connected digit strings, can reach over 98% [73]. However, when an ASR system is deployed in a mobile environment such as a mobile phone network derived from low bit-rate speech codecs, the recognition accuracy is seriously reduced due to compression effect [51] [59] [70] [71]. To perform speech recognition over mobile networks three different architectures can be considered as shown in figure 1.1. These three architectures are reviewed in the next sections Traditional codec based architecture Figure 1.1-a shows a traditional codec based architecture of a speech recognition system. On the receiver side, the speech recognition features are extracted from the decoded speech signal and then sent to the speech recogniser. The reconstructed speech quality is good with mean opinion score (MOS) around 4.0 [54], but several studies have shown that the speech recognition performance based on this architecture is worse than that of normal ASR system. Huerta [71] reports that cepstral coefficients computed from GSM decoded speech [53] are significantly different to the corresponding coefficients calculated from the original speech waveform. Euler [51]

20 1.1. Motivations of this Thesis 3 Figure 1.1: Architectures for speech recognition and reconstruction over mobile networks shows that the speech recognition accuracy is reduced down from 98.5% for a 64kbps PCM encoding speech to 96.0% for decoding the same speech from a 4.8kbps CELP codec system. Digalakis [70] demonstrates that decoding speech from a GSM codec operating at 9.6kbps gave word error rate of 14.5% in comparison to 12.7% using the G.711 (64kbps PCM) and G.721 (32kbps ADPCM) codecs. These studies indicate that although the reconstructed speech from the low bitrate codecs is adequate for human perceptibility, distortions from speech coding and compression / decompression are still apparent in the reconstructed speech and therefore distort the speech recognition features [84] [91] [97]. The situation deteriorates further in the presence of background noise and packet loss during transmission [105].

21 1.1. Motivations of this Thesis Codec-derived architecture To reduce this distortion, the feature vectors can be transformed directly from the codec parameters by comparing the similarity between the encoding part of parametric speech coders and the front-end processing stage of speech recogniser shown as figure 1.1-b. Tucker [78] derived speech recognition features (Mel-frequency cepstral coefficients) [13] directly from the coefficients used in a 2.4kbps LPC codec. The codec-derived MFCCs give a recognition error rate of 13% in comparison to 5.7% by the orignal MFCCs. However the recognition error rate can be reduced to 7.6% if models trained on the codec-derived features are used. A similar observation [88] has been reported in that the recognition performance of features derived from the codec vector is slightly worse than coventionally derived cepstrum. Raj [86] showed that the word error rate of speech recognition features derived from decoded speech is higher than that derived from parametric encoders by comparing the speech recognition performance in a 2.4kbps LPC FS-1015 codec, a 4.8kbps CELP FS-1016 and a 9.6kbps GSM codec. The results of these investigations show that the speech recognition performance can be improved by deriving the recognition features directly from parametric encoders. These studies indicate that the distortion in the decoded speech found in the traditional codec-based architecture shown in figure 1.1-a and 1.1-b has been partly reduced but the distortion introduced at the encoding stage cannot be eliminated and this results in the decline of the speech recognition performance. Testing using models matched to the codec partly solves the problem, but it is impractical to have separate sets of speech models for every codec.

22 1.1. Motivations of this Thesis Distributed speech recognition based architecture One proposed solution to this problem is the Distributed Speech Recognition (DSR) architecture as proposed by the European Telecommunications Standards Institute (ETSI) Aurora standard [82]. With the DSR, the front-end is moved to terminal devices shown as figure 1.1-c and speech recognition features, such as MFCC vectors, are extracted directly from the speech signal and transmitted over the network to the remote back-end for recognition. The speech recognition accuracy can be improved in this scheme because of the elimination of the codec distortion when compared with the codec based architectures as shown in figure 1.1-a and 1.1-b. However, the front-end technique, such as MFCC extraction, is designed for speech recognition and not for speech reconstruction purposes. MFCC vectors comprise mainly vocal tract information with excitation information, which is necessary for speech reconstruction, being lost through the coarseness of the mel-filterbank and truncation of higher order coefficients. This means that at the receiver, given only a set of MFCC vectors it seems not possible to obtain an acoustic speech signal. The limited bandwidth of channels over which DSR systems typically operate means that it is not possible to simultaneously transmit both feature vectors for speech recognition and codec vectors for speech reconstruction. For example, an early downside of DSR [82] which has a restricted bit-rate of 4800bps was that there was insufficient information present at the receiver to reconstruct the speech waveform for playback purposes. This is considered very important by the ETSI Aurora standard in particular for automated transaction service where in cases of dispute it is a legal requirement to listen to the audio. In this case, it is necessary to

23 1.2. Overview of this Thesis 6 reconstruct speech from recognition features. It also provides a possibility to compress speech signal by storing speech recognition features which can lead to high speech recognition performance as well as high quality reconstructed speech signal if speech signal can be reconstructed from recognition features. Therefore, further revisions of the DSR standard [100] now include a voicing decision and fundamental frequency value which enables intelligible reconstruction of speech from the MFCC vectors [79] [92]. This occupies the transmission bandwidth of 800 bps to increase the total bandwidth to 5600 bps. The main task of this work is to investigate methods of reconstructing a time domain speech signal from speech recognition features such as MFCCs together with additional excitation information. Once this has been achieved the thesis also introduces methods to predict the excitation information from the MFCC vectors and therefore enable the speech signal to be reconstructed from the MFCC vectors alone. 1.2 Overview of this Thesis The remainder of this thesis is arranged as follows. Chapter two introduces background information which includes speech processing methods, statistical methods and mathematical models of speech used in this thesis. Chapter three focuses on adapting models production to enable a speech signal to be reconstructed from MFCC vectors and excitation information in a noise-free speech environment. Two speech production models, the source-filter and sinusoidal models, are evaluated. The necessary parameters for these models are estimated from spectral envelope estimates which are derived from MFCC vectors and excitation information which has

24 1.2. Overview of this Thesis 7 been extracted from the clean speech signal. Formal listening tests show that an intelligible speech signal can be reconstructed from the MFCC vectors and excitation information using both speech production models in noise-free environments. Chapter four addresses clean speech reconstruction from noise contaminated MFCCs and robust fundamental frequency estimates. When the speech reconstruction method introduced in chapter three is implemented in a real mobile environment the background noise where the mobile devices are being used will contaminate the MFCC vectors and lead to erroneous fundamental frequency estimates. The quality of the resulting reconstructed speech will be reduced because of the noisy MFCCs and distorted fundamental frequency estimates. It is therefore necessary to estimate a clean MFCC-derived spectral envelope and obtain robust excitation information for clean speech reconstruction. By simulating the human auditory system [1] [15], which is able to discriminate between different sound sources in noisy environments, a computational auditory model [19] is applied to split a speech signal into a number of frequency channels. By assuming that noise only exists in some of these frequency channels, the selected noise-free channels can be used to provide robust fundamental frequency estimates [65] [93]. The nature of the additive noise allows spectral subtraction [11] to be implemented to achieve clean spectral envelope estimates from the noisy MFCCs if the noise is relatively stable. Chapter five makes further analysis of the output from the auditory model and the MFCC extraction and indicates that the frequency responses of the auditory filterbank and mel filterbank are quite similar. This leads to the proposal of a new speech feature extraction scheme to provide features for both speech recognition and

25 1.2. Overview of this Thesis 8 speech reconstruction. This allows a reduction of the computation in the front-end. Chapter six identifies correlations which exist between the fundamental frequency and MFCC vectors. This leads to the proposal of a novel fundamental frequency prediction method from MFCC vectors which removes the need for transmission of fundamental frequency and voicing information. [102] [103] This is achieved first using a Gaussian mixture model (GMM), an unsupervised training-testing scheme, to predict fundamental frequency from a number of Gaussian clusters which are trained from entire feature space. However, the GMM method does not consider the temporal information of fundamental frequency contour either in the training or testing stages. Therefore, a hidden Markov model (HMM) is incorporated to include the temporal correlation in the fundamental frequency contour and an HMM-GMM framework is developed. This framework uses a set of hidden Markov models (HMMs) to link together a set of state-dependent GMMs which enables a more localised modeling of the joint density of MFCCs and fundamental frequency. Experimental results show that the HMM-GMM can successfully predict fundamental frequencies from MFCC vectors and thus the speech can be reconstructed solely from MFCC vectors. A summary and suggestion of future work are made in chapter seven.

26 Chapter 2 Background Information Contents 2.1 Speech production models 2.2 MFCC extraction 2.3 Auditory model 2.4 Clustering algorithms 2.5 MAP estimation 2.6 Summary Preface The aim of this chapter is to supply basic knowledge about speech signal processing and mathematical methods used in this thesis. First, two speech production models, the source-filter model and the sinusoidal model, together with related speech signal processing methods are described in section 2.1. Next, the extraction of the Mel-frequency cepstral coefficients (MFCC) is reviewed and this speech feature is referenced through all chapters. The auditory system of humans is outlined in section 2.3 and this is subsequently used for reliable fundamental frequency estimation in chapters 3 and 4. Clustering algorithms which are applied in chapter 6 for fundamental frequency prediction are presented in section 2.4. Maximum a posteriori (MAP) method is employed with clustering techniques to estimate parameters in chapter 6 and its derivation is demonstrated in section 2.5. The summary is given in section

2.1. Speech Production Models 10 2.1 Speech Production Models This section describes the mechanisms of human speech production followed by two mathematical models to reproduce human speech signals.

27 2.1. Speech Production Models Speech Production Models This section describes the mechanisms of human speech production followed by two mathematical models to reproduce human speech signals. Related speech signal processing methods are also introduced The mechanism of speech production The human vocal system is composed of two parts: vocal cords and vocal tract including the oral, pharynx and nasal cavities in figure 2.1. The lungs provide the energy for speech by passing air through the vocal cords or glottis which forms a sound source or excitation signal. Different speech sounds can be produced when the excitation signal goes through the time-varying vocal tract. Figure 2.1: The human vocal system Speech sounds can be broadly divided into two distinct classes according to the status of the vocal cords, namely voiced sounds and unvoiced sounds. The voiced sounds, such as the phoneme /er/ in word bird or the phoneme /ae/ in word cat, are produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxed oscillation, thereby producing quasiperiodic pulse of air which excites the vocal tract [48]. On the contrary, unvoiced

28 2.1. Speech Production Models 11 sounds, such as the phoneme /sh/ from word sheep or the phoneme /t/ from word state, are generated without any vibration of vocal cords but by forming a constriction at some point in the vocal tract, and forcing air through a constriction at a high enough velocity to produced turbulence. This creates a wideband noise source to excite the vocal tract. [48] Figure 2.2: Voiced and unvoiced sound Figure 2.2 shows waveform of a female speech utterance six which includes a phoneme sequence /s/-/ih/-/k/-/s/. The middle part which produced the phoneme /ih/can be thought as a voiced sound and quasi-periodic pulse are shown in a zoomed plot in the upper right of figure 2.2. A noise-like unvoiced sound such as the first phoneme /s/ normally has low energy and random amplitude as shown in an expanded plot in lower right. The different sounds are produced by changing both the form of excitation signal and the shape of the vocal tract in figure 2.1. Changes in the vocal tract mainly arise for movement of the tongue, jaw, lips and velum. This can be modeled as a series of lossless tubes [76] which have non-uniform cross-section area as shown figure 2.3. Figure 2.3 shows the lossless tubes for speech production modeling. The excita-

29 2.1. Speech Production Models 12 Figure 2.3: Lossless tubes in speech production model tion signal propagates down a series of connected tubes with varying cross section along length to give different resonant frequencies. This has the effect of emphasising different frequency components in the excitation signal along time to produce different sounds. The resonant frequencies of the vocal tract tubes in frequency domain are called formants, as shown in figure 2.4. Figure 2.4: Formants in frequency spectrum Figure 2.4 shows a spectral envelope extracted from the phoneme /ay/. The horizontal axis represents frequency from 0 to 4000Hz and the vertical axis shows the magnitude of spectral envelope. There are four formants in this spectral envelope labeled from F 1 to F 4. These formants reflect the shape of the vocal tract which can be characterised by their frequencies, bandwidths and amplitudes. Different speech sounds have different formant positions which rise to different spectral envelopes. For

2.1. Speech Production Models 13 example, the positions of the four formants in figure 2.4 produces the characteristics of the vocal tract for the phoneme /ay/.

30 2.1. Speech Production Models 13 example, the positions of the four formants in figure 2.4 produces the characteristics of the vocal tract for the phoneme /ay/. The time-varying vocal tract and vocal cords lead to time-varying spectral characteristics which can be shown in a time-frequency plane known as the spectrogram [10]. This is calculated by splitting a time-domain signal into a number of overlapping frames, windowing each followed by the Fourier transform. The horizontal axis represents time and the vertical axis represents the frequency. The darkness or colour represents the energy at particular frequency in time. For example, figure 2.5 shows a spectrogram of utterance spoken by a male speaker. Figure 2.5: (a)wideband (upper) and (b)narrowband (lower) spectrogram According to the length of window used in the Fourier transform, two kinds of spectrogram, wideband and narrowband, can be produced. The wideband spectrogram displays good temporal resolution but poor frequency resolution. On the other hand, the narrowband spectrogram shows good frequency resolution but poor temporal resolution [10]. Figure 2.5-a shows a wideband spectrogram which consists of a few broad peaks corresponding to the formant frequencies at a particular time. The spectrogram clearly displays the variation of the formant frequencies with

31 2.1. Speech Production Models 14 time. Figure 2.5-b shows a narrowband spectrogram where the frequency dimension clearly illustrates the pitch or fundamental frequency and its harmonic structure varying along time. Unvoiced regions are distinguished by a lack of periodicity in the frequency dimension. This narrowband spectrogram is used to display the fundamental frequency and harmonic structure of reconstructed speech Source-filter model The physical model of speech production described in the previous section can be modeled mathematically by the well known source-filter model. This model is widely used in speech processing and is found in speech coding standards such as G723.1 [61] and GSM [54]. This model encodes the excitation from the human vocal cords and models the vocal tract as an all-pole time varying filter. The simplified block diagram of the source-filter model is shown in figure 2.6. Figure 2.6: Source-filter model for speech production A simple model of the source for voiced sounds is a train of impulses, which have the same period as the fundamental period. The source for unvoiced sounds can be considered as wideband noise. These two kinds of excitation are controlled by a voiced/unvoiced decision. The synthesized speech signal is formed as the convolution between excitation (source) and vocal tract filter. The mathematical description of

32 2.1. Speech Production Models 15 this model is shown as equation 2.1, s (n) = P a k s (n k) + Gu (n) (2.1) k=1 where s(n) : the n th synthesised speech sample; a k : the k th coefficient of the vocal tract filter; P G u(n) : the order of vocal tract filter; : the gain term; : the excitation term. P indicates the number of poles in the vocal tract filter. One formant can be represented by a pair of poles. So P is double the maximum number of formants which can be modeled. The task now remains of how to compute the vocal filter coefficients. This can be achieved by considering the similarity between the source-filter model and linear predictive equation which predicts the value of sample s(n) from the P previous samples, ŝ(n) = P α k s(n k) (2.2) k=1 where α is predictor coefficient. The prediction error, e(n), is defined as P e(n) = s(n) ŝ(n) = s(n) α k s(n k) (2.3) k=1 Re-arranging equation 2.3 gives the speech sample as s(n) = P α k s(n k) + e(n) (2.4) k=1

33 2.1. Speech Production Models 16 Comparing equation 2.1 and equation 2.4 shows that if α k = a k and e(n) = Gu(n), then vocal filter coefficients, a, can be obtained from the predictor coefficients, α. These are computed by minimising the mean-squared prediction error shown as E = 1 N N e 2 (n) (2.5) n=1 Equation 2.5 can be extended by combining with equation 2.3, shown as [ 2 E = 1 N P s(n) α k s(n k)] (2.6) N n=1 The value of α k can be derived by setting E/ α j = 0, j = 1, 2,..., P, to obtain the equations k=1 s(n j)s(n) = n P ˆα k s(n j)s(n k) 1 j P (2.7) k=1 n The variable φ(j, k) can be defined as φ(j, k) = n s(n j)s(n k) (2.8) Assuming signal s(n) is identically zero outside the interval (0 n N 1) because speech signal has been windowed and framed. φ(j, k) is identical to the short-time autocorrelation function evaluated for (j k), defined as, r( j k ) = φ(j, k) (2.9) Therefore equation 2.7 can be represented as, r(j) = P a k r( j k ) 1 j P (2.10) k=1 where r(j) is the j th autocorrelation coefficient computed for the N samples in the frame which is defined as equation, r(j) = N 1 j n=0 s(n)s(n + j) (2.11)

34 2.1. Speech Production Models 17 Representing equation 2.10 in matrix form gives, a = R 1 p (2.12) The matrix of autocorrelation coefficients, R, has Toeplitz structure which allows the filter coefficients to be extracted using the Levinson-Durbin recursive procedure [10]. Therefore, as equation 2.11 indicates, the predictor coefficients, α, can be computed from a window of speech samples. Using these in the source-filter model allows an acoustic signal to be reconstructed from the predictor coefficients, α, and a train of impulses appearing at the fundamental period or wideband noise using equation 2.1. An example of a voiced frame which has been reconstructed from predictor coefficients and a train of impulse is demonstrated in figure 2.7, Figure 2.7: An application of source-filter model: a) original speech signal; b) vocal tract filter derived from predictor coefficients; c) impulse train at fundamental period; d) reconstructed speech signal. Figure 2.7-a shows the original speech signal. Figures 2.7-b and 2.7-c presents the frequency response of the vocal tract filter derived from the predictor coefficients and a train of impulses derived from the fundamental period (here fundamental

35 2.1. Speech Production Models 18 period = 5.5ms) respectively. Figure 2.7-d displays the reconstructed speech signal from figure 2.7-b and 2.7-c using the source-filter model. Comparing figure 2.7-a and 2.7-d indicates that the reconstructed speech signal is close to the original speech signal. However, the reconstructed speech sounds somewhat artificial due to the simplified excitation signal. Many low bit rate speech coders, such as code excited linear prediction (CELP), multi-pulse excited linear prediction (MELP), residual excited linear prediction (RELP) and regular excited linear prediction (RPELP), are based on this model [52] [54] [56]. These codec schemes utilise the vocal tract filter, but vary in the way that encode the excitation signal Sinusoidal model An alternative model of speech production is the sinusoidal model [21], which models the speech signal as a summation of a number of sinusoidal components as shown, ŝ (n) = L(m) l=1 where ŝ(n) : the n th reconstructed sample; A l cos (2πf l n + θ l ) (2.13) A l : amplitude of the l th sinusoidal waves; f l : frequency of the l th sinusoidal waves; θ l : phase of the l th sinusoidal waves; L(m) : the number of sinusoidal components varying at the m th frame. This equation indicates that the speech can be reconstructed if the parameters of all the sinusoidal components are known. Unfortunately, the numerous parameters of

36 2.1. Speech Production Models 19 this model restrict its application. The most comprehensive study on the sinusoidal modeling of speech, which also included successful low bit-rate representations, was presented in a series of papers by McAulay and Quatieri [22] [24] [31] [38] [58]. The main contribution in the work of McAulay and Quatieri lies in the analysis of a minimal parameterised sinusoidal model and also in the development of algorithm for tracking the sinusoidal parameters from frame to frame. First, they proved that one peak in the frequency domain corresponds to one sinusoidal wave component in the time domain and three important parameters, frequency, amplitude and phase offset, are extracted from each frequency component. This proof leads to a significant reduction of the necessary parameters for the sinusoidal model. Secondly, a series of algorithms are employed to match the parameters from frame to frame. The concepts of birth and death were introduced to allow sinusoidal components matching from frame to frame. This analysis of the sinusoidal representation method followed by the synthesis method is shown in figure 2.8. Figure 2.8: (a) Analysis and (b) synthesis stage of sinusoidal representation During the analysis stage, a framed speech signal is transformed to the frequency domain through a short-time Fourier transform. The amplitude, frequency and phase offset of each sinusoidal component are extracted from the complex frequency

37 2.1. Speech Production Models 20 domain in each frame. These parameters can then be used in synthesis stage. A hamming window width that is 2.5 times of the average fundamental period is adequate and ensure that the sine waves can be identified [21]. At the synthesis stage, a number of sinusoidal waves have been generated using the frequency and phase offset followed by a muliplication with the amplitude. The synthesised signal can be obtained by summation of these sinusoidal waves shown as figure 2.8-b where the circle with a cross indicates a multiplication. All these parameters are recorded from the analysis stage. To illustrate these analysis-synthesis stages, a voiced frame of speech signal is analysed followed by synthesis using the sinusoidal model shown in figure 2.9 Figure 2.9: An application of sinusoidal representation: a) original speech signal; b) magnitude spectrum of original speech signal; c) twenty-seven peaks extracted from magnitude spectrum (parameters of these peaks are used for reconstruction); d) reconstructed speech signal Figure 2.9-a shows a voiced frame which lasts 16ms (128 samples). Figure 2.9-b

38 2.2. Mel-Frequency Cepstral Coefficients Extraction 21 shows the magnitude spectrum of the speech signal in figure 2.9-a. Figure 2.9-c shows the peaks from the sinusoidal model representation of the magnitude spectrum. All parameters, amplitude, frequency and phase, of each peak or sinusoidal component are recorded and passed to the synthesised stage. Figure 2.9-d is the reconstructed speech signal using parameters from figure 2.9-c according to equation The similarity between figure 2.9-a and 2.9-d shows that sinusoidal model is a highly precise model for reconstructing speech signals. The sinusoidal model has been applied in several applications, such as time-scale modification [22] and speech coding [58]. High quality speech reconstruction can be achieved using sinusoids with amplitudes, frequencies and phases corresponding to the peaks of the magnitude spectrum. 2.2 Mel-Frequency Cepstral Coefficients Extraction This section reviews the extraction of Mel-frequency cepstral coefficients (MFCCs) [13] which has been proved as one of the features to obtain the best speech recognition performance. This feature is a perceptual extension to traditional cepstral coefficients which take the inverse Fourier transform to log spectral magnitude of the speech signal, shown as, c(n) = IDF T (log X(f) ) (2.14) Where c(n) denotes the n th cepstral coefficient; X(f) represents the spectral magnitude and IDF T ( ) means the inverse discrete Fourier transform. As well known, a speech signal can be regarded as the convolution between the excitation signal and the vocal tract filter in the time domain or the multiplication of the

39 2.2. Mel-Frequency Cepstral Coefficients Extraction 22 excitation component and the vocal tract component in the frequency domain. This can be separated as the sum of the excitation term and the vocal tract term by taking the logarithm of the spectral magnitude. Since the excitation term has higher variation in the frequency domain than that of the vocal tract term, it is located in different parts of the cepstrum. The vocal tract term in the cepstral domain can be obtained by taking the inverse Fourier transform, which is also called deconvolution [76]. The lower cepstral coefficients represent the vocal tract information which are used for speech recognition. MFCC extraction inserts a mel-filterbank between the spectral magnitude and logarithm to simulate human perceptual ability which has different resolutions at different frequency bands. A simplified block diagram of MFCC extraction is shown as figure 2.10 and conforms to the Aurora standard proposed by ETSI [82] [100]. Figure 2.10: Mel-frequency cepstral coefficients extraction The remainder of the section considers each processing block in turn with reference to Aurora standard [82] [100] Pre-emphasis filter The input speech signal is first passed through a high pass filter which has the system function shown as equation 2.15, y(n) = x(n) αx(n 1) (2.15) Where y(n) represents the n th output sample of the filter; x(n) and x(n 1)

40 2.2. Mel-Frequency Cepstral Coefficients Extraction 23 denote the n th and (n 1) th input samples respectively; α is the pre-emphasis coefficient which is close to 1 (in ETSI α = 0.97) [82] [100]. The transfer function H(Z) of this high pass filter can be derived as, H(Z) = 1 αz 1 (2.16) And the frequency response can be shown as figure 2.11, Figure 2.11: Frequency response of pre-emphasis filter This high pass filter introduces a high frequency tilt on the spectral magnitude to compensate for the -6 db/octave high frequency tilt in the spectrum of the voiced speech. [48] Hamming window and Fourier transform A Hamming window is applied to extract a frame of speech and also to reduce the side lobe [7] in the frequency domain by tapering the window smoothly to zero at frame boundary shown as equation, y (n) = y(n)w(n) = y(n)[ cos( 2πn )] 0 n N 1 (2.17) N 1 Where y(n) and y (n) represent the n th input and output sample of this block respectively; w(n) is the n th factor of the Hamming window; N denotes the window length. Figure 2.12 shows the shape of the Hamming window in the time domain and its frequency response.

41 2.2. Mel-Frequency Cepstral Coefficients Extraction 24 Figure 2.12: Hamming window and its frequency response The Fourier transform is applied to the windowed signal, y (n) to obtain the frequency representations. Y (f) = DFT[y (n)] (2.18) Where DF T ( ) denotes the discrete Fourier transform; y (n) and Y (f) represent the samples in the time domain and the frequency domain respectively. The magnitude spectrum Y (f) where the phase term is discarded is passed through the mel-filterbank as shown in the next subsection Mel-filterbank The mel-filterbank used in the ETSI Aurora standard [82] [100] is composed of 23 triangular bandpass filters, which have equal bandwidth in mel-scale [13]. The mel-scale frequency can be transformed from linear frequency shown as, mel(f) = 2595 log 10 (1 + f 700 ) (2.19) Figure 2.13 shows the relation between linear frequency and mel-scale frequency. The horizontal axis represents the linear frequency and the vertical axis represents the mel-scale frequency. Obviously, mel-scale frequency has higher resolution at low frequencies than that at high frequencies to simulate the human perceptual ability. This arises from the non-linear spacing of hair cells along the basilar membrane of

42 2.2. Mel-Frequency Cepstral Coefficients Extraction 25 Figure 2.13: The relation between linear frequency and Mel frequency the human ear. The filterbank comprises of a number of equal bandwidth filters in the mel frequency domain and these filters are half overlapped and a triangular weighting is applied to each filter. The resulting frequency response of the melfilterbank in the linear frequency domain is shown as figure Figure 2.14: Mel-scale filterbank The resolution of the magnitude spectrum is reduced by applying the mel-scale filterbank because the passband of mel-filterbank becomes wider as frequency increases shown as figure 2.14 and as equation2.20, Y (k) = M(k+1) f=m(k) a(f) Y (f) (2.20) where M(k) and M(k + 1) are start and end frequency points at k th triangular filterbank (1 k K and K is the number of filterbank); a(f) denotes the triangular weight factor.

43 2.2. Mel-Frequency Cepstral Coefficients Extraction Log and DCT As described in the beginning of this section, logarithm combined with inverse Fourier transform is utilised to deconvolve the excitation component and vocal tract component. During the MFCC extraction, a logarithm is applied to the melfilterbank followed by a discrete cosine transform (DCT) [60] which can be derived from the DFT for the same purpose. The equation to transform the data from frequency domain to cepstral domain is shown as the equation [ π n ] c(n) = log[y (k)] cos 23 (k 0.5) k=1 0 n 22 (2.21) Where c(n) denotes the n th cepstral coefficient and Y (k) represents the k th component of the log mel-filterbank output. This equation also can be written as a matrix form, C = SY, where C = {c 0,..., c 22 } denotes cepstral vector, Y = {Y (1)..., Y (23)} is the mel-filterbank output and each row of matrix S is a cosine basis function in equation It can be proved that S is an orthonormal transformation matrix [33] which indicates that the basis functions in this transform are orthonormal or independent to each other shown as figure Figure 2.15 presents the first sixteen of twenty-three basis function in equation The figure shows that the frequency of successive basis functions increases. The output of the DCT, c(n), can be thought as the projection of the log-filterbank vector, log[y (k)], on these orthonormal basis functions and therefore the cepstral coefficients are independent to each other. These independent coefficients lead to a simplified model in the back-end by using a diagonal covariance matrix to replace a full covariance matrix needed in the specification of the hidden Markov models.

44 2.3. Auditory Model 27 Figure 2.15: Basis functions in the DCT The low-order cepstral coefficients contain the information about the vocal tract and so are retained. In the ETSU Aurora standard [82] [100], the first 13 coefficients are used. Based on this extraction, one short-term frame which typically comprises 200 samples can be reduced to a 13-dimension vector. This is a static feature but in general temporal derivation are computed and augmented onto the static feature. [73] 2.3 Auditory Model The current understanding of the human auditory periphery is well documented [15] [26] [68]. A very brief review is addressed here as the basis for further applications in this thesis. Acoustic waves propagate from a sound source and are collected by the human external ears. These waves pass through the middle ear, which through a mechanical

45 2.3. Auditory Model 28 arrangement can amplify the signal up to 185 times [26]. The signal then passes into the basilar membrane of the inner ear. The vibrations along the basilar membrane modulate the release of neurotransmitters associated with hair cells and the signal is passed to the brain via the auditory nerve. In this transduction process, information about the mechanical vibrations of sound are transformed into electrical signals which are processed by more central neural regions. [74] The basilar membrane, located in the middle of the cochlea, functions as a Fourier transformer because of its non-uniform stiffness which has different frequency responses along its length. The stiffness of the basilar membrane reduces along its length and therefore a low frequency wave can travel longer than a high frequency wave since the less stiff membrane can be responded to lower frequency waves [1]. This indicates that different positions along the basilar membrane can select different frequency components from an acoustic wave. In other words, an acoustic wave can be decomposed into a number of frequency components which stimulate different hair cells for further neural action. The details of the neural representation of sound are documented comprehensively in the literature and will not be discussed here. [15] [26] By understanding how the human auditory periphery works, it is possible to analyse acoustic signals using mathematical models to reproduce the auditory mechanism. One of these computational auditory models, the Lyon s auditory model [16] [19] [28], applys a number of cascade filters which have non-linear spaced passbands. Each of these filters is followed by half wave rectification and automatic gain control

46 2.4. Clustering Algorithms 29 to generate the cochleagram, which approximately represents the neural firing rates produced by the hair cells in the cochlea. This has been used for fundamental frequency extraction and speech enhancement [39] [69]. An alternative computational auditory model proposed by Seneff [23] [29] is called the Joint Synchrony/Mean-Rate model which is similar to the Lyon s model tries to extract the essential features which simulate the behaviour of the cochlea in response to sound pressure waves. One of the same stages for both computational auditory models is to employ a number of bandpass filters which simulate the center frequencies and bandwidths of the basilar membrane. One of the main differences between these two models is that Seneff s model has two output components. One output, Generalised Synchrony Detector (GSD), implements the known phase-locking property of nerve fibers and therefore enhances spectral peaks due to vocal tract resonances. Another output, Envelope Detector (ED), extracts the very rapidly changing dynamic nature of speech which is more important in characterzing transient sounds. The main benefit obtained from the computational auditory model in this thesis is to decompose a sound wave into a number of frequency bands which conform the characteristics of the human auditory system. This decomposition allows frequency regions of the speech to be selected which are less effected by noise. This is utilised in chapter 4 and 5 for robust fundamental frequency estimation. 2.4 Clustering Algorithms Human brains are good at finding regularities in data. One way of expressing regularity is to put a set of objects into a group that are similar to each other which is called clustering [95]. Two clustering algorithms, K-means and Expectation-

47 2.4. Clustering Algorithms 30 Maximisation(EM), are briefly described in the following sections. These clustering techniques will be used in chapter K-means The K-means clustering algorithm [12] is a clustering technique to find a set of centers which more accurately reflect the distribution of the data points. This algorithm is a heuristic algorithm in which a number of cluster centers are initialised followed by a number of repeated re-estimation procedures shown as figure 2.16 Figure 2.16: Work flow for the K-means clustering algorithm Figure 2.16 shows the work flow of the K-means clustering algorithm. First of all the number K of cluster centers are randomly created for further calculation. The algorithm divides the data points x into K disjoint subsets S k containing N k data points. A global distance J is then computed between each point and the center of its cluster, K J = x m µ k 2 (2.22) m S k k=1 where µ k denotes the k th cluster center vector and x m is all the data points which belong to the k th subset S. Comparing the current global distance with the global distance from the last iteration, if J is stable then stop the iteration or a new mean

48 2.4. Clustering Algorithms 31 vector is computed from the subsets S k and shown as equation 2.23, µ k = 1 x m (2.23) N k m S k This process is repeated until there is no further reduction in J or reduction of J is under the predefined threshold value. This clustering technique is demonstrated in figure Figure 2.17: A demonstration for K-means clustering algorithm Figure 2.17-a presents forty 2-dimensional points shown as blue dots in a two dimensional plane. Four centers are randomly chosen and plotted as red crosses bars. Figure 2.17-b to 2.17-f illustrate five iterations using K-means cluster algorithm described above. The global distance J in equation 2.22 is shown as in the title of each figure and the circles shown in each of these five figures are the center points associated with those dots in same colour. From figure 2.17-b to 2.17-f, it can be observed that the global distance, J, is reduced from to after three iterations.

49 2.4. Clustering Algorithms 32 The K-means algorithm is an effective clustering technique but suffers from the problem of choosing the initial center points. This problem can be solved by taking the Linde-Buzo-Gray (LBG) method [14] which sets the initial center points by successively splitting from a global mean vector rather than by randomly choosing centers vectors EM algorithm The EM algorithm [64] is one of the most widely used algorithms in statistics for solving the maximum likelihood problem. This can follow K-means for further refinement of clusters into probability framework. Assuming a probability density function p(x Θ) that is governed by the set of parameters Θ, where for example p(.) can be a Gaussian distribution and Θ is the set of parameters comprising the prior probability, mean vector and covariance matrix for each Gaussian. Assuming an observed data set X = {x 1,..., x N } is generated from the unknown model Θ. This can be described from equation 2.24, p(x Θ) = N p(x i Θ) (2.24) i=1 p(x Θ) is well-known as the likelihood of the parameters, Θ, for the given data space, X. The EM algorithm is the method to find Θ that maximises the likelihood function, p(x Θ) and can be represented as equation 2.25, Θ = arg max p(x Θ) (2.25) Θ Practically, log[p(x Θ)] is maximised instead because it is analytically easier. To solve the likelihood problem using EM algorithm, it is necessary to introduce the concept of incomplete data [9]. As before, assuming that data X is observed and is

50 2.4. Clustering Algorithms 33 generated by some distribution and called incomplete data. Then it can be assumed that a complete data set Z = (X, Y) exists with joint probability density function, p(z Θ) = p(x, Y Θ) = p(y X, Θ)p(X Θ) (2.26) Instead of equation 2.24 the complete-data likelihood p(x, Y Θ) is used to solve the likelihood problem. The EM algorithm first finds the expected value of the complete-data log-likelihood log[p(x, Y Θ)] with respect to the unknown data Y for the observed data X and the current parameter estimates, Θ (i 1) at time i 1. This can be defined as, Q(Θ, Θ (i 1) ) = E[logp(X, Y Θ) X, Θ (i 1) ] (2.27) Where E[.] is expectation operator; Θ (i 1) are the current parameter estimates used to evaluate the expectation and Θ are the new parameters that will be optimised to increase Q. The evaluation of this expectation is called the E-step of the algorithm. The second step of the EM algorithm is to maximise the expectation from the first step, defined as equation 2.28, Θ (i) = arg max Θ Q(Θ, Θ(i 1) ) (2.28) This step is the M-step. Each iteration of the EM process is guaranteed to increase to the log-likelihood and the algorithm is guaranteed to converge to local maximum of the likelihood function [9] [55]. The expectation-maximisation steps can be repeated until convergence or a set number of iterations have been completed. Assuming the following probabilistic model p(x Θ) = M α j p j (X θ j ) (2.29) j=1

51 2.4. Clustering Algorithms 34 Where the parameters are Θ = {α 1,..., α M, θ 1,..., θ M } with M j=1 α j = 1 and each p j is a density function parameterized by θ j. In other words, there are M component densities mixed together with M mixing coefficients α j and parameter j denotes the j th component. If Θ g = {α1, g..., α g M, θg 1,..., θ g M } are appropriate parameters for the likelihood p(x Θ g ), it can be derived that Q(Θ, Θ g ) = E[log p(x, Y Θ) X, Θ g ] = M N log(α j p j (X m θ j ))p(j X m, Θ g ) = j=1 m=1 M j=1 m=1 M j=1 m=1 N log(α j )p(j X m, Θ g ) + N log(p j (X m θ j ))p(j X m, Θ g ) (2.30) Where m denotes the m th vector or x m X. As assumed before p(.) can be d- dimensional Gaussian component distributions with mean vector µ and covariance matrix Σ, or θ = (µ, Σ) then it can be defined as equation 2.31, p j (x µ j, Σ j ) = 1 (2π) d/2 Σ j 1/2 exp( 1 2 (x µ j) T Σ 1 j (x µ j )) (2.31) Replacing the term θ in equation 2.30 with equation 2.31 then maximising the terms α j, µ j and Σ j by taking the partial derivative of equation 2.30 and setting them to zero respectively. Then the following equations can be derived, µ new j = (Σ new j ) 2 = 1 d α new j = 1 N m P old (θ j x m )x n m P old (θ j x m ) m P old (θ j x m ) x m µ new j 2 n P old (θ j x m ) (2.32) (2.33) P old (θ j x m ) (2.34) m Where the superscript new or old represents the parameters after or before each iteration; j and m denote the j th component density and the m th vector as

52 2.5. Maximum A Posteriori Estimation 35 defined in equation 2.29 and equation 2.30; P (θ j x) is the posterior probability and it can be expressed using Bayes s theorem [4] in the form, P (θ j x) = p(x θ j)α j p(x) = p(x θ j )α j M j=1 p(x θ j)α j (2.35) Where p(x θ j ) is the likelihood of the j th component of the mixture which is defined as equation 2.31; α j is the prior probability of the data point having been generated from component j of the mixture and α j should be satisfied with the constraints, M α j = 1 and 0 α j 1 (2.36) j=1 Summarising, equations 2.32, 2.33 and 2.34 estimate the new parameters in terms of the old parameters with the observed data X and these equations perform both the expectation step and the maximisation step simultaneously. Practically, these equations are performed repeatly until Q(Θ, Θ g ) converges to a stable state [9] [55] [64]. 2.5 Maximum A Posteriori Estimation The aim of this section is to introduce maximum a posteriori (MAP) as a method to estimate unobserved components from a given probability model and observed components. Defining a probability density function p(x Θ) that is governed by the set of parameters, Θ. Assuming p(.) is a Gaussian distribution then Θ is the set of parameters comprising the prior probability, α, mean vector, µ, and covariance matrix, Σ. X can be defined as a d-dimensional vector which comprise two parts, observed components denote as X o and unobserved components denote as X u, shown as

53 2.5. Maximum A Posteriori Estimation 36 equation, X = [X u, X o ] (2.37) The dimensionality of the observed component is d o while the dimensionality of unobserved component is d u. d o and d u are constrained by the following equation, d o + d u = d (2.38) The aim of this section is to maximise the likelihood of the unobserved components, X u, from the observed part, X o, and the set of parameters, Θ, shown as equation 2.39 [44] [83], ˆX u = arg max X u {p(x u X o, Θ)} (2.39) MAP estimation can be simplified to a linear regression when the distribution of the data set is Gaussian [83]. As defined before, p(x Θ) is a Gaussian distribution with a mean vector µ and covariance matrix Σ and it can be written as p(x µ, Σ). Therefore, the distribution of X o and X u, p(x o µ, Σ) and p(x u µ, Σ) also are Gaussian [40]. If µ o and µ u represent the mean vector and Σ oo and Σ uu represent the covariance matrix for p(x o µ, Σ) and p(x u µ, Σ). Equation 2.40 and 2.41 can be obtained as, Σ = µ = [µ o, µ u ] (2.40) Σ oo Σ uo Σ ou Σ uu (2.41) Where Σ ou is the cross covariance matrix between X o and X u, and Σ ou = Σ T uo. The conditional probability for the unobserved components based on the observed components and model parameters, p(x u X o, Θ), can be obtained from Ap-

54 2.5. Maximum A Posteriori Estimation 37 pendix A, p(x u X o, Θ) = p(x u X o, µ, Σ) = 1 (2π) d/2 Σ 1/2 exp { 0.5(X u µ u Σ uo Σ 1 oo (X o µ o )) T (Σ uu Σ uo Σ 1 oo Σ ou ) 1 (X u µ u Σ uo Σ 1 oo (X o µ o )) } (2.42) To maximise the equation 2.42, the exponential term is minimised which gives X u µ u Σ uo Σ 1 oo (X o µ o ) = 0. The the equation 2.39 can be derived as, ˆX u = arg max{p(x u X o, Θ)} = µ u + Σ uo Σ 1 oo (X o µ o ) (2.43) X u This equation could be applied to estimate unknown components X u from known components X o from a vector X = [X u, X o ]. In this thesis this equation will be referenced in chapter 6 to predict a single unobsered value (fundamental frequency) from observed vector (MFCC vector) for given model parameters. In this case, it can be explained using multiple linear regression (MLR). Assume an unobserved component, f, is a linear combination of vector components, [x 1, x 2,..., x p ], it can be denoted as, f = β 0 + β 1 x 1 + β 2 x β p x p + ɛ (2.44) Where β = [β 0, β 1,..., β p ] is a coefficient vector; ɛ is an error item. For a set of n observations equation 2.44 could be re-written as f 1 = β 0 + β 1 x 11 + β 2 x β p x 1p + ɛ 1 f 2 = β 0 + β 1 x 21 + β 2 x β p x 2p + ɛ 2 f n = β 0 + β 1 x n1 + β 2 x n β p x np + ɛ n (2.45)...

55 2.5. Maximum A Posteriori Estimation 38 The equation 2.45 can be simplified as f = βx + ɛ (2.46) According to the least square estimate of β [106], ˆβ = (X X) 1 X f (2.47) Obviously, assuming X and f have zero means then X X = nσ X X and X f = nσ X f, then the estimate of f can be presented as ˆf = ˆβX + ɛ = ( Σ X f) T ( Σ X X ) 1 X (2.48) If X and f have means at µ X and µ f, the equation 2.48 can be derived to ˆf = µ f + ( Σ X f) T ( Σ X X ) 1 (X µx ) (2.49) It can be seen that equation 2.49 is similar to equation This indicates that the MAP has same effect as the multiple linear regression (MLR) in this case.

56 2.6. Summary Summary This chapter has introduced speech processing models and statistical methods which will be used in the remaining chapters. Two speech production models, the source filter and the sinusoidal models, are demonstrated to produce speech signals from model related parameters. The details of the ETSI Aurora standard of MFCC extraction have been illustrated and will be referenced through out the remaining chapters. The human auditory system has been briefly described to give the basis for robust fundamental frequency estimation. The clustering algorithms and the MAP estimation method have also been introduced and will subsequently be used for predicting fundamental frequency from MFCC vectors.

57 Chapter 3 Speech Reconstruction from MFCCs and Fundamental Period Contents 3.1 Introduction 3.2 Estimation of spectral envelope from MFCCs 3.3 Fundamental frequency estimation 3.4 Speech reconstruction using the source-filter model 3.5 Speech reconstruction using the sinusoidal model 3.6 Post processing 3.7 Experimental results 3.8 Summary Preface The aim of this chapter is to reconstruct a speech signal from a set of MFCC vectors and fundamental frequency. A method to estimate spectral envelope from MFCCs is introduced followed by fundamental frequency extraction methods. Two fundamental frequency estimation methods, comb function based and simplified inverse filter tracking (SIFT) methods, are implemented and compared using an evaluation algorithm. Two approaches, the source-filter and sinusoidal models, are developed for speech reconstruction using the estimated spectral envelope and fundamental frequency estimates. Formal listening tests indicate that the reconstructed speech signal from the sinusoidal model is preferable to that from the source-filter model. 40

58 3.1. Introduction Introduction The previous chapter has introduced two speech production models, the sourcefilter and the sinusoidal models. The source-filter model produces a speech signal as the output of a convolution between a source, such as impulse train separated by fundamental period or wideband noise, and a vocal tract filter. The vocal tract filter models the resonant frequencies in the vocal tract which corresponds to formants that can be specified in terms of frequency, amplitude and bandwidth. The timevarying vocal tract therefore can produce a stream of phonemes to form the speech. The sinusoidal model represents the speech signal in the frequency domain. A magnitude spectrum can be decomposed into two parts, fast changing component, fundamental frequency or pitch frequency (represented as F 0, f 0 or ω 0 in this work) and its harmonic components and a slowly changing component, spectral envelope. The spectral envelope is related to the speech phoneme or vocal tract and has the same properties to the vocal tract filter. The multiplication in the frequency domain has a similar effect to the convolution in the time domain [10]. The speech signal in the frequency domain is represented as the multiplication of fundamental frequency together with its harmonic frequencies and spectral envelope, shown as the figure 3.1. Figure 3.1-a shows the fundamental frequency and its harmonics in the frequency domain. Figure 3.1-b is the spectral envelope for English phoneme /ay/. This envelope reflects the shape of the vocal tract. Figure 3.1-c shows the speech magnitude spectrum of the same phoneme in figure 3.1-b. Obviously, figure 3.1-c can be produced by a multiplication of figure 3.1-a and figure 3.1-b.

59 3.2. Estimation of Spectral Envelope from MFCCs 42 Figure 3.1: Magnitude spectrum of a) fundamental and harmonic frequencies, b) spectral envelope and c) a frame of speech signal Figure 3.1 also shows that two components, fundamental frequency and speech envelope, are necessary for speech production. The next section will show how to obtain the spectral envelope from the MFCC vectors while section 3.3 shows how to estimate the fundamental frequency from a frame of speech. Section 3.4 and section 3.5 then describe methods for reconstructing speech from the spectral envelope estimates and fundamental frequency using first a source-filter model and second a sinusoidal model respectively. Post processing is introduced in section 3.6 to produce more nature sounds. Section 3.7 demonstrates the experimental results and a summary is made in the last section. 3.2 Estimation of Spectral Envelope from MFCCs As described in section 2.2, the MFCC extraction is designed to extract vocal tract information from the speech signal and to maximise the discrimination between different speech sounds. However there is not only vocal tract information presented

60 3.2. Estimation of Spectral Envelope from MFCCs 43 in the speech signal but also excitation information which has been discarded during the MFCC extraction. Table 3.1 shows the details of the lost informations in the MFCC extractions. Table 3.1: Investigation of MFCC extractions Process Action effect Invertible Lost information Pre-emphasis introduce Yes None spectral tilt Hamming window reduce discontinuity Yes None at frame boundary Fourier transform transform to Yes None frequency domain Magnitude represent spectrum No phase component in term of magnitude Mel-scale filterbank extract spectral No harmonic structure envelope Logarithm compress Yes None Discrete cosine transform to Yes None transform cepstral domain Truncation reduce dimensionality No fine spectral of feature vector envelope structure Table 3.1 shows that following information is lost during the MFCC extraction: 1. Phase component is lost after magnitude option. 2. The details of harmonic frequency are smoothed when the magnitude spectrum passes through the mel-filterbank. 3. The finer spectral structure is discarded when 23-dimensional cepstral coefficients are cut down to 13-dimensional cepstral coefficients. This lost information makes the MFCC non-invertible back to the original speech signal directly. However, during the MFCC extraction, the MFCC vector retains the speech phoneme information which is closely related to the vocal tract or spectral

61 3.2. Estimation of Spectral Envelope from MFCCs 44 envelope. It is therefore possible to estimate the spectral envelope from the MFCCs. A simplified block diagram to achieve this is shown as figure 3.2. Figure 3.2: MFCC vector to spectral envelope Zero-padding the truncated MFCC vector, c y, to the dimensionality of the filterbank (K = 23) allows an inverse discrete cosine transform (IDCT) to be taken. This results in a smoothed estimate of the log filterbank vector, logŷk, shown as equation 3.1, K 1 ( ) (2k + 1) nπ log(ŷk) = v (n) c y (n) cos 2K n=0 1 n = 0 K v(n) = 0 < n < K 2 K (3.1) Where log(ŷk) denotes the k th log mel-filterbank and c y (n) represents the n th cepstral coefficient. The area under the triangular filters used in the mel-filterbank analysis increases at higher frequencies. The effect of this is to impose a high frequency tilt on the resulting mel-filterbank channels, which distorts the estimated magnitude spectrum. Two methods could be used to eliminate this tilt. One is scaling the mel-filterbank outputs, Ŷ (k), by the area of the corresponding triangular mel-filter, ω k, in the frequency domain. An alternative is to transform the vector of filterbank channel areas, ω= {ω 1,..., ω K }, into log space, l ω, shown as equation 3.2, l ω = log (ω) (3.2)

62 3.2. Estimation of Spectral Envelope from MFCCs 45 Then l ω can be subtracted from log(ŷk) to remove the spectral tilt in the log filterbank domain. A similar method can also be used to eliminate the high frequency tilt introduced by the pre-emphasis stage in the MFCC extraction, l p. As discussed in section 2.2, the differential equation 2.15 is introduced at the pre-emphasis stage to simulate the frequency response of the human ear and results in the frequency response illustrated in figure This frequency response is then passed to the mel-filterbank to obtain its mel-filterbank representation. Applying a logarithm to this mel-filterbank representation gives the frequency tilt coefficients l p. In this work, the subtraction is implemented as shown in equation 3.3, log(ŷ k) = log(ŷk) l ω l p (3.3) An estimate of the linear mel-filterbank vector can be obtained by using the exponential operator directly. The next stage is to estimate the D-dimensional magnitude spectrum in this work D = 256 spectral bin are computed. This requires D linearly spaced magnitude estimates to be computed from the K melfilterbank channels. As D >> K some form of interpolation is necessary. A method used in this work is cubic spline interpolation [85] to obtain an estimate of the spectral envelope from the linear mel-filterbank vector. A cubic polynomial has the form as shown in equation 3.4, p(t) = c 0 + c 1 t + c 2 t 2 + c 3 t 3 (3.4) Where c = {c 0,..., c 3 } are the coefficients of cubic polynomial. Four conditions or four given samples are needed to determine a unique set of coefficients, c. Once

63 3.2. Estimation of Spectral Envelope from MFCCs 46 these coefficients are calculated, estimated values between these samples can be calculated through the cubic polynomial estimation. An example of a spectral envelope estimated using the cubic spline interpolation from a linear mel-filterbank vector is shown as figure 3.3. Figure 3.3: Cubic spline interpolation for estimating spectral envelope from linear mel-filterbank vectors The blue crosses in figure 3.3 are samples in a linear mel-filterbank vector. The red solid line is the estimated spectral envelope created using the cubic spline interpolation. A comparison of this spectral envelope and the original magnitude spectrum are shown in figure 3.4. Figure 3.4: Comparison of estimated spectral envelope (red line) and magnitude spectrum (blue line) Figure 3.4 compares the restored spectral envelope (red line) and the original

64 3.2. Estimation of Spectral Envelope from MFCCs 47 magnitude spectrum (blue line). Most of the formants can be recovered from the MFCC vector and part of the fine structure in the very low frequency bands (< 500Hz) can also be restored. This can be observed in that the first two peaks in the spectral envelope estimate are very close to the first two peaks in the original magnitude spectrum while other fine structures in middle/high frequency bands are lost. This has a similar frequency response to the basilar membrane in the human cochlea which is more sensitive to low frequency bands than to high frequency bands and can be attributed to the configuration of the mel-filterbank shown in figure 3.5 and table 3.2. Figure 3.5: Center frequency in triangular filterbank Table 3.2: Center frequency and bandwidth of each triangular filter in Mel-frequency cepstral coefficients extraction Filter No CF(Hz) BW(Hz) Filter No CF(Hz) BW(Hz) Filter No CF(Hz) BW(Hz) Figure 3.5 and table 3.2 show the center frequencies and bandwidths of the mel-

65 3.3. Fundamental Frequency Estimation 48 filterbank. It can be observed that the average difference of center frequencies in first three filters is 67Hz and the average difference of center frequencies in last three filters is 306Hz. Comparison of figure 3.3 and figure 3.4 gives that the distribution of mel filterbank channels shown in figure 3.5 explains the resulting detail of lower frequencies in the estimated spectral envelope and the smooth of high frequencies shown in figure 3.4. This also provides a clue to predict the fundamental frequency from the MFCC-derived spectral envelope which will be covered in chapter 6. Estimation of spectral envelope, ˆX(f), now obtained and it will be used for parameter estimation for both the source-filter and the sinusoidal models. 3.3 Fundamental Frequency Estimation As discussed in section 3.1, fundamental frequency is another necessary component for speech production. Fundamental frequency estimation is an essential requirement in systems for pitch-synchronous analysis, speech analysis/synthesis and speech coding. It has been reported that fundamental frequency can improve performance of a speech recognition system for a tonal language [80] and of a speaker identification system [77]. However, accurate and reliable measurement of the fundamental period of a speech signal from the acoustic pressure waveform alone is often exceedingly difficult for several reasons: [8] 1. The glottal excitation waveform is not a perfect train of periodic pulses. This excitation results from the vibration of the vocal cords and several factors can effect this vibration such as the air pressure from the lungs and the force from muscles. These factors make the vibration quasi-periodic pulses.

66 3.3. Fundamental Frequency Estimation The fundamental period is influenced by the interaction between the vocal tract and the glottal excitation. The time-varying vocal tract can affect the vibration frequency of vocal cords so it is difficult to measure the fundamental period. 3. In measuring the fundamental period it is difficult to define the exact beginning and end of each fundamental period during voiced speech segments. The vibration of the vocal cords cannot suddenly start or stop and therefore it is difficult to find the exact boundary between voiced and unvoiced frames. 4. It is difficult to distinguish between unvoiced speech and low-level voiced speech. In low-level voiced speech, the amplitude of vibration of the vocal cords is sometimes not high enough to be detected. 5. Additional complications occur with the problem that speech has been transmitted through the telephone system which has a cut-off frequencies of 300Hz and 3300Hz. Therefore the fundamental frequency under 300Hz has been cut off and only can only be estimated from its harmonic components. As the result of the numerous difficulties in fundamental frequency estimation above, a wide variety of fundamental frequency extraction methods [3] [5] [8] [42] [90] have been exploited. In this work, two fundamental frequency extraction methods, simplified inverse filter tracking (SIFT) [5] and comb function [90], are investigated. The results from these two methods are compared and the more effective one is chosen for extracting fundamental frequency information from clean speech signals.

67 3.3. Fundamental Frequency Estimation SIFT fundamental frequency estimation The SIFT fundamental frequency estimation method extracts excitation information from speech signal using an inverse linear predictive filter followed by an autocorrelation analysis, shown as block diagram in figure 3.6 Figure 3.6: The SIFT fundamental frequency estimation method The speech signal is first framed followed by low-pass filtering with a cut-off frequency of 800Hz to simplify the linear predictive analysis. A fourth order linear predictor is applied to calculate the vocal tract filter coefficients as described in section according to equation 2.1 (equation 3.5 in this section), s (n) = P a k s (n k) + Gu (n) (3.5) k=1 All the parameters in this equation are the same as defined in the equation 2.1 and the system transform function H(z) is derived by taking the Z-transform of equation 3.5 followed by re-arrangement, H(z) = S(z) U(z) = G 1 P k=1 a kz k (3.6) The excitation signal can be obtained using the inverse vocal tract filter and the low-pass filtered signal given as, U(z) = S(z) H(z) = S(z)(1 P k=1 a kz k ) G (3.7) Figure 3.7 presents an example of an inverse filtered speech signal which will be used for the fundamental frequency estimation.

68 3.3. Fundamental Frequency Estimation 51 Figure 3.7: The SIFT fundamental frequency estimation method: a)original speech signal; b)low-pass filtered speech signal; c)lpc-derived spectral envelope; d)excitation signal Figure 3.7-a shows a frame of voiced speech. Figure 3.7-b presents the speech signal from the low-pass filter with a cut-off frequency of 800Hz and figure 3.7-c is the vocal tract extracted from the signal in figure 3.7-c using LPC. Figure 3.7-d shows the resulting excitation signal taken as the output from the inverse vocal tract filter. Comparing figure 3.7-a and 3.7-d, the time between the main peaks, which are defined as the fundamental period, are almost the same. But in the excitation signal (figure 3.7-d), the main peaks are more prominent as the vocal tract component is filtered out by the inverse filtering. The normalised auto-correlation function is then computed from the excitation signal shown as equation, R(τ) = 1 N/2 1 Ex Ey n= N/2 x(n)x(n + τ) τ = N,, N (3.8) Where E x = N/2 n= N/2 x2 (n); E y = N/2 n= N/2 x2 (n + τ); x(n) denotes the n th excitation sample and N is the length of the frame.

69 3.3. Fundamental Frequency Estimation 52 The autocorrelation function R(τ) posses symmetry along τ = 0 where the autocorrelation function obtains the maximum value (R(0) = 1) and this has been shown in figure 3.8-a (only ploting τ [0, N 1]), Figure 3.8: Autocorrelation diagram of excitation signal in a) voiced frame and b) unvoiced frame Figure 3.8-a shows the normalised autocorrelation of the excitation signal in figure 3.7-c. The time between the two vertical lines shows the fundamental period (about 5.5ms in this case which corresponds to 181.8Hz). A frame of unvoiced speech is processed using the same procedure as for the voiced frame and the normalised autocorrelation diagram of the excitation signal is shown in figure 3.8-b. There is no periodic information in this autocorrelation diagram since the excitation of the unvoiced speech is composed of wideband noise. Therefore, the autocorrelation can not only determine the fundamental period for the voiced frame but can also classify frames as voiced or unvoiced frames. To determine the voicing classification associated with the frame, the strongest peak, R(τ max ), is picked up from the possible fundamental periods (2 20ms) and

70 3.3. Fundamental Frequency Estimation 53 compared with R(0). If R(τ max ) is greater than a pre-defined threshold value (0.4R(0) from [48]), then the frame is classified as voiced frame. Once the voiced/unvoiced decisions and fundamental frequency estimates are computed from all frames of the utterance, the fundamental frequency smoothing method is employed based on the consideration of fundamental frequency properties and this will be covered later in section Comb function based fundamental frequency estimation The comb function based fundamental frequency estimation method [17] [90] works mainly in the frequency domain and exploits the harmonic structure of the magnitude spectrum of speech. Figure 3.9 illustrates the spectral differences between a typical voiced and unvoiced frame of speech. Figure 3.9: Magnitude spectrum of a) voiced speech signal; b) unvoiced speech signal Figure 3.9-a shows that the peaks are located at the fundamental frequency and harmonic frequencies in a voiced magnitude spectrum while the peaks in figure 3.9-b

71 3.3. Fundamental Frequency Estimation 54 appear randomly in an unvoiced magnitude spectrum. To estimate the fundamental frequency a comb function generator creates a number of comb functions (fundamental frequency candidates) which have the same mark-to-space ratio but different periods defined as equation 3.9, 0.0 kf c > f > (kf c ζf c ) k = {1, 2,..., floor(4000/f c )} comb fc (f) = 1.0 otherwise (3.9) where f c denotes the fundamental frequency candidate and ζ is the mark-tospace ratio (here ζ = 0.4). Some of these comb functions are demonstrated in figure Figure 3.10: A set of comb functions (fundamental frequency candidates) from 50Hz to 400Hz Figure 3.10 demonstrates a number of comb functions which are used to measure the normalised harmonic structure in the frequency domain. The periods of these comb functions are varying from 50Hz (left) to 400Hz (right) and part of fine structure in left figure is expanded as the middle figure. One of the comb functions can capture most of the peaks if the frame is a voiced frame, but for an unvoiced frame, each comb function will capture roughly equal amounts of peaks. A simplified block diagram of the comb function approach for fundamental frequency estimation is shown as figure 3.11.

72 3.3. Fundamental Frequency Estimation 55 Figure 3.11: Fundamental frequency estimation using comb function The speech signal is framed and windowed which is identical to the first two stages in the MFCC extraction [82]. Then a 512 point short time Fourier transform is applied to each frame and the magnitude spectrum is mapped to a linear frequency scale (0-4000Hz in this work). Although the resolution from STFT is coarseness (4000/256 15Hz), the fundamental frequency can be computed from high order harmonic components and these higher harmonic components provide a better resolution. For example, if fundamental frquency is computed from the 20 th harmonic component with a resolution of 15 Hz, the resolution of the fundamental frequency can be improved to 15/20 = 0.75Hz. All peaks in the magnitude spectrum are extracted and normalised according to equation 3.10 p (f) = p 2 (f)/ f p 2 (f) (3.10) where p(f) is the peak in the linear frequency scale. A voiced and unvoiced decision, ρ(f c ), for each fundamental frequency candidate, f c, is made from the normalised peaks and each comb function, comb fc (f) shown as equation ρ(f c ) = f [p (f) comb fc (f)] (3.11) If ρ max (f c ) is lower than the predefined threshold (0.80 from [90]), this frame is defined as an unvoiced frame. Otherwise, this frame is defined as a voiced frame

3.3. Fundamental Frequency Estimation 56 where the fundamental frequency, f 0, can be achieved from equation 3.12. f 0 = arg max f c [ρ(f c )] (3.12) Figure 3.

73 3.3. Fundamental Frequency Estimation 56 where the fundamental frequency, f 0, can be achieved from equation f 0 = arg max f c [ρ(f c )] (3.12) Figure 3.12 illustrates the procedure of fundamental frequency estimation using the comb function method. Figure 3.12: Three of comb functions of which fundamental frequency candidates are a) 150Hz; b) 250Hz; c) 400Hz; d) magnitude spectrum of a voiced frame; e) normalised peaks and f) fundamental frequency selection Three comb functions are shown in figures 3.12-a, 3.12-b and 3.12-c and form the fundamental frequency candidates. The candidate frequencies are 150Hz, 250Hz and 400Hz which give comb 150 (f), comb 250 (f) and comb 400 (f). Figure 3.12-d presents the magnitude spectrum of a voiced frame and the normalised peaks, p (f), in the power spectrum of the voiced frame is shown in figure 3.12-e. These peaks are measured by a set of comb functions such as figure 3.12-a, 3.12-b, 3.12-c and result in the line contour, ρ(f c ), in figure 3.12-f. The largest value marked as the red cross, ρ max, which is at 172Hz is selected as the fundamental frequency for this frame.

74 3.3. Fundamental Frequency Estimation 57 Obviously the comb function based fundamental frequency estimation method is a computationally expensive method. For each fundamental frequency candidate, there are 4000 multiply-and-add operations. If the search range is from 50 Hz to 350 Hz with 1 Hz resolution, there are at least 1.2 million multiply and add operations for each frame. An efficient method to search fundamental frequency has been proposed by Chazan [90]. In his technique, the peaks in magnitude spectrum are extracted, normalised (as in equation 3.10) and sorted. These peaks could be regarded as the multipliers of fundamental frequency. In other words, a number of pitch candidates can be chosen according to the given peaks. For example, if a peak appears at 700Hz in magnitude spectrum, the fundamental frequency could be at the vicinity of 350 Hz, 233 Hz, 175 Hz etc. These candidates have a gain of the magnitude of the peak. An iteration of all peaks is implemented and the fundamental frequency candidate which has the maximum summation of all gains is chosen as the fundamental frequency. A threshold and a partial unity function are combined to be employed during iteration to reduce the valid fundamental frequency search regions and therefore speech up the process. It can be found from this method that the high order harmonics of the fundamental frequency have role in resolving pitch ambiguities, such as double or half fundamental frequency errors and improve the resolution of fundamental frequency Fundamental frequency smoothing algorithm From the discussion in the last two subsections, the voicing classification and fundamental frequency is estimated frame by frame. During the fundamental frequency

75 3.3. Fundamental Frequency Estimation 58 estimation, all the frames are assumed independent to each other which is unreasonable and results in the example fundamental frequency contour, extracted from connected digit string one-three-nine-oh, shown in figure Figure 3.13: Comb function based fundamental frequency estimates without smoothing The spectrogram of the same utterance is shown in the background of figure The blue line shown above the spectrogram is the fundamental frequency contour extracted from the comb function based fundamental frequency estimation method described in the previous subsection. This figure shows an irregular contour resulting from short duration of voiced frames and unvoiced frames appearing in the fundamental frequency contour. This leads to the fundamental frequency contour mismatching the spectrogram causing voicing classification errors and large fundamental frequency errors. When considering the speech production mechanism, the fundamental frequency contour from the vocal cords has following attributes, 1. Fundamental frequency cannot vary largely from one frame to the next frame due to physical speech production mechanism. 2. Vocal cords cannot vibrate for a very short time then stop. 3. Vocal cords cannot stop vibrating for a very short time then vibrate again.

3.3. Fundamental Frequency Estimation 59 For the first attribute, a five points median smoothing is applied to reduce the large variation of the fundamental frequency estimates.

76 3.3. Fundamental Frequency Estimation 59 For the first attribute, a five points median smoothing is applied to reduce the large variation of the fundamental frequency estimates. For the second and third attributes, the very short voiced frames which last less than five frames (50ms) can be treated as unvoiced frames. The very short unvoiced frames which last less than five frames (50ms) can be considered as voiced frames. Linear interpolation is used to calculate the fundamental frequency values for those frames needed to be corrected. This fundamental frequency correction method results in the following fundamental frequency contour, shown as figure Figure 3.14: Comb function based fundamental frequency estimates with smoothing The spectrogram in the background of figure 3.14 is the same spectrogram as in figure The blue line is the fundamental frequency contour after the smoothing algorithm discussed above. Compared with figure 3.13, the fundamental frequency contour after the smoothing is more accurate than the fundamental frequency contour without the smoothing algorithm Evaluation of fundamental frequency One of the most difficult problems in comparing and evaluating the performance of fundamental frequency estimators is choosing a meaningful objective performance

77 3.3. Fundamental Frequency Estimation 60 criterion. There are many characteristics of fundamental frequency estimation algorithms which influence the choice of a set of performance criteria [8], shown as the following, 1. Accuracy in estimating fundamental frequency. 2. Accuracy in making a voicing classification. 3. Robustness of the measurements, such as modification for different transmission conditions, speakers. 4. Speed of operation. 5. Complexity of the algorithm. 6. Suitability and cost of hardware implementation. According to different system requirements, various weight must be given to each of the above factors to decide which fundamental frequency estimation algorithm is the best for the system. In this thesis, the first three factors are given priority when selecting the fundamental frequency estimation algorithm. Before defining these measures it is useful to examine the types of error made in fundamental frequency extraction, 1. A voiced frame being classified as unvoiced frame; 2. An unvoiced frame being classified as voiced frame; 3. A correct classification for a voiced frame but a wrong fundamental frequency value.

78 3.3. Fundamental Frequency Estimation 61 The first two kinds of error are straightforward. To illustrate the third kind of error, figure 3.15 shows a histogram of the percentage fundamental frequency error, taken across 75 phonetically rich sentences using the comb function based fundamental frequency estimation. The percentage fundamental frequency error is defined as equation e m % = f 0(m) ˆf 0 (m) f 0 (m) 100% (3.13) Where ˆf 0 (m) and f 0 (m) represents the estimate and reference fundamental frequency of the m th frame respectively. In this experiment the reference fundamental frequency has been provided by a hand-checked laryngograph signal. For clarity figure 3.15 also shows an expanded section of the lower potion of the histogram in right-hand side. Figure 3.15: Distribution of percentage fundamental frequency errors The majority of fundamental frequency estimates are very close to the reference fundamental frequency and apparently have a Gaussian distribution. In fact the vertical line shows the range of fundamental frequency estimation errors that are within +/-20% of the reference fundamental frequency statistically, over 97% of fundamental frequency estimates are within this range. However, a number of

79 3.3. Fundamental Frequency Estimation 62 errors are concentrated around the -50% and + 100% points (less than 3% of all fundamental frequency estimates). These correspond to the fundamental frequency halving errors and doubling errors which are common mistakes made in fundamental frequency estimation. If these large fundamental frequency errors were included in the overall accuracy measurement, they would skew the results. To avoid this, these large percentage fundamental frequency estimation errors are classified as a voicing classification error. Therefore voicing classification errors, E c, can be defined as equation 3.14 [99], E c = N V/U + N U/V + N >20% N total 100% (3.14) where N V/U is the number of voiced frames are classified as unvoiced frames; N U/V is the number of unvoiced frames are classified as voiced frames; N >20% is the number of frames which fundamental frequency error is greater than 20%; N total is the number of all frames; For those frames correctly classified as voiced, the percentage fundamental frequency error, E p, can be defined as equation 3.15, E p = 1 N ˆf 0 (m) f 0 (m) 100% (3.15) N f 0 (m) m=1 where ˆf 0 (m) and f 0 (m) are the m th estimated and reference fundamental frequency estimate respectively and N is the total number of voiced frames which are correctly classified with fundamental frequency percentage error less than 20%. Equation 3.14 and equation 3.15 will be used as two criteria to measure the performance of the fundamental frequency estimation methods such as the SIFT

80 3.3. Fundamental Frequency Estimation 63 method, the comb function method, the auditory model based method (in section 4.3) and the HMM-GMM method (in section 6.2) in this chapter or later chapters Experimental Results This section measures the accuracy of the fundamental frequency estimation and voicing classification using a subset of the ETSI Aurora connected digits database. A dataset composed of 501 noise-free utterances (12,643 frames in total) is taken from the database. The reference fundamental frequency is made from the Speech Filing System software [72], followed by manual correction if necessary. Two fundamental frequency estimation methods, the SIFT and the comb function methods, are used to estimate the fundamental frequency contour for each utterance in this dataset. Two fundamental frequency measures, percentage voicing classification error and percentage fundamental frequency error are shown in table 3.3. For the voicing classification error, the table also shows the three components which make up the classification errors shown in equation Table 3.3: Fundamental frequency evaluation for the SIFT and comb function based fundamental frequency estimation methods N E p E U/V N V /U N 20% c N total N total N total SIFT 2.08% 10.05% 1.83% 4.91% 3.31% COMB 2.42% 3.73% 1.74% 0.67% 1.32% Table 3.3 shows that voicing classification from the comb function method is much more accurate than that from the SIFT method. The percentage fundamental frequency error E p from the comb function is slightly lower than that from the SIFT method. However, component N 20% N total from the comb function method is much lower than that from the SIFT method. This means that the large fundamental frequency

81 3.3. Fundamental Frequency Estimation 64 error are more likely to occur on the fundamental frequency contour from the SIFT method. An example of the fundamental frequency contours using the SIFT and comb function methods from the utterance four-seven-seven is shown in figure Figure 3.16: Comparing reference fundamental frequency with fundamental frequency estimate from the (a) SIFT method; (b)comb function; and (c) energy contour Figure 3.16-a shows the result from the SIFT method (blue line) and reference fundamental frequency (red line). Figure 3.16-b demonstrates the result from the comb function method (blue line) and the reference fundamental frequency (red line). Figure 3.16-c is the energy contour taken from the speech signal at the same frame rate as in figure 3.16-a and 3.16-b. Comparing figure 3.16-a and figure 3.16-b indicates that the fundamental frequency contour from the comb function method is much closer to the reference fundamental frequency than that from the SIFT method. Much more classification errors happened from the SIFT method than that in the comb function method. Some classification errors also happened at the

82 3.4. Speech Reconstruction Using the Source-filter Model 65 beginning and end of speech in the fundamental frequency contour from the comb function method. These classification errors occur at around -25dB energy level which means that these errors may introduce less detectable noise because of their low energy component. With consideration of all these measures, the comb function based method performs better than the SIFT method and therefore the comb function based method is selected for further experiments. 3.4 Speech Reconstruction Using the Source-filter Model As discussed in section 2.1.2, a speech signal can be reconstructed from the linear predictive coefficients and a train of periodic impulse separated by the fundamental period or wideband noise shown as the equation and described in section s (n) = P a k s (n k) + Gu (n) (3.16) k=1 As derived in section the vocal tract filter coefficients, a k, can be achieved from computing the prediction coefficients, α k. The implementation of the sourcefilter model for speech reconstruction from MFCC vectors and fundamental frequencies is shown as in figure 3.17, Figure 3.17: Speech reconstruction from MFCC vectors and fundamental frequencies using the source filter model

83 3.4. Speech Reconstruction Using the Source-filter Model 66 Both MFCC vectors and fundamental frequency information are extracted from terminal device and transmitted to server. On the server side, spectral envelope is first estimated from the MFCC vector using the cubic interpolation technique which has been discussed in section 3.2 followed by the prediction coefficients estimation to compute the prediction coefficients from the MFCC derived spectral envelope. As described in section 2.1.2, the prediction coefficients, α, can be computed from the autocorrelation coefficients, shown as equation 3.17, P α k r( j k ) = r(j) 1 j P (3.17) k=1 where r(j) is the j th autocorrelation coefficient and is normally obtained from speech signal, shown as, r(j) = N 1 j n=0 s(n)s(n + j) (3.18) Where s(n) is the n th speech sample and N denotes the length of the frame. However the speech signal is not available on the server side based on the DSR framework. Instead the smoothed spectral envelope, estimated from the MFCC vectors, can be squared and reflected to give a power spectrum estimate. Using the Wiener-Khintchine theorem [18] an estimate of the autocorrelation coefficients can be obtained through an inverse Fourier transform, shown as equation 3.19 ˆr(j) = 1 N N 1 f=0 ˆX(f) 2 e i2πjf N (3.19) where ˆr(j) is the estimate of the j th autocorrelation coefficient; i is the complex factor (i 2 = 1) and ˆX(f) 2 is the power spectrum estimate derived from the MFCC vector. Applying these to equation 3.17 gives a set of MFCC-derived filter coefficients. The effectiveness of the inversion is illustrated in figure 3.18 which shows

84 3.4. Speech Reconstruction Using the Source-filter Model 67 an LPC-derived magnitude spectrum (dotted line). This is compared to the LPCderived magnitude spectrum estimated from both a non-truncated (23-D) MFCC vector (solid line) and a truncated (13-D) MFCC vector (dashed line). Figure 3.18: MFCC-derived vocal tract frequency response Both the non-truncated and truncated MFCC vectors give close estimates to the true LPC magnitude spectrum. However, the truncated MFCC is unable to resolve the high frequency peak into two separate formants. The excitation signal, u(n), shown in equation 3.16, is generated according to the voicing classification and estimated fundamental frequency. For voiced speech the excitation is a series of impulses, separated by the fundamental period, and for unvoiced speech the excitation is wideband noise [48]. This method results in a signal frame of speech. Section will introduce the overlap-and-add method to reconstruct the entire utterance.

85 3.5. Speech Reconstruction Using the Sinusoidal Model Speech Reconstruction Using the Sinusoidal Model As described in section 2.1.3, the sinusoidal model can be used as an alternative model for speech production shown as equation 3.20, ŝ (n) = L(m) l=1 A l cos (2πf l n + θ l ) (3.20) where ŝ(n) denotes the n th reconstructed speech signals; A l, f l and θ l represent the amplitude, frequency and phase offset of the l th sinusoidal components respectively and L(m) is the number of the sinusoidal components in the m th frame. To apply this model to speech reconstruction from the MFCC vector and fundamental frequency information, a block diagram is illustrated in figure Figure 3.19: Speech reconstruction using the sinusoidal model In the sinusoidal model, the parameters comprise amplitude, frequency and phase for each sinusoid and methods to estimate these parameters are introduced in the next sub-section. Speech reconstruction of a single frame is described in the second part of this section Parameter estimation from spectral envelope and fundamental frequency From McAulay s work [21], one of the most important contributions is the proof that each peak in the frequency domain is one sinusoidal component in the time

86 3.5. Speech Reconstruction Using the Sinusoidal Model 69 domain. This proof significantly reduces the necessary size of the parameter set. These peaks in the magnitude spectrum of a voiced frame are quasi-periodic (also named the harmonic assumption) which provides a method to estimate the harmonic components from the fundamental frequency estimate ˆf 0, shown as equation, ˆf l = l ˆf 0 l = { 1, 2,... floor(4000/ ˆf } 0 ) (3.21) The amplitude, Â l, of each sinusoidal component is obtained from the spectral envelope, ˆX (f), derived from the MFCC vector, at each harmonic component, shown as equation, Â l = ( ˆX l ˆf ) 0 (3.22) An example of the estimated frequency, ˆf l, and amplitude, Âl, are shown in figure 3.20, taken from English phoneme /ay/. Figure 3.20: Comparison of estimated peaks and peaks of the magnitude spectrum Figure 3.20 illustrates the difference between the estimated peaks and peaks in the magnitude spectrum. The green vertical line denotes the fundamental frequency position, ˆf0, and its harmonic components, (l ˆf 0, Âl), are drawn as green crosses in the estimated spectral envelope (red line). The blue line is the actual magnitude spectrum of the frame. It can be observed that the estimated peaks are very close

87 3.5. Speech Reconstruction Using the Sinusoidal Model 70 to the actual peaks in the real magnitude spectrum in low/middle frequency bands. However relatively large errors exist in the estimated peaks in high frequency bands because of the smoothing effect of the mel-filterbank and the truncation during the MFCC extraction. The phase offset, ˆθ l, of each component is composed of two components [58]: phase offset from the excitation, ˆϕ l, and from the vocal tract, ˆΦ l, shown as equation 3.23 ˆθ l = ˆϕ l + ˆΦ l (3.23) The phase offset from the excitation is the multiplication of frequency and time shown as equation, ˆϕ l = 2πf l t (3.24) Where ˆϕ l represents the phase offset at fundamental and its harmonic frequency denoted by f l. Based on the harmonic assumption shown in equation 3.21, equation 3.24 can be represented as, ˆϕ l = 2πlf 0 t (3.25) This linear phase model is demonstrated in figure 3.21, as a simplified spectrogram. Figure 3.21: Linear phase model The fundamental frequency and its harmonics are shown in three successive

88 3.5. Speech Reconstruction Using the Sinusoidal Model 71 frames, m-1, m and m+1. Each horizontal line in different colours represents fundamental frequency or its harmonic frequency. The phase offset at the fundamental frequency of each frame is the phase offset at the previous frame plus the multiplication of the frame length, T, and the fundamental frequency, ˆf 0, in the last frame, shown as equation 3.26, ˆϕ 0 (m) = ˆϕ 0 (m 1) + 2π ˆf 0 (m 1) T (3.26) The phase offset of each harmonic component is a multiplication of the harmonic index number and the phase offset at fundamental frequency, shown as equation 3.27, ˆϕ l (m) = l ˆϕ 0 (m) (3.27) The phase component from the vocal tract can be estimated by assuming a minimum phase system. A Hilbert transform [7] can be used to obtain the minimum phase delay, Φ l (f), from the magnitude spectrum, ˆX(f), shown as equation 3.28, 3.29 [58], c(n) = 2 N/2 log N ˆX(f) cos(2πnf) (3.28) f=1 D ˆΦ l (f) = 2 c(n) sin(2πnf) (3.29) n=1 where log( ˆX(f) ) and Φ l (f) are the Hilbert transform pair; the c(n) is the n th cepstral coefficient of the magnitude spectrum ˆX(f) ; D is the dimensionality of cepstrum coefficients and N denotes the frequency bin number (here N/2 = 256). In practice, the phase offset from excitation is much more important than that from vocal tract filter to produce good quality speech reconstruction based on listening to the reconstructed results.

89 3.6. Post Processing Reconstruction of a single frame of speech After the amplitude, frequency and phase of each sinusoid are calculated and a voicing classification is made, the synthesis equation can be modified from equation 2.13 to equation 3.30 according to the harmonic assumption, 3.31, ŝ (n) = L(m) l=1 Â l cos [2πl ˆf 0 n + ˆθ ] l + θ r (3.30) L (m) can be determined from the fundamental frequency, shown as equation L (m) = floor [4000/f 0 (m)] (3.31) where f 0 (m) is fundamental frequency in the m th frame and measured in Hz. θ r is a random phase term which is related to the percent of voicing occupancy and it can be determined as equation, θ r = [1 ρ (m)] 2π [rand (1) 0.5] (3.32) where ρ (m) is voicing classification in the m th frame ranging from 0 to 1 which as defined during the fundamental frequency extraction in section and rand(.) denotes a random value ranging from 0 to Post Processing The previous two sections introduced two algorithms to reconstruct a speech signal in a single frame with a given spectral envelope and fundamental frequency information. This section first describes an algorithm to reconstruct a whole utterance from continuous frames. Secondly, a method to improve speech quality by reducing the fundamental frequency shift in successive frames is described in the second part of this section.

90 3.6. Post Processing Reconstruction of multiple frames After a frame of speech has been synthesised, shifting of parameters such as fundamental frequency appears between the frame boundaries and introduces an artificial noise. It is therefore necessary to merge the signal together between frames. An overlap-and-add algorithm is implemented for this purpose, shown as figure Figure 3.22: Overlap-and-add method The rectangles in different colours represent different frames and the vertical line in each rectangle is the center line of the frame. In this method, each frame is extended to the previous and next frames for the duration of a half frame. A triangular weighting is applied to each extended frame. In the last stage, the weighted signal is added together. Figure 3.23 compares the reconstruction with and without the overlap-and-add algorithm. The dashed magenta line is the shadow of the triangular weighting of the overlap-and-add window. The signal shift appears at the frame boundary (green line) in figure 3.23-a because all parameters could be quite different in successive frames and this shift has been smoothed using overlap-and-add as shown in figure 3.23-b.

3.6. Post Processing 74 Figure 3.23: Synthesised signal (a) with and (b) without overlap-and-add algorithm 3.6.2 Fundamental frequency smoothing between frames Comparison of the spectrogram of the reconstructed speech signal and its original signal is shown in figure 3.

91 3.6. Post Processing 74 Figure 3.23: Synthesised signal (a) with and (b) without overlap-and-add algorithm Fundamental frequency smoothing between frames Comparison of the spectrogram of the reconstructed speech signal and its original signal is shown in figure The reconstructed speech is obtained by using 23-D MFCCs and fundamental frequency information obtained from the comb function algorithm. The utterance, which is a male-speaking connected digit string , comes from the Aurora TI database [73]. Figure 3.24: The narrowband spectrogram of (a) original and (b) reconstructed speech signal

92 3.6. Post Processing 75 Clearly, formant information and some harmonic structure are kept in the reconstructed speech. Unfortunately, the middle/high harmonic frequency components in the reconstructed speech are not continuous. A buzz effect can be heard because of this confusion. This results from the fundamental frequency shifting in successive frames, shown in the simplified spectrogram in figure 3.25-a. Figure 3.25: (a) Harmonic confusion and (b) fundamental frequency smoothing The vertical line represents the frame boundary and the horizontal line, in same colour, represents the same harmonic components in different frames. Figure a shows that a small difference of fundamental frequency between the continuous frames causes a big frequency shift in the middle or high frequency bands because high harmonic frequency components comes from the multiplication of fundamental frequency and its harmonic index, shown as equation For example, the (l+1) th component in the frame m 1 is close to the l th component rather than to the l +1 th component in the m th frame. To avoid this confusion, one improvement to the algorithm is to split one frame into a number of subframes, shown as figure 3.25-b and equation 3.33 (in this work, one frame is divided into four sub-frames). A linear interpolation is used to obtain the fundamental frequency in subframes according to the fundamental frequency in continuous frames. f 0 (m, b) = f 0 (m) + [f 0 (m + 1) f 0 (m)] (b 1) /4 b = 1, 2, 3, 4 (3.33)

3.7. Experimental Result 76 where f 0 represents fundamental frequency in subframe; f 0 represents fundamental frequency in frame; m represents frame index; b represents subframe index; Clearly,

93 3.7. Experimental Result 76 where f 0 represents fundamental frequency in subframe; f 0 represents fundamental frequency in frame; m represents frame index; b represents subframe index; Clearly, harmonic components in the middle and high frequency band in figure 3.25-b are clearer than that in figure 3.25-a. An improved reconstructed speech signal is demonstrated in figure 3.26-b. Figure 3.26: (a)initial and (b)improved reconstructed spectrogram Comparing figures 3.26 and 3.24, the harmonic frequency structure from fundamental frequency smoothing algorithm is clearer than that from initial model. The result after fundamental frequency smoothing is closer to the original speech than that without smoothing algorithm. Listening tests show that the artificial effect from the reconstruction method using the fundamental frequency smoothing is much smaller than that without using fundamental frequency smoothing. 3.7 Experimental Result To measure the reconstructed speech quality from the source-filter and sinusoidal models, a series of listening tests conforming to the ITU-T recommendations for sub-

94 3.7. Experimental Result 77 jective assessment quality [62] have been performed [101]. For comparison the tests include speech encoded by two standard codecs GSM(9.6kbps) [53] and CELP(FS kbps) [57]. For each codec under evaluation fifty-two utterances were created from short sentences spoken by a UK English female speaker. Ten listeners were employed to rate the speech quality of these sentences from the different codecs. The sentences were played out in random order with each listener rating the speech quality using an absolute category rating (ACR) ranging from 1 (bad) to 5 (excellent). Table 3.4 shows the mean opinion score (MOS) [63] computed across all listeners for the four codecs tested. (more details are presented in Appendix B) Table 3.4: Mean opinion score for reconstructed speech MOS GSM (9.6kbps) 3.85 CELP(4.8kbps) 3.37 MFCC-SourceFilter(4.8kbps) 1.40 MFCC-Sinusoidal(4.8kbps) 2.53 The MFCC-based speech reconstruction method give lowest speech quality with the sinusoidal model performing better of the two by 1.1. This low quality is to be expected given the limited information from which the speech is reconstructed, although in all cases the speech remained intelligible. It is interesting to observe that reconstruction from MFCCs using the sinusoidal model is reasonably close the quality attained by the CELP codec. Further comparisons of reconstructed speech are illustrated in figure 3.27 which compares the original speech spectrogram of the utterance The best way to learn is to solve extra problems. with that reconstructed from MFCCs using both the sinusoidal and the source-filter models. Both reconstructed speech signals exhibit close formant structure to the original

95 3.7. Experimental Result 78 Figure 3.27: Narrow band spectrogram of a) original speech (top); reconstructed speech from MFCC and fundamental frequency using b) sinusoidal model (middle) and c)source-filter model (bottom) speech signal. However in both cases the formants are not as well defined as the formants of the original speech. This can be attributed to the smoothing imposed by the mel-filterbank and truncation of the MFCC vector from 23 to 13 dimensions. Both reconstructed speech signals have close harmonic structures to that of the original speech signal in low frequency bands. But in high frequency bands, the harmonic structure of the reconstructed speech signal from the sinusoidal model are closer to the original speech signal than that from the source-filter model. Voicing and fundamental frequency estimation is generally accurate in the reconstructed speech although a voicing error can be observed at the end of the word best at around 0.5s where unvoiced speech in figure 3.27-a is mistaken for voiced by the fundamental frequency estimator. This causes the phoneme /t/ to be lost at the end of the word. Based on this spectrogram analysis and the MOS tests the sinusoidal model was selected for further speech reconstruction experiments.

96 3.8. Summary Summary This chapter has implemented two methods for reconstructing a speech signal from MFCC vectors and fundamental frequency information based on the source-filter and the sinusoidal models respectively. Related algorithms, such as two fundamental frequency estimation methods and model improvement, are also described. Both speech reconstruction schemes produce intelligible speech from the MFCC vectors and the fundamental frequency. The speech reconstruction scheme using the sinusoidal model produces higher quality speech than that using the source-filter model and therefore forms the basis for speech reconstruction in the remaining chapters.

97 Chapter 4 Speech Reconstruction from Noisy MFCCs Contents 4.1 Introduction 4.2 Noise compensation 4.3 Robust pitch estimation 4.4 Experimental results 4.5 Summary Preface This chapter extends the techniques from the previous chapter to reconstruct a clean speech signal from MFCCs which have been contaminated by noise. To achieve this it is necessary to obtain a clean spectral envelope estimate and a reliable fundamental frequency estimate. The clean spectral envelope estimate is obtained by applying spectral subtraction to mel filterbank vectors and an auditory model is employed for reliable fundamental frequency estimation. Experimental results show that fundamental frequency estimation from the auditory model is much more reliable than that from the traditional fundamental frequency estimation methods. Spectrograms and listening tests indicate that a clean speech signal can be successfully reconstructed from the noisy MFCCs. Fundamental frequency estimation errors and voicing classification errors are heard as artificial sounding bursts in the reconstructed speech signal while incorrect estimates of the spectral envelope introduce noise into the reconstructed speech. 80

98 4.1. Introduction Introduction In the previous chapter, a speech signal has been reconstructed from a stream of clean MFCC vectors and excitation information (fundamental frequency and voicing classification) and both features are extracted from a clean speech signal. However, when this scheme is deployed in a noisy environment, a noise, defined as an unwanted signal that interferes with the communication or measurement of another signal [85], is added to that of the clean speech signal. This can be presented as equation 4.1, y(n) = x(n) + e(n) (4.1) where x(n), e(n) and y(n) denote the n th clean speech sample, noise sample and observed noisy sample respectively in the time domain. This can be transformed to the frequency domain using the Fourier Transform which is a linear system [10], so equation 4.1 can be represented in the frequency domain as, Y (f) = X(f) + E(f) (4.2) where X(f), E(f) and Y (f) represent the complex magnitude spectrum of the clean signal, noise and observed noisy signal in the frequency domain respectively; f denotes the linear frequency scale. When MFCC vectors (detailed in section 2.2) and fundamental frequency (detailed in section 3.3) are extracted from the noisy signal, y(n) or Y (f), both the MFCC vector and fundamental frequency are distorted as shown in following subsections.

99 4.1. Introduction Effect of noise on spectral envelope When MFCC vectors are extracted from the magnitude spectrum of the observed noisy signal, Y (f), the noise magnitude spectrum, E(f), can effect all or some of the mel-filterbank according to the bandwidth of the noise and therefore distorts the MFCC vectors at varying levels in accordance with the signal to noise ratio (SNR), shown as figure 4.1, Figure 4.1: One frame (16ms duration) of a speech signal showing the time domain and corresponding spectral envelope in the frequency domain of a)clean speech signal; b)noise and c) observed signal Figure 4.1-a shows a 16ms frame of clean speech in both the time domain (left column) and the frequency domain (right column). Figure 4.1-b presents a contaminating whistle noise which has a frequency around 2400 Hz. The left part of figure 4.1-c is the resulting noisy speech signal which is the summation of the clean signal in figure 4.1-a and the noise in figure 4.1-b at an SNR of 5dB. Obviously, as

100 4.1. Introduction 83 observed in the right part of figure 4.1-c, the envelope of the clean signal is interfered by the noise spectrum. This noisy spectral envelope leads to distortion during speech reconstruction Effect of noise on fundamental frequency estimation The noise also effects the results of the fundamental frequency extraction for both voicing classification and fundamental frequency estimation. Figure 4.2 demonstrates the effect of noise on estimation of fundamental frequency, Figure 4.2: Magnitude spectrum of a)clean speech signal, b)noise, c) observed noisy signal and d)selecting fundamental frequency from clean and noisy magnitude spectrum using the comb function method Figure 4.2-a shows the magnitude spectrum of the clean speech signal as shown in figure 4.1-a. Figure 4.2-b presents the magnitude spectrum of the whistle noise as shown in figure 4.1-b. Figure 4.2-c is the resulted noisy magnitude spectrum obtained from the summation of the clean signal and the noise at SNR of 5dB.

101 4.1. Introduction 84 The green crosses in figure 4.2-a and 4.2-c are peaks in magnitude spectrum which are used for fundamental frequency estimation based on the comb function (fundamental frequency candidates). Blue and magenta lines in figure 4.2-d are values of peaks captured by each fundamental frequency candidate for the clean and the noisy speech signal respectively (detailed in section 3.3 and figure 2.15). The blue cross in this figure is the fundamental frequency estimate chosen from the clean speech signal which is 201 Hz while the magenta cross represents the fundamental frequency estimate chosen from the noisy speech signal which is 52Hz. This demonstrates that the presence of noise can result in erroneous fundamental frequency estimates Proposal for clean speech reconstruction Both the distorted spectral envelope and fundamental frequency estimation errors reduce the quality of the reconstructed speech. Two requirements for clean speech reconstruction are accurate excitation information and a noise-free spectral envelope. To obtain these from a noisy speech signal, a modified speech reconstruction system is proposed in figure 4.3 to overcome these distortions. Figure 4.3: Robust speech reconstruction from noisy MFCCs and fundamental frequency There are two differences compared with the original speech reconstruction scheme shown in figure 3.19 in section 3.5. One is to utilise robust fundamental frequency estimation to obtain a precise fundamental frequency contour from the noisy speech

102 4.2. Noise Compensation for Spectral Envelope Estimation 85 signal rather than using the comb function method, which is more sensitive to noise. The other difference is the division of the stage MFCC to spectral envelope in figure 3.19 into three sub-stages. First of all, MFCCs are converted back to mel-filterbank as before. Then noise compensation is implemented at this stage to obtain an estimate of the clean mel-filterbank vector followed by spectral envelope restoration from Mel-filterbank. All other stages are the same as those in the previous chapter. The remainder of this chapter is arranged as follows. The noise compensation algorithm is introduced to provide a clean spectral envelope estimate in section 4.2 followed by an auditory model based robust fundamental frequency estimation method in section 4.3. Experimental results are demonstrated in section 4.4 and summary is made in the final section. 4.2 Noise Compensation for Spectral Envelope Estimation To achieve clean speech reconstruction it is necessary to obtain an estimate of the clean spectral envelope extracted from noisy MFCC vectors. Many techniques can be used to achieve this. This section introduces spectral subtraction to provide a clean spectral envelope estimate Spectral subtraction Spectral subtraction [11] [67] is a method for restoration of the power spectrum or the magnitude spectrum of a signal observed in additive noise. This is achieved through subtraction of an estimate of the noise spectrum from the noisy signal

103 4.2. Noise Compensation for Spectral Envelope Estimation 86 spectrum, shown as equation, ˆX(f) b = Y (f) b α Ê(f) b (4.3) where ˆX(f) b is an estimate of original spectrum, X(f) b, from a noisy spectrum, Y (f) b, and Ê(f) b is the estimate of the noise spectrum, typically obtained as time-averaged noise spectra. For power spectral subtraction, b = 2, and for magnitude spectral subtraction b = 1, which is implemented in this section. The parameter α, named over-subtraction factor [67], controls the amount of noise subtracted from the noisy signal. Obviously, if Y (f) < α Ê(f), spectral subtraction results in a negative estimate for ˆX(f). This is more likely to occur in cases of low signal-to-noise ratio (SNR). To avoid these negative magnitude spectrum estimates, a noise floor or maximum attenuation factor β is introduced. Equation 4.3 is now modified as, ˆX (f) = Y (f) α Ê(f) β Y (f) ˆX(f) > β Y (f) otherwise (4.4) where Y (f), Ê (f) and ˆX (f) are the same as in equation 4.3. The variables α and β are the over-subtraction factor and maximum attenuation of the filter respectively. Spectral subtraction is known to suffer from processing distortions which occur when spectral magnitudes reach a spectral floor. This results in certain frequencies being turned on and off and causes the so-called musical noise. This may be attributed to local variations of the speech and noise magnitudes spectra. In this section, such processing distortions are reduced by implementing the subtraction in the mel-filterbank domain, rather than the spectral magnitude domain, shown as

104 4.2. Noise Compensation for Spectral Envelope Estimation 87 equation 4.5, Y (k) ˆX(k) = αê(k) βy ( k) ˆX(k) > βy (k) otherwise (4.5) Y (k) = Ê(k) = M(k+1) f=m(k) M(k+1) f=m(k) ω(f) Y (f) ; ω(f) Ê(f) ; Where ˆX(k) is the clean mel-filterbank estimates given as subtracting the melfilterbank estimate of the noise, Ê(k), from the noisy filterbank, Y (k); ω(f) is triangular filterbank factor; M(k) and M(k + 1) are start and end frequency point at k th triangular filterbank (1 k K) and K is the number of filterbank. The averaging of the magnitude spectrum made by the triangular windows of the mel-filterbank means that channel estimates are less likely to reach floor values which would introduce distortion shown in figure 4.4. This figure illustrates spectral subtraction in the filterbank domain (left column) and magnitude spectrum domain (right column) respectively. Figure 4.4-a presents the mel-filterbank and magnitude spectrum of a clean speech signal. Figure 4.4-c shows the same signal but contaminated by a wideband noise, shown in figure 4.4-b, at SNR of 5dB. Figure 4.4-d demonstrates the result of subtracting the noise from the noisy signal shown as figure 4.4-c to give a clean speech estimate. Red dots in figure 4.4-d display the negative values which may distort the spectral envelope estimation. This figure shows that the noise spectrum in the mel-filterbank as shown in the left column of figure 4.4-b is smoother than that in the magnitude spectrum as shown

105 4.2. Noise Compensation for Spectral Envelope Estimation 88 Figure 4.4: Spectral subtraction Mel filterbank and magnitude domains for a) clean signal; b) noise; c) noisy signal; d) clean estimate in the right column of figure 4.4-b. This variation in the noise magnitude spectrum can lead to poor estimates of the clean speech spectrum. However, the more stable representation in each mel-filterbank channel (due to the spectral averaging) is less likely to introduce distortion. This can be seen by the structure of the spectral estimate of the mel-filterbank as shown in the left column of figure 4.4-d which is much closer to the structure of the original clean mel-filterbank as shown in the left column of figure 4.4-a than that of the magnitude spectrum. As well known, the spectral subtraction is effective to broadband noise, but not as good for tones. When spectral subtraction is implemented in mel-filterbank domain, the energy of tone is averaged in each frequency channel. Doing spectral subtraction will not lose tones. The advantage of spectral subtraction implemented in this section comes from less

106 4.3. Robust Fundamental Frequency Estimation 89 variability of each frequency channels than that of each frequency bin. 4.3 Robust Fundamental Frequency Estimation To accurately reconstruct the speech signal it is also vital to have a reliable estimation of the fundamental frequency (for voiced sounds) and voicing classification. The previous discussions in section 3.3 employed a comb function to determine the fundamental frequency estimate from the magnitude spectrum of a speech signal. This delivers good fundamental frequency estimates for clean speech but is less accurate when estimated from a noise contaminated speech signal as shown in figure 4.2. To improve fundamental frequency estimation a robust estimation algorithm is developed based on an auditory model proposed in Brown and Cook s work [50], Rouat s work [65] and Wu s work [93]. The auditory model [27] is used to decompose a noisy signal into a number frequency channels. Assuming the noise signal is contaminated in some of frequency bands or channels, then fundamental frequency information can be measured from those channels which are less damaged by noise. A simplified block diagram of the robust fundamental frequency estimation is shown in figure 4.5, Figure 4.5: Robust fundamental frequency estimation from noisy speech using an auditory model This figure shows the procedure of extracting the reliable fundamental frequency estimates from a noisy speech signal. The detail of each block is explained in the

107 4.3. Robust Fundamental Frequency Estimation 90 following subsections Auditory filterbank As introduced in section 2.3, the basilar membrane which is located in the middle of cochlea can decompose a speech signal into a number of frequency bands due to its non-uniform stiffness. A computational auditory model splits a speech signal into a number of frequency channels by implementing a number of filters which have similar frequency response to that of the basilar membrane and are called the auditory filters or the auditory filterbank. [50] These auditory filters are implemented using 4 th order Gammatone filters [27] which have an impulse response shown as gt (n) = n p 1 exp ( 2πbn) cos (2πf c n + ϕ) (4.6) Where p is the filter order (here p = 4); ϕ is the phase; f c is the center frequency and b is a function of center frequency f c and can be used to control the bandwidth of the filter through the exponential function. The definition of b is given as, b = 1.018ERB(f c ) = ( f c ) (4.7) Where ERB(f c ), equivalent rectangular bandwidth, is defined by Glasberg and Moore [36] to reproduce human auditory perceptual ability which has high resolution or narrow bandwidth in low frequency bands and low resolution or wide bandwidth in high frequency bands. The impulse response and frequency response of a gammatone filter which has a center frequency at 1000Hz is illustrated as figure 4.6 Figure 4.6-a shows that the impulse response of the gammatone filter lasts 32ms and figure 4.6-b presents its frequency response. In total, 128 auditory filters are

108 4.3. Robust Fundamental Frequency Estimation 91 Figure 4.6: (a) Impulse response and (b) frequency response of a gammatone filter implemented in this work to split the speech signal into 128 frequency channels with varying center frequencies and bandwidths. The spacing of these bandpass filters is also derived from the ERB scale [46] [75] where the minimum and maximum stop frequency of these bandpass filters are 64Hz and 4000Hz respectively in accordance with the Aurora standard [73] shown as, f c (k) = exp( 2.67 k/k); (4.8) Where f c (k) is the center frequency at the k th frequency channel and K is the total number of channels. The output of the auditory filterbank can be divided into two parts for further processing, according to the center frequency of the channels. The channels which have center frequencies above 800Hz are classified as middle/high frequency channels and the remainder are low frequency channels [50]. The outputs from the middle/high frequency channels can be considered as amplitude-phase modulated cosine waves where the changing frequency of amplitude is much smaller than its center frequency f c or carrier frequency [45] and the Teager energy operator is applicable to these channels for demodulation.

109 4.3. Robust Fundamental Frequency Estimation Teager energy operator The Teager energy operator (TEO) [37] [45] [47] is proposed for extracting the energy of a signal by tracking the envelope of both an amplitude-modulated (AM) signal and the instantaneous frequency of an frequency-modulated (FM) signal [65]. In this work, since the FM term is relatively constant in the middle and high frequency channels, only the AM term is considered. This can be implemented using a nonlinear differential equation shown as, Γ k (n) = x k (n) 2 x k (n + 1)x k (n 1) (4.9) where Γ k (n) is output envelope; x k (n), x k (n + 1), x k (n 1) are signal samples in the middle and high frequency channels. Figure 4.7 demonstrates an example of demodulation from the output of a high frequency channel. Figure 4.7-a shows one frame of a clean speech signal and figure 4.7-b presents one of the output of the high frequency channels extracted from 4.7-a using the auditory model. The center frequency of this channel is Hz with a bandwidth of Hz. The magnitude spectrum of this channel is exhibited in figure 4.7- c. The normalised output of the Teager energy operator is plotted as a red line together with the auditory output of this channel in figure 4.7-d. Obviously, the energy envelope demodulated from TEO is related to the fundamental frequency period and therefore it can be used for fundamental frequency extraction and this is illustrated in next section.

110 4.3. Robust Fundamental Frequency Estimation 93 Figure 4.7: An energy envelope extracted using Teager energy operator Autocorrelation for output of the auditory model Autocorrelation is a useful mathematical method to highlight the periodicity in a signal. In this section, it is applied to the outputs of low frequency channels and the energy envelopes demodulated from middle/high frequency channels as discussed in the previous subsection. Practically, a set of normalized auto-correlation contours, Rk 16 (τ), are computed from those outputs at varying time lags, τ, for a framed signal which lasts 16ms at the k th channel, Rk 16 (τ) = 1 N/2 1 Exk Eyk n= N/2 x k (n)x k (n + τ) τ = 13,, 2N (4.10) where E xk = E yk = N/2 n= N/2 N/2 n= N/2 x 2 k(n); x 2 k(n + τ); N is the length of frame (128 in this case).

111 4.3. Robust Fundamental Frequency Estimation 94 x k (.) is the output of the k th channel from the auditory model or from the Teager energy operator. The set of normalised autocorrelation coefficients, Rk 16 (τ), called autocorrelogram [39] [50], are plotted for every channel of a clean voiced frame in figure 4.8, Figure 4.8: Autocorrelation of all channels from a clean speech frame Channels 1 to 64 represent the autocorrelation of energy envelopes demodulated from middle/high frequency channels from 4000Hz down to 800Hz using the Teager energy operator. Channels 64 to 128 show the autocorrelation of auditory outputs of the low frequency channels from 800Hz down to 64Hz. The fundamental period can be defined as the time duration between two successive resonant peaks of all outputs as shown in figure 4.8 which is around 5ms (200Hz) as illustrated. The remaining stages in the auditory model based fundamental frequency estimation now focus on the estimation of the fundamental frequency from the autocorrelogram.

112 4.3. Robust Fundamental Frequency Estimation Channel selection As assumed in the early part of this chapter, noise corrupts only some of the frequency channels and therefore fundamental frequency can be estimated from those clean frequency channels. Two methods are employed to identify the clean/noisy channels for the low and middle/high frequency channels respectively. For low frequency channels, the normalised autocorrelation diagram appears as a periodic wave and all peaks are at same level if this channel is clean [93]. As noise is added into the channels, the maximum peak at non-zero lag in the autocorrelation diagram is less than the value at zero lag (here Rk 16 (0) = 1). When this non-zero maximum value is lower than a pre-defined threshold value (< from [93]), the channel is defined as a noise damaged channel as shown in the left column of figure 4.9. In the middle/high frequency channels, a new autocorrelation diagram is calculated within a larger window (32ms) [65], Rk 32 (τ). The two autocorrelation contours are able to give similar shape such as the positions of peaks if this channel is clean. A simple method is given by Rouat [65] to measure the difference between the positions of the maximum value of the two autocorrelation measures, Rk 16 (τ) and R32 k (τ). First of all, the time lags which have the maximum value for both Rk 16 (τ) and R32 k (τ) are defined respectively, T 16 k T 32 k = arg max[rk 16 (τ)] (4.11) τ = arg max[rk 32 (τ)] (4.12) τ If the difference between two time lags, T 16 k 32 Tk, is under the threshold (lag=2 from [93]), this channel is clean. Alternatively, if it is greater than the threshold,

113 4.3. Robust Fundamental Frequency Estimation 96 it is considered as a noisy channel as shown in the right column of figure 4.9. Figure 4.9: The autocorrelation of a)clean low frequency channel; b)clean high frequency channel; c) noisy low frequency channel and d) noisy high frequency channel Figure 4.9-a shows an autocorrelation of an auditory output channel with center frequency of Hz and a bandwidth of Hz from a clean speech signal. The blue crosses indicate the non-zero maximum normalised autocorrelation value, 0.987, which is above the threshold and this channel will be kept for further calculation. Figure 4.9-c presents an autocorrelation of auditory output which is the same as figure 4.9-a but with an addition of wideband noise with an SNR of 5dB. The blue cross is non-zero maximum normalised autocorrelation value, 0.920, which is under the threshold and therefore this channel is defined as a noisy channel which is removed from further processing. Figure 4.9-b demonstrates an autocorrelation of Teager energy operator output for an auditory channel which center frequency is at Hz with a bandwidth of 411.2Hz. The blue line indicates autocorrelation of

114 4.3. Robust Fundamental Frequency Estimation 97 a 16ms frame while the red line presents autocorrelation of a 32ms frame. The blue cross and the red cross indicate non-zero maximum value for a 16ms frame and 32ms frame respectively. In this case, both maximum points are at time lag 39 so this channel is held for fundamental frequency estimation. Figure 4.9-d illustrates the autocorrelation of the same signal as in figure 4.9-b but with an additional wideband noise at an SNR of 5dB. The autocorrelation lag of the blue cross is at 56 while the autocorrelation lag of the red cross is at 169. Obviously, Tk 16 Tk 32 = 113 >> 2, which indicates the channel is a noisy channel and is removed from fundamental frequency estimation. Based on these methods, all clean frequency channels are selected from both low frequency channels and middle/high frequency channels for fundamental frequency estimation Pseudo-periodic histogram Autocorrelation values from channels identified as being noisy are discarded, while channels corresponding to clean channels are summed together at the pseudo-periodic histogram (PPH) stage [65] shown as, P P H(τ) = 1 K k selected R 16 k (τ) (4.13) This produces a waveform, which varies at the fundamental period shown as figure 4.10-a. This figure shows the results of equation 4.13 and that the time period between two successive main peaks is the fundamental period. However, some minor peaks in this waveform introduce errors in estimating the fundamental period and therefore a low pass filter is employed to remove these minor peaks as shown in figure 4.10-b.

115 4.3. Robust Fundamental Frequency Estimation 98 Figure 4.10: Pseudo-periodic histogram (a) before a low pass filter (upper) and (b) after the low pass filter (lower) A set of comb functions or fundamental period candidates are then applied to figure 4.10-b to identify the fundamental period. The period of comb function which captures most of peaks in the PPH is defined as the fundamental period. These procedures are very similar to the procedures in section to identify the fundamental frequency from the magnitude spectrum. The fundamental frequency contour smoothing algorithm addressed in section 3.3 is also applied to the fundamental frequency estimates to correct large variations of the fundamental frequency estimates or very short duration of voiced/unvoiced frames Voiced/unvoiced classification All descriptions in the above sections are related to fundamental frequency estimation by assuming that the frame is a voiced frame. Therefore, it is first necessary to identify voiced/unvoiced for speech frames from the PPH. In Rouat s work [65], since the total number of auditory based frequency chan-

116 4.3. Robust Fundamental Frequency Estimation 99 nels is 20, the PPH is the summation of all low frequency channels and selected middle/high frequency channels. A measurement, ρ, is then introduced from the first half of the PPH(τ) (13 < τ < N) which measures the degree of unvoicing to gauge the dissimilarity between PPH(τ) and PPH(τ + t max ), as shown in equation 4.14, ρ = max(diffpph(τ)) min(diffpph(τ)) max(pph(τ)) min(pph(τ)) for τ = 13,, N, (4.14) where diffpph(τ) = PPH(τ) PPH(τ + t max ) t max = arg max (PPH(τ)) τ=13,,n It can be predicted that if the PPH is a pure periodic curve varying at fundamental period t max, diffpph will be a straight line with all zeros. But if PPH is not a pure curve, diffpph can be a curve which has lower amplitude than the PPH amplitude. The more noise or the difference between the minimum and the maximum value of diffpph results in larger values of ρ. When ρ is greater than pre-defined value (0.6 from [65]) the frame is defined as unvoiced or noise/silence. In this work, since the PPH is the summation of all the selected channels which are defined as clean channels, the number of selected frequency channels can be regarded as a measurement for voicing classification. Experimental results indicate that 35% of the total number of channels can be regarded as a threshold value to distinguish between a voiced or unvoiced frame. In this case, if the number of selected clean frequency channles is greater than 45, the frame is classified as a voiced frame otherwise the frame is classified as an unvoiced or a non-speech frame.

117 4.4. Experimental Results Experimental Results To analyse the effectiveness of the clean speech reconstruction scheme a set of speech utterances based on Messiah sentences has been used. These have been sampled at 8kHz and are accompanied by accurate fundamental frequency contours taken from a laryngograph at the recording sessions. To simulate noisy speech, a number of wideband noises have been extracted from the ETSI Aurora database [73] and added to Messiah sentences artificially at varying signal to noise ratios (SNRs). Two experiments are implemented in this section. First of all, the estimated fundamental frequency and voicing classification from the auditory model are evaluated and secondly, the quality of reconstructed speech utterances from a noisy speech signal are demonstrated Evaluation of auditory model based fundamental frequency estimation The aim of this subsection is to compare the result from the 128-channel auditory model based fundamental frequency estimator with the result from the comb function based fundamental frequency estimator (detailed in section 3.3.2) and also from the SIFT fundamental frequency estimator (detailed in section 3.3.1). To implement this experiment a number of wideband noises have been extracted from the Aurora database and then these noise signals have then been added to the Messiah sentences at varying SNRs from 0dB to 30dB. From the discussion in section 3.3.4, two measures, the percentage voicing classification errors, E c, and the percentage fundamental frequency errors, E p, are considered for fundamental frequency evaluation. Classification errors include three components: the number of frames which

118 4.4. Experimental Results 101 were voiced but classified as unvoiced, the number of unvoiced frames but classified as voiced frames and the number of voiced frames which have correct voicing classification but their percentage fundamental frequency errors are more than 20% shown as equation The percentage fundamental frequency frequency error is measured as equation The SIFT and comb function based fundamental frequency estimation methods are introduced for comparison. Figure 4.11-a and figure 4.11-b show the comparative results of percentage voicing classification errors and percentage fundamental frequency errors respectively. a)percentage classification error b)percentage fundamental frequency error Figure 4.11: (a)percentage classification error and (b) percentage fundamental frequency error for different fundamental frequency estimators Figure 4.11-b illustrates that the fundamental frequency estimates from the SIFT method give the most accurate measurements for voiced frames under SNRs of 10dB but deteriorate at SNRs below this. The comb function method gives the best fundamental frequency classification above SNRs of 20dB but gives the most inaccurate fundamental frequency estimates of the three methods. However, the fundamental frequency classification error from the SIFT estimator is the worst of the three methods as shown in figure 4.11-a. The 128-channel auditory model based method gives close to the best performance for clean speech and is significantly more

119 4.4. Experimental Results 102 accurate for noisy speech. In fact, the percentage fundamental frequency error from the auditory model based method only increases by 0.3% from clean speech to an SNR of 0dB which indicates that the auditory model based estimator is the least sensitive method to noise and provides robust fundamental frequency estimates. The fundamental frequency classification errors in figure 4.11-a are relative large while the fundamental frequency accuracy errors in figure 4.11-b are small. This can be attributed to that the large fundamental frequency errors which greater than 20% are classified as classification error (more details in equation 3.14) Experiments for speech reconstruction To illustrate the quality of reconstructed speech, some of the utterances in the previous experiment are used for speech reconstruction. Figure 4.12-a shows the spectrogram of the sentence Look out of the window and see if it s raining. spoken by a female speaker. Figure 4.12-b shows the same sentence but contaminated by wideband noise at an SNR of 10dB. From this noisy speech a fundamental frequency contour is extracted using the robust fundamental frequency estimation technique described in section 4.3 and also a set of clean spectral envelopes restored from a stream of noisy MFCC vectors as described in section 4.2. It is from these two sets of parameters that clean speech reconstruction takes place. The overall quality of the reconstructed speech is dependent on the accuracy of fundamental frequency estimation and the effectiveness of spectral subtraction to remove the noise. Figure 4.13-a shows the spectrogram of the reconstructed speech signal using the robust fundamental frequency estimation (detailed in section 4.3) and noise

120 4.4. Experimental Results 103 Figure 4.12: Spectrogram of a)clean speech and b)noisy speech-10db of utterance Look out of the window and see if it s raining contaminated MFCC vectors. No spectral subtraction has been employed at this stage. This figure also demonstrates that robust fundamental frequency estimates have enabled the resulting fundamental frequency harmonics to be correctly positioned in comparison with the original signal in figure 4.12-a. Figure 4.13-b shows the same signal but reconstructed from mel-filterbank vectors which have had an estimate of the noise subtracted from them. The inversion of the MFCC vectors to a spectral envelope has produced a good reproduction of the original spectral envelope of the speech. The spectrogram in figure 4.13-b clearly shows that spectral subtraction has removed the wideband noise presented in figure 4.12-b. As a comparison, figure 4.14 shows the spectrogram of the reconstructed speech signal using the reference fundamental frequency (taken from the laryngograph signal) and spectral subtracted estimates of the clean speech mel-filterbank vectors.

14: Reconstructed speech using reference fundamental frequency from the laryngograph and spectral subtraction Again the spectrogram shows a relatively clean speech signal.

121 4.4. Experimental Results 104 Figure 4.13: Spectrogram of reconstructed speech using a) robust fundamental frequency estimates; b)robust fundamental frequency estimates and spectral subtraction Figure 4.14: Reconstructed speech using reference fundamental frequency from the laryngograph and spectral subtraction Again the spectrogram shows a relatively clean speech signal. It is interesting to observe how similar the fundamental frequency harmonics are between those derived from the robust fundamental frequency estimates and the referenced fundamental frequencies from the laryngograph which demonstrates the effectiveness of the robust fundamental frequency estimates achieved by the auditory model.

122 4.5. Summary 105 These spectrograms demonstrate that both robust fundamental frequency estimation and noise removal from the spectral envelope are necessary for clean speech reconstruction. Errors in fundamental frequency estimation manifest themselves as artificial sounding bursts in the reconstructed speech signal. Incorrect estimates of the spectral envelope are perceived as part of the contaminating noise remaining in the reconstructed speech. A downloadable result is available at a169838/. 4.5 Summary This chapter has demonstrated that it is possible to reconstruct a clean speech signal from a series of noisy MFCC vectors. To achieve this both a robust fundamental frequency estimate and an estimate of the clean speech spectral envelope are necessary. The additive nature of background noise in the frequency domain allows the technique of spectral subtraction to produce an estimate of the clean speech spectral envelope. Results have shown performing spectral subtraction in mel-filterbank domain incurs less musical noise than that in magnitude spectral domain and therefore enables a sufficiently good estimate of the clean spectral envelope to be derived for clean speech reconstruction when the background noise is relative stable. Using a robust fundamental frequency estimation method, which is based on an auditory model to simulate human perception together with postprocessing, has been shown to give robust estimates of the fundamental frequency which are close to those obtained from a laryngograph across a range of SNRs. It can be concluded that a relatively noise-free speech signal can be reconstructed from noisy MCC vectors along with robust fundamental frequency estimates.

123 Chapter 5 An Integrated Front-End for Speech Recognition and Reconstruction Contents 5.1 Introduction 5.2 Integrated front-end 5.3 Experimental results 5.4 Summary Preface This chapter proposes an integrated front-end for both speech recognition and speech reconstruction applications. Speech is first decomposed into a number of frequency channels using an auditory model. The output of this auditory model is then used to extract both robust fundamental frequency estimates and MFCC vectors. Initial tests for fundamental frequency estimation used a 128 channel auditory model, but results show that this can be reduced significantly to between 23 and 32 channels. Speech recognition results show that the auditory model based cepstral coefficients give very similar performance to the conventional MFCCs. Spectrograms and listening tests also reveal that speech reconstructed from the auditory based cepstral coefficients and fundamental frequency has similar quality to that reconstructed from the conventional MFCCs and fundamental frequency. 106

124 5.1. Introduction Introduction Chapter three proposed two schemes for reconstructing speech from a stream of MFCC vectors and fundamental frequency. These have been based on either a sinusoidal model [79] [89] or source-filter model [92] of speech production. An extension of this work [98] also considered the reconstruction of clean speech from noise contaminated MFCC vectors and robust fundamental frequency estimates which was discussed in the previous chapter. In these systems, the MFCC vectors and fundamental frequency are extracted using separate speech processors. For example in [79], a comb function is used for fundamental frequency extraction and in [98] a 128-channel auditory model [27] provided the robust fundamental frequency estimates. Some procedures in these fundamental frequency estimators are common with the MFCC extraction, such as framing, hamming windowing and FFT. The aim of this chapter is to integrate the initial stages of the MFCC extraction and fundamental frequency estimation into a single front-end processor which extracts features for both speech recognition and speech reconstruction applications. For both the fundamental frequency estimation and MFCC extraction, the speech signal is decomposed into a number of discrete frequency channels either by an auditory model or mel filterbank. It is therefore reasonable to combine these into a single system which forms the basis of the integrated front-end and is described in section 5.2. A detailed evaluation of the fundamental frequency extraction component is described in subsection Speech recognition and speech reconstruction results are presented in subsection and subsection respectively. A summary is given in the last section.

125 5.2. Integrated Front-End Integrated Front-End This section describes the proposed integrated speech front-end and associated backend, which are illustrated in figure 5.1. Figure 5.1: Integrated front-end and back-end systems The front-end comprises three main parts: auditory model, MFCC extraction and fundamental frequency estimation. Three features, MFCC vectors, fundamental

126 5.2. Integrated Front-End 109 frequency and energy, are output across the communication channel. At the remote back-end the MFCC vectors and fundamental frequency estimates are employed for speech reconstruction. For speech recognition the MFCC vectors and energy are utilised together with their temporal derivatives. Decomposition of the input speech signal into frequency channels is performed by the auditory model. These discrete frequency channels signals are then employed by both the MFCC extraction and fundamental frequency estimation. The original fundamental frequency estimation system proposed in [93] was based on a 128 channel auditory model. However most MFCC extraction methods use significantly fewer channels (e.g. 23 for the Aurora standard [82] [100]). One of the aims of this section is to examine the effect of varying the size of the auditory filterbank to produce a compromise that gives both robust fundamental frequency estimates and MFCCs which result in accurate speech recognition. Each of the three components of the integrated front-end are now discussed in the remainder of this section Review of auditory model The auditory model upon which the speech is decomposed into frequency bands was proposed in [27]. Auditory models have been introduced in section 2.3 and they have been successfully used for robust fundamental frequency estimation [65] [93] as shown in section 4.3, and therefore form the first stage of this integrated front-end. Decomposition of the speech signal into a number of frequency bands is achieved using a series of non-linearly spaced and overlapping bandpass filters (detailed in section 4.3.1). In the original auditory model a set of 128 channels were used. These give

127 5.2. Integrated Front-End 110 sufficient frequency response detail which the subsequent fundamental frequency estimation uses for picking up the clean channels for a reliable fundamental frequency extraction shown in section 4.3. However for the MFCC extraction, the ETSI Aurora standard [100] defines just 23 channels. Work shown in later sections examines the effect of reducing the number of channels in the auditory model from 128 to 23 in terms of the resulting speech recognition performance and fundamental frequency estimation accuracy Feature extraction from auditory model for speech recognition The output of the auditory model takes the form of a series of time domain samples from each of the K bandpass filters. In the conventional MFCC feature extraction a windowing function captures a short time frame of speech, typically 25ms [100]. From this a Fourier transform determines the magnitude spectrum and this is then quantised in frequency using a mel-spaced filterbank as shown in section 2.2. This linear mel filterbank is then transformed to log mel filterbank followed by a DCT to obtain cepstral features. Truncation is utilised to reduce the dimensionality of feature space from 23 dimensions to 13 dimensions. These MFCC vectors have been proved to be one of the best speech features for speech recognition [13] [73]. This is mainly attributed to the mel filterbank which simulates human perceptual ability by given more sensitivity to low frequencies than to high frequencies. This is illustrated in figure 5.2 which compares the frequency response of the triangular mel filterbank with a 23-channel auditory model based filterbank. This figure shows that the frequency responses of the 23-channel auditory model

128 5.2. Integrated Front-End 111 Figure 5.2: Frequency response of a) triangular mel scale filterbank and b) 23- channel auditory model filterbank is very similar to the frequency response of the 23-channel triangular mel filterbank in terms of the position of filters in frequency, their increasing bandwidths and spectral shape. To generate a filterbank vector from outputs of the auditory model a mean amplitude (MA) filter is employed. This outputs the root mean square amplitude, c k, from each bandpass filter, k, at 10ms intervals from a 25ms buffer of time-domain samples, as shown in equation 5.1, c k = 1 N N 1 n=0 x k (n) 2 (5.1) Where x k (n) is the n th time-domain sample from the k th channel or bandpass filter in the 25ms buffer, N is the buffer length (N=200 samples for the 8kHz sampling frequency). This is consistent with the frame width and rate used in the Aurora standard. The final three stages of the MFCC extraction are logarithm, discrete cosine transform and truncation. These are identical to the last three stages in the

129 5.3. Experimental Results 112 conventional MFCC extraction as shown in section 2.2. It should be noted that the positioning of the auditory filters is close to, but not exactly, mel-scaled. Therefore the features extracted by this system are not strictly MFCCs. However, for the purpose of this work they are referred as auditory model based MFCCs (AM-MFCCs) Robust fundamental frequency estimation Auditory models have been demonstrated as being one of the most reliable methods for accurately estimating fundamental frequency from a noisy speech signal. This has been presented in detail in section 4.3. In the next section an examination of the effect on fundamental frequency estimation will be shown for varying the number of channels of the auditory model filterbank. 5.3 Experimental Results This section first examines the effect of reducing the number of channels in the auditory model in terms of fundamental frequency estimation accuracy. In particular the number of channels is reduced from 128 to 23 to be comparable with the number of channels used in conventional MFCC extraction. Secondly, the recognition accuracy of the auditory model-based MFCC vectors (AM-MFCCs) is measured and thirdly, the resultant speech quality after reconstruction using the integrated front-end is displayed Fundamental frequency evaluation The aim of this section is to examine the effect of reducing the number of channels in the auditory model in terms of fundamental frequency estimation accuracy. The robust fundamental frequency estimation method in this work comes from Rouat s

130 5.3. Experimental Results 113 [65] and Wu s [93] work. Route employed twenty channels in auditory model and Wu extended Rouat s work and used 128 channels to track fundamental frequency in a noisy speech [96]. A 128-channel auditory model are first implemented and then the number of channels is reduced from to 23 to be comparable with the number of filterbank channels used in the conventional MFCC extraction (detailed in section 2.2). The test data set used in these experiments is composed of 246 utterances from a set of Messiah sentences which gives of a total 130,000 frames. To observe the effect of noise on fundamental frequency estimation, examples of wideband noise from the Aurora database have been artificially added to the speech at a range of SNRs from 30dB to 0dB. The reference fundamental frequency measurements come from a laryngograph signal which has been manually corrected where necessary. Two performance measures, the percentage voicing classification error, E c, and the percentage fundamental frequency error, E p, described in section are evaluated. Figure 5.3-a shows the voicing classification error, E c, for 128, 64, 32 and 23 channel auditory models across a range of noise levels. Similarly, figure 5.3-b illustrates the percentage fundamental frequency error, E p, for the different numbers of channels and noise levels. The results indicate that, as expected, errors for both voicing classification and fundamental frequency measurement increase as the SNR decreases. Voicing classification measurements from the 128, 64, and 32 channel auditory models give almost identical performance. Reducing the number of channels to 23 causes a slight reduction in voiced/unvoiced classification accuracy at low SNRs. Fundamental frequency

131 5.3. Experimental Results 114 a)percentage classification error b)percentage fundamental frequency error Figure 5.3: (a)percentage classification error and (b) percentage fundamental frequency error for different auditory model configurations measurements from the 128, 64, 32 and 23 channel auditory models indicate that the percentage fundamental frequency estimate error is very similar for all auditory model configurations. It should be noted that larger errors which are greater than 20% are reported as classification errors Recognition performance This section compares speech recognition accuracy of the auditory model based MFCCs (AM-MFCCs) for varying numbers of channels with that obtained by the conventional MFCCs. Recognition accuracy has been evaluated on the Aurora TI digits database which comprises 28,000 digit strings for testing and 8440 for training. The digits are modeled using 16-state, 3-mode, diagonal covariance matrix HMMs,

132 5.3. Experimental Results 115 trained from 8,440 digits strings. The training data set covers a range of noises and from clean to an SNR of 0dB (as outlined in the Aurora test specification) [73]. Three feature vector configurations have been tested: the conventional MFCC vectors [100], MFCCs extracted from a 23-channel auditory model and MFCCs extracted from a 32-channel auditory model. In each case the final speech vector comprised static MFCCs 1 to 12 and log energy together with velocity and acceleration derivatives. Figure 5.4 shows recognition accuracy for the three configurations for both clean and noisy speech. Figure 5.4: Comparison of speech recognition accuracy For clean speech, the recognition rate from the AM-MFCCs is slightly higher than that with the conventional MFCCs 98.72% to 98.57%. At lower SNRs the performance of the AM-MFCCs falls slightly below that of the conventional MFCCs. For example at an SNR of 0dB the MFCCs derived from the 23-channel auditory model attain 59.03% while the conventional MFCCs attain 60.69%. Changing from a 23-channel auditory filterbank to a 32-channel filterbank had negligible effect on accuracy. This result indicates that the proposed features based on the 23-channel auditory filterbank give comparable recognition performance with the conventional

133 5.3. Experimental Results 116 MFCCs. Therefore it is possible to employ one speech signal processor in the frontend to provide both speech recognition features and fundamental frequency and voicing estimates Speech reconstruction quality To examine the quality of reconstructed speech a set of Messiah sentences has been used. These are sampled at 8kHz and have then been contaminated by the wideband noise from the Aurora database. Speech is reconstructed using a sinusoidal model of speech with MFCC vectors being inverted to the filterbank domain and then interpolated to provide an estimate of the speech spectral envelope (detailed in section 3.2). The fundamental frequency estimate is used to provide finer harmonic detail. Spectral subtraction has also been applied to provide a clean speech spectral estimates from noise contaminated MFCCs [98] (detailed in section 4.2 and 4.3). Figure 5.5-a shows the spectrogram of the sentence Look out of the window and see if it s raining spoken by a female speaker and contaminated by wideband noise at an SNR of 10dB. Figure 5.5-b illustrates the spectrogram of speech reconstructed from conventional MFCC vectors. Figures 5.5-c and 5.5-d exhibit spectrograms of speech reconstructed from 23 and 32 channel auditory-based MFCCs respectively. Comparing figure 5.5-a with figure 5.5-b, 5.5-c, 5.5-d indicates that most of noise has been removed from the noisy speech signal. The spectrograms also show that the formants, fundamental frequency and its harmonics are very similar to those of the original speech. Comparing figure 5.5-c and 5.5-d with 5.5-b shows that there is no big difference between speech reconstructed from traditional MFCC vectors with that from AM-MFCC vectors. Comparing figure 5.5-c with 5.5-d shows that

5.3. Experimental Results 117 Figure 5.5: Comparison of speech reconstruction results reduction in the number of channels in the auditory model from 32 to 23 has very little effect.

134 5.3. Experimental Results 117 Figure 5.5: Comparison of speech reconstruction results reduction in the number of channels in the auditory model from 32 to 23 has very little effect. Comparing the pattern of fundamental frequency harmonics in figure 5.5-c and 5.5-d with 5.5-b shows that the harmonics from the auditory model are clearer than that from conventional MFCCs. This is attributed to the fact that in the auditory model filterbank the channel center frequencies are more closely spaced at lower frequency channels than that of the mel filterbank.

135 5.4. Summary Summary This chapter has proposed an integrated speech front-end capable of generating features for both speech recognition and speech reconstruction. Evaluation of fundamental frequency estimation has shown that good performance can be obtained using significantly fewer filterbank channels than the original auditory model used. In combination with this, speech recognition tests have shown that AM-MFCCs attain performance almost identical to conventional MFCCs. Using either a 23-channel or 32-channel filterbank has little effect on performance. In addition, speech reconstruction from the AM-MFCCs gives very similar speech quality to that attained using conventional MFCCs. These results conclude that a single front-end, based on an auditory model using either 23 or 32 channels, is feasible for generating features for both speech recognition and speech reconstruction purposes.

136 Chapter 6 Fundamental Frequency Prediction from MFCCs Contents 6.1 Introduction 6.2 Fundamental frequency prediction 6.3 Voiced/Unvoiced classification 6.4 Experimental results 6.5 Summary Preface The aim of this chapter is to predict voicing classification and fundamental frequency from MFCC vectors and therefore to provide a solution to reconstruct a speech signal solely from a stream of MFCC vectors. To achieve this two maximum a posterior methods are employed. The first method enables fundamental frequency prediction by modeling the joint density of MFCCs and fundamental frequency using a Gaussian mixture model (GMM). The second scheme uses a set of hidden Markov models to link together a set of state-dependent GMMs which enables a more localised modeling of the joint density. Experimental results show that accurate voicing classification and fundamental frequency prediction is attained when comparing to hand-corrected fundamental frequency. The use of the predicted fundamental frequency for speech reconstruction is shown to give very similar speech quality to that obtained using the reference fundamental frequency. 119

137 6.1. Introduction Introduction Chapter 3 showed that fundamental frequency and spectral envelope derived from MFCC vectors are two necessary components for speech reconstruction and several schemes [79] [89] [91] [98] have been proposed. These schemes require modification to the feature extraction on the terminal device such that fundamental frequency is extracted in addition to the MFCC vector. They also need additional bandwidth in which to transmit the fundamental frequency and voicing component. Such a system is included in the latest version of the ETSI Aurora standard [100] and is based upon the sinusoidal model [21] which delivers reasonable quality, intelligible speech. Two fundamental frequency components, voicing flag and fundamental frequency, occupy 800bps of the bandwidth. The aim of this chapter is to predict the voicing and fundamental frequency associated with a frame of speech from its MFCC representation. In a DSR environment this will enable speech to be reconstructed solely from the stream of MFCC vectors and therefore avoids the need for modification to the feature extraction stage and increased transmission bandwidths. Such a technique will also allow an audio speech signal to be reconstructed from MFCC-parameterised utterances that have no timedomain signal associated with them. Figure 6.1 illustrates the general operation of the proposed system in the context of a DSR system. Fundamental frequency prediction from MFCCs is motivated by several studies which have indicated that class-dependent correlation exists between the spectral envelope, or formants, and the fundamental frequency [20] [25]. In particular it was observed that the first formant tended to increase in response to increase in the

138 6.1. Introduction 121 Figure 6.1: DSR-based speech reconstruction from MFCCs with fundamental frequency prediction fundamental frequency. Knowledge of this correlation has been exploited in speech recognition by a phoneme-based normalisation of spectral features by the fundamental frequency [43]. This has reduced spectral variations in vowel sounds that are introduced as a result of different fundamental frequency values and leads to an overall reduction in inter-vowel confusions. Further work has also reported both improved phoneme and isolated digit accuracy by exploiting this correlation to adapt the observation probabilities within an HMM-based speech recogniser according to the associated fundamental frequency [87]. The fundamental frequency correlation has also been utilised in concatenative text-to-speech (TTS) synthesis systems to adjust the spectral envelope of speech units in response to large differences between the measured and target fundamental frequency contours. Adjustments to the spectral envelope have been computed through both codebook mappings [66] and a GMM [81]. Listening tests indicate that the more realistic speech units, in terms of fundamental frequency and spectral envelope correlations, leads to higher quality synthesised speech. A voice conversion application has also utilised this correlation to determine the most appropriate fundamental frequency for a target frame of speech from the spectral envelope of the converted spectral envelope [94].

139 6.2. Fundamental Frequency Prediction 122 The remainder of this chapter is arranged as follows. Section 6.2 exploits the correlation between fundamental frequency and spectral envelope and introduces two methods of predicting fundamental frequency from MFCC vectors. The first uses a GMM to model the joint density of fundamental frequency and MFCCs while the second extends this to also model the temporal correlation of fundamental frequency through a set of combined HMM-GMMs. Voicing classification methods are developed in section 6.3 to determine whether an MFCC vector represents voiced (in which case fundamental frequency prediction is employed) or unvoiced speech using first a prior voicing probability and then a posterior probability. Section 6.4 evaluates the accuracy of these fundamental frequency prediction and voicing classification methods on both connected digit strings and sentences of phonetically rich speech. Spectrograms of reconstructed speech using the predicted fundamental frequency and MFCC vectors are also shown compared to that reconstructed speech with the reference fundamental frequency and the original speech. Finally a conclusion and summary are made in section Fundamental Frequency Prediction This section proposes two methods for predicting the fundamental frequency associated with a frame of speech from its MFCC vector representation. The idea behind both methods is to model the joint density of the MFCCs and fundamental frequency of a frame of speech in order to enable a statistical prediction of the fundamental frequency. Previous studies have indicated that correlation does exist between the spectral envelope and fundamental frequency although not enough to formulate a generic relation. Instead this work proposes two methods which make

140 6.2. Fundamental Frequency Prediction 123 a localised, class-dependent prediction of the fundamental frequency of a frame of speech from its MFCC representation. The first method is based on the unsupervised creation of a GMM while the second uses a supervised approach through a combined HMM-GMM approach GMM-based fundamental frequency prediction The GMM-based fundamental frequency prediction method comprises of two stages: training and testing. A number of GMMs which are ustilised to predict fundamental frequency during the testing stage are created at the training stage using a set of training utterances. A simplified block diagram of this method is shown as figure 6.2. Figure 6.2: Workflow diagram of GMM-based fundamental frequency prediction

141 6.2. Fundamental Frequency Prediction Training stage During the training stage, in order to model the joint density of the MFCC vector, x, and fundamental frequency, f, an augmented feature vector, y, is defined, y = [x, f] (6.1) From a set of training data utterances the augmented feature vector is extracted with the MFCC component comprising static coefficients 0 to 12 (as in the ETSI Aurora standard [100]). The fundamental frequency is computed using a comb function [90] applied to the original frame of time domain speech samples and is subsequently manually corrected where necessary. To signify unvoiced or non-speech frames, the fundamental frequency is set to zero. From the training set of augmented vectors, two pools of vectors are created - those represented as voiced frames and those represented as unvoiced frames according to the fundamental frequency component of the i th frame, f i, from the entire set of feature vectors Z, which can be defined as Ω = {y i Z : f i 0} (6.2) Ψ = {y i Z : f i = 0} (6.3) Fundamental frequency prediction only considers the set of voiced vectors Ω, although for voicing classification (detailed in section 6.3) the unvoiced set will be used. The unsupervised clustering is implemented using the expectation-maximisation (EM) algorithm [55] to produce a GMM which comprises a set of K clusters which localize the correlation between the fundamental frequency and MFCCs in the joint

142 6.2. Fundamental Frequency Prediction 125 feature vector space [44], p(y) = K α k N (y; µ y k, Σy k ) y Ω (6.4) k=1 Each of the K clusters is represented by a Gaussian probability density function (PDF) with prior probability, α k, mean vector, µ y k, and covariance matrix, Σy k, Where Σ xx k µ y k = µ x k µ f k and Σ y k = Σ xx k Σ fx k Σ xf k Σ ff k (6.5) denotes a dimensional covariance matrix of MFCC vector, x, in the k th cluster; Σ xf k and Σ fx k denote the 13 1 and 1 13 dimenstional cross covariance matrix between MFCC vector, x, fundamental frequency, f where Σ xf k = (Σ fx k )T. Σ ff k denotes the deviation of fundamental frequency. These parameters of the K clusters are passed into the testing stage together with voicing classification parameter which will be covered in later section of this chapter Testing stage This set of K clusters, or classes, enables a prediction to be made of the fundamental frequency of the i th frame of speech, ˆf i, from the MFCC vector representation of that frame, x i. Prediction can be made from the closest cluster, in some sense, to the input MFCC vector or from a weighted contribution from all K clusters. The closest cluster, k, to the input MFCC vector, x i, is given, k = arg max {p(x i c x k)α k } (6.6) k where p(x i c x k ) is the marginal probability density [4] of the MFCC vector for the k th cluster, with prior probability α k. p(x i c x k ) can be defined from equation 6.5, p(x i c x k) = 1 (2π) d/2 Σ xx k { exp 1/2 1 2 (x i µ x k) T (Σ xx k ) 1 (x µ x k) } (6.7)

143 6.2. Fundamental Frequency Prediction 126 where d is the dimensionality of MFCC vector, x; µ x k, Σxx k are mean vector and covariance matrix as defined in equation 6.5. Using the joint density of the fundamental frequency and MFCC vector from the k th cluster a maximum a-posteriori (MAP) (as derived in Appendix A) prediction of the fundamental frequency can be made, ˆf i = µ f k + Σ fx k (Σ xx k ) 1 (x i µ x k )T (6.8) To avoid making a hard decision in terms of identifying from which cluster to predict the fundamental frequency, an alternative is to combine the MAP fundamental frequency prediction from all K clusters in the GMM according to the posterior probability, h k (x i ), of MFCC vector, x i, belonging to the k th cluster [94], ˆf i = K k=1 [ ] h k (x i ) µ f k + Σfx k (Σxx k ) 1 (x i µ x k) T (6.9) The posterior probability, h k (x i ), of an MFCC vector, x i, belonging to the k th cluster is given, h k (x i ) = α k p(x i c x k ) K k=1 α kp(x i c x k ) (6.10) where α k and p(x i c x k ) are as defined for equation HMM-GMM based fundamental frequency prediction The unsupervised training used to create the GMM does not fully exploit the classdependent correlation between the MFCC vector and fundamental frequency, nor does it satisfactorily model the temporal correlation which exists in the fundamental frequency contour. No account is made during the EM training of whether feature vectors occur adjacently when deciding upon cluster allocation. Similarly in testing

144 6.2. Fundamental Frequency Prediction 127 no account is made of the previous frame s fundamental frequency when determining the current fundamental frequency value. However, the changing of fundamental frequency reflects the varying of vibration frequency of vocal cords which is impossible to change very fast. This phenomenon can be modeled as temporal correlation. To model the inherent correlation within the feature vector stream, and to therefore select a more appreciate class, or sub-space, from which to predict the fundamental frequency, a combined HMM-GMM method is proposed. This utilizes a set of HMMs which have associated with them a series of state-dependent GMMs. Given a stream of MFCC vectors which have been decoded into a model and state sequence, the HMMs provide a more localised region, through the state-dependent GMMs, from which to predict the fundamental frequency. Figure 6.3: Modeling of the joint MFCC and fundamental frequency feature space using: a) GMM clustering, b) a series of GMMs, each located within the state of a set of HMMs Figure 6.3-a illustrates the joint MFCC and fundamental frequency feature space which is populated by a set of clusters forming a single GMM. As discussed in section prediction is made from either the closest cluster to the MFCC vector or from a combination of all clusters. Figure 6.3-b shows the same feature space but now modeled by a set of HMMs each containing state-dependent GMMs. The solid line

145 6.2. Fundamental Frequency Prediction 128 illustrates the trajectory of a stream of MFCC vectors passing through the states (indicated by the circles) of three models - λ 1, λ 2, and λ 3. With each of the states of these models a GMM provides a localized prediction of fundamental frequency. To compare the difference with GMM-based method, a modified block diagram of figure 6.2 is shown as figure 6.4 Figure 6.4: Workflow of HMM-GMM based fundamental frequency predication Training stage Training begins with the creation of a set of HMM-based speech models, Λ = {λ 1, λ 2,..., λ W }. These models are trained on the MFCC component, x, of the augmented vector, y, using standard Baum-Welch training [30] [32] [49] to produce a set of single mode, diagonal covariance matrix HMMs - in accordance with ETSI Aurora guidelines [73]. The set of training data utterances is then re-aligned to the speech models using Viterbi decoding [32] to find the model and state al-

146 6.2. Fundamental Frequency Prediction 129 location for each feature vector. Therefore for a single training utterance which comprises N MFCC vectors, X = [x 1, x 2,..., x N ], an associated model allocation, o = [o 1, o 2,..., o N ], and state allocation, q = [q 1, q 2,..., q N ], are computed. This indicates the state, q i, and model, o i, to which the i th MFCC vector, x i, is allocated, where o i = {1, 2,..., W } and q i = {1,..., S oi } with W indicating the number of models and S oi the number of states in model o i. The augmented feature vectors y i is calculated from the i th vector x i and fundamental frequency f i according to the equation 6.1. Voiced vectors belonging to each state, s, of each model, w, are then pooled together to form state and model dependent subsets of feature vectors, Ω S,W, from the overall set of feature vectors, Z, made up of all training data utterances, Ω S,W = {y i Z : f i 0, q i = s, o i = w} 1 s S W, 1 w W (6.11) Similarly, subsets of unvoiced vectors belonging to each state and model can be created, Ψ S,W, Ψ S,W = {y i Z : f i = 0, q i = s, o i = w} 1 s S W, 1 w W (6.12) EM clustering is applied to each subset of voiced augmented vectors to create a series of model and state-dependent GMMs which are represented by mean vectors, µ y k,s,w, covariance matrices, Σy k,s,w, and prior probabilities, α k,s,w, corresponding to the k th cluster of state s of speech model w. Some states have very few voiced vectors associated with them and this forms the basis of voicing classification which is discussed in the section 6.3. At this stage the joint feature vector space is modeled by a series of GMMs which are linked together by the states of a set of HMMs and provide localised regions from which fundamental frequency can be predicted.

147 6.2. Fundamental Frequency Prediction Testing stage Prediction of the fundamental frequency, for voiced frames, is made from the MFCC vectors in a speech utterance, X = [x 1, x 2,..., x N ], by first determining their model allocation, o = [o 1, o 2,..., o N ], and state allocation q = [q 1, q 2,..., q N ], from the set of HMMs using Viterbi decoding. For each MFCC vector, x i, in the utterance this provides the model, o i and state, q i, to which it is allocated. This information localises the region from which fundamental frequency is to be predicted to that of the particular GMM associated with state q i of model o i. The model and state dependent MAP predication of the fundamental frequency, ˆf i, associated with MFCC vector x i is then computed as, ˆf i = K k=1 ( ) h k,qi,o i (x i ) µ f k,q i,o i + Σ fx k,q i,o i (Σ xx k,q i,o i ) 1 (x i µ x k,q i,o i ) T (6.13) where h k,qi,o i (x i ) is calculated similarly to the equation 6.10 with p(x i c x k ) being made specific to state q i of model o i shown as equation h k,qi,o i (x i ) = α k,qi,o i p(x i c x k,q i,o i ) K k=1 α k,q i,o i p(x i c x k,q i,o i ) (6.14) Similarly, α k,qi,o i is the prior probability for the k th cluster with model allocation o i and state allocation q i. p(x i c x k,q i,o i ) is the marginal distribution of the MFCC vectors for the k th cluster with model allocation o i and state allocation q i defined as equation p(x i c x k,q i,o i ) = { 1 (2π) d/2 Σ xx exp 1 } k,q i,o i 1/2 2 (x i µ x k,q i,o i ) T (Σ xx k,q i,o i ) 1 (x i µ x k,q i,o i ) (6.15)

148 6.3. Voiced/Unvoiced Classification Voiced/Unvoiced Classification This section introduces two methods to determine whether an MFCC vector represents a voiced frame of speech. Accurate identification of voiced frames is important as this information is used to decide whether a prediction of fundamental frequency is subsequently made. The discussion of these two techniques is based upon the combined HMM-GMM system of section and relies on the model and state sequence of the stream of MFCC vectors determined by Viterbi decoding [2] [6]. The later technique is adapted to the GMM-only system at the end of this section Voicing classification using prior probability The first method of voicing classification is based on the computation of a prior voicing probability for each state of each HMM. The prior voicing probability, υ s,w, of state s of model w is calculated from the proportion of vectors allocated to it which are voiced, υ s,w = n(ω s,w ) n(ω s,w ) + n(ψ s,w ) 1 s S w, 1 w W (6.16) Where n(ω s,w ) is the number of elements in the set Ω s,w. To illustrate how the prior voicing probability changes across models and states, figures 6.5 shows the prior voicing probability of the 16 emitting states of each eleven digit models Λ = {one, two,..., zero, oh} (6.17) These figures demonstrate the voicing occupancy of 16 states of eleven digit model. Considering the digit six (which comprises phonemes /s/ /ih/ /k/ /s/) the first few and last few states contain relatively few voiced vectors which corresponds to the unvoiced phonemes /s/ and /k/ /s/. The central states of the model are

149 6.3. Voiced/Unvoiced Classification 132 Figure 6.5: Voicing probability for eleven digit models associated with the vowel /ih/ and comprise nearly all voiced vectors which give a high prior voicing probability. The state occupancy for the model three ( /th/ /r/ /iy/ ) shows similar behavior. Initial states have low prior voicing probabilities due to relatively few voiced vectors corresponding to the domination of voiced vectors from the voiced phonemes /r/ and /iy/. It is interesting to observe that the first state in both models has a broadly mid-valued prior voicing probability. This can be attributed to the state being on the transition from one model to the next, meaning it has relatively unstable voicing characteristics. In fact for most models the initial state has a similar prior voicing probability and also relatively few vectors allocated to it during training compared with the remaining states. Figure 6.6 demonstrates the prior voicing probability of the three emitting states of the 39 phoneme models. In some phonemes, such as /w/ /m/ and /ae/, almost all feature vectors are voiced frames which give high prior probability. However,

150 6.3. Voiced/Unvoiced Classification 133 Figure 6.6: Voicing probability for phoneme models in some phonemes, such as /k/ /s/ and /t/, the prior voicing probability is very low which means that these phonemes are created without vibration of vocal cords. Most of the phonemes in this category are unvoiced sounds. It is very interesting to observe that the voicing probabilities of the first and the third emitting state in phoneme /f / and phoneme /s/ are much higher than the voicing probability of the second state. This may be attributed to the co-articulation effect. For example, the English word statistics can be transferred to the phoneme string /s/-/t/-/ey/- /t/-/ih/-/s/-/t/-/ih/-/k/-/s/ where there are three phoneme-/s/ combined with different phonemes to form four bi-phone models, such as /s/-/t/, /ih/-/s/ and /k/- /s/. In the bi-phone /ih/-/s/, the first state of /s/ is connected with the last state of the phoneme /ih/ which has high voicing prior probability. The vocal cords are

151 6.3. Voiced/Unvoiced Classification 134 vibrating when phoneme /ih/ is pronounced. After phoneme /ih/ is pronounced, the vocal cord cannot stop vibrating suddenly and therefore some voicing frames are brought to the first state of phoneme /s/. In another bi-phone /k/-/s/, the first state of /s/ is connected with the last state of the phoneme /k/ which has low voicing probability. In this case, the first state of phoneme /s/ comprises unvoiced frames. While the second state of phoneme /s/ is quite stable because the vibration of the vocal cords has almost stopped before entering this state. The voicing associated with an input MFCC vector, x i, can now be determined from the prior voicing probability of the state, q i, of model, o i, to which it is aligned during Viterbi decoding, voiced voicing i = unvoiced υ qi,o i > θ υ qi,o i θ (6.18) The threshold, θ, has been determined experimentally through analysis of the resulting speech reconstruction quality with a suitable value found to be θ = 0.2. This is deliberately set low so that errors are more likely from unvoiced frames being classified as voiced. As the energy of unvoiced frames is usually low the voicing errors make little perceptible sound. Conversely, if more errors were made when classifying voiced frames as unvoiced, their higher energy would cause more noticeable noise-like errors which degrades the quality of reconstructed speech Voicing classification using posterior probability An analysis into the prior voicing probabilities for both the digit and phoneme sets of HMMs revealed that some states are strongly voiced. For example, the voicing occupancy of the 8 th and 9 th state in digit model one are 0.94 and 0.93 respectively.

152 6.3. Voiced/Unvoiced Classification 135 The voicing occupancy of the 5 th state in digit model two is which indicates a strongly unvoiced state. However, for some other states the distinction is not so clear, for example the voicing probability of the third state in phoneme /f / is Figure 6.7 shows a histogram of prior voicing probabilities table across all states of the eleven digit models. Figure 6.7: Histogram of prior voicing probability for digits models Figure 6.7 shows that 55 of the 176 states have prior voicing probability over 0.9 indicating that most of vectors located in these states are voiced frames; 37 of the 176 states have a prior voicing probability under 0.2 which allows vectors associated with these states to be considered as unvoiced. However, the distinction of other states where the voicing occupancy probability is between 0.2 and 0.9 is not so clear and may be subject to voicing classification errors. Similarly, figure 6.8 shows the prior voicing probability histogram from all states of the phoneme models. Figure 6.8: Histogram of prior voicing probabilities for phoneme models

153 6.3. Voiced/Unvoiced Classification 136 The histogram shown in figure 6.8 illustrates a very similar behavior of figure 6.7 in that some states are strongly voiced. For example 81 states have a prior voicing probability over 0.9. Few states have very low voicing probabilities, for example 16 states have voicing probability under 0.2 and these states can be declared as unvoiced states. While other states in which the prior voicing probability is between 0.2 and 0.9 are not easy to be determined as voiced or unvoiced frame and may incur voicing classification errors. The difference between figure 6.7 and 6.8 also indicates that there are more voicing states in phoneme models than those in digital models. The voicing classification using a prior probability described in previous section introduces classification error for both phoneme and digital models To improve accuracy of voicing classification, this section discusses an alternative to the threshold-based voicing decision by making a voicing decision using posterior probabilities of voicing. For the HMM-GMM fundamental frequency prediction, described in section 6.2.2, a GMM with K clusters has been created in each state, s, of each model, w, from the set of voiced vectors allocated to that state, Ω s,w. An additional, (K + 1) th, cluster is now created from the set of vectors allocated to that state which are labeled as unvoiced, Ψ s,w. This (K + 1) th cluster can be defined as µ K+1, Σ K+1 and α K+1 = 1.0. The probability of an input MFCC vector, x i, being allocated to any one of the K + 1 clusters is now given, p(c k,qi,o i x i ) = p(x i c k,qi,o i )P (c k,qi,o i ) K+1 k=1 p(x i c k,qi,o i )P (c k,qi,o i ) 1 k K + 1 (6.19) p(x i c k,qi,o i ) is the marginal distribution of the MFCC vector for the k th cluster in

154 6.3. Voiced/Unvoiced Classification 137 state q i and model o i, and the prior probability, P (c k,qi,o i ), for each cluster is given α k,qi,o i υ qi,o i 1 k K P (c k,qi,o i ) = (6.20) u qi,o i k = K + 1 Where α k,qi,o i is as defined in equation 6.6 and u s,w is given, u s,w = n(ψ s,w ) n(ω s,w ) + n(ψ s,w ) = 1 υ s,w (6.21) The decision as to whether the MFCC vector is voiced/unvoiced can be made, voiced K k=1 voicing i = p(c k,q i,o i x i ) > p(c K+1,qi,o i x i ) (6.22) unvoiced K k=1 p(c k,q i,o i x i ) p(c K+1,qi,o i x i ) Voicing classification for GMM-only method The method of voicing classification in the previous subsection can also be applied to the GMM-only scheme described in section In this case classification is no longer made from the state-specific GMMs but instead uses the single GMM which models the entire feature space. The global voicing occupancy can be calculated as, υ g = n(ω) n(ω) + n(ψ) (6.23) where Ω and Ψ are defined in equation 6.2 and 6.3 respectively and the global unvoicing occupancy can be defined as u g = 1 υ g. Similarly, the (K + 1) th cluster, c K+1, is trained using the global unvoiced vectors set Ψ in addition to the K voicing clusters, c 1,..., c K, shown as equation 6.4. Then the posterior probability for a given vector being located in any of K + 1 cluster can be defined as, p(c k x i ) = p(x i c k )P (c k ) K+1 k=1 p(x i c k )P (c k ) 1 k K + 1 (6.24)

155 6.4. Experimental Results 138 p(c k x i ), marginal distribution of the MFCC vector for the k th cluster, is given, α k υ g 1 k K P (c k ) = (6.25) u g k = K + 1 The voicing decision for GMM-based method can be made according to equation, voiced K k=1 voicing i = p(c k x i ) > p(c K+1 x i ) (6.26) unvoiced K k=1 p(c k x i ) p(c K+1 x i ) 6.4 Experimental Results The aim of these experiments is to first measure the accuracy of fundamental frequency predication and voicing classification using the techniques discussed in section 6.2 and 6.3. Secondly the quality of reconstructed speech using the predicted fundamental frequency is compared to that using the reference fundamental frequency within the framework of the sinusoidal model of speech in section 3.5. Two datasets, the Aurora connected digit database and a set of phonetically rich utterances, Messiah sentences, are applied to the proposed method respectively in following subsections Experimental results for digit models This section measures the accuracy of fundamental frequency prediction and voicing classification using a subset of the ETSI Aurora connected digits database. A set of 1266 utterances have been used for training and comprise 633 male utterances and 633 female utterances taken from noise-free speech utterances. Each set is made up from 50 male speakers and 50 female speakers. A separate set of 1000 utterances, spoken by different talkers from those used in training, is used for testing

156 6.4. Experimental Results 139 and comprises 501 male utterances and 499 female utterances. Each set uses 50 male and 50 female speakers. In accordance with the ETSI Aurora standard, 13-D static MFCC vectors are extracted from the speech at a rate of 100 vectors per second. The reference fundamental frequency associated with each MFCC vector is estimated from the time domain signal using a comb function (detailed in section 3.3) and subsequent manual correction. The fundamental frequency prediction methods are evaluated on both their classification of MFCC vectors as voiced or unvoiced and also on the percentage fundamental frequency prediction error for voiced frames, as shown in section 3.3.4, equation 3.14 and equation Tables 6.1 and 6.2 show the percentage classification error, E c, and percentage fundamental frequency prediction error, E p, for male and female speech using gender-dependent models. In each table, results are presented first using the prior voicing probability method of determining voiced frames (section 6.3.1) and then for the posterior probability method (section 6.3.2). Results are shown for the two GMM methods which use either the closest cluster to the input MFCC vector as equation 6.8, or the posteriori weighted MAP prediction as equation 6.9. In both cases it was found from En Najjary s work [94] and preliminary experiments that using K = 64 clusters gave best performance. More clusters in GMM model will increase the possibility to employ parameters from wrong clusters. Results for HMMbased prediction are shown using from 1 to 5 clusters within each state (about clusters for all digits) with posteriori weighted MAP prediction of fundamental frequency shown as equation 6.13.

157 6.4. Experimental Results 140 Table 6.1: Classification accuracy and percentage fundamental frequency error for male speech on the ETSI Aurora connected digit database Prior voicing prob. Posterior voicing prob. Classification F 0 error Classification F 0 error error,e c E p error,e c E p GMM-closest 22.5% 9.1% 22.0% 9.2% GMM-posteriori 22.5% 9.0% 22.0% 9.1% HMM-1 cluster 17.7% 7.8% 13.4% 7.8% HMM-2 clusters 16.5% 7.3% 12.1% 7.2% HMM-3 clusters 15.9% 6.8% 11.7% 6.7% HMM-4 clusters 16.0% 6.8% 11.7% 6.7% HMM-5clusters 16.1% 6.8% 11.9% 6.8% Table 6.2: Classification accuracy and percentage fundamental frequency error for female speech on the ETSI Aurora connected digit database Prior voicing prob. Posterior voicing prob. Classification F 0 error Classification F 0 error error,e c E p error,e c E p GMM-closest 19.7% 5.6% 13.4% 5.7% GMM-posteriori 18.5% 5.9% 12.2% 5.9% HMM-1 cluster 15.1% 7.0% 11.7% 7.0% HMM-2 clusters 14.8% 6.1% 11.3% 6.1% HMM-3 clusters 14.6% 5.6% 11.2% 5.6% HMM-4 clusters 14.6% 5.5% 11.2% 5.5% HMM-5 clusters 14.7% 5.4% 11.5% 5.4% Comparing the performance differences between using the prior voicing probability and the posterior voicing probability, for determining voicing classification, shows that the second method consistently outperforms the first method. On average, classification errors are reduced by about 3.5% using the posterior voicing probability, while errors made in fundamental frequency prediction remain unchanged. Examining the performance of GMM-based prediction for the male speech reveals a slight improvement in fundamental frequency prediction accuracy when taking the posterior weighted prediction from all clusters (equation 6.9) over using only the closest cluster (equation 6.8). For female speech the classification error is

158 6.4. Experimental Results 141 significantly better when using the weighted prediction although fundamental frequency prediction error is slightly worse. Investigation into the reason for this drop in prediction accuracy revealed that the fundamental frequency prediction error was actually worse when using the closest cluster, as some of the errors were greater than 20% which meant they were labeled as classification errors, E c (equation 3.14) rather than being included in the fundamental frequency error measurement, E p. The HMM-GMM prediction gives considerably more accurate frame classification and lower fundamental frequency error in comparison to the GMM. This can be attributed to the better localisation for fundamental frequency prediction that the HMM gives. Increasing the number of clusters in each state of the HMM enables more detailed modeling of the joint distribution of MFCCs and fundamental frequency and this results in a general reduction of frame classification error to a minimum of 11.7% for male speech and 11.2% for female speech. The percentage of fundamental frequency prediction error also reduces as the number of clusters increases to a minimum of 6.7% for male speech and 5.4% for female speech. For large numbers of clusters the amount of training data from which to estimate the cluster statistics is reduced which leads to a slight decrease in accuracy. Using more training data should allow larger numbers of clusters to be reliably created and lead to further reductions in error. It is interesting to note that the accuracy of the speech recogniser was 97% which means that 3% of digits were aligned to incorrect models from which voicing and fundamental frequency were predicted. Analysis of predicted values also revealed that the significant majority of classification errors arise from incorrect voicing decisions

159 6.4. Experimental Results 142 which occur in low energy regions at the start or end of speech. Figure 6.9 compares the predicted fundamental frequency contour made from the five cluster HMM-GMM (blue line) with the reference fundamental frequency (red line) for the digit sequence nine-six-oh (comprising phonemes /n//ay//n/ /s//ih//k//s/ /ow/). Figure 6.9: a) waveform of connected digit string ; b) predicted fundamental frequency contour (blue line) and reference fundamental frequency contour (red line) Figure 6.9-a shows the waveform of the connected digit string and figure 6.9-b compares the predicted fundamental frequency contour shown as blue line with the reference fundamental frequency contour shown as the red line. The corresponding phoneme strings are labeled on this figure as well. Comparing the two fundamental frequency contours shows that the classification of frames as voiced is effective and follows closely the voicing associated with the reference fundamental frequency. For example, accurate voicing classification can be observed for the digit six, where the central /ih/ phoneme is correctly classified as voiced in contrast to the unvoiced phonemes /s/ and /k/-/s/ at the start and end

160 6.4. Experimental Results 143 of the digit. For voiced frames the predicted fundamental frequency tracks closely the reference fundamental frequency although some fluctuations can be observed. For example the predicted fundamental frequency overshoots at the beginning of the digit nine as a result of the voicing classification error and then stabilises further into the digit. Most voicing classification errors occur at the start and end of digits in areas of speech which have relatively low energy regions. This is illustrated at the start of the digit nine (frame 22) where unvoiced frames are labeled as voiced and at the end of the digit (frame 50) where voiced frames are labeled as unvoiced. As will be discussed later, the effect of these errors in reconstructed speech is generally not severe due to their lower energy making the voicing errors less audible. The motivation behind fundamental frequency prediction in this work is to enable an acoustic speech signal to be reconstructed from a stream of MFCC vectors without the need for an additional fundamental frequency component. To illustrate the effectiveness of this approach, figure 6.10-a shows the narrowband spectrogram of the original speech utterance nine-six-oh - as used in figure 6.9. Figure 6.10-b shows the spectrogram of the speech signal reconstructed from MFCC vectors and the reference fundamental frequency using the sinusoidal model described in section 6.2. Figure 6.10-c shows the spectrogram of the speech signal reconstructed solely from MFCC vectors with the fundamental frequency and voicing predicted using the 5 cluster HMMs. Comparing figures 6.10-a and 6.10-b shows that formant peaks become broader as a result of the spectral smoothing which the mel-filterbank analysis and truncation

161 6.4. Experimental Results 144 Figure 6.10: Comparison of narrowband spectrogram for a) original speech signal, Reconstructed speech using b)the reference fundamental frequency and c) the predicted fundamental frequency of DCT coefficients impart on the magnitude spectrum in the MFCC extraction process. Considering now the result of fundamental frequency prediction, only slight differences are observed between figures 6.10-b and 6.10-c which arise from voicing classification errors and fundamental frequency prediction errors. It is interesting to note that the voicing classification errors observed in figure 6.9 have little effect in the reconstructed speech as they are associated with very low energy regions of the speech. This can be observed at the start of phoneme /n/ at the beginning of the digit nine where the low energy makes the extra frames labeled as voiced almost unnoticeable. A similar effect occurs at the end of phoneme /ow/ as the energy of the digit oh falls away. Listening to a number of reconstructed speech utterances and observing their spectrograms confirmed that errors made in voicing classification and fundamental frequency prediction led to only minor perceptible differences between speech re-

162 6.4. Experimental Results 145 constructed from the reference fundamental frequency and predicted fundamental frequency. Unvoiced frames which were incorrectly classified as voiced led to almost inaudible errors in the reconstructed speech as indicated in the spectrogram analysis, due to their relatively low energy. However, voiced frames incorrectly labeled as unvoiced could be heard in the reconstructed speech as short duration noise-like sounds because of unvoiced sound formed from high energy frame. The effect of fundamental frequency estimation errors on the reconstructed speech generally had less effect than voicing classification errors in terms of speech quality. For fundamental frequency estimation errors that were small and relatively constant this made very little perceptible difference to the reconstructed speech. However, when a severe step change in the predicted fundamental frequency arose this could be heard as the introduction of artificial sounding noise at that time instant. Practically, these effects can be removed through post-processing. It can be observed that the reconstructed speech through over-smoothed fundamental frequency estimates can introduce less artificial noise Experimental results for free speech In this experiment, a set of 903 phonetically rich sentences, Messiah sentences (for example Chocolate and roses never fail as a romantic gift. ), have been used to train the GMM and HMM-GMMs. A further set of 246 phonetically rich sentences, comprising a total of 130,000 vectors, has been used for testing. All these 1,149 utterances come from the same female speaker. Each sentence is approximately 5 seconds in duration and from these 13-D MFCC vectors have been extracted at a rate of 100 vectors per second in accordance with the ETSI Aurora standard.

163 6.4. Experimental Results 146 The reference fundamental frequency associated with each MFCC vector has been obtained from a laryngograph signed and subsequent manual correction. The fundamental frequency prediction methods are evaluated on both their classification of MFCC vectors as voiced or unvoiced and also on the percentage fundamental frequency prediction error for voiced frames more details are shown in section 3.3.4, equation 3.14 and equation The structure of table 6.3 is similar to that of table 6.1 and table 6.2 which shows percentage fundamental frequency prediction error, E p, and percentage classification error, E c, using both the prior probability in section and posterior voicing probability method in section Results are shown for the two GMM methods the closest cluster to the input MFCC vector as equation 6.8 and the posterior weighted MAP prediction as equation 6.9. The number of clusters, K, is 64 which is the same as in table 6.1 and table 6.2. Results for HMM-based prediction are shown using from 1 to 16 clusters within each state with posterior weighted MAP prediction of the fundamental frequency shown as equation Table 6.3: Classification accuracy and percentage fundamental frequency error for unconstrain monophone model on free speech from a single female speaker Prior voicing prob. Posterior voicing prob. Classification F 0 error Classification F 0 error error,e c E % error,e c E % GMM-closest 19.05% 4.78% 18.74% 4.75% GMM-posteriori 18.29% 4.91% 18.01% 4.87% HMM-1 cluster 9.66% 5.94% 7.93% 5.95% HMM-2 clusters 9.52% 5.35% 7.75% 5.36% HMM-4 clusters 9.44% 4.89% 7.54% 4.89% HMM-8 clusters 9.36% 4.40% 7.36% 4.40% HMM-16 clusters 9.46% 4.02% 7.35% 4.04% The result shows that the posterior probability based voicing classification out-

164 6.4. Experimental Results 147 performs the prior probability based method. On average, classification errors are reduced by about 1.9% using the posterior voicing probability, while errors made in fundamental frequency prediction remain almost unchanged. A slightly reduction in voicing error is also observed as the number of cluster in the HMM-GMM system is increased. The performence of fundamental frequency errors in GMM-based fundamental frequency prediction is slightly improved compared to that of HMM-GMM based fundamental frequency prediction method. However, analysing the prediction errors made by the GMM showed that a significant number (> 3%) were greater than 20% meaning they are treated as voicing errors. Taking this into account shows the GMM to give the least accurate fundamental frequency prediction. The HMM-GMM prediction gives significantly more accurate voicing classification comparing to the result from GMM based prediction in accordance with the result of Aurora datasets. Increasing the number of clusters in each state of the HMM enables more detailed modeling of the joint distribution and this leads to a reduction of voicing classification error at an average of 0.44%. In terms of fundamental frequency percentage error, increasing the number of clusters from 1 to 16 results in the reduction of error from 5.95% to 4.04%. Both table 6.2 and table 6.3 demonstrate the result from female speakers while table 6.2 shows the performance on speaker-independent dataset and table 6.3 shows that on speaker-dependent dataset. It can be concluded that the performance of both fundamental frequency measures for speaker-dependent dataset is better than that for speaker-independent dataset and this can be attributed to the fact that the fundamental frequency information has a close relation to the speaker s use of their

165 6.4. Experimental Results 148 vocal organs. To illustrate the accuracy of the 16 cluster HMM-GMM system, figure 6.11 compares the predicted fundamental frequency contour with the reference fundamental frequency contour for the sentence Look out of the window and see if it s raining. Figure 6.11: a) Waveform of the speech utterance; b) predicted and reference fundamental frequency contours Figure 6.11-a shows the waveform of the speech utterance and figure 6.11-b presents the reference fundamental frequency shown as the red line and the predicted fundamental frequency shown as blue line. Comparing this figure indicates that the predicted fundamental frequency follow closely the reference fundamental frequency throughout the sentence, although it has more variation than the reference fundamental frequency. Some of this variability has been removed by the median filter although applying further filtering causes too much detail to be lost. Generally, classification of the vectors as voiced or unvoiced follows closely the reference voicing classification. Analysis of voicing errors from the test set has indicated that most occur in relatively low energy regions of speech

6.4. Experimental Results 149 which frequently occur at the start and end of words. A typical voicing error of this type can be observed around time 4.

166 6.4. Experimental Results 149 which frequently occur at the start and end of words. A typical voicing error of this type can be observed around time 4.25s where voiced frames are incorrectly identified as unvoiced. The purpose of fundamental frequency prediction has been to enable an acoustic speech signal to be reconstructed from a stream of MFCCs. To illustrate the effectiveness of this figure 6.12-a shows the narrowband spectrogram of the original speech utterance Look out of the window and see if it s raining as used in figure Figure 6.12-b shows the spectrogram of the speech signal reconstructed from MFCCs and reference fundamental frequency using the sinusoidal model. Figure 6.12-c shows the spectrogram of the speech signal reconstructed solely from MFCCs with fundamental frequency predicted using the 16 cluster HMM-GMM. Figure 6.12: Comparison of narrowband spectrogram of a) original speech, reconstructed speech from b) reference fundamental frequency and c) predicted fundamental frequency Comparing figure 6.12-a and 6.12-b shows the spectral smoothing which the MFCC extraction has introduced as a result of the mel-filterbank and truncation of

167 6.5. Summary 150 DCT coefficients. Considering now the result of fundamental frequency prediction, only slight differences are observed between figure 6.12-b and 6.12-c and these arise from fundamental frequency prediction errors. The effect of the incorrect classification of voiced frames as unvoiced at the end of the word raining can be seen in figure 6.12-b, and can be heard as a burst of white noise. 6.5 Summary This chapter has shown that it is possible to predict the voicing and fundamental frequency of a frame of speech from its MFCC representation. Using this information it has been possible to reconstruct an intelligible speech signal solely from a sequence of MFCC vectors. A fundamental frequency prediction method using a GMM to model the joint density of MFCCs and fundamental frequency was introduced in section 6.2 and gave reasonably accurate voicing classification and fundamental frequency prediction accuracy. This was extended in section 6.3 with a set of combined HMM-GMMs which used the HMMs to localize the region from which fundamental frequency is predicted through a series of state-dependent GMMs. This led to significant improvements in both voicing classification and fundamental frequency prediction accuracy over the GMM-only system. For speech reconstruction the use of the predicted voicing classification and fundamental frequency gave similar speech quality to that obtained using the reference fundamental frequency information. The fundamental frequency estimation and reconstruction systems have been evaluated on both speaker-independent speech and speaker-dependent speech. For a speaker-independent speech set, the vocabulary has been restricted to a connected digit task while the vocabulary is unconstrained for a speaker-dependent speech set.

168 6.5. Summary 151 The experimental results shows the reconstructed speech utterances from MFCC vector and fundamental frequency estimates are intelligible for both test conditions. Further investigation is needed to test the performance of these systems in a noisy environment. It is likely that more voicing classification errors will be introduced and fundamental frequency estimation is less accurate due to the sensitivity of MFCC vectors to noise. A corresponding improved model might be considered for robust fundamental frequency estimation that is adapted to a noisy environment. This HMM-GMM frame work has been extended to other applications, such as predicting formants from MFCC vectors for noise compensation [104] [107].

169 Chapter 7 Conclusions and Future Work Contents 7.1 Review of this thesis 7.2 Conclusions 7.3 Future work Preface This thesis has addressed the problem of reconstructing a speech signal from speech recognition features (MFCCs) and fundamental frequency in a distributed speech recognition (DSR) framework. In particular, the combined hidden Markov model and Gaussian mixture model (HMM-GMM) is proposed to predict the fundamental frequencies from MFCC vectors and therefore it is possible to reconstruct speech solely from MFCC vectors. The review of this thesis is made in next section followed by the conclusions and future work is suggested in the last section. 152

170 7.1. Review of this work Review of this work Chapter two introduces two speech production models, the source-filter and the sinusoidal models, and these models are fully implemented for the reconstruction of speech from MFCCs and fundamental frequencies in the third chapter. Both models can be employed for speech reconstruction using the MFCC-derived spectral envelope and the fundamental frequency. In the source-filter model, linear predictive coefficients (LPC) are estimated from the spectral envelope and then used to generate the speech signal according to the fundamental frequency. In the alternative, sinusoidal model, the set of parameters which are necessary for signal reconstruction have been identified as sinusoidal frequencies, amplitudes and phases. These have been extracted from the spectral envelope and the fundamental frequency based on the harmonic assumption. Post processing methods such as fundamental frequency shift reduction and the overlap-and-add algorithms are also employed to improve the reconstructed speech quality. Fundamental frequency and spectral envelope are two necessary components for speech reconstruction using either the source-filter model or the sinusoidal model. Initially, two fundamental frequency estimation methods, the SIFT method and the comb function method, have been applied and compared. Inversion of MFCC vectors to the mel filterbank, followed by cubic interpolation, has given a smoothed estimate of spectral envelope. Experimental results show that the comb function based fundamental frequency estimation performs better than the SIFT method in terms of both voicing classification and fundamental frequency estimates. The quality of the reconstructed

171 7.1. Review of this work 154 speech using the sinusoidal model is better than that using the source-filter model. Due to these results, the comb function based fundamental frequency estimation method and the sinusoidal model based speech reconstruction scheme were chosen for further experiments. Chapters four and five focus on clean speech reconstruction in noisy environments. Because both MFCC extraction and fundamental frequency estimation methods are sensitive to noise, it is necessary to obtain a clean spectral envelope estimate and a reliable fundamental frequency estimate from a noisy signal. An auditory model is briefly described in chapter two and is applied to decompose an acoustic signal into a number of frequency channels in chapters four and five. Chapter four provides a method to identify the noisy channels and to then reliably estimate fundamental frequency from the clean channels. Spectral subtraction is applied in the filterbank domain to provide a clean spectral envelope. This enables a clean speech signal to be synthesised using the robust fundamental frequency and noise-free spectral envelope. Chapter five analysed the similarity between the frequency responses of the Mel filterbank and the auditory filterbank. Their similarity leads to the proposal of an auditory model based feature extraction. Experimental results show that this auditory model based feature extraction can provide speech recognition features which have similar performance to the MFCCs and is also able to estimate reliable fundamental frequencies from a noisy signal for speech reconstruction. Previous studies and evidence collected in this thesis indicate that correlations appear between fundamental frequency and spectral envelope. This correlation gives

172 7.2. Conclusions 155 us an idea to predict the fundamental frequency from the MFCC vectors and thereby to remove the need for transmission of voicing classification and fundamental frequency in the DSR framework. This has been achieved using two statistic models which are derived in chapter six. A Gaussian mixture model (GMM) is introduced for unsupervised training using the EM algorithm. The fundamental frequency can then be predicted from these Gaussians using the maximum a posteriori (MAP) probability estimation. Unfortunately, this method does not take into account temporal information which does exist in the fundamental frequency contour between successive frames and therefore an hidden Markov model (HMM) is introduced to overcome this shortcoming. An HMM-based speech recognisor is employed to build a state-dependent local feature space GMM. Fundamental frequency can then be predicted from the HMM-guided state feature space. Two voicing classification methods, prior probability and posterior probability, are also developed and compared within the HMM-GMM framework. Experimental results indicate that the speech reconstructed from the predicted fundamental frequencies and MFCC vectors is similar to that from the reference fundamental frequency and MFCC vectors. This HMM-GMM estimation framework has been extended in other application, such as the estimation of formants from MFCC vectors for noise compensation. 7.2 Conclusions The following conclusions can be made based on the review of this thesis: Speech can be successfully reconstructed from MFCC vectors and excitation information which comprises the fundamental frequency and a voicing classi-

173 7.3. Future work 156 fication. The computational auditory model can be employed to estimate a robust fundamental frequency in a noisy environment and also to provide speech recognition features, (AM-MFCCs), which are comparable to the traditional MFCC vectors. Fundamental frequency can be predicted from the MFCC vectors using the proposed HMM-GMM framework. An acoustic speech signal can be created from the MFCC vectors alone, using them to provide the spectral envelope information and to predict the voicing classification and fundamental frequency. 7.3 Future work This work is concerned with speech reconstruction from the MFCC vector and fundamental frequency. All the parameters for the sinusoidal or source-filter models are estimated from the estimated spectral envelopes. Obviously, an accurate estimated spectral envelope is very important to reconstruct high quality speech. Current techniques can calculate the spectral envelope accurately in the low and middle frequency bands. However, some differences between the estimated spectral envelope and spectral magnitude appear in high frequency bands and these differences are derived from smoothing and truncation effects during the MFCC extraction. Further analysis could be investigated and post processing method such that formant tightening might be introduced to reduce those effects so as to estimate the spectral envelope from the MFCC vector more accurately.

174 7.3. Future work 157 The auditory model is a powerful tool to analyse acoustic signals and it has been applied to estimate robust fundamental frequencies from noisy speech signals. The techniques used in robust fundamental frequency estimation have successfully identified clean frequency channels and this might provide a clue for the extraction of robust speech recognition features to improve speech recognition accuracy. Noise compensation for noisy channels might be studied as an alternative method to spectral subtraction. One of the most important contributions of this thesis is that an extendable HMM-GMM framework has successfully been built to predict the fundamental frequencies and voicing classification from MFCC vectors. Currently, this framework has been applied to predict fundamental frequency from a clean speech signal. The performance of this framework in a noisy environment needs to be verified and multiple training might be introduced. The system parameters, such as the number of clusters, need to be optimised. Two acoustic models (digit models and monophone models) are utilised within this framework. It can be anticipated that the performance will be improved if the co-articulation is considered such as bi-phone models or tri-phone models and a language model might be given more constraints for improving state identifications. However, applying bi-phone or tri-phone models could introduce a sparse training problem. Fortunately, most of the achievements in acoustic modeling, such as tied state, broad classifications, can be included to the HMM-GMM framework.

175 Appendix A Conditional Distributions of Multivariate Normal Distribution Def inition Assume X R p is a p dimensional multivariate normal distribution with mean vector µ and covariance matrix Σ. This can be denoted as X N p (µ, Σ). X 1 µ 1 Σ 11 Σ 12 If X can be denoted as X = with µ = and Σ = Σ 21 Σ 22 X 2 where X 1 is q (q < p) dimensional multivariate normal distribution and X 2 is a (p q) dimensional multivariate normal distribution, the conditional distribution of X 2 for given X 1 is (X 2 X 1 ) N p q (µ 2 + Σ 21 Σ 1 11 (X 1 µ 1 ), Σ 21 Σ 1 11 Σ 12 ) Proof According to the linear combination property of multivariate normal distribution, if X N p (µ, Σ) then BX + b N p (Bµ + b, BΣB T ) [106] Where B is a r p dimensional matrix and b is r dimensional vector. µ 2 Now take B = I q q 0 q (p q) Σ 21 Σ 1 11 Σ 12 I (p q) (p q) (A.1) 158

176 Appendix A. Conditional Distributions of Multivariate Normal Distribution 159 where I q q is a q q dimensional identity matrix; I (p q) (p q) is a (p q) (p q) dimensional identity matrix; 0 q (p q) is (q (p q)) dimensional zero matrix; Then we could have X 1 µ 1 B(X µ) = B = X 2 µ 2 X 1 µ 1 X 2 µ 2 Σ 21 Σ 1 11 Σ 12 (A.2) The covariance matrix of B(X-µ) is Cov(B(X µ)) = BΣB T = Σ 11 0 }{{} term2 term1 {}}{ 0 Σ 22 Σ 21 Σ 1 11 Σ 12 (A.3) Because the term 1 and term 2 in equation A.3 are zeros, components in equation A.2 are independent to each other [41]. And from this independecy, it can be obtained that for given component (X 1 µ 1 ) the multivariate normal distribution of (X 2 µ 2 Σ 21 Σ 1 11 (X 1 µ 1 )) can be presented as, (X 2 µ 2 Σ 21 Σ 1 11 (X 1 µ 1 )) N p-q (0, Σ 22 Σ 21 Σ 1 11 Σ 12 ) (A.4) The equation A.4 can be simplified as (X 2 X 1 ) N p q (µ 2 + Σ 21 Σ 1 11 (X 1 µ 1 ), Σ 22 Σ 21 Σ 1 11 Σ 12 ) (A.5)

177 Appendix B Speech Quality Evaluation This appendix shows the histogram of raw scores for speech quality evaluation and analysis of data is also presented as the second part of this section. Fifty-two utterances which are simple, meaningful and short sentences are randomly chosen from Messiah sentences for evaluation of different codec methods. The original samples are linearly encoded PCM format with 16bits per sample sampled at 16kHz with big endian bit ordering. These utterances then encoded/decoded using four speech codecs, GSM, CELP, sinusoidal model and source-filter model for evaluation. Ten subjects from age 19 to 25 are employed to rate the speech quality of each sentences encoded/decoded from different codecs using absolute category rating (ACR) ranging from 1 (bad) to 5 (excellent). None had participated in any prior tests in the last 12 months. The ACR test environment was set up in the listening room at UEA. The experiments have been fully implemented [101] according to the requirement of International Telecommunication Union (ITU) recommendation [62]. 160

scores of Mean Opinion Score (MOS) for each speech codec using the ACR test.

178 B.1. Histogram of raw scores 161 B.1 Histogram of raw scores Each figure in this section presents the histogram of row scores of Mean Opinion Score (MOS) for each speech codec using the ACR test. Figure B.1: Histogram of absolute category rating (ACR) for GSM codec Figure B.2: Histogram of absolute category rating (ACR) for CELP codec

B.2. Evaluation of data 162 Figure B.3: Histogram of absolute category rating (ACR) for reconstructed speech from MFCC and fundamental frequency using sinusoidal model Figure B.

2 Evaluation of data In order to measure the data obtained in last section, a two-sample t-test has been chosen to evaluate the performance of each codec to compare the performance of each codec to

179 B.2. Evaluation of data 162 Figure B.3: Histogram of absolute category rating (ACR) for reconstructed speech from MFCC and fundamental frequency using sinusoidal model Figure B.4: Histogram of absolute category rating (ACR) for reconstructed speech from MFCC and fundamental frequency using source-filter model B.2 Evaluation of data In order to measure the data obtained in last section, a two-sample t-test has been chosen to evaluate the performance of each codec to compare the performance of each codec to each other codec. The t-test is a method to test hypotheses about the mean of a normal distribution when the variance is unknown [34]. The central limit theorem states that the sampling distribution of the mean is asymptotically normal as the number of samples used to estimate the mean is increased. In these experiments, the mean was estimated from 52 samples, so we can assume with a high degree of confidence that the distribution of the sample

EE482: Digital Signal Processing Applications

Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/