STRUCTURE-BASED SPEECH CLASSIFCATION USING NON-LINEAR EMBEDDING TECHNIQUES. A Thesis Proposal Submitted to the Temple University Graduate Board

Size: px

Start display at page:

Download "STRUCTURE-BASED SPEECH CLASSIFCATION USING NON-LINEAR EMBEDDING TECHNIQUES. A Thesis Proposal Submitted to the Temple University Graduate Board"

Toby James
5 years ago
Views:

1 STRUCTURE-BASED SPEECH CLASSIFCATION USING NON-LINEAR EMBEDDING TECHNIQUES A Thesis Proposal Submitted to the Temple University Graduate Board in Partial Fulfillment of the Requirements for the Degree Master of Science in Engineering By Uchechukwu Ofoegbu May, 24 Dr. Robert Yantorno Thesis Advisor Dr.Saroj K. Biswas Director of Graduate Studies College of Engineering Committee Member Dr.Musoke H. Sendaula Graduate Director Electrical & Computer Engineering Committee Member

2 ABSTRACT Usable speech is referred to as those portions of corrupted speech which can be used in determining a reasonable amount of distinguishing features of the speaker. It has previously been shown that the use of only voiced segments of speech improves the usable speech detection system, and also, that unvoiced speech does not contributes significant information about the speaker(s) for speaker identification. Therefore, using a voiced/unvoiced speech detection system, voiced portions of co-channel speech are usually detected and extracted for use in usable speech extraction systems. The process of human speech production is complex, nonlinear and nonstationary. Its most precise description can only be realized in terms of nonlinear fluid dynamics. Traditionally, though, it has been described using linear techniques such as source-filter model and spectral analysis. These techniques work very well for many aspects of speech analysis, but they are inherently limited in their ability to describe the true dynamics of speech production. In this research, a non-linear speech classification approach is proposed, which classifies speech based on features extracted after processing the input signal via an embedding technique known as Takens Method of Delays. Unvoiced speech and useable speech are similar in structure, as the former is noise-like in nature, while the latter constitutes the presence of a significant amount of interference. Likewise, the structure of voiced speech is comparable to that of usable speech. Based on this, the proposed technique attempts to classify speech as both voiced or unvoiced and usable or unusable, using different features extracted from the embedded signals. Preliminary experiments have shown that this technique is capable of correctly detecting 96% of voiced speech (with 1% false alarms) and 9% of unvoiced speech (with 4% false alarms) in a noisefree environment. ii

3 TABLE OF CONTENTS ABSTRACT...ii TABLE OF CONTENTS...iii LIST OF EQUATIONS...v LIST OF FIGURES...vi CHAPTERS 1. INTRODUCTION Motivation Nonlinear Embedding Problem Statement and Research Goals Scope of Research Disclaimer Organization of Thesis Proposal BACKGROUND Literature Review Voiced and Unvoiced Speech Usable and Unusable Speech Traditional Voiced/Unvoiced Detection Measures Energy and Zero-Crossings st Order Reflection Coefficients and Residual Energy Usable/Unusable Detection Measures SAPVR APPC CURVATURE MEASURE Introduction to Curvature...21 iii

4 3.2. Preliminary Research: Voiced/Unvoiced Classification Noise and Filtering Experiments and Results End-Point Detection Result Comparisons Proposed Research: Voiced/Unvoiced/Silence Classification NODAL DENSITY MEASURE Introduction to Nodal Density Preliminary Research: Usable/Unusable Classification Preliminary Experiments and Results Discussion Proposed Research: Voiced/Unvoiced Classification DIFFERENCE-MEAN COMPARISON MEASURE Introduction to Difference-Mean Comparison Experiments and Results SUMMARY BIBLIOGRAPHY...55 iv

5 LIST OF EQUATIONS Equation Page 1.1 Vector-Valued Trajectory Formed by Takens Method of Delays Target to Interferer Ratio (TIR) Energy First Order Reflection coefficient Denominator for First Order Reflection coefficient Numerator for First Order Reflection coefficient TNB Matrix (Serret-Frenet) Theorem Curvature Curvature Estimation Elemental Arc Length of the Discrete Embedding Curve Moving Average Filter NxN 1 st Order Difference 27 v

6 LIST OF FIGURES Figure Page 2.1 Error! Reference source not found Voiced/Unvoiced Detection Using Energy and Zero-Crossings Signal Generated by Speech Utterance, its Zero-Crossing Rate and Its Energy Spectrum st Order Reflection Coefficient and Residual Energy Plotted Against Corresponding Speech Utterance SAPVR Based Usable/Unusable Speech Separation Process Sampled Signal, FFT Magnitude and Spectral Autocorrelation of sigle spaeker and co-channel speech Single Speaker Voiced Speech and Its Adjacent Pitch Period Amplitude Comparison Co-channel Voiced Speech and Its Adjacent Pitch Period Amplitude Comparison TNB Frame Classification Embedded Voiced and Unvoiced Speech Frames Curvature and energy plotted against corresponding speech utterance Curvature distribution for clean speech Curvature distribution for clean speech and speech + pink noise at 15db SNR Curvature distribution for clean speech and speech + white noise at 15db SNR Curvature distribution for clean speech and speech + pink noise at 15db SNR after filtering Curvature distribution for clean speech and speech + white noise at 15db SNR after filtering vi

7 3.9 Curvature-based decision process ROC for different noise states + voiced speech ROC for different noise states + unvoiced speech Curvature-based decisions for clean speech Curvature-based decisions for corresponding speech + added pink noise at 15dB SNR Curvature-based decisions for corresponding speech + added white noise at 15dB SNR Speech data, actual classification, curvature-based classification, and difference between actual and curvature-based classifications Comparisons of hits minus false alarms for voiced speech Comparisons of hits minus false alarms for unvoiced speech Embedded voiced, unvoiced and background speech frames with added pink noise at 15dB SNR Embedded voiced, unvoiced and background speech frames with added white noise at 15dB SNR Embedded voiced, unvoiced and background speech frames with added pink noise at 15dB SNR after filtering Embedded voiced, unvoiced and background speech frames with added white noise at 15dB SNR after filtering Embedded frame for co-channel speech of 3dB TIR Embedded frame for co-channel speech of 1dB TIR Embedded frame for co-channel (usable) speech of 3dB TIR, gridded to show nodes spanned Embedded frame for co-channel (unusable) speech of 1dB TIR, gridded to show nodes spanned Nodes spanned by embedded frame for co-channel (usable) speech of 3dB TIR vii

8 4.6 Nodes spanned by embedded frame for co-channel (unusable) speech of 1dB TIR ROC curve for usable speech detection using the nodal density approach Embedded voiced speech, gridded to show nodes spanned Embedded unvoiced speech, gridded to show nodes spanned Nodes spanned embedded voiced speech Nodes spanned by embedded unvoiced speech Difference-Mean Comparison distribution for clean speech Difference-Mean Comparison distribution for clean speech and speech plus pink noise at 15db SNR Difference-Mean Comparison distribution for clean speech and speech plus white noise at 15db SNR Classifier characteristic curves for varying difference-mean comparison values for clean voiced and unvoiced speech Classifier characteristic curves for varying difference-mean comparison values for voiced and unvoiced speech plus pink noise at 15dB SNR Classifier characteristic curves for varying difference-mean comparison values for voiced and unvoiced speech plus white noise at 15dB SNR Hits minus false alarms for voiced speech Hits minus false alarms for unvoiced speech viii

9 CHAPTER 1: INTRODUCTION 1.1. Motivation Speech signals can be corrupted by two types of interference, background noise, or another speaker s speech. The performance of speaker identification systems is known to be adversely affected by the presence of such interferences. Various techniques exists for the reduction or elimination of noise distortions in signals (including speech), however, due to the non-stationary properties of speech, complete removal of speech interferences has been a challenge to the speech processing industry. Speech interference occurs when two or more speakers are speaking simultaneously over the same channel without a significant difference in their overall energy. This research focuses on two speakers speaking through the same channel at the same time. The resulting speech is commonly termed co-channel speech. When the energies of the target and interferer speeches are approximately equal, certain portions still exist in co-channel speech in which the energy of one speaker is greater than the energy of the other speaker. These portions are termed usable while the other portions are termed unusable. The use of only usable portions of speech has been shown improve the performance of speaker identification systems (Lovekin, et. al., 21a), (Iyer et. al., 24). A Target (energy) to Interferer (energy) ratio (TIR) magnitude of 2dB is considered a suitable threshold for usable/unusable speech classification. 1

10 Previous research (Lovekin, et al., 21a) has shown the inability of unvoiced speech to contribute necessary information about the speaker for speaker identification due to its noise-like structure; therefore voiced portions of speech are extracted, using voiced/unvoiced classifiers, for use in usable speech extraction systems. Much research has been performed for the categorization of speech segments as voiced or unvoiced, which has led to the development of traditional voiced/unvoiced classifiers such as Energy and Zero-Crossings (Atal and Rabiner,1976) and First-Order Reflection Coefficients and Residual Energy (Childers, 2). These techniques are restricted in their capability to take into consideration the non-linear characteristics of the signals on question, and will therefore result in the omission of vital acoustic features, thereby leading to reduction in the accuracy of the speech classifier. Usable speech classification techniques have also been introduced, which use linearbased approaches such as Spectral Auto-Correlation Peak-to-Value Ratio (Krishnamachari, et al., 21), and Adjacent Peak Period Comparison (Lovekin, et al., 21b) along with others (Iyer et. al., 24), (Krishnamachari et. al., 2), (Kizhanatham et. al., 22), (Smolenski, et. al., 22), (Sundaram et. al., 23), (Yantorno, 1998). As mentioned, these methods do not take into account all the nonlinear features of the signal, thereby ignoring valuable characteristics arising which could lead to more precise distinctions between heavily and slightly distorted speech signals. Due to the inability of linear-based speech classification systems to account for nonlinear features in speech production, the necessity arises to develop a non-linear based 2

11 method, hence the non-linear embedding technique, which is discussed in the next section. Unvoiced speech and unusable speech are similar in structure, as the former is noise-like in nature, while the latter constitutes the presence of a significant amount of interference. Likewise, the structure of voiced speech is comparable to that of usable speech. Based on this, the proposed technique attempts to classify speech as both voiced or unvoiced and usable or unusable Non-Linear Embedding In this section, Takens method of delays (Takens, 1981), a widely used technique in the analyses of chaotic signals, especially in bio-engineering, is discussed as well as its application to speech classification. Voiced speech is generated by a comparatively low-dimensional nonlinear dynamical system (Kubin, 1995). It is not viable to directly observe the degrees of independence of the state variables of this system. Consequently, the problem arises as to how to obtain and depict the underlying low-dimensional dynamics from the one-dimensional observable speech signal. In other words, how can the apparent one-dimension signal obtained from speech be reconstructed to illustrate the actual dynamics of the speech production system. One of the most popular representations of the chaotic nature of signals can be attained via Takens embedding theorem, which states that it is possible to reconstruct a state space representation topologically equivalent to the original state space of a system from a single observable dimension. Nonlinear dynamic progression of 3

12 speech can be observed as a vector which travels along a phase (or state) space trajectory, where the coordinates of the point are the degrees of independence of the system. The procedure for implementing Takens theorem is as follows: First the time series x(t), which is the speech signals in our case, is accumulated in an array, {x(t)} (usually, the speech signals are given as a vector, and, therefore, the accumulation is already performed). A lag or time delay, d, and an embedding dimension, m, are used so as to form the vector-valued trajectory, V(t) = [v 1 (t), v 2 (t), v 3 (t),, v m (t)] (1.1) Where v 1 (t) = x(t) v 2 (t) = x(t-d) v 3 (t) = x(t-2d).. v m (t) = x(t-(m-1)d) Takens has shown that, provided the embedding dimension, m, is greater than twice the original dimension of the time series, x(t), {V(t)} will be an embedding of {X(t)}, and, in theory, the dynamics of V(t) posses the same qualitative characteristics as those of X(t), regardless of the lag, d. Due to the non-stationary process associated with speech production, the embedding procedure is applied to short consecutive segments. Based on the knowledge that the generation of voiced speech constitutes a low-dimensional system as compared to the higher-dimensional nature of the unvoiced speech generation system (Kubin, 1995), an embedding dimension, m can be chosen to be greater than twice the original dimension of 4

13 the speech signal, and yet, small enough to clearly distinguish between voiced and unvoiced speech. A dimension of 3, for instance, meets the requirement, m > n (where n = 1, the original dimension of speech), and is also sufficient to construct well structured trajectories of the voiced and usable speech signals. However, since unvoiced speech is of a much higher dimension, the structure generated will be chaotic and highly random in nature. Therefore, choosing m = 3 will result in an unambiguous distinction between voiced and unvoiced speech. The delay constant, d, should be large enough for a reconstructed trajectory to be maximally open in state space on average, but relatively small in order to preserve the time resolution of the signal. A constant value of d=12 was found to provide a good discrimination between structured (voiced) and unstructured (unvoiced) speech (Terez, 22). The presence of a significant amount of interfering speech in a voiced speech signal will adversely affect the structure of the signal, giving it a more unvoiced-like structure, hence the use of the embedding technique as a viable candidate for usable speech classification Problem Statement and Research Goals Performing speaker identification on speech that has been corrupted by interfering speech at a small (less than 15dB) Target to Interferer ratio leads to degradation of the system performance. However, since there exist portions of co-channel speech with relatively 5

14 high (above 2dB) TIR (i.e., usable speech), the low TIR portions can be removed in order to minimize the effect of the interfering speech. The idea of usable speech is novel; therefore, a new technique is presented here, which could analyze co-channel speech in ways that currently existing methods cannot. Based on the low information content of unvoiced speech in speaker identification, the separation of voiced and unvoiced speech is necessary in order to process only speech segments that are appropriate for the speaker identification system. A novel voiced/unvoiced classification technique, based on non-linear modeling of speech signals, is presented here Scope of Research In various multiple-way communication systems, co-channel speech is usually encountered, leading to significant distortion in the output of the system, hence the need for an effective usable speech classification system. One possible application of usable speech extraction system is the identification of the target pilot s speech amongst various aircraft pilots speaking over the same channel, at the same time and with about the same overall signal energy. In usable speech detection systems, unvoiced speech segments are usually detected and removed (based on their unimportance to the system) using voiced/unvoiced classifiers. 6

15 Other than for use in usable speech extraction systems, voiced/unvoiced classifiers are also applied in various acoustic speech processing techniques such as speech recognition and speaker recognition Disclaimer It must be noted that all speech data used in this research was obtained from the TIMIT database, which is widely used by most researchers in the speech processing field. Recordings were performed in a very controlled environment, using professional recording equipment, thereby resulting in the production of high quality speech. Therefore, although the addition of various types/levels of noise to the input speech signal has been investigated, the performance of the voiced/unvoiced classifier and usable speech extraction system presented in this research may be degraded with the use of signals of less quality Organization of Thesis Proposal In this thesis proposal, classification of speech signals into structured (voiced, usable) and unstructured (unvoiced/unusable) is investigated. Fundamental descriptions of co-channel speech, voiced-unvoiced speech and non-linear embedding are presented in the current chapter. Chapter 2 covers reviews of voiced/unvoiced and usable/unusable speech classification. In Chapter 3, the curvature measure is introduced, as well as its application in voiced/unvoiced classification. Some preliminary experiments and obtained results are 7

16 presented. The application of this measure to voiced/unvoiced/background classification is also introduced. The density measure is introduced in Chapter 4. The application of this measure to usable in speech detection is also discussed. Also, the implementation of this measure in voiced/unvoiced classification is proposed in this Chapter. In the 5 th Chapter, the difference-mean comparison measure is introduced, along with its application to voiced/unvoiced classification. The 6 th Chapter, which is the summary, concludes this proposal, and discusses possible future work, which includes fusing the introduced features to obtain one optimal voiced/unvoiced classifier. 8

17 CHAPTER 2: BACKGROUND 2.1. Literature Review In this section, the concepts of voiced/unvoiced, and usable/unusable speech are discussed in detail Voiced and Unvoiced Speech Voiced speech is produced by an air flow of pulses caused by the vibration of the vocal cords. The resulting signal could be described as quasi-periodic waveform with high energy and high adjacent sample correlation. On the other hand, unvoiced speech, which is produced by turbulent air flow resulting from constrictions in the vocal tract, is characterized by a random aperiodic waveform with low energy and low correlation. Figure 2.1 below illustrates the difference between voiced and unvoiced speech signals. 1 Voiced Speech 1 Unvoiced Speech Amplitude Sample Number Sample Number 9

18 Figure 2.1: Illustration of periodic nature of voiced speech (left panel) versus aperiodic nature of unvoiced speech (right panel). Note in Figure 2.1 the periodic structure in the voiced frame, as opposed to the random structure of the unvoiced frame. Observe, also, the difference in the maximum amplitude of each of the frames. The maximum amplitude of the voiced frame is 1,, while that of the unvoiced frame is 1,, indicating that voiced speech is much lower in energy than unvoiced speech. Accurately classifying speech signals as voiced or unvoiced is essential in speech analysis techniques such as speaker recognition/identification, speech recognition, speech synthesis and speaker count. As discussed earlier, many features exist in speech signals for distinguishing between voiced and unvoiced portions, and some of these features have been previously investigated and will be discussed in subsequent sections of this proposal Usable and Unusable Speech The concept of usable speech is derived from the fact that, not all portions of speech corrupted by co-channel interference are unusable for speech processing. In this research, usability of speech is defined with respect to Target-to-Interferer Ratio (Yantorno, 1999), (Smolenski 24). 1

19 The ratio of target energy to interferer energy in decibels (db) is referred to as Target to Interferer Ratio (TIR), which is expressed as E t TIR = 1log 1 E db..(2.1) i Where E t is the energy of target speech, and E i is the energy of the interfering speech. Experiments have shown that co-channel speech segments with TIR values of 2dB or greater are only minimally corrupted, and can therefore be effectively used in speaker identification (Yantorno, 1999). Attempts have been made to develop usable speech measures having high correlation with TIR, such that, even without knowledge of its TIR, an input speech frame can be classified as usable or unusable. The portions identified as usable can then be extracted for use in speaker identification and other speech processing systems. Some of the prior usable/unusable speech classification methods are discussed in subsequent sections, and a novel approach to usable/unusable speech detection is being introduced in this research Traditional Voiced/Unvoiced Detection Measures Energy and Zero-Crossings (E/ZC) The energy and zero-crossings approach (Atal, et al., 1976) is one of the traditional voiced/unvoiced speech classification techniques. The energy technique is based on the difference in amplitude (and therefore, energy) between voiced and unvoiced speech. In the previous chapter, it was demonstrated that voiced speech constitutes signals of much higher energy than unvoiced speech. The zero-crossings approach, which involves 11

20 counting the number of times the signal crosses the x-axis, is based on the knowledge that unvoiced speech signals, being more noise-like in nature, oscillate much faster than voiced speech signals. Therefore, the zero-crossing rates of voiced signals should be lower than those of unvoiced signals. The procedure for the energy and zero-crossing method, given in Figure 2.2 below, is as follows: First the input speech signal is passed through a highpass filter for the removal of any dc components that might be present. The output of the highpass filter is then separated into frames of about 128 samples. The number of zero-crossings is then computed for each frame, as well as the energy of the speech frame, which is obtained from the equation: Energy, E= x(n) 2...(2.2), where x(n) is the speech signal. Voiced/unvoiced speech classification is them performed based on the output of the parameters.. 12

21 Speech Signals, s(n) Measurements Highpass Filter Sampling Block X(n) Zero-Crossings Energy Compute Minimum Distance Select Minimum Distance Voiced/Unvoiced Decision Figure 2.2: Voiced/unvoiced detection using energy and zero-crossings. Figure 2.3 below shows the energy and zero-crossings for a speech segment consisting of both voiced and unvoiced speech, computed on a sample-by-sample basis. The file is the representation of the phrase will serve, and the samples between 4 8 represent the unvoiced sound, /s/, in the word, serve. In the figure, the high zero-crossing rate of unvoiced speech is readily observed, along with the high energy of voiced speech signals. 13

22 Figure 2.3: Speech utterance, will serve (top panel), its zero-crossing rate (middle panel) and its energy (bottom panel) First Order Reflection Coefficient/Residual Energy (FR/RE) Voiced/unvoiced classification has also been developed using the first order reflection coefficient and the residual energy of the speech signals (Childers, 2). The reflection coefficient, obtained by modeling the vocal tract as a concatenation of tubes, determines the amount of volume-velocity reflection that can be found at the intersection of two tubes. Due to its high energy, voiced speech possesses a high amount of volume-velocity 14

23 as compared to unvoiced speech. Significant information in speech is usually contained in the first coefficient, hence the use of the first order reflection coefficient, r 1 which can be expressed by: r 1 = Rss (1) Rss ()..(2.3) Where R R ss ss N 1 () = snsn ( ) ( ) N n = 1 (2.4) N 1 1 (1) = s( n) s( n+ 1) N n= 1 (2.5) N is the number of samples in the analysis frame and s(n) are the speech samples. The residual energy is the energy of the signal that has been inverse filtered using the LPC (Linear Predictive Coding) coefficients. The chaotic nature of an unvoiced speech signal results in a low residual energy as compared to a voiced speech signal. Figure 2.4 below shows the first order reflection coefficient and the residual energy of a given speech signal. The green line on the top panel is the threshold for voiced/unvoiced classification. 15

Figure 2.4: First order reflection coefficient (top panel) and residual energy (bottom panel) plotted (in blue) against corresponding speech utterance (black voiced and red - unvoiced). 2.3.

24 Figure 2.4: First order reflection coefficient (top panel) and residual energy (bottom panel) plotted (in blue) against corresponding speech utterance (black voiced and red - unvoiced) Traditional Usable/Unusable Speech Detection Measures Spectral Autocorrelation Peak-to-Valley Ratio (SAPVR) The SAPVR measure (Krishnamachari, et al., 21), was the first usable speech detection technique to be introduced. In this method, the ratio of peaks to valleys of the spectral autocorrelation of the input speech signal is computed. Voiced, single speaker speech (or 16

25 co-channel speech with high TIR) is highly structured, and posses a well-defined harmonic structure in the frequency domain, as opposed to the random structure of multispeaker speech. The spectral autocorrelation of usable co-channel speech, results in welldefined peaks and valleys, and, hence, a high peak to valley ratio as compared to unusable speech. The SAPVR usable/unusable speech classification process (given in Figure 2.5 below) is as follows: A 32-point hamming window is used to sample the input speech signal. The FFT of the windowed samples is computed. Autocorrelation is then performed on the magnitude FFT. The peaks and valleys of the resulting autocorrelation are determined using a peak-picking algorithm. The ratio of the peak to the valley is computed and is compared to a threshold which was chosen to distinguish between usable and unusable frames. Finally, frames above the threshold are considered usable, and extracted for such applications as speaker identification purposes. Speech signal Hamming Window FFT Autocorrelation Peak Picking algorithm Usable/Unusable decision Figure 2.5: SAPVR Based Usable/Unusable Speech Classification Process 17

26 Figure 2.6 below shows a frame of speech and associated FFT and spectral autocorrelation. The speech signals were sampled and windowed with a 32-point hamming window, 5% overlap and 128-point zero-padding. 1 x 1 4 SAPVR Study - fvmh-i--8k & madc-zero-sil - male speec 2 SAPVR Study - madc-zero-sil & mkls-s6i - male speech Amplitude Amplitude Sample Number 4 x 1 5 Magnitude Spec Autocorr Sample Number 6 x Sample Number Sample Number 2 x 1 4 Magnitude Spec. Autocorr Sample Number 6 x Sample Number Figure 2.6: Speech signal (top panel), fft magnitude (middle panel) and spectral autocorrelation (bottom panel) of single speaker (left) and co-channel (right) speech. From the above figure, it is evident that the peaks of the SAPVR of single speaker speech are relatively high (bottom left panel) as compared with that of co-channel speech of low TIR value (bottom right panel). This measure was capable of correctly identifying 73% of usable frames (defined based on TIR value) with about 25% false alarms (Krishnamachari, et al., 21). 18

27 Adjacent Pitch Period Comparison (APPC) Voiced speech is known to be periodic in nature; therefore, its adjacent pitch periods are similar in shape. However, the presence of interfering voiced speech creates dissimilarity in adjacent pitch periods of co-channel speech. The APPC measure (Lovekin, et al., 21), takes advantage of this difference in the adjacent pitch periods of single and cochannel speech in the development of a usable/unusable speech detection system. The concept of this measure is the comparison of sample-by-sample variations of adjacent pitch periods of the speech signal. With single speaker voiced speech, a comparison of adjacent pitch periods will yield minimal sample-by-sample variations, and an accurate length of the pitch period can be easily obtained. However, with the presence of interfering speech, adjacent pitch period comparison results in large variations, and the estimation of the pitch period length could also be inaccurate. Ironically, this inaccurate pitch period estimation, occurring with co-channel speech, leads to an increase in correct usable/unusable speech detection due to the fact that, the more inaccurate the selected pitch period lengths are, the greater the dissimilarity between the pitch periods, and, hence, the larger the variations. The APPC process is as follows: The length, N, of each reference pitch period is computed as the distance between the zero-lag point and the next highest formed by the autocorrelation matrix of the next 1ms. The adjacent pitch period is then considered as samples N+1 to 2N+1. 19

28 It should be noted that, in this method, changes in length from one pitch period to its neighboring pitch period are ignored. Figures 2.7 and 2.8 below show the amplitude comparisons of the single (upper) and cochannel (lower) speech signals, respectively..8 U sable S peech.6 Normalized Amplitude Normalized Amplitude S a m p le N u m b e r A m p litu d e C o m p a r is o n o f A d ja c e n t P itc h P e r io d s.8 R e fe re n c e P itc h P e rio d.6 A d ja c e n t P it c h P e r io d S a m p le N u m b e r Figure 2.7: Single speaker voiced speech (top panel) and its adjacent pitch period comparison (bottom panel)..6 C o-channel S peech.4 Normalized Amplitude S am ple Num ber.6 A m plitude C om parison of A djacent P itch P eriods.4 Normalized Amplitude Reference P itch P eriod A djacent P itch P eriod S am ple Num ber Figure 2.8: Co-channel voiced speech (top panel) and its adjacent pitch period comparison (bottom panel). 2

29 This measure was able to correctly identify 75% of usable frames (defined based on TIR value), with about 25% false alarms (Lovekin, et al., 21). 21

30 CHAPTER 3: CURVATURE MEASURE 3.1. Introduction to Curvature In an attempt to obtain a mathematical quantification for the difference between embedded voiced and unvoiced signals, the curvature measure (Smolenski, 24) was developed using the Serret-Frenet theorem (Rahman & Mulolani, 21). This theorem states that any 3-dimensional space curve can be completely characterized by the following matrix equation: = B N T B N T... κ κ τ τ (3.1) Where κ = curvature, τ = torsion and T, B, and N are the axes shown in Figure 3.1 below, and the derivatives are with respect to s, the arc length of the curve. Figure 3.1: TNB Frame Classification 22

31 The curvature, which is being considered in this research, is defined as the rate of rotation of the tangent at a point, P, as P moves along a given trajectory. In other words, the curvature measures the angle between any three points on the trajectory, and can also be considered as the reciprocal of the radius of curvature. Curvature can be expressed by the following equation: κ = lim s θ (3.2) s Where θ = the angle between the tangents to the curve and φ = the angle between the binormals to the curve. However, the space curve formed by the state space embedding procedure is actually a sampled version of the original phase space trajectory, therefore, the curvature, as well as other variables in the equation, must be approximated from the discrete embedding curve. The discrete curvature estimation is given by: K n = cos 1 Α ( Α n n Α Α n+ 1 n+ 1 ).(3.3) Where Α n = [ x n xn 1, yn yn 1, zn zn 1].(3.4) is the elemental arc length of the discrete embedding curve; and x, y and z are the coordinates of the embedded signals. 23

32 3.2. Preliminary Research: Voiced/Unvoiced Classification Figure 3.2 below shows the embedded signals of voiced and unvoiced speech frame consisting of 128 sample points. Note the difference between the structure of the embedded voiced speech and that of the unvoiced speech. It is evident that the angle between any three points on the trajectory will be much greater for the embedded voiced signals than the embedded unvoiced signal. Embedded Voiced Speech Embedded Unvoiced Speech Figure 3.2: Embedded voiced (left panel) and embedded unvoiced (right panel) speech frames Figure 3.3 below shows the sample-by-sample curvature values (black) plotted against the corresponding speech segment (blue), which consists of voiced and unvoiced speech. The negative of the speech signal energy, computed using the traditional energy measure discussed earlier, was also plotted (in red) against the speech signal for comparison purposes. Note the correlation between energy and curvature in the utterance. 24

33 Figure 3.3: Curvature (black) and energy (red) plotted against speech utterance (blue). Usually, voiced/unvoiced classification is performed on a frame-by-frame basis, due to the difficulty of finding a class using a single sample. Obtaining a common result for each speech frame eliminates any single-sample detection errors made by the curvature algorithm. However, frame-by-frame classification of speech can lead to an overapproximation of some short segments. Moreover, voiced and unvoiced start and endpoints could also be inaccurately detected due to the averaging of the decision value. In this research, this problem is resolved by segmenting the speech signals into relatively small frames (about 15ms) before processing. Figure 3.4 below shows a histogram of curvature of labeled voiced and unvoiced speech signals obtained from the TIMIT database. The blue bars represent the voiced distribution, while the red represents the unvoiced distribution. Note the separation between the two distributions. 25

34 .18 Curvature Distribution for Clean Speech Voiced Unvoiced Relative Counts Curvature Figure 3.4: Curvature distribution for clean speech, voiced blue, unvoiced red Noise and Filtering It must be noted that the data used in Figure 3.4 (in the previous section) was obtained for clean speech. However, due to the presence of noise in most speech communication channels, a robust measure is required. Pink noise, a type of noise that flickers throughout the signal is sometimes found in speech. This category of noise is sometimes referred to as 1/f noise because its power spectra P(f), as a function of frequency, can be expressed as: P(f) = 1/f a, where a is very close to 1. The curvature distribution of speech corrupted by pink noise at 15dB shows that pink noise has minimal effect on the accuracy of the curvature measure. This is illustrated in Figure 3.5 below. 26

35 Curvature Distribution for Clean Speech Curvature Distribution for Speech + Pink Noise of 15dB SNR Voiced IUnvoiced Voiced Unvoiced Relative Counts Relative Counts Curvature Curvature Figure 3.5: Curvature distribution for clean speech (left panel) and speech + pink noise at 15db SNR (right panel), voiced blue, unvoiced red. On the other hand, white noise, the most common type of noise found in speech signals, has an adverse effect on the curvature measure, this is illustrated in Figure 3.6 below. Relative Counts Curvature Distribution for Clean Speech Voiced IUnvoiced Curvature Distribution for Speech + White Noise of 15dB SNR Relative Counts Voiced Unvoiced Curvature Curvature Figure 3.6: Curvature distribution for clean speech (left panel) and speech + white noise at 15dB SNR (right panel), voiced blue, unvoiced red. From Figure 3.6, it is observed that, with the presence of white noise in the speech signal, the curvature pdf for voiced speech shifts to the left of the curve, in other words, 27

36 the discriminative power of the measure reduces. This can be explained by the chaotic nature of white noise, which introduces disorganization in the well-defined structure of the embedded voiced signal, thereby decreasing the curvature value for the voiced samples, i.e., making voiced speech unvoiced-like. In order to minimize the effect of noise on the speech signals, a 1 th order (11 point) moving average filter is used as a pre-processing block for the input speech signal. A moving average has been chosen because it is very easy to implement, and yet, optimal in the simple task of reducing chaotic noise signals while maintaining a relatively sharp impulse response. The expression of an M-point moving average filter is given by:.(3.5) Where x and y are the input and output of the filter, respectively. Since the significant information in speech signals is found in the low frequency components of the signal, the moving average filter, which is a lowpass filter, minimizes the effects of noise on the signal, while retaining the information need for voiced/unvoiced classification. The curvature voiced/unvoiced distributions for clean speech and speech + with pink and white noise at 15dB SNR after filtering are given in Figure 3.7 and 3.8 below. It can be observed, from the figures below, that the performance of the curvature measure is not degraded by filtering clean speech or speech with pink noise. Note, however, that, in the case of white noise, filtering causes the voiced distribution to shift towards the right, making the distribution more like the distribution of clean speech. Therefore, the moving 28

37 average filter is very effective in reducing the effect of noise on the performance of the curvature measure. Curvature Distribution for Filtered Clean Speech Curvature Distribution for Speech + Pink Noise of 15dB SNR.5 Voiced Unvoiced.5 Voiced Unvoiced.4.4 Relative Counts.3.2 Relative Counts Curvature Curvature Figure 3.7: Curvature distribution for clean speech (left panel) and speech + pink noise of 15dB SNR (right panel) after filtering, voiced blue, unvoiced red. Curvature Distribution for Filtered Clean Speech Curvature Distribution for Speech + White Noise of 15dB SNR.5 Voiced Unvoiced.25 Voiced Unvoiced.4.2 Relative Counts.3.2 Relative Counts Curvature Curvature Figure 3.8: Curvature distribution for clean speech (left panel) and speech + white noise at 15dB SNR (right panel) after filtering, voiced blue, unvoiced red. 29

38 Experiments and Results All speech data used in the following experiments were obtained from the TIMIT database (13 female files and 12 male files). Curvature-based voiced/unvoiced decisions were made using the following procedure: The speech signal is filtered using a 1 th order moving average filter The output of the filter is then segmented into frames of 128 samples each. Takens embedding technique is then applied on each frame The curvature values of each embedded frame are then computed and averaged to produce a voiced or unvoiced decision for that frame, based on a threshold of 2.3. The block diagram for the voiced/unvoiced classification process is given in Figure 3.9 below. Speech signal Moving Average Filter Framing Nonlinear Embedding Curvature Algorithm Voiced/Unvoiced Detection Figure 3.9: Curvature-based decision process. In choosing a threshold, one has two options: an optimal (and therefore different) threshold for each noise condition, or an optimal (single) threshold for all conditions; however, since prior knowledge of the noise state cannot be determined (as yet), an optimum threshold has been chosen for all three noise states. Figures 3.1 and 3.11 below show the ROC curves for the voiced and unvoiced hits and false alarms for all three noise states. 3

39 1 ROC for Noise States + Unvoiced Speech dB White 15dB Pink Clean.8 Hits False Alarms Figure 3.1: ROC curves for different noise states + voiced speech ROC for Noise States + Unvoiced Speech dB White 15dB Pink Clean.7.6 Hits False Alarms Figure 3.11: ROC curves for different noise states + unvoiced speech It is observed, from the above figures, that it is possible to achieve a minimum of 95% hits with 5% false alarms for each of the noise states; however, these values are only 31

40 attainable if the optimum threshold for each individual noise state is used. Choosing one threshold that will produce the best overall results for all three noise states put together is more practical, even though it leads to a reduction in the maximum accuracy for each individual noise state. A threshold of 2.3 was found to yield the best overall result for all three noise states. Frames whose curvature values fell below 2.3 were considered unvoiced, and frames whose curvature values were above 2.3 were considered voiced. Figure 3.12 to 3.14 below show curvature-based voiced/unvoiced decision values (voiced: 1, unvoiced: ) plotted against color-coded speech data with different speech classes. The data is coded as follows: Voiced, weak voiced, unvoiced, transition, and silence. It must be noted, however, that in this research, only voiced/unvoiced classification is performed, and all other voicing states are ignored. Curvature Based Decisions for Given Clean Speech (Blk: V; Red: UV) Amplitude Sample Number x 1 4 Figure 3.12: Curvature-based decisions for clean speech. 32

41 Curvature Based Decisions for Speech + Pink Noise of 15dB SNR (Blk: V; Red: UV) Amplitude Sample Number x 1 4 Figure 3.13: Curvature-based decisions for corresponding speech + added pink noise at 15dB SNR. Curvature Based Decisions for Speech + White Noise of 15dB SNR (Blk: V; Red: UV) Amplitude Sample Number x 1 4 Figure 3.14: Curvature-based decisions for corresponding speech + added white noise at 15dB SNR. 33

42 End-Point Detection Based on the inability of the curvature measure to detect speech states other than voiced and unvoiced, an undecided region was created in an attempt to detect voiced and unvoiced end-points as accurately as possible. For the undecided band, two thresholds were chosen, one slightly above the threshold, for voiced speech, and the other slightly below the threshold, for unvoiced speech. The advantages of having such a band are the improvement of accuracy in endpoint detections and the reduction of false alarms in voiced and unvoiced detections. However, some actual voiced and unvoiced frames could fall among the undecided region, resulting in a reduction in the hits and an increase in misses. Figure 3.15 illustrates the endpoint detection accuracy for the curvature measure using clean speech. Voiced Unvoiced Decisions for Clean Speech (With Undecided) => 1:V; :Dont Care; -1:UV Differences Decision Decision Amplitude x x x Sample Number x 1 4 Figure 3.15: Speech data (top level), ground truth (2 nd panel), curvature-based classification (3 rd panel), and difference between ground truth and curvature-based classifications (4 th panel). 34

43 The don t care regions in the actual classification are all speech classes other than voiced or unvoiced, while those in the curvature-based classification are the undecided regions. It should be noted that the don t care regions in both cases are almost the same. Therefore, if accurate endpoint detection is desired, the use of an undecided band could be effective Result Comparisons Figures 3.16 and 3.17 below show the comparison of the performance of the curvature measure with the traditional voiced/unvoiced classifiers presented in preceding chapters. These results were obtained by subtracting the average false alarms from the average hits using 25 different speech files from the TIMIT database. Hits -False Alarms for Voiced Speech FR/RE E/ZC Curvature Clean 15dB Pink 15dB White Figure 3.16: Comparisons of hits minus false alarms for voiced speech 35

44 Hits -False Alarms for Unvoiced Speech FR/RE E/ZC Curvature Clean 15dB Pink 15dB White Figure 3.17: Comparisons of hits minus false alarms for unvoiced speech It is observed in Figures 3.16 and 3.17 that the curvature measure is comparable to the traditional measures in a noiseless environment. However, in the presence of white noise, the curvature measure performs better than the FR/RE measure, and is comparable to the E/ZC measure. Furthermore, with pink noise interference, the curvature measure is decidedly better than either traditional measure for voiced/unvoiced classification 3.3. Proposed Research: Voiced/Unvoiced/Background Classification Separation on unvoiced speech and background noise has been a challenge in speech classification due to their similarity in structure. In this proposal are some preliminary experiments to explore possible differences between the structures of embedded unvoiced speech and background noise in order to extend the classification to voiced/unvoiced/background. Figures 3.18 and 3.19 below show the embedded signals 36

45 of voiced, unvoiced and background speech with added 15B pink noise and 15dB white noise, respectively consisting of 128 sample points. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.18: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) frames with added pink noise at 15dB SNR. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.19: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) frames with added white noise at 15dB SNR. It is evident, from the above figures, that, although embedded background (or noise) is chaotic in nature, some differentiation does exist between the structures of unvoiced and background speech, and this differentiation can also be measured using the curvature 37

46 algorithm. With white noise, however, the difference between the structures of embedded unvoiced speech and background is not clear, therefore, as in the case of voiced/unvoiced classification, a 1 th order moving average filter was used to pre-process the speech before applying the embedded technique. Figures 3.2 and 3.21 below show the embedded signals of voiced, unvoiced and background speech with added 15B pink noise and 15dB white noise, respectively after filtering. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.2: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) speech frames with added pink noise at 15dB SNR after filtering. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.21: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) speech frames with added white noise at 15dB SNR after filtering. 38

47 It is readily observed that filtering increases the differentiation between unvoiced speech and background both with both added pink noise and added white noise. 39

48 CHAPTER 4: NODAL DENSITY MEASURE 4.1. Introduction to Nodal Density Another distinguishable feature between embedded voiced and unvoiced signals, observable from Figure 3.2 in the previous chapter, is the density of the signals. The embedded voiced signals appears to be much less dense than the unvoiced signal, however, the presence of an appreciable amount of interfering speech in voiced signals will introduce significant distortion in their structured pattern thereby increasing the apparent density. Figures 4.1 and 4.2 below show 256-sampling point frames of usable and usable voiced speech, respectively, embedded using Takens method of delays with m = 3 and d = 12. The co-channel data was obtained by combining two different frames of speech from different speakers, scaling them to obtain the desired target to interferer ratio TIR, and then extracting out the voiced portions using one of the traditional voiced/unvoiced classifiers. Embedded Co-channel Speech of 3dB TIR Figure 4.1: Embedded data for co-channel speech at 3dB TIR. 4

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals 16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract