STRUCTURE-BASED SPEECH CLASSIFCATION USING NON-LINEAR EMBEDDING TECHNIQUES. A Thesis Proposal Submitted to the Temple University Graduate Board
|
|
- Toby James
- 5 years ago
- Views:
Transcription
1 STRUCTURE-BASED SPEECH CLASSIFCATION USING NON-LINEAR EMBEDDING TECHNIQUES A Thesis Proposal Submitted to the Temple University Graduate Board in Partial Fulfillment of the Requirements for the Degree Master of Science in Engineering By Uchechukwu Ofoegbu May, 24 Dr. Robert Yantorno Thesis Advisor Dr.Saroj K. Biswas Director of Graduate Studies College of Engineering Committee Member Dr.Musoke H. Sendaula Graduate Director Electrical & Computer Engineering Committee Member
2 ABSTRACT Usable speech is referred to as those portions of corrupted speech which can be used in determining a reasonable amount of distinguishing features of the speaker. It has previously been shown that the use of only voiced segments of speech improves the usable speech detection system, and also, that unvoiced speech does not contributes significant information about the speaker(s) for speaker identification. Therefore, using a voiced/unvoiced speech detection system, voiced portions of co-channel speech are usually detected and extracted for use in usable speech extraction systems. The process of human speech production is complex, nonlinear and nonstationary. Its most precise description can only be realized in terms of nonlinear fluid dynamics. Traditionally, though, it has been described using linear techniques such as source-filter model and spectral analysis. These techniques work very well for many aspects of speech analysis, but they are inherently limited in their ability to describe the true dynamics of speech production. In this research, a non-linear speech classification approach is proposed, which classifies speech based on features extracted after processing the input signal via an embedding technique known as Takens Method of Delays. Unvoiced speech and useable speech are similar in structure, as the former is noise-like in nature, while the latter constitutes the presence of a significant amount of interference. Likewise, the structure of voiced speech is comparable to that of usable speech. Based on this, the proposed technique attempts to classify speech as both voiced or unvoiced and usable or unusable, using different features extracted from the embedded signals. Preliminary experiments have shown that this technique is capable of correctly detecting 96% of voiced speech (with 1% false alarms) and 9% of unvoiced speech (with 4% false alarms) in a noisefree environment. ii
3 TABLE OF CONTENTS ABSTRACT...ii TABLE OF CONTENTS...iii LIST OF EQUATIONS...v LIST OF FIGURES...vi CHAPTERS 1. INTRODUCTION Motivation Nonlinear Embedding Problem Statement and Research Goals Scope of Research Disclaimer Organization of Thesis Proposal BACKGROUND Literature Review Voiced and Unvoiced Speech Usable and Unusable Speech Traditional Voiced/Unvoiced Detection Measures Energy and Zero-Crossings st Order Reflection Coefficients and Residual Energy Usable/Unusable Detection Measures SAPVR APPC CURVATURE MEASURE Introduction to Curvature...21 iii
4 3.2. Preliminary Research: Voiced/Unvoiced Classification Noise and Filtering Experiments and Results End-Point Detection Result Comparisons Proposed Research: Voiced/Unvoiced/Silence Classification NODAL DENSITY MEASURE Introduction to Nodal Density Preliminary Research: Usable/Unusable Classification Preliminary Experiments and Results Discussion Proposed Research: Voiced/Unvoiced Classification DIFFERENCE-MEAN COMPARISON MEASURE Introduction to Difference-Mean Comparison Experiments and Results SUMMARY BIBLIOGRAPHY...55 iv
5 LIST OF EQUATIONS Equation Page 1.1 Vector-Valued Trajectory Formed by Takens Method of Delays Target to Interferer Ratio (TIR) Energy First Order Reflection coefficient Denominator for First Order Reflection coefficient Numerator for First Order Reflection coefficient TNB Matrix (Serret-Frenet) Theorem Curvature Curvature Estimation Elemental Arc Length of the Discrete Embedding Curve Moving Average Filter NxN 1 st Order Difference 27 v
6 LIST OF FIGURES Figure Page 2.1 Error! Reference source not found Voiced/Unvoiced Detection Using Energy and Zero-Crossings Signal Generated by Speech Utterance, its Zero-Crossing Rate and Its Energy Spectrum st Order Reflection Coefficient and Residual Energy Plotted Against Corresponding Speech Utterance SAPVR Based Usable/Unusable Speech Separation Process Sampled Signal, FFT Magnitude and Spectral Autocorrelation of sigle spaeker and co-channel speech Single Speaker Voiced Speech and Its Adjacent Pitch Period Amplitude Comparison Co-channel Voiced Speech and Its Adjacent Pitch Period Amplitude Comparison TNB Frame Classification Embedded Voiced and Unvoiced Speech Frames Curvature and energy plotted against corresponding speech utterance Curvature distribution for clean speech Curvature distribution for clean speech and speech + pink noise at 15db SNR Curvature distribution for clean speech and speech + white noise at 15db SNR Curvature distribution for clean speech and speech + pink noise at 15db SNR after filtering Curvature distribution for clean speech and speech + white noise at 15db SNR after filtering vi
7 3.9 Curvature-based decision process ROC for different noise states + voiced speech ROC for different noise states + unvoiced speech Curvature-based decisions for clean speech Curvature-based decisions for corresponding speech + added pink noise at 15dB SNR Curvature-based decisions for corresponding speech + added white noise at 15dB SNR Speech data, actual classification, curvature-based classification, and difference between actual and curvature-based classifications Comparisons of hits minus false alarms for voiced speech Comparisons of hits minus false alarms for unvoiced speech Embedded voiced, unvoiced and background speech frames with added pink noise at 15dB SNR Embedded voiced, unvoiced and background speech frames with added white noise at 15dB SNR Embedded voiced, unvoiced and background speech frames with added pink noise at 15dB SNR after filtering Embedded voiced, unvoiced and background speech frames with added white noise at 15dB SNR after filtering Embedded frame for co-channel speech of 3dB TIR Embedded frame for co-channel speech of 1dB TIR Embedded frame for co-channel (usable) speech of 3dB TIR, gridded to show nodes spanned Embedded frame for co-channel (unusable) speech of 1dB TIR, gridded to show nodes spanned Nodes spanned by embedded frame for co-channel (usable) speech of 3dB TIR vii
8 4.6 Nodes spanned by embedded frame for co-channel (unusable) speech of 1dB TIR ROC curve for usable speech detection using the nodal density approach Embedded voiced speech, gridded to show nodes spanned Embedded unvoiced speech, gridded to show nodes spanned Nodes spanned embedded voiced speech Nodes spanned by embedded unvoiced speech Difference-Mean Comparison distribution for clean speech Difference-Mean Comparison distribution for clean speech and speech plus pink noise at 15db SNR Difference-Mean Comparison distribution for clean speech and speech plus white noise at 15db SNR Classifier characteristic curves for varying difference-mean comparison values for clean voiced and unvoiced speech Classifier characteristic curves for varying difference-mean comparison values for voiced and unvoiced speech plus pink noise at 15dB SNR Classifier characteristic curves for varying difference-mean comparison values for voiced and unvoiced speech plus white noise at 15dB SNR Hits minus false alarms for voiced speech Hits minus false alarms for unvoiced speech viii
9 CHAPTER 1: INTRODUCTION 1.1. Motivation Speech signals can be corrupted by two types of interference, background noise, or another speaker s speech. The performance of speaker identification systems is known to be adversely affected by the presence of such interferences. Various techniques exists for the reduction or elimination of noise distortions in signals (including speech), however, due to the non-stationary properties of speech, complete removal of speech interferences has been a challenge to the speech processing industry. Speech interference occurs when two or more speakers are speaking simultaneously over the same channel without a significant difference in their overall energy. This research focuses on two speakers speaking through the same channel at the same time. The resulting speech is commonly termed co-channel speech. When the energies of the target and interferer speeches are approximately equal, certain portions still exist in co-channel speech in which the energy of one speaker is greater than the energy of the other speaker. These portions are termed usable while the other portions are termed unusable. The use of only usable portions of speech has been shown improve the performance of speaker identification systems (Lovekin, et. al., 21a), (Iyer et. al., 24). A Target (energy) to Interferer (energy) ratio (TIR) magnitude of 2dB is considered a suitable threshold for usable/unusable speech classification. 1
10 Previous research (Lovekin, et al., 21a) has shown the inability of unvoiced speech to contribute necessary information about the speaker for speaker identification due to its noise-like structure; therefore voiced portions of speech are extracted, using voiced/unvoiced classifiers, for use in usable speech extraction systems. Much research has been performed for the categorization of speech segments as voiced or unvoiced, which has led to the development of traditional voiced/unvoiced classifiers such as Energy and Zero-Crossings (Atal and Rabiner,1976) and First-Order Reflection Coefficients and Residual Energy (Childers, 2). These techniques are restricted in their capability to take into consideration the non-linear characteristics of the signals on question, and will therefore result in the omission of vital acoustic features, thereby leading to reduction in the accuracy of the speech classifier. Usable speech classification techniques have also been introduced, which use linearbased approaches such as Spectral Auto-Correlation Peak-to-Value Ratio (Krishnamachari, et al., 21), and Adjacent Peak Period Comparison (Lovekin, et al., 21b) along with others (Iyer et. al., 24), (Krishnamachari et. al., 2), (Kizhanatham et. al., 22), (Smolenski, et. al., 22), (Sundaram et. al., 23), (Yantorno, 1998). As mentioned, these methods do not take into account all the nonlinear features of the signal, thereby ignoring valuable characteristics arising which could lead to more precise distinctions between heavily and slightly distorted speech signals. Due to the inability of linear-based speech classification systems to account for nonlinear features in speech production, the necessity arises to develop a non-linear based 2
11 method, hence the non-linear embedding technique, which is discussed in the next section. Unvoiced speech and unusable speech are similar in structure, as the former is noise-like in nature, while the latter constitutes the presence of a significant amount of interference. Likewise, the structure of voiced speech is comparable to that of usable speech. Based on this, the proposed technique attempts to classify speech as both voiced or unvoiced and usable or unusable Non-Linear Embedding In this section, Takens method of delays (Takens, 1981), a widely used technique in the analyses of chaotic signals, especially in bio-engineering, is discussed as well as its application to speech classification. Voiced speech is generated by a comparatively low-dimensional nonlinear dynamical system (Kubin, 1995). It is not viable to directly observe the degrees of independence of the state variables of this system. Consequently, the problem arises as to how to obtain and depict the underlying low-dimensional dynamics from the one-dimensional observable speech signal. In other words, how can the apparent one-dimension signal obtained from speech be reconstructed to illustrate the actual dynamics of the speech production system. One of the most popular representations of the chaotic nature of signals can be attained via Takens embedding theorem, which states that it is possible to reconstruct a state space representation topologically equivalent to the original state space of a system from a single observable dimension. Nonlinear dynamic progression of 3
12 speech can be observed as a vector which travels along a phase (or state) space trajectory, where the coordinates of the point are the degrees of independence of the system. The procedure for implementing Takens theorem is as follows: First the time series x(t), which is the speech signals in our case, is accumulated in an array, {x(t)} (usually, the speech signals are given as a vector, and, therefore, the accumulation is already performed). A lag or time delay, d, and an embedding dimension, m, are used so as to form the vector-valued trajectory, V(t) = [v 1 (t), v 2 (t), v 3 (t),, v m (t)] (1.1) Where v 1 (t) = x(t) v 2 (t) = x(t-d) v 3 (t) = x(t-2d).. v m (t) = x(t-(m-1)d) Takens has shown that, provided the embedding dimension, m, is greater than twice the original dimension of the time series, x(t), {V(t)} will be an embedding of {X(t)}, and, in theory, the dynamics of V(t) posses the same qualitative characteristics as those of X(t), regardless of the lag, d. Due to the non-stationary process associated with speech production, the embedding procedure is applied to short consecutive segments. Based on the knowledge that the generation of voiced speech constitutes a low-dimensional system as compared to the higher-dimensional nature of the unvoiced speech generation system (Kubin, 1995), an embedding dimension, m can be chosen to be greater than twice the original dimension of 4
13 the speech signal, and yet, small enough to clearly distinguish between voiced and unvoiced speech. A dimension of 3, for instance, meets the requirement, m > n (where n = 1, the original dimension of speech), and is also sufficient to construct well structured trajectories of the voiced and usable speech signals. However, since unvoiced speech is of a much higher dimension, the structure generated will be chaotic and highly random in nature. Therefore, choosing m = 3 will result in an unambiguous distinction between voiced and unvoiced speech. The delay constant, d, should be large enough for a reconstructed trajectory to be maximally open in state space on average, but relatively small in order to preserve the time resolution of the signal. A constant value of d=12 was found to provide a good discrimination between structured (voiced) and unstructured (unvoiced) speech (Terez, 22). The presence of a significant amount of interfering speech in a voiced speech signal will adversely affect the structure of the signal, giving it a more unvoiced-like structure, hence the use of the embedding technique as a viable candidate for usable speech classification Problem Statement and Research Goals Performing speaker identification on speech that has been corrupted by interfering speech at a small (less than 15dB) Target to Interferer ratio leads to degradation of the system performance. However, since there exist portions of co-channel speech with relatively 5
14 high (above 2dB) TIR (i.e., usable speech), the low TIR portions can be removed in order to minimize the effect of the interfering speech. The idea of usable speech is novel; therefore, a new technique is presented here, which could analyze co-channel speech in ways that currently existing methods cannot. Based on the low information content of unvoiced speech in speaker identification, the separation of voiced and unvoiced speech is necessary in order to process only speech segments that are appropriate for the speaker identification system. A novel voiced/unvoiced classification technique, based on non-linear modeling of speech signals, is presented here Scope of Research In various multiple-way communication systems, co-channel speech is usually encountered, leading to significant distortion in the output of the system, hence the need for an effective usable speech classification system. One possible application of usable speech extraction system is the identification of the target pilot s speech amongst various aircraft pilots speaking over the same channel, at the same time and with about the same overall signal energy. In usable speech detection systems, unvoiced speech segments are usually detected and removed (based on their unimportance to the system) using voiced/unvoiced classifiers. 6
15 Other than for use in usable speech extraction systems, voiced/unvoiced classifiers are also applied in various acoustic speech processing techniques such as speech recognition and speaker recognition Disclaimer It must be noted that all speech data used in this research was obtained from the TIMIT database, which is widely used by most researchers in the speech processing field. Recordings were performed in a very controlled environment, using professional recording equipment, thereby resulting in the production of high quality speech. Therefore, although the addition of various types/levels of noise to the input speech signal has been investigated, the performance of the voiced/unvoiced classifier and usable speech extraction system presented in this research may be degraded with the use of signals of less quality Organization of Thesis Proposal In this thesis proposal, classification of speech signals into structured (voiced, usable) and unstructured (unvoiced/unusable) is investigated. Fundamental descriptions of co-channel speech, voiced-unvoiced speech and non-linear embedding are presented in the current chapter. Chapter 2 covers reviews of voiced/unvoiced and usable/unusable speech classification. In Chapter 3, the curvature measure is introduced, as well as its application in voiced/unvoiced classification. Some preliminary experiments and obtained results are 7
16 presented. The application of this measure to voiced/unvoiced/background classification is also introduced. The density measure is introduced in Chapter 4. The application of this measure to usable in speech detection is also discussed. Also, the implementation of this measure in voiced/unvoiced classification is proposed in this Chapter. In the 5 th Chapter, the difference-mean comparison measure is introduced, along with its application to voiced/unvoiced classification. The 6 th Chapter, which is the summary, concludes this proposal, and discusses possible future work, which includes fusing the introduced features to obtain one optimal voiced/unvoiced classifier. 8
17 CHAPTER 2: BACKGROUND 2.1. Literature Review In this section, the concepts of voiced/unvoiced, and usable/unusable speech are discussed in detail Voiced and Unvoiced Speech Voiced speech is produced by an air flow of pulses caused by the vibration of the vocal cords. The resulting signal could be described as quasi-periodic waveform with high energy and high adjacent sample correlation. On the other hand, unvoiced speech, which is produced by turbulent air flow resulting from constrictions in the vocal tract, is characterized by a random aperiodic waveform with low energy and low correlation. Figure 2.1 below illustrates the difference between voiced and unvoiced speech signals. 1 Voiced Speech 1 Unvoiced Speech Amplitude Sample Number Sample Number 9
18 Figure 2.1: Illustration of periodic nature of voiced speech (left panel) versus aperiodic nature of unvoiced speech (right panel). Note in Figure 2.1 the periodic structure in the voiced frame, as opposed to the random structure of the unvoiced frame. Observe, also, the difference in the maximum amplitude of each of the frames. The maximum amplitude of the voiced frame is 1,, while that of the unvoiced frame is 1,, indicating that voiced speech is much lower in energy than unvoiced speech. Accurately classifying speech signals as voiced or unvoiced is essential in speech analysis techniques such as speaker recognition/identification, speech recognition, speech synthesis and speaker count. As discussed earlier, many features exist in speech signals for distinguishing between voiced and unvoiced portions, and some of these features have been previously investigated and will be discussed in subsequent sections of this proposal Usable and Unusable Speech The concept of usable speech is derived from the fact that, not all portions of speech corrupted by co-channel interference are unusable for speech processing. In this research, usability of speech is defined with respect to Target-to-Interferer Ratio (Yantorno, 1999), (Smolenski 24). 1
19 The ratio of target energy to interferer energy in decibels (db) is referred to as Target to Interferer Ratio (TIR), which is expressed as E t TIR = 1log 1 E db..(2.1) i Where E t is the energy of target speech, and E i is the energy of the interfering speech. Experiments have shown that co-channel speech segments with TIR values of 2dB or greater are only minimally corrupted, and can therefore be effectively used in speaker identification (Yantorno, 1999). Attempts have been made to develop usable speech measures having high correlation with TIR, such that, even without knowledge of its TIR, an input speech frame can be classified as usable or unusable. The portions identified as usable can then be extracted for use in speaker identification and other speech processing systems. Some of the prior usable/unusable speech classification methods are discussed in subsequent sections, and a novel approach to usable/unusable speech detection is being introduced in this research Traditional Voiced/Unvoiced Detection Measures Energy and Zero-Crossings (E/ZC) The energy and zero-crossings approach (Atal, et al., 1976) is one of the traditional voiced/unvoiced speech classification techniques. The energy technique is based on the difference in amplitude (and therefore, energy) between voiced and unvoiced speech. In the previous chapter, it was demonstrated that voiced speech constitutes signals of much higher energy than unvoiced speech. The zero-crossings approach, which involves 11
20 counting the number of times the signal crosses the x-axis, is based on the knowledge that unvoiced speech signals, being more noise-like in nature, oscillate much faster than voiced speech signals. Therefore, the zero-crossing rates of voiced signals should be lower than those of unvoiced signals. The procedure for the energy and zero-crossing method, given in Figure 2.2 below, is as follows: First the input speech signal is passed through a highpass filter for the removal of any dc components that might be present. The output of the highpass filter is then separated into frames of about 128 samples. The number of zero-crossings is then computed for each frame, as well as the energy of the speech frame, which is obtained from the equation: Energy, E= x(n) 2...(2.2), where x(n) is the speech signal. Voiced/unvoiced speech classification is them performed based on the output of the parameters.. 12
21 Speech Signals, s(n) Measurements Highpass Filter Sampling Block X(n) Zero-Crossings Energy Compute Minimum Distance Select Minimum Distance Voiced/Unvoiced Decision Figure 2.2: Voiced/unvoiced detection using energy and zero-crossings. Figure 2.3 below shows the energy and zero-crossings for a speech segment consisting of both voiced and unvoiced speech, computed on a sample-by-sample basis. The file is the representation of the phrase will serve, and the samples between 4 8 represent the unvoiced sound, /s/, in the word, serve. In the figure, the high zero-crossing rate of unvoiced speech is readily observed, along with the high energy of voiced speech signals. 13
22 Figure 2.3: Speech utterance, will serve (top panel), its zero-crossing rate (middle panel) and its energy (bottom panel) First Order Reflection Coefficient/Residual Energy (FR/RE) Voiced/unvoiced classification has also been developed using the first order reflection coefficient and the residual energy of the speech signals (Childers, 2). The reflection coefficient, obtained by modeling the vocal tract as a concatenation of tubes, determines the amount of volume-velocity reflection that can be found at the intersection of two tubes. Due to its high energy, voiced speech possesses a high amount of volume-velocity 14
23 as compared to unvoiced speech. Significant information in speech is usually contained in the first coefficient, hence the use of the first order reflection coefficient, r 1 which can be expressed by: r 1 = Rss (1) Rss ()..(2.3) Where R R ss ss N 1 () = snsn ( ) ( ) N n = 1 (2.4) N 1 1 (1) = s( n) s( n+ 1) N n= 1 (2.5) N is the number of samples in the analysis frame and s(n) are the speech samples. The residual energy is the energy of the signal that has been inverse filtered using the LPC (Linear Predictive Coding) coefficients. The chaotic nature of an unvoiced speech signal results in a low residual energy as compared to a voiced speech signal. Figure 2.4 below shows the first order reflection coefficient and the residual energy of a given speech signal. The green line on the top panel is the threshold for voiced/unvoiced classification. 15
24 Figure 2.4: First order reflection coefficient (top panel) and residual energy (bottom panel) plotted (in blue) against corresponding speech utterance (black voiced and red - unvoiced) Traditional Usable/Unusable Speech Detection Measures Spectral Autocorrelation Peak-to-Valley Ratio (SAPVR) The SAPVR measure (Krishnamachari, et al., 21), was the first usable speech detection technique to be introduced. In this method, the ratio of peaks to valleys of the spectral autocorrelation of the input speech signal is computed. Voiced, single speaker speech (or 16
25 co-channel speech with high TIR) is highly structured, and posses a well-defined harmonic structure in the frequency domain, as opposed to the random structure of multispeaker speech. The spectral autocorrelation of usable co-channel speech, results in welldefined peaks and valleys, and, hence, a high peak to valley ratio as compared to unusable speech. The SAPVR usable/unusable speech classification process (given in Figure 2.5 below) is as follows: A 32-point hamming window is used to sample the input speech signal. The FFT of the windowed samples is computed. Autocorrelation is then performed on the magnitude FFT. The peaks and valleys of the resulting autocorrelation are determined using a peak-picking algorithm. The ratio of the peak to the valley is computed and is compared to a threshold which was chosen to distinguish between usable and unusable frames. Finally, frames above the threshold are considered usable, and extracted for such applications as speaker identification purposes. Speech signal Hamming Window FFT Autocorrelation Peak Picking algorithm Usable/Unusable decision Figure 2.5: SAPVR Based Usable/Unusable Speech Classification Process 17
26 Figure 2.6 below shows a frame of speech and associated FFT and spectral autocorrelation. The speech signals were sampled and windowed with a 32-point hamming window, 5% overlap and 128-point zero-padding. 1 x 1 4 SAPVR Study - fvmh-i--8k & madc-zero-sil - male speec 2 SAPVR Study - madc-zero-sil & mkls-s6i - male speech Amplitude Amplitude Sample Number 4 x 1 5 Magnitude Spec Autocorr Sample Number 6 x Sample Number Sample Number 2 x 1 4 Magnitude Spec. Autocorr Sample Number 6 x Sample Number Figure 2.6: Speech signal (top panel), fft magnitude (middle panel) and spectral autocorrelation (bottom panel) of single speaker (left) and co-channel (right) speech. From the above figure, it is evident that the peaks of the SAPVR of single speaker speech are relatively high (bottom left panel) as compared with that of co-channel speech of low TIR value (bottom right panel). This measure was capable of correctly identifying 73% of usable frames (defined based on TIR value) with about 25% false alarms (Krishnamachari, et al., 21). 18
27 Adjacent Pitch Period Comparison (APPC) Voiced speech is known to be periodic in nature; therefore, its adjacent pitch periods are similar in shape. However, the presence of interfering voiced speech creates dissimilarity in adjacent pitch periods of co-channel speech. The APPC measure (Lovekin, et al., 21), takes advantage of this difference in the adjacent pitch periods of single and cochannel speech in the development of a usable/unusable speech detection system. The concept of this measure is the comparison of sample-by-sample variations of adjacent pitch periods of the speech signal. With single speaker voiced speech, a comparison of adjacent pitch periods will yield minimal sample-by-sample variations, and an accurate length of the pitch period can be easily obtained. However, with the presence of interfering speech, adjacent pitch period comparison results in large variations, and the estimation of the pitch period length could also be inaccurate. Ironically, this inaccurate pitch period estimation, occurring with co-channel speech, leads to an increase in correct usable/unusable speech detection due to the fact that, the more inaccurate the selected pitch period lengths are, the greater the dissimilarity between the pitch periods, and, hence, the larger the variations. The APPC process is as follows: The length, N, of each reference pitch period is computed as the distance between the zero-lag point and the next highest formed by the autocorrelation matrix of the next 1ms. The adjacent pitch period is then considered as samples N+1 to 2N+1. 19
28 It should be noted that, in this method, changes in length from one pitch period to its neighboring pitch period are ignored. Figures 2.7 and 2.8 below show the amplitude comparisons of the single (upper) and cochannel (lower) speech signals, respectively..8 U sable S peech.6 Normalized Amplitude Normalized Amplitude S a m p le N u m b e r A m p litu d e C o m p a r is o n o f A d ja c e n t P itc h P e r io d s.8 R e fe re n c e P itc h P e rio d.6 A d ja c e n t P it c h P e r io d S a m p le N u m b e r Figure 2.7: Single speaker voiced speech (top panel) and its adjacent pitch period comparison (bottom panel)..6 C o-channel S peech.4 Normalized Amplitude S am ple Num ber.6 A m plitude C om parison of A djacent P itch P eriods.4 Normalized Amplitude Reference P itch P eriod A djacent P itch P eriod S am ple Num ber Figure 2.8: Co-channel voiced speech (top panel) and its adjacent pitch period comparison (bottom panel). 2
29 This measure was able to correctly identify 75% of usable frames (defined based on TIR value), with about 25% false alarms (Lovekin, et al., 21). 21
30 CHAPTER 3: CURVATURE MEASURE 3.1. Introduction to Curvature In an attempt to obtain a mathematical quantification for the difference between embedded voiced and unvoiced signals, the curvature measure (Smolenski, 24) was developed using the Serret-Frenet theorem (Rahman & Mulolani, 21). This theorem states that any 3-dimensional space curve can be completely characterized by the following matrix equation: = B N T B N T... κ κ τ τ (3.1) Where κ = curvature, τ = torsion and T, B, and N are the axes shown in Figure 3.1 below, and the derivatives are with respect to s, the arc length of the curve. Figure 3.1: TNB Frame Classification 22
31 The curvature, which is being considered in this research, is defined as the rate of rotation of the tangent at a point, P, as P moves along a given trajectory. In other words, the curvature measures the angle between any three points on the trajectory, and can also be considered as the reciprocal of the radius of curvature. Curvature can be expressed by the following equation: κ = lim s θ (3.2) s Where θ = the angle between the tangents to the curve and φ = the angle between the binormals to the curve. However, the space curve formed by the state space embedding procedure is actually a sampled version of the original phase space trajectory, therefore, the curvature, as well as other variables in the equation, must be approximated from the discrete embedding curve. The discrete curvature estimation is given by: K n = cos 1 Α ( Α n n Α Α n+ 1 n+ 1 ).(3.3) Where Α n = [ x n xn 1, yn yn 1, zn zn 1].(3.4) is the elemental arc length of the discrete embedding curve; and x, y and z are the coordinates of the embedded signals. 23
32 3.2. Preliminary Research: Voiced/Unvoiced Classification Figure 3.2 below shows the embedded signals of voiced and unvoiced speech frame consisting of 128 sample points. Note the difference between the structure of the embedded voiced speech and that of the unvoiced speech. It is evident that the angle between any three points on the trajectory will be much greater for the embedded voiced signals than the embedded unvoiced signal. Embedded Voiced Speech Embedded Unvoiced Speech Figure 3.2: Embedded voiced (left panel) and embedded unvoiced (right panel) speech frames Figure 3.3 below shows the sample-by-sample curvature values (black) plotted against the corresponding speech segment (blue), which consists of voiced and unvoiced speech. The negative of the speech signal energy, computed using the traditional energy measure discussed earlier, was also plotted (in red) against the speech signal for comparison purposes. Note the correlation between energy and curvature in the utterance. 24
33 Figure 3.3: Curvature (black) and energy (red) plotted against speech utterance (blue). Usually, voiced/unvoiced classification is performed on a frame-by-frame basis, due to the difficulty of finding a class using a single sample. Obtaining a common result for each speech frame eliminates any single-sample detection errors made by the curvature algorithm. However, frame-by-frame classification of speech can lead to an overapproximation of some short segments. Moreover, voiced and unvoiced start and endpoints could also be inaccurately detected due to the averaging of the decision value. In this research, this problem is resolved by segmenting the speech signals into relatively small frames (about 15ms) before processing. Figure 3.4 below shows a histogram of curvature of labeled voiced and unvoiced speech signals obtained from the TIMIT database. The blue bars represent the voiced distribution, while the red represents the unvoiced distribution. Note the separation between the two distributions. 25
34 .18 Curvature Distribution for Clean Speech Voiced Unvoiced Relative Counts Curvature Figure 3.4: Curvature distribution for clean speech, voiced blue, unvoiced red Noise and Filtering It must be noted that the data used in Figure 3.4 (in the previous section) was obtained for clean speech. However, due to the presence of noise in most speech communication channels, a robust measure is required. Pink noise, a type of noise that flickers throughout the signal is sometimes found in speech. This category of noise is sometimes referred to as 1/f noise because its power spectra P(f), as a function of frequency, can be expressed as: P(f) = 1/f a, where a is very close to 1. The curvature distribution of speech corrupted by pink noise at 15dB shows that pink noise has minimal effect on the accuracy of the curvature measure. This is illustrated in Figure 3.5 below. 26
35 Curvature Distribution for Clean Speech Curvature Distribution for Speech + Pink Noise of 15dB SNR Voiced IUnvoiced Voiced Unvoiced Relative Counts Relative Counts Curvature Curvature Figure 3.5: Curvature distribution for clean speech (left panel) and speech + pink noise at 15db SNR (right panel), voiced blue, unvoiced red. On the other hand, white noise, the most common type of noise found in speech signals, has an adverse effect on the curvature measure, this is illustrated in Figure 3.6 below. Relative Counts Curvature Distribution for Clean Speech Voiced IUnvoiced Curvature Distribution for Speech + White Noise of 15dB SNR Relative Counts Voiced Unvoiced Curvature Curvature Figure 3.6: Curvature distribution for clean speech (left panel) and speech + white noise at 15dB SNR (right panel), voiced blue, unvoiced red. From Figure 3.6, it is observed that, with the presence of white noise in the speech signal, the curvature pdf for voiced speech shifts to the left of the curve, in other words, 27
36 the discriminative power of the measure reduces. This can be explained by the chaotic nature of white noise, which introduces disorganization in the well-defined structure of the embedded voiced signal, thereby decreasing the curvature value for the voiced samples, i.e., making voiced speech unvoiced-like. In order to minimize the effect of noise on the speech signals, a 1 th order (11 point) moving average filter is used as a pre-processing block for the input speech signal. A moving average has been chosen because it is very easy to implement, and yet, optimal in the simple task of reducing chaotic noise signals while maintaining a relatively sharp impulse response. The expression of an M-point moving average filter is given by:.(3.5) Where x and y are the input and output of the filter, respectively. Since the significant information in speech signals is found in the low frequency components of the signal, the moving average filter, which is a lowpass filter, minimizes the effects of noise on the signal, while retaining the information need for voiced/unvoiced classification. The curvature voiced/unvoiced distributions for clean speech and speech + with pink and white noise at 15dB SNR after filtering are given in Figure 3.7 and 3.8 below. It can be observed, from the figures below, that the performance of the curvature measure is not degraded by filtering clean speech or speech with pink noise. Note, however, that, in the case of white noise, filtering causes the voiced distribution to shift towards the right, making the distribution more like the distribution of clean speech. Therefore, the moving 28
37 average filter is very effective in reducing the effect of noise on the performance of the curvature measure. Curvature Distribution for Filtered Clean Speech Curvature Distribution for Speech + Pink Noise of 15dB SNR.5 Voiced Unvoiced.5 Voiced Unvoiced.4.4 Relative Counts.3.2 Relative Counts Curvature Curvature Figure 3.7: Curvature distribution for clean speech (left panel) and speech + pink noise of 15dB SNR (right panel) after filtering, voiced blue, unvoiced red. Curvature Distribution for Filtered Clean Speech Curvature Distribution for Speech + White Noise of 15dB SNR.5 Voiced Unvoiced.25 Voiced Unvoiced.4.2 Relative Counts.3.2 Relative Counts Curvature Curvature Figure 3.8: Curvature distribution for clean speech (left panel) and speech + white noise at 15dB SNR (right panel) after filtering, voiced blue, unvoiced red. 29
38 Experiments and Results All speech data used in the following experiments were obtained from the TIMIT database (13 female files and 12 male files). Curvature-based voiced/unvoiced decisions were made using the following procedure: The speech signal is filtered using a 1 th order moving average filter The output of the filter is then segmented into frames of 128 samples each. Takens embedding technique is then applied on each frame The curvature values of each embedded frame are then computed and averaged to produce a voiced or unvoiced decision for that frame, based on a threshold of 2.3. The block diagram for the voiced/unvoiced classification process is given in Figure 3.9 below. Speech signal Moving Average Filter Framing Nonlinear Embedding Curvature Algorithm Voiced/Unvoiced Detection Figure 3.9: Curvature-based decision process. In choosing a threshold, one has two options: an optimal (and therefore different) threshold for each noise condition, or an optimal (single) threshold for all conditions; however, since prior knowledge of the noise state cannot be determined (as yet), an optimum threshold has been chosen for all three noise states. Figures 3.1 and 3.11 below show the ROC curves for the voiced and unvoiced hits and false alarms for all three noise states. 3
39 1 ROC for Noise States + Unvoiced Speech dB White 15dB Pink Clean.8 Hits False Alarms Figure 3.1: ROC curves for different noise states + voiced speech ROC for Noise States + Unvoiced Speech dB White 15dB Pink Clean.7.6 Hits False Alarms Figure 3.11: ROC curves for different noise states + unvoiced speech It is observed, from the above figures, that it is possible to achieve a minimum of 95% hits with 5% false alarms for each of the noise states; however, these values are only 31
40 attainable if the optimum threshold for each individual noise state is used. Choosing one threshold that will produce the best overall results for all three noise states put together is more practical, even though it leads to a reduction in the maximum accuracy for each individual noise state. A threshold of 2.3 was found to yield the best overall result for all three noise states. Frames whose curvature values fell below 2.3 were considered unvoiced, and frames whose curvature values were above 2.3 were considered voiced. Figure 3.12 to 3.14 below show curvature-based voiced/unvoiced decision values (voiced: 1, unvoiced: ) plotted against color-coded speech data with different speech classes. The data is coded as follows: Voiced, weak voiced, unvoiced, transition, and silence. It must be noted, however, that in this research, only voiced/unvoiced classification is performed, and all other voicing states are ignored. Curvature Based Decisions for Given Clean Speech (Blk: V; Red: UV) Amplitude Sample Number x 1 4 Figure 3.12: Curvature-based decisions for clean speech. 32
41 Curvature Based Decisions for Speech + Pink Noise of 15dB SNR (Blk: V; Red: UV) Amplitude Sample Number x 1 4 Figure 3.13: Curvature-based decisions for corresponding speech + added pink noise at 15dB SNR. Curvature Based Decisions for Speech + White Noise of 15dB SNR (Blk: V; Red: UV) Amplitude Sample Number x 1 4 Figure 3.14: Curvature-based decisions for corresponding speech + added white noise at 15dB SNR. 33
42 End-Point Detection Based on the inability of the curvature measure to detect speech states other than voiced and unvoiced, an undecided region was created in an attempt to detect voiced and unvoiced end-points as accurately as possible. For the undecided band, two thresholds were chosen, one slightly above the threshold, for voiced speech, and the other slightly below the threshold, for unvoiced speech. The advantages of having such a band are the improvement of accuracy in endpoint detections and the reduction of false alarms in voiced and unvoiced detections. However, some actual voiced and unvoiced frames could fall among the undecided region, resulting in a reduction in the hits and an increase in misses. Figure 3.15 illustrates the endpoint detection accuracy for the curvature measure using clean speech. Voiced Unvoiced Decisions for Clean Speech (With Undecided) => 1:V; :Dont Care; -1:UV Differences Decision Decision Amplitude x x x Sample Number x 1 4 Figure 3.15: Speech data (top level), ground truth (2 nd panel), curvature-based classification (3 rd panel), and difference between ground truth and curvature-based classifications (4 th panel). 34
43 The don t care regions in the actual classification are all speech classes other than voiced or unvoiced, while those in the curvature-based classification are the undecided regions. It should be noted that the don t care regions in both cases are almost the same. Therefore, if accurate endpoint detection is desired, the use of an undecided band could be effective Result Comparisons Figures 3.16 and 3.17 below show the comparison of the performance of the curvature measure with the traditional voiced/unvoiced classifiers presented in preceding chapters. These results were obtained by subtracting the average false alarms from the average hits using 25 different speech files from the TIMIT database. Hits -False Alarms for Voiced Speech FR/RE E/ZC Curvature Clean 15dB Pink 15dB White Figure 3.16: Comparisons of hits minus false alarms for voiced speech 35
44 Hits -False Alarms for Unvoiced Speech FR/RE E/ZC Curvature Clean 15dB Pink 15dB White Figure 3.17: Comparisons of hits minus false alarms for unvoiced speech It is observed in Figures 3.16 and 3.17 that the curvature measure is comparable to the traditional measures in a noiseless environment. However, in the presence of white noise, the curvature measure performs better than the FR/RE measure, and is comparable to the E/ZC measure. Furthermore, with pink noise interference, the curvature measure is decidedly better than either traditional measure for voiced/unvoiced classification 3.3. Proposed Research: Voiced/Unvoiced/Background Classification Separation on unvoiced speech and background noise has been a challenge in speech classification due to their similarity in structure. In this proposal are some preliminary experiments to explore possible differences between the structures of embedded unvoiced speech and background noise in order to extend the classification to voiced/unvoiced/background. Figures 3.18 and 3.19 below show the embedded signals 36
45 of voiced, unvoiced and background speech with added 15B pink noise and 15dB white noise, respectively consisting of 128 sample points. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.18: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) frames with added pink noise at 15dB SNR. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.19: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) frames with added white noise at 15dB SNR. It is evident, from the above figures, that, although embedded background (or noise) is chaotic in nature, some differentiation does exist between the structures of unvoiced and background speech, and this differentiation can also be measured using the curvature 37
46 algorithm. With white noise, however, the difference between the structures of embedded unvoiced speech and background is not clear, therefore, as in the case of voiced/unvoiced classification, a 1 th order moving average filter was used to pre-process the speech before applying the embedded technique. Figures 3.2 and 3.21 below show the embedded signals of voiced, unvoiced and background speech with added 15B pink noise and 15dB white noise, respectively after filtering. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.2: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) speech frames with added pink noise at 15dB SNR after filtering. Embedded Voiced Speech Embedded Unvoiced Speech Embedded Background Figure 3.21: Embedded voiced (left panel), unvoiced (middle panel) and background (right panel) speech frames with added white noise at 15dB SNR after filtering. 38
47 It is readily observed that filtering increases the differentiation between unvoiced speech and background both with both added pink noise and added white noise. 39
48 CHAPTER 4: NODAL DENSITY MEASURE 4.1. Introduction to Nodal Density Another distinguishable feature between embedded voiced and unvoiced signals, observable from Figure 3.2 in the previous chapter, is the density of the signals. The embedded voiced signals appears to be much less dense than the unvoiced signal, however, the presence of an appreciable amount of interfering speech in voiced signals will introduce significant distortion in their structured pattern thereby increasing the apparent density. Figures 4.1 and 4.2 below show 256-sampling point frames of usable and usable voiced speech, respectively, embedded using Takens method of delays with m = 3 and d = 12. The co-channel data was obtained by combining two different frames of speech from different speakers, scaling them to obtain the desired target to interferer ratio TIR, and then extracting out the voiced portions using one of the traditional voiced/unvoiced classifiers. Embedded Co-channel Speech of 3dB TIR Figure 4.1: Embedded data for co-channel speech at 3dB TIR. 4
speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals
16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract
More informationTesting the Intelligibility of Corrupted Speech with an Automated Speech Recognition System
Testing the Intelligibility of Corrupted Speech with an Automated Speech Recognition System William T. HICKS, Brett Y. SMOLENSKI, Robert E. YANTORNO Electrical & Computer Engineering Department College
More informationInternational Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015
International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 Analysis of Speech Signal Using Graphic User Interface Solly Joy 1, Savitha
More informationCO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM
CO-CHANNEL SPEECH DETECTION APPROACHES USING CYCLOSTATIONARITY OR WAVELET TRANSFORM Arvind Raman Kizhanatham, Nishant Chandra, Robert E. Yantorno Temple University/ECE Dept. 2 th & Norris Streets, Philadelphia,
More informationReduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter
Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter Ching-Ta Lu, Kun-Fu Tseng 2, Chih-Tsung Chen 2 Department of Information Communication, Asia University, Taichung, Taiwan, ROC
More informationEE482: Digital Signal Processing Applications
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 12 Speech Signal Processing 14/03/25 http://www.ee.unlv.edu/~b1morris/ee482/
More informationChapter IV THEORY OF CELP CODING
Chapter IV THEORY OF CELP CODING CHAPTER IV THEORY OF CELP CODING 4.1 Introduction Wavefonn coders fail to produce high quality speech at bit rate lower than 16 kbps. Source coders, such as LPC vocoders,
More informationSpeech Enhancement using Wiener filtering
Speech Enhancement using Wiener filtering S. Chirtmay and M. Tahernezhadi Department of Electrical Engineering Northern Illinois University DeKalb, IL 60115 ABSTRACT The problem of reducing the disturbing
More informationEffects of Reverberation on Pitch, Onset/Offset, and Binaural Cues
Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues DeLiang Wang Perception & Neurodynamics Lab The Ohio State University Outline of presentation Introduction Human performance Reverberation
More informationL19: Prosodic modification of speech
L19: Prosodic modification of speech Time-domain pitch synchronous overlap add (TD-PSOLA) Linear-prediction PSOLA Frequency-domain PSOLA Sinusoidal models Harmonic + noise models STRAIGHT This lecture
More informationProject 0: Part 2 A second hands-on lab on Speech Processing Frequency-domain processing
Project : Part 2 A second hands-on lab on Speech Processing Frequency-domain processing February 24, 217 During this lab, you will have a first contact on frequency domain analysis of speech signals. You
More informationLab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels
Lab 8. ANALYSIS OF COMPLEX SOUNDS AND SPEECH ANALYSIS Amplitude, loudness, and decibels A complex sound with particular frequency can be analyzed and quantified by its Fourier spectrum: the relative amplitudes
More informationSingle Channel Speaker Segregation using Sinusoidal Residual Modeling
NCC 2009, January 16-18, IIT Guwahati 294 Single Channel Speaker Segregation using Sinusoidal Residual Modeling Rajesh M Hegde and A. Srinivas Dept. of Electrical Engineering Indian Institute of Technology
More informationX. SPEECH ANALYSIS. Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER
X. SPEECH ANALYSIS Prof. M. Halle G. W. Hughes H. J. Jacobsen A. I. Engel F. Poza A. VOWEL IDENTIFIER Most vowel identifiers constructed in the past were designed on the principle of "pattern matching";
More informationStructure of Speech. Physical acoustics Time-domain representation Frequency domain representation Sound shaping
Structure of Speech Physical acoustics Time-domain representation Frequency domain representation Sound shaping Speech acoustics Source-Filter Theory Speech Source characteristics Speech Filter characteristics
More informationSpeech Synthesis using Mel-Cepstral Coefficient Feature
Speech Synthesis using Mel-Cepstral Coefficient Feature By Lu Wang Senior Thesis in Electrical Engineering University of Illinois at Urbana-Champaign Advisor: Professor Mark Hasegawa-Johnson May 2018 Abstract
More informationMel Spectrum Analysis of Speech Recognition using Single Microphone
International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree
More informationSignal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2
Signal Processing for Speech Applications - Part 2-1 Signal Processing For Speech Applications - Part 2 May 14, 2013 Signal Processing for Speech Applications - Part 2-2 References Huang et al., Chapter
More informationVoice Activity Detection for Speech Enhancement Applications
Voice Activity Detection for Speech Enhancement Applications E. Verteletskaya, K. Sakhnov Abstract This paper describes a study of noise-robust voice activity detection (VAD) utilizing the periodicity
More informationVoice Activity Detection
Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class
More informationEpoch Extraction From Emotional Speech
Epoch Extraction From al Speech D Govind and S R M Prasanna Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Email:{dgovind,prasanna}@iitg.ernet.in Abstract
More informationOverview of Code Excited Linear Predictive Coder
Overview of Code Excited Linear Predictive Coder Minal Mulye 1, Sonal Jagtap 2 1 PG Student, 2 Assistant Professor, Department of E&TC, Smt. Kashibai Navale College of Engg, Pune, India Abstract Advances
More informationFundamental Frequency Detection
Fundamental Frequency Detection Jan Černocký, Valentina Hubeika {cernocky ihubeika}@fit.vutbr.cz DCGM FIT BUT Brno Fundamental Frequency Detection Jan Černocký, Valentina Hubeika, DCGM FIT BUT Brno 1/37
More informationAudio Restoration Based on DSP Tools
Audio Restoration Based on DSP Tools EECS 451 Final Project Report Nan Wu School of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI, United States wunan@umich.edu Abstract
More informationSpeech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter
Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter 1 Gupteswar Sahu, 2 D. Arun Kumar, 3 M. Bala Krishna and 4 Jami Venkata Suman Assistant Professor, Department of ECE,
More informationPitch Period of Speech Signals Preface, Determination and Transformation
Pitch Period of Speech Signals Preface, Determination and Transformation Mohammad Hossein Saeidinezhad 1, Bahareh Karamsichani 2, Ehsan Movahedi 3 1 Islamic Azad university, Najafabad Branch, Saidinezhad@yahoo.com
More informationCan binary masks improve intelligibility?
Can binary masks improve intelligibility? Mike Brookes (Imperial College London) & Mark Huckvale (University College London) Apparently so... 2 How does it work? 3 Time-frequency grid of local SNR + +
More informationChapter 4 SPEECH ENHANCEMENT
44 Chapter 4 SPEECH ENHANCEMENT 4.1 INTRODUCTION: Enhancement is defined as improvement in the value or Quality of something. Speech enhancement is defined as the improvement in intelligibility and/or
More informationSignal segmentation and waveform characterization. Biosignal processing, S Autumn 2012
Signal segmentation and waveform characterization Biosignal processing, 5173S Autumn 01 Short-time analysis of signals Signal statistics may vary in time: nonstationary how to compute signal characterizations?
More informationA Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image
Science Journal of Circuits, Systems and Signal Processing 2017; 6(2): 11-17 http://www.sciencepublishinggroup.com/j/cssp doi: 10.11648/j.cssp.20170602.12 ISSN: 2326-9065 (Print); ISSN: 2326-9073 (Online)
More informationSpeech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm A.T. Rajamanickam, N.P.Subiramaniyam, A.Balamurugan*,
More informationSystem Identification and CDMA Communication
System Identification and CDMA Communication A (partial) sample report by Nathan A. Goodman Abstract This (sample) report describes theory and simulations associated with a class project on system identification
More informationTRANSFORMS / WAVELETS
RANSFORMS / WAVELES ransform Analysis Signal processing using a transform analysis for calculations is a technique used to simplify or accelerate problem solution. For example, instead of dividing two
More informationNOISE ESTIMATION IN A SINGLE CHANNEL
SPEECH ENHANCEMENT FOR CROSS-TALK INTERFERENCE by Levent M. Arslan and John H.L. Hansen Robust Speech Processing Laboratory Department of Electrical Engineering Box 99 Duke University Durham, North Carolina
More informationROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE
- @ Ramon E Prieto et al Robust Pitch Tracking ROUST PITCH TRACKIN USIN LINEAR RERESSION OF THE PHASE Ramon E Prieto, Sora Kim 2 Electrical Engineering Department, Stanford University, rprieto@stanfordedu
More informationNonuniform multi level crossing for signal reconstruction
6 Nonuniform multi level crossing for signal reconstruction 6.1 Introduction In recent years, there has been considerable interest in level crossing algorithms for sampling continuous time signals. Driven
More informationSpeech Compression Using Voice Excited Linear Predictive Coding
Speech Compression Using Voice Excited Linear Predictive Coding Ms.Tosha Sen, Ms.Kruti Jay Pancholi PG Student, Asst. Professor, L J I E T, Ahmedabad Abstract : The aim of the thesis is design good quality
More informationCOMPRESSIVE SAMPLING OF SPEECH SIGNALS. Mona Hussein Ramadan. BS, Sebha University, Submitted to the Graduate Faculty of
COMPRESSIVE SAMPLING OF SPEECH SIGNALS by Mona Hussein Ramadan BS, Sebha University, 25 Submitted to the Graduate Faculty of Swanson School of Engineering in partial fulfillment of the requirements for
More informationQuantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation
Quantification of glottal and voiced speech harmonicsto-noise ratios using cepstral-based estimation Peter J. Murphy and Olatunji O. Akande, Department of Electronic and Computer Engineering University
More informationWavelet Speech Enhancement based on the Teager Energy Operator
Wavelet Speech Enhancement based on the Teager Energy Operator Mohammed Bahoura and Jean Rouat ERMETIS, DSA, Université du Québec à Chicoutimi, Chicoutimi, Québec, G7H 2B1, Canada. Abstract We propose
More informationRobust Low-Resource Sound Localization in Correlated Noise
INTERSPEECH 2014 Robust Low-Resource Sound Localization in Correlated Noise Lorin Netsch, Jacek Stachurski Texas Instruments, Inc. netsch@ti.com, jacek@ti.com Abstract In this paper we address the problem
More informationCommunications Theory and Engineering
Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 Speech and telephone speech Based on a voice production model Parametric representation
More informationSPEECH AND SPECTRAL ANALYSIS
SPEECH AND SPECTRAL ANALYSIS 1 Sound waves: production in general: acoustic interference vibration (carried by some propagation medium) variations in air pressure speech: actions of the articulatory organs
More informationTheory of Telecommunications Networks
Theory of Telecommunications Networks Anton Čižmár Ján Papaj Department of electronics and multimedia telecommunications CONTENTS Preface... 5 1 Introduction... 6 1.1 Mathematical models for communication
More informationAPPLICATIONS OF DSP OBJECTIVES
APPLICATIONS OF DSP OBJECTIVES This lecture will discuss the following: Introduce analog and digital waveform coding Introduce Pulse Coded Modulation Consider speech-coding principles Introduce the channel
More informationKONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM
KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM Shruthi S Prabhu 1, Nayana C G 2, Ashwini B N 3, Dr. Parameshachari B D 4 Assistant Professor, Department of Telecommunication Engineering, GSSSIETW,
More informationPerformance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches
Performance study of Text-independent Speaker identification system using & I for Telephone and Microphone Speeches Ruchi Chaudhary, National Technical Research Organization Abstract: A state-of-the-art
More informationME scope Application Note 01 The FFT, Leakage, and Windowing
INTRODUCTION ME scope Application Note 01 The FFT, Leakage, and Windowing NOTE: The steps in this Application Note can be duplicated using any Package that includes the VES-3600 Advanced Signal Processing
More informationVoiced/nonvoiced detection based on robustness of voiced epochs
Voiced/nonvoiced detection based on robustness of voiced epochs by N. Dhananjaya, B.Yegnanarayana in IEEE Signal Processing Letters, 17, 3 : 273-276 Report No: IIIT/TR/2010/50 Centre for Language Technologies
More informationDifferent Approaches of Spectral Subtraction Method for Speech Enhancement
ISSN 2249 5460 Available online at www.internationalejournals.com International ejournals International Journal of Mathematical Sciences, Technology and Humanities 95 (2013 1056 1062 Different Approaches
More informationSpeech/Non-speech detection Rule-based method using log energy and zero crossing rate
Digital Speech Processing- Lecture 14A Algorithms for Speech Processing Speech Processing Algorithms Speech/Non-speech detection Rule-based method using log energy and zero crossing rate Single speech
More informationMeasuring the complexity of sound
PRAMANA c Indian Academy of Sciences Vol. 77, No. 5 journal of November 2011 physics pp. 811 816 Measuring the complexity of sound NANDINI CHATTERJEE SINGH National Brain Research Centre, NH-8, Nainwal
More informationLong Range Acoustic Classification
Approved for public release; distribution is unlimited. Long Range Acoustic Classification Authors: Ned B. Thammakhoune, Stephen W. Lang Sanders a Lockheed Martin Company P. O. Box 868 Nashua, New Hampshire
More informationDrum Transcription Based on Independent Subspace Analysis
Report for EE 391 Special Studies and Reports for Electrical Engineering Drum Transcription Based on Independent Subspace Analysis Yinyi Guo Center for Computer Research in Music and Acoustics, Stanford,
More informationBasic Characteristics of Speech Signal Analysis
www.ijird.com March, 2016 Vol 5 Issue 4 ISSN 2278 0211 (Online) Basic Characteristics of Speech Signal Analysis S. Poornima Assistant Professor, VlbJanakiammal College of Arts and Science, Coimbatore,
More informationSOUND SOURCE RECOGNITION AND MODELING
SOUND SOURCE RECOGNITION AND MODELING CASA seminar, summer 2000 Antti Eronen antti.eronen@tut.fi Contents: Basics of human sound source recognition Timbre Voice recognition Recognition of environmental
More informationCHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS
66 CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS 4.1 INTRODUCTION New frontiers of speech technology are demanding increased levels of performance in many areas. In the advent of Wireless Communications
More informationDigital Signal Processing
COMP ENG 4TL4: Digital Signal Processing Notes for Lecture #27 Tuesday, November 11, 23 6. SPECTRAL ANALYSIS AND ESTIMATION 6.1 Introduction to Spectral Analysis and Estimation The discrete-time Fourier
More information2.1 BASIC CONCEPTS Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal.
1 2.1 BASIC CONCEPTS 2.1.1 Basic Operations on Signals Time Shifting. Figure 2.2 Time shifting of a signal. Time Reversal. 2 Time Scaling. Figure 2.4 Time scaling of a signal. 2.1.2 Classification of Signals
More informationCOMP 546, Winter 2017 lecture 20 - sound 2
Today we will examine two types of sounds that are of great interest: music and speech. We will see how a frequency domain analysis is fundamental to both. Musical sounds Let s begin by briefly considering
More informationEnhanced Waveform Interpolative Coding at 4 kbps
Enhanced Waveform Interpolative Coding at 4 kbps Oded Gottesman, and Allen Gersho Signal Compression Lab. University of California, Santa Barbara E-mail: [oded, gersho]@scl.ece.ucsb.edu Signal Compression
More informationMODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS
MODIFIED DCT BASED SPEECH ENHANCEMENT IN VEHICULAR ENVIRONMENTS 1 S.PRASANNA VENKATESH, 2 NITIN NARAYAN, 3 K.SAILESH BHARATHWAAJ, 4 M.P.ACTLIN JEEVA, 5 P.VIJAYALAKSHMI 1,2,3,4,5 SSN College of Engineering,
More information(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods
Tools and Applications Chapter Intended Learning Outcomes: (i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods
More informationTE 302 DISCRETE SIGNALS AND SYSTEMS. Chapter 1: INTRODUCTION
TE 302 DISCRETE SIGNALS AND SYSTEMS Study on the behavior and processing of information bearing functions as they are currently used in human communication and the systems involved. Chapter 1: INTRODUCTION
More informationROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION. Frank Kurth, Alessia Cornaggia-Urrigshardt and Sebastian Urrigshardt
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION Frank Kurth, Alessia Cornaggia-Urrigshardt
More informationINTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)
INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) Proceedings of the 2 nd International Conference on Current Trends in Engineering and Management ICCTEM -214 ISSN
More informationEnhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients
ISSN (Print) : 232 3765 An ISO 3297: 27 Certified Organization Vol. 3, Special Issue 3, April 214 Paiyanoor-63 14, Tamil Nadu, India Enhancement of Speech Signal by Adaptation of Scales and Thresholds
More informationAudio Signal Compression using DCT and LPC Techniques
Audio Signal Compression using DCT and LPC Techniques P. Sandhya Rani#1, D.Nanaji#2, V.Ramesh#3,K.V.S. Kiran#4 #Student, Department of ECE, Lendi Institute Of Engineering And Technology, Vizianagaram,
More informationScienceDirect. Unsupervised Speech Segregation Using Pitch Information and Time Frequency Masking
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 122 126 International Conference on Information and Communication Technologies (ICICT 2014) Unsupervised Speech
More informationEnhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis
Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis Mohini Avatade & S.L. Sahare Electronics & Telecommunication Department, Cummins
More informationLab 8. Signal Analysis Using Matlab Simulink
E E 2 7 5 Lab June 30, 2006 Lab 8. Signal Analysis Using Matlab Simulink Introduction The Matlab Simulink software allows you to model digital signals, examine power spectra of digital signals, represent
More informationVHF Radar Target Detection in the Presence of Clutter *
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 6, No 1 Sofia 2006 VHF Radar Target Detection in the Presence of Clutter * Boriana Vassileva Institute for Parallel Processing,
More informationEE228 Applications of Course Concepts. DePiero
EE228 Applications of Course Concepts DePiero Purpose Describe applications of concepts in EE228. Applications may help students recall and synthesize concepts. Also discuss: Some advanced concepts Highlight
More informationVisual Interpretation of Hand Gestures as a Practical Interface Modality
Visual Interpretation of Hand Gestures as a Practical Interface Modality Frederik C. M. Kjeldsen Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate
More informationChapter 2 Direct-Sequence Systems
Chapter 2 Direct-Sequence Systems A spread-spectrum signal is one with an extra modulation that expands the signal bandwidth greatly beyond what is required by the underlying coded-data modulation. Spread-spectrum
More informationSINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase and Reassigned Spectrum
SINOLA: A New Analysis/Synthesis Method using Spectrum Peak Shape Distortion, Phase Reassigned Spectrum Geoffroy Peeters, Xavier Rodet Ircam - Centre Georges-Pompidou Analysis/Synthesis Team, 1, pl. Igor
More informationAutomatic Transcription of Monophonic Audio to MIDI
Automatic Transcription of Monophonic Audio to MIDI Jiří Vass 1 and Hadas Ofir 2 1 Czech Technical University in Prague, Faculty of Electrical Engineering Department of Measurement vassj@fel.cvut.cz 2
More informationChapter 2 Channel Equalization
Chapter 2 Channel Equalization 2.1 Introduction In wireless communication systems signal experiences distortion due to fading [17]. As signal propagates, it follows multiple paths between transmitter and
More informationVoice Excited Lpc for Speech Compression by V/Uv Classification
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 3, Ver. II (May. -Jun. 2016), PP 65-69 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Voice Excited Lpc for Speech
More informationCHAPTER 6 INTRODUCTION TO SYSTEM IDENTIFICATION
CHAPTER 6 INTRODUCTION TO SYSTEM IDENTIFICATION Broadly speaking, system identification is the art and science of using measurements obtained from a system to characterize the system. The characterization
More informationFourier Methods of Spectral Estimation
Department of Electrical Engineering IIT Madras Outline Definition of Power Spectrum Deterministic signal example Power Spectrum of a Random Process The Periodogram Estimator The Averaged Periodogram Blackman-Tukey
More informationAspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification. Daryush Mehta
Aspiration Noise during Phonation: Synthesis, Analysis, and Pitch-Scale Modification Daryush Mehta SHBT 03 Research Advisor: Thomas F. Quatieri Speech and Hearing Biosciences and Technology 1 Summary Studied
More informationSystem analysis and signal processing
System analysis and signal processing with emphasis on the use of MATLAB PHILIP DENBIGH University of Sussex ADDISON-WESLEY Harlow, England Reading, Massachusetts Menlow Park, California New York Don Mills,
More informationDigital Speech Processing and Coding
ENEE408G Spring 2006 Lecture-2 Digital Speech Processing and Coding Spring 06 Instructor: Shihab Shamma Electrical & Computer Engineering University of Maryland, College Park http://www.ece.umd.edu/class/enee408g/
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 BACKGROUND The increased use of non-linear loads and the occurrence of fault on the power system have resulted in deterioration in the quality of power supplied to the customers.
More informationNOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or
NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying
More information(i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters
FIR Filter Design Chapter Intended Learning Outcomes: (i) Understanding of the characteristics of linear-phase finite impulse response (FIR) filters (ii) Ability to design linear-phase FIR filters according
More informationPR No. 119 DIGITAL SIGNAL PROCESSING XVIII. Academic Research Staff. Prof. Alan V. Oppenheim Prof. James H. McClellan.
XVIII. DIGITAL SIGNAL PROCESSING Academic Research Staff Prof. Alan V. Oppenheim Prof. James H. McClellan Graduate Students Bir Bhanu Gary E. Kopec Thomas F. Quatieri, Jr. Patrick W. Bosshart Jae S. Lim
More informationImproved Detection by Peak Shape Recognition Using Artificial Neural Networks
Improved Detection by Peak Shape Recognition Using Artificial Neural Networks Stefan Wunsch, Johannes Fink, Friedrich K. Jondral Communications Engineering Lab, Karlsruhe Institute of Technology Stefan.Wunsch@student.kit.edu,
More informationEC 6501 DIGITAL COMMUNICATION UNIT - II PART A
EC 6501 DIGITAL COMMUNICATION 1.What is the need of prediction filtering? UNIT - II PART A [N/D-16] Prediction filtering is used mostly in audio signal processing and speech processing for representing
More informationA Novel Technique or Blind Bandwidth Estimation of the Radio Communication Signal
International Journal of ISSN 0974-2107 Systems and Technologies IJST Vol.3, No.1, pp 11-16 KLEF 2010 A Novel Technique or Blind Bandwidth Estimation of the Radio Communication Signal Gaurav Lohiya 1,
More informationA Survey and Evaluation of Voice Activity Detection Algorithms
A Survey and Evaluation of Voice Activity Detection Algorithms Seshashyama Sameeraj Meduri (ssme09@student.bth.se, 861003-7577) Rufus Ananth (anru09@student.bth.se, 861129-5018) Examiner: Dr. Sven Johansson
More informationFinite Word Length Effects on Two Integer Discrete Wavelet Transform Algorithms. Armein Z. R. Langi
International Journal on Electrical Engineering and Informatics - Volume 3, Number 2, 211 Finite Word Length Effects on Two Integer Discrete Wavelet Transform Algorithms Armein Z. R. Langi ITB Research
More informationJitter Analysis Techniques Using an Agilent Infiniium Oscilloscope
Jitter Analysis Techniques Using an Agilent Infiniium Oscilloscope Product Note Table of Contents Introduction........................ 1 Jitter Fundamentals................. 1 Jitter Measurement Techniques......
More informationDigital Processing of Continuous-Time Signals
Chapter 4 Digital Processing of Continuous-Time Signals 清大電機系林嘉文 cwlin@ee.nthu.edu.tw 03-5731152 Original PowerPoint slides prepared by S. K. Mitra 4-1-1 Digital Processing of Continuous-Time Signals Digital
More informationDESIGN OF GLOBAL SAW RFID TAG DEVICES C. S. Hartmann, P. Brown, and J. Bellamy RF SAW, Inc., 900 Alpha Drive Ste 400, Richardson, TX, U.S.A.
DESIGN OF GLOBAL SAW RFID TAG DEVICES C. S. Hartmann, P. Brown, and J. Bellamy RF SAW, Inc., 900 Alpha Drive Ste 400, Richardson, TX, U.S.A., 75081 Abstract - The Global SAW Tag [1] is projected to be
More informationReading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.
L105/205 Phonetics Scarborough Handout 7 10/18/05 Reading: Johnson Ch.2.3.3-2.3.6, Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday Spectral Analysis 1. There are
More informationCHAPTER 6 SIGNAL PROCESSING TECHNIQUES TO IMPROVE PRECISION OF SPECTRAL FIT ALGORITHM
CHAPTER 6 SIGNAL PROCESSING TECHNIQUES TO IMPROVE PRECISION OF SPECTRAL FIT ALGORITHM After developing the Spectral Fit algorithm, many different signal processing techniques were investigated with the
More informationA Parametric Model for Spectral Sound Synthesis of Musical Sounds
A Parametric Model for Spectral Sound Synthesis of Musical Sounds Cornelia Kreutzer University of Limerick ECE Department Limerick, Ireland cornelia.kreutzer@ul.ie Jacqueline Walker University of Limerick
More informationBIT SYNCHRONIZERS FOR PSK AND THEIR DIGITAL IMPLEMENTATION
BIT SYNCHRONIZERS FOR PSK AND THEIR DIGITAL IMPLEMENTATION Jack K. Holmes Holmes Associates, Inc. 1338 Comstock Avenue Los Angeles, California 90024 ABSTRACT Bit synchronizers play an important role in
More information