speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

16 3. SPEECH ANALYSIS 3.1 INTRODUCTION TO SPEECH ANALYSIS Many speech processing [22] applications exploits speech production and perception to accomplish speech analysis. By speech analysis we extract few properties or features from the speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals or a set of parameters with an objective to simplify the speech signal and to remove redundancy present in the speech signal. In speech analysis we extract features directly pertinent for different applications, while suppressing redundant aspects of speech. The original signal may approach optimality from the point of view of human perception, but will have much repetitive data when processed by computer. The elimination of such redundancy aids accuracy in computer applications and makes phonetic interpretation simpler [21]. For speech storage or recognition eliminating the redundant and irrelevant aspects of the speech waveform simplifies data manipulation. An efficient representation for speech recognition would be set of parameters that yield similar values for the same phonemes uttered by various speakers. For speech synthesis, the continuity of parameter value in time is important to reconstruct a smooth speech signal. The synthesized speech obtained must be replica of the original speech. Speech analysis can be done either in time domain or in frequency domain. The speech analysis is done to obtain a more useful representation of the speech signal in terms of parameters that contain relevant information in an efficient format. The speech analyzer periodically examines a limited time range of speech. This limited time range is called a window. The choice of duration and shape of the window reflects a compromise in time and frequency resolution. Accurate time resolution is useful for segmenting speech signal and determining periods in voiced speech while good frequency resolution helps to identify different sounds. The

17 former requires relatively minimal calculations but is limited to simple speech measures like energy and periodicity of speech signal, while spectral analysis requires more computational effort but characterizes sounds more usefully. Speech signal information can be partitioned into parameters and features. The speech parameters can be obtained by simple mathematical rule but have relatively low information content, where as speech features requires complex mathematical computations and yields more compact speech representation. Many speech analyzers extract only parameters to avoid controversial decisions. The standard and widely used speech analysis model is linear predictive analyser. In this analyser both parameters and features of speech signal are extracted. Speech signals are basically partitioned into voiced speech segments and unvoiced speech segment [23]. A voiced speech segment is also known as pitch of voiced speech. It has high energy content and is periodic in nature. The unvoiced part of the speech looks like a random noise with no periodicity. Some parts of the speech that are neither voiced nor unvoiced are called transition segments [25]. The speech can be analysed either by Time Domain method Frequency Domain Method 3.2 SHORT-TIME SPEECH ANALYSIS Speech signal is dynamic with voiced segments and unvoiced segments. The variation in the speech signal is due to vocal cord vibration and vocal tract shape. Non-periodic variations are not under the control of speaker, where as voiced segment speech signals are directly under speaker s control. Speech analysis is used to extract related parameters of periodic speech. Speech analysis usually assumes the speech signal properties change slowly with time, hence allowing the examination of short time window of speech to extract parameters presumed to

18 remain fixed for the duration of the window. Most of the techniques yield parameters averaged over the course of the time window. To model dynamic parameters we must divide the signal into successive windows or frames so as to calculate the parameters for the relevant change in the signal. 3.2.1 Windowing The windowing W(n) determines the portion of the speech signal that is to be processed by zeroing out the signal outside the region of interest [25]. Windowing is the multiplication of a speech signal S(n) by a window W(n). This multiplication yields a set of speech samples X(n) weighted by the shape of the window. The window W(n) may have infinite duration, but most of the practical windows have finite length to simplify computation. Many of the speech applications require speech averaging, to yield parameters that represent slowly varying aspects of the vocal tract movement. The amount of desired smoothing leads to a choice of window size trade-off on three factors. If W(n) is short then the speech properties of interest will change little within the window. If W(n) is long enough to then it allows to calculate all the desired parameters of the speech but the longer windows will average out the random noise present. If the size of the window is medium then analysis of S(n) is periodically repeated. The effect of the window changes with the change in the shape and the size of the window. The W(n) rarely changes except at the edges. The simplest window is rectangular shape and is represented as r(n). The r(n) can be written as 1 for0 n N 1 w( n) r( n) 3.1 0otherwise

19 The rectangular window provides equal weight for all the samples and limits the analysis range to N consecutive samples. Depending on the application the size and the shape of the window can be varied [21]. The rectangular window has high frequency resolution but has high frequency leakage produced by its larger side lobes. Hence this window is very noisy and is not preferred for speech analysis. The other window is hamming window and this has the same spectral shape as that of the rectangular window but has less frequency resolution. These drawbacks can be overcome by using the hybrid window. We get good temporal resolution by using a short window and good frequency resolution using a longer window [25]. If the size of the window is small then the short-time energy will change very rapidly and we need maximum bandwidth. If the size of the window is large then short-time energy will be averaged over a long time but this will not reflect the changing properties of the speech signal. Hence a suitable size of the window must be chosen to represent the harmonic structure accurately. 3.3 TIME DOMAIN PARAMETERS The time domain analysis is simple to implement. The time domain analysis transforms a speech signal into set of parameter signals, which varies very slowly in time than the original signal. This allows more efficient storage or manipulation of the relevant speech parameters than the original signal. To capture the relevant aspects of speech we require several parameters which can be obtained by sampling the signal at lower rate. The short-time processing techniques both in time and frequency domain produce parameter signals of the form Q(n),where Q(n) is written as 3.2 Q( n) T[ s( m)] w( n m) m

20 The speech signal S(n) undergoes transformation T and is weighted by the window W(n) to yield Q(n). The Q(n) corresponds to a convolution of T[S(n)] with W(n). The Q(n) is smoothed version of T[S(n)]. The bandwidth of the Q(n) matches with the bandwidth of W(n). In the above equation 3.2 the Q(n) corresponds to short-time energy, the energy emphasizes high amplitude. Such measures help segmented speech to smaller phonetic units called phonemes. The amplitude of Q(n) varies largely between the voiced and the unvoiced signal. The amplitude slightly varies between phonemes. These variations in the amplitude help us to detect the boundaries of speech and the pause. The widely used time domain techniques are Short-Time Averaging Zero-crossing Rate Short-Time Auto correlation 3.3.1 Short-Time Averaging Zero-crossing Rate (ZCR) The spectral measure of the speech signal requires a Fourier or other frequency transformation. A simple spectral measure technique called the Zero-crossing rate provides adequate spectral information at low cost. In a signal S(n) zero crossing can occur when the waveform crosses the time axis or changes the algebraic sign. For a sinusoidal signal there are two zero-crossings per period. For discrete-time signals zero-crossing per sample can be written as F0 [ ZCR f s ] / 2 3.3 Where f s is the Sampling frequency. The ZCR helps in making the decision between the voiced and unvoiced speech as most of the energy of the voiced signal is at low frequency and the broad band noise excites mostly the

21 higher frequencies. The ZCR is highly sensitive to the noise due to A/D converter. Hence requires set of filters to nullify the same. 3.3.2 Short-Time Autocorrelation The Fourier transform of speech signal S(n) provides both the spectral magnitude and the phase. The time signal r(k) for the inverse Fourier transform of the energy spectrum is called the autocorrelation function of S(n). The function r(k) preserve the information like the harmonic, periodicity and amplitude of S(n). The r(k) ignores the phase value of S(n) as is carries less information than the spectral value. The autocorrelation function can be written as 3.4 ( ) ( ) ( ) sy k s m y m k m The above equation can be used to measure the autocorrelation function if both S(n) and y(n) are fed with a same signal. 3.4 FREQUENCY-DOMAIN PARAMETERS The frequency domain coder provides useful parameters for speech processing. Speech signal can be easily analysed by spectrally than in time domain. The basic speech production model can be represented by a periodic or non-periodic wave that excites the vocal tract filter and this corresponds well to separate the models for excitation and for the vocal tract. Human ear pays more attention to aspects of the speech than its phase or timing aspects. Hence spectral analysis is used to extract most of the parameters of speech signal. The different methods to extract the frequency domain parameters are Filter bank Analysis Short-Time Fourier Transform Analysis 3.4.1 Filter bank Analysis

22 The filter bank analysis is most inexpensive method of spectral analysis and is done by using set of bandpass filters. The filter bank techniques are more flexible than the DFT analysis since the bandwidth can be varied to follow the resolving power of the ear. This method is simple and can be used for many applications requiring a small set of parameters describing the spectral distribution of energy in the spectral envelope. 3.4.2 Short-Time Fourier Transform Analysis The short-time Fourier transform is the traditional spectral technique. In this technique the speech is represented in terms of amplitude and phase as a function of frequency. The Fourier transform of speech is the product of the transforms of glottal excitation and the vocal tract response. The short-time Fourier transform of a signal S(n) can be defined as jw jwm Sn ( e ) S( m) e W ( n m) 3.5 m jw If W(n) acts as a low pass filter then S ( e ) describes the amplitude and phase of the n signal S(n) within the bandwidth equivalent to that of the window but centered at w rad. If this jw calculation of S ( e ) is repeated at different frequencies will yield a two dimensional n representation of the input speech. One of the major speech analysis tool has been the spectrogram. The sound spectrograph provides a three dimensional representation of short speech utterance. Wideband spectrograms display individual pitch periods as vertical striations corresponding to the large speech amplitude each time the vocal cords vibrate. Voicing can be easily detected visually by the presence of these periodically spaced striations. The narrowband spectrograms display separate harmonics instead of pitch periods and has poorer time resolution [21].

23 3.5 LINEAR PREDICTIVE CODING [LPC] This is a popular alternative for short-time Fourier transform. This is a very important speech analysis technique. In LPC analysis the short-time correlation between the speech samples are modeled and removed by an efficient short order filter [25]. The LPC has been used to estimate the spectral harmonics, vocal tract function, frequency and bandwidths of spectral poles and zeros. The LPC estimates each speech sample based on a linear combination of its previous samples. We can get an accurate estimate model if the number of previous sample values considered are more. The important drawback of LPC analysis is its high computation complexity. This drawback can be overcome by assuming that the speech comes from an all-pole filter. The LPC provides an analysis-synthesis system for speech signals. Let H(z) be the steady state system function which represents the combined spectral contributions of the glottal flow, the vocal tract and the radiation from the lips[25] and H(z) can be written as Sz ˆ( ) H( z) 3.6 U( z) Where Snis ˆ( ) the synthesised speech signal from the spectral shaping filter and U(z) is the input excitation to the spectral shaping filter H(z). The synthesised speech can be written as p q k 3.7 l k1 l0 Sˆ( n) a Sˆ( n k) G b u( n l) Where p represents the pole s and q represents the zero s in the above equation 3.7. Further H(z) can be written as 1 H ( z) G 1 q i0 p k 1 bz l az k l k 3.8

24 In the above equation 3.8 G is a gain factor for the input speech signal. If the order of the denominator is high then H(z) can be approximated by an all-pole filter model given by H( z) 1 G p k k1 az k G Az ( ) 3.9 If the speech signal S(n) is filtered by an inverse or predictor filter A(z) then we get a residual signal e(n),where e(n) can be written as p e( n) S( n) aks( n k) 3.10 k1 The two methods used to obtain the LPC coefficients are The least-square autocorrelation method The least-square covariance method 3.5.1 The least-square autocorrelation method In this method the value of a k is chosen to minimize the mean energy in the error signal over a frame of speech data and either S(n) or e(n) is windowed to limit the extent of the speech under analysis. Here the speech signal is multiplied by window W(n) to have finite duration for x(n) and S(n) is assumed to be stationary during each window. x( n) W( n) S( n) 3.11 The LPC coefficients describe a smoothed average of the signal. Let E be the error energy which is written as p 2 E e n x n ak x n k n n k1 ( ) [ ( ) ( )] 3.12

25 Where e(n) is the residual signal corresponding to the windowed signal x(n). The value of ak that minimizes E is found. The conventional least-square method simplifies the computation but ignores certain information about the speech signal. This method introduces distortion into the spectral estimation since the windowing corresponds to the convolution of speech signal with the frequency response of the window. 3.5.2 Least-Square Covariance Method In this method the error signal e(n) is windowed instead of the input speech signal S(n). The autocorrelation method and the covariance methods differ in widowing effect. The autocorrelation method uses windowed speech samples and covariance method uses no window on the speech samples. This method doesn t introduce any spectral distortion but requires the knowledge of speech samples to be used. 3.6 PITCH PREDICTION AND DETECTION In LPC analysis the adjacent and the neighboring sample correlations present in the speech are removed. After LPC analysis there will be considerable variations in the spectrum. The residual signal obtained will still have long-term correlations during the voiced region of the speech. Hence to remove the periodic structure of the residual signal a second stage of prediction is required. The main objective of the second stage is to spectrally flatten the obtained residual signal and this stage is called pitch prediction stage. The LTP can be interpreted a 1 Pz ( ) l 1 bz j j1 ( j T) 3.13 Where T is the pitch period and bj is the pitch gain. The pitch predictor exploits the correlation between the speech samples that are one pitch or multiple pitch periods away. Hence the pitch

26 predictor is called a long-term predictor [25]. Usually the long-term predictor comes before the short-term predictor. The pitch analysis is performed on the block of N samples and these N samples are taken from the window which is longer than the analysis frame length L. The longer size window is considered as the pitch and its value varies around 16 samples to 160 samples. The fundamental frequency or pitch of a signal place a vital role in speech applications. In voiced speech the vibration of the vocal cords refers to the fundamental frequency. Most of the low-rate speech coder requires accurate pitch estimation for good reconstructed speech and medium rate coder use pitch to reduce transmission rate while preserving high quality speech. The pitch can be determined either from the periodicity in time domain or from the regularly spaced harmonics in the frequency domain. 3.7 SUMMARY This chapter mainly deals with the introduction to speech analysis and briefly explains the different speech analysis techniques.