Dual-Microphone Voice Activity Detection Technique Based on Two-Step Power Level Difference Ratio

Size: px

Start display at page:

Download "Dual-Microphone Voice Activity Detection Technique Based on Two-Step Power Level Difference Ratio"

Roderick Hensley
5 years ago
Views:

1 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE Dual-Microphone Voice Activity Detection Technique Based on Two-Step Power Level Difference Ratio Jae-Hun Choi and Joon-Hyuk Chang, Senior Member, IEEE Abstract In this paper, we propose a novel dual-microphone voice activity detection (VAD) technique based on the two-step power level difference (PLD) ratio. This technique basically exploits the PLD between the primary microphone and the secondary microphone in a mobile device when the distance between the microphones and the sound source is relatively short. Based on the PLD, we propose the use of the PLD ratio (PLDR) instead of the original PLD to take advantage of the relative difference between the PLD of speech and the PLD of noise. Indeed, the PLDR is obtained by estimating the ratio of the PLD between the input signals and the PLD between the two channel noises during periods without speech. The proposed technique offers a two-step algorithm using the PLDRs including long-term PLDR (LT-PLDR), which characterizes long-term evolution and short-term PLDR (ST-PLDR), which characterizes short-time variation during the first step. LT-PLDR-based and ST-PLDR-based VAD decision are performed using the maximum a posteriori (MAP) probability derived from the model-trust algorithm and combined at the second step to reach a superior VAD decision for both long-term and short-term situations. Extensive experimental results show that the proposed dual-microphone VAD technique outperforms the conventional two-channel VAD method as well as most standardized VAD algorithms. Index Terms Dual-microphone, power level difference ratio, two-step, voice activity detection. I. INTRODUCTION VOICE activity detection (VAD) has become an essential component of speech enhancement and speech recognition systems. Many approaches have focused on single microphone-based algorithms using linear predictive coding (LPC) parameters [1], energy levels, formant shape [2], the zero-crossing rate (ZCR) [3], the cepstral feature [4], periodicity measures [5], the spectral difference [6], and a statistical model-based approach [7]. Among these approaches, statistical model-based methods have been widely used due to their impressive performance as well as their efficient implementation [7]. Specifically, the statistical distributions of both Manuscript received June 21, 2013; revised November 12, 2013; accepted March 17, Date of publication April 22, 2014; date of current version May 09, This work was supported by an NRF grant funded by the Korean Government (MEST) ( 2012R1A2A2A ). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Wai-Yip Geoffrey Chan. The authors are with the School of Electronic Engineering, Hanyang University, Seoul , Korea ( jchang@hanyang.ac.kr). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TASLP noisy speech and noise are assumed to follow parametric models, including Gaussian [7], [8], Laplacian [9], generalized Gaussian [9], and generalized gamma [10] as a candidate for a better model of the distribution of speech and noise. Based on the assumed statistical model, the likelihood ratio test (LRT) is established based on a set of hypotheses. This algorithm has been further improved by incorporating a conditional maximum aposteriori(map) estimator, which is conditioned not only on the data of the current frame, but also the voice activity decision of the previous frame [11]. One of the key issues in the VAD problem is the performance of noise power estimation. Many previous studies have investigated noise power estimation. One simple method is to average the noisy signal over non-speech areas. For example, Kim and Chang [12] incorporated a soft decision scheme into power spectrum estimation. However, soft-decision-based noise power estimation has difficulties in estimating background noise with non-stationary characteristics. A more recent noise estimation technique is minimum statistics (MS), which obtains a noise estimate from the minimum values of a smoothed power estimate of a noisy signal within a finite window [13]. The MS scheme is impaired by sudden rising and falling noise contours that are a result of picking the minimal value within a sliding window. Also, the minima controlled recursive averaging (MCRA) approach is known to be a successful noise power estimation technique due to its robustness to the type and intensity of ambient noise [14]. This approach estimates noise by recursively averaging past spectral power values, which are adjusted by the speech presence probability in subbands. In addition, the relevant noise estimation techniques based on Monte-Carlo method [15], linear dynamical system method [16], particle filtering with switching dynamical system [17], and switching Kalman filtering [18] have been reported to handle the time-varying noises. But, the performance improvement is restricted due to the usage of the single microphone. The use of multiple microphones is beneficial for VAD since it provides relevant spatial characteristics, while single-channel VAD cannot precisely discriminate the target noise from the noisy speech under the highly nonstationary condition. For example, a beamforming technique can be considered relevant since it incorporates both spatial and spectral information efficiently [19] [27]. However, the use of a microphone array requires aprioriknowledge of the direction of arrival (DOA) through several microphones, which is not realistic in mobile device systems such as smart phones. Also, the configuration of two microphones is preferred in the contemporary smart phones IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 1070 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 because of the realization issues in terms of size, power consumption, and computational complexity while the most existing methods focus on more than four microphones [23] [27]. Accordingly, we consider a dual-microphone system based on a trade-off between complexity and performance. One of the two microphone VAD techniques is based on a decision measure referred to DOA homogeneity, which uses the entropy of the DOA estimates [27]. But, this method is sensitive to the estimation error of DOA. Several dual microphone systems avoiding the problem of DOA estimation have been widely used in speech enhancement as well as VAD [28] [34]. In particular, cross power spectral density (CPSD) has been used for exploiting spatial benefits, since noise components can be considered to be mutually uncorrelated [31], [32]. This algorithm works well if the distance between adjacent microphones is not too short, because the main idea behind this algorithm is that the speech signals of the two channels are significantly correlated, whereas noises are uncorrelated. However, the cross-correlation based method has a drawback in terms of noise estimation since it cannot accurately estimate abruptly changing noises due to the large smoothing coefficients required to compute the power spectral density (PSD) of the cross-correlation term. This observation holds for the two microphone VAD method of the magnitude squared coherence (MSC) [30]. In addition, the methods proposed in [33] [36] utilize the difference in the power of the signals received at the two microphones. These techniques rely on the fact that the speech signals emitted from the source (i.e., mouth) have different power levels at the microphones, while the power levels of the noise signals are almost identical. This assumption is valid if the distances between the source and two microphones are distinct. It should be noted that mobile devices such as contemporary smart phone systems have similar structures. Indeed, the algorithm based on the power level difference (PLD) is useful with highly non-stationary noises. However, the performance of the technique is sensitive to the noise level and noise types if the PLD is used in the criterion of the gain for speech enhancement as in [33]. In particular, Jeub et al. [35] derived the normalized difference of the PSD of the noisy speech to update the noise PSD. However, the assumption that the normalized difference of the PSD will be close to zero as the input power level is almost identical is not valid in practice due to many factors such as reverberation, microphones mismatch, and azimuth angles from the noise source at each microphone. This paper proposes a novel two-microphone VAD technique based on a two-step PLD ratio (denoted by PLDR). Based on the PLD derived between the primary microphone and the secondary microphone in each frame, we offer two kinds of PLDR that can efficiently characterize the evolution of speech over short-term and long-term time frames. Indeed, we consider the PLDR instead of the PLD to achieve robust performance because a relative comparison between the PLD of speech and the PLD of noise estimated during noise periods which is not sensitive to the noise level or type is performed. For this reason, we first propose a long-term PLDR (LT-PLDR) using a large long-term smoothing parameter for calculating the PLDR. While our approach is based on the PLD proposed by Yousefian et al. [33], [34], we offer the PLDR that can produce robust and superior performance under various noise environments. Specifically, the PLDR is definedbytheratioofthepldof the input signals and the PLD of noise estimated during speech inactivity. In order to compute the PLD of noise efficiently, we apply the minima controlled recursive averaging (MCRA) approach [14] to the estimation for the PLD of noise. Based on thelt-pldr,wedeviseanefficient framework to derive the a posteriori probability based on a parametric way employing the model-trust minimization algorithm in classifying the speech presence or absence regions. With the aposterioriprobability corresponding to the LT-PLDR, the interim VAD decision at the first step is performed by choosing the hypothesis with the maximum probability according to the maximum a posteriori (MAP) criterion, which provides a rough VAD to minimize cases of missing speech. On the other hand, a short-term PLDR (denoted by ST-PLDR) using a low smoothing parameter is derived from each frame, which establishes an appropriate parameter for detecting short non-speech intervals while having a high false-alarm rate. In addition, the PLD of noise for calculating the ST-PLDR is estimated by utilizing the speech presence probability derived at the first step, which eventually allows for the PLD of noise in estimating the ST-PLDR to be updated quickly. In a similar manner with the probability derived from the LT-PLDR at the first step, the probability for the ST-PLDR is obtained and the VAD decision is separately performed according to the MAP criterion. At the second step, we construct the final VAD decision rule, in which the interim VAD result determined by the LT-PLDR is modified by the VAD result of the ST-PLDR only when the VAD result provided by the LT-PLDR is speech presence. This creates a robust way to track speech evolution while keeping the missing error rate below an acceptable level and minimizing the false-alarm rate error below a tolerable level. Extensive objective evaluation of the proposed VAD technique is performed under various acoustic conditions in terms of noises, azimuths, and distances between the source and the microphones. We show that the proposed VAD parameter derived from the two-microphone PLDR is superior, particularly for non-stationary noises, and is robust with respect to the input SNRs and various acoustical circumstances. This paper is organized as follows. In the next section, we review the PLD, which is a basic parameter in our framework. In Section III, we describe the design of the VAD algorithm based on the proposed two-step technique. Extensive evaluation of the proposed algorithm is discussed in Section IV and conclusions are presented in Section V. II. REVIEW OF PLD In this section, we begin with a theoretical description of the PLD function and review the notion of the function for the VAD task. For this, we first need to define the acoustic experimental environment in brief. Two microphones are installed in a smart-phone mock-up on a dummy head, as shown in Fig. 1. Since we chose the smart phone as the main platform in this research, the distance between the primary microphone (close to the speaker s mouth) and the secondary microphone (distant to the speaker s mouth) was set to 12 cm. To simulate mobile environments, various distances and azimuths between the dummy

3 CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1071 head and the noise source are considered. Detailed specification will be given in Section IV. Based on these conditions, we first define the noisy signal received at the microphones by, where denotes the microphone index and is the sample index such that (1) where is the convolution operator, is the main source signal, is the impulse response associated with the microphones, is the noise-free reverberant speech, and is the noise component at the each microphone, respectively [33], [34]. The above equations could be changed frame-by-frame into a frequency domain by taking the discrete Fourier transform (DFT) which length is bigger than the frame size as follows: where is the frequency-bin index of and is the frame index. Letting denote the Fourier transform of the acoustic transfer function between the primary microphone and secondary microphone and is equal to, the above Fourier transform of signal data model can be written as (2) (3) (4) Assuming that the speech and noise are independent, the signal power of each microphone is given by (5) (6) where,,,, and denote the power spectrum of,,,, and respectively. Since the distance between the primary and secondary microphone is distinct but short in a near field, as shown in Fig. 1, the signal power received at the primary microphone close to the mouth shows a stronger signal compared to the signal power of the secondary microphone, while the level of the noise signal at each microphone is almost identical [33], [34]. Based on this, we firstly define the difference of the signal power between the primary microphone and the secondary microphone as so that is derived for the primary and secondary microphone such that where.if can be neglected due to the assumption of a diffuse noise field [33], (6) after taking the absolute operation results in following: (7) (8) Fig. 1. Our acoustical architecture using smart phone with the dual-microphones located at the dummy head. Note that is directly used in the gain computation of the Wiener filter-based noise suppressing algorithm [33], [34] such that (9) where is an over-subtraction parameter and used in controlling the level of noise reduction and can be estimated by using the cross power spectral density (CPSD) of the input and noise signals in the two channels in [33]. The above conventional technique proposed by Yousefian et al. [33], [34] was used to directly apply the PLD value to calculate the spectral weighting gain for speech enhancement based on the assumption that is negligible for diffuse noise. However, in practice, the levels of the noise signals at the primary microphone and the secondary microphone cannot be identical as assumed in the real situation illustrated in Fig. 1. Thus, this assumption cannot be directly used in the VAD task which we focus on. As an analogous example, the premise of the coherence-based technique is that the noise signals in the two channels are uncorrelated, which is not valid in reality. In this regard, works similar to those in [31] have suggested modifying to the coherence filter to address this problem. III. PROPOSED DUAL-MICROPHONE VOICE ACTIVITY DETECTION BASED ON TWO-STEP PLD SCHEME A. Basic idea of PLD As stated in the previous section, the premise we choose in this paper is that the PLD has a larger value during speech activity than the PLD during speech inactivity regardless of the noise type; diffuse noise or coherence noise. For handling this premise in our algorithm, we first define the PLDR of the observed PLD and the PLD estimated during noise periods. For clear understanding, Fig. 2 is drawn to show the overall blockdiagram of the proposed two-step PLDR-based algorithm. Assuming that a noise is added to a noise-free speech signal

1072 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 Fig. 2. Overall block diagram of the proposed two-step PLDR-based technique.

4 1072 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 Fig. 2. Overall block diagram of the proposed two-step PLDR-based technique., the representation of the frequency domain of the noisy signal of (1) is rewritten as: the observed PLD, we adopt the recursive averaging technique given by (10) where,, and are the DFT coefficients at the th frequency bin for the th frame of noisy speech, noise, and clean speech, respectively. Given two binary hypotheses, and, which indicate speech presence and absence, it is derived that (11) (12) By taking into account the independence assumption of speech and noise, we obtain the power spectral density (PSD) for the primary and secondary microphone as follows: (13) (14) We first derive the PLD between two microphones by taking the absolute operator for ensuring robust performances in actual situations such that (15) where the PLD becomes almost zero value if powers of two input signal are almost identical. B. LT-PLDR We then derive the PLDR between the observed PLD of the current frame and the PLD estimated at the noise regions. For (16) where 1 is a smoothing parameter, which thus characterizes efficiently the long-term evolution of the speech signal and is not sensitive to the type and intensity of environmental noise as will be explained in Section IV. Accordingly, is called the long-term PLD (LT-PLD) since it extends over a relatively long time periods. Once the LT-PLD is obtained, the LT-PLDR is computed as shown in Test Phase of Fig. 2 according to (17) where implies the ratio of and. Here, is the PLD of the noise estimated during the noise only periods. The estimation of is then performed using the MCRA approach known as the simple but computationally efficient noise power estimation technique used in the speech enhancement field [14]. In a similar manner as in the MCRA technique for estimating the variance of noise, the estimate of is given as follows: (18) 1 If we choose 0.9 for, more than latest 20 frames can dominantly affect the value of

5 CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1073 where means the PLD definedin(15)and is a time-varying parameter. The time-varying smoothing parameter is adjusted by the speech presence probability (SPP) and is estimated by (19) where is a constant value, which is set to 0.95 based on [14]. In the conventional MCRA approach, the SPP in each subband is determined by comparing the ratio between the local energy of noisy speech and the minimum value within a finite window length with a probability threshold. Based on [14], the SPP at the each frequency bin is calculated by (20) where is the smoothing parameter in order to consider the strong correlation of speech presence over successive frames and represents the indicator function of the speech presence or speech absence at the each subband on the current frame. In order to compute, is calculated by (21) where is the local minimum of the observed input PLD and is picked within the finite window of the consecutive frames as proposed in the minimum-search procedure of [14]. Based on this, indication function to classify rough speech presence or absence regions is determined such that if speech presence if speech absence (22) where is the threshold. Subsequently, at the current frame is computed as the mean of the individual over each subband [7], [9] as shown in of Fig. 2 which constructs the interim VAD on the th frame at the firststepbythe following: (23) Using on each frame, we employ a parametric way to derive the probability of the VAD decision at the first step, which corresponds to in Fig. 2. For this, we derive the a posteriori probability as a probabilistic output for the LT-PLDR in each frame using the sigmoid fitting approach as proposed in the former studies [37], [38]. Note that is the hypothesis based on which indicates the speech presence or absence of the first step. Specifically, the LT-PLDR in each frame is transformed into the probability through the logistic regression model using a slope parameter andanoffset as follows: (24) where is the LT-PLDR in the th frame as explained in (23). For the computation of the reliable probability, the principal parameters and are given by using the discriminative training in separate a way to minimize the negative log likelihood of the data as shown in of Fig. 2, which is the cross-entropy error function obtained by (25) where as the target probability for the class ( and ) is given by manual labeling of every frame in the training process. Indeed, isassumed1ifthe th frame is speech presence and is assumed 0, otherwise. Based on this, we adopt the model-trust minimization algorithm in estimation of the parameters as proposed in [38] since the parameters and are estimated in terms of minimization problem as in (25) and are chosen to minimize a bound on the test misclassification rate, which can produce sparse kernel machine [39]. Based on obtained using the model-trust minimization algorithm, the interim decision rule for the VAD at the first step (denoted by ) could be represented as shownintheblockas of Fig. 2 using the maximum a posteriori (MAP) criterion as given by: (26) where is an experimentally chosen constant. As an example, and the corresponding probability are shown in Fig. 3 along with the manual labeling of each frame. As can be seen in Fig. 3, offers a robust VAD performance under non-stationary noise conditions especially for minimizing missing speech. However, it can be seen that tends to falsely detect the short pause regions between words and syllables as speech since uses a high smoothing parameter as designated in (16). C. ST-PLDR While the utilization of the large smoothing parameter in order to reduce the fluctuation in estimating the signal power provides low false alarm rate, keeping the high smoothing parameter eventually results in large false alarm rate in the short pause regions. Due to the drawbacks of for short pauses, we propose a technique to derive an ST-PLDR using a low smoothing parameter which is likely to be better for characterizing short-time variations in speech. As in the former case of, the short-time smoothed PLD between the two microphones is derived utilizing a low smoothing parameter such that (27) where the short-term PLD (ST-PLD) is relatively adequate for a short duration of time. In the following, the ST-PLDR is derived using the ST-PLD at each frequency bin such that (28)

6 1074 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 Fig. 3. An example of derived at the babble noisy signal (approx. 10 db SNR). Noise is located in front of the dummy head with a distance of 5 m. where denotes the estimated PLD of noise, which is estimated during noise only periods. The main objective behind the VAD decision based on the ST-PLDR is to reduce the false alarm rate in the short-pause regions and therefore is to modify the VAD decision result derived from the LT-PLDR. To allow the ST-PLDR to be suited for short-time variation of speech with the dynamic attributes, is separately estimated as follows: each frequency bin as in (28), the ST-PLDR for the VAD on the th frame is averaged by the following: (32) Insimilarmannerproposedin(24),theST-PLDR on the th frame is converted into the aposteriorispp of the hypothesis using the logistic regression model as shown in the block of of Fig. 2 as follows: with Here, is calculated by (29) (30) is the time-varying smoothing parameter and (33) where the principal parameters and for fitting into probability values between 0 and 1 are similarly estimated as in (25). Based on, the VAD decision with the decision threshold experimentally chosen as 1.5 for the short pause regions at the first step can be expressed by (34) (31) where is the SPP computed in (24). Note that since shows good performance in detecting speech presence regions, we use in calculating in (29). Specifically, in order to avoid the speech component of the observed PLD in updating the noise PLD in (30), is multiplied to. Compared to the constant value to compute the time-vary smoothing parameter in updating, is chosen as 0.3. This is because rapidly varies for quick update of. With the ST-PLDR derived in Summarizing the above procedures, and its corresponding probability have a significant advantage for detecting short-pauses between words due to adoption of and, as illustrated in Fig. 4. However, it can be seen that tends to fluctuate highly during long pauses in speech, which results in more false-alarm cases. D. Second step for VAD decision Based on these observations of the proposed two parameters, we perform the final VAD decision using both the LT-PLDR and ST-PLDR at the second step. Indeed, while minimizing cases of missing speech by avoiding miss-classifications of speech as noise, we attempt to reduce false classification of

7 CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1075 Fig. 4. An example of derived at the babble noisy signal (approx. 10 db SNR). Noise is located in front of the dummy head with a distance of 5 m. noise as speech with the help of.todothis,wepresent a technique to combine the VAD decisions derived using the two PLDRs in the th frame to form the final decision at the second step as shown in of Fig. 2: if if (35) where at the second step is changed as in the case of at the first step and is same as if. An example of the result of the proposed method in conjunction with the manual label and speech waveform is shown in Fig. 5. This result indicates that the proposed VAD technique avoids the aforementioned problems and works well in adverse noise environments. As shown in Fig. 2, the final decision of VAD is performed during the second step by combining the interim VAD decision results of and derived at the first step for further performance improvement. IV. EXPERIMENTS This section describes the performance evaluation of the proposed two-step PLDR technique. In order to assess the proposed method, we carried out acoustic experiments at different distances and azimuth angles between the speech source and the noise source under various noise environments. For an objective comparison, the proposed algorithm was compared with a number of standardized VAD algorithms, including the European Telecommunications Standards Institute adaptive multirate (ETSI AMR) VAD option 2 [40] as well as the conventional two-microphone VAD techniques based on the MSC [30] and the dual-channel normalized PLD [35]. In addition, the wellknown, state-of-the-art multiple-statistical model-based VAD (MSM) was included in the performance comparison [9]. Fig. 5. Proposed result of at the babble noisy signal (approx. 10 db SNR). Noise is located in front of the dummy head with a distance of 5 m. (a) Waveform of the primary microphone signal (b) Waveform of the secondary microphone signal (c) Corresponding manual VAD (0: noise, 1: speech) (d) VAD result of the proposed two-step PLDR. A. Experimental Setup For the objective evaluation, we investigated the speech hit rates and non-speech hit rates of each algorithm for both speech and non-speech, where and are defined as the ratio of correct speech and non-speech decisions to the hand-labeling speech frames and non-speech frames, respectively. In order to simulate various noise environments and practical noisy conditions, noisy signals were recorded at two microphones in an office with a room size of m.the distance between the primary microphone and the secondary microphone was set to 12 cm, which follows up the configuration of a contemporary smart phone with two microphones. For the test, noisy sentences were recorded at various distances

8 1076 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 Fig. 6. The geographical placement of the sound source about the dummy head. of1m,3m,5m,and7mandatazimuthanglesof,, and between the speech source at the dummy head and the noise source. For easy comprehension, the overall geographical placement of the sound and noise sources is illustrated in Fig. 6. To simulate noisy conditions, five different noise sources such as babble, destroyer engine, factory, HF-channel, and white from the NOISEX-92 database [41] as well as office generated from actual recording were used. Among them, babble and office can be categorized into non-stationary noise [42]. The total samples were composed of 846 s long speech data originally recorded by four males and four females sampled at 8 khz for the application of narrow-band speech communication scenario. In order to evaluate and for speech and non-speech, we made reference decisions on the clean speech signal by labeling it manually at every 10 ms. The proportion of the hand-marked active speech frame was 58.2%, which was consisted of 44.5% voiced sounds and 13.4% unvoiced sounds. Note that settings on implementation of the algorithm such as sampling frequency and framesize can be easily changed since the proposed VAD decision rule is similar to the literature with the DFT baseline [7], [9]. For real-time processing, the windowed signal by a trapezoidal window of length 13 ms was transformed to 128 fast Fourier transform (FFT) coefficients after zero-padding. Thus, the frame size was 10 ms and the frame shift was 3 ms, respectively. The window length for the local minima search was set to 1.5 s. For evaluating the training phase through the model-trust algorithm and validating the model, we used 10-fold cross-validation at which total data was partitioned 10 equally sized segments [43]. Using these segment sets, 10-fold cross-validation was performed in the noise data we did not use during the training. Based on the results from 10-fold cross-validation experiment, the parameters and for the probability derived from the LT-PLDR were estimated in the discriminative training phase employing the model-trust Fig. 7. Comparison of the VAD performance under babble noise environment with approx. 11 db SNR. Noise is located in front of the dummy head with a distance of 5 m. (a) Waveform of the primary microphone signal (b) Corresponding manual VAD (0: noise, 1: speech) (c) VAD result of AMR option2 (d) VAD result of the MSM-based VAD (e) VAD result of two-channel MSC (f) VAD result of the proposed two-step PLDR. minimization algorithm. Also, the coefficients and in order to transform the ST-PLDR into the probability were obtained in a similar training manner. Note that the mean values of,,,and were,,,and, respectively. Also, standard deviation values were,,,and for,,,and, which looks very small and thus implies the robust applicability of our algorithm. B. Experimental Results Next, we evaluated the performance of the proposed approach compared to the aforementioned well-known VAD techniques. For convenience, we showed the resulting average accuracy after 10 cross-validation. First, in order to take advantage of the two-step approach, we examined the detection performance compared to option 2 of the AMR VAD, the multiple-statistical model-based VAD, the dual-microphone MSC-based VAD algorithm, and the dual-channel PLD-based algorithm. Figs. 7 and 8 illustrate the detection results for the babble and office noises, respectively, where the noise source was located front with 5 m from the dummy head (corresponding to SNR db). From the figures, it can be seen that the proposed two-step PLDRbased method has superior performance during both long noise periods and short pause periods, while the two conventional methods are inferior in detecting speech. In particular, the proposed method shows outstanding detection capability in terms of both onset and offset regions, which are known to be difficult to detect well, especially in non-stationary noise conditions. To determine the detection accuracy in terms of the speech and non-speech in a situation in which the noise source was placed at, we conducted the VAD experiment with six noise

9 CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1077 TABLE I COMPARISON OF VOICE ACTIVITY IN TERMS OF SPEECH HIT RATES AND NON-SPEECH HIT RATES AMONG THE METHODS OF CONVENTIONAL MSM-BASED VAD [9], AMR VAD OPTION2 [40],TWO-CHANNEL MSC [30], DUAL-CHANNEL PLD [35], AND PROPOSED TWO-STEP PLDR (LOCATION OF NOISE SOURCE ABOUT THE DUMMY HEAD: ). NOTE THAT WINNING RESULTS IN TERMS OF AND ARE HIGHLIGHTED Fig. 8. Comparison of the VAD performance under office noise environment with approx. 10 db SNR. Noise is located in front of the dummy head with a distance of 5 m. (a) Waveform of the primary microphone signal (b) Corresponding manual VAD (0: noise, 1: speech) (c) VAD result of AMR option2 (d) VAD result of the MSM-based VAD (e) VAD result of two-channel MSC (f) VAD result of the proposed two-step PLDR. types and at different distances as listed in the previous subsection, and the results are summarized in Table I. We confirmed that the proposed two-step PLDR approach is superior to the conventional single- and two-microphone-based VAD techniques for all tested conditions. Note that the proposed method shows outstanding improvement in performance in terms of the probability of the detection for speech, especially for nonstationary noises such as babble and office noises. In particular, it is evident that the proposed algorithm outperformed the conventional dual-microphone MSC-based VAD technique [30], which implies that the proposed two-step PLDR technique is likely to address the issue raised by the MSC-based VAD as stated in Introduction Section. However, since many multiple microphone-based algorithms such as beamforming require the aprioriknowledge of the directivity of the noise source, which is difficult to estimate and often not possible in practice, it is difficult to ensure robust performance. Therefore, robust performance in dealing with a variety of noise signals must be consistently demonstrated. To test the robust performance of the proposed algorithm in terms of the directivity of the noise source, we carried out experiments by varying the azimuth angles between the speech source and the noise source. We examined the detection performance of the proposed method at an azimuth angle of and the results are given in Table II. The proposed method outperformed all other algorithms in six noise conditions. It is evident from the results that the proposed two-step technique is effective in enhancing detection performance, no matter where noise is located. As this tendency was observed at the azimuth, it can be seen that the proposed technique exhibits robust performance even when there is a difference in power level between two microphones and the assumption of a diffused noise field is not met and the set parameters are not sensitive to the direction of noise. What remains is to test the performance at the azimuth, the noise source was placed in a position opposite that used in the case and located back with 5 m from the dummy head (SNR 6 db). Representative results for babble and office noises are plotted in Figs. 9 and 10, respectively, and it can be seen that two-step PLDR technique in detecting both noise-only periods and short pause regions in nonstationary noise conditions is superior. The performances in terms of the probability of the detection for the speech and non-speech for various azimuth angles are given in Table III, which shows the outstanding performance. In addition, from Table IV containing the summary of performances over entire environments, the overall results demonstrate that the proposed algorithm is superior to conventional VAD methods in almost all conditions. This observation confirms that the proposed two-step PLDR-based VAD technique is not sensitive to the location of noise sources. On the other hand, the sensitivity of and can be important factor in the performance of the proposed method. We experimentally chose 0.9 and 0.3 for ensuring robust performance over diverse acoustic environments, which turns out the best performance regardless of the type and intensity of noise. C. Application to Speech Enhancement As can be seen through the various experimental results in the previous subsection, the proposed two-step PLDR-based VAD approach showed the robust performance under highly varying noise environments. In order to demonstrate the effectiveness of

10 1078 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 TABLE II COMPARISON OF VOICE ACTIVITY IN TERMS OF SPEECH HIT RATES AND NON-SPEECH HIT RATES AMONG THE METHODS OF CONVENTIONAL MSM-BASED VAD[9],AMRVADOPTION2 [40], TWO-CHANNEL MSC [30], DUAL-CHANNEL PLD [35], AND PROPOSED TWO-STEP PLDR (LOCATION OF NOISE SOURCE ABOUT THE DUMMY HEAD: ). NOTE THAT WINNING RESULTS IN TERMS OF AND ARE HIGHLIGHTED Fig. 9. Comparison of the VAD performance under babble noise environment withapprox.6dbsnr.noiseislocatedat about the dummy head with a distance of 5 m. (a) Waveform of the primary microphone signal (b) Corresponding manual VAD (0: noise, 1: speech) (c) VAD result of AMR option2 (d) VAD result of the MSM-based VAD (e) VAD result of two-channel MSC (f) VAD result of the proposed two-step PLDR. the proposed method, we investigated the overall speech quality when the proposed two-step PLDR-based VAD technique is incorporated in estimating the noise power in the speech enhancement system. As a target platform for speech enhancement system, we employed the state-of-the-art speech enhancement algorithm based on MCRA noise estimation incorporating second-order conditional maximum a posteriori (CMAP) criterion [11]. For verifying the performance by taking advantage of the proposed two-step PLDR-based VAD in updating the noise power at the conventional second-order CMAP-based algorithm, we used the composite measures proposed by Hu and Loizou [44] for the objective evaluation of speech quality. The composite measure, which is known to show the significantly high correlation with the mean opinion score (MOS) of subjective speech quality perceived by the listeners, is defined as a combination of representative objective evaluation measures as following: (36) Fig. 10. Comparison of the VAD performance under office noise environment with approx. 6 db SNR. Noise is located at about the dummy head with adistanceof5m.(a) Waveform of the primary microphone signal (b) Corresponding manual VAD (0: noise, 1: speech) (c) VAD result of AMR option2 (d) VAD result of the MSM-based VAD (e) VAD result of two-channel MSC (f) VAD result of the proposed two-step PLDR. where,,and mean the values obtained by the perceptual evaluation of speech quality (PESQ), the log-likelihood ratio (LLR), the weighted-slope spectral distance (WSS), respectively. Table V summarizes the averaged results of speech quality in terms of PESQ, LLR, WSS, and under various noise conditions and azimuth angles. As can be seen in Table V, we can confirm that the proposed two-step PLDR-based VAD consistently improves the performance of speech enhancement in terms of the speech quality by incorporating the proposed VAD approach in updating the noise power estimation within the conventional speech enhancement baseline. In order to verify the improvement of the performance, we compared the speech spectrograms between the output signal processed by the conventional second-order CMAP-based algorithm and the output signal enhanced by the second-order CMAP incorporating the two-step PLDR-based VAD algorithm. Fig. 11 shows performance comparison in terms of speech spectrograms of which the speech sentence

CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1079 TABLE III COMPARISON OF VOICE ACTIVITY IN TERMS OF SPEECH HIT RATES AND NON-SPEECH HIT RATES AMONG THE METHODS OF

11 CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1079 TABLE III COMPARISON OF VOICE ACTIVITY IN TERMS OF SPEECH HIT RATES AND NON-SPEECH HIT RATES AMONG THE METHODS OF CONVENTIONAL MSM-BASED VAD [9], AMR VAD OPTION2 [40],TWO-CHANNEL MSC [30], DUAL-CHANNEL PLD [35], AND PROPOSED TWO-STEP PLDR (LOCATION OF NOISE SOURCE ABOUT THE DUMMY HEAD: ). NOTE THAT WINNING RESULTS IN TERMS OF AND ARE HIGHLIGHTED TABLE IV SUMMARY OF VAD IN TERMS OF OF SPEECH AND OF NON-SPEECH BY AVERAGING AND OVER GIVEN AZIMUTH ANGLES AND DISTANCES. NOTE THAT WINNING RESULTS IN TERMS OF AND ARE HIGHLIGHTED TABLE V RESULTS IN TERMS OF SPEECH QUALITY PESQ, LLR, WSS, AND AVERAGED FOR VARIOUS NOISE CONDITIONS AND AZIMUTH ANGLES, OBTAINED USING CONVENTIONAL SECOND-ORDER CMAP [11] AND SECOND-ORDER CMAP INCORPORATING TWO-STEP PLDR-BASED VAD (WITH 95% CONFIDENCE INTERVAL) was corrupted with babble noise located front with 1 m from the dummy head (i.e., SNR db). As can be seen in the figure, the proposed two step PLDR-based VAD clearly contributes on the performance improvement of the second-order CMAP-based speech enhancement technique under the nonstationary noise environment. D. Computational Complexity and Discussion Computational complexity is considered to be one of the crucial factors in designing systems for mobile devices, since complexity increases power consumption. While the proposed method is superior to conventional single- or dual-microphone-based algorithms, computational complexity should be assessed for the purpose of practical implementation. In order to evaluate the additional computational burden, we compared the computational complexity of the proposed two-step algorithm with those of the MSM-based VAD, the dual-microphone MSC-based VAD technique, and the dual-channel normalized PLD-based VAD algorithm. For a fair comparison, the computation steps were divided into main VAD and feature extraction part. Actually, a single FFT routine in the external feature can be ignored since the FFT can be reused in the forthcoming noise suppression module [12]. A brief summary of the computational cost required by each algorithm in terms of million instructions per second (MIPS), which is based on the TXS320C55X [45], [46], is presented

1080 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 Fig. 11. Comparison of speech spectrograms (babble noise, SNR db).

(a) Speech spectrogram of the clean speech signal (b) Speech spectrogram of the noisy speech signal (c) Speech spectrogram of the output signal processed by the conventional second-order CMAP [11]

12 1080 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 6, JUNE 2014 Fig. 11. Comparison of speech spectrograms (babble noise, SNR db). Noise was located at about the dummy head with a distance of 1 m. (a) Speech spectrogram of the clean speech signal (b) Speech spectrogram of the noisy speech signal (c) Speech spectrogram of the output signal processed by the conventional second-order CMAP [11] (d) Speech spectrogram of the output signal processed by the second-order CMAP incorporating the two-step PLDR-based VAD. TABLE VI COMPARISON OF COMPUTATIONAL COMPLEXITY IN TERMS OF MIPS PER FRAME ( ms) in Table VI. Among the existing methods, the MSM-based VAD has the highest computational load, followed by the dual-microphone MSC-based VAD technique. The computational burden of the main part of the VAD decision claimed by the multiple-statistical models is larger than those of both the MSC-based VAD and the proposed technique. In an aspect of the computation of the external feature module, the proposed two-step and MSC-based method and the normalized PLD-based VAD are twice bigger than the VAD technique based on the MSM since their techniques require dual-microphones and thus the forward FFT routine is implemented twice. However, considering the main module in the VAD, the proposed two-step scheme calls for a lower computational burden compared to the MSM-based VAD and the MSC-based scheme. This can be attributed to the fact that the MSM-based VAD [9] estimates the noise power spectrum of the noisy signal and SNR parameters such as the aposteriorisnr, apriorisnr, and the likelihood ratio according to multiple-statistical models, while the proposed method simply utilizes the PLDR between the input signals and the noise signals between the primary microphone and the secondary microphone. In this regard, the proposed two-step PLDR approach could be considered to be a simple but effective algorithm since it could be efficiently implemented without significant additional computation cost. V. CONCLUSIONS In this paper, we proposed a novel dual-microphone two-step PLDR technique based on the relative comparison between the PLD of input signals and the PLD of noise estimated during periods without speech. The proposed two-step PLDR approach is composed of two main parts in which the LT-PLDR and ST-PLDR are used to characterize the long-term evolution and the short-term variation, respectively, at the first step and are incorporated into a combined decision rule for VAD at the second step. In order to minimize the missing cases of speech in the first step, the LT-PLDR is derived using a large smoothing parameter for calculating the PLD and is changed to the aposteriori probability for VAD. The ST-PLDR is obtained using a small smoothing parameter in order to detect short pause intervals and to decrease the false-alarm rate and results in the a posteriori SNR for VAD. At the second step, based on the decision by the aposterioriprobability derived from the LT-PLDR and the ST-PLDR, the final VAD decision rule is established and the VAD result from the LT-PLDR is modified by the decision from the ST-PLDR when the periods of speech presence are detected by the LT-PLDR. This two-step framework allows for the proposed method to provide reliable VAD performances under various acoustical conditions incorporating nonstationary noise environments. Through extensive experiments under various noise environments, the proposed dual-microphone VAD technique was found to significantly improve the performance of the VAD compared to conventional standardized VAD algorithms and representative single- and dual-microphone VAD algorithms. The proposed method showed outstanding results in terms of performance improvement in nonstationary noise conditions, including babble and office noises. Furthermore, through simulation of the computational complexity, we confirmed that the proposed two-step VAD algorithms can be implemented simply and efficiently in the smart phone system with little additional computational cost. REFERENCES [1] L.R.RabinerandM.R.Sambur, Voiced-unvoiced-slience detection using Itakura LPC distance measure, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 1977, pp [2] J. D. Hoyt and H. Wechsler, Detection of human speech in structured noise, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,May 1994, pp [3] J. C. Junqua, B. Reaves, and B. Mark, A study of endpoint detection algorithms in adverse conditions: Incidence on a DTW and HMM recognize, in Proc. Eurospeech, Sep. 1991, pp [4] J.A.HaighandJ.S.Mason, Robustvoiceactivitydetectionusing cepstral feature, in Proc. IEEE TELCON, Oct. 1993, pp [5] R. Tucker, Voice activity detection using a periodicity measure, in Proc. Inst. Electr. Eng, Aug. 1992, vol. 139, pp [6] ITU-T, A slience compression scheme for G.729 optimized for terminals conforming to recommendation V.70, ITU-T Rec. G. 729, Annex B, [7] J. Sohn, N. S. Kim, and W. Sung, A statistical model-based voice activity detection, IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1 3, Jan [8] Y. D. Cho, K. Al-Naimi, and A. Kondoz, Improved voice activity detection based on a smoothed statistical likelihood ratio, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2001, pp [9] J.-H. Chang, N. S. Kim, and S. K. Mitra, Voice activity detection based on multiple statistical models, IEEE Trans. Signal Process., vol. 54, no. 6, pp , Jun

CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1081 [10] J. W. Shin, J.-H. Chang, and N. S. Kim, Voice activity detection based on a family of parametric distributions, Pattern Recognition Lett.

Chang, Speech enhancement based on minima controlled recursive averaging incorporating second-order conditional MAP criterion, IEEE Signal Process. Lett., vol. 16, no. 7, pp. 624 627, Jul. 2009.

13 CHOI AND CHANG: DUAL-MICROPHONE VAD TECHNIQUE BASED ON TWO-STEP PLDR 1081 [10] J. W. Shin, J.-H. Chang, and N. S. Kim, Voice activity detection based on a family of parametric distributions, Pattern Recognition Lett., vol. 28, no. 11, pp , Aug [11] J.-M. Kum and J.-H. Chang, Speech enhancement based on minima controlled recursive averaging incorporating second-order conditional MAP criterion, IEEE Signal Process. Lett., vol. 16, no. 7, pp , Jul [12] N. S. Kim and J.-H. Chang, Spectral enhancement based on global soft decision, IEEE Signal Process. Lett., vol. 7, no. 5, pp , May [13] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp , Jul [14] I. Cohen and B. Berdugo, Noise estimation by minima controlled recursive averaging for robust speech enhancement, IEEE Signal Process. Lett., vol. 9, no. 1, pp , Jan [15] K. Yao and S. Nakamura, Sequential noise compensation by sequential Monte Carlo method, in Proc. Neural Inf. Process. Syst., Dec. 2001, pp [16] B. Raj, R. Singh, and R. Stern, On tracking noise with linear dynamical system models, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2004, pp [17] M. Fujimoto and S. Nakamura, Sequential non-stationary noise tracking using particle filtering with switching dynamical system, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2006, pp [18] M. Fujimoto and K. Ishizuka, Noise robust voice activity detection based on switching Kalman filtering, in Proc. Eurospeech, Aug. 2007, pp [19] L. J. Griffiths and C. W. Jim, An alternative approach to linearly constrained adaptive beamformer, IEEE Trans. Antennas Propag., vol. AP-30, no. 1, pp , Jan [20] D. R. Campbell and P. W. Shields, Speech enhancement using subband adaptive Griffiths-Jim signal processing, Speech Commun., vol. 39, pp , Jan [21] B. D. Van and K. M. Buckley, Beamforming: A versatile approach to spatial filtering, IEEE ASSP Mag., vol. 5, pp. 4 24, Apr [22] H. Krim and M. Viberg, Two decades of array signal processing research, IEEE Signal Process. Mag., vol. 13, no. 4, pp , Jul [23] J. Chen and W. Ser, Speech detection using microphone array, Electron. Lett., vol. 36, no. 3, pp , Jan [24] Y. Hioka and N. Hamada, Voice activity detection with array signal processing in the wavelet domain, in Proc. 6th Eur. Signal Process. Conf., Sep. 2002, vol. I, pp [25] I. Potamitis, Estimation of speech presence probability in the field of microphone array, IEEE Signal Process. Lett., vol. 11, no. 12, pp , Dec [26] T. Pirinen and A. Visa, Signal independent wideband activity detection features for microphone arrays, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2006, pp [27] J. E. Rubio, K. Ishizuka, H. Sawada, S. Araki, T. Nakatani, and M. Fujimoto, Two-microphone voice activity detection based on the homogeneity of the direction of arrival estimates, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2007, pp [28] J. B. Allen, D. A. Berkley, and J. Blauert, Multi-microphone signal processing technique to remove room reverberation from speech signals, J. Acoust Soc. Amer., vol. 62, no. 4, pp , Oct [29] R. Le Bouquin-Jeannès and G. Faucon, Using the coherence function for noise reduction, Inst. Electr. Eng. Proc.-I Commun., Speech, Vis., vol. 139, no. 3, pp , Jun [30] R. Le Bouquin-Jeannès and G. Faucon, Study of a voice activity detector and its influence on a noise reduction system, Speech Commun., vol. 16, pp , Apr [31] R. Le Bouquin-Jeannès, A. A. Azirani, and G. Faucon, Enhancement of speech degraded by coherent and incoherent noise using a crossspectral estimator, IEEE Trans. Speech Audio Process., vol. 5, no. 5, pp , Sep [32] C. Nelke, C. Beaugeant, and P. Vary, Dual microphone noise psd estimation for mobile phones in hands-free position exploiting the coherence and speech presence probability, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2013, pp [33] N. Yousefian, M. Rahmani, and A. Akbari, Power level difference as a criterion for speech enhancement, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2009, pp [34] N. Yousefian, A. Akbari, and M. Rahmani, Using power level difference for near field dual-microphone speech enhancement, Appl. Acoust., vol. 70, pp , Dec [35] M. Jeub, C. Herglotz, C. Nelke, C. Beaugeant, and P. Vary, Noise reduction for dual-microphone mobile phones exploiting power level differences, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Mar. 2012, pp [36] Z.-H. Fu, F. Fan, and J.-D. Huang, Dual-microphone noise reduction for mobile phone application, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process, May 2013, pp [37] J.-H. Chang, Q.-H. Jo, D. K. Kim, and N. S. Kim, Global soft decision employing support vector machine for speech enhancement, IEEE Signal Process. Lett., vol. 16, no. 1, pp , Jan [38] J. Platt, Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in Advances in Large Margin Classifiers. Cambridge, MA, USA: MIT Press, [39]G.Wahba,X.Lin,F.Gao,D.Xiang,R.Klein,andB.Klein,The Bias-Variance Trade-Off and The Randomized GACV, M.Kearns,S. Solla, and D. Cohn, Eds. Cambridge, MA, USA: MIT Press, 1999, vol. 11, pp , Advances in Neural Information Processing Systems, Proceedings of the 1998 Conference. [40] Voice activity detector (VAD) for adaptive multi-rate (AMR) speech traffic channels, ETSI EN v7.1.1, ETSI, Dec [41] A. Varga and H. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Commun., vol. 12, no. 3, pp , Jul [42] N. Krishnamurthy and J. Hansen, Babble noise: Modeling, analysis, and applications, IEEE Audio, Speech, Lang. Process., vol.17,no.7, pp , Sep [43] G. McLachlan, K.-A. Do, and C. Ambroise, Analyzing Microarray Gene Expression Data. New York, NY, USA: Wiley, [44] Y. Hu and P. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp , Jan [45] TI, TMS320C55x DSP library programmer s reference,. Dallas, TX, USA, [46] J.-H. Choi and J.-H. Chang, On using acoustic environment classification for statistical model-based speech enhancement, Speech Commun., vol. 54, no. 3, pp , Mar Jae-Hun Choi was born in Seoul, Korea, in He received the B.S. and M.S. degrees in electronic engineering from Inha University, Incheon, Korea in 2007 and 2010, respectively. Since 2011, he has been pursuing the Ph.D. degree at the department of electronics computer engineering, Hanyang University, Seoul, Korea. His research interests include speech enhancement, voice activity detection, machine learning applied to speech signal processing. Joon-Hyuk Chang received the B.S. degree in electronics engineering from Kyungpook National University, Daegu, Korea in 1998 and the M.S. and Ph.D. degrees in electrical engineering from Seoul National University, Korea, in 2000 and 2004, respectively. From March 2000 to April 2005, he was with Netdus Corp., Seoul, as a chief engineer. From May 2004 to April 2005, he was with the University of California, Santa Barbara, in a postdoctoral position to work on adaptive signal processing and audio coding. In May 2005, he joined Korea Institute of Science and Technology, Seoul, as a Research Scientist to work on speech recognition. From August 2005 to February 2011, he was an assistant professor in the school of Electronic Engineering at Inha University, Incheon, Korea. Currently, he is an associate professor in the School of Electronic Engineering at Hanyang University, Seoul, Korea. His research interests are in speech coding, speech enhancement, speech recognition, audio coding, and adaptive signal processing. He is a senior member of IEEE. He is a winner of IEEE/IEEK IT young engineer of the year He is serving as Editor-in-chief of the Signal Processing Society Journal of the IEEK.

Mel Spectrum Analysis of Speech Recognition using Single Microphone

International Journal of Engineering Research in Electronics and Communication Mel Spectrum Analysis of Speech Recognition using Single Microphone [1] Lakshmi S.A, [2] Cholavendan M [1] PG Scholar, Sree