DNN-based Causal Voice Activity Detector

Similar documents
CHAPTER 2 LITERATURE STUDY

Interference Cancellation Method without Feedback Amount for Three Users Interference Channel

Study on SLT calibration method of 2-port waveguide DUT

Synchronous Machine Parameter Measurement

ABB STOTZ-KONTAKT. ABB i-bus EIB Current Module SM/S Intelligent Installation Systems. User Manual SM/S In = 16 A AC Un = 230 V AC

Speech Enhancement Using the Minimum-Probability-of-Error Criterion

Synchronous Machine Parameter Measurement

Design of FPGA-Based Rapid Prototype Spectral Subtraction for Hands-free Speech Applications

B inary classification refers to the categorization of data

Multi-beam antennas in a broadband wireless access system

Exercise 1-1. The Sine Wave EXERCISE OBJECTIVE DISCUSSION OUTLINE. Relationship between a rotating phasor and a sine wave DISCUSSION

Experiment 3: Non-Ideal Operational Amplifiers

Convolutional Networks. Lecture slides for Chapter 9 of Deep Learning Ian Goodfellow

CSI-SF: Estimating Wireless Channel State Using CSI Sampling & Fusion

Experiment 3: Non-Ideal Operational Amplifiers

METHOD OF LOCATION USING SIGNALS OF UNKNOWN ORIGIN. Inventor: Brian L. Baskin

This is a repository copy of Effect of power state on absorption cross section of personal computer components.

MAXIMUM FLOWS IN FUZZY NETWORKS WITH FUNNEL-SHAPED NODES

Application Note. Differential Amplifier

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Module 9. DC Machines. Version 2 EE IIT, Kharagpur

Section 17.2: Line Integrals. 1 Objectives. 2 Assignments. 3 Maple Commands. 1. Compute line integrals in IR 2 and IR Read Section 17.

Lecture 20. Intro to line integrals. Dan Nichols MATH 233, Spring 2018 University of Massachusetts.

Improving Iris Identification using User Quality and Cohort Information

High-speed Simulation of the GPRS Link Layer

DESIGN OF CONTINUOUS LAG COMPENSATORS

A Novel Back EMF Zero Crossing Detection of Brushless DC Motor Based on PWM

Engineer-to-Engineer Note

Geometric quantities for polar curves

DYE SOLUBILITY IN SUPERCRITICAL CARBON DIOXIDE FLUID

Joanna Towler, Roading Engineer, Professional Services, NZTA National Office Dave Bates, Operations Manager, NZTA National Office

4110 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 66, NO. 5, MAY 2017

Application of Wavelet De-noising in Vibration Torque Measurement

Solutions to exercise 1 in ETS052 Computer Communication

GNSS MULTIPATH MITIGATION USING LOW COMPLEXITY ADAPTIVE EQUALIZATION ALGORITHMS

ECE 274 Digital Logic. Digital Design. Datapath Components Shifters, Comparators, Counters, Multipliers Digital Design

On the Prediction of EPON Traffic Using Polynomial Fitting in Optical Network Units

Robustness Analysis of Pulse Width Modulation Control of Motor Speed

A New Stochastic Inner Product Core Design for Digital FIR Filters

CS 135: Computer Architecture I. Boolean Algebra. Basic Logic Gates

Algorithms for Memory Hierarchies Lecture 14

A Development of Earthing-Resistance-Estimation Instrument

BP-P2P: Belief Propagation-Based Trust and Reputation Management for P2P Networks

CHAPTER 3 AMPLIFIER DESIGN TECHNIQUES

A Slot-Asynchronous MAC Protocol Design for Blind Rendezvous in Cognitive Radio Networks

Analog computation of wavelet transform coefficients in real-time Moreira-Tamayo, O.; Pineda de Gyvez, J.

Eliminating Non-Determinism During Test of High-Speed Source Synchronous Differential Buses

Fuzzy Logic Controller for Three Phase PWM AC-DC Converter

Section 16.3 Double Integrals over General Regions

Nevery electronic device, since all the semiconductor

BP-P2P: Belief Propagation-Based Trust and Reputation Management for P2P Networks

Address for Correspondence

Understanding Basic Analog Ideal Op Amps

Synchronous Generator Line Synchronization

An Analog Baseband Approach for Designing Full-Duplex Radios

A Simple Approach to Control the Time-constant of Microwave Integrators

To provide data transmission in indoor

Energy Harvesting Two-Way Channels With Decoding and Processing Costs

Research on Local Mean Decomposition Algorithms in Harmonic and Voltage Flicker Detection of Microgrid

High Speed On-Chip Interconnects: Trade offs in Passive Termination

Adaptive VoIP Smoothing of Pareto Traffic Based on Optimal E-Model Quality

Soft-decision Viterbi Decoding with Diversity Combining. T.Sakai, K.Kobayashi, S.Kubota, M.Morikura, S.Kato

Substrate Integrated Evanescent Filters Employing Coaxial Stubs

Domination and Independence on Square Chessboard

Multipath Mitigation for Bridge Deformation Monitoring

Pulse Radar with Field-Programmable Gate Array Range Compression for Real Time Displacement and Vibration Monitoring

Jamming-Resistant Collaborative Broadcast In Wireless Networks, Part II: Multihop Networks

Example. Check that the Jacobian of the transformation to spherical coordinates is

Secret Key Generation and Agreement in UWB Communication Channels

Implementation of Different Architectures of Forward 4x4 Integer DCT For H.264/AVC Encoder

A Comparative Analysis of Algorithms for Determining the Peak Position of a Stripe to Sub-pixel Accuracy

10.4 AREAS AND LENGTHS IN POLAR COORDINATES

Y9.ET1.3 Implementation of Secure Energy Management against Cyber/physical Attacks for FREEDM System

A Stochastic Geometry Approach to the Modeling of DSRC for Vehicular Safety Communication

EET 438a Automatic Control Systems Technology Laboratory 5 Control of a Separately Excited DC Machine

Simulation of Transformer Based Z-Source Inverter to Obtain High Voltage Boost Ability

Redundancy Data Elimination Scheme Based on Stitching Technique in Image Senor Networks

Signaling-Embedded Preamble Design for Flexible Optical Transport Networks

(1) Non-linear system

Simultaneous Adversarial Multi-Robot Learning

2016 2Q Wireless Communication Engineering. #10 Spread Spectrum & Code Division Multiple Access (CDMA)

Exponential-Hyperbolic Model for Actual Operating Conditions of Three Phase Arc Furnaces

Performance of Adaptive Multiuser Receivers for the WCDMA Uplink

ABSTRACT. We further show that using pixel variance for flat field correction leads to errors in cameras with good factory calibration.

Electrically Large Zero-Phase-Shift Metamaterial-based Grid Array Antenna for UHF Near-Field RFID Readers

Kirchhoff s Rules. Kirchhoff s Laws. Kirchhoff s Rules. Kirchhoff s Laws. Practice. Understanding SPH4UW. Kirchhoff s Voltage Rule (KVR):

Mixed CMOS PTL Adders

A New Algorithm to Compute Alternate Paths in Reliable OSPF (ROSPF)

Adaptive Network Coding for Wireless Access Networks

g Lehrstuhl für KommunikationsTechnik, Lehrst

Digital Design. Sequential Logic Design -- Controllers. Copyright 2007 Frank Vahid

Real Time Determination of Rechargeable Batteries Type and the State of Charge via Cascade Correlation Neural Network

Design of Coupling Coding in MPEG-4 HE-AAC

University of North Carolina-Charlotte Department of Electrical and Computer Engineering ECGR 4143/5195 Electrical Machinery Fall 2009

Application of Feed Forward Neural Network to Differential Protection of Turbogenerator

EXIT CHARTS FOR TURBO RECEIVERS IN MIMO SYSTEMS

D]TC - S octa Asmria ooi. <~ p-ee 199b3- %he srorisr7cx L~)~,71'% a I PHOTOGRAPH THIS SHEET. li LEVEL INVENTORY DOCUMENT IDENTIFICATION

LATEST CALIBRATION OF GLONASS P-CODE TIME RECEIVERS

Topic 20: Huffman Coding

Temporal Secondary Access Opportunities for WLAN in Radar Bands

Transcription:

D-bsed Cusl Voice Activity Detector Ivn J. Tshev Microsoft Reserch One Microsoft Wy, Redmond, WA 9851, USA ivntsh@microsoft.com Seyedmhdd Mirsmdi University of Texs t Dlls 8 West Cmpbell Rod, Richrdson, TX 758, USA mirsmdi@utdlls.edu Abstrct Voice Activity Detectors VAD re importnt components in udio processing lgorithms. In generl, VADs re two wy clssifiers, flgging the udio frmes where we hve voice ctivity. Most of them re bsed on the signl energy nd build sttisticl models of the noise bcground nd the speech signl. In the process of derivtion, we re limited to simplified sttisticl models nd this limits the ccurcy of the clssifiction. Using more precise, but lso more complex, sttisticl models mes the nlyticl derivtion of the solution prcticlly impossible. In this pper, we propose using deep neurl networ D to lern the reltionship between the noisy speech fetures nd the correct VAD decision. In most of the cses we need cusl lgorithm, i.e. woring in rel time nd using only current nd pst udio smples. This is why we use udio segments tht consist only of current nd previous udio frmes, thus ming possible rel-time implementtions. The proposed lgorithm nd D structure exceeds the clssic, sttisticl model bsed VAD for both seen nd unseen noises. Index Terms voice ctivity detection, deep neurl networs, speech sttisticl model, noise sttisticl model. I. ITRODUCTIO Voice Activity Detectors VAD re lgorithms for detecting the presence of speech signl in the mixture of speech nd noise. They re prt of noise suppressors, double tl detectors, codecs, nd utomtic gin control blocs, to mention few. The VAD output cn vry from simple binry decision yes/no, to soft decision probbility of speech presence in the current udio frme, to probbility of speech presence in ech frequency bin of ech udio frme. The commonly used VAD lgorithms re bsed on the ssumption of qusi-sttionry noise, i.e. the noise spectrum chnges much slower thn the speech signl. A clssic VAD lgorithm wors in rel time nd mes the decisions bsed on the current nd previous smples, i.e. it is cusl. Most of these lgorithms wor in frequency domin for better integrtion in the udio processing chin nd provide estimtion for ech frequency bin seprtely. One of the pproches frequently used s bseline VAD lgorithm is stndrdized s ITU-T Recommendtion G.729- Annex B [1]. An improved nd generlized VAD is described in [2], where uthors crete soft decision VAD ssuming Gussin distribution of the noise nd speech signls. A simple HMM is dded to crete hngover scheme in [3] nd to finlize the decision utilizing the timing of switching the sttes. This lgorithm cn be generlized nd optimized for better performnce s described in [4]. Most of the VAD lgorithms ssume Gussin distribution of the noise nd speech signls. It is well nown tht while the distribution of noise mplitudes in time domin is well modelled with the Gussin distribution, the distribution of the mplitudes of the speech signl hs higher urtosis thn the Gussin distribution. Gzor nd Zhng [5] published study for the speech signl distribution in time domin, lter in [6] this study ws extended with models of the Probbility Density Functions PDF of the speech signl mgnitudes in frequency domin. Severl ttempts re published in the literture to utilize the non-gussinity of the speech signl for better noise suppression rules [7], [8] nd [9], or for better VAD [1] nd [11]. In most of the cses it is very difficult to find nlyticl form of the suppression rules, or speech presence probbility, nd the proposed solutions re either pproximte or computtionlly expensive. The sttisticl udio signl processing lso ssumes tht the frequency bins in one udio frme re sttisticlly independent, which llows processing these bins individully. The sme ssumption is in force for the consecutive udio frmes, which llows processing of the udio signl frme by frme. In relity there re noise signls tht chnge fster thn the speech signl clpping, clns, etc., the consecutive udio frmes re highly correlted, nd the frequency bins in the sme frme contin informtion tht cn be utilized by processing them together. Still, the ssumptions bove led to woring VAD lgorithms, which serve well in pretty much every udio processing system. In this pper we propose n lgorithm for cusl VAD bsed on deep neurl networs D. The D is trined on segments of severl consecutive udio frmes, nd with ll frequency bins together to utilize the correltion between the frmes nd bins. We do not ssume ny prior distribution of the noise nd speech signls nd expect the D to lern the dependency between the input fetures nd the VAD decision. In Section II, we formulte the problem nd present the sttisticl model-bsed VAD. Sections III nd IV describe the proposed neurl networ structure nd the evlution dtset. In Section V, we describe the experimentl results nd we conclude in VI.

Amplitude Frequency, Hz.8.6.4.2.2.4.6 Signl VAD frme/2.8 5 1 15 2 25 3 35 4 Time, sec Fig. 1. oisy signl in time domin with SR=1 db. 7 6 5 4 3 2 1 5 1 15 2 25 3 35 Time, sec Fig. 2. oisy signl in frequency domin with SR=1 db. A. Modelling II. PROBLEM DEFIITIO We hve limited discrete signl in time domin x lt where l [, L 1], x min x lt x mx, l, nd T is the smpling period. An exmple of such signl is shown in Figure 1. This signl is mixture of two limited, discrete, nd uncorrelted signls x lt = n lt + s lt, noise n lt nd speech s lt, respectively. After frming, windowing, nd converting to frequency domin we hve X n = n +S n, where [, K 1] is the frequency bin, K is the number of frequency bins, n [, 1] is the frme number, nd is the number of udio frmes. The sme cn be written in mtrix form X = + S, where ll re K complex mtrices representing the spectr of the signl, the noise, nd the speech components. This representtion is visulized in Figure 2, where the mgnitudes re in decibel scle. In ech frme nd/or frequency bin we consider two hypotheses: H : speech is bsent, X = H 1 : speech is present, X = + S. 3 2 1 1 2 3 4 5 6 The gol of the VAD lgorithm is to produce the presence probbility P n H 1 for ech frequency bin, nd P n H 1 for ech frme column in the bove mtrices bove. An exmple of the expected VAD decision per frme is shown in Figure 1. B. Voice Activity Detectors Let us ssume tht noise nd speech signls re zero men nd fully chrcterized by their respective vrinces σ 2 n nd σ 2 s, nd we hve prior nowledge of the PDFs of these two signls, p n σ 2 n nd ps σ 2 s respectively. The PDF of mix of two uncorrelted signls is the convolution of the PDFs of the two signls: p x σ 2 n, σ 2 s = pn σ 2 n ps σ 2 s. 1 ote tht this eqution hs nlyticl solution for smll number of distribution pirs, it hs to be solved numericlly for most of the cses. The probbility P H 1 of signl with mplitude to contin speech is derived fter directly pplying the Byesin rule: P H 1 = p H 1 P H 1 p H 1 P H 1 + p H P H. 2 Here P H 1 nd P H = 1 P H 1 re the prior probbilities for speech nd noise presence respectively. After dividing by p H P H we hve: P H 1 = ελ 1 + ελ, 3 where ε = P H 1 /P H, nd Λ is the lielihood rtio: Λ = p x σ 2 n, σs 2 p n σn 2. 4 The proportion of the prior probbilities for speech nd noise ε cn be ssumed constnt nd nown. Then if we cn estimte the noise nd speech vrinces - we cn estimte the speech presence probbility in ech frme nd/or frequency bin. Under the ssumption of zero men Gussin distribution of both speech nd noise signls, [3] provides nlyticl solution of 4 for the lielihood for speech signl presence in frequency bin of udio frme n: Λ = 1 γ ξ exp 1 + ξ 1 + ξ, 5 where ξ = λ S λ nd γ = X 2 λ re the prior nd posterior SRs respectively, λ S = σs 2 nd λ = σn 2. The decision directed pproch [12] is used for estimtion of the prior SR: ξ n n 1 λ [ ] S = α + 1 α. mx, γ n λ n 1 1. 6

ote tht this pproch utilises prtilly the fct tht the consecutive speech frmes re correlted. Here α is constnt typiclly in the rnge of.95.98. In [3] is lso proposed smoothing of the estimted lielihood using first order HMM. After the derivtion the smoothed lielihood for speech presence in the current frequency bin is estimted s: Λ n = 1 + Λn 1 11 + 1 Λn 1 Λ n, 7 where 1 nd 1 re the prior probbilities for chnging the stte. Then the probbility for speech presence in the current frequency bin is: P n H 1 X n = 1 + Λ n Λ n. 8 ote tht ε from eqution 3 conveniently cncels out. To combine the lielihoods from ll frequency bins to compute the lielihood for speech signl presence in the entire frme we cn use geometric men or rithmetic men. The geometric men ssumes the speech signl hs energy in ll frequency bins, i.e. reflects the fct tht the speech is widebnd signl, but speech is lso sprse signl nd bsence of speech in severl frequency bins will drive the lielihood very low. On the other hnd the rithmetic men will hve high lielihood even if we hve high energy only in few frequency bins, which is definitely not speech. In [1] uthors propose using weighted sum to combine the lielihoods from the frequency bins: Λ n = β exp 1 end beg + 1 β 1 end beg end log = beg Λ n 9 end Λ n. = beg Here the prmeter β is djusted for chieving best ccurcy. Also note the implicit bndpss filtering by processing only the frequency bins between beg nd end. We cn pply lielihood smoothing in the sme wy s in eqution 7 by introducing b 1 nd b 1, which re the prior probbilities for switching the stte on frme level. We cn compute the speech presence probbility P n fter using eqution 8. The binry flg V n for speech presence 1 or bsence cn be obtined by compring the lielihood Λ n or the speech presence probbility P n with fixed threshold η: V n = 1 if P n H 1 > η if P n 1 H 1 η For prcticl purposes smll hysteresis is dded to prevent ringing of the flg when the probbility is close to the threshold. At the end of processing of ech frme we cn updte the noise model: = λn 1 + X P H X n T n τ p 2 λ n 1 11 where P H X n = 1 P H 1 X n is the speech bsence probbility, T is the frme shift time, nd τ p is the time constnt for updting the model. The introduced VAD prmeters time constnts, prior probbilities, etc. cn be optimized for given dtset using the pproch described in [4]. λ n III. DEEP LEARIG APPROACH The chllenges for VAD increse with the prolifertion of mobile devices nd infotinment systems in crs. In both cses the noise levels re higher nd SRs re lower. Fr field sound cpture lso dds higher reverbertion, compred to close tling microphones in smrtphone devices. The consumer of the enhnced speech shifts from telecommunictions to speech recognition. While ting phone cll, people try to find quieter plce simply becuse there re limittions of how much power we cn put in the hedphone, before strting to hrm the users hering. In the cse of speech enbled dilog system for mobile device, the user spes nd the system typiclly responds by showing the results on the screen. This lifts the limits t how noisy conditions the system should wor the user will be hppy if the system cn understnd when sing with norml tone in noisy stdium. In generl the speech presence probbility is function of the mgnitudes of the frequency bins in the current nd severl previous udio frmes. The question is cn neurl networ lern tht function, without ssumptions for the sttisticl distribution of the speech nd noise signls, without explicitly hndling the temporl nd spectrl contexts, nd with dding the cpbility for distinguishing between speech nd fst vrying non-sttionry noises. In this pper, we propose to use fully connected deep neurl networ D s shown in Figure 3. The input fetures re the mgnitudes of ll frequency bins in the current nd severl previous udio frmes, forming the current segment. The output is the probbilities for speech presence in ech frequency bin nd the probbility for speech presence in the entire frme. The performnce of the D will be evluted both ginst seen noise i.e. this type of noise is presented in the trining set nd unseen noise this type of noise hs not been presented in the trining set. A. Dtset IV. DATASET AD EVALUATIO A multi-condition trining corpus with different noise types, signl-to-noise rtios SRs, nd reverbernt properties ws creted bsed on the TIMIT trining set [13]. We used collection of 1 different noise signls from [14], which includes vriety of different noise types crowd noise, trffic nd cr noise, etc.. We lso used set of 6 different room

TABLE I VAD CLASSIFICATIO ERRORS Dtset Per bin Per frme Bseline, verge.46328.68949 Development.3268.24669 Test.31418.41633 Test, unseen.3256.4461 Fig. 3. Proposed D-bsed VAD. impulse responses RIRs recorded t multiple distnces from 1 to 4 meters in room with reverbertion time T 6 of pproximtely 3 ms. The trining corpus ws creted s follows: speech nd noise sound pressure levels SPL in room were ssumed to be normlly distributed with mens µ s = 6 dba SPL nd µ n = 55 dba SPL, nd stndrd devitions σ s = 8 db nd σ s = 1 db respectively. An utternce is rndomly selected from the TIMIT trining set, nd scled to level tht is rndomly selected ccording to the ssumed distribution for speech levels. Similrly, rndomly selected signl from the noise dtset is scled to level chosen from the noise power distribution. Correction for the Lombrd effect is performed on speech signl level. The scled speech signl is convolved with rndomly selected RIR, nd the scled noise is dded to the result. This noisy signl is then synchronized with the clen speech signl to remove the dely introduced by the RIR. Such temporl lignment of the noisy nd clen reference signls is necessry so tht the subsequent frming nd feture extrction steps will produce feture pirs which correspond to the sme section of the speech signl. The finl SRs were limited to [ 5, 3] db. This procedure is used to crete dtset of clen/noisy pirs for trining. In similr fshion, we generted two different test dtsets bsed on the TIMIT test set, ech contining 2 utternces. The first test dtset uses noise signls from the noise corpus used to generte the trining dtset, nd the second uses completely disjoint set of noise smples from OISEX-92 corpus [15]. We cll these seen nd unseen test dtsets, respectively. B. Lbeling nd Evlution The ground truth is binry speech signl presented or bsent nd ws obtined by running simple threshold-bsed VAD on the clen speech utternces. TIMIT utternces re recorded with very high qulity nd simple comprison with given threshold provides flg for presence or bsence of speech signl for both per-bin nd per-frme lbels. The evlution criterion is the root-men-squred RMS error between the VAD output nd the ground truth obtined bove: 1 E f = P n H 1 G n H 1. 12 Here G n H 1 is the ground truth. n V. EXPERIMETAL RESULTS All of the voice nd noise files were converted to 16 Hz smpling rte. To convert from bsolute sound pressure levels to the signl on the output of the ADC convertor the clipping levels of the microphone ws ssumed 12 db SPL typicl for most of the widely used MEMS microphones. We hve generted 4 files for trining, 2 files for testing, nd 2 files for testing with unseen noises. The totl durtion of the dtset ws 2 hours. The frme size ws 512 smples, weighted with Hnn window before converting to frequency domin. This results in 256 frequency bins for ech udio frme. Frme shift ws 256 smples 5%. Overlp nd dd procedure ws used to synthesize the signl bc to time domin s described in [16]. Ech segment consisted of seven frmes, which mens input feture vector of 1792 mgnitudes. The neurl networ hd four hidden lyers of 512 nodes ech. The output lyer hd 257 neurons: one for the speech presence probbility for the entire frme nd 256 for ech frequency bin. For trining nd evlution we used CTK toolit [17]. The VAD clssifiction errors, ccording to eqution 12 re shown in Tble I. ote tht this is the RMS of the error, compred with binry clssifier. This mens tht well woring VAD with output probbility of.1 when speech is not present nd.9 when speech is present will hve error of.2. The bseline is the VAD, described in Section II. As expected the results ginst the test dtset degrde. There lso noticeble degrdtion, but less thn expected in the results ginst the test dtset with unseen noise. The clssifiction error s function of the SR is shown in Figure 4. The trends re consistent with the numericl results. VI. COCLUSIOS In this pper, we proposed using deep neurl networ to overcome shortges in the models used by the sttisticl VAD. We chieved substntil reduction of the clssifiction error for

Fig. 4. VAD error per frme. both seen nd unseen noises. The reduction of the performnce ginst unseen noises ws less thn expected. As resonble next step, we consider experimenting with different neurl networs, for exmple R with LSTM for preserving the stte nd reducing the size of the input vector. REFERECES [1] Recommendtion G.729 Annex B: silence compression scheme for use with G.729 optimized for V.7 digitl simultneous voice nd dt pplictions, 1997. [2] J. Sohn nd W. Sung, A voice ctivity detection employing soft decision bsed noise spectrum dpttion, in Proceedings of IEEE Interntionl Conference on Acoustics, Speech, nd Signl Processing ICASSP, 1998. [3] J. Sohn,. Kim, nd W. Sung, A sttisticl model bsed voice ctivity detector, IEEE Signl Processing Letters, vol. 6, pp. 1 3, 1999. [4] Ivn Tshev, Andrew Lovitt, nd Alex Acero, Unified frmewor for single chnnel speech enhncement, in Proceedings of IEEE Pcific Rim Conference on Communictions, Computers nd Signl Processing, 29. [5] S. Gzor nd W. Zhng, Speech probbility distribution, IEEE Signl Processing Letters, vol. 1, pp. 24 27, 23. [6] Ivn Tshev nd Alex Acero, Sttisticl modeling of the speech signl, in Proceedings of Interntionl Worshop on Acoustic, Echo, nd oise Control IWAEC, 21. [7] R. Mrtin, Speech enhncement using MMSE short time spectrl estimtion with Gmm distributed speech priors, in Proc. of IEEE Interntionl Conference on Acoustics, Speech, nd Signl Processing ICASSP, 22. [8] T. Lotter, Speech Enhncement, chpter Single- nd Multi-Microphone Spectrl Amplitude Estimtion Using Super-Gussin Speech Model, pp. 67 95, Springer-Verlg, 25. [9] Ivn J. Tshev nd Mlcolm Slney, Dt driven suppression rule for speech enhncement, in Informtion Theory nd Applictions Worshop. University of Cliforni in Sn Diego, 213. [1] Ivn Tshev, Andrew Lovitt, nd Alex Acero, Dul stge probbilistic voice ctivity detector, in OISE-CO 21 nd 159th Meeting of the Acousticl Society of Americ, 21. [11] Ivn J. Tshev, Offline voice ctivity detector using speech supergussinity, in Informtion Theory nd Applictions Worshop. University of Cliforni in Sn Diego, 215. [12] Y. Ephrim nd D. Mlh, Speech enhncement using minimum men-squre error short-time spectrl mplitude estimtor, IEEE Trnsctions on Acoustics, Speech, nd Signl Processing, vol. ASSP- 32, no. 6, pp. 119 1121, Dec. 1984. [13] John S. Grofolo et l., TIMIT coustic-phonetic continuous speech corpus, Linguistic Dt Consortium, 1993. [14] G. Hu, 1 nonspeech environmentl sounds, in http://web.cse.ohiostte.edu/pnl/corpus/huonspeech/hucorpus.html, 24. [15] A. Vrg nd H. J. Steeneen, Assessment for utomtic speech recognition: II. OISEX-92: A dtbse nd n experiment to study the effect of dditive noise on speech recognition systems, Speech Communiction, vol. 12, no. 3, pp. 247 251, 1993. [16] Ivn J. Tshev, Sound Cpture nd Processing: Prcticl Approches, Wiley, July 29. [17] Dong Yu nd et l, An introduction to computtionl networs nd the computtionl networ toolit, Tech. Rep., Microsoft MSR-TR-214-112, 214.