Voice Activity Detection

Similar documents
Speech Coding in the Frequency Domain

SOUND SOURCE RECOGNITION AND MODELING

Applications of Music Processing

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

Efficient Signal Identification using the Spectral Correlation Function and Pattern Recognition

Effects of Reverberation on Pitch, Onset/Offset, and Binaural Cues

Mikko Myllymäki and Tuomas Virtanen

Chapter IV THEORY OF CELP CODING

Statistical Tests: More Complicated Discriminants

Automatic Text-Independent. Speaker. Recognition Approaches Using Binaural Inputs

Mel Spectrum Analysis of Speech Recognition using Single Microphone

speech signal S(n). This involves a transformation of S(n) into another signal or a set of signals

CHAPTER 4 VOICE ACTIVITY DETECTION ALGORITHMS

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

SSB Debate: Model-based Inference vs. Machine Learning

EE482: Digital Signal Processing Applications

Signal Processing for Speech Applications - Part 2-1. Signal Processing For Speech Applications - Part 2

Chapter 4 SPEECH ENHANCEMENT

An Optimization of Audio Classification and Segmentation using GASOM Algorithm

Fundamental frequency estimation of speech signals using MUSIC algorithm

Supplementary Materials for

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Voice Activity Detection for Speech Enhancement Applications

Drum Transcription Based on Independent Subspace Analysis

Roberto Togneri (Signal Processing and Recognition Lab)

Digital Signal Processing

Automotive three-microphone voice activity detector and noise-canceller

An Efficient Extraction of Vocal Portion from Music Accompaniment Using Trend Estimation

Advanced audio analysis. Martin Gasser

KONKANI SPEECH RECOGNITION USING HILBERT-HUANG TRANSFORM

The Jigsaw Continuous Sensing Engine for Mobile Phone Applications!

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Wheel Health Monitoring Using Onboard Sensors

Indoor Location Detection

Enhanced Waveform Interpolative Coding at 4 kbps

A Survey and Evaluation of Voice Activity Detection Algorithms

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

Image analysis. CS/CME/BIOPHYS/BMI 279 Fall 2015 Ron Dror

Antennas and Propagation. Chapter 6b: Path Models Rayleigh, Rician Fading, MIMO

Long Range Acoustic Classification

Speech Endpoint Detection Based on Sub-band Energy and Harmonic Structure of Voice

Introduction of Audio and Music

Speech/Music Discrimination via Energy Density Analysis

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 4: 7 Feb A. Faulkner.

Estimating Single-Channel Source Separation Masks: Relevance Vector Machine Classifiers vs. Pitch-Based Masking

Learning Human Context through Unobtrusive Methods

Module 10 : Receiver Noise and Bit Error Ratio

Evaluation of clipping-noise suppression of stationary-noisy speech based on spectral compensation

Combining Voice Activity Detection Algorithms by Decision Fusion

Epoch Extraction From Speech Signals K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE

AUTOMATED MUSIC TRACK GENERATION

Modulation Classification of Satellite Communication Signals Using Cumulants and Neural Networks

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

Voiced/nonvoiced detection based on robustness of voiced epochs

Adaptive Filters Application of Linear Prediction

NOISE ESTIMATION IN A SINGLE CHANNEL

Speech Coding using Linear Prediction

Cepstrum alanysis of speech signals

Speech Enhancement using Wiener filtering

Physiological signal(bio-signals) Method, Application, Proposal

A k-mean characteristic function to improve STA/LTA detection

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

A Method for Voiced/Unvoiced Classification of Noisy Speech by Analyzing Time-Domain Features of Spectrogram Image

The Delta-Phase Spectrum with Application to Voice Activity Detection and Speaker Recognition

Perception of pitch. Definitions. Why is pitch important? BSc Audiology/MSc SHS Psychoacoustics wk 5: 12 Feb A. Faulkner.

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Speech synthesizer. W. Tidelund S. Andersson R. Andersson. March 11, 2015

Speech Synthesis using Mel-Cepstral Coefficient Feature

EC 6501 DIGITAL COMMUNICATION UNIT - II PART A

Performance analysis of voice activity detection algorithm for robust speech recognition system under different noisy environment

An Improved Voice Activity Detection Based on Deep Belief Networks

Dynamically Configured Waveform-Agile Sensor Systems

Envelope Modulation Spectrum (EMS)

Introducing COVAREP: A collaborative voice analysis repository for speech technologies

NCCF ACF. cepstrum coef. error signal > samples

REAL TIME DIGITAL SIGNAL PROCESSING

The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

SUB-BAND INDEPENDENT SUBSPACE ANALYSIS FOR DRUM TRANSCRIPTION. Derry FitzGerald, Eugene Coyle

GE 113 REMOTE SENSING

Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR

Adaptive Feature Analysis Based SAR Image Classification

Digital Modulation Recognition Based on Feature, Spectrum and Phase Analysis and its Testing with Disturbed Signals

Fourier Methods of Spectral Estimation

Monophony/Polyphony Classification System using Fourier of Fourier Transform

The psychoacoustics of reverberation

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

A multi-class method for detecting audio events in news broadcasts

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Auditory Based Feature Vectors for Speech Recognition Systems

SPEECH ENHANCEMENT USING SPARSE CODE SHRINKAGE AND GLOBAL SOFT DECISION. Changkyu Choi, Seungho Choi, and Sang-Ryong Kim

Perception of pitch. Importance of pitch: 2. mother hemp horse. scold. Definitions. Why is pitch important? AUDL4007: 11 Feb A. Faulkner.

CHAPTER 3 SPEECH ENHANCEMENT ALGORITHMS

CORRELATION BASED SNR ESTIMATION IN OFDM SYSTEM

28th Seismic Research Review: Ground-Based Nuclear Explosion Monitoring Technologies

Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation

SpeakerID - Voice Activity Detection

Transcription:

Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015

Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class of methods which detect whether a sound signal contains speech or not. A closely related and partly overlapping task is speech presence probability estimation. Instead of a present/not-present decision, SPP gives a probability level that the signal contains speech. A VAD can be derived from SPP by setting a threshold probability above which the signal is considered to contain speech.

Introduction Voice activity detection is used as a pre-processing algorithm for almost all other speech processing methods. In speech coding, it is used to to determine when speech transmission can be switched off to reduce the amount of transmitted data. In speech recognition, it is used to find out what parts of the signal should be fed to the recognition engine. Since recognition is a computationally complex operation, ignoring non-speech parts saves CPU power. In speech enhancement, where we want to reduce or remove noise in a speech signal, we can estimate noise characteristics from non-speech parts (learn/adapt) and remove noise from the speech parts (apply). It is thus used mostly as a resource-saving operation.

Low-noise VAD Trivial case To introduce basic vocabulary and methodology, let us consider a case where a speaker is speaking in an (otherwise) silent environment. When there is no speech, there is silence. (Any) Signal activity indicates voice activity. Input signal Signal activity detection Thresholding VAD decision Signal activity can be measured by, for example, estimating signal energy per frame the energy thresholding algorithm.

Low-noise VAD Trivial case 0.2 Input speech signal Amplitude Magnitude (db) 0-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Framewise energy 0-20 -40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 VAD decision speech non-speech 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time (s)

Low-noise VAD Trivial case Clearly energy thresholding works for silent speech signals. Low-energy frames are correctly labeled as non-speech and speech parts are likewise correctly labeled. It is however not trivial to choose an appropriate threshold-level. A low threshold level would make sure that all speech-frames are correctly labeled. However, we might then also label frames with other sounds, like breathing sounds or other background noises, as speech frames. A high threshold would make sure that all detected speech-frames actually are truly speech frames. But then we could miss offsets (sounds which are trailing off), since they often have a low energy. What strategy should we use to choose a threshold? What is the correct label for something like breathing-noises? How do we actually measure performance of a VAD?

VAD objective and performance measurement The objective of a VAD implementation depends heavily on the application. In speech coding, our actual objective is to reduce bitrate without decreasing quality. We want to make sure that no speech frames are classified as background noise, because that would reduce quality. We make a conservative estimate. In keyword spotting (think Siri or OK Google ), we want to detect the start of a particular combination of words. The VADs task is to avoid running a computationally expensive keyword spotting algorithm all the time. Missing one keyword is not so bad (the user would then just try again), but if it is too sensitive then the application would drain the battery. We want to be sure that only keywords are spotted. In speech enhancement, we want to find non-speech areas such that we can there estimate noise characteristics, such that we can remove anything which looks like noise. We want to be sure that there is no speech in the noise estimate, otherwise we would end up removing some speech and not only noise. What about speech recognition? What would the objective be there?

VAD objective and performance measurement We need a set of performance measures which reflect these different objectives. The performance is then often described by looking at how often are frames which do contain speech labeled as speech/non-speech, and how often is non-speech labeled as speech/non-speech? Identification result Input Speech Non-speech Speech True positive False negative Non-speech False positive True negative For speech coding, we want to keep the number of false negatives low, and false positives are only secondary importance. For keyword spotting, we want to keep the number of false positive low, and false negatives are secondary importance.

VAD objective and performance measurement Performance in noise -3dB threshold Input speech signal, noise and noisy speech (SNR 0dB) Amplitude Speech Noise Noisy speech Magnitude (db) 0-10 -20-30 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Framewise energy Clean Noisy Clean threshold Noisy threshold -40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 VAD decision true positive false positive speech Clean Noisy non-speech false negative true negative 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time (s)

VAD objective and performance measurement Performance in noise -4dB threshold Input speech signal, noise and noisy speech (SNR 0dB) Amplitude Speech Noise Noisy speech Magnitude (db) 0-10 -20-30 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Framewise energy Clean Noisy Clean threshold Noisy threshold -40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 VAD decision true positive false positive speech Clean Noisy non-speech false negative true negative 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time (s)

Post-processing We already saw that speech coding wants to avoid false negatives (=speech frames labeled as non-speech). Can we identify typical situations where false negatives occur? Offsets (where a phonation ends) often have low energy Easily misclassified as non-speech. Stops have a silence in the middle of an utterance. Easily misclassified as non-speech. We should be careful at the end of phonations. We can use a hangover time, such that after a speech segment we keep the label as speech for a while until we are sure that speech has ended. For onsets (starts of phonemes) we usually want to be very sensitive. We obtain a hysteresis rule; If any of the last K frames was identified as speech, then the current frame is labelled as speech. Otherwise non-speech.

Post-processing Hangover Input speech signal, noise and noisy speech (SNR 0dB) Amplitude Speech Noise Noisy speech Magnitude (db) 0-10 -20-30 true positive false positive speech 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Framewise energy Clean Noisy Clean threshold Noisy threshold -40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 VAD decision Clean Noisy Noisy w/ hangover non-speech false negative true negative 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time (s)

VAD for noisy speech Clean speech (absolutely no background noise) is very rare if not impossible to achieve. Real-life speech recordings practically always have varying amounts of background noise. Performance of energy thresholding decreases rapidly when the SNR drops. For example, weak offsets easily disappear in noise. We need more advanced VAD methods for noisy speech. We need to identify characteristics which differentiate between speech and noise. Measures for such characteristics are known as features.

Features In VAD, with features we try to measure some property of the signal which would give an indication to whether the signal is speech or non-speech. Signal energy is naturally a useful feature, since the energy of speech varies a lot. Voiced sounds generally have energy mainly at the low frequencies, whereby estimators for spectral tilt are often useful. For example, Zero-crossings (per time unit) is high for high-frequency signals (noise) and low for low-frequency signals (voiced speech), whereby it can be used as a feature. The lag-1 autocorrelation is high (close to one) for low-frequency signals and low (close to -1) for high-frequency signals. Speech sounds can be efficiently modelled by linear prediction. If the prediction error is small, then it is likely that the signal is speech. If the prediction error is large, then it is probably non-speech.

Features Voiced speech has by definition a prominent pitch. If we can identify a prominent pitch in the range 80 Hz... 450 Hz then it likely voiced speech. Speech information is described effectively by their spectral envelope. MFCC can be used as a description of envelope information and it is thus a useful set of features. Linear prediction parameters (esp. prediction residual) also describe envelope information and can thus also be used as a feature-set. Speech features vary rapidly and frequently. By looking at the rate of change k = f k+1 f k in other features f k, we obtain information about the rate of change of the signal. (Estimate of derivative) Likewise, we can look at the second difference k = k+1 k. (Estimate of second derivative) These first and second order differences can be used as features and they are known as - and -features.

Features Speech signal 1 0.6 0.2 Signal energy Signal correlation r 1 /r 0 0.5-0.5 0 Fundamental frequency F 0 320 280 240 0.04 0.02 Cepstral peak size C max /C 0 20 40 60 80 100 120 140 160 180 Time (frame)

-Features Speech signal Signal -energy 0.4 0 Signal -correlation r 1 /r 0 0.5-0.5 0-1 100 50-50 0 -Fundamental frequency F 0 -Cepstral peak size C max /C 0 0.02-0.02 0 20 40 60 80 100 120 140 160 180 Time (frame)

Classifier We have collected a set of indicators for speech, the features, whereby the next step is to merge the information from these features to make a decision between speech and non-speech. Input signal Analyse Feature 1 Classifier VAD decision Analyse Feature 2 Analyse Feature n Classification is generic problem, with plenty of solutions such as decision trees (low-complexity, requires manual tuning) linear classifier (relatively low-complexity, training from data) advanced methods such as neural networks, Gaussian mixture models etc. (high-complexity, high-accuracy, training from data)

Classifier Decision trees Make a sequence of binary decisions (for example, is low or high energy?) to decide whether signal is speech or non-speech. For example: Input signal Is low energy? No Mainly low frequency? No Is high energy? Yes Yes Pitch present? Yes No Previous frame was speech? No Yes Yes Decision: non-speech Decision: speech No

Classifier Decision trees Decision trees are very simple to implement. Hard-coded not very flexible. Noise in one feature can cause us to follow wrong path. One noisy feature can break whole decision tree. Requires that each decision is manually tuned Lots of work, especially when tree is large Structure and development becomes very complex if the number of features increase. Suitable for low-complexity systems and low-noise scenarios where accuracy requirements are not so high. = I did not prepare an illustration/figure.

Linear classifier Instead of manually-tuned, binary decisions, can we use observed data to make a statistical estimate? Using training data would automate the tuning of the model. Accuracy can be improved by adding more data. By replacing binary decisions, we can let tendencies in several features improve accuracy. Linear classifiers attempt to achieve a decision as a weighted sum of the features. Let ξ k be the features. The decision is then obtained by η = k ω kξ k, where ω k are scalar weights. The objective is to find weights ωk such that { 1 non-speech η = +1 speech.

Linear classifier Input signal Analyse Feature 1 Analyse Feature 2 w 2 w 1 Ʃ Thresholding VAD decision w n Analyse Feature n

Linear classifier We then need to develop a method for choosing optimal weights ω k. The first step is to define an objective function, which we can minimize. A good starting point is the classification error. If η is the desired class for a frame and our classifier gives ˆη, then the classification error is ν 2 = (η ˆη) 2. By minimizing the classification error, we can determine optimal parameters ω k. Let x k be the vector of all features for a frame k and X = [x 0, x 1... ] a matrix with all features for all frames. The classification of a single frame is then η k = xk T w. The classification of all frames is then a vector y = X T w, where w is the vector of weights ω k. The sum of classification errors of all frames is then the norm y ŷ 2.

Linear classifier A bit of math The minimum of the classification error y ŷ 2 can be found by setting the partial derivative to zero. 0 = w y ŷ 2 = w y X T w 2 = w (y T y + w T XX T w 2w T Xy) = 2XX T w 2Xy. The solution is the Moore-Penrose pseudo-inverse w = (XX T ) 1 Xy T := X y. Note: This is a very common mathematical approach for solving problems in speech processing, so it is much more important and broadly applicable than only VAD.

Linear classifier Pre-whitening (advanced topic) If the range of values from features are very different, we end up with problems. A very loud voice will overrun weaker ones, even if the loud one is full of crap. The range (mean and variance) of features need to be normalized. Correlations between features are also undesirable. The first step is removal of the mean, x = x E[x] 1 N k x k, where N is the number of frames. The covariance of the features is then C = E[xx T ] 1 N X T X, where X now contains the zero-mean features. The eigenvalue decomposition of C is C = V T DV, whereby we can define the pre-whitening transform A = D 1/2 V and x = Ax. The covariance of the modified vector is E[x (x ) T ] = AE[x (x ) T ]A T = ACA T = D 1/2 VV T DVV T D 1/2 = I. That is, x has uncorrelated samples with equal variance and zero-mean.

Classifier Normalize mean and variance (Features - mean)/standard deviation Target output Speech Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 8 Feature 9 Feature 10 Feature 11 Feature 12 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Linear classifier Pre-whitening (advanced topic) What you need to know is that pre-whitening is a pre-processing step, applied before training w. We thus train the classifier on the modified vectors x = A(x E[x]), to obtain the weights w. The classifier with pre-whitening is where ŵ = Aw. ν = w T x = w T A(x E[x]) = ŵ T (x E[x]) In other words, the pre-whitening can be included in the weights, so no additional complexity is introduced other than removal of the mean (which is trivial).

Classifier Pre-whitening Whitened features Target output Speech Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 8 Feature 9 Feature 10 Feature 11 Feature 12 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Classifier Post-processing Linear classifier VAD on noisy speech output+hangover output+trehshold output target speech 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Linear classifier Linear classifiers are only slightly more complex than decisions trees, but much more accurate. Main complexity of VAD lies in feature-extraction anyway, so the differences in complexity of decision trees and linear classifiers is negligible. The main advantages of linear classifiers in comparison to decision trees are that (unbiased) we can use real data to train the model, whereby we can be certain that it corresponds to reality (no bias due to manual tuning), (robust) whereas noise in one feature can break a decision tree, linear classifiers merge information from all features, thus reducing effect of noise.

Advanced classifiers There exists a large range of better and more complex classifiers in the general field of machine learning. Linear discriminant analysis (LDA) splits the feature space using hyper-planes. Gaussian mixture models (GMM) the feature space is modelled by a sum of Gaussians. Neural networks (NN) similar to linear classifiers but adds non-linear mappings and several layers of sums. K-nearest neighbors (knn), support vector machine (SVM), random forest classifiers, etc. These methods are in general more effective, but training and application is more complex. Try a simple approach first and see if its good enough.

Speech Presence Probability The output of the classifier is a continuous number, but it is thresholded to obtain a decision. The continuous output contains a lot of information about the signal which lost with thresholding. With a high value we are really certain that the signal is speech, while a value near the threshold is relatively uncertain. We can use the classifier output as an estimate of the probability that the signal is speech It is an estimator for speech presence probability. Subsequent applications can use this information as input to improve performance.

Speech Presence Probability Input signal Analyse Feature 1 Analyse Feature 2 w 2 w 1 Ʃ Thresholding VAD decision w n Analyse Feature n Input signal Analyse Feature 1 Analyse Feature 2 w 2 w 1 Ʃ Speech presence probability w n Analyse Feature n

Speech Presence Probability = Output before thresholding Linear classifier VAD on noisy speech output+hangover output+trehshold output target speech 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Noise types As noted before, VAD is trivial in noise-free scenarios. In practice, typical background noise types are for example, office noise, car noise, cafeteria (babble) noise,... Clearly the problem is easier if the noise has a very different character than the speech signal. Speech is quickly varying stationary noises are easy. Speech is dominated by low frequencies high frequency noises are easy. The classic worst case is a competing (undesired) speaker, that is, when someone else is speaking in the background (babble noise). However, that would be difficult also for a human listener, whereby it actually is a very difficult problem.

Conclusions Voice activity detection is a type of methods which attempt to determine if a signal is speech or non-speech. In a noise-free scenario the task is trivial, but it is also not a realistic scenario. The basic idea of algorithms is: 1. Calculate a set of features from the signal which are designed to analyze properties which differentiate speech and non-speech. 2. Merge the information from the features in a classifier, which returns the likelihood that the signal is speech. 3. Threshold the classifier output to determine whether the signal is speech or not. VADs are used as a low-complexity pre-processing method, to save resources (e.g. complexity or bitrate) in the main task.