Can binary masks improve intelligibility?

Similar documents
Speech Enhancement using Wiener filtering

Enhancement of Speech Signal Based on Improved Minima Controlled Recursive Averaging and Independent Component Analysis

SOBM - A BINARY MASK FOR NOISY SPEECH THAT OPTIMISES AN OBJECTIVE INTELLIGIBILITY METRIC

Speech Synthesis using Mel-Cepstral Coefficient Feature

Bandwidth Extension for Speech Enhancement

CHAPTER 6 SIGNAL PROCESSING TECHNIQUES TO IMPROVE PRECISION OF SPECTRAL FIT ALGORITHM

NOISE ESTIMATION IN A SINGLE CHANNEL

Reduction of Musical Residual Noise Using Harmonic- Adapted-Median Filter

Speech Enhancement Based On Spectral Subtraction For Speech Recognition System With Dpcm

Auditory Based Feature Vectors for Speech Recognition Systems

Different Approaches of Spectral Subtraction Method for Speech Enhancement

Lecture 9: Time & Pitch Scaling

The role of temporal resolution in modulation-based speech segregation

Nonuniform multi level crossing for signal reconstruction

SPEECH ENHANCEMENT USING A ROBUST KALMAN FILTER POST-PROCESSOR IN THE MODULATION DOMAIN. Yu Wang and Mike Brookes

REAL-TIME BROADBAND NOISE REDUCTION

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

Spectrum Sensing Using Bayesian Method for Maximum Spectrum Utilization in Cognitive Radio

(i) Understanding the basic concepts of signal modeling, correlation, maximum likelihood estimation, least squares and iterative numerical methods

The function is composed of a small number of subfunctions detailed below:

Mel Spectrum Analysis of Speech Recognition using Single Microphone

Perceptual Speech Enhancement Using Multi_band Spectral Attenuation Filter

HCS 7367 Speech Perception

Vocoder (LPC) Analysis by Variation of Input Parameters and Signals

Audio Restoration Based on DSP Tools

Signal Processing Toolbox

VQ Source Models: Perceptual & Phase Issues

INSTANTANEOUS FREQUENCY ESTIMATION FOR A SINUSOIDAL SIGNAL COMBINING DESA-2 AND NOTCH FILTER. Yosuke SUGIURA, Keisuke USUKURA, Naoyuki AIKAWA

INTRODUCTION TO ACOUSTIC PHONETICS 2 Hilary Term, week 6 22 February 2006

EE482: Digital Signal Processing Applications

Digital Filtering: Realization

Robust Voice Activity Detection Based on Discrete Wavelet. Transform

Speech Enhancement in Presence of Noise using Spectral Subtraction and Wiener Filter

Single Channel Speaker Segregation using Sinusoidal Residual Modeling

RECENTLY, there has been an increasing interest in noisy

Students: Avihay Barazany Royi Levy Supervisor: Kuti Avargel In Association with: Zoran, Haifa

Spectral estimation using higher-lag autocorrelation coefficients with applications to speech recognition

Speech Enhancement Using Spectral Flatness Measure Based Spectral Subtraction

Auditory modelling for speech processing in the perceptual domain

Understanding Probability of Intercept for Intermittent Signals

A Novel Technique or Blind Bandwidth Estimation of the Radio Communication Signal

Reading: Johnson Ch , Ch.5.5 (today); Liljencrants & Lindblom; Stevens (Tues) reminder: no class on Thursday.

CS 188: Artificial Intelligence Spring Speech in an Hour

Blind Dereverberation of Single-Channel Speech Signals Using an ICA-Based Generative Model

Dimension Reduction of the Modulation Spectrogram for Speaker Verification

An Adaptive Adjacent Channel Interference Cancellation Technique

COM 12 C 288 E October 2011 English only Original: English

OFDM Transmission Corrupted by Impulsive Noise

Linguistic Phonetics. Spectral Analysis

Speech Enhancement for Nonstationary Noise Environments

Digital Signal Processing of Speech for the Hearing Impaired

Question 1 Draw a block diagram to illustrate how the data was acquired. Be sure to include important parameter values

Mikko Myllymäki and Tuomas Virtanen

Effective post-processing for single-channel frequency-domain speech enhancement Weifeng Li a

SPEECH ENHANCEMENT USING SPARSE CODE SHRINKAGE AND GLOBAL SOFT DECISION. Changkyu Choi, Seungho Choi, and Sang-Ryong Kim

Spur Detection, Analysis and Removal Stable32 W.J. Riley Hamilton Technical Services

Tunable Multi Notch Digital Filters A MATLAB demonstration using real data

Long Range Acoustic Classification

Advanced Cell Averaging Constant False Alarm Rate Method in Homogeneous and Multiple Target Environment

Keywords Decomposition; Reconstruction; SNR; Speech signal; Super soft Thresholding.

Enhancement of Speech Signal by Adaptation of Scales and Thresholds of Bionic Wavelet Transform Coefficients

Improved Detection by Peak Shape Recognition Using Artificial Neural Networks

A Two-step Technique for MRI Audio Enhancement Using Dictionary Learning and Wavelet Packet Analysis

Enhancement of Speech in Noisy Conditions

Cepstrum alanysis of speech signals

Multiple Sound Sources Localization Using Energetic Analysis Method

Telecommunication Electronics

Non-coherent pulse compression - concept and waveforms Nadav Levanon and Uri Peer Tel Aviv University

Operational Amplifiers

Target Echo Information Extraction

RF Characterization Report

Learning New Articulator Trajectories for a Speech Production Model using Artificial Neural Networks

Channel selection in the modulation domain for improved speech intelligibility in noise

6.555 Lab1: The Electrocardiogram

ENF PHASE DISCONTINUITY DETECTION BASED ON MULTI-HARMONICS ANALYSIS

Adaptive Speech Enhancement Using Partial Differential Equations and Back Propagation Neural Networks

Voice Activity Detection for Speech Enhancement Applications

Analog and Telecommunication Electronics

EE390 Final Exam Fall Term 2002 Friday, December 13, 2002

Automatic Evaluation of Hindustani Learner s SARGAM Practice

AUDL GS08/GAV1 Auditory Perception. Envelope and temporal fine structure (TFS)

WARPED FILTER DESIGN FOR THE BODY MODELING AND SOUND SYNTHESIS OF STRING INSTRUMENTS

Factors Governing the Intelligibility of Speech Sounds

C/N Ratio at Low Carrier Frequencies in SFQ

A New General Purpose, PC based, Sound Recognition System

Advanced bridge instrument for the measurement of the phase noise and of the short-term frequency stability of ultra-stable quartz resonators

Electrical & Computer Engineering Technology

Department of Electronics and Communication Engineering 1

Signal Processing for Digitizers

Comparison of Spectral Analysis Methods for Automatic Speech Recognition

Chapter IV THEORY OF CELP CODING

Speech Enhancement Using Beamforming Dr. G. Ramesh Babu 1, D. Lavanya 2, B. Yamuna 2, H. Divya 2, B. Shiva Kumar 2, B.

An Adaptive Kernel-Growing Median Filter for High Noise Images. Jacob Laurel. Birmingham, AL, USA. Birmingham, AL, USA

ROBUST PITCH TRACKING USING LINEAR REGRESSION OF THE PHASE

High-speed Noise Cancellation with Microphone Array

Frequency Domain Representation of Signals

Pitch Period of Speech Signals Preface, Determination and Transformation

Utilization of Multipaths for Spread-Spectrum Code Acquisition in Frequency-Selective Rayleigh Fading Channels

Target detection in side-scan sonar images: expert fusion reduces false alarms

Non-intrusive intelligibility prediction for Mandarin speech in noise. Creative Commons: Attribution 3.0 Hong Kong License

Transcription:

Can binary masks improve intelligibility? Mike Brookes (Imperial College London) & Mark Huckvale (University College London) Apparently so... 2

How does it work? 3 Time-frequency grid of local SNR + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - e s = speech energy, e n = noise energy, w() = frequency weighting - F() is some monotonic function - index is increased if attenuation applied in each cell where e n > e s - i.e. where local SNR < db 4 2

Use of classifier to estimate binary mask 5 Replication Similarities IEEE sentences as training testing materials Single male talker Babble and speech-shaped noise @ -5dB SNR Signals at 2, samples/sec Acoustic features based on modulation spectrum - code provided by Kim Feature vector incorporates time & frequency deltas SNR thresholds for constructing target mask on training data GMM classifier design, using full covariance Four GMMs to classify feature vectors based on division of training vectors into groups based on SNR. Differences We used a different, British English, talker We used babble from NOISEX ROM Thanks to: Toby Davies 6 3

Classifier performance (@ -5dB SNR) SNR > Cells % Speech-shaped noise Hits False- Alarms Hits Babble noise False- Alarms Kim et al 88.3 9.5 87. 4.5 Ours 7 Classifier performance (@ -5dB SNR) SNR > Cells % Speech-shaped noise Hits False- Alarms Hits Babble noise False- Alarms Kim et al 88.3 9.5 87. 4.5 Ours 55.2 5. 5.6 5. 8 4

performance (@ -5dB SNR) Words % Speech-shaped noise Babble noise No proc. Proc. Ideal No proc. Proc. Ideal Kim et al 45 87 92 9 85 92 Ours 9 performance (@ -5dB SNR) Words % Speech-shaped noise Babble noise No proc. Proc. Ideal No proc. Proc. Ideal Kim et al 45 87 92 9 85 92 Ours 49 2 77 54 5 85 5

Binary Mask Enhancement LTASS -5dB Recognised Mask Ideal Mask Binary Mask Enhancement Babble -5dB Recognised Mask Ideal Mask 2 6

What is going on? There are a number of arbitrary parameter settings in Kim et al (29) Sampling rate, window size, number of channels Down-sampling of modulation spectrum SNR thresholds for binary mask choice These may have become optimised for particular data set they used Overall performance may be very sensitive to small changes in system design We need to investigate and understand details of algorithm... over to Mike 3 What is the perfect binary mask? Original idea [Wang25]: Select Time-Frequency (TF) cells with S ( t, f ) N( t, f ) > where S and N are speech and noise power spectral densities in db and L is a threshold ( Local Criterion ) L Motivation: Masking Exclude TF cells with poor SNR since they give little information and may mask adjacent frequency bands However If we plot intelligibility versus L for different SNR levels the results do not match this theory 4 7

of Binary Masked Speech L= :OK@ > db SNR L= 6:OK@ > 2dB SNR SNR= 6 db: OK @ 9 db < L <-5 db Two independent sources of information [Kjems et al 2]:. Noisy speech signal SNR > & (L SNR) < 2. Noise-vocoded signal 3 db < (L SNR) < db The benefit of binary masking comes entirely from component 2 [Kjems et al, EUSIPCO-2] 5 Noise-Vocoded component Define Relative Criterion : R = L SNR = L ( S ( f ) N ( f )) Mask becomes: S( t, f ) N( t, f ) > R + S ( f ) N ( f ) Eliminate noise dependency by taking N ( t, f ) = N ( f ) S ( t, f ) S ( f ) > R Target Binary Mask Clean Speech TF analysis Active Level LTASS db db + R Threshold LTASS Noise TF analysis Mask TF synth 6 8

Unimodal Psychometric Function Modelling Product of two logistic curves Fixed guess/lapse rates 4 free parameters Modify to remove interaction between low and high slopes No change if low and high slopes are equal Negligible change if slopes are widely separated Estimation is easier and more stable Use width @ 5% as a single figure of merit Not always ideal UTBM on LTASS noise, fft=4 ms, ov=4 < 4. > -26.5 3.5-4 -3-2 - 2 UTBM on LTASS noise, fft=2 ms < 8. > -6.6.5-4 -3-2 - 2 7 Psychometric Function Evaluation Digit triples: male+female Forced choice experiment Bayesian estimation of pdf of 4-D parameter vector Update pdf after each trial Select next R to give greatest expected entropy reduction Very quick convergence (e.g. 6 trials) After trial : UTBM on LTASS noise, fft=4 ms, ov=4 2 4 6 8-3 -2-2 UTBM on LTASS noise, fft=4 ms, ov=4 Normalized semi-width (db) 2 8 6 4 2 8 6 Ln up slope (ln prob/db).5 -.5 - -.5-2 -2.5 < 4. > -26.5 3.5 4 2-2 - Peak position (db) -3-3.5-3 -2 - Ln down slope (ln prob/db) -4-3 -2-2 8 9

Effect of FFT length TF analysis/synthesis Hamming window of length T Freq resolution ~.8/T Modulation bandwidth ~.9/T Observations @ T=4ms, R can vary by 4 db @ T=6ms performance worse: too much smoothing in modulation domain? @ T=2ms performance worse: cannot resolve formants? @ T=ms performance still OK.5.5-4 -3-2 - 2.5.5-4 -3-2 - 2 File: psy23655.txt UTBM on LTASS noise, fft= ms, ov=4 < 32.4 > -23.5 9. File: psy23545.txt UTBM on LTASS noise, fft=4 ms, ov=4 < 4. > -26.5 3.5-4 -3-2 - 2 File: psy23858.txt UTBM on LTASS noise, fft=2 ms, ov=4 < 2.2 > -2. File: psy2379.txt UTBM on LTASS noise, fft=6 ms, ov=4 < 35.7 > -27. 8.5-4 -3-2 - 2 T=2 ms f res =9 Hz f mod <45 Hz T= ms f res =8 Hz f mod <9 Hz T=4 ms f res =45 Hz f mod <22.5 Hz T=6 ms f res = Hz f mod <5.6 Hz 9 Non-uniform frequency resolution FFT length kept at 5 ms f res =36 Hz, f mod <8 Hz Change mask resolution Estimate mask in erb domain,.5, and 2 erb resolution Observations Even at a resolution of.5 erb, the intelligibility is noticeably worse [surprising] Substantial degradation at erb resolution.5.5 File: psy22578.txt UTBM on LTASS noise, fft=5 ms df=.5 erb < 33.7 > -24.4 9.3-4 -3-2 - 2.5.5 File: psy22572.txt UTBM on LTASS noise, fft=5 ms df=. erb < 2. > -8.8 2.3-4 -3-2 - 2 File: psy225724.txt UTBM on LTASS noise, fft=5 ms df= erb < 49.5 > -4 8.9-4 -3-2 - 2 File: psy22573.txt UTBM on LTASS noise, fft=5 ms df=2. erb < 9.6 > -7.8.8 T=5 ms f res = erb T=5 ms f res =.5 erb T=5 ms f res =. erb T=5 ms f res =2. erb -4-3 -2-2 2

Modulation Domain Model determined by accuracy of modulation domain spectrum [Taal et al, ICASSP 2] Encompasses both regions of the graph within one concept Measure by correlation coefficient between clean and masked speech in 4ms window for each frequency bin Maximize by comparing with low pass filtered version of spectrogram: Clean Speech TF analysis LP filter db db + R Threshold 2 Time Correlation based mask LP filter operates on power spectrum in time domain Hamming window impulse resp LP cutoff =.9/T LP Correlation coeff between clean and masked max when R= Observations Poor intelligibility compared to previous for short T LP Very noisy: mask tries to match noise when no speech energy Use noise floor threshold File: psy25834.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=6 ms x Hz File: psy2584.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=4 ms x Hz -4-3 -2-2 < 37. > -27.. -4-3 -2-2 < 33.7 > -26.6 7. File: psy25845.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp= ms x Hz < 2. > -.7-4 -3-2 - 2 T LP =6 ms F mod > Hz T LP =4 ms F mod >2.3 Hz T LP = ms F mod >9 Hz 22

Time-Freq Correlation based Mask Seems reasonable to try matching modulation in both time and frequency Apply LP filter in both directions Fix T LP =8 ms giving mod domain HP at. Hz Vary filter width in frequency direction Observations Makes rather little difference F LP =2Hz gives some benefit File: psy25859.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x Hz -4-3 -2-2 File: psy2595.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x 3 Hz File: psy259.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x 6 Hz -4-3 -2-2 < 47.9 > -38.3 9.6-4 -3-2 - 2 < 45.2 > -34.7.5 < 44.4 > -34.5 9.9 File: psy2596.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x 2 Hz < 5 > -39.6.2-4 -3-2 - 2 T LP =8 ms F mod >. Hz F LP = Hz T LP =8 ms F mod >. Hz F LP =3 Hz T LP =8 ms F mod >. Hz F LP =6 Hz T LP =8 ms F mod >. Hz F LP =2 Hz 23 Summary benefits arise from the noise vocoded component of the masked speech Rapid estimation of unimodal psychometric functions is possible of noise vocoded speech Relative criterion can vary by ~4 db without loss of intelligibility FFT length can vary between 2 and 6 ms without loss of int Uniform frequency resolution is better than non-uniform (erb) Maximizing correlation in modulation domain is equivalent to HP filtering the spectrogram (when R=) Nice idea but little benefit Seems logical to extend it to freq axis but gives small improvement 24 2

Can Binary Masks Improve? Replication of Kim et al (29) show mask enhancement not straightforward to achieve Binary mask has two effects Preserve speech information in noisy signal when SNR good enough Encode speech information in vocoded noise when SNR poor Former just like any enhancement algorithm Latter relies on pattern recognition system Which may perform badly at low SNR just when it would be most useful 25 3