Can binary masks improve intelligibility?

Can binary masks improve intelligibility? Mike Brookes (Imperial College London) & Mark Huckvale (University College London) Apparently so... 2

How does it work? 3 Time-frequency grid of local SNR + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - e s = speech energy, e n = noise energy, w() = frequency weighting - F() is some monotonic function - index is increased if attenuation applied in each cell where e n > e s - i.e. where local SNR < db 4 2

Use of classifier to estimate binary mask 5 Replication Similarities IEEE sentences as training testing materials Single male talker Babble and speech-shaped noise @ -5dB SNR Signals at 2, samples/sec Acoustic features based on modulation spectrum - code provided by Kim Feature vector incorporates time & frequency deltas SNR thresholds for constructing target mask on training data GMM classifier design, using full covariance Four GMMs to classify feature vectors based on division of training vectors into groups based on SNR. Differences We used a different, British English, talker We used babble from NOISEX ROM Thanks to: Toby Davies 6 3

Classifier performance (@ -5dB SNR) SNR > Cells % Speech-shaped noise Hits False- Alarms Hits Babble noise False- Alarms Kim et al 88.3 9.5 87. 4.5 Ours 7 Classifier performance (@ -5dB SNR) SNR > Cells % Speech-shaped noise Hits False- Alarms Hits Babble noise False- Alarms Kim et al 88.3 9.5 87. 4.5 Ours 55.2 5. 5.6 5. 8 4

performance (@ -5dB SNR) Words % Speech-shaped noise Babble noise No proc. Proc. Ideal No proc. Proc. Ideal Kim et al 45 87 92 9 85 92 Ours 9 performance (@ -5dB SNR) Words % Speech-shaped noise Babble noise No proc. Proc. Ideal No proc. Proc. Ideal Kim et al 45 87 92 9 85 92 Ours 49 2 77 54 5 85 5

Binary Mask Enhancement LTASS -5dB Recognised Mask Ideal Mask Binary Mask Enhancement Babble -5dB Recognised Mask Ideal Mask 2 6

What is going on? There are a number of arbitrary parameter settings in Kim et al (29) Sampling rate, window size, number of channels Down-sampling of modulation spectrum SNR thresholds for binary mask choice These may have become optimised for particular data set they used Overall performance may be very sensitive to small changes in system design We need to investigate and understand details of algorithm... over to Mike 3 What is the perfect binary mask? Original idea [Wang25]: Select Time-Frequency (TF) cells with S ( t, f ) N( t, f ) > where S and N are speech and noise power spectral densities in db and L is a threshold ( Local Criterion ) L Motivation: Masking Exclude TF cells with poor SNR since they give little information and may mask adjacent frequency bands However If we plot intelligibility versus L for different SNR levels the results do not match this theory 4 7

of Binary Masked Speech L= :OK@ > db SNR L= 6:OK@ > 2dB SNR SNR= 6 db: OK @ 9 db < L <-5 db Two independent sources of information [Kjems et al 2]:. Noisy speech signal SNR > & (L SNR) < 2. Noise-vocoded signal 3 db < (L SNR) < db The benefit of binary masking comes entirely from component 2 [Kjems et al, EUSIPCO-2] 5 Noise-Vocoded component Define Relative Criterion : R = L SNR = L ( S ( f ) N ( f )) Mask becomes: S( t, f ) N( t, f ) > R + S ( f ) N ( f ) Eliminate noise dependency by taking N ( t, f ) = N ( f ) S ( t, f ) S ( f ) > R Target Binary Mask Clean Speech TF analysis Active Level LTASS db db + R Threshold LTASS Noise TF analysis Mask TF synth 6 8

Unimodal Psychometric Function Modelling Product of two logistic curves Fixed guess/lapse rates 4 free parameters Modify to remove interaction between low and high slopes No change if low and high slopes are equal Negligible change if slopes are widely separated Estimation is easier and more stable Use width @ 5% as a single figure of merit Not always ideal UTBM on LTASS noise, fft=4 ms, ov=4 < 4. > -26.5 3.5-4 -3-2 - 2 UTBM on LTASS noise, fft=2 ms < 8. > -6.6.5-4 -3-2 - 2 7 Psychometric Function Evaluation Digit triples: male+female Forced choice experiment Bayesian estimation of pdf of 4-D parameter vector Update pdf after each trial Select next R to give greatest expected entropy reduction Very quick convergence (e.g. 6 trials) After trial : UTBM on LTASS noise, fft=4 ms, ov=4 2 4 6 8-3 -2-2 UTBM on LTASS noise, fft=4 ms, ov=4 Normalized semi-width (db) 2 8 6 4 2 8 6 Ln up slope (ln prob/db).5 -.5 - -.5-2 -2.5 < 4. > -26.5 3.5 4 2-2 - Peak position (db) -3-3.5-3 -2 - Ln down slope (ln prob/db) -4-3 -2-2 8 9

Effect of FFT length TF analysis/synthesis Hamming window of length T Freq resolution ~.8/T Modulation bandwidth ~.9/T Observations @ T=4ms, R can vary by 4 db @ T=6ms performance worse: too much smoothing in modulation domain? @ T=2ms performance worse: cannot resolve formants? @ T=ms performance still OK.5.5-4 -3-2 - 2.5.5-4 -3-2 - 2 File: psy23655.txt UTBM on LTASS noise, fft= ms, ov=4 < 32.4 > -23.5 9. File: psy23545.txt UTBM on LTASS noise, fft=4 ms, ov=4 < 4. > -26.5 3.5-4 -3-2 - 2 File: psy23858.txt UTBM on LTASS noise, fft=2 ms, ov=4 < 2.2 > -2. File: psy2379.txt UTBM on LTASS noise, fft=6 ms, ov=4 < 35.7 > -27. 8.5-4 -3-2 - 2 T=2 ms f res =9 Hz f mod <45 Hz T= ms f res =8 Hz f mod <9 Hz T=4 ms f res =45 Hz f mod <22.5 Hz T=6 ms f res = Hz f mod <5.6 Hz 9 Non-uniform frequency resolution FFT length kept at 5 ms f res =36 Hz, f mod <8 Hz Change mask resolution Estimate mask in erb domain,.5, and 2 erb resolution Observations Even at a resolution of.5 erb, the intelligibility is noticeably worse [surprising] Substantial degradation at erb resolution.5.5 File: psy22578.txt UTBM on LTASS noise, fft=5 ms df=.5 erb < 33.7 > -24.4 9.3-4 -3-2 - 2.5.5 File: psy22572.txt UTBM on LTASS noise, fft=5 ms df=. erb < 2. > -8.8 2.3-4 -3-2 - 2 File: psy225724.txt UTBM on LTASS noise, fft=5 ms df= erb < 49.5 > -4 8.9-4 -3-2 - 2 File: psy22573.txt UTBM on LTASS noise, fft=5 ms df=2. erb < 9.6 > -7.8.8 T=5 ms f res = erb T=5 ms f res =.5 erb T=5 ms f res =. erb T=5 ms f res =2. erb -4-3 -2-2 2

Modulation Domain Model determined by accuracy of modulation domain spectrum [Taal et al, ICASSP 2] Encompasses both regions of the graph within one concept Measure by correlation coefficient between clean and masked speech in 4ms window for each frequency bin Maximize by comparing with low pass filtered version of spectrogram: Clean Speech TF analysis LP filter db db + R Threshold 2 Time Correlation based mask LP filter operates on power spectrum in time domain Hamming window impulse resp LP cutoff =.9/T LP Correlation coeff between clean and masked max when R= Observations Poor intelligibility compared to previous for short T LP Very noisy: mask tries to match noise when no speech energy Use noise floor threshold File: psy25834.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=6 ms x Hz File: psy2584.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=4 ms x Hz -4-3 -2-2 < 37. > -27.. -4-3 -2-2 < 33.7 > -26.6 7. File: psy25845.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp= ms x Hz < 2. > -.7-4 -3-2 - 2 T LP =6 ms F mod > Hz T LP =4 ms F mod >2.3 Hz T LP = ms F mod >9 Hz 22

Time-Freq Correlation based Mask Seems reasonable to try matching modulation in both time and frequency Apply LP filter in both directions Fix T LP =8 ms giving mod domain HP at. Hz Vary filter width in frequency direction Observations Makes rather little difference F LP =2Hz gives some benefit File: psy25859.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x Hz -4-3 -2-2 File: psy2595.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x 3 Hz File: psy259.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x 6 Hz -4-3 -2-2 < 47.9 > -38.3 9.6-4 -3-2 - 2 < 45.2 > -34.7.5 < 44.4 > -34.5 9.9 File: psy2596.txt Band-pass TBM on LTASS noise, fft=4*5 ms, lp=8 ms x 2 Hz < 5 > -39.6.2-4 -3-2 - 2 T LP =8 ms F mod >. Hz F LP = Hz T LP =8 ms F mod >. Hz F LP =3 Hz T LP =8 ms F mod >. Hz F LP =6 Hz T LP =8 ms F mod >. Hz F LP =2 Hz 23 Summary benefits arise from the noise vocoded component of the masked speech Rapid estimation of unimodal psychometric functions is possible of noise vocoded speech Relative criterion can vary by ~4 db without loss of intelligibility FFT length can vary between 2 and 6 ms without loss of int Uniform frequency resolution is better than non-uniform (erb) Maximizing correlation in modulation domain is equivalent to HP filtering the spectrogram (when R=) Nice idea but little benefit Seems logical to extend it to freq axis but gives small improvement 24 2

Can Binary Masks Improve? Replication of Kim et al (29) show mask enhancement not straightforward to achieve Binary mask has two effects Preserve speech information in noisy signal when SNR good enough Encode speech information in vocoded noise when SNR poor Former just like any enhancement algorithm Latter relies on pattern recognition system Which may perform badly at low SNR just when it would be most useful 25 3