The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments

Size: px

Start display at page:

Download "The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments"

Darleen Perkins
6 years ago
Views:

1 The Munich 2011 CHiME Challenge Contribution: BLSTM-NMF Speech Enhancement and Recognition for Reverberated Multisource Environments Felix Weninger, Jürgen Geiger, Martin Wöllmer, Björn Schuller, Gerhard Rigoll Institute for Human-Machine Communication, Technische Universität München September 1 st, 2011

2 +Gerhard Rigoll September 1st, 2011 CHiME

3 Outline Motivation Our ASR Architectures: Speech Enhancement by Convolutive NMF BLSTM Speech Recognition Single- and Multi-Stream Recognisers Development Results Our Final Challenge Result Outlook September 1st, 2011 CHiME

4 ASR in Noisy Conditions Noisy speech Feature extractor MFCC HMM Transcr. September 1st, 2011 CHiME

5 Solution 1: Front-End Enhancement Noisy speech Source separation + Increases SNR - Imperfect: Noise suppression vs. information loss Enhanced speech Feature extractor MFCC HMM Transcr. September 1st, 2011 CHiME

6 Solution 2: Robust Back-Ends Multi-condition training MAP adaptation Noisy speech Feature extractor MFCC HMM Transcr. September 1st, 2011 CHiME

7 Solution 2: Robust Back-Ends BLSTM-RNN Word prediction Noisy speech Feature extractor MFCC Multi-stream HMM Transcr. September 1st, 2011 CHiME

8 Proposed ASR Architecture Noisy speech BLSTM NMF Word prediction Enhanced speech Feature extractor MFCC Multi-stream HMM Transcr. September 1st, 2011 CHiME

9 Speech Enhancement: Convolutive NMF Assumption of additive noise Observed magnitude spectrogram = Convolution of Speech and noise spectrograms P = ms frame size, 16 ms shift = 256 ms Non-negative activations Dictionaries (`bases ) of speech and noise computed from training data September 1st, 2011 CHiME

10 Convolutive signal model Modelling of true speech spectrogram: Modelling of true noise spectrogram: R (s), R (n) = 51 (102 NMF components ) September 1st, 2011 CHiME

11 Speech Enhancement: Convolutive NMF Matrix formulation: Determine H (s), H (n) by multiplicative updates Minimize KL divergence d(v, Λ (s) +Λ (n) ) Estimate (soft masking) September 1st, 2011 CHiME

12 Convolutive Speech and Noise Bases Speaker-dependent speech bases: Convolutive NMF on training set for speakers k and words w, Build General noise base: Sub-sample training noise Build by convolutive NMF September 1st, 2011 CHiME

13 Back-End: Multi-stream Tandem BLSTM-HMM September 1st, 2011 CHiME

14 Context Modelling in Neural Networks MLP Feature frame stacking RNN Persistent memory LSTM-RNN BLSTM-RNN Bidirectional context September 1st, 2011 CHiME

15 Word Predictions by BLSTM-RNNs Bi-directionally context-sensitive prediction Amount of context learned automatically during training Superior to (R)NN feature frame stacking [Woellmer, 2011] September 1st, 2011 CHiME

16 BLSTM Training and Classification Dimension: 39 input units (one per feature) 3 hidden layers per direction (78 / 150 / 51 LSTM units) 51 output units (one per word) Training: Frame-wise word targets by forced alignment Early stopping strategy (use best network on development set) Classification: Input: (NMF-enhanced) speech Output: Index of output unit with highest activation September 1st, 2011 CHiME

17 Multi-Stream Hidden Markov Modelling GMM (M=7 mixtures) for MFCCs x t CPT for discrete BLSTM word prediction b t Mitigate BLSTM misclassifications by Viterbi decoding HMM emission probability in state s t : MFCC stream weight a = 1.3 (tuned on devel. set) Superior to GMM feature fusion [Woellmer, 2011] September 1st, 2011 CHiME

18 Results [Development Set] CHiME baseline: -6 db -3 db 0 db 3 db 6 db 9 db Mean Noisy speech Feature extractor MFCC HMM Keywords September 1st, 2011 CHiME

19 Results [Development Set] With MAP speaker adaptation: -6 db -3 db 0 db 3 db 6 db 9 db Mean Noisy speech Feature extractor MFCC HMM + MAP Keywords September 1st, 2011 CHiME

20 Results [Development Set] With MAP and multi-condition training: -6 db -3 db 0 db 3 db 6 db 9 db Mean Noise-free training set overlaid with CHiME training noise Select random segments to provide various SNRs Include noise in MAP Noisy speech Feature extractor MFCC HMM + MAP + MCT Keywords September 1st, 2011 CHiME

21 Results [Development Set] Multi-stream HMM recogniser: -6 db -3 db 0 db 3 db 6 db 9 db Mean MCT BLSTM Word pred. Noisy speech Feature extractor MFCC MS- HMM + MAP + MCT Keywords September 1st, 2011 CHiME

22 What about Speech Enhancement? September 1st, 2011 CHiME

23 Results [Development Set] Baseline recogniser: -6 db -3 db 0 db 3 db 6 db 9 db Mean w/o NMF w/ NMF Noisy speech NMF Feature extractor MFCC HMM Enhanced speech Keywords September 1st, 2011 CHiME

24 Results [Development Set] With MAP+MCT: -6 db -3 db 0 db 3 db 6 db 9 db Mean w/o NMF w/ NMF Noisy speech NMF Feature extractor MFCC HMM Enhanced speech + MAP + MCT Keywords September 1st, 2011 CHiME

25 Results [Development Set] Multi-Stream Recogniser: -6 db -3 db 0 db 3 db 6 db 9 db Mean w/o NMF w/ NMF Noisy speech + MCT BLSTM Word pred. NMF Feature extractor MFCC MS-HMM Enhanced speech + MAP + MCT Keywords September 1st, 2011 CHiME

26 Noise-Adaptive Speech Enhancement Noise dictionary context noise utterance September 1st, 2011 CHiME

27 Noise-Adaptive Speech Enhancement Noise dictionary context noise T utterance Replace T dictionary entries with a) Minimum KL divergence b) Maximum KL divergence d(context noise dictionary) September 1st, 2011 CHiME

28 SNR [db] Technische Universität München Noise-Adaptive Speech Enhancement: Results [Development Set] Keyword accuracy [%] avg max, T=10 min, T=10 max, T=5 min, T=5 non-adaptive MAP+MCT recogniser September 1st, 2011 CHiME

29 TUM Challenge Results [Test Set] Multi-stream HMM recogniser, MCT + MAP -6 db -3 db 0 db 3 db 6 db 9 db Mean w/o NMF w/ NMF w/ ANMF % accuracy in full realism 87.9% using oracle for VAD September 1st, 2011 CHiME

30 Conclusions Reduction of KW error rate: 44.1% (baseline) 15.6% (single-stream) 12.7% (multi-stream) Front-end enhancement and refined back-ends: Complementary approaches to ASR robustness September 1st, 2011 CHiME

31 Outlook Speaker-dependent BLSTM First results on test (non-adaptive NMF): -6 db -3 db 0 db 3 db 6 db 9 db Mean Pure BLSTM modelling Multi-stream modelling of (sparse) NMF activations NMF dictionary optimization September 1st, 2011 CHiME

32 Do it Yourself! cnmf enhancement by openblissart [Weninger, 2011] Feature extraction: opensmile [Eyben, 2010] Multi-stream HMM: HTK BLSTM implemented using RNNLIB by Alex Graves September 1st, 2011 CHiME

33 Thank you. September 1st, 2011 CHiME

8ch test data Dereverberation GMM 1ch test data 1ch MCT training data double-stream HMM recognition result LSTM Fig. 1: System overview: a double-stre

8ch test data Dereverberation GMM 1ch test data 1ch MCT training data double-stream HMM recognition result LSTM Fig. 1: System overview: a double-stre REVERB Workshop 2014 THE TUM SYSTEM FOR THE REVERB CHALLENGE: RECOGNITION OF REVERBERATED SPEECH USING MULTI-CHANNEL CORRELATION SHAPING DEREVERBERATION AND BLSTM RECURRENT NEURAL NETWORKS Jürgen T. Geiger,