Discriminative Training for Automatic Speech Recognition

Size: px

Start display at page:

Download "Discriminative Training for Automatic Speech Recognition"

Christina Wilkins
5 years ago
Views:

1 Discriminative Training for Automatic Speech Recognition 22 nd April 2013 Advanced Signal Processing Seminar

2 Article Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S. Signal Processing Magazine, IEEE, vol.29, no.6, pp.58,69, Nov Covered Topics Statistical Speech Recognition Discriminative Training Criteria Parameter Models Optimisation Implementation Experimental Results Summary and Outlook

3 4 Components of an LVCSR System from [6]

4 Features Usable with discriminative Training Short-term power spectrum Mel frequency cepstral coefficients (MFCC) Perceptual linear prediction (PLP) Further enhancement methods (LDA,...) Other Approaches Feature Extraction through Neural Networks

5 Problem Given Sequence of Feature-Vectors x1 T Find Sequence of Words/Phonemes w1 N Statistical Model [ ] w1 T opt = argmax w N 1 p(w N 1 x T 1 )

6 Problem Statistical Model cont. [ ] w1 T opt = argmax w N 1 = argmax w N 1 = argmax w N 1 = argmax w N 1 p(w N 1 x T 1 ) p(x T 1, w N 1 ) p(x T 1 ) }{{} indep. of w N 1 p(x T 1, w N 1 ) p(x T 1 w N 1 ) p(w N 1 )

7 Problem Statistical Model cont. [ ] w1 T opt = argmax w N 1 p(x1 T w1 N ) }{{} p(w1 N ) }{{} Acoustic Model Language Model

8 Standard Acoustic Model Gaussian Hidden Markov Model p(x T 1 w N 1 ) = = = p(x1 T, s1 T ) S1 T HMM(w 1 N) T p(x t s t ) p(s t s t 1 ) S1 T HMM(w 1 N) t=1 T L s p(s t s t 1 ) c st,ln (x t µ st,l, Σ st,l) S1 T HMM(w 1 N) t=1 l=1 Parameterset for GHMM: p(s1 T 1 st 0 ), c s T 1,l Ls, µ 1 s T 1,l Ls, Σ 1 s T 1,l Ls 1 Λ

9 Maximum Likelihood Standard Approach Find the most probable parameters for the model given the training data set (x T 1, w N 1 ). Training Criterion F (Λ) = log p Λ (x T 1, w N 1 )

10 Maximum Mutual Information (MMI) Motivation Maximizes directly the posterior probability Takes also all competing sentences w N 1 Training Criterion into account F (Λ) = log p Λ (w N 1 x T 1 ) = log p Λ (x T 1, w N 1 ) w N 1 p Λ(x T 1, w N 1 )

11 Maximum Mutual Information (MMI) MMI vs. ML from [1]

12 Minimum Classification Error (MCE) Motivation Assumption that GMMs are not the real distribution Aims to maximize the classification error (WER) Derived by minimizing the expected loss Training Criterion F (Λ) = σ β (log ) p Λ (x1 T, w 1 N ) w 1 N w 1 N p Λ(x1 T, w 1 N)

13 Minimum Phone Error (MPE) Motivation Similar motivation as for MCE Aims to minimize the phone error rate (Levenshtein-Distance) Hypotheses weighted by phone accuracy A( w N 1, w N 1 ) Training Criterion F (Λ) = w N 1 p Λ ( w 1 N x1 T )A( w 1 N, w1 N ) w N 1

14 I-smoothing Overfitting Problem Discriminative training criteria prone to overfitting Critical on less training data Introducing a prior to each Gaussian based on the ML statistics Essentially for MCE/MPE

15 Margin Term Focus on decision boundary Generalization reached through closest training samples to the decision boundary (Margin, SVM) Focus more on these samples by adding a term to the training criterion: exp( A( w N 1, w N 1 )) Applied on MMI: equal to boosted MMI (BMMI)

16 Language Model Scale HMM p(x T 1 w N 1 ) = S T 1 HMM(w N 1 ) p(x T 1, s T 1 ) 1 κ Language Model p(w N 1 ) = p(w N 1 ) 1 κ : language model scale

17 Feature Transform Features containing more information Each training criterion can also be used in the feature space. for instance: y t = x t + M h t

18 Comparison Unified Training Criterion ( w 1 F (Λ) = f N p Λ(x1 T, w 1 N)A( w 1 N, w 1 N ) ) w 1 N p Λ(x1 T, w 1 N)B( w 1 N, w 1 N) MMI setup f (x) = log(x) A( w N 1, w N 1 ) = δ( w N 1, w N 1 ) B( w N 1, w N 1 ) = 1

19 Comparison Binary Classification from [1]

20 Extended Baum-Welch Strong-sense auxillary function G(λ, λ ) G(λ, λ ) F(λ) F(λ ) Principe of Expectation Maximization Algorithm Weak-sense auxillary function λ G(λ, λ ) λ=λ = λ F(λ) λ=λ Principe of Extended Baum-Welch Algorithm

21 Rprop Properties Only sign of the partial derivatives is needed Separate step size for each parameter Simple heuristic Good alternative to EBW Roughly the same number of iterations as EBW for convergence on conservative initial step size

22 Experimental Results EBW vs. Rprop from [1]

23 Experimental Results Training Criteria from [1]

24 Experimental Results Margin Term from [1]

25 References [1] Heigold, G.; Ney, H.; Schluter, R.; Wiesler, S., : Modeling, Criteria, Optimization, Implementation, and Performance, Signal Processing Magazine, IEEE, vol.29, no.6, pp.58,69, Nov [2] Povey, D.; Woodland, P.C.; Gales, M. J F, Discriminative map for acoustic model adaptation, Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP 03) IEEE International Conference on, vol.1, no., pp.i-312,i-315 vol.1, 6-10 April 2003 [3] Povey, D.; Woodland, P.C., Minimum Phone Error and I-smoothing for improved discriminative training, Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol.1, no., pp.i-105,i-108, May 2002 [4] Biing-Hwang Juang; Wu Hou; Chin-Hui Lee, Minimum classification error rate methods for speech recognition, Speech and Audio Processing, IEEE Transactions on, vol.5, no.3, pp.257,265, May 1997 [5] Bahl, L.; Brown, P.; De Souza, P.V.; Mercer, R., Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP 86., vol.11, no., pp.49,52, Apr 1986 [6] Saon, G.; Jen-Tzung Chien, Large-Vocabulary Continuous Speech Recognition Systems: A Look at Some Recent Advances, Signal Processing Magazine, IEEE, vol.29, no.6, pp.18,33, Nov [7] Hermansky, Hynek. Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America 87 (1990): [8] Hermansky, H.; Ellis, D.P.W.; Sharma, S., Tandem connectionist feature extraction for conventional HMM systems, Acoustics, Speech, and Signal Processing, ICASSP 00. Proceedings IEEE International Conference on, vol.3, no., pp.1635,1638 vol.3, 2000

Using RASTA in task independent TANDEM feature extraction

R E S E A R C H R E P O R T I D I A P Using RASTA in task independent TANDEM feature extraction Guillermo Aradilla a John Dines a Sunil Sivadas a b IDIAP RR 04-22 April 2004 D a l l e M o l l e I n s t