EXPLORING MULTIDIMENSIONAL LSTMS FOR LARGE VOCABULARY ASR

Size: px

Start display at page:

Download "EXPLORING MULTIDIMENSIONAL LSTMS FOR LARGE VOCABULARY ASR"

Russell Simon
5 years ago
Views:

1 EXPLORING MULTIDIMENSIONAL LSTMS FOR LARGE VOCABULARY ASR Jinyu Li, Abderahman Mohamed, Geoffrey Zweig, and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA {jinyi, asamir, gzweig, ABSTRACT Long short-term memory (LSTM) recurrent neura networks (RNNs) have recenty shown significant performance improvements over deep feed-forward neura networks. A key aspect of these modes is the use of time recurrence, combined with a gating architecture that aows them to track the ong-term dynamics of speech. Inspired by human spectrogram reading, we recenty proposed the frequency LSTM (F-LSTM) that performs -D recurrence over the frequency axis and then performs -D recurrence over the time axis. In this study, we further improve the acoustic mode by proposing a 2-D, time-frequency (TF) LSTM. The TF-LSTM jointy scans the input over the time and frequency axes to mode spectro-tempora warping, and then uses the output activations as the input to a time LSTM (T-LSTM). The joint timefrequency modeing better normaizes the features for the upper ayer T-LSTMs. Evauated on a 375-hour short message dictation task, the proposed TF-LSTM obtained a 3.4% reative reduction over the best T-LSTM. The invariance property achieved by joint time-frequency anaysis is demonstrated on a mismatched test set, where the TF-LSTM achieves a 4.2% reative reduction over the best T-LSTM. Index Terms LSTM, RNN, time and frequency, mutidimensiona. INTRODUCTION Recenty, significant progress has been made in automatic speech recognition (ASR) thanks to the appication of deep neura networks (DNNs) [][2][3][4][5][6]. DNNs, however, ony consider information in a fixed-ength siding window of frames and thus cannot expoit ong-range correations in the signa. Recurrent neura networks (RNNs), on the other hand, can encode sequence history in their interna state, and thus have the potentia to predict phonemes based on a the speech features observed up to the current frame. Unfortunatey, simpe RNNs, depending on the argest eigenvaue of the state-update matrix, may have gradients which either increase or decrease exponentiay over time. Thus, the basic RNN is difficut to train, and in practice can ony mode short-range effects. Long short-term memory (LSTM) RNNs [7][8] were deveoped to overcome these probems. LSTM-RNNs use input, output and forget gates to achieve a network that can maintain state and propagate gradients in a stabe fashion over ong spans of time. These networks have been shown to outperform DNNs on a variety of ASR tasks [9][0][][2][3][4]. A previousy proposed LSTMs use a recurrence aong the time axis to mode the tempora patterns of speech signas, and we ca them T-LSTMs in this paper. In common practice, og-fiter-bank features are often used as the input to the neura-network-based acoustic mode [5]. In standard systems, the og-fiter-bank features are independent of one-another, i.e. switching the positions of two fiter-banks won t affect the performance of the DNN or LSTM. However, this is not the case when a human reads a spectrogram: a human reies on both patterns that evove on time, and frequency, to predict phonemes. Switching the positions of two fiter-banks wi destroy the frequency-wise patterns. Meanwhie, switching the positions of two frames wi destroy the time-wise patterns. Inspired by the way peope read spectrograms, we recenty proposed frequency LSTM (F-LSTM) in [6] which performs recurrence aong the frequency axis to summarize the frequency invoving patterns as the feature for the upper eve T-LSTMs. A the LSTM operations in [6] are onedimensiona, either aong the frequency axis or the time axis. However, both time-wise and frequency-wise patterns are important to human spectrogram reading. Hence, it may be better to extract feature with both patterns. Further, the concept of mutidimensiona processing has been proved very successfu in the handwriting recognition tasks [7][8] and the computer vision tasks [9], and it outperformed the traditiona handwriting systems that use convoutiona neura networks (CNNs) [20][2] as the feature extractor. The main contribution of this paper is the proposa to use a mutidimensiona LSTM to mode both time and frequency dynamics for speech recognition. We further propose a method for doing this joint time-frequency anaysis in a highy efficient way. We term the proposed method the time-frequency LSTM or TF- LSTM. Evauated on a 375-hour Microsoft short message dictation (SMD) task, the TF-LSTM consistenty outperformed the F-LSTM and obtained 3.4% reative word error rate () reduction from the T-LSTM on the SMD test set, and a 4.2% reative reduction on a mismatched test set. The rest of the paper is organized as foows. In Section 2, we briefy introduce LSTMs and then we present the proposed timefrequency LSTM in Section 3. We differentiate the proposed method from the convoutiona LSTM DNN (CLDNN) [4] and muti-dimensiona RNN [7][8] in Section 4. Experimenta evauation of the agorithm is provided in Section 5. We summarize our study and draw concusions in Section THE LSTM-RNN An RNN is fundamentay different from the feed-forward DNN in that the RNN does not operate on a fixed window of frames; instead, it maintains a hidden state vector, which is recursivey updated after seeing each time frame. This aows RNNs to be resiient to arbitrary input warping aong the recurrence dimension eading to better generaization abiities. Stacking mutipe ayers of RNNs aows the network to discover reationships between frames on progressivey higher eves of abstraction. During earning, the simpe RNN suffers from the vanishing/expoding gradient probem [22]. This probem is we handed in the LSTM-RNNs through the use of the foowing four components:

Memory units: these store the tempora state of the network; Input gates: these moduate the input activations into the ces; Output gates: these moduate the output activations of the ces ; Forget

Z is a time-deay node. Figure depicts the architecture of an LSTM-RNN with one recurrent ayer.

This mode can be described as: i t = σ(w xi x t + W hi h t + W ci c t + b i ), () f t = σ(w xf x t + W hf h t + W cf c t + b f ), (2) c t = f t. c t + i t.

tanh(c t ), (5) where i t, o t, f t, and c t denote the activation vectors of input gate, output gate, forget gate, and memory ce at the -th ayer and time t, respectivey.

2 Memory units: these store the tempora state of the network; Input gates: these moduate the input activations into the ces; Output gates: these moduate the output activations of the ces ; Forget gates: these adaptivey reset the ce s memory. Taken together as in Figure beow, these four components are termed a LSTM ce. Figure : Architecture of LSTM-RNNs with one recurrent ayer. Z is a time-deay node. Figure depicts the architecture of an LSTM-RNN with one recurrent ayer. In LSTM-RNNs, in addition to the past hidden-ayer output h t, the past memory activation c t is aso an input to the LSTM ce. This mode can be described as: i t = σ(w xi x t + W hi h t + W ci c t + b i ), () f t = σ(w xf x t + W hf h t + W cf c t + b f ), (2) c t = f t. c t + i t. tanh(w xc x t + W hc h t + b c ), (3) o t = σ(w xo x t + W ho h t + W co c t + b o ), (4) h t = o t. tanh(c t ), (5) where i t, o t, f t, and c t denote the activation vectors of input gate, output gate, forget gate, and memory ce at the -th ayer and time t, respectivey. h t is the output of the LSTM ces at ayer and time t. W terms denote different weight matrices. For exampe, W xi is the weight matrix from the ce input to the input gate at the -th ayer. b terms are the bias terms (e.g., b i is the bias of input gate at ayer ).. denotes eement wise mutipication. In [], a LSTM with an additiona projection ayer prior to the output was proposed to reduce the computationa compexity of LSTM. A projection ayer is appied to h t as r t = W hr h t And then h t in Eqs ()--(4) is repaced by r t. In this study, we adopt this structure for T-LSTM modeing. Figure 2: An exampe of time-frequency LSTM-RNN which scans both the time and frequency axis at the bottom ayer using TF-LSTM, and then scans the time axis at the upper ayers using T-LSTM. Note that the outputs of a TF-LSTM ces are fed into the upper ayer T- LSTM. f k,t The formuation of the TF-LSTM is as foows. = σ(w xi x k,t + W hi h k,t + W hi2 h k,t + W ci c k,t + b i ), (6) = σ(w xf x k,t + W hf h k,t + W hf2 h k,t + W cf c k,t + b f ), (7) i k,t o k,t c k,t = f k,t = σ(w xo x k,t. c k,t + W ho + i k,t h k,t. tanh(w xc x k,t W hc2 + W ho2 h k,t + W hc h k,t + + b c ), (8) h k,t + W co c k,t + b o ), (9) h k,t = o k,t. tanh(c k,t ), (0) In this formuation, every gate now has three indices: ayer, frequency band k, and time t. For exampe, f k,t denotes the activation vectors of forget gate at the ayer, frequency band k, and time t. Different from Eqs ()--(4), now we have both time-deay input h k,t and frequency-deay input h k,t. The W h. and W h.2 matrices denote the weight matrices connecting h k,t and h k,t, respectivey. The structure of a TF-LSTM ce is potted in Figure 3, where φ denotes the tanh function. 3. JOINT TIME-FREQUENCY ANALYSIS VIA MULTIDIMENSIONAL LSTM In this section, we propose a time-frequency LSTM (TF-LSTM) as shown in Figure 2. In contrast to the frequency LSTM (F-LSTM) in our previous work [6] which scans the frequency bands so that frequency-evoving information is summarized by the output of the F-LSTM, the new method scans both the time and frequency axes jointy to perform the time-frequency anaysis. Figure 3: A TF-LSTM ce at frequency band k, and time t.

3 The proposed TF-LSTM in Eqs (6)--(0) is a genera case of T-LSTM or F-LSTM. When a the time frequency bands are concatenated together as a singe unit, frequency index k and a the items associated with W h.2 are removed. Then the TF-LSTM reduces to the T-LSTM of Eqs ()--(5). In contrast, if a the items associated with W h. are removed, the TF-LSTM reduces to a F- LSTM, which can be viewed as removing the connections to h k,t in Figure 3. The detaied TF-LSTM processing is described as foows. At each time step, divide the N og-fiter-banks at the current time into M overapped chunks, shifting by C ogfiter-banks between adjacent chunks. They are denoted as x k,t, k = M. Using the hidden activations at each frequency chunk from the previous time step h k,t, the hidden activations at each time step from the previous frequency chunk h k,t, and the input at the current frequency chunk and time step x k,t, go through Eqs (6)--(0) to generate the output of h k,t, k = M. Note that we use og-fiterbanks as the input which means the time-frequency anaysis is in the first ayer, is set as in Eqs (6)--(0). Merge h k,t, k = M into a super-vector h t which can be considered as a trajectory of time-frequency patterns. Then use h t as the input to the upper ayer T-LSTM. It is aso worthwhie to investigate the stacking of mutipe TF- LSTM ayers. This can be easiy done by repacing x k,t with the hidden activations from the previous ayer h k,t in Eqs (6)--(9). Again, the output of the ast TF-LSTM ayer is merged into a supervector as the input to the upper ayer T-LSTM. A sampe of stacked two TF-LSTM ayers is shown in Figure 4. the CNN ayer is fed into a muti-ayer LSTM to earn the tempora patterns. Finay, the output of the ast LSTM ayer is fed into severa fuy connected DNN ayers for the purpose of cassification. The key difference between the TF-LSTM and the CLDNN is that the TF-LSTM uses joint time-frequency recurrence, whereas the CLDNN uses a siding convoutiona window for pattern detection. Whie the siding window achieves some oca invariance, it is not the same as a joint two-dimensiona recurrent network which scans the whoe time and frequency axis. The two approaches both aim to achieve invariance to input distortions, but the pattern detectors in the CNN maintain a constant dimensionaity, whie the TF-LSTM can perform a genera time-frequency warping. The proposed method is simiar to the mutidimensiona LSTM [7][8] which is used for handwriting recognition. Mutidimensiona LSTM has been used in [23] on a very sma phone recognition task, TIMIT [24], using connectionist tempora cassification (CTC) [25] as the training criterion. However, there is no accuracy comparison with T-LSTM in [23]. In contrast, we wi show the advantage of our proposed TF-LSTM over T-LSTM with the cross-entropy training criterion on a arge scae speech recognition task in next section. Athough using simiar concepts, the proposed TF-LSTM has a different formuation from the mutidimensiona LSTM in [7][8]. The proposed TF-LSTM has ony a singe memory unit and a singe forget gate whie the mutidimensiona LSTM in [7][8] has mutipe forget gates, each handing one dimensiona information. Thus we achieve a significant reduction in compexity. We are currenty buiding a strong CLDNNs baseine to compare with, and it wi be reported in the future. We wi aso impement the mutidimensiona LSTM with mutipe forget gates [7][8] and compare with our proposed method. 5. EXPERIMENTS AND DISCUSSIONS The proposed methods are evauated on a Microsoft Windows phone short message dictation task. The transcribed training data contain 375 hours of US-Engish audio. The test set is from the same Windows Phone task, and has 25k words. This arge test set guarantees the significance of reported improvement. The 87-dimentiona feature used in the DNN and T-LSTM experiments consists of the 29-dimensiona static og-fiter-bank outputs and their first- and second-order derivatives [26]. For the F- LSTM and TF-LSTM experiments, we ony use the static og-fiterbanks as the feature. A modes evauated in this study use 5976 tied-triphone states (senones), determined by a baseine CD-GMM- HMM system, and were trained to minimize the frame-eve crossentropy criterion. A experiments were conducted using the Computationa Network Tookit (CNTK) [27], which aows us to buid and evauate various network structures efficienty without deriving and impementing compicated training agorithms. Figure 4: An exampe of stacked TF-LSTM ayers. 4. RELATION TO PRIOR WORK In this section, we first discuss the difference between our proposed TF-LSTM and the convoutiona LSTM DNN (CLDNN) [4] which combines CNNs, LSTMs, and DNNs together. The CLDNN first uses a CNN to reduce the spectra variation, and then the output of To buid the baseine DNN, we augment the 87-dimensiona feature vectors with 5 frames of context on either side (5--5). The DNN has 5 hidden ayers, each with 2048 sigmoid units. The baseine T-LSTM is modeed after that in []. Each T-LSTM ayer has 024 hidden units and the output size of each T-LSTM ayer is reduced to 52 using a inear projection ayer. There is no frame stacking, and the output HMM state abe is deayed by 5 frames as in []. When training T-LSTM, the backpropagation through time (BPTT) [28] step is 20. We use a 4-ayer T-LSTM as our baseine.

4 This has 5.35%. It outperforms the baseine DNN with 0.39% reative reduction. This setup is better than the mode with three or five T-LSTM ayers as shown in Tabe. There is a 4.3% reative reduction when increasing one additiona ayer from 3-ayer T-LSTM to 4-ayer T-LSTM. However, a 5-ayer LSTM does not outperform a 4-ayer T-LSTM. Tabe : and mode size comparison of DNN and T-LSTM. M denotes miion in the coumn of number of. Mode DNN M 3-ayer T-LSTM M 4-ayer T-LSTM M 5-ayer T-LSTM M In Tabe 2, we compare the performance of the F-LSTM and TF-LSTM modes. The F-LSTM mode uses a singe LSTM to scan the og-fiter-banks whie the TF-LSTM uses a singe LSTM to scan both the time and og-fiter-banks. The generated time-frequency evoving summary or the frequency evoving summary wi then be passed into 3 or 4 ayers of T-LSTMs. At each time step, the 29 og-fiter-bank channes are divided into 22 overapped chunks with each chunk containing 8 og-fiterbanks, which means the frequency shift is og-fiter-bank. This og-fiter-bank grouping strategy foows our previous wisdom in CNN [29]. Then these 22 chunks are fed into F-LSTM. The input to the TF-LSTM ces incudes not ony the previous frequency chunks but aso the output of this TF-LSTM ce in the previous time frame. Both the F-LSTM and TF-LSTM have 24 memory ces, introducing sma computationa cost. The upper ayer T-LSTMs have the same structure as the baseine T-LSTMs, with 024 hidden units in each ayer, and the output size is reduced to 52 using a projection. A the setups in Tabe 2 outperform the baseine 4-ayer T- LSTM. With a 3-ayer T-LSTM on top of it, the F-LSTM and TF- LSTM perform amost the same. However, with a 4-ayer T-LSTM on top it, the TF-LSTM is much better than the F-LSTM, and gets 4.83% a 3.4% reative reduction from the baseine 4- ayer T-LSTM. The joint time-frequency modeing provides a better feature for the upper ayer T-LSTMs to consume. As shown in Tabe, simpy increasing number of ayers from 4 to 5 doesn t give any gain. Tabe 2: Comparison of F-LSTM or TF-LSTM Mode F-LSTM + 3-ayer T-LSTM M F-LSTM + 4-ayer T-LSTM M TF-LSTM + 3-ayer T-LSTM M TF-LSTM + 4-ayer T-LSTM M We further investigate the performance of stacked F-LSTM and TF-LSTM in Tabe 3. To have the same number of ayers as the TF-LSTM + 4-ayer T-LSTM setup in Tabe 2, we tried to use either 2-ayer F-LSTM or 2-ayer TF-LSTM, foowed by 3-ayer T- LSTM. Again, the setup using TF-LSTM outperformed the setup with F-LSTM. However, none outperformed the TF-LSTM + 4- ayer T-LSTM setup. Note that it ony introduces 0.M additiona from the TF-LSTM + 3-ayer T-LSTM setup in Tabe 2 to the 2-ayer F-LSTM + 3-ayer T-LSTM setup in Tabe 3 and this brings very sight improvement. This is because the TF- LSTM itsef has very sma number of parameter because the ce size is ony 24. In the future, we can have 2-ayer TF-LSTM foowed by 4-ayer T-LSTM to get some further gains. Tabe 3: The stacking of F-LSTM and TF-LSTM Mode 2-ayer F-LSTM + 3-ayer T-LSTM M 2-ayer TF-LSTM + 3-ayer T- LSTM M In a fina set of experiments, we evauated the invariance properties of the TF-LSTM mode by testing the modes trained with Windows phone data on the Aurora 4 [30] test sets. Two cean evauation sets (A and C) are recorded with the Sennheiser microphone and the secondary microphone, respectivey. The remaining two groups (B and D), are recorded with two types of microphone respectivey, and 6 types of noise are added with randomy chosen SNRs between 5 and 5 db for each of the microphone types. Therefore, these test sets have totay mismatched acoustic environments from the Windows phone training set. We used the baseine 4-ayer T-LSTM mode in Tabe and the TF-LSTM mode in Tabe 2 for the evauation. The anguage mode is a bigram provided by Aurora 4. As shown in Tabe 4, the TF-LSTM performs much better than the T-LSTM in a test conditions, and reduced the average from 7.46% to 5.0%, a 4.2% reative reduction. This confirms the robustness [3] of the joint time-frequency anaysis of the TF-LSTM. Tabe 4: The comparison of T-LSTM and TF-LSTM modes on the mismatched Aurora 4 test sets. Modes are trained with Windows phone short message dictation data. Mode A B C D Avg. 4-ayer T-LSTM TF-LSTM + 4- ayer T-LSTM CONCLUSIONS In this paper, we have presented a two-dimensiona TF-LSTM architecture that scans both the time and frequency axes to mode the evoving patterns of the spectrogram. The TF-LSTM uses a LSTM to perform a joint time-frequency recurrence that summarizes spectro-tempora patterns. The summarized patterns are then fed into upper eve T-LSTMs. The proposed TF-LSTM obtained a 3.4% reative reduction over the traditiona T- LSTM on a 375-hour short message dictation task. We further investigated the effectiveness of stacking mutipe TF-LSTM ayers, and found that the additiona accuracy gain is margina. This indicates that a one ayer TF-LSTM is good enough to extract the patterns reevant to speech recognition. When evauated with a totay mismatched Aurora 4 test set, the TF-LSTM demonstrates much better resistance to the distortion, giving 4.2% reative reduction over a T-LSTM.

5 REFERENCES [] F. Seide, G. Li, and D. Yu, Conversationa speech transcription using context-dependent deep neura networks, in Proc. Interspeech, pp , 20. [2] N. Jaity, P. Nguyen, A. Senior, and V. Vanhoucke, An appication of pretrained deep neura networks to arge vocabuary conversationa speech recognition, in Proc. Interspeech, 202. [3] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, A.-R. Mohamed, Making deep beief networks effective for arge vocabuary continuous speech recognition, in Proc. ASRU, pp , 20. [4] G. E. Dah, D. Yu, L. Deng, and A. Acero, Large vocabuary continuous speech recognition with context-dependent DBN- HMMs, in Proc. ICASSP, pp , 20. [5] A. Mohamed, G. E. Dah, and G. Hinton, Acoustic modeing using deep beief networks, IEEE Trans. Audio Speech and Language Process., vo. 20, no., pp. 4-22, Jan [6] L. Deng, J. Li, J.-T. Huang et. a. Recent advances in deep earning for speech research at Microsoft, in Proc. ICASSP, 203. [7] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neura Computation, vo. 9, no. 8, pp , 997. [8] A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continua prediction with LSTM, Neura Computation, vo. 2, no. 0, pp , [9] A. Graves, A. Mohamed, G. Hinton. Speech recognition with deep recurrent neura networks, in Proc. ICASSP, 203. [0] A. Graves, N. Jaity, A. Mohamed. Hybrid speech recognition with deep bidirectiona LSTM, in Proc. ASRU, 203. [] H. Sak, A. Senior, F. Beaufays, "Long short-term memory recurrent neura network architectures for arge scae acoustic modeing," in Proc. Interspeech, 204. [2] H. Sak, O. Vinyas, G. Heigod, A. Senior, E. McDermott, R. Monga, M. Mao, "Sequence discriminative distributed training of ong short-term memory recurrent neura networks," in Proc. Interspeech, 204. [3] X. Li and X. Wu, Constructing ong short-term memory based deep recurrent neura networks for arge vocabuary speech recognition, in Proc. ICASSP, 205. [4] T. N. Sainath, O. Vinyas, A. Senior and H. Sak, "Convoutiona, ong short-term memory, fuy connected deep neura networks," in Proc. ICASSP, 205. [5] A. Mohamed, G. Hinton, and G. Penn, Understanding how deep beief networks perform acoustic modeing, in Proc. ICASSP, pp , 202. [6] J. Li, A. Mohamed, G. Zweig, and Yifan Gong, LSTM time and frequency recurrence for automatic speech recognition, in Proc. ASRU, 205. [7] A. Graves, S. Fernández, J. Schmidhuber, Muti-dimensiona recurrent neura networks, in ICANN, pp , [8] A. Graves and J. Schmidhuber, Offine handwriting recognition with mutidimensiona recurrent neura networks, Advances in Neura Information Processing Systems, pp , [9] W. Byeon, T. M. Breue, F. Raue, and M. Liwicki, Scene abeing with LSTM recurrent neura networks, In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp , 205. [20] T. N. Sainath, A. Mohamed, B. Kingsbury and B. Ramabhadran, "Deep convoutiona neura networks for LVCSR," in Proc. ICASSP, 203. [2] O. Abde-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and Dong Yu, Convoutiona neura networks for speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language processing, vo. 22, no. 0, pp , 204. [22] Y. Bengio, P. Simard, and P. Frasconi. Learning ong-term dependencies with gradient descent is difficut, IEEE Transactions on Neura Networks, vo. 5, no. 2, pp , 994. [23] A. Graves, "Practica variationa inference for neura networks." In Advances in Neura Information Processing Systems, pp , 20. [24] J. S. Garofoo, L. F. Lame, W. M. Fisher, J. G. Fiscus, D. S. Paett, and N. L. Dahgren, DARPA TIMIT Acoustic- Phonetic Continuous Speech Corpus, U.S. Dept. of Commerce, NIST, Gaithersburg, MD, February 993. [25] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, Connectionist tempora cassification: abeing unsegmented sequence data with recurrent neura networks, in Proceedings of the 23rd internationa conference on Machine earning. ACM, pp , [26] J. Li, D. Yu, J. T. Huang, and Y. Gong. "Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM," in Proc. IEEE Spoken Language Technoogy Workshop, pp. 3 36, 202. [27] D. Yu, A. Eversoe, M. Setzer, et. a., "An introduction to computationa networks and the computationa network tookit," Microsoft Technica Report MSR-TR-204-2, 204. [28] H. Jaeger, Tutoria on training recurrent neura networks, covering BPPT, RTRL, EKF and the echo state network approach, GMD Report 59, GMD German Nationa Research Institute for Computer Science, [29] J.-T. Huang, J. Li, and Y. Gong, An anaysis of convoutiona neura networks for speech recognition, in Proc. ICASSP, 205. [30] N. Parihar and J. Picone, Aurora working group: DSR front end LVCSR evauation AU/384/02, Tech. Rep., Institute for Signa and Information Processing, Mississippi State Univ., [3] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, Robust Automatic Speech Recognition: A Bridge to Practica Appications, Esevier Press, 205.

LSTM TIME AND FREQUENCY RECURRENCE FOR AUTOMATIC SPEECH RECOGNITION

LSTM TIME AND FREQUENCY RECURRENCE FOR AUTOMATIC SPEECH RECOGNITION Jinyu Li, Abderahman Mohamed, Geoffrey Zweig, and Yifan Gong Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 { jinyi, asamir,