arxiv: v1 [cs.lg] 8 Jul PDF Free Download

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks arxv:1607.02241v1 [cs.lg] 8 Jul 2016 Darryl D. Ln Qualcomm Research, San Dego, CA 92121 USA Sachn S. Talath Qualcomm Research, San Dego, CA 92121 USA Abstract It s known that tranng deep neural networks, n partcular, deep convolutonal networks, wth aggressvely reduced numercal precson s challengng. The stochastc gradent descent algorthm becomes unstable n the presence of nosy gradent updates resultng from arthmetc wth lmted numerc precson. One of the wellaccepted solutons facltatng the tranng of low precson fxed pont networks s stochastc roundng. However, to the best of our knowledge, the source of the nstablty n tranng neural networks wth nosy gradent updates has not been well nvestgated. Ths work s an attempt to draw a theoretcal connecton between low numercal precson and tranng algorthm stablty. In dong so, we wll also propose and verfy through experments methods that are able to mprove the tranng performance of deep convolutonal networks n fxed pont. 1. Introducton Deep convolutonal networks (DCNs) have demonstrated state-of-the-art performance n many machne learnng tasks such as mage classfcaton (Krzhevsky et al., 2012) and speech recognton (Deng et al., 2013). However, the complexty and the sze of DCNs have lmted ther use n moble applcatons and embedded systems. One reason s related to the ht on performance (n terms of accuracy on a gven machne learnng task) that these networks take when they are deployed wth data representatons usng reduced numerc precson. A potental avenue to allevate ths problem s to fne-tune pre-traned floatng pont DCNs usng data representatons wth reduced numerc precson. Accepted for the 33 rd Internatonal Conference on Machne Learnng - Workshop on On-Devce Intellgence. Copyrght 2016 by the author(s). DARRYL.DLIN@GMAIL.COM TALATHI@GMAIL.COM However, the tranng algorthms have a strong tendency to dverge when the precson of network parameters and features are too low (Han et al., 2015; Courbaraux et al., 2014). More recently, several works have touched upon the ssue of tranng deep networks wth low numercal precson (Gupta et al., 2015; Ln et al., 2015; Gysel et al., 2016). In all of these works stochastc roundng has been the key to mprovng the convergence propertes of the tranng algorthm, whch n turn has enabled tranng of deep networks wth relatvely small bt-wdths. However, to the best of our knowledge, there s a lmted understandng from a theoretcal pont of vew as to why low precson networks lead to tranng dffcultes. In ths paper, we attempt offer a theoretcal nsght nto the root cause of the numercal nstablty when tranng DCNs wth lmted numerc precson representatons. In dong so, we wll also propose a few solutons to combat such nstablty n order to mprove the tranng outcome. These proposals are not meant to replace stochastc roundng. Rather, they are complementary technques. To clearly demonstrate the effectveness of our proposed solutons, we wll not perform stochastc roundng n the experments. We ntend to combne stochastc roundng and our proposed solutons n future works. Ths work wll focus on fne-tunng a pre-traned floatng pont DCN n fxed pont. Whle most of the analyss apply also to the case of tranng a fxed pont network from scratch, some dscussons may be applcable to the fxed pont fne-tunng scenaro alone. 2. Low Precson and Back-Propagaton In ths secton, we wll nvestgate the orgn of nstablty n the network tranng phase when low precson weghts and actvatons are used. The outcome of ths effort wll shed lght on possble avenues to allevate the problem.

2.1. Effectve Actvaton Functon Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks The computaton of actvatons n the forward pass of a deep network can be wrtten as: a (l) = w (l), g(a(l 1) ), (1) where a (l) denotes the -th actvaton n the l-th layer, w (l), represents the (, )-th weght value n the l-th layer. And g( ) s the actvaton functon. Note that here we assume both the actvatons and weghts are full precson values. Now consder the case where only the weghts are low precson fxed pont values. From the forward pass perspectve, (1) stll holds. However, when we ntroduce low precson actvatons nto the equaton, (1) s no longer an accurate descrpton of how the actvatons propagate. To see ths, we may consder the evaluaton of a (l) n fxed pont representaton as n Fgure 1. Decmal Pt 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Step 1: Multplcaton 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Step 2: Addton X X X X X 1 2 3 4 5 6 7 8 X X X X X X X X X Step 3: Roundng Fgure 1. Evaluaton of actvaton as quantzaton In Fgure 1, three operatons are depcted: Step 1: Compute w g(a). Assumng both w and g(a) are 8-bt fxed pont values, the product s a 16-bt value. Step 2: Compute w g(a). The sze of the accumulator s larger than 16-bt to prevent overflow. Step 3: The outcome of w g(a) s rounded and truncated to produce an 8-bt actvaton value. Step 3 s a quantzaton step that reduces the precson of the value calculated based on (1) n keepng wth the desred fxed pont precson of layer l. In essence, assumng ReLU, the effectve actvaton functon experenced by the features n the network s as shown n Fgure 2(b), rather than 2(a). 2.2. Gradent Msmatch In back-propagaton, denotng the cost functon as C, an mportant equaton that dctates how the error sgnal,, propagates down the network s expressed as follows: = g (a (l) ) w (l+1), a (l+1). (2) (a) g( ) (b) g q ( ) Fgure 2. The presumed and actual ReLU functon n low precson networks The value of ndcates the drecton n whch a (l) should move n order to mprove the cost functon. Playng a crucal role n (2) s the dervatve of the actvaton functon, g (a (l) ). In a software envronment that mplements SGD, orgnal actvaton functons n the form of Fgure 2(a) s assumed. However, as explaned n Secton 2.1, the effectve actvaton functon n a fxed pont network s a non-dfferentable functon as descrbed n Fgure 2(b). Ths dsagreement between the presumed and the actual actvaton functon s the orgn of what we call the gradent msmatch problem. When the bt-wdths of the weghts and actvatons are large, the gradent of the orgnal actvaton functon offers a good approxmaton to that of the quantzed actvaton functon. However, the msmatch wll start to mpact the stablty of SGD when the bt-wdths become too small (step szes become too large). The gradent msmatch problem also exacerbates as the error sgnal propagates deeper down the network, because every tme the presumed g (a (l) ) s used, addtonal errors are ntroduced n the gradent computaton. Snce the gradents w.r.t. the weghts are drectly based on the gradents w.r.t. the actvatons, w (l), = g(a (l 1) ), (3) the weght updates become ncreasngly naccurate as the error propagates nto lower layers of the network. Hence tranng networks n fxed pont s much more challengng n deeper networks than n shallower networks. 2.3. Potental Solutons Havng understood the source of the ssue, we wll propose a few methods to help overcome the challenges of tranng or fne-tunng a fxed pont network. The obvous approach of replacng the perceved actvaton functon wth the effectve actvaton functon that takes quantzaton nto account s not vable because the effectve actvaton functon s not dfferentable. However, some alternatves may help mprove convergence durng model tranng to avod the gradent msmatch problem.

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks 2.3.1. PROPOSAL 1: LOW PRECISION WEIGHTS AND FULL PRECISION ACTIVATIONS Recognzng that the man obstacle of tranng n fxed pont s the low precson actvatons, we may tran a network wth the desred precson for the weghts, whle keepng the actvatons floatng pont or wth relatvely hgh precson. After tranng, the network can be adapted to run wth lower precson actvatons. 2.3.2. PROPOSAL 2: FINE-TUNING TOP LAYER(S) ONLY As the analyss n Secton 2 shows, when the actvaton precson s low, weght updates of top layers are more relable than lower layers, because the gradent msmatch bulds up from the top of the network to the bottom. Therefore, whle t may not be possble to fne-tune the entre network, t may be possble to fne-tune only the top layers wthout ncurrng convergence ssues. 2.3.3. PROPOSAL 3: BOTTOM-TO-TOP ITERATIVE FINE-TUNING The bottom-to-top teratve fne-tunng scheme s a tranng algorthm desgned to avod gradent msmatch. At the same tme, t allows the entre network to be fne-tuned. For example, consder a network wth 4 layers. Table 1 offers an llustraton of how fne-tunng s dvded nto phases where one layer s fne-tuned n each phase. Table 1. Example showng the phases of teratve fne-tunng Phase 1 Phase 2 Phase 3 Acts Wgts Acts Wgts Acts Wgts Layer4 Float - Float - Float update Layer3 Float - Float update FxPt - Layer2 Float update FxPt - FxPt - Layer1 FxPt - FxPt - FxPt - Each phase of fne-tunng, consstng of 1 or multple epochs, updates the weghts of one of the layers (weghts can follow the desred fxed pont format wthout specal treatment). As shown n Table 1, Phase 1 fne-tunes the weghts of Layer2. After Phase 1 s complete, Phase 2 fnetunes the weghts of Layer3 whle keepng the weghts of all other layers statc. Then Phase 3 fne-tunes Layer4 n a smlar manner. Note that Layer1 weghts are quantzed but never fne-tuned. Also of mportance s how the number format of actvatons change over the phases. Intally durng Phase 1, only the bottom layer (Layer1) actvatons are n fxed pont, but n Phase 2, both Layer1 and Layer2 actvatons are n fxed pont. In the last phase of fne-tunng, only the output of the fnal layer remans floatng pont. All other actvatons have been turned nto fxed pont. The gradual turnng on of fxed pont actvatons s desgned to prevent gradent msmatch completely. Careful nspecton of the algorthm shows that, whenever the weghts of a partcular layer are updated, the gradents are always back-propagated from layers wth only floatng pont actvatons. 3. Experments In ths secton, we examne the effectveness of the proposed solutons based on a deep convolutonal network we developed for the ImageNet classfcaton task 1. The network has 12 convolutonal layers and 5 fully-connected layers. We choose ths network to experment because, as we have shown n a network desgned for CIFAR-10 classfcaton (Ln et al., 2016), fne-tunng a relatvely shallow fxed pont network does not pose convergence challenges even when the bt-wdths are small. Table 2. ImageNet classfcaton Top-5 error rate (%): No fnetunng Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 98.6 33.4 32.9 32.7 8 97.1 19.3 18.0 18.2 16 96.6 15.0 14.3 14.4 Float 96.6 14.1 13.9 13.8 The baselne for the experment s the DCN network that s quantzed based on the algorthm presented n Ln et al. (2016) wthout fne-tunng. The Top-5 error rates of these networks, for dfferent weght and actvaton bt-wdth combnatons, are lsted n Table 2. Note that for all the fxed pont experments n ths paper, the output actvatons of the fnal fully-connected layer s always set to a bt-wdth of 16. We do not try to reduce the precson of ths quantty because the subsequent softmax layer s rather senstve to low precson nputs and t s an nsgnfcant overhead to the network overall. To further mprove the accuracy beyond Table 2, we perform fne-tunng on these networks subect to the correspondng fxed pont bt-wdth constrants of the weghts and actvatons. Table 3 shows that, whle fne-tunng mproves some scenaros (for example, 16-bt actvatons and 4-bt weghts), t fals to converge for most of the settngs where the actvatons are n fxed pont. Ths nterestng observaton valdates the analyss n Secton 2 showng that the stablty problem s due to the low precson of actva- 1 Propretary Informaton, Qualcomm Inc

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks Table 3. ImageNet classfcaton Top-5 error rate (%): vanlla fne-tunng ( n/a = fals to converge ) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 n/a n/a n/a n/a 8 n/a 19.3 n/a n/a 16 21.0 n/a n/a n/a Plan Table 5. ImageNet classfcaton Top-5 error rate (%): Fne-tune the top fully-connected layer (Proposal 2) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 37.1 23.8 23.3 23.5 8 22.8 15.6 15.7 16.2 16 21.2 13.7 13.5 13.7 tons, not weghts. We note that for these and all the subsequent fne-tunng experments, we dd not perform any hyperparameter optmzaton of the tranng parameters and t s qute possble to dentfy a set of tranng hyperparameters for whch the quantzed network may tran successfully. 3.1. Proposal 1 Table 4. ImageNet classfcaton Top-5 error rate (%): Use fxed pont actvatons n networks traned wth floatng pont actvatons (Proposal 1) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 45.6 32.0 31.3 32.7 8 25.1 16.8 16.8 18.2 16 22.5 13.9 13.8 14.4 The networks on the last row of Table 3 are already traned wth the desred weght precson. We can drectly use them to run wth dfferent actvaton precson. Table 4 lsts the classfcaton accuracy of ths approach. It s seen that we can acheve farly good classfcaton accuracy for dfferent actvaton bt-wdths. 3.2. Proposal 2 Usng the networks on the last row of Table 3 as the baselne, we can contnue to fne-tune only the weghts of the top few layers. It s possble to fne-tune the top layers because the effect of gradent msmatch accumulates toward the lower layers of the network, but the mpact on the top layers s relatvely small. Table 5 demonstrates the results of fne-tunng only the top fully-connected layer n the network. It s seen that fnetunng the top layer offers a small boost n accuracy compared to the networks n Table 4. 3.3. Proposal 3 Agan usng the network on the last row of Table 3 as the fne-tunng baselne, we teratvely fne-tune the network from the bottom to the top, one layer at a tme, accordng to the algorthm prescrbed n Table 1. Ths procedure ensures that each layer has accurate gradent nformaton when the weghts are updated. Table 6. ImageNet classfcaton Top-5 error rate (%): Iteratve fne-tunng from bottom layer to top layer (Proposal 3) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 25.3 18.4 18.3 18.2 8 19.3 15.2 14.1 14.1 16 18.8 13.2 13.2 13.5 As seen n Table 6, ths approach provdes a sgnfcant performance boost compared to the prevous solutons. Even a network wth 4-bt weghts and 4-bt actvatons s able to acheve Top-5 error rate of 25.3%. Some of the entres n the table have better accuracy than the floatng pont baselne. Ths may be attrbuted to the regularzaton effect of the added quantzaton nosy durng tranng (Ln et al., 2015). 4. Concluson In ths paper, we studed the effect of low numercal precson of weghts and actvatons on the accuracy of gradent computaton durng back-propagaton. Our analyss showed that low precson weghts are bengn, but low precson actvatons have a detrmental mpact on the computed gradents. The errors n gradent computaton accumulate durng back-propagaton and may slow and even prevent the successful convergence of gradent descent when the network s suffcently deep. We proposed a few solutons to combat ths problem and demonstrated through experments ther effectveness on the ImageNet classfcaton task. We plan to combne

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks stochastc roundng and our proposed solutons n future works. References Courbaraux, M., Bengo, Y., and Davd, J. Low precson arthmetc for deep learnng. arxv:1412.7024, 2014. Deng, L., G.E., Hnton, and Kngsbury, B. New types of deep neural network learnng for speech recognton and related applcatons: an overvew. In IEEE Internatonal Conference on Acoustc, Speech and Sgnal Processng, pp. 8599 8603, 2013. Gupta, S., Agrawal, A., Gopalakrshnan, K., and Narayanan, P. Deep learnng wth lmted numercal precson. arxv:1502.02551, 2015. Gysel, P., Motamed, M., and Ghas, S. Hardwareorented approxmaton of convolutonal neural networks. arxv:1604.03168, 2016. Han, S., Mao, H., and Dally, W. J. A deep neural network compresson ppelne: Prunng, quantzaton, Huffman encodng. arxv:1510.00149, 2015. Krzhevsky, A., Sutskever, I., and Hnton, G.E. ImageNet classfcaton wth deep convolutonal neural networks. In NIPS, 2012. Ln, D. D., Talath, S. S., and Annapureddy, V. S. Fxed pont quantzaton of deep convolutonal networks. In ICML, 2016. Ln, Z., Courbaraux, M., Memsevc, R., and Bengo, Y. Neural networks wth few multplcatons. arxv:1510.03009, 2015.

arxiv: v1 [cs.lg] 8 Jul 2016