Learnng Ensembles of Convolutonal Neural Networks Lran Chen The Unversty of Chcago Faculty Mentor: Greg Shakhnarovch Toyota Technologcal Insttute at Chcago 1 Introducton Convolutonal Neural Networks (CNN) have demonstrated mpressve performance n mage classfcaton. When fttng complex models wth nonconvex objectves to tran the network, the resultng model depends on stochastc learnng procedure,.e., the fnal network traned wth gradent descent depends on factors such as the order of data n each epoch, ntalzaton, learnng rates, etc. Ensemble learnng s a method for generatng multple versons of a predctor network and usng them to get an aggregated predcton. Gven a learnng set Ω conssts of data {(y n, x n ), n = 1,..., N} where y s the class label and x s the nputng feature, we tran a predctor ϕ( x, Ω). Wth dfferent ntalzaton, we obtan a seres of predctors {ϕ k }. Our object s to use the {ϕ k } to get a better predctor, ϕ A. In the last few years, several papers have shown that ensemble method can delver outstandng performance n reducng the testng error. Most notably, (Krzhevsky et al., 2012) showed that on the ImageNet 2012 classfcaton benchmark, ther ensemble model wth 5 convnets acheved a top-1 error rate of 38.1%, compared to the top -1 error rate of 40.7% gven by the sngle model. In addton, (Zeler & Fergus, 2013) showed that by the ensemble of 6 convnets, they reduced the top -1 error from 40.5% to 36.0%. In 1994, Breman ntroduced the concept of baggng, whch helped us gan some understandng of why the ensemble of classfcaton tree and regresson tree work when they were traned by random samples from the whole dataset (Breman, 1996). However there s stll no clear understandng of 1
why the ensemble of CNNs performs so well, what s the relaton between the number of models n ensemble and the amount of error reduced, or other methods of ensemble nstead of averagng predcton. 2 Experments 2.1 The Data Set The MNIST database (Mxed Natonal Insttute of Standards and Technology database) s a large database of handwrtten zp code dgts provded by the U.S. Postal Servce. Ths dataset s commonly used for tranng varous mage processng systems. The database contans 60,000 tranng mages and 10,000 testng mages. The dgts have been sze-normalzed and centered n a fxed-sze mage. 2.2 The Archtecture The archtecture of our network contans three learned layers - two convolutonal layers, H1, H2 and one fully-connected layer, H3. The output of the fully-connected layer s fed to a Softmax layer whch produces a vector of length 10. Each element of the vector represents the probablty that the nput belongs to a certan class (from 0 to 9). We construct our CNN n a smlar way as (LeCun et al., 1989) dd. The frst convolutonal layer, H1, s fed by a 28 28 normalzed nput mage. Ths layer conssts 12 14 14 feature maps, desgnated as H1.1, H1.2,..., H1.12. Each pxel n each feature map n H1 takes nput on a 5 5 receptve feld on the nput plane. In H1, pxels that are next to each other have ther receptve felds two pxels apart. For pxels n a certan feature map, all receptve felds share the same set of 5 5 weghts. However, pxels n another feature map share a dfferent sets of 5 5 weghts. Thus, n total, there are 12 sets of 5 5 weghts to create 12 features maps for H1. Before beng fed to H2, we operates a nonlnear ReLU transformaton to all pxels n all maps. From H1 to H2, a smlar process occurs - convoluton and nonlnear transformaton. H2 conssts 12 7 7 12 feature maps, each contans 12 unts arranged n a 7 7 plane, desgnated as H2.1, H2.2,..., H2.12. For H2.1, pxels n the frst unt take a 5 5 receptve feld from H1 and share a set of 5 5 weghts, pxels n another unt share a dfferent set of 5 5 weghts, and so on. Thus, to obtan H2.1, we need a set of weghts szed 5 5 12. 2
In total, H2 s created from 12 sets of 5 5 12 weghts. The output of H2 are 12 7 7 12 feature maps, whch are then fed to H3. H3 s a fully-connected layer, whch conssts of 30 unts, each s produced from the dot product of H2 and 30 sets of weghts each szed 7 7 12 and nonlnear ReLU transformaton. Before beng fed to the Softmax layer, a bas term s added to H3. Thus, the output of H3 s 30 numercal numbers plus a bas. Wth back propagaton through the whole archtecture, a test error rate of 3% s acheved wth sngle model. Snce we want to nvestgate nto the effect of ensemble, we dumb the CNN on purpose by fxng H1 and H2 and let learnng occurs only n the fully-connected layer and the Softmax layer. 2.3 Tranng on Dfferent Epochs In our experment, we tran 30 CNNs ndependently and each CNN s traned from 1 to 20 epochs, desgnated as ϕ 1.1, ϕ 1.2,..., ϕ 1.20, ϕ 2.1, ϕ 2.2,..., ϕ 2.20,..., ϕ 30.1, ϕ 30.2,..., ϕ 30.20. The testng error, the vertcal axs of Fgure 1, s plotted aganst the tranng epochs, the horzontal axs. Lower testng error ndcates better performance. The red lne n the mddle s the averaged testng error of all CNN traned under a certan number of epochs. Fgure 1 shows there s no obvous observaton that ncreasng the tranng number of epochs would lead to better performance. Note that wth dfferent structure and dataset, ths observaton may vary. 2.4 Averagng Predctons From the prevous settng, we have traned 30 groups of CNNs desgnated as ϕ 1.1, ϕ 1.2,..., ϕ 1.20, ϕ 2.1, ϕ 2.2,..., ϕ 2.20,..., ϕ 30.1, ϕ 30.2,..., ϕ 30.20. Snce the output s numercal, an obvous procedure of ensemble learnng s averagng the predcton over the predctors traned under the same number of epochs. Desgnatng ϕ N.e as the ensemble of predctors each traned wth e epochs. In Fgure 2, the testng error s plotted aganst, the number of CNNs averaged. Fgure 2 shows that as ncreases, the testng error reduces, ndcatng that the ensemble va averagng the predcton contrbutes to achevng better performance. However, we notce that as more and more predctors beng averaged, the rate of reducng testng rate goes down and eventually the lnes go flat, thus t s unlkely that by averagng nfnte number of CNNs, testng error can be reduced to zero. 3
Fgure 1: Testng Error of CNNs Traned wth Dfferent Epochs Fgure 2: Testng Error of CNNs wth Averaged Predcton 2.5 Fxng the Total Number of Tranng Epochs Now that we are aware that for our archtecture and dataset, ncreasng, the number of models averaged, helps obtan better performance, whle 4
ncreasng e, the learnng duraton, does not. Snce both ncreasng and e lead to hgher costs, t s natural to thnk about the tradeoff between and e. If we fx e, then we fx the total cost of tranng. In Fgure 3, testng error s plotted aganst dfferent combnatons of and e. The blue lne shows that when e = 30, the testng error s reduced as ncreases. The red lne s the test error of CNNs wth = 1 and e equals that of the blue lne. Use the red lne as a control group, Fgure 3 shows that the predctors gan more and more accuracy as the number of models nvolve n the ensemble ncreases. Snce the models are traned ndependently, f we spread the tranng onto dfferent machnes, we can effectvely reduce the tranng tme. Fgure 3: Testng Error of the ensemble of CNNs wth Fxed Amount of Tranng Cost 2.6 Creatng New Softmax Layer Here we experment a new way of ensemble nstead of smply averagng the predcton numercally. As dscussed n the prevous secton, the output of a CNN s a vector of length 10. Wth a fxed tranng duraton,.e., e s a constant, we collect the output of 30 ndependently-traned CNNs and stack them as a new feature map. Ths map s then fed to a new Softmax layer whch outputs a vector of length 10. The elements of the new output stll 5
represent the probablty of the orgnal nput belongs to a certan class. In Fgure 4, testng error s plotted aganst e ranged from 1 to 20. The blue lne s the plot of testng error aganst epochs under the new way of ensemble, whle the red lne s testng error aganst epochs wthout ensemble. Use the red lne as a control group, Fgure 4 shows that wth ensemble by stackng the outputs from ndependently-traned models and creatng a new Softmax layer, the predctor performs better. Fgure 4: Testng Error of Ensemble by Creatng New Softmax Layer 3 Why Ensemble Works For a sngle network from the dstrbuton P (Net), the expected test error s e = E ϕ E x,y (y ϕ(x)) 2 (1) The aggregate predctor s ϕ A (x) = E ϕ P (Net) (ϕ(x)) (2) 6
The expected aggregate error s The emprcal mean of the predctors s The expected error of the mean of the predctor s To prove e e A, we have e A = E x,y (y ϕ A (x)) 2 (3) ϕ = 1 ϕ (x) (4) e = E ϕ E x,y (y 1 ϕ (x)) 2 (5) e = E ϕ E x,y (y ϕ(x)) 2 = E ϕ E x,y (y 2 2yϕ(x) + ϕ 2 (x)) 2 = E x,y E ϕ (y 2 2yϕ(x) + ϕ 2 (x)) 2 = E x,y (y 2 E ϕ (2yϕ(x)) + E ϕ (ϕ 2 (x))) E x,y (y 2 2yϕ A (x) + E 2 ϕ(ϕ(x))) = E x,y (y 2 2yϕ A (x) + ϕ 2 A(x)) = E x,y (y ϕ A (x)) 2 = e A (6) Thus we prove that the expected error from ensemble s always smaller than the expected error from a sngle predctor. So theoretcally, we always gan from ensemble n expectaton. We can decompose (1) by bas and varance as followng: e = E ϕ E x,y (y ϕ(x)) 2 = E ϕ E x,y (y E x,y ϕ A (x) + E x,y ϕ A (x) ϕ(x)) 2 = E ϕ E x,y (y E x,y ϕ A (x)) 2 + 2E ϕ E x,y (y E x,y ϕ A (x))(e x,y ϕ A (x) ϕ(x)) + E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 = E ϕ E x,y (y E x,y ϕ A (x)) 2 + E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x)) 2 + E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 (7) Smlarly, (3) can be decomposed as followng: e A = E x,y (y ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x) + E x,y ϕ A (x) ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x)) 2 + E x,y (ϕ A (x) E x,y ϕ A (x)) 2 (8) 7
Thus, e e A = E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 E x,y (ϕ A (x) E x,y ϕ A (x)) 2 = E x,y (E ϕ (ϕ 2 (x)) E 2 ϕ(ϕ(x))) = E x,y (V ar ϕ (ϕ(x))) 0 (9) However, n our experment, we can only observe the emprcal mean,(4), nstead of the expectaton,(2). To obtan the ensemble learnng effect, we decompose e = E ϕ E x,y (y 1 f (x)) 2 = E ϕ E x,y (y E x,y ϕ A (x) + E x,y ϕ A (x) 1 f (x)) 2 = E ϕ E x,y (y E x,y ϕ A (x)) 2 + 2E ϕ E x,y (y E x,y ϕ A (x))(e x,y ϕ A (x) 1 f (x)) + E ϕ E x,y ( 1 f (x) E x,y ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x)) 2 + E ϕ E x,y ( 1 f (x) E x,y ϕ A (x)) 2 (10) In (10), the second term can be decomposed as E ϕ E x,y ( 1 f (x) E x,y ϕ A (x)) 2 = E x,y E ϕ (Ex,yϕ 2 A (x) 2E x,y ϕ A (x) 1 f (x) + ( 1 f (x)) 2 ) = E x,y E ϕ (Ex,yϕ 2 A (x) 2E x,y ϕ A (x)ϕ A (x) + ( 1 f (x)) 2 ) (11) E x,y E ϕ ( 1 f (x)) 2 = E x,y E ϕ ( 1 2 ( f 2 (x) +,j ϕ (x)ϕ j (j))) = 1 2 (E ϕ(ϕ 2 (x)) +,j E ϕ (ϕ (x)ϕ j (j))) (12) 8
Thus, e e = E x,y (E ϕ (ϕ 2 (x)) E ϕ (ϕ 2 (x)) +,j E ϕ (ϕ (x)ϕ j (j)) = E x,y ( 1 (E ϕ(ϕ 2 (x)) Eϕ(ϕ(x)))) 2 = 1 E x,y(v ar ϕ (ϕ(x))) 0 (13) From (13) we know that the more predctors nvolve n ensemble, the less of the error. However, as goes to nfnty, the margnal amount of error we reduce decrease to 0. Ths agrees wth the behavor of the error we observed from Fgure 2. 4 Concluson Ensemble s a powerful procedure whch mproves sngle network performance. It reduces the varance porton n the bas-varance decomposton of the predcton error. Our project has expermented wth dfferent ensemble methods that all tend to contrbute to dramatc error reducton. In addton, the tradeoff between number of models and ther complexty has been nvestgated and we show that ensemble learnng may lead to accuracy gans along wth reducton n tranng tme. 5 References Breman, L., Baggng Predctors, Machne Learnng, 24(2):123 140, 1996 Krzhevsky, A., Sutskever, I., and Hnton, G.E. Imagenet classfcaton wth deep convolutonal neural networks. In NIPS, 2012. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagaton appled to handwrtten zp code recognton. Neural Comput., 1(4):541 551, 1989. Zeler, M., Fergus, R., Vsualzng and Understandng Convolutonal Networks, ECCV 2014, Arxv 1311.2901, 2013. ) 9