Learning Ensembles of Convolutional Neural Networks

Similar documents
Fast Code Detection Using High Speed Time Delay Neural Networks

arxiv: v1 [cs.lg] 8 Jul 2016

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

ANNUAL OF NAVIGATION 11/2006

A Preliminary Study on Targets Association Algorithm of Radar and AIS Using BP Neural Network

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

High Speed, Low Power And Area Efficient Carry-Select Adder

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

arxiv: v1 [cs.lg] 22 Jan 2016 Abstract

Wavelet Multi-Layer Perceptron Neural Network for Time-Series Prediction

Development of Neural Networks for Noise Reduction

A High-Sensitivity Oversampling Digital Signal Detection Technique for CMOS Image Sensors Using Non-destructive Intermediate High-Speed Readout Mode

Research Article Indoor Localisation Based on GSM Signals: Multistorey Building Study

Digital Transmission

Uncertainty in measurements of power and energy on power networks

Rejection of PSK Interference in DS-SS/PSK System Using Adaptive Transversal Filter with Conditional Response Recalculation

RC Filters TEP Related Topics Principle Equipment

POLYTECHNIC UNIVERSITY Electrical Engineering Department. EE SOPHOMORE LABORATORY Experiment 1 Laboratory Energy Sources

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Evaluate the Effective of Annular Aperture on the OTF for Fractal Optical Modulator

Ensemble Evolution of Checkers Players with Knowledge of Opening, Middle and Endgame

Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks

High Speed ADC Sampling Transients

ROBUST IDENTIFICATION AND PREDICTION USING WILCOXON NORM AND PARTICLE SWARM OPTIMIZATION

antenna antenna (4.139)

Th P5 13 Elastic Envelope Inversion SUMMARY. J.R. Luo* (Xi'an Jiaotong University), R.S. Wu (UC Santa Cruz) & J.H. Gao (Xi'an Jiaotong University)

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Networks. Backpropagation. Backpropagation. Introduction to. Backpropagation Network training. Backpropagation Learning Details 1.04.

aperture David Makovoz, 30/01/2006 Version 1.0 Table of Contents

Electricity Price Forecasting using Asymmetric Fuzzy Neural Network Systems Alshejari, A. and Kodogiannis, Vassilis

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

Equity trend prediction with neural networks

STRUCTURE ANALYSIS OF NEURAL NETWORKS

Optimization Frequency Design of Eddy Current Testing

arxiv: v1 [cs.ne] 2 Nov 2016

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

Generalized Incomplete Trojan-Type Designs with Unequal Cell Sizes

NEURAL PROCESSIN G.SYSTEMS 2 INF ORM.ATIO N (Q90. ( Iq~O) DAVID S. TOURETZKY ADVANCES CARNEGIE MELLON UNIVERSITY. ..F~ k \ """ Ct... V\.

Artificial Intelligence Techniques Applications for Power Disturbances Classification

Chess players fame versus their merit

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

Adaptive System Control with PID Neural Networks

THE INCREDIBLE SHRINKING NEURAL NETWORK: NEW PERSPECTIVES ON LEARNING REPRESENTA-

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Adaptive Modulation for Multiple Antenna Channels

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

Estimation of Solar Radiations Incident on a Photovoltaic Solar Module using Neural Networks

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Development of an UWB Rescue Radar System - Detection of Survivors Using Fuzzy Reasoning -

Multi-objective Genetic Algorithm Based Selective Neural Networks Ensemble for Concentration Estimation of Indoor Air Pollutants Using Electronic Nose

FEATURE SELECTION FOR SMALL-SIGNAL STABILITY ASSESSMENT

THE GENERATION OF 400 MW RF PULSES AT X-BAND USING RESONANT DELAY LINES *

Target Response Adaptation for Correlation Filter Tracking

Estimating Mean Time to Failure in Digital Systems Using Manufacturing Defective Part Level

Graph Method for Solving Switched Capacitors Circuits

Application of Linear Discriminant Analysis to Doppler Classification

Joint Power Control and Scheduling for Two-Cell Energy Efficient Broadcasting with Network Coding

An Improved Method in Transient Stability Assessment of a Power System Using Committee Neural Networks

NOVEL ITERATIVE TECHNIQUES FOR RADAR TARGET DISCRIMINATION

Multi-Level Halftoning by IGS Quantization

DIMENSIONAL SYNTHESIS FOR WIDE-BAND BAND- PASS FILTERS WITH QUARTER-WAVELENGTH RES- ONATORS

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Image Compression Using Cascaded Neural Networks

Modelling Service Time Distribution in Cellular Networks Using Phase-Type Service Distributions

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

Enhancement of Degraded Image Based on Neural Network

Review: Our Approach 2. CSC310 Information Theory

N( E) ( ) That is, if the outcomes in sample space S are equally likely, then ( )

Robot Docking Based on Omnidirectional Vision and Reinforcement Learning

CURL: Co-trained Unsupervised Representation Learning for Image Classification

Traffic balancing over licensed and unlicensed bands in heterogeneous networks

Siamese Multi-layer Perceptrons for Dimensionality Reduction and Face Identification

Arterial Travel Time Estimation Based On Vehicle Re-Identification Using Magnetic Sensors: Performance Analysis

Medium Term Load Forecasting for Jordan Electric Power System Using Particle Swarm Optimization Algorithm Based on Least Square Regression Methods

Novel Artificial Neural Networks For Remote-Sensing Data Classification

Analysis of Time Delays in Synchronous and. Asynchronous Control Loops. Bj rn Wittenmark, Ben Bastian, and Johan Nilsson

Network Theory. EC / EE / IN. for

Effect of Time-Interleaved Analog-to-Digital Converter Mismatches on OFDM Performance

Day ahead hourly Price Forecast in ISO New England Market using Neuro-Fuzzy Systems Alshejari, A. and Kodogiannis, V.

Artificial Neural Networks for Cognitive Radio Network: A Survey

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Algorithms Airline Scheduling. Airline Scheduling. Design and Analysis of Algorithms Andrei Bulatov

Grain Moisture Sensor Data Fusion Based on Improved Radial Basis Function Neural Network

Chaotic Filter Bank for Computer Cryptography

New Parallel Radial Basis Function Neural Network for Voltage Security Analysis

Chapter 13. Filters Introduction Ideal Filter

Discussion on How to Express a Regional GPS Solution in the ITRF

MASTER TIMING AND TOF MODULE-

Speech bandwidth expansion based on Deep Neural Networks

Appendix E: The Effect of Phase 2 Grants

POWER constraints are a well-known challenge in advanced

A MODIFIED DIRECTIONAL FREQUENCY REUSE PLAN BASED ON CHANNEL ALTERNATION AND ROTATION

On Channel Estimation of OFDM-BPSK and -QPSK over Generalized Alpha-Mu Fading Distribution

Transcription:

Learnng Ensembles of Convolutonal Neural Networks Lran Chen The Unversty of Chcago Faculty Mentor: Greg Shakhnarovch Toyota Technologcal Insttute at Chcago 1 Introducton Convolutonal Neural Networks (CNN) have demonstrated mpressve performance n mage classfcaton. When fttng complex models wth nonconvex objectves to tran the network, the resultng model depends on stochastc learnng procedure,.e., the fnal network traned wth gradent descent depends on factors such as the order of data n each epoch, ntalzaton, learnng rates, etc. Ensemble learnng s a method for generatng multple versons of a predctor network and usng them to get an aggregated predcton. Gven a learnng set Ω conssts of data {(y n, x n ), n = 1,..., N} where y s the class label and x s the nputng feature, we tran a predctor ϕ( x, Ω). Wth dfferent ntalzaton, we obtan a seres of predctors {ϕ k }. Our object s to use the {ϕ k } to get a better predctor, ϕ A. In the last few years, several papers have shown that ensemble method can delver outstandng performance n reducng the testng error. Most notably, (Krzhevsky et al., 2012) showed that on the ImageNet 2012 classfcaton benchmark, ther ensemble model wth 5 convnets acheved a top-1 error rate of 38.1%, compared to the top -1 error rate of 40.7% gven by the sngle model. In addton, (Zeler & Fergus, 2013) showed that by the ensemble of 6 convnets, they reduced the top -1 error from 40.5% to 36.0%. In 1994, Breman ntroduced the concept of baggng, whch helped us gan some understandng of why the ensemble of classfcaton tree and regresson tree work when they were traned by random samples from the whole dataset (Breman, 1996). However there s stll no clear understandng of 1

why the ensemble of CNNs performs so well, what s the relaton between the number of models n ensemble and the amount of error reduced, or other methods of ensemble nstead of averagng predcton. 2 Experments 2.1 The Data Set The MNIST database (Mxed Natonal Insttute of Standards and Technology database) s a large database of handwrtten zp code dgts provded by the U.S. Postal Servce. Ths dataset s commonly used for tranng varous mage processng systems. The database contans 60,000 tranng mages and 10,000 testng mages. The dgts have been sze-normalzed and centered n a fxed-sze mage. 2.2 The Archtecture The archtecture of our network contans three learned layers - two convolutonal layers, H1, H2 and one fully-connected layer, H3. The output of the fully-connected layer s fed to a Softmax layer whch produces a vector of length 10. Each element of the vector represents the probablty that the nput belongs to a certan class (from 0 to 9). We construct our CNN n a smlar way as (LeCun et al., 1989) dd. The frst convolutonal layer, H1, s fed by a 28 28 normalzed nput mage. Ths layer conssts 12 14 14 feature maps, desgnated as H1.1, H1.2,..., H1.12. Each pxel n each feature map n H1 takes nput on a 5 5 receptve feld on the nput plane. In H1, pxels that are next to each other have ther receptve felds two pxels apart. For pxels n a certan feature map, all receptve felds share the same set of 5 5 weghts. However, pxels n another feature map share a dfferent sets of 5 5 weghts. Thus, n total, there are 12 sets of 5 5 weghts to create 12 features maps for H1. Before beng fed to H2, we operates a nonlnear ReLU transformaton to all pxels n all maps. From H1 to H2, a smlar process occurs - convoluton and nonlnear transformaton. H2 conssts 12 7 7 12 feature maps, each contans 12 unts arranged n a 7 7 plane, desgnated as H2.1, H2.2,..., H2.12. For H2.1, pxels n the frst unt take a 5 5 receptve feld from H1 and share a set of 5 5 weghts, pxels n another unt share a dfferent set of 5 5 weghts, and so on. Thus, to obtan H2.1, we need a set of weghts szed 5 5 12. 2

In total, H2 s created from 12 sets of 5 5 12 weghts. The output of H2 are 12 7 7 12 feature maps, whch are then fed to H3. H3 s a fully-connected layer, whch conssts of 30 unts, each s produced from the dot product of H2 and 30 sets of weghts each szed 7 7 12 and nonlnear ReLU transformaton. Before beng fed to the Softmax layer, a bas term s added to H3. Thus, the output of H3 s 30 numercal numbers plus a bas. Wth back propagaton through the whole archtecture, a test error rate of 3% s acheved wth sngle model. Snce we want to nvestgate nto the effect of ensemble, we dumb the CNN on purpose by fxng H1 and H2 and let learnng occurs only n the fully-connected layer and the Softmax layer. 2.3 Tranng on Dfferent Epochs In our experment, we tran 30 CNNs ndependently and each CNN s traned from 1 to 20 epochs, desgnated as ϕ 1.1, ϕ 1.2,..., ϕ 1.20, ϕ 2.1, ϕ 2.2,..., ϕ 2.20,..., ϕ 30.1, ϕ 30.2,..., ϕ 30.20. The testng error, the vertcal axs of Fgure 1, s plotted aganst the tranng epochs, the horzontal axs. Lower testng error ndcates better performance. The red lne n the mddle s the averaged testng error of all CNN traned under a certan number of epochs. Fgure 1 shows there s no obvous observaton that ncreasng the tranng number of epochs would lead to better performance. Note that wth dfferent structure and dataset, ths observaton may vary. 2.4 Averagng Predctons From the prevous settng, we have traned 30 groups of CNNs desgnated as ϕ 1.1, ϕ 1.2,..., ϕ 1.20, ϕ 2.1, ϕ 2.2,..., ϕ 2.20,..., ϕ 30.1, ϕ 30.2,..., ϕ 30.20. Snce the output s numercal, an obvous procedure of ensemble learnng s averagng the predcton over the predctors traned under the same number of epochs. Desgnatng ϕ N.e as the ensemble of predctors each traned wth e epochs. In Fgure 2, the testng error s plotted aganst, the number of CNNs averaged. Fgure 2 shows that as ncreases, the testng error reduces, ndcatng that the ensemble va averagng the predcton contrbutes to achevng better performance. However, we notce that as more and more predctors beng averaged, the rate of reducng testng rate goes down and eventually the lnes go flat, thus t s unlkely that by averagng nfnte number of CNNs, testng error can be reduced to zero. 3

Fgure 1: Testng Error of CNNs Traned wth Dfferent Epochs Fgure 2: Testng Error of CNNs wth Averaged Predcton 2.5 Fxng the Total Number of Tranng Epochs Now that we are aware that for our archtecture and dataset, ncreasng, the number of models averaged, helps obtan better performance, whle 4

ncreasng e, the learnng duraton, does not. Snce both ncreasng and e lead to hgher costs, t s natural to thnk about the tradeoff between and e. If we fx e, then we fx the total cost of tranng. In Fgure 3, testng error s plotted aganst dfferent combnatons of and e. The blue lne shows that when e = 30, the testng error s reduced as ncreases. The red lne s the test error of CNNs wth = 1 and e equals that of the blue lne. Use the red lne as a control group, Fgure 3 shows that the predctors gan more and more accuracy as the number of models nvolve n the ensemble ncreases. Snce the models are traned ndependently, f we spread the tranng onto dfferent machnes, we can effectvely reduce the tranng tme. Fgure 3: Testng Error of the ensemble of CNNs wth Fxed Amount of Tranng Cost 2.6 Creatng New Softmax Layer Here we experment a new way of ensemble nstead of smply averagng the predcton numercally. As dscussed n the prevous secton, the output of a CNN s a vector of length 10. Wth a fxed tranng duraton,.e., e s a constant, we collect the output of 30 ndependently-traned CNNs and stack them as a new feature map. Ths map s then fed to a new Softmax layer whch outputs a vector of length 10. The elements of the new output stll 5

represent the probablty of the orgnal nput belongs to a certan class. In Fgure 4, testng error s plotted aganst e ranged from 1 to 20. The blue lne s the plot of testng error aganst epochs under the new way of ensemble, whle the red lne s testng error aganst epochs wthout ensemble. Use the red lne as a control group, Fgure 4 shows that wth ensemble by stackng the outputs from ndependently-traned models and creatng a new Softmax layer, the predctor performs better. Fgure 4: Testng Error of Ensemble by Creatng New Softmax Layer 3 Why Ensemble Works For a sngle network from the dstrbuton P (Net), the expected test error s e = E ϕ E x,y (y ϕ(x)) 2 (1) The aggregate predctor s ϕ A (x) = E ϕ P (Net) (ϕ(x)) (2) 6

The expected aggregate error s The emprcal mean of the predctors s The expected error of the mean of the predctor s To prove e e A, we have e A = E x,y (y ϕ A (x)) 2 (3) ϕ = 1 ϕ (x) (4) e = E ϕ E x,y (y 1 ϕ (x)) 2 (5) e = E ϕ E x,y (y ϕ(x)) 2 = E ϕ E x,y (y 2 2yϕ(x) + ϕ 2 (x)) 2 = E x,y E ϕ (y 2 2yϕ(x) + ϕ 2 (x)) 2 = E x,y (y 2 E ϕ (2yϕ(x)) + E ϕ (ϕ 2 (x))) E x,y (y 2 2yϕ A (x) + E 2 ϕ(ϕ(x))) = E x,y (y 2 2yϕ A (x) + ϕ 2 A(x)) = E x,y (y ϕ A (x)) 2 = e A (6) Thus we prove that the expected error from ensemble s always smaller than the expected error from a sngle predctor. So theoretcally, we always gan from ensemble n expectaton. We can decompose (1) by bas and varance as followng: e = E ϕ E x,y (y ϕ(x)) 2 = E ϕ E x,y (y E x,y ϕ A (x) + E x,y ϕ A (x) ϕ(x)) 2 = E ϕ E x,y (y E x,y ϕ A (x)) 2 + 2E ϕ E x,y (y E x,y ϕ A (x))(e x,y ϕ A (x) ϕ(x)) + E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 = E ϕ E x,y (y E x,y ϕ A (x)) 2 + E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x)) 2 + E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 (7) Smlarly, (3) can be decomposed as followng: e A = E x,y (y ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x) + E x,y ϕ A (x) ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x)) 2 + E x,y (ϕ A (x) E x,y ϕ A (x)) 2 (8) 7

Thus, e e A = E ϕ E x,y (ϕ(x) E x,y ϕ A (x)) 2 E x,y (ϕ A (x) E x,y ϕ A (x)) 2 = E x,y (E ϕ (ϕ 2 (x)) E 2 ϕ(ϕ(x))) = E x,y (V ar ϕ (ϕ(x))) 0 (9) However, n our experment, we can only observe the emprcal mean,(4), nstead of the expectaton,(2). To obtan the ensemble learnng effect, we decompose e = E ϕ E x,y (y 1 f (x)) 2 = E ϕ E x,y (y E x,y ϕ A (x) + E x,y ϕ A (x) 1 f (x)) 2 = E ϕ E x,y (y E x,y ϕ A (x)) 2 + 2E ϕ E x,y (y E x,y ϕ A (x))(e x,y ϕ A (x) 1 f (x)) + E ϕ E x,y ( 1 f (x) E x,y ϕ A (x)) 2 = E x,y (y E x,y ϕ A (x)) 2 + E ϕ E x,y ( 1 f (x) E x,y ϕ A (x)) 2 (10) In (10), the second term can be decomposed as E ϕ E x,y ( 1 f (x) E x,y ϕ A (x)) 2 = E x,y E ϕ (Ex,yϕ 2 A (x) 2E x,y ϕ A (x) 1 f (x) + ( 1 f (x)) 2 ) = E x,y E ϕ (Ex,yϕ 2 A (x) 2E x,y ϕ A (x)ϕ A (x) + ( 1 f (x)) 2 ) (11) E x,y E ϕ ( 1 f (x)) 2 = E x,y E ϕ ( 1 2 ( f 2 (x) +,j ϕ (x)ϕ j (j))) = 1 2 (E ϕ(ϕ 2 (x)) +,j E ϕ (ϕ (x)ϕ j (j))) (12) 8

Thus, e e = E x,y (E ϕ (ϕ 2 (x)) E ϕ (ϕ 2 (x)) +,j E ϕ (ϕ (x)ϕ j (j)) = E x,y ( 1 (E ϕ(ϕ 2 (x)) Eϕ(ϕ(x)))) 2 = 1 E x,y(v ar ϕ (ϕ(x))) 0 (13) From (13) we know that the more predctors nvolve n ensemble, the less of the error. However, as goes to nfnty, the margnal amount of error we reduce decrease to 0. Ths agrees wth the behavor of the error we observed from Fgure 2. 4 Concluson Ensemble s a powerful procedure whch mproves sngle network performance. It reduces the varance porton n the bas-varance decomposton of the predcton error. Our project has expermented wth dfferent ensemble methods that all tend to contrbute to dramatc error reducton. In addton, the tradeoff between number of models and ther complexty has been nvestgated and we show that ensemble learnng may lead to accuracy gans along wth reducton n tranng tme. 5 References Breman, L., Baggng Predctors, Machne Learnng, 24(2):123 140, 1996 Krzhevsky, A., Sutskever, I., and Hnton, G.E. Imagenet classfcaton wth deep convolutonal neural networks. In NIPS, 2012. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagaton appled to handwrtten zp code recognton. Neural Comput., 1(4):541 551, 1989. Zeler, M., Fergus, R., Vsualzng and Understandng Convolutonal Networks, ECCV 2014, Arxv 1311.2901, 2013. ) 9