arxiv: v1 [cs.lg] 8 Jul 2016

Similar documents
Learning Ensembles of Convolutional Neural Networks

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

arxiv: v1 [cs.lg] 22 Jan 2016 Abstract

ANNUAL OF NAVIGATION 11/2006

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Fast Code Detection Using High Speed Time Delay Neural Networks

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

A Preliminary Study on Targets Association Algorithm of Radar and AIS Using BP Neural Network

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

Priority based Dynamic Multiple Robot Path Planning

High Speed, Low Power And Area Efficient Carry-Select Adder

Application of Intelligent Voltage Control System to Korean Power Systems

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Networks. Backpropagation. Backpropagation. Introduction to. Backpropagation Network training. Backpropagation Learning Details 1.04.

NOVEL ITERATIVE TECHNIQUES FOR RADAR TARGET DISCRIMINATION

Latency Insertion Method (LIM) for IR Drop Analysis in Power Grid

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Uncertainty in measurements of power and energy on power networks

The Spectrum Sharing in Cognitive Radio Networks Based on Competitive Price Game

High Speed ADC Sampling Transients

Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks

PERFORMANCE EVALUATION OF BOOTH AND WALLACE MULTIPLIER USING FIR FILTER. Chirala Engineering College, Chirala.

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Figure 1. DC-DC Boost Converter

Figure 1. DC-DC Boost Converter

Development of Neural Networks for Noise Reduction

Chaotic Filter Bank for Computer Cryptography

Multiple Error Correction Using Reduced Precision Redundancy Technique

MTBF PREDICTION REPORT

arxiv: v1 [cs.ne] 2 Nov 2016

4.3- Modeling the Diode Forward Characteristic

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network

Optimal State Prediction for Feedback-Based QoS Adaptations

Th P5 13 Elastic Envelope Inversion SUMMARY. J.R. Luo* (Xi'an Jiaotong University), R.S. Wu (UC Santa Cruz) & J.H. Gao (Xi'an Jiaotong University)

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

A NSGA-II algorithm to solve a bi-objective optimization of the redundancy allocation problem for series-parallel systems

Review: Our Approach 2. CSC310 Information Theory

Robot Docking Based on Omnidirectional Vision and Reinforcement Learning

Equity trend prediction with neural networks

Discussion on How to Express a Regional GPS Solution in the ITRF

Weighted Penalty Model for Content Balancing in CATS

Partial Discharge Pattern Recognition of Cast Resin Current Transformers Using Radial Basis Function Neural Network

Delay Constrained Fuzzy Rate Control for Video Streaming over DVB-H

MODEL ORDER REDUCTION AND CONTROLLER DESIGN OF DISCRETE SYSTEM EMPLOYING REAL CODED GENETIC ALGORITHM J. S. Yadav, N. P. Patidar, J.

POWER constraints are a well-known challenge in advanced

Genetic Algorithm Based Deep Learning Parameters Tuning for Robot Object Recognition and Grasping

An Improved Method for GPS-based Network Position Location in Forests 1

Research on Peak-detection Algorithm for High-precision Demodulation System of Fiber Bragg Grating

A TWO-PLAYER MODEL FOR THE SIMULTANEOUS LOCATION OF FRANCHISING SERVICES WITH PREFERENTIAL RIGHTS

Application of a Modified PSO Algorithm to Self-Tuning PID Controller for Ultrasonic Motor

White Paper. OptiRamp Model-Based Multivariable Predictive Control. Advanced Methodology for Intelligent Control Actions

Understanding the Spike Algorithm

A Novel Hybrid Neural Network for Data Clustering

Markov Chain Monte Carlo Detection for Underwater Acoustic Channels

Comparison of Gradient descent method, Kalman Filtering and decoupled Kalman in training Neural Networks used for fingerprint-based positioning

Adaptive System Control with PID Neural Networks

RC Filters TEP Related Topics Principle Equipment

Sensors for Motion and Position Measurement

Analysis of Time Delays in Synchronous and. Asynchronous Control Loops. Bj rn Wittenmark, Ben Bastian, and Johan Nilsson

A Current Differential Line Protection Using a Synchronous Reference Frame Approach

Research on Controller of Micro-hydro Power System Nan XIE 1,a, Dezhi QI 2,b,Weimin CHEN 2,c, Wei WANG 2,d

Coverage of Hybrid Terrestrial-Satellite Location in Mobile Communications

The Performance Improvement of BASK System for Giga-Bit MODEM Using the Fuzzy System

Sentient Autonomous Vehicle using Advanced Neural net Technology

antenna antenna (4.139)

aperture David Makovoz, 30/01/2006 Version 1.0 Table of Contents

A Simple Yet Efficient Accuracy Configurable Adder Design

Performance Analysis of Cellular Radio System Using Artificial Neural Networks

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

NEURAL PROCESSIN G.SYSTEMS 2 INF ORM.ATIO N (Q90. ( Iq~O) DAVID S. TOURETZKY ADVANCES CARNEGIE MELLON UNIVERSITY. ..F~ k \ """ Ct... V\.

Performance Analysis of Multi User MIMO System with Block-Diagonalization Precoding Scheme

THE INCREDIBLE SHRINKING NEURAL NETWORK: NEW PERSPECTIVES ON LEARNING REPRESENTA-

Modeling Hierarchical Event Streams in System Level Performance Analysis

LOCAL DECODING OF WALSH CODES TO REDUCE CDMA DESPREADING COMPUTATION

A Fuzzy-based Routing Strategy for Multihop Cognitive Radio Networks

Prevention of Sequential Message Loss in CAN Systems

Secure Transmission of Sensitive data using multiple channels

Adaptive Control and On-line Controller Design for Multivariable Systems in State-Space

STRUCTURE ANALYSIS OF NEURAL NETWORKS

INSTRUCTION MANUAL BENCH LATHE

Studying the Relationship between Network Measurement Parameters and Available Bandwidth for Accurate Estimation

ECE315 / ECE515 Lecture 5 Date:

Inverse Halftoning Method Using Pattern Substitution Based Data Hiding Scheme

Multi-focus Image Fusion Using Spatial Frequency and Genetic Algorithm

MULTIPLE LAYAR KERNEL-BASED APPROACH IN RELEVANCE FEEDBACK CONTENT-BASED IMAGE RETRIEVAL SYSTEM

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

Low-Delay 16 kb/s Wideband Speech Coder with Fast Search Methods

Low Sampling Rate Technology for UHF Partial Discharge Signals Based on Sparse Vector Recovery

An Alternation Diffusion LMS Estimation Strategy over Wireless Sensor Network

Adaptive Phase Synchronisation Algorithm for Collaborative Beamforming in Wireless Sensor Networks

Multi-Robot Map-Merging-Free Connectivity-Based Positioning and Tethering in Unknown Environments

Data Compression for Multiple Parameter Estimation with Application to TDOA/FDOA Emitter Location

Transcription:

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks arxv:1607.02241v1 [cs.lg] 8 Jul 2016 Darryl D. Ln Qualcomm Research, San Dego, CA 92121 USA Sachn S. Talath Qualcomm Research, San Dego, CA 92121 USA Abstract It s known that tranng deep neural networks, n partcular, deep convolutonal networks, wth aggressvely reduced numercal precson s challengng. The stochastc gradent descent algorthm becomes unstable n the presence of nosy gradent updates resultng from arthmetc wth lmted numerc precson. One of the wellaccepted solutons facltatng the tranng of low precson fxed pont networks s stochastc roundng. However, to the best of our knowledge, the source of the nstablty n tranng neural networks wth nosy gradent updates has not been well nvestgated. Ths work s an attempt to draw a theoretcal connecton between low numercal precson and tranng algorthm stablty. In dong so, we wll also propose and verfy through experments methods that are able to mprove the tranng performance of deep convolutonal networks n fxed pont. 1. Introducton Deep convolutonal networks (DCNs) have demonstrated state-of-the-art performance n many machne learnng tasks such as mage classfcaton (Krzhevsky et al., 2012) and speech recognton (Deng et al., 2013). However, the complexty and the sze of DCNs have lmted ther use n moble applcatons and embedded systems. One reason s related to the ht on performance (n terms of accuracy on a gven machne learnng task) that these networks take when they are deployed wth data representatons usng reduced numerc precson. A potental avenue to allevate ths problem s to fne-tune pre-traned floatng pont DCNs usng data representatons wth reduced numerc precson. Accepted for the 33 rd Internatonal Conference on Machne Learnng - Workshop on On-Devce Intellgence. Copyrght 2016 by the author(s). DARRYL.DLIN@GMAIL.COM TALATHI@GMAIL.COM However, the tranng algorthms have a strong tendency to dverge when the precson of network parameters and features are too low (Han et al., 2015; Courbaraux et al., 2014). More recently, several works have touched upon the ssue of tranng deep networks wth low numercal precson (Gupta et al., 2015; Ln et al., 2015; Gysel et al., 2016). In all of these works stochastc roundng has been the key to mprovng the convergence propertes of the tranng algorthm, whch n turn has enabled tranng of deep networks wth relatvely small bt-wdths. However, to the best of our knowledge, there s a lmted understandng from a theoretcal pont of vew as to why low precson networks lead to tranng dffcultes. In ths paper, we attempt offer a theoretcal nsght nto the root cause of the numercal nstablty when tranng DCNs wth lmted numerc precson representatons. In dong so, we wll also propose a few solutons to combat such nstablty n order to mprove the tranng outcome. These proposals are not meant to replace stochastc roundng. Rather, they are complementary technques. To clearly demonstrate the effectveness of our proposed solutons, we wll not perform stochastc roundng n the experments. We ntend to combne stochastc roundng and our proposed solutons n future works. Ths work wll focus on fne-tunng a pre-traned floatng pont DCN n fxed pont. Whle most of the analyss apply also to the case of tranng a fxed pont network from scratch, some dscussons may be applcable to the fxed pont fne-tunng scenaro alone. 2. Low Precson and Back-Propagaton In ths secton, we wll nvestgate the orgn of nstablty n the network tranng phase when low precson weghts and actvatons are used. The outcome of ths effort wll shed lght on possble avenues to allevate the problem.

2.1. Effectve Actvaton Functon Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks The computaton of actvatons n the forward pass of a deep network can be wrtten as: a (l) = w (l), g(a(l 1) ), (1) where a (l) denotes the -th actvaton n the l-th layer, w (l), represents the (, )-th weght value n the l-th layer. And g( ) s the actvaton functon. Note that here we assume both the actvatons and weghts are full precson values. Now consder the case where only the weghts are low precson fxed pont values. From the forward pass perspectve, (1) stll holds. However, when we ntroduce low precson actvatons nto the equaton, (1) s no longer an accurate descrpton of how the actvatons propagate. To see ths, we may consder the evaluaton of a (l) n fxed pont representaton as n Fgure 1. Decmal Pt 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Step 1: Multplcaton 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Step 2: Addton X X X X X 1 2 3 4 5 6 7 8 X X X X X X X X X Step 3: Roundng Fgure 1. Evaluaton of actvaton as quantzaton In Fgure 1, three operatons are depcted: Step 1: Compute w g(a). Assumng both w and g(a) are 8-bt fxed pont values, the product s a 16-bt value. Step 2: Compute w g(a). The sze of the accumulator s larger than 16-bt to prevent overflow. Step 3: The outcome of w g(a) s rounded and truncated to produce an 8-bt actvaton value. Step 3 s a quantzaton step that reduces the precson of the value calculated based on (1) n keepng wth the desred fxed pont precson of layer l. In essence, assumng ReLU, the effectve actvaton functon experenced by the features n the network s as shown n Fgure 2(b), rather than 2(a). 2.2. Gradent Msmatch In back-propagaton, denotng the cost functon as C, an mportant equaton that dctates how the error sgnal,, propagates down the network s expressed as follows: = g (a (l) ) w (l+1), a (l+1). (2) (a) g( ) (b) g q ( ) Fgure 2. The presumed and actual ReLU functon n low precson networks The value of ndcates the drecton n whch a (l) should move n order to mprove the cost functon. Playng a crucal role n (2) s the dervatve of the actvaton functon, g (a (l) ). In a software envronment that mplements SGD, orgnal actvaton functons n the form of Fgure 2(a) s assumed. However, as explaned n Secton 2.1, the effectve actvaton functon n a fxed pont network s a non-dfferentable functon as descrbed n Fgure 2(b). Ths dsagreement between the presumed and the actual actvaton functon s the orgn of what we call the gradent msmatch problem. When the bt-wdths of the weghts and actvatons are large, the gradent of the orgnal actvaton functon offers a good approxmaton to that of the quantzed actvaton functon. However, the msmatch wll start to mpact the stablty of SGD when the bt-wdths become too small (step szes become too large). The gradent msmatch problem also exacerbates as the error sgnal propagates deeper down the network, because every tme the presumed g (a (l) ) s used, addtonal errors are ntroduced n the gradent computaton. Snce the gradents w.r.t. the weghts are drectly based on the gradents w.r.t. the actvatons, w (l), = g(a (l 1) ), (3) the weght updates become ncreasngly naccurate as the error propagates nto lower layers of the network. Hence tranng networks n fxed pont s much more challengng n deeper networks than n shallower networks. 2.3. Potental Solutons Havng understood the source of the ssue, we wll propose a few methods to help overcome the challenges of tranng or fne-tunng a fxed pont network. The obvous approach of replacng the perceved actvaton functon wth the effectve actvaton functon that takes quantzaton nto account s not vable because the effectve actvaton functon s not dfferentable. However, some alternatves may help mprove convergence durng model tranng to avod the gradent msmatch problem.

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks 2.3.1. PROPOSAL 1: LOW PRECISION WEIGHTS AND FULL PRECISION ACTIVATIONS Recognzng that the man obstacle of tranng n fxed pont s the low precson actvatons, we may tran a network wth the desred precson for the weghts, whle keepng the actvatons floatng pont or wth relatvely hgh precson. After tranng, the network can be adapted to run wth lower precson actvatons. 2.3.2. PROPOSAL 2: FINE-TUNING TOP LAYER(S) ONLY As the analyss n Secton 2 shows, when the actvaton precson s low, weght updates of top layers are more relable than lower layers, because the gradent msmatch bulds up from the top of the network to the bottom. Therefore, whle t may not be possble to fne-tune the entre network, t may be possble to fne-tune only the top layers wthout ncurrng convergence ssues. 2.3.3. PROPOSAL 3: BOTTOM-TO-TOP ITERATIVE FINE-TUNING The bottom-to-top teratve fne-tunng scheme s a tranng algorthm desgned to avod gradent msmatch. At the same tme, t allows the entre network to be fne-tuned. For example, consder a network wth 4 layers. Table 1 offers an llustraton of how fne-tunng s dvded nto phases where one layer s fne-tuned n each phase. Table 1. Example showng the phases of teratve fne-tunng Phase 1 Phase 2 Phase 3 Acts Wgts Acts Wgts Acts Wgts Layer4 Float - Float - Float update Layer3 Float - Float update FxPt - Layer2 Float update FxPt - FxPt - Layer1 FxPt - FxPt - FxPt - Each phase of fne-tunng, consstng of 1 or multple epochs, updates the weghts of one of the layers (weghts can follow the desred fxed pont format wthout specal treatment). As shown n Table 1, Phase 1 fne-tunes the weghts of Layer2. After Phase 1 s complete, Phase 2 fnetunes the weghts of Layer3 whle keepng the weghts of all other layers statc. Then Phase 3 fne-tunes Layer4 n a smlar manner. Note that Layer1 weghts are quantzed but never fne-tuned. Also of mportance s how the number format of actvatons change over the phases. Intally durng Phase 1, only the bottom layer (Layer1) actvatons are n fxed pont, but n Phase 2, both Layer1 and Layer2 actvatons are n fxed pont. In the last phase of fne-tunng, only the output of the fnal layer remans floatng pont. All other actvatons have been turned nto fxed pont. The gradual turnng on of fxed pont actvatons s desgned to prevent gradent msmatch completely. Careful nspecton of the algorthm shows that, whenever the weghts of a partcular layer are updated, the gradents are always back-propagated from layers wth only floatng pont actvatons. 3. Experments In ths secton, we examne the effectveness of the proposed solutons based on a deep convolutonal network we developed for the ImageNet classfcaton task 1. The network has 12 convolutonal layers and 5 fully-connected layers. We choose ths network to experment because, as we have shown n a network desgned for CIFAR-10 classfcaton (Ln et al., 2016), fne-tunng a relatvely shallow fxed pont network does not pose convergence challenges even when the bt-wdths are small. Table 2. ImageNet classfcaton Top-5 error rate (%): No fnetunng Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 98.6 33.4 32.9 32.7 8 97.1 19.3 18.0 18.2 16 96.6 15.0 14.3 14.4 Float 96.6 14.1 13.9 13.8 The baselne for the experment s the DCN network that s quantzed based on the algorthm presented n Ln et al. (2016) wthout fne-tunng. The Top-5 error rates of these networks, for dfferent weght and actvaton bt-wdth combnatons, are lsted n Table 2. Note that for all the fxed pont experments n ths paper, the output actvatons of the fnal fully-connected layer s always set to a bt-wdth of 16. We do not try to reduce the precson of ths quantty because the subsequent softmax layer s rather senstve to low precson nputs and t s an nsgnfcant overhead to the network overall. To further mprove the accuracy beyond Table 2, we perform fne-tunng on these networks subect to the correspondng fxed pont bt-wdth constrants of the weghts and actvatons. Table 3 shows that, whle fne-tunng mproves some scenaros (for example, 16-bt actvatons and 4-bt weghts), t fals to converge for most of the settngs where the actvatons are n fxed pont. Ths nterestng observaton valdates the analyss n Secton 2 showng that the stablty problem s due to the low precson of actva- 1 Propretary Informaton, Qualcomm Inc

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks Table 3. ImageNet classfcaton Top-5 error rate (%): vanlla fne-tunng ( n/a = fals to converge ) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 n/a n/a n/a n/a 8 n/a 19.3 n/a n/a 16 21.0 n/a n/a n/a Plan Table 5. ImageNet classfcaton Top-5 error rate (%): Fne-tune the top fully-connected layer (Proposal 2) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 37.1 23.8 23.3 23.5 8 22.8 15.6 15.7 16.2 16 21.2 13.7 13.5 13.7 tons, not weghts. We note that for these and all the subsequent fne-tunng experments, we dd not perform any hyperparameter optmzaton of the tranng parameters and t s qute possble to dentfy a set of tranng hyperparameters for whch the quantzed network may tran successfully. 3.1. Proposal 1 Table 4. ImageNet classfcaton Top-5 error rate (%): Use fxed pont actvatons n networks traned wth floatng pont actvatons (Proposal 1) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 45.6 32.0 31.3 32.7 8 25.1 16.8 16.8 18.2 16 22.5 13.9 13.8 14.4 The networks on the last row of Table 3 are already traned wth the desred weght precson. We can drectly use them to run wth dfferent actvaton precson. Table 4 lsts the classfcaton accuracy of ths approach. It s seen that we can acheve farly good classfcaton accuracy for dfferent actvaton bt-wdths. 3.2. Proposal 2 Usng the networks on the last row of Table 3 as the baselne, we can contnue to fne-tune only the weghts of the top few layers. It s possble to fne-tune the top layers because the effect of gradent msmatch accumulates toward the lower layers of the network, but the mpact on the top layers s relatvely small. Table 5 demonstrates the results of fne-tunng only the top fully-connected layer n the network. It s seen that fnetunng the top layer offers a small boost n accuracy compared to the networks n Table 4. 3.3. Proposal 3 Agan usng the network on the last row of Table 3 as the fne-tunng baselne, we teratvely fne-tune the network from the bottom to the top, one layer at a tme, accordng to the algorthm prescrbed n Table 1. Ths procedure ensures that each layer has accurate gradent nformaton when the weghts are updated. Table 6. ImageNet classfcaton Top-5 error rate (%): Iteratve fne-tunng from bottom layer to top layer (Proposal 3) Actvaton Weght Bt-wdth Bt-wdth 4 8 16 Float 4 25.3 18.4 18.3 18.2 8 19.3 15.2 14.1 14.1 16 18.8 13.2 13.2 13.5 As seen n Table 6, ths approach provdes a sgnfcant performance boost compared to the prevous solutons. Even a network wth 4-bt weghts and 4-bt actvatons s able to acheve Top-5 error rate of 25.3%. Some of the entres n the table have better accuracy than the floatng pont baselne. Ths may be attrbuted to the regularzaton effect of the added quantzaton nosy durng tranng (Ln et al., 2015). 4. Concluson In ths paper, we studed the effect of low numercal precson of weghts and actvatons on the accuracy of gradent computaton durng back-propagaton. Our analyss showed that low precson weghts are bengn, but low precson actvatons have a detrmental mpact on the computed gradents. The errors n gradent computaton accumulate durng back-propagaton and may slow and even prevent the successful convergence of gradent descent when the network s suffcently deep. We proposed a few solutons to combat ths problem and demonstrated through experments ther effectveness on the ImageNet classfcaton task. We plan to combne

Overcomng Challenges n Fxed Pont Tranng of Deep Convolutonal Networks stochastc roundng and our proposed solutons n future works. References Courbaraux, M., Bengo, Y., and Davd, J. Low precson arthmetc for deep learnng. arxv:1412.7024, 2014. Deng, L., G.E., Hnton, and Kngsbury, B. New types of deep neural network learnng for speech recognton and related applcatons: an overvew. In IEEE Internatonal Conference on Acoustc, Speech and Sgnal Processng, pp. 8599 8603, 2013. Gupta, S., Agrawal, A., Gopalakrshnan, K., and Narayanan, P. Deep learnng wth lmted numercal precson. arxv:1502.02551, 2015. Gysel, P., Motamed, M., and Ghas, S. Hardwareorented approxmaton of convolutonal neural networks. arxv:1604.03168, 2016. Han, S., Mao, H., and Dally, W. J. A deep neural network compresson ppelne: Prunng, quantzaton, Huffman encodng. arxv:1510.00149, 2015. Krzhevsky, A., Sutskever, I., and Hnton, G.E. ImageNet classfcaton wth deep convolutonal neural networks. In NIPS, 2012. Ln, D. D., Talath, S. S., and Annapureddy, V. S. Fxed pont quantzaton of deep convolutonal networks. In ICML, 2016. Ln, Z., Courbaraux, M., Memsevc, R., and Bengo, Y. Neural networks wth few multplcatons. arxv:1510.03009, 2015.