A Quantitative Comparison of Different MLP Activation Functions in Classification

A Quantitative Comparison of Different MLP Activation Functions in Classification Emad A. M. Andrews Shenouda Department of Computer Science, University of Toronto, Toronto, ON, Canada emad@cs.toronto.edu Abstract. Multilayer perceptrons (MLP) has been proven to be very successful in many applications including classification. The activation function is the source of the MLP power. Careful selection of the activation function has a huge impact on the network performance. This paper gives a quantitative comparison of the four most commonly used activation functions, including the Gaussian RBF network, over ten real different datasets. Results show that the sigmoid activation function substantially outperforms the other activation functions. Also, using only the needed number of hidden units in the MLP, we improved its conversion time to be competitive with the RBF networks most of the time. 1 Introduction Introducing the back-propagation algorithm (BP) is a landmark in Neural Networks (NN) [5]. The earliest description of (BP) was presented by Werbos in his PhD dissertation in 1974 [1], but it did not gain much publicity until it has been independently rediscovered by Le Cun, Parker, Hinton and Williams [7]. MLP is perhaps the most famous implementation for the (BP). It is very successfully used in many applications in various domains such as prediction, function approximation and classification. For classification, MLP is considered a super-regression machine that can draw complicated decision borders between nonlinearly separable patterns [5]. The nonlinearity power of MLP is due to the fact that all its neurons use a nonlinear activation function to calculate their outputs. MLP with one hidden layer can form single convex decision regions, while adding more hidden layers can form arbitrary disjoint decision regions. In [6], Huang, et al showed that single hidden layer feedforward neural networks (SLFN s) with some unbounded activation function can also form disjoint decision regions with arbitrary shapes. If a linear activation function is used, the whole MLP will become a simple linear regression machine. Not only determining the decision borders, but the value of the activation function also determines the total signal strength the neuron will produce and receive. In turn, that will affect almost the all aspects of solving the problem in hand like: the quality of the network initial state, speed of conversion and the efficiency of the synaptic weights updates. As a result, a careful selection of the activation function has a huge impact on the MLP classification performance. In theory, the (BP) is universal in this matter, such J. Wang et al. (Eds.): ISNN 006, LNCS 3971, pp. 849 857, 006. Springer-Verlag Berlin Heidelberg 006

850 E.A.M.A. Shenouda that any activation function can be used as long as it has a first derivative. Activation functions can be categorized into three basic families [15]: linear, e.g. Step-like functions, logistic, e.g. Sigmoidal and Radial Basis Functions (RBF), e.g. Gaussian. In spite of the importance of the activation function as an integral part of any feedforward network, it has not been well investigated in the NN literature. Although Michie et al showed that both MLP and RBF are computationally equivalent; they could not interpret their discrepancy in classification performance because MLP outperformed RBF but needed order of magnitude more training time []-[15]. In this paper, we compare the performance of the three families of functions with respect to classification over 10 different real datasets using both batch and online learning. With careful MLP pruning, our results showed that MLP with Sigmoid activation function is superior for classification with less or competitive tainting time to RBF. 1.1 Related Work The mathematical and theoretical foundation of various MLP activation functions, including RBF, can be found by Duch and Jankowski in [15]. In [14], the authors provided a comprehensive survey of different activations function. Others compared the performance of different activation functions in MLP. In [3], the authors provided a visual comparison for the speed of conversion of different linear and sigmoid functions. However, the comparison did not provide much information since one randomly generated easy dataset was used. Comparing the performance of MLP and RBF networks attracted the attention of more researchers. While the RBF outperformed the MLP for voice recognition in [10], other authors showed that MLP is superior to RBF in fault tolerance and resource allocation applications [8]-[9]. Various attempts to combine both MLP and RBF networks in one hybrid less complex and better performing networks exist. In [1], the authors showed a new MLP-RBF architecture that outperformed both of the individual networks in an equalizer signal application. In [4] the authors provided a unified framework for both MLP and RBF via a conic section transfer functions. For classification, the most comprehensive source of comparison is the Statlog report in []. The report presents the performance comparison between different NN and statistical algorithms for classification over different datasets. Both MLP and RBF did well most of the time. Results showed there is a huge discrepancy between RBF and MLP in terms of cross validation error. However, MLP outperformed RBF most of the time; RBF sometimes did not report valid results. The only problem with the MLP performance in this report is its training time which was order of magnitude greater than RBF networks training time. In [13], Wang and Huang systematically studied both the Sigmoidal and RBF activations functions in protein classification and reached the same conclusion as we did. Wang et al work is very significant due to the fact of using Extreme Learning Machines (ELM) instead of the (BP). Unlike BP which works only with differentiable activation functions, ELM can work with any nonlinear activation function; also, ELM does not need any control parameters.

A Quantitative Comparison of Different MLP Activation Functions in Classification 851 According to Wang et al, ELM achieved higher classification accuracy with up to 4 four orders of magnitude less training time compared to BP. In addition to confirming the fact that MLP with sigmoid activation function is superior for classification applications, our contribution in this paper is significantly pruning the MLP size to improve its training time to be less than the time needed for the RBF most of the time. Also, we showed the relation between the problem dimensionality and number of classes from one side and the performance of the activation function on the other side. We carefully observed the network initial state using each activation function and related it to the network ability of generalization. Activation Functions in Comparison Where n denotes the net-input to the hidden unit i and i o i denotes the unit output using the activation function: Form the linear family we have used the linear activation function such that for (a) and (b) as constants: o = an + b Derivative is: (a). i i (1) From the logistic family, we have used the two most commonly used functions: The Sigmoid function (the asymmetric logistic function [5]) : 1 o = i 1+ e ni Derivative is: oi(1 oi) () The hyperbolic tangent function (the symmetric logistic function [5] ): e ni e ni 1 e ni o = = i e n i e n i 1 e n Derivative is: (1 o ) + i i (3) Those functions are the most widely used in MLP implementation because their derivatives are easy to compute and can be expressed directly as a function of the net input. Also, their curve shape contains a linear region which makes them the most suitable functions for regularizing the network using weight decay [11]. From the RBF family, we use the Gaussian activation function: o 1 exp i n i m = i Derivative is: σ ( n ) oi i mi σ (4) Where m and σ are the center and the width, respectively. The reader is directed to [16] where there are more net input calculation and activation functions. 3 Datasets Selection We tried to cover all possible dataset criteria. Table 1 summarizes the datasets used.

85 E.A.M.A. Shenouda Table 1. Datasets characteristics Dataset Name Cases Dimension Classes % continuous valued features % nominal valued features 1 Optdigit 560 64 10 100 0 No B_C_W 699 9 100 0 Yes 3 Wpbc 198 30 100 0 Yes 4 Wdbc 569 30 100 0 No 5 Dermatology 366 34 6 97 3 Yes 6 Car 178 6 4 0 100 No 7 Monks 196 6 0 100 No 8 Pima_diabates 768 8 100 0 No 9 Sonar 08 60 100 0 No 10 votes 435 16 0 100 Yes Missing Values 4 Number of Hidden Unites and Centers of RBF To reduce MLP training time, all networks used in the experiments are of one hidden layer, which is the case by definition in RBF networks. For the MLP, network pruning is recommended by both Zurada and Haykin [17]-[5]. Instead of pruning only the weights, we started with number of hidden units equal to twice the number of the input features. Then, without affecting the CV error, we kept dividing this number by. By doing this, the number of hidden units in all MLP used was in the range between 4 and 5. From the 4 methods proposed in [5] to adapt RBF centers, we chose the closest to MLP implementation, where the number of hidden units is fixed and the centers are adapted via an unsupervised learning. The width of all functions are kept constant to be 1. Cross Validation (CV) is our main method for regularization and stopping. 5 Results Results for different activation functions performance via batch learning are tabulated in the Table. The analogous results for online learning are in Table 3: Where, A is the network initial state (starting MSE) after the first 50 epochs. B is the best MSE Table. Results for batch learning Sigmoid Tanh Linear Gaussian(RBF) DS A B C D A B C D A B C D A B C D 1 83.3 7.7 11.6 35536 177 11. 35.1 1395 39 39 330 346 33 8.3 91. 55461 186 4.1 1783 93.4.8 16.5 6 136 136 68.3 64 540 87.6 17.9 1909 3 167 108 14 186 478 385 100 63 63 458 60 613 455 716 187 4 179 11. 17. 103 13 10.6 90.4 169 57 45 45 84 470 90.5 11.3 559 5 88.8 3.9 3.8 61 57.7 6.8 17.4 50 30 30 383 158 98 5.5 18.7 40594 6 38.1 4.3 94.5 350 0.8 16.7 189 318 76.9 7.6 57 3 16 63. 307 650 7 05 0 0 139 810 799 817 194 809 809 808 10 811 804 901 339 8 178 11 13 867 514 416 493 164 50 50 50 60 698 57 507 1395 9 184 95.5 168 8 193.9 1454 18 808 808 81 54 801 369 1133 367 10 149 135 166 116 53 164 676 18 533 51 583 78 59 510 550 160

A Quantitative Comparison of Different MLP Activation Functions in Classification 853 Table 3. Results for the online learning DS Sigmoid Tanh Linear Gaussian(RBF) A B C D A B C D A B C D A B C D 1 7.5 1 7.11 1330 1 13.1 3.9 1490 348 348 354 90 74 49.5 56.6 10071 4..739 3.8 1446 55.6 0 3. 734.4 160 160 69 90 101 90.3 9.8 360 3 107 0 88 600 46 0 169 833 656 656 558 86 655 489 1098 80 4 15.5 1 49.6 130 71.6 58.1 14.9 65 79 79 56 75 110 65.3 40.6 07 5 4.6 0 6.39 858 5. 0 17 77 36 36 377 146 38.9 14.6 0 35781 6 11.4 3.7 11.8 1389 4.3 19.8 50.3 100 6.6 6.6 174 943 38. 30.5 106 600 7 04 179 0 454 895 894 963 300 894 894 984 650 1366 1361 1459 340 8 14 84.9 167 733 670 95 750 734 703 703 499 390 581 544 535 597 9 10.7 0 164 50 349 1.1 447 310 38 38 671 4 0 4.4 1038 97 10 133 41.1 143 88 55 357 935 316 69 69 58 10 537 513 55 8 achieved in training, C is the MSE for testing, and D is the training time in seconds. (Note, cells A, B, C are divided by 10-3, bold figures represent best testing MSE). 6 Results Analysis We will provide result analysis with respect to MSE, training time needed and network initial state in both online and batch learning. Unlike the case in the StatLog report in [], we did not get huge discrepancy in performance between the MLP and RBF. In spite of the expected poor performance of the linear function compared to the other functions, it will get a special attention in the network initial state section. 6.1 Test MSE From the above tables and figures 1 and, we can observe that the sigmoidal activation function always achieved the best classification MSE in 9 of the 10 datasets. In dataset 4, the RBF got the best MSE in both online and batch learning, which still close to the MLP MSE. Also, it is natural for the tanh MSE to be the closest to the sigmoid MSE most of the time. Both functions belong to the logistic family and geometrically equivalent except for the there minimum values, which is the minimum saturation region. For the sigmoid, tanh and RBF, it is observable that: 1- Except for the dataset number 4 in online learning, the MSE for the three functions tends to increase or decrease together for each dataset. - The maximum value of the MSE for each dataset is the same or so close in both online and batch learning. 3- The sigmoid function experiences the least variance in terms of its MSE value in both online and batch learning. The difference between the average of the online test MSE and batch test MSE over the 10 datasets was 0.0149, 0.0771, and 0.068 for the sigmoid, tanh and RBF respectively. That means the dataset characteristics like dimensionality, features type and number of classes desired do not have a remarkable effect of the sigmoid function ability to generalize.

854 E.A.M.A. Shenouda Results show that there is no clear or direct relation between the dataset dimensionality and the network ability to generalize for the three activations functions. For example, the three activation functions produced less MSE in dataset number 1 than what they produced for dataset number 6 which is less than 1/10 of the dimension, while the situation is reversed in case of datasets and 9. Fig. 1. Test (MSE), online learning Fig.. Test (MSE), batch learning 6. Training Time MLP with sigmoid activation did not only achieve the best MSE, but also needed the least training time in either online or batch leaning in all datasets. For example, for dataset 3, the sigmoid function consumed slightly more time than the RBF function using online learning, but it consumed less time for the same dataset in batch learning. The situation is reversed in the case of dataset 4. That means, by choosing the right training method and the only needed number of hidden units, we can reduce the MLP training time to be less or competitive to the RBF networks training time. Although the difference is minimal, the tanh function tended to consume less time than the sigmoid function in many cases, but this small difference in time should not give preference of the tanh over the sigmoid function because of the MSE of sigmoid is the best most of the time. Excluding the first and fourth datasets, the batch learning training time is less than the online learning training time, in average. Network initial state Fig. 3. Training time, online learning Fig. 4. Training Time, batch learning

A Quantitative Comparison of Different MLP Activation Functions in Classification 855 For the sigmoid, tanh and RBF we can observe the following: 1- Most of the time in both online and batch learning, the training time for the three functions tends to increase or decrease together for each dataset. - The sigmoid and tanh functions tend to experience less variance than the RBF networks in terms of time needed for each dataset. This means that the dimensionality of the dataset does not have a remarkable effect on the MLP time needed for training as long as there are enough examples. 3- On the other hand, results show that there is a strong relation between the number of classes and the time needed for training in both MLP and RBF. This is observable by inspecting datasets number 1 and 9. They have dimensionality of 64, 60 and number of classes of 10 and respectively. Although their dimensionalities are close and both have 100% continuous data type, the time needed for the first data set is order of magnitude more than the time needed for the 9 th dataset. The same relation holds for nominal data types as well in datasets 6 and 7. 6.3 The Network Initial State By network initial state we mean the MSE after 50 epochs of training. This section is a very important one as it carries, to some extend, the explanation of the previous analysis. Also, it clarifies the behavior of the linear activation function. Concerning the linear activation function, by observing the columns A and B in tables 1 and it would be clear that the final MSE after training is identical to the initial network state in 18 experiments out of 0, or 90% of the time. Also, in the remaining 10% the difference is almost negligible. This happens because the linear activation function turns the whole MLP into a normal linear regression machine, which does not tend to learn more by more iteration on the same data. However, it might improve its MSE when more examples are introduced. Fig. 5. Network initial state after 50 epochs, online learning Fig. 6. Network initial state after 50 epochs, batch learning Our results concerning this function comply with the results in [3] where the authors stated that they have experienced the same behavior form the step-like function after 10 epochs. That is why it would be misleading to compare the linear

856 E.A.M.A. Shenouda activation function training time with the other functions because it reaches its maximum generalization ability very quickly but always produces the highest MSE. Form the above tables and Figures 5 and 6, it is observable that the sigmoid activation function tends to produce the best network state after the first 50 epochs. The next best initial state was for the tanh while RBF and linear activation functions come last. The most important observation here is the analogy between the figures 5 and 1 and figures 6 and. The MSE values obtained are relatively symmetrical for each function in each dataset. That means whenever the function reaches a good initial state, it generalizes well and vice verse. So, observing the network initial state is an essential task to see how fast the MSE error drops and what value it will reach; this might be an indicator that the network parameters needs some modifications or the number of hidden units is not enough. Sometimes only re-randomizing the weights and neurons thresholds values will put the network in a better initial state. Also, observing the network initial state was very useful while performing MLP hidden units pruning to decrease time needed for training. 7 Concluding Remarks and Future Work We showed that the sigmoidal activation function is the best for classification applications in terms of time and MSE. We showed a relation between the dataset dimensionality, number of classes and time need for training. Also, the results indicated that the network initial state highly impacted the test MSE and consequently, the overall network ability for generalization. Still, more research can and should be done concerning MLP activation functions. It will be very interesting to use the EM algorithm to adapt the centers of RBF networks. Also, the performance of RBF with a huge fixed number of centers versus the ones with small but adapted centers is still an open issue. Haykin showed that they should perform the same, but he also showed that this contradicts with the NETtalk experiment where using unsupervised learning to adapt the centers of RBFN always resulted in a network with poor generalization than MLP, which was the opposite when supervised learning was used. Acknowledgement The author would like to thank Prof. Anthony J. Bonner for his valuable advices for the final preparation of this manuscript. References 1. Lu, B., Evans, B.L.: Channel Equalization by Feedforward Neural Networks. In: IEEE Int. Symposium on Circuits and Systems, Vol. 5. Orlando, FL (1999) 587-590. Michie, D., Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Elis Horwood, London (1994) 3. Piekniewski, F., Rybicki, L.: Visual Comparison of Performance for Different Activation Functions in MLP Networks. In: IJCNN 004 & FUZZ-IEEE, Vol. 4. Budapest (004) 947-953

A Quantitative Comparison of Different MLP Activation Functions in Classification 857 4. Dorffner, G.: A Unified Framework for MLPs and RBFNs: Introducing Conic Section Function Networks. Cybernetics and Systems 5(4) (1994) 511-554 5. Haykin, S.: Neural Networks A Comprehensive Introduction. Prentice Hall, New Jersey (1999) 6. Huang, G., Chen, Y., Babri, H.A.: Classification Ability of Single Hidden Layer Feedforward Neural Networks. IEEE Transactions on Neural Networks 11(3) (000) 799-801 7. Le Cun, Y., Touresky, D., Hinton G., Sejnowski, T.: A Theoretical Framework for Backpropagation. In: The Connectionist Models Summer School. (1988) 1-8 8. Li, Y., Pont, M. J., Jones, N.B.: A Comparison of the Performance of Radial Basis Function and Multi-layer Perceptron Networks in Condition Monitoring and Fault Diagnosis. In: The International Conference on Condition Monitoring. Swansea (1999) 577-59 9. Arahal, M.R., Camacho, E.F.: Application of the Ran Algorithm to the Problem of Short Term Load Forecasting. Technical Report, University of Sevilla, Sevilla (1996) 10. Finan, R.A., Sapeluk, A.T., Damper, R.I.: Comparison of Multilayer and Radial Basis Function Neural Networks for Text-Dependent Speaker Recognition. In: IEEE Int. Conf. on Neural Networks, Vol. 4. Washington DC (1996) 199-1997 11. Karkkainen, T.: MLP in Layer-Wise Form with Applications to Weight Decay. Neural Computation 14(6) (00) 1451-1480 1. Werbos, P. J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Doctoral Thesis, Applied Mathematics, Harvard University. Boston (1974) 13. Wang, D., Huang, G.: Protein Sequence Classification Using Extreme Learning Machine. In: IJCNN05, Vol. 3. Montréal (005) 1406-1411 14. Duch, W., Jankowski, N.: Survey of Neural Transfer Functions. Neural Computing Surveys (1999) 163-1 15. Duch, W., Jankowski, N.: Transfer functions: Hidden Possibilities for Better Neural Networks. In: 9 th European Symposium on Artificial Neural Network. Bruges (001) 81-94 16. Hu, Y., Hwang, J.: Handbook of Neural Network Signal Processing. 3 rd edn. CRC-Press, Florida (00) 17. Zurada, J. M.: Introduction to Artificial Neural Systems. PWS Publishing, Boston (1999)