STRUCTURE ANALYSIS OF NEURAL NETWORKS

Size: px

Start display at page:

Download "STRUCTURE ANALYSIS OF NEURAL NETWORKS"

Marshall Ferguson
5 years ago
Views:

1 STRUCTURE ANALYSIS OF NEURAL NETWORKS DING SHENQIANG NATIONAL UNIVERSITY OF SINGAPORE 004

2 STRUCTURE ANALYSIS OF NEURAL NETWORKS DING SHENQIANG 004

3 STRUCTURE ANANLYSIS OF NEURAL NETWORKS DING SHENQIANG (B. Eng, Unversty Of Scence and Technology of Chna) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF ENGINEERING DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 004

4 Acknowledgements I would lke to express my most sncere apprecaton to my supervsor, Dr. Xang Cheng, for hs good gudance, support and encouragement. Hs stmulatng advce benefts me n overcomng obstacles on my research path. I am also grateful to the Center for Intellgent Control (CIC), as well as the Control and Smulaton Lab, Department of Electrcal and Computer Engneerng, Natonal Unversty of Sngapore, whch provdes the research facltes to conduct the research work. I also wsh to acknowledge Natonal Unversty of Sngapore (NUS) for the fnancal support provded throughout my research work. Thanks to many of my frends n Control and Smulaton Lab, who have made contrbutons n varous ways to my research and lfe here n the past two years. Fnally, specal thanks to my wfe Sun Yu, for her love and patence.

5 Table of Contents Acknowledgements... Table of Contents... Summary... v Lst of Fgures... v Lst of Tables...v Chapter 1 Introducton Artfcal Neural Networks Statement of the Structure Analyss Problem of Neural Networks Thess Outlne... 8 Chapter Archtecture Selecton of Mult-layer Perceptron Introducton Geometrcal Interpretaton of MLP Selecton of Number of Hdden Neurons for Three-layered MLP Advantage Offered by Four-layered MLP Conclusons Chapter 3 Overfttng Problem of MLP Overfttng Problem Overvew Comparatve Study of Avalable Methods Model Selecton Early Stoppng Regularzaton Methods... 55

6 3.3 Conclusons Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network Introducton to Radal Bass Functon Network Two-stage Tranng of Radal Bass Functon Networks One-stage Supervsed Tranng of Radal Bass Functon Networks Dfference Comparng to Multlayer Perceptron MLP wth Addtonal Second Order Inputs Comparatve Study Expermental Setup Expermental Results Conclusons Chapter 5 Conclusons and Future Works Conclusons Future works References Lst of Publcatons

7 v Summary Ths work seeks to conduct structure analyss of artfcal neural networks, especally feedforward neural networks such as multlayer perceptrons (MLP) and radal bass functon networks (RBFN). Frst of all, a bref ntroducton of artfcal neural networks s gven; the background and the necessty of the structure analyss problem are also stated. Then a geometrcal nterpretaton of multlayer perceptron based on the geometrcal meanng of the weghts of a sngle hdden neuron s presented. Ths nterpretaton wll be frst suggested for the case when the actvaton functon of the hdden neuron s pecewselnear functon and then s extended naturally to the case of sgmod actvaton functons. Followng ths, a general gudelne for selectng the number of hdden neurons for three-layered (wth one hdden layer) MLP s proposed based upon the geometrcal nterpretaton. The effectveness of ths gudelne s llustrated by a couple of smulaton examples. Subsequently, the attenton s shfted to the controversal ssue of whether four-layered (wth two hdden layers) MLP s superor to the three-layer MLP. Wth the ad of the geometrcal nterpretaton and also through careful examnaton of the varous contradctory results reported n the lterature, t s be demonstrated that n many cases four-layered MLP s slghtly more effcent than three-layered MLP n terms of the mnmal number of parameters requred for approxmatng the target functon, and for a certan class of problems the four-layered MLP outperforms three-layered MLP sgnfcantly.

8 v After that, the overfttng problem of MLP s examned, a comparatve study s carred out on varous allevatng methods and the reasons behnd these methods are revewed based on the geometrcal nterpretaton. In partcular, the popular regularzaton methods are studed n detal. Not only the reason why regularzaton methods are effectve to allevate the over-fttng can be smply explaned by the geometrcal nterpretaton, but also a potental problem wth regularzaton s predcted and verfed. Afterward, another popular feedforward neural network, radal bass functon network, s vsted. A specal addtonal nput, whch s the sum of the squares of the other nputs, s added to the standard multlayer perceptron, so that the multlayer perceptron works smlarly to the radal bass functon network wth localzed response. Specally, we wll show a three-layered multlayer perceptron wth exponental actvaton functon and ths knd of addtonal nput s naturally a generalzed radal bass functon network and multlayer perceptron can be traned usng the well-developed tranng strateges of multlayer perceptrons. Then a comparatve study s conducted between multlayer perceptrons, multlayer perceptrons wth addtonal nputs and radal bass functon networks traned by varous methods. Fnally, a concluson of the whole thess s presented and the drecton of future research s also ponted.

9 v Lst of Fgures Fg A nonlnear model of a sngle neuron ANN... 3 Fg. 1.. General structure of a feedforward ANN... 3 Fg A smple one-dmensonal functon approxmaton problem... 4 Fg..1. Pecewse lnear actvaton functon... 1 Fg... Weghted pecewse lnear functon Fg..3. Overlappng of basc buldng-blocks Fg..4. Sgmod actvaton functon... 0 Fg..5. Two-dmensonal buldng-block... 1 Fg..6. A smple one-dmensonal example... 3 Fg..7. The nosy one-dmensonal example... 4 Fg..8. A complcated one-dmensonal example... 6 Fg..9. Approxmaton of Gaussan functon... 8 Fg..10. Pecewse planes needed to construct the basc shape Fg..11. A more complcated two-dmensonal example Fg..1. Flowchart of the smplfed EPNET Fg..13. Approxmatng hll and valley wth a MLP Fg..14. Approxmatng hll and valley wth a -5-1 network Fg..15. The modfed hll and valley example Fg..16. The outputs of the neurons n the second hdden layer for the MLP... 45

10 v Fg..17. The outputs of the neurons n the second hdden layer wth dfferent ntalzaton Fg Example of over-fttng problem... 5 Fg. 3.. Early stoppng for overcomng over-fttng problem Fg Bayesan regularzaton for overcomng over-fttng problem Fg A smple example where Bayesan regularzaton fals Fg Three-layered structure of radal bass functon network... 6 Fg. 4.. Structure of CBP network Fg Approxmaton wth 50 hdden neurons Fg The approxmaton error of ECBP network Fg The two-spral problem Fg Approxmaton results for the two-spral problem Fg The tranng and test data of the smple lnear separaton problem Fg The classfcaton problem of two Gaussan dstrbutons... 8 Fg The classfcaton results for the two Gaussan dstrbutons... 8

11 v Lst of Tables Table.1. Sgnfcantly dfferent performances of and MLPs... 5 Table.. Performance comparson of EPNET wth dfferent ntal populatons Table 4.1. The mnmum number of hdden neurons needed for smulaton Table 4.. The mnmum number of hdden neurons needed for smulaton Table 4.3. The mnmum number of hdden neurons needed for smulaton Table 4.4. The mnmum number of hdden neurons needed for smulaton Table 4.5. The mnmum number of hdden neurons needed for smulaton Table 4.6. The mnmum number of hdden neurons needed for smulaton

12 Chapter 1 Introducton 1.1 Artfcal Neural Networks Artfcal neural networks (usually shorten as neural networks ) are orgnally motvated from the bologcal neural networks such as the bran and human nervous system. The frst artfcal neural network s called perceptron, whch s developed by Rosenblatt (1959) from the bologcal neuron model by McCulloch and Ptts (1943). Despte orgnatng from the bologcal system, artfcal neural networks are wdely used as problem-solvng algorthms rather than n developng them as accurate representatons of the human nervous system (Rpley 1994). However, the artfcal neural networks stll emulate bologcal neural networks n followng man aspects: 1. Each basc unt of the artfcal neural networks s a smplfed verson of the bologcal neuron.. Each basc unt s connected to a massve network n parallel. 3. Each basc unt has an actvaton functon. 4. Learnng of the network s done by adjust the connectons (weghts) between the basc unts.

13 Chapter 1 Introducton There s stll no formal defnton of artfcal neural networks, one recent defnton was gven by Haykn (1999): A neural network s a massvely parallel dstrbuted processor made up of smple processng unts, whch has a natural propensty for storng experental knowledge and makng t avalable for use. It resembles the bran n two respects: 1. Knowledge s acqured by the network from ts envronment through a learnng process.. Interneuron connecton strengths, known as synaptc weghts are used to store the acqured knowledge. Fg 1.1 gves a mathematc model of a smplest artfcal neural network wth only one basc unt. Three essental elements are noted: the connecton weghts, summaton operator and the actvaton functon. Another term bas adjusts the value of the summaton. We may descrbe the model wth equaton (1.1), where x, 1, x, xm are the nput sgnals;, w, 1 wm are the connecton weghts; b s the bas; and ϕ( ) w, s the actvaton functon. When outsde sgnals are feed to the neural network; the nputs frst go through the connecton weghts whch lead to weghted nputs, then the summaton operator effects, and fnally the weghted summaton of the nputs and bas are sent to the actvaton functon to gve the fnal output. y m out + = 1 = ϕ ( w x b) (1.1)

14 Chapter 1 Introducton 3 +1 Weghts: Bas: b Inputs: x 1 x w 1 w Actvaton functon: ϕ( ) Output: y out w m Summaton x m Fg A nonlnear model of a sngle neuron ANN Normally, the artfcal neural network contans many of ths knd of basc unts dstrbuted n dfferent layers. A more general structure of artfcal neural network s provded n Fg. 1.. Bases Bases Inputs Hdden layers Output layer Fg. 1.. General structure of a feedforward ANN

15 Chapter 1 Introducton 4 Please note that some neural networks do have reverse (feedback) sgnal flow lke recurrent neural networks. In ths thess, feedforward neural networks such as multlayer perceptron and radal bass functon neural networks are studed. 1. Statement of the Structure Analyss Problem of Neural Networks Although neural networks are used wdely and successfully n many applcaton areas, how to select the structure of specfed neural networks s stll a very essental problem. For example, f the multlayer perceptron network s chosen, then the practtoner stll faces many problems to decde the structure of multlayer perceptron to be used. Such as gven the followng functon approxmaton example as shown n Fgure 1.3, how many hdden layers to use and how many neurons to choose for each hdden layer? Fg A smple one-dmensonal functon approxmaton problem Unfortunately, there s no foolproof recpe at the present tme, and the desgner has to make seemngly arbtrary choces regardng the number of hdden layers and neurons. The common practce s just regardng the multlayer perceptron as a sort of magc

16 Chapter 1 Introducton 5 black box and choosng a suffcently large number of neurons such that t can solve the practcal problem n hand. Desgnng and tranng a neural network and makng t work seem more of an art than a scence. Wthout a deep understandng of the desgn parameters, some people stll feel uneasy to use the multlayer perceptron even though the neural networks have already proven to be very effectve n a wde spectrum of applcatons, n partcular the functon approxmaton and pattern recognton problems. Tradtonally, the man focus regardng the archtecture selecton of MLP has been centered upon the growng and prunng technques. (Mozer and Smolensky 1989; Karnn 1990; LeCun et al. 1990; Wegend et al. 1991; Hassb et al. 199; Reed 1993; Hush 1997). In network growng technques, we often start wth a small network to solve the problem at hand and add addtonal neurons or layers only f the current network s unable to meet the crteron. For network prunng, whch s to choose a network larger than necessary at frst, and then remove the redundant part. More efforts were put on the prunng technques n the lterature; the prunng technques manly nclude senstvty calculaton methods and regularzaton methods (Reed 1993). The senstvty calculaton methods usually estmate the senstvty of each neuron or connecton and delete those wth less senstvty or less mportance (Mozer and Smolensky 1989; Karnn 1990; Reed 1993). The regularzaton methods ncorporate an addtonal term n the standard error cost functon. Ths addtonal penalty term s a complexty penalty, whch s usually a functon of the weghts (Plaut et al. 1986; Chauvn 1989; J et al. 1990; Wegend et al. 1991; Nowlan et al. 199; Moody and Rögnvaldsson 1997). One attractve advantage of the regularzaton methods s that the tranng and prunng are done smultaneously whch wll lead to a more optmal

17 Chapter 1 Introducton 6 soluton. However for the prunng algorthms, when to stop the prunng procedure or how to choose the regularzaton parameter s stll a problem. Recently, lots of attenton has also been drawn on applyng evolutonary algorthms to evolve both the parameters and archtectures of the artfcal neural networks (Alpaydm 1994; Jasc and Poh 1995; Sarkar and Yegnanarayana 1997; Castllo 000). Such knd of hybrd algorthms s commonly referred to n the lterature as evolutonary artfcal neural networks (EANN); for a detaled survey see (Yao 1999). One essental feature of EANN s the combnaton of the two dstnct forms of adaptaton,.e., learnng and evoluton, whch makes the hybrd systems adapt to the envronment more effcently. However, one major drawback of EANN s that ts adaptaton speed s usually very slow due to ts nature of populaton and random search. In all these approaches dscussed above, any a pror nformaton regardng the geometrcal shape of the target functon s generally not exploted to ad the archtecture desgn of multlayer perceptron. Thus how to smplfy the task of archtecture selecton usng ths geometrcal nformaton s a very nterestng and challengng problem. The overfttng problem of neural networks s also essental, because n most cases what we focused s how the neural networks act wth the unseen nputs, whch s called generalzaton performance. Normally, we take for granted t s the sze of the neural networks that domnate the generalzaton performance. However, Bartlett (1997) stated that the sze of the weghts s more mportant that the sze of the network for generalzaton performance. So that a deep nsght on how the structure of neural networks nfluences the generalzaton performance s desrable.

18 Chapter 1 Introducton 7 Radal bass functon network s another very popular feedforward neural network. There s normally only one hdden layer n the structure, so choosng the number of hdden layers s not a problem for radal bass functon network. But t stll faces the problem of decdng the number of hdden neurons. Moreover, the radal bass functon network has another problem of decdng the locatons and the spreads of the bass functons. There are varous methods to determne the locatons and spreads of the bass functons, whch are usually separated from the calculaton of the output weghts. One stage supervsed tranng algorthms to decde all the parameters smultaneously are also avalable. However, the supervsed tranng of radal bass functon networks s mmature comparng the well developed tranng algorthms for multlayer perceptrons. Thus, a comparatve study of these avalable methods s also attractve. The local responses of multlayer perceptrons wth a certan class of addtonal nputs or normalzed nputs are reported n the lteratures (Casasent 199; Maruyama et al. 199; Sarajedn and Hecht-Nelsen 199; Rdella et al. 1997). The connecton between the dfferent structured multlayer perceptrons and radal bass functon networks s also a very nterestng problem. A multlayer perceptron wth addtonal second order nputs, whch s the sum of the square of other nputs, can approxmate a radal bass functon arbtrarly. At the same tme, another queston rses: can such a multlayer perceptron represent a radal bass functon network exactly? If the answer s postve, how does t perform comparng to the standard multlayer perceptron and radal bass functon networks?

19 Chapter 1 Introducton Thess Outlne Ths thess conssts of fve chapters. Chapter presents a geometrcal nterpretaton of multlayer perceptron based on the geometrcal meanng of the weghts of a sngle hdden neuron, dscusses the selecton of the hdden neurons n three-layered multlayer perceptrons, and analyze the advantages offered by four-layered multlayer perceptrons. Chapter 3 gves an overvew of the overfttng problem, a comparatve study s carred out on varous allevatng methods for ths problem and the reasons behnd these methods are revewed based upon the geometrcal nterpretaton n Chapter. In Chapter 4, another popular feedforward neural network, radal bass functon network, s vsted. A specal addtonal nput, whch s the sum of the squares of the other nputs, s added to the standard multlayer perceptron, so that the multlayer perceptron works smlarly to the radal bass functon network wth localzed response. Specally, we wll show a three-layered multlayer perceptron wth exponental actvaton functon and ths knd of addtonal nput s naturally a generalzed radal bass functon network and multlayer perceptron, whch can by traned wth the well developed tranng strateges of multlayer perceptrons. Then a comparatve study s conducted between multlayer perceptrons, multlayer perceptrons wth addtonal nputs and radal bass functon networks traned by varous methods. Chapter 5 concludes the whole thess and ponts out the drecton of future research.

20 Chapter Archtecture Selecton of Mult-layer Perceptron.1 Introducton As mentoned n the prevous chapter, every practtoner of the multlayer perceptron (MLP) faces the same archtecture selecton problem: how many hdden layers to use and how many neurons to choose for each hdden layer? The common practce s stll based on a tral and error method, whch choosng the number of neurons manually untl the network can solve the practcal problem n hand. Tradtonally, the man focus regardng the archtecture selecton of MLP has been centered upon the growng and prunng technques (LeCun et al. 1990; Wegend et al. 1991; Hassb et al. 199; Hush 1997). Recently, evolutonary artfcal neural networks (EANN) (Alpaydm 1994; Jasc and Poh 1995; Sarkar and Yegnanarayana 1997; Yao 1999; Castllo 000) are alternatve methods concernng the archtecture selecton problem. However, the adaptaton speed of EANN s usually very slow due to ts nature of populaton and random search. In prevous approaches, any a pror nformaton regardng the geometrcal shape of the target functon s generally not exploted to ad the archtecture desgn of MLP. In contrast to them, t wll be demonstrated n ths chapter that t s the geometrcal nformaton that wll smplfy the task of archtecture selecton sgnfcantly. We wsh

21 Chapter Archtecture Selecton of Multlayer Perceptron 10 to suggest some general gudelnes for selectng the archtecture of the MLP,.e., the number of hdden layers as well as the number of hdden neurons, provded that the basc geometrcal shape of the target functon s known n advance, or can be perceved from the tranng data. These gudelnes wll be based upon the geometrcal nterpretaton of the weghts, the bases, and the number of hdden neurons and layers, whch wll be gven n the next secton of ths Chapter. It wll be shown that the archtecture desgned from these gudelnes s usually very close to the mnmal archtecture needed for approxmatng the target functon satsfactorly, and n many cases t s the mnmal archtecture tself. As we know, searchng for a mnmal or sub-mnmal structure of the MLP for a gven target functon s very crtcal not only for the obvous reason that the least amount of computaton would be requred by the mnmal structured MLP, but also for a much deeper reason that the mnmal structured MLP would provde the best generalzaton n most of the cases. It s well known that neural networks can easly fall nto the trap of over-fttng, and supplyng a mnmal structure s the best medcne to allevate ths problem. In the next secton, the geometrcal nterpretaton of the MLP wll be presented. Ths nterpretaton wll be frst suggested for the case when the actvaton functon of the hdden neuron s pecewse-lnear functon and then s extended naturally to the case of sgmod actvaton functons. Followng ths, a general gudelne for selectng the number of hdden neurons for three-layered (wth one hdden layer) MLP wll be proposed based upon the geometrcal nterpretaton. The effectveness of ths gudelne wll be llustrated by a couple of smulaton examples. Fnally we wll turn our

22 Chapter Archtecture Selecton of Multlayer Perceptron 11 attenton to the controversal ssue of whether four-layered (wth two hdden layers) MLP s superor to the three-layer MLP. Wth the ad of the geometrcal nterpretaton and also through carefully examnng the varous contradctory results reported n the lterature, t wll be demonstrated that n many cases four-layered MLP s slghtly more effcent than three-layered MLP n terms of the mnmal number of parameters requred for approxmatng the target functon, and for a certan class of problems the four-layered MLP outperforms three-layered MLP sgnfcantly.. Geometrcal Interpretaton of MLP Consder a three-layered 1-N-1 MLP, wth one nput neuron, N hdden neurons and one output neuron. The actvaton functon for the hdden neuron s the pecewselnear functon descrbed by 1, v 0.5 ϕ ( v) = v + 0.5, 0.5 < v < 0.5 (.1) 0, v 0.5 and plotted n Fgure.1.

23 Chapter Archtecture Selecton of Multlayer Perceptron 1 ϕ(v) v Fg..1. Pecewse lnear actvaton functon. Let the weghts connectng the nput neuron to the hdden neurons be denoted as w ( = 1,, N ), the weghts connectng the hdden neurons to the output neuron be (1) () w, the bases for the hdden neurons be (1) b, and the bas for the output neuron be () b. The actvaton functon n the output neuron s the dentty functon such that the output y of the MLP wth the nput x feedng nto the network s N y( x) = w b = 1 () (1) (1) () ϕ ( w x + b ) + (.) It s evdent that y(x) s just superposton of N pecewse-lnear functons plus the bas. From (.1) we know that each pecewse-lnear functon n (.) s descrbed by () (1) (1) w, w x + b 0.5 () (1) (1) () (1) (1) (1) (1) w ϕ ( w x + b ) = w ( w x + b + 0.5), 0.5 < w x + b < 0.5 (.3) (1) (1) 0, w x + b 0.5

24 Chapter Archtecture Selecton of Multlayer Perceptron 13 (1) In the case of w > 0, we have (1) () 0.5 b w, x (1) (1) w w (1) (1) () (1) (1) () (1) (1) 0.5 b 0.5 b w ϕ ( w x + b ) = w ( w x + b + 0.5), < x < (1) (1) (1) (1) (.4) w w w w (1) 0.5 b 0, x (1) (1) w w The graph for ths weghted pecewse lnear functon s plotted n Fgure.. y Endng Pont: 0.5 b ( 1) w w (1) (), w ( (1) ) (1) Slope: w w () () Heght:w Startng Pont: 0.5 b (1) w w (1) ( (1),0) x Wdth: (1) 1/ w Fg... Weghted pecewse lnear functon. Ths pece-wse lnear functon has the same geometrcal shape as that of (.1), comprsng two peces of flat lnes at the two ends and one pece of lne segment n the

25 Chapter Archtecture Selecton of Multlayer Perceptron 14 mddle. Any fnte pece of lne segment can be completely specfed by ts wdth (span n the horzontal axs), heght (span n the vertcal axs), and poston (startng pont, center, or endng pont). And t s obvous from equaton (.4) and Fgure. that the 1 wdth of the mddle lne segment s (1) w, the heght s () w, the slope s therefore (1) () w w, and the startng pont s ( 0.5 b (1) w w (1) (1),0). Once ths mddle lne segment s specfed the whole pecewse lne s then completely determned. From above dscusson t s natural to suggest the followng geometrcal nterpretaton for the threelayered MLP wth pecewse-lnear actvaton functons. 1) The number of hdden neurons corresponds to the number of pecewse lnes that are avalable for approxmatng the target functon. These pecewse lnes act as the basc buldng-blocks for constructng functons. ) The weghts connectng the nput neuron to the hdden neurons completely determne the wdths of the mddle lne segments of those basc buldngblocks. By adjustng these weghts, the wdths of the basc elements can be changed to arbtrary values. 3) The weghts connectng the hdden neurons to the output neuron completely decde the heghts of the mddle lne segments of those basc buldng-blocks. The heghts can be modfed to any values by adjustng these weghts. Note that negatve heght mples negatve slope of the mddle lne segment of basc buldng-blocks. 4) The bases n the hdden neuron govern the postons of the mddle lne segments of those basc buldng-blocks. By adjustng the values of these bases, the postons of the buldng-blocks can be located arbtrarly.

26 Chapter Archtecture Selecton of Multlayer Perceptron 15 5) The bas n the output neuron provdes an offset term to the fnal value of the functon. Usng the fact that the wdths, the heghts, and the postons of the mddle lne segments of the basc buldng-blocks can be adjusted arbtrarly, we are ready to state and prove Theorem.1 as follows. Theorem.1: Let f (x) be any pecewse lnear functon defned n any fnte doman, < a x b <, there exsts at least one three-layered MLP, denoted as NN (x), wth pecewse lnear actvaton functons for the hdden neurons that can represent f (x) exactly,.e., NN ( x) = f ( x) for all x [ a, b]. The proof of Theorem.1 s qute straghtforward by drectly constructng one MLP that can acheve the objectve. Proof: Let f (x) be any pecewse lnear functon consstng of arbtrary number N of lne segments. Each lne segment s completely determned by ts two end ponts. Denote the end ponts of the th lne segment as x, f ( x )) and x, f ( x )), ( 1 1 ( where x 0 = a, and x N = b. The wdth and heght of the th lne segment are then x and f x ) f ( x ) respectvely. x 1 ( 1 Let s construct the three-layered MLP as follows. Let the number of the hdden neurons be N, the same as the number of the pecewse lnes n f (x). Each of the hdden neuron wll then provde one pecewse lne, whose wdth, heght, and startng

27 Chapter Archtecture Selecton of Multlayer Perceptron 16 pont can be arbtrarly adjusted by the weghts and bases. One natural way of choosng the weghts and bases s to make the mddle lne segment provded by the th neuron match the th lne segment n f (x) can be calculated as follows.. Therefore, the parameters of the MLP To match the wdth, we set w 1 (1) = x x 1, = 1,, N (.5) To match the heght, we set () w f x ) f ( x ), = 1,, N (.6) = ( 1 To match the poston, we set 0.5 b w w (1) (1) (1) = x 1, = 1,, N (.7) To match the fnal value of f (x), we need to provde the offset term as () b = f x ) = f ( a) (.8) ( 0 The parameters of the three-layered MLP are completely determned by equatons (.5) to (.8). Because of the specal property of the actvaton functon that the lnes are all flat (wth zero slope) except the mddle segment, the contrbuton to the slope of the lne segment n the nterval x, x ] comes only from the mddle lne segment [ 1 provded by the th neuron. From (.5) and (.6), t s obvous that the slope of the each lne segment of MLP matches that of f (x). All we need to show now s that the output

28 Chapter Archtecture Selecton of Multlayer Perceptron 17 value of MLP at the startng pont for each lne segment matches that of f (x), then the proof wll be complete. At the ntal pont x = x0, all the contrbutons from the hdden neurons are zero, and the output value of the MLP s just the bas () b, () NN ( x0 ) = b (.9) At pont x = x1, whch s the end pont of the lne segment provded by the frst neuron, the output value of the frst neuron s () w 1 whle the output values of all other neurons are zero, therefore we have NN ( x + b ) = w () () 1 1 (.10) Smlar argument leads to NN( x b () () () ) = w1 + + w +, 1,,, N = (.11) From equaton (.6) and (.8), t follows mmedately that NN x ) = f ( x ), = 0,1, N (.1) ( Ths completes the proof of Theorem.1. Comment.1: The weghts and bases constructed by equatons (.5) to (.8) are just one set of parameters that can make the MLP represent the gven target functon. There are other possble sets of parameters that can acheve the same objectve. For nstance, (1) for purpose of smplcty we let w > 0 n all our dscussons so far. Wthout ths

29 Chapter Archtecture Selecton of Multlayer Perceptron 18 constrant, the sgn of the slope of pecewse lne s determned by w (1) () w, and consequently there are many other combnatons of the buldng-blocks that can construct the same pecewse lnear functon exactly. Ths mples that the global mnmum may not be unque n many cases. Comment.: In the proof gven, N hdden neurons are used to approxmate the functon consstng of N pecewse lne segments, and the doman of the mddle lne segment for each basc buldng-block does not overlap wth each other. If some domans of the mddle lne segments overlap, then t s possble for 1-N-1 MLP to approxmate functons comprsng more than N pecewse lne segments. But then the slopes around these overlappng regons are related, and cannot be arbtrary. A couple of such examples are plotted n Fgure.3, where sold lne s the combnaton of two basc buldng-blocks, whch are plotted wth dash-dotted and dashed lnes respectvely. Fg..3. Overlappng of basc buldng-blocks. Comment.3: Snce any bounded contnuous functon can be approxmated arbtrarly closely by pecewse lnear functon, Theorem.1 smply mples that any bounded

30 Chapter Archtecture Selecton of Multlayer Perceptron 19 contnuous functon can be approxmated arbtrarly closely by MLP, whch s the well-known unversal approxmaton property of the MLP proven n (Hornk et al. 1989; Cybenko 1989; Funahash; 1989). Although the proof s gven only for the case of pecewse-lnear actvaton functons, the geometrcal nature of the proof presented n ths Chapter makes ths nce property of MLP much more transparent than other approaches. Comment.4: The geometrcal shape of the sgmod actvaton functon s very smlar to the pecewse-lnear actvaton functon, except the neghborhood of the two end ponts are all smoothed out as shown n Fgure.4. Therefore the prevous geometrcal nterpretaton of the MLP can be appled very closely to the case when sgmod actvaton functons are used. Further, snce the sgmod functon smoothes out the non-smooth end ponts, the MLP wth sgmod actvaton functons s more effcent to approxmate smooth functons.

31 Chapter Archtecture Selecton of Multlayer Perceptron 0 Fg..4. Sgmod actvaton functon. Comment.5: When the nput space s hgh dmensonal, then each hdden neuron provdes a pecewse hyperplane as the basc buldng-block that conssts of two flat hyperplanes and one pece of hyperplane n the mddle. The poston and wdth of the mddle hyperplane can be adjusted by the weghts connectng the nput layer to the hdden layer and the bases n the hdden layer, whle the heght can be altered by the weghts connectng the hdden layer to the output layer. A two-dmensonal example of such buldng-blocks s shown n Fgure.5 where sgmod actvaton functons are used.

32 Chapter Archtecture Selecton of Multlayer Perceptron 1 Fg..5. Two-dmensonal buldng-block..3. Selecton of Number of Hdden Neurons for Three-layered MLP Based upon prevous dscusson regardng the geometrcal meanng of the number of hdden neurons, the weghts and the bases, we suggest a smple gudelne for choosng the number of hdden neurons for the three-layered MLP as follows. Gudelne One: Estmate the mnmal number of lne segments (or hyperplanes n hgh dmensonal cases) that can construct the basc geometrcal shape of the target functon, and use ths number as the frst tral for the number of hdden neurons of the three-layered MLP. We have tested ths gudelne wth extensve smulaton studes. In all of the cases studed, ths mnmal number of lne segments s ether very close to the mnmal number of hdden neurons needed for satsfactory performance, or s the mnmal

33 Chapter Archtecture Selecton of Multlayer Perceptron number tself n many cases. Some of the smulaton examples wll be dscussed below to llumnate the effectveness of ths gudelne. All the smulatons have been conducted usng the neural network toolbox of MATLAB. The actvaton functon for the hdden neurons s hyperbolc tangent functon (called tansg n MATLAB), and that for the output neurons s the dentty functon (called pureln n MATLAB) n most cases. Batch tranng s adopted and the Levenberg-Marquardt algorthm (Marquardt 1963; Mor 1977) (called tranlm n MATLAB) s used as the tranng algorthm. The Nguyen-Wdrow method (Nguyen and Wdrow 1990) s utlzed to ntalze the weghts of the each layer of the MLPs. Comment.6: The selecton of the actvaton functon and tranng algorthm s another nterestng ssue whch was nvestgated by other papers (Hush and Salas 1988; Mennon et al. 1996; Amr 1998). We wll not delve nto ths ssue here. We choose tansg and tranlm just by smple tral and error studes. Smulaton.1: The target functon s chosen as: 3 f ( x) = x + 0.3x 0. 4x, x [ 1,1 ] (.13) The tranng set conssts of 1 ponts, whch are chosen by unformly parttonng the doman [-1, 1] wth grd sze of 0.1. And the test set comprses 100 ponts unformly randomly sampled from the same doman. Followng Gudelne One, the least number of lne segments to construct the basc geometrcal shape of f (x) s obvously three, therefore s tred frst. It turns out that s ndeed the mnmal szed MLP to approxmate f (x) satsfactorly. After only 1 epochs, the mean square error (MSE) 6 6 of the tranng set decreases to.09 10, and the test error (MSE) s The

34 Chapter Archtecture Selecton of Multlayer Perceptron 3 result s shown n Fgure.6, where the dotted lne s the target functon, and the dashdotted lne s the output of the MLP, whch almost concde wth each other exactly. Fg..6. A smple one-dmensonal example. Comment.7: It s obvous that such good approxmaton result cannot be acheved usng three peces of pure lne segments. The smoothng property of the sgmod functon plays an mportant role n smoothng out the edges. Smulaton.: Assume the samples of the target functon n Smulaton One are corrupted by noses unformly dstrbuted n [-0.05, 0.05]. Both and are used to learn the same set of tranng data, and the test set contans 100 ponts unformly randomly selected n [-1, 1]. The results are shown n Table.1 and are plotted n Fgure.7.

35 Chapter Archtecture Selecton of Multlayer Perceptron 4 (a) Approxmaton by (b) Approxmaton by Fg..7. The nosy one-dmensonal example.

36 Chapter Archtecture Selecton of Multlayer Perceptron 5 TABLE.1 Sgnfcantly dfferent performance of and MLPs MLPs Epochs Tranng error(mse) Test error(mse) Comment.8: The purpose of ths smulaton example s to show the necessty of searchng for mnmal archtecture. It s evdent that MLP has the best generalzaton capablty, whch approxmates the deal target functon closely even though the tranng data s corrupted. In contrast to ths, the MLP falls badly nto the trap of over-fttng wth only epochs. Smulaton.3: We ntend to approxmate a more complcated functon as follows, 3 y = 0.5sn( π x) 0.1cos(4πx) + x, 1 x (.14) x + The tranng set contans 131 ponts, whch are chosen by unformly dvdng the doman [-1, 1.6] wth grd sze of 0.0. The test set ncludes 00 ponts randomly selected wthn the same doman. It s observed that at least nne lne segments are needed to construct the basc shape of the target functon, and hence s decded to be the frst tral. After 3 epochs, the mean square tranng error and test error are and respectvely, and the bound of test error s 0.01.

37 Chapter Archtecture Selecton of Multlayer Perceptron 6 Fg..8. A complcated one dmensonal example. Comment.9: Smaller szed MLP such as and are also tested to solve ths problem. Both of them are able to provde good approxmatons except n the small neghborhood around x = 0 where the error bound s bgger than 0.01 (but smaller than 0.04). The reader s referred back to Comment. for understandng the possblty that the mnmal number of the hdden neurons (buldng-blocks) may be smaller than the number of lne segments for a gven target functon. In ths example, f we consder approxmaton wth error bound of 0.04 as satsfactory, then the mnmal structure would be nstead of Smulaton.4: We move on to consder a smple two-dmensonal example, a Gaussan functon descrbed by

38 Chapter Archtecture Selecton of Multlayer Perceptron 7 5 x + y f ( x, y) = exp( ), x, y [ 4,4] (.15) π The tranng set comprses 89 ponts, whch are chosen by unformly parttonng the doman x, y [ 4,4] wth grd sze of 0.5. The test set contans 1000 ponts randomly sampled from the same doman. It s apparent that at least 3 pecewse planes are needed to construct the basc geometrcal shape of the Gaussan functon: a hll surrounded by flat plane. Therefore, from our gudelne a -3-1 MLP s frst tred to approxmate ths functon. After 1000 epochs, the tranng error (MSE) decreases to , and the test error (MSE) s The result s reasonably good as shown n Fgure.9, f we consder the error bound of about 0.07 to be acceptable. Comment.10: It s worth notng that the actvaton functon used for the output neuron n Smulaton Four s not the dentty functon, but the logstc functon (called logsg n MATLAB). Snce the sgmod functon has the property of flattenng thngs outsde of ts focused doman, t s possble to approxmate a functon wthn a certan regon whle keepng other areas flat, whch s very sutable for the type of Gaussan hll problem. Wthout ths flattenng property, t would be dffcult to mprove the approxmaton at one pont wthout worsenng other parts. That s why the sze of the three-layered MLP has to be ncreased to around -0-1 to acheve smlar error bound f the dentty actvaton functon s used n the output neuron.

39 Chapter Archtecture Selecton of Multlayer Perceptron 8 (a) Tranng data (b) Output of the Neural Network (c) Approxmaton error Fg..9. Approxmaton of Gaussan functon.

40 Chapter Archtecture Selecton of Multlayer Perceptron 9 Smulaton.5: We consder a more complcated two-dmensonal example as follows, f ( x, y) = 0.1x 0.05y + sn(0.16x y ), x, y [ 4.5,4.5] (.16) The tranng set composes of 100 ponts, by unformly parttonng the doman x, y [ 4.5,4.5] wth grd sze of 1.0. The test set contans 1000 ponts randomly chosen from the same doman. In order to apply our gudelne, we have to estmate the least number of pecewse planes to construct the basc shape of ths target functon. It appears that at least three peces of planes are needed to construct the valley n the mddle, sx peces of planes to approxmate the downhlls outsde the valley, and addtonal four peces of planes to approxmate the lttle uphlls at the four corners, whch are shown n Fgure.10. The total number of pecewse planes s then estmated to be 13, hence a MLP s frst tred to approxmate ths functon. After 5000 epochs, the tranng error (MSE) decreases to and the test error (MSE) s The approxmaton result s qute well wth error bound of 0.15, as shown n Fgure.11.

41 Chapter Archtecture Selecton of Multlayer Perceptron 30 Fg..10. Pecewse planes needed to construct the basc shape.

42 Chapter Archtecture Selecton of Multlayer Perceptron 31 (a) Output of the Neural Network (b) Approxmaton error Fg..11.A more complcated two-dmensonal example.

43 Chapter Archtecture Selecton of Multlayer Perceptron 3 It s observed that the local mnmum problem s qute severe for ths smulaton example. Approxmately only one out of ten trals wth dfferent ntal weghts may acheve error bound of To allevate ths local mnmum problem, as well as to further decrease the error bound of the test set, evolutonary artfcal neural networks (EANNs) are appled to ths example. One of the popular EANN systems, EPNET (Yao and Lu 1997; Ressen et al. 1997), s adopted to solve the approxmaton problem for the functon (.16) wth the same tranng set and test set mentoned before. Here, the EPNET s smplfed by removng the connecton removal and addton operators, due to the fact that only fully-connected three-layered MLPs are used. The flowchart s gven n Fgure.1, whch s a smplfed verson of the flowchart n (Ressen et al. 1997). The reader s referred to (Yao and Lu 1997; Ressen et al. 1997) for detaled descrpton of the EPNET algorthm. The followng comments are n order as follows to explan some of the blocks n the flowchart: a) MBP tranng refers to tranng wth the Levenberg-Marquardt algorthm (tranlm). b) MRS refers to the modfed random search algorthm, and the reader s referred to (Sols and Wets 1981) for further detals. c) Selecton s done by randomly choosng one ndvdual out of the populaton wth probabltes assocated wth the performance ranks, where the hgher probabltes are assgned to the ndvduals wth worse performances. Ths s n

44 Chapter Archtecture Selecton of Multlayer Perceptron 33 order to mprove the performance of the whole populaton rather than mprovng a sngle MLP as suggested n (Yao and Lu 1997; Ressen et al. 1997). d) Successful means the valdaton error bound has been reduced substantally, for nstance, by at least 10% n our smulatons. The valdaton set contans 1000 random samples unformly dstrbuted n the doman of [ 4.5,4.5] [ 4.5,4.5]. e) The performance goal s set as 0.1 for the valdaton error bound. Once the goal s met, the evolutonary process wll stop, and the best canddate (wth the lowest error bound) wll be selected to approxmate the target functon. The sze of the populaton s 10, and the ntalzaton of the populaton can be done n dfferent ways. Snce has already been estmated by Gudelne One to be good canddate for the structure of MLP, t s natural to ntalze the populaton wth the same structures of It s shown n Table. that after only 69 generatons one of the MLPs acheves the performance goal of error bound of 0.1. If the populaton for the frst generaton s chosen wthout ths gudelne, for nstance, ntalzed wth -5-1 MLPs, or -0-1 MLPs, or a set of dfferent structured MLPs n whch the numbers of hdden neurons are randomly selected n the range of [5,30], as suggested n (Yao and Lu 1997), the convergence speed s usually much slower as shown n Table.. Structures of the ntal populaton TABLE. Performance comparson of EPNET wth dfferent ntal populatons Generatons needed to meet the goal The error bounds of the best network The structures of the best network Mxed structures

45 Chapter Archtecture Selecton of Multlayer Perceptron 34 Begn Intalzaton Entre Populaton Partal MBP tranng for 00 epochs Populaton creaton Selecton Entre Populaton Rank by valdaton error bounds/ Selecton Sngle parent Partal MBP tranng for 00 epochs Fne-tuned parent Mutaton Yes Successful? No Partal MRS tranng for 00 steps Yes Successful? Repalce old by new Replace the worst by new Yes No Hdden node deleton and partal MBP tranng for 400 epochs Better than the worst? No Hdden node addton and partal MBP tranng for 00 epochs Yes Better than the worst? Replacement No Replace the worst wth the fne-tuned parent No Is the performance goal met? Yes Stop Fg..1. Flowchart of the smplfed EPNET.

46 Chapter Archtecture Selecton of Multlayer Perceptron 35 Comment.11: The number of generatons needed to acheve the performance goal, and the structures of the best canddate may dffer wth dfferent experments, and the results reported n Table II s from one set of experments out of fve. It s nterestng to note that the fnal structure of the best canddate usually converges to a narrow range from to regardless of the structures of the ntal populaton, whch s ndeed not far from our ntal estmaton of Therefore, t s not surprsng that the EPNET wth ntal populaton of MLPs always converges faster than other approaches although the number of generatons to evolve vares wth dfferent sets of smulaton studes. Comment.1: It also has to be stressed that the performance goal of 0.1 error bound can be hardly acheved by tranng a -15-1, or MLP solely wth standard BP or modfed BP due to the local mnmum problem. The combnaton of evolutonary algorthm and neural networks (EANN) ndeed proves to be more effcent as seen from our smulaton studes, and our proposed gudelne can be used to generate the ntal populaton of the EANNs, whch can speed up the evoluton process sgnfcantly. Comment.13: It s notced that the dffculty n estmatng the least number of hyperplane peces to construct the basc geometrcal shape of the target functon ncreases wth the complexty of the target functon. In partcular, when the dmenson s much hgher than as n many cases of pattern recognton problems, t s almost mpossble to determne the basc geometrcal shape of the target functon. Hence Gudelne One can be hardly appled to very hgh dmensonal problems unless a pror nformaton regardng the geometrcal shapes of the target functons are known by

47 Chapter Archtecture Selecton of Multlayer Perceptron 36 other means. Ether prunng and growng technques (LeCun et al. 1990; Wegend et al. 1991; Hassb et al. 199; Hush 1997) or EANNs (Alpaydm 1994; Jasc and Poh 1995; Sarkar and Yegnanarayana 1997; Yao and Lu 1997; Yao 1999; Ressen et al. 1997; Castllo 000) are then recommended to deal wth such problems where geometrcal nformaton s hardly known..4 Advantage Offered by Four-layered MLP Whether addng another hdden layer to the three-layered MLP s more effectve remans a controversal ssue n the lterature. Whle some publshed results (Chester 1990; Sontag 199; Tamura and Tatesh 1997) suggest that four-layered MLP s superor to three-layered MLP from varous ponts of vews, other results (Vllers and Barnard 199) clam that four-layered networks are more prone to fall nto bad local mnma, but that three- and four-layered MLPs perform smlarly n all other respects. In ths secton, we wll try to clarfy the ssues rased n the lterature, and provde a few gudelnes regardng the choce of one or two hdden layers by applyng the geometrcal nterpretatons n secton.. One smple nterpretaton of four-layered MLP s just regardng t as a lnear combnaton of multple three-layered MLPs by observng that the fnal output of the four layered MLP s nothng but lnear combnaton of the outputs of the hdden neurons n the second hdden layer, whch themselves are smply the outputs of threelayered MLPs. Thus, the task of approxmatng a target functon s essentally decomposed nto tasks of approxmatng sub-functons wth these three-layered MLPs. Snce all of them share the same hdden neurons but wth dfferent output neurons,

48 Chapter Archtecture Selecton of Multlayer Perceptron 37 these three-layered MLPs share the same weghts connectng the nput layers to the frst hdden layers; but wth dfferent weghts connectng the frst hdden layers to the output neurons (the neurons n the second hdden layer of the four-layered MLP). Accordng to the geometrcal nterpretaton dscussed before, t s apparent that the correspondng basc buldng-blocks of these three-layered MLPs share the same wdths and postons, but wth dfferent heghts and slope. One obvous advantage ganed by decomposng the target functon nto several subfunctons s that the total number of the parameters of the four-layered MLP may be smaller than that of three-layered MLP. Because the number of the hdden neurons n the frst hdden layer can be decreased substantally f the target functon s decomposed nto sub-functons wth smpler geometrcal shapes and hence need less number of the buldng-blocks to construct. Smulaton.6: Consder the approxmaton problem n Smulaton.3, the tranng set and the test set reman the same as those of Smulaton.3. Several four-layered MLPs are tested and t s found that MLP wth parameters can acheve smlar performance as that of MLP consstng of 8 parameters. After 447 epochs, the 5 5 tranng error (MSE) reaches.03 10, the test error (MSE) s and the error bound of the test set s about Due to the local mnmum problem, t s hard to get a good result by only one tral, and the success rate s about one out of twenty, whch s much less than the 90% success rate of MLP. Smulaton.7: We also revst the two-dmensonal problem n Smulaton.5 wth the tranng data set and test data. A MLP s searched out to approxmate the functon satsfactorly. The total number of the parameters of ths four-layered MLP s

49 Chapter Archtecture Selecton of Multlayer Perceptron 38 43, whle the total number of the parameters for the former network s 53. After epochs, the tranng error (MSE) decreases to , the test error (MSE) s and the test error bound s about From above two smulaton examples, t s clear that four-layered MLP s more effcent than three-layered MLP n terms of the number of parameters needed to acheve smlar performance. However, the dfference between the numbers of the parameters usually s not very large, and the three-layered MLP may be more appealng consderng the fact that four-layered MLP may be more prone to local mnma traps because of ts more complcate structure as ponted out n (Vllers and Barnard 199). But there are certan stuatons that four-layered MLP s dstnctvely better than three-layered MLP as shown below. Smulaton.8: Consder an example (Sarle 00) made of a Gaussan hll and a Gaussan valley as follows, f ( x, y) = 3exp( ( x ) ( y ) ) 4 exp( ( x + ) y ), x, y [ 4,4] (.17) The tranng set conssts of 1681 ponts, whch are sampled by unformly parttonng the doman x, y [ 4,4] wth grd sze of 0.. The test set comprses 1000 ponts randomly chosen from the same doman. A network s used to approxmate t 5 qute well as shown n Fgure.13. The tranng error (MSE) s reduced to after 10 epochs, the test error (MSE) s and the error bound s about However, f three-layered MLP s used, then the mnmal sze has to be around to acheve smlar performance. The total number of parameters of s only 5, whle that of s 11, whch s much hgher. Why does four-layered MLP outperform three-layered MLP so dramatcally for ths problem? Before we reveal the answer to ths queston, let s consder another related hll and valley example.

50 Chapter Archtecture Selecton of Multlayer Perceptron 39 (a) Output of the Neural Network (b) Approxmaton error Fg..13. Approxmatng hll and valley wth a MLP.

51 Chapter Archtecture Selecton of Multlayer Perceptron 40 Smulaton.9: It s stll a hll and valley problem as descrbed below and shown n Fgure.14, 0.6 exp( ( x ) f ( x, y) = 0.6 exp( ( x ) ( y ) ( y ) ) 0.8 exp( ( x + ) ) 0.8 exp( ( x + ) y y ) 1, x [0,4], y [ 4,4] ) + 1, x [ 4,0), y [ 4,4] (.18) The tranng set conssts of 6561 ponts, whch are chosen by unformly parttonng the doman x, y [ 4,4] wth grd sze of 0.1. The test set composes of 500 ponts randomly chosen from the same doman. At frst glance of the geometrcal shape of ths functon, t appears more complcated than the prevous example because of the jump n the planes, and a larger szed MLP would be expected to approxmate t satsfactorly. However, a stunnngly smple -5-1 three-layered MLP wth hyperbolc tangent functon as the actvaton functon for the output neuron can approxmate t 5 astonshngly well wth tranng error (MSE) of and test error (MSE) of after only 00 epochs. And the test error bound s even less than 0.03, as shown n Fgure.14.

52 Chapter Archtecture Selecton of Multlayer Perceptron 41 (a) Output of the Neural Network (b) Approxmaton error Fg..14. Approxmatng hll and valley by a -5-1 network.

53 Chapter Archtecture Selecton of Multlayer Perceptron 4 After careful analyss of these two examples, t s fnally realzed that the essental dfference between these two examples s the locaton of the flat areas. The flat regons n Smulaton.8 le n the mddle, whle those n Smulaton.9 are located on the top as well as at the bottom. It s notced prevously n the Gaussan functon example (Smulaton.4) that the sgmod functon has the nce property of flattenng thngs outsde ts focused doman, but the flat levels must be located ether on the top or at the bottom, dctated by ts geometrcal shape. Therefore t s much easer to approxmate the functon n Smulaton.9 wth three-layered MLP than the functon n Smulaton.8. To verfy ths explanaton, we ncrease the heght of the hll as well as the depth of the valley n Smulaton.9 such that they are hgher or lower than the two flat planes, then t becomes very dffcult to approxmate wth three-layered MLP, as shown n the followng smulaton. Smulaton.10: We slghtly change the approxmaton problem n Smulaton Nne as follows..3exp( ( x ) f ( x, y) =.3exp( ( x ) ( y ) ( y ) ).4 exp( ( x + ) ).4 exp( ( x + ) y y ) 1, x [0,4], y [ 4,4] ) + 1, x [ 4,0), y [ 4,4] (.19) The dfference between ths example and Smulaton Nne s that the two flat planes are no longer present at the top or the bottom any more. The samplng ponts of tranng set and test set are the same as those n Smulaton Nne. The number of hdden neurons has to be ncreased from 5 to around 35 for the three-layered MLP, whle a smple MLP can approxmate t qute well f four-layered MLP s used. 5 After 1000 epochs, the tranng error (MSE) goes to , the MSE and error 5 bound of test set are and 0.06 respectvely. The result s plotted n Fgure.15.

54 Chapter Archtecture Selecton of Multlayer Perceptron 43 (a) Output of the Neural Network (b) Approxmaton Error Fg..15. The modfed hll and valley example.

55 Chapter Archtecture Selecton of Multlayer Perceptron 44 From above dscusson the reason why a smple four layered MLP can approxmate hll and valley very well should be also clear now. As we dscussed before, the four-layered MLP has the capablty of decomposng the task of approxmatng one target functon nto tasks of approxmatng sub-functons. If the target functon wth flat regons n the mddle as n the case of Smulaton.8 and.10 can be decomposed nto lnear combnaton of sub-functons wth flat areas on the top or at the bottom, then ths target functon can be approxmated satsfactorly by a fourlayered MLP because each of the sub-functon can be well approxmated by a threelayered MLP now. To valdate ths explanaton, the outputs of the hdden neurons n the second hdden layer of the network n Smulaton.8 are plotted out n Fgure.16, whch are nterestngly n the shape of a hll wth flat areas around. It s apparent that these two sub-functons whch are constructed by three-layered MLPs can easly combne nto a shape consstng of a hll and a valley by subtracton. Comment.14: The way of decomposng the target functon by the four-layered MLP s not unque and largely depends upon the ntalzaton of the weghts. For nstance, the shapes of the outputs of the hdden neurons are totally dfferent from those of Fgure.16, as shown n Fgure.17, when dfferent ntal weghts are used. However, both of them share the common feature that the flat areas are all located at the bottom, whch can be easly approxmated by three-layered MLPs.

56 Chapter Archtecture Selecton of Multlayer Perceptron 45 (a) Output of the frst hdden neuron (b) Output of the second hdden neuron Fg..16. The outputs of the neurons n the second hdden layer for the MLP.

57 Chapter Archtecture Selecton of Multlayer Perceptron 46 (a) Output of the frst hdden neuron (b) Output of the second hdden neuron Fg..17. The outputs of the neurons n the second hdden layer wth dfferent ntalzaton.

58 Chapter Archtecture Selecton of Multlayer Perceptron 47 In summary, we have followng two gudelnes regardng the choce of one or two hdden layers to use. Gudelne Two: Four-layered MLP may be consdered for purpose of decreasng the total number of the parameters. However, t may ncrease the rsk of fallng nto local mnma n the mean tme. Gudelne Three: If there are flat surfaces located n the mddle of the graph of the target functon, then four-layered MLP should be used nstead of three-layered MLP. Comment.15: The Gaussan hll and valley example s the most well known example (Sarle 00) to show the advantage of usng two hdden layers over usng one hdden layer. However, very lttle explanaton has been provded except Chester suggested an nterpretaton n (Chester 1990), whch was not well founded. Comment.16: Sontag (199) proved that a certan class of nverse problems n general can be solved by functons computable by four-layered MLPs, but not by the functons computable by three-layered MLPs. However, the precse meanng of computable defned n (Sontag 199) s exact representaton, not approxmaton. Therefore hs result does not mply the exstence of functons that can be approxmated only by four layered MLPs, but not by three-layered MLPs, whch s stll consstent wth the unversal approxmaton theorem.

59 Chapter Archtecture Selecton of Multlayer Perceptron 48.5 Conclusons A geometrcal nterpretaton of MLPs s suggested n ths Chapter, on the bass of the specal geometrcal shape of the actvaton functon. Bascally, the hdden layer of the three-layered MLP provdes the basc buldng-blocks wth shapes very close to the pecewse lnes (or pecewse hyperplanes n hgh dmensonal cases). The wdths, heghts and postons of these buldng blocks can be arbtrarly adjusted by the weghts and bases. The four-layered MLP s nterpreted smply as lnear combnaton of multple three-layered MLPs that have the same hdden neurons but wth dfferent output neurons. The number of the neurons n the second hdden layer s then the number of these three-layered MLPs whch construct correspondng sub-functons that would combne nto an approxmaton of the target functon. Based upon ths nterpretaton, three gudelnes for selectng the archtecture of the MLP are then proposed. It s demonstrated by varous smulaton studes that these gudelnes are very effectve for searchng of the mnmal structure of the MLP, whch s very crtcal n many applcaton problems. The suggested geometrcal nterpretaton s not only useful to gude the desgn of MLP, but also sheds lght on some of the beautful but somewhat mystc propertes of the MLP. For nstance, the unversal approxmaton property can now be readly understood from the vewpont of pecewse lnear approxmaton as proven n Theorem 1. And also t does not escape our notce that ths geometrcal nterpretaton may provde a lght to llumnate the advantage of MLP over other conventonal lnear regresson methods, shown by Barron (199; 1993), that the MLP may be free of the curse of dmensonalty, snce the number of the neurons of MLP needed for

60 Chapter Archtecture Selecton of Multlayer Perceptron 49 approxmatng a target functon depends only upon the basc geometrcal shape of the target functon, not on the dmenson of the nput space. Whle the geometrcal nterpretaton s stll vald wth the dmenson of the nput space ncreasng, the gudelnes can be hardly appled to hgh dmensonal problems because the basc geometrcal shapes of hgh dmensonal target functons are very dffcult to determne. Consequently, how to extract the basc geometrcal shape of a hgh dmensonal target functon from the avalable tranng data would be a very nterestng and challengng problem.

61 Chapter 3 Overfttng Problem of MLP 3.1 Overfttng Problem Overvew Multlayer perceptron (MLP) has already proven to be very effectve n a wde spectrum of applcatons, n partcular the functon approxmaton and pattern recognton problems. Lke other nonlnear estmaton methods MLP also suffers from over-fttng. The best way to solve the over-fttng problem s to provde a suffcently large pool of tranng data. But n most of the practcal problems, the number of tranng data s lmted and hence other methods such as model selecton, early stoppng, weght decay, and Bayesan regularzaton etc. are more feasble when a fxed amount of tranng data s gven. Model selecton manly focuses on the sze of the neural network,.e. the number of weghts, whle most other approaches are related to the sze of the weghts, drectly or ndrectly. They are actually the two aspects of the complexty of the networks. Therefore t s of great nterest to gan deeper nsght nto the functonng of the sze of the network and the sze of weghts n the context of the over-fttng problem. Based on the geometrcal nterpretaton presented n Chapter, how the number and the sze of the weghts nfluence the over-fttng problem wll then be clearly

62 Chapter 3 Overfttng Problem of MLP 51 explaned. Varous approaches of dealng wth the over-fttng problem are examned from the pont of vew of the new geometrcal nterpretaton. In partcular, the popular regularzaton tranng algorthms are studed n detals. Not only the reason why regularzaton methods are very effcent to overcome the over-fttng can be smply explaned by the geometrcal nterpretaton, but also a potental problem wth regularzaton s predcted and demonstrated. Applyng the geometrcal nterpretaton, a bref overvew of over-fttng and some popular approaches to mprove generalzaton wll be dscussed n ths Chapter. An example of over-fttng problem (Caruana et al. 000) s llustrated n Fgure 3.1, whch s a functon approxmaton wth a three-layered (one hdden layered) MLP. The tranng dataset s created by cos( x) + v 0 x < π y = (3.5) cos(3( x π )) + v π x π And the nose v s unformly dstrbuted wthn [ 0.5,0.5]. The MLP s traned wth Levenberg-Marquardt algorthm usng Neural Network toolbox of MATLAB. Wth 4 hdden neurons, the approxmaton s farly good. When the number of hdden neurons ncreases, sgnfcant over-fttng and poor generalzaton are observed. The output of the MLP fts the tranng data perfectly when number of the hdden neurons reaches 100, but the nterpolaton between the tranng ponts s extremely poor.

63 Chapter 3 Overfttng Problem of MLP 5 (a) (b) (c) Fgure 3.1: Example of over-fttng problem

64 Chapter 3 Overfttng Problem of MLP 53 From above example, t s obvous that the degree of over-fttng ncreases wth the sze of the neural network. However, Bartlett (1997) made a surprsng observaton that for vald generalzaton, the sze of the weghts s more mportant that the sze of the network, whch appears hardly to be true at frst glance. But wth the ad of the geometrcal nterpretaton, ths astonshng observaton can be planly explaned as follows. Snce the slope of each buldng-block s roughly proportonal to w (1) () w, the smaller the weghts, the gentler the slope of each buldng-block and hence the smoother the shape of the overall functon. In fact, most of the prevalent methods to prevent over-fttng are concerned ether wth the sze of the network or the sze of the weghts, whch wll be examned from ths new perspectve of the geometrcal nterpretaton.

65 Chapter 3 Overfttng Problem of MLP Comparatve Study of Avalable Methods 3..1 Model Selecton Ths approach focuses on the sze of the network. Generally, a smple network wll gve good generalzaton performance. Normally, the model selecton procedure s based on cross-valdaton to choose the optmal sze usng ether prunng or growng technques, whch s usually tme-consumng. Instead, based upon the geometrc nterpretaton, some much smpler gudelnes have already been proposed n Chapter. Followng the gudelnes, obvously 4 hdden neurons are needed to approxmate the functon gven n the example, and ndeed network gves very good generalzaton as seen from Fgure 3.1. We have tested ths gudelne wth extensve smulaton studes. In all of the cases studed, the estmated number of the hdden neurons s ether very close to the mnmal number of hdden neurons needed for satsfactory performance, or s the mnmal number tself n many cases as shown n Chapter. 3.. Early Stoppng Early stoppng s another popular method to overcome the over-fttng problem n the tranng progress (Sarle 1995). The man dea s to stop tranng when the valdaton error goes up. Fgure 3. shows the results of usng early stoppng method n the former example, n whch no sgnfcant over-fttng s observed even when the number of hdden unts reaches 100. To apply early stoppng successfully, t s crtcal to choose very small random ntal values for the weghts (chosen wthn [-0.1, 0.1]

66 Chapter 3 Overfttng Problem of MLP 55 randomly n ths example) and use a slow learnng rate, whch essentally prevents the weghts from evolvng nto large values. Confnng the sze of the weghts to be small s also a good remedy to allevate the over-fttng problem as dscussed before Fgure 3.: Early stoppng for overcomng over-fttng problem 3..3 Regularzaton Methods Conventonally, the tranng progress s to mnmze the cost functon where F = ED, E D s the summaton of the squared errors. Regularzaton methods add a penalty term to the cost functon. Usually, the penalty term s a functon of the weghts, whch s called complexty penalty. Then the cost functon becomes F = E D + λe, w where E s the complex penalty and λ s called regularzaton parameter. w

67 Chapter 3 Overfttng Problem of MLP 56 Weght decay (Plaut et al. 1986) s the smplest one of the regularzaton approaches, where E w s the summaton of all the squared parameters ncludng both weghts and bases, and weght elmnaton (Wegend et al. 1991) s actually a normalzed verson of weght decay. Both of them work effectvely n some applcatons, but they do not work well all the tmes because they gnore the dfference between the weghts and the bases, as well as the nteracton between the weghts n dfferent layers. For nstance, from the geometrcal nterpretaton, the bases are only related to the postons of the basc buldng-blocks, not the shapes, and hence should not be ncluded n the penalty term. A more recent regularzaton method proposed by Moody and Rögnvaldsson (1997) work much better than the standard weght decay and weght elmnaton. In ther approach, for the case of one-dmensonal map, the complexty penalty E w for the frst order local smoothng regularzer can be reduced to E w = N = 1 () (1) ( w w ), whch s actually mnmzng the slopes of the basc buldng-blocks from the pont of vew of geometrcal nterpretaton. Therefore ts superor performance can be smply attrbuted to ts capablty to dstngush the dfferent roles played by the weghts and the bases. The choce of the regularzaton parameter λ also affects the performance of the generalzaton sgnfcantly. MacKay s Bayesan approach (MacKay 199a; MacKay 199b) to choose the regularzaton parameters s the most popular one. Usng Bayesan regularzaton, MLP may acheve good generalzaton result for the former example whle t fals prevously wthout regularzaton as shown n Fgure 3.3. It s worth notng that the Bayesan regularzaton may break down f the number of

68 Chapter 3 Overfttng Problem of MLP 57 data pars N s small relatve to the number of the free parameters k as ponted out by MacKay. But the reason and how large N/k must be for relable approxmaton s stll an open queston (MacKay 199b). Furthermore, ths breakng down may also depend upon the ntalzaton of the parameters as observed from our smulaton studes. The regularzaton methods lmt the sze of the weghts, whch n turn restrct the slopes of the buldng-blocks to be small and hence results n smooth approxmaton. However, the strength of ths approach s also ts weakness. Based upon the geometrcal nterpretaton, the MLP may have dffculty n approxmatng functons wth sgnfcant hgh frequency components because the slopes of the buldng blocks are confned to be small. To verfy ths predcton, a smulaton example s constructed as follows. A tranng dataset s created whch contans 41 ponts accordng to the functon y = sn( π x) + 0.sn(10πx). A MLP wth 1 hdden neurons (whch follows the prevous model selecton gudelne) s used to approxmate ths functon, and the ntal weghts are chosen randomly wthn [-1,1]. The results wth and wthout Bayesan regularzaton are shown n Fgure 3.4, where unexpected smooth soluton can be seen when Bayesan regularzaton s used. Very nterestngly, Bayesan regularzaton ndeed acts as a low-pass flter, and fals to capture the hgh frequency component. Fortunately most of the hgh frequency sgnals result from noses n realty, and the Bayesan regularzaton may gve the desred approxmaton by effectvely flterng out the nose. But f the hgh frequency sgnals are useful sgnals nstead of nose, then regularzaton approach may not be the rght choce, and model selecton method may be more proper.

69 Chapter 3 Overfttng Problem of MLP 58 (a) wthout Bayesan regularzaton (b) wth Bayesan regularzaton Fgure 3.3: Bayesan regularzaton for overcomng over-fttng problem

70 Chapter 3 Overfttng Problem of MLP 59 (a) wth Bayesan regularzaton (b) wthout Bayesan regularzaton Fgure 3.4: A smple example where Bayesan regularzaton fals

71 Chapter 3 Overfttng Problem of MLP Conclusons Over-fttng s a crtcal ssue for neural network applcatons. In order to gan deeper nsghts n the functonng of the sze of the network, as well as the sze of the weghts, a geometrcal nterpretaton n Chapter s revsted. Based upon ths nterpretaton, the sze of the weghts drectly decdes the shape of the basc buldng-blocks, the smaller the weghts, the smoother the buldng-blocks. And the reason behnd Bartlett s well-known observaton that for vald generalzaton, the sze of the weghts s more mportant than the sze of the network s now crystal clear from the vewpont of ths geometrcal nterpretaton. Varous methods of preventng over-fttng are revewed from ths new perspectve, and all of them can be elegantly explaned by the suggested geometrcal nterpretaton. A smple gudelne for model selecton s also suggested and appled successfully to the gven example. Regularzaton has emerged as the most popular approach to overcome over-fttng snce no specfc technques are needed to select an optmal archtecture and the avalable data can be fully used. However, a potental problem wth the regularzaton method that t may fals to capture the hgh frequency characterstcs of the functon, s llumnated by the geometrcal nterpretaton.

72 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 4.1. Introducton to Radal Bass Functon Network Radal Bass Functon Network (RBFN) s another popular feedforward neural network that s wdely used n classfcaton, regresson and functon approxmaton problems. The man dfference between the MLP s that the actvatons of the hdden neurons of RBFN depend on the dstance of an nput vector from a prototype vector whereas MLP calculate the nner product of the nput vector and the weght vector. Normally, radal bass functon networks have three layers wth dfferent roles. The nput layer (sensors) connects the network to the envronment. The hdden layer performs the key nonlnear transformaton from the nput space to the hgh dmensonal hdden space n the network. The output layer gves a weghted lnear combnaton of the hdden neuron actvatons. The structure of radal bass functon network s shown n Fgure 4.1. The k th output of the network should be: M f ( X ) = w h ( X ) (4.1) k = 1 k

73 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 6 where X s the nput vector, h (X ) s the th bass functon and w k s the weght from the th bass functon. The bass functons are normally multvarate Gaussan functons: h X µ X ) = exp( ) (4.) σ ( where µ s the center of the prototype vector, σ s the spread of the Gaussan functon and X µ s the squared Eucldean dstance between the nput vector and the prototype vector. Bases Inputs Bass functons Output layer Fgure 4.1 Three-layered structure of radal bass functon network A very mportant and nterestng property s that RBFN s naturally related to the regularzaton network and some statstcal concepts especally n classfcaton areas. Comparng to multlayer perceptron networks, these lnks make radal bass functon networks able to be traned by dfferent and fast tranng methods (such as clusterng and EM methods). The tranng of RBFN s usually separated n to two stages.

74 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network Two-stage Tranng of Radal Bass Functon Networks Although smultaneous adjustment of all the parameters of RBFN s also avalable, n practce, the estmaton of the parameters s often separated nto two stages: 1. Determne the centers µ and the relatve spreadsσ.. Estmate the output weghts based on the prevously determned centers and spreads. Both these two stages can be solved quckly usng batch mode methods. Although ths knd of separaton may lead to a sub-optmal soluton as compared to the smultaneous tranng of the whole network, the dfference of fnal performance s not that large. Actually, n many stuatons, t even can provde better solutons consderng the fnte tranng data and computatonal resource. In the frst stage, only the part of the tranng nformaton s used. The centers and spreads can be determned wthout the target (label) nformaton. So the learnng s unsupervsed at frst. Once the centers and spreads are set, supervsed learnng wll be conducted to calculate the output weghts. Random Selecton of Centers The most convenent and fast way s to choose fxed parameters for the bass functons. The locatons of the centers may be smply randomly chosen from the tranng data set, sometmes, even the whole tranng data set. Ths s consdered to be a sensble approach, snce the tranng data are dstrbuted n a representatve manner for the problem at hand (Lowe 1989). Specfcally, a radal bass functon centered at µ s defned as:

75 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 64 n H ( X µ ) = exp( X µ ), = 1,,, n (4.3) d max Where n s the number of centers and d max s the largest dstance between the chosen center vectors. Then the spread (standard devaton) of all the Gaussan bass functons s: d max σ = (4.4) n So that each ndvdual radal bass functon wll not be too steep or too flat. Small spread can lead to less smooth functons. Another emprcal method of choosng the spreads s to set the spread to be 1.5 to tmes of the average dstance to L nearest neghbors (Ghosh and Nag, 000). Once the locaton of centers and the spreads are determned, the network can be treated as a sngle-layer network wth lnear output neurons. So that least-squares soluton can be appled to get the weghts: W T + where D s the target vector n the tranng set, = H D (4.5) + H s the pseudo-nverse of the bass functon matrx H. Ths knd of random selecton of centers seems somehow rough, but t s often used because such ad hoc procedure s very fast (Bshop 1995). And t actually works satsfactorly n many practcal ssues. Clusterng Algorthms A more sutable approach s to choose the centers usng clusterng algorthms, whch can separate the gven tranng ponts nto subsets. Then the locaton of the centers can be obtaned by calculatng the geometrc mean of the ponts n the subsets. There are many of such clusterng algorthms. Among all, self-organzed learnng or K-means

76 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 65 clusterng algorthm (Mcqueen 1967; Duda and Hart 1973, Moody and Darken 1989; Kohonen 1990) s wdely used. The K-means algorthm parttons the tranng data ponts nto K subsets S by mnmzng crteron wth the clusters: j K = 1 n S n J = X µ (4.6) where µ s the center of the th subset ( N s the number of ponts n the th subset): 1 n µ = X (4.7) N n S The partton of the data set s normally at random at frst. Then the centers for each of the subsets are calculated usng equaton 4.6. After that, each data pont s reassgned to the nearest center calculated. Ths procedure s terated untl there s no further change n the partton. Although the above clusterng procedure s a batch one, sequental clusterng s also avalable (Haykn 1999). Smlar spreads determnaton and output weghts lnear least-squared soluton can be appled after the locaton of centers s settled down One-stage Supervsed Tranng of Radal Bass Functon Networks Radal bass functon network s a specfc feedforward neural network. It can also be traned n a smlar way to the multlayer perceptron. The frst step s also to defne the cost functon, whch s usually the sum-squared error. 1 E = N O j= 1 k = 1 e k ( j),where ek ( j) = yk ( j) d k ( j) (4.9) Hence, the error gradent for lnear output weghts and bas:

77 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 66 E( n) = w ( n) k N O j= 1 k = 1 e kj ( n)exp( X j µ ( n) σ ( n) ) (4.10) E( n) = b ( n) j N O j= 1 k = 1 e kj ( n) (4.11) The error gradent for the locaton and spread of the centers: E( n) = µ ( n) N O j= 1 k = 1 e kj ( n) w k exp( X µ ( n) σ ( n) ( ) X µ ( n)) σ ( n) (4.1) E( n) = σ ( n) N O j= 1 k = X µ ( n) X µ ( n) ekj ( n) wk exp( ) 3 1 σ ( n) σ ( n) (4.13) Where e kj (n) s the error sgnal of the k th output neuron respect to the j th tranng pont at tme n. Actually the gradent learnng has an effect smlar to a clusterng effect (Poggo and Gros, 1990). After the error gradents are ready, we can easly update these parameters wth a set of learnng rates η l for dfferent parameters. Lke the supervsed tranng of multlayer perceptron, the choce of the learnng rate s also a problem. When the learnng rate s too small, the convergence speed wll be very slow; when the learnng rate s too large, the learnng procedure maybe unstable. To allevate the nfluence of the chose of learnng rate, adaptve learnng rate wth momentum s adopted n the later smulaton studes. The NETtalk experment (Wettschereck and Detterch 199) ndcated the generalzaton performance of the supervsed traned RBFN s better than those by two-stage traned ones. However, supervsed tranng s computatonally expensve comparng to two-stage tranng.

78 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network Dfference Comparng to Multlayer Perceptron Radal bass functon network and multlayer perceptron are wdely used, snce they are both unversal approxmators. However, there are mportant dfferences between these two knds of neural networks: 1. The hdden unts of MLP compute the weghted lnear summatons of the nputs, where the hdden unts of RBFN calculate the dstance between the nput vector and the prototype vectors (.e. the centers).. The respond of RBFN s localzed and the network can be adjusted locally wth the new nputs. 3. The MLP can have a complex structure wth many layers whereas the RBFN normally has only one hdden layer. 4. The parameters of MLP are usually adjusted smultaneously at one tme; whereas the tranng of RBFN s mostly separated to two stages. 4.. MLP wth Addtonal Second Order Inputs Although there are major dfferences between the multlayer perceptron and the radal bass functon networks, they do have connecton between each other. Maruyama, Gros and Poggo have reported that for normalzed nputs, multlayer perceptron network can always smulate a Gaussan radal bass functon network (Maruyama et al. 199). Wlensky and Manukan (199) also proposed the Projecton Neural Network where two dfferent transformatons from an N-dmensonal nput to the N+1 dmensonal transformed nput space were ntroduced, and resulted n localzed reponses. All the dmensons should be recalculated n the both transformatons.

79 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 68 Wlamowsk and Jaeger (1996; 1997) also rased a smple transformaton of nput patterns onto a hypershpere n augmented space, and the effcency of ths method s also expermentally verfed. Omohundro (1989) also mentoned a MLP wth addtonal nput whch s the sum of the squares of other nputs may have localzed responses lke RBFNs. Ths knd of addtonal nput ncreases the nput dmenson by one, whch actually transforms the nput to a hyperbolc surface. Casasent Networks (Casasent 199; Sarajedn 199) are the practcal approaches of ths concept, whch allows ether MLP or RBFNs, or combnatons of these two. A more recent report s gven by Rdella et al. (1997), the proposed crcular backpropagaton (CBP) network s also a MLP wth addtonal nput whch s the sum of the squares of the orgnal nputs. The structure of CBP network s gven n Fgure 4.. Bases Bases x 1 x x n 1 + x + xn x + Inputs Hdden layers Output layer Fg. 4.. Structure of CBP network It s clear from Fgure 4. that the CBP network wll become a standard MLP f the weghts connected to the addtonal nput are set to zero. The three-layered CBP network (wth only one hdden layer) can also approxmate a RBFN wth same hdden neurons. At the unt level, the CBP model can be descrbed n the followng form:

80 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 69 d = 1 d h( x, w) = b + w x + w x (4.14) d + 1 = 1 It s possble to obtan another form by smple algebrac transformatons: h ( x) = g( x c + θ ) (4.15) where g w 1 whch decdes the spread of the Gaussan lke functon, = d + c are the centers of these Gaussan lke functon, and = w / wd + 1 d 1 w θ = ( b ) works lke a bas for the Gaussan lke functon. The w 4w d + 1 = 1 d + 1 actvaton functon for CBP s sgmodal: ϕ ( h) = 1 1 h (4.16) + e Let h' = g x c, multply t by an arbtrary constant whch can be taken from the output weghts: gθ = k h' ke k ϕ = e (4.17) ( h' ) ' 1+ + gθ h gθ e e e + 1 ke We can choose k and gθ that let h' e e gθ gθ arbtrarly close to 1 (Rdella et al. 1997), and + 1 then the CBP network can approxmate the relevant RBFN f the remanng parts of the output weghts are dentcal to those of the RBFN. Smlarly, a three-layered CBP network wth hyperbolc tangent functon can also approxmate a relevant RBFN wth h same number of hdden neurons. If exponental actvaton functon ϕ ( h) = e s selected n the CBP network, then ths exponental CBP (ECBP) network s naturally a RBFN. Snce the output of the exponental neuron becomes:

81 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 70 ϕ ( g ( x c + θ )) g x c gθ ( h) = e = e e (4.18) The term gθ e can be easly absorbed n the output weghts. In the RBFNs the spreads of the Gaussan functons are always postve, but n the ECBP network the spread g = w d +1 can be negatve. So that, the ECBP network s actually a generalzed verson of the RBFN Comparatve Study To llumnate the effectveness of the CBP network and the proposed ECBP network, a lot of smulaton studes are carred out. Varous tranng methods of RBFNs are also examned. The performances are compared together wth multplayer percepton networks Smulatonal Setup All of the smulaton studes are conducted n MATLAB and based on the NETLAB toolbox and the Neural Network toolbox of MATLAB. All the networks have only one hdden layer and the actvaton functons for the output neurons are all dentcal functons (called pureln n MATLAB). The followng are the detal settngs of the networks to be compared: 1. MLP: The Levenberg-Marquardt algorthm (Marquardt 1963, Mor 1977) s used; the actvaton functons n the hdden layers are all hyperbolc tangent functon (called tansg ) and the Nguyen-Wdrow method (Nguyen and Wdrow 1990) s utlzed to ntalze the weghts of the each layer of the MLPs.

82 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 71. CBP: The same settngs as those of MLP are adopted. 3. ECBP: Same settngs as those of MLP are used except the actvaton functon used n the hdden layer s exponental functon. 4. RBFN 1: The centers are randomly chosen from the tranng samples followng the method mentoned n secton Least Square method s selected to calculate the output weghts of the network. 5. RBFN : k-means clusterng method s adopted here, and the spread s the same as that of RBFN S-RBFN: A one-stage supervsed learnng algorthm wth momentum and adaptve learnng rate s selected to tranng the RBFN. The maxmum teratons for supervsed networks (MLP, CBP, ECBP and S-RBFN) are fxed at 10000, and the maxmum teratons for the clusterng algorthm are 1000, unless specfed. Each result n the tables s the optmal one from ten trals. Those dd not acheve the performance goal wthn specfed epochs were marked wth Faled Smulatonal Results Smulaton 4.1: Consder the approxmaton problem n Smulaton 3.1 wth the same tranng and test data sets. Table 4.1 gves the results of the mnmum number of hdden neurons needed to reach the test goal, whch s set at TABLE 4.1 The mnmum number of hdden neurons needed for smulaton 4.1 MLP CBP ECBP RBFN 1 RBFN S-RBFN

83 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 7 Comment 4.1: From Table 4.1, we can see that fewer number of hdden neurons s needed for supervsed tranng. We also noted that for CBP and ECBP networks, a network wth only hdden neurons s suffcent for satsfactory approxmaton. That s maybe because of the addtonal nput weghts ncorporated n the CBP and ECBP networks. The total number of free parameters for MLP wth 3 hdden neurons s 10, and those for CBP and ECBP wth hdden neurons are already 9. The clusterng method does not work very effcently compared to S-RBFN whch adjusts the centers, spreads and weghts at the same tme. Smulaton 4.: The nosy approxmaton problem n smulaton 3. s revsted here, and the result s shown n Fgure 4.3. Wth the mnmum structure, all the networks generalze well. But when the number of hdden neurons s ncreased to 50, the RBFNs (except for the clusterng method, snce the number of centers exceeds the number of tranng data) and the ECBP network stll result n good generalzaton performance. But the CBP and MLP traned wth Levenberg-Marquardt algorthm result n overfttng.

84 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 73 (a) Approxmaton by ECBP (b) Approxmaton by CBP Fg Approxmaton wth 50 hdden neurons

85 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 74 Comment 4.: The dfference of generalzaton performance between CBP and ECBP comes from the dfference of the actvaton functons. The hyperbolc tangent functon s more effcent than the exponental functon. Hence, t s more prone to overfttng problem (Caruana et al. 000). Smulaton 4.3: The more complcated one-dmensonal example n smulaton 3.3 s reconsdered here. The tranng and test sets reman the same. Table 4. gves the results of the mnmum number of hdden neurons to reach the error bound TABLE 4. The mnmum number of hdden neurons needed for smulaton 4.3 MLP CBP ECBP RBFN 1 RBFN S-RBFN Faled Comment 4.3: Agan, CBP and ECBP gve the best approxmaton results consderng the number of hdden neurons, even the total number of free parameters. The mnmum number of the parameters of MLP s 8, those for CBP and ECBP s 5. All the RBFN networks do not gve good approxmaton performance, whch s related to the near lnear part n the mddle of the functon to be approxmated, snce RBFN networks are very neffcent n approxmatng lnear or constant functon. The falure of S-RBFN maybe attrbuted to ether slow convergence rate of the learnng algorthm or local mnma problem.

86 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 75 Smulaton 4.4: Consder the two-dmensonal example n Smulaton 3.5 wth the same tranng set and test set. The mnmum numbers of hdden neurons to reach the error bound 0.15 are gven n Table 4.3. TABLE 4.3 The mnmum number of hdden neurons needed for smulaton 4.4 MLP CBP ECBP RBFN 1 RBFN S-RBFN Faled Faled Faled Comment 4.4: The total number of parameters of the mnmum structure of CBP and ECBP s 5, and that for MLP s 40. So that CBP and ECBP s very effcent n terms of the number of parameters. Moreover, the CBP can even reach an error bound less than 0.04, whch s hard even for EANNs (see Smulaton.5 n Chapter ). All of the RBFNs faled because of the severe overfttng problem for ths partcular case. Smulaton 4.5: We revst the Gaussan hll and valley problem n Smulaton 3.3. The tranng and test sets are stll the same. The clusterng method does not work well snce the tranng data partton the nput space unformly. The target error bound set for approxmaton s 0.05, whch s quet strct. If the centers of the RBFN are chosen randomly from the tranng data, the number of the centers should be about 100 to acheve approxmaton goal. If the tranng s contnued beyond the targeted error bound, after 5000 epochs, the approxmaton error bound can be reduced to about for CBP networks and for supervsed tranng. For ECBP networks, as shown n Fgure 4.4, the error bound can be even reduced to after amazngly 63 epochs. Snce the tranng data s unformly dstrbuted, the performance of RBFN-1 and RBFN- s much worse than S-RBFN.

87 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 76 TABLE 4.4 The mnmum number of hdden neurons needed for smulaton 4.5 MLP CBP ECBP RBFN 1 RBFN S-RBFN Fg. 4.4: The approxmaton error of ECBP network Comment 4.5: If an addtonal hdden layer s added to the MLP, a smpler network can be searched out to meet the requrement of error bound, whch s stll much more complex than the structures of CBP, ECBP and Supervse RBFN. The nput weghts of ECBP are [ ; ]. Snce the centers of the equvalent RBFN network should be c = w / w and the d + 1

88 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 77 spread g = w d +, the centers are (, ) and (-, 0) and the spreads are both 1, whch s 1 exactly the centers and spreads for the orgnal Gaussan hll and valley. Ths fact nspres the dea partally traned ECBP network may be used to ntalze the centers of the RBFNs. Snce ths knd of supervsed tranng has an effect smlar to the clusterng procedure. Snce ths supervsed tranng seems to be more effcent when the tranng data do not naturally appear n groups or clusters, we can expect ths method of ntalzng RBFNs to be effcent n these cases. We can also expect that the total tme of tranng maybe shorter than that of S-RBFN. Smulaton 4.6: we consder the classcal two-spral problem (Lang 1989; Fahlmann 1989), the tranng set s shown n Fgure 4.5. Table 4.5 gves the mnmum number of hdden neurons of those networks to acheve zero msclassfcaton. Fg The two-spral problem

89 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 78 TABLE 4.5 The mnmum number of hdden neurons needed for smulaton 4.6 MLP CBP ECBP RBFN 1 RBFN S-RBFN Faled Comment 4.6: Rdella et al. (1997) mentoned that a RBFN wth 4 hdden neurons can solve the two-spral problem wth consderable optmzaton efforts, but no detals are gven. In the above smulaton studes, no specfc optmzaton effort s added, so that most of RBFNs faled to gve the correct classfcaton. Theoretcally, an RBFN wth randomly selected centers can solve ths problem wth 97 (half of the total tranng ponts) hdden neurons, but t s almost mpossble to select all these 97 centers n on class. Hence, about 130 centers are needed to correctly classfy these two classes. In Fgure 4.6, very nterestngly, all the decson boundares are smooth except that for MLP. That s because the decson boundary for a sngle neuron n MLP s lnear and global, and the others are crcular and localzed.

7: Here, we go to a smple lnear separaton problem, where the lne x = 1 separates the nput space to two classes.

90 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 79 (a) MLP wth 9 HN (b) CBP wth 8 HN (c) ECBP wth 4 HN (d) RBFN1 wth 130 HN (e) RBFN- wth 98 HN Fg 4.6. Approxmaton results for the two-spral problem Smulaton 4.7: Here, we go to a smple lnear separaton problem, where the lne x = 1 separates the nput space to two classes.. The tranng data comprses 1000 ponts randomly selected from the nput space. And the test data set conssts 676 ponts, whch are sampled by unformly parttonng the nput space by grd sze

The tranng and test data of the smple lnear separaton problem TABLE 4.6 The mnmum number of hdden neurons needed for smulaton 4.7 MLP CBP ECBP RBFN 1 RBFN S-RBFN 1 1 5 7 10 Comment 4.

91 Chapter 4 From Multlayer Perceptron to Radal Bass Functon Network 80 The tranng and test data sets are shown n Fgure 4.7. Table 4.6 shows the mnmum number of hdden neurons needed for the dfferent networks to acheve zero msclassfcaton. (a) Tranng data (b) Test data Fg The tranng and test data of the smple lnear separaton problem TABLE 4.6 The mnmum number of hdden neurons needed for smulaton 4.7 MLP CBP ECBP RBFN 1 RBFN S-RBFN Comment 4.7: We can see that the MLP works best snce the decson boundary of MLP s naturally lnear whch s very sutable for ths problem. The CBP network also works very effcently as t can represent a MLP exactly. However snce the structure s more complex than the standard MLP, the performance of CBP network s not as good as that of MLP. Theoretcally, the ECBP network can also be reduced to a MLP f the weghts connectng to the addtonal nput are set to zero. But the exponental transfer functon s not as effcent as the hyperbolc tangent functon and ECBP s more close to RBFNs, so that more hdden neurons are needed for ECBP. The RBFNs

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

68 Internatonal Journal "Informaton Theores & Applcatons" Vol.11 PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION Evgeny Artyomov and Orly