Image Compression Using Cascaded Neural Networks

Size: px

Start display at page:

Download "Image Compression Using Cascaded Neural Networks"

Patrick Gray
6 years ago
Views:

1 Unversty of New Orleans Unversty of New Orleans Theses and Dssertatons Dssertatons and Theses Image Compresson Usng Cascaded Neural Networks Chgoze Obegbu Unversty of New Orleans Follow ths and addtonal works at: Recommended Ctaton Obegbu, Chgoze, "Image Compresson Usng Cascaded Neural Networks" (2003). Unversty of New Orleans Theses and Dssertatons Ths Thess s brought to you for free and open access by the Dssertatons and Theses at ScholarWorks@UNO. It has been accepted for ncluson n Unversty of New Orleans Theses and Dssertatons by an authorzed admnstrator of ScholarWorks@UNO. The author s solely responsble for ensurng complance wth copyrght. For more nformaton, please contact scholarworks@uno.edu.

2 IMAGE COMPRESSION USING CASCADED NEURAL NETWORKS A Thess Submtted to the Graduate Faculty of the Unversty of New Orleans n partal fulfllment of the requrements for the degree of Master of Scence n The Department of Electrcal Engneerng by Chgoze Obegbu B.Eng., Federal Unversty of Technology Owerr, 1997 August 2003

3 ACKNOWLEDGMENTS The completon of ths thess has nvolved an enormous amount of help from a number of people. Frst and foremost of these s Dr. Dmtros Charalampds, my thess advsor, for provdng the orgnal deas, suggestons, and motvaton for the work. He freely bestowed hs tme, gudance, brllance, and wsdom consderably beyond the call of duty; and was a model of professoral responsblty, professonalsm, and commtment. What I have learned from hm cannot be quantfed. Specal thanks to other members of my thess commttee, Dr. Julette Ioup and Dr. Terry Remer for enrchng my experence n academa by sharng wth me ther ntellectual curosty, professonal nsght and ntegrty, and personal warmth and understandng through ther courses. I would especally lke to thank Dr. Julette Ioup for carefully readng through the entre draft of my thess and offerng several helpful edtoral comments. If anyone has had to be patent wth me n the course of wrtng ths thess, t s my specal frend, Melssa Bas. Her companonshp has helped me to put forth my full effort, and to mantan my santy. Sncere apprecaton goes to Vjay Kura, a fellow graduate student and a good frend, for lendng some of hs exquste programmng sklls at the begnnng stages. And last, but most of all, to my parents, James and Mmachukwu Obegbu, I owe everythng; they sustan me n all that I do and t s to them that ths work s dedcated wth love; and n lovng memory of my dearest cousn, Uchenna Ebeledke.

4 TABLE OF CONTENTS LIST OF TABLES...v LIST OF FIGURES...v ABSTRACT...v CHAPTER 1. Introducton Technques for Image Compresson Vector Quantzaton Predctve Codng Transform Codng Dscrete Cosne Transform Artfcal Neural Network Technology- an Overvew Basc Prncples of Learnng n Neural Networks Type of Connecton between Neurons Connecton between Input and Output Data Input and Transfer Functons Type of Learnng Other Parameters for NN Archtecture Desgn Backpropagaton Network...29

5 3.3 Radal-Bass Functon Network General Regresson Network Modular Network Probablstc Network Learnng Vector Quantzaton Network The Cascade Archtecture Neural Network Image Compresson Usng Neural Networks Sngle-structure NN mage compresson mplementaton Pre-processng Tranng Smulaton Post-processng Parallel-structure NN mage compresson mplementaton Proposed Cascade Archtecture Encodng process Decodng process Results Comparsons n terms of PSNR Comparsons n terms of Computatonal Complexty Dscusson and Conclusons...90 REFERENCES...94 VITA...98 APPENDIX...99 v

6 LIST OF TABLES Neural Network Archtectures...28 Comparson between sngle structure and cascade archtectures...84 PSNR values for JPEG algorthm...84 v

7 LIST OF FIGURES 1.0 Block dagram of JPEG compresson Zgzag sequence for Bnary Encodng Graph of hyperbolc tangent functon Multple-nput neuron Archtecture of Backpropagaton Neural Network Archtecture of RBF Archtecture of GRNN Archtecture of Modular Neural Network Archtecture of Probablstc Neural Network Cascade Correlaton Network Image Compresson Block Dagram Sngle-structure neural network mage compresson/decompresson scheme (a) Encodng scheme for proposed cascade archtecture (b) Decodng scheme for proposed cascade archtecture PSNR vs. CR for reconstructed Lena mage PSNR vs. CR for reconstructed Peppers mage PSNR vs. CR for reconstructed Baboon mage (a) Orgnal Lena mage (b) Reconstructed Lena mage at 8:1 CR usng sngle-structure NN (c) Reconstructed Lena mage at 8:1 CR usng cascade method...88 v

8 5.4 (d) Reconstructed Lena mage at 8:1 CR usng JPEG...88 v

9 ABSTRACT Images are formng an ncreasngly large part of modern communcatons, brngng the need for effcent and effectve compresson. Many technques developed for ths purpose nclude transform codng, vector quantzaton and neural networks. In ths thess, a new neural network method s used to acheve mage compresson. Ths work extends the use of 2-layer neural networks to a combnaton of cascaded networks wth one node n the hdden layer. A redstrbuton of the gray levels n the tranng phase s mplemented n a random fashon to make the mnmzaton of the mean square error applcable to a broad range of mages. The computatonal complexty of ths approach s analyzed n terms of overall number of weghts and overall convergence. Image qualty s measured objectvely, usng peak sgnal-to-nose rato and subjectvely, usng percepton. The effects of dfferent mage contents and compresson ratos are assessed. Results show the performance superorty of cascaded neural networks compared to that of fxedarchtecture tranng paradgms especally at hgh compresson ratos. The proposed new method s mplemented n MATLAB. The results obtaned, such as compresson rato and computng tme of the compressed mages, are presented. v

10 1 CHAPTER 1 Introducton Computer mages are extremely data ntensve and hence requre large amounts of memory for storage. As a result, the transmsson of stll mages from one machne to another can be very tme consumng. For ths reason stll mage compresson s subject of an ntense worldwde research effort [1]-[8]. By usng data compresson technques, t s possble to remove some of the redundant nformaton contaned n mages, requrng less storage space and less tme to transmt. The objectve of dgtal mage compresson technques s the mnmzaton of the number of bts requred to represent an mage, whle mantanng an acceptable mage qualty. Another ssue n mage compresson and decompresson s the processng speed, especally n real-tme applcatons. It s often desrable to be able to carry out compresson and decompresson n real-tme wthout reducng mage qualty. Numerous lossy mage compresson technques have been developed n the past years. The transform-based codng technques have proved to be the most effectve n obtanng large compresson ratos whle retanng good vsual qualty. In low nose envronments, where the bt error rate s less than 6 10, the JPEG [3]-[4] pcture compresson algorthm, whch employs cosne-transforms, has been found to obtan excellent results n many dgtal mage compresson applcatons. However, an ncreasng number of applcatons are requred for use n hgh nose envronments [12] - [14], e.g.,

11 the transmsson of a compressed pcture over a moble or cordless telephone. In these applcatons, the bt error rates due to nose on the transmsson channel may be as hgh 2 as 2 10 [12], [14]. At these error rates, JPEG and smlar compresson algorthms, whch rely on entropy codng, are not sutable [12]. Recently, neural networks have proved to be useful n mage compresson because of ther parallel archtecture and flexblty [15]-[20]. They do not use entropy codng and are therefore ntrnscally robust, makng them an attractve choce for hgh nose envronments. Of course a prce must be pad for ths robustness. In ths case, the prce s a reducton n decompressed mage qualty for the same compresson effcency. However, as shall be seen n chapter 4, the reducton n mage qualty s not excessve. Although the parallel-structure neural networks are robust, they suffer from several drawbacks. The man drawbacks are: 1) hgh computatonal complexty; 2) moderate reconstructed pcture qualty; 3) varable bt-rate; In ths thess the frst two drawbacks are tackled n a proposed neural network (NN) method that uses a cascade of feedforward networks wth one node at the hdden layer. The thrd drawback s not drectly tackled to avod ncreasng the computatonal complexty of the system. However, provson s made for the rare occasons when the compresson effcency s much lower than expected. Compared to the current parallelstructure NNs, the proposed NN has a lower computatonal complexty, faster convergence, and hgher compresson effcency.

12 The chapters of ths thess are organzed as follows: Chapter 2 brefly revews three of the major approaches to mage compresson, namely predctve codng, transform codng and vector quantzaton. In partcular, the JPEG technque s descrbed. Chapter 3 dscusses NN technology, ncludng varous archtectures and algorthms. These archtectures nclude Backpropagaton, Radal-bass functon, General Regresson, Modular, Probablstc, and Learnng Vector Quantzaton. The sngle-structure NN consstng of two layers and the parallel-structure NN are embedded n the dscusson. In Chapter 4, the sngle-structure NN as an mage compresson tool s mplemented, and the proposed NN method usng a cascade of feedforward networks s presented. A background on ths approach s gven followed by ts applcaton to mage compresson. Chapter 5 presents expermental results. The error metrcs used n computng the results are descrbed. Graphcal and pctoral results are fully llustrated. Performance comparsons are made wth JPEG and sngle-structure/parallel-structure NNs. Issues relatng to a comprehensve evaluaton of the mage compresson technques presented heren are dscussed n Chapter 6. The practcalty and usefulness of the new method s mentoned. The thess concludes by suggestng certan modfcatons to the new NN method to acheve even better results. Specfc remarks on the development of the code used n the new NN method asssts n hghlghtng the usefulness of the completed work. 3

13 4 CHAPTER 2 Technques for Image Compresson Image compresson methods are categorzed as lossless and lossy. Lossless methods preserve all orgnal nformaton wthout changng t. Lossy methods reduce nformaton to attan better compresson rato. Lossless compresson has been found to be adequate when low compresson ratos are acceptable. Sgnfcantly substantal compresson ratos can only be acheved wth lossy compresson schemes, whch wll be the man focus of ths thess. Due to the extensve breadth of ths feld, t s mpossble to lst all of the currently avalable mage compresson technques. However, exstng research fall nto one of three major categores: vector quantzaton, predctve codng, or transform codng. 2.1 Vector Quantzaton Quantzaton refers to the process of approxmatng a contnuous set of values n the mage data wth a fnte (preferably small) set of values. The nput to a quantzer s the orgnal data, and the output s always one among a fnte number of levels. The quantzer s a functon whose set of output values s dscrete and usually fnte. The concept of quantzng data can be extended from scalar or one-dmensonal data to vector data of arbtrary dmenson.

14 Vector Quantzaton (VQ) [9] [11] uses a codebook contanng pxel patterns wth correspondng ndex for each of them. The man dea of VQ n mage codng s then to represent arrays of pxels by an ndex n the codebook. In ths way, compresson s acheved snce the sze of the ndex s usually a small fracton of that of the block of pxels. The codebook s used to quantze the ncomng vectors and s analogous to the quantzaton levels n a scalar quantzer. A good codebook can reduce the overall dstorton of the reconstructed mage, and s determnatve to the performance of the VQ process. At the encoder, the ncomng mage s parttoned nto blocks of sub-mages. These blocks have the dmenson equal to entres n codebook, so comparson can be done easly between them. An nput vector X n consstng of blocks of pxels s quantzed. Ths s done by encodng X n nto a bnary ndex n whch ponts to an entry n the codebook. Ths n then serves as an ndex for the output reproducton vector or codeword. The standard approach to calculate the codebook s by way of the Lnde, Buzo, and Gray (LBG) algorthm [9]. Fnally, the concatenaton of all n represents the compressed mage. However, the choce of ndex n wll be dfferent when usng dfferent mappng rules. These mappng rules often depend on mnmzng a predefned dstorton measure d(x n,y n ), where X n and Y n are the vectors from the mage and from the codebook respectvely. A relable assessment crteron based on the propertes of the human vsual system has not yet been defned; hence Eucldean metrcs are adopted as a default dstorton measure. Besdes the LGB algorthm, other teratve approaches relate to neural network models and are based on Kohonen Self-Organzng Maps (SOMs) [21]. 5

15 The man advantages of VQ are the smplcty of ts dea and the possble effcent mplementaton of the decoder. Moreover, VQ s theoretcally an effcent method for mage compresson, and superor performance wll be ganed for large vectors. However, n order to use large vectors, VQ encodng becomes complex, whch requres many computatonal resources (e.g., memory, computatons per pxel) n order to effcently construct and search a codebook. For nstance, whle the LGB algorthm converges to a local mnmum, t s not guaranteed to reach the global mnmum. In addton, the algorthm s very senstve to the ntal codebook. Furthermore, the algorthm s slow snce t requres, on each teraton, an exhaustve search through the entre codebook. More research on reducng ths complexty must be done n order to make VQ a practcal mage compresson method wth superor qualty Predctve Codng Predctve mage codng algorthms [5] are used prmarly to explot correlaton between adjacent pxels. They predct the value of a gven pxel based on the values of the surroundng pxels. Due to the correlaton property among adjacent pxels n an mage, the use of predctor can reduce the amount of nformaton bts requred to represent the mage. Ths can accomplshed through the use of predctve codng or dfferental pulse-code modulaton (DPCM). The predctor uses past samples x( n 1), x( n 2),..., x( n p), or n the case of mages, neghborng pxels, to calculate an estmate, x ˆ( n), of the current sample. It the dfference between the true value and the estmate, namely e( n) = x( n) xˆ( n), whch s used for storage transmsson. As the accuracy of the predctor ncreases, the varance of

16 the dfference decreases, resultng n hgher predctve gan and therefore a hgher compresson rato. 7 The problem of course s how to desgn the predctor. One approach s to use a statstcal model of the data to derve a functon whch relates the values of the neghborng pxels to that of the current one n an optmal manner. An autoregressve model (AR) s one such model whch has been successfully appled to mages. For a pth order causal AR process, the nth value x (n) s related to the prevous p values n the followng manner: p x( n) = ω j x( n j) + ε j= 1 n, (2.1) where { ω j } s a set of AR coeffcents, and { ε n } s a set of zero-mean ndependent and dentcally dstrbuted random varables. In ths case, the predcted value s a lnear sum of neghborng samples (pxels) as shown by p xˆ( n) = ω j x( n j). (2.2) j= 1 Equaton (2.2) s the bass of lnear predctve codng. To mnmze the mean squared 2 error E[( xˆ x) ], the followng relatonshp must be satsfed: Rw = d, (2.3) where [R] j = E[ x( ) x( j)] s jth element of the autocovarance matrx R and d j = E[ xˆ( n) x( j)] s the jth element of the cross covarance vector d. Knowng R and d, the unknown coeffcent vector w can be computed, and the AR model (.e., predctor) s thereby determned.

17 8 2.3 Transform Codng Another approach to mage compresson s the use of transformatons that operate on an mage to produce a set of coeffcents [5]. A subset of these coeffcents s chosen and quantzed for transmsson across a channel or for storage. The goal of ths technque s to choose a transformaton for whch such a subset of coeffcents s adequate to reconstruct an mage wth mnmum dscernble dstorton. A smple and powerful class of transform codng technques s lnear block transform codng. An mage s subdvded nto non-overlappng blocks of n n pxels whch can be consdered as N-dmensonal vectors x wth N = n n. A lnear transformaton, whch can be wrtten as an M N -dmensonal matrx W wth M N, s performed on each block, wth the M rows of W, w beng the bass vectors of the transformaton. The resultng M -dmensonal coeffcent vector y s calculated as y = Wx. (2.4) If the bass vectors w are orthogonal, that s, w T w j 1, = 0, = j, (2.5) j then the nverse transformaton s gven by the transpose of the forward transformaton matrx resultng n the reconstructed vector: T xˆ = W y. (2.6) The optmal lnear transformaton for mnmzng the mean squared error s the Karhunen-Loeve transformaton (KLT) [18]. The transformaton matrx W conssts of M rows of the egenvectors correspondng to the M largest egenvalues of the sample autocovarance matrx

18 9 T Σ = E[ xx ]. (2.7) The KLT also produces uncorrelated coeffcents and therefore results n the most effcent codng of the data snce the redundancy due to the hgh degree of correlaton between neghborng pxels s removed. The KLT s related to prncpal component analyss (PCA) [23], snce the bass vectors are also the M prncpal components of the data. Because the KLT s an orthogonal transformaton, ts nverse s smply ts transpose. A number of practcal dffcultes exst when tryng to mplement the above approach. The calculaton of the estmate of the covarance of an mage may be unweldy and may requre a large amount of memory. In addton, the soluton for the egenvectors and egenvalues s computatonally ntensve. Fnally, the calculaton of the forward and nverse transforms s of order O (MN) for each mage block. Due to these dffcultes, fxed-bass transforms such as the dscrete cosne transform (DCT) [22], whch can be computed n order, O ( N log N ), are typcally used when mplementng block transform schemes Dscrete Cosne Transform The currently accepted standard for lossy stll mage compresson was developed by the Jont Photographc Experts Group (JPEG) [3], [4], whch adopted the lnear block transform codng approach for ts standard usng the DCT as the transformaton [22]. The JPEG specfcaton defnes a mnmal subset of the standard, called baselne JPEG, whch all JPEG-aware applcatons are requred to support. Ths baselne uses an encodng scheme based on the DCT to acheve compresson.

19 Fgure 1 descrbes the baselne JPEG process. The compresson scheme s dvded nto the followng stages: Apply a DCT to blocks of pxels, thus removng redundant mage data. 2. Quantze each block of DCT coeffcents usng weghtng functons optmzed for the human eye. 3. Encode the resultng coeffcents (mage data) usng a Huffman varable wordlength algorthm to remove redundances n the coeffcents. The mage s frst subdvded nto 8 x 8 blocks of pxels. As each 8 x 8 block or submage s encountered, ts 64 pxels are level shfted by subtractng the quantty 2 1 n, where n 2 s the maxmum number of gray levels. 8X8 pxel block DCT Quantzer Encoder Fgure 1.0: Block dagram of JPEG compresson. The 2-D dscrete cosne transform of the block s then computed. The DCT helps separate the mage nto parts (or spectral sub-bands) of dfferng mportance wth respect to the mage's vsual qualty. The DCT s smlar to the dscrete Fourer transform: t transforms a sgnal or mage from the spatal doman to the spatal frequency doman. Wth an nput mage, A, the output mage, B s: 1 B( u, v) = C( u) C( v) 4 N 1N πu A(, j) cos 2N πv (2 + 1) cos 2N = 0 j= where C(u), C(v) = 1/ 2 for u,v = 0 and 1 otherwse (2 j + 1). (2.8)

20 The nput mage s N 2 pxels wde by N 1 pxels hgh; A (, j) s the ntensty of the pxel n row and column j. B( u, v) s the DCT coeffcent n row u and column v of the DCT matrx. The DCT nput s an 8 by 8 array of ntegers. Ths array contans each pxel's gray scale level; 8 bt pxels have levels from 0 to 255. The output array of DCT coeffcents contans ntegers; these can range from to For most mages, much of the sgnal energy les at low frequences; these appear n the upper left corner of the DCT. The lower rght values represent hgher frequences, and are often small - small enough to be neglected wth lttle vsble dstorton. A quantzer rounds off the DCT coeffcents accordng to a quantzaton matrx. Ths matrx s the 8 by 8 matrx of step szes (sometmes called quantums) - one element for each DCT coeffcent. It s usually symmetrc. Step szes wll be small n the upper left (low frequences), and large n the lower rght (hgh frequences); a step sze of 1 s the most precse. The quantzer dvdes the DCT coeffcent by ts correspondng quantum, and then rounds to the nearest nteger. Large quantums drve small coeffcents down to zero. The result: many hgh frequency coeffcents become zero, and therefore easer to code. The low frequency coeffcents undergo only mnor adjustment. Ths step causes the lossy nature of JPEG, but allows for large compresson ratos. After quantzaton, t s not unusual for more than half of the DCT coeffcents to equal zero. JPEG ncorporates run-length codng to take advantage of ths. For each nonzero DCT coeffcent, JPEG records the number of zeros that preceded the number, the number of bts needed to represent the number's ampltude, and the ampltude tself. To consoldate the runs of zeros, JPEG processes DCT coeffcents n the zgzag pattern shown n fgure 2. 11

21 12 The number of prevous zeros and the bts needed for the current number's ampltude form a par. Each par has ts own code word, assgned through a varable length code (for example Huffman, Shannon-Fano or Arthmetc codng). JPEG outputs the code word of the par, and then the codeword for the coeffcent's ampltude (also from a varable length code). After each block, JPEG wrtes a unque end-of-block sequence to the output stream, and moves to the next block. When fnshed wth all blocks, JPEG wrtes the end-of-fle marker. At ths pont, the JPEG data stream s ready to be transmtted across a communcatons channel or encapsulated nsde an mage fle format. JPEG s not always an deal compresson soluton. There are several reasons: Does not ft every compresson need. Images contanng large areas of a sngle color do not compress very well. Can be rather slow when t s mplemented only n software.

22 13 Hard to mplement. Not supported by very many fle formats. Recently, novel approaches have been ntroduced based on pyramdal structures [24], wavelet transforms [25], and fractal transforms [26]. These and some other new technques [27] nspred by the representaton of vsual nformaton n the bran can acheve hgh compresson ratos wth good vsual qualty but are nevertheless computatonally ntensve. Wth ths bref revew of conventonal mage compresson technques at hand, varous types of neural networks and ther archtectures wll be revewed, and ther role as an mage compresson tool consdered.

23 14 CHAPTER 3 Artfcal Neural Network Technology- an Overvew Artfcal Neural networks are software or hardware systems that try to smulate the human bran functonalty. From the begnnng of ther presence n scence, Neural Networks (NNs) are beng nvestgated wth two dfferent scentfc approaches. Frst, the bologcal aspect explores NNs as smplfed smulatons of the human bran and uses them to test hypotheses about human bran functonng. The second approach treats NNs as technologcal systems for complex nformaton processng. Ths thess s focused on the second approach by whch NNs are evaluated accordng to ther effcency to deal wth complex problems, especally n the areas of assocaton, classfcaton and predcton, but specfcally n the area of mage processng. The reasons why NNs often outperform classcal statstcal methods le n ther abltes to analyze ncomplete, nosy data, to deal wth problems that have no clear-cut soluton and to learn on hstorcal data. Because of those advantages, they have shown remarkable success n areas such as mage transmsson over hgh-nose envronments. NNs, however, do have dsadvantages. One such s the lack of tests of statstcal sgnfcance of NN models and parameters estmated [32], [35]. Furthermore, there are no establshed paradgms for decdng whch archtecture s the best for certan problems and data types. Ths problem s partly nvestgated n ths thess. Despte those dsadvantages, many research results show that neural networks can solve almost all problems more

24 effcently than tradtonal modelng and statstcal methods. It s mathematcally proven (usng the Ston-Weerstrass, Hahn-Banach and other theorems and corollares [36]) that two-layer neural networks havng arbtrarly squashng transfer functons are capable of approxmatng any nonlnear functon Basc Prncples of Learnng n Neural Networks NNs consst of one or more layers or groups of processng elements called neurons. The term neuron denotes a basc unt of a neural network model ntended for data processng. Neurons are connected nto a network n a way that the output of each neuron represents the nput for one or more other neurons. The connecton between neurons can be ether one-drectonal or b-drectonal, and accordng to ts ntensty the connecton can ether be exctatory or nhbtory. Neurons are grouped nto layers. There are two man types of layers: hdden and output layers. Some authors refer to the nputs as another layer, but ths wll not be the case n ths thess. The hdden layer receves nput data. Here, the nformaton s processed and sent to the output layer neurons, where the network output s compared to the desred output and the network error s computed. The error nformaton then flows backward through the network and the values of connecton weghts between the neurons are adjusted usng the error term. The process s repeated n the network for the number of teratons necessary to acheve the output closest to the desred (actual) output. Fnally, the network output s presented to the user. Neural network learnng s bascally the process by whch the system arrves at the values of connecton weghts between neurons. The connecton weght s the strength of the connecton between two neurons. If, for example, neuron j s connected to neuron,

25 w denotes the connecton weght from neuron j to neuron ( w s the weght of the j j 16 reverse connecton from neuron to neuron j ). If neuron s connected to neurons called 1,2,..., n, ther weghts are stored n the varables w w, w. A neuron receves as 1, 2 n many nputs as there are nput connectons to that neuron and produces a sngle output to other neurons accordng to a transfer functon. The process of neural network desgn conssts of four phases: 1. arrangng neurons n varous layers, 2. determnng the type of connectons between neurons (nter-layer and ntra-layer connectons), 3. determnng the way neuron receves nput and produce output, and 4. determnng the learnng rule for adjustng the connecton weghts. The result of NN desgn s the NN archtecture. Accordng to the above desgn processes, the crtera to dstngush NN archtectures are as follows: number of layers, type of connecton between neurons, connecton between nput and output data, nput and transfer functons, type of learnng, certanty of frng, temporal characterstcs, and learnng tme.

26 Type of Connecton between Neurons Connectons n the network can be realzed between two layers (nter-layer connectons) and between neurons n one layer (ntra-layer connectons) [28]. Inter-layer connectons can be classfed as: fully connected each neuron n the frst layer s connected to each neuron n the second layer; partally connected each neuron n the frst layer should not necessarly be connected to every neuron n the second layer; feedforward connecton between neurons s one-drectonal, neurons n the frst layer send ther output to the neurons n the second layer, but they do not receve any feedback; bdrectonal there s a feedback when the neurons from the second layer send ther output back to the neurons n the frst layer; herarchcal neurons n one layer are connected only to the neurons of the next neghbor layer; resonance two-drectonal connecton where neurons contnue to send nformaton between layers untl a certan condton s satsfed. Examples of some well known NN archtectures wth nter-layer connectons: Perceptron (developed by Frank Rosenblat, 1957) frst NN, two-layered, fully connected, ADALINE (developed by Bernard Wdrow, Marcan E. Hoff, 1962) twolayered, fully connected, Backpropagaton (developed by Paul Werbos, 1974, extended by Rumelhart, Hnton, Wllams, 1986) frst NN wth one or more hdden layers, connecton between hdden layers s herarchcal, ART (Adaptve Resonance Theory) (desgned by Steven Grosberg, 1976) resonance connecton, three-layered network,

27 Feedforward Counterpropagaton (desgned by Robert Hecht-Nelsen, 1987) structure smlar to Backpropagaton network, three-layered, but nonherarchcal. There s also a connecton between neurons n one layer. Connectons between neurons n one layer (ntra-layer) can be: a) Recurrent neurons n one layer are fully or partally connected. The connecton s realzed n a way that neurons communcate ther outputs wth each other after they receve ther nputs from another layer. The communcaton contnues untl neurons reach a stable condton. When the stable condton s reached, neurons are allowed to send ther output to the next layer. b) On-center/off-surround n ths connecton a neuron n one layer has an exctatory connecton toward tself and toward the neghbor neurons, but an nhbtory connecton toward other neurons n the layer. Some of the ntra-layer networks wth recurrent connecton are: Hopfeld's network (desgned by John Hopfeld, 1982) two-layered, fullyconnected, neurons of output layer are mutually connected wth recurrent ntralayer connecton, Recurrent Backpropagaton network (desgned by Davd Rumelhart, Geoffrey Hnton, Ronald Wllams, 1986) recurrent ntra-layer connecton, but onelayered, where part of the neurons receve nputs, and the other part s fully connected wth recurrent ntra-layer connecton, and some of the networks wth on-center/off-surround connecton are: ART1, ART2, ART3 (desgned by Steven Grosberg, 1960) - resonance oncenter/off-surround connecton, 18

28 19 Kohonen's self-organzng network (created by Teuvo Kohonen, 1982), Counterpropagaton networks, Compettve learnng networks. Detals on the above archtectures are dscussed later n the text Connecton between Input and Output Data NNs can also be dstngushed accordng to the connecton between nput and output that can be: 1) autoassocatve nput vector s the same as output (common n pattern recognton problems, where the objectve s to obtan the same data n output as they are n nput), 2) heteroassocatve output vector dffers from the nput vector. Autoassocatve networks [17], [36], [37] are used n pattern recognton, sgnal processng, nose flterng and smlar problems that am to recognze the patterns of nput data Input and Transfer Functons In order to understand the man types of NN archtectures that wll be explored below, the basc prncples of NN functonng wll be descrbed through the equatons of nput and output of neurons, transfer functons and learnng rules. Input (Summaton) Functons When a neuron receves nput from the prevous layer, the value of ts nput s computed accordng to an nput functon, usually called a "summaton" functon. The

29 smplest summaton functon for the neuron s determned by multplyng the output 20 sent by the neuron j to the neuron (denoted as output j ) wth the connecton weght between neurons and j, then summarzng those multplcatons for all j neurons connected to neuron, as gven by: nput ( w j output j ) = n j= 1, (3.1) where n s the number of neurons n the layer that sends ts output receved by the neuron. In other words, nput of a neuron s the sum of all weghted outputs that arrve nto that neuron. Besdes ths standard network nput, there are two addtonal specfc types of nputs n a network: external nput and bas. For the former, neuron receves nput from the external envronment. For the latter, a bas value s used for neuron actvaton control n some networks. Input values can be normalzed to an nterval (usually [0,1] or [-1,1]) to avod the extreme nfluence of hgh-valued nputs. Therefore, normalzaton s recommended n most neural networks (t s oblgatory n Kohonen's network) [16], [21]. Detals about data normalzaton used n ths thess wll be explaned later n the text. Output (Transfer) Functons After recevng the nput accordng to the summaton functon presented n formula (3.1), the output of a neuron s computed and sent to the other neurons t s connected to (usually to the next layer neurons). The output of a neuron s computed accordng to a transfer functon, whch may be a lnear or a nonlnear functon of ts nput. A partcular transfer functon s chosen to satsfy some specfcaton of the problem

30 that the neuron s attemptng to solve. Several of the most frequently used transfer functons are the step functon, sgnum functon, sgmod functon, hyperbolc-tangent functon, lnear functon, and threshold lnear functon. The output of each transfer functon s computed accordng to a set formula. Only two of the above-mentoned transfer functons are used n ths thess, namely, the hyperbolc-tangent and the lnear functons. The former has the form: 21 output u u e e = (3.2) u u e + e where u = g nput. g =1/T s the gan of the functon, where T s the threshold. The gan determnes the skewness of the functon around 0. The functon has contnuous values n the nterval [-1,1]. The hyperbolc-tangent functon s commonly used n multlayer networks that are traned usng the backpropagaton algorthm, n part because ths functon s dfferentable. The graph s shown n the followng the fgure below: Fgure 3.1: Graph of hyperbolc tangent functon

31 Because of ts ablty to map values nto postve as well as negatve regons, ths functon s used throughout Matlab mplementaton n ths thess. A lnear functon has the form: output = g nput (3.3) It should be ponted out that Matlab names the hyperbolc tangent tansg, whch has the same shape, and the lnear functon pureln. Fgure 3.2 below depcts the overall pcture of the nput, transfer functon, and output of a typcal multple-nput neuron. Choce of the approprate transfer functon s made n the network desgn phase, stll allowng the change of threshold value (T) and gan (g). The best transfer functon s usually obtaned by expermentng on a partcular problem. 22 Fgure 3.2 Multple-nput neuron Type of Learnng "Learnng" s the process of calculatng the weghts among neurons n a network [29]. NNs can be desgned by supervsed or unsupervsed learnng. In supervsed learnng, the network s presented wth a set of nput and desred output patterns. The resultng (actual) outputs are compared wth the desred outputs and ther dfferences are used to adjust the network weghts. In unsupervsed learnng, the network s presented

32 wth only nput patterns. Wthout desred responses, the network has no knowledge about whether or not ts resultng outputs are correct. As a result, the network has to selforganze (cluster) the data nto smlar classes by adjustng ts weghts so that the clusterng mproves. Ths type of learnng s commonly used for pattern recognton problems and clusterng. Kohonen's self-organzng network s based on unsupervsed learnng. Every NN goes through three operatve phases: 1) learnng (tranng) phase network learns on the tranng sample, the weghts are beng adjusted n order to mnmze the objectve functon (for example the RMS or root mean square error), 2) testng phase network s tested on the testng sample whle the weghts are fxed, 3) operatve (recall) phase NN s appled to the new cases wth unknown results (weghts are also fxed). 23 Learnng Rules A learnng rule represents the formula that s used n NN to adjust the connecton weghts among neurons. Among varous learnng rules developed so far, four of them are most commonly used: Delta rule, Generalzed Delta rule, Delta-Bar-Delta and Extended Delta-Bar-Delta rules, and Kohonen's rule. 1) Delta rule Delta rule s also well known as Wdrow/Hoff's rule [29], or the rule of least mean squares, because t ams to mnmze the objectve functon by determnng the weght

33 values. The am s to mnmze the sum of square error, where error s defned as the dfference between the computed and the desred output of a neuron, for the gven nput data. The Delta rule equaton s: 24 w = η y e, (3.4) j cj where wj s the adjustment of the connecton weght from neuron j to neuron computed by: w = w w, (3.5) j new j old j y cj s the output value computed n the neuron j ; e s the raw error computed by: e = y y, (3.6) c d η s the learnng coeffcent, and y d s the desred (actual) output that s used to compute the error. The raw error n formula (3.6) s very rarely backpropagated; more often other error forms are used. In a classcal Backpropagaton NN, the error s backpropagated through the network usng the gradent descent algorthm descrbed n secton The gradent component of the global error E backpropagated nto a connecton k s: E k δ k =, (3.7) wk whch enables localzaton n a sense that each partcular connecton n the network s adjusted. Snce Delta rule (or ts varatons) s commonly used n supervsed networks, t s necessary to menton the man problem that can occur n backpropagatng the error,.e., the local mnma. The local mnma problem occurs when the mnmum error of the functon s found only for the local area and learnng s stopped wthout reachng the

34 global mnmum. Snce the problem s manly apparent n the Backpropagaton algorthm, t wll be dscussed n detal later n the text together wth suggested solutons. 2) Generalzed Delta rule Generalzed delta rule s obtaned by addng a dervaton of nput neurons nto the Delta rule equaton such that weght adjustment s computed accordng to the formula: w = η y e f I ), (3.8) j cj ( where f ( I ) s the dervatve of the nput I nto neuron. Ths rule s approprate to be used wth non-lnear transfer functons. 25 3) Delta-Bar-Delta and Extended Delta-Bar-Delta rules As can be seen from the prevous secton, the learnng coeffcent s an mportant parameter for the speed and effcency of NN learnng, and s typcally determned as a sngle learnng rate for all connectons n the network. The Delta-Bar-Delta (DBD) learnng rule was developed n 1988 by Jacobs [30] n order to mprove the convergence speed of the classcal Delta rule. It s a heurstc approach of localzng the learnng coeffcent n a way that each connecton n the network has ts own learnng rate. Those rates change contnuously as the learnng progresses. Dynamc weght adjustment n the DBD rule s done accordng to the Sards heurstc approach. The learnng rate of a connecton n the network s ncreased f the sgn of the weght for that connecton s the same for a number of tme steps (or over the regon of relatvely low curvature). On the other hand, when the sgn of the weght s changed for a certan number of tme steps, the rate for that connecton s decreased. Thus, Delta rule equaton (3.4) s modfed so that the learnng rate s dfferent for each connecton k:

35 26 w j( k ) = ηk ycj e. (3.9) Weght ncrements are conducted lnearly, whle decrements are conducted geometrcally. Despte ts advantages over the classcal Delta rule, Delta-Bar-Delta has some lmtatons, such as lack of a momentum term n the learnng equaton and large jumps that can skp mportant regons of the error surface due to the lnear ncrements of the learnng rates. Ths cannot be prevented by slow geometrcal decrements. In order to overcome these shortcomngs, Extended-Delta-Bar-Delta rule (EDBD), proposed by Mna and Wllams [30] ntroduces a momentum term α k, whch also vares wth tme. The momentum term s used to prevent the network weghts from saturaton (see detals n secton 3.2.1), and the EDBD rule enables local dynamc adjustment of ths parameter, such that the learnng equaton becomes: w t j( k) = η y e + α w, (3.10) k cj k t 1 j( k) where α k s the momentum of the connecton k n the network and t s the tme pont n whch the weghts of the connecton k are adjusted. Both the learnng rates and the momentum term are adjusted exponentally, not lnearly or geometrcally as n DBD. The magntudes of the exponental functons are the weghted gradent components δ k (equaton 3.7), whch makes a larger ncrease n the areas of a small error curvature, and a smaller one n the areas of large curvature, thereby preventng the bg jumps present n the DBD rule. The above learnng rules use the desred (real) output to compute the error, thus they learn supervsed. If the desred output s not known, one of the unsupervsed learnng rules should be used, such as the followng Kohonen's rule.

36 27 4) Kohonen's rule Snce Kohonen's network does not learn on known outputs, the weghts are adjusted usng the nput nto the neuron : w = η extnput w, (3.11) j j where extnput s the nput that neuron receves from the external envronment. Kohonen's rule s used n Kohonen's self-organzng network. Detals concernng learnng equatons are gven n secton Other Parameters for NN Archtecture Desgn Accordng to the number of layers, NN archtectures can be one-layered (wth the output layer only) or mult-layered (wth one or more hdden layers addtonally). The number of necessary hdden layers should be expermentally determned. It s to be expected that more hdden layers should be used for approxmatng a very complex nonlnear functon, although t s proven that two-layered NNs can approxmate any nonlnear functon as mentoned earler. NNs can be dvded nto: a) determnstc networks when a neuron reaches a certan actvaton level, t sends mpulses to other neurons (t "fres"), b) stochastc networks frng s not certan and t s performed accordng to probablstc dstrbuton (for example, the Boltzman machne).

37 Archtecture Two-layered Perceptron ADALINE/ MADALINE Kohonen s Characterstcs Type of connecton Learnng Inter-layers Intra-layers Type Equaton fully connected fully connected fully connected - supervsed w = η y y ) j ( d c - unsupervsed ( ) on-center/ offsurround unsupervsed 1 n n = 1 j y d y c w = η extnput w Hopefeld s n n fully Recurrent 1 unsupervsed E = connected cross-bar wj y 2 j = 1 Mult-layered Backpropagaton herarchcal recurrent supervsed wk = η y ε k Recurrent Backpropagaton Radal-Bass Probablstc Learnng Vector Quantzaton on-center/ offsurround Counterpropagaton ART networks fully connected fully connected fully connected, nonherarchcal fully connected fully connected, nonherarchcal Resonance fully connected recurrent cross-bar on-center/ offsurround on-center/ offsurround recurrent cross-bar on-center/ offsurround unsupervsed Unsupervsed phase + supervsed Unsupervsed phase + supervsed Unsupervsed phase + supervsed supervsed unsupervsed Table 1. Neural Network Archtectures w t j j w = 2( y t d = η y k y t r = f ( nput )( y w w j = η k y = η k y t c 2 ε j ) r k y t j, w ) r 28 j t 1 t + 1 cj k k ε ε w = η extnput w between 1 st and 2 nd layer k k j NNs can also be classfed as a) statc networks (receve nputs n one pass), b) dynamc networks (receve nputs n tme ntervals, they are also called spatotemporal networks).

38 29 NN learnng can be: a) batch learnng network learns only n the learnng phase, n other phases weghts are fxed, b) on-lne learnng network also adjusts ts weghts n the recall phase. Table 1 [31] above shows a bref overvew of well-known NN archtectures accordng to the above parameters for archtecture desgn. Further text presents detaled descrpton of varous NN archtectures. 3.2 Backpropagaton Network Back-propagaton (BP) [15], [19], [37] s a mult-layer neural network usng sgmodal actvaton functons. Orgnally developed by Paul Werbosn n 1974, extended by Rumelhart, Hnton, and Wllams n 1986, ths was the frst network wth more than one hdden layer. Its role was prmarly to solve the "credt assgnment" problem mposed by the Perceptron network, whch s the problem of assgnng the adjustments of parameters or connecton weghts. The suggested soluton was to localze the error by computng t at the output layer and backpropagatng the error to each hdden layer such that weghts of connectons are adjusted untl the nput layer s reached. The classcal Backpropagaton algorthm nvolves error optmzaton usng a determnstc gradent descent algorthm, whch wll be descrbed n detal. However, recent research ncludes some other determnstc (second order methods) [32], [33] for error optmzaton, such as conjugate gradent and the Levenberg-Marqardt algorthm that tres to overcome the man dsadvantage of the steepest descent method,.e., the danger

39 of local mnma. In ths thess mplementaton, second order methods wll not be used, but some parameter adjustments to avod the man shortcomngs of the classcal Backpropagaton algorthm wll be mplemented. 30 Archtecture of the network The network s made up of an nput layer, at least one hdden layer, and an output layer. Nodes n each layer are fully connected to those n the layers above and below. Each connecton s assocated wth a synaptc weght. Typcal backpropagaton archtecture s presented n Fgure 3.3 (for clarty reasons only 1 hdden layer s shown):... Input... Hdden layer... Output layer Fgure 3.3: Archtecture of Backpropagaton Neural Network Data flow through the network can be brefly descrbed n few steps: 1) from the nput to the hdden layer: the nput layer loads data from nput vector X, and sends them to the frst hdden layer, 2) n the hdden layer: unts n the hdden layer receve the weghted nput and transfer t to the next hdden or to the output layer usng one of the transfer functons,

40 3) as nformaton propagates through the network, all the summed nputs and output states are computed n each processng unt, 4) n the output layer: for each processng unt, the scaled local error s computed and used to determne the weght ncrement or decrement, 5) Backpropagaton from the output back to the hdden layers: the scaled local error and weght ncrements or decrements are computed for each layer backwards, startng from the output layer and endng at the frst hdden layer, and the weghts are updated. 31 Computaton n the network When the nput layer sends data to the frst hdden layer, each hdden unt n the hdden layer receves weghted nput from the nput layer (ntal weghts are set randomly) accordng to the formula [15]: I [ s ] j = w [ s ] j x [ s 1], (3.11) where [s] I j s the nput to neuron j n layer s, [s] w j s the connecton weght from neuron j to neuron n layer s, and x s the output of the neuron n layer s 1. Unts n [ s 1] the hdden layer transfer those nputs accordng to the formula: [ s] [ s] [ s x j = f wj x 1 ] = f s ( I ) j, (3.12) where [s] x j s the output of the neuron j n layer s, and f s the transfer functon (sgmod, hyperbolc tangent, or any other functon). If there s more than one hdden layer, the above transfer functon s used through all hdden layers untl the output layer s

41 reached. At the output layer, the network output s compared to the desred (real) output, and the global error E s determned as: 32 = 1 2 k E ( d k x k ), 2 (3.13) where d k s the desred (real) output, x k s the output of the network, and k s the ndex for the component of the output,.e., the number of output unts. Each output unt has ts own local error e whose raw form s d x ), but what s backpropagated through the ( k k networks s the scaled error n the form of a gradent component: ( x) ( x) e = E / I = E / x x / I = ( d x ) f ( I ). (3.14) k k k k k k k k The objectve of the Backpropagaton learnng process s to mnmze the above global error by backpropagatng t nto the connectons through the networks backwards untl the nput layer s reached. By modfyng the weghts, each connecton n the network s corrected n order to acheve a smaller global error. The process of ncrementng or decrementng the weghts (learnng) s done by usng a gradent descent rule: [ s] [ s] w = η ( E / w ), (3.15) j j where η s the learnng coeffcent. To compute partal dervatons n the above equaton we can use (3.14), whch gves: [ s] [ s] [ s] [ s] [ s] [ s 1] E / w = ( E / I ) ( I / w ) = e x. (3.16) j j j When the above result s ncluded n formula (3.15), the weght adjustment s [ s] [ s] [ s 1] w = η e x, (3.17) j j j j

42 33 whch leads to the man problem of settng the approprate learnng rate. There are two mutually conflctng gudelnes for determnng η. The frst gudelne s to keep η low because t determnes the area n whch the error surface s locally lnear. If the network ams to predct hgh curvatures, that area should be very small. However, a very low learnng coeffcent means very slow learnng. In order to resolve ths conflct, the prevous delta weghts n tme ( t 1) are added n equaton (3.17), so that the current weght adjustment s: w [ s] [ s 1] t 1[ ] = η e x + α w, (3.18) [ s] s j j j where α s the momentum term whch makes learnng faster when the learnng coeffcent s low. Learnng can be accelerated also f the weghts are not adjusted for each tranng vector but cumulatvely, where the number of tranng vectors after whch the weghts are adjusted s called the epoch. An epoch that s not very large can mprove the convergence speed, but a large epoch can make the computaton of the error more complex and therefore decreases ts beneft. Another problem that can occur n Backpropagaton s that some processng unts wll stop to learn f ther ncomng weghts become large. In such case the summaton values become large and the weghts are saturated (value 0 or 1) leadng the dervaton to zero and the scaled error to zero. Such saturaton can be prevented by addng a small bas value (or F' offset) to the dervatve of the sgmod transfer functon.

43 34 Improvements of Standard Backpropagaton Network Learnng Rules Snce one of the man dsadvantages of Backpropagaton s ts slow learnng, much effort has been expanded n mprovng the learnng rules and other parameters. Some of the achevements are [19], [38]: Delta-Bar-Delta (DBD) rule a learnng rule that uses past values of the gradent to fnd the local curvature of the error and allocates a dfferent learnng coeffcent to each connecton n the network, Extended Delta-Bar-Delta (EDBD) rule besdes usng a dfferent learnng rate for each connecton, t uses a dfferent momentum term for each connecton (equatons are descrbed n secton 3.1.4), QuckProp and MaxProp - learnng rules that use quadratc estmaton heurstcs to determne the drecton and step sze for the weght changes, Reslent Backpropagaton (Rprop) - A local adaptve learnng scheme that elmnates the harmful effect of havng a small slope at the extreme ends of the sgmod "squashng" transfer functons. Because of ts advantages n dynamc and local adjustments of learnng rates to the topology of the error functon, the Rprop s used n the mplementaton, and wll be further dscussed below. Backpropagaton neural network allows the usage of a dfferent error functon, such as quadratc and cubc, but they wll not be dscussed here snce they are not ncluded n the mplementaton.

44 35 Reslent Backpropagaton (Rprop) Multlayer networks typcally use sgmod transfer functons n the hdden layers. Sgmod functons are characterzed by the fact that ther slope must approach zero as the nput gets large. Ths causes a problem when usng steepest descent to tran a multlayer network wth sgmod functons, snce the gradent can have a very small magntude; and therefore, cause small changes n the weghts and bases, even though the weghts and bases are far from ther optmal values. The purpose of the Rprop [38] tranng algorthm s to elmnate these harmful effects of the magntudes of the partal dervatves. Only the sgn of the dervatve s used to determne the drecton of the weght update; the magntude of the dervatve has no effect on the weght update. The sze of the weght change s determned by a separate update value. The update value for each weght and bas s ncreased by a factor delt_nc whenever the dervatve of the performance functon wth respect to that weght has the same sgn for two successve teratons. The update value s decreased by a factor delt_dec whenever the dervatve wth respect that weght changes sgn from the prevous teraton. If the dervatve s zero, then the update value remans the same. Whenever the weghts are oscllatng the weght change wll be reduced. If the weght contnues to change n the same drecton for several teratons, then the magntude of the weght change wll be ncreased. Dealng wth Local Mnma and Overtranng Two of probably the most famous problems wth Backpropagaton are local mnma and overtranng. Because of the way that error s backpropagated through the

45 network (gradent descent optmzaton), learnng can stck n a local mnmum and mnmze the error only locally. There are a number of solutons for ths problem. Some of them are determnstc and use second order equatons to compute the error, whle others are stochastc, and rely on random numbers rather than on equatons. One of the stochastc methods for avodng local mnma s smulated annealng. Overtranng s the unversal problem for all types of NN algorthms. It occurs when the network learns the tranng sample perfectly, but s not able to generalze on the test sample. One of the man stll unanswered questons on how long t takes to learn can be approached n the followng ways [33]: cross valdaton usng the valdaton sample to determne when to stop learnng. The tranng wll contnue as long as the error on the valdaton sample mproves. When t does not mprove, the tranng wll stop. Such teratve procedure s usually called the "Save best" procedure, whch alternatvely trans and tests the network untl the performance of the network does not mprove for n number of teratons. After the best network s selected, t s tested on a new test sample to determne ts generalzaton ablty (snce ths method s used n our experments, t s descrbed n detal n secton), addng bas and random error n parameter estmates, jackknfng, bootstrappng, and others. 36 Input Parameters to Buld the Network 1) number of nput, hdden, and output layer unts

46 The number of hdden unts can be statcally set to a fxed number or dynamcally optmzed durng the learnng phase of the NN. In ths mplementaton, one node s used and another node added f the goal s not met (ths s dscussed n detal n secton 4). 2) learnng coeffcents Learnng coeffcents can be set: statcally and globally for the whole network n a way that coeffcents do not change durng the learnng process, statcally and locally by settng a dfferent learnng rate for each hdden layer or connecton, dynamcally and globally by changng the global learnng rate whle the learnng process mproves, dynamcally and locally by assgnng a dfferent learnng rate to each connecton n the network and changng them durng the learnng process. 3) Bas (F'Offset) As explaned n the prevous secton, ths parameter prevents the network from saturatng the weghts. 4) learnng rule Ths s a procedure for modfyng the weghts and bases of the network. 5) transfer functon The choce of a transfer functon s made accordng to ts ablty to map nto postve as well as negatve regons. Ths transfer functon s often used n mage compresson. 6) bpolar nputs 37

47 Input values are scaled between 1 and 1. Because postve and negatve values n the nput varables are desred, ths opton s used n the thess mplementaton. 7) MnMax table Inputs to the network are preprocessed usng the so-called Mnmax table created from the tranng data. Such a table conssts of mnmum m ( = 1,..., n) and maxmum M ( = 1,..., n) values for each of the n varables n the network, where n s the sum of the number of nput varables I and the number of desred output varables D. Those values together wth the network range parameters (specfed n the I/O set of parameters) are used to scale each nput and output varable accordng to the formula: 38 s ( R r ) x + ( M r m R ) =, (3.19) ( M m ) where s s the scaled new value for the varable, R s the upper lmt of the network range for nputs (or outputs), and r s the lower lmt of the network range for nputs (or outputs). Such a scalng process s necessary because of the output range of transfer functons used n the networks. For example, the hyperbolc tangent functon has the output range of [-1,1], and therefore the nputs to and outputs of the network need to be mapped nto the same range. Upon completon of the learnng process, output values of the network are rescaled, so that orgnal real values are presented to the user. 8) epoch An epoch s a presentaton of a set of tranng (nput and/or target) vectors to a network and the calculaton of new weghts and bases. Tranng vectors can be presented one at a tme or all together n a batch.

48 Radal-Bass Functon Network A Radal-Bass functon network (RBFN), proposed by M.J.D. Powel [34], s a general-purpose network whch can be used n the same stuatons as a Backpropagaton network for predcton as well as for classfcaton problems. Snce t uses a radally symmetrc and radally bounded transfer functons n ts hdden layer, t s a general form of probablstc and general regresson networks. It overcomes some dsadvantages of Backpropagaton such as slow tranng tme and the local mnma problem, but requres more computaton n the recall phase n order to perform functon approxmaton or classfcaton. Computaton n the Network Any network usng radally symmetrc hdden unts belongs to the class of Radal- Bass Functon networks. A pattern of hdden unts s radally symmetrc [30], f t: (a) has a "center",.e. an nput vector stored n the weght vector between the nput and the hdden layer, (b) has a dstance measure whch determnes the dstance of each nput vector from the center, (c) has a transfer functon whch maps the output of the dstance functon. Such a general defnton also ncludes General Regresson networks, Probablstc, Counter-propagaton, and other smlar networks. The most common dstance measure used s the Eucldean dstance, whle a Gaussan s the usual transfer functon (or kernel) n the hdden layer. The output of ths hdden layer s the same for all nputs wthn a fxed radal dstance from the center,.e., for the nputs that are radally symmetrc. The performance of RBFN "depends on the number and poston of the radal-bass functons,

49 ther shape, and the method used for determnng the assocatve weght matrx W [35]. Some exstng strateges for tranng RBFNs can be classfed as follows: 1) RBFNs wth a fxed number of centers selected randomly from the tranng data, 2) RBFNs wth unsupervsed procedures for selectng a fxed number of Radal-Bass Functon centers, 3) RBFNs wth supervsed procedures for selectng a fxed number of Radal-Bass Functon centers. 40 The above strateges all have the same dsadvantage: the number of centers must be determned n advance. To overcome ths shortcomng, several authors suggested algorthms, such as the growng cell structure (GCS) proposed by Frtzke, dstrbuton of radal-bass functons wth space-fllng curves proposed by Whtehead and Choate, dynamc decay adjustment (DDA) algorthm proposed by Berthold and Damond, and mergng two prototypes at each adaptaton cycle. All the above algorthms nvolve ether cascade or prunng prncples [39]. The focus below wll be on the RBFN algorthm proposed by Moody and Darken [30], whch uses Eucldean dstance and a Gaussan transfer functon n the hdden layer. The nput to the hdden unts s computed accordng to the formula [37]: I k = X c k = N = 1 ( X c ) 2, (3.20) k where c s the center. The output s computed usng a Gaussan transfer functon: f ( x) 2 I k 2 σ ( x c ) = e, = k ϕ (3.21)

50 where the center c s determned by a clusterng algorthm and by the nearest neghbor technque. 41 Archtecture of the Network The RBF learnng algorthm can be brefly descrbed as follows: - tranng starts n the hdden layer wth an unsupervsed learnng algorthm n order to determne the center, - tranng contnues n the output layer wth a supervsed learnng algorthm n order to compute the error, - smultaneous applcaton of a supervsed learnng algorthm to the hdden and output layers to fne-tune the network. A common RBFN archtecture s shown n the fgure below.... Input... Hdden layer (pattern unts) Summaton Output layer Fgure 3.4: Archtecture of RBFN

51 42 Learnng through the archtecture can be descrbed n the followng steps: 1) from the nput to the hdden layer: Clusterng phase. In ths phase the ncomng weghts to the prototype layer learn to become the centers of clusters of nput vectors usng a dynamc algorthm. 2) n the hdden layer: The rad of the Gaussan functons at the cluster centers are computed usng a 2-nearest neghbor technque. The radus of a gven Gaussan s set to the average dstance to the two nearest cluster centers. 3) n the output layer: Error s computed at the output layer usng one of the learnng rules. It s also possble to nclude one addtonal hdden layer to mprove learnng. Applcaton of the Network Karayanns and Wegun [35] gve a bref overvew of the prevous usage of a Radal-Bass network that starts wth Broomhead and Lowe year who frst mplemented ths network and showed how t models nonlnear relatonshps. The ablty of a RBFN wth one hdden layer to approxmate any nonlnear functon s proved by Park and Sandberg. Then Mchell showed how ths network could produce an nterpolatng surface, whch passes through all the pars of the tranng set. Advantages of a RBFN can be brefly summarzed as follows: - fast tranng, - better decson boundares than Backpropagaton when used for classfcaton and decson problems,

52 - hdden unt can be nterpreted as a densty functon for the nput vectors and thus measures the probablty that a new vector s a member of the same dstrbuton as others n the nput space. Dsadvantages: - despte fast learnng, t can be slower than Backpropagaton n the recall phase, - snce the ntal learnng phase of a Radal-Bass Functon network s the unsupervsed clusterng phase, some dscrmnatory nformaton could be lost n ths phase, - t s dffcult to determne the optmal number of prototype unts [35]. The authors who propose several ways to overcome ths dsadvantage: a Growng Radal-Bass (GRBF) network that starts wth a small number of prototypes at each growng cycle and grows n the tranng process by splttng of the prototypes n each cycle. They also suggest two crtera to determne whch prototype to splt, and test dfferent hybrd learnng schemes for ncorporatng exstng learnng schemes nto RBFN, such as unsupervsed learnng for clusterng, learnng vector quantzaton, and lnear neural networks, wth very satsfactory results. The authors also propose a supervsed learnng scheme based on mnmzaton of the localzed class-condtonal varance. 43 Input Parameters to Buld the Network The RBFN uses the same nput parameters as the Backpropagaton network 3.4 General Regresson Network Accordng to Specht [40], the General Regresson Neural Network (GRNN) s a generalzed form of the Probablstc network, whch s prmarly desgned for

53 classfcaton problems. GRNN can be used for system modelng and predcton, wth the specal ablty to deal wth sparse and nonstatonary data. Its dsadvantages, such as memory ntensveness and tme-ntensveness n the recall phase, are not lmtng factors for today's fast computers. 44 Computaton n the Network GRNN s desgned to perform a nonlnear regresson analyss. If f ( x, z) s the probablty densty functon of the vector random varable x (nput vector) and ts scalar random varable z (measurement), then the computaton n GRNN conssts of calculatng the condtonal mean E( z x) of the output vector, gven by [37]: zf ( x, z) dz E ( z x) =. (3.22) f ( x, z) dz The jont probablty densty functon (pdf) f ( x, z) s requred to compute the above condtonal mean. GRNN approxmates the pdf functon from the tranng vectors usng Parzen wndow estmaton, a nonparametrc technque that approxmates a densty functon by constructng t out of many smple parametrc pdfs [40]. Parzen wndows are Gaussan wth a constant dagonal covarance matrx: f p ( x z) = (2πσ P D ( z x ) σ 2σ + e e, 2 ( N 1) / 2 ) P = 1 ˆ (3.23)

54 where P s the number of sample ponts x, N s the dmenson of the vector of sample ponts x, σ s a smoothng constant, and D s the Eucldean dstance between x and x computed by: 45 = 1 2 ( x x ), D = x x = (3.24) where N s the number of nput unts to the network, σ s the wdth parameter whch satsfes the followng asymptotc behavor as the number of Parzen wndows P becomes large: lm N P ( Pσ P) = and (3.25) lm N P ( Pσ P) = 0 or when σ S =,0 E < 1, ( E/ N ) P (3.26) where S s the scale and, N s the number of nput unts. When estmated pdfs are nserted nto equaton (3.22), the followng formula for computng each component z j s obtaned z j ( x) = P = 1 P z = 1 j e e 2 D 2 2σ 2 D 2 2σ. (3.27) Snce computaton of Parzen the estmaton s tme consumng when the sample s large, a clusterng procedure s often ncorporated n GRNN. Accordng to ths procedure, for any gven sample x, nstead of computng a new Gaussan kernel 2 D 2 2σ e at center x for each pont, the dstance of that sample to the closest center of a prevously establshed

55 kernel s found, and the old closest kernel s reused. Such an approach transforms the equaton (3.27) for z j, nto: 46 P 2 D 2 2σ Ae = 1 zˆ j =, j = 2 P = 1 B e D 2 2 σ 1,..., M, (3.28) where A A ( k) = A ( k 1) + z j and B B ( k) = B ( k 1) + 1. (3.29) Archtecture of the Network The network conssts of the nput layer, the pattern layer and the output layer (see Fgure 3.5). There s also an addtonal summaton/dvson layer whose functon wll be explaned later. The process of network learnng s conducted as follows: 1) from the nput layer to the pattern layer: tranng vector X s dstrbuted from the nput layer to the pattern layer, and the connecton weghts from the nput layer to the k th unt n the pattern layer store the center X of the th k Gaussan kernel. 2) n the pattern layer: The summaton functon for the th k pattern unt computes the Eucldean dstance D k between the nput vector and the stored center X and transforms t through the exponental functon 2 D 2 2σ e. Then B coeffcents are set as connecton weghts from the pattern layer to the frst unt n the summaton/dvson layer, and A coeffcents are set as the weghts to the remanng unts n the summaton/dvson layer.

56 47... x 1 x 2 x 3 x n Input Pattern layer B (k) A 1 (k) A (k) A m (k)... Summaton and dvson... z 1 z j z m Output layer Fgure 3.5: Archtecture of GRNN 3) n the summaton/dvson layer: the summaton functon of ths layer (whch s the standard weghted sum functon) computes the denomnator of equaton (3.27) for the frst unt ( j ), then the numerator for each next unt ( j + 1). To compute the output ˆ ( x), the summaton of the numerator s dvded by the summaton of denomnator and such output s forwarded to the output layer (note that the frst unt of the summaton layer does not generate the output). 4) n the output layer: output layer receves nputs from the summaton/dvson layer, outputs the estmated condtonal means, and computes the error on the bass of the real output from the envronment. z j

57 48 Applcaton of the Network Because of ts generalty, GRNN can be used n varous problems such as predcton, plant process modelng and control, general mappng problems, or for other problems where nonlnear relatonshps exst among nputs and output [37]. One of the man advantages of GRNN s the ablty to deal wth nonstatonary data (tme seres data whose statstcal propertes change over tme). Ths ablty s obtaned by modfyng the computaton of the B and A coeffcents n equaton (3.27) such that a tme constant s ntroduced n terms of number of tranng vectors, as an ndcaton of how fast the tme seres changes ts characterstcs. It can be concluded from the above descrpton of GRNN that t s especally adaptable to (a) nonstatonary and sparse data and (b) statonary but nosy data. Input Parameters to Buld the Network 1) number of nput, pattern and output layer unts. 2) summaton functon n the pattern unt (Eucldean, Cty Block, or Projecton). 3) τ - tme constant τ s defned n terms of tranng vectors, and should be adjusted accordng to the degree of nonstatonarty present n the data. A smaller tme constant wll cause the network to forget the prevous cases faster. 4) θ - reset factor Ths factor s dvded by the number of pattern unts, and used as the comparng value for resettng the B coeffcents.

58 49 5) radus of nfluence Ths s a clusterng mechansm for determnng the lmt of the Eucldean dstance by whch an nput vector wll be assgned to a cluster. The nput vector wll be assgned to a cluster f the cluster center s the nearest center to the nput vector, or f the cluster center s closer than the radus of nfluence. If the nput vector does not satsfy the above condtons, a new center s computed for the vector. 6) sgma scale (S) and sgma exponent (E) S and E values are used n computng the Parzen wndow wdth n formula (3.26) Modular Network Proposed by Jacobs, Jordan, Nowlan and Hnton (1991), ths network s a system of many separate networks (usually Backpropagaton). Each of them learns to handle a subset of the complete set of tranng cases. It s therefore able to mprove the performance of Backpropagaton when the tranng set can be naturally dvded nto subsets that correspond to dstnct subtasks. Computaton n the Network Ths network conssts of several networks called "local experts" connected by a gatng network that allocates each case to one of the local experts. The output of the local expert s compared to the actual output, and the weghts are changed locally only for that expert and for the gatng network. In that way the gatng network "encourages" a partcular local expert to specalze n smlar cases. Other experts specalze n other cases. Decsons of the gatng network are made stochastcally. Whle a prevous paper

59 suggested that the fnal output of the whole system s a lnear combnaton of the outputs of the local experts, Jacobs et al. [41] use a stochastc selector and compute the error accordng to the formula: 50 c c c c c c E = d o = p d o (3.30) 2, where c o s the output vector of expert n case c, c p s the proportonal contrbuton of expert on the combned output vector, and c d s the desred output vector n case c. In such a process each local expert produces the whole output, and the goal of one local expert s not drectly affected by the weghts of the other local experts. Although some ndrect couplng can occur f the gatng network alters the responsbltes from one local expert to another, stll the sgn of the local expert error remans unnfluenced. The number of local experts n the network s determned n advance, based on the assumpton of the number of subsets or local regons n the nput space of the sample. Each local expert s a feedforward network and all experts have the same number of nput and output unts. Local experts as well as the gatng network receve the same nput. Of course, ther output dffers. Output of the gatng network s the probablty [41]: ( x ) j e p j =, (3.31) ( x ) e where x j s the total weghted nput receved by output unt j of the gatng network, and p j s the probablty that the swtch wll select the output from local expert j. Ths output s normalzed to sum to 1. The output of the local experts y s then corrected by the probablty (3.31), and the fnal output of the network s

60 51 y = N = 1 y p. (3.32) Unlke Backpropagaton where the objectve functon s to mnmze a global error functon E, a Modular network tres to maxmze the followng objectve functon J : T N ( d y ) ( d y ) 2 J = ln. pe (3.33) = 1 The error that s backpropagated for the th k local expert s J I k and for the gatng J network, G k where I k s the nput to the th k local expert output node and G k s the nput to the gatng network output node. Accordng to the above learnng process, f an expert gves a smaller less error than the weghted average of the errors of all the experts, ts responsblty for that case wll be ncreased, and vce versa. The error s backpropagated and the weghts are updated accordng to the chosen learnng rule. Archtecture of the Network The fgure below represents the archtecture of the Modular neural network. For clarty reasons, the archtecture n fgure 3.6 conssts of two local experts marked as LE1and LE2. Each local expert has only one output neuron. The gatng network and local experts have the same number of nput neurons, but the number of output neurons n the gatng network s the number of local experts,.e., two n our example presented n the fgure.

61 52... Input vector LE1 LE2 Gatng network p 1 p 2 Gate layer - Stochastc selector Output Fgure 3.6: Archtecture of Modular Neural Network Learnng s conducted as follows: 1) from the nput layer to the local experts and to the gatng network: The same tranng vector X s dstrbuted from the nput layer to each local expert and to the gatng network. Each expert s a feedforward (usually Backpropagaton) network. Output of the local experts depends on the feedforward archtecture ncorporated n the expert. Output of the gatng network s computed as descrbed above. 2) n the gate layer: The gatng network sends output to an ntermedate layer (called a gate) where the probabltes sent by the gatng network are used to correct the local expert outputs.

62 3) n the output layer: fnal output to the user s the output of the local expert wth the hghest probablty. The error s computed accordng to the formula 3.30 and backpropagated to the local experts and to the gatng network. 53 Applcaton of the Network A Modular network can be appled n most cases where Backpropagaton s used, especally n problems wth dfferent regons n the nput space. One of the llustratons of such problems s the absolute value functon: x f x 0 y =, (3.34) x f x < 0 where output y s computed by a dfferent functon for dfferent subsets of x. Jacobs et al. [41] appled a Modular network on a speaker ndependent, four-class vowel dscrmnaton problem and compared the performance of a Modular network usng 4 and 8 local experts wth three-layered Backpropagaton networks. The comparson showed that the performance of tested networks s the same, although the Modular network reaches the error crteron sgnfcantly faster than Backpropagaton. A Modular network s also a way to make compettve learnng assocatve, n a sense that a local expert whose output vector specfes the mean of a multdmensonal Gaussan dstrbuton replaces each hdden unt n the compettve network. A Modular approach also enables the usage of more complex archtectures when each local expert s desgned as one Modular network.

63 54 Input Parameters to Buld the Network 1) number of nput, hdden and output layer unts for local experts (LE) Each local expert has the same ntal structure snce t s not known n advance whch part of the nput space wll be allocated to each LE. 2) number of hdden and output unts n gatng network Hdden unts n the gatng network can be determned heurstcally or optmzed n the learnng phase. The number of output unts n the gatng network determnes the number of local experts (LEs) n the network. 3) learnng coeffcent 4) learnng rule The EDBD learnng rule s used as descrbed n the Backpropagaton secton. 5) other parameters descrbed n the Backpropagaton secton. 3.6 Probablstc Network A Probablstc neural network (PNN), one of the stochastc-based networks, s bult on a decades old statstcal algorthm developed by Mester [36], although Donald Specht proposed the complete neural network algorthm n It uses nonparametrc estmaton methods for classfcaton and therefore does not suffer from local mnma problems, as do feedforward networks. Accordng to Kartalopoulos [34], Probablstc NN s able to approxmate the optmum boundares between categores; therefore t can be used as a classfer, wth the assumpton that the tranng data are a true representatve sample. The Archtecture of the Probablstc neural network s bult upon Bayes'

64 classfer usng the Parzen wndow estmator to estmate the probablty dstrbutons of the class samples [37]. 55 Computaton n the Network In general, the classfcaton problem can be stated as samplng the m-component multvarate random vector X = x,..., x ], where the samples are ndexed by [ m k, k = 1,..., K [33]. The probablty that a sample wll be drawn from a populaton k s h and the cost of msclassfyng the sample wll be c. The populaton from whch the, k k tranng samples are taken s known, and the am s to create an algorthm for classfyng unknown samples wth the expected msclassfcaton cost less than or equal to any other. Such algorthm s Bayes optmal. If the probablty densty functons for all populatons k are known, then the Bayes optmal decson rule s: classfy X nto populaton f h c f ( X ) > h c f ( X ) j,. (3.35) j j Snce the probablty densty functons are usually not known n practce, t s often assumed that they are members of a normal dstrbuton. The tranng set s then used to estmate the parameters of the dstrbuton. However, t s more approprate to use a nonparametrc estmaton method such as Parzen wndows, also used n GRNN. In order to classfy unknown samples, most common classfers separate the unknown from each known member of the tranng set usng the Eucldean or other dstance. The unknown member s then classfed nto the populaton of ts nearest neghbor. The Parzen wndows technque goes one step further n a way that t takes nto account more dstant neghbors. Parzen's technque estmates a "sphere-of-nfluence" functon for separatng an unknown pont from the known tranng sample pont. Such a functon has a hgher value

65 f the dstance s close and converges to zero f the dstance becomes large. Takng the sum of ths functon for all known tranng set members and classfyng the unknown pont nto the populaton wth the largest sum s the man dea of the probablstc algorthm. Parzen's estmated densty functon s 56 1 g( x) = nσ n 1 =0 x W σ x, (3.36) where n s the sample sze, σ s the scalng parameter that controls the wdth of the area of nfluence of the dstance, and W s the weghtng functon. Snce a Gaussan s usually used for weghtng, the probablty densty functon takes the form: g( x) = nσ p 1 (2 π) x x n 1 2 2σ e. p/ 2 = 0 2 (3.37) Although the value of σ s an mportant smoothng parameter n the Probablstc network snce t affects the estmaton error, there s no mathematcal way of determnng t. A too small value of σ gves the same effect as the nearest neghbor technque, and too large does not gve clear separaton of classes so classfcaton cannot be made. A large value gves a flat curved surface, whle a small value results n narrow peaks. Archtecture of the Network The Probablstc neural network conssts of the nput layer, the pattern layer, the summaton layer, and the output layer. A smplfed archtecture s shown n the fgure below.

66 57... x 1 x 2 x n Input Pattern layer B (k) A (k) Summaton Output layer Fgure 3.7: Archtecture of Probablstc Neural Network Learnng n PNN s not an teratve process. Only one pass through the tranng sample s needed for the network to learn. The flow of nformaton through the layers s the followng: 1) from the nput layer to the pattern layer: Tranng vector X s dstrbuted from the nput layer to the pattern layer, 2) n the pattern layer: The pattern layer conssts of K unts, one for each tranng vector X. Input n the pattern layer unt j s computed accordng to the formula: I j T = x w ) ( x w ).) (3.38) ( The pattern layer unts are not fully but selectvely connected to the summaton unts, dependng on the classes they represent. Output from the pattern unt j s performed accordng to the actvaton functon:

67 58 k ( x w j ) = 1 2 f ( x, w ) = 2σ j e. (3.39) 2 3) n the summaton layer: The number of unts n the summaton layer s equal to the number of classes. Each summaton unt receves nput from the pattern unt of the same class. Output of the summaton unts s the estmaton of the class probablty densty functon accordng to formula (3.39). 4) n the output layer: Each unt n the output layer receves nputs from each summaton unt and produces a bnary output sgnal, whch s a product of the summaton unt s output and the weght coeffcent. Applcaton of the Network The probablstc network s exclusvely desgned for classfcaton problems. Although n some cases t can be adjusted for autoassocaton, t s prmarly a classfer. Successful usage of ths algorthm s observed n vector cardogram nterpretaton, radar/target dentfcaton, hull-to-emtter correlaton on radar hts, for example. It s also a very fast tranng algorthm compared to feedforward algorthms, especally n the case when t s optmzed n an extensve jackknfng process. Another advantage of ths network s that ts results can be nterpreted as Bayesan posteror probabltes, thus sutable for confdence estmates. Dsadvantages of Probablstc network are n the necessty of havng a representatve tranng set and n memory requrements, snce the whole tranng set s processed whle classfyng each unknown case.

68 Snce t takes the sum of the densty functon, the Probablstc network s especally sutable for problems wth sparse data and data outlers, snce outlers wll have no strong effect on decsons. 59 Input Parameters to Buld the Network In order to buld a PNN, the followng parameters should be determned: 1) number of nput, pattern, and output unts Pattern unts are set to the number of tranng vectors accordng to the computaton procedure descrbed before. If the tranng sample s very large, t s recommended to use a smaller number of pattern unts, and then to set the radus of nfluence to a postve value. The number of output unts s set to the number of classes. 2) summaton functon n the pattern unt: Eucldean, Cty Block, or Projecton 3) radus of nfluence Ths parameter s used for a clusterng procedure n the same way as n a GRNN wth the am to make learnng faster by fndng the closest already computed Parzen wndow for a new tranng vector. 4) output mode for the output layer Possble output modes are probablstc, compettve, and normalzed. The probablstc mode drectly outputs the values produced by the densty functon. The compettve output produces 1 for the wnnng unt and 0 for the others. The output values n the normalzed mode are all postve and normalzed such that they sum to 1.

69 60 5) sgma scale (S) and sgma exponent (E) In order to fnd the correct asymptotc behavor for the Parzen wndow wdth when the pattern unts become very large, NeuralWare [30] mplements an exponental decay of. Parameters S and E are used to compute accordng to the formula (3.26). 3.7 Learnng Vector Quantzaton Network Vector quantzaton (VQ) s one of the compresson methods usually used when large quanttes of data must be dvded nto a number of classes n order to ncrease the processng effcency. As was prevously dscussed n secton 2.1, t s the process of mappng nput vectors x of dmenson n nto a fnte number of classes, represented by a codeword or a prototype vector w j ( j = 1,..., m), where m < n. Mappng s performed usng one of the nearest neghbor technques such as Eucldean dstance or a cost functon [37]. VQ s a method of unsupervsed learnng, but has a specal supervsed form called Learnng Vector Quantzaton (LVQ), orgnally proposed by Teuvo Kohonen. Such supervsed versons of VQ wll be dscussed below. Computaton n the Network Learnng n LVQ a takes place n the hdden Kohonen layer. Ths network dffers from a VQ n ts usage of known (desred, true) output classfcatons t for each nput vector x. If C (x) s a class of x computed by the network and t s the true class, then the learnng n LVQ s performed as follows [37]: 1) After recevng an nput vector, a Kohonen layer fnds the approprate class n an unsupervsed way, fndng the dstance d :

70 61 d = w x = mn w x. (3.40) c 2) Computed class C (x) s compared to the true class t and the weghts are adjusted n the followng way: w c w c = α( x w ) = γ ( x w ) w = 0 c. c c f C( x) = t f C( x) t (3.41) In other words, f the computed class s equal to the true class, then the weght vector s shfted toward the nput vector. It s shfted away from the nput vector f the computed and true classes do not match. Basc LVQ suffers from one mportant shortcomng: t can happen that the ntal unt for learnng s chosen far from the true class. Then ths unt wll be shfted toward the true class, whle the others wll do nothng. To avod such stuaton, a penalzng factor s ntroduced n the form of a dstance bas proportonal to the dfference between the wnnng frequency of a unt and the average unt wnnng frequency (proposed by DeSeno, 1998). Ths verson of LVQ s called LVQ1 (wth conscence). In order to mprove learnng n LVQ, other varants are developed by Teuvo Kohonen [21], called LVQ2, LVQ2.1, LVQ3, Extended LVQ and others. Learnng starts wth the basc LVQ where Kohonen unts compute the Eucldean dstance accordng to the formula (3.40). The weghts of the wnnng unt are adjusted accordng to (3.41). Then LVQ1 takes control by addng a bas b to the dstance d of the unts n the correct class: d = d + b. (3.42)

71 Usng based dstances, LVQ1 computes the n-class wnner, but also fnds the global wnner usng unbased dstances. Weghts are adjusted such that the n-class wnner s moved toward the tranng vector (correct class) accordng to the equatons: 62 w c w c = α( x w = β ( x w c c ) ) f f the n class the n class wnner s equal tothe global wnner, wnner s not equal tothe global wnner. (3.43) If the global wnner s not n the correct class, t s shfted away from t accordng to the formula: w = γ x w ). (3.44) c ( c Then the new based dstance s computed such that the estmated maxmum dstance d max s corrected usng the wnnng frequency p for that unt and a constant η (conscence factor), whch ncreases wth learnng: b = η d (1 Np ), (3.45) max where N s the number of unts n the Kohonen layer per class. The ntal value for 1/N. It s updated accordng to: p s p p = (1 ϕ) p = (1 ϕ) p f + ϕ s not the n class wnner, or f s the n class wnner. (3.46) Control s then taken by LVQ2 n order to refne that s soluton s found by LVQ1. It s focused on the cases where the wnnng unt s n the wrong class, but the second best unt s n the correct class. Thus, the weghts n LVQ2 are adjusted f one of the followng occurs: f computed and true classes do not match, or the closest prototype weght vector s the correct class, or f the nput vector s close to the bndng hyperplane separatng the two closest prototype weght vectors [37]. If one of these three stuatons occurs when LVQ2 takes control, then the wnnng unt (whch s n the wrong class) s

72 moved away from the correct class, whle the second best unt s moved closer to the nput vector accordng to formulas: 63 w 1 w 2 = w α( x w ) 1 = w 2 1 α( x w 2 ) f x s ( w 1 + w near to 2). 2 (3.47) Archtecture of the Network The network conssts of three layers: the nput, the Kohonen and the output layer. The summarzed learnng process through the layers s: 1) from the nput layer to the Kohonen layer: nputs are transmtted from the nput layer to the Kohonen layer through full connecton. 2) n the Kohonen layer: the Kohonen layer learns usng LVQ technque to compute dstances, then LVQ1 to adjust wnnng dstances by a penalzng factor, and last LVQ2 to mprove learnng n fndng the second best unts that are n correct classes. 3) n the output layer: the computed class C (x) s compared to the correct class t, and weghts n the Kohonen layer are adjusted accordng to the LVQ1 and LVQ2 equatons. Applcaton of the Network LVQ s applcable to any type of classfcaton problem where the output classes are known. Its wde usage s noted n speech recognton, mage processng, or other problems where data compresson s needed and the probablty dstrbuton of the pattern sample s not known [37]. In cases where correct classes are not known, unsupervsed networks such as Kohonen's Self-organzng maps (SOM) or regularzaton clusterng technques are recommended.

73 64 Input Parameters to Buld the Network The followng nput parameters must be set n order to buld the LVQ network: 1) number of nput, Kohonen, and output unts: There are no strct rules for choosng the number of Kohonen unts. The number should be dvsble by the number of output, unts.e., classes. It can also be set as a percentage of the number of tranng vectors. 2) number of learnng teratons n the LVQ1 and LVQ2 phases: The number of teratons n the LVQ1 phase s oblgatory, whle the LVQ2 phase s optonal. The number of teratons can be determned heurstcally or t can be set from the tranng fle. 3) ntal learnng rate for LVQ1 and LVQ2 Ths parameter s used for ntalzng rates that are reduced over each learnng phase. 4) LVQ2 wdth parameter Ths s the parameter whch determnes the wdth of the hyperplane corrdor n LVQ2 learnng. 5) In-class wnner always learns Ths opton enables LVQ1 learnng wth conscence. 6) conscence factor Ths s the constant that s used to compute the bas n the LVQ1 learnng. 7) frequency estmaton

74 65 Ths s the p parameter that s also used to compute bas for penalzng wnnng unts. It can also be set from the tranng sample as the nverse of the number of tranng vectors. 3.8 The Cascade Archtecture Neural Network To solve real-world problems wth neural networks, the use of hghly structured networks of a rather large sze s usually requred. A practcal ssue that arses n ths context s that of mnmzng the sze of the network and yet mantanng good performance. A neural network wth mnmum sze s less lkely to learn the dosyncrases or nose n the tranng data, and may thus generalze better to new data. Ths desgn objectve may be acheved n one of the two ways: Network growng, whch starts off wth a small multplayer perceptron, small for accomplshng the task at hand, and then add a new layer of hdden neurons only when the desgn specfcaton s not met. Network prunng, whch starts of wth a large multplayer perceptron wth an adequate performance for the problem at hand, and then prune t by weakenng or elmnatng certan synaptc weghts n a selectve and orderly fashon. The cascade-correlaton learnng archtecture [39] s an example of the networkgrowng approach. The rest of ths secton presents a detaled descrpton of the cascadecorrelaton archtecture, ts algorthm and mathematcal background, and ts applcaton to mage processng.

75 66 Cascade-Correlaton Cascade-Correlaton (CC) s an algorthm developed by Scott Fahlman [39]. There are two types of CC- Pruned CC and Recurrent CC. Both wll be brefly dscussed but the results obtaned n the mplementaton are from usng the standard algorthm, albet wth a few modfcatons. Strctly speakng, CC s a knd of meta-algorthm, n whch other algorthms such as back propagaton (BP) are embedded. The procedure begns wth a mnmal network that has some nputs and one or more output nodes as ndcated by nput/output consderatons, but no hdden nodes. The LMS algorthm, for example, may be used to tran the network. The hdden neurons are added to the network one by one, thereby obtanng a multlayer structure. Each new hdden neuron receves a synaptc connecton from each of the nput nodes and also from each preexstng hdden neuron. When a new hdden neuron s added, the synaptc weghts on the nput sde of that neuron are frozen; only the synaptc weghts on the output sde are traned repeatedly. Expandng on the above descrpton, learnng n CC takes place as repeatng twophase steps. The frst nvolves the embedded standard learnng algorthm, whch n our case s Backpropagaton (BP). Durng ths phase the deal actvty as specfed n the tranng pattern s compared wth the actual actvty and the weghts of all tranable connectons adjusted to brng these nto correspondence. Ths s repeated untl learnng ceases or a predefned number of cycles have been exceeded. The second phase nvolves the creaton of a pool of 'canddate' unts. Each canddate unt s connected wth all nput unts and all exstng hdden unts. It s ths whch leads to the cascadng archtecture, as each new unt s connected to all precedng

76 unts. There are no connectons from these canddate unts to the output unts. The lnks leadng to each canddate unt are traned wth the selected standard learnng algorthm (BP) to maxmze the correlaton between the resdual error of the network and the actvaton of the canddate unts. Tranng s stopped f the correlaton ceases to mprove or a predefned number of cycles s exceeded. The fnal step of the second phase s the ncluson, as a hdden unt, of the canddate unt whose correlaton was hghest. Ths nvolves freezng all ncomng weghts (no further modfcatons wll be made) and creatng randomly ntalzed connectons from the selected unt to the output unts. Ths new hdden unt represents, as a consequence of ts frozen nput connectons, a permanent feature detector. The weghts from ths new unt and the output unts wll undergo tranng. Because the outgong connectons of ths new unt are subject to modfcaton ts relevance to the fnal behavor of the traned network s not fxed. These two phases are repeated untl ether the tranng pattern has been learned to a predefned level of acceptance or a preset maxmum number of hdden unts has been added, whchever occurs frst. Fgure 3.8 llustrates the ncluson of the frst two hdden unts nto a network undergong tranng. The same steps are taken no matter how many hdden unts are already ncluded n a gven network. 67 Mathematcal Background The tranng of the output unts tres to mnmze the sum-squared error E : E = ( d p y ), p 1 2 k p, k 2 k, (3.48)

77 68 where d p, k s the desred (real) output and p k y, s the observed output of the output unt k for a pattern p. The error E s mnmzed by gradent decent usng e = d y ) f ( net ), (3.49) p, k ( p, k p, k p k E / w, k = e p, k I, p, (3.50) p where f p s the dervatve of an actvaton functon of the output unt k and I, s the p value of an nput unt or a hdden unt for a pattern p. w, k denomnates the connecton between an nput or hdden unt and an output unt k. Fgure 3.8: A neural net traned wth cascade-correlaton after 2 hdden unts have been added. The vertcal lnes show unt nputs and horzontal lnes show unt output. Square connectons ndcate frozen lnks. Cross connectons ndcate tranable lnks.

78 After the tranng phase the canddate unts are adapted, so that the correlaton C 69 between the value y p, k of a canddate unt and the resdual error p k e, of an output unt becomes maxmal. Fahlman wth gves the correlaton: where C = = = ( y p, k yk )( e p, k ek ) ( y ) p, kep, k ek k p p k k p p y p, k ( e e ) p, k k, y p, k yk s the average actvaton of a canddate unt and (3.51) e k s the average error of an output unt over all patterns p. The maxmzaton of C proceeds by gradent ascent usng δ p C w = = k σ p k δ ( e e ) p I p, k p,. k f, p (3.52) where σ k s the sgn of the correlaton between the canddate unt's output and the resdual error at output k. In summary the standard CC algorthm s realzed n the followng way: 1. Start wth a mnmal network consstng only of an nput and an output layer, both fully connected. 2. Tran all the connectons endng at an output unt wth a usual learnng algorthm untl the error of the net no longer decreases.

79 70 3. Generate the so-called canddate unts. Every canddate unt s connected wth all nput unts and wth all exstng hdden unts. Between the pool of canddate unts and the output unts there are no weghts. 4. Try to maxmze the correlaton between the actvaton of the canddate unts and the resdual error of the net by tranng all the lnks leadng to a canddate unt. Learnng takes place wth an ordnary learnng algorthm. The tranng s stopped when the correlaton scores no longer mproves. 5. Choose the canddate unt wth the maxmum correlaton; freeze ts ncomng weghts and add t to the net. To change the canddate unt nto a hdden unt, generate lnks between the selected unt and all the output unts. Snce the weghts leadng to the new hdden unt are frozen, a new permanent feature detector s obtaned. Loop back to step 2. The above fve steps are repeated untl the overall error of the net falls below the gven tolerance value. The two varants of the standard CC are brefly dscussed below: Pruned Cascade Correlaton In the standard Cascade Correlaton the Network s fully connected. Pruned Cascade Correlaton [39] s a varant of the standard algorthm, whch removes unnecessary lnks by applyng selecton crtera or a holdout set. Ths n turn decreases the tranng tme requred by the network. Examples of the selecton crtera are: Schwarz's Bayesan crteron (SBC) Akakes nformaton crteron (AIC)

80 71 Conservatve mean square error of predcton (CMSEP) The SBC, the default crteron, s more conservatve compared to the AIC. Thus, prunng va the SBC wll produce smaller networks than prunng va the AIC. Both SBC and AIC are selecton crtera for lnear models, whereas the CMSEP does not rely on any statstcal theory, but happens to work pretty well n an applcaton. These selecton crtera for lnear model can sometmes drectly be appled to nonlnear models, f the sample sze s large. Recurrent Cascade Correlaton Recurrent Cascade-Correlaton (RCC) s a recurrent verson of Cascade- Correlaton and can be used to tran recurrent neural networks (RNN) [40]. Recurrent Networks are used to represent tme mplctly rather than explctly. One of the most commonly known archtectures of RNNs s the Elman model [42], whch assumes that the network operates n dscrete tme steps. In the Elman model the outputs of the network's hdden unts at a tme t are fed back for use as addtonal network nputs at tme t+1. Context unts are used n the Elman model to store the output of the hdden unts. To use the standard CC wth the Elman model (and recurrent models n general), only one change s needed to the algorthm. The hdden unts' values are no longer fed back to all other hdden unts. Instead, every hdden unt has lnks to all nput unts and also has one self-recurrent lnk. Ths self-recurrent lnk s traned along wth the canddate unts other nput weghts to maxmze the correlaton when the canddate unt s added to the actve network as a hdden unt. The recurrent lnk s frozen along wth all other lnks.

81 72 CHAPTER 4 Image Compresson Usng Neural Networks After dscusson of some of the varous types of neural networks and ther archtectures, ths chapter consders ther role as an mage compresson tool. Frst the mage compresson problem s descrbed, then the mplementaton of the sngle and parallel structure NNs wth regard to mage compresson s presented, after whch the new NN method s presented n detal. Its mplementaton s fully presented and ts performance evaluated. Evaluaton metrcs for comparng the compresson schemes are also descrbed. The Image Compresson Problem k A stll mage I s descrbed by a functon f : Z Z { 01,,...,2 1}, where Z s the set of natural numbers, and k s the maxmum number of bts to be used to represent the gray level of each pxel. In other words, f s a mappng from dscrete spatal coordnates (x,y) to gray level values. Thus, M N k bts are requred to store an M N dgtal mage. As was mentoned n Chapter 1, the am of lossy mage compresson s to develop a scheme to encode the orgnal mage I nto the fewest number of bts such that the mage I reconstructed from ths reduced representaton through the decodng process s as smlar to the orgnal mage as possble,.e., the

82 73 problem s to desgn a compress and a decompress block so that I I and I c << I where denotes the sze n bts. The above scenaro s depcted n Fgure 4.0. I COMPRESS I c DECOMPRESS I Orgnal Image Compressed Image Reconstructed Image Fgure 4.0: Image Compresson Block Dagram Evaluaton metrcs In order to compare the qualty acheved by lossy compresson schemes, a metrc s needed to quantfy the qualty of the reconstructon. The metrc used to compare the neural network and JPEG mage compresson schemes s the peak sgnal-to-nose rato (PSNR). Assumng that the orgnal and reconstructed mages are represented by functons f(x,y) and g(x,y) of the pxel plane poston (x,y), respectvely, the PSNR s defned by: k 2 (2 1) PSNR = 10log10 (4.1) MSE where k s the number of bts representng a pxel, and the mean square error (MSE) s: 1 MSE = MN M N [ I( x, y) Iˆ( x, y) ] y= 1 x= 1 2 (4.2) MSE s the cumulatve squared error between the compressed and the orgnal mage, whereas the PSNR s a measure of the peak error. Ths metrc does not adequately evaluate the vsual qualty of the decompressed mage, but t s very smple to compute and relates to usual sgnal metrcs when the orgnal mage I s vewed as the sgnal and

83 Î s vewed as the same sgnal corrupted by some nose representng the deformaton due to lossy compresson Sngle-structure NN mage compresson mplementaton Dfferent approaches have been mplemented for the sngle-structure NN. In ths mplementaton, the backpropagaton algorthm s employed and Rprop s used to speed the tranng process. All the algorthms descrbed n ths chapter are mplemented n MATLAB Pre-processng The mage s splt nto J blocks {B j, j = 1, 2,, J} of sze M x M pxels. The pxel values n each block are rearranged to form a M 2 length pattern C j = {C 1,j, C 2,j,..., C M 2,j }, where j = 1, 2,, J (C,j s the th element of the j th pattern). It s common practce to normalze the nput patterns usng the transformaton P j = f(c j ) = (C j m Cj )/σ Cj where m Cj and σ Cj are, respectvely, the average and standard devaton of C j. The reason for usng normalzed pxel values s because NNs operate more effcently when the nput s lmted to a range of [0,1]. Patterns P j are used n the tranng phase of the NN as both nputs and outputs Tranng The network s traned to mnmze the mean square error between output and nput values, thus maxmzng the peak sgnal-to-nose rato (PSNR), wth the provso

84 that the mage PSNR s measured for quantzed values n [0,255] whle the NN learnng uses the correspondng real-valued network parameters. The NN conssts of the nput and two layers: the hdden and output layers wth number of nodes M 2, H, and M 2, respectvely. Refer to Secton for a complete descrpton of NN archtectures. The NN acts as a coder/decoder. The coder conssts of the nput-to-hdden layer weghts v,k { = 1,,M 2 and k = 1,,H}, and the decoder conssts of the hdden-to-output layer weghts w k,m, {k = 1 H and m = 1,, M 2 }. In the defntons above, v,k represents the weght from the th nput node to the k th hdden node, and w k,m represents the weght from the k th hdden node to the m th output node. Compresson s acheved due to the transformaton of patterns P j, through the v,k weghts, by settng the number of hdden nodes H smaller than the nput pattern length M 2 (H<M 2 ). Actually, even though H<M 2, no real compresson has occurred because unlke 75 the M 2 orgnal nputs whch are 8-bt pxel values, the outputs of the hdden layer are real-valued (between 1 and 1), whch requres an astronomc number of bts to transmt. True mage compresson occurs when the hdden layer outputs are quantzed before transmsson. Thus the hdden layer neuron values are quantzed to 8 bts to obtan an mage that truly corresponds to a gven compresson rato. The encodng/decodng processes are shown n Fgure 4.1 below: The network s traned wth the reslent backpropagaton algorthm (Rprop). As was mentoned n secton 3.2, ths allows for faster tranng. The tranng parameters, namely, the learnng rate, epochs, and mnmum gradent are set at 0.001, 1000, and respectvely. The hyperbolc tangent and lner transfer functons are used for the hdden and output layers respectvely. The bas and layer weghts are ntalzed.

76 Fgure 4.1: Sngle-structure neural network mage compresson/decompresson scheme. 4.1.3 Smulaton After tranng, the network s smulated wth the nput and target matrces.

85 76 Fgure 4.1: Sngle-structure neural network mage compresson/decompresson scheme Smulaton After tranng, the network s smulated wth the nput and target matrces. Ths s done by multplyng the hdden layer weght matrx w k,m by the output of the preprocessor, addng the bas, and then applyng the output layer transfer functon to the result (Equaton 3.1.2). Ths result becomes the output from the hdden layer, whch otherwse could not have been drectly obtaned by usng the Matlab sm functon. The same process s repeated to obtan the output from the output layer, wth the nput to t beng the output from the hdden layer. It must be ponted out that the weghts and bases beng referred to are the ntal choces Post-processng The next step s to reconstruct the smulated mage. The codng product assocated to the j th pattern P j s the hdden layer output O j ={O 1,j, O 2,j,..., O H,j }. The set {O j,m Cj,σ Cj } together wth weghts w k,m s suffcent for reconstructng an approxmaton C^j of the orgnal pattern C j n the decodng phase. Consderng the overhead {m Cj,σ Cj } due to the normalzaton functon f, the compresson rato (CR) acheved s M 2 :(H+2).

86 4.2 Parallel-structure NN mage compresson mplementaton 77 The parallel-structure NN can be vewed as a sngle-structure NN but wth multple hdden layers. In that lght, t s mplemented n much the same way as the sngle-structure NN. The detals presented n the prevous secton wll thus apply here as well, albet wth a dfferent mode of presentaton. Four networks {NET k, k = 1,2,3,4} wth dfferent number of hdden nodes (4, 8, 12, and 16) are used. Each NN s traned smlarly to the NN n secton 4.1. The goal s to acheve CR = 8:1. The codng procedure conssts of two phases. In the frst phase, each pattern s assocated wth a NET k. As wth all NNs, the larger the number of hdden nodes to whch a pattern s assgned, the smaller the assocated error e j 2 = (P^ j - P j ) (P^ j - P j ) T between the orgnal P j and estmated patterns P^j. On the other hand, a large number of hdden nodes results n low CR. Intally, patterns are assgned to NNs so that the CR s as close as possble to the predefned value 8:1. The second phase s an teratve procedure. At each successve teraton, the goal s to reduce the total error E 2 = e j 2 wthout changng the CR. Let de j 2,k be the error reducton caused due to reassgnng pattern P j from NET k. to NET k+1. Then the goal s acheved by reassgnng a par of patterns. Pattern P j1 s reassgned from NET k1 to NET k1 +1 f the reassgnment causes a maxmum error decrease de 2 j 1,k 1 = m j a k x(de j 2,k ), and pattern P j2 s reassgned from NET k2 to NET k2-1 f ths results n a mnmum error ncrease de 2 2 j 2,k 2 = m j k n(de j,k-1 ). Iteratons contnue as long as the error E 2 decreases. Ths parallel archtecture has the advantage of provdng better mage qualty for a gven CR than the methods descrbed n chapter 3 n terms of error E 2, ncludng both sngle and parallel structures presented n Carrato et al. [43]. Furthermore, codng s

87 faster than prevous parallel archtectures. Nevertheless, tranng s stll sgnfcantly slow due to the use of multple networks. Moreover, the total number of weghts s large. Thus, tranng cannot be part of the codng process; otherwse, t cannot be performed n realtme. Thus, the compresson qualty depends on the tranng data and ther smlarty to the test mages Proposed Cascade Archtecture Ths secton ntroduces an mage compresson method based on a cascade of adaptve unts that provdes better mage qualty than prevous NN-based technques. It borrows some of the dea of Cascade Correlaton n that hdden unts are added to the network one at a tme. Each unt s equvalent to a feed-forward NN wth a sngle node at the hdden layer. The proposed method exhbts a fast tranng phase that s ncluded n the encodng process. Thus, t acheves ndependence of the tranng data, whch makes t approprate for general mage compresson applcatons. The mplementaton of the new archtecture s descrbed next Encodng process The nputs C j = [C 1,j, C 2,j, C M 2,j ] are defned as n secton The normalzaton functon s P j = f(c j ) = (C j m Cj ) (4.3) Patterns P j are used as nputs/outputs to tran the frst unt. The output of the frst unt s hdden node correspondng to the j th pattern s denoted as O j,k, k = 1. The frst unt s estmated weght vector s denoted as

88 79 W k = [W 1,k, W 2,k, W M 2,k ], k = 1, (4.4) and s equvalent to the weght vector connectng the hdden node to the output layer. Element W,k s the th weght assocated to the k th unt. It s shown below that the weght vector connectng the nput to the hdden node s not requred n the mplementaton. The output O j,1 and the weght vector W 1 are requred n the decodng process. The frst unt s error patterns are denoted as e j,1 = [e 1,j,1, e 2,j,1, e M 2,j,1 ], (4.5) They are defned as the dfference between the orgnal P j and estmated patterns P^j at the output of the frst unt, after tranng s complete. The element e,j,k s the th element of the j th error pattern at unt k. If the set of error patterns at the output of unt k s defned as R k then only a subset of these error patterns s used as nput/output to tran the next unt. It should be ponted out that ths s where ths new method dffers from most NN-based mage compresson methods. Instead of tranng wth the entre patterns t makes sense to tran wth only patterns that do not meet a threshold. Ths enables faster tranng tme and s convergence. The subset R k conssts of S error patterns e s j,k, j = 1,, S, whose square s sum s larger than a specfed threshold Q: R k = {e s j,k R k, e s j,k e s j,k T > Q}. Agan, the second unt s hdden node outputs, denoted as O j,2, and the weght vector W 2 are stored to be used n the decodng process. Smlarly, the second unt s error patterns, e 2,j, are defned as the dfference between e s j,1 and the estmated error patterns e^s j,1 at the second unt s output. Agan, only a subset of the new error patterns e j,2 s used as nput/output to tran the thrd unt. The procedure of addng/tranng unts s repeated for as long as the CR does not exceed a specfc target. There s an addtonal overhead per block ndcatng how many unts

89 encode that block. It s mportant to note that snce only a subset of error patterns trans each unt, the number of outputs O j,k per unt k s varable. Ths allows assgnment of mage-blocks wth larger estmaton error to more unts, whle keepng the same CR. The threshold Q s based on the frst unt s error patterns e j,1 and the CR. More specfcally, M 2 1 Q = a std ( e ) CR. 2, j,1 (4.6) M j = 1 The justfcaton for the threshold defnton n equaton (4.6) s as follows. A small threshold ndcates an expectaton for a small codng error. Frst, the threshold Q s proportonal to the desred CR. A small CR promses a small codng error; therefore, the threshold can be set low. Second, the threshold s proportonal to average standard devaton (ASD) of the error patterns between the orgnal and the encoded mages (usng only the frst unt). The ASD (content of the square bracket n equaton (4.6)) s a smlarty measure between the error patterns. A small ASD ndcates that the error patterns may be relatvely smlar throughout the mage-blocks. As a result, the addtonal adaptve unts are expected to be able to produce a small codng error, thus the threshold can be set low. Fnally, parameter a s a constant whch was fxed and equal to 1.2 n all mplementatons. The advantage of the cascade technque over exstng parallel NN archtectures s that the total number of weghts s sgnfcantly smaller and that the number of hdden node outputs s varable. Furthermore, the tranng tme requrements are low due to the unts low computatonal complexty. The algorthm converges n 4-5 teratons by repeatedly applyng the smple set of equatons: 80

90 O j,k = W k T j,k T / W k W k T, (4.7) 81 W k = j O j,k T j,k T / j O 2 j,k, (4.8) where T j,k s equal to P j f k = 1, and e s j,k-1 otherwse. Equaton (4.7) gves the unt s optmum hdden layer outputs O j,k gven the unt s weghts. Ths s the result of mnmzng the sum of square errors χ 2 between nput and output patterns χ 2 j,k = j (T j,k T^ j,k) (T j,k T^j,k) T, (4.9) where T^j,k = W k T j,k T, for unt k wth respect to O j,k. Smlarly, equaton (4.8) gves the optmum unt s weghts gven the hdden layer outputs O j,k. Ths s the result of mnmzng χ 2 wth respect to W k. The condtons from whch equatons (4.7) and (4.8) are derved are shown below: χ O 2 j, k j, k = 0, and χ 2 j, k W k = 0 (4.10) Snce only the weghts emanatng from the hdden layer W k and the hdden layer outputs O j,k are needed n the decodng process, t s mperatve to obtan the optmum set {W k, O j,k } for each unt. The algorthm based on equatons (4.7) and (4.8) drectly attempts to fnd optmum values for both W k and O j,k Decodng process Each block s decoded usng the set {O j,k, W k }, consderng all unts k used to encode that block. For nstance, the frst unt produces an estmate of patterns P^j = O j,1 W T 1 and the second unt an estmate of the frst unt s error patterns e^s j,1 = O j,2 W T 2. The decoded block s obtaned from summng of all those estmates and the block

91 average m Cj. Fgure 4.2(a) and 4.2(b) presents the proposed encodng and decodng schemes. 82 W 1 O 1 P^j Q W 2 O 2 Q P j Forward P^j e j,1 e s Pattern j,1 Unt 1 Selecton - Forward Unt 2 e^s j,1 e j,2 - Pattern Selecton Fgure 4.2(a): Encodng scheme for proposed cascade archtecture. m Cj W 1, O 1 Forward Unt 1 P^j - C^j Forward Unt 2 e^s j,1 W 2, O 2... Fgure 4.2(b): Decodng scheme for proposed cascade archtecture. The mage compresson results for the methods descrbed above are presented next.

92 83 CHAPTER 5 Results The mage compresson technques presented n chapter 4 are tested on four dfferent test mages. Ths chapter presents and compares the results obtaned from these tests for the sngle-structure, parallel-structure, and proposed cascade archtectures. Also these results are compared wth those of JPEG compresson. The comparsons are made n terms of Peak Sgnal to Nose Rato (PSNR) and computatonal complexty. Graphcal and vsual comparsons are also presented. 5.1 Comparsons n terms of PSNR Table 1 presents the results obtaned by compressng and decompressng four dfferent test mages (Lena, Baboon, and Peppers) wth the sngle-structure NN and the proposed cascade method. The sngle-structure NN s both traned and tested on the same mage to avod dependence of the results on tranng data. Although ths s mpractcal for real applcatons due to hgh tranng tme requrements, t s useful for comparson purposes. Table 2 shows that the proposed cascade archtecture gves hgher PSNR for all tested mages and for three dfferent compresson ratos (4:1, 8:1, 16:1). The parallel-structure archtecture resulted n PSNR = 33.8 db (for CR = 8:1) for Peppers when traned wth Lena, and 33.0 db when traned wth Baboon. The PSNR for the proposed technque s 35.2 db for the same CR. The dependence of the parallel-structure NN on the tranng data now becomes apparent. Smlarly when mage

93 Lena was compressed by tranng the parallel archtecture wth Peppers the PSNR was 34.2 db whle for the cascade archtecture t was 35.9 db. 84 CR Sngle- Structure Lena Baboon Peppers Cascade Sngle- Structure Cascade Sngle- Structure Cascade 16: : : Table 2: Comparson between sngle structure and cascade archtectures n terms of PSNR (db). Qualty Factor CR JPEG Lena Baboon Peppers : : : Table 3: PSNR (db) for JPEG algorthm. Table 3 gves the PSNRs obtaned usng the JPEG algorthm. The qualty factor determnes the degree of compresson and thus the compresson rate. Strctly speakng, t s not possble to make a drect comparson between JPEG and the proposed cascade method or, n fact, between JPEG and sngle/parallel-structure NNs. The reason for ths s has already been descrbed n secton but wll be re-terated here: In JPEG, all quantzed coeffcents, whch are transmtted, are entropy coded usng a sophstcated dfferental pulse code modulaton and run-length Huffman code. Wthout the entropy codng, JPEG gves substantally lower PSNRs or, equvalently, a much lower CR for the

94 same PSNR [34]. Nevertheless, t can be clearly seen that for the Lena mage the proposed method, wth no entropy codng, gves PSNRs that are only between 3 and 4 db below the correspondng PSNRs obtaned wth JPEG. These results are further presented graphcally n the next fgures. Fgure 5.1 plots the PSNR values versus the CR for the reconstructed Lena mage usng the sngle-structure NN algorthm, the proposed cascaded method, and standard JPEG. As s clearly seen, the proposed method outperforms the sngle-structure NN. It s always better by 3dB. It seems, however, that JPEG compresson performs better than the proposed method, whch resulted n 32.8 db, 36.7 db, and 39.0 db at compresson ratos 16:1, 8:1, and 4:1, respectvely. JPEG resulted n 34.3 db, 39.5 db, and 42.9 db at the same compresson ratos. 85 Fgure 5.1: PSNR values for the reconstructed Lena mage at dfferent compresson ratos.

95 86 In smlar fashon Fgures 5.2 and 5.3 plot the PSNR values for the reconstructed Peppers and Baboon mages, respectvely. Fgure 5.2: PSNR values for the reconstructed Peppers mage at dfferent compresson ratos. The JPEG approach yelds a PSNR of 5dB hgher on the average than the other two methods. However, the proposed cascaded approach does have certan advantages over the JPEG approach. They are: 1) The decompressed NN output can be recompressed further by up to 30% wth a lossless compresson scheme wth mnmal loss n qualty. JPEG recompresson rate s much smaller n comparson (less than 5%). 2) The codng/decodng tme requred for the cascade method s also much smaller. Once the off-lne tranng s complete, the compressng and decompressng

96 process s sgnfcantly faster wth the cascaded method. For nstance, at 16:1 CR, t takes less than 2 seconds to compute, whle at 4:1 CR the computng tme s less than 1 second. 3) The parallel processng capablty of NNs makes them superor to JPEG n terms of hardware mplementaton. 87 Fgure 5.3: PSNR values for the reconstructed Baboon mage at dfferent compresson ratos. An example that presents a vsual comparson of the compresson results s shown n Fgure 5.4. Fgure 5.4(a) presents the orgnal Lena mage. Fgures 5.4(b), 5.4(c), and 5.4(d) present the compressed mage usng, respectvely, the sngle NN archtecture, the proposed algorthm, and JPEG at CR 8:1. Although compresson has apparently affected all compressed mages, the proposed algorthm presents the best vsual result, especally

around lnes and edges. Notce that although the JPEG algorthm s better n terms of the PSNR, ts performance s poor vsually especally at hgh CRs, as s clearly seen when Fgures 5.4(c) and 5.

97 around lnes and edges. Notce that although the JPEG algorthm s better n terms of the PSNR, ts performance s poor vsually especally at hgh CRs, as s clearly seen when Fgures 5.4(c) and 5.4(d) are compared. 88 Fgure 5.4: (a) Orgnal Lena mage and Reconstructed Lena mage at 8:1 compresson rato usng the (b) Sngle-structure NN, (c) Proposed cascaded method, and (d) JPEG algorthms.

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding Sde-Match Vector Quantzers Usng Neural Network Based Varance Predctor for Image Codng Shuangteng Zhang Department of Computer Scence Eastern Kentucky Unversty Rchmond, KY 40475, U.S.A. shuangteng.zhang@eku.edu