Supervised Information-Theoretic Competitive Learning by Cost-Sensitive Information Maximization

BIC, 9 Aug- ept upervied Information-Theoretic Competitive Learning by Cot-enitive Information Maximization Ryotaro Kamimura Information cience Laboratory, Tokai Univerity, 7 Kitakaname Hiratuka Kanagawa 59-9, Japan ryo@cc.u-tokai.ac.jp Abtract In thi paper, we propoe a new upervied learning method whereby information i controlled by the aociated cot in an intermediate layer, and in an output layer, error between target and output are minimized. In the intermediate layer, competition i realized by maximizing mutual information between input pattern and competitive unit with Gauian function. The proce of information maximization i controlled by changing a cot aociated with information. Thu, we can flexibly control the proce of information maximization and to obtain internal repreentation appropriate to given problem. The new method i conidered to be a hybrid model imilar to the counter-propagation model, in which a competitive layer i combined with an output layer. In addition, thi i conidered to be a new approach to radial-bai function network in which the center of clae can be determined by uing information maximization. We applied our method to an artificial data problem, the prediction of long-term interet rate and yen rate. In all cae, experimental reult howed that the cot can flexibly change internal repreentation, and the cot-enitive method gave better performance than did the conventional method. Keyword: Information maximization, Gauian, cot, error minimization, hybrid model, competitive layer, output layer I. INTRODUCTION In thi paper, we propoe a new upervied learning method in which information i controlled by the aociated cot in the intermediate layer, and in the econd layer, error between target and output are minimized. In the intermediate layer, unit or neuron compete with each other by maximizing mutual information. The method i conidered to be a new type of hybrid ytem or a new approach to radial-bai function network. The new method can contribute to neural computing from four perpective: () thi i a new type of informationtheoretic competitive learning; () the activation function i Gauian; (3) a proce of information maximization i controlled by a cot; () the new model i a hybrid model in which information maximization and minimization are combined; and (5) the method i conidered to be a new approach to the radial-bai function network, in which information maximization i ued to determine the center of radial bai function. Firt, our method i baed upon a new type of informationtheoretic competitive learning. We have o far propoed a new type of competitive learning baed upon informationtheoretic approache [], [], [3], []. In the new approache, competitive procee have been realized by maximizing mutual information between input pattern and competitive unit. When information i maximized, jut one unit i turned on, while all the other are off. Thu, we can realize competitive procee by maximizing mutual information. In addition, in maximizing mutual information, the entropy of competitive unit mut be maximized. Thi mean that all competitive unit mut equally be ued on average. Thi entropy maximization can realize equiprobabilitic competitive unit without the pecial technique that have o far been propoed in conventional competitive learning [5], [6], [7], [8], [9], [], []. econd, we ue in thi new approach Gauian activation function to produce competitive unit output. When we firt introduced information-theoretic competitive learning, we ued the igmoidal activation function =( + exp( u)), where u i the net input to competitive unit [], [3], [] [], [], [3]. When we try to increae information, the igmoidal approache produce trongly negative connection weight, and they can inhibit many competitive unit, except ome pecific competitive unit. Thu, it i relatively eay to increae information content. However, becaue trongly negative connection weight are produced almot independently of input pattern, final repreentation are not necearily faithful to input pattern. Thu, we tried to ue the invere Euclidean ditance between input pattern and connection weight [5]. Though thi method produced faithful repreentation, it wa ometime very low in learning. In particular, a problem become more complex, network with the invere Euclidean ditance activation function howed difficulty in increaing information to a ufficiently high level. At thi point, we try to replace Euclidean ditance function by Gauian function, becaue we can eaily increae information by decreaing the Gauian width. Third, a proce of information maximization i controlled by a cot aociated with information. We have oberved that information maximization i achieved at the expene of imilarity to input pattern. A information i increaed, connection weight tend to be away from input pattern. We hould ay that information maximization exaggerate ome part of input pattern. Thi property i ueful to obtaining ome important feature in input pattern. However, it ometime happened that obtained feature did not repreent faithfully input pattern. Thu, we introduce a cot that i defined a difference between NC- of 7

BIC, 9 Aug- ept input pattern and connection weight. Then, by controlling the cot, we can control internal repreentation obtained by information maximization. Fourth, our new model i a hybrid model in which information maximization and minimization are combined with each other. There have been many attempt to model upervied learning baed upon competitive learning. For example, Rumelhart and Ziper [6] tried to include teacher information in competitive learning. They called thi method the correlated teacher learning method, in which teacher information i included in input pattern. However, one of the main hortcoming of thi method i that we ometime need an overwhelmingly number of large correlated teacher to upervie learning. On the other hand, Hechet-Nielen tried to combine competitive learning directly with error minimization procedure [7], [8] [9] in what are called counterpropagation network. However, error minimization procedure are realized by the gradient decent, and uually a large number of competitive unit are needed. In our method, we can ue the peudo-invere matrix operation to produce output, and learning i much fater than with counter-propagation. Fifth, the new method i alo conidered to be a radial-bai function network approach in which the center of radial-bai function i determined by maximizing information content. The radial bai function approach ha been applied to many problem, uch a function approximation, peech recognition and o on, becaue of rapid convergence and generality [], [], []. In thi paper, we ue Gauian function, and we can conider thi computational method a new approach to radial-bai function network. One of the problem of thi approach i that it i difficult to determine the center of radial-bai function. The center ha been determined by unupervied learning method uch a K-mean, competitive learning, vector quantization [3], [], [], []. Thu, our method i conidered to be a new approach to the radialbai function, in which information maximization i ued to determine the center of radial-bai function. II. INFORMATION ACQUIITION A. General Cot-enitive Information Maximization We uppoe that information on the outer environment can be obtained only at the expene of the cot aociated with the acquiition proce. For example, if we want to obtain ome information on an object, we hould illuminate it by uing ome energy that correpond to the cot in information acquiition. Though information can urely be obtained with the aociated cot, there have been no attempt to take into account the cot in information-theoretic approache to neural computing. A naturally inferred, one of the mot favorable ituation i one in which much information i obtained with relatively mall cot. Thu, our problem i to maximize information, and at the ame time the aociated cot hould be minimized. We define thi concept by the equation: I = X 8j +XX 8 p(j)logp(j) 8j p()p(j j )logp(j j ) C; () Input unit Fig.. NC- of 7 Competitive layer Information maximization Normalization layer v j p(j ) layer Wij Third layer w jk x O T k i i Competitive unit unit Error minimization A network architecture to control information content. where p(j), p() and p(jj) denote the probability of firing of the jth unit, the probability of the th input pattern and the conditional probability of the jth unit, given the th input pattern, repectively. And C denote the aociate cot. Thi equation mean that information i maximized, and at the ame time the aociated cot mut be minimized. B. Competition by Information Maximization Let u preent update rule to maximize information content. A hown in Figure, a network i compoed of input unit x k and competitive unit v j. The jth competitive unit receive a net input from input unit, and an output from the jth competitive unit can be computed by P L! vj ψ =exp k= (x k w jk) ff ; () where L i the number of input unit, w jk denote connection from the kth input unit to the jth competitive unit, and ff control the width of the Gauian function. The output i increaed a connection weight come cloer to input pattern. The conditional probability p(j j ) i computed by vj p(j j ) = P M ; (3) m= v m where M denote the number of competitive unit. ince input pattern are uppoed to be given uniformly to network, the probability of the jth competitive unit i computed by p(j) = p(j j ); () = where i the number of input pattern. Information I i computed by I = p(j)logp(j) j= + p(j j )logp(j j ): (5) = j=

BIC, 9 Aug- ept 3 To maximize mutual information, entropy mut be maximized, and at the ame time conditional entropy mut be minimized. When conditional entropy i minimized, each competitive unit repond to a pecific input pattern. On the other hand, when entropy i maximized, all competitive unit are equally activated on average. C. Cot-enitive Information Maximization In thi paper, a cot i conidered to be one repreenting the difference between input pattern and connection weight. Thu, a cot function i defined by C = p(j j ) (x k w jk) : (6) = j= k= Thu, we mut maximize the following function: I = LX p(j)logp(j) + p(j j )logp(j j ) j= = j= LX p(j j ) (x k w jk) : (7) = j= k= A information become larger, pecific pair of input pattern and competitive unit become trongly correlated. Differentiating information with repect to input-competitive connection w jk,wehave ψ ψ w jk = ff = +fi = log p(j) p(m j )logp(m) m=! Q jk log p(j j ) p(m j )logp(m j ) m= Q jk + fl p(j j )(x k w jk); (8) = Fig.. where ff, fi and fl are the learning parameter, and Q jk = (x k w jk)p(j j ) ff : (9) D. Error Minimization In the output layer, error between target and output are minimized. The output from the output layer are computed by W ij p(jj); () j= where W ij denote connection weight from the jth competitive unit to the ith output unit. Error between target and output can be computed by O i = NX E = = i= (T i O i ) ; () where T i denote target for output unit O i and N i the number of output unit. Thi linear equation i directly olved by uing the peudo-invere of the matrice of competitive unit!.5.5 - - - (a) Information (a) Cot (a3) Weight.5 5.5 NC- 3 of 7 - - - - - 5 (c) Information (c) Cot (c3) Weight 5 (a) Gamma= (b) Gamma=.5 (c) Gamma=.5 5 (d) Gamma=.5 - - Weight Input pattern (b) Information (b) Cot (b3) Weight - - (d) Information (d) Cot (d3) Weight Information, cot a a function of the number of epoch and connection weight: (a) fl = (pure information maximization), (b) fl = :5, (c) fl = :5 and (d) fl = :5. output. Following the tandard matrix notation, we uppoe that W and T denote the matrice of connection weight and target and Py how the peudo-invere of the matrix of competitive unit activation. Then, we can obtain final connection weight by W = PyT: III. EXPERIMENT NO.: ARTIFICIAL DATA PROBLEM In thi ection, we try to how how the cot change final repreentation obtained by information maximization. The firt example i a claification problem in which ix pattern mut be claified into three clae, a hown on the right hand ide of Figure. Figure (a) how final reult by uing pure information maximization. A information i increaed, the aociated cot i alo increaed. Though final connection weight claify input pattern into three clae, the weight are away from input pattern. When fl i increaed to.5, the cot i lightly decreaed, and final connection weight are alo lightly cloe to input pattern. When fl i further

BIC, 9 Aug- ept increaed to.5, the cot i apparently decreaed, and final connection weight are much cloer to input pattern. Finally, when fl i.5, information i immediately increaed to almot a maximum point, and then fluctuate in the later tage of learning. The cot i decreaed ignificantly to a maller point, and connection weight are located in the middle of each cla. Thee reult eem to how that cot-enitive information maximization i much better than the pure information maximization. However, the next example will explicitly how the utility of the cot a well a information maximization. The econd example i concerned with artificial data compoed of common and ditinctive feature. We can ee five input pattern given into network in Figure (e). A hown in the figure, the figure are compoed of ditinctive feature (horizontal line) and common feature (vertical line). Figure 3 (a) how information and cot a a function of the number of epoch. Information i rapidly increaed to a maximum point with le than epoch, but the aociated cot decreae more lowly than the other cae. A the parameter fl i increaed from. (Figure 3(b)) to. (Figure 3(e)), information i more lowly increaed, but the cot i more rapidly decreaed. Figure how connection weight obtained by changing the parameter fl. When the parameter fl i zero, ditinctive feature can be obtained (Figure (a)). A the parameter i increaed, network tend to capture input pattern themelve. Thu, by changing the parameter fl, the propertie of obtained feature can flexibly be controlled. IV. EXPERIMENT NO. :LONG TERM INTERET RATE AND YEN RATE PREDICTION In thi problem, we predict Japanee long-term interet rate and yen-to-dollar rate during 99 and. Figure 5 how a network architecture for the prediction. The number of input unit i ix, repreenting the previou ix month rate, and the number of output unit i one, repreenting a rate at the current tate. The number of input and competitive unit were experimentally determined o a to maximize prediction performance. For example, even if we took into account more than the previou ix month rate, no improvement could been een. We reduced the number of training pattern a much a poible. By extenive experiment, we found that input pattern were the minimum number of pattern required to etimate rate by our method. Even if we increaed the number of training pattern, we could not obtain better performance. On the other hand, if we decreaed the number of training pattern below input pattern, performance ignificantly degraded. Figure 6(a) how original long-term interet rate during 99-. When the parameter fl i., network could predict the long-term interet rate quite well, a hown in Figure 6(b). When the parameter fl i decreaed to zero (Figure 6(c)), that i, pure information maximization i ued, ome fluctuation could be een. However, we can ay that network till predict the long-term interet rate well. On the other hand, by uing two different kind of the radial-bai function network (Figure 6(d) and (e)), network failed to predict the The parameter ff and fi were alway et to...5.5 (a) Gamma=.5.5.5 Fig. 3. (e) Gamma=. 3 3 (b) Gamma=. (c) Gamma=. (d) Gamma=. Information and cot a a function of the number of epoch by four different value of the parameter fl: (a) fl =, (b) fl =:, (c) fl =:, (d) fl =:, and (e) fl =:. rate. Figure 7(a) and (b) how relation between target and actual output when the parameter fl i.5 and zero. The regreion line are cloe to line with which output become equal to target. However, a lightly better regreion line can be obtained by the cot-enitive method. Figure 7(c) and (d) how regreion line by the exact and the incremental radialbai function network. The regreion line are far from the target line. Figure 8(a) how original yen rate during 99-. When the parameter fl i., network could predict the yen rate quite well, a hown in Figure 8(b). When the parameter fl i decreaed to zero (Figure 8), with ome fluctuation, network till predict the yen rate well. On the other hand, by uing the radial-bai function network (Figure 8(d) and (e)), network failed to predict the rate. Figure 9(a) and (b) how relation between target and actual output when the parameter fl i.5 and zero. Though two network can predict target well, network with the cot-enitive method how NC- of 7

BIC, 9 Aug- ept 5-99 99 99 993 99 995 996 997 998 999 (a) Original (a) Gamma= - 99 99 99 993 99 995 996 997 998 999 (b) Gamma=.5 (b) Gamma=. - 99 99 99 993 99 995 996 997 998 999 (c) Gamma=. (c) Gamma=. - 99 99 99 993 99 995 996 997 998 999 (d) Exact (d) Gamma=. (e) Gamma=. Fig.. Connection weight by five different parameter value: (a) fl =, (b) fl =:, (c) fl =:, (d) fl =:, and (e) fl =:. Fig. 6. 99 99 99 993 99 995 996 997 998 999 (e) Incremental Original long-term interet rate (a), etimated interet rate by the new method with fl =:5 (b), with fl =(c), by the conventional radial bai network (exact method, ff =:5) (d) and the conventional radial bai method with an incremental method ff =(e). Competitive layer x wjk x k x N Input unit Normalization layer v j p(j ) Competitive unit layer Wj unit O T = x N+ Fig. 5. A network for predicting long-term interet rate and yen rate during 999-. lightly better performance. Figure 9(c) how a regreion line by the exact radial-bai function network. The regreion line i far from the target line. Figure 9(d) how a regreion line by the incremental radial-bai function network. The regreion line i cloe to the target line, but the variation of output i larger. V. CONCLUION In thi paper, we have tried to control information-theoretic competitive learning by introducing the aociated cot. In our network, we have a competitive layer, a normalization layer and an output layer. In the competitive layer, information i increaed to realize competitive procee. In the normalization layer, competitive unit output are normalized to produce the probabilitie. Then, in the output layer, error between target and output are minimized by uing the leat quare method. In the paradigm of competitive learning, thi i a hybrid model in which unupervied and upervied learning are combined with each other, which i cloe to the counter-propagation network. The difference i that in our method competitive unit NC- 5 of 7

6 BIC, 9 Aug- ept.5.5 -.5 -.5-99 99 99 993 99 995 996 997 998 999 (a) Original - - (a) Gamma=.5 - - (b) Gamma=. - 99 99 99 993 99 995 996 997 998 999 5 (b) Gamma=. -5 - - -5 5 Fig. 7. (c) Exact - (d) Incremental Relation between target and output by four method. output are computed by uing the Gauian function, and in the output layer, the leat quare method i ued. Thu, in the paradigm of the radial-bai function network, thi i a new approach to the radial-bai function to determine the center of clae. We have applied competitive learning with Gauian function to an artificial data problem and the prediction of long-term interet rate and yen rate. In thee problem, we have hown that the new method can poibly predict future rate with a mall number of pat data. Finally, though ome problem remain unolved, I think that the approach outlined here i a tep toward a new information-theoretic approach to neurocomputing. REFERENCE [] R. Kamimura, T. Kamimura, and T. R. hultz, Information theoretic competitive learning and linguitic rule acquition, Tranaction of the Japanee ociety for Artificial Intelligence, vol. 6, no., pp. 87 98,. [] R. Kamimura, T. Kamimura, and O. Uchida, Flexible feature dicovery and tructural information, Connection cience, vol. 3, no., pp. 33 37,. [3] R. Kamimura, T. Kamimura, and H. Takeuchi, Greedy information acquiition algorithm: A new information theoretic approach to dynamic information acquiition in neural network, Connection cience, vol., no., pp. 37 6,. [] R. Kamimura, Progreive feature extraction by greedy networkgrowing algorithm, Complex ytem, vol., no., pp. 7 53, 3. [5] D. E. Rumelhart and J. L. McClelland, On learning the pat tene of Englih verb, in Parallel Ditributed Proceing (D. E. Rumelhart, G. E. Hinton, and R. J. William, ed.), vol., pp. 6 7, Cambrige: MIT Pre, 986. [6]. Groberg, Competitive learning: from interactive activation to adaptive reonance, Cognitive cience, vol., pp. 3 63, 987. [7] D. Deieno, Adding a concience to competitive learning, in Proceeding of IEEE International Conference on Neural Network, (an Diego), pp. 7, IEEE, 988. - 99 99 99 993 99 995 996 997 998 999 - - Fig. 8. (c) Gamma=. 99 99 99 993 99 995 996 997 998 999 (d) Exact 99 99 99 993 99 995 996 997 998 999 (e) Incremental Original yen rate (a), etimated rate by the new method with fl = (b), fl = :, fl = : (c), by the conventional radial bai network (exact method) (d) and the conventional incremental method (e). [8]. C. Ahalt, A. K. Krihnamurthy, P. Chen, and D. E. Melton, Competitive learning algorithm for vector quantization, Neural Network, vol. 3, pp. 77 9, 99. [9] L. Xu, Rival penalized competitive learning for clutering analyi, RBF net, and curve detection, IEEE Tranaction on Neural Network, vol., no., pp. 636 69, 993. [] A. Luk and. Lien, Propertie of the generalized lotto-type competitive learning, in Proceeding of International conference on neural information proceing, (an Mateo: CA), pp. 8 85, Morgan Kaufmann Publiher,. [] M. M. V. Hulle, The formation of topographic map that maximize the average mutual information of the output repone to noiele input ignal, Neural Computation, vol. 9, no. 3, pp. 595 66, 997. [] R. Kamimura and. Nakanihi, Improving generalization performance by information minimization, IEICE Tranaction on Information and ytem, vol. E78-D, no., pp. 63 73, 995. [3] R. Kamimura and. Nakanihi, Hidden information maximization for feature detection and rule dicovery, Network, vol. 6, pp. 577 6, 995. [] R. Kamimura, Minimizing ff-information for generalization and interpretation, Algorithmica, vol., pp. 73 97, 998. [5] R. Kamimura, Information-theoretic competitive learning with invere euclidean ditance, to appear in Neural Proceing Letter, 3. [6] D. E. Rumelhart and D. Ziper, Feature dicovery by competitive learning, Cognitive cience, vol. 9, pp. 75. [7] R. Hecht-Nielen, Counterpropagation network, Applied Optic, vol. 6, pp. 979 98, 987. NC- 6 of 7

BIC, 9 Aug- ept 7.5.5 -.5 -.5 - - 5 (a) Gamma=. - - (b) Gamma=..5-5 - -.5-5 - - Fig. 9. (c) Exact - - (a) Incremental Relation between target and output by four method. [8] R. Hecht-Nielen, Application of counterpropagation network, Neural Network, vol., no., pp. 3 39, 988. [9] M. Novic and J. Zupan, Invetigation of infrared pectra-tructure correlation uing Kohonen and counterpropagation neural network, Journal of Chemical Information and Computer cience, vol. 35, pp. 5 66, May-June 995. [] J. Moody and C. J. Darken, Fat learning in network of locally-tuned proceing unit, Neural Computation, vol., no., pp. 8 9, 989. [] D. F. pecht, A general regreion neural network, IEEE Tranaction on Neural Network, vol., pp. 568 576, 99. [] P. Burracano, Learning vector quantization for the probabilitic neural network, IEEE Tranaction on Neural Network, vol., pp. 58 6, 99. [3] D. F. pecht, A general regreion neural network, IEEE Tranaction on Neural Network, vol., pp. 568 576, Nov. 99. [] D. Lowe, Radial bai function network, in The Handbook of Brain Theory and Neural Network (M. A. Arbib, ed.), pp. 779 783, Cambridge, Maachuett: MIT Pre, 995. NC- 7 of 7