TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS Thong B. Trinh, Anwer S. Bashi, Nikhil Deshpande Department of Electrical Engineering University of New Orleans New Orleans, LA 70148 Tel: (504) 280-7383 Fax: (504) 280-3950 E-Mail: tbtrinh@uno.edu, asbee@uno.edu, n9dee@uno.edu Abstract Reinforcement learning, in general, has not been totally successful at solving complex realworld problems which can be described by nonlinear functions. However, temporal difference learning is a type of reinforcement learning algorithm that has been researched and applied to various prediction problems with promising results. This paper discusses the application of temporal-difference learning in the training of a neural network to play a scaled-down version of the board game Chinese Chess. Preliminary results show that this technique is favorable for producing desired results. In test cases where minimal factors of the game are presented, the network responds favorably. However, when introducing more complexity, the network does not function as well, but generally produces reasonable results. These results indicate that temporal difference learning has the potential to solve real-world problems of equal or greater complexity. Continuing research in the application of neural networks to complex strategic games will most likely lead to more responsive and accurate systems in the future. Introduction Neural networks have been widely implemented for solving various problems, mostly classification, pattern recognition, and function approximations. However, they have not been researched as extensively in the science of games. This shortcoming may be due to two reasons. Games may be considered trivial subjects not worthy of serious research, and neural networks have not yet evolved into a precise science. Thus, complex problems such as those presented by strategic games cannot be readily fed into a generic neural network and be expected to output reasonable results. There are many areas in neural networks that have yet to be discovered and explored in order to sufficiently handle or even solve complex problems. Artificial intelligence (AI) today is just that -- artificial, and only capable of solving a fraction of what genuine intelligence (humans) can solve. Therein lies the justification for designing neural networks to play games and solve puzzles. Though they may seem to not yield any obvious applications in other fields, through this research new learning methods may be developed that can be applied to similar real-life problems and new insights may be gained.
Chinese Chess Not much research has been done or at least published in the field of strategic learning in games. The only games that have been successfully implemented via neural networks are Checkers and Backgammon [1]. In fact, the neural network engine used is able to learn the game from zero-knowledge merely by self-play and learning from observed consequences. Chess and other strategic board games have been deemed too complex to be fully implemented via neural networks, and so progress in that field has been hindered. Some have even totally discounted neural networks as ever being able to learn the game well enough to challenge human master-level players. However, the engine that has been proven to be successful in a board game, namely Backgammon, is a learning algorithm called Temporal Difference. This method seems versatile enough that it may hold promise for other strategic games in general, and this is the learning method that is used in this project to train a neural network to play the Asian variant of the popular European game of Chess - - Chinese Chess. Game description Chinese Chess is played in many of the Asian countries and it is very similar to Chess, having many similar pieces with similar moves. A brief description of the games is necessary for understanding the scope of this project. The actual game contains other pieces that this report does not address and the rules governing the games are far more detailed than this project will cover. Developing a neural network to play the full game of Chinese Chess would obviously require many more years of research. Thus, only a scaled-down version of the game is implemented in order to make the game more manageable. Only half of the actual playing field is used with a few selected playing pieces. The game consists of the battleground or playing field represented by square boxes as in Chess (see Figure 1). Figure 1. Chinese Chess Board and Pieces
However, the pieces are positioned on the intersections, not within the boxes. The object of this two-player game is to capture the opponent s Chief (or King) by gradually penetrating the opponent's defenses by eliminating the player's pieces. The two opposing sides initially have the same playing pieces but with different colors, usually red and blue. The pieces relevant to this project are the Chief, who is confined to one-block movements in a central square-area called the Palace. His resources include soldiers, cannons, horses and chariots. Soldiers are restricted to one-block forward progressions. Chariots have the freedom to slide in vertical or horizontal directions with no limit on distance. Cannons move the same way as chariots, but can only eliminate an opponent s piece by jumping over another piece as in checkers. Horses move in L-shaped patterns as knights do in Chess. Game Design To implement this game, the developer must consider which parameters of the game are necessary for inputs to the network to solve the problem. First, the board must be represented. Positions of each piece on the board must also be represented. Finally, the system must consider possible moves of each player s pieces and evaluate and decide on a best move. Neural networks can be designed to perform most of these operations. The board layout and positions of playing pieces are easily and distinctly represented by numbers, as is done in some chess board representations [2]. For example, all positions on the board are represented by a matrix of the same dimension as the playing field. Unique numbers (weighing the value of a certain piece) are assigned to each distinct playing piece, and opposing sides are differentiated by a leading signed or unsigned value [3]. The number +5 may represent the Chief on the red or blue side whereas a -5 would designate the Chief on the other side. In order to consider possible moves of each playing piece on the game board, an extensive search must be applied to find possible moves. As neural networks have not been specifically designed to perform searches, this part of the project requires computer software to perform searches that would generate possible moves for each piece. The details of how a computer program performs searches beyond the scope of this paper, but the concept is basically a tree that branches out indicating the possible sequential positions of each piece that is movable [3]. Thus, the moves in this game are updated sequential board layouts represented by a matrix and serve as inputs to the network. The neural network s purpose in this project is to evaluate the possible moves and determine the best move, which would contribute to ultimately winning a game. Learning Algorithm The inputs to the network would be the positions of the playing pieces. Pieces are weighted by value and are assigned a number, as shown in Table 1. Table 1. Weighted Value of Playing Pieces Player Piece Red Side Blue Side Soldier +1-1 Cannon +2-2
Horse +3-3 Chariot +4-4 Chief +5-5 The output of the network would be a single value to indicate the evaluation of each possible move. For each possible move, the network is input with an update of the matrix board representation indicating the new position of a piece. With this information, the network attempts to evaluate possible moves (generated by a computer search program) and suggest the best one. Learning is achieved via a method similar to backpropagation - temporal difference. Unsupervised learning algorithms allow a neural network to discover the underlying structure of a complex problem or function that would otherwise be indecipherable [4]. Temporal difference learning is an unsupervised reinforcement learning method for increasing the accuracy of predictions by recursively learning a prediction function via feedback [5]. Weights are updated from the difference of temporally successive predictions rather than the difference between the predicted and actual outputs, as is the case for most supervised learning algorithms [6]. Random events that appear to have no conceivable pattern can be described approximately by some function. An unsupervised neural network may learn to discover this complex random pattern or function. In this paper, temporal difference attempts to assign credit to each move by taking the difference between temporally successive predictions of the network. The usual backpropagation supervised approach involves feeding back the error calculated from difference of the current actual network output and the desired target. Temporal difference incorporates a slight modification in attempting to decrease errors by taking the difference between the two most recent successive predictions of the network and feeding that back to adjust the weights accordingly to better match the desired output. The error can be thought of as a sum of changes in the predictions. Thus, weights are updated according to the following equation: new weight = (current weight) + (sum of changes in predictions) n w = final w + initial w t t = 1 where W is the weight matrix updated by the sum of the weight changes for each of n observations at time t. Weight change is traditionally computed by ( ) P w = α z Pt w t where a is the learning rate and the gradient is the vector of partial derivatives of the predictions P t at time t with respect to the weight vectors w. The term (z - P t ) represents the error or difference between the predicted outcome and the actual outcome. However, in temporal difference, this equation must be modified so that the error is redefined to be the difference between temporally successive i th predictions
and the weight-change equation becomes n ( ) ( ) z Pt Pi+ 1 Pi i= t ( ) w = α t t + 1 t w k = 1 P P P k in which the change in weights will depend on the successively determined predictions and the sum of changes in past values of the previous predictions [4]. The transfer function of the neural network is chosen to be a hyperbolic tangent sigmoid, and the output is a single value taking on a value ranging from -1 to +1. Negative values indicate that the blue side has won, is winning or is making good moves. Positive values indicate a similar situation for the red side. Implementation and Testing The above learning algorithm is programmed into a neural network. Inputs to the network are possible moves (represented by changes in the board game s matrix) and outputs indicate what the network predicts to be the "value" of each move - good or bad (i.e., if it would contribute to the ultimate goal of reaching +1 or -1 for a game victory). Several cases (move sequences of different games) are presented to train the neural network from zero-knowledge. To begin training at a rudimentary level, only a few pieces are played in these game sequences. Initial random weights are adjusted by the algorithm such that the pattern of predicted outcomes of each move match the final outcome of the game (ultimate network output). After the network has been trained and the weights have been adjusted, test cases are fed into the network. When the network is fed the trained set of input data, the network predicted output correctly matches the actual outcome of the game. When new game sequences are input, the neural network is able to correctly predict the outcome of the game most of the times. These preliminary results indicate that the network is able to learn the game and capable of evaluating each move s contribution to the final outcome of the game. Following are examples of tests performed on the neural network. In the end-game case involving only the horse and chief on a 4x4 playing field, the neuralnetwork-controlled horse could check the opponent s chief by either of two moves. For the two different scenarios, the neural network indicated that they are equally good moves since they ultimately lead to the same result. Good moves are indicated by output values increasing towards positive unity for the neural-networkcontrolled side. Good moves for the opponent would be indicated by increasingly negative output values. The network is also tested for accuracy. When the neural-networkcontrolled horse placed the opponent s chief in check, but did not follow through by eliminating the Chief on the next move, the neural network detected this as a weak move and the output value correspondingly decreased. This confirmed that the
network was correctly responding to the position of the players pieces. To further test and confirm proper behavior of the network, the neural network was forced to evaluate bad moves. The neural-network-controlled pieces were presented to the network as moving into positions that would favor the opponent. The network again yielded correct results, since the output values became increasingly negative (favoring the opponent) as the bad moves were being made. Finally, the network is asked to evaluate a complete sequence of moves of a scaled-down version of the game. Since this involved a larger playing field and more pieces than the test cases, the network did not function consistently, although results do indicate that it generally functions correctly. In some cases where the opposing side made a good move, the network deemed it as a good move for the other side. However, in most other cases, the network correctly evaluated the moves as favoring one side or the other by the shifting output values. Of course, good moves are subjective, so they can understandably be considered subjective to an artificial intelligence that is the neural network. However, this network could be trained with more game sequences, perhaps thousands more, to improve its accuracy in evaluating each move as they pertain to the end-game result. And the advantage of temporal difference learning is that it is capable of learning by self-play since errors of predictions are fed back and weights are adjusted accordingly to match the actual output. Conclusion A neural network modeled on an effective learning algorithm such as the temporal difference method has been shown to be capable of playing complex games. The temporal difference algorithm used to train the network in this project shows promise for further development in this area. In the most basic setup, the network responds almost flawlessly. With more complexity introduced, the network sometimes may incorrectly evaluate the value of a particular move, but with additional training, the network outputs will most definitely improve. The results of this paper support the use of the temporal difference method as a potentially robust method for neural networks to solve complex problems. This learning algorithm is a promising starting point for similar developments in solving problems in other fields involving similar levels of complexity, and the method may also be refined and/or built upon with further research. References [1] S. Thrun. Learning to Play the Game of Chess. Advances in Neural Information Processing Systems, 7:1069-76, 1995. [2] M. Schmidt. Temporal Difference Learning and Chess. Technical Report, Aarhus University, Computer Science Department, June 20, 1994. [3] M. Schmidt. Neural Networks and Chess. Thesis, Aarhus University, Computer Science Department, July 19, 1993.
[4] C.L. Isbell. Explorations of the Practical Issues of Learning Prediction Control Tasks Using Temporal Difference Learning Methods. Master Thesis, Massachusetts Institute of Technology, December 1992. [5] G. Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3):58-68. [6] R.S. Sutton. On Step-Size and Bias in Temporal-Difference Learning. Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp91-96, 1994.