TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Similar documents
Reinforcement Learning in Games Autonomous Learning Systems Seminar

TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

Training a Back-Propagation Network with Temporal Difference Learning and a database for the board game Pente

CPS331 Lecture: Search in Games last revised 2/16/10

CS221 Project Final Report Gomoku Game Agent

Temporal-Difference Learning in Self-Play Training

Game Design Verification using Reinforcement Learning

ECE 517: Reinforcement Learning in Artificial Intelligence

Five-In-Row with Local Evaluation and Beam Search

Teaching a Neural Network to Play Konane

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

An Artificially Intelligent Ludo Player

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Foundations of AI. 6. Adversarial Search. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard & Bernhard Nebel

Learning Unit Values in Wargus Using Temporal Differences

IMPROVING TOWER DEFENSE GAME AI (DIFFERENTIAL EVOLUTION VS EVOLUTIONARY PROGRAMMING) CHEAH KEEI YUAN

Feature Learning Using State Differences

Artificial Neural Networks. Artificial Intelligence Santa Clara, 2016

Artificial Intelligence Search III

Foundations of Artificial Intelligence

Creating a Poker Playing Program Using Evolutionary Computation

Introduction to Spring 2009 Artificial Intelligence Final Exam

Bootstrapping from Game Tree Search

Foundations of Artificial Intelligence

Foundations of AI. 6. Board Games. Search Strategies for Games, Games with Chance, State of the Art

Training a Neural Network for Checkers

Adversarial Search and Game Playing

A Quoridor-playing Agent

Augmenting Self-Learning In Chess Through Expert Imitation

Board Representations for Neural Go Players Learning by Temporal Difference

Experiments on Alternatives to Minimax

Monte Carlo Tree Search

Training Neural Networks for Checkers

Decision Making in Multiplayer Environments Application in Backgammon Variants

CPS 570: Artificial Intelligence Two-player, zero-sum, perfect-information Games

On Verifying Game Designs and Playing Strategies using Reinforcement Learning

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

April 25, Competing and cooperating with AI. Pantelis P. Analytis. Human behavior in Chess. Competing with AI. Cooperative machines?

Contents. Foundations of Artificial Intelligence. Problems. Why Board Games?

Presentation Overview. Bootstrapping from Game Tree Search. Game Tree Search. Heuristic Evaluation Function

The Basic Kak Neural Network with Complex Inputs

Game-playing: DeepBlue and AlphaGo

Prediction of Missing PMU Measurement using Artificial Neural Network

CHAPTER 6 BACK PROPAGATED ARTIFICIAL NEURAL NETWORK TRAINED ARHF

CS 4700: Foundations of Artificial Intelligence

Foundations of AI. 5. Board Games. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard and Luc De Raedt SA-1

Contents. MA 327/ECO 327 Introduction to Game Theory Fall 2017 Notes. 1 Wednesday, August Friday, August Monday, August 28 6

Artificial Intelligence

An intelligent Othello player combining machine learning and game specific heuristics

ADVERSARIAL SEARCH. Today. Reading. Goals. AIMA Chapter Read , Skim 5.7

Adversary Search. Ref: Chapter 5

Humanization of Computational Learning in Strategy Games

CS 1571 Introduction to AI Lecture 12. Adversarial search. CS 1571 Intro to AI. Announcements

Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural Networks

Opponent Modelling In World Of Warcraft

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Creating a Dominion AI Using Genetic Algorithms

CS 2710 Foundations of AI. Lecture 9. Adversarial search. CS 2710 Foundations of AI. Game search

Monte Carlo tree search techniques in the game of Kriegspiel

A Reinforcement Learning Algorithm Applied to Simplified Two-Player Texas Hold em Poker

Automated Suicide: An Antichess Engine

Machine Learning Othello Project

CS 331: Artificial Intelligence Adversarial Search II. Outline

Ar#ficial)Intelligence!!

MITOCW Project: Backgammon tutor MIT Multicore Programming Primer, IAP 2007

Artificial Intelligence. Minimax and alpha-beta pruning

IMPROVEMENTS TO A QUEUE AND DELAY ESTIMATION ALGORITHM UTILIZED IN VIDEO IMAGING VEHICLE DETECTION SYSTEMS

MyPawns OppPawns MyKings OppKings MyThreatened OppThreatened MyWins OppWins Draws

B1 Problem Statement Unit Pricing

CMSC 671 Project Report- Google AI Challenge: Planet Wars

Hybrid of Evolution and Reinforcement Learning for Othello Players

CSCI 4150 Introduction to Artificial Intelligence, Fall 2004 Assignment 7 (135 points), out Monday November 22, due Thursday December 9

a b c d e f g h i j k l m n

Game Tree Search. Generalizing Search Problems. Two-person Zero-Sum Games. Generalizing Search Problems. CSC384: Intro to Artificial Intelligence

Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar

Co-Evolving Checkers Playing Programs using only Win, Lose, or Draw

Outline. Introduction to AI. Artificial Intelligence. What is an AI? What is an AI? Agents Environments

Reinforcement Learning of Local Shape in the Game of Go

Adjustable Group Behavior of Agents in Action-based Games

Universiteit Leiden Opleiding Informatica

FreeCiv Learner: A Machine Learning Project Utilizing Genetic Algorithms

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Game Playing. Philipp Koehn. 29 September 2015

Games and Adversarial Search

Documentation and Discussion

Learning to play Dominoes

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast

Player Profiling in Texas Holdem

Artificial Intelligence 1: game playing

Contents. List of Figures

CS 188: Artificial Intelligence Spring Game Playing in Practice

Multiple Agents. Why can t we all just get along? (Rodney King)

TUD Poker Challenge Reinforcement Learning with Imperfect Information

Game Tree Search. CSC384: Introduction to Artificial Intelligence. Generalizing Search Problem. General Games. What makes something a game?

Extending the STRADA Framework to Design an AI for ORTS

A Machine-Learning Approach to Computer Go

DeepStack: Expert-Level AI in Heads-Up No-Limit Poker. Surya Prakash Chembrolu

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Algorithms for Data Structures: Search for Games. Phillip Smith 27/11/13

Transcription:

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS Thong B. Trinh, Anwer S. Bashi, Nikhil Deshpande Department of Electrical Engineering University of New Orleans New Orleans, LA 70148 Tel: (504) 280-7383 Fax: (504) 280-3950 E-Mail: tbtrinh@uno.edu, asbee@uno.edu, n9dee@uno.edu Abstract Reinforcement learning, in general, has not been totally successful at solving complex realworld problems which can be described by nonlinear functions. However, temporal difference learning is a type of reinforcement learning algorithm that has been researched and applied to various prediction problems with promising results. This paper discusses the application of temporal-difference learning in the training of a neural network to play a scaled-down version of the board game Chinese Chess. Preliminary results show that this technique is favorable for producing desired results. In test cases where minimal factors of the game are presented, the network responds favorably. However, when introducing more complexity, the network does not function as well, but generally produces reasonable results. These results indicate that temporal difference learning has the potential to solve real-world problems of equal or greater complexity. Continuing research in the application of neural networks to complex strategic games will most likely lead to more responsive and accurate systems in the future. Introduction Neural networks have been widely implemented for solving various problems, mostly classification, pattern recognition, and function approximations. However, they have not been researched as extensively in the science of games. This shortcoming may be due to two reasons. Games may be considered trivial subjects not worthy of serious research, and neural networks have not yet evolved into a precise science. Thus, complex problems such as those presented by strategic games cannot be readily fed into a generic neural network and be expected to output reasonable results. There are many areas in neural networks that have yet to be discovered and explored in order to sufficiently handle or even solve complex problems. Artificial intelligence (AI) today is just that -- artificial, and only capable of solving a fraction of what genuine intelligence (humans) can solve. Therein lies the justification for designing neural networks to play games and solve puzzles. Though they may seem to not yield any obvious applications in other fields, through this research new learning methods may be developed that can be applied to similar real-life problems and new insights may be gained.

Chinese Chess Not much research has been done or at least published in the field of strategic learning in games. The only games that have been successfully implemented via neural networks are Checkers and Backgammon [1]. In fact, the neural network engine used is able to learn the game from zero-knowledge merely by self-play and learning from observed consequences. Chess and other strategic board games have been deemed too complex to be fully implemented via neural networks, and so progress in that field has been hindered. Some have even totally discounted neural networks as ever being able to learn the game well enough to challenge human master-level players. However, the engine that has been proven to be successful in a board game, namely Backgammon, is a learning algorithm called Temporal Difference. This method seems versatile enough that it may hold promise for other strategic games in general, and this is the learning method that is used in this project to train a neural network to play the Asian variant of the popular European game of Chess - - Chinese Chess. Game description Chinese Chess is played in many of the Asian countries and it is very similar to Chess, having many similar pieces with similar moves. A brief description of the games is necessary for understanding the scope of this project. The actual game contains other pieces that this report does not address and the rules governing the games are far more detailed than this project will cover. Developing a neural network to play the full game of Chinese Chess would obviously require many more years of research. Thus, only a scaled-down version of the game is implemented in order to make the game more manageable. Only half of the actual playing field is used with a few selected playing pieces. The game consists of the battleground or playing field represented by square boxes as in Chess (see Figure 1). Figure 1. Chinese Chess Board and Pieces

However, the pieces are positioned on the intersections, not within the boxes. The object of this two-player game is to capture the opponent s Chief (or King) by gradually penetrating the opponent's defenses by eliminating the player's pieces. The two opposing sides initially have the same playing pieces but with different colors, usually red and blue. The pieces relevant to this project are the Chief, who is confined to one-block movements in a central square-area called the Palace. His resources include soldiers, cannons, horses and chariots. Soldiers are restricted to one-block forward progressions. Chariots have the freedom to slide in vertical or horizontal directions with no limit on distance. Cannons move the same way as chariots, but can only eliminate an opponent s piece by jumping over another piece as in checkers. Horses move in L-shaped patterns as knights do in Chess. Game Design To implement this game, the developer must consider which parameters of the game are necessary for inputs to the network to solve the problem. First, the board must be represented. Positions of each piece on the board must also be represented. Finally, the system must consider possible moves of each player s pieces and evaluate and decide on a best move. Neural networks can be designed to perform most of these operations. The board layout and positions of playing pieces are easily and distinctly represented by numbers, as is done in some chess board representations [2]. For example, all positions on the board are represented by a matrix of the same dimension as the playing field. Unique numbers (weighing the value of a certain piece) are assigned to each distinct playing piece, and opposing sides are differentiated by a leading signed or unsigned value [3]. The number +5 may represent the Chief on the red or blue side whereas a -5 would designate the Chief on the other side. In order to consider possible moves of each playing piece on the game board, an extensive search must be applied to find possible moves. As neural networks have not been specifically designed to perform searches, this part of the project requires computer software to perform searches that would generate possible moves for each piece. The details of how a computer program performs searches beyond the scope of this paper, but the concept is basically a tree that branches out indicating the possible sequential positions of each piece that is movable [3]. Thus, the moves in this game are updated sequential board layouts represented by a matrix and serve as inputs to the network. The neural network s purpose in this project is to evaluate the possible moves and determine the best move, which would contribute to ultimately winning a game. Learning Algorithm The inputs to the network would be the positions of the playing pieces. Pieces are weighted by value and are assigned a number, as shown in Table 1. Table 1. Weighted Value of Playing Pieces Player Piece Red Side Blue Side Soldier +1-1 Cannon +2-2

Horse +3-3 Chariot +4-4 Chief +5-5 The output of the network would be a single value to indicate the evaluation of each possible move. For each possible move, the network is input with an update of the matrix board representation indicating the new position of a piece. With this information, the network attempts to evaluate possible moves (generated by a computer search program) and suggest the best one. Learning is achieved via a method similar to backpropagation - temporal difference. Unsupervised learning algorithms allow a neural network to discover the underlying structure of a complex problem or function that would otherwise be indecipherable [4]. Temporal difference learning is an unsupervised reinforcement learning method for increasing the accuracy of predictions by recursively learning a prediction function via feedback [5]. Weights are updated from the difference of temporally successive predictions rather than the difference between the predicted and actual outputs, as is the case for most supervised learning algorithms [6]. Random events that appear to have no conceivable pattern can be described approximately by some function. An unsupervised neural network may learn to discover this complex random pattern or function. In this paper, temporal difference attempts to assign credit to each move by taking the difference between temporally successive predictions of the network. The usual backpropagation supervised approach involves feeding back the error calculated from difference of the current actual network output and the desired target. Temporal difference incorporates a slight modification in attempting to decrease errors by taking the difference between the two most recent successive predictions of the network and feeding that back to adjust the weights accordingly to better match the desired output. The error can be thought of as a sum of changes in the predictions. Thus, weights are updated according to the following equation: new weight = (current weight) + (sum of changes in predictions) n w = final w + initial w t t = 1 where W is the weight matrix updated by the sum of the weight changes for each of n observations at time t. Weight change is traditionally computed by ( ) P w = α z Pt w t where a is the learning rate and the gradient is the vector of partial derivatives of the predictions P t at time t with respect to the weight vectors w. The term (z - P t ) represents the error or difference between the predicted outcome and the actual outcome. However, in temporal difference, this equation must be modified so that the error is redefined to be the difference between temporally successive i th predictions

and the weight-change equation becomes n ( ) ( ) z Pt Pi+ 1 Pi i= t ( ) w = α t t + 1 t w k = 1 P P P k in which the change in weights will depend on the successively determined predictions and the sum of changes in past values of the previous predictions [4]. The transfer function of the neural network is chosen to be a hyperbolic tangent sigmoid, and the output is a single value taking on a value ranging from -1 to +1. Negative values indicate that the blue side has won, is winning or is making good moves. Positive values indicate a similar situation for the red side. Implementation and Testing The above learning algorithm is programmed into a neural network. Inputs to the network are possible moves (represented by changes in the board game s matrix) and outputs indicate what the network predicts to be the "value" of each move - good or bad (i.e., if it would contribute to the ultimate goal of reaching +1 or -1 for a game victory). Several cases (move sequences of different games) are presented to train the neural network from zero-knowledge. To begin training at a rudimentary level, only a few pieces are played in these game sequences. Initial random weights are adjusted by the algorithm such that the pattern of predicted outcomes of each move match the final outcome of the game (ultimate network output). After the network has been trained and the weights have been adjusted, test cases are fed into the network. When the network is fed the trained set of input data, the network predicted output correctly matches the actual outcome of the game. When new game sequences are input, the neural network is able to correctly predict the outcome of the game most of the times. These preliminary results indicate that the network is able to learn the game and capable of evaluating each move s contribution to the final outcome of the game. Following are examples of tests performed on the neural network. In the end-game case involving only the horse and chief on a 4x4 playing field, the neuralnetwork-controlled horse could check the opponent s chief by either of two moves. For the two different scenarios, the neural network indicated that they are equally good moves since they ultimately lead to the same result. Good moves are indicated by output values increasing towards positive unity for the neural-networkcontrolled side. Good moves for the opponent would be indicated by increasingly negative output values. The network is also tested for accuracy. When the neural-networkcontrolled horse placed the opponent s chief in check, but did not follow through by eliminating the Chief on the next move, the neural network detected this as a weak move and the output value correspondingly decreased. This confirmed that the

network was correctly responding to the position of the players pieces. To further test and confirm proper behavior of the network, the neural network was forced to evaluate bad moves. The neural-network-controlled pieces were presented to the network as moving into positions that would favor the opponent. The network again yielded correct results, since the output values became increasingly negative (favoring the opponent) as the bad moves were being made. Finally, the network is asked to evaluate a complete sequence of moves of a scaled-down version of the game. Since this involved a larger playing field and more pieces than the test cases, the network did not function consistently, although results do indicate that it generally functions correctly. In some cases where the opposing side made a good move, the network deemed it as a good move for the other side. However, in most other cases, the network correctly evaluated the moves as favoring one side or the other by the shifting output values. Of course, good moves are subjective, so they can understandably be considered subjective to an artificial intelligence that is the neural network. However, this network could be trained with more game sequences, perhaps thousands more, to improve its accuracy in evaluating each move as they pertain to the end-game result. And the advantage of temporal difference learning is that it is capable of learning by self-play since errors of predictions are fed back and weights are adjusted accordingly to match the actual output. Conclusion A neural network modeled on an effective learning algorithm such as the temporal difference method has been shown to be capable of playing complex games. The temporal difference algorithm used to train the network in this project shows promise for further development in this area. In the most basic setup, the network responds almost flawlessly. With more complexity introduced, the network sometimes may incorrectly evaluate the value of a particular move, but with additional training, the network outputs will most definitely improve. The results of this paper support the use of the temporal difference method as a potentially robust method for neural networks to solve complex problems. This learning algorithm is a promising starting point for similar developments in solving problems in other fields involving similar levels of complexity, and the method may also be refined and/or built upon with further research. References [1] S. Thrun. Learning to Play the Game of Chess. Advances in Neural Information Processing Systems, 7:1069-76, 1995. [2] M. Schmidt. Temporal Difference Learning and Chess. Technical Report, Aarhus University, Computer Science Department, June 20, 1994. [3] M. Schmidt. Neural Networks and Chess. Thesis, Aarhus University, Computer Science Department, July 19, 1993.

[4] C.L. Isbell. Explorations of the Practical Issues of Learning Prediction Control Tasks Using Temporal Difference Learning Methods. Master Thesis, Massachusetts Institute of Technology, December 1992. [5] G. Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3):58-68. [6] R.S. Sutton. On Step-Size and Bias in Temporal-Difference Learning. Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pp91-96, 1994.