TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play

Similar documents
Reinforcement Learning in Games Autonomous Learning Systems Seminar

Decision Making in Multiplayer Environments Application in Backgammon Variants

TEMPORAL DIFFERENCE LEARNING IN CHINESE CHESS

Training a Back-Propagation Network with Temporal Difference Learning and a database for the board game Pente

ECE 517: Reinforcement Learning in Artificial Intelligence

How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)

CS 331: Artificial Intelligence Adversarial Search II. Outline

Game Design Verification using Reinforcement Learning

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Game-playing AIs: Games and Adversarial Search FINAL SET (w/ pruning study examples) AIMA

Adversarial Search and Game Playing. Russell and Norvig: Chapter 5

CSC321 Lecture 23: Go

Contents. List of Figures

Foundations of AI. 6. Adversarial Search. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard & Bernhard Nebel

CMSC 671 Project Report- Google AI Challenge: Planet Wars

Artificial Intelligence Search III

Free Kindle Books Backgammon

Five-In-Row with Local Evaluation and Beam Search

An Artificially Intelligent Ludo Player

MITOCW Project: Backgammon tutor MIT Multicore Programming Primer, IAP 2007

Bootstrapping from Game Tree Search

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Game-Playing & Adversarial Search

Unit-III Chap-II Adversarial Search. Created by: Ashish Shah 1

K-means separated neural networks training with application to backgammon evaluations

Success Stories of Deep RL. David Silver

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Abalearn: Efficient Self-Play Learning of the game Abalone

Programming Project 1: Pacman (Due )

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

One Jump Ahead. Jonathan Schaeffer Department of Computing Science University of Alberta

Presentation Overview. Bootstrapping from Game Tree Search. Game Tree Search. Heuristic Evaluation Function

Automated Suicide: An Antichess Engine

Monte Carlo Tree Search

Adversarial Search (Game Playing)

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

Board Representations for Neural Go Players Learning by Temporal Difference

CPS331 Lecture: Search in Games last revised 2/16/10

More Adversarial Search

Game Tree Search. Generalizing Search Problems. Two-person Zero-Sum Games. Generalizing Search Problems. CSC384: Intro to Artificial Intelligence

Adversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5

Hybrid of Evolution and Reinforcement Learning for Othello Players

CS 771 Artificial Intelligence. Adversarial Search


Foundations of AI. 6. Board Games. Search Strategies for Games, Games with Chance, State of the Art

To Double or Not to Double by Kit Woolsey

MyPawns OppPawns MyKings OppKings MyThreatened OppThreatened MyWins OppWins Draws

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

Towards A World-Champion Level Computer Chess Tutor

CS 188: Artificial Intelligence Spring Game Playing in Practice

Foundations of Artificial Intelligence

Quick work: Memory allocation

Approaching The Royal Game of Ur with Genetic Algorithms and ExpectiMax

CS 188: Artificial Intelligence Spring Announcements

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

Adversarial Search and Game Playing

Ar#ficial)Intelligence!!

Artificial Intelligence

Game Playing AI Class 8 Ch , 5.4.1, 5.5

Adversarial Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 9 Feb 2012

Nannon : A Nano Backgammon for Machine Learning Research

Creating a Poker Playing Program Using Evolutionary Computation

GAMES COMPUTERS PLAY

Temporal-Difference Learning in Self-Play Training

Learning of Position Evaluation in the Game of Othello

Games and Adversarial Search II

Foundations of Artificial Intelligence

Game Playing: Adversarial Search. Chapter 5

Chess Skill in Man and Machine

Computing Science (CMPUT) 496

On the Design and Training of Bots to Play Backgammon Variants

Game Playing State-of-the-Art CSE 473: Artificial Intelligence Fall Deterministic Games. Zero-Sum Games 10/13/17. Adversarial Search

CS 188: Artificial Intelligence Spring 2007

Local Search. Hill Climbing. Hill Climbing Diagram. Simulated Annealing. Simulated Annealing. Introduction to Artificial Intelligence

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

Last update: March 9, Game playing. CMSC 421, Chapter 6. CMSC 421, Chapter 6 1

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Announcements. CS 188: Artificial Intelligence Spring Game Playing State-of-the-Art. Overview. Game Playing. GamesCrafters

Adversarial Search. CMPSCI 383 September 29, 2011

A Machine-Learning Approach to Computer Go

CSE 573: Artificial Intelligence Autumn 2010

Game point match, Score is robin swaffield: 7, andy darby: 5 42: 8/4 6/4 31: 8/5 6/5 51: 24/23 13/8

Coevolution of Neural Go Players in a Cultural Environment

Reinforcement Learning Simulations and Robotics

Artificial Intelligence. Minimax and alpha-beta pruning

AI in Tabletop Games. Team 13 Josh Charnetsky Zachary Koch CSE Professor Anita Wasilewska

Foundations of AI. 5. Board Games. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard and Luc De Raedt SA-1

To progress from beginner to intermediate to champion, you have

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Abalone Final Project Report Benson Lee (bhl9), Hyun Joo Noh (hn57)

Games and Adversarial Search

Memory-Based Approaches To Learning To Play Games

Analysing and Exploiting Transitivity to Coevolve Neural Network Backgammon Players

Artificial Intelligence Adversarial Search

Feature Learning Using State Differences

Today. Types of Game. Games and Search 1/18/2010. COMP210: Artificial Intelligence. Lecture 10. Game playing

Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function

CS 4700: Foundations of Artificial Intelligence

6. Games. COMP9414/ 9814/ 3411: Artificial Intelligence. Outline. Mechanical Turk. Origins. origins. motivation. minimax search

CS885 Reinforcement Learning Lecture 13c: June 13, Adversarial Search [RusNor] Sec

Transcription:

NOTE Communicated by Richard Sutton TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas 1. Watson Research Center, I? 0. Box 704, Yorktozon Heights, NY 10598 USA TD-Gammon is a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results, based on the TD(X) reinforcement learning algorithm (Sutton 1988). Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e., given only a raw description of the board state), the network learns to play at a strong intermediate level. Furthermore, when a set of handcrafted features is added to the network s input representation, the result is a truly staggering level of performance: the latest version of TD-Gammon is now estimated to play at a strong master level that is extremely close to the world s best human players. Reinforcement learning is a fascinating and challenging alternative to the more standard approach to training neural networks by supervised learning. Instead of training on a teacher signal indicating the correct output for every input, reinforcement learning provides less information to work with: the learner is given only a reward or reinforcement signal indicating the quality of output. In many cases the reward is also delayed, that is, is given at the end of a long sequence of inputs and outputs. In contrast to the numerous practical successes of supervised learning, there have been relatively few successful applications of reinforcement learning to complex real-world problems. This paper presents a case study in which the TD(X) reinforcement learning algorithm (Sutton 1988) was applied to training a multilayer neural network on a complex task: learning strategies for the game of backgammon. This is an attractive test problem due to its considerable complexity and stochastic nature. It is also possible to make a detailed comparison of TD learning with the alternative approach of supervised training on human expert examples; this was the approach used in the development of Neurogammon, a program that convincingly won the backgammon championship at the 1989 International Computer Olympiad (Tesauro 1989). Neurnl Cotnpirtntior~ 6, 215-219 (1994) @ 1994 Massachusetts Institute of Technology

216 Gerald Tesauro Details of the TD backgammon learning system are described elsewhere (Tesauro 1992). In brief, the network observes a sequence of board positions ~ 1. ~...?xr 2. leading to a final reward signal z determined by the outcome of the game. (These games were played without doubling, thus the network did not learn anything about doubling strategy.) The sequences of positions were generated using the networks predictions as an evaluation function. In other words, the move selected at each time step was the move that maximized the networks estimate of expected outcome. Thus the network learned based on the outcome of self-play. This procedure of letting the network learn from its own play was used even at the very start of learning, when the networks initial weights are random, and hence its initial strategy is a random strategy. From an a priori point of view, this methodology appeared unlikely to produce any sensible learning, because random strategy is exceedingly bad, and because the games end up taking an incredibly long time: with random play on both sides, games often last several hundred or even several thousand time steps. In contrast, in normal human play games usually last on the order of 50-60 time steps. Preliminary experiments used an input representation scheme that encoded only the raw board information (the number of white or black checkers at each location), and did not utilize any additional precomputed features relevant to good play, such as, for example, the strength of a blockade or probability of being hit. These experiments were completely knowledge-free in that there was no initial knowledge built in about how to play good backgammon. In subsequent experiments, a set of hand-crafted features was added to the representation, resulting in higher overall performance. This feature set was the same set that was included in Neurogammon. The rather surprising result, after tens of thousands of training games, was that a significant amount of learning actually took place, even in the zero initial knowledge experiments. These networks achieved a strong intermediate level of play approximately equal to that of Neurogammon. The networks with hand-crafted features have greatly surpassed Neurogammon and all other previous computer programs, and have continued to improve with more and more games of training experience. The best of these networks is now estimated to play at a strong master level that is extremely close to equaling world-class human play. This has been demonstrated in numerous tests of TD-Gammon in play against several world-class human grandmasters, including Bill Robertie and Paul Magriel, both noted authors and highly respected former World Champions. For the tests against humans, a heuristic doubling algorithm was added to the program that took TD-Gammon's equity estimates as input, and tried to apply somewhat classical formulas developed in the 1970s (Zadeh and Kobliska 1977) to determine proper doubling actions. Results of testing are summarized in Table 1. TD-Gammon 1.O, which had a total training experience of 300,000 games, lost a total of 13 points in

TD-Gammon 217 Table 1: Results of Testing TD-Gammon in Play against World-Class Human Opponents.a Program Training games Opponents Results TD-Gammon 1.O 300,000 Robertie, Davis, -13 pts/51 games Magriel (-0.25 ppg) TD-Gammon 2.0 800,000 Goulding, Woolsey, -7 pts/38 games Snellings, Russell, (-0.18 ppg) Sylvester TD-Gammon 2.1 1,500,000 Robertie -1 pt/40 games (-0.02 ppg) "Version 1.0 used 1-ply search for move selection; versions 2.0 and 2.1 used 2-ply search. Version 2.0 had 40 hidden units; versions 1.0 and 2.1 had 80 hidden units. 51 games against Robertie, Magriel, and Malcolm Davis, the 11th highest rated player in the world in 1991. TD-Gammon 2.0, which had 800,000 training games of experience and was publicly exhibited at the 1992 World Cup of Backgammon tournament, had a net loss of 7 points in 38 exhibition games against top players Kent Goulding, Kit Woolsey, Wilcox Snellings, former World Cup Champion Joe Sylvester, and former World Champion Joe Russell. The latest version of the program, version 2.1, had 1.5 million games of training experience and achieved near-parity to Bill Robertie in a recent 40-game test session: after trailing the entire session, Robertie managed to eke out a narrow one-point victory by the score of 40 to 39. According to an article by Bill Robertie published in Inside Backgummon magazine (Robertie 19921, TD-Gammon's level of play is significantly better than any previous computer program. Robertie estimates that TD- Gammon 1.O would lose on average in the range of 0.2 to 0.25 points per game against world-class human play. (This is consistent with the results of the 51-game sample.) This would be about equivalent to a decent advanced level of human play in local and regional open-division tournaments. In contrast, most commercial programs play at a weak intermediate level that loses well over one point per game against world-class humans. The best previous commercial program scored -0.66 points per game on this scale. The best previous program of any sort was Hans Berliner's BKG program, which in its only public appearance in 1979 won a short match against the World Champion at that time (Berliner 1980). BKG was about equivalent to a very strong intermediate or weak advanced player and would have scored in the range of -0.3 to -0.4 points per game. Based on the latest 40-game sample, Robertie's overall assessment is that TD-Gammon 2.1 now plays at a strong master level that is extremely close to equaling the world's best human players. In fact, due to the

218 Gerald Tesauro program s steadiness (it never gets tired or careless, as even the best of humans inevitably do), he thinks it would actually be the favorite against any human player in a long money-game session or in a grueling tournament format such as the World Cup competition. The only thing that prevents TD-Gammon from genuinely equaling world-class human play is that it still makes minor, practically inconsequential technical errors in its endgame play. One would expect these technical errors to cost the program on the order of 0.05 points per game against top humans. Robertie thinks that there are probably only two or three dozen players in the entire world who, at the top of their game, could expect to hold their own or have an advantage over the program. This means that TD-Gammon is now probably as good at backgammon as the grandmaster chess machine Deep Thought is at chess. Interestingly enough, it is only in the last 5-10 years that human play has gotten good enough to rival TD-Gammon s current playing ability. If TD-Gammon had been developed 10 years ago, Robertie says, it would have easily been the best player in the world at that time. Even 5 years ago, there would have been only two or three players who could equal it. The self-teaching reinforcement learning approach used in the development of TD-Gammon has greatly surpassed the supervised learning approach of Neurogammon, and has achieved a level of play considerably beyond any possible prior expectations. It has also demonstrated favorable empirical behavior of TD(X), such as good scaling behavior, despite the lack of theoretical guarantees. Prospects for further improvement of TD-Gammon seem promising. Based on the observed scaling, training larger and larger networks with correspondingly more experience would probably result in even higher levels of performance. Additional improvements could come from modifications of the training procedure or the input representation scheme. Some combination of these factors could easily result in a version of TD-Gammon that would be the uncontested world s best backgammon player. However, instead of merely pushing TD-Gammon to higher and higher levels of play, it now seems more worthwhile to extract the principles underlying the success of this application of TD learning, and to determine what kinds of other applications may also produce similar successes. Other possible applications might include financial trading strategies, military battlefield strategies, and control tasks such as robot motor control, navigation, and path planning. At this point we are still largely ignorant as to why TD-Gammon is able to learn so well. One plausible conjecture is that the stochastic nature of the task is critical to the success of TD learning. One possibly very important effect of the stochastic dice rolls in backgammon is that during learning, they enforce a certain minimum amount of exploration of the state space. By stochastically forcing the system into regions of state space that the current evaluation function

TD-Gammon 219 tries to avoid, it is possible that improved evaluations and new strategies can be discovered. References Berliner, H. 1980. Computer backgammon. Sci. Am. 243(1), 64-72. Robertie, B. 1992. Carbon versus silicon: matching wits with TD-Gammon. Inside Backgammon 2(2), 14-22. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learn. 3, 9-44. Tesauro, G. 1989. Neurogammon wins Computer Olympiad. Neural Comp. 1, 321-323. Tesauro, G. 1992. Practical issues in temporal difference learning. Machine Learn. 8, 257-277. Zadeh, N., and Kobliska, G. 1977. On optimal doubling in backgammon. Manage. Sci. 23, 853-858. Received April 19, 1993; accepted May 25, 1993.