TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Similar documents
CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Game-playing: DeepBlue and AlphaGo

CSC321 Lecture 23: Go

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

Monte Carlo Tree Search

Andrei Behel AC-43И 1

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

Mastering the game of Go without human knowledge

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

More on games (Ch )

AlphaGo and Artificial Intelligence GUEST LECTURE IN THE GAME OF GO AND SOCIETY

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)

A Bandit Approach for Tree Search

Foundations of Artificial Intelligence

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Game Playing State-of-the-Art CSE 473: Artificial Intelligence Fall Deterministic Games. Zero-Sum Games 10/13/17. Adversarial Search

Foundations of AI. 6. Board Games. Search Strategies for Games, Games with Chance, State of the Art

Foundations of Artificial Intelligence

More on games (Ch )

43.1 Introduction. Foundations of Artificial Intelligence Introduction Monte-Carlo Methods Monte-Carlo Tree Search. 43.

Adversarial Search (Game Playing)

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Success Stories of Deep RL. David Silver

CS 387: GAME AI BOARD GAMES

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Game-playing AIs: Games and Adversarial Search FINAL SET (w/ pruning study examples) AIMA

CS-E4800 Artificial Intelligence

Foundations of Artificial Intelligence Introduction State of the Art Summary. classification: Board Games: Overview

Foundations of AI. 5. Board Games. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard and Luc De Raedt SA-1

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Adversarial Search Aka Games

CSE 473: Artificial Intelligence. Outline

Unit-III Chap-II Adversarial Search. Created by: Ashish Shah 1

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Monte Carlo Tree Search. Simon M. Lucas

V. Adamchik Data Structures. Game Trees. Lecture 1. Apr. 05, Plan: 1. Introduction. 2. Game of NIM. 3. Minimax

Artificial Intelligence Adversarial Search

COMP219: Artificial Intelligence. Lecture 13: Game Playing

Contents. Foundations of Artificial Intelligence. Problems. Why Board Games?

Data Structures and Algorithms

Game Playing: Adversarial Search. Chapter 5

ARTIFICIAL INTELLIGENCE (CS 370D)

Adversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5

CS 387: GAME AI BOARD GAMES. 5/24/2016 Instructor: Santiago Ontañón

Adversarial Search Lecture 7

CS 387/680: GAME AI BOARD GAMES

CS 188: Artificial Intelligence

Foundations of AI. 6. Adversarial Search. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard & Bernhard Nebel

Game AI Challenges: Past, Present, and Future

Artificial Intelligence

Artificial Intelligence. Minimax and alpha-beta pruning

Adversarial Search: Game Playing. Reading: Chapter

2 person perfect information

Game Playing AI Class 8 Ch , 5.4.1, 5.5

Game Engineering CS F-24 Board / Strategy Games

Exploration exploitation in Go: UCT for Monte-Carlo Go

CS 188: Artificial Intelligence

Game Playing. Philipp Koehn. 29 September 2015

AI in Tabletop Games. Team 13 Josh Charnetsky Zachary Koch CSE Professor Anita Wasilewska

AI, AlphaGo and computer Hex

Lecture 5: Game Playing (Adversarial Search)

Adversarial Search 1

Last update: March 9, Game playing. CMSC 421, Chapter 6. CMSC 421, Chapter 6 1

Application of UCT Search to the Connection Games of Hex, Y, *Star, and Renkula!

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

Playing Games. Henry Z. Lo. June 23, We consider writing AI to play games with the following properties:

Algorithms for Data Structures: Search for Games. Phillip Smith 27/11/13

Computing Science (CMPUT) 496

Game-Playing & Adversarial Search Alpha-Beta Pruning, etc.

CS 771 Artificial Intelligence. Adversarial Search

TD-Leaf(λ) Giraffe: Using Deep Reinforcement Learning to Play Chess. Stefan Lüttgen

Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

Game Playing. Why do AI researchers study game playing? 1. It s a good reasoning problem, formal and nontrivial.

CS 188: Artificial Intelligence Spring Announcements

Computer Science and Software Engineering University of Wisconsin - Platteville. 4. Game Play. CS 3030 Lecture Notes Yan Shi UW-Platteville

Programming Project 1: Pacman (Due )

CPS331 Lecture: Search in Games last revised 2/16/10

CS 440 / ECE 448 Introduction to Artificial Intelligence Spring 2010 Lecture #5

4. Games and search. Lecture Artificial Intelligence (4ov / 8op)

Artificial Intelligence Search III

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

Building Opening Books for 9 9 Go Without Relying on Human Go Expertise

Recent Progress in Computer Go. Martin Müller University of Alberta Edmonton, Canada

Game-Playing & Adversarial Search

Games and Adversarial Search II

CSE 573: Artificial Intelligence

AI in Games: Achievements and Challenges. Yuandong Tian Facebook AI Research

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Decision Making in Multiplayer Environments Application in Backgammon Variants

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Game Tree Search. Generalizing Search Problems. Two-person Zero-Sum Games. Generalizing Search Problems. CSC384: Intro to Artificial Intelligence

Chapter 6. Overview. Why study games? State of the art. Game playing State of the art and resources Framework

Adversarial Search. Chapter 5. Mausam (Based on slides of Stuart Russell, Andrew Parks, Henry Kautz, Linda Shapiro) 1

6. Games. COMP9414/ 9814/ 3411: Artificial Intelligence. Outline. Mechanical Turk. Origins. origins. motivation. minimax search

Game Tree Search. CSC384: Introduction to Artificial Intelligence. Generalizing Search Problem. General Games. What makes something a game?

CS 4700: Artificial Intelligence

Transcription:

TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 AlphaZero 1

AlphaGo Fan (October 2015) AlphaGo Defeats Fan Hui, European Go Champion. 2

AlphaGo Lee (March 2016) 3

AlphaGo Zero vs. Alphago Lee (April 2017) AlphaGo Lee: Trained on both human games and self play. Trained for Months. Run on many machines with 48 TPUs for Lee Sedol match. AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions 100 to 0. 4

AlphaZero Defeats Stockfish in Chess (December 2017) AlphaGo Zero was a fundamental algorithmic advance for general RL. The general RL algorithm of AlphaZero is essentially the same as that of AlphaGo Zero. 5

Some Algorithmic Concepts Rollout position evaluation (Bruegmann, 1993) Monte Carlo Tree Search (MCTS) (Bruegmann, 1993) Upper Confidence Bound (UCB) Bandit Algorithm (Lai and Robbins 1985) Upper Confidence Tree Search (UCT) (Kocsis and Szepesvari, 2006) 6

Rollouts and MCTS (1993) To estimate the value of a position (who is ahead and by how much) run a cheap stochastic policy to generate a sequence of moves (a rollout) and see who wins. Take an average value of many rollouts. Do a selective tree search using rollout averages for position evaluation. 7

(One Armed) Bandit Problems Consider a set of choices (different slot machines). Each choice gets a stochastic reward. We can select a choice and get a reward as often as we like. We would like to determine which choice is best and also to get reward as quickly as possible. 8

The UCB algorithm (1995 Version) For each choice (bandit) a construct a confidence interval for its average reward. µ = ˆµ ± 2σ/ n Always select µ(a) ˆµ(a) + U(N(a)) argmax a ˆµ(a) + U(N(a)) 9

The UCT algorithm (2006) Build a search tree by running simulations. Each simulation uses the UCB rule to select a child of each node until a leaf is reached. The leaf is then expanded and a value is computed for the leaf. This value is backed up through the tree adding a value and increment the count of each ancestor node. 10

AlphaGo AlphaGo trained: a fast rollout policy. an imitation policy network. a self-play policy network. a value network trained to predict self-play rollout values. 11

AlphaGo Competition play is done using UTC search using the four components just mentioned. No tree search is used in training. 12

AlphaGo Policy and Value Networks [Silver et al.] The layers use 5 5 filters with Relu on 256 channels 13

Fast Rollout Policy Softmax of linear combination of (hand designed) pattern features. An accuracy of 24.2%, using just 2µs to select an action, rather than 3ms for the policy network. 14

Imitation Policy Learning A 13-layer policy network trained from from 30 million positions from the KGS Go Server. 15

Self-Play Policy Run the policy network against version of itself to get an (expensive) rollout a 1, b 1, a 2, b 2,..., a N, b N with value z. No tree search is used here. Θ π += z Θπ ln π(a t s t ; Θ π ) This is just REINFORCE. 16

Regression Training of Value Function Using self-play of the final RL policy we generate a database of 30 million pairs (s, z) where s is a board position and z { 1, 1} is an outcome and each pair is from a different game. We then train a value network by regression. Θ = argmin Θ E (s,z) (V (s, Θ) z) 2 17

Monte Carlo Tree Search (MCTS) Competition play is then done with UCT search using the four predictors described above. A simulation descends the tree using N(s) argmax Q(s, a) + cp (s, a) a 1 + N(a) where P (s, a) is the imitation learned action probability. 18

Monte Carlo Tree Search (MCTS) When a leaf is expanded it is assigned value (1 λ)v (s) + λz where V (s) is from the the self-play learned value network and z is value of a rollout from s using the fast rollout policy. Once the search is deemed complete, the most traversed edge from the root is selected as the move. 19

AlphaGo Zero The self-play training is based on UCT tree search rather than rollouts. No rollouts are ever used just UCT trees under the learned policy and value networks. No database of human games is ever used, just self-play. The networks are replaced with Resnet. A single dual-head network is used for both policy and value. 20

Training Time 4.9 million games of self-play 0.4s thinking time per move About 8 years of thinking time in training. Training took just under 3 days about 1000 fold parallelism. 21

Elo Learning Curve 22

Learning Curve for Predicting Human Moves 23

Ablation Study for Resnet and Dual-Head 24

Learning from Tree Search UTC tree search is used to generate a complete self-play game. Each self-play game has a final outcome z and generates data (s, π, z) for each position s in the game where π is the final move probability of that position and z is the final value of the game. This data is collected in a replay buffer. 25

Learning from Tree Search Learning is done from this replay buffer using the following objective on a single dual-head network. Φ = argmin Φ E (s,π,z) Replay, a π (v Φ (s) z) 2 λ 1 log Q Φ (a s) +λ 2 Φ 2 26

Exploration Exploration is maintained by selecting moves in proportion to visit count for the first 30 moves rather than the maximum visit count. After 30 moves the max count is selected. Throughout the game noise is injected into the root move probabilities for each move selection. 27

Increasing Blocks and Training Increasing the number of Resnet blocks form 20 to 40. Increasing the number of training days from 3 to 40. Gives an Elo rating over 5000. 28

Final Elo Ratings 29

AlphaZero Chess and Shogi Essentially the same algorithm with the input image and output images modified to represent to game position and move options respective. Minimal representations are used no hand coded features. Three days of training. Tournaments played on a single machine with 4 TPUs. 30

Alpha vs. Stockfish From white Alpha won 25/50 and lost none. From black Alpha won 3/50 and lost none. Alpha evaluates 70 thousand positions per second. Stockfish evaluates 80 million positions per second. 31

Checkers is a Draw In 2007 Jonathan Schaeffer at the University of Alberta showed that checkers is a draw. Using alpha-beta and end-game dynamic programming, Schaeffer computed drawing strategies for each player. This was listed by Science Magazine as one of the top 10 breakthroughs of 2007. Is chess also a draw? 32

Grand Unification AlphaZero unifies chess and go algorithms. This unification of intuition (go) and calculation (chess) is surprising. This unification grew out of go algorithms. But are the algorithmic insights of chess algorithms really irrelevant? 33

Chess Background The first min-max computer chess program was described by Claude Shannon in 1950. Alpha-beta pruning was invented by various people independently, including John McCarthy, about 1956-1960. Alpha-beta has been the cornerstone of all chess algorithms until AlphaZero. 34

Alpha-Beta Pruning def MaxValue(s,alpha,beta): value = alpha for s2 in s.children(): value = max(value, MinValue(s2,value,beta)) if value >= beta: break() return value def MinValue(s,alpha,beta): value = beta for s2 in s.children(): value = min(value, MaxValue(s2,alpha,value)) if value <= alpha: break() return value

Conspiracy Numbers Conspiracy Numbers for Min-Max search, McAllester, 1988 Consider a partially expanded game tree where each leaf is labeled with a static value. Each node s has a min-max value V (s) determined by the leaf values. For any N define an upper confidence U(s, N) to be the greatest value that can be achieved for s by changing N leaf nodes. We define N(s, U) to be the least N such that U(s, N) U. 36

Conspiracy Algorithm Define an upper-confidence leaf for s and U any leaf that occurs in a set of N(s, U) leaves that can change V (s) to U. Algorithm: Fix a hyper-parameter N. Repeatedly expand an upper-confidence leaf for the root S and value U(s, N) and a lower-confidence leaf for s value L(s, N). 37

Simulation To find an upper-confidence leaf for the root and value U: At a max node pick the child minimizing N(s, U). At a min node select any child s with V (s) < U. 38

Refinement Let the static evaluator associate leaf nodes with values U(s, N) and L(s, N) 39

END