Game-playing: DeepBlue and AlphaGo

Similar documents
CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

Adversarial Search Lecture 7

Game Playing: Adversarial Search. Chapter 5

CSE 473: Artificial Intelligence. Outline

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Monte Carlo Tree Search

Game Playing State-of-the-Art CSE 473: Artificial Intelligence Fall Deterministic Games. Zero-Sum Games 10/13/17. Adversarial Search

Andrei Behel AC-43И 1

CS 188: Artificial Intelligence

How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

Adversarial Search. Read AIMA Chapter CIS 421/521 - Intro to AI 1

Artificial Intelligence

Adversarial Search: Game Playing. Reading: Chapter

Lecture 5: Game Playing (Adversarial Search)

Programming Project 1: Pacman (Due )

Game Playing. Philipp Koehn. 29 September 2015

CSC321 Lecture 23: Go

CS 5522: Artificial Intelligence II

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Adversarial Search and Game- Playing C H A P T E R 6 C M P T : S P R I N G H A S S A N K H O S R A V I

CS 188: Artificial Intelligence Spring 2007

Game playing. Outline

Artificial Intelligence Adversarial Search

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

Game Playing State-of-the-Art

Games CSE 473. Kasparov Vs. Deep Junior August 2, 2003 Match ends in a 3 / 3 tie!

Adversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5

CSE 573: Artificial Intelligence

Artificial Intelligence. Minimax and alpha-beta pruning

COMP219: Artificial Intelligence. Lecture 13: Game Playing

Game Playing State of the Art

CS 188: Artificial Intelligence

More on games (Ch )

Foundations of Artificial Intelligence

CS 331: Artificial Intelligence Adversarial Search II. Outline

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

CS 380: ARTIFICIAL INTELLIGENCE

Foundations of Artificial Intelligence

CS 188: Artificial Intelligence Spring Announcements

Games vs. search problems. Game playing Chapter 6. Outline. Game tree (2-player, deterministic, turns) Types of games. Minimax

Game playing. Chapter 6. Chapter 6 1

Adversarial Search (Game Playing)

More on games (Ch )

Artificial Intelligence

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

Game playing. Chapter 6. Chapter 6 1

CS 380: ARTIFICIAL INTELLIGENCE ADVERSARIAL SEARCH. Santiago Ontañón

Game playing. Chapter 5. Chapter 5 1

Game-Playing & Adversarial Search

AlphaGo and Artificial Intelligence GUEST LECTURE IN THE GAME OF GO AND SOCIETY

Game Playing State-of-the-Art. CS 188: Artificial Intelligence. Behavior from Computation. Video of Demo Mystery Pacman. Adversarial Search

Unit-III Chap-II Adversarial Search. Created by: Ashish Shah 1

Game Playing. Dr. Richard J. Povinelli. Page 1. rev 1.1, 9/14/2003

Outline. Game playing. Types of games. Games vs. search problems. Minimax. Game tree (2-player, deterministic, turns) Games

Last update: March 9, Game playing. CMSC 421, Chapter 6. CMSC 421, Chapter 6 1

Game Playing AI Class 8 Ch , 5.4.1, 5.5

Ar#ficial)Intelligence!!

CS188 Spring 2010 Section 3: Game Trees

CPS331 Lecture: Search in Games last revised 2/16/10

CS188 Spring 2010 Section 3: Game Trees

Game Playing. Why do AI researchers study game playing? 1. It s a good reasoning problem, formal and nontrivial.

Intuition Mini-Max 2

CS 771 Artificial Intelligence. Adversarial Search

ARTIFICIAL INTELLIGENCE (CS 370D)

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Adversarial Search. Chapter 5. Mausam (Based on slides of Stuart Russell, Andrew Parks, Henry Kautz, Linda Shapiro) 1

CS188 Spring 2014 Section 3: Games

Adversarial Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 9 Feb 2012

Adversarial Search Aka Games

School of EECS Washington State University. Artificial Intelligence

CS 188: Artificial Intelligence Spring Game Playing in Practice

Game-Playing & Adversarial Search Alpha-Beta Pruning, etc.

Foundations of AI. 6. Board Games. Search Strategies for Games, Games with Chance, State of the Art

CS 188: Artificial Intelligence. Overview

Foundations of Artificial Intelligence

Adversarial Search. CMPSCI 383 September 29, 2011

Today. Types of Game. Games and Search 1/18/2010. COMP210: Artificial Intelligence. Lecture 10. Game playing

Game Playing AI. Dr. Baldassano Yu s Elite Education

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

Foundations of AI. 5. Board Games. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard and Luc De Raedt SA-1

CSE 573: Artificial Intelligence Autumn 2010

Artificial Intelligence. Topic 5. Game playing

6. Games. COMP9414/ 9814/ 3411: Artificial Intelligence. Outline. Mechanical Turk. Origins. origins. motivation. minimax search

CS885 Reinforcement Learning Lecture 13c: June 13, Adversarial Search [RusNor] Sec

Adversary Search. Ref: Chapter 5

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence

Game-playing AIs: Games and Adversarial Search FINAL SET (w/ pruning study examples) AIMA

Artificial Intelligence Search III

Foundations of AI. 6. Adversarial Search. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard & Bernhard Nebel

Adversarial search (game playing)

CS 387/680: GAME AI BOARD GAMES

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

Games vs. search problems. Adversarial Search. Types of games. Outline

CS 2710 Foundations of AI. Lecture 9. Adversarial search. CS 2710 Foundations of AI. Game search

CS 387: GAME AI BOARD GAMES. 5/24/2016 Instructor: Santiago Ontañón

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Game Playing. Garry Kasparov and Deep Blue. 1997, GM Gabriel Schwartzman's Chess Camera, courtesy IBM.

Transcription:

Game-playing: DeepBlue and AlphaGo

Brief history of gameplaying frontiers 1990s: Othello world champions refuse to play computers 1994: Chinook defeats Checkers world champion 1997: DeepBlue defeats world champion Gary Kasparov 2016: AlphaGo defeats world champion Lee Sedol Today, we re going to talk about DeepBlue and AlphaGo.

DeepBlue In 1997, DeepBlue beat world champion Gary Kasparov at chess.

DeepBlue In 1997, DeepBlue beat world champion Gary Kasparov at chess. How? Minimax Alpha-beta pruning Evaluation function Sound familiar?

First, some review Let s play a two-player game. Start with n=5, and alternate turns. On every turn, player can either set n = n - 1 or n = floor(n/2) The first player to set n = 0 wins! How can we model this?

Game trees 5

Game trees subtract 5 divide

Game trees 5 4 2

Game trees 5 4 2 subtract divide subtract divide

Game trees 5 4 2 3 2 1 1

Game trees 5 4 2 3 2 1 1 2 1 1 1 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Game trees So what are the best moves I can play? 5? 4 2 3 2 1 1 2 1 1 1 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Game trees So what are the best moves I can play? Problem: We also don t know what the opponent will play. 3 5 4 2 2 1 1 1 2??? 1 1 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax We want to maximize our own utility.

Expectimax We want to maximize our own utility. If it s my turn, then:

Expectimax We want to maximize our own utility. If it s my turn, then: = Take the action a that maximizes the utility of the resulting state.

Expectimax We want to maximize our own utility. If it s my turn, then: We don t know what the enemy will do. = Take the action a that maximizes the utility of the resulting state.

Expectimax We want to maximize our own utility. If it s my turn, then: = Take the action a that maximizes the utility of the resulting state. We don t know what the enemy will do. So let s guess!

Expectimax We want to maximize our own utility. If it s my turn, then: = Take the action a that maximizes the utility of the resulting state. We don t know what the enemy will do. So let s guess! Probability that our opponent will take action a from state s

Expectimax We want to maximize our own utility. If it s my turn, then: = Take the action a that maximizes the utility of the resulting state. We don t know what the enemy will do. So let s guess! Probability that our opponent will take action a from state s Utility of the next state.

Expectimax Let s say we don t know our enemy s policy at all. 5 4 2 Maybe it s random! 3 2 1 1 2 1 1 1 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax 5 4 2 3 2 1 1 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax 5 4 2 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax 5 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax 5 +0 +100 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax +0 +100 5 +100 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax +0 +100 5 +100 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Expectimax What if our enemy isn t random? +0 +100 5 +100 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Minimax We know we want to maximize our utility. = Take the action a that maximizes the utility of the resulting state.

Minimax We know we want to maximize our utility. = Take the action a that maximizes the utility of the resulting state. Let s assume the enemy is adversarial, i.e. wants to minimize our utility. = Take the action a that minimizes the utility of the resulting state.

Minimax 5 4 2 3 2 1 1 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Minimax 5 4 2 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Minimax 5 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Minimax 5-100 +100 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Minimax -100 +100 5 +100 4 2 +100-100 +100 +100 3 2 1 1 +100-100 -100-100 2 1 1 1 +100 +100 +100 +100 +100 +100 1 1-100 -100-100 -100-100 -100 +100 +100 +100 +100

Minimax DeepBlue did not use vanilla MiniMax. What s wrong?

Minimax DeepBlue did not use vanilla MiniMax. What s wrong? Game trees are huge!!! Can we do better?

Minimax DeepBlue did not use vanilla MiniMax. What s wrong? Game trees are huge!!! Can we do better? Idea: Prune the search space!

Alpha-Beta pruning From a max-node (our perspective): If we know utility of action a is really high, we shouldn t have to evaluate other actions that we know will not be as good Inverse is true from a min-node (adversary s perspective)

Alpha-Beta pruning From a max-node (our perspective): If we know utility of action a is really high, we shouldn t have to evaluate other actions that we know will not be as good Inverse is true from a min-node (adversary s perspective) Alpha: lower bound on the value that a max-node may ultimately be assigned v >= α Beta: upper bound on the value that a min-node may ultimately be assigned v <= β

Alpha-Beta pruning V = 6 6 9 8

Alpha-Beta pruning α = 6 V 6 Since this is a max-node, the root node will always end up with a value of at least 6, no matter what values the children have V = 6 6 9 8

Alpha-Beta pruning α = 6 V 6 V = 6 β = 4 V 4 6 9 8 4

Alpha-Beta pruning α = 6 V 6 V = 6 β = 4 V 4 This is a min-node, so its value will be at most 4, no matter what values the children have. 6 9 8 4

Alpha-Beta pruning α = 6 V 6 V = 6 β = 4 V 4 This is a min-node, so its value will be at most 4, no matter what values the children have. 6 9 8 4 4 < 6, so the value of this node is guaranteed to have no affect on the value of root node.

Alpha-Beta pruning α = 6 V 6 V = 6 β = 4 V 4 β = 9 V 9 6 9 8 4 9

Alpha-Beta pruning α = 6 V 6 V = 6 β = 4 V 4 β = 7 V 7 6 9 8 4 9 7

Alpha-Beta pruning α = 6 V 6 Search order matters! We could have pruned this branch sooner if we had seen the node with value 2 first. V = 6 β = 4 V 4 β = 2 V 2 6 9 8 4 9 7 2

Alpha-Beta pruning DeepBlue did not just use MiniMax + Alpha-Beta pruning. What s wrong?

Alpha-Beta pruning Pretty cool, but DeepBlue did not just use MiniMax + Alpha-Beta pruning. What s wrong? Game trees are too deep!!! Can we do better? Idea: Instead of playing the entire game, let s guess how we ll we re doing after d moves.

Evaluation functions Suppose we have finite computing resources and can t afford to compute this entire tree.

Evaluation functions Suppose we have finite computing resources and can t afford to compute this entire tree. Let s stop our search at some fixed depth d.

Evaluation functions Suppose we have finite computing resources and can t afford to compute this entire tree. Let s stop our search at some fixed depth d.

Evaluation functions Suppose we have finite computing resources and can t afford to compute this entire tree. Let s stop our search at some fixed depth d. How do we know the utility of these new leaf nodes (to propagate up the game tree)?????

Evaluation functions Suppose we have finite computing resources and can t afford to compute this entire tree. Let s stop our search at some fixed depth d. How do we know the utility of these new leaf nodes (to propagate up the game tree)? ~70 ~30 ~10 ~20 Guess! (use an heuristic) From current game state, how likely am I to win?

Evaluation functions Connect-4: How many open connect-3 s do I have? How many open connect-2 s do I have? Chess (DeepBlue): material, position, King safety and tempo Material: How many pieces do I have left? And what are they worth? Position: How many empty/safe squares can I attack? King safety: How in-danger of attack is my King? Tempo: Have I been making progress recently? DeepBlue: MiniMax tree + Alpha-Beta pruning to a depth of ~13. After that depth, used evaluation function to estimate utility.

Go Why wasn t DeepBlue s algorithm good for Go?

Go Why wasn t DeepBlue s algorithm good for Go? Go is way harder than chess. ~300 possible actions for every game board (vs ~30 in chess) ~150 moves per game (vs ~70 in chess) Total number of possible games ~10^761 (vs ~10^120) for chess There s only 10^80 atoms in the universe?

Alpha Go s Approach Monte Carlo Tree Search Value network as evaluation function What s the expected utility of this board state? Policy network as selection function What moves are more likely to happen from this state? Fed data from seeing many expert games

Monte Carlo Tree Search I have limited resources to find the optimal policy for every game state.

Monte Carlo Tree Search I have limited resources to find the optimal policy for every game state. approximate

Monte Carlo Tree Search I have limited resources to find the optimal policy for every game state. approximate the most common game states

Monte Carlo Tree Search: the core loop Choose a game path to learn more about

Monte Carlo Tree Search: the core loop Choose a game path to learn more about Add a MCTS node to our search tree

Monte Carlo Tree Search: the core loop Choose a game path to learn more about Add a MCTS node to our search tree Play a game randomly: Did we win?

Monte Carlo Tree Search: the core loop Choose a game path to learn more about Add a MCTS node to our search tree Play a game randomly: Did we win? Propagate result up through path

Monte Carlo Tree Search: the core loop Choose a game path to learn more about a good selection policy explores common game paths more often, while also exploring unknown states Add a MCTS node to our search tree Play a game randomly: Did we win? Propagate result up through path

Monte Carlo Tree Search: the core loop Choose a game path to learn more about a good selection policy explores common game paths more often, while also exploring unknown states Add a MCTS node to our search tree Play a game randomly: Did we win? Instead of doing a full playout, some MCTS use an evaluation function. Propagate result up through path

AlphaGo s Monte Carlo Tree Search Uses policy prediction to guess which actions are more likely to be taken. Uses value prediction as an evaluation function instead of performing full playout.

AlphaGo s Monte Carlo Tree Search Uses policy prediction to guess which actions are more likely to be taken. These predictions are trained using a convolutional neural network. Uses value prediction as an evaluation function instead of performing full playout.

Convolutional Neural Networks How does training work? Take an affine function of input (with weights) Pass this output through a nonlinear function -- activation function.

Convolutional Neural Networks How do you train a classifier from these features

Convolutional Neural Networks What are they doing mechanically? Finding local features in a picture Prioritizing features that help predict outcome of interest Value Network -> Predict Rewards Policy Network -> Predict Next Moves

Policy Network Given a 19x19 Go board, output probability distribution over all legal moves Data from 30 million positions, and data from self-plays 13 layers!

Value Network Given a 19x19 Go board, output a value. How likely am I to win? Learned on same games as policy network

MCTS in Alpha Go Selection We choose which path to learn more about by selecting paths with max Q + u(p) Q trained by value network, u(p) samples probability of this action from policy network

MCTS in Alpha Go Expansion To choose a node to expand, randomly sample probability distribution from policy network.

MCTS in Alpha Go Evaluation Heuristic is either: Q from value network r from fast rollout i.e. simulated game

MCTS in Alpha Go Backpropagation Q values in the entire path are backpropagated based on the evaluation result.

It s not perfect Alpha Go s only loss against Lee: White 78, Lee played an unexpected move AlphaGo failed to explore this in MCTS Two possible reasons: Policy network hadn t been trained for long enough Selection too aggressively chooses common game paths, not enough exploration

AlphaGo We just designed AlphaGo! Almost

Computational Power 1202 CPUs! 176 GPUs! Specialized hardware against Lee Sedol

Summary AlphaGo applied advanced versions of techniques in this class! Name ELO Lee Sedol 3517 AlphaGo (2016) ~3594 Ke Jie (world champion) 3616