A Bandit Approach for Tree Search

Similar documents
Exploration exploitation in Go: UCT for Monte-Carlo Go

Recent Progress in Computer Go. Martin Müller University of Alberta Edmonton, Canada

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

A Parallel Monte-Carlo Tree Search Algorithm

Application of UCT Search to the Connection Games of Hex, Y, *Star, and Renkula!

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2010-GI-24 No /6/25 UCT UCT UCT UCB A new UCT search method using position evaluation function an

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

Score Bounded Monte-Carlo Tree Search

A Study of UCT and its Enhancements in an Artificial Game

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

CS-E4800 Artificial Intelligence

Bandit Algorithms Continued: UCB1

Monte Carlo Tree Search

GO for IT. Guillaume Chaslot. Mark Winands

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

Game-playing: DeepBlue and AlphaGo

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Monte Carlo Tree Search. Simon M. Lucas

CS 387: GAME AI BOARD GAMES

Combining Final Score with Winning Percentage by Sigmoid Function in Monte-Carlo Simulations

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

Playing Othello Using Monte Carlo

43.1 Introduction. Foundations of Artificial Intelligence Introduction Monte-Carlo Methods Monte-Carlo Tree Search. 43.

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

A Complex Systems Introduction to Go

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Building Opening Books for 9 9 Go Without Relying on Human Go Expertise

Monte-Carlo Tree Search and Minimax Hybrids with Heuristic Evaluation Functions

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Monte-Carlo Tree Search for the Simultaneous Move Game Tron

Adding expert knowledge and exploration in Monte-Carlo Tree Search

Available online at ScienceDirect. Procedia Computer Science 62 (2015 ) 31 38

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Nested Monte-Carlo Search

Comparing UCT versus CFR in Simultaneous Games

Creating a Havannah Playing Agent

More on games (Ch )

Artificial Intelligence

Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

Thesis : Improvements and Evaluation of the Monte-Carlo Tree Search Algorithm. Arpad Rimmel

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Game Playing State-of-the-Art

Adversarial Search. Read AIMA Chapter CIS 421/521 - Intro to AI 1

The Combinatorial Multi-Armed Bandit Problem and Its Application to Real-Time Strategy Games

Game Playing State-of-the-Art. CS 188: Artificial Intelligence. Behavior from Computation. Video of Demo Mystery Pacman. Adversarial Search

MONTE-CARLO TWIXT. Janik Steinhauer. Master Thesis 10-08

CS 5522: Artificial Intelligence II

CSC321 Lecture 23: Go

Probability of Potential Model Pruning in Monte-Carlo Go

CS 188: Artificial Intelligence

A Multi Armed Bandit Formulation of Cognitive Spectrum Access

More on games (Ch )

Game-Playing & Adversarial Search

CS221 Project Final Report Gomoku Game Agent

CS 771 Artificial Intelligence. Adversarial Search

ARTIFICIAL INTELLIGENCE (CS 370D)

Enhancements for Monte-Carlo Tree Search in Ms Pac-Man

Artificial Intelligence

Igo Math Natural and Artificial Intelligence

Programming Project 1: Pacman (Due )

Old-fashioned Computer Go vs Monte-Carlo Go

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

CS 188: Artificial Intelligence

Playout Search for Monte-Carlo Tree Search in Multi-Player Games

Computing Science (CMPUT) 496

Move Prediction in Go Modelling Feature Interactions Using Latent Factors

Multiple Tree for Partially Observable Monte-Carlo Tree Search

CS229 Project: Building an Intelligent Agent to play 9x9 Go

Unit-III Chap-II Adversarial Search. Created by: Ashish Shah 1

Artificial Intelligence. Minimax and alpha-beta pruning

Monte-Carlo Tree Search and Minimax Hybrids

Adversarial Search and Game- Playing C H A P T E R 6 C M P T : S P R I N G H A S S A N K H O S R A V I

UCT for Tactical Assault Planning in Real-Time Strategy Games

Computing Elo Ratings of Move Patterns in the Game of Go

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Lecture 14. Questions? Friday, February 10 CS 430 Artificial Intelligence - Lecture 14 1

Artificial Intelligence for Go. Kristen Ying Advisors: Dr. Maxim Likhachev & Dr. Norm Badler

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

game tree complete all possible moves

Artificial Intelligence Search III

Generalized Rapid Action Value Estimation

Improving Monte Carlo Tree Search Policies in StarCraft via Probabilistic Models Learned from Replay Data

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

Monte-Carlo Tree Search Enhancements for Havannah

Improving MCTS and Neural Network Communication in Computer Go

On the Huge Benefit of Decisive Moves in Monte-Carlo Tree Search Algorithms

Theory and Practice of Artificial Intelligence

Foundations of Artificial Intelligence

Upper Confidence Trees with Short Term Partial Information

Adversarial Search (I)

Adversarial Search (I)

Enhancements for Monte-Carlo Tree Search in Ms Pac-Man

Computer Science and Software Engineering University of Wisconsin - Platteville. 4. Game Play. CS 3030 Lecture Notes Yan Shi UW-Platteville

Using Neural Network and Monte-Carlo Tree Search to Play the Game TEN

2 person perfect information

Adversarial Search 1

4. Games and search. Lecture Artificial Intelligence (4ov / 8op)

Game Algorithms Go and MCTS. Petr Baudiš, 2011

An AI for Dominion Based on Monte-Carlo Methods

Transcription:

A An Example in Computer-Go Department of Statistics, University of Michigan March 27th, 2008 A

1 Bandit Problem K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem 2 Classical Tree Search UCT Algorithm 3 Quick Introduction to Go Game and Computer-Go A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Bandit Problem A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 4 4 X 2,3 A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 4 4 X 2,3 5 5 X 2,4 A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 4 4 X 2,3 5 5 X 2,4 6 1 X 1,2 A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Formalization of K-Armed Bandit Problem A machine with K arms The i-th play of arm k brings reward X k,i. X k,1, X k,2, are i.i.d. with EX k,i = µ k. µ k unknown. X k,i [0, a]. X k1,i 1 and X k2,i 2 are independent when k 1 k 2. T k (n) : the number of plays of arm k until round n. Objective: Maximize reward in n rounds, strategy at each round n based only on rewards of previous plays. A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Formalization of K-Armed Bandit Problem A machine with K arms The i-th play of arm k brings reward X k,i. X k,1, X k,2, are i.i.d. with EX k,i = µ k. µ k unknown. X k,i [0, a]. X k1,i 1 and X k2,i 2 are independent when k 1 k 2. T k (n) : the number of plays of arm k until round n. Objective: Maximize reward in n rounds, strategy at each round n based only on rewards of previous plays. Exploitation-Exploration Dilemma Exploitation: playing arms with high average rewards ensuring promising rewards in future: ˆk argmax k X k,n 1 where X k,n = 1 Tk (n) T k (n) i=1 X k,i Exploration: playing arms with few plays, in order to get more information: k such that T k (n 1) is small. A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Formalization of K-Armed Bandit Problem (Cont d) T k (n) : the number of plays of arm k until round n. k := argmax k µ k, µ := max k µ k, k := µ µ k. Definition of regret up to n rounds n R }{{} n := X k,i i=1 Regret }{{} reward by optimal plays Objective: minimize ER n = nµ k k µ k ET k (n) = K k=1 T k (n) i=1 X k,i }{{} reward by real plays k k k ET k (n) A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al. 2002 Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al. 2002 Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) Remarks Every arm is ultimately played infinite times, otherwise if k is played finite times, then b 2 log n T k (n 1) as n. At round n, any arm k k, T k (n) = O(log n). A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al. 2002 Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) Regret Bound for UCB1 ER n k k ( 256Vk k + 8 k ) log n + O(1) A

K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al. 2002 Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) Regret Bound for UCB1 ER n k k ( 256Vk k + 8 k ) log n + O(1) UCB-V Algorithm by Audibert et al. 2007 A refined version using empirical variances. A

Classical Tree Search UCT Algorithm A

Classical Tree Search UCT Algorithm Tree Search Settings Structure of tree: one root, nodes, leafs The value is only known at leafs The value of any node is a function of values of its child-nodes (max tree/minimax tree) The value of any node can be computed in an iterative way An example of minimax tree. A

Classical Tree Search UCT Algorithm Tree Search Settings Structure of tree: one root, nodes, leafs The value is only known at leafs The value of any node is a function of values of its child-nodes (max tree/minimax tree) The value of any node can be computed in an iterative way Problem arises when the iterative search can not be completed When search can not be completed. A

Classical Tree Search UCT Algorithm Instead of exhaustively searching each branch and get the exact value of each node, we estimate the value... Model each node as a bandit machine At each node, decide exploration or exploitation by bandit algorithm The more a node is exploited, the preciser is its (estimated) value Exploration and Exploitation at Each Node Exploitation: descend to a node with promising value Exploration: descend to a node to get more information A

Classical Tree Search UCT Algorithm A

Classical Tree Search UCT Algorithm A

Classical Tree Search UCT Algorithm A

Classical Tree Search UCT Algorithm A

Classical Tree Search UCT Algorithm UCT (UCB for Tree) Algorithm (Kocsis and Szepesvari. 2006) start from the root loop until arriving at a leaf choose a child-node according to UCB and descend get the value of the leaf update all visited nodes with the value A

Classical Tree Search UCT Algorithm Formalization of UCT The visit times of node i: n i The value of leaf j: X j,nj = 1 nj n j k=1 X j,k The set of child-nodes of node i: C(i) The set of leafs of branch starting at node i: L(i) The value of each node: X i,ni = 1 n i j L(i) n j X j,nj UCB at node i: play ĵ such that log n i ĵ argmax j C(i) Xj,nj + n j A

Classical Tree Search UCT Algorithm Formalization of UCT The visit times of node i: n i The value of leaf j: X j,nj = 1 nj n j k=1 X j,k The set of child-nodes of node i: C(i) The set of leafs of branch starting at node i: L(i) The value of each node: X i,ni = 1 n i j L(i) n j X j,nj true value UCB at node i: play ĵ such that log n i ĵ argmax j C(i) Xj,nj + n j A

Classical Tree Search UCT Algorithm Remarks The value of each node estimates and converges to its true value. Under smoothness assumption, fast convergence rate is expected. No cut (pruning). Every node will ultimately be visited. The tree is explored in an asymmetric way. The order of exploration is always the key point. A

Quick Introduction to Go Game and Computer-Go A

Quick Introduction to Go Game and Computer-Go A Quick Introduction to the Game of Go A

Quick Introduction to Go Game and Computer-Go A Quick Introduction to Go Game Go-board (Goban): 19 19 (pedagogical Go-board 9 9) Black and White play alternatively. Black starts the game Adjacent stones are called a string. Liberties are the empty intersections next to the string Stones do not move, there are only added and removed from the board. A string is removed iif its number of liberties is 0 Score: territory (number of occupied or surrounded intersections). A

Quick Introduction to Go Game and Computer-Go History of Computer-Go Beginning of Computer-Go, 1970s Classical methods Expert-knowledge-based evaluation function Minimax tree search Comparison with chess Chess: Deeper Blue won against Kasparov, 1997 Go: The strongest programs are about 10kyu until 2006 (amateurs of good level can win with 9 stones handicap) 2006, UCT introduced in Computer-Go Today best programs around 1dan on 19 19 A

Quick Introduction to Go Game and Computer-Go Difficulties in Computer-Go Techniques developed for Computer-Chess do not work for Computer-Go Huge branching factor 200 (chess 40), large depth > 300 Legal positions number (J. Tromp and G. Farneback, 2006) 2.0 10 170 on 19 19, 1.0 10 38 on 9 9 Good evaluation function difficult to build A

Quick Introduction to Go Game and Computer-Go Minimax tree structure. A

Quick Introduction to Go Game and Computer-Go Minimax tree structure. Given the huge tree size (depth > 300), what to do with the nodes rarely visited? A

Quick Introduction to Go Game and Computer-Go Memory management (R. Coulum 2006) Idea: no need to save the nodes rarely visited Tree (in memory) starts with only the root node In each simulation (path), save the first node not yet in the tree The rest of path is not saved and randomly chosen. A

Quick Introduction to Go Game and Computer-Go Memory management (R. Coulum 2006) Monte-Carlo evaluation function for each node (B. Brugmann, 1993) Score (0 or 1) A

Quick Introduction to Go Game and Computer-Go Memory management (R. Coulum 2006) Monte-Carlo evaluation function for each node (B. Brugmann, 1993) Random order better than fixed order. How to improve (to find intelligent order)? A

Quick Introduction to Go Game and Computer-Go One trend of Computer-Go since 2006, motivated by Crazy Stone (R. Coulom 2006) and MoGo (Y. Wang and S. Gelly 2007) MoGo won the Golden Medal of 19 19 Go in Computer Game Olympics 2007 A

Quick Introduction to Go Game and Computer-Go Why it is efficient compared to Alpha-Beta? Alpha-Beta never reconsider a cut: dangerous when random reward and no accurate evaluation function Estimation vs. computation Efficient tree exploration breadth first search move ordering efficiently managed (for often visited nodes) asymmetric growth Anytime A

Quick Introduction to Go Game and Computer-Go Why it is efficient compared to Alpha-Beta? Alpha-Beta never reconsider a cut: dangerous when random reward and no accurate evaluation function Estimation vs. computation Efficient tree exploration breadth first search move ordering efficiently managed (for often visited nodes) asymmetric growth Anytime Remark UCT is far from the final solution. The current tree search algorithms of best programs are different from the one presented today. A

Quick Introduction to Go Game and Computer-Go Improvement of Random Plays by Patterns A

Quick Introduction to Go Game and Computer-Go Improvement of Random Plays by Patterns Examples of Pattern A

Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) A

Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) A

Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) How to get information on-line? Learning from previous simulations (S. Gelly, D. Silver 2007) Order of search is important! A

Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) How to get information on-line? Learning from previous simulations (S. Gelly, D. Silver 2007) Order of search is important! How to find evaluation function with high quality? A

Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) How to get information on-line? Learning from previous simulations (S. Gelly, D. Silver 2007) Order of search is important! How to find evaluation function with high quality? Other Approaches Parallelization: MPI MoGo A

Quick Introduction to Go Game and Computer-Go References on Bandit Problem P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3), pages 235-256, 2002. J.-Y. Audibert, R. Munos, and C. Szepesvàri. Tuning Bandit Algorithms in Stochastic Environments. In M. Hutter, R. A. Servedio, and E. Takimoto, editors, ALT, volume 4754 of Lecture Notes in Computer Science, pages 150-165. Springer, 2007. L. Kocsis and C. Szepesvari. Bandit Based Monte-Carlo Planning. In 15th European Conference on Machine Learning (ECML), pages 282-293, 2006. P.-A. Coquelin and R. Munos. Bandit Algorithm for Tree Search. 23rd Conference on Uncertainty in Artificial Intelligence (UAI), 2007. A

Quick Introduction to Go Game and Computer-Go References on Computer-Go Y. Wang and S. Gelly. Modification of UCT with Patterns for Monte-Carlo Go, Computational Intelligence and Games, 2007 p157-182 S. Gelly and D. Silver. Combining Online and Offline Knowledge in UCT ICML 2007, Corvallis Oregon USA, p. 273-280 R. Coulom. Computing Elo Ratings of Move Patterns in the Game of Go In H. Jaap van den Herik, Jos W. H. M. Uiterwijk, Mark Winands and Maarten Schadd editors, Computer Games Workshop, Amsterdam, The Netherlands, 2007. A

Quick Introduction to Go Game and Computer-Go Acknowledgment Thanks to Rémi Munos, Sylvain Gelly and Rémi Coulom A

Quick Introduction to Go Game and Computer-Go Thank you! A