A Bandit Approach for Tree Search

Size: px

Start display at page:

Download "A Bandit Approach for Tree Search"

Jewel Fisher
6 years ago
Views:

1 A An Example in Computer-Go Department of Statistics, University of Michigan March 27th, 2008 A

2 1 Bandit Problem K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem 2 Classical Tree Search UCT Algorithm 3 Quick Introduction to Go Game and Computer-Go A

3 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Bandit Problem A

4 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 A

5 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 A

6 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 A

7 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 4 4 X 2,3 A

8 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 4 4 X 2,3 5 5 X 2,4 A

9 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem K-Armed Bandit A machine with K arms Playing each arm leads to a random reward The mean of reward of each arm is fixed but unknown The rewards at each round are independent from the others Objective: maximize the total wins for n rounds Round Left Right Reward 1 1 X 1,1 2 4 X 2,1 3 3 X 2,2 4 4 X 2,3 5 5 X 2,4 6 1 X 1,2 A

10 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Formalization of K-Armed Bandit Problem A machine with K arms The i-th play of arm k brings reward X k,i. X k,1, X k,2, are i.i.d. with EX k,i = µ k. µ k unknown. X k,i [0, a]. X k1,i 1 and X k2,i 2 are independent when k 1 k 2. T k (n) : the number of plays of arm k until round n. Objective: Maximize reward in n rounds, strategy at each round n based only on rewards of previous plays. A

11 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Formalization of K-Armed Bandit Problem A machine with K arms The i-th play of arm k brings reward X k,i. X k,1, X k,2, are i.i.d. with EX k,i = µ k. µ k unknown. X k,i [0, a]. X k1,i 1 and X k2,i 2 are independent when k 1 k 2. T k (n) : the number of plays of arm k until round n. Objective: Maximize reward in n rounds, strategy at each round n based only on rewards of previous plays. Exploitation-Exploration Dilemma Exploitation: playing arms with high average rewards ensuring promising rewards in future: ˆk argmax k X k,n 1 where X k,n = 1 Tk (n) T k (n) i=1 X k,i Exploration: playing arms with few plays, in order to get more information: k such that T k (n 1) is small. A

12 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem Formalization of K-Armed Bandit Problem (Cont d) T k (n) : the number of plays of arm k until round n. k := argmax k µ k, µ := max k µ k, k := µ µ k. Definition of regret up to n rounds n R }{{} n := X k,i i=1 Regret }{{} reward by optimal plays Objective: minimize ER n = nµ k k µ k ET k (n) = K k=1 T k (n) i=1 X k,i }{{} reward by real plays k k k ET k (n) A

13 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) A

14 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) Remarks Every arm is ultimately played infinite times, otherwise if k is played finite times, then b 2 log n T k (n 1) as n. At round n, any arm k k, T k (n) = O(log n). A

15 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) Regret Bound for UCB1 ER n k k ( 256Vk k + 8 k ) log n + O(1) A

16 K-Armed Bandit UCB Algorithms for K-Armed Bandit Problem UCB1 (Upper Confidence Bound) Algorithm by Auer et al Suppose X k [0, b]. First: play each arm once After: play arm k at round n such that k maximize b X k,n + 2 log n T k (n 1) Regret Bound for UCB1 ER n k k ( 256Vk k + 8 k ) log n + O(1) UCB-V Algorithm by Audibert et al A refined version using empirical variances. A

17 Classical Tree Search UCT Algorithm A

18 Classical Tree Search UCT Algorithm Tree Search Settings Structure of tree: one root, nodes, leafs The value is only known at leafs The value of any node is a function of values of its child-nodes (max tree/minimax tree) The value of any node can be computed in an iterative way An example of minimax tree. A

19 Classical Tree Search UCT Algorithm Tree Search Settings Structure of tree: one root, nodes, leafs The value is only known at leafs The value of any node is a function of values of its child-nodes (max tree/minimax tree) The value of any node can be computed in an iterative way Problem arises when the iterative search can not be completed When search can not be completed. A

20 Classical Tree Search UCT Algorithm Instead of exhaustively searching each branch and get the exact value of each node, we estimate the value... Model each node as a bandit machine At each node, decide exploration or exploitation by bandit algorithm The more a node is exploited, the preciser is its (estimated) value Exploration and Exploitation at Each Node Exploitation: descend to a node with promising value Exploration: descend to a node to get more information A

21 Classical Tree Search UCT Algorithm A

22 Classical Tree Search UCT Algorithm A

23 Classical Tree Search UCT Algorithm A

24 Classical Tree Search UCT Algorithm A

25 Classical Tree Search UCT Algorithm UCT (UCB for Tree) Algorithm (Kocsis and Szepesvari. 2006) start from the root loop until arriving at a leaf choose a child-node according to UCB and descend get the value of the leaf update all visited nodes with the value A

26 Classical Tree Search UCT Algorithm Formalization of UCT The visit times of node i: n i The value of leaf j: X j,nj = 1 nj n j k=1 X j,k The set of child-nodes of node i: C(i) The set of leafs of branch starting at node i: L(i) The value of each node: X i,ni = 1 n i j L(i) n j X j,nj UCB at node i: play ĵ such that log n i ĵ argmax j C(i) Xj,nj + n j A

27 Classical Tree Search UCT Algorithm Formalization of UCT The visit times of node i: n i The value of leaf j: X j,nj = 1 nj n j k=1 X j,k The set of child-nodes of node i: C(i) The set of leafs of branch starting at node i: L(i) The value of each node: X i,ni = 1 n i j L(i) n j X j,nj true value UCB at node i: play ĵ such that log n i ĵ argmax j C(i) Xj,nj + n j A

28 Classical Tree Search UCT Algorithm Remarks The value of each node estimates and converges to its true value. Under smoothness assumption, fast convergence rate is expected. No cut (pruning). Every node will ultimately be visited. The tree is explored in an asymmetric way. The order of exploration is always the key point. A

29 Quick Introduction to Go Game and Computer-Go A

30 Quick Introduction to Go Game and Computer-Go A Quick Introduction to the Game of Go A

Quick Introduction to Go Game and Computer-Go A Quick Introduction to Go Game Go-board (Goban): 19 19 (pedagogical Go-board 9 9) Black and White play alternatively.

31 Quick Introduction to Go Game and Computer-Go A Quick Introduction to Go Game Go-board (Goban): (pedagogical Go-board 9 9) Black and White play alternatively. Black starts the game Adjacent stones are called a string. Liberties are the empty intersections next to the string Stones do not move, there are only added and removed from the board. A string is removed iif its number of liberties is 0 Score: territory (number of occupied or surrounded intersections). A

32 Quick Introduction to Go Game and Computer-Go History of Computer-Go Beginning of Computer-Go, 1970s Classical methods Expert-knowledge-based evaluation function Minimax tree search Comparison with chess Chess: Deeper Blue won against Kasparov, 1997 Go: The strongest programs are about 10kyu until 2006 (amateurs of good level can win with 9 stones handicap) 2006, UCT introduced in Computer-Go Today best programs around 1dan on A

33 Quick Introduction to Go Game and Computer-Go Difficulties in Computer-Go Techniques developed for Computer-Chess do not work for Computer-Go Huge branching factor 200 (chess 40), large depth > 300 Legal positions number (J. Tromp and G. Farneback, 2006) on 19 19, on 9 9 Good evaluation function difficult to build A

34 Quick Introduction to Go Game and Computer-Go Minimax tree structure. A

35 Quick Introduction to Go Game and Computer-Go Minimax tree structure. Given the huge tree size (depth > 300), what to do with the nodes rarely visited? A

36 Quick Introduction to Go Game and Computer-Go Memory management (R. Coulum 2006) Idea: no need to save the nodes rarely visited Tree (in memory) starts with only the root node In each simulation (path), save the first node not yet in the tree The rest of path is not saved and randomly chosen. A

37 Quick Introduction to Go Game and Computer-Go Memory management (R. Coulum 2006) Monte-Carlo evaluation function for each node (B. Brugmann, 1993) Score (0 or 1) A

38 Quick Introduction to Go Game and Computer-Go Memory management (R. Coulum 2006) Monte-Carlo evaluation function for each node (B. Brugmann, 1993) Random order better than fixed order. How to improve (to find intelligent order)? A

39 Quick Introduction to Go Game and Computer-Go One trend of Computer-Go since 2006, motivated by Crazy Stone (R. Coulom 2006) and MoGo (Y. Wang and S. Gelly 2007) MoGo won the Golden Medal of Go in Computer Game Olympics 2007 A

40 Quick Introduction to Go Game and Computer-Go Why it is efficient compared to Alpha-Beta? Alpha-Beta never reconsider a cut: dangerous when random reward and no accurate evaluation function Estimation vs. computation Efficient tree exploration breadth first search move ordering efficiently managed (for often visited nodes) asymmetric growth Anytime A

41 Quick Introduction to Go Game and Computer-Go Why it is efficient compared to Alpha-Beta? Alpha-Beta never reconsider a cut: dangerous when random reward and no accurate evaluation function Estimation vs. computation Efficient tree exploration breadth first search move ordering efficiently managed (for often visited nodes) asymmetric growth Anytime Remark UCT is far from the final solution. The current tree search algorithms of best programs are different from the one presented today. A

42 Quick Introduction to Go Game and Computer-Go Improvement of Random Plays by Patterns A

43 Quick Introduction to Go Game and Computer-Go Improvement of Random Plays by Patterns Examples of Pattern A

44 Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) A

45 Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) A

46 Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) How to get information on-line? Learning from previous simulations (S. Gelly, D. Silver 2007) Order of search is important! A

47 Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) How to get information on-line? Learning from previous simulations (S. Gelly, D. Silver 2007) Order of search is important! How to find evaluation function with high quality? A

48 Quick Introduction to Go Game and Computer-Go How to improve the quality of random simulation? Using hand-made patterns (Y. Wang and S. Gelly 2007) How to learn patterns off-line? Collecting and ranking patterns (R. Coulom 2007) How to get information on-line? Learning from previous simulations (S. Gelly, D. Silver 2007) Order of search is important! How to find evaluation function with high quality? Other Approaches Parallelization: MPI MoGo A

49 Quick Introduction to Go Game and Computer-Go References on Bandit Problem P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3), pages , J.-Y. Audibert, R. Munos, and C. Szepesvàri. Tuning Bandit Algorithms in Stochastic Environments. In M. Hutter, R. A. Servedio, and E. Takimoto, editors, ALT, volume 4754 of Lecture Notes in Computer Science, pages Springer, L. Kocsis and C. Szepesvari. Bandit Based Monte-Carlo Planning. In 15th European Conference on Machine Learning (ECML), pages , P.-A. Coquelin and R. Munos. Bandit Algorithm for Tree Search. 23rd Conference on Uncertainty in Artificial Intelligence (UAI), A

50 Quick Introduction to Go Game and Computer-Go References on Computer-Go Y. Wang and S. Gelly. Modification of UCT with Patterns for Monte-Carlo Go, Computational Intelligence and Games, 2007 p S. Gelly and D. Silver. Combining Online and Offline Knowledge in UCT ICML 2007, Corvallis Oregon USA, p R. Coulom. Computing Elo Ratings of Move Patterns in the Game of Go In H. Jaap van den Herik, Jos W. H. M. Uiterwijk, Mark Winands and Maarten Schadd editors, Computer Games Workshop, Amsterdam, The Netherlands, A

51 Quick Introduction to Go Game and Computer-Go Acknowledgment Thanks to Rémi Munos, Sylvain Gelly and Rémi Coulom A

52 Quick Introduction to Go Game and Computer-Go Thank you! A

Exploration exploitation in Go: UCT for Monte-Carlo Go

Exploration exploitation in Go: UCT for Monte-Carlo Go Sylvain Gelly(*) and Yizao Wang(*,**) (*)TAO (INRIA), LRI, UMR (CNRS - Univ. Paris-Sud) University of Paris-Sud, Orsay, France sylvain.gelly@lri.fr