Learning in 3-Player Kuhn Poker

Size: px

Start display at page:

Download "Learning in 3-Player Kuhn Poker"

Kerry Brooks
6 years ago
Views:

1 University of Manchester Learning in 3-Player Kuhn Poker Author: Yifei Wang 3rd Year Project Final Report Supervisor: Dr. Jonathan Shapiro April 25, 2015

2 Abstract This report contains how an ɛ-nash Equilibrium of 3-Player Kuhn Poker, a simple 3-player hidden-information game, is found. Temporal Difference Learning and Policy Hill-Climbing algorithm have been applied to the sample games of 2-player perfect-information game, 3-player perfect-information game and 2-player hidden-information game respectively and found that Policy Hill-Climbing algorithm is the better one which can be used on a 3-player hidden-information game. When applied this algorithm on 3-Player Kuhn Poker, a Nash Equilibrium is gotten. 1

3 Contents 1 Introduction Overview of Project Overview of Kuhn Poker and 3-Player Kuhn Poker Game Theory Nash Equilibrium ɛ-nash Equilibrium Other Important Concepts in Game Game Tree Strategy Reinforcement Learning in Games Reinforcement Learning Temporal-Difference Learning Policy Hill-Climbing Algorithm WoLF Policy Hill-Climbing Algorithm Algorithm Test Tic-Tac-Toe Gunmen Dilemma Player Kuhn Poker Design and Implementation Structure State AI Game Result and Evaluation Result Evaluation Mathematical Calculation Best Response Comparison Conclusion Achievement Future Works

4 List of Figures 1 Game Tree for (2-Player) Kuhn Poker Game Tree for 3-Player Kuhn Poker One Possible Trace for Tic-Tac-Toe Tic-Tac-Toe board Test Result for Tic-Tac-Toe using TD-learning Game Tree for Gunmen Dilemma The convergence of the strategy of player 1 when using TDlearning in Kuhn Poker The convergence of the strategies generated when using PHC in Kuhn Poker Game Tree for 3-Player Kuhn Poker with Label System diagram of 3-Player Kuhn Poker The convergence of the expected payoff for 3 players in 3-Player Kuhn Poker

5 List of Tables 1 pseudo code of Temporal Difference Learning pseudo code of Policy Hill-Climbing pseudo code of WoLF Policy Hill-Climbing Test Result of Tic-Tac-Toe using TD-learning test result of Gunmen Dilemma (Policy when all alive) test result of Gunmen Dilemma (Simulation results) Strategies in Kuhn Poker Strategies Generated by TD-learning in Kuhn Poker Strategies Generated by WoLF PHC in Kuhn Poker Information in Decision Nodes of 3-Player Kuhn Poker Information in Decision Nodes of 3-Player Kuhn Poker (Simplified) pseudo code of game process of 3-Player Kuhn Poker pseudo code of learning process of 3-Player Kuhn Poker Strategies Generated by WoLF PHC in 3-Player Kuhn Poker Formula table for the probability to reach ending node expected payoff for each player Equilibrium of 3-Player Kuhn Poker Expected Payoff and Best Response Payoff

6 1 Introduction 1.1 Overview of Project This aim of this project is to use machine learning to develop an AI which can play 3-Player Kuhn Poker well. And my aim is to develop the AIs which will play the strategies that lead the conditions to Nash Equilibrium. However, as using machine learning would not give a exact result, so an ɛ-nash Equilibrium in which ɛ < 0.01 should be a proper ending for this project. Developing AI for games can provide some useful information, such as how to maximize player s payoff in gambling and whether the game is a fair game or a banker-favoured game. 1.2 Overview of Kuhn Poker and 3-Player Kuhn Poker Kuhn Poker is a simplified poker game introduced by Harold W. Kuhn [7]. Kuhn Poker is usually used when testing some learning algorithms as it has a quite simple set of rules: Each player antes 1 chip before the cards are dealt. Each player deals one card from the deck. The deck has 3 cards with ranking. If there is no outstanding bet, a player can check or bet 1 chip. If there is an outstanding bet, a player can fold or call. Betting round ends when there is an outstanding bet and both player has made a decision to join this bet or not, or both player has checked. If there is a bet, betting players show their cards and the one held the largest win all chips in the pot.(thus if a player fold, then the other bet player holds the largest card and get all chips in the pot) If there is no bet, both players show their cards and the one held the largest win all chips. Like other poker games that usually played, Kuhn Poker requires players to bet, check, fold and call, and the players who bet need to compare their cards at last. However, in order to simplify the rules, each player can only hold 1 card and the comparison is just based on the point on the card, so that players do not have to apply complex rules to check the value of their cards. And for Kuhn Poker, only one round of betting is allowed, one player cannot bet more, which decrease the time of the game. And 3-Player Kuhn Poker applies a similar rule: 5

7 Each player antes 1 chip before the cards are dealt. Each player deals one card from the deck. The deck has 4 cards with ranking. If there is no outstanding bet, a player can check or bet 1 chip. If there is an outstanding bet, a player can fold or call. Betting round ends when there is an outstanding bet and next player bet before or there is no outstanding bet and next player checked before. If there is a bet, betting players show their cards and the one held the largest win all chips in the pot. If there is no bet, all players show their cards and the one held the largest win all chips. For convenience, 2-Player Kuhn Poker would use a deck contained J, Q and K with K > Q > J and 3-Player Kuhn Poker would use a deck contained J, Q, K and A with A > K > Q > J in this report. Both 2-Player Kuhn Poker and 3-Player Kuhn Poker are just models, that nobody would play in the real world. They are usually used to do some algorithm tests on hidden-information game in game theory, as their rules are quite simple. Theory or algorithm applied to hidden-information game should also applied to 2-Player or 3-Player Kuhn Poker, so the premise of a theory or algorithm applied to all hidden-information game is applied to Kuhn Poker. And 3-Player Kuhn Poker is also the one of the games in Annual Computer Poker Competition[1] in 2014 and 2015, which encourages researchers to solve this game. 2-Player Kuhn Poker has already been solved by Harold W. Kuhn, it has a Nash Equilibrium strategy for both players.[7] For 3-Player Kuhn Poker, the best attempt until now is done by Risk and Szafron, they get the strategies that would lead to a ɛ-nash equilibrium with ɛ = [9] Besides Kuhn Poker, Leduc Holdem and Texas Holdem are also the simple hidden-information games which can be used to test. And they are much more complex than Kuhn Poker, algorithm that performs well on Kuhn Poker well could have a poor performance.[9] 6

8 2 Game Theory 2.1 Nash Equilibrium Game in game theory has 7 main elements: players, actions, information, strategies, payoffs, outcomes and equilibria.[8] In a game, players are the individuals who make decisions, and their goal is to make their payoff as large as possible. So they should adjust their strategies to make the expected payoff larger, and finally, players would lead the game to a situation that none of the player can improve their payoff by changing their strategies. This situation is so called a Nash Equilibrium. In other word, Nash Equilibrium is the strategy set (s 1, s 2, s 3,..., s k ) for k players, that no player has incentive to deviate from his strategy given that the other players do not deviate.[8] It is also possible to represent it in mathematical expression: i, π i (s i, s i ) π i(s i, s i ), s i [8] Where, π i is the payoff got by player i; s i and s i are the strategies followed by player i and his opponent. 2.2 ɛ-nash Equilibrium In some games, it is not possible to find the strategies that lead to a Nash Equilibrium, but we can find the strategies that would lead the game to a nearly Nash Equilibrium. That is no one can improve their payoff by ɛ, by changing their strategies. This is so called ɛ-nash Equilibrium. When represent it in mathematical expression: i, π i (s i, s i ) + ɛ π i(s i, s i ), s i Obviously, when ɛ = 0, this is a Nash Equilibrium. And the aim of this project is to find a ɛ-nash Equilibrium for 3-Player Kuhn Poker in which ɛ < Other Important Concepts in Game Zero-sum game is the game that the sum of payoffs of all players is zero whatever the strategies they choose. A game that is not zero-sum is called non-zero-sum game.[8] 7

9 Hidden-information game is the game that at least one of the players do not know all the information in the game. A game that all information is known by each player is Perfect information game. The information set of a player in a hidden-information game contains all possible states that the player cannot distinguished with the direct observation. Nature is a non-player actor who will take actions with specified probabilities at some specified points.[8] For a card game, nature is often the deck, which will do the dealing process with a uniform distribution. As the rules described in 1.2, 3-Player Kuhn Poker is a zero-sum hiddeninformation game as the sum of the payoffs of 3 players is always 0 and players do not know the card that others holds. 2.4 Game Tree A game tree is a tree that would contain all possible strategies combinations in a game. The node in the game is a point that either players or nature would take actions. And the node which has no successors is an ending node where the game ends and give payoffs to all players. Figure 1: Game Tree for (2-Player) Kuhn Poker For the game trees of 2-Player and 3-Player Kuhn Poker, the branches of different cards holding combinations share a same structure, thus the nodes in the same position of the branches that player holds the same card construct an information set of this player. 8

10 Figure 2: Game Tree for 3-Player Kuhn Poker We can get much information from the game tree. For example, from the game tree of 3-Player Kuhn Poker, we can infer that in such game, each player has at most 2 decision nodes in one single round. And there are 13 different ending nodes in one card combinations. As there are 24 different cards combinations, so there are = 312 different ending nodes in 3-Player Kuhn Poker. It is really a large amount compared to 2-Player one, which has only 6 5 = 30 different combinations. Game tree is a common tool used to represent a game, this will make the process of the game clear. When game tree is combined with the payoffs at each ending node, it will become the extensive representation of the game. Most games described in this report will be shown with its game tree. 2.5 Strategy Strategy is a rule that tells the player which action should be chosen at each instant of the game.[8] In some solved game, it is possible for a player to do a same action under the same situation. For example, in Tic-Tac-Toe, player 1 can always choose the center at the beginning and he won t lose. This kind of strategy is called pure strategy. However, in some hidden-information games, like poker games. If you always 9

11 do the same action, your opponent may guess what the card you are holding now. So player need to bluff the opponent by choosing his action based on a probability distribution, meaning that even in the same state, a player may choose different actions so that the opponent cannot infer the card holding by the actions performed. This kind of strategy is called mixed strategy. And, of course, to play 3-Player Kuhn Poker, a mixed strategy is required. 10

12 3 Reinforcement Learning in Games 3.1 Reinforcement Learning Reinforcement learning is an area of the machine learning that usually used in game theory research. Unlike supervised learning, that machine is learning from the examples given by a supervisor. The performance would depend on the qualities of the examples. And supervised learning usually perform bad in interactive problems. In reinforcement learning, machine would adjust its actions based on the environment and how it perform before in the same situation. So for interactive problem, like card games, reinforcement learning is more effective.[10] There are 4 main element in reinforcement learning: a policy, a reward function, a value function and a model of the environment. A policy defines the learning agent s way of behaving at a given time.[10] For game, a policy is the probability distribution of different actions at a given decision node. And all this policies for one player construct the strategy of this player. A reward function defines the goal in a reinforcement learning problem.[10] For game, it suggests the payoff of a player when given the current states and the actions that player has done. A value function specifies what is good in the long run. The value of a state is the total amount of the award an agent can expect to accumulate over the future, starting from that state. [10] Generally speaking, the value of the state shows how much payoff the player is expected to get finally if this player visits this state. A model of the environment is something that mimics the behavior of the environment. For example, when given a state and ac action, the model should predict the resultant next state and next reward.[10] In game, that is the game rule. From the game rule, we can easily predict the next state and reward if there exists. However, model is an option element for a Reinforcement Learning algorithm, not all Reinforcement Learning method need a model to tell it how the game works, meaning that some methods just use the experience to learn what is the best method to solve the problem. In reinforcement learning of game, a large amount of rounds of game would be played and then update the policies of the players in all decision nodes by adjusting the value function to make more accurate prediction. So that players would try to get the largest payoff in each node and the Nash Equilibrium would achieve. 11

13 3.2 Temporal-Difference Learning Temporal-Difference Learning(TD-Learning) is a common algorithm used in Reinforcement Learning. It takes the idea from both Monte Carlo and dynamic programming.[10] TD-learning does not need the model, it would adjust the value function by accepting the final payoff. The value V (s) at state s would be updated by the following formula: V (s) = V (s) + α[r + γv (s ) V (s)] Where s is the next state, r is the payoff (usually 0 if not the final state),α is the learning rate which in range [0, 1], γ is the discount factor which suggests how import the value of the next state is, also in range [0, 1]. The pseudo code during the learning phrase is shown below: for all state s do V (s) = 0 end for repeat do self play with the policy calculated by states and get all visited states S and reward r for all s in S do V (s) := V (s) + α[r + γv (s ) V (s)] end for until enough rounds of self-play Table 1: pseudo code of Temporal Difference Learning A naive method to deal with the policy is always choose the state that has the highest value from all possible next states. However, it will result in a pure strategy. One appropriate approach is to use Boltzmann distribution to deal with the policy: π(s) = e V (s)/τ ns i=1 ev (s i )/τ Where π(s) is the probability to choose s as the next state, n s is the number of all possible next states, τ is a constant. So that machine would choose the state that has a high value with a high probability, but also has a chance to choose the state that has a small value. However, even using distribution, there is still no guarantee that it will output a mixed strategy. 12

14 3.3 Policy Hill-Climbing Algorithm Policy Hill-Climbing Algorithm (PHC) is an advanced form of Q-learning introduced by Micheal Bowling and Manuela Veloso.[3] In order to get a mixed strategy, PHC does not calculate the policy by using values of the possible next states. PHC would match the values to actions, meaning the value in PHC means the expected payoff would get when doing this action. And there is also a probability attached with the action, probabilities of all possible actions in one state would have a sum of 1. So that PHC would output a mixed strategy. The pseudo code of PHC is shown below: Let α and δ be learning rates. for all state s do for all action a do Q(s, a) := 0 π(s, a) := 1 A i end for end for repeat a. From state s select action a with probability π(s, a) with some exploration b. Observing reward r and next state s Q(s, a) := (1 α)q(s, a) + α(r + γmaxq(s, a )) a c. Update π(s, a) and constrain it to a legal probability distribution { δ ifa = arg maxa Q(s, a ) π(s, a) := π(s, a) + δ A i 1 otherwise until enough rounds of self-play Table 2: pseudo code of Policy Hill-Climbing From the pseudo code, we can find that PHC uses a similar way to update the values as TD-Learning does. And each time, PHC would move its policy to choose the action that has a largest value more. 3.4 WoLF Policy Hill-Climbing Algorithm In testing, PHC does not perform well, it need quite a long duration for policy to converge. So Bowling and Veloso introduced an advanced version of PHC, WoLF Policy Hill-Climbing Algorithm.[3] WoLF(Win-or-Learn-Fast) principle is quite simple, it requires agent to learn 13

15 quickly while losing and slowly while winning. It is not suitable to use game rule to decide whether an agent is losing or not, as lots of games are unfair game, some players may be at disadvantage. So WoLF Policy Hill-Climbing would calculate the average policy, and if an agent gets more payoff than playing the average payoff, we would consider this agent is wining and vice versa. And the pseudo code of WoLF PHC is shown below: Let α, δ l > δ w be learning rates. for all state s do for all action a do Q(s, a) := 0 π(s, a) := 1 A i C(s) := 0 end for end for repeat a. From state s select action a with probability π(s, a) with some exploration b. Observing reward r and next state s Q(s, a) := (1 α)q(s, a) + α(r + γmax a Q(s, a )) c. Update estimate of average policy π C(s) := C(s) + 1 a A i, π(s, a ) := π(s, a ) + 1 C(s) (π(s, a ) π(s, a )) d. Update π(s, a) and constrain it to a legal probability distribution. { δ ifa = arg maxa Q(s, a ) π(s, a) := π(s, a) + δ A i 1 otherwise where, { δ w ifσ a π(s, a)q(s, a) > Σ a π(s, a)q(s, a) δ = δ l otherwise until enough rounds of self-play Table 3: pseudo code of WoLF Policy Hill-Climbing 14

16 4 Algorithm Test In order to choose the suitable algorithm, some tests should be done on other games. The 3 tests here is to show whether the algorithm can be applied to 2- player perfect information game, 3-player perfect information game and 2-player hidden-information game and perform well. However, there should also be a test to show if the algorithm can be applied to 3-player hidden-information game, the type of 3-Player Kuhn Poker. But it is hard to find a suitable game to do such test, as 3-Player Kuhn Poker is one of the simplest 3-Player hidden-information game. 4.1 Tic-Tac-Toe Tic-Tac-Toe (or Noughts and Crosses) is a simple 2-player game. Two players would draw O and X in a 9-cell board in turn, and the one who makes 3 of his symbols in a line wins. Figure 3: One Possible Trace for Tic-Tac-Toe As Tic-Tac-Toe has already been solved, so the test here is to have a look whether the algorithm would output the optimal strategy or not, that is to occupy the center first and then the corner for player 1 and occupy the corner first and then block player 1 for player 2. If both player play the optimal strategies, they would have a draw and that is the Nash Equilibrium for Tic- Tac-Toe. When using TD-learning, the first thing is to decide how many states is there in this game. As there are 9 cells in the board of Tic-Tac-Toe, and each of them can be empty, occupied by player 1 or occupied by player 2. So there are at most 3 9 = states in this game. Although there are some states that are not reachable in this game, such as all cells occupied by player 1. However, for convenience to find a specified state in Figure 4: Tic-Tac-Toe board all states and this amount is still acceptable, I use all those states. Although it is also important about who is the next one to choose cell, however, this can be inferred from the numbers of O and X on the board, so state could not include this. When use the number to show the cell in the board as shown in Figure 4. The part of test result for using TD-learning is shown on the next page (Suppose 15

17 Cell Probability of 1st step (Player 1) each player would choose the one has the highest probability): Probability of 1st step (Player 2) Probability of 2nd step (Player 1) Probability of 2nd step (Player 2) Probability of 3rd step (Player 1) Probability of 3rd step (Player 2) Probability of 4th step (Player 1) N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Table 4: Test Result of Tic-Tac-Toe using TD-learning Probability of 4th step (Player 2) From the test data, we can find that player 1 almost always choose cell 5, the center of the board, on the first step. And if player 1 do so, player 2 has a high probability to choose cell 7 and cell 9, which are the corners of the board. And then player 1 would choose another corner and then 2 of player 1 s symbol has been in 1 line. And player 2 has the only choice to block it. And then both of them would block each other until the 4th step. And the following figure shows the whole flow of the game if both of them choose the highest probability at each step. Figure 5: Test Result for Tic-Tac-Toe using TD-learning And as shown they would finally draw. It is also the Nash Equilibrium of this game. Which means that when using TD-learning to deal with Tic-Tac-Toe would make both player play the optimal strategies and finally lead to the Nash Equilibrium. 4.2 Gunmen Dilemma Gunmen Dilemma (or Truel) is a duel participated by three players (say A, B and C ). A truel can have many different rules, the rule used in this section is a sequential (fixed order) rule with some extra rules[6]: The players fire one at a time in a fixed, repeating sequence, such as A, B, C, A, B, C, A, B,... 16

18 If it is a dead player s turn to shoot, then skip it. Players may shoot at others or shoot at air. Each player has infinite bullets. The hit rates of A, B and C are 1 3, 1 2 and 1 respectively. To avoid infinite game, C, the one who never misses, is not allowed to shoot at air. The player who survives at last is the winner. Note: this game is not so violent as described. Figure 6: Game Tree for Gunmen Dilemma From the game tree above, it can be found that this game is very complex and many states can be converted to each other. The reason why Gunmen Dilemma is chosen as the 3-Player Perfect Information Game is that it has nature, that is to check whether a player s shoot hits or not. So that besides checking an algorithm can be used on 3-player game, it can also check if this algorithm can be used on the game has nature. In this game, A has the highest probability to survive if he uses the optimal strategy, that is to shoot at the air first.[12] A simple proof of the optimal strategy is here[2]: 17

19 a. If A shoots at B and hits, then C can just shoots at A and win this game. b. If A shoots at C and hits, then it becomes the duel between A and B with B shoots first. c. So if A decides to shoot, then shoot at C is a better choice. d. A shoots at air and shoots but misses would both lead to the same situation, that is still 3 players alive with B shoots first. If B decides to shoot at someone, then A would not be a first choice as A has a low hit rate with the reason similar as c. (a) If B shoots at C and misses, then it would becomes the duel between A and B with A shoots first. (b) If B misses or shoots at air, then it is C s turn to shoot. With the reason similar as c., C would shoot at B and hits. And then it will become a duel between A and C with A shoots first. e. Compare results between b. and d., A would choose d. as in a duel the first one to shoot would have a huge advantage. So the test for this game is to see if an algorithm can produce the optimal strategy for A (i.e. shoots at air first) and A can have the highest probabilities to win in this game. Just like what has been done in Tic-Tac-Toe, it still need to determine the amount of states and what should be included in a state. From the game tree, it can be found that there are totally 9 states. To list who are alive and who is the next one to shoot is a possible choice to specify these 9 states. And the result of applying TD-Learning is shown below: Target Probability for A Probability for B Probability for C A N/A B 0.00 N/A 1.00 C N/A air N/A Table 5: test result of Gunmen Dilemma (Policy when all alive) We can find that, in this strategy, A would always shoot at the air when the probability is rounded to 2 significant figures. 18

20 And if use the strategies produced by TD-learning to do the simulation for rounds, the statistical data of winners is shown below: A B C Wining Times Table 6: test result of Gunmen Dilemma (Simulation results) Toral and Amengual suggest the formulas to calculate the winning probabilities of each player if given their hit rate.[12] When use the hit rate in this example(i.e. 1 3, 1 2 and 1), we can get the winning probabilities of A, B and C are 0.417, and respectively, which are quite close to the data in table 6. Those results mean that TD-learning can be applied to Gunmen Dilemma to get an optimal strategy. So TD-learning can be used to a 3-player perfect game with nature Player Kuhn Poker The rule of the 2-Player Kuhn Poker has been described in 1.2, and it is a 2-Player Hidden Information game. As Kuhn has determined the equilibrium strategies for 2 players [7] [5], so the test here is to see whether an algorithm can generate a mixed-strategy for Nash Equilibrium or not in a 2-player hiddeninformation game and would whether the strategy would converge or not. Bet with K Call with K Bet with Q Call with Q Bet with J Call with J Player 1 γ 1 0 β α 0 Player η 0 ξ 0 Table 7: Strategies in Kuhn Poker Kuhn suggests that the strategy for player 1 can be represented by 3 parameters and player 2 by 2 parameters as shown above. And the equilibrium strategy for player 1 is that (α, β, γ) = (γ/3, (1 + γ)/3, γ) and the equilibrium for player 2 is (ξ, η) = (1/3, 1/3). Fig. 1 shows that there are 8 different states in one cards holding combination. Since there are 6 different cards holding combinations, so the state amount in 2-Player Kuhn Poker is 8 6 = 48. However, as it is a hidden-information game, player would not know the exact state that he is in. So the state used in TDlearning would be different from the real state, the states used should be the information sets for each player, and thus there are 8 3 = 24 states for each player. And one state should contains the information of the the card held by this player, whether each player is in the bet or not and who is the next one to make decision. 19

When using TD-learning, the strategies generated is shown below: Bet with K Call with K Bet with Q Call with Q Bet with J Call with J Player 1 0.000 1.000 0.000 0.000 0.000 0.000 Player 2 0.599 1.

21 When using TD-learning, the strategies generated is shown below: Bet with K Call with K Bet with Q Call with Q Bet with J Call with J Player Player Table 8: Strategies Generated by TD-learning in Kuhn Poker And this is very far from the Nash Equilibrium strategy, which means TDlearning could not produce a proper mixed-strategy for hidden-information game. And the line chart below shows the convergence of the strategy of player 1 in the training process. Figure 7: The convergence of the strategy of player 1 when using TD-learning in Kuhn Poker This shows that the strategy does not converge at all, almost every probability keeps shaking during the learning process. It provides another evidence that TD-learning is not suitable for a hidden-information game. Because PHC is an algorithm used to generate mixed-strategy, the following results are generated by PHC(for quick converge, WoLF PHC is used). Bet with K Call with K Bet with Q Call with Q Bet with J Call with J Player Player Table 9: Strategies Generated by WoLF PHC in Kuhn Poker 20

22 The strategies generated here are very closed to the Nash Equilibrium strategy stated above. And the following line charts show the convergence of the strategies. (a) Player 1 (b) Player 2 Figure 8: The convergence of the strategies generated when using PHC in Kuhn Poker 21

23 As shown, α(bet with J for player 1), β(call with Q for player 1), γ(bet with K for player 1), ξ(bet with J for player 2) and η(bet with Q for player 2) clearly converge. So that PHC can be used for 2-Player Hidden Information Game to generate a Nash Equilibrium mixed strategy. And thus PHC will be the algorithm used in 3-Player Kuhn Poker implementation. 22

24 5 Design and Implementation 5.1 Structure To implement a game, at least 3 classes are required, State class, AI class and Game class. State class will be used for creating a State object and dealing the simple problem based on State, such as the winner of the game and the final payoff for each player. AI class will be used for creating AI for each player, and contains methods of value and policy updating and moving based on the policy. Game class is the main class to deal with the main process of the game and the training process. 5.2 State Just as the work in Tic-Tac-Toe, Gunmen Dilemma and 2-Player Kuhn Poker, in the implementation, the first thing to do is to decide how many states should be considered for each player and what information should be contained for each state to identify them. So come back to the game tree of 3-Player Kuhn Poker and label each decision node as following: Figure 9: Game Tree for 3-Player Kuhn Poker with Label Node I to Node IV are the decision nodes for player 1, and V to VIII for player 2, IX to XII for player 3. 23

25 All information for each player in this game include card holding, whether bet before, whether check before, whether fold before and whether call before. As card holding decide the branch of the game tree, and all 12 nodes stated before are in the same branch, so card holding must be in the state in this game. And for each branch, the table below shows the other information in each decision node. Player 1 Player 2 Player 3 Node Bet Check Call Fold Bet Check Call Fold Bet Check Call Fold I F F F F F F F F F F F F II F T F F F T F F T F F F III F T F F T F F F F F T F IV F T F F T F F F F F F T V F T F F F F F F F F F F VI F T T F F T F F T F F F VII F T F T F T F F T F F F VIII T F F F F F F F F F F F IX F T F F F T F F F F F F X F T F F T F F F F F F F XI T F F F T F F F F F F F XII T F F F F T F F F F F F Table 10: Information in Decision Nodes of 3-Player Kuhn Poker As shown, each state has a different combination of the 12 Boolean variables, so that these 12 Boolean variables can be used to identify the state. However, different from TD-learning, in PHC, each player would not care about all the states in the game, he would only consider the decision nodes that he would visit. So in PHC, it is enough for a state object to contain the information to just identify the states that a player can meet. In this case, that is to identify I to IV, V to VIII, IX to XII. So state in PHC is not necessary to contain all these 12 Boolean variables. To simplify the state, many Boolean variables can be merged together. The merge method used here is to use only one Boolean to represent one player if he is in a bet or not(i.e. has bet or called before). And then the information in each decision node in one branch is shown on the next page: And as shown, it is enough to identify the states that each player can meet as each of these 4 states has a different combination of these 3 Boolean variables. So for a state object in 3-Player Kuhn Poker, it will contain the cards holding of 3 players and whether each player is in a bet or not. 24

26 Node Player 1 in Bet Player 2 in Bet Player 3 in Bet I F F F II F F T III F T T IV F T F V F F F VI T F T VII F F T VIII T F F IX F F F X F T F XI T T F XII T F F Table 11: Information in Decision Nodes of 3-Player Kuhn Poker (Simplified) 5.3 AI Each player in the game should have an AI object, which stores the strategy of this player. Thus, it should contain all the possible states that this player can meet and together with the policy at this state. For PHC, the visit times and the values and the probabilities of all possible actions should be contained. Following text is an example of a game state with its policy: false false false! Which means that player 1 holds the lowest card(i.e. Jack), and he does not know what player 2 and player 3 are holding. None of them are in the bet now. This state has been visited for times. Player 1 has 2 actions in this state, that is to bet and to check. Bet has the value of , and the probability to bet is 0.000, average probability is also Check has the value of , and the probability to bet is 1.000, average probability is also Game The game class is the main class of this program, it contains the main process of the game and the training process. The table on the next page is the pseudo code of the game process: Learning phrase is similar with the game process expect that the states visited and the actions done should be stored, which is also shown on the next page: 25

27 Initialise the state of the game. Shuffle the deck. Each player draw one card from the deck. while game is not finished do Based on the strategy of the current player, set him bet or not. Set the next player as the current player. end while Table 12: pseudo code of game process of 3-Player Kuhn Poker Create 6 arrays to store the states and actions of 3 players Initialise the state of the game. Shuffle the deck. Each player draw one card from the deck. while game is not finished do Add the player view of the current state to the current player s state array. Based on the strategy of the current player, set him bet or not. Add that action to the current player s action array. Set the next player as the current player. end while Based the states and actions, update the strategy of each player Table 13: pseudo code of learning process of 3-Player Kuhn Poker Overall system diagram of the whole program is shown below (then name of the game is used as the name of game class, that is Kuhn): Figure 10: System diagram of 3-Player Kuhn Poker 26

28 6 Result and Evaluation 6.1 Result The following table shows the strategy generated by the WoLF PHC(1st Node for Player 1, 2 and 3 is Node I, V and IX respectively on Figure 9 and 2nd Node is Node II, VI and X. etc.): Player 1 Player 2 Player 3 Bet/Call at 1st Node with J Bet/Call at 1st Node with Q Bet/Call at 1st Node with K Bet/Call at 1st Node with A Bet/Call at 2nd Node with J Bet/Call at 2nd Node with Q Bet/Call at 2nd Node with K Bet/Call at 2nd Node with A Bet/Call at 3rd Node with J Bet/Call at 3rd Node with Q Bet/Call at 3rd Node with K Bet/Call at 3rd Node with A Bet/Call at 4th Node with J Bet/Call at 4th Node with Q Bet/Call at 4th Node with K Bet/Call at 4th Node with A Table 14: Strategies Generated by WoLF PHC in 3-Player Kuhn Poker And based on the strategies, we can easily calculate the expected payoff for each player in one single round of the game by the following formula: p i = c 1=0 c 2=0 c ( 13 3=0 N=1 P (c 1, c 2, c 3, N) p i (c 1, c 2, c 3, N)) In which, p i is the expected payoff for player i. p i (c 1, c 2, c 3, N) is the expected payoff for player i at the ending node N with player 1 holds c 1, player 2 holds c 2 and player 3 holds c 3. P (c 1, c 2, c 3, N) is the probability to reach ending node N with these holdings, which is calculated by the table on next page: In the table, P (N, c i ) is the probability for player i to bet/call at Node N with holding the card c i, which can be found at table 14. And the expected payoff calculated for each player is shown on the next page as well: 27

29 Ending Node Formula 1 (1 P (I, c 1 )) (1 P (V, c 2 )) (1 P (IX, c 3 )) 2 (1 P (I, c 1 )) (1 P (V, c 2 )) P (IX, c 3 ) P (II, c 1 ) P (VI, c 2 ) 3 (1 P (I, c 1 )) (1 P (V, c 2 )) P (IX, c 3 ) P (II, c 1 ) (1 P (VI, c 2 )) 4 (1 P (I, c 1 )) (1 P (V, c 2 ) P (IX, c 3 ) (1 P (II, c 1 )) P (VII, c 2 ) 5 (1 P (I, c 1 )) (1 P (V, c 2 ) P (IX, c 3 ) (1 P (II, c 1 )) (1 P (VII, c 2 )) 6 (1 P (I, c 1 )) P (V, c 2 ) P (X, c 3 ) P (III, c 1 ) 7 (1 P (I, c 1 )) P (V, c 2 ) P (X, c 3 ) (1 P (III, c 1 )) 8 (1 P (I, c 1 )) P (V, c 2 ) (1 P (X, c 3 )) P (IV, c 1 ) 9 (1 P (I, c 1 )) P (V, c 2 ) (1 P (X, c 3 )) (1 P (IV, c 1 )) 10 P (I, c 1 ) P (VIII, c 2 ) P (XI, c 3 ) 11 P (I, c 1 ) P (VIII, c 2 ) (1 P (XI, c 3 )) 12 P (I, c 1 ) (1 P (VIII, c 2 )) P (XII, c 3 ) 13 P (I, c 1 ) (1 P (VIII, c 2 )) (1 P (XII, c 3 )) Table 15: Formula table for the probability to reach ending node Player 1 Player 2 Player 3 Expected Payoff Table 16: expected payoff for each player Figure 11: The convergence of the expected payoff for 3 players in 3-Player Kuhn Poker For convergence, it is not a good idea to plot all these 48 variables to a line chart. So the alternative method to plot the expected payoff of 3 players on 28

30 the line chart. If the strategies converge, then the expected payoffs should also converge. And if the expected payoffs converge, then there is a high probability that the strategies converge as well. The line chart on the previous page shows the convergence of expected payoffs: 6.2 Evaluation Have a look at the expected payoff, we can find that player 3 has an advantage in this game, player 3 is the only one that has the positive expected payoff. It is sensible in 3-Player Kuhn Poker as player 3 is the last one to make decision so that player 3 may know more things from the actions of the other player than player 1 and player 2. However, unlike Tic-Tac-Toe, Gunmen Dilemma and 2-Player Kuhn Poker. The equilibrium or the optimal strategy for 3-Player Kuhn Poker is unknown. So it is hard to check if this strategy is the Nash Equilibrium strategy or not. The following text would suggest different ways of verifying that Mathematical Calculation One way to check a strategy is to just calculate the strategy. As said above, we can calculate the expected payoff for each player if we know the policy for every player on every decision node. So if make all these policy parameter as variables. Then to reach Nash Equilibrium, it is an Optimization problem to maximum the expected payoff for each player. Szafron, Gibson and Sturtevant have calculated the Nash Equilibrium of 3-Play Kuhn Poker using a pure mathematical method.[11] They suggest that one kind of Nash Equilibrium strategies should obey the rules on the next page: In which, β = max(b 11, b 21 ) When substitute the strategy in table 14 to table 17. We can find that expect b 23 = , c 41 = 1 0 and c 33 = 0 < 0.181, all other variables satisfies the rule in table 17. So, it is not exactly the Nash Equilibrium Best Response Comparison As said on the previous section, the strategy generated is not a Nash Equilibrium. However, as it is quite close to the equilibrium point, would it be an ɛ-nash Equilibrium, the aim of this project? And if yes, what is the value of ɛ? The following method would solve these problems. Risk and Szafron suggest a way to check if a strategy is ɛ-nash Equilibrium 29

31 c min( 1, Player 1 Player 2 Player 3 Bet/Call at 1st Node with J a 11 = 0 b 11 b 21, if c 11 = 2 b11 3+2b 11+2b 21 ) Bet/Call at 1st Node with Q a 21 = ; b ,otherwise 2 1 b 21 4, if c 11 = c 21 = 1 2 c 11 0 ; b 21 = b 11, if 0 < c 11 < 1 2 ; b 21 min(b 11, 1 2 2b 11 ), otherwise Bet/Call at 1st Node with K a 31 = 0 b 31 = 0 c 31 = 0 Bet/Call at 1st Node with A a 41 = 0 b 41 = 2b b 21 c 41 = 0 Bet/Call at 2nd Node with J a 12 = 0 b 14 = 0 c 12 = 0 Bet/Call at 2nd Node with Q a 22 = 0 b 24 = 0 c 22 = 0 Bet/Call at 2nd Node with K a 32 = 0 b 34 = 0 c 32 = 0 Bet/Call at 2nd Node with A a 42 = 1 b 44 = 1 c 42 = 1 Bet/Call at 3rd Node with J a 14 = 0 b 13 = 0 c 14 = 0 Bet/Call at 3rd Node with Q a 24 = 0 b 23 max(0, b11 b12 2(1 b ) 21) c 24 = 0 Bet/Call at 3rd Node with K a 34 = 0 b 33 = (b 11 + b 21 ) + 0 c 34 1 β 2 b 23(1 b 21 ) Bet/Call at 3rd Node with A a 44 = 1 b 43 = 1 c 44 = 1 Bet/Call at 4th Node with J a 13 = 0 b 12 = 0 c 13 = 0 Bet/Call at 4th Node with Q a 23 = 0 b 22 = 0 c 23 = 0 Bet/Call at 4th Node with K a 33 = 1 2 b (b 11 + b 21 ) + β b 32 < c 33 < 1 2 b (b 11 + b 21 ) + β 4 Bet/Call at 4th Node with A a 43 = 1 b 42 = 1 c 43 = 1 Table 17: Equilibrium of 3-Player Kuhn Poker or not and what ɛ is.[9] That is to calculate a best response strategy against strategies of opponent. A best response strategy, as it name suggests, is the strategy that would maximum this player s payoff against the strategies of opponent. Assume applying best response strategy would increase the payoff of this player by δ if the other player fix their strategies, and δ 1, δ 2 and δ 3 be the increment of payoff for player 1, player 2 and player 3 respectively. Then the origin strategies for 3 players should form an ɛ-nash Equilibrium, where ɛ = max(δ 1, δ 2, δ 3 ) One possible method to calculate the best response strategy is to fix all other 32 variables of other 2 players and then consider it as an optimal problem which has 16 variables. It would generate the expected payoff exactly, however, it would also take a long time to produce the result. One other method is to check all the pure strategies. As there are a large amount of variables in 3-Player Kuhn Poker, so all pure strategies would cover a large area of all the possible payoffs. And the strategy that can increase the expected 30

32 payoff largest in all the pure strategies could be regarded as a approximate best response strategy. And the following table shows the expected payoff and the expected payoff when play the best response strategy and the other fix their strategies: Player 1 Player 2 Player 3 Expected Payoff Best Response Payoff Difference Table 18: Expected Payoff and Best Response Payoff As shown, the largest difference between expected payoff and best response payoff is So we can say that this would lead to an ɛ-nash Equilibrium, with an ɛ = 0.003, which is less than the upper limit stated in

33 7 Conclusion 7.1 Achievement The largest achievement in this project is that an ɛ-nash Equilibrium with ɛ = of 3-Player Kuhn Poker has been found. And this is very closed to the best attempt of 3-Player Kuhn Poker, which is done by Risk and Szafron, who got an ɛ-nash Equilibrium with ɛ = [9] Besides that, the optimal strategies of Tic-Tac-Toe, Gunmen Dilemma, 2-Player Kuhn Poker and other games have also been found by applying Temporal Difference Learning and Policy Hill-Climbing algorithm on them. 7.2 Future Works In Policy Hill-Climbing algorithm, there are many factors that might influence the result, including the discount factor stated in table 3 and the exploration factor in ɛ-greedy strategy. The discount factor used to get the result in 6.1 is 1, which means that all payoff should count, and the exploration factor used is Some works could be done to find the relationship between these factor and the result. Besides that, some works can also been done on Counterfactual Regret Minimization, the algorithm used by University of Alberta[4], the winner of 3- Player Kuhn Poker on Annual Computer Poker Competition on 2014[1]. 32

34 References [1] Annual computer poker competition. computerpokercompetition.org/. Accessed on [2] Three way duel problem(gunmen dilemma). three-way-duel/. Accessed on [3] Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In International joint conference on artificial intelligence, volume 17, pages LAWRENCE ERLBAUM ASSO- CIATES LTD, [4] Richard Gibson. Regret Minimization in Games and the Development of Champion Multiplayer Computer Poker-Playing Agents. PhD thesis, University of Alberta, [5] Bret Hoehn, Finnegan Southey, Robert C Holte, and Valeriy Bulitko. Effective short-term opponent exploitation in simplified poker [6] D Marc Kilgour and Steven J Brams. The truel. Mathematics Magazine, pages , [7] Harold W Kuhn. A simplified two-person poker. Contributions to the Theory of Games, 1:97 103, [8] E. Rasmusen. Games and Information: An Introduction to Game Theory. Blackwell, [9] Nick Abou Risk and Duane Szafron. Using counterfactual regret minimization to create competitive multiplayer poker agents. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages International Foundation for Autonomous Agents and Multiagent Systems, [10] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning. MIT Press, [11] Duane Szafron, Richard Gibson, and Nathan Sturtevant. A parameterized family of equilibrium profiles for three-player kuhn poker. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages International Foundation for Autonomous Agents and Multiagent Systems, [12] Raúl Toral and Pau Amengual. Distribution of winners in truel games. arxiv preprint cond-mat/ ,

CS221 Final Project Report Learn to Play Texas hold em

CS221 Final Project Report Learn to Play Texas hold em Yixin Tang(yixint), Ruoyu Wang(rwang28), Chang Yue(changyue) 1 Introduction Texas hold em, one of the most popular poker games in casinos, is a variation