Heads-up Limit Texas Hold em Poker Agent Nattapoom Asavareongchai and Pin Pin Tea-mangkornpan CS221 Final Project Report Abstract Our project aims to create an agent that is able to play heads-up limit Texas Hold em poker against a human opponent. As an artificial intelligence bot, the agent will be able to learn its opponent s bet behavior to try to maximally utilize it on its game play model to gain the most reward. I. INTRODUCTION Poker is one of the world s most popular sports, and the most popular card game in the world. It offers excitement and action, demands great skill from an expert player, and contains an element of luck. Its strategic challenges and psychological elements contribute greatly to its popularity, making poker a very social, human game. The object of poker is very simple - to win the money in the center of the table, called the pot, which contains the sum of the bets that have been made by the participants of that game round. Players make their bets on the belief that they have the best hand, or in the hopes that they can make a better hand i.e. leave the game, giving up the pot to them. Poker poses general applications that make it an effective platform for artificial intelligence research. First, it is a non-deterministic game with stochastic outcomes. With opponents hands hidden, imperfect knowledge causes typical search algorithms to fail. The agent needs to perform risk management to handle betting strategies and their consequences, and also identify patterns in the opponent s play in order to exploit them. Advanced poker agent may also deal with deception (bluffing), and unreliable information when taking into account the opponent s deceptive plays. There are many variants of poker with different number of players and level of complexity, such as Omaha, Seven card stud, Texas hold em, Five card draw, etc. Since Texas hold em is the most widely played form of poker, we decide to build an agent that plays a simplified version of heads-up limit Texas hold em poker, which has two players with fixed amount of bet. II. TASK DEFINITION & INFRASTRUCTURE Texas hold em poker may involve two or more players. Because of the complicated nature of the problem with more than two players, we will limit our problem to a game only between our agent and one opponent (a one on one game). The game will run like a normal Texas hold em game, with the pre-flop, flop, turn, and river stages that will allow players to take actions to either bet, check, call, or on each stage. Another challenge that comes with poker is the amount of bet allowed during the game. Since bets are continuous and arbitrary, may vary from $1 to over $1000 for example, we decided to model our problem by fixing the bets to a distinct set of values. For example, we will only allow $10 or $20 bets and no other values, and a maximum of $20 per stage. This will reduce the complexity of our agent model and algorithm. With this in mind, we set the big blind at $10 and the small blind at $5. Note that in our game play, the big blind and small blind will alternate every round. This is because of the advantage of a big blind in a real game may affect the overall outcome, thus to make things fair we will do the same thing for our setup. Our agent will try to play the game in multiple rounds against the human opponent to win the largest amount of money possible. To do so we will have the agent model the game as a Markov Decision Process and will try to learn the opponent s behavior using a reflexive machine learning algorithm. The learning algorithm will simply consist of a linear regression model with a feature extractor. This will be further explained in the model and algorithm section (Section V). One big characteristic of poker is the ability of players to bluff and gain advantages in certain rounds. Unfortunately due to the complicated nature and non-deterministic nature of bluffing, we will not take into account bluffing in our problem model to simplify the model to a reasonable problem to tackle. However, this may be indirectly incorporated into our learning model of the opponent while the agent is playing, implicitly. Since we incorporated opponent learning, we had to gather training data in order to train our agent. To gather data, we had to create our agent code and run the agent against our custom created opponents many times. The opponents have certain fixed behaviors according to the aggressiveness we set them to be. Data are then recorded and learned by the agent. Further data collection is done during the actual game play of the agent against a human opponent. It will gather the opponent s bet behaviors on the fly while it is playing and update its learning model constantly. More details will be explained in Section V. III. RELATED WORKS There are two previous CS 221 projects that implemented an agent to play poker. The first one by Berro, et.
al., implemented the agent using MDP [2]. They used Q-learning with a poker feature extractor and the epsilongreedy algorithm to learn an optimal policy. While learning the policy, they ran Monte Carlo simulations that play against fixed-policy opponents to update the feature vector. However, the authors noted that their agent performed worse than average human players because it did not explore enough states and adopted a very risk-averse playing style. This is similar to our approach to the problem in the use of MDP. Also the same problem were noticed in our method, that in order for MDP to perform well, we need to trade off efficiency. The second project by Abadi and Takapoui created an automated player using probabilistic graphical models to concurrently explore the state space and exploit acquired knowledge from the opponent [1]. Since both sides were agents, there was no comparison with human players. Instead, they defined a variety of agents and compared their performance by evaluating the agent s ability in estimating the opponent s latent feature vectors. For other related works, Yakovenko et.al. implemented a self-trained poker system using Convolutional Neural Network based learning model to learn the patterns in three different types of poker games: video poker, Texas hold em, and 2-7 triple draw [5]. Their representation of the poker games in matrix form, to make it processable by the convolution network, is worth noting. They encode each card as a 4 13 sparse binary matrix in accordance with the 4 suits and 13 ranks of poker cards. The matrix is zero-padded to 17 17 to help with the convolutions and max pooling computations. For five card games, they also add the sum of the 5 layers as another layer (17 17) to capture the whole-hand information. This encoding strategy has several advantages, the most interesting of which is that the full hand representation makes it easy to model common poker patterns, such as a pair (two cards of the same rank, which are in the same column) or a flush (five cards of the same suit, which are in the same row) without game-specific card sorting or suit isomorphisms (e.g. AsKd is essentially the same as KhAc). For multiple round games, they keep track of context information, such as the pot size and the bets made so far by adding layers with different encodings for each feature. The poker tensor is also extended to encode game-state context information that is not measured in cards. In 2015, Zinkevich, et. al., from the Computer Poker Research Group, University of Alberta, claimed to have weakly solved the heads-up limit Texas hold em game using CFR+, a variant of the CFR (counterfactual regret minimization) algorithm, which is an iterative method for approximating a Nash equilibrium of an extensive-form game through the process of repeated self-play between two regret-minimizing algorithms [3]. However, we did not pursue this approach in interest of the scope of this class. IV. RULES We use a slightly simplified rule of heads-up limit Texas hold em. The game is played between two players, and consists of 4 phases: pre-flop, flop, river, and turn. We follow the limit Texas hold em rule where the bets during pre-flop and flop are equal to the big blind, and equal to twice the big blind in the turn and river phases. Blinds: At the beginning of the game each player starts with $0 reward. We start with the agent being the big blind and the human player the small blind. The blinds will alternate in every turn. Pre-flop: Both players are dealt their hole cards face down (in the game we only show the player his/her own card). Unlike the normal version, the small blind can only or call. Then the big blind decides to either check or. Flop: The three flop cards are dealt on the table. Unlike regular heads-up Texas hold em where the big blind acts first, our agent requires the small blind to act first. A player has three valid actions -, check, or bet. Turn: The fourth card is dealt onto the table. The round is similar to the flop round where the small blind acts first by choosing one of the three actions -, check, or bet. River: The fifth, which is the last card, is dealt onto the table. This round has the same procedure as the turn round. If both players decides to play (bet and calls or check and calls), then we move into the showdown phase where each player s cards are evaluated for their hand strength. The hand strengths are then compared, and the winner is the one with a better hand. He/she gains the whole money pot. V. APPROACH We model the problem as a Markov Decision Process (MDP) model and use a reflexive learning algorithm to learn the opponent s betting behavior. However, to evaluate the card s strength and ranks we used an external poker evaluator to do so. The external Python library used is called Deuces. This library allows us to efficiently calculate the 5 card hand strength as an integer score value for comparison. It also gives a card rank and suit description and a function to print cards out in a visually appealing manner for our interface. A. MDP model Our Markov Decision Process (MDP) model is broken into smaller MDP models for each phase. Each MDP explores only the states within two phases ahead. We have the pre-flop MDP, flop MDP, turn MDP, and river MDP, as shown in Figure 1. This is because of the fact that if we combined all these MDP, the state space is too large to run efficiently. There would be 3.16 10 17 states to go through in a heads-up limit Texas hold em game. In our preliminary implementation, each round takes about 10 minutes to run.
Our pre-flop and flop MDP are both depth limited MDP models. This is so that it does not have to run through all the large state space. We therefore used an evaluation function as the reward value for these MDPs. The evaluation is: Eval(s) = 1 (current hand score) + w φ(x) φ(x) is the input feature into our learning model and w is the weight vector. Our learning model will be described in more detail in the next section. The current hand score is the score evaluated on the cards in the agent s hand together with the current cards dealt on the table. For each of the MDP model, the model parameters are described in Table I. Pre-flop MDP check/call Flop MDP Parameters State Start State Actions Rewards Transition Probability Values (current hand cards, tuple of table cards, agent pot value, IsEnd) (handcards, tablecards, current pot value, 0) {, bet, check } = pot value evaluation function if the bet = MDP is depth limited 0 if MDP runs to the end { pot value if won IsEnd = pot value if lost Uniform probability of possible dealt card combinations Turn MDP River MDP TABLE I MDP model parameters. B. Reflexive Learning Algorithm In order to make our agent more robust to the opponent s varying style of play, we decided to incorporate a learning algorithm that learns how the opponent bets and tries to predict what the opponent s hand cards are. We decided to use a simple linear regression model in order to do so. The linear regression model will take in the following features: Table card rank = the rank of the cards open on the table. This feature varies in size depending on which turn the game is in. During the pre-flop phase, this feature is not present. The flop phase will consist of 3 rank values of the 3 table cards, turn phase will contain 4 rank values for the 4 cards, and river phase will contain 5 rank values for the 5 cards on the table. This is represented by an integer defined by the Deuces package used. Table card suit = the suit of the cards open on the table. The different suits are represented by integers according to the Deuces package used. This feature also varies in the same manner as the table card rank feature. Opponent bet sequence = This is the bet values the opponent has bet in the past and present rounds. For example, if we are currently in the flop phase, this Fig. 1. Win/Lose Depth limited MDPs and allowed actions. feature will include the opponent s bet value during the pre-flop phase and the flop phase. With these features, the model will output an estimate of the possible score of the cards in the opponent s hand. One problem we encountered was that we did not have data on the opponent beforehand in order to run linear regression to create a model weight for prediction. It is impossible to create a model for each opponent the agent faces and we would not have enough data if we start creating the model after playing against the opponent for the first time. To tackle this problem, we created two custom opponents for the agent to play against and collect data beforehand. These two opponents are a conservative opponent that bets only if it has at least a pair, and an aggressive opponent that bets only if it has at least a decent high card (it plays most of the time and usually bets instead of check).
With these two custom opponents, we played our agent against them and collected the required data. Then we ran stochastic gradient descent (SGD) on a square loss function to create a weight vector for each of the opponent. The parameters used in SGD are as follows: w w η w Loss(x, y, w) where w = weight vector, x = input feature, y = opponent hand score. # of iterations = 20 ɛ = 1 10 10 η = 0.001 In the agent game play interface, we require the user/opponent to choose whether they are an aggressive or a conservative opponent and the agent will use the weight learnt from these custom opponents as an initial learning parameter. Then using the data collected as the agent is playing an actual opponent, it will update this weight vector constantly after every showdown turn. The update will also handle the case where players deliberately chose the wrong type to take advantage of the agent e.g. an aggressive player told the agent that he or she is conservative. C. Agent Interface We tested our agent against our written random and oracle opponents through a Python script individually. However, to let our agent play with any human opponent, and to test its strength against human opponents, we created an interface to play the game. The interface is simple, as shown in Figure 2. Opponent Agent Random Human Oracle Naive 9.35-4.00-7.25 Clever, aggressive 10.15 1.96-22.45 Clever, conservative 7.95-2.00-16.20 TABLE II Average winning per game of agent versus different types of opponents. VI. RESULTS & ANALYSIS Our agent has three modes: naive, clever aggressive, and clever conservative. The naive agent only uses depth limited MDP to compute its actions and does not perform opponent learning. The clever agent learns the opponent according to the opponent type that the user specified at the beginning of the game - aggressive or conservative. We tested our agent with three types of opponent: random, human, and oracle. The random opponent chooses its actions at each phase uniformly at random. The human opponent is actual human playing against the agent through our interface. Our human players are amateur poker players. The oracle opponent is one that knows all hidden cards; oracle s immediately at pre-flop if it knows that it has the worse hand, and bets through the end otherwise. For each type of agent, we ran 100 games against the automated players i.e. random and oracle, and 50 games against human players. We ran less iteration on human players because our agent can take up to 2 minutes to compute in each phase and thus binds us in a time constraint. Table II shows the average winning per game of our agent. Both the agent and the opponent start with $0 balance. Our agent performs better than the random player, and pretty comparable with human players. Since our big blind is $10, we can say that our agent wins the random player by approximately one big blind. As expected, it lost against the oracle player. However, note that it lost at worst by only approximately twice the big blind. This is quite satisfactory given the simplicity of the model. Also, it is worth noting that the naive agent performs better with the oracle than either of the clever agent. We hypothesize that because the oracle has a mixed aggressive-conservative behavior (excessively aggressive if it knows it is winning but excessively conservative otherwise), the naive agent is better off ignoring the opponent s playing style than trying to lean either on the aggressive or conservative side. Fig. 2. Poker game interface. The game is played in Terminal and the input is through keyboard. Only the table cards and the human player s hole cards are displayed. The agent s hole cards will only be displayed if the agent decides to play until showdown. We have some interesting insights from the human players experiences. For the clever aggressive agent, it s less often even when the supposedly aggressive player bets (i.e. bluff). The clever conservative agent often bets to scare off the supposedly conservative player, and more likely s if the opponent bets.
VII. CONCLUSION & FUTURE WORKS Our model shows that depth limited Markov Decision Process with reflexive opponent learning can be used to model a heads-up limit Texas hold em poker game. Our agent s performance is one big blind better than a random player, on par with human players, and in the worst case about two big blinds worse than an oracle player. However, the agent currently takes quite a long time to process. Its maximum thinking time is 2 minutes (run on 2 GHz Intel Core i7), whereas Cepheus (the agent that weakly solved the game using CFR+ developed by the Computer Poker Research Group, University of Alberta) takes less than one minute. In future works, we plan to apply other algorithms, such as neural network, to model the game in order to reduce the runtime, as well as take into account more complex features of the game such as raising and bluffing. Neural networks can also be a good learning model to improve our evaluation function. We could do more state pruning with better evaluation functions for our MDP models. REFERENCES [1] H. Abadi and R. Takapoui, (2014, December 12). Automated Headsup Poker Player. Available: https://web.stanford.edu/class/cs221/restricted/projects/takapoui/final.pdf. [2] T. Berro, J. Benjamin, and C. Zanoci, a poker AI agent. Available: http://web.stanford.edu/class/cs221/restricted/projects/bgalliga/final.pdf. [3] M. Zinkevich, M. Johnson, M. Bowling, and C. Piccione, Heads-up limit hold em poker is solved. (2015, January 8) in Science 347 (6218), pp. 145-149. [4] P. McCurley, An Artificial Intelligence Agent for Texas Hold em Poker. Available: http://poker-ai.org/archive/pokerai.org/public/aith.pdf. [5] N. Yakovenko, L. Cao, C. Raffel, and J. Fan, (2015, September 22). Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks. arxiv:1509.06731.