BetaPoker: Reinforcement Learning for Heads-Up Limit Poker Albert Tung, Eric Xu, and Jeffrey Zhang

Size: px

Start display at page:

Download "BetaPoker: Reinforcement Learning for Heads-Up Limit Poker Albert Tung, Eric Xu, and Jeffrey Zhang"

Lydia Reynolds
6 years ago
Views:

1 Introduction BetaPoker: Reinforcement Learning for Heads-Up Limit Poker Albert Tung, Eric Xu, and Jeffrey Zhang Texas Hold em Poker is considered the most popular variation of poker that is played widely online and in casinos. As an imperfect information game, poker remains a challenging game to play because for most of the game, the opponents cards are unknown and the future is unpredictable as well. Even with Heads-Up Limit poker, a variant of poker that involves fixed betting and only two players, the challenges of building an agent that has to take into account possible states, that can accurately modeling the opponent without cheating (i.e. knowing his cards), and that can understand risk taking makes building a poker bot a difficult endeavor. The goal of this project is to attempt to create an AI that could approximate the perfect heads-up (one opponent versus another opponent) fixed limit Texas Hold Em poker strategy using techniques taught in class. We define the strategy as the ability to model opponents and the game sufficiently enough in order to make a profit from users. Rules of the Game The version of the game we use to test our bots is Texas Hold em Heads-Up Limit Poker with a reverse blinds format. In heads-up limit poker, two players each begin with two hole cards that are from the standard 52 card deck. In reverse blinds, the second player is designated as the dealer who places the small blind and the first player places the big blind. For the reverse blinds format, the second player goes first on the preflop and the first player goes first on the rest of the stages. The blinds themselves are fixed amounts of money and for our purposes the small blinds and big blinds are $5 and $10 respectively. The game has multiple "rounds" or hands and for each hand there are four stages: the pre-flop, the flop, the turn, and the river. At the flop, three cards are placed on the board. For the turn and river, one additional card at each stage is placed on the board. At each stage of the hand, each player can choose to raise (increasing the bet), call (matching the opponent s bet or continuing the game), and fold (withdrawing from the game). Although there exists betting and checking, we simplify the terms to allow for an easier explanation. The actions for the stage continue until the maximum number of raises for the stage has been reached, both players have called, or one player has decided to fold. For our game, the maximum number of raises allowed per stage is 4 with the exception of 3 for the pre-flop. A player can win the amount of money bet and placed in the pot if the opposing player folds or the game has reached the river stage and both players have called at which a showdown occurs. At the showdown, the player with the best five-card hand (two from hole cards and three out of the five board cards) wins at takes the entire pot or ties at which point the pot is split. Previous Work In the domain of creating bots for heads-up limit poker, there have been many different approaches over the years. In 2004, Billings et al. implemented a version of Expectimax search which they dubbed Miximax and Miximix which computed the expected value at nodes using probabilities of the opponent s possible hand. The group chose a mixed strategy based on the respective weighted sums of each action where they make an assumption that all chances occur uniformly at random [Billings et al., 2004]. In 2007, CMU researchers Gilpin and Sandholm used new approximation methods that provided a heuristic for evaluating the state space of the board and used a simulation to estimate the rewards at the end of the game. Their approach could efficiently take into account a large amount of the game tree when computing the best possible action [Gilpin and Sandholm, 2007]. 1

2 The most advanced development in recent years, however, has been the introduction of counterfactual regret minimization (CFR) that has been extensively used by researchers at the University of Alberta. In 2015, the research group claimed that they had solved the game of heads-up limit poker through Cepheus, a bot that used a variant of CFR (dubbed CFR+) that would allow them to compute a Nash equilibrium of the game [Bowling et al., 2015]. Most of the previously developed projects, however, did not use Q-learning, but rather a different approach to exploring the game tree in a way that utilized game theory. As a result, these approaches were complementary to our approach. Infrastructure To get two bot poker players to play against each other, we used an existing infrastructure from the website: The existing infrastructure provided us the following information and abilities: the ability to play heads-up limit with two poker players that were constructed based off their example template, the ability to examine the game state, and the ability to log the information of the match for inspection. We revised the existing code to allow us to play games with random hand deals, construct our own custom states, and create multiple players for each different approach. As a result, a majority of the code for online Q-learning was written in C and the offline Q-learning (which generated a state to policy mapping) written in Python. Baseline and Oracle Our baseline is a bot that chooses to raise, check, or fold randomly depending on the valid actions that it can take. The random bot chooses to fold 6% of the time (if folding is valid) and chooses to either raise or check for the remainder of the probability (if raising and checking is valid). If one or more actions is invalid, we would normalize appropriately. Our oracle is a pure CFR (Counterfactual Regret Minimization) bot, which was the basis for many top AI bots in the field now including Cepheus and is found here: open-pure-cfr. We trained the pure CFR for an hour with over 289 million iterations. Although pure CFR may not be as challenging as an opponent as Cepheus, the open sourced code remains one of the few poker bots that can interface with our infrastructure. Furthermore, we have tested this CFR and it has beat our baseline consistently in the roughly 20,000 hands we played. Model and Algorithm Modeling as a Markov Decision Process Our approach models the game as a Markov Decision Process in which each game position is a state. We then use Q-learning to learn which action to choose based on the game state. The state of the game consists of the following information: T - betting stage (i.e. T {pre-flop, flop, turn, river}) P - player whose turn it is (0 for small blind, 1 for big blind) S - sequence of checks, calls, and raises that have occurred in the current game H - two cards we have B - cards that are on the board At each turn, the player can choose the following actions: fold, check/call, raise. If it is our turn, let us be at a state S in the game. Taking one of the three possible actions from S leads to a hidden probability distribution that determines the new state S. This probability distribution accounts for the subsequent opponent move as well as the next card(s) that appear on the board. Since we don t know how the opponent determines his choice of action, we don t know this probability distribution of states that we could end up 2

3 in. This motivates learning the transition probabilities and rewards with reinforcement learning, and we chose to use the Q-learning algorithm. Q-learning For every state, we want to estimate the reward associated at that state. Denote this value as V (s). Furthermore, for each state-action pair (s, a), we attempt to learn the reward associated for taking action a at state s. Denote this estimate as Q(s, a). Then V (s) = Q(s, a). max a Actions(s) We can learn the value of Q(s, a) by first extracting certain features from the state-action pair. Let φ(s, a) be the feature vector. Then we will approximate Q(s, a) as a linear combination of these features, so Q(s, a) = w φ(s, a), where w is the weights vector. On the other hand, we know that if taking action a from state s leads to state s, then Q(s, a) r +γv (s ), where r is the reward and γ is the discount factor. We assume the discount factor is 1 in this game. Then our goal is to minimize the quantity (Q(s, a) (r + V (s ))) 2 (s,a,r,s ) It follows that we must learn the weights w so that the above quantity is minimized. Taking the gradient with respect to w, we find that for every tuple (s, a, r, s ), we can update w as w w η[q(s, a) (r + V (s ))]φ(s, a) where η is the step size. For this game, we let η be decreasing over time and equal to the reciprocal of the number of occurrences of the feature vector φ(s, a). To choose an action at a state using the Q-learning algorithm, we can take the action that gives us the greatest reward. That is, if we are at state s, choose action as arg max Q(s, a). However, as we will a Actions(s) later see, we can also choose the action based on the epsilon-greedy approach: with probability ɛ, we choose an action uniformly at random; otherwise, we take the action that gives us the greatest reward. This allows us to explore more of the state space. Data Mining for Offline Q-learning To train the bot in our initial approaches, we fed the Q-Learning algorithm data in form (state, action, reward, new state). The data was obtained from past games from the Annual Computer Poker Competition (ACPC). For every action each player made, we created a state object that stored the current state of the game (phase of the betting, number of raises in the current phase, number of times the opponent raised, and hand strength). To determine the hand strength, we used an external library called Deuces to evaluate the poker hand. Then to create the (s, a, r, s ) format, we looped through every state in the game and found the immediate next state that has the same player under the gun. All states have reward 0 except for when the action is fold or both players show. In those two cases, the reward is the net earning from the pot. After processing all of the data provided by ACPC, we ended up with approximately 161 million (s, a, r, s ) segments. We inputted all of these pieces to the Q-Learning algorithm, which then computed the optimal policy for each state. Approaches, Results and Analysis Throughout the entire project, we took multiple approaches to improving and enhancing our poker bot. Our goal evolved from minimizing loss from our oracle to maximizing the amount of money we could receive 3

4 from it. For each approach, we ran at least 250,000 different rounds or hands against our oracle and our baseline. Approach 1: Offline Q-learning with Hand Categorization For the first approach, we selected the following features from the state: T - betting stage (i.e. T {pre-flop, flop, turn, river}) X - total raises in this stage Y - total opponent raises H - categorization of hand Note: The hand categorization is a string that tells us the best possible hand that we currently have (among our two cards and the cards on the board). Valid hand categorization strings include high card, pair, two pair, triple, straight, flush, full house, four of a kind, and straight flush. Assuming that there can be at most 4 raises in a stage and that the opponent will raise at most 20 times, our feature space is then approximately = Computing the features takes roughly linear time since we just need to run through the cards once to compute the hand categorization. We then bundled all of these characteristics into one feature (T, X, Y, H). An example of a feature vector φ(s, a) would be {(1, 1, 2, pair, call) : 1}. This would mean that we are at the flop stage and there has been one raise in this stage. In the game, the opponent has raised twice, our best hand is a pair, and we perform the action call. With this feature extractor, we ran Q-learning on the data set from ACPC to compute the values of Q(s, a). Then we ran through all states and obtained the optimal action, printing them into a text file. Our poker bot reads this file and plays the game by taking the action specified in the file. If a specific state does not exist in our policy, the poker bot calls/checks. Results for Approach 1 See Appendix A.i for plots. Approach 1 vs Baseline and Oracle Our Earnings vs. Baseline Our Earnings vs. Oracle -$4,080 -$5,617,610 Our initial approach showed us that with this simple approach of classifying what the best hand we could have at this point, we had large losses against the oracle and neither consistently beat nor lose against the random player. Upon examination, we reviewed that for many of the states, our policy was too aggressive such that we would continually raise even when we were in bad positions. For example, when we had 3,5 off suit as our hole cards, we continued to raise even though a normal, conservative player would have only checked or folded. We attributed this to the fact that a simple classification would not accurately tell us what our hand potential was. Approach 2: Offline Q-learning with Hand Potential Heuristic Rather than considering what is the best possible hand we currently have, we could consider how good our hand could become as the game evolves. This led to a different feature extractor in which we consider the following: T - betting stage (i.e. T {pre-flop, flop, turn, river} P - player whose turn it is (0 for small blind, 1 for big blind) X - total raises in this stage H - representation of hand potential Note: If the stage is pre-flop, then H will be the two ranks of the cards in the hand followed by a 4

5 character that indicates whether the cards have the same suit (no indicator if the hand is a pair) i.e. 97s corresponds to 9 7 suited, 88 corresponds to pocket 8 s, AKo corresponds to Ace King offsuit. Otherwise, the hand strength is a tuple (x 1, x 2, x 3, x 4 ), where x 1 is the number of cards on the board whose rank match at least one of the ranks of the cards in our hand. x 2 is the highest power of a matched card in our hand or any card in our hand if there is no matched card. The power of a card is defined to be the number of cards on the board that are less than or equal in rank to that card. x 3 is the categorization of a flush draw: 1 if there is exactly one card in our hand that matches suit with 3 cards on the board, 2 if the player has a flush and there isn t a flush on the board, 3 if the player has suited cards in hand and there are 2 board cards with that suit, and 0 otherwise. x 4 is 1 if there is a straight draw and 2 if there is a straight and 0 otherwise. We can bound the hand potential heuristic space by x 1 5, x 2 5, x 3 4, and x 4 3. Thus, our overall feature space is approximately = The heuristics can be computed in linear time by sweeping through the cards once. Similar to the first approach, we grouped these characteristics into one feature (T, P, X, H). Then we ran Q-learning on same data set from ACPC to compute the values of Q(s, a) and found the optimal policy. The bot again plays according to the fixed policy, and if a specific game state does not exist in our policy, the poker bot calls/checks. Results for Approach 2 See Appendix A.ii for plots. Approach 2 vs Baseline and Oracle Our Earnings vs. Baseline Our Earnings vs. Oracle $1,951,400 -$2,621,670 With the larger state space, we were better able to categorize the potential of our hand by evaluating the rank of the cards and the possible future of the cards with various draws. Our feature space is now larger (9600 compared to 2880) so with more information, we were able to consistently beat the random bot and mitigate our losses to the oracle compared to our first approach. However, when playing against the CFR bot we found that there were many instances in which our desired state did not have a policy assigned to it. Approach 3: Generalizing Hand Potential Heuristic From the previous approach, we realized that there were approximately 500 states that were not included in the optimal policy. This may have been due to a limitation of our training data as some of the states that arose in play were not present in the training data and so we performed a sub-optimal action instead. As a result, we created a feature extractor that generalizes each state. Taking an action a and the previous features (T, P, X, H) and letting H = (x 1, x 2, x 3, x 4 ), we obtain the following feature vector: {(T, a, P ) : 1, (T, a, X) : 1, (T, a, x 1 ) : 1, (T, a, x 2 ) : 1, (T, a, x 3 ) : 1, (T, a, x 4 ) : 1} Essentially, for a given stage and action, we make each characteristic (player, total raises, and components of the hand potential) into its own feature. This allows us to compute an optimal policy for every state. Again, we continued the offline Q-learning approach by first training on the ACPC data set to compute the optimal policy for all states and then playing according to the fixed policy. Results for Approach 3 See Appendix A.iii for plots. 5

6 Approach 3 vs Baseline and Oracle Our Earnings vs. Baseline Our Earnings vs. Oracle $1,493,040 -$206,975 Using the feature extractor, we still consistently beat our baseline, but more importantly, we have reduced our losses to our oracle. We believe that by writing a feature extractor that generalizes to unseen states, we are able to compute a better policy for the missing states in our previous approach. However, we still performed poorly against the CFR bot and we believe this was caused by limitations in offline training so we could not actively model the strategy of our current opponent. Approach 4: Online Q-learning with Generalized Hand Potential Heuristic Since different poker players play differently, it would be better for the poker bot to be able to counter and adapt to the opponent s strategy. As a result, we implemented an online Q-learning bot that learns the rewards for each state while playing the opponent. For each game, we store the features (using the same feature extractor described in the third approach) for every state that we were in as well as the action we take in that state. Then at the end of each game, we obtain the reward for that game and incorporate the feedback of the (s, a, r, s ) tuples into the weights vector. However, we are now playing the opponent from scratch, so in the first couple thousand games, we cannot yet determine the optimal action at each state. Therefore, we use the epsilon-greedy exploration approach. Initially, we set ɛ 0.1, so with probability 0.1, we choose an action uniformly at random and with probability 0.9, we choose the action that yields the maximum expected reward. This encourages exploration in the beginning. But as the bot plays more games, we decrease epsilon by 10 9 per game so we gradually shift from exploration to exploitation. Results for Approach 4 See Appendix A.iv for plots. Approach 4 vs Baseline and Oracle Our Earnings vs. Baseline Our Earnings vs. Oracle $815,995 $916,895 With online Q-learning, we actively learn from the games we play and experience similar rewards for both our baseline and oracle. We seem to learn the optimal weights within the first fifty to seventy thousand hands that are played through the epsilon-greedy approach since that is when our average reward per thousand hands becomes positive. After that, we exploit the opponent s strategies. Approach 5: Online Q-learning with TwoPlusTwo Evaluator In this approach, we replace the hand potential heuristic with the TwoPlusTwo Evaluator, which allows for us to move away from an imperfect evaluation of the our hand potential. The TwoPlusTwo Evaluator provides a look-up table for any seven card hand in constant time, returning the hand s rank relative to all other hands. Therefore, for the flop, turn, and river, we can compute the probability we win over all possible hands the opponent could have and over all possible remaining board cards. For the pre-flop stage, we cannot brute force over all possible configurations, so we used another pre-computed look-up table with the approximate probabilities of winning [Teofilo et al., 2013]. Furthermore, we added back the feature of counting the number of times the opponent raised because it can give us a rough estimate of how good the opponent thinks his or her hand is and a metric of his/her aggressiveness. Consequently, our feature selection consists of the following: T - betting stage (i.e. T {pre-flop, flop, turn, river}) P - player whose turn it is (0 for small blind, 1 for big blind) 6

7 X - total raises in this stage Y - total opponent raises H - probability of winning based on all possible opponent hands and remaining board cards Bucketing the probability into the nearest integer, we find that our features space is now = Moreover, computing the features is now more expensive, taking ( ) iterations in the flop stage to get the probability we win (we have to loop over the remaining two board cards as well as the two opponent cards). Our feature extractor then generalizes by separating the opponent raises with the win probability. Thus, if we perform action a with a particular state with features (T, P, X, Y, H), our feature vector would consist of {(T, P, X, a, Y ) : 1, (T, P, X, a, H) : 1} We then ran the same online Q-learning algorithm with the same epsilon-greedy approach as in the previous section. Results for Approach 5 See Appendix A.v for plots. Approach 5 vs Baseline and Oracle Our Earnings vs. Baseline Our Earnings vs. Oracle $1,599,390 $1,691,505 The earnings using the TwoPlusTwoEvaluator almost double the earnings in the original online Q-learning algorithm. This is a result of the probability of winning returning a better estimation of a given hand than the hand potential heurisitic. In addition, we are now modeling the opponent by considering the number of times he or she raised. This provides another indicator on how well we fare against the opponent s hand. The drawback, however, is that we now have to store a larger feature space (64000 vs 9600) and computation of the features is much slower ( operations per turn vs 50 operations per turn). Approach 6: Online Q-learning with TwoPlusTwo Evaluator and Chen Formula However, while studying the results of approach 5, we found that the bot occasionally makes suboptimal moves during the pre-flop stages by immediately folding on decent cards (such as having Jack and 8). To fix this problem, we added a separate heuristic for the pre-flop cards called the Chen Formula, which approximates the relative value of all pocket hands. Keeping the rest of the features the same, we then ran the online Q-learning algorithm with epsilon-greedy exploration as in the previous approach [Teofilo et al., 2013]. Results for Approach 6 See Appendix A.vi for plots. Approach 6 vs Baseline and Oracle Our Earnings vs. Baseline Our Earnings vs. Oracle $1,788,160 $1,961,315 With a better pre-flop state, our bot has a better understanding of the strength of its hand at the pre-flop and so is less likely to fold better hands. As a result, we avoid the mistakes we made earlier and improve our earnings by not conceding blinds to the opponent without changing the feature space or time complexity from the previous approach. 7

8 Experimenting with Non-linear Features Note that all of the features of a state delineated in the first six approaches are binary or linear features. However, we also ran tests with non-linear features. In particular, we created a quadratic feature equal to the sum of the squares of the number of opponent raises for each betting stage. This allows the opponent to make a bigger impact on the overall reward for each state. Unfortunately, this feature extractor pair with the online Q-learning did not succeed, losing approximately 5000 dollars per a thousand hands. Reflecting on the results, we found that the player often folded immediately on the pre-flop stage. This could mean that we weighted the opponent raises too highly, causing our bot to be bluffed out by the opponent s actions. Approach 7: Online Q-learning with Hidden States In previous approaches, we can determine exactly which state we are in because all the information that determine a state is available to us during the game. One shortcoming of these approaches is that storing all possible information leads to a large state space, which in turn causes a much slower learning rate. In particular, we could not afford to store the history of opponent s bets during the round and instead had to store the total number of raises in the current phase. Note that knowing the opponent s bets mainly serves to give a better estimate on what kind of hand he actually has. This motivates an approach where we have hidden states that store all the information in the game, (i.e. what hand each player has). In other words, if we knew the opponent s cards, we would have another reinforcement learning problem (note it is not optimal just to bet based on expectation; in fact, that serves as a lower bound on what we reward we could get). We define a state in this reinforcement learning problem as the following tuple: (T, P, X, (B 1, B, B 2 )), T - betting stage (i.e. T {preflop, flop, turn, river}) P - player whose turn it is (0 for small blind, 1 for big blind) X - total raises in this stage B 1 - the bucketed probability that our hand will win assuming the all the possible pairs of cards the opponent could have are equally as likely (the probability we think we will win) B - the bucketed probability that our hand will win against the opponent s hand(the actual probability that we will win) B 2 - the bucketed probability that the opponent s hand will win assuming all the possible pairs of cards we could have are equally as likely (the probability the opponent thinks he will win) Total Reward: $4,447,175 The algorithm converges extremely quickly - in less than a 1000 hands and maintains a steady gain of about $18000 per 1000 hands (180 big blinds per 100 hands - 180bb/100). Of course, we don t actually know the opponent s cards, so the states are hidden. Thus, 180bb/100 is the best we can do (if we somehow guess the opponent s exact cards every time), and the performance of this algorithm depends on how well we can estimate the true state we are in. Our approach involves using a hidden Markov model for the estimation. Let x 1, x 2,, x n be the actions the opponent took, where x i is the action he took on his i-th turn. Suppose we know the emission probability distribution p(x s). Then we on the n-turn we could compute p(s x 1, x 2,, x n ), the probability that the true state is s, by using an algorithm like the forward-backward algorithm. 8

9 Figure 1: Result of 250,000 hands for Known Opponent Cards Q-learning gives us a distribution p(a s), or the probability distribution from which we choose an action from a state. Combining this with p(s x 1, x 2,, x n ), we can marginalize out s, we can obtain p(a x 1, x 2,, x n ) the probability distribution from which we choose an action given the opponent s action. This is all described in [Ivanov et al., 2000]. The question is how to find the emission probability distribution p(x s). We make the assumption where at the end of the hand both players reveal their cards. Thus we can just keep a count of how many times the opponent took action x from state s and obtain a maximum likelihood estimate directly for p(x s). If we don t make this assumption (in reality most version of heads up limit do not have any sort of showing), then we can use an EM algorithm that uses the rewards to obtain estimates of the states in hindsight. This is also described in [Ivanov et al., 2000]. We did not attempt this approach but we believe it will perform quite well. Either way, in the end we get a distribution p(a x 1, x 2,, x n ) from which we sample an action and report the reward back in our Q-learning algorithm. So we need to store the Q-table which requires O(SA) where S is the amount of states we have and A is the number of actions we have. In our case there are only 3 actions, and the number of states is a constant factor times B 3 where B is the number of buckets we choose to use for Q-learning. We also need to store a table for the emission probabilities where there is a probability for each state action pair, so this also requires O(SA) = O(B 3 ) space. Each time a new board card appears, we consider all possible 2 cards the opponent could have and update the probability transitions in our HMM: p(s i s i+1 ). We store these transition probabilities, which naively requires O(S 2 ) space. Normally this would require O(B 6 ) memory but note the dimension of the bucket probability we think we win doesn t matter so we can reduce the memory down to O((B 2 ) 2 ) = O(B 4 ). This dominates the space complexity, so the total space complexity is O(B 4 ). For the time complexity we not only have to populate this transition table which takes O(B 4 ) time (we can actually precompute this), but we also have to compute the probabilities that we win and the probabilities that the opponent thinks he wins. To do this we use Monte Carlo simulation running X simulations gives a time complexity of O(X) per every pair of cards the opponent has. We have to consider O(D 2 ) possible pairs of cards the opponent has where D = 52 in our case. Thus the total time complexity is O(B 4 +NXD 2 ) where N is the number of rounds we play. We ran into some trouble implementing this algorithm, and so the results have been omitted. 9

10 Future Work There is always room to explore different state spaces and feature extractors for the standard online Q- learning algorithms described in this paper. Trying different features for opponent modeling beyond just the amount the opponent raised may allow us to develop a measure of when the opponent is aggressive or conservative. Furthermore, attempting to use different non-linear functions may help us if opponents are dynamically changing their strategy in an attempt to adapt. As for the hidden state Q-learning algorithm, there is a lot of work to be done in finding the best weight we should assign to the opponent s actions for guessing what hand he has. A lower weight means we guess the opponent s hand is closer to uniformly distributed (works well against aggressive players), and a higher weight means that the opponent s hand is more likely to be direct reflection of his actions (works well against conservative players). This choice of weight can be adjusted by adjusting the weight of the Laplace prior during our maximum likelihood estimate of the emission probabilities. Also, as mentioned, an EM algorithm can be used for games when players don t show their cards at the end of every hand. We also look forward to testing our bot against other oracles. Our current oracle, pure-cfr, is based on a GTO (Game Theoretically Optimal) bot and thus does not always play like most humans or other bots. Furthermore, pure-cfr, like many other open sourced bots, is a fixed distribution bot, which means it has a fixed probability distribution over which actions it chooses in a particular state which does not incorporate information from outside of the immediate current round. In particular, it does not adapt over time to opponents, which explains why it loses so badly to many of the approaches we tried. For example, our standard online Q-learning was able to win at a rate of 80bb/100 (80 big blinds per 100 hands), which is better than what we could get if the opponent just folded every hand (75bb/100). 10

11 Appendix A Figure 2: Result of 250,000 hands for Approach 1 played against baseline (left) and oracle (right) (i) Figure 3: Result of 250,000 hands for Approach 2 played against baseline (left) and oracle (right) (ii) Figure 4: Result of 250,000 hands for Approach 3 played against baseline (left) and oracle (right) (iii) (iv) 11

12 Figure 5: Result of 250,000 hands for Approach 4 played against baseline (left) and oracle (right) Figure 6: Result of 250,000 hands for Approach 5 played against baseline (left) and oracle (right) (v) Figure 7: Result of 250,000 hands for Approach 6 played against baseline (left) and oracle (right) (vi) 12

13 Appendix B References D. Billings, M. Bowling, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Game tree search with adaptation in stochastic imperfect information games. In Proceedings of the 4th International Conference on Computers and Games, pages 21 34, M. Bowling, N. Burch, M. Johanson, and O. Tammelin. Heads-up limit hold em poker is solved. preprint, Andrew Gilpin and Tuomas Sandholm. Better automated abstraction techniques for imperfect information games, with application to texas hold em poker. In 6th international joint conference on Autonomous agents and multiagent systems, Yuri Ivanov, Bruce Blumberg, and Alex Pentland. Em for perceptual coding and reinforcement learning tasks. preprint, Luis Filipe Teofilo, Luis Paulo Reis, and Henrique Lopes Cardoso. Estimating the odds for texas hold em poker agents. In 2013 IEEE/WIC/ACM International Joint Conferences,

CS221 Final Project Report Learn to Play Texas hold em

CS221 Final Project Report Learn to Play Texas hold em Yixin Tang(yixint), Ruoyu Wang(rwang28), Chang Yue(changyue) 1 Introduction Texas hold em, one of the most popular poker games in casinos, is a variation