Heads-up Limit Texas Hold em Poker Agent

Similar documents
CS221 Final Project Report Learn to Play Texas hold em

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

BetaPoker: Reinforcement Learning for Heads-Up Limit Poker Albert Tung, Eric Xu, and Jeffrey Zhang

DeepStack: Expert-Level AI in Heads-Up No-Limit Poker. Surya Prakash Chembrolu

Texas Hold em Inference Bot Proposal. By: Brian Mihok & Michael Terry Date Due: Monday, April 11, 2005

CandyCrush.ai: An AI Agent for Candy Crush

Automatic Public State Space Abstraction in Imperfect Information Games

Comp 3211 Final Project - Poker AI

Optimal Rhode Island Hold em Poker

BLUFF WITH AI. CS297 Report. Presented to. Dr. Chris Pollett. Department of Computer Science. San Jose State University. In Partial Fulfillment

Using Sliding Windows to Generate Action Abstractions in Extensive-Form Games

Speeding-Up Poker Game Abstraction Computation: Average Rank Strength

An Adaptive Intelligence For Heads-Up No-Limit Texas Hold em

Using Fictitious Play to Find Pseudo-Optimal Solutions for Full-Scale Poker

Exploitability and Game Theory Optimal Play in Poker

Texas Hold em Poker Basic Rules & Strategy

Texas hold em Poker AI implementation:

POKER AGENTS LD Miller & Adam Eck April 14 & 19, 2011

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Alternation in the repeated Battle of the Sexes

Reinforcement Learning Applied to a Game of Deceit

TABLE OF CONTENTS TEXAS HOLD EM... 1 OMAHA... 2 PINEAPPLE HOLD EM... 2 BETTING...2 SEVEN CARD STUD... 3

Etiquette. Understanding. Poker. Terminology. Facts. Playing DO S & DON TS TELLS VARIANTS PLAYER TERMS HAND TERMS ADVANCED TERMS AND INFO

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

Fictitious Play applied on a simplified poker game

Eight game mix tournament structure

Building a Computer Mahjong Player Based on Monte Carlo Simulation and Opponent Models

A Heuristic Based Approach for a Betting Strategy. in Texas Hold em Poker

Massachusetts Institute of Technology. Poxpert+, the intelligent poker player v0.91

CS221 Project Final Report Gomoku Game Agent

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

arxiv: v1 [cs.gt] 23 May 2018

10, J, Q, K, A all of the same suit. Any five card sequence in the same suit. (Ex: 5, 6, 7, 8, 9.) All four cards of the same index. (Ex: A, A, A, A.

Chapter 6. Doing the Maths. Premises and Assumptions

Decision Making in Multiplayer Environments Application in Backgammon Variants

CMS.608 / CMS.864 Game Design Spring 2008

An Empirical Evaluation of Policy Rollout for Clue

An Exploitative Monte-Carlo Poker Agent

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

An evaluation of how Dynamic Programming and Game Theory are applied to Liar s Dice

Reflections on the First Man vs. Machine No-Limit Texas Hold 'em Competition

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Optimal Yahtzee performance in multi-player games

Learning a Value Analysis Tool For Agent Evaluation

arxiv: v1 [cs.ai] 22 Sep 2015

Poker Rules Friday Night Poker Club

What now? What earth-shattering truth are you about to utter? Sophocles

Opponent Modeling in Texas Hold em

CASPER: a Case-Based Poker-Bot

Poker Hand Rankings Highest to Lowest A Poker Hand s Rank determines the winner of the pot!

Data Biased Robust Counter Strategies

Presentation by Toy Designers: Max Ashley

Reinforcement Learning Agent for Scrolling Shooter Game

Solution to Heads-Up Limit Hold Em Poker

Texas Hold em Poker Rules

Poker as a Testbed for Machine Intelligence Research

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Learning to Play like an Othello Master CS 229 Project Report. Shir Aharon, Amanda Chang, Kent Koyanagi

Creating a Poker Playing Program Using Evolutionary Computation

Game theory and AI: a unified approach to poker games

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Opponent Modelling by Expectation-Maximisation and Sequence Prediction in Simplified Poker

Player Profiling in Texas Holdem

Regret Minimization in Games with Incomplete Information

Bonus Maths 5: GTO, Multiplayer Games and the Three Player [0,1] Game

Intelligent Gaming Techniques for Poker: An Imperfect Information Game

Models of Strategic Deficiency and Poker

Simple Poker Game Design, Simulation, and Probability

Fall 2017 March 13, Written Homework 4

An Artificially Intelligent Ludo Player

Probabilistic State Translation in Extensive Games with Large Action Sets

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Derive Poker Winning Probability by Statistical JAVA Simulation

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Games and Adversarial Search

Solving Coup as an MDP/POMDP

Pengju

Learning Strategies for Opponent Modeling in Poker

After receiving his initial two cards, the player has four standard options: he can "Hit," "Stand," "Double Down," or "Split a pair.

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

CS221 Project Final: DominAI

ultimate texas hold em 10 J Q K A

Artificial Intelligence

P a g e 1 HOW I LEARNED POKER HAND RANKINGS

Using Counterfactual Regret Minimization to Create Competitive Multiplayer Poker Agents

Programming Project 1: Pacman (Due )

Game Playing. Philipp Koehn. 29 September 2015

Electronic Wireless Texas Hold em. Owner s Manual and Game Instructions #64260

An Introduction to Poker Opponent Modeling

Computing Robust Counter-Strategies

2048: An Autonomous Solver

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

CS510 \ Lecture Ariel Stolerman

DeepMind Self-Learning Atari Agent

Artificial Intelligence

Learning and Using Models of Kicking Motions for Legged Robots

BLUFF WITH AI. A Project. Presented to. The Faculty of the Department of Computer Science. San Jose State University. In Partial Fulfillment

ELKS TOWER CASINO and LOUNGE TEXAS HOLD'EM POKER

AI Agent for Ants vs. SomeBees: Final Report

Transcription:

Heads-up Limit Texas Hold em Poker Agent Nattapoom Asavareongchai and Pin Pin Tea-mangkornpan CS221 Final Project Report Abstract Our project aims to create an agent that is able to play heads-up limit Texas Hold em poker against a human opponent. As an artificial intelligence bot, the agent will be able to learn its opponent s bet behavior to try to maximally utilize it on its game play model to gain the most reward. I. INTRODUCTION Poker is one of the world s most popular sports, and the most popular card game in the world. It offers excitement and action, demands great skill from an expert player, and contains an element of luck. Its strategic challenges and psychological elements contribute greatly to its popularity, making poker a very social, human game. The object of poker is very simple - to win the money in the center of the table, called the pot, which contains the sum of the bets that have been made by the participants of that game round. Players make their bets on the belief that they have the best hand, or in the hopes that they can make a better hand i.e. leave the game, giving up the pot to them. Poker poses general applications that make it an effective platform for artificial intelligence research. First, it is a non-deterministic game with stochastic outcomes. With opponents hands hidden, imperfect knowledge causes typical search algorithms to fail. The agent needs to perform risk management to handle betting strategies and their consequences, and also identify patterns in the opponent s play in order to exploit them. Advanced poker agent may also deal with deception (bluffing), and unreliable information when taking into account the opponent s deceptive plays. There are many variants of poker with different number of players and level of complexity, such as Omaha, Seven card stud, Texas hold em, Five card draw, etc. Since Texas hold em is the most widely played form of poker, we decide to build an agent that plays a simplified version of heads-up limit Texas hold em poker, which has two players with fixed amount of bet. II. TASK DEFINITION & INFRASTRUCTURE Texas hold em poker may involve two or more players. Because of the complicated nature of the problem with more than two players, we will limit our problem to a game only between our agent and one opponent (a one on one game). The game will run like a normal Texas hold em game, with the pre-flop, flop, turn, and river stages that will allow players to take actions to either bet, check, call, or on each stage. Another challenge that comes with poker is the amount of bet allowed during the game. Since bets are continuous and arbitrary, may vary from $1 to over $1000 for example, we decided to model our problem by fixing the bets to a distinct set of values. For example, we will only allow $10 or $20 bets and no other values, and a maximum of $20 per stage. This will reduce the complexity of our agent model and algorithm. With this in mind, we set the big blind at $10 and the small blind at $5. Note that in our game play, the big blind and small blind will alternate every round. This is because of the advantage of a big blind in a real game may affect the overall outcome, thus to make things fair we will do the same thing for our setup. Our agent will try to play the game in multiple rounds against the human opponent to win the largest amount of money possible. To do so we will have the agent model the game as a Markov Decision Process and will try to learn the opponent s behavior using a reflexive machine learning algorithm. The learning algorithm will simply consist of a linear regression model with a feature extractor. This will be further explained in the model and algorithm section (Section V). One big characteristic of poker is the ability of players to bluff and gain advantages in certain rounds. Unfortunately due to the complicated nature and non-deterministic nature of bluffing, we will not take into account bluffing in our problem model to simplify the model to a reasonable problem to tackle. However, this may be indirectly incorporated into our learning model of the opponent while the agent is playing, implicitly. Since we incorporated opponent learning, we had to gather training data in order to train our agent. To gather data, we had to create our agent code and run the agent against our custom created opponents many times. The opponents have certain fixed behaviors according to the aggressiveness we set them to be. Data are then recorded and learned by the agent. Further data collection is done during the actual game play of the agent against a human opponent. It will gather the opponent s bet behaviors on the fly while it is playing and update its learning model constantly. More details will be explained in Section V. III. RELATED WORKS There are two previous CS 221 projects that implemented an agent to play poker. The first one by Berro, et.

al., implemented the agent using MDP [2]. They used Q-learning with a poker feature extractor and the epsilongreedy algorithm to learn an optimal policy. While learning the policy, they ran Monte Carlo simulations that play against fixed-policy opponents to update the feature vector. However, the authors noted that their agent performed worse than average human players because it did not explore enough states and adopted a very risk-averse playing style. This is similar to our approach to the problem in the use of MDP. Also the same problem were noticed in our method, that in order for MDP to perform well, we need to trade off efficiency. The second project by Abadi and Takapoui created an automated player using probabilistic graphical models to concurrently explore the state space and exploit acquired knowledge from the opponent [1]. Since both sides were agents, there was no comparison with human players. Instead, they defined a variety of agents and compared their performance by evaluating the agent s ability in estimating the opponent s latent feature vectors. For other related works, Yakovenko et.al. implemented a self-trained poker system using Convolutional Neural Network based learning model to learn the patterns in three different types of poker games: video poker, Texas hold em, and 2-7 triple draw [5]. Their representation of the poker games in matrix form, to make it processable by the convolution network, is worth noting. They encode each card as a 4 13 sparse binary matrix in accordance with the 4 suits and 13 ranks of poker cards. The matrix is zero-padded to 17 17 to help with the convolutions and max pooling computations. For five card games, they also add the sum of the 5 layers as another layer (17 17) to capture the whole-hand information. This encoding strategy has several advantages, the most interesting of which is that the full hand representation makes it easy to model common poker patterns, such as a pair (two cards of the same rank, which are in the same column) or a flush (five cards of the same suit, which are in the same row) without game-specific card sorting or suit isomorphisms (e.g. AsKd is essentially the same as KhAc). For multiple round games, they keep track of context information, such as the pot size and the bets made so far by adding layers with different encodings for each feature. The poker tensor is also extended to encode game-state context information that is not measured in cards. In 2015, Zinkevich, et. al., from the Computer Poker Research Group, University of Alberta, claimed to have weakly solved the heads-up limit Texas hold em game using CFR+, a variant of the CFR (counterfactual regret minimization) algorithm, which is an iterative method for approximating a Nash equilibrium of an extensive-form game through the process of repeated self-play between two regret-minimizing algorithms [3]. However, we did not pursue this approach in interest of the scope of this class. IV. RULES We use a slightly simplified rule of heads-up limit Texas hold em. The game is played between two players, and consists of 4 phases: pre-flop, flop, river, and turn. We follow the limit Texas hold em rule where the bets during pre-flop and flop are equal to the big blind, and equal to twice the big blind in the turn and river phases. Blinds: At the beginning of the game each player starts with $0 reward. We start with the agent being the big blind and the human player the small blind. The blinds will alternate in every turn. Pre-flop: Both players are dealt their hole cards face down (in the game we only show the player his/her own card). Unlike the normal version, the small blind can only or call. Then the big blind decides to either check or. Flop: The three flop cards are dealt on the table. Unlike regular heads-up Texas hold em where the big blind acts first, our agent requires the small blind to act first. A player has three valid actions -, check, or bet. Turn: The fourth card is dealt onto the table. The round is similar to the flop round where the small blind acts first by choosing one of the three actions -, check, or bet. River: The fifth, which is the last card, is dealt onto the table. This round has the same procedure as the turn round. If both players decides to play (bet and calls or check and calls), then we move into the showdown phase where each player s cards are evaluated for their hand strength. The hand strengths are then compared, and the winner is the one with a better hand. He/she gains the whole money pot. V. APPROACH We model the problem as a Markov Decision Process (MDP) model and use a reflexive learning algorithm to learn the opponent s betting behavior. However, to evaluate the card s strength and ranks we used an external poker evaluator to do so. The external Python library used is called Deuces. This library allows us to efficiently calculate the 5 card hand strength as an integer score value for comparison. It also gives a card rank and suit description and a function to print cards out in a visually appealing manner for our interface. A. MDP model Our Markov Decision Process (MDP) model is broken into smaller MDP models for each phase. Each MDP explores only the states within two phases ahead. We have the pre-flop MDP, flop MDP, turn MDP, and river MDP, as shown in Figure 1. This is because of the fact that if we combined all these MDP, the state space is too large to run efficiently. There would be 3.16 10 17 states to go through in a heads-up limit Texas hold em game. In our preliminary implementation, each round takes about 10 minutes to run.

Our pre-flop and flop MDP are both depth limited MDP models. This is so that it does not have to run through all the large state space. We therefore used an evaluation function as the reward value for these MDPs. The evaluation is: Eval(s) = 1 (current hand score) + w φ(x) φ(x) is the input feature into our learning model and w is the weight vector. Our learning model will be described in more detail in the next section. The current hand score is the score evaluated on the cards in the agent s hand together with the current cards dealt on the table. For each of the MDP model, the model parameters are described in Table I. Pre-flop MDP check/call Flop MDP Parameters State Start State Actions Rewards Transition Probability Values (current hand cards, tuple of table cards, agent pot value, IsEnd) (handcards, tablecards, current pot value, 0) {, bet, check } = pot value evaluation function if the bet = MDP is depth limited 0 if MDP runs to the end { pot value if won IsEnd = pot value if lost Uniform probability of possible dealt card combinations Turn MDP River MDP TABLE I MDP model parameters. B. Reflexive Learning Algorithm In order to make our agent more robust to the opponent s varying style of play, we decided to incorporate a learning algorithm that learns how the opponent bets and tries to predict what the opponent s hand cards are. We decided to use a simple linear regression model in order to do so. The linear regression model will take in the following features: Table card rank = the rank of the cards open on the table. This feature varies in size depending on which turn the game is in. During the pre-flop phase, this feature is not present. The flop phase will consist of 3 rank values of the 3 table cards, turn phase will contain 4 rank values for the 4 cards, and river phase will contain 5 rank values for the 5 cards on the table. This is represented by an integer defined by the Deuces package used. Table card suit = the suit of the cards open on the table. The different suits are represented by integers according to the Deuces package used. This feature also varies in the same manner as the table card rank feature. Opponent bet sequence = This is the bet values the opponent has bet in the past and present rounds. For example, if we are currently in the flop phase, this Fig. 1. Win/Lose Depth limited MDPs and allowed actions. feature will include the opponent s bet value during the pre-flop phase and the flop phase. With these features, the model will output an estimate of the possible score of the cards in the opponent s hand. One problem we encountered was that we did not have data on the opponent beforehand in order to run linear regression to create a model weight for prediction. It is impossible to create a model for each opponent the agent faces and we would not have enough data if we start creating the model after playing against the opponent for the first time. To tackle this problem, we created two custom opponents for the agent to play against and collect data beforehand. These two opponents are a conservative opponent that bets only if it has at least a pair, and an aggressive opponent that bets only if it has at least a decent high card (it plays most of the time and usually bets instead of check).

With these two custom opponents, we played our agent against them and collected the required data. Then we ran stochastic gradient descent (SGD) on a square loss function to create a weight vector for each of the opponent. The parameters used in SGD are as follows: w w η w Loss(x, y, w) where w = weight vector, x = input feature, y = opponent hand score. # of iterations = 20 ɛ = 1 10 10 η = 0.001 In the agent game play interface, we require the user/opponent to choose whether they are an aggressive or a conservative opponent and the agent will use the weight learnt from these custom opponents as an initial learning parameter. Then using the data collected as the agent is playing an actual opponent, it will update this weight vector constantly after every showdown turn. The update will also handle the case where players deliberately chose the wrong type to take advantage of the agent e.g. an aggressive player told the agent that he or she is conservative. C. Agent Interface We tested our agent against our written random and oracle opponents through a Python script individually. However, to let our agent play with any human opponent, and to test its strength against human opponents, we created an interface to play the game. The interface is simple, as shown in Figure 2. Opponent Agent Random Human Oracle Naive 9.35-4.00-7.25 Clever, aggressive 10.15 1.96-22.45 Clever, conservative 7.95-2.00-16.20 TABLE II Average winning per game of agent versus different types of opponents. VI. RESULTS & ANALYSIS Our agent has three modes: naive, clever aggressive, and clever conservative. The naive agent only uses depth limited MDP to compute its actions and does not perform opponent learning. The clever agent learns the opponent according to the opponent type that the user specified at the beginning of the game - aggressive or conservative. We tested our agent with three types of opponent: random, human, and oracle. The random opponent chooses its actions at each phase uniformly at random. The human opponent is actual human playing against the agent through our interface. Our human players are amateur poker players. The oracle opponent is one that knows all hidden cards; oracle s immediately at pre-flop if it knows that it has the worse hand, and bets through the end otherwise. For each type of agent, we ran 100 games against the automated players i.e. random and oracle, and 50 games against human players. We ran less iteration on human players because our agent can take up to 2 minutes to compute in each phase and thus binds us in a time constraint. Table II shows the average winning per game of our agent. Both the agent and the opponent start with $0 balance. Our agent performs better than the random player, and pretty comparable with human players. Since our big blind is $10, we can say that our agent wins the random player by approximately one big blind. As expected, it lost against the oracle player. However, note that it lost at worst by only approximately twice the big blind. This is quite satisfactory given the simplicity of the model. Also, it is worth noting that the naive agent performs better with the oracle than either of the clever agent. We hypothesize that because the oracle has a mixed aggressive-conservative behavior (excessively aggressive if it knows it is winning but excessively conservative otherwise), the naive agent is better off ignoring the opponent s playing style than trying to lean either on the aggressive or conservative side. Fig. 2. Poker game interface. The game is played in Terminal and the input is through keyboard. Only the table cards and the human player s hole cards are displayed. The agent s hole cards will only be displayed if the agent decides to play until showdown. We have some interesting insights from the human players experiences. For the clever aggressive agent, it s less often even when the supposedly aggressive player bets (i.e. bluff). The clever conservative agent often bets to scare off the supposedly conservative player, and more likely s if the opponent bets.

VII. CONCLUSION & FUTURE WORKS Our model shows that depth limited Markov Decision Process with reflexive opponent learning can be used to model a heads-up limit Texas hold em poker game. Our agent s performance is one big blind better than a random player, on par with human players, and in the worst case about two big blinds worse than an oracle player. However, the agent currently takes quite a long time to process. Its maximum thinking time is 2 minutes (run on 2 GHz Intel Core i7), whereas Cepheus (the agent that weakly solved the game using CFR+ developed by the Computer Poker Research Group, University of Alberta) takes less than one minute. In future works, we plan to apply other algorithms, such as neural network, to model the game in order to reduce the runtime, as well as take into account more complex features of the game such as raising and bluffing. Neural networks can also be a good learning model to improve our evaluation function. We could do more state pruning with better evaluation functions for our MDP models. REFERENCES [1] H. Abadi and R. Takapoui, (2014, December 12). Automated Headsup Poker Player. Available: https://web.stanford.edu/class/cs221/restricted/projects/takapoui/final.pdf. [2] T. Berro, J. Benjamin, and C. Zanoci, a poker AI agent. Available: http://web.stanford.edu/class/cs221/restricted/projects/bgalliga/final.pdf. [3] M. Zinkevich, M. Johnson, M. Bowling, and C. Piccione, Heads-up limit hold em poker is solved. (2015, January 8) in Science 347 (6218), pp. 145-149. [4] P. McCurley, An Artificial Intelligence Agent for Texas Hold em Poker. Available: http://poker-ai.org/archive/pokerai.org/public/aith.pdf. [5] N. Yakovenko, L. Cao, C. Raffel, and J. Fan, (2015, September 22). Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks. arxiv:1509.06731.