Poker Opponent Modeling

Size: px

Start display at page:

Download "Poker Opponent Modeling"

Georgiana Hudson
5 years ago
Views:

1 Poker Opponent Modeling Michel Salim and Paul Rohwer Computer Science Department Indiana University Abstract Utilizing resources and research from the University of Alberta Poker research group, we are investigating opponent modeling improvements. Currently, our simple poker bot plays online against instantiations of Poki- Bots, the poker machine created by the University of Alberta research group. After some decision rule building, our poker bot is competitive. Our next step is to build upon this research and investigate opponent modeling by experimenting with frequency distributions and multi-agent case-based reasoning. We feel this can provide both a good long-term and short-term measure of opponent play. The multi-agent case-base will track the current temperature of a player. Whether an opponent plays wildly after a demoralizing loss or an opponent plays too confidently after a thrilling victory. This knowledge will be used to either contract or expand the hand strength threshold used by our poker bot. Opponent Modeling Motivation A century ago, Mark Twain bemoaned the decline in the art of lying (Twain 1882). To lie, he felt, is a great virtue and lubricrant to social discourse. Mark Twain objected to the sad state into which the eternal art had fallen. Today, artificial intelligence and computer science is neglecting this great virtue. There are few who pursue the art of lying and computer science. The domain is difficult since computers are most often engineered to help people, not lie to them. Though people in general pay good money for others to lie for them. Lawyers and politicians, for example, enjoy a healthy reputation and good benefits from practicing the art of lying. Artificial intelligence has yet to explore and challenge these expert human liars. This paper explores machine learning through opponent modeling in a game where lying is critical to success. In poker, bluffing is to lie about one s hand strength. Given a weak hand, bluffing signals the opposite to our opponents. The condition for success, the critical determinant for deciding to bluff, is correctly understanding the opponents. The expectation is an opponent will fold after one bluffs. For a poker player to make a bluff decision then With much appreciation for David Leake, Steven Bogaerts, and The University of Alberta Poker Research Group Copyright c 2005, Michel Salim and Paul Rohwer. All rights reserved. requires a correct estimation of opponent fold probability. The likelihood an opponent will fold when faced with our challenge, and in poker this bluff action is either a bet or a raise. Opponent modeling targets accurate predictions of future opponent actions. For poker, opponent modeling is difficult. It is a game of imperfect information, chance and incomplete knowledge. Contrast this with other games targeted by machine learning research. Chess has a game state known to each player. There is no risk and chance since playing the best move is always the best action. Other games where chance is present, like backgammon, still retain perfect information. And games that do retain both chance and imperfect information typically include just one opponent. These games, such as degenerate one-against-one poker games and RoShamBo (the kid s game of rock, paper, scissors), do not have the additional complexity of play against multiple opponents. These difficulties have led researchers to conclude that opponent modeling in poker appears to have many of the characteristics of the most difficult problems in machine learning noise, uncertainty, an unbounded number of dimensions to explore, and a need to quickly learn and generalize from relatively small number of heterogeneous training examples. Hetergeneous is used because when a player folds, quits the game early, their cards, the missing link of poker s imperfect nature, are not revealed to other players. What is the gain from opponent modeling study? Human poker players are good at understanding their opponent. The best human players are frequently able to form an accurate model from single data points. And while the best poker programs have successfully improved with opponent modeling, the program s developers conclude there are numerous opportunities for improvement and that for a poker program to defeat the best human players opponent modeling is critical. In computer poker game-playing research, the University of Alberta Poker Research group leads a sparse field of researchers. They have developed an excellent poker-playing machine called Poki, building on their prior work developing a world champion checkers program, Chinook. Poki and the University of Alberta research group is focused on adaptive artificial intelligence. Key to Poki s success thus far is adjustment to new information. Yet the deluge of information leads Poki to more slowly adjust to opponents. For example, in heads-up trials with

2 an online poker legend their PokiBot successfully outplayed the human over 3500 hands. But then, the human changed course, refocusing after modeling the PokiBot. He changed his playing mode from overly aggressive to cagey passiveness; outplaying PokiBot over the next 3500 hands. In the group s paper, The Challenge of Poker, they conclude that the build-up of interia after thousands of observed data points can be detrimental if the player changes mood. Past success may have been due to static or a fixed-playing style of opponents. And they also conclude that it is difficult to track good players who consistently alter playing-style over relatively brief periods. This adjustment inertia helps explain why the human expert player proved superior to Poki. Our prescription to Poki s adaptation inertia is to vary playing style, pursuing the emotion of the table by tracking the ebb and flow of the game. Tracking opponents with a long-term and short-term picture: keeping long-term measures of frequency and building short-term models for adaption. For us, a black-box neural net does not provide a simple enough understanding. By using a case-base and a case-base reasoning framework, we will understand the influences of our poker bot. The combination of long-term opponent characteristics and short-term opponent changes will target a table temperature. Is one player who usually is the air of passivity, folding early and often, suddenly playing like a maniac and aggressively betting over the short term? If we recognize this historical difference and alter our expectations quickly, then our poker bot should loosen a hand strength threshold requirement and let the maniacal opponent lose more! Texas Hold em Poker Background Our opponent modeling experiments use a popular form of poker called Texas Hold em. This is a multi-person game that shares a community of cards. Each player combines those community cards with two cards dealt face down at the beginning, attempting to develop the best hand as play proceeds through four stages of betting, bluffing, deceit and folding. To bet is putting money into the pot given to the winner, the player who last remains or has the best hand, the best combination of community cards and two private cards. If more than one player remains after the last stage, then all cards are revealed. Suddenly, the game exhibits perfect information and each player discovers the two private cards, termed hole cards. This discovery stage is the crucial point for machines to learn about opponents. Until the last stage, commonly termed the showdown, players attempt to infer there opponent s hand strength by forming an accurate opponent model and applying it to the current game context. The game context changes as community cards are revealed at each stage. A general Texas Hold-em sequence then uses four stages termed: pre-flop, flop, turn and river. After hole cards, the two private cards, are dealt to each player, pre-flop commences with a round of betting. Each player chooses from a set of actions, to either (1) fold, quit the hand and the game, (2) check, remain, but decline the opportunity to bet, (3) call, put enough money into the pot to remain in the game, this occurs in the pre-flop and after an earlier player bets, (4) bet, put an amount into the pot and require all other players to match or raise it, and (5) raise, to raise the amount others must pay to remain after another player bets. The set of actions can be simplified into a action triple: fold, check/call, and bet/raise. Poki uses this simplification to calculate a probability action triple: p(fold), p(check/call), p(bet/raise), choosing from it an action to take. After a pre-flop stage, three community cards are revealed. This is called the flop and another round of betting ensues. At the next two stages, the turn and the river, one community card is revealed. This makes the total number of community cards to be five. If players remain after the betting round at the river, then a showdown occurs. The showdown is where hole cards are revealed and the best hand wins the pot. In the event of a tie, the pot is split evenly amongst the winners. Dimensions of expert poker play The University of Alberta Poker Research Group has delineated expert player dimensions. These are the factors that must be utilized to defeat expert players. First, an assessment of hand quality. This is a combination of game context: the hole cards, community cards, the number of active players, the potential hand improvement and the betting position of the player. The potential hand improvement is a hand strength factor, but more importantly provides a calculation for hand improvement. Bluffing is the ability to win given a weak hand. A hand that would lose in a perfect information world can still win if all other players fold. This means bluffing is the probability an opponent will fold, quit the game, if challenged with a bet or a raise. Unpredictability is a measure of play variance, but more importantly is a measure of how well a player can be modelled. And finally, opponent modeling determines an accurate prediction of opponent actions. Opponent modeling is crucial for bluff and fold decisions. Previous approaches The University of Alberta Poker Research group has explored four methods of prediction: expert systems, statistics, neural nets, and decision trees (Davidson et al. 2000), (Billings et al. 2002). Expert Systems The simplest way to predict an opponent action is to use a set of rules to make a rational choice for opponents. This unadaptive method is called generic opponent modeling and the research group found the expert system method to be not effective over the long-term, but a useful starting point given no information about an opponent. Statistics Using past actions to predict future actions is the statistical approach to opponent modeling. A collection of simple statistics, a confusion matrix, and a history table of conditional action probabilities have all been explored by the University of Alberta Poker Research Group. The difficulty is choosing from the wealth of context information that which

3 should be tracked and utilized. If too much information is kept, then the time constraints of opponent modeling become too onerous. And if too little information is tracked, then important learning is thrown away. Neural Networks Poki uses nueral networks as a general system for opponent modeling. Neural networks have been successfully applied to backgammon programs, so the research group chose neural networks because they typically provide reasonable accuracy in noisy domains. The Poki nueral net is a multi-layer perceptron trained on contectual data collected from online games against human opponents. There are eighteen inputs from the game context, a hidden layer of four nuerons, and the output layer contains three nodes corresponding to fold, call, and raise probabilities. To understand the neural network output, the researchers observe graphical displays of the relative connection strengths to determine which input parameters have the strongest effect on output. There observations conclude that relatively few contextual inputs are important. After recognizing the important factors, the researchers used the information to define a relatively small set of context equivalence classes that significantly improved their previous statistical modeling attempts. The liability of using neural networks for the research group was the black-box nature and the output from the neural network represented the most likely action from the probability triple, not the probability triple itself. Decision Trees The research group also used Paul Utgoff s Incremental Decision Tree Induction software to test decision trees on action prediction. Using tree pruning to decrease the noise and over-fitting. The research concluded that decision trees with pruning provided more accurate predictions of opponent actions. Decision trees were able to output the probability action triple used by Poki and the added benefit of better human understanding given the tree representation. The research group mentions that decision trees achieved accuracies slightly below the neural networks. Resources used The University of Alberta Poker Research Group publishes protocols and source code to quickly build poker machines that interface online poker games. The source code allows developers the ability to quickly develop bots that play online through the research group s game servers. The source code is basically provided as is, no opponent modeling included and little documentation. IUCBRF The second resource used is Indiana University s Case-Base Reasoning Framework. IUCBRF (Bogaerts & Leake 2005) is a framework for rapid and modular case-based reasoning system development. We use the framework to store and retrieve short-term cases. Case-based reasoning utilizes stored cases to remember past solutions to previous problems. It is based on two tenets. First, CBR assumes a regular world such that similar problems have similar solutions. For our poker domain, we assume that an opponent will behave as in the past. This is the basis for tracking context statistics. Second, CBR assumes the types of problems tend to recur. For poker, this is true. The same game is played many times and similar problems do recur. Poker Domain Knowledge The authors are poker novices. To gain sufficient poker background, we turned to books by David Slansky, The Theory of Poker (1989) and Hold em for Advanced Players (1999). Slansky explains basic poker principles like the importance of bet position and pot odds, while in the advanced book defines heuristic strategies for bluffing, raising and deducing the table temperature. We also began regular play on-line, experimenting with bluffing strategies and gaining sufficient domain knowledge to add functionality to our poker bot. As an early development process, we used a series of decision rules, calibrating thresholds until we no longer lost regularly. The Frequency Distribution We created a serializable object bound to each username seen on the PokiPoker game server. This object holds longterm measurements of the game context, updated regularly as the game is played. After observing 100 hands, a shortterm case is formed from the aggregation of those one hundred observation frequencies. Making a move Probability Triple Given the uncertainties in a game of poker - our opponents hands are unknown, the board cards to come are unknown - it stands to reason that we do not want to put ourselves at a disadvantage by moving deterministically given a situation. We thus utilize what the University of Alberta termed probability triples, which is simply a tuple of three double numbers that add up to one, each number standing for the probabilities of folding, checking/calling and betting/raising respectively. Given a tuple, a random number between 0 and 1 is generated and used to index into the tuple to make a decision. Fold percentage calculation The probability of folding should be maximum when hand strength is low, and decreases as hand strength increases down to a minimum. There is probably a tipping point at which a human player judges his hand to be playable, and we capture this by using a tanh function to calculate the probability: tanh((center EHS) scalef actorx) foldratio = +δy scalef actory The y scale factor and delta together compresses and shifts the tanh curve to fit within 0 and 1, and the x scale factor

4 determines how sharp the drop in fold percentage is at the tipping point, center, which we take to be the middle point between the calling threshold and the pot odds for raising. The center in the formula above is the average of the minimum strength at which we allow ourselves to play, and the pot odds for raising. When our estimated strength is greater than the pot odds for raising, we should always play - in fact, we should contemplate raising - since the expected return over the long run is positive, and thus the pot odds for raising is a good upper limit of hand strength, at or above which we should never fold our hand. The lower limit at which we should always fold is trickier to calculate. We currently hardcode a minimum number to make sure we don t play very weak hands, though this number is more a guesstimate than anything else. Provided other players have moved before us, though, we can use their predicted winning percentage given their last action (e.g. betting at the turn) to serve as the lower limit for our folding decision function. We discount this percentage by that opponent s observed variance in hand strength, to reflect our uncertainty over the opponent s trustworthiness, and use the maximum of this number among all our active opponents to determine whether we should fold or not. Raise calculation A raise ratio is calculated to determine the proportion of the hands we play that should be a bet/raise instead of a check/call. We currently scale the raise ratio linearly, starting to raise if our estimated strength is better than the pot odds for raising, peaking at the perfect estimated hand strength of 1.0, at which point we always raise and never call. It is more than likely that using a similar formula to that used for calculating folding probability in this situation would improve performance - in fact, such a formula has been written but not yet calibrated for use. Given that the formula s output is determined by adjustable x and y limits, and the slope is also adjustable, we could potentially vary our style of play automatically depending on how good our winning probability for raising is. Bluffing As Slansky theorized, bluffing, or leading opponents to think one has a better hand in the hope of making them fold, should be attempted whenever the pot odds one gets by raising is favorable in relation to the probability that all other opponents fold. Thus, raisep otodds < bluffchance i Where and raisep otodds = 0 i<nump layers raisecost potsize + raisecost bluffchance i = 1 showdownactions i[lastaction i ] allactions i [lastaction i ] The probability that the opponent might fold, given his/her last action, is the dual of the probability that he /she might reach showdown. Slansky recommends bluffing as often as possible, but in our experience, even bluffing half the time our formula recommends result in opponents calling our bluff. The formula above predicts how often an opponent would fold in general, but unfortunately, against opponents that also do their own opponent modeling, it simply results in a loss of credibility.the probability that a bet or raise by us leads to a win would drop precipitiously. Conversely, when bluffing is not done, we have a good winning percentage for our bets and raises, but our earning rate is lower than opponent bots. We are being too predictable in this case, and while we win often, we win small amounts, not enough to cover losses from the showdowns that we lose. Thus, bluffing should be done judiciously. We should only consider it if we have a good reputation for winning at that round, i.e. given a certain threshold of winning percentage, we should bluff more and more and below, we should bluff less. This should correct for the opponent s perception of us. We take our winning percentage to be a weighted average of the winning percentage in the past 100 hands and the winning percentage in the current (less than 100) cluster of hands, weighted as follows: winp ct = (oldp ct (clustersize handsp layed) +curp ct handsp layed)/clustersize More weight is thus given to the our most recent actions as the number of hands in the current cluster approaches the maximum number; on the other hand, when that number is low more weight is given to the previous cluster of hands. Thus if we are bluffing more than we should, this would be reflected in a lowered winp ct score and would restrain our bluffing. The aggregate winning percentage is fed into a tanh-based function, which is currently calibrated to return zero for winning percentages below 0.4, and output a maximum of 0.8 at winning percentages of 0.9 and above. A decision whether to bluff or not, provided the cost of raising is judged to be advantageous relative to the probability estimate that all our other opponents will fold, is made by generating a random number and comparing it to the tanh function output; if the random number is lower we are allowed to bluff. Combined with an automatic adjustment method for our raising probability to keep our performance high, we estimate that bluffing - and a related problem, adjusting for opponents bluffs, which we currently only do rudimentarily by tracking their variance at showdown - should improve our bot s strength substantially. Overview Case-base utilization Since poker is a game of imperfect information, we only have complete knowledge of an opponent s hand when that

Figure 1: Prediction errors for IssPoker s strength Figure 2: Prediction errors for Pokibrat-T1 s strength opponent reaches showdown. From our observations, this happens roughly 10% of the time.

5 Figure 1: Prediction errors for IssPoker s strength Figure 2: Prediction errors for Pokibrat-T1 s strength opponent reaches showdown. From our observations, this happens roughly 10% of the time. A case in our repository is thus taken to be a given player s statistics for 200 hands, with the first 100 hands being used to match a query and the last 100 hands being the predicted play. Every 100 hands, we thus look the CBR up to make a prediction of the opponent s future play; during the adaptation phase the difference between the retrieved problem and its solution is mapped on top of the opponent s last observed 100 hands and this forms our prediction for the next 100. Indexing The most important statistics we are trying to predict are the average hand strength an opponent has, and the variance of that hand strength. It thus follows that the difference between any two case is simply the sum of the hand strength and variance differences. For each feature, we currently take the average of the relative errors of the first problem relative to the second problem and the second problem relative to the first. Figure 3: Prediction errors for IssPoker s variance Adaptation We do not currently predict all features - apart from average strength, variance, earning and showdown hands, the statistics we return as a solution from the CBR are simply the same that are passed in the query, i.e. we make an assumption that those numbers have not changed. Based on what we observe, this might not necessarily be a good idea, as is elaborated below. Performance As can be seen from the figures, our indexing and adaptation currently leaves much to be desired - its predictive performance is actually worse than the predictions made based on the long-term averages. Some opponent players are more Figure 4: Prediction errors for Pokibrat-T1 s variance

6 amenable to our CBR predictions than others, but the figures included are typical. Conclusions The surprising observation we made is that a player s recent history does not necessarily serve as the best predictor for future playing style. There are several factors that could partially account for this - our cluster size might not be finegrained enough (player changes playing style more often than once every 100 hands), or conversely, that the cluster size is too fine grained, and the variations in cards dealt are not cancelled out within 100 hands. The former is quite unlikely, as the fact that long-term statistics serve as a good predictor obviously indicate that, if anything, a player s playing style does not deviate from long term averages by much, and tend to return to it. Making the cluster size larger, on the other hand, would at best make our CBR predictions an approximation of the long-term statistics, which is not really ideal - if the statistics work, why bother with a case base. It is more likely the case that we need to observe the way players behave under certain circumstances - after a long winning streak or a losing streak, after unexpected losses in showdowns, after their bluffs have been called, etc. This would seem to suggest maintaining one global case-base for all players combined, with each case representing a certain style of playing, instead of the current multiple case-bases of individual histories. each different playing decision and the final outcome - overall playing strength and earning rate - is unclear. References Billings, D.; Davidson, A.; Schaeffer, J.; and Szafron, D The challenge of poker. Artificial Intelligence Journal 134(1-2): Bogaerts, S., and Leake, D Iucbrf: A framework for rapid and modular case-base reasoning system development. Technical Report 608, Indiana University. Davidson, A.; Billings, D.; Schaeffer, J.; and Szafron, D Improved opponent modeling in poker. Proceedings of the 2000 International Conference on Artificial Intelligence Sklansky, D., and Malmuth, M Hold em Poker For Advanced Players. Two Plus Two Publishing. Sklansky, D The Theory of Poker. Two Plus Two Publishing, fourth edition. Twain, M On the decay of the art of lying. 6. Future improvements There are many things that could be improved in the current implementation of our bot. As discussed earlier, automatically adjusting our play to keep our winning percentages (and thus our credibility, so we could bluff other players) high is certainly very desirable. Better opponent modeling, possibly including several items from previous researches such as the Confusion Matrix to estimate a given player s hand. We are of the opinion that the confusion matrix approach could be much improved by taking into account the cards we have observed the player holding at showdown - a preference bias by that player for a certain types of cards would be a very important knowledge to have. In general, our aim is to model an expert human player s decision making process, which is so far an unsolved problem, even in perfect information games such as chess. With that in mind, we should take heart from our current situation - the chess playing computers from the 70s certainly are weaker relative to their human competition than we are right now! Given our aim at providing an understandable model, it is likely that our future modeling of computer opponents will continue to preclude the use of neural networks. On the other hand, the process of varying our own style of play would most likely benefit from such a treatment, given that we have statistical measures of our playing strength and perfect information of our own hands, but the relationship between

CASPER: a Case-Based Poker-Bot

CASPER: a Case-Based Poker-Bot Ian Watson and Jonathan Rubin Department of Computer Science University of Auckland, New Zealand ian@cs.auckland.ac.nz Abstract. This paper investigates the use of the case-based