CS221 Project Final Report Learning to play bridge

CS221 Project Final Report Learning to play bridge Conrad Grobler (conradg) and Jean-Paul Schmetz (jschmetz) Autumn 2016 1 Introduction We investigated the use of machine learning in bridge playing. Bridge presents an interesting challenge as it combines adversarial game-play with the uncertainty about which cards are in which hands. It is different from many other games of chance, in that the randomness happens only once during the deal at the start of a hand. In other games like backgammon the randomness happens at every move. We investigated two different machine learning approaches and their use in bridge playing. The one is based on policy gradient reinforcement learning and tries to learn the rules of bridge from scratch by playing many random games. The second uses supervised learning where it learns from games played by oracle players. We present the results of these investigations and a survey of the literature regarding computer bridge playing and specifically the use of neural networks in doing so. 2 Bridge Game and Setup A bridge game consists of two distinct phases: an auction phase where players bid to define the contract and trump suit; and a playing phase where the players compete to win tricks. Each trick consists of 4 cards being played, one from each player s hand. A game is played between two teams (North/South and East/West). The players continue playing until all the cards from all the hands have been played. The winner of a trick plays first on the next trick. The other players follow clockwise. On the very first trick, the player to the left of the declarer plays the first card. The declarer is the winner of the auction. As soon as the first card is played, the cards held by the partner of the declarer is placed on the table - the so-called dummy hand. It is visible to all players. Play from the dummy hand is also controlled by the declarer. Declarer s partner does not play any further part in the rest of the game. The focus of our investigations was on the playing phase, but valuable information about the distributions of the different hands can be communicated during the bidding phase. We have implementing a simplified bidding simulation to extract some of this distributional information, but this does not form part of the competitive playing between agents. It is done as a separate preprocessing step on each board. The results are provided to the agents during the initialisation of each hand. Each player is represented as an agent. The agent is given the current state of the game. The state consists of the cards in each hand, the cards which have been played in the trick so far and who the next player is. The agent must then select a legal card to play from the cards in their hand, and in doing so try to win as many tricks as possible. 3 Infrastructure The training is handled differently for the two agents. For the RL agent we built a gym (compatible with OpenAI Gym [1]) where it can train over many hands against a different agent. The initial training is done against a random player to explore as many different 1

hands as possible. The supervised learning agent uses the oracle result as the correct training data. It either tries to learn to predict the optimal number of tricks that can be won by a specific action from a specific state or to directly learn the best action from a give state. The supervised training system follows the play of optimal oracle players to explore the game tree. At at a specific state it uses that state as the input and information from the oracle as the desired output. The trainer saves the trained models periodically to disk to be used by the agent during play. The playing capability of each agent type is evaluated by a controller that manages the game play. The controller takes as input how many cards to deal per hand (in case smaller hands are required for debugging), the number of boards to play and the agent type to use for each player. Our standard evaluation process is to play a team made up of the agents being tested against a team made up of oracle players. They play 10000 boards and the controller keeps track of the number of tricks each team wins. The average number of tricks won by the team of agents against the oracle team is used as the benchmark for performance. The controller performs the bidding process as a preprocessing step on every game. The bidding is implemented as a simplified versions of the basic convention used by Bridge Base Online agents[2]. This provides initial information on the distribution of the hands and is also used to determine the trump suit and the starting player. The agent is instantiated with some initial data on each board: distributional information from the bidding process and the trump suit. When it is the turn of a particular player, the controller passes the agent the current states of the game (cards remaining in player s hand, cards remaining in dummy, cards already played in this trick) and the agent returns the action to take, i.e. the next card to play. The controller uses this to calculate the next state. The agent is responsible for maintaining its own internal state. At the end of each trick the controller calls each agent and passes the 4 cards that were played during the trick so that the agent can track the playing history. The controller also renders a visualisation of each trick played when it is used in verbose mode: 3.1 Oracle and baseline The oracle is implemented as a double-dummy solver. The oracle is able to cheat by looking at all the cards. It can then compute the optimal play through search as there is no uncertainty left. We used a very efficient open source c++ implementation of a double-dummy solver [3]. We also implemented two different double-dummy solver using python for testing and debugging purposes, but these are only usable on small hands (6-7 tricks or fewer) due to the speed. The first is a modified minimax implementation with αβ-pruning. The second is based on a zero-window search approach suggested by Chang [4]. The baseline is implemented as a simple greedy agent (i.e. it will try to win the current trick as cheaply as possible), but we added a few additional rules that try to match some of the maxims taught to new bridge players (e.g. second hand low, third hand high). The baseline does not keep track of any internal state, apart from the trump suit, and looks only at the current trick. 3.2 Feature extraction The feature extractor used by the supervised learning agent takes the current game state (the cards left in each hand, the cards played so far in the trick, the index of the next player and the trump suit) to create a feature vector that can be used by the learning algorithms. The feature extractor represents each hand 2

or each card that was played by using 65 values. The first 52 features of a 65 feature batch represent the 52 cards in a deck. Because trumps are special, there is a separate 13 values for the trump suit. When playing no-trumps these values would be all 0. When a trump contract is played, the values of the trump suit are mapped in these special 13 values and the values corresponding to the original suit are all zero: 0-12: clubs if clubs not trumps, 13-25: diamonds if diamonds not trumps, 26-38: hearts if hearts not trumps, 39-51: spades if spades not trumps, 52-64: trumps. Cards are represented in order of increasing rank. The first 65 values represent the next hand to play. Each feature is 1 if the card is in the hand and 0 if not. For the next three batches of 65 values represent the other hands in playing order. If it is a visible hand (current player s hand or dummy) 1 indicates that a card is in the hand and 0 that it is not. For a non-visible hand the value of each feature is the probability that the cards is in a specific hand. At the moment the feature extractor assumes a uniform distribution. The next three batches of 65 features represent the cards on the table (played so far in the trick). This follows the same convention for trumps as above. If card was played, it is represented as a 1. All other values are 0. We also experimented with including the full history of play in the feature vector, with up to 48 additional batches of 65 features each. Each group of 4 represented the cards played in a completed trick to provide the full history of the board so far. This did not seem to improve the performance of the agent and quadrupled the size of the trained models, so we are not currently including this information in the feature vector. We also implemented optional hand normalisation to reduce the required training time. This takes advantage of the equivalence of many different hands, especially when only a few tricks remain. The winner of a trick in a specific suit only depends on the relative ranks of the cards, not the absolute ranks. Consider the example when each player only has one card left (let s say hearts) with west leading. The following play sequences are conceptually equivalent: K Q J 10; 10 7 3 2; A 6 4 3 and 5 4 3 2. The hand normalisation removes all cards that have been played from the state and reduces the ranks of cards left in the hand to replace these. In all of the examples above it would have resulted in the last sequence of 5 4 3 2. This makes different states appear more similar to the training algorithm and we managed to reduce the required number of training examples to achieve the same performance by a factor of approximately 5. We also implemented an option that allows the supervised learning algorithm to cheat by looking at the opponents hands to support additional analysis. This provides some insight into which errors are based on the uncertainty and which are based on the general approach. The policy gradient agent was tried with 3 different sets of features. The first is exactly the same as the supervised learning agent. The second is a simplified version in which the game state is represented by 5 rows of 52 cards (collapsed into a 260 value vector). The first row represents the card held by the player and the other rows representing the state of the dummy hand, table, played cards and suits (incl. trumps). We also implemented a simpler game ( minibridge ) which essentially is a 2-player bridge. This was used to test the policy gradient learning agent as it only required a few hundred thousand games to train effectively. The third version of our feature extractor represented the state of minibridge in 5 rows of 52 cards. 4 Approach 4.1 Policy Gradient Agent The policy gradient agent tries to learn the appropriate policy for playing bridge from scratch by playing many random hands against other agents. The initial training was done against a random player to allow it to see as many different hands and states as possible. The agent receives a large negative reward if it tries to perform an illegal action. If it performs a legal action that does not lead to a trick being won it receives a small positive reward. If the legal action 3

leads to winning a trick, it receives a larger positive reward. The policy gradient agent is implemented based on the explanation of deep reinforcement learning by Andrej Karpathy [5]. Unlike Pong, the difficulty is to learn the probability of playing one of 52 cards instead of a binary decision (UP/DOWN). We decided to learn a policy for each card. The learner plays one of 52 cards randomly based on their normalized probabilities and collects the rewards until the game is finished (an episode is a 13 tricks game). It then discounts the rewards and back-propagate the episode to the neural networks of the cards that were played in the episode (back-propagating to all cards proved to be harmful). Doing so ensures that the learner explores a large set of possibilities while still learning at a reasonable albeit slow rate. It does also mean that the learning requires more than a million games before the player consistently learns to play by the rules. The agent is a collection of 52 neural networks each with one hidden layer (of relatively small size. We tried between 5 and 50 values) with ReLU nonlinearity and the single value output squeezed between zero and 1 by a sigmoid function. In order to test the learning agent, we created a simplified version of bridge ( minibridge ) in which 2 players both receive 13 cards chosen randomly from a deck of 52 and basically play bridge i.e. the first player (chosen randomly) plays a card and the other player needs to follow suit (if possible). The winner is the player with the largest card in the suit declared. The winner starts the next trick. This game is close enough to bridge to test whether the policy learning agent would be able to learn a more complicated game. The gym that exercises the player is patient and allows the player to try many cards until a correct one is played - each illegal cards tried however receives a relatively large negative reward. Stopping the round at every illegal move proved to be inefficient. Once the learner proved to be able to play minibridge (see results below), we deployed a large numbers of these learners on the more complete sets of features representing the real game of bridge. The learning time was expected to be much larger than for minibridge. Except for the structure of the features (which to a human is a representation of multiple rows of cards) and the structure of 52 cards, the learning agent has no concept of bridge. It is just trying to learn the optimal decision out of 52 possible actions based on a sparse vector of observation. This is important as we really wanted the learner to learn from scratch. In fact it seems that it could learn any card game which involves putting a card on the table at every turn. 4.2 Supervised Learning Agent The main supervised learning algorithm uses a multilayer perceptron regressor (specifically sklearn.neural network.mlpregressor [6], [7]) to learn to predict the number of tricks that can be won from a specific state. Specifically, it uses 3 hidden layers with 400, 300 and 200 nodes. These numbers seemed to provide the most robust performance over a number of training attempts. It uses a sigmoid gain function for the hidden layers. For the output layer it uses a linear model. Conceptually it tries to use the hidden layers to learn useful features and then feed these into the final layer to do linear regression. One thing we noticed early on is that the learning was struggling to cope with the different roles related to the position in a trick. The approach when leading the first card into a new trick is very different to the approach when playing the last card in the trick. To help with this we created 4 neural networks for each agent. The training and playing would use the neural network corresponding to the number of cards already played in the trick. This caused a significant boost in the performance of the agent. The training is done on many random hands. For a specific hand it randomly explores the search tree based on the play of the oracle players. From a specific state it uses the double-dummy solver (oracle) to get the value of actual tricks that could be won from optimal play for each of the possible next states. it aggregates the training data in batches of 500 000. When a batch of 500 000 training samples are ready it performs additional training on the model with this batch. 4

During play, the agent uses the learnt model and estimates the number of of tricks for each of the possible next states that result from the possible actions at that point. It then chooses the action with the highest score. Another approach we tried was to try to learn the best action directly, rather than deducing it from a predicted number of tricks. For this we used a multilayer perceptron classifier (specifically sklearn.neural network.mlpclassifier [6], [7]). The activation and the number and sizes of the hidden layers match those used in the MLPRegressor implementation The output was and array of 52 possible values, each one corresponding to a card. The generation of training data followed the same process as above, except that it used the optimal cards calculated by the oracle as the training data, rather than the predicted number of tricks. During play the agent plays the card from the list of legal cards that had the highest probability predicted by the classifier. We also implemented a hybrid agent option. This allows the supervised learning agent to switch over to using search to find the optimal card to play once the hand becomes tractable (in the experiments we did this when there were 6 or fewer tricks left). The search process enumerates all possible states that could exist based on the unknown hands. The agent keeps track of the minimum and maximum possible cards in each suit in each hand as the play unfolds. It then evaluate each of these possible states against this information. If it satisfies the constraints it uses the oracle to determine the optimal cards to play from that state. It then plays the card that was returned as the optimal card most ofter over these test states. 5 Results 5.1 Reinforcement learning The minibridge agent is able to learn a policy that allows it to score around 200 points against a random opponent. (playing 1000 games in a row). This is better than the expected 180 average. This means that the agent learned not only to play correctly but also to be good at winning the game. Both learner and player agents (with a learned policy) can be found in a Codalab worksheet. [8]. The training usually requires about 1 million episode (an episode is defined as a 13 tricks round played to the end including all the illegal moves tried by the learner). The real bridge agent takes a lot more episodes to learn to play (also, playing one episode takes a longer time). The learning agent still has a running average (over 100 episodes) of about -2000 per episode after half a million episode. This means essentially that the agent is still trying roughly 150 illegal moves per episode (this is not a problem as each illegal move contributes to decrease the probability of that move by a tiny fraction every time). This compares to minibridge, where the agent reaches -2000 average running rewards relatively quickly and reaches a plateau of about +160 (it is always exploring and therefore collecting negative rewards while learning) before often diverging again (see Error Analysis below). After 5 days of learning and playing almost 2,000,000 episodes, the policy learned is able to reach a running reward of slightly negative (in the range of -100 to -500) while still learning (i.e. still collecting negative points). The player (which always plays the most probable actions in the learned policy) wins consistently against the random player (by nearly the same margin as the baseline player) and makes few mistakes (defined as playing an illegal card). Both learner and player for a more fully featured bridge can be found in a Codalab worksheet. [9]. We were unable to fully train an agent using the full feature set (which would allow the player to use the knowledge of the order of all the moves in the episode). It does seem possible however but would require a much longer training time (measured in weeks most likely). 5.2 Supervised Learning Agent Below is a summary of the average results from running the supervised training experiment. More details, and two example runs of the experiments can be found in our CodaLab worksheet [10]. 5

The results were produced by running the controller over 10 000 games each with 13 tricks. In all cases the agent was used for playing North and South and oracle players used for East and West. Agent Average won Random 3.8 Baseline 4.5 Normalised 4.0 Direct 3.8 Cheating 4.3 Hybrid 4.7 Oracle 6.5 The results show that the baseline performs surprisingly well. The default agent using a normalised hand performed better than random but did not match the baseline. The cheating agent (removing uncertainty) performed better than the default agent, but not by as big a margin as we expected. The best performing agent (apart from the oracle) was the hybrid agent that switched over to search once the hand became tractable. As part of testing we found a curious phenomenon. The training process was surprisingly volatile. When we trained a model with exactly the same parameters, including the same number of training example, the performance of the agent sometimes fluctuated wildly. The only difference between each run was the random hands that were generated, but the number of hands, number of training examples and all other parameters were the same. We initially thought is was randomness in the evaluation process, but the differences remained over different assessments of the same trained model. For example, different training runs of the default normalised agent produced results usually ranging from 3.9 to 4.2. Most of the tests results were close to 4.1 and a few were worse. One striking example performed extremely well with a result of 4.8. At the time we believed it was due to hyperparameter tuning, but repeated further attempts with the same parameters never reproduced that result. It seems that the training is very sensitive to the details of the training samples. Another interesting observation was that the performance seemed to vary with training samples in a cyclical manner. As more training samples were added the performance would increase up to a point, and then start decreasing and at some later point increasing again. This is likely due to some hands that look very similar but with very different results confusing the learning process. Due to the random nature of the training data the process is bound to encounter this at some point. As even more samples are added, it probably drowns out the problems caused by the conflicting information until other new conflicting information is found. 6 Error Analysis 6.1 Reinforcement Learning On minibridge, it is clear that the player has really learned the game (i.e. the learner makes very few mistakes). It still has moments where it plays an illegal card but they are relatively rare. The hidden feature vector looks like an indicator of playability of the card. The puzzling thing is that the probability of playing is always extremely low regardless of whether the card is legal or illegal. It turns out that the very low probability of playing a legal card is higher than the very low probability of playing an illegal card. In some observations, we saw the comparison being done on 1 100 vs. 1 70. This of course is a problem as the training will eventually underflow even for the right action. On the more realistic game of bridge, the learner plays more illegal moves, but consistently beats a random player. More details and a few sample runs can be found on the codalab page[9]. 6.2 Supervised Learning For the supervised learning agent we compared the choices of the agent with the oracle in the same circumstances over a number of hands to understand where the agent diverged from the optimal play. For the default supervised agent we looked at the predicted number of tricks, the actual optimal tricks and how this impacted the card which was chosen. In general it was quite good at predicting the number 6

of tricks that could be won from a specific state. If the purpose was purely to train a neural network to predict the number of tricks that could be made by a double-dummy solver, the results would have been pretty good. The problem is that it did not match the prediction at crucial parts in the play. The most significant problem we found was that the agent never seemed to learn the third hand plays high maxim. Whenever the agent played the third hand in a trick it almost always played low. It was trying to protect its high cards for the future, but that allowed the player in the fourth hand to win cheap tricks. The whole point behind the third hand high maxim is to force high cards from the opponents to make the cards in the partner s hand good. In spite of training a separate network for each playing position and running many attempts, it never learnt a good strategy for the third hand. The playing results look much more in line with expectations for the other hands. Another more subtle problem was that the network was not good at differentiating between states with only minor differences. When two cards were of similar value but had significant implication later in the game (e.g. one allowing an entry into partner s hand the other not) the agent was not able to predict the differences in outcomes. It quite often chose the wrong one to play. It never seemed to learn the value of strategies like protecting entries into the partner hand or protecting long suits until there are no trumps left in the opponents hands. In the same way it also did not learn to appreciate the value of trying to knock entries out of the opponents hands. For the direct-prediction agent we compared the probability predicted associated with each card to the optimal card choices suggested by the oracle. The network seemed to be pretty good at predicting which card would have been good to play. The problem is that the best card to play was almost never in the hand of the agent, or not legal to play. The agent chose the legal card with the highest predicted probability, but quite often this was a very low value. In general the agent did not seem to perform better than a random player. It would likely require a significantly different design in terms of feature extraction and network architecture to be able to predict this correctly while conforming to the rules of bridge. 7 Survey of the Literature Using Monte Carlo simulations combined with an efficient double-dummy solver to play bridge was first proposed by Ginsberg [11]. The basic idea is to simulate many random deals matching the currently known distributional constraints. On each simulation an efficient double-dummy solver is used to determine the optimal play. The best card in expectation over all the simulations is then chosen. This approach has been refined since then, but still seems to form the basis of the best automated bridge playing agents [12]. Our hybrid agent implementation uses this approach once fewer than 7 tricks remain, except that we perform an exhaustive search of possible deals rather than a Monte Carlo simulation. The significant boost in performance this provides to the hybrid agent highlights the power of this approach. Machine learning has been used in a number of papers to handle the bridge bidding process. The most interesting of these was the use of deep reinforcement learning for automated bidding [13]. As we focused on playing rather than bidding, this was not directly applicable to our work. It does present an interesting approach and outperformed the state of the art in terms of bidding the optimal contracts. We could not find any research that tried to use reinforcement learning on the actual playing phase of bridge. Mossakowski et al used feedforward neural networks to predict the optimal number of tricks by a double-dummy solver [14] and hand strength [15]. The approach for predicting number of tricks is very similar to our approach for the supervised learning agent. They also used a multi-layer feedforward network. The feature extractor was quite different. It used a two numbers to represent each card in a hand (the rank and the suit) rather than indicator features. The other major difference is that they only used it to estimate the number of tricks at the start of a game, not at each point during the game to guide game play. They mentioned plans to investigate the use of their model in game play in future research, but as far as 7

we could find that research has not been published. 8 Conclusion Bridge represents a difficult challenge for machine learning. For reinforcement learning, the challenge of just learning the rules of the game was a huge challenge. The rules for which cards are legal to play in a specific situation are complicated and to learn to choose a legal card from an arbitrary state was challenging. Using domain knowledge to encode more information about each card into the feature vector could help with the learning, but this is not quite the same learning from scratch based purely on the raw information about the cards. The supervised learning agent performed worse than anticipated. Based on analysing the cards it chose to play compared to what was suggested by the oracle showed a number of weaknesses in the approach. The main issue seemed to be the inability to learn a successful strategy when playing third in a trick. Another significant issue was that the attributes that are typically useful in neural networks (such as being resilient to minor changes in the input) are actually counter-productive for bridge playing. A single change in the hand, such as whether a minor card is a 3 or a 4 or whether there is 1 or 0 trumps left in the opponents hands could in some cases lead to huge swings in the possible number of tricks. Even with three hidden layers, the networks were unable to learn these subtleties. Perhaps a more complex architecture, such as a deep CNN that uses the structure of the bridge hand or encoding more domain knowledge into the feature vectors could produce better results. The most surprising result is probably how well the baseline player performed, even though it is based on a very simple strategy. It required a lot of effort to surpass it. In our opinion this shows how simple rules of thumb (maxims) can contain very compact representations of important strategies in game play and the value of domain knowledge when approaching a problem. This could be part of the reason that new players can learn bridge and become relatively competitive quite quickly. Real expert-level play however takes many years or decades to develop. The current state of the art in computer bridge playing is to use Monte Carlo methods to explore the game tree and then use an efficient double-dummy solver to determine the best card for each of these states. The card that was optimal in most of these cases is then played. This is very similar to the search part of our implementation of the hybrid agent. The difficulty is that the highest number of possible states exists at the start of the hand. This is also the time when the least information can be deduced from the cards played so far. This is where the largest margin for improvement for future research will probably lie. Perhaps combining machine learning techniques with canned opening plays could improve the performance during the first trick or two. After that the most value from machine learning could probably be to try to detect (or even employ) deceptive plays or learn signaling systems to get more distributional information earlier on in the game so that the Monte Carlo process could be better focused. 8

References [1] G. B. et. al, Openai gym. https://gym.openai.com/. [2] B. B. Online, Gib system notes. https://www.bridgebase.com/doc/gib system notes.php. [3] B. Haglund, Dds double dummy solver. https://github.com/dds-bridge/dds. [4] M.-S. Chang, Building a fast double-dummy bridge solver, tech. rep., New York University, 1996. [5] A. Karpathy, Deep reinforcement learning: Pong from pixels. http://karpathy.github.io/2016/05/31/rl/. [6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825 2830, 2011. [7] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, API design for machine learning software: experiences from the scikit-learn project, in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108 122, 2013. [8] C. Grobler and J.-P. Schmetz, Learning to play bridge - minibridge. https://worksheets.codalab.org/worksheets/0x949eb558dc9c438a985d8d9d7f72b004/. [9] C. Grobler and J.-P. Schmetz, Learning to play bridge - policy gradient learner. https://worksheets.codalab.org/worksheets/0x8750e9638941479b86989ee156ddcfdd/. [10] C. Grobler and J.-P. Schmetz, Learning to play bridge - supervised agent. https://worksheets.codalab.org/worksheets/0x95601a915c4c4f2dadc394dbc18594ef/. [11] M. L. Ginsberg, Gib: Steps toward an expert-level bridge-playing program, tech. rep., University of Oregon, 1998. [12] P. M. Bethe, The state of automated bridge play, tech. rep., New York University, 2010. [13] C.-K. Yeh and H.-T. Lin, Automatic Bridge Bidding Using Deep Reinforcement Learning, ArXiv e-prints, July 2016. [14] K. Mossakowski and J. Mańdziuk, Artificial Neural Networks for Solving Double Dummy Bridge Problems, pp. 915 921. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004. [15] K. Mossakowski and J. Mańdziuk, Neural Networks and the Estimation of Hands Strength in Contract Bridge, pp. 1189 1198. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006. 9