Learning to Play Love Letter with Deep Reinforcement Learning

Learning to Play Love Letter with Deep Reinforcement Learning Madeleine D. Dawson* MIT mdd@mit.edu Robert X. Liang* MIT xbliang@mit.edu Alexander M. Turner* MIT turneram@mit.edu Abstract Recent advancements in artificial intelligence have shown dramatic success in playing games. In particular, the application of deep learning algorithms to reinforcement learning has enabled substantially better performance. Most notably, DeepMind s AlphaGo was able to defeat the world s best human Go players. [1] These algorithms found strategies with minimal knowledge of the domain by learning to play from experience. The majority of recent advancements in reinforcement learning have been in single-agent environments or in twoplayer environments with perfect information. [1] In this paper, we investigate the performance of a similar approach on Love Letter, a multiplayer incomplete information card game. We will use a modified version of Neural Fictitious Selfplay [2] to train competitive agents in this environment, avoiding the need for Nash equilibrium computations. We additionally investigate the feasibility of not only learning strategies, but also the game rules themselves. We propose a method of initial training using random play to aid this process. * The authors contributed equally to this work. 1 Introduction In the past few years, neural networks have become very popular due to their breakthrough success in complex, non-linear tasks like image classification and speech recognition. This was in large part due to the use of graphics processing units (GPUs), which allow the training of much larger models than previously possible. [3] The technique is known as deep learning because of the use of multiple layers that can generate increasingly abstract features of the input. Neural networks have also been used for playing games, with substantial success. Artificial intelligence has achieved expert-level play in Backgammon, Chess, and Go. [2] In a widely publicized series, DeepMind s AlphaGo was able to defeat the world s best human players of Go. [1] Aditionally, OpenAI has trained bots to play Atari games and to perform motor control. [4] These algorithms learned how to perform these tasks with minimal knowledge of the domain of interest by learning to play from experience. The fundamental algorithms used are standard reinforcement learning algorithms, but by adapting them to take advantage of deep learning, they have solved many classes of learning tasks. In Deep Reinforcement Learning from Self-Play in Imperfect- Information Games, Heinrich and Silver use competing neural networks to approach a Nash equilibrium through self-play of Leduc poker. [2]

In this paper, we investigate the performance of these algorithms on Love Letter, a multiplayer incomplete information card game. This is a challenging domain because the majority of recent advancements in reinforcement learning have been in single-agent environments or in two-player environments with perfect information. [2] Previous approaches to multiplayer incomplete information games have focused on computing the Nash equilibrium strategy. [2] We will show that it is still possible to train agents that are competitive in this environment. Furthermore, we investigate the ability of agents to not only learn strategies without domain knowledge, but even to learn the game rules themselves. We propose a novel method for initially training agents using a random player. This initialization not only encourages the action-value network to learn which moves are illegal but also may allow for more efficient development of agents, especially for games with complex rules. We show that our player can implictly learn the rules of Love Letter using this method. 2 The Love Letter Game Environment In Love Letter, players use a deck of cards numbered 1 to 8 and take turns until either the deck runs out or there is only one player left. [5] To begin, one secret card is removed from the deck and placed face-down. Each player is dealt one card, also face-down. Players may observe their own hand but not others. This setup results in an incomplete information game. On a player s turn, s/he draws a card and then chooses to play one of the two cards in their hand, triggering the chosen card s effect. This effect is determined by the card s number. After this effect is processed, play proceeds to the next player. The game continues until the deck runs out or only one player remains. The cards have the following numbers, names, and effects: 1. (Guard) The active player chooses another player and guesses a type of card other than a 1 (Guard). If the chosen player has the card guessed in their hand, the chosen player is eliminated. 2. (Priest) The active player chooses another player, and the chosen player privately shows their hand to the active player. 3. (Baron) The active player chooses another player. The two players privately compare their hands. The player with the lower number card is eliminated and reveals their hand publicly. If there is a tie, no player is eliminated and no cards are revealed. 4. (Handmaid) Until the currently active player s next turn, that player cannot be targeted by any cards effects. 5. (Prince) The active player chooses any player and forces them to discard their hand. If the discarded hand is 8 (Princess) then the chosen player is immediately eliminated. If the chosen player is not eliminated, they draw a new card immediately. If no cards are left in the deck, the chosen player picks up the secret card (set aside at the beginning of the game) and the game ends. 6. (King) The active player chooses another player. The active and chosen players switch their hands without revealing their cards to other players. 7. (Countess) The active player must discard this card if the other card in their hand is a 5 (Prince) or 6 (King). 8. (Princess) If a player discards this card for any reason, they are eliminated.

We used Python [6] to write our game engine and players, using the Numpy [7] library extensively. We used Keras [8] to build the action-value network and Tensorflow [9] to build the average-policy network. Our random player, used for initialization and testing, chooses a random, legal move and can optionally choose a random, illegal move with some small probability. For our trained AI agent, we allowed it to return any action at each point, meaning it could choose to discard a card it didn t possess. In cases where an agent performed an illegal action, it was immediately eliminated. 3 Methods 3.1 Q-learning Q-learning is a model-free form of reinforcement learning, where a network is trained on action-value pairs. [10] A state s t and action a t are given as input and a value v t is predicted. Typically these values are an expected reward for executing a t in state s t. If R is the reward function, and λ is the weight used to discount future rewards compared to current rewards, Q aims to learn the following function, i.e. the expected total of discounted rewards: Q(s t, a t ) = E[R t + λr t+1 + λ 2 R t+2 + ] The network thus learns a greedy policy where the action of highest expected reward is taken at every step. Traditional Q-learning uses the following as its update step: Q(s t, a t ) (1 α) Q(s t, a t ) + α (r t + λ max a Q(s t+1, a )) Here, Q(s, a) is the prediction of total discounted reward, and r t is the reward observed for taking the action a t at state s t. α is the learning rate. max a Q(s t+1, a ) is the estimate of discounted future reward. Deep Q-learning is a modified form of Q- learning that can be used instead of handcrafting features for high-dimensional states. [11] Instead of storing an estimated action-value pair for every possible state, a deep neural network is used to predict values. This has the potential to generalize estimates of the value for each state, even if that particular state might not have been encountered earlier. 3.2 Neural Fictitious Self-Play Fictitious play is a method of learning to play games by choosing the best response to previous opponents strategies. [2] Fictitious selfplay extends this concept to the agent playing against itself. Each agent stores all state transitions, actions and rewards earned in one memory. In another memory, it stores the agent s previous actions taken. We approach the Love Letter game using a modified form of Neural Fictitious Self-Play [2] where neural networks are also used to approximate the agent s historical chance of taking each action in each state. 3.3 Training The training procedure is outlined in Algorithm 1. This procedure is adapted from Heinrich and Silver s paper. [2] The rewards used in training were simply +1 on victory and -1 on loss or illegal action. 3.3.1 Initial Training Using Random Play We modify the original Neural Fictitious Self- Play procedure by adding initial training using a random-play policy. This random play policy plays any of the legal moves available to it with uniform probability. A small amount of the time (10%), it is also allowed to play an illegal move. This aims to help the actionvalue network learn which moves are legal as this knowledge is not provided a priori. If an agent plays an illegal move, it simply loses automatically.

Algorithm 1: Training Procedure Initialize game and the array of AI agents, each running TrainAgent() for training, first using random training then resetting M RL and using self-play. Function TrainAgent() Initialize memories M RL and M SL Initialize average-policy network, Π Initialize action-value network, Q Initialize target action-value network, Q T, from Q s weights for each game do Set agent s policy to SetPolicy() while game is not over do Observe state, s t, at current time t Perform action a t in game, according to the policy Observe reward r t+1 and following state s t+1 Store (s t, a t, r t+1, s t+1 ) in M RL if policy is ɛ-greedy then Store (s t, a t ) in M SL end if Train Π using SGD on cross entropy loss on M SL data Train Q using SGD on mean squared error loss, using target network Q T, on M RL data every k steps do Update Q T s weights to Q s end every end while end for end function Function SetPolicy() if in initial random training then Set policy to random player else Set policy to ɛ-greedy with probability η. Else, set to Π. end if end function While this random player did require an understanding of the game rules, the game framework does not require players to choose a legal move. Our goal was for the trained player to infer the rules of Love Letter through this random initialization. 3.3.2 Neural Network Architectures For the action-value (Q) network, two fullyconnected hidden layers of 256 units with ReLU activation were used. The output layer was sized according to the number of possible actions. No activation function was applied to this output. For the average-policy network, two fullyconnected hidden layers of 512 units with ReLU activation were used. The output layer was again sized according to the number of possible actions and softmax activation was used. We experimented briefly with differently sized architectures using half or double the number of units, or using an additional layer. We used these architecture sizes as we found that the training speed of larger architectures was too slow for our purposes and that smaller architectures did not achieve equally good performance. 3.3.3 Target Q Network The naive approach to training the deep Q network is the mean squared error loss between the predicted total discounted reward, Q(s t, a t ) and the observed reward plus the predicted future discounted reward if a greedy policy is followed, r t + λ max a Q(s t+1, a )). This, however, resulted in unstable predictions, with some weights in the Q network increasing or decreasing without bound. To train stably, used kept track of a target Q network, Q T, which was initialized as a copy of Q. Q was then trained on the mean squared error between Q(s t, a t ) and r t + λ max a Q T (s t+1, a )), i.e. the observed reward plus the target network s prediction of

future discounted reward if a greedy policy is followed. The network weights of Q T were then set to Q s periodically. We updated them every 10 game steps. This procedure damped the oscillations, resulting in stable training. 3.3.4 Memories and Batching Two memory stores were used to store past experience: M RL and M SL. M RL is used to train the action-value network and consists of past rewards observed. A circular buffer of size 2000 was used to store these. M SL is used to train the average-policy network and consists of all states and actions taken by the agent when following the ɛ-greedy policy, described below. The action-value network was trained using a batch size of 32. The average-policy network was trained using a batch size of 300. For each training step, that number of random memories were selected from the corresponding memory to form the batch. 3.3.5 Policies After initial training using the random-play policy described earlier, at the start of each game, the agent being trained chooses between two policies: ɛ-greedy or Π (average policy). The agent chooses ɛ-greedy with some small probability η. We used η = 0.1. An ɛ-greedy policy usually picks the action that Q predicts will maximize total discounted reward. Given a state as input, Q outputs a predicted total discounted reward for each possible action. The chosen action thus corresponds to argmax of the output. However, with probability ɛ, an agent following an ɛ-greedy policy will simply act randomly. We used an exponentially decaying ɛ that begins at 0.1 and is multiplied by 0.995 every training step. Additionally, the decay of ɛ stops when it reaches the minimum allowed value of 0.01. The agent thus chooses Π (average policy) with probability 1 η. In this case, the agent acts according to the average previous action taken in this state, as estimated by the average-policy network. Given a state as input, this network outputs an estimated historical probability of taking each action in that state. The softmax layer normalizes these probabilities so an action is simply randomly selected according to the probabilities outputted. After training, evaluation is performed solely using Π (average policy). The action selected specifies not only the card chosen but also (if relevant) the target of the card and (if the Guard is chosen) the card guessed. 4 Results 4.1 Game length The typical game length was also investigated as a function of the number of games played. This was investigated both for the initial training using random play and training using selfplay. The network was trained using 6000 games for each training stage. The change in game length over time is shown in Figure 1. During initial random-play training, the average game length remained roughly constant. This is expected as the policy is unchanged. After the game proceeds to self-play training, gameplay is initially almost random (with illegal moves allowed). This is due to the average-policy network, which is used the majority of the time, not having yet started training. The average game length is thus very short as many invalid moves are played, causing the training agent to rapidly lose. However, this performance rapidly improves as the average-policy network is trained. It is trained on the saved ɛ-greedy policy actions based off the predictions of the action-value network, which had already been trained on random play. As the number of state transitions memorized increases and the

Figure 1: Variation of typical game length with additional training. The vertical axis represents the game length, smoothed using an exponential moving average (α = 0.99). The horizontal axis represents the number of games used to train the agent. The agent is first trained using a random-play policy and game length is constant. Then, as it is trained using self-play, initially game lengths are much shorter, but rapidly improve to a roughly constant value. average network is further trained, there is a very rapid increase as the network learns not to play invalid moves. After a few thousand games, this reaches a roughly constant state. 4.2 Illegal moves To further understand the rate at which the agent learns to not perform illegal moves, we investigated the how this rate varies with additional training. The network was first trained using 5000 games of random play. After each hundred rounds of training, the agent was evaluated by playing 1000 rounds against itself (without the actions being saved or the agent trained). The number of illegal moves the agent made was then saved for each 100 games of training up to 5000 games. The results are shown in Figure 2. Initially the number of illegal moves was very high, this rapidly reduces over the first 1000 games played. After this, the rate of Figure 2: The rate of illegal moves made by the trained agent with varying amounts of self-play training. In all cases, the agent was first trained using 5000 rounds of random play. The horizontal axis is the number of rounds of self-play training performed (in 100s). The vertical axis is the number of illegal moves observed per 1000 games played. The rate of decrease is initially very high and then slows down. However, even after 5000 rounds of self-play training, the agent continues to improve. improvement slowed, but the rate of illegal actions continued to reduce even up to 5000 games, albeit somewhat noisily. This implies that additional training may be able to reduce this rate further. After 5000 rounds of training, the trained player only lost 5% of games by making an illegal move. Note that, as the game lengths become substantially longer after additional training, the rate of illegal moves is much lower than 5% of moves. 4.3 Win rate vs random player We tested our trained player against a player that makes uniformly random valid moves. Before the trained player has learned not to make illegal actions, its win rate is very low it almost instantly loses every time by making an illegal action. As the player trains and learns not to break the rules, the win rate quickly rises and then levels out around 65% after 3000 rounds of training. This trend is

Figure 3: The win rate of the trained agent against a random agent constrained to legal moves. In all cases, the agent was first trained using 5000 rounds of random play. The horizontal axis is the number of rounds of selfplay training performed (in 100s). The vertical axis is the win rate of the trained player observed over 1000 games. Initially, the trained agent lost the vast majority of games due to a high number of illegal actions. With additional training, it learns to play mostly only valid modes and the win rate rises to a roughly constant value of around 65%. shown in Figure 3. There is a large degree of inherent randomness involved in Love Letter games, which means that no player will be to win all of the time. Additionally, in Love Letter, the number of valid actions is relatively constrained (due to the two cards present in the active player s hand), and good strategies involve substantial random play. Nevertheless, it appears likely that the optimal player would win more than two-thirds of matches against a random player. It seems possible we have reached this limit as constraining the agent in both training and evaluation to only play valid moves results in a similar accuracy of 65%. Because of this randomness, Love Letter in the two player setting requires winning 7 matches before winning the game. With a 65% winrate per game or round, this means our random player would win in 87% of matches versus the random player. It is also possible that the levelling out seen in Figure 3 is not due to the agent reaching a fundamental limit, but due to the averagepolicy network fitting well to the M SL memory. With additional training, the action-value network could learn more complicated relationships between the state and expected total discounted reward, resulting in more complicated strategies. 5 Contributions This marks the first time that self-play with Q learning has been applied to the incompleteinformation game Love Letter. We demonstrate the applicability of Neural Fictitious Self-play to this game. Further, we investigate the feasibility of not only learning a strategy, but the game rules themselves through this process without a priori knowledge. We propose initially using random-play training to encourage the network to learn which moves are legal. Additionally, we have provided a new initialization algorithm for training reinforcement learning agents. By using records from a random player, our agent converged on a winning strategy much more quickly than if it were to have learned from scratch. Furthermore, we did not need many logs of high-level human play in order to train our agent, which was a method used to train the original AlphaGo player. Each member of the team contributed a roughly equal amount to the project. In the initial implementation, Madeleine Dawson wrote the initial game infrastructure implementation, Robert Liang wrote the actionvalue network implementation, and Alexander Turner wrote the average-policy network implementation. All authors then contributed to the integration of these different modules. When investigating the research questions, each student took on one research questions:

win rate vs random player, illegal moves and game length, respectively. Finally, in writing the milestones and the report, each student contributed writing for their section and all students contributed approximately equally in the other sections. 6 Future Work Future work in this domain includes further experimentation with initialization. It is possible that a player who learns from human strategy is better initialized than one who learns from random actions. For example, a canonical Love Letter play is to play a Guard against another player who used the King to swap hands in the previous turn. Another area for further work is to apply our novel initialization method combined with Neural Fictitious Self-Play to other twoplayer turn-based strategy games. For example, in Texas Hold em, does this reinforcement learning method teach itself Nashequilibrium strategies such as bluffing? 7 Conclusions First, we have demonstrated a new method of initializing neural networks in reinforcement learning. This method uses history generated by a random player to prepare players for standard Neural Fictitious Self-Play. Secondly, we have developed an AI agent for Love Letter using this method which can beat a player that plays random legal moves a majority of the time. This was developed using a very simple reward function and from no records of high-level play and was trained without any a priori knowledge of the game rules. Finally, our implementation has led to an environment on which future research into Love Letter and other imperfect information games can be easily developed. References [1] Silver, D. et. al. (2016). Mastering the game of Go with deep neural networks and tree search. In Nature. 529, pp. 484489. doi:10.1038/nature16961. [2] Heinrich, J. and Silver, D. (2016). Deep Reinforcement Learning from Self-Play in Imperfect- Information Games. arxiv:1603.01121v2. [3] Raina, R., et al. Large-scale deep unsupervised learning using graphics processors. (2009). In Proc. 26th Annu. Int. Conf. Mach. Learn. - ICML 09, ACM Press, New York, New York, USA: pp. 18. doi:10.1145/1553374.1553486. [4] OpenAI. (2017). openai/atari-py Github repository. https://github.com/openai/atari-py (accessed December 11, 2017). [5] Alderac Entertainment Group. Love letter rules. (2012). https://www.alderac.com/tempest/files/20 12/09/Love Letter Rules Final.pdf (accessed December 11, 2017). [6] Python Software Foundation. Python Language Reference, version 3.6. Software available from http://www.python.org. [7] Van der Walt, S., et al. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. In Computing in Science & Engineering. 13, pp. 22-30. dot:10.1109/mcse.2011.37 [8] Chollet, F. (2015). keras Github repository. https://github.com/fchollet/keras (accessed December 11, 2017). [9] Abadi, M. et al. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. [10] Watkins, C. and Dayan, P. (1992). Q-Learning. Machine Learning, pp. 279-292. [11] Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. In Nature. 518, pp. 529533. doi:10.1038/nature14236.