Searching Optimal Movements in Multi-Player Games with Imperfect Information

Size: px

Start display at page:

Download "Searching Optimal Movements in Multi-Player Games with Imperfect Information"

Godwin Hicks
5 years ago
Views:

1 1 Searching Optimal Movements in Multi-Player Games with Imperfect Information Kenshi Yoshimura Teruhisa Hochin Hiroki Nomiya Department of Information Science Kyoto Institute of Technology Kyoto, Japan {hochin, Abstract This paper proposes a search method of optimal movements in multi-player games with imperfect information in order to implement a Mahjong player exceeding human top players. The proposed method uses tabu search. It does not use the record of a game. Tabu search is applied to the states of discarding tiles and using the tile that other players discarded. The experiments were carried out in order to evaluate the proposed method. In the experiment of evaluating the rate of concordance of the usage of tiles, the maximum rate of concordance reached to 83%. This means effective winning hands could be found in the initial states. In the experiment of playing a game with benchmark players, it is shown that the proposed method is better than benchmark players. From these results, the possibility of the effective search of optimal solution by using tabu search was indicated. Keywords meta-heuristics; multi-player games with imperfect information; Mahjong; tabu search; I. INTRODUCTION Recently, players whose abilities are equal to those of human top players are implemented in two-player, zero-sum, logical perfect information game, e.g., Shogi and Reversi. In contrast, few players are implemented in multi-player games with imperfect information, e.g., Poker [1] and Mahjong [2] [8]. The reasons why few players are implemented include that the hand of opponent is not public, and that all of possible states can not be searched because of enormous states. This paper treats Mahjong. Mahjong is four-player, zerosum games with imperfect information. So, it is multi-player games. Multi-player games mean the number of players is over two. In these games, the movement of getting maximum profit is not always the best. This is different from two-player games. From these properties, it is hard to decide the best movement in Mahjong. Although some researches on Mahjong are conducted, players whose abilities are equal to those of human top players are not implemented. This paper proposes the search method of optimal movements in multi-player games with imperfect information in order to implement a Mahjong player exceeding human top players. The proposed method uses tabu search [9] to decide the best movement. It does not use the record of a game because an application of the proposed method to other games with imperfect information is expected. The reminder of the paper is as follows. Section 2 describes Mahjong. Section 3 surveys related works. Section 4 describes tabu search. Section 5 proposes a search method of optimal movements in Mahjong. Section 6 experimentally evaluates the proposed method. Section 7 discusses the evaluation results. Finally, Section 8 concludes the paper. II. MAHJONG Mahjong [10] is a four-player, zero-sum game with imperfect information. 136 tiles are used in Mahjong. Suits tiles are 3 types, numbered from one to nine. Honor tiles are 2 types: winds (east, south, west, and north) and dragons (red, green, and white). Each tile has 4 identical copies. Every player can take the following actions: draw: getting one tile in the player s turn. discard: throwing one tile in the player s turn. naki: using the tile when an other player discarded it. Using these actions, the score is made by arriving at a winning hand. A winning action is as follows: ron: winning by picking up a discarded tile. tsumo: winning by drawing the tile. A winning hand consists of four melds (a specific pattern of 3 pieces) and the eyes (a pair of two identical pieces). Melds are as follows: pung: a set of 3 identical tiles. kong: a set of 4 identical tiles. chow: a set of 3 suited tiles in sequence. The terminology of Mahjong used in this paper is as follows: tenpai: the state that a hand is one tile short of a winning hand. shanten: the number of necessary tiles to tenpai. fold: to give up own winning and not discard effective tiles of opponents. naki: using the tile an other player discarded in order to consist of melds. effective tile: a tile decreasing shanten. drawn game: a finished round no player has won. A player of being tenpai get a score from a player of not being tenpai. flush: a winning hand consisted of one type of suited tiles, or one type of suited tiles and honor tiles. richi: promising that any tile drawn by the player is immediately discarded unless it constitutes a winning /16/$31.00 copyright 2016 IEEE ICIS 2016, June 26-29, 2016, Okayama, Japan

2 2 hand. It can be declared when naki (except for Kong consisted of only the player s tile) is not done and the player has points for declaration of richi. value tiles: melds consisted of honor tiles. all simples with naki: to win to consist of all simples with naki. All simples is a special patterns consisted of only the tiles numbered from 2 to 8. dora: a tile increasing the score. Dora is not a part of special patterns. Red fives are the same as dora. dealer: player discarding a tile in the first. pre-meld: the state which is one tile short of meld. isolation tile: the tile which is not meld, the eyes, or pre-meld. a complete game: the game every player acts as a dealer twice. The game consists of the east round and the south round. burst: to finish the game when the player s score is less than 0. double ron: ron by two players at the same time. repeated event: repeating a dealer when a dealer wins, or the hand of dealer is tenpai in a drawn game. III. RELATED WORK A. The Method Using the Record Mizukami et al. [2] proposed a one-player Mahjong player adding naki and fold better than average human player. Moreover, a predicting model is created by using an opponent s model consisted of tenpai, finishing tiles, and winning score. By combining this model, one-player Mahjong player [2], and Monte Carlo simulation, the player having similar performance to intermediate players is implemented [3]. Improving an one-player Mahjong player is necessary because of naki not determined special patterns and discarding inefficient tiles. Miki et al. [4] proposed the method of learning to rank moves in Mahjong using Support Vector Machine (SVM) with kernel method. As the result of the experiment using expert records, the rate of concordance of the usage of tiles is 51%. This means measurable classification is possible based on a simple tree structure. In contrast, there is the problem that important tiles are missed to consist of simple structure;the eyes, chow, and so on. Miki et al. [5] proposed the method using Singular Value Decomposition-based Alternating Structure Optimization (SVD-ASO) of a semi-supervised learning method. This method expands the feature space using auxiliary classification of a problem, for example, subproblems of a primal problem. The evaluation function is learned by using the record. The concordance is similar to that of the work [4] in expansion of feature space using prediction. It is effective possibility to extract features as the prediction of auxiliary classification problem. Wagatsuma et al. [6] estimated the value of risk in tile discarding by using Support Vector Regression (SVR). The rate of concordance of the largest risk tile is 13.4% on average, and the rate of concordance of the smallest risk tile is 43.4% on average. This estimation is more effective than that discarding tiles randomly. B. The Method not Using the Record Miki et al. [5] proposed the player using the Monte Carlo tree search. The player using this method has higher performance than the player only discarding tiles randomly from a set of tiles which can decrease shanten without using records of top player. This player, however, tends to go toward tenpai rather than winning because this player cannot win in simulation in naki states. Takahashi et al. [7] proposed the search method of a similar winning hand from an initial hand using genetic algorithm. In this method, a deeper and wider search generates better solution. Execution time is, however, five minutes in ten depth. So, it is difficult to use. Moreover, Takahashi et al. [7] proposed the method using the Monte Carlo method. Using shanten and the number of effective tiles as the evaluation value gives higher probability of discarding the right tile than only shanten. The player whose abilities are equal to those of human top players cannot be implemented. Komatsu et al. [8] proposed the normal Monte Carlo method and the Monte Carlo method presuming k tiles randomly, with which many play-outs are simulated. The normal method is inefficient in playing Mahjong because of not arriving at winning hands in most cases. The latter method can arrive at the winning hand in most cases. The reward can successfully and effectively be obtained. IV. TABU SEARCH A. Outline Tabu search [9] is meta-heuristics used for solving combinatorial optimization, and devised by Glover in Traveling salesman problem, graph coloring, and scheduling problems are part of combinatorial optimization. This algorithm searches several neighborhoods of one state and moves it to the best one. At the same time, this move is recorded to a set called a tabu list. By prohibiting moves in the tabu list, it can search a path with preventing a loop. Tabu search can prevent a solution from converging to local optimum because no matter how bad things get, all moves except for the moves in a tabu list can be used. B. Notation Combinatorial optimization is represented as Eq. 1. Minimize c(x) :x X. (1) The objective function c(x) may be linear or nonlinear, and x is a discrete value and a member of X. To solve this problem, a move s that leads from x to another is necessary to be defined. A move s is defined as a mapping defined on a subset X(s) ofx as shown in Eq. 2. s : X(s) X. (2) Using s S on x, the set S (x) is defined as Eq. 3. S (x) = {s S : x X(s)}. (3) The set S (x) is referred to as a neighborhood function.

3 3 C. Algorithm The algorithm of tabu search consists of four steps as follows. 1) Select an initial x X and let x := x. Set the iteration counter k = 0 and begin with T empty. 2) If S (x) T is empty, go to Step 4. Otherwise, set k := k + 1 and select s k S (x) T such that s k (x) = OPTIMUM(s(x) :s S (x) T). 3) Let x := s k (x). If c(x) < c(x ), where x denotes the best solution currently found, let x := x. 4) If a chosen number of iterations has elapsed either in total or since x was the last improved, or if S (x) T = φ upon reaching this step directly from Step 2, stop. Otherwise, update T and return to Step 2. s(x) is obtained from a move s and x. Now, OPTIMUM and a tabu list T is necessary to be defined. Using s k (x), OPTIMUM is defined as Eq. 4. c(s k (x)) = Minimum(c(s(x)) : s S (x) T). (4) The update of T is defined as Eq. 5. T := T s 1 k t + s 1 k (5) where criterion s 1 is the inverse of the move s; i.e., s 1 (s(x)) = x. The variable t is the number of moves in a tabu list. When k t, the reference to s 1 k t is disregarded. From Eq. 5, T is represented as Eq. 6. T = {s 1 : s = s h for h k t} (6) The effective number of moves in a tabu list is experimentally obtained. This is from 5 to 12, especially 7. A. Outline V. PROPOSED METHOD In this paper, tabu search is applied to the states of discarding tiles and naki in Mahjong. It is assumed that tabu search is effective in Mahjong because much states can be searched in simulation and it can prevent a solution from converging to a local optimum. Moreover, it is considered that the tile, which is discarded, is not needed when a winning hand is searched. That tile is not included in the winning hand. Therefore, it is not needed to re-start the search from the previous state in searching a winning hand. Thus, it is considered that a winning hand is effectively searched by using tabu search. In the proposed method, the value x in Eq. 1 is a hand of the player, and the tabu search method finds the best combination of the tiles among all combinations using four evaluation criteria. B. Evaluation Function The evaluation function consists of four evaluation criteria. These are shanten, the number of effective tiles, degree of similarity, and the score. They are applied according to the priority. After all evaluation criteria are applied, the evaluation value of the current solution is compared with that of the best solution. The best solution is updated if the evaluation value of the current solution is better than that of the best solution, and other evaluation values having higher priority are the same. In this paper, degree of similarity is defined as how similar is the winning hand to the initial hand. It is obtained by multiplying probabilities that the changed tiles are drawn to compare a winning hand with the initial hand. The probability P(i, x) that player i draws a tile x is represented as Eq. 7 by using the number of tiles N(i, x), which is the number of tiles x the player i has, and T(x), which is the number of visible tiles x by all players. 4 N(i, x) T(x) P(i, x) = 136 i x N(i, x) (7) x T(x) If the change does not occur, degree of similarity becomes the maximum value 1. C. The Method Applying Tabu Search to Mahjong To apply tabu search to Mahjong, the number of iterations, neighborhood values, the number of moves in a tabu list, and how to discard tiles are necessary to be decided. If the player draws over six tiles, the number of iterations is six. Otherwise, it is the same as the number of tiles the player can draw. The value six is used because much changes are not needed except for arriving at a peculiar hand (e.g., flush) when a winning hand is searched from the initial hand. Moreover, the number of executions can be increased by reducing the number of iterations. A neighborhood value should be similar to the pre-value. For example, in the traveling salesman problem, a neighborhood value is a city next to the current city. In Mahjong, neighborhood values can be obtained by changing one tile with another one. Evaluating several searched neighborhood values, the solution is moved to the best value. The current hand and the best neighborhood value are recorded in a tabu list and the move to these tiles is prohibited. As the current hand is close to a winning hand, the effective tiles decrease in Mahjong. If the number of neighborhood values generated is small, the probability of finding a winning hand by search becomes low. So, the number of neighborhood values generated is necessary to be large in order to arrive at a winning hand. The number of moves in a tabu list is set to six because all moves are recorded in every turn, and it is said that the effective size of a tabu list is in the range from 5 to 12, and 7 is the best [9]. The number of initial choices and the average score as evaluation values are defined for every tile of the initial hand to decide discarding tiles. The number of initial choices is defined as how often discarding is for every tile in the first move. The average score is defined as the sum of all scores of

4 4 the winning hands searched divided by the number of initial choices for every tile in the first move. Using these values, an optimal discard is decided when tiles are present in the initial hand and not present in the optimal hand. The algorithm is as follows. 1) The initial hand becomes the best hand in first. Set the number of iteration k = 0, and set the number of moves in a tabu list T empty. 2) Set k = k + 1, select a tile in the current hand, draw a possible tile from yama, exchange those tiles, and evaluate a new hand. 3) Set the new hand to the current hand, and update the best hand if the current hand is better than the best hand. 4) If a chosen number of iterations has elapsed in total, or reach to the end of execution time, stop. Otherwise, update T, and return to Step 2. The current hand and the pre-hand are recorded in T. D. Configuration of the State of Discarding Tiles The state of discarding a tile means the state where the hand consists of fourteen tiles. The search time is assumed to be four seconds. The number of generating values of neighborhood values is set to ten in order to improve the probability of arriving at a winning hand and to increase the number of execution times as large as possible. The evaluation criteria are shanten, the number of effective tiles, the degree of similarity, and the score in the descending order of priorities because arriving at a winning hand has the highest priority. Discard criteria are the number of initial choices, and the average score in the descending order of priorities because arriving at a winning hand has the highest priority. If two optimal solution are found, one tile is randomly discarded. E. Configuration of the State of Naki The state of naki means the state where the hand consists of thirteen tiles and an other player discards a tile. The search begins at this hand as the initial hand. The search time is set to two seconds in maximum because the composition of melds has four types and every search time is set to 0.5 second. The number of generating values of neighborhood values is five because the search time is shorter than the state of discarding tiles and possible states needed to search decrease because melds obtained by naki cannot be discarded. The evaluation criteria are shanten, the number of effective tiles, the degree of similarity, and the score in the descending order of priorities because arriving at a winning hand has the highest priority. If the score of optimal solution is not zero, naki is conducted because several special patterns are not possible by naki so that the winning hand has no score. Discard criteria are the number of initial choices and the average score in the descending order of priorities because arriving at a winning hand has the highest priority. If naki is conducted, discarding a tile is the same as at the state of discarding a tile. VI. EXPERIMENT A. Configuration The basic rules are the rules of the Internet Mahjong server Tenhou [11]. Changes made are as follows: All simples with naki is possible. No red fives. No repeated event. Only at the east round. No burst. No double ron. The proposed method and benchmark players in this paper do not search considering the difference in the score so that it is assumed that repeated event, a complete game, and so on are not necessary. The experimental environment is shown in TABLE I. TABLE I. Item CPU Memory Language used EXPERIMENTAL ENVIRONMENT Value Dual-Core 3.20GHz 2GB C++ B. Experimental Methods 1) Rate of concordance of discarding tiles: The rate of concordance of discarding tiles as the record of a game is examined to evaluate the proposed method. Records of top players in Tenhou are used. These players are top 0.8% players. A human player changes the best movement in any situations. By contrast, the proposed method does not. Therefore, the targets of this experiment is limited to the states until discarding six tiles in the first round. Sixty states satisfying these conditions are randomly sampled. In this experiment, the input is the chosen state, and the output is the rank of the discarding tiles. As tabu search randomly searches a tile, the rank is not always the same. Therefore, the operation is conducted fifty times. If the right discard at the first place is the majority, this is the concordance at the first place. If the right discard in the first or the second place is the majority, this is the concordance of the second place, and so on. 2) Playing a game with benchmark player: The performance of the proposed method is evaluated through playing a game with benchmark players. Players consist of two players of the proposed method and two benchmark players. The same players face each other. For a set of the same tiles, each of the players of the proposed method and benchmark players is assigned to a dealer once. The evaluation criteria are the average rank and the rate of arriving at a winning hand. a) Player of the proposed method: The players of the proposed method are the player only discarding tiles based on the proposed method, which is denoted as M A, and the player discarding tiles and playing naki based on the proposed method, which is denoted as M B. These players may declare richi if possible.

5 5 b) Benchmark player: Two benchmark players are made. One is a player only discarding tiles from a set of tiles which can decrease shanten randomly, which is denoted as B A. Another is a player discarding tiles based on the proposed method and playing naki for decreasing shanten after playing naki of value tiles, which is denoted as B B. These players may declare richi if possible. C. Experimental Results 1) Rate of concordance of discarding tiles: The result is shown in Fig. 1. The maximum rank is at the sixth place. Fig. 1. Rate of concordance of discarding tiles 2) Playing a game with benchmark players: a) Game of M A and B A: The result is shown in TA- BLE II. There is a significant difference at the 5% significant level by applying the Welch s t test to the average rank. From TABLE II, it can be found that M A is stronger than B A. TABLE II. M A AND B A player average rank the rate of arriving at a winning hand M A B A the number of matches b) Game of M A and M B: The result is shown in TA- BLE III. There is a significant difference at the 5% significant level by applying the Welch s t test to the average rank. From TABLE III, it can be found that M B is stronger than M A. TABLE III. M A AND M B player average rank the rate of arriving at a winning hand M A M B the number of matches c) Game of M B and B B: The result is shown in TA- BLE IV. There is a significant difference at the 5% significant level by applying the Welch s t test to the average rank. From TABLE IV, it can be found that M B is stronger than B B. VII. DISCUSSION A. Rate of Concordance of Discarding Tiles Fig. 1 shows the maximum rate of concordance is 83%. This means that many winning hands are effective in the 34 TABLE IV. M B AND B B player average rank the rate of arriving at a winning hand M B B B the number of matches beginning because the winning hand the player found tends not to include the right tile. On the other hand, the rate of concordance in the first place is 43%. That is, the right tile cannot be chosen in many cases even through the right tile is included in candidates. The reasons are as follows. One of the reasons is that the priority of an isolation tile is low. Top players can discard a tile of a bad pre-meld with keeping an isolation tile in the chosen record. The proposed method changes the initial hand to the best hand generated by changing one tile randomly. The possibility of the transition of discarding a tile of a pre-meld with keeping an isolation tile is low because making a pre-meld by changing one tile from other pre-melds does not approach to the winning hand. In contrast, at the state that the sum of melds and pre-melds is less than five, the possibility of that transition is high because the shanten can be decreased by making a pre-meld by changing an isolation tile. Another reason is that several isolation tiles are present. Isolation tiles tend to be the best candidates in the proposed method from the reason described above. The best candidates are spread in every search so that the right tile can not often become the majority at the first place. Yet another reason is that top players discard tiles with aiming at a specific target. In some states, the player discards unnecessary tiles which may recede from the winning hand when the special pattern is imagined. For example, in flush, two suited tiles are unnecessary. These tiles are discarded at the first. In the proposed method, the degree of similarity has higher priority than the score so that it is assumed that discarding is not concordance. As an improvement method, giving weights to pre-melds, isolation tiles, and so on is considered. For example, the tile numbered five can more likely lead to a suited tile than the tile numbered one. So, it may be good that bad selection can be avoided by easily selecting the tile numbered one. B. Playing a Game with Benchmark Players 1) Game of M A and B A: TABLE II shows M A is stronger than B A. B A is a player only discarding from a set of tiles which can decrease the shanten randomly. It is considered that M A discards tiles more effectively than B A. M A tends to discard the tile which can decrease the shanten, and makes tiles more effective ones. Consequently, the average rank and the rate of arriving at a winning hand are better than B A. 2) Game of M A and M B: TABLE III shows M B is stronger than M A. M B is M A with naki. The winning hand sometimes has no score when naki is done. naki has better not be done in some states. In contrast, the player with naki in the proposed method is effective because the average rank and the 46

6 6 rate of arriving at a winning hand are better than the player only discarding tiles with the proposed method. 3) Game of M B and B B: TABLE IV shows M B is stronger than B B. M B does naki not determining special patterns and arrives at the winning hand. In the proposed method, if the score of an optimal solution is not zero, naki is conducted. It is considered that the winning hand can be reached reasonably even if M B does naki not determining special patterns. VIII. CONCLUSION Recently, players whose abilities are equal to those of human top players are implemented in two-player, zero-sum, logical perfect information game. In contrast, few players are implemented in multi-player games with imperfect information including Mahjong. This paper proposed the search method of the optimal movement in multi-player games with imperfect information in order to implement Mahjong players exceeding human top players. The proposed method uses tabu search without using the record of a game. Tabu search is applied to the states of discarding tiles and naki. The experiments were carried out in order to evaluate the proposed method. In the experiment of evaluating the rate of concordance of the usage of tiles, the maximum rate of concordance reached to 83%. This means effective winning hands could be found in the initial states. In the experiment of playing a game with benchmark players, it was shown that the proposed method is better than benchmark players. From these results, the possibility of the effective search of optimal solution by using tabu search was indicated in Mahjong. It is considered that the proposed method is effective in similar games such as Poker where players discard some cards, and draw some cards to win. The maximum rate of concordance is 83%, while the rate of concordance at the first place is 43%. Adding the method considering the weight may improve the concordance at the first place. This is included in future work. The player created in this paper does not fold. In Mahjong, it is necessary to discard tiles so that other players cannot win when the player cannot reach to the winning hand. Making a player be able to fold is also in future work. Comparing the proposed method with other optimization methods including a hill climbing, method and experiments only using the proposed players are also in future work. Playing a game with human players and evaluating the performance are also included in the future issues. [3] N. Mizukami, Y. Tsuruoka, Building computer mahjong players by modeling opponent players using game records and a Monte Carlo method, Proceedings of the 19th Game Programming Workshop, pp , (in Japanese) [4] A. Miki, M. Miwa, T. Chikayama, Learning to rank moves in mahjong using SVM with tree kernels, Proceedings of the 13th Game Programming Workshop, pp , (in Japanese) [5] A. Miki, T. Chikayama, Research on decision making in multi-player games with imperfect information, Master s thesis, The University of Tokyo, (in Japanese) [6] A. Wagatsuma, M. Harada, H. Morita, K. Komiya, Y. Kotani, Estimating Risk of Discarded Tiles in Mahjong Using SVR, IPSJ SIG Technical Reports, GI, Vol. 2014, No. 12, pp. 1-3, (in Japanese) [7] S. Takahashi, J. Rokui, Research on The Strategy in Mahjong, Proceedings of the 11th FIT, pp , (in Japanese) [8] K. Tomoki, N. Kazuyuki, S. Ayumi, Effective algorithm for decision making on hand-composing game, IPSJ SIG Technical Reports, GI, Vol. 2012, No. 8, pp. 1-8, (in Japanese) [9] F. Glover, Tabu Search - PartI, ORSA Journal on Computing, Vol. 1, No. 3, pp , [10] European Mahjong Association, [11] S. Tsunoda, Tenhou, REFERENCES [1] N. A. Risk and D. Szafron, Using counterfactual regret minimization to create competitive multiplayer poker agents, Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, volume 1-Volume 1, International Foundation for Autonomous Agents and Multiagent Systems, pp , [2] N. Mizukami, R. Nakahari, A. Ura, M. Miwa, Y. Tsuruoka, T. Chikayama, Adapting One-Player Mahjong Players to Four-Player Mahjong by Recognizing Folding Situations, Proceedings of the 18th Game Programming Workshop, pp. 1-7, (in Japanese)

Estimation of Rates Arriving at the Winning Hands in Multi-Player Games with Imperfect Information

Estimation of Rates Arriving at the Winning Hands in Multi-Player Games with Imperfect Information 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science &