Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Size: px

Start display at page:

Download "Game Playing for a Variant of Mancala Board Game (Pallanguzhi)"

Russell Osborne
6 years ago
Views:

Pallanguzhi, a south Indian Variant of the Mancala Board game, has not been attempted to be solved by an AI agent.

1 Game Playing for a Variant of Mancala Board Game (Pallanguzhi) Varsha Sankar (SUNet ID: svarsha) 1. INTRODUCTION Game playing is a very interesting area in the field of Artificial Intelligence presently. Pallanguzhi, a south Indian Variant of the Mancala Board game, has not been attempted to be solved by an AI agent. It is a two-player strategy game which is also a non-zero sum game, thus making it interesting for applying AI strategies to it. In this project, I have implemented different algorithmic strategies for the AI agent and analyzed the performance against each other. 2. BACKGROUND Pallanguzhi is a very ancient and traditional game of South India which is similar to the more popular Mancala Board games which is said to have originated in Africa. The board for the game consists of 14 bins, with 7 on each side, belonging to either player. The centre bins on both sides are reserved and act like banks which the players could win the ownership of, during the game. Initially, all the bins (except for the centre ones) are filled with a fixed number of seeds (or stones).the player who has collected the maximum number of seeds at the end of the game is the winner. Images of sample boards can be seen below. Fig 1. Empty Board Fig 2. Board with bins filled with seeds 2.1. GAME RULES When a player s turn comes up; 1. He has to choose one of his bins (excluding the banks), which is non-empty. 2. Remove all the seeds from that bin. 3. Drop these seeds one by one into the consecutive bins in clockwise order, including the banks and the opposite player s bins. 4. When he runs out of seeds, he should pick up the seeds from the next bin and continue (Step 3), till he reaches the end of his turn. 5. End of turn could be marked by one of the following; The last dropped seed ends in a bin just before one of the banks.

2 The last dropped seed ends in a bin, such that the next bin is empty. The player gets the seeds in the bin next to the empty bin and the bin opposite to it. A special case is when the bin next to the empty bin (of the previous case) is a bank. In this case, that player is said to have gained the ownership of the bank. He cannot get the seeds in the bank until the end of the game and both the players can own the same bank(s), in which case the seeds are split between the players at the end of the game SAMPLE GAME MOVE To begin with, the game board looks like Figure below (here, using 6 seeds per bin initially, but it could be any number though it is 6 or 12 usually). Let the lower row correspond to player one and the upper row correspond to player 2. It is player 1 s turn and he has to choose one of his non-empty bins (excluding the bank) and play according to the rules mentioned above. Bin 1 Bin 2 Bin3 Bank 2 Bin 5 Bin 6 Bin 7 Player Player Bin 7 Bin 6 Bin 5 Bank 1 Bin 3 Bin 2 Bin 1 Say Player 1 chooses Bin 1. After removing the 6 seeds from Bin 1 and dropping them in consecutive bins, the board looks like; Bin 1 Bin 2 Bin3 Bank 2 Bin 5 Bin 6 Bin 7 Player Player Bin 7 Bin 6 Bin 5 Bank 1 Bin 3 Bin 2 Bin 1 The last seed was dropped into Bin 7 on his side and it is not one of the end turn states. So he continues by removing seeds from Bin 1 of Player 2. Bin 1 Bin 2 Bin3 Bank 2 Bin 5 Bin 6 Bin 7 Player Player Bin 7 Bin 6 Bin 5 Bank 1 Bin 3 Bin 2 Bin 1

3 The last seed was dropped into Bin 7 of Player 2 and the next bin (Bin 1 of Player 1) is empty. So it marks the end of Player 1 s turn. He gets the seeds from Bin 2 and Bin 6 which adds to his score at the end of the first turn. At the end of the first move, the board looks as shown below; Bin 1 Bin 2 Bin3 Bank 2 Bin 5 Bin 6 Bin 7 Player Player Bin 7 Bin 6 Bin 5 Bin 3 Bin 2 Bin 1 Bank 1 The score of Player 1 at the end of this turn is 14. The valid moves for Player 2 now are, choosing Bin 2 or 3 or 5 or TASK DEFINITION For this project, I first developed the game engine for Pallanguzhi and then defined the baseline approach and oracle. The baseline was a greedy agent which tries to maximize the rewards at that particular turn without thinking about the future. The oracle was an expert human player who is assumed to know opponent s action. The task is to implement different agents, Minimax, Expectimax and Minimax with alpha beta pruning and analyze their performance when pitted against each other and against random agents by using different Evaluation functions which correspond to different heuristics (strategies), in terms of the win rate and win margin, by starting as the first player or second player. 4. RELATED WORK There has been no attempt to solve this particular variant, Pallanguzhi. However there have been many AI agents to solve the general Mancala. Though the boards look the same, the two games are different in that, Pallanguzhi s single turn goes on and on with wrap around till one of end turn states are reached. Also, the concept of the bank in pallanguzhi results in adding up a reward at the end of the game, which may sometimes change the game completely when only immediate rewards are concerned. [1] deals with solving Mancala using heuristics with Minimax and alphabeta pruning. Though the game is similar, most of the heuristics are not meaningful in our scenario. So, I have tried with other appropriate heuristics. 5. INFRASTRUCTURE Building the Game Engine: Since Pallanguzhi is an old and dying game, there existed no game engines, so I developed one from scratch by incorporating the game rules into it. After building the game engine, I tested if it is working correctly by playing it in Human Vs Human mode and then in Human Vs Computer mode, where the computer agent was my baseline greedy agent. After verifying the correctness of the game engine, I went on to implement other agents.

4 6. APPROACH The game was first modeled as a state space search. Then different agents were implemented. The agents were made to play against each other and their performance was analyzed based on the win rates, average number of moves to game over, average winning score and win margin GAME MODEL Pallanguzhi, as it can be seen is an adversarial game. Thus it was modeled as a state-space search model, which consists of a Start State, a set of legal actions that can be taken at a given state, the successor state given an action and possible end states. Each transition from a state to Successor state is associated with a score update. The state model is given as follows; State s (player ID, boardconfig, score, bank) player ID: -1 indicates player1, 1 indicates player2 boardconfig: List of 14 values, representing number of seeds in each bin. score: Dictionary to hold cumulative scores. bank: Dictionary to hold banks of both players. s start Actions(s) Succ(s,a) IsEnd(s) (-1, [N N N 0 N N N N N N 0 N N N], {-1: 0, 1: 0}, {-1: [ ], 1: [ ]}) Index of the non-empty bins that a player can choose. The modified state after player in state s chooses to start with bin marked by a (action) Checks if state s is an end state; player ID = -1 and boardconfig[0:3], boardconfig[4:7] are 0. (or) player ID = 1 and boardconfig[7:10], boardconfig[11:14] are BASELINE AND ORACLE Base Line - Greedy Approach: The agent tries to maximize its score (i.e, the number of seeds collected or owning of a bank) at the end of its one full turn without looking at the actions that the opponent would take in the future. The opponent was a human player. When two different actions lead to the same increase in score, the agent chooses randomly. The win rate of the greedy agent was 20% by starting as the second player. Oracle - Experienced Human player: The experienced human player is assumed to know the opponent s action at every state and thus obtains a win rate of 100% when starting as the first or the second player. The gap between the baseline and the oracle is significant enough to apply other techniques and analyze the problem.

5 6.3. AGENTS The different agents that were implemented include minimax agent, expectimax agent, faster minimax agent using alpha-beta pruning and modified minimax agent which work as depthlimited searches and were tested using different evaluation functions. 1. Minimax agent: The agent tries to maximize its utility approximated by the Evaluation function, assuming that the opponent tries to minimize agent s utility. 2. Expectimax Agent: The agent tries to maximize its expected utility approximated by the Evaluation function, considering that the opponent could uniformly choose any of its valid actions. 3. Alpha-beta Agent: To reduce the search space on the game tree and hence to speed up minimax search, alpha beta pruning was used. 4. Modified Minimax Agent: In this game, as each player try to maximize their own score (utility) in reality, thus implemented modified version of minimax where the opponent tries to maximize its own utility instead of minimizing the agent s utility EVALUATION FUNCTIONS AND HEURISTICS: When I tried performing a complete search on the tree for the different agents, the recursion depth exceeded, thus requiring to restrict to depth-limited search. This needed good evaluation functions to approximate the future rewards. 1. Eval1 This just returns the current scores of the player and this is measure of how good it is to be at that state. Eval1(s) score [player] 2. Eval2 At a given state, this takes into account the information about the banks. Since the points are not going to reduce later, the current number of seeds in the banks owned by the players are also added to the current state score and returned. This is a measure how close the agent is to winning, as if the current score + seeds in bank is almost half the total number of seeds initially, it implies that the agent would win with a very high probability. Eval2 score[player] + numseeds(banks[player]) 3. Eval3 It evaluates a given state in terms of further maximum possible increase in score by considering all possible actions in that state. If an action leads to ownership of a bank, that is chosen and the current score is updated to include increase in score or the seeds in the bank and this value is returned Eval3 score[player] + increase_in_score[player] + numseeds(banks[player]) While these evaluation functions just tell how good a given state is, in actual playing of the game the strategy relies on the actions which are taken from a state. So, evaluating the state-action pair might act as a better approximation to the future rewards. The heuristics are used to evaluate the

6 state-action pair after evaluating the next_state by using one of the above-mentioned evaluation functions. They are; 1. Heuristic1 How far is the chosen bin (action) from the opponent s side of the board? This adds penalty based on the relative position of the chosen bin (action) from opponent s side. This would help in distributing the seeds within one s own bins and thus help to retain the seeds which might at the end add to one s score. 2. Heuristic2 How many seeds are there in the chosen bin? Lesser the number of seeds in the bin being chosen, lesser will be the distribution to the opponent s bin, thus it helps in limiting the number of seeds which could later go to the opponent. Thus, the function adds more penalty if the chosen bin has more seeds. 7. EXPERIMENTS AND RESULTS Since the game involves a lot of counting, human players, playing strategically tend to look only at the current action which maximizes the immediate score. So, a reflex agent would model the human player well. On running the 1000 trials of game between random and reflex agent, following are the statistics obtained. TABLE I First player Random Agent, Second player Reflex Agent Random Agent Reflex Agent Win Rate% Avg. Score while winning Avg. Win margin Avg. no. of rounds to game over 9 TABLE II First player Reflex Agent, Second player Random Agent Random Agent Reflex Agent Win Rate% Avg. Score while winning Avg. Win margin Avg. no. of rounds to game over 8 The next experiment was to make the minimax and expectimax agents play against the random and reflex agents and themselves and look at their win rates, starting as the first player and the second player. Since the minimax agent becomes slow for even a depth of 3, alpha-beta pruning was used to reduce the search space and speed up minimax.

7 Minimax, expectimax and modified minimax agents were made to play for the different evaluation functions and heuristics and the performance was analyzed. TABLE III Second Player: Minimax, depth = 3 Eval1 Eval2 Eval3 Win Rate% Loss Rate% Tie Rate% The above table (Table III) shows the Win rate%, Loss Rate% and Tie Rate% for the Minimax Agent when it plays against a random agent, by starting as the second player. The columns show the different Evaluation functions used with using any of the other heuristics. Similar statistics is shown for Expectimax agents in Table IV. TABLE IV Second Player: Expectimax, depth = 3 Eval1 Eval2 Eval3 Win Rate% Loss Rate% Tie Rate% Similar statistics for modified minimax is shown in Table V. TABLE V Second Player: Modified Minimax, depth = 3 Eval1 Eval2 Eval3 Win Rate% Loss Rate% Tie Rate% The effectiveness of the other Heuristics, which evaluate a given state, action pair were observed by making the Modified Minimax and Expectimax agents use them when playing against a random agent. The Tables VI and VII show the Win Rate%, Loss Rate% and Tie Rate% for Modified Minimax and Expectimax respectively for the different heuristics. TABLE VI Second Player: Modified Minimax, depth = 3 using Eval1 Heuristic 1 Heuristic 2 Win Rate% Loss Rate% Tie Rate%

8 TABLE VII Second Player: Expectimax, depth = 3 using Eval1 Heuristic 1 Heuristic 2 Win Rate% Loss Rate% Tie Rate% All these statistics have been computed by running repeatedly 100 trials. Another observation was that Minimax, expectimax and modified minimax yielded 100% win rates when they started as the first agent and played against random agent. Since modified minimax best models the game, the same was made to play against Random, Reflex, Expectimax and itself. The win rates can be seen from the plot, below; It can be seen that the Modified Minimax starting as the second player gives a high win rate of 81% when made to play against random agent. When the First player is a reflex agent (models a normal human player), Modified Minimax gets a 100% win rate. When the first agent is Expectimax, it still performs well and gets 25.67% win rate over a 0% as in case of the other agents. But when it plays against itself, the first player always wins. ALTERNATIVE APPROACH: Instead of using heuristics which are crafted from domain knowledge, Reinforcement Learning could be used. Machine Learning could be used to learn from several trials of the game, the Value associated with each state, action pair, i.e, the Q-value. Then each time the AI agent plays the game, at each state, all possible actions are evaluated by making the model predict the value of the state, action at hand and make a decision accordingly. I implemented the above approach, where I chose the AI player to be the second player and had randomized policy for the opponent and Epsilon-greedy policy for the AI player. After

9 running several trials of the game, a large number of state, action pairs were generated and the the Q-value was approximated by taking average of the expected utility from different episodes. A neural network was trained on this data. The Inputs to the neural network were features extracted from given (state, action), which consisted of number of seeds in each bin, indicator on whether each of the center bins are on the bank and the action which indicates how far the bin is from the opponent. The output from the neural network is supposed to be Q-values. But it yielded weaker results, where Modified Minimax (which captures the essence of the game) was losing 85% of the games to random agent. A good future work would be to select an appropriate model that would fit the data well and hence act as a good evaluation function. 8. ANALYSIS 8.1. Is starting the game as the first player advantageous? From Table II, it can be seen that even a simple agent like a reflex agent gets a win rate of 90% when playing against a random agent. It was also observed that any other AI agent (Minimax, Expectimax and Modified Minimax) results in almost 100% win rate if started as the first agent. So, I have limited the analysis to only performance of AI agent when it starts as the second agent. If an AI agent is made to play against itself, the one which started first wins. Thus, another future work could be to analyze this behavior and come with a solution for the same Performance of the agents: By fixing the Evaluation function to be Eval1 and the depth to be 3, it can be seen that the win rates for Minimax, Expectimax and modified minimax agents against random agent are 59.67, and 79, respectively. Since minimax takes an action that maximizes its utility by assuming that the opponent is trying to minimize its utility, it doesn t capture the game well. This is because the opponent generally try to maximize their utility instead of minimizing the agent s utility. This idea is presented in the modified Minimax agent, which results in the highest win rate. The Expectimax agent which assumes a random policy of the opponent performs almost equally well when it is against a random agent Performance of the Evaluation Functions: From Tables III and IV, it can be seen that the improvement of the performance of the Minimax and Expectimax agents changes only by small amounts when different Evaluation functions are used. These functions just evaluate a given state (i.e, how good it is to be at a state), without taking into account the actions. It can be seen that for a depth of 3 (above which the speed decreases), both Minimax and Expectimax agents get highest Win rate for Eval3, followed by Eval2 and Eval1. But Modified Minimax gets highest win rate for Eval1 followed by Eval2 and Eval3 as seen in Table V. This may be because, in Eval2 and Eval3, information about current banks is used and all the cons in the bank owned by the agent are added to the score, but it is possible that later the opponent also acquires the same bank, as it maximizes its utility and the

10 value returned does not consider this. But since Eval1 requires the minimal amount of processing when compared to Eval2 and Eval3, the Modified Minimax with Eval1 at depth 3 is a good choice for the Agent Performance of the Other Heuristics: Table VI and VII show the win rates for modified Minimax and Expectimax when Heuristic1 and Heuristic2 are used along with the current score (Eval1). It can be seen that Heuristic1 gives a better performance than Heuristic2 for both the agents from the Win rates. This shows that by choosing a bin away from the opponent s side improves the performance more than by choosing the bin that has lesser number of seeds in it. This is because if the bin has a large number of seeds, it is possible to distribute them back into one s own bins and thus retain a portion of it. When both the heuristics are combined additively they result in weaker performance, as it would be preferable to choose a bin which is close to the opponent, but has only a less number of seeds. When the heuristics on current (state, action) pair are made to run along with other evaluation functions (Eval2 and Eval3) on the next state, it results in lesser win rates. So, the best performance out of these is obtained when Heuristic1 is used with Eval1 in Modified Minimax agent. 9. CONCLUSION By running each of the agents, random, reflex, minimax, expectimax and modified minimax against each other for different evaluation functions and heuristics, it can be concluded that Modified Minimax with a search depth of 3 using Eval1 and Heuristic1 results in the highest win rate of 81% when it plays as the second agent against a random agent. When it starts the game as the first agent, irrespective of the policy followed it results in a 100% win. In case of Expectimax and Minimax, though win rates close to 80% have been reached, they do not get a 100% win rate as the first player when the opponent is modified minimax. Also they get a 0% win rate when they start as the second player and the first player is already one of the AI agents. The statistics for Expectimax and Minimax against all agents has not been tabulated as they are similar and result in poor performance when played with agents other than random and reflex. From this project, I was able to implement different agents and experiment with different heuristics and analyze their performance and conclude which performs well. This work can be continued to find if it would be possible to improve the win rates when the agent starts as the second player against an AI agent or it is just the nature of this game, which is advantageous to the first player, if he is optimal. 10. REFERENCES [1] Chris Gifford, James Bley, Dayo Ajayi, and Zach Thompson, Searching and Game Playing: An Artificial Intelligence Approach to Mancala, [2] Matthew Bardeen, Coevolution of Mancala Players, COGS, University of Sussex. [3] Stanford University, CS221, Multi-agent Pacman Assignment (2016)

ARTIFICIAL INTELLIGENCE (CS 370D)

Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-5) ADVERSARIAL SEARCH ADVERSARIAL SEARCH Optimal decisions Min algorithm α-β pruning Imperfect,