Learning from Hints: AI for Playing Threes

Size: px

Start display at page:

Download "Learning from Hints: AI for Playing Threes"

Bruce Patrick
6 years ago
Views:

1 Learning from Hints: AI for Playing Threes Hao Sheng (haosheng), Chen Guo (cguo2) December 17, Introduction The highly addictive stochastic puzzle game Threes by Sirvo LLC. is Apple Game of the Year 2014, and also the ancestor of all the number merging game like However, Threes has more complicated rules and suffer less from random effects, hence needs more planning and strategy. Another big difference is that most of the sliding games are time homogeneous in the sense of incrementation: Although there might be some randomness after each action, the distribution of incoming blocks/numbers does not depend on action or current state. This classical property makes the problem easy to tackle (with the help of reinforcement learning) but hard to learn from when we try to move back to real world problems. The game we choose to work on, Threes, however, is adaptive as the game goes on (see the next section). It will change the transition distribution but also provides hints on it so a player can get a rough sense of what next step would look like. Ideally, not only current action but also some long-term policy/strategy should shift as the distribution of future changes. We would like to model the problem as MDP and compare Expetimax Algorithm, Monte Carlo Tree Search, and Q-learning. 2 Literature Review A lot of theoretical and practical research has been done on the similar game 2048, which is more popular because of simpler rules. Some algorithms are employed for 2048, such as minimax search with alpha-beta pruning and expectimax optimization, as well as other hard-coded solutions. In [4], temporal difference learning is also used to compute the value function as an intermediate step towards learning decision making. However, suffering more from randomness, it seems more difficult for Q-learning to outperform expetimax with glaring flaws of gigantic branching factor and the need for a human-programmed heuristic to evaluate any given board/state. Monte Carlo Tree Search is known for achieving successful results when playing finiteaction, finite-length, perfect information games. Although there s not much work related with using MCTS to play sliding games like Threes or 2048, the mechanic of games is 1

2 appropriate for it. In [5], the method is used to play Settlers of Catan. They made use of domain knowledge pretty well, not only by setting weights to actions, but also giving virtual wins to preferred states and virtual losses to bad states. Deep reinforcement learning plays a more and more important role in Q-Learning role recently. With enough amount of data, the neural network can help the agent learn well from only raw information from the game, compared with handcrafted features. For example, [6] chose to directly operate on RGB images. For our game, deep reinforcement learning might be helpful for evaluating the interactions of different tiles on the board. 3 Set-up Figure 1: The game Threes 3.1 Rules of Threes The rules of Threes can be described as follows (See?? for the mobile application): 1. Merging Rules: 1 and 2 can be merged to get to get a 3. Two same tiles with number larger than 3 can be merged. 2. Scoring Rules: The total score is the sum of the scores of every single tile, which is 3 log 2 (k/3)+1 with number k. 3. New Tile: 2

3 The new tile is drawn from a shuffled deck with 12 cards of 4 ones, 4 twos and 4 threes. And the deck would be regenerated after used up. When the largest tile on board m > 96, with probability 1/24, the new tile isn t drawn from deck. It s randomly generated in the range [6,..., m/4] instead. The new tile would only appear in the columns/rows moved in the last step. 3.2 Modeling as MDP We can model this problem as a Markov Decision Process. We introduce some notations here for convenience. D = [D 1, D 2, D 3 ] is the current deck, where D[k] is the number of ks in the deck. The tiles on board are a sequence of B where the tile on the position [i, j] is mapped to B[4(i 1) + j]. The hint for next tile is H. To make the game more intuitive, we simplify the hint mechanism from the original game. In original game, the hint wold be given as a list of three possible cards when the card isn t drawn from deck. Considering that this hardly happens, we simplify the hint to be determined. As the number distribution on the board after taking an action differs a lot and it s hard to describe in language due to the game mechanism, we define Succ(b, a, h) is the set of possible board situations after taking action a on b with new tile h, where every situations are equally possible. In practical implementation, it wouldn t be difficult to retrieve Succ(b, a, h), whose size is at most 4. Now we have: State: s = (b, d, h) Action(s) = {a {Up, Down, Left, Right} a is valid action} Trans Prob: T (s, a, s ) = 1 max{b} Succ(b, a, h) log 3 (max{b}/2) 2 for b Succ(b, a, h), d = [d i 1 h=i ] i=1,2,3, h [6,..., max{b} ] 4 = (1 1 max{b} 96 1 d [h ] ) 24 Succ(b, a, h) d for b Succ(b, a, h), d = [d i 1 h=i ] i=1,2,3, h {1, 2, 3} = 0 otherwise Reward:R(s, a, s ) = t b 3 log 2 (t/3)+1 t b 3 log 2 (t/3)+1 3

4 The MDP described above might look a little complicated. But it s much easier to actually implemented in programs since the sliding and merging mechanism would give all the possible successive states. 4 Approaches 4.1 Expectimax Algorithm We used Expectimax Algorithm to solve this problem at first. The value evaluation function is the key to optimizing the algorithm performance. By simply using the reward of every step R(s, a, s ), we can benchmark our performance. Notice that the reward has nothing to do with the random stuff in s like the new hint, this baseline is actually a greedy algorithm. We also tried a heuristic evaluation function as Value(s) =Score(s) merges merges 48 monotonicity + corner large + 50 (maxrank maxrank ) empty2 filled t t b where merges is the number of potentially mergeable tile pairs in next step, 1-2-merges is the number of mergable one-two pairs, corner is the indicator that if the maximum tile is in the corner, monotonicity is a special measure of monotonicity of rows and columns, maxrank is the log 2 maxtile/3 + 3, large is the indicator that if the maxrank is larger than 7, empty is the number of empty tiles while filled is the number of non-empty tiles. 4.2 Monte Carlo Tree Search The idea of using Monte Carlo Tree Search is from poster session. The method keeps track of a certain number of successive states with different actions given a starting state. The number of valid actions for Threes is pretty small, which makes things easier to keep track of more states. Running Monte Carlo Simulation on the afterward states keeps updating the values of the states we have in record. Practically, we keep track of different number of nodes and do experiments on them. When selecting nodes during recursive tree search, we consider Upper Confidence Bounds (UCB) formula log N v i + C n i where v i is the estimated value of the node, n i is the number of times the nodes has been visited. N is the number of times of its parent has been visited. This method actually finds a balance between exploration and exploitation. 4

4.3 Q-Learning We tried deep Q-Learning to both capture the information on the board and also the future value of the potential position of the hint. The structure of our model is shown in Figure 2.

5 4.3 Q-Learning We tried deep Q-Learning to both capture the information on the board and also the future value of the potential position of the hint. The structure of our model is shown in Figure 2. Figure 2: Deep Q-Learning 5 Experiments We did experiments for: Expectimax Algorithm with 2,000 examples separately for: baseline with d = 1 or d = 2, and evaluation function with d = 1 and d = 2. MCTS with different number of nodes in record: 5, 10, 20, 30, 40. Q-Learning with linear model with direct information and heuristic feature mapper. However, we didn t got the time to train neural network model yet. The basic statistics for these experiments are in Table 1. For Expectimax Algorithm, The histogram plots and log-scale ranking plot are in the Figure??. From the results, we can see that the evaluation function helps a lot in terms of performance, while simply increasing the depth from 1 to 2 helps a little. Also, noticing that the running time increases a lot with depth as 2, we probably don t want to increase it more. For MCTS Algorithm, we did experiments on different number of nodes in record. The results are shown in Figure 4. 5

6 Mean Min Max Time Per Run Baseline (d = 1) , s Baseline (d = 2) ,239 1s Expectimax (d = 1) , s Expectimax (d = 2) ,682 17s MCTS (n = 5) 2, ,401 5s MCTS (n = 10) 4, ,068 10, s MCTS (n = 20) 5, ,395 22, s MCTS (n = 30) 5, ,310 20, s MCTS (n = 40) 3, ,795 1m21s Q-Learning , s Table 1: Basic Statistics of Experiments Figure 3: Results of Experiments for Expectimax Algorithm To compare different methods, we can refer to Figure 5. We can see that MCTS performs much stabler. We can see that it deals with randomness pretty well. However, Q-Learning doesn t perform well given the limited training time we have. 6

7 (a) Histogram for MCTS Algorithm (b) Ranking for MCTS Algorithm Figure 4: Results of Experiments for MCTS Algorithm 6 Summary and Discussion Threes is a game with few actions and many states. Compared with other similar games, it gives more information and needs more strategies. We modeled the game as MDP and then tried different approaches. Figure 5: Results of Experiments for Different Algorithms The simple Expectimax Algorithm is short-sighted, and it is hard to transfer our domain knowledge into evaluation functions. Simply increasing searching depth provides limited improvement. MCTS performs well, though it fits better with the mechanism of the game and finds a 7

8 balance between randomness and strategy. It chooses the action that copes best with the future randomness. The result is better than an intermediate human player. Q-Learning takes more time to train the feature parameters. By now, we have not got the time to train the model thoroughly because of the huge number of all the possible states. The current performance is just mediocre, 50% better than the baseline. 8

9 References [1] Jason Lewis, Alec Powell & Jai Sajnani 2048: An AI Agent [2] Hadi Pouransari & Saman Ghili, AI algorithms for the game 2048, CS221 project [3] Prithvi RamakrishnanSolving 2048, CS221 project [4] Szubert, M., & Jakowski, W. (2014, August). Temporal difference learning of N-tuple networks for the game In 2014 IEEE Conference on Computational Intelligence and Games (pp. 1-8). IEEE. [5] Szita, I., Chaslot, G., & Spronck, P. (2009, May). Monte-carlo tree search in settlers of catan. In Advances in Computer Games (pp ). Springer Berlin Heidelberg. [6] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arxiv preprint arxiv:

Using Artificial intelligent to solve the game of 2048

Using Artificial intelligent to solve the game of 2048 Ho Shing Hin (20343288) WONG, Ngo Yin (20355097) Lam Ka Wing (20280151) Abstract The report presents the solver of the game 2048 base on artificial