an AI for Slither.io - PDF Free Download

an AI for Slither.io Jackie Yang(jackiey) Introduction Game playing is a very interesting topic area in Artificial Intelligence today. Most of the recent emerging AI are for turn-based game, like the very popular Alpha Go (Silver et al. 2016), or are usually a solo or few palyers game, like a breakout AI (Mnih et al. 2015). However, there are many other interesting games that does not falls into this category, like Star Craft. They usually have a realtime game play experience with potentially massive amount of player participate in the game, which makes the game a lot more harder to solve. In this project, I m focusing on a very confined and easily modeled game with all these under explored features. Slither.io is a massively multiplayer browser game developed by Steve Howse ( Slither.io on the App Store ). Player controls a snake by changing its moving direction and eat food, represented by colored dots on the map to grow larger, which is the target of this game. Meanwhile, when one player s snake runs into other s player s snake, the snake would die and is converted into food dots. The game arose intense competitation between players who tries to let other players snakes run into their bodies. Task Definition For this project, the object is to create an AI that can control the snake (as showned as snake in center in figure 1) and keep eating and avoid danger. Specifically, the input would be all the food around the snake avatar (as showned as colored dots in figure 1 around the user s snake) and all the danger (nearby snakes) around the snake, while the output would be the direction of where the snake should go. The metric of success woule be the average maxium length that the snake controled by AI. Baseline and future improvement I consider a baseline of this project would be a snake wander around in random directions. An oracle for this project is quite straight forward, which is to become 1

Figure 1: Slither.io the longest player in the arena. There is a realtime learderboard in the game which includes the top 10 longest players currently playing in the arena. Related Work I have found some related work in the effort of creating a Slither.io AI. Slither.iobot is a very popular implementation of a Slither.io AI. It uses a rule-based way to give a action for the snake. The rule of this AI can be divided into two part, food finding and collision avoiding. The food finding is basically find the nearest food and head towards it. The collision is that once the bot detected any other snakes get into a circle defined by a static parameter, it would turn the other way to avoid collision. Currently, the rule-based AI is no where near other human players. the major reaosn is that the rule-based AI can be easily killed by others snakes who surrounded the AI with their bodies. As long as the surround circle is larger than that threshold of the rule-based AI, the AI wou t react to that action. My solution is to use a machine learning algorithm to dynamically tune that parameter to avoid this situation without becoming too timid. This training process could be done using reinforcement learning algorithm like Q-learning and I would go for simple classifier like linear classifier to make the AI runs fast. 2

Infrastructure The infrastructure running this AI would be a browser, a browser automation script and the AI, which is built already. The input data is like this: collisionpoints = '0' : { 'xx' : 27557.36809910213, 'yy' : 30773.149294393803, 'snake' : which shows all the obstacles and also the length of current snake. The output are just the parameter for the rule-based AI. Approach (new) As I working on this project throughtout the semaster, I graduately realize that this is a rather difficult problem to tackle with. I have tried quite a few approaches to make the AI. I will first discuss the general infrastructure for all of these designs and then describe each one of my AI that I have tried. General The general approach I planed to solve this problem is Q-learning with nerual network as stated (Mnih et al. 2015). The Q-learning algorithm is shown as follows: where ˆQ opt (s, a) (1 η) ˆQ opt (s, a) +η(r + γ }{{} ˆV opt (s ))] }{{} prediction target ˆV opt (s ) = max ˆQ opt (s, a ) a Actions(s ) As the s have a much larger space than as in mose Q-learning algorithm, I want to replace the list Q opt (s, a) with a nerual network. Then the Q opt (s, a) should be updated with stochastic gradient descent, therefore, the algorithm is showned as follows: train ˆQ opt with input of (s, a) and the result of(1 η) ˆQ opt (s, a) +η(r + γ }{{} ˆV opt (s ))] }{{} prediction target where 3

ˆV opt (s ) = max ˆQ opt (s, a ) a Actions(s ) However, there are still serveral difficulties to solve: 1. The competing-with-human nature of this game makes the game play really slow, how to speed up the training? 2. How to produce a continious action with only discrete policy from Q- learning. 3. How to select a proper feature vector. The first problem is a shared problem for all of the AI I designed. The human part of this game makes game play really slow and significantly slowed down the training. For example, if we consider 0.5 second in game as one turn for the AI, then we can only generate a total of 172800 = 1 60 60 24/0.5 samples in a day. I solve this problem letting the AI play serveral slither.io simutaneously with a shared predictor. All of those AI evaluate the situation and choose action using a single predictor. In the mean while, they give all the feedback to the same predictor to improve the policy (actually, the function estimation of Q) collborately. Another problem is that the training process is slow, however, because of the online, multiplayer nature of slither.io, we cannot pause the game while the predictor is training. I solve this problem by building 2 identical classifier, the game use one of them to predict the best action and use another to receive feedback in another thread. After a epoch, the program swap those 2 model in a atomic way to avoid hazard in threads and then quickly copy all the new parameter to the old model in a asynchronous way, and the latest-trained model can be used to produce the best action and the just updated model can be used to accept new feedbacks. Using this parallel method, the bottleneck of human players can be overcomed without building a self-competing system which can be inaccurate and does not represent how human player plays this game. For the rest of the problems, I have designed serveral AIs to tackle them: Rule-based AI with Q-learning parameter tuning This idea emerged when I m thinking ways to solve the second problem: how might I play a continious game while Q-learning is turn-based. I figured that I can use a rule-based to do continious control and use a Q-learning to tune the parameters of that rule-based AI. So I build a AI that have rule-based AI could be used to tackle short-term strategy and a reinforcement-learning-based AI could be used to give high-level prediction about high-level strategy. Namely, we adopt the same rule-based AI as described in related work and extracted 2 parameter inside that AI as a high-level instruction: radiusavoidsize and fastresponsesize, which guide how near should the snake to avoid possible danger and how near should the snake to do emergency maneuver to avoid 4

prominent danger. Notably, I divided game play to 2 seconds turn for the Q-learning AI. To design a proper feature vector, I focus on represent the game status as well as possible for the AI to deside what should those parameters be like. So I select the most important factor to make that decision: the distance of first 10 nearest dangers and their direction. As the snake grow larger, it might make sence that the snake should avoid danger in a bigger range as it is harder to turn a large snake around. So I also include snake length into the input factor, as both an factor for the nerual-network predictor and also a approach to calculate the reward. Q-learning AI with hand extracted feature and nerual network function approximation After a few days of training and exploring, I found the improvement is not that prominent and it is not easy to distinguish the improvement because of the many uncertainties in each game-play. I choose to build an AI from ground up using Q-learning. To make the AI as responsive as possible, I reduce the turn time for the AI to 0.5 seconds. With the experience of there previous AI and another tens of days of training, I came up this this feature vector. This new feature vector not only takes food locations into consideration, but also greatly improved the interpretability. I figured that in the previous implementation, the function approximation would have a hard time found out the relationship between the group of dangers directions and their distances. In the new design, I divide the map into 16 directions from the snake and I build a array consists of the distance of the thread from those 16 directions. To futher improve the AI, I build a matrix shows the relationship between those 16 direcitons, which is whether the nearest danger in each of those 2 direction is the same. In this way, the AI would have a idea of where is the danger and where each snakes are. During the training, I also noticed that the AI sometimes tries to kill other snakes, however, because lack of information, they sometimes go for their tail instead of their head. So I added another 16-elements boolean vector to show whether that danger in that direction is a snake head. For the food vector, I futher divide each one of the 16 directions into 4 regions according to their distances. So that I can build a 64-element vector shows the amount of food in different direction and different ranges of distances. 5

Q-learning AI with raw image input and convolutional neural networks Although I tried really hard on hand-picking those feature vector in the previous design of a slither.io AI. the AI is still not as informed as a human player. For example, the picture showed in the game play shows whether the user is in a snake-clowded area or not. This piece of information is very useful to human players to decide whether to rush wildly or play cautiously. To tackle this problem, I chose to refer to the method mentioned in the breakout AI (Mnih et al. 2015) paper, directly feeding the raw imagies to the Q-learning algorithm. Similar to the method mentioned in that paper, I use a convolutional nerual network to take advantage of the 2-d matrix shaped data, and also stacked 4 frames in a row and feed it to the machine-learning algorithm to get a better sence of velocity. Results The result is showed as follows, I uses length and turns as metrics: Baseline Rule-based AI Method 1 Method 2 Method 3 Length average 13.49 948.18 979.98 39.03 13.92 Length stdev 8.98 1303.60 879.30 55.10 5.22 Turns average 59.09 300.30 703.39 55.40 63.56 Turns stdev 109.34 343.31 741.14 73.05 89.08 Note that the Rule-based AI and Method 1 have longer turns of 2 sec, other turns are all 0.5 sec. It seems that the Method 1 did improve the performance of Rule-based AI, however it is quite surprising to see that the Method 2 and 3 performed poorly. I assume this is due to the complexity of this game. The Q-learning based AI is very unlikely to discover that rushing to a cluster food is a good idea, as going to a cluster of food will usually caused the not well-performed AI to be killed, which yield in a very high penalty. In the mean while, a well-trained Q-learning AI is very timid, I observe that the trained AI in method 2 and 3 often rush to the corner of the map and stays there. They does not have the chance to walk to a crowded area and fight for food with other snakes just because of low possiblity of random exploration. 6

Discuss During the training, I found 2 figure very intriguing. Figure 2 shows the loss function in terms of training epoch, while figure 3 shows the length of the snake when died with a 100 elements sliding window during the Method 3 training. Figure 2: Loss and training epoches Figure 3: Length of each death This result shows that although the loss function keeps reducing, the average score does not keeps increasing. I assume that this proves my point above. The snake have better awareness of the situation, it knows that crowded places are dangerous. However, it did not have enough chance to try out different eating tactic and just given up all the crowded places. I think this explain the bad 7

performance of both method 2 and 3. I can think of a few very preliminary idea to solve this problem in the future: 1. Learn from human players: Let the AI watch human players play for a couple of rounds. Hopefully, the AI would learn from the organized tactic of human player and recognize that the crowded area are quite profitable. 2. Learn from opponents: Override the webpage and let it render the scene not only from the snake that the bot controls but also other opponents. Shows the situation and movement of other plays and train the function approximation of Q with that. Hopefully, the bot can learn from other s experience. 3. Learn from itself: Build a private slither.io server with only bot in it. As they all have very bad tactics, the crowded area might not be as dangerous. Then, the AI might be able to try out more tactics instead of hidding. Reference Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, et al. 2015. Human-Level Control Through Deep Reinforcement Learning. Nature 518 (7540). Nature Publishing Group: 529 33. Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529 (7587). Nature Publishing Group: 484 89. Slither.io on the App Store. id1091944550?mt=8. https://itunes.apple.com/us/app/slither.io/ 8