CSE 258 Winter 2017 Assigment 2 Skill Rating Prediction on Online Video Game

ABSTRACT CSE 258 Winter 2017 Assigment 2 Skill Rating Prediction on Online Video Game In competitive online video game communities, it s common to find players complaining about getting skill rating lower than their actually skill level. It s an interesting topic that if we can predict players skill rating(sr) using their in-game statistics. This report presents results of skill rating prediction task on large-scale dataset of a famous online game Overwatch. Results show that with extracted feature vector, only a moderate performance can be achieved. This indicated that more representative features and model with temporal structure are required to finish this task. Runze Xu University of California San Diego rux012@eng.ucsd.edu [1]. Ruck Thawonmas et al. performed detection of bots in MMORPG by action frequency feature and SVM [2]. Kyong Jin Shim et al. did player s future performance prediction based on past performance [3]. In the paper of Olivier Delalleau et al. the authors proposed a balanced matchmaking strategy aiming to replace old skill rating system [4]. Generally speaking, many literatures on sports ranking prediction are also related this assignment. However, we found that most of the literatures in this field concentrate on predict future outcome given past matches which require that the dataset has temporal structure. Due to the limitation of collected dataset, we cannot exploit much from these papers. 1 INTRODUCTION Overwatch is a team-based multiplayer first-person shooter video game developed and published by Blizzard Entertainment. It has a competitive mode which runs in seasons. And players are assigned skill rating(sr) according to their performance in matches during the season. One of the myth in the game community is that you cannot break out once you fell into the low-level SR groups. In other words, the skill rating a player can get does not mainly depend on the his/her performance. On the online forum, some players even post their skill ratings and performances of some matches to prove this idea. If this myth is true, then we can draw the conclusion that the developer has designed a bad ranking system which does not properly reflect the skill of a player. And thus a new rating is required for improving the quality of competitive mode of this game. One way to solve this problem is to design a prediction system which can estimate SR of a player based purely on the performance. Given the statistics of a player, if we can precisely predict the rating, then we can say that the SR system is a reasonable representation of skills. The rest parts of this report describes a simple approach to solve this problem. 2 RELATED WORKS Since Overwatch was just released in May 2016, there isn t much deep study about it. After the end of each competitive season, Blizzard Entertainment would release a short blog talking about the statistics of that season. But only a rough distribution would be given in the blog which is not very helpful to this assignment. However, there exists some literature about performing prediction task using in-game data. Tobias Mahlmann et al. predicted player behavior in Tomb Raider: Underworld using various algorithm including linear regression, SVM and MLP 3 DATASET Blizzard Entertainment doesn t provide official dataset or APIs for obtaining player statistics. So all the data samples are collected by crawling on the website Overwatch Tracker. Overwatch Tracker provides a leaderboard which contains ranking statistics of all the players who have been searched through this website. The dataset was extracted at February 27th, which is just after the end of season 3 of competitive mode. So the rankings in the dataset can represent well the overall standings of players. The dataset contains 80577 valid samples. Each sample has the current skill rating of the player, as well as various performance data, such as elimination, healing and death rate. According to Wikipedia, by January 2017, Overwatch has more than 25 million registered players. Generally speaking, the dataset we have obtained is a small subset of the whole player community. Due to the fact that Overwatch Tracker is mainly used by English-speaking users, players from only three regions are collected. Among the whole dataset, around 2000 samples are from Korea server, 30000 from European server and 48000 from United States. Figure 1 shows the number of players in each skill rating level. As we can see, the overall distribution looks like Gaussian distribution with some peaks at around rating 0, 2000, 2500, 3000, 3500 and 4000. It s very unlikely to get 0 skill rating unless one do it on purpose. So the peak at 0 may be caused by the error in original website or the fact that some players haven t finished enough matches to deciding their ratings. Apart from that, the game has a reward system which give player special bonus when they reach certain rating levels, e.g. 2000, 2500 and 3000. Once reach these tiers, some players may be less motivated to continue playing competitive mode since it would be hard to reach next tier. This might be the reason why there are peaks at other rating levels.

2 skillful and novice players would play it, resulting in low average rating. This figure tells us that the kind of hero the player prefers may be a good feature to estimate skill rating. Figure 1: Skill Rating Distribution After removing the abnormal samples around 0 rating, the size of the dataset becomes 78960. The mean and median of the dataset are 2573 and 2535, respectively. According to Blizzard Entertainment, the median maximum skill rating of season 3 is a little above 2300. So our scraped dataset is a little biased towards players who have better skills. As a result, if we apply the prediction on the complete dataset, it may perform worse on low-level players. In the competitive mode, each player has to pick a unique hero which can be roughly classified into four types: tank, offensive, defensive and healer. Besides playing various heroes, most gamers have one particular hero which he/she plays the most. We can group players by their favorite hero and observe the distribution of skill rating. Figure 2 shows the normalized distribution of players who play Ana and Junkrat the most. As expected, the two distributions are both Gaussian shaped. But we can also find that players who play Junkrat the most have lower average rating than players who play Ana. Figure 3: Average Rating of Different Heroes An intuitive way to estimate the rating is to use the win/loss ratio. Just like the case in most sports games, a team with higher winning rate often has higher rank as well. Figure 4 shows the rating against win/loss ratio. Surprisingly, we find that there is no strong connection between SR and winning rate, even though players who have very low winning rate (around 20%) tend to have low SR. And only players who have winning rate around 100% tend to have skill rating over 3000. This could be caused by the match making system which tries to let people play with players who have similar skill level. It s possible that a player got high winning rate just because he/she has defeated many novice players. In the experiment, we have fitted a linear regression model with winning rate and achieved MAE of 496.09, which is just a limited improvement compared to the trivial baseline model. So it s necessary to use more feature for this task. Figure 2: Skill Rating Distribution of Two Heroes Figure 3 shows the average skill rating of all 23 heroes. As we can see, the mean ratings are in the range of 2000 to 2800. Some heroes have low average SR because they are designed to be less powerful than others, e.g. Junkrat. Some heroes are must-picked for every match, e.g. Lucio, so both Figure 4: Skill Rating against Win/Loss Ratio

3 4 TASK The task of this assignment is to predict a player s skill rating, given his/her statistics. We randomly picked 10% of the samples as testing set. The remaining 90% are used for training and validation. The evaluate the performance of the predictor, mean absolute error (MAE) was used. In addition, global mean rating was used as a simple baseline model against our methods. In competitive mode of Overwatch, if a player lose the match, his/her skill rating will drop about 30 and vice versa. If the player win or lose in a row, the points would be higher. So it s quite common for a player to have a SR fluctuation. In this assignment, we aim to predict the rating with MAE lower than 100. 5 METHODOLOGY 5.1 Feature Extraction Each data sample in the original dataset contains 364 features which describe both the overall performance and specifics of each played heroes. There are some repeated features generated during the scraping. In addition, due to the fact that the dataset was collected from websites which also obtained data by scraping, there exists lots of errors or missing features in the data. So the first step is to clean the dataset and remove some obvious outliers. As we have mentioned previously, samples with rating around 0 were considered outliers and removed. Because not every player can master all the heroes, it s quite common that only a few certain heroes appear in the statistics. In this situation, we set the features of all other heroes to zeros. Another problem is that some heroes are recorded to be played for a very short time during the whole season. Under this condition, the statistics may not represent the capacity of players on those heroes. Considering the fact that a typical Overwatch competitive match lasts about 15 minutes, if a hero was played less than 10 minutes, the corresponding features were set to zeros. Some of the statistics are cumulative during the season, e.g. Objective Time, Deaths and Defensive Assists. Because the total number of played matches are different for each player, a skillful player who have played less matches may have lower values than another player who play more. So these features were divided by the number of matches. As we have mentioned before, the kind of hero the player prefers may be a good feature to estimate skill rating. However, a player may have multiple preferences at the same time. So we counted the overall time spent on each hero, and represent the hero preference using a 23 dimension vector. Each dimension stands for the ratio that the player have played with specific hero. There is a popular opinion in the game community that players in Korea have higher skill level than US or European region. By computing the average rating among these three regions, we found that players in Korea server indeed has relative higher ratings. So we also encode the region into a on-hot vector as one of the features. After the previous processing, the feature which would be input to the predictor is a 284 dimension vector. One problem is that these dimensions do not have the same dynamic range. For example, the dimension that represents Damage per Game may be 0 to 20000, while the dimension for region is merely 0 or 1. If we are using gradient descent algorithm to train the predictor, this will lead to slower convergence or sticking to local minimum. Or if we are using regularization, some weights would be too small to affect the final results. To solve this problem, before inputing the feature vector, we manually normalize each dimension using the mean and variance of the training set. So the value of each dimension becomes zero mean and unit variance. 5.2 Regression Algorithm For the purpose of prediction, we have tested three approach, which are linear regression with l2 regularization (Ridge regression), multi-layer neural network (MLP) and random forest regressor. RBF kernel support vector regression (SVR) was tried. But due to the relatively large number of training samples, the training time became to long. So SVR was not used in the final experiment. Ridge regression is selected because its fast training speed and low number of hyper-parameter. Since the training features may not be linearly separable, MLP was used to estimate the no-linear relationship between features and target. But MLP requires to tune more hyper-parameter, so the results may not be optimal. To reduce overfitting, l2 regularization terms were added to every layer of MLP. Random forest is chosen because it can automatically select the important features and require less hyper-parameter tuning. 6 RESULTS AND ANALYSIS As we have mentioned earlier, 10% of the original data samples were used as testing set. The remaining samples was split by 8:1 and the second part was used as validation set for hyper-parameter tuning. The tuned parameters are as followed. 1.Ridge Regression: l2 Regularization λ = 0.1. 2.Random Forest: Number of Estimator: 100; Max Tree Depth: 30; Max Number of Feature: feature dim / 3. 3.MLP : Number of Hidden Layer: 1; l2 Regularization λ = 1; Hidden Layer Dimension: 100. In the experiment, both Ridge regression and random forest were trained using functions in scikit-learn package [5], while MLP is trained with Adam optimizer in Keras library [6]. Figure 5 shows that the training process. Rather than MAE, the objective function is mean square error. Validation loss converged after about 200 epochs while the training loss continued to decrease.

4 Algorithm training MAE testing MAE Global Average 550.83 544.2 Random Forest 105.46 285.68 Ridge Regression 251.63 257.82 MLP 233.9 242.57 Table 1: Mean Absolute Error Figure 5: Loss against Epochs Adam [7] is an adaptive algorithm to perform optimization which is more efficient than vanilla gradient descent. Another reason to choose Adam is that it does not require manually tuning the learning rate. The update rule of Adam optimizer is as followed, m t = β 1 m t 1 + (1 β 1) g t v t = β 2 v t 1 + (1 β 2) g 2 t mt ˆm t = 1 β1 t vt ˆv t = 1 β2 t θ t = θ t 1 α ˆm t ( ˆv t + ɛ) Table 1 shows the results of applying different algorithms on the prediction task. As expected, the baseline model performs bad on both training and testing set since it simply estimates using global average rating. Random forest regressor achieves a good performance on the training set. However, it gets far worse MAE on the testing set. This indicates that the algorithm might have overfitted the training data. To reduce the overfitting, we used validation set to tune the hyper-parameters, such as maximum feature number and maximum tree depth. But we found it hard to get similar MAE on training set and validation set without underfitting. On the other hand, ridge regression has similar performances on both training set and testing set, while the testing MAE becomes lower than random forest. By sorting the learned weights, we find that the features with the three greatest positive weights are Damage per Game, Eliminations per Game and Healing per Game. This meets the basic goal of the game eliminating enemy while avoiding casualties. Out of the three algorithms, MLP has the best performance on the testing set. This shows us that there is non-linear relationship between features and rating. (1) Even though we have improved the testing MAE from 544 to 242, there is still a huge gap toward the expected MAE. So we cannot precisely predict the skill rating of a player given the feature we have extracted. There are some potential reason which lead to this result. 1. Need more representative features. Among all the extracted features, a lot of them are related to elimination or dealt damage of players. However, some factors, which can show if a player is skillful, are not included in the features. For example, the ability to eliminate the most important target first or the ability to switch heroes according to opponents. In high level matches, these features are crucial to decide the overcome. 2. No temporal structure in data. Since all of the features are players statistics of whole season, we can only extract average or maximum of values. This might work well when players skills remain the same during this time period. However, it s common for a player to improve his/her skill over the process of playing. Some top-level players may have low average feature values due to the fact that they performed very bad at the beginning. So it s more reasonable to estimate the current skill rating of player based on his past performances on each match. Models with temporal structure, such as Hidden Markov Model or Recurrent Neural Network may be useful to solve this problem is we can access dataset of every match. 3. Biased dataset. As we have mentioned before, the scraped dataset is biased towards high-level player. This is caused by the fact that skillful players care more about their ranking and are more likely to search their position compared to others, while novice players are just having fun. As a result, the predictor may learn a poor representation on players with low skill rating. To counter this problem, we need to collect more data samples. According to the experiment result, if we estimate that a player has skill rating 2750, then it s very likely that his true rating is around 2500 to 3000. As we have mentioned early, the game divides skill rating into tiers. And the range of each tier is about 500. So it s possible to estimate the tiers of players with higher accuracy rather than estimate the actual skill level. This is leaved to further study in the future. Figure 6: Tiers Breakdown

5 6.1 References [1] Mahlmann, Tobias, et al. Predicting player behavior in tomb raider: Underworld. Computational Intelligence and Games (CIG), 2010 IEEE Symposium on. IEEE, 2010. [2] Thawonmas, Ruck, Yoshitaka Kashifuji, and Kuan-Ta Chen. Detection of MMORPG bots based on behavior analysis. Proceedings of the 2008 International Conference on Advances in Computer Entertainment Technology. ACM, 2008. [3] Shim, Kyong Jin, Richa Sharan, and Jaideep Srivastava. Player performance prediction in massively multiplayer online role-playing games (MMORPGs). Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2010. [4] Delalleau, Olivier, et al. Beyond skill rating: Advanced matchmaking in ghost recon online. IEEE Transactions on Computational Intelligence and AI in Games 4.3 (2012): 167-177. [5] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. [6] Chollet, François, Keras, 2015, GitHub, https://github. com/fchollet/keras [7] Kingma, Diederik, and Jimmy Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv:1412.6980 (2014).