Linköpings University. Marine rush. Teaching an agent StarCraft 2 through reinforced learning. Erik Kindberg

Size: px

Start display at page:

Download "Linköpings University. Marine rush. Teaching an agent StarCraft 2 through reinforced learning. Erik Kindberg"

Frederick Short
5 years ago
Views:

1 Linköpings University Marine rush Teaching an agent StarCraft 2 through reinforced learning Erik Kindberg

2 Table of contents 1. Introduction... 1 Starcraft What is Starcraft PySC Reinforcement learning Q learning... 2 The Game The bot The map Implementation... 4 PYSC2 Package... 5 QLearningTable choose_action learn check_state_exist... 6 sparse_agent transformdistance TransformLocation splitaction step Results Reflections Reference List Appendices Appendix 1 Python Code... 12

3 ABSTRACT This report concerns the implementation of a reinforcement trained bot in the game StarCraft II. The bot has access to a few of all possible actions to limit the action space and it improved through Q-learning techniques. It faced the easiest built-in game AI and was fairly successful, eventually it was able to win more then it lost.

4 1. INTRODUCTION The purpose of this project is to implement a bot in the strategy game StarCraft II. The term bot will be interchangeable with intelligent agent in this report. I prefer to use the term bot, because it is the commonly used term within the game. The bot will be tested against another scripted bot and improve through reinforced learning techniques. The bot will be programmed in Python 3 and will be implemented through the environment PySC2. STARCRAFT WHAT IS STARCRAFT 2 StarCraft 2 is a PC strategy game developed by Blizzard Entertainment (Blizzard Enterainment, 2018). The game is played between two or more agents, human or computer controlled. The purpose is to destroy your opponents, by building a base, recruiting soldiers, and deploying them in a more effective manner than your opponents. A common distinction you make is between the micro game and the macro game. The micro game can be likened to tactics and concerns placement and manoeuvring of your soldiers. The macro game can be likened to strategy, and concerns what buildings you build in your base and what types of soldiers you recruit. StarCraft II provides a difficult challenge for intelligent agents, described in the original DeepMind blogpost: Even StarCraft s action space presents a challenge with a choice of more than 300 basic actions that can be taken. Contrast this with Atari games, which only have about 10 (e.g. up, down, left, right etc). On top of this, actions in StarCraft are hierarchical, can be modified and augmented, with many of them requiring a point on the screen. Even assuming a small screen size of 84x84 there are roughly 100 million possible actions available (DeepMind, 2018) PYSC2 PySC2, is an environment for AI research within the game StarCraft 2. It is a collaboration between Blizzard entertainment, and DeepMind a Google owned company focusing on AI research (Vinyals, o.a., 2018). StarCraft II was chosen for AI research for a 1

5 few reasons. Firstly, the above-mentioned action space complexity, another challenge is that the length of a game has a lot of variance, meaning that actions may not pay off for a long time, or at all. Another challenge is the map which is only partly observable. This requires a combination of memory and planning. A benefit of using StarCraft 2 is the large playerbase, ensuring that a large quantity of replay data is available as well as many talented human opponents for AI (DeepMind, 2018). REINFORCEMENT LEARNING Reinforcement learning is an aspect of machine learning, inspired by behaviourism (Reinforcement Learning, 2018). The general idea is to define a goal, encourage behaviour that leads the bot closer towards the goal, good behaviour, through some kind of reward, and discourage bad behaviour, on the opposite end. In complex domains supervised learning might not be possible, since it requires accurate and consistent evaluation which might not be available. Especially if the states are sequential, dependant on each other. If each action is not individual, it is harder to accurately label them, making supervised training tricky (Russell & Norvig, 2010). Reinforcement training is about putting an agent into an environment where it does not know what the possible actions will result in or how the environment works. The agent will figure out optimal policies through positive and negative feedback (Russell & Norvig, 2010) Q LEARNING The variant of reinforcement learning used to implement the bot, is called Q-learning. Q-learning is an active learning method. In active learning, an agent needs to decide on an action to take, compared to passive learning where the actions are static, state S will always perform action A. Q-learning stores action-utility rather than utility, representing the expected utility with each action instead of storing a utility function on outcomes for each state. Q- learning is also a model-free method. This means that it does not need an explicit model for its environment. More specifically, the agent does not need a transition model. The agent searches for optimal policies through predicting the value function of a policy, without a model for the environment (Russell & Norvig, 2010). In simple terms, this is the Q-learning algorithm that is going to be implemented in the bot (Zhou, 2018). Initialize the state s. 2

6 Choose action a in state s. Take action a, observe reward rand next state s Update learning table for the performed action a in state s by adding the reward and TD error Repeat until the last state has been reached. In the case of Staracraft II, until a match has ended. THE GAME THE BOT Because of the complexity of StarCraft II, the bot has certain hard-coded limitations. The bot will only play as one of the three available races, the Terrans. The Terrans have fourteen different buildings and sixteen different types of soldiers available. The bot is limited to two different types of buildings, barracks, and supply depots. It is also limited to one type of soldier, the marine. Barracks are used to build marines, supply units raise the number of marines that can be recruited and at least one is required to have been built, before you can build the barracks. The marine is the most basic kind of soldier available to the Terrans. Both players start with a certain amount of SCV units, these gathers resources to build things with and builds buildings, and a command center. The command center has functions that isn t used by the bot. The opponent will be set to the easiest difficulty THE MAP The bot will play on the simplest map called Simple64x64, which is the simplest map available in PySC2, it consists of two, symmetrical plateaus where the starting bases and a 3

valley in-between. All units and buildings have a vision range, and the bot will see only what it s units and buildings see, fog will cover the rest of the map.

7 valley in-between. All units and buildings have a vision range, and the bot will see only what it s units and buildings see, fog will cover the rest of the map. The bot s units and buildings will be represented by green squares on the map, and the opponent s buildings and units will be represented as blue squares. Figure 2. The map. 2. IMPLEMENTATION The implementation consists of the PySC2 environment and the bot. The agent class is constructed from a tutorial by Steven Brown (2018) and the Q-learning algorithm is taken from a tutorial by Morvan Zhou (2018). This is enough code to produce a functioning bot, and is contained within one file, sparse_agent.py. The bot contains two classes, QLearningTable and SparseAgent. Firstly, a number of variables are declared that will correspond with the game. For example, _NO_OP = actions.functions.no_op.id This tells the bot to do nothing, which can seem strange, but there are cases where doing nothing is beneficial, such as when waiting for a marine to be recruited. Outside of the classes is a for-loop. This loop reduces the bot s model for the minimap from a 64x64 grid to a 4x4 grid. The bot uses the game s minimap to issue attack orders, and a 4x4 grid reduces the number of possible actions greatly. The marines have an attack range large enough that it can still cover the entire map with a 4x4 grid. 4

Figure 3. Mini-map (Brown, 2018) PYSC2 PACKAGE The environment connects the bot to the game through the python package Pysc2. A complete description of this package would require a report of its own.

8 Figure 3. Mini-map (Brown, 2018) PYSC2 PACKAGE The environment connects the bot to the game through the python package Pysc2. A complete description of this package would require a report of its own. I am encouraging anyone who wants to know more to visit the GitHub page: QLEARNINGTABLE In the init method, certain parameters for the Q-learning algorithm. learning_rate 0.1 This parameter states how prone the bot is to follow new information (Wikimedia Foundation, Inc, 2018). e_greedy 0.9 This parameter decides how greedy the algorithm will be, set to 1 the algorithm will always choose the action that it thinks is optimal, but you want it to explore other options sometimes, because for it to discover the global optimal policy. With a greedy factor of 0.9, the bot will perform a random action one fifth of the times (Wikimedia Foundation, Inc, 2018). reward_decay Decides the importance of future rewards, a reward decay of 0 means that the bot only focuses on the short-term gains, and a reward decay of 1 means that it is very focused on the long-term gains. The bot s reward decay is fairly high at 0.9. It is appropriate, because 5

9 many of the actions taken in StarCraft II does not pay off until many steps later (Wikimedia Foundation, Inc, 2018) CHOOSE_ACTION This method tells the bot what action to choose. First it checks if a state exists, it then randomly generates a number between 0 and 1. If the number is below the greedy factor, the action with the highest score from the learning table is chosen. If the number is on par or above the greedy factor a random action is chosen LEARN The core of the QLearningTable class is the learn method. This is where the bot learns, and the learning table is updated. The method uses four parameters, the current state, the action to be taken, the reward and the next state. It first checks if the current and next state exists with the method check_state_exist. It then checks if the next state is the last state. If not, we update the target with the reward decay with the next state that has the highest value among all possible next states and adds it with the reward. if s_!= 'terminal': q_target = r + self.gamma * self.q_table.ix[s_, :].max() If we are on the last state, only the reward matters for the target. else: q_target = r # next state is terminal Then we update the state and action tuple in the learning table by adding the learning rate times the TD error CHECK_STATE_EXIST This method checks if a state exists in the learning table. If not, the state is added to the learning table. 6

10 SPARSE_AGENT TRANSFORMDISTANCE This method takes the coordinates of the bot s base and computes other positions on the map, relative to the player base. This function as well as transformlocation converts map data such that actions are taken as if the base is in the bottom right, lowering the amount of computation needed for the bot. The bot does not actually need to take random start location into account when making decisions TRANSFORMLOCATION This method converts absolute x and y coordinates on the map, rather than relative distances to the bot s base SPLITACTION This method is a support method, it allows the extraction of required information from smart actions, that contain several actions STEP This is the major part of the agent class. Every game is called an episode. The agent acts in steps, and for every step, this method is looped. Firstly, the loops check if it is the last step of an episode. If that is the case, the class updates the Q-learning table and saves it to an external file in the pickle format. Saving the Q-learning in an external file comes in handy, because playing many game is time-consuming and being able to stop and then start where you left the bot, comes in handy. In the last step of an episode the bot also applies the reward. The learning table is updated with the reward, the rewards are sparse, plus one for a win, zero for a draw and minus one for a loss. Then the bot checks if it is the first step of an episode. In the first step of a game, the position of the bot s base is established, the position of the opponent s base is inferred from this, since there are only two possible start locations. Then the loop keep tracks of the number of friendly buildings on the map. The smart actions are multiple actions condensed into one. These are actions that are closely connected, for example, the attack action consists of selection units, send them towards the coordinates that are going to be attacked and do nothing. The actions are 7

11 condensed to make the action space less complex, we give the bot certain hints about which actions go together, all smart actions contain three actions in total. The main actions that the bot is capable of are: Do nothing Do nothing for three steps Build supply depot select an SCV build a supply depot and send the SCV back to gathering resources Build barracks - Like the above command, but the SCV builds barracks rather than a supply depot Build marine Select all built barracks and fill the recruit queue with marines Attack (x,y) This action attacks a certain coordinate on the map We then check which action in a smart action we currently are at. If it is the first action we setup the state so it includes the count of the bot s buildings and the number of marines the bot has. Then we divide the map into four quadrants and mark it as hot if enemy units can be observed in it. Again, if the base is at the bottom-right of the map, we invert the quadrants, so the perspective stays constant. If the episode isn t on the first step, we call the learn method from the QLearningTable class with the reward zero since the bot hasn t won or lost. Since this is only done on the first step of each smart action, the table is only updated every third step in each episode. We then choose a smart action from the choose_action method in QLearningTable. The loop then extracts the needed information for the first step in the smart action by using the splitaction method and executes the first step for the chosen smart action.on the second and third step of each smart action, the bot simply executes the relevant action. The bot remembers which step it is on by a counter and resets the counter on the third step of each action. 8

12 3. RESULTS After 2477 matches against the scripted opponent, the bot s win rate surpassed it s loss rate. The bot had the following performance curve: % % 80.00% Win %, Loss % and Draw % Axeltitel 60.00% 40.00% 20.00% Win % Loss % Draw % 0.00% Figure 4. Win/draw/loss ratio 4. REFLECTIONS StarCraft II provides a very interesting challenge for AI because of its complexity. With only two available buildings and one unit, the bot had a small scope, and it was enough to perform decently against the easiest opponent. I suspect that it will need access to more buildings and units to be successful against the more difficult opponents, but this will increase action space complexity in ways that is hard to predict. Corners can be cut, for example with smart actions that have been implemented in the bot described in this report. One improvement that would drastically improve performance would be the capability for the bot to accept surrenders. Currently it is unable to do so. From my own empirical observations of the bot playing I noticed that often the opponent would offer to surrender, 9

13 which it does once certain scripted conditions are met, when it believes it can t win anymore. Because of a PySC2 limitation, the bot is unable to accept the surrender and numerous times the game conditions for a draw were met, or the opponent came back, resulting in either a draw or a loss for the bot where a human player could ve won by pressing one button. Another issue that was considered was the reward structure, with sparse rewards the bot generally takes more matches to learn. The author of the tutorial used to create the bot did consider alternate structures, the game has a built-in score which depends on a few things, such as enemy units destroyed (Brown, 2018). Using the score as a performance measure led to some strange behaviours, such as the bot placing its soldiers outside of the enemy base and waiting for enemy soldiers to appear and destroy them, to further increase the score. Winning really is the purest performance measure. I considered the algorithm used and it options, I specifically looked at an algorithm called SARSA which is similar to Q-learning with a few differences that becomes important when choosing algorithm. In Russell & Norvig (2010) they state that if the overall policy is affected by another agent, which it is, since the bot is playing against an opponent, then a Q- learning algorithm for what actually happens, rather than a SARSA algorithm for what the bot wants to happen. This is the case because the Q-learning algorithm backs up the best Q-value when a state is reached and the SARSA algorithm backs up the Q-value after an action has been taken. This was not my choice since I followed a tutorial, but the choice is backed up by literature. I dived head first into the StarCraft II AI world, because it looked interesting and I very soon realized that I have only been scraping the surface. There is a whole lot more to discover and I believe that the bot s performance can be improved both by a deeper understanding of reinforcement learning and deeper understanding of the game and all of its intricacies. 10

14 5. REFERENCE LIST Blizzard Enterainment. (den 9 January 2018). Starcraft 2. Hämtat från Starcraft 2: Brown, S. (den 10 January 2018). Build a Sparse Reward PySC2 Agent. Hämtat från Medium: DeepMind. (den 9 January 2018). DeepMind and Blizzard open StarCraft II as an AI research environment. Hämtat från Deepmind: Russell, S., & Norvig, P. (2010). Artificial Intelligence: A modern approach. Boston: Pearson Education. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M.,... Tsing, R. (den 9 January 2018). PySC2 - StarCraft II Learning Environment. Hämtat från GitHub: Wikimedia Foundation, Inc. (den 10 January 2018). Q-learning. Hämtat från Wikipedia: Wikimedia Foundation, Inc. (den 9 January 2018). Reinforcement Learning. Hämtat från Wikipedia: Zhou, M. (den 9 January 2018). Reinforcement Learning Methods and Tutorials. Hämtat från Github: 11

15 6. APPENDICES APPENDIX 1 PYTHON CODE # Importing packages import random import math import os.path import numpy as np import pandas as pd from pysc2.agents import base_agent from pysc2.lib import actions from pysc2.lib import features # Declaring variables for actions and unit IDs _NO_OP = actions.functions.no_op.id _SELECT_POINT = actions.functions.select_point.id _BUILD_SUPPLY_DEPOT = actions.functions.build_supplydepot_screen.id _BUILD_BARRACKS = actions.functions.build_barracks_screen.id _TRAIN_MARINE = actions.functions.train_marine_quick.id _SELECT_ARMY = actions.functions.select_army.id _ATTACK_MINIMAP = actions.functions.attack_minimap.id _HARVEST_GATHER = actions.functions.harvest_gather_screen.id _PLAYER_RELATIVE = features.screen_features.player_relative.index _UNIT_TYPE = features.screen_features.unit_type.index _PLAYER_ID = features.screen_features.player_id.index _PLAYER_SELF = 1 _PLAYER_HOSTILE = 4 _ARMY_SUPPLY = 5 _TERRAN_COMMANDCENTER = 18 _TERRAN_SCV = 45 _TERRAN_SUPPLY_DEPOT = 19 _TERRAN_BARRACKS = 21 _NEUTRAL_MINERAL_FIELD = 341 _NOT_QUEUED = [0] _QUEUED = [1] _SELECT_ALL = [2] DATA_FILE = 'sparse_agent_data' ACTION_DO_NOTHING = 'donothing' ACTION_BUILD_SUPPLY_DEPOT = 'buildsupplydepot' ACTION_BUILD_BARRACKS = 'buildbarracks' ACTION_BUILD_MARINE = 'buildmarine' ACTION_ATTACK = 'attack' # Smart actions are multiple actions in one command smart_actions = [ ACTION_DO_NOTHING, ACTION_BUILD_SUPPLY_DEPOT, ACTION_BUILD_BARRACKS, ACTION_BUILD_MARINE, ] # This loop splits the mini-map from a 64x64 grid into a 4x4 grid for mm_x in range(0, 64): for mm_y in range(0, 64): if (mm_x + 1) % 32 == 0 and (mm_y + 1) % 32 == 0: 12

16 smart_actions.append(action_attack + '_' + str(mm_x - 16) + '_' + str(mm_y - 16)) # Stolen from class QLearningTable: def init (self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9): self.actions = actions # a list self.lr = learning_rate self.gamma = reward_decay self.epsilon = e_greedy self.q_table = pd.dataframe(columns=self.actions, dtype=np.float64) def choose_action(self, observation): self.check_state_exist(observation) if np.random.uniform() < self.epsilon: # choose best action state_action = self.q_table.ix[observation, :] # some actions have the same value state_action = state_action.reindex(np.random.permutation(state_action.index)) action = state_action.idxmax() else: # choose random action action = np.random.choice(self.actions) return action def learn(self, s, a, r, s_): self.check_state_exist(s_) self.check_state_exist(s) q_predict = self.q_table.ix[s, a] if s_!= 'terminal': q_target = r + self.gamma * self.q_table.ix[s_, :].max() else: q_target = r # next state is terminal # update self.q_table.ix[s, a] += self.lr * (q_target - q_predict) def check_state_exist(self, state): if state not in self.q_table.index: # append new state to q table self.q_table = self.q_table.append( pd.series([0] * len(self.actions), index=self.q_table.columns, name=state)) class SparseAgent(base_agent.BaseAgent): def init (self): super(sparseagent, self). init () self.qlearn = QLearningTable(actions=list(range(len(smart_actions)))) self.previous_action = None self.previous_state = None self.cc_y = None self.cc_x = None self.move_number = 0 13

17 if os.path.isfile(data_file + '.gz'): self.qlearn.q_table = pd.read_pickle(data_file + '.gz', compression='gzip') def transformdistance(self, x, x_distance, y, y_distance): if not self.base_top_left: return [x - x_distance, y - y_distance] return [x + x_distance, y + y_distance] def transformlocation(self, x, y): if not self.base_top_left: return [64 - x, 64 - y] return [x, y] def splitaction(self, action_id): smart_action = smart_actions[action_id] x = 0 y = 0 if '_' in smart_action: smart_action, x, y = smart_action.split('_') return (smart_action, x, y) def step(self, obs): super(sparseagent, self).step(obs) if obs.last(): reward = obs.reward with open(data_file + '_rewards.txt', 'a') as myfile: myfile.write(str(reward) + "\n") self.qlearn.learn(str(self.previous_state), self.previous_action, reward, 'terminal') self.qlearn.q_table.to_csv(data_file + '.csv') self.qlearn.q_table.to_pickle(data_file + '.gz', 'gzip') self.previous_action = None self.previous_state = None self.move_number = 0 return actions.functioncall(_no_op, []) unit_type = obs.observation['screen'][_unit_type] if obs.first(): player_y, player_x = (obs.observation['minimap'][_player_relative] == _PLAYER_SELF).nonzero() self.base_top_left = 1 if player_y.any() and player_y.mean() <= 31 else 0 self.cc_y, self.cc_x = (unit_type == _TERRAN_COMMANDCENTER).nonzero() cc_y, cc_x = (unit_type == _TERRAN_COMMANDCENTER).nonzero() cc_count = 1 if cc_y.any() else 0 depot_y, depot_x = (unit_type == _TERRAN_SUPPLY_DEPOT).nonzero() supply_depot_count = int(round(len(depot_y) / 69)) barracks_y, barracks_x = (unit_type == _TERRAN_BARRACKS).nonzero() barracks_count = int(round(len(barracks_y) / 137)) 14

18 if self.move_number == 0: self.move_number += 1 current_state = np.zeros(8) current_state[0] = cc_count current_state[1] = supply_depot_count current_state[2] = barracks_count current_state[3] = obs.observation['player'][_army_supply] hot_squares = np.zeros(4) enemy_y, enemy_x = (obs.observation['minimap'][_player_relative] == _PLAYER_HOSTILE).nonzero() for i in range(0, len(enemy_y)): y = int(math.ceil((enemy_y[i] + 1) / 32)) x = int(math.ceil((enemy_x[i] + 1) / 32)) hot_squares[((y - 1) * 2) + (x - 1)] = 1 if not self.base_top_left: hot_squares = hot_squares[::-1] for i in range(0, 4): current_state[i + 4] = hot_squares[i] if self.previous_action is not None: self.qlearn.learn(str(self.previous_state), self.previous_action, 0, str(current_state)) rl_action = self.qlearn.choose_action(str(current_state)) self.previous_state = current_state self.previous_action = rl_action smart_action, x, y = self.splitaction(self.previous_action) if smart_action == ACTION_BUILD_BARRACKS or smart_action == ACTION_BUILD_SUPPLY_DEPOT: unit_y, unit_x = (unit_type == _TERRAN_SCV).nonzero() if unit_y.any(): i = random.randint(0, len(unit_y) - 1) target = [unit_x[i], unit_y[i]] target]) return actions.functioncall(_select_point, [_NOT_QUEUED, elif smart_action == ACTION_BUILD_MARINE: if barracks_y.any(): i = random.randint(0, len(barracks_y) - 1) target = [barracks_x[i], barracks_y[i]] target]) return actions.functioncall(_select_point, [_SELECT_ALL, elif smart_action == ACTION_ATTACK: if _SELECT_ARMY in obs.observation['available_actions']: return actions.functioncall(_select_army, [_NOT_QUEUED]) elif self.move_number == 1: self.move_number += 1 smart_action, x, y = self.splitaction(self.previous_action) if smart_action == ACTION_BUILD_SUPPLY_DEPOT: if supply_depot_count < 2 and _BUILD_SUPPLY_DEPOT in obs.observation['available_actions']: if self.cc_y.any(): if supply_depot_count == 0: 15

19 target = self.transformdistance(round(self.cc_x.mean()), -35, round(self.cc_y.mean()), 0) elif supply_depot_count == 1: target = self.transformdistance(round(self.cc_x.mean()), -25, round(self.cc_y.mean()), -25) [_NOT_QUEUED, target]) return actions.functioncall(_build_supply_depot, elif smart_action == ACTION_BUILD_BARRACKS: if barracks_count < 2 and _BUILD_BARRACKS in obs.observation['available_actions']: if self.cc_y.any(): if barracks_count == 0: target = self.transformdistance(round(self.cc_x.mean()), 15, round(self.cc_y.mean()), -9) elif barracks_count == 1: target = self.transformdistance(round(self.cc_x.mean()), 15, round(self.cc_y.mean()), 12) target]) return actions.functioncall(_build_barracks, [_NOT_QUEUED, elif smart_action == ACTION_BUILD_MARINE: if _TRAIN_MARINE in obs.observation['available_actions']: return actions.functioncall(_train_marine, [_QUEUED]) elif smart_action == ACTION_ATTACK: do_it = True if len(obs.observation['single_select']) > 0 and obs.observation['single_select'][0][0] == _TERRAN_SCV: do_it = False if len(obs.observation['multi_select']) > 0 and obs.observation['multi_select'][0][0] == _TERRAN_SCV: do_it = False if do_it and _ATTACK_MINIMAP in obs.observation["available_actions"]: x_offset = random.randint(-1, 1) y_offset = random.randint(-1, 1) return actions.functioncall(_attack_minimap, [_NOT_QUEUED, self.transformlocation(int(x) + (x_offset * 8), int(y) + (y_offset * 8))]) elif self.move_number == 2: self.move_number = 0 smart_action, x, y = self.splitaction(self.previous_action) if smart_action == ACTION_BUILD_BARRACKS or smart_action == ACTION_BUILD_SUPPLY_DEPOT: if _HARVEST_GATHER in obs.observation['available_actions']: unit_y, unit_x = (unit_type == _NEUTRAL_MINERAL_FIELD).nonzero() if unit_y.any(): i = random.randint(0, len(unit_y) - 1) m_x = unit_x[i] m_y = unit_y[i] target = [int(m_x), int(m_y)] 16

20 target]) return actions.functioncall(_harvest_gather, [_QUEUED, return actions.functioncall(_no_op, []) 17

Deep RL For Starcraft II

Deep RL For Starcraft II Andrew G. Chang agchang1@stanford.edu Abstract Games have proven to be a challenging yet fruitful domain for reinforcement learning. One of the main areas that AI agents have surpassed