COOPERATIVE STRATEGY BASED ON ADAPTIVE Q- LEARNING FOR ROBOT SOCCER SYSTEMS Soft Computing Alfonso Martínez del Hoyo Canterla 1
Table of contents 1. Introduction... 3 2. Cooperative strategy design... 4 2.1 Strategy selection design... 5 2.2 Role assignment... 7 2.3 Behaviors design... 9 3. Simulation... 12 4. Conclusion... 14 2
1. Introduction The objective of the paper is to develop a self-learning cooperative strategy for robot soccer systems. The robots can learn from successes or failures to improve the performance gradually. A robot soccer game is a proper platform to implement a multiagent system (MAS). Many methods are being proposed about cooperation in MAS: genetic algorithms, neural networks, reinforcement learning In this case they use reinforcement learning: the agent selects an action that gives the highest reward (maximizes a numerical signal) It is implemented by Q- learning, a temporal-difference (TD) method. Simplest TD update method: Q-learning algorithm: The learning in the system is implemented in the strategy selection and in the sidekick behavior. 3
2. Cooperative strategy design There are 3 aspects to design the strategy: Strategy selection Role assignment Behaviors of all roles individually First the strategy selection observes the state from the environment, after some time it observes the total reward given to see if the decision is good. Then the role arbiter chooses an attacker, defenders and sidekicks. Finally each robot executes their individual tasks, and a reward will be individually given to them according to their performances. The complete cooperative strategy architecture: 4
2.1 Strategy selection design The strategy design selects the number of robots in each position. It uses the following information: the current state of the environment and the reward from past actions. The environment information: we make it fuzzy with the following membership functions: The dimension of the inputs is 3x4, so we need 12 inference rules: We obtain a fuzzy value for the state (S) using Mamdani s minimum implication rule 5
The reward: it is calculated as follows:! Given the state and the reward we select one of the 3 possible actions: 6
2.2 Role assignment The main purpose is to find a proper robot to be an attacker, and others to be defenders or sidekicks. Find the attacker: Select the robot closest to the ball facing the opponent goal: 7
If there are no robot behind the ball, select the suitable robot according to the distance to the ball, the direction of the velocity of the ball and the possible obstacles (select the robot that maximizes Attacker_value): Find the defenders: Select the robots closer to our goal. Find the sidekicks: Select the remaining robots. 8
2.3 Behaviors design Common behavior (Obstacle-avoidance): If a robot wants to go to the ball, all the robots are positrons and the ball is a negatron. If the robot just wants to go to a position, the ball is also a positron, and the position is negatron. There is an avoidance range D. 9
Attacker behavior: If the goal is close to the ball, the robot tries to shoot, other wise it passes the ball. The attacker must be behind the ball, so if it cannot kick the ball directly, it will use the obstacle avoidance behavior. Defender behavior: The number of defenders varies from 1 to 3. They are assigned to different positions (there are 2 zones). The defender robot tries to block the ball when the opponent shoots, so it finds a suitable location according to the velocity of the ball. Sidekick behavior: It has learning ability. There can be 1 to 3 sidekicks. The sidekicks objective is to find good positions. States classification: the robot takes in consideration the angle of the attacker, the angle of the closest opponent to the ball and whether the attacker is closer to the ball than the closest opponent. Actions: the robot chooses a position, selecting the best distance and angle from a predefined group of possible positions with respect to the ball. Rewards: the sidekicks objective is to find good positions so that they can become an attacker or to prevent the ball from being closer to the goal. The reward is positive if they change often to attacker and negative if the ball is close to the goal. It is obtained with a fuzzy system: The procedure to find the final consequence is the same than before, but now the consequence in the rules is crisp so we directly obtain a crisp value when we compute the maximum. 10
There is an extra reward of +25 if we kick the ball to the opponent goal, and -25 if the opponent does so.! The final reward is the sum of the fuzzy and the extra rewards. 11
3. Simulation The robot team played 100 matches with a team designed with another nonlearning strategy. Everything was simulated in a computer. Goals from our team. Goals from the opponent. 12
Difference of goals. Summary of the results: 13
4. Conclusion The main objective of the fuzzy is to evaluate the rewards and the states. It gives flexibility: We can change the membership functions to modify the way the robots cooperate and coordinate with each other without big modifications in the strategy architecture. The learning part is carried by the Q-learning algorithm. The cooperative strategy is effective: there is a tendency on rise to get better results. 14