Modular Q-learning based multi-agent cooperation for robot soccer

Size: px

Start display at page:

Download "Modular Q-learning based multi-agent cooperation for robot soccer"

Jason Blankenship
5 years ago
Views:

Robotics and Autonomous Systems 35 (2001) 109 122 Modular Q-learning based multi-agent cooperation for robot soccer Kui-Hong Park, Yong-Jae Kim, Jong-Hwan Kim Department of Electrical Engineering and

Communicated by F.C.A. Groen Abstract In a multi-agent system, action selection is important for the cooperation and coordination among agents.

1 Robotics and Autonomous Systems 35 (2001) Modular Q-learning based multi-agent cooperation for robot soccer Kui-Hong Park, Yong-Jae Kim, Jong-Hwan Kim Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Kusong-dong, Yusong-gu, Taejon-shi , South Korea Received 8 August 2000; received in revised form 12 February 2001 Communicated by F.C.A. Groen Abstract In a multi-agent system, action selection is important for the cooperation and coordination among agents. As the environment is dynamic and complex, modular Q-learning, which is one of the reinforcement learning schemes, is employed in assigning a proper action to an agent in the multi-agent system. The architecture of modular Q-learning consists of learning modules and a mediator module. The mediator module of the modular Q-learning system selects a proper action for the agent based on the Q-value obtained from each learning module. To obtain better performance, along with the Q-value, the mediator module also considers the state information in the action selection process. A uni-vector field is used for robot navigation. In the robot soccer environment, the effectiveness and applicability of modular Q-learning and the uni-vector field method are verified by real experiments using five micro-robots Elsevier Science B.V. All rights reserved. Keywords: Multi-agent system; Robot soccer system; Reinforcement learning; Modular Q-learning; Action selection 1. Introduction It is important that multi-agent systems perform tasks that are complex and difficult. This needs cooperation and coordination among the agents [3,9]. Developing a multi-agent system amounts to the search for a method for implementing an intelligent system composed of multi-agents, with independent motion control and cooperation with each other. Multi-agent systems are more flexible and fault tolerant as several simple robot agents are easier to handle and cheaper to build compared to a single Corresponding author. addresses: khpark@vivaldi.kaist.ac.kr, (K.-H. Park), johkim@vivaldi.kaist.ac.kr (J.-H. Kim). powerful robot which can carry out different tasks [7]. From the standpoint of multi-agent systems, robot soccer is a good example of the problems in real world which can be moderately modeled. The soccer game is different from other multi-agent systems in that the robots of one team have to cooperate while facing competition with the opponent team. The cooperative and competitive strategies used play a major role in a robot soccer system [10]. The related research issues are quite wide and they are associated with the hardware configuration, software implementation, agent/robot communication, sensor fusion and learning, to mention a few. The action of the robot is usually selected by considering some conditions in the robot soccer /01/$ see front matter 2001 Elsevier Science B.V. All rights reserved. PII: S (01)

110 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) 109 122 environment [12].

2 110 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) environment [12]. However, it is not possible to describe all the situations of the robot soccer game by some condition statements. Moreover, as the environment under consideration is dynamic and complex, reinforcement learning should be employed for the selection of the proper action. In reality, it is very difficult to get the model of the robot soccer game. The agent learns its own action by reinforcement learning [1,2]. Reinforcement learning is the problem faced by an agent that learns its behavior through trial-and-error interactions with a dynamic environment [5,6]. The agent only knows the possible states and actions, not the transition probabilities or the reward structure [11]. Among the reinforcement learning methods, Q-learning can be used in the reinforcement scheme as it is applicable where no model of the environment is available [8,16]. In this paper, modular Q-learning is applied to improve the performance of the team playing in NaroSot (Nano-Robot World Cup Soccer Tournament) category of FIRA (Federation of International Robot Soccer Association; where five robots of size 4 cm 4cm 5.5 cm form a team. Modular Q-learning is one of the reinforcement learning schemes, where the mediator module selects the proper action of a robot based on the Q-value obtained from each learning module. When selecting the proper action, state information such as the distance between the ball and the robot, and the angle between the robot heading angle and the desired angle is also considered in the mediator module along with the Q-value to improve the learning performance. The concept of coupled agents is proposed to resolve conflicts between robots when the ball is located at the boundary region. A uni-vector field method is used for the navigation of the robot. In the robot soccer environment, the effectiveness and applicability of modular Q-learning and the uni-vector field method are verified by real experiments using five micro-robots of the team Y2K2 ranked as a runner-up at the FIRA Robot World Cup Brazil 99. Section 2 describes the robot soccer system, the structure of the robot, the uni-vector field for navigation and basic actions, and robot soccer strategies. This is followed by modular Q-learning and implementation of modular Q-learning for robot soccer in Section 3. The experimental results are presented in Section 4 and concluding remarks are given in Section Robot soccer system 2.1. NaroSot robot soccer system The micro-robot soccer system which comprises of robots, overhead vision system and a host computer is being used as a practical test bed to develop multi-agent systems and multiple robot systems. The complexity of the robot soccer system comes from the cooperation with the home team robots, the competition with opponent team robots and the fast and precise control of each robot while tracking the ball which is the passive constituent of the dynamic environment. We will now describe the NaroSot (Nano-Robot World Cup Soccer Tournament) system, which is one of the categories of the FIRA games. In the NaroSot category, each team has five robots of size 4 cm 4cm 5.5 cm. The pitch is 150 cm 90 cm in size and a ping-pong ball is used. Fig. 1 shows NaroSot robots and a ping-pong ball in the playground. Due to the size limitation, encoders are not used and only vision information is used as feedback. Hence, precise and fast robot control is difficult. The host computer receives the vision signals and uses this to compute the strategy routine and command velocity. The command velocity is then sent to the robots. The strategy routine is to select a proper action for each robot considering the game situation. The robot receives the velocity data sent from the host computer through the RF (radio frequency) transmitter and controls the motor velocity using the command data. The developed robots have two centrally aligned wheels which are easier to control. The width D between the two wheels of the robot is 3.5 cm and the radius R of the wheel is 1.0 cm. Each robot is composed Fig. 1. The NaroSot robots.

3 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) The velocity vector S and the posture vector P are associated with robot kinematics as follows: ẋ c cos θ c 0 cos θ c 0 [ ] P υ = ẏ c = sin θ c 0 S = sin θ c 0. ω θ c (3) 2.2. Uni-vector field navigation Fig. 2. Robot modeling. of four parts; a micro-controller part, an RF communication module, two DC motors with motor driving chips and a power supply unit. The micro-controller PIC16C73A is used for processing the command data and for computing the motor control using the two PWM signals. The RF module is used for communication between the host computer and the robot. The motors have a 6:1 gear ratio without encoders. 9.6 V rechargeable cells are used as power supply and a regulator is used for the logic power supply. Two wheeled mobile robots are considered under the assumptions of non-slipping and pure rolling [4]. Its kinematics can be derived using Fig. 2, where X, Y are the global coordinates. Posture P and position p c of the robot are defined as P = x c y c θ c, p c = [ xc y c ], (1) where (x c, y c ) is the position of the robot center, and θ c the heading angle of the robot with respect to global coordinates. Velocity vector S is defined as follows: [ ] υ S = = ω V R + V L 2 V R V L D = D D [ VL V R ], where υ is the translational velocity of the robot center, ω the rotational velocity with respect to the robot center, V L the left wheel velocity and V R the right wheel velocity. The translational velocity and rotational velocity are obtained from the two wheel velocities. (2) Fig. 3 shows the proposed uni-vector field, where the tiny circles with small dash attached to it denote the robot heading direction. The tiny circle is meant to represent the robot position and the straight line attached to it represents its heading directions [13]. A slightly bigger version of the same symbol is used in the figure to represent the initial position of each of the five robots. A vector field (x, y) at position p is defined as F(p) or F(x, y). It is assumed that the magnitude of the vector field is unity and is the same at all points [14]. The vector field at robot position p is generated by F(p) = pg nφ with φ = pr pg, (4) where n is a positive constant. The larger n is, the smaller the F(p) is at the same robot position. Thus, if n increases, the uni-vector field spreads out to a larger area, making the path to be traversed by the robot in reaching its goal larger. The shape of the field and the turning motion of the robots change according to the parameter n and the length of the line gr. The proposed uni-vector field method is based on (4), through which the vector field at all points can be obtained. In Fig. 3, g represents the target position of the robots. A dummy point r is used for deriving the vector field. The dummy point r is selected heuristically close to the goal point g. In practical applications, the point g will be the position of the ball. The following relationships are used to reduce the error in angle between the robots and the field vectors: ω = K P θ e + K D θ e, θ e = F(p) θ c, θ e = dθ (5) e dt, where F(p) is the vector field at position p with unit magnitude, θ e the error in angle between the robot

112 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) 109 122 Fig. 3. Uni-vector field method.

4 112 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 3. Uni-vector field method. heading and the field vector direction, θ e the derivative of θ e, K P the proportional feedback gain, and K D the time derivative feedback gain The translational velocity υ is constant. If υ = 0, the robot s heading angle will be towards the direction of F(p) without any changes in position. As indicated by (5), the robot motion is controlled through its right and left wheel velocities which are functions of time: V R = V C + K P θ e + K D θ e, V L = V C K P θ e K D θ e, (6) where V C is the constant robot center velocity. The robot s vector field will be oriented towards the target position and the associated angle of the robot motion is as shown in Fig Implementation of modular Q-learning 3.1. Modular Q-learning Q-learning is a recently explored reinforcement learning algorithm that does not need a model for its application and can be used on-line. Q-learning algorithms store the expected reinforcement value associated with each situation action pair, usually in a look-up table. However, in applying Q-learning to the multi-agent system, there are some difficult problems because the dimension of state space for each learning agent grows exponentially, the power being proportional to the number of agents. For example, considering two agents being engaged in a joint task. Here a joint task implies two agents working together to find an optimal method to kick the ball. Assume that for a single agent 10 3 states are needed for learning, then in the case of the joint task, mentioned above, the total number of states of the agent will grow to 10 6.Asactions are needed in every state, it needs more memory space for multi-agent learning. Such an application of Q-learning to the kind of multi-agent learning problems results in the explosion of the state and memory space. To overcome this problem, modular Q-learning is employed. Fig. 4 shows the architecture of modular Q-learning [15]. The architecture consists of learning modules which amount to the number of agents involved in the task and a mediator module. Each agent in the learning module carries out Q-learning in the environment. In the learning module, learning concentrates on a single agent and the learning of other agents is not considered. To complete the global goal, it needs a mediator module to arbitrate the results of the learning modules. The mediator module makes the final decision and selects the most suitable action based on the Q-value

K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) 109 122 113 Fig. 4. Modular Q-learning architecture. received from each learning modules.

However, in real experimental environment, convergence to the optimal Q-value during finite iteration is not often possible.

5 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 4. Modular Q-learning architecture. received from each learning modules. In [15], the mediator module makes this selection by considering the highest Q-value received from the learning modules. This selection method is called greatest mass merging strategy. However, in real experimental environment, convergence to the optimal Q-value during finite iteration is not often possible. So, it is desirable to select the most suitable action by considering an appropriate function which is calculated by using the Q-value and the state information. In this paper, the following function is used to make the final decision in the mediator module: arg max a f(q i (s i, a), θ i,d i ), (7) where a is the action of the agent and i the number of learning modules. The Q-value, Q i is obtained from the learning module and θ i is calculated as (90 θ e ). If Robot 1 and Robot 2 were to form a coupled agent, then considering d 1 as the distance between Robot 1 and the ball, and d 2 as the distance between Robot 2 and the ball, d i for Robot 1 is computed as d i = d 2 d 1. It should be noted that θ i and d i are considered in the mediator module to select the final action. In robot soccer game, robots play the role of attackers, defenders and goalie. In the robot soccer system being implemented in the NaroSot category, there are two attackers, two defenders and a goalie. All attackers and defenders have only two actions either to shoot or to follow the uni-vector field. In the uni-vector field, the target point is the position of the ball. The action selection layer, as a coordinator, selects the shoot action when it is in a good position to do so. Under normal conditions, robots follow the uni-vector field. The robot following the uni-vector field selects the shoot action when its longitudinal position is within the boundary of the ball. In shoot action, the target where the robot kicks the ball to is the center of the opponent goal area. The velocity of a robot in shoot action is faster than its velocity while following the uni-vector field. Goalie has its own actions within the goal area for defending a goal. The role allocation layer, as a higher level, selects the role of a robot according to the situation. The implemented robot soccer system uses a relative fixed role allocation scheme. It is a (1 goalie, 2 defenders, and 2 attackers) formation strategy. The term relative fixed role allocation is being used because the zones of each robot though fixed in most cases would, in some specific situations, be changed for a short interval. In the zone defense scheme, there would be a conflict among the home robots near the boundary region. This is classified as a non-blocked situation. On the other hand, a blocked situation is when the home robot is blocked by opponent robots. With respect to the ball position, there are three cases in the non-blocked situation. Fig. 5 shows the three boundary regions 3.2. Implementation Modular Q-learning is employed in the robot soccer system to improve cooperation among the robots in the team so as to carry out zone defense strategies. Fig. 5. The three learning regions in the non-blocked situation.

6 114 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) The angle error between the robot heading direction and the ball three levels. 4. A binary flag: 1 ifr x >B x, and (B y 4 3 B w)<r y <(B y B w), 0 otherwise. Fig. 6. The coupled agent. corresponding to the three cases in the non-blocked situation. To apply modular Q-learning to the robot soccer system, the concept of coupled agents is introduced as shown in Fig. 6. When the ball is within the boundary region of two robots, both the robots will be in a position to kick the ball. This may lead to their collision. To solve this problem, the concept of coupled agents composed of these two robots is proposed. For example, if the ball is located in Region 1, Attacker 1 and Defender 1 are considered as a coupled agent. The mediator module will assign an action to each robot in a coupled agent based on (7). The action is either to kick the ball or to maintain its current position. For learning, the initial Q-values are randomly determined in the range [0, 0.02]. The learning rate α is set to 0.9 for fast convergence during the learning process. The discount factor γ is set to 0.3, a relatively low value to reduce the possible noise effect. The noise effect arises because γ gets multiplied by the maximum Q-value of the next state. The reason for choosing the low discount factor γ is that in real experiments it is possible for the robot to kick the ball unexpectedly. In such cases, the Q-value is updated as a reward. This is not desirable for precise learning. Considering this situation, as a compensation, the discount factor γ should be set to a low value Non-blocked situation First, consider Region 1, where a state in the learning module of an individual agent consists of five components: 1. The robot location two levels (either Area 1 or Area 3 occupied by the robot). 2. The difference in distance d four levels. 5. A binary flag: 1 if the other robot in a coupled agent is to kick the ball, 0 otherwise. Considering d 1, the distance between Robot 1 and the ball, and d 3, the distance between Robot 3 and the ball, d is computed as the difference in distance between d 1 and d 3, i.e., d = d 1 d 3. So, d is in fact the negative of d i in Eq. (7). Considering the learning module of Robot 1, R x (R y )isthex (Y) coordinate of the robot, B x (B y )isthex (Y) coordinate of the ball, and B w is the width of the ball. From the learning module of Robot 3, d = d 3 d 1. Fig. 7(a) (e) shows each component of the state for learning in the non-blocked situation. There are 96 states in each of the three cases in the non-blocked situation. For example, in Region 1, Robot 1 and Robot 3 form the coupled agent and each robot has a learning module which has 96 states. The mediator module selects the final action of each robot of the coupled agent. In Region 2, the first component of the state is either Area 1 or Area 2, the second component of the state is d = d 1 d 2 from the viewpoint of Robot 1 and d = d 2 d 1 from the viewpoint of Robot 2, where d 1 is the distance between Robot 1 and the ball and d 2 the distance between Robot 2 and the ball. The states of Region 3 are similar to those of Region 1, as it is symmetric. Table 1 shows the list of actions which the coupled agent will select in Regions 1 and 2. For example, if Action 1 is selected in Region 1, Robot 1 shall be Attacker 1 and Robot 3 shall be Defender 1. When the ball moves to Region 3, Robot 2 becomes Attacker 2 and Robot 4 assumes defense and becomes Defender 2. The reward is assigned as follows: r = a b +, (8) t 1 + t const1 t 2 + t const2

7 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 7. State components in Region 1. Table 1 Actions of the coupled agent in the non-blocked situation Region 1 Region 2 Actions Robot 1 Robot 3 Actions Robot 1 Robot 2 1 Attacker 1 Defender 1 1 Attacker 1 Attacker 2 2 Attacker 1 Kicker 2 Attacker 1 Kicker 3 Kicker Defender 1 3 Kicker Attacker 2

116 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) 109 122 Fig. 8. Two cases of the blocked situation.

a, b, t const1 and t const2 are constants. t const1 and t const2 are used for prohibiting the reward value from increasing to infinite value. 3.2.2. Blocked situation The states and the reward in the blocked situation are similar to those of the non-blocked situation.

8 116 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 8. Two cases of the blocked situation. where t 1 is the time taken by the kicking robot to kick the ball and t 2 the time taken by the ball to reach the boundary of any other robot which is situated outside the learning region. a, b, t const1 and t const2 are constants. t const1 and t const2 are used for prohibiting the reward value from increasing to infinite value Blocked situation The states and the reward in the blocked situation are similar to those of the non-blocked situation. In the zone defense scheme, there are two cases where the attacker cannot execute its own role because of the blocking. Fig. 8 shows the cases in the blocked situation. Consider Region 1, where a state in the learning module consists of four components as follows: 1. The robot location two levels (either Area 2 or Area 3). 2. A binary flag: { 1 ifry <B y, 0 otherwise. 3. Blocking flag, difference in distance level (four levels), angle error level (three levels) 13 levels. 4. A binary flag: 1 if the other robot in a coupled agent is to kick the ball, 0 otherwise, where R y is the Y coordinate of the position of the blocked robot and B y the Y coordinate of the ball. In Fig. 9. State components in the blocked situation.

9 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Table 2 Actions of the coupled agent in the blocked situation Region 1 Region 2 Actions Robot 2 Robot 3 Actions Robot 1 Robot 4 1 Attacker 2 Defender 1 1 Attacker 1 Defender 2 2 Attacker 2 Kicker 2 Attacker 1 Kicker 3 Kicker Defender 1 3 Kicker Defender 2 Region 2, the first component of the state is either Area 1 or Area 4. Fig. 9(a) (d) shows each component of the state for learning in the blocked situation. There are 104 states in each of the two cases in the blocked situation. Table 2 shows actions which the coupled agent can take in Regions 1 and 2. In Region 2, the action is similar to that of coupled agent in Region 1. The reward is assigned as a r =, (9) t + t const1 where t is the time taken by the kicking robot to kick the ball, and a and t const1 are constants. 4. Experimental results The mediator module is working when both of the robots in a coupled agent tend to kick the ball. The action of the coupled agent is selected by considering the Q-value obtained from the learning modules and the state information. The angle error and the distance to the other robot in the coupled agent are used as the state information. In the selection equation (7) of the mediator module, f (Q i (s, a), θ i, d i )is given by f(q i (s i, a), θ i,d i ) = η Qi Q i (s i,a)+ η θi θ i + η di d i, (10) where η Qi, η θi and η di are constant coefficients. In the experiment, the values, 0.5, 0.3 and 0.2, respectively, were used. The mediator module selects the final action of each robot of the coupled agent based on the modified Q-value and the state information. The sampling time used in the real robot soccer system is 18 ms Non-blocked situation In Eq. (8), for the reward a = 12,000, b = 6, t const1 = 18 and t const2 = 3 were used, where a is the time interval selected for the kicking robot to kick the ball and is limited to the millisecond range. b is obtained by the experiment and t const2 is determined heuristically. In Region 1 of the non-blocked situation, it took 280 trials to obtain the Q-value which is being considered as a suboptimal value. In Region 2, it took 210 trials. The Q-values of the third region were the same as those of the first region. Fig. 10(a) shows the trajectories of the two robots when the ball is in Region 1. After the learning phase, Robot 3 took up the task of kicking the ball. It took s (155 steps) to kick the ball. Had Robot 1 assumed this task, it would have taken s (167 steps) to do so. Instead Robot 1 took up position in the defense zone left vacant by Robot 3. In Fig. 10(a), initial positions of the ball, Robot 1 and Robot 3 were (46.50 cm, cm), (86.05 cm, cm) and (22.21 cm, cm), respectively. Fig. 10(b) shows the trajectories of the two robots when the ball is in Region 2. Robot 1 kicked the ball after learning. It took s (94 steps) to kick the ball. If the other robot in the coupled agent were to kick the ball, it would have taken s (103 steps). Initial positions of the ball, Robot 1 and Robot 2 were (97.56 cm, cm), (80.99 cm, cm) and (72.02 cm, cm), respectively Blocked situation For the reward in the blocked situation, a = 20,000 and t const1 = 18 were used. These values were determined similar to the non-blocked situation.

10 118 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 10. Non-blocked situation: (a) Robot 3 kicked the ball after the learning phase in Region 1; (b) Robot 1 kicked the ball after the learning phase in Region 2. Three hundred trials were needed for convergence in the first region of the blocked situation. The Q-values of the second region were the same as those of the first region because they are symmetric. Fig. 11 shows the trajectories of the two robots in the blocked situation in Region 1. Robot 2 assisted the blocked Robot 1 and it took s to kick the ball (if Robot 3 was a kicker it would have taken s). Initial positions of the ball, Robot 2 and Robot 3 Fig. 11. Robot 2 assists Robot 1 after learning in the blocked situation. were (91.83 cm, cm), (70.78 cm, cm) and (21.74 cm, cm), respectively Effect of the modified Q-value in the mediator module In the above results, the mediator module did not use any of the state informations to determine the action of the coupled agent. The effectiveness of the modified Q-value in the mediator module which makes the final selection of the coupled agent are brought out in the real experiment. Fig. 12(a) shows the trajectories of the two robots in the non-blocked situation in Region 2. In this case the mediator module arbitrates the action of the two robots. As shown in Fig. 12(a), Robot 2 kicked the ball. It took 100 steps (1.800 s) to do so. Considering the Q-value of the learning modules, the Q-value of the kick action was larger than that of the other actions in each robot. The mediator module selects the final action which has a larger Q-value. Initial positions of the ball, Robot 1 and Robot 2 were (95.26 cm, cm), (76.62 cm, cm) and (74.01 cm, cm), respectively. Fig. 12(b) shows the same situation as that of Fig. 12(a). In this case, the mediator module considers the Q-value received from each learning module and the state information described in (7). In this situation,

K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) 109 122

11 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 12. Effect of the modified Q-value: (a) Robot 2 kicked ball after learning: with only Q-value; (b) Robot 1 kicked ball after learning: with Q-value and state information. Robot 1 kicked the ball. It took 83 steps (1.494 s) for this action. It may be noted that the time for Robot 1 to kick the ball (Fig. 12(b)) is shorter than that of the time taken by Robot 2 to do so (Fig. 12(a)) Boundary region of four robots As shown in Fig. 13, in the non-blocked situation, it is possible for the ball to be in the common region of the three regions. In this case, the coupled agent includes four robots and the problem is how to select the right robot for the kicking action. All four robots will have two Q-values which are obtained in three regions in the non-blocked situation. Q 11 and Q 13 are determined in Region 1 and Q 21 and Q 22 are obtained in Region 2 in the non-blocked situation. Q 32, Q 34, Q 43 and Q 44 are decided in Region 3 and Region 4 (Fig. 14). Q-values of Regions 3 and 4 are the same as those of Regions 1 and 2. In Regions 1 3 in the non-blocked situation, only two robots need be chosen as the coupled agent for learning. In the situation now considered, four robots form a coupled agent and kicking has to be assigned to one of them. The mediator module arbitrates the robots when Q-value of each robot for kick action is larger than that of the Q-value for retaining its position. In such a situation, the average of the Q-values of each robot in the two regions considered becomes the deciding factor in the mediator module. Together with Fig. 13. The ball is located in boundary region. Fig. 14. Q-value of the four robots.

12 120 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Fig. 15. After learning phase in the boundary region: (a) Robot 1 kicked the ball: with only Q-value; (b) Robot 3 kicked the ball: with Q-value and state information. this average value, the state information in Eq. (7) is also being considered for assigning the kick action. It may be noted that in this situation, d i in Eq. (7) is computed as the sum of the distances between the ith robot and each of the remaining robots in the coupled agent. Fig. 15(a) shows the trajectories of the four robot when the ball was in the boundary region of the four robots. Considering only the Q-value information received from the learning modules, the kick action was assigned to Robot 1 and it took s (218 steps) to kick the ball. Initial positions of the ball, Robot 1, Robot 2, Robot 3 and Robot 4 were (47.86 cm, cm), (83.50 cm, cm), (88.35 cm, cm), (24.62 cm, cm) and (23.92 cm, cm), respectively. The 1, 2, 3, 4 and B in Fig. 15 denote the initial positions of Robots 1, 2, 3, 4 and ball, respectively. These symbols are used in all other figures. In this initial position, the Q-value of Robot 1 as kicker is greater than its Q-value as Fig. 16. Other two cases in the boundary region: (a) Attacker 2 kicked the ball; (b) Defender 2 kicked the ball.

13 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Attacker 1. For Robot 3 also, its Q-value for kick action is seen to be greater than its Q-value as Defender 1. In the case of Robot 2, the Q-value for the kick action is seen to be smaller than its Q-value as Attacker 2 and for Robot 4, its Q-value as a kicker is smaller than its Q-value as Defender 2. Between Robot 1 and Robot 3, Robot 1 has a higher Q-value as a kicker. However, the choice of assigning the kick action is now decided by the mediator module using the modified Q-value which is obtained by considering the state information. Then Robot 3 which has a greater modified Q-value qualifies to be a kicker. Fig. 15(b) shows the trajectories of the four robots when Robot 3 kicked the ball. It took s (123 steps) to kick the ball. The time that would have been taken by Robot 2 and Robot 4 to kick the ball was also investigated. These situations are depicted in Fig. 16(a) and (b), respectively. It took s (143 steps) for Robot 2 and s (132 steps) for Robot 4 to kick the ball. Thus, the experiment asserted the theory that the time taken for kick action by a robot that is assigned this task using the modified Q-value method is the least. 5. Conclusions This paper proposed an action selection mechanism among the robots in a robot soccer game. The action selection problem of the zone defense scheme is divided into two situations: non-blocked case and blocked case by its opponent. Non-blocked case is the situation of conflict among the home robots near the boundary region. Blocked case corresponds to the situation of a home robot being blocked by opponent robots. The modular Q-learning architecture was used to solve the action selection problem which specifically selects the robot that needs the least time to kick the ball and assign this task to it. The concept of the coupled agent was used to resolve a conflict in action selection among robots. A uni-vector method was employed for navigation of the robots. The mediator module selects the final action of the coupled agent by considering the Q-value received from the learning modules and the state information. The effectiveness of the scheme was demonstrated through real robot soccer experiments. References [1] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, Bradford Books/MIT Press, Cambridge, MA, [2] C.J.C.H. Watkins, P. Dayan, Q-learning, Machine Learning 8 (1992) [3] C.R. Kube, H. Zhang, Collective robotics: From social insects to robot, Adaptive Behavior 2 (2) (1993) [4] G. Campion, G. Bastin, D Andréa-Novel, Structural properties and classification of kinematic and dynamic models of wheeled mobile robots, IEEE Transactions on Robotics and Automation 12 (1) (1996) [5] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: A survey, Journal of Artificial Intelligence Research 4 (1996) [6] T.W. Sandholm, R.H. Crites, Multiagent reinforcement learning in the Iterated Prisoner s Dilemma, Biosystems 37 (1996) [7] H.-S. Shim, H.-S. Kim, M.-J. Jung, I.-H. Choi, J.-H. Kim, J.-O. Kim, Designing distributed control architecture for cooperative multi-agent system and its real-time application to soccer robot, Robotics and Autonomous Systems 21 (2) (1997) [8] P.V.C. Caironi, M. Dorigo, Training and delay reinforcements in Q-learning agents, International Journal of Intelligent Systems 12 (10) (1997) [9] L.E. Parker, Alliance: An architecture for fault tolerant multirobot cooperation, IEEE Transactions on Robotics and Automation 14 (2) (1998) [10] J.-H. Kim, H.-S. Shim, H.-S. Kim, M.-J. Jung, I.-H. Choi, K.-O. Kim, A cooperative multi-agent system and its real time application to robot soccer, in: Proceedings of the IEEE International Conference on Robotics and Automation, Minneapolis, MN, 1996, pp [11] C. Boutilier, Planning, learning and coordination in multiagent decision processes, in: Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge, Netherlands, [12] S.H. Lee, J. Bautista, Motion control for micro-robots playing soccer games, in: Proceedings of the IEEE International Conference on Robotics and Automation, Leuven, Belgium, 1998, pp [13] J.-H. Kim, K.-C. Kim, D.-H. Kim, Y.-J. Kim, P. Vadakkepat, Path planning and role selection mechanism for soccer robots, in: Proceedings of the IEEE International Conference on Robotics and Automation, Leuven, Belgium, 1998, pp [14] Y.-J. Kim, D.-H. Kim, J.-H. Kim, Evolutionary programming-based vector field method for fast mobile robot navigation, in: Proceedings of the Second Asia Pacific Conference on Simulations, Evolutions and Learning, [15] N. Ono, K. Fukumoto, Multi-agent reinforcement learning: A modular approach, in: Proceedings of the Second International Conference on Multi-agent Systems, AAAI Press, 1996, pp [16] G.A. Rummery, Problem solving with reinforcement learning, Ph.D. Thesis, Cambridge University, Cambridge, UK, 1995.

122 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) 109 122 Kui-Hong Park received his B.S. degree in Electrical Engineering from Hanyang University, Seoul, South Korea, in 1997 and M.S. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST), Taejon, South Korea, in 1998, respectively.

Park received the second awards both in the 1998 Nano-Robot World Cup So

He is currently working towards the Ph.D. degree in Electrical Engineering at this institute. His research interests include motion planning of mobile systems, and machine intelligence. Mr.

14 122 K.-H. Park et al. / Robotics and Autonomous Systems 35 (2001) Kui-Hong Park received his B.S. degree in Electrical Engineering from Hanyang University, Seoul, South Korea, in 1997 and M.S. degree in Electrical Engineering from Korea Advanced Institute of Science and Technology (KAIST), Taejon, South Korea, in 1998, respectively. He is currently working towards the Ph.D. degree in Electrical Engineering at KAIST. His main research interests include multi-agent systems and machine intelligence. Mr. Park received the second awards both in the 1998 Nano-Robot World Cup Soccer Tournament (NaroSot) in France and in the NaroSot 99 in Brazil. Yong-Jae Kim received his B.S. and M.S. degrees in Electrical Engineering from Korea Advanced Institute of Science and Technology, Taejon, South Korea, in 1996 and 1998, respectively. He is currently working towards the Ph.D. degree in Electrical Engineering at this institute. His research interests include motion planning of mobile systems, and machine intelligence. Mr. Kim is the recipient of the third award at the MiroSot 97 and the first award at the Robot Soccer American Cup in Jong-Hwan Kim received his B.S., M.S., and Ph.D. degrees in Electronics Engineering from Seoul National University, South Korea, in 1981, 1983, and 1987, respectively. Since 1988, he has been with the Department of Electrical Engineering at the Korea Advanced Institute of Science and Technology, where he is currently a Professor. He was a Visiting Scholar at Purdue University from September 1992 to August His research interests are in the areas of evolutionary multi-agent robotic systems. He is the Associate Editor of the IEEE Transactions on Evolutionary Computation, and of the International Journal of Intelligent and Fuzzy Systems. He is one of the co-founders of Asia Pacific Conference on Simulated Evolution and Learning. He is the General Chair of the Congress on Evolutionary Computation His name is included in the Barons 500 Leaders for the New Century as the Founder of FIRA (Federation of International Robot Soccer Association) and of IROC for Robot Olympiad. He is now serving FIRA and IROC as President. He was the Guest Editor of the special issue on MiroSot 96 of the journal Robotics and Autonomous Systems and on Soccer Robotics of the Journal of Intelligent Automation and Soft Computing. Dr. Kim is the recipient of the 1988 Choongang Young Investigator Award from Choongang Memorial Foundation, the LG YonAm Foundation Research Fellowship in 1992, the Korean Presidential Award in 1997, and the SeoAm Foundation Research Fellowship in 1999.

Multi-Agent Control Structure for a Vision Based Robot Soccer System

Multi-Agent Control Structure for a Vision Based Robot Soccer System Multi- Control Structure for a Vision Based Robot Soccer System Yangmin Li, Wai Ip Lei, and Xiaoshan Li Department of Electromechanical Engineering Faculty of Science and Technology University of Macau