Learning to Acquire Whole-Body Humanoid Center of Mass Movements to Achieve Dynamic Tasks

Size: px

Start display at page:

Download "Learning to Acquire Whole-Body Humanoid Center of Mass Movements to Achieve Dynamic Tasks"

Rose Briggs
5 years ago
Views:

1 Advanced Robotics 22 (2008) Full paper Learning to Acquire Whole-Body Humanoid Center of Mass Movements to Achieve Dynamic Tasks Takamitsu Matsubara a,b,, Jun Morimoto b,c, Jun Nakanishi b,c, Sang-Ho Hyon b,c, Joshua G. Hale b,c and Gordon Cheng b,c a Nara Institute of Science and Technology, Takayama-cho, Ikoma, Nara , Japan b ATR Computational Neuroscience Laboratories, Hikaridai Seika-cho, Soraku-gun, Kyoto , Japan c ICORP Computational Brain Project, Japan Science and Technology Agency, Honcho, Kawaguchi, Saitama , Japan Received 4 March 2008; accepted 19 March 2008 Abstract This paper presents a novel approach for acquiring dynamic whole-body movements on humanoid robots focused on learning a control policy for the center of mass (CoM). In our approach, we combine both a model-based CoM controller and a model-free reinforcement learning (RL) method to acquire dynamic whole-body movements in humanoid robots. (i) To cope with high dimensionality, we use a model-based CoM controller as a basic controller that derives joint angular velocities from the desired CoM velocity. The balancing issue can also be considered in the controller. (ii) The RL method is used to acquire a controller that generates the desired CoM velocity based on the current state. To demonstrate the effectiveness of our approach, we apply it to a ball-punching task on a simulated humanoid robot model. The acquired wholebody punching movement was also demonstrated on Fujitsu s Hoap-2 humanoid robot. Koninklijke Brill NV, Leiden and The Robotics Society of Japan, 2008 Keywords Reinforcement learning, humanoid robot, whole-body movement, policy-gradient method 1. Introduction Since their physical structure resembles humans, humanoid robots can be expected to help us with many tasks in our normal living environment, without specifically needing additional environmental customization. Therefore, interest continues to grow in the development of humanoid robots and their control methods to achieve whole-body dynamic movements in these systems [1 4]. In particular, over the last * To whom correspondence should be addressed. takam-m@is.naist.jp Koninklijke Brill NV, Leiden and The Robotics Society of Japan, 2008 DOI: / X324785

2 1126 T. Matsubara et al. / Advanced Robotics 22 (2008) decade, a number of methods for achieving various tasks on a humanoid robot have been explored, mainly to achieve biped walking and balancing [5 8]. Even though a number of real humanoid robots have demonstrated whole-body dynamic movements with these existing methods, it remains impossible to introduce humanoid robots into our living spaces to help us in our daily lives. This is in large part caused by their inability to adapt to new environments as easily as humans and animals, i.e., due to a lack of motor learning ability. One candidate solution for granting motor learning skills to humanoid robots is reinforcement learning (RL) a promising method because it requires no expert teachers or idealized desired behavior to improve skills. RL is a framework for improving the control rules of an agent, i.e., a robot, through iterative interaction with the environment based on a trial-and-error paradigm without using an explicit model of the environment [9, 10]. However, with increasing dimensionality in state and action spaces, RL often requires not only a large number of iterations, but also large computational cost, especially for learning a complex control policy motor learning. Although many researchers have attempted to apply RL methods to several robots in simulations and real hardware systems for acquiring desired movements, so far most of the robots to which learning has been successfully applied only have a small number of d.o.f., not as many as the d.o.f. typically offered by humanoid robots [11 15]. To the best of our knowledge, only one attempt has successfully learned the desired movements on a small humanoid robot [16]. That work focused on learning biped walking [16]. In this paper, we present a novel approach for acquiring dynamic whole-body movements on humanoid robots by focusing on learning a control policy for the center of mass (CoM). CoM is one of the most important features of humanoid robots because it approximately represents the whole-body motion of the humanoid robot. Moreover, as suggested by such experimental studies as Ref. [17], it can also be considered a control variable during functional tasks to overcome the curse of dimensionality in humans. Due to its dimensionality, learning a CoM movement for a given task might be simpler than directly learning all joint movements. Therefore, we propose combining a model-based CoM controller with a modelfree RL approach. A drawback of the model-based CoM controller is that the method only considers highly approximated dynamics, comprised of CoM and zero moment point (ZMP), to design a CoM controller. This approximation can cause poor tracking performance for a given desired trajectory. Model-based approaches are also always affected by modeling errors. However, we can derive joint angular velocities from the desired CoM velocity and can also explicitly consider the balancing with a model-based CoM controller. On the other hand, RL methods are applicable to improve the performance of controllers without using physical models and parameters. However, as described above, its drawback is that RL generally is not applicable to high-dimensional systems due to the curse of dimensionality [9]. Therefore, we cannot expect improvement of controllers for humanoid robots with

3 T. Matsubara et al. / Advanced Robotics 22 (2008) many d.o.f. within a realistic amount of time by naive application of RL to humanoid robots. In our approach, we combine a model-based CoM controller and a RL method to acquire dynamic whole-body movements on humanoid robots. (i) To cope with high-dimensionality, we use a model-based CoM controller as a basic CoM controller that derives joint angular velocities from the desired CoM velocity. The balancing issue can also be considered in the controller [5 8]. (ii) The RL method is used to acquire a controller that generates the desired CoM velocity based on the current state. While RL methods generally do not require the physical model and parameters of the robot, the learning system needs to be a Markov decision process for most standard approaches based on such value functions as Q-learning. Since we only consider CoM position and time as state variables, the learning system becomes a partially observable Markov decision process (POMDP). Therefore, we use a policy-gradient method that can be applied to the POMDP. We demonstrate that our proposed approach efficiently acquires the appropriate policies for a ball-punching task on a numerically simulated humanoid robot model of Fujitsu s Hoap-2. The acquired whole-body punching movement is demonstrated in a real hardware system as well as in simulations. The paper is organized as follows. In Section 2, we briefly describe our approach for learning desired whole-body movements on a humanoid robot by focusing on its CoM. In Section 3, we briefly introduce ZMP and its equation. Next we describe how we can control CoM by manipulating ZMP based on the ZMP equations. In Section 4, we present the policy-gradient method for learning an appropriate control policy for the desired full-body movements of a humanoid robot. In Section 5, we present a concrete example of the learning system in a ball-punching task with a humanoid robot. In Section 6, we describe the results achieved by applying the proposed method in numerical simulations. In Section 7, we demonstrate the acquired whole-body punching movement on a real robot. 2. Learning a Desired Whole-Body Movement on a Humanoid Robot: FocusedonLearningCoMMovements In this section, we briefly describe our approach for learning the desired wholebody movements on a humanoid robot. The approach focuses on learning CoM movement suitable to achieve the task on a humanoid robot. Figure 1 shows a rough sketch of our proposed approach. x is a state variable which is (partial) information of the robot and a is control output for learning. In this paper, we focus on learning CoM movement, i.e., the control output is the desired velocity of the CoM ṙ CoM ref as a = ṙ CoM ref. c(x) is reward function that evaluates each control decision. π(x, a; w) is a control policy in which the parameter w is learned to maximize the accumulated reward function. q is the desired angular velocities. As long as both CoM and ZMP are inside the support polygon during CoM control, the robot can be prevented from falling over [18]. This characteristic makes the

4 1128 T. Matsubara et al. / Advanced Robotics 22 (2008) Figure 1. Learning system to acquire desired whole-body movements on a humanoid robot. x is state variable and a is control output for learning. We focus on learning CoM movement, i.e., control output is the desired velocity of CoM ṙ CoM ref as a = ṙ CoM ref. c(x) is a reward function that evaluates each control decision. π(x, a; w) is control policy in which parameter w is learned to maximize accumulated reward function. q is the desired angular velocities. under-actuated robot system apparently become like a fully actuated system, which simplifies its motor learning task and increases its tractability. Thus, our approach also contains such a ZMP manipulation method, i.e., in our approach, policygradient-type reinforcement learning is applied to learn the CoM controller on such a ZMP manipulating method. The acquired controller is expected to implicitly consider the dynamics of the robot, e.g., friction and inertia, and information about the task, which are not explicitly considered in the model-based CoM controller. A CoM Jacobian-based redundancy resolution technique is utilized to compute angular velocities for all joints to achieve a whole-body movement consistent with a desired CoM movement [7]. We use a manually tuned weighting matrix to compute the weighted pseudo-inverse computation to achieve desirable joints configuration and avoid joint limits. Thus, our learning system is composed of two components that are introduced in the following two sections: (i) CoM control based on ZMP and distribution of CoM movement into joint space, and (ii) RL for CoM movement. 3. CoM Controller Based on a ZMP Equation This section describes a method for controlling the CoM based on ZMP manipulation [5, 7]. ZMP compensation control is a method to compensate the current ZMP as an objective point [5]. The PID controller is used to calculate the objective ZMP based on an analogy between the inverted pendulum and the CoM ZMP dynamics in a mass-concentrated humanoid robot model [7]. By integrating the two components, which are presented in Sections 3.1 and 3.2, CoM can be controlled by manipulating ZMP. The CoM Jacobian-based redundancy resolution technique, described in Section 3.3, is utilized to calculate the movements in the all joint-space consistent with the desired CoM movements.

5 T. Matsubara et al. / Advanced Robotics 22 (2008) ZMP Compensation Control According to Nagasaka [5], assuming a mass-concentrated model, the relationship between the moment acting on the ZMP and the objective ZMP is given as: n ZMP = n OZMP + (r OZMP r ZMP ) f CoM (1) n OZMP = (r CoM r OZMP ) f CoM, (2) where n ZMP R 3 and n OZMP R 3 are ZMP and objective ZMP moments, respectively. r ZMP R 3 and r OZMP R 3 are the position vector of ZMP and objective ZMP from the origin, respectively, and f CoM R 3 is the force acting on the CoM. From the definition of the ZMP, which is a point such that the horizontal components of the moment acting at the point are zero, we can derive a control law to compensate the ZMP to the objective ZMP by kinematically manipulating CoM as follows: rx,i+1 CoM = K( rx ZMP rx OZMP ) + r CoM x,i ry OZMP ) + r CoM + ( r CoM x,i rx,i 1 CoM ) + K r CoM x,i (3) ry,i 1 CoM ) + K r CoM y,i, (4) ry,i+1 CoM = K( ry ZMP y,i + ( ry,i CoM where K = fz,i CoM t 2 /(rz,i CoM rz,i OZMP ) and t is a discrete time step. r is the deviation of position during t. The desired velocity of CoM can be straightforwardly approximated as ṙ CoM x ṙ CoM y rx,i+1 CoM / t (5) ry,i+1 CoM / t. (6) Under such control, the robot can be regarded as an inverted pendulum with its supporting point at the objective ZMP Calculating the Reference ZMP Based on Inverted Pendulum Model As mentioned above, since the horizontal components of the moment on the ZMP are zero, the mass-concentrated model of the humanoid robot can be regarded as an inverted pendulum. Based on this analogy, we apply a simple PID controller to control the CoM by manipulating the ZMP as described in Ref. [7]. The dynamics of the mass-concentrated model approximately linearized around an equilibrium point are given as: r CoM x r CoM y = ω 2( r CoM x = ω 2( r CoM y rx ZMP ) (7) ry ZMP ), (8) r where ω = CoM z +g. The above dynamics equations represent the horizontal rz CoM rz ZMP movement of CoM. Due to the symmetry of the x and y components, we can focus

6 1130 T. Matsubara et al. / Advanced Robotics 22 (2008) on the x component in the following derivation without loss of generality. By differentiating (7) and ignoring the change in ω by assuming that r CoM z = 0andrz CoM are constant, the following equation can be derived:... r CoM x = ω 2( ṙ CoM x ṙ ZMP ) x. (9) To control CoM rx CoM controller as ṙ ZMP (ṙcom x (t) = K P x (ṙcom + K I with reference r CoM x ref as a target, we can apply the following x ref ṙcom ) x ref ṙcom x ) dt + KD ( r CoM x ref rcom x ṙ CoM x ref = K ( C r CoM x ref rx CoM ). (11) K P, K I, K D and K C are gains. By the final-value theorem, it may be proven that rx CoM converges to rx CoM ref with appropriate settings for the gains. By integrating the two components presented in this section, the CoM can be controlled by manipulating ZMP. In the next section, we describe a CoM Jacobianbased redundancy resolution technique to achieve whole-body movement consistent with the desired CoM movement Distributing the CoM Movement Into Joint Space In this section, we present a CoM Jacobian-based redundancy resolution technique to achieve whole-body movement consistent with desired CoM movement [7]. We also present the CoM controller used in our framework that is based on the CoM Jacobian Distributing CoM Movements Through the CoM Jacobian Sugihara et al. [7] utilized a calculation method for the CoM Jacobian with legged systems, that was originally proposed by Boulic et al. [19]. The CoM Jacobian relates CoM velocity with the angular velocities of all joints as: ṙ CoM = J C (q) q, (12) where J C (q) R 3 n is the CoM Jacobian and n is the the number of d.o.f. in the robot. By using the CoM Jacobian and the weighted pseudo-inverse calculation, we can distribute the CoM velocity to the angular velocities of all the joints based on sum-squared minimization applied to all the joint angular velocities as follows: q = J + C ṙcom + (I J + C J C)k, (13) where: J + C = W 1 J T C( JC W 1 J T 1. C) (14) W = diag{w i } (i = 1,...,n),andk R n is an arbitrary vector. I R n n is an identity matrix. The above redundancy resolution technique with a weighting matrix determines whole-body motion consistent with the desired CoM movements. ) (10)

7 T. Matsubara et al. / Advanced Robotics 22 (2008) Figure 2. Definition of the variables CoM Jacobian-Based Redundancy Resolution in the Double-Support Case We used the following mapping to control the CoM by all joints as: q = J + ṙ + (I J + J)k, (15) where ṙ R 6 =[ṙ C ṙ rl, ṙ C ṙ ll ] T and J(q) R 6 n =[J C (q) J rl (q), J C (q) J ll (q)] T. k R 6 is an arbitrary vector. r C is a position vector of CoM from the base-link defined on the waist, and r ll and r rl are the position vectors of the left and right feet from the base-link, respectively. ṙ and J(q) are the corresponding velocity vector and the Jacobian of each r defined above, respectively. The variables are defined in Fig. 2. The desired ṙ to control the CoM based on the desired trajectory is given by (5), (6) and (10). 4. RL for CoM Movement In this section, we present a RL method for the proposed learning framework. For learning CoM movement, we use a policy-gradient method, which is a kind of RL method that maximizes the average reward with respect to parameters controlling action rules known as policy [11, 20, 21]. Compared with most standard value function-based RL methods, such a method has particular features suited to robotic applications. First, the policy-gradient method is applicable to POMDPs [22]. Considering all possible states of the robot is almost impossible, because even if it has a complete set of sensors there will be a certain degree of noise. It is also possible to consider a partial set of states as input for a RL system. Second, the policy-gradient method is a stochastic gradient-descent scheme. The policy can, therefore, be improved with every update. In this section, we briefly describe a framework for RL with the policy-gradient method RL With a Policy-Gradient Method Assuming a Markov decision process, average reward, discounted cumulative re-

8 1132 T. Matsubara et al. / Advanced Robotics 22 (2008) ward and value functions are defined as: [ η(θ) = lim E 1 T T [ η β (θ) = lim E 1 T T ] T c(x t ) t=0 ] T βc(x t ) t=0 [ Vβ π (x) = E ] βc t+k+1 x t = x k=0 (16) (17) (18) [ Q π β (x, a) = E ] βc t+k+1 x t = x,a t = a, (19) k=0 where x S is state and c(x): S R is immediate reward. η(θ) is the average reward and η β (θ) is the discounted cumulative reward. Vβ π(x) and Qπ β (x, a) are the state-value and action-value function, respectively [9]. x is the state, a is the action and θ is the parameter of the stochastic policy. β is a discounting factor. The goal of RL is to maximize the average reward. If we calculate the gradient of η(θ) with respect to policy parameters θ, we can search for a locally optimal policy in the policy parameter space by updating the parameters as θ θ + α η(θ). η(θ) is the gradient of η(θ) with respect to θ. Various derivations and algorithms have been proposed to estimate the gradient based on sampling through interaction with the environment. According to Kimura and Kobayashi [23], the gradient is given by: η = (1 β) η β (20) [ = (1 β) d(x) π(a,x) log d(x) + 1 ] log π(a,x) Q π β (x, a) da dx (21) 1 β = d(x) 1 = lim T,β 1 T 1 = lim T,β 1 T π(a,x)[(1 β) log d(x) t=0 + log π(a,x)] { Q π β (x, a) V π β (x)} da dx (22) T T log π(a t,x t ) β s t δ(x s,a s ) T δ(x t,a t ) t=0 s=t t β t s log π(a s,x s ). (23) s=0 Here, π(x,a; θ) = P(a x; θ) is a stochastic policy that maps state x to action a stochastically. π(x,a; θ) means the gradient of π(x,a; θ) with respect to θ.

9 T. Matsubara et al. / Advanced Robotics 22 (2008) d(x) is the stationary distribution of x. δ(x,a) is TD error defined as δ(x t,a t ) = c(x t ) + β p(x t+1 x t,a t )Vβ π(x t+1) dx t Vβ π(x t). Equation (20) is presented in Ref. [24] as Theorem 1 and (21) is derived in Ref. [25]. The derivation of (22) is based on π(x,a)vβ π(x) da = 0. If we neglect V β π (x), the algorithm is identical to the GPOMDP algorithm developed in Ref. [21]. As pointed out in Ref. [21], discounting factor β controls a bias variance trade-off in the policy-gradient estimated by sampling. In fact, we update the policy parameters based on the following rule: θ t+1 = θ t + αd t δ(x t,a t ),whered is updated by D t = βd t 1 + log π(x t,a t ).However, to derive TD error δ(x t,a t ), we need the state-value function Vβ π (x). In this paper, we simultaneously approximate it using the function approximator ˆV β π (x; w) with parameter w and a simple TD learning method presented as w = w + αδ t ˆV π β (x;w) w. TD error δ(x t,a t ) is then approximately calculated by δ(x t,a t ) = c(x t ) + β ˆV π β (x t+1) ˆV π β (x t). Note that β should satisfy 0 β<1topreventthe state-value function from diverging. 5. Application to Learning of a Dynamic Task: Ball-Punching In previous sections, we presented our learning approach that focussed on CoM movement to achieve whole-body movement. We applied the proposed approach for learning whole-body movement on a humanoid robot and selected a ball-punching task. The goal was to strengthen ball-punching through a learning process that focused on CoM movement. In this section, the details of the learning settings are described. We then present the numerical simulation and experimental results in a real environment in the next two sections Learning CoM Movement for Whole-Body Dynamic Punching In this paper, we focus on controlling the x-axis component of the CoM, i.e., the policy output is the target velocity of the x-axis component of CoM ṙ CoM x ref. Thus, action in the control policy for learning is defined as a =ṙ CoM x ref. To simplify the task, we constrained the desired CoM to one-dimensional movement. The policy output is distributed to the x-andy-axis components of CoM as ṙ CoM x = sin(ψ) ṙ CoM x ref and ṙ CoM y = cos(ψ) ṙ CoM x ref,whereψ is the angle from the y-axis to the x-axis clockwise and ψ = π/3 as depicted in Fig. 3. This setting can be considered to sufficiently use the area of the support polygon because a diagonal line is larger than x- andy-axis lines. State-space was simply defined as x = (rx CoM,t). Note that the state of the dynamics of the humanoid robot to which the learning is applied is not such a small dimensional variable, even though the inverted pendulum-based controller simplified it, as explained in Section 3. However, the position of CoM remains one of most dominant variables and time t is also important to coordinate the timing of the pre-designed punching motion. Thus, with the above notion and the applicability

10 1134 T. Matsubara et al. / Advanced Robotics 22 (2008) Figure 3. One-dimensional CoM movement controlled by a policy is shown by grey line; the solid line is each foot and the dashed line means the support polygon. of the policy-gradient method to such partially observable cases [21], we simply designed the state-space for the above learning Gaussian Policy and Function Approximator for the State-Value Function We implemented the following Gaussian policy as a stochastic policy for controlling the CoM: π(x,a; θ) = 1 2πσ exp ( (a μ(x; θ)) 2 2σ 2 ), (24) where μ(x; θ) = θ T φ(x). x is the state and a is the action. In this study, a radial basis function network is used as a model of the feedback controller. Since it is almost impossible to manually design all the network parameters, using the policygradient method is useful to optimize them. We located Gaussian basis functions φ(x) on a grid with even intervals in each dimension of the observation space as in Refs [10, 15]. The function approximator for the state-value function is also modeled as ˆV β π(x) = wt φ(x). We allocated 100 (=10 10) basis functions φ(x) in state-space ( 1.0 <rx CoM < 0.0, 0.5 <t<4.0) to represent the mean of the policy μ(x) Reward Function The purpose of the ball-punching task is to strengthen the punching as much as possible. We designed the reward function based on this objective as: c = (t t b ) v T b v b, (25) because the velocity of the ball v b punched is proportional to its momentum. The term associated with time t is incorporated in the reward function to avoid local minima motions, which involve the robot falling forward and ignore the timing of the punch. t b is bias to distribute the reward to positive and negative, which is set as

11 T. Matsubara et al. / Advanced Robotics 22 (2008) in this study. Negative reward 5 is given when both feet leave the ground to avoid acquiring a punching motion with jumping Punching Motion Projected on Null Space of the CoM Controller A punching motion was straightforwardly implemented by tracking the target trajectory in the task space. In this study, we achieved tracking control in the null space of the CoM controller by introducing the following vector as an arbitrary vector in (15): k = J + ra (ṙ ra J ra J + ṙ), (26) where J ra R 3 n is the Jacobian relating to the right hand velocity in task space ṙ ra with q as ṙ ra = J ra q, and J + ra = J ra(i J + J). Introducing this vector yields target tracking with the right hand in the null space of the CoM controller [26]. 6. Numerical Simulations 6.1. Settings and Results We applied the proposed approach to the acquisition of a strong punching movement on Fujitsu s Hoap-2 humanoid robot (see Fig. 4) in numerical simulation. The ball was modeled as a simple point mass (0.1 kg), and the contact between the robot and the ball was simulated by a spring damper model. A spring damper model was also used to model the floor. The integration time-step for the robot was 0.2 ms, and the time interval for learning was 50 ms. For the CoM and right-arm controllers, a weighting matrix suitable for this task must be set to appropriately achieve whole-body motion in (14). To avoid using the d.o.f. in the right arm (which are used for the punching motion) for the CoM controller, the weights in the right arm were set smaller (0.01) than the other joints (1.0) in the CoM controller described in (13). For the right-arm controller described in (26), to achieve a punching motion mainly using the right arm, we set the weights in the body joint larger (3.0) than other joints (1.0). The target trajectory for the right-arm controller to achieve a punching motion was designed as Figure 4. Fujitsu humanoid robot Hoap-2 (21 d.o.f.): 6 d.o.f. for the legs, 4 d.o.f. for arms and 1 d.o.f. for the waist. Total weight is about 7 kg, and height is about 0.4 m.

1136 T. Matsubara et al. / Advanced Robotics 22 (2008) 1125 1142 Figure 5. Acquired reward at each episode.

12 1136 T. Matsubara et al. / Advanced Robotics 22 (2008) Figure 5. Acquired reward at each episode. The learning curve was averaged over five experiments and smoothed by taking a 50 moving average. Figure 6. Acquired control policy for the axis component of the CoM. r rax ref = p sin(2πf(t t a )) + q, where(t t a ), and we set the parameters so that amplitude p = 0.03 m, bias q = 0.21 m, frequency f = 1.5 Hz and bias t a = 3.5 by considering Hoap-2 s physical model. While 0 <t<3.5, r rax ref is constant p. Figure 5 shows the reward at each episode based on the policy-gradient method. The curve means that the locally optimal punching motion with maximal reward was acquired around 2000 episodes. Figure 6 is an acquired policy for controlling the x-axis component of CoM and Fig. 7 presents a whole-body punching motion acquired by the control-policy. While keeping the CoM at the initial point, the punching motion produced a ball momentum of about kgm/s. The acquired punching motion without any probabilistic factors produced an average ball momentum of about kgm/s

T. Matsubara et al. / Advanced Robotics 22 (2008) 1125 1142 1137 Figure 7. Acquired whole-body punching movement. Snapshots correspond to 0.0, 0.85, 1.40, 2.16 and 2.33 s, respectively.

13 T. Matsubara et al. / Advanced Robotics 22 (2008) Figure 7. Acquired whole-body punching movement. Snapshots correspond to 0.0, 0.85, 1.40, 2.16 and 2.33 s, respectively. The grey bar on the foot denotes ground reaction force. Figure 8. CoM trajectories generated with the learned control policy from various initial CoM positions. (standard deviation was 0.005), which means the ball momentum generated by the learned policy was about 2.3-times larger than the initial performance. Note that the acquired control policy is not a simple trajectory. Figure 8 presents the x-axis CoM trajectories with an acquired control policy from various initial conditions. To achieve a strong punching motion, the x-axis CoM position must be about 0.02 m to guarantee that the right arm can kinematically reach the ball. When the robot hits the ball, the CoM also requires high velocity for strong punching. The acquired policy for various initial conditions tends to move the CoM backward from the ball at the beginning. Then, it accelerates and propels the CoM forward, achieving high velocity when its position is about 0.02 m and is coordinated with the pre-designed right-arm movement for strong punching. Thus, the acquired control policy is a complex feedback controller to achieve strong punching Robustness of Learning Against Modeling Error As presented in previous sections, our suggested approach requires such robot information as mass, length and the position of the CoM in each link to calculate

1138 T. Matsubara et al. / Advanced Robotics 22 (2008) 1125 1142 the position of the CoM and its Jacobian.

14 1138 T. Matsubara et al. / Advanced Robotics 22 (2008) the position of the CoM and its Jacobian. Even through having perfectly accurate parameters would be desirable, our approach can be robust to estimate errors of such parameters, because the control policy of the CoM is acquired through iterative interaction with the environment. To investigate its robustness, we applied the learning in simulations with the following settings: (i) mass of the right arm s tip is over-estimated as double the true parameter and (ii) position of body mass is 0.01 m biased in the x-axis direction. In both cases, the appropriate control policy for the CoM was acquired as in the normal settings. The resulting rewards with acquired policies for (i) and (ii) through 2000 trials were 1.57 and 1.89, which were averaged for five experiments and smoothed by taking a 50 moving average. These results suggest its robustness to modeling errors. 7. Experiments on a Real Hardware System In this section, we implemented the proposed controller on Hoap-2 a real humanoid robot. We implemented the CoM trajectories generated in simulations with the acquired control policy for CoM. To show the effectiveness of the learned punching motions, we set a toy car in front of the robot as a punching target. The distance the toy car is punched can measure the effectiveness of the initial and learned punches. Figure 9 provides sequential snapshots of the car being hit. The upper and lower sequences are the initial and learned movements, respectively. The results suggest that the punching motion, i.e., the acquired cooperative whole-body movement, is effective even in a real environment. Figure 9. Sequential snapshots for punching motion with (a) initial (car speed was 0.42 m/s) and (b) learned (car speed was 0.71 m/s) control policies. Each picture corresponds to 0.0, 0.67, 1.67 and 2.0 s from timing of impact. From the car s movement after being punched, the learned punching significantly affected the impact on the car.

15 8. Conclusions T. Matsubara et al. / Advanced Robotics 22 (2008) This paper presented an approach for acquiring dynamic whole-body movements on humanoid robots that focused on learning a control policy for the CoM to produce dynamic movements in achieving tasks. We applied the framework to the learning of a dynamic ball-punching motion on a Hoap-2 model in numerical simulations. As a result, we demonstrated that acquiring dynamic punching motions is possible through learning using our approach. We achieved the task with significantly fewer trials while accounting for the original complexity of the task and robot. The acquired cooperative whole-body punching movement was also demonstrated on a real hardware platform. As future work, we wish to explore on-line learning within a real environment because the proposed framework is also suitable for such situations. References 1. K. Hirai, M. Hirose, Y. Haikawa and T. Takenaka, The development of handa humanoid robot, in: Proc. IEEE Int. Conf. on Robotics and Automation, Leuven, pp (1998). 2. Y. Kuroki, T. Ishida, J. Yamaguchi, M. Ujita and T. Doi, A small biped entertainment robot, in: Proc. IEEE RAS Int. Conf. on Humanoid Robots, Tokyo, pp (2001). 3. J. Morimoto, G. Endo, J. Nakanishi, S. Hyon, G. Cheng, D. Bentivegna and C. Atkeson, Modulation of simple sinusoidal patterns by a coupled oscillator model for biped walking, in: Proc. IEEE Int. Conf. on Robotics and Automation, Orlando, FL, pp (2006). 4. S. Hyon and G. Cheng, Passivity-based whole-body motion control for humanoids: gravity compensation, balancing and walking, in: Proc. IEEE Int. Conf. on Intelligent Robots and Systems, Beijing, pp (2006). 5. K. Nagasaka, The whole-body motion generation of humanoid robot using dynamics filter (in japanese), PhD Thesis, University of Tokyo (2000). 6. S. Kagami, F. Kanehiro, Y. Tamiya, M. Inaba and H. Inoue, Autobalancer: an online dynamic balance compensation scheme for humanoid robots, in: Algorithmic and Computational Robotics: New Directions, B. R. Donald, K. Lynch and D. Rus (Eds), pp A. K. Peters, Wellesley, MA (2001). 7. T. Sugihara and Y. Nakamura, Whole-body cooperative balancing of humanoid robot using COG Jacobian, in: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Lausanne, pp (2002). 8. S. Kajita, F. Kanehiro, K. Kaneko, K. Fujiwara, K. Harada, K. Yokoi and H. Hirukawa, Resolved momentum control: humanoid motion planning based on the linear and anguler momentum, in: Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Las Vegas, NV, pp (2003). 9. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998). 10. K. Doya, Reinforcement learning in continuous time and space, Neural Comput. 12, (2000). 11. H. Kimura, K. Miyazaki and S. Kobayashi, Reinforcement learning in POMDPs with function approximation, in: Proc. 14th Int. Conf. on Machine Learning, Nashville, TN, pp (1997).

1140 T. Matsubara et al. / Advanced Robotics 22 (2008) 1125 1142 12. H. Kimura, T. Yamashita and S. Kobayashi, Reinforcement learning of walking behavior for a four-legged robot, in: Proc. IEEE Conf.

16 1140 T. Matsubara et al. / Advanced Robotics 22 (2008) H. Kimura, T. Yamashita and S. Kobayashi, Reinforcement learning of walking behavior for a four-legged robot, in: Proc. IEEE Conf. on Decision and Control, Orlando, FL, pp (2001). 13. J. Morimoto and K. Doya, Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning, Robotics Autonomous Systems 36, (2001). 14. R. Tedrake, T. W. Zhang and H. S. Seung, Stochastic policy gradient reinforcement learning on a simple 3D biped, in: Proc. IEEE Int. Conf. on Intelligent Robots and Systems, Sendai, pp (2004). 15. T. Matsubara, J. Morimoto, J. Nakanishi, M. Sato and K. Doya, Learning sensory feedback to CPG for biped locomotion with policy gradient, in: Proc. IEEE Int. Conf. on Robotics and Automation, Barcelona, pp (2005). 16. G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi and G. Cheng, Learning CPG sensory feedback with policy gradient for biped locomotion for a full body humanoid, in: Proc. 12th Natl. Conf. on Artificial Intelligence, Pittsburgh, PA, pp (2005). 17. J. Scholz and G. Schoner, The uncontrolled manifold concept: identifying control variables for a functional task, Exp. Brain Res. 126, (1999). 18. M. Vukobratović and B. Borovac, Zero-moment point thirty five years of its life, Int. J. Humanoid Robotics 1, (2004). 19. R. Boulic, R. Mas and D. Thalmann, Inverse kinetics for center of mass position control and posture optimization, in: Proc. Eur. Workshop on Combined Real and Synthetic Image Processing for Broadcast and Video Production, Hamburg (1994). 20. R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learn. 8, (1992). 21. J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation, J. Artif. Intell. Res. 15, (2001). 22. D. A. Aberdeen, Policy-gradient algorithms for partially observable Markov decision processes, PhD Thesis, Australian National University (2003). 23. H. Kimura and S. Kobayashi, An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function, in: Proc. Int. Conf. on Machine Learning, Madison, WI, pp (1998). 24. J. Baxter and P. L. Bartlett, Direct gradient-based reinforcement learning: I. Gradient estimation algorithms, Technical Report, Australian National University (1999). 25. R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Adv. Neural Information Proc. Syst. 12, pp (2000). 26. T. Yoshikawa, Foundations of Robotics: Analysis and Control. MIT Press, Cambridge, MA (1990). About the Authors Takamitsu Matsubara received the BE in Electrical and Electronic Systems Engineering from Osaka Prefecture University, Japan, in 2003, the ME in Information Science from Nara Institute of Science and Technology, Nara, in 2005, and the PhD in Information Science from Nara Institute of Science and Technology, Nara, in From 2005 to 2007, he was a Research Fellow (DC1) of the Japan Society for the Promotion of Science. He is currently an Assistant Professor of Nara Institute of Science and Technology and Visiting Researcher at ATR Computational Neuroscience Laboratories, Kyoto. His research interests include

T. Matsubara et al. / Advanced Robotics 22 (2008) 1125 1142 1141 reinforcement learning, machine learning and robotics.

He received the PhD in Information Science from Nara Institute of Science and Technology, Nara, in 2001.

From 2001 to 2002, he was a Postdoctoral Fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. He Jointed ATR in 2002 and then joined JST, ICORP in 2004.

He received the PhD degree in Engineering from Nagoya University in 2000.

17 T. Matsubara et al. / Advanced Robotics 22 (2008) reinforcement learning, machine learning and robotics. Jun Morimoto is a Senior Researcher at ATR Computational Neuroscience Laboratories and with the Computational Brain Project, ICORP, JST. He received the PhD in Information Science from Nara Institute of Science and Technology, Nara, in He was a Research Assistant with the Kawato Dynamic Brain Project, ERATO, JST, from 1999 to From 2001 to 2002, he was a Postdoctoral Fellow at the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. He Jointed ATR in 2002 and then joined JST, ICORP in Jun Nakanishi received the BE and ME degrees both in Mechanical Engineering from Nagoya University, Nagoya, in 1995 and 1997, respectively. He received the PhD degree in Engineering from Nagoya University in He also studied in the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor, MI, from 1995 to He was a Research Associate at the Department of Micro System Engineering, Nagoya University, from 2000 to 2001, and was a Presidential Postdoctoral fellow at the Computer Science Department, University of Southern California, Los Angeles, CA, from 2001 to He joined ATR Human Information Science Laboratories, Kyoto, in He is currently a Researcher at ATR Computational Neuroscience Laboratories and with the Computational Brain Project, ICORP, Japan Science and Technology Agency. His research interests include motor learning and control in robotic systems. He received the IEEE ICRA 2002 Best Paper Award. Sang-Ho Hyon received the MS degree in Mechanical Engineering from Waseda University, in 1998, and the PhD degree in Control Engineering from the Tokyo Institute of Technology, in He has been a Research Associate and Assistant Professor of Tohoku University during He has been developing various legged robots and the controllers, performing various dynamic locomotion experiments such as jumping, running, walking and somersaulting. He is currently a Researcher at ATR Computational Neuroscience Laboratories, Japan. From 2005 to 2007, he was a researcher at JST International Cooperative Research Project, Computational Brain Project. He was a 1999 ICRA Best Paper Award Finalist. His primary research interests are legged locomotion, nonlinear oscillation and nonlinear control. He is a member of the RSJ and the IEEE Robotics and Automation Society. Joshua G. Hale received the BA (Hons 1st) degree in computation from the University of Oxford, in 1997, the MS (Dist.) degree in Computer Science from the University of Edinburgh, in 1998, the MA degree in Computation from the University of Oxford, in 2002, and the PhD degree on Biomimetic Motion Synthesis from the University of Glasgow, in He has worked as a Research Engineer at the Hardware Compilation Group at the University of Oxford, as a Research Assistant at the Computer Vision and Graphics Laboratory at the University of Glasgow, and is currently employed as an Researcher at the Humanoid Robotics and Computational Neuroscience Laborotory at ATR in Japan. His research interests include dynamic simulation, humanoid robotics and robot skill acquisition, computer graphics and three-dimensional modelling, and human motion production and perception.

18 1142 T. Matsubara et al. / Advanced Robotics 22 (2008) Gordon Cheng received the BS and MS degrees in Computer Science from the University of Wollongong, Wollongong, NSW, and the PhD degree in Systems Engineering from the Department of Systems Engineering, Australian National University, Acton, ACT. His current research interests include humanoid robotics, cognitive systems, biomimetics of human vision, computational neuroscience of vision, action understanding, human robot interaction, active vision, mobile robot navigation and object-oriented software construction. He is on the Editorial Board of the International Journal of Humanoid Robotics. He is a Senior Member of the IEEE Robotics and Automation Society and the IEEE Computer Society.

Learning to acquire whole-body humanoid CoM movements to achieve dynamic tasks

27 IEEE International Conference on Robotics and Automation Roma, Italy, -4 April 27 ThC9.5 Learning to acquire whole-body humanoid CoM movements to achieve dynamic tasks Takamitsu Matsubara,JunMorimoto,