ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: Case Studies and Gradient Policy October 29, 2015 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2015 1
Introduction We ll discuss several case studies of reinforcement learning Illustrate some of the trade-offs and issues that arise in real-world applications For example, we emphasize how domain knowledge is incorporated into the formulation and solution of the problem We also highlight the representation issues that are so often critical to successful applications Applications of RL are still far from routine and typically require as much art as science Making applications easier and more straightforward is one of the goals in current RL research ECE 517: Reinforcement Learning in AI 2
TD-Gammon (Tesauro s 1992, 1994, 1995, ) One of the most impressive early applications of RL to date is Gerry Tesauro s (IBM) game of backgammon TD-Gammon, required little backgammon knowledge, yet learned to play extremely well, near the level of the world's strongest grandmasters The learning algorithm was a straightforward combination of the TD(l) algorithm and nonlinear function approximation FA using a FFNN trained by backpropagating TD errors There are probably more professional backgammon players than there are professional chess players BG is in part a game of chance, which can be viewed as a large MDP ECE 517: Reinforcement Learning in AI 3
TD-Gammon (cont.) The game is played with 15 white and 15 black pieces on a board of 24 locations, called points Here s a typical position early in the game, seen from the perspective of the white player ECE 517: Reinforcement Learning in AI 4
TD-Gammon (cont.) White has just rolled a 5 and a 2, so it can move one of his pieces 5 and one (possibly the same) 2 steps The objective is to advance all pieces to points 19-24, and then off the board Hitting removal of single piece 30 pieces, 24 locations implies enormous number of configurations (state set is ~10 20 ) Effective branching factor of 400, considering that each dice roll has ~20 possibilities ECE 517: Reinforcement Learning in AI 5
TD-Gammon - details Although the game is highly stochastic, a complete description of the game's state is available at all times The estimated value of any state was meant to predict the probability of winning starting from that state Reward: 0 at all times except those in which the game is won, when it is 1 Episodic (game = episode), undiscounted Non-linear form of TD(l) using a FF neural network Backpropagation of TD error 4 input units for each point; unary encoding of number of white pieces, plus other features Use of after-states Learning during self-play fully incrementally ECE 517: Reinforcement Learning in AI 6
TD-Gammon Neural Network Employed ECE 517: Reinforcement Learning in AI 7
Summary of TD-Gammon Results Two players played against each other Each had no prior knowledge of the game Only the rules of the game were imposed Human s learn from machines: TD-Gammon learned to play certain opening positions differently than was the convention among the best human players ECE 517: Reinforcement Learning in AI 8
Rebuttal on TD-Gammon For an alternative view, see Why did TD-Gammon Work?, Jordan Pollack and Alan Blair, NIPS (1997) Claim: it was the co-evolutionary training strategy, playing games against itself, which led to the success No need for dealing with exploration/exploitation Any such approach would work with backgammon No sensitivity of state to value - success does not extend to other problems e.g. Tetris, maze-type problems exploration issue comes up ECE 517: Reinforcement Learning in AI 9
The Acrobot Robotic application of RL Roughly analogous to a gymnast swinging on a high bar The first joint (corresponding to the hands on the bar) cannot exert torque The second joint (corresponding to the gymnast bending at the waist) can This system has been widely studied by control engineers and machine learning researchers ECE 517: Reinforcement Learning in AI 10
The Acrobot (cont.) One objective for controlling the Acrobot is to swing the tip (the "feet") above the first joint by an amount equal to one of the links in minimum time In this task, the torque applied at the second joint is limited to three choices: positive torque of a fixed magnitude, negative torque of the same magnitude, or no torque A reward of 1 is given on all time steps until the goal is reached, which ends the episode. No discounting is used Thus, the optimal value of any state is the minimum time to reach the goal (an integer number of steps) Sutton (1996) addressed the Acrobot swing-up task in an on-line, model-free context ECE 517: Reinforcement Learning in AI 11
Acrobot Learning Curves for Sarsa(l) ECE 517: Reinforcement Learning in AI 12
Typical Acrobot Learned Behavior ECE 517: Reinforcement Learning in AI 13
RL in Robotics Robot motor capabilities were investigated using RL Walking, grabbing and delivering MIT Media Lab Robocup competitions soccer games Sony AIBOs are commonly employed Maze-type problems Balancing themselves on unstable platform Multi-dimensional input streams ECE 517: Reinforcement Learning in AI 14
Policy Gradient Methods Assume that our policy, p, has a set of n realvalued parameters, q = {q 1, q 2, q 3,..., q n } Running the policy with a particular q results in a reward, r q r Estimate the reward gradient,, for each q i θ i r θi θi θ i This is another learning rate ECE 517: Reinforcement Learning in AI 15
Policy Gradient Methods (cont.) This results in hill-climbing in policy space So, it s subject to all the problems of hill-climbing But, we can also use tricks from search theory, like random starting points and momentum terms This is a good approach if you have a parameterized policy Let s assume we have a reasonable starting policy Typically faster than value-based methods Safe exploration, if you have a good policy Learns locally-best parameters for that policy ECE 517: Reinforcement Learning in AI 16
An Example: Learning to Walk RoboCup 4-legged league Walking quickly is a big advantage Historically - tuned manually Robots have a parameterized gait controller 12 parameters Controls step length, height, etc. Robot walk across soccer field and is timed Reward is a function of the time taken They know when to stop (distance measure) ECE 517: Reinforcement Learning in AI 17
An Example: Learning to Walk (cont.) Basic idea 1. Pick an initial q = {q 1, q 2,..., q 12 } 2. Generate N testing parameter settings by perturbing q q j = {q 1 + d 1, q 2 + d 2,..., q 12 + d 12 }, d i {-e, 0, e} 3. Test each setting, and observe rewards q j r j 4. For each q i q Calculate q i+, q i0, q i- and set 5. Set q q, and go to 2 θ' i θ i d 0 d if if if θ θ θ i 0 i i largest largest largest Average reward when q n i = q i - d i ECE 517: Reinforcement Learning in AI 18
An Example: Learning to Walk (cont.) Initial Final ECE 517: Reinforcement Learning in AI 19
Value Function or Policy Gradient? When should I use policy gradient? When there s a parameterized policy When there s a high-dimensional state space When we expect the gradient to be smooth Typically on episodic tasks (e.g. AIBO walking) When should I use a value-based method? When there is no parameterized policy When we have no idea how to solve the problem (i.e. no known structure) ECE 517: Reinforcement Learning in AI 20
Summary RL is a powerful tool which can support a wide range of applications There is an art to defining the observations, states, rewards and actions Main goal: formulate as simple as possible representation Policy Gradient methods directly search in policy space ECE 517: Reinforcement Learning in AI 21