Why Planning ue.g. if we have a robot we want the robot to decide what to do; how to act to achieve our goals. Planning & Reinforcement Learning

Size: px

Start display at page:

Download "Why Planning ue.g. if we have a robot we want the robot to decide what to do; how to act to achieve our goals. Planning & Reinforcement Learning"

Miranda Hunter
6 years ago
Views:

Planning & Reinforcement Learning Slides borrowed from Sheila McIlraith, Kate Larson, and David Silver Why Planning ue.g. if we have a robot we want the robot to decide what to do; how to act to achieve our goals S384 University of Toronto 1 S384 University of Toronto 2 Planning vs.

utonomous gents for Space Exploration uutonomous planning, scheduling, control u NS: JPL and mes uremote gent Experiment (RX) u Deep Space 1 umars Exploration Rover (MER) GOL: Steven has coffee

1 Planning & Reinforcement Learning Slides borrowed from Sheila McIlraith, Kate Larson, and David Silver Why Planning ue.g. if we have a robot we want the robot to decide what to do; how to act to achieve our goals S384 University of Toronto 1 S384 University of Toronto 2 Planning vs. Search How to change the world to suit our needs. ritical issue: we need to reason about what the world will be like after doing a few actions. This aspect of planning is just like Search. utonomous gents for Space Exploration uutonomous planning, scheduling, control u NS: JPL and mes uremote gent Experiment (RX) u Deep Space 1 umars Exploration Rover (MER) GOL: Steven has coffee URRENTLY: robot in mailroom, has no coffee, coffee not made, Steven in office, etc. TO DO: goto lounge, make coffee, 3 4 1

Scheduling with ction hoices & Resource Requirements u Problems in supply chain management u HSTS (Hubble Space Telescope scheduler) u Workflow management ir Traffic ontrol Other pplications (cont.

2 Scheduling with ction hoices & Resource Requirements u Problems in supply chain management u HSTS (Hubble Space Telescope scheduler) u Workflow management ir Traffic ontrol Other pplications (cont.) u Route aircraft between runways and terminals. rafts must be kept safely separated. Safe distance depends on craft and mode of transport. Minimize taxi and wait time. pplications These applications require more than search. Not sufficient to simply find a sequence of action for transforming the world so as to achieve a goal state. uthese applications involve dealing with uncertainty. usensing the world and planning to sense the world so as to reduce uncertainty. ugenerating a plan that has high payoff or high expected payoff rather than simply achieving a fixed goal. urunning into problems when executing a plan and having to recover. uetc. haracter nimation u Generate step-by-step character behaviour from high-level spec Plan-based Interfaces u E.g. NLP to database interfaces u Plan recognition, ctivity Recognition 5 6 2

Planning u gent: single agent, or multi-agent u State: complete or incomplete (logical/probabilistic), state of the world and/or agent s state of knowledge u ctions: world-altering and/or

3 Planning u gent: single agent, or multi-agent u State: complete or incomplete (logical/probabilistic), state of the world and/or agent s state of knowledge u ctions: world-altering and/or knowledge-altering (e.g. sensing); deterministic or non-deterministic (logical/stochastic) u Goal ondition: satisfying or optimizing; final-state or temporally extended; optimizing for preference/cost/utility u Reasoning: offline or online (fully observable, partially observable) u Plans: partial order, sequential, conditional S384 University of Toronto 11 Simplifying the Planning Problem u We simplify the planning problem as follows: u ssume complete information about the initial state through closed world assumption (W) u ssume finite domain of objects u ssume action effects are restricted to making conjunctions of atomic formulae to be true or false. No conditional effects, etc. u ssume action preconditions are restricted to conjunctions of ground atoms u Perform lassical Planning. No incomplete or uncertain knowledge S384 University of Toronto 12 3

lassical Planning ssumptions STRIPS Representation u Finite System: finitely many states, actions, events u Fully Observable: controller always knows current state u Deterministic: each action has

u Implicit time: actions are instantaneous (have no duration) u Off-line planning: planner doesn t know execution status S384 University of Toronto 13 ustrips (Stanford Research Institute Problem

4 lassical Planning ssumptions STRIPS Representation u Finite System: finitely many states, actions, events u Fully Observable: controller always knows current state u Deterministic: each action has only one outcome u Static: changes only occur as result of controller actions u ttainment goals: a set of goal states S g u Sequential plans: plan is linearly ordered sequence of actions (a 1,, a n ) u Implicit time: actions are instantaneous (have no duration) u Off-line planning: planner doesn t know execution status S384 University of Toronto 13 ustrips (Stanford Research Institute Problem Solver) u way of representing actions with respect to W-K closed world knowledge base representing the state of the world S384 University of Toronto 14 Sequence of Worlds STRIPS ctions S384 University of Toronto 15 u Strips represent actions using 3 lists u list of preconditions. u list of action add effects. u list of action delete effects uthese lists contain variables so that we can represent a whole class of actions with one specification ueach ground instantation of the variables yields a specific action S384 University of Toronto 16 4

5 STRIPS ctions: Example STRIPS ctions: Example robot hand pickup(x): robot hand pickup(x): is called a STRIPS operator Pre: {handempty, clear(x), ontable(x)} dds: {holding(x)} Dels: {handempty, clear(x), ontable(x)} pickup(a): (a particular instance), is called an action S384 University of Toronto 17 S384 University of Toronto 18 STRIPS ctions: Example STRIPS ctions: Example robot hand putdown(x) robot hand stack(x,y) Pre: {holding(x)} dds: {clear(x), ontable(x), handempty} Dels: {holding(x)} Pre: {holding(x), clear(y)} dds: {on(x, Y), handempty, clear(x)} Dels: {holding(x), clear(y)} S384 University of Toronto 19 S384 University of Toronto 20 5

6 STRIPS has no onditional Effects stack(x,y) Pre: {holding(x), clear(y)} dds: {on(x, Y), handempty, clear(x)} Dels: {holding(x), clear(y)} u locks World assumption: Table has infinite space, so it is always clear u If we stack something on table (Y = table), we cannot delete clear(table) u ut if Y is an ordinary block, we must delete clear(y) STRIPS has no onditional Effects usince STRIPS has no conditional effects, we must sometimes utilize extra actions: one for each type of condition. uwe Embed the condition in the precondition and then alter the effects accordingly putdown(x) stack(x,y) Pre: {holding(x)} dds: {ontable(x), handempty, clear(x)} Dels: {holding(x)} Pre: {holding(x), clear(y)} dds: {on(x, Y), handempty, clear(x)} Dels: {holding(x), clear(y)} S384 University of Toronto 21 S384 University of Toronto 22 STRIPS ctions: Example STRIPS ctions: Example robot hand uunstack(x, Y) robot hand uunstack(x, Y) Pre: { } dds: { } Dels: { } Pre: {clear(x), on(x, Y), handempty} dds: {holding(x), clear(y)} Dels: {clear(x), on(x,y), handempty} S384 University of Toronto 23 S384 University of Toronto 24 6

7 Planning as a Search Problem u Given u W-K representing the initial state u set of STRIPS operators that map a state to a new state u goal conditions (conjunction of facts, or as a formula) The planning problem is to determine sequence of action that, when applied to the initial W-K yield an updated W-K which satisfies the goal. This is the classical planning task S384 University of Toronto 25 Planning s Search uthis is a search problem, in which our state space representation is a W-K uinitial W-K is the initial state uactions are operators mapping a state to a new state ugoal is satisfied by any state that satisfies the goal. Typically the goal is a conjunction of primitive facts, so we just need to check if all the facts in the goal are contained in the W-K S384 University of Toronto 26 Example Example move(b,c) move(c,table) move(c,b) S384 University of Toronto 27 move(a,b) S384 University of Toronto 28 7

8 Problems Planning Summary u Search tree is generally quite large u Randomly reconfiguring 9 blocks takes thousands of PU seconds u ut: representation suggests some structure u Each action only affects a small set of facts, u ctions depend on each other via their preconditions u Planning algorithms are designed to take advantage of fact that the representation makes the locality of action changes explicit S384 University of Toronto 29 umodel of the environment is known ugent performs computations with its model (without external interaction) ugent improves its policy udeliberation, reasoning, introspection, pondering, though, search S384 University of Toronto 30 ut what happens if the environment is unknown? How can we inform our agent of what actions to take? ussume: environment is initially unknown uonsider using a reward function, to guide agent uif agent doesn t know what actions to take utry an action out usee what the reward is, of taking that action uthis is Reinforcement Learning S384 University of Toronto 31 S384 University of Toronto 32 8

Reinforcement Learning Example: Tic Tac Toe ulearning what to do, so as to maximize some reward signal

loss, 0 for draw uproblem: Find π: S that maximizes reward S384 University of Toronto 34 Example: Mobile

uproblem: Find π: S that maximizes reward S384 University of Toronto 35 ustate: pixel location of game

9 Reinforcement Learning Example: Tic Tac Toe ulearning what to do, so as to maximize some reward signal S384 University of Toronto 33 ustate: oard configuration uctions: Next move ureward: 1 for win, -1 for loss, 0 for draw uproblem: Find π: S that maximizes reward S384 University of Toronto 34 Example: Mobile Robot Example: tari u State: location of robot, people u ctions: Motion u Reward: number of happy faces uproblem: Find π: S that maximizes reward S384 University of Toronto 35 ustate: pixel location of game agents uctions: agent movement ureward: score uproblem: Find π: S that maximizes reward S384 University of Toronto 36 9

10 utonomous Helicopter Flight Quadruped Robot S384 University of Toronto 37 S384 University of Toronto 38 Reinforcement Learning ugoal: learn to choose actions that maximize REWRD = r 0 + γ r 1 + γ 2 r 2, where 0 < γ < 1 Reward u reward R t is a scalar feedback signal uindicates how well agent is doing at step t u The agent s job is to maximize cumulative reward Reward hypothesis: ll goals can be described by the maximization of expected cumulative reward S384 University of Toronto 39 S384 University of Toronto 40 10

11 Sequential Decision Making Exploration and Exploitation ugoal: select actions to maximize total future reward uctions may have long-term consequences ureward may be delayed umay be better to sacrifice immediate reward to gain more long-term reward (exploitation vs. exploration) ureinforcement learning is like trial-and-error learning ugents should discover a good policy ufrom its experiences of the environment (explore) uwithout losing too much of the reward along the way (exploit) S384 University of Toronto 41 S384 University of Toronto 42 gent s Learning Task Fully Observable Environment uexecute actions in the world uobserve the results ulearn policy π: S that maximizes reward from some initial state S384 University of Toronto 43 ufull observability: agent directly observes environment state ugent state = environment state = information state uformally, this is a Markov Decision Process (MDP) S384 University of Toronto 44 11

this is a partially observable Markov Decision Process (POMDP) u gent must construct its own state representation S a t u omplete history: St a = Ht u eliefs of environment state St a = (P[St e =

12 Partially Observable Environment RL gent u Partial observability: agent indirectly observes environment u E.g. robot with camera vision isn t told its absolute location u Trading agent only observes current prices u Poke playing agent only observes public cards u gent state environment state u Formally this is a partially observable Markov Decision Process (POMDP) u gent must construct its own state representation S a t u omplete history: St a = Ht u eliefs of environment state St a = (P[St e = s 1 ],, P[St e = s n ] u Recurrent Neural Network St a = σ(s 1./0 W 3 + O. W 6 ) S384 University of Toronto 45 un RL agent may include one or more of these components upolicy: agent s behaviour function uvalue function: how good is each state and/or action umodel: agent s representation of the environment S384 University of Toronto 46 Maze Example ustates: gent s location uctions: N, E, S, W urewards: -1 per time-step S384 University of Toronto 47 Maze Example Policy: gent s behaviour umap from state to action udeterministic policy a = π(s) ustochastic policy π(a s) = P( t = a S t = s) Each arrow represents policy π(s) for each state s S384 University of Toronto 48 12

Maze Example Value function: uprediction of future reward uused to evaluate goodness/badness of states v π (s) = E

49 Maze Example Model: u Predicts what the environment will do next u gent may have internal model of the

u Model may be imperfect S384 University of Toronto 50 RL gent RL gent S384 University of Toronto 51 umodel-ased:

13 Maze Example Value function: uprediction of future reward uused to evaluate goodness/badness of states v π (s) = E π [ R t+1 + γ R t+2 + S t = s] Numbers represent value function v π (s) of each state s S384 University of Toronto 49 Maze Example Model: u Predicts what the environment will do next u gent may have internal model of the environment, which determines u how actions change the state, and u how much reward should be given for each state. u Model may be imperfect S384 University of Toronto 50 RL gent RL gent S384 University of Toronto 51 umodel-ased: upolicy and/or Value Function umodel umodel-free: upolicy and/or Value Function uno Model S384 University of Toronto 52 13

Planning & Reinforcement Learning

Planning & Reinforcement Learning Slides borrowed from Sheila McIlraith, Kate Larson, and David Silver CSC384 University of Toronto 1 Why Planning ue.g. if we have a robot we want the robot to decide what