Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Policy Teaching Through Reward Function Learning Haoqi Zhang, David Parkes, and Yiling Chen School of Engineering and Applied Sciences Harvard University ACM EC 2009 Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 1 / 23

Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 2 / 23

Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Often, agent does not behave as desired. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 2 / 23

Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Often, agent does not behave as desired. Idea: provide incentives. Effective incentives depend on agent preferences. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 2 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Goal: to induce desired behavior quickly and at a low cost. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Goal: to induce desired behavior quickly and at a low cost. Policy Teaching is an example of Environment Design. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries infer preferences from behavior Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 4 / 23

Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries center implements outcomes infer preferences from behavior agents take actions, not center Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 4 / 23

Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries center implements outcomes equilibrium analysis infer preferences from behavior agents take actions, not center agent is myopic to incentives Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 4 / 23

Understanding preferences Direct preference elicitation is costly and intrusive. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 5 / 23

Understanding preferences Direct preference elicitation is costly and intrusive. Passive indirect elicitation is insufficient. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 5 / 23

Understanding preferences Direct preference elicitation is costly and intrusive. Passive indirect elicitation is insufficient. Use active, indirect elicitation method [Z. & Parkes, 2008] Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 5 / 23

This paper Objective: induce a fixed target policy. Easier for interested party to specify policy than utility function. More tractable than value-based policy teaching [Z. & Parkes, 08] Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 6 / 23

Main results Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Extension to partial observations and partial target policies. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Extension to partial observations and partial target policies. Game-theoretic analysis to handle strategic agents. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Markov Decision Process Definition An infinite horizon MDP is a model M = {S, A, R, P, γ}: S is the finite set of states. A is the finite set of available actions. R : S R is the reward function. P : S A S [0, 1] is the transition function. γ is the discount factor from (0, 1). We assume bounded rewards: R(s) < R max for all s S. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 8 / 23

Example: hyperlink design in ad-network Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 9 / 23

Example: hyperlink design in ad-network MDP model c.f. Immorlica et al, 2006 Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 9 / 23

Example: hyperlink design in ad-network Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 9 / 23

Policy Teaching with known rewards Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Use Inverse Reinforcement Learning (IRL) [Ng and Russell, 2000] rewards consistent with a policy is given by linear constraints. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Use Inverse Reinforcement Learning (IRL) [Ng and Russell, 2000] rewards consistent with a policy is given by linear constraints. Linear Programming formulation Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with unknown rewards Typically, the interested party won t know the agent s reward. Idea: provide incentives, observe behavior, and repeat. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 11 / 23

An indirect approach R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R T R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R IRL π IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

Convergence result Theorem The elicitation method terminates after finite steps with admissible incentives that induce π T if such exists. Intuition. Pigeonhole argument on # of hypercubes that fit in IRL space. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 13 / 23

Choosing an objective function Few elicitation rounds Tractable Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 14 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Algorithm: use relaxation of IRL constraints for observations. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Algorithm: use relaxation of IRL constraints for observations. Obtain logarithmic bound on the number of elicitation rounds. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 16 / 23

Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Find approximate centroid in polynomial time to obtain logarithmic convergence bound with arbitrarily high probability. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 16 / 23

Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Find approximate centroid in polynomial time to obtain logarithmic convergence bound with arbitrarily high probability. But: algorithm is about O( S 6 ). bound not representative of actual performance. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 16 / 23

Two-sided slack maximization heuristic R R R T IRL space IRL πt Idea from Z. & Parkes, 2008. Here formulation is a linear program Generates S A new constraints at each round. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 17 / 23

Example: An ad-network setting Publisher designs link structure on website to maximize utility. An ad-network provides incentives to influence the link design. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 18 / 23

Example: An ad-network setting Publisher designs link structure on website to maximize utility. An ad-network provides incentives to influence the link design. Results on slack-based heuristics, 20 to 100 web pages: Elicitation takes 8 12 rounds # of rounds about constant as number of states increases. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 18 / 23

Extension: partial observation and partial target policy Partial observation: observe agent actions in some of the states. Partial target policy: care about the agent s policy in certain states. Goal: induce desired partial policy in observable states. Can formulate as mixed integer program and get convergence. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 19 / 23

Handling forward-looking agents Consider interactions as an infinitely repeated game. A strategic agent may misrepresent its preferences. Consider a trigger strategy: provide maximal admissible incentives. If agent does not perform π T, provide no future incentives. This approach is unsatisfying: Doesn t work for myopic agents Commitment issues Hard to implement in applications Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 20 / 23

Handling forward-looking agents: a solution Idea: use elicitation method until elicit desired policy or no possible rewards remain. Agent may misrepresent, but if patient will want to keep getting incentives. Agent will follow desired policy as best response. Benefits: Works for myopic and strategic agent Can still get fast convergence Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 21 / 23

The message There are tractable methods for finding limited incentives to quickly induce desired sequential agent behavior when: Direct queries about agent preferences are unavailable Can observe agent behavior over time Many interesting open questions, both theoretical and practical. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 22 / 23

Thank you I would love to get your comments and suggestions! hq@eecs.harvard.edu. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 23 / 23