Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Similar documents
Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

Resource Management in QoS-Aware Wireless Cellular Networks

Tracking of Real-Valued Markovian Random Processes with Asymmetric Cost and Observation

CSE 473 Midterm Exam Feb 8, 2018

Iteration. Many thanks to Alan Fern for the majority of the LSPI slides.

CMU-Q Lecture 20:

AI Agent for Ants vs. SomeBees: Final Report

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

10703 Deep Reinforcement Learning and Control

The Future of Network Science: Guiding the Formation of Networks

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

Domination Rationalizability Correlated Equilibrium Computing CE Computational problems in domination. Game Theory Week 3. Kevin Leyton-Brown

Microeconomics II Lecture 2: Backward induction and subgame perfection Karl Wärneryd Stockholm School of Economics November 2016

Game Theory: Normal Form Games

Solving Coup as an MDP/POMDP

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Minmax and Dominance

Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility

ECON 312: Games and Strategy 1. Industrial Organization Games and Strategy

LECTURE 26: GAME THEORY 1

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Reinforcement Learning Applied to a Game of Deceit

Optimization Techniques for Alphabet-Constrained Signal Design

UNIVERSITY of PENNSYLVANIA CIS 391/521: Fundamentals of AI Midterm 1, Spring 2010

Section Notes 6. Game Theory. Applied Math 121. Week of March 22, understand the difference between pure and mixed strategies.

CS510 \ Lecture Ariel Stolerman

Reinforcement Learning

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Appendix A A Primer in Game Theory

Decentralized Cognitive MAC for Opportunistic Spectrum Access in Ad-Hoc Networks: A POMDP Framework

Chapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks

Two-stage column generation and applications in container terminal management

Reinforcement Learning for Ethical Decision Making

Five-In-Row with Local Evaluation and Beam Search

A Multi Armed Bandit Formulation of Cognitive Spectrum Access

Q-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control

Strategies and Game Theory

Topic 1: defining games and strategies. SF2972: Game theory. Not allowed: Extensive form game: formal definition

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Modeling the Dynamics of Coalition Formation Games for Cooperative Spectrum Sharing in an Interference Channel

Repeated Games. Economics Microeconomic Theory II: Strategic Behavior. Shih En Lu. Simon Fraser University (with thanks to Anke Kessler)

Downlink Scheduler Optimization in High-Speed Downlink Packet Access Networks

Design of intelligent surveillance systems: a game theoretic case. Nicola Basilico Department of Computer Science University of Milan

Agenda. Intro to Game Theory. Why Game Theory. Examples. The Contractor. Games of Strategy vs other kinds

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

Reinforcement Learning Agent for Scrolling Shooter Game

Multiple Agents. Why can t we all just get along? (Rodney King)

3 Game Theory II: Sequential-Move and Repeated Games

ECON 301: Game Theory 1. Intermediate Microeconomics II, ECON 301. Game Theory: An Introduction & Some Applications

Geometric Programming and its Application in Network Resource Allocation. Presented by: Bin Wang

Game Theory and Algorithms Lecture 3: Weak Dominance and Truthfulness

Contents. MA 327/ECO 327 Introduction to Game Theory Fall 2017 Notes. 1 Wednesday, August Friday, August Monday, August 28 6

An Empirical Evaluation of Policy Rollout for Clue

Extensive Form Games. Mihai Manea MIT

CSE 591: Human-aware Robotics

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

An Artificially Intelligent Ludo Player

Fictitious Play applied on a simplified poker game

Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)

CS 188 Introduction to Fall 2014 Artificial Intelligence Midterm

CMU Lecture 22: Game Theory I. Teachers: Gianni A. Di Caro

1890 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 30, NO. 10, NOVEMBER 2012

Alternation in the repeated Battle of the Sexes

Wireless Network Pricing Chapter 7: Network Externalities

Chapter 3 Learning in Two-Player Matrix Games

Optimizing Media Access Strategy for Competing Cognitive Radio Networks Y. Gwon, S. Dastangoo, H. T. Kung

ECON 282 Final Practice Problems

UMBC 671 Midterm Exam 19 October 2009

OPPORTUNISTIC spectrum access (OSA), first envisioned

Fast Online Learning of Antijamming and Jamming Strategies

Game Theory and Economics of Contracts Lecture 4 Basics in Game Theory (2)

Deep Learning for Autonomous Driving

Outline for today s lecture Informed Search Optimal informed search: A* (AIMA 3.5.2) Creating good heuristic functions Hill Climbing

Communication over a Time Correlated Channel with an Energy Harvesting Transmitter

Index Terms Deterministic channel model, Gaussian interference channel, successive decoding, sum-rate maximization.

Foundations of AI. 3. Solving Problems by Searching. Problem-Solving Agents, Formulating Problems, Search Strategies

Name: Your EdX Login: SID: Name of person to left: Exam Room: Name of person to right: Primary TA:

The Game-Theoretic Approach to Machine Learning and Adaptation

TUD Poker Challenge Reinforcement Learning with Imperfect Information

U strictly dominates D for player A, and L strictly dominates R for player B. This leaves (U, L) as a Strict Dominant Strategy Equilibrium.

Outline for this presentation. Introduction I -- background. Introduction I Background

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 3, MARCH

Asynchronous Best-Reply Dynamics

DeepMind Self-Learning Atari Agent

ACRUCIAL issue in the design of wireless sensor networks

Convergence in competitive games

Theory of Moves Learners: Towards Non-Myopic Equilibria

Reinforcement Learning Simulations and Robotics

Scheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48

Game Theory and Randomized Algorithms

Optimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function

Localization (Position Estimation) Problem in WSN

Guess the Mean. Joshua Hill. January 2, 2010

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

Incentive design for social computing: Interdisciplinarity time!

Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks

Elements of Artificial Intelligence and Expert Systems

Multi-user Space Time Scheduling for Wireless Systems with Multiple Antenna

Optimization of On-line Appointment Scheduling

Transcription:

Policy Teaching Through Reward Function Learning Haoqi Zhang, David Parkes, and Yiling Chen School of Engineering and Applied Sciences Harvard University ACM EC 2009 Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 1 / 23

Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 2 / 23

Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Often, agent does not behave as desired. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 2 / 23

Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Often, agent does not behave as desired. Idea: provide incentives. Effective incentives depend on agent preferences. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 2 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Goal: to induce desired behavior quickly and at a low cost. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Goal: to induce desired behavior quickly and at a low cost. Policy Teaching is an example of Environment Design. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 3 / 23

Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries infer preferences from behavior Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 4 / 23

Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries center implements outcomes infer preferences from behavior agents take actions, not center Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 4 / 23

Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries center implements outcomes equilibrium analysis infer preferences from behavior agents take actions, not center agent is myopic to incentives Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 4 / 23

Understanding preferences Direct preference elicitation is costly and intrusive. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 5 / 23

Understanding preferences Direct preference elicitation is costly and intrusive. Passive indirect elicitation is insufficient. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 5 / 23

Understanding preferences Direct preference elicitation is costly and intrusive. Passive indirect elicitation is insufficient. Use active, indirect elicitation method [Z. & Parkes, 2008] Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 5 / 23

This paper Objective: induce a fixed target policy. Easier for interested party to specify policy than utility function. More tractable than value-based policy teaching [Z. & Parkes, 08] Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 6 / 23

Main results Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Extension to partial observations and partial target policies. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Extension to partial observations and partial target policies. Game-theoretic analysis to handle strategic agents. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 7 / 23

Markov Decision Process Definition An infinite horizon MDP is a model M = {S, A, R, P, γ}: S is the finite set of states. A is the finite set of available actions. R : S R is the reward function. P : S A S [0, 1] is the transition function. γ is the discount factor from (0, 1). We assume bounded rewards: R(s) < R max for all s S. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 8 / 23

Example: hyperlink design in ad-network Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 9 / 23

Example: hyperlink design in ad-network MDP model c.f. Immorlica et al, 2006 Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 9 / 23

Example: hyperlink design in ad-network Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 9 / 23

Policy Teaching with known rewards Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Use Inverse Reinforcement Learning (IRL) [Ng and Russell, 2000] rewards consistent with a policy is given by linear constraints. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Use Inverse Reinforcement Learning (IRL) [Ng and Russell, 2000] rewards consistent with a policy is given by linear constraints. Linear Programming formulation Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 10 / 23

Policy Teaching with unknown rewards Typically, the interested party won t know the agent s reward. Idea: provide incentives, observe behavior, and repeat. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 11 / 23

An indirect approach R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R T R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R T R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R IRL π IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

An indirect approach R R IRL π IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 12 / 23

Convergence result Theorem The elicitation method terminates after finite steps with admissible incentives that induce π T if such exists. Intuition. Pigeonhole argument on # of hypercubes that fit in IRL space. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 13 / 23

Choosing an objective function Few elicitation rounds Tractable Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 14 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Algorithm: use relaxation of IRL constraints for observations. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Algorithm: use relaxation of IRL constraints for observations. Obtain logarithmic bound on the number of elicitation rounds. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 15 / 23

Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 16 / 23

Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Find approximate centroid in polynomial time to obtain logarithmic convergence bound with arbitrarily high probability. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 16 / 23

Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Find approximate centroid in polynomial time to obtain logarithmic convergence bound with arbitrarily high probability. But: algorithm is about O( S 6 ). bound not representative of actual performance. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 16 / 23

Two-sided slack maximization heuristic R R R T IRL space IRL πt Idea from Z. & Parkes, 2008. Here formulation is a linear program Generates S A new constraints at each round. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 17 / 23

Example: An ad-network setting Publisher designs link structure on website to maximize utility. An ad-network provides incentives to influence the link design. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 18 / 23

Example: An ad-network setting Publisher designs link structure on website to maximize utility. An ad-network provides incentives to influence the link design. Results on slack-based heuristics, 20 to 100 web pages: Elicitation takes 8 12 rounds # of rounds about constant as number of states increases. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 18 / 23

Extension: partial observation and partial target policy Partial observation: observe agent actions in some of the states. Partial target policy: care about the agent s policy in certain states. Goal: induce desired partial policy in observable states. Can formulate as mixed integer program and get convergence. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 19 / 23

Handling forward-looking agents Consider interactions as an infinitely repeated game. A strategic agent may misrepresent its preferences. Consider a trigger strategy: provide maximal admissible incentives. If agent does not perform π T, provide no future incentives. This approach is unsatisfying: Doesn t work for myopic agents Commitment issues Hard to implement in applications Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 20 / 23

Handling forward-looking agents: a solution Idea: use elicitation method until elicit desired policy or no possible rewards remain. Agent may misrepresent, but if patient will want to keep getting incentives. Agent will follow desired policy as best response. Benefits: Works for myopic and strategic agent Can still get fast convergence Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 21 / 23

The message There are tractable methods for finding limited incentives to quickly induce desired sequential agent behavior when: Direct queries about agent preferences are unavailable Can observe agent behavior over time Many interesting open questions, both theoretical and practical. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 22 / 23

Thank you I would love to get your comments and suggestions! hq@eecs.harvard.edu. Haoqi Zhang (Harvard University) Policy Teaching ACM EC 2009 23 / 23