Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen
|
|
- Norah Caldwell
- 5 years ago
- Views:
Transcription
1 Policy Teaching Through Reward Function Learning Haoqi Zhang, David Parkes, and Yiling Chen School of Engineering and Applied Sciences Harvard University ACM EC 2009 Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
2 Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
3 Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Often, agent does not behave as desired. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
4 Interested party and agent Web 2.0 site wants a user to... Online retailer wants a customer to... Ad-network wants a publisher to... Often, agent does not behave as desired. Idea: provide incentives. Effective incentives depend on agent preferences. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
5 Policy Teaching An agent performs a sequence of observable actions [MDP]. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
6 Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
7 Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
8 Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Goal: to induce desired behavior quickly and at a low cost. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
9 Policy Teaching An agent performs a sequence of observable actions [MDP]. The interested party can associate limited rewards with states. Can interact multiple times, but cannot impose actions. Goal: to induce desired behavior quickly and at a low cost. Policy Teaching is an example of Environment Design. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
10 Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries infer preferences from behavior Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
11 Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries center implements outcomes infer preferences from behavior agents take actions, not center Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
12 Mechanism Design vs. Environment Design Mechanism Design Environment Design elicit preferences via direct queries center implements outcomes equilibrium analysis infer preferences from behavior agents take actions, not center agent is myopic to incentives Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
13 Understanding preferences Direct preference elicitation is costly and intrusive. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
14 Understanding preferences Direct preference elicitation is costly and intrusive. Passive indirect elicitation is insufficient. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
15 Understanding preferences Direct preference elicitation is costly and intrusive. Passive indirect elicitation is insufficient. Use active, indirect elicitation method [Z. & Parkes, 2008] Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
16 This paper Objective: induce a fixed target policy. Easier for interested party to specify policy than utility function. More tractable than value-based policy teaching [Z. & Parkes, 08] Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
17 Main results Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
18 Main results Finding limited incentives to induce pre-specified policy is in P. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
19 Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
20 Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
21 Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Extension to partial observations and partial target policies. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
22 Main results Finding limited incentives to induce pre-specified policy is in P. With unknown rewards, polynomial time algorithm to find incentives that induce desired policy after logarithmic interactions. Tractable slack-based heuristic with empirical results in a simulated, ad-network setting. Extension to partial observations and partial target policies. Game-theoretic analysis to handle strategic agents. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
23 Markov Decision Process Definition An infinite horizon MDP is a model M = {S, A, R, P, γ}: S is the finite set of states. A is the finite set of available actions. R : S R is the reward function. P : S A S [0, 1] is the transition function. γ is the discount factor from (0, 1). We assume bounded rewards: R(s) < R max for all s S. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
24 Example: hyperlink design in ad-network Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
25 Example: hyperlink design in ad-network MDP model c.f. Immorlica et al, 2006 Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
26 Example: hyperlink design in ad-network Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
27 Policy Teaching with known rewards Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
28 Policy Teaching with known rewards Agent performs his optimal policy π. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
29 Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
30 Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
31 Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
32 Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Use Inverse Reinforcement Learning (IRL) [Ng and Russell, 2000] rewards consistent with a policy is given by linear constraints. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
33 Policy Teaching with known rewards Agent performs his optimal policy π. Interested party can provide admissible incentive : S R (e.g., within budget and without punishment). Agent performs π w.r.t. R +. Goal: provide minimal admissible to induce π T. Use Inverse Reinforcement Learning (IRL) [Ng and Russell, 2000] rewards consistent with a policy is given by linear constraints. Linear Programming formulation Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
34 Policy Teaching with unknown rewards Typically, the interested party won t know the agent s reward. Idea: provide incentives, observe behavior, and repeat. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
35 An indirect approach R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
36 An indirect approach R R T R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
37 An indirect approach R R T R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
38 An indirect approach R R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
39 An indirect approach R R IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
40 An indirect approach R R IRL π IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
41 An indirect approach R R IRL π IRL space IRL πt Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
42 Convergence result Theorem The elicitation method terminates after finite steps with admissible incentives that induce π T if such exists. Intuition. Pigeonhole argument on # of hypercubes that fit in IRL space. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
43 Choosing an objective function Few elicitation rounds Tractable Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
44 Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
45 Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
46 Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
47 Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
48 Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Algorithm: use relaxation of IRL constraints for observations. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
49 Centroid-based approach Theorem (Grünbaum, 1960) Any halfspace containing the centroid of a convex set contains at least of its volume. 1 e Pick the centroid of the IRL space for R at every iteration Adding IRL constraints eliminate at least 1 e of its volume. Idea: maintain volume around true reward that s never eliminated Algorithm: use relaxation of IRL constraints for observations. Obtain logarithmic bound on the number of elicitation rounds. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
50 Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
51 Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Find approximate centroid in polynomial time to obtain logarithmic convergence bound with arbitrarily high probability. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
52 Computing the centroid Centroid is #P-hard to compute [Rademacher, 2007]. Approximate via sampling in P [Bertsimas and Vempala, 2004]. Find approximate centroid in polynomial time to obtain logarithmic convergence bound with arbitrarily high probability. But: algorithm is about O( S 6 ). bound not representative of actual performance. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
53 Two-sided slack maximization heuristic R R R T IRL space IRL πt Idea from Z. & Parkes, Here formulation is a linear program Generates S A new constraints at each round. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
54 Example: An ad-network setting Publisher designs link structure on website to maximize utility. An ad-network provides incentives to influence the link design. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
55 Example: An ad-network setting Publisher designs link structure on website to maximize utility. An ad-network provides incentives to influence the link design. Results on slack-based heuristics, 20 to 100 web pages: Elicitation takes 8 12 rounds # of rounds about constant as number of states increases. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
56 Extension: partial observation and partial target policy Partial observation: observe agent actions in some of the states. Partial target policy: care about the agent s policy in certain states. Goal: induce desired partial policy in observable states. Can formulate as mixed integer program and get convergence. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
57 Handling forward-looking agents Consider interactions as an infinitely repeated game. A strategic agent may misrepresent its preferences. Consider a trigger strategy: provide maximal admissible incentives. If agent does not perform π T, provide no future incentives. This approach is unsatisfying: Doesn t work for myopic agents Commitment issues Hard to implement in applications Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
58 Handling forward-looking agents: a solution Idea: use elicitation method until elicit desired policy or no possible rewards remain. Agent may misrepresent, but if patient will want to keep getting incentives. Agent will follow desired policy as best response. Benefits: Works for myopic and strategic agent Can still get fast convergence Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
59 The message There are tractable methods for finding limited incentives to quickly induce desired sequential agent behavior when: Direct queries about agent preferences are unavailable Can observe agent behavior over time Many interesting open questions, both theoretical and practical. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
60 Thank you I would love to get your comments and suggestions! hq@eecs.harvard.edu. Haoqi Zhang (Harvard University) Policy Teaching ACM EC / 23
Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer
Learning via Delayed Knowledge A Case of Jamming SaiDhiraj Amuru and R. Michael Buehrer 1 Why do we need an Intelligent Jammer? Dynamic environment conditions in electronic warfare scenarios failure of
More informationResource Management in QoS-Aware Wireless Cellular Networks
Resource Management in QoS-Aware Wireless Cellular Networks Zhi Zhang Dept. of Electrical and Computer Engineering Colorado State University April 24, 2009 Zhi Zhang (ECE CSU) Resource Management in Wireless
More informationTracking of Real-Valued Markovian Random Processes with Asymmetric Cost and Observation
Tracking of Real-Valued Markovian Random Processes with Asymmetric Cost and Observation Parisa Mansourifard Joint work with: Prof. Bhaskar Krishnamachari (USC) and Prof. Tara Javidi (UCSD) Ming Hsieh Department
More informationCSE 473 Midterm Exam Feb 8, 2018
CSE 473 Midterm Exam Feb 8, 2018 Name: This exam is take home and is due on Wed Feb 14 at 1:30 pm. You can submit it online (see the message board for instructions) or hand it in at the beginning of class.
More informationIteration. Many thanks to Alan Fern for the majority of the LSPI slides.
Approximate Click to edit Master titlepolicy style Iteration Click to edit Emma Master Brunskill subtitle style Many thanks to Alan Fern for the majority of the LSPI slides. https://web.engr.oregonstate.edu/~afern/classes/cs533/notes/lspi.pdf
More informationCMU-Q Lecture 20:
CMU-Q 15-381 Lecture 20: Game Theory I Teacher: Gianni A. Di Caro ICE-CREAM WARS http://youtu.be/jilgxenbk_8 2 GAME THEORY Game theory is the formal study of conflict and cooperation in (rational) multi-agent
More informationAI Agent for Ants vs. SomeBees: Final Report
CS 221: ARTIFICIAL INTELLIGENCE: PRINCIPLES AND TECHNIQUES 1 AI Agent for Ants vs. SomeBees: Final Report Wanyi Qian, Yundong Zhang, Xiaotong Duan Abstract This project aims to build a real-time game playing
More informationCS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s
CS88: Artificial Intelligence, Fall 20 Written 2: Games and MDP s Due: 0/5 submitted electronically by :59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators) but must be written
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Slides borrowed from Katerina Fragkiadaki Solving known MDPs: Dynamic Programming Markov Decision Process (MDP)! A Markov Decision Process
More informationThe Future of Network Science: Guiding the Formation of Networks
The Future of Network Science: Guiding the Formation of Networks Mihaela van der Schaar and Simpson Zhang University of California, Los Angeles Acknowledgement: ONR 1 Agenda Establish methods for guiding
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1
CS 188 Fall 2018 Introduction to Artificial Intelligence Midterm 1 You have 120 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam. Do
More informationDomination Rationalizability Correlated Equilibrium Computing CE Computational problems in domination. Game Theory Week 3. Kevin Leyton-Brown
Game Theory Week 3 Kevin Leyton-Brown Game Theory Week 3 Kevin Leyton-Brown, Slide 1 Lecture Overview 1 Domination 2 Rationalizability 3 Correlated Equilibrium 4 Computing CE 5 Computational problems in
More informationMicroeconomics II Lecture 2: Backward induction and subgame perfection Karl Wärneryd Stockholm School of Economics November 2016
Microeconomics II Lecture 2: Backward induction and subgame perfection Karl Wärneryd Stockholm School of Economics November 2016 1 Games in extensive form So far, we have only considered games where players
More informationGame Theory: Normal Form Games
Game Theory: Normal Form Games CPSC 322 Lecture 34 April 3, 2006 Reading: excerpt from Multiagent Systems, chapter 3. Game Theory: Normal Form Games CPSC 322 Lecture 34, Slide 1 Lecture Overview Recap
More informationSolving Coup as an MDP/POMDP
Solving Coup as an MDP/POMDP Semir Shafi Dept. of Computer Science Stanford University Stanford, USA semir@stanford.edu Adrien Truong Dept. of Computer Science Stanford University Stanford, USA aqtruong@stanford.edu
More informationReinforcement Learning in Games Autonomous Learning Systems Seminar
Reinforcement Learning in Games Autonomous Learning Systems Seminar Matthias Zöllner Intelligent Autonomous Systems TU-Darmstadt zoellner@rbg.informatik.tu-darmstadt.de Betreuer: Gerhard Neumann Abstract
More informationMinmax and Dominance
Minmax and Dominance CPSC 532A Lecture 6 September 28, 2006 Minmax and Dominance CPSC 532A Lecture 6, Slide 1 Lecture Overview Recap Maxmin and Minmax Linear Programming Computing Fun Game Domination Minmax
More informationSummary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility
Summary Overview of Topics in Econ 30200b: Decision theory: strong and weak domination by randomized strategies, domination theorem, expected utility theorem (consistent decisions under uncertainty should
More informationECON 312: Games and Strategy 1. Industrial Organization Games and Strategy
ECON 312: Games and Strategy 1 Industrial Organization Games and Strategy A Game is a stylized model that depicts situation of strategic behavior, where the payoff for one agent depends on its own actions
More informationLECTURE 26: GAME THEORY 1
15-382 COLLECTIVE INTELLIGENCE S18 LECTURE 26: GAME THEORY 1 INSTRUCTOR: GIANNI A. DI CARO ICE-CREAM WARS http://youtu.be/jilgxenbk_8 2 GAME THEORY Game theory is the formal study of conflict and cooperation
More informationREINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING
REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING RIKA ANTONOVA ANTONOVA@KTH.SE ALI GHADIRZADEH ALGH@KTH.SE RL: What We Know So Far Formulate the problem as an MDP (or POMDP) State space captures
More informationReinforcement Learning Applied to a Game of Deceit
Reinforcement Learning Applied to a Game of Deceit Theory and Reinforcement Learning Hana Lee leehana@stanford.edu December 15, 2017 Figure 1: Skull and flower tiles from the game of Skull. 1 Introduction
More informationOptimization Techniques for Alphabet-Constrained Signal Design
Optimization Techniques for Alphabet-Constrained Signal Design Mojtaba Soltanalian Department of Electrical Engineering California Institute of Technology Stanford EE- ISL Mar. 2015 Optimization Techniques
More informationUNIVERSITY of PENNSYLVANIA CIS 391/521: Fundamentals of AI Midterm 1, Spring 2010
UNIVERSITY of PENNSYLVANIA CIS 391/521: Fundamentals of AI Midterm 1, Spring 2010 Question Points 1 Environments /2 2 Python /18 3 Local and Heuristic Search /35 4 Adversarial Search /20 5 Constraint Satisfaction
More informationSection Notes 6. Game Theory. Applied Math 121. Week of March 22, understand the difference between pure and mixed strategies.
Section Notes 6 Game Theory Applied Math 121 Week of March 22, 2010 Goals for the week be comfortable with the elements of game theory. understand the difference between pure and mixed strategies. be able
More informationCS510 \ Lecture Ariel Stolerman
CS510 \ Lecture04 2012-10-15 1 Ariel Stolerman Administration Assignment 2: just a programming assignment. Midterm: posted by next week (5), will cover: o Lectures o Readings A midterm review sheet will
More informationReinforcement Learning
Reinforcement Learning Reinforcement Learning Assumptions we made so far: Known state space S Known transition model T(s, a, s ) Known reward function R(s) not realistic for many real agents Reinforcement
More informationCreating an Agent of Doom: A Visual Reinforcement Learning Approach
Creating an Agent of Doom: A Visual Reinforcement Learning Approach Michael Lowney Department of Electrical Engineering Stanford University mlowney@stanford.edu Robert Mahieu Department of Electrical Engineering
More informationAppendix A A Primer in Game Theory
Appendix A A Primer in Game Theory This presentation of the main ideas and concepts of game theory required to understand the discussion in this book is intended for readers without previous exposure to
More informationDecentralized Cognitive MAC for Opportunistic Spectrum Access in Ad-Hoc Networks: A POMDP Framework
Decentralized Cognitive MAC for Opportunistic Spectrum Access in Ad-Hoc Networks: A POMDP Framework Qing Zhao, Lang Tong, Anathram Swami, and Yunxia Chen EE360 Presentation: Kun Yi Stanford University
More informationChapter 12. Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks
Chapter 12 Cross-Layer Optimization for Multi- Hop Cognitive Radio Networks 1 Outline CR network (CRN) properties Mathematical models at multiple layers Case study 2 Traditional Radio vs CR Traditional
More informationTwo-stage column generation and applications in container terminal management
Two-stage column generation and applications in container terminal management Ilaria Vacca Matteo Salani Michel Bierlaire Transport and Mobility Laboratory EPFL 8th Swiss Transport Research Conference
More informationReinforcement Learning for Ethical Decision Making
Reinforcement Learning for Ethical Decision Making The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence AI, Ethics, and Society: Technical Report WS-16-02 David Abel, James MacGlashan,
More informationFive-In-Row with Local Evaluation and Beam Search
Five-In-Row with Local Evaluation and Beam Search Jiun-Hung Chen and Adrienne X. Wang jhchen@cs axwang@cs Abstract This report provides a brief overview of the game of five-in-row, also known as Go-Moku,
More informationA Multi Armed Bandit Formulation of Cognitive Spectrum Access
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationQ-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control
Q-Learning Algorithms for Constrained Markov Decision Processes with Randomized Monotone Policies: Application to MIMO Transmission Control Dejan V. Djonin, Vikram Krishnamurthy, Fellow, IEEE Abstract
More informationStrategies and Game Theory
Strategies and Game Theory Prof. Hongbin Cai Department of Applied Economics Guanghua School of Management Peking University March 31, 2009 Lecture 7: Repeated Game 1 Introduction 2 Finite Repeated Game
More informationTopic 1: defining games and strategies. SF2972: Game theory. Not allowed: Extensive form game: formal definition
SF2972: Game theory Mark Voorneveld, mark.voorneveld@hhs.se Topic 1: defining games and strategies Drawing a game tree is usually the most informative way to represent an extensive form game. Here is one
More informationTRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill
TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS Thomas Keller and Malte Helmert Presented by: Ryan Berryhill Outline Motivation Background THTS framework THTS algorithms Results Motivation Advances
More informationModeling the Dynamics of Coalition Formation Games for Cooperative Spectrum Sharing in an Interference Channel
Modeling the Dynamics of Coalition Formation Games for Cooperative Spectrum Sharing in an Interference Channel Zaheer Khan, Savo Glisic, Senior Member, IEEE, Luiz A. DaSilva, Senior Member, IEEE, and Janne
More informationRepeated Games. Economics Microeconomic Theory II: Strategic Behavior. Shih En Lu. Simon Fraser University (with thanks to Anke Kessler)
Repeated Games Economics 302 - Microeconomic Theory II: Strategic Behavior Shih En Lu Simon Fraser University (with thanks to Anke Kessler) ECON 302 (SFU) Repeated Games 1 / 25 Topics 1 Information Sets
More informationDownlink Scheduler Optimization in High-Speed Downlink Packet Access Networks
Downlink Scheduler Optimization in High-Speed Downlink Packet Access Networks Hussein Al-Zubaidy SCE-Carleton University 1125 Colonel By Drive, Ottawa, ON, Canada Email: hussein@sce.carleton.ca 21 August
More informationDesign of intelligent surveillance systems: a game theoretic case. Nicola Basilico Department of Computer Science University of Milan
Design of intelligent surveillance systems: a game theoretic case Nicola Basilico Department of Computer Science University of Milan Outline Introduction to Game Theory and solution concepts Game definition
More informationAgenda. Intro to Game Theory. Why Game Theory. Examples. The Contractor. Games of Strategy vs other kinds
Agenda Intro to Game Theory AUECO 220 Why game theory Games of Strategy Examples Terminology Why Game Theory Provides a method of solving problems where each agent takes into account how others will react
More informationOutline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game
Outline Game Playing ECE457 Applied Artificial Intelligence Fall 2007 Lecture #5 Types of games Playing a perfect game Minimax search Alpha-beta pruning Playing an imperfect game Real-time Imperfect information
More informationReinforcement Learning Agent for Scrolling Shooter Game
Reinforcement Learning Agent for Scrolling Shooter Game Peng Yuan (pengy@stanford.edu) Yangxin Zhong (yangxin@stanford.edu) Zibo Gong (zibo@stanford.edu) 1 Introduction and Task Definition 1.1 Game Agent
More informationMultiple Agents. Why can t we all just get along? (Rodney King)
Multiple Agents Why can t we all just get along? (Rodney King) Nash Equilibriums........................................ 25 Multiple Nash Equilibriums................................. 26 Prisoners Dilemma.......................................
More information3 Game Theory II: Sequential-Move and Repeated Games
3 Game Theory II: Sequential-Move and Repeated Games Recognizing that the contributions you make to a shared computer cluster today will be known to other participants tomorrow, you wonder how that affects
More informationECON 301: Game Theory 1. Intermediate Microeconomics II, ECON 301. Game Theory: An Introduction & Some Applications
ECON 301: Game Theory 1 Intermediate Microeconomics II, ECON 301 Game Theory: An Introduction & Some Applications You have been introduced briefly regarding how firms within an Oligopoly interacts strategically
More informationGeometric Programming and its Application in Network Resource Allocation. Presented by: Bin Wang
Geometric Programming and its Application in Network Resource Allocation Presented by: Bin Wang Why this talk? Nonlinear and nonconvex problem, can be turned into nonlinear convex problem Global optimal,
More informationGame Theory and Algorithms Lecture 3: Weak Dominance and Truthfulness
Game Theory and Algorithms Lecture 3: Weak Dominance and Truthfulness March 1, 2011 Summary: We introduce the notion of a (weakly) dominant strategy: one which is always a best response, no matter what
More informationContents. MA 327/ECO 327 Introduction to Game Theory Fall 2017 Notes. 1 Wednesday, August Friday, August Monday, August 28 6
MA 327/ECO 327 Introduction to Game Theory Fall 2017 Notes Contents 1 Wednesday, August 23 4 2 Friday, August 25 5 3 Monday, August 28 6 4 Wednesday, August 30 8 5 Friday, September 1 9 6 Wednesday, September
More informationAn Empirical Evaluation of Policy Rollout for Clue
An Empirical Evaluation of Policy Rollout for Clue Eric Marshall Oregon State University M.S. Final Project marshaer@oregonstate.edu Adviser: Professor Alan Fern Abstract We model the popular board game
More informationExtensive Form Games. Mihai Manea MIT
Extensive Form Games Mihai Manea MIT Extensive-Form Games N: finite set of players; nature is player 0 N tree: order of moves payoffs for every player at the terminal nodes information partition actions
More informationCSE 591: Human-aware Robotics
CSE 591: Human-aware Robotics Instructor: Dr. Yu ( Tony ) Zhang Location & Times: CAVC 359, Tue/Thu, 9:00--10:15 AM Office Hours: BYENG 558, Tue/Thu, 10:30--11:30AM Nov 8, 2016 Slides adapted from Subbarao
More informationMonte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar
Monte Carlo Tree Search and AlphaGo Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar Zero-Sum Games and AI A player s utility gain or loss is exactly balanced by the combined gain or loss of opponents:
More informationAn Artificially Intelligent Ludo Player
An Artificially Intelligent Ludo Player Andres Calderon Jaramillo and Deepak Aravindakshan Colorado State University {andrescj, deepakar}@cs.colostate.edu Abstract This project replicates results reported
More informationFictitious Play applied on a simplified poker game
Fictitious Play applied on a simplified poker game Ioannis Papadopoulos June 26, 2015 Abstract This paper investigates the application of fictitious play on a simplified 2-player poker game with the goal
More informationIntroduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)
Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.) Eric B. Laber February 12, 2008 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or,
More informationCS 188 Introduction to Fall 2014 Artificial Intelligence Midterm
CS 88 Introduction to Fall Artificial Intelligence Midterm INSTRUCTIONS You have 8 minutes. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators only.
More informationCMU Lecture 22: Game Theory I. Teachers: Gianni A. Di Caro
CMU 15-781 Lecture 22: Game Theory I Teachers: Gianni A. Di Caro GAME THEORY Game theory is the formal study of conflict and cooperation in (rational) multi-agent systems Decision-making where several
More information1890 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 30, NO. 10, NOVEMBER 2012
1890 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 30, NO. 10, NOVEMBER 2012 Dynamic Spectrum Sharing Among Repeatedly Interacting Selfish Users With Imperfect Monitoring Yuanzhang Xiao and Mihaela
More informationAlternation in the repeated Battle of the Sexes
Alternation in the repeated Battle of the Sexes Aaron Andalman & Charles Kemp 9.29, Spring 2004 MIT Abstract Traditional game-theoretic models consider only stage-game strategies. Alternation in the repeated
More informationWireless Network Pricing Chapter 7: Network Externalities
Wireless Network Pricing Chapter 7: Network Externalities Jianwei Huang & Lin Gao Network Communications and Economics Lab (NCEL) Information Engineering Department The Chinese University of Hong Kong
More informationChapter 3 Learning in Two-Player Matrix Games
Chapter 3 Learning in Two-Player Matrix Games 3.1 Matrix Games In this chapter, we will examine the two-player stage game or the matrix game problem. Now, we have two players each learning how to play
More informationOptimizing Media Access Strategy for Competing Cognitive Radio Networks Y. Gwon, S. Dastangoo, H. T. Kung
Optimizing Media Access Strategy for Competing Cognitive Radio Networks Y. Gwon, S. Dastangoo, H. T. Kung December 12, 2013 Presented at IEEE GLOBECOM 2013, Atlanta, GA Outline Introduction Competing Cognitive
More informationECON 282 Final Practice Problems
ECON 282 Final Practice Problems S. Lu Multiple Choice Questions Note: The presence of these practice questions does not imply that there will be any multiple choice questions on the final exam. 1. How
More informationUMBC 671 Midterm Exam 19 October 2009
Name: 0 1 2 3 4 5 6 total 0 20 25 30 30 25 20 150 UMBC 671 Midterm Exam 19 October 2009 Write all of your answers on this exam, which is closed book and consists of six problems, summing to 160 points.
More informationOPPORTUNISTIC spectrum access (OSA), first envisioned
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 5, MAY 2008 2053 Joint Design and Separation Principle for Opportunistic Spectrum Access in the Presence of Sensing Errors Yunxia Chen, Student Member,
More informationFast Online Learning of Antijamming and Jamming Strategies
Fast Online Learning of Antijamming and Jamming Strategies Y. Gwon, S. Dastangoo, C. Fossa, H. T. Kung December 9, 2015 Presented at the 58 th IEEE Global Communications Conference, San Diego, CA This
More informationGame Theory and Economics of Contracts Lecture 4 Basics in Game Theory (2)
Game Theory and Economics of Contracts Lecture 4 Basics in Game Theory (2) Yu (Larry) Chen School of Economics, Nanjing University Fall 2015 Extensive Form Game I It uses game tree to represent the games.
More informationDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving Shai Shalev-Shwartz Mobileye IMVC dimension, March, 2016 S. Shalev-Shwartz is also affiliated with The Hebrew University Shai Shalev-Shwartz (MobilEye) DL for Autonomous
More informationOutline for today s lecture Informed Search Optimal informed search: A* (AIMA 3.5.2) Creating good heuristic functions Hill Climbing
Informed Search II Outline for today s lecture Informed Search Optimal informed search: A* (AIMA 3.5.2) Creating good heuristic functions Hill Climbing CIS 521 - Intro to AI - Fall 2017 2 Review: Greedy
More informationCommunication over a Time Correlated Channel with an Energy Harvesting Transmitter
Communication over a Time Correlated Channel with an Energy Harvesting Transmitter Mehdi Salehi Heydar Abad Faculty of Engineering and Natural Sciences Sabanci University, Istanbul, Turkey mehdis@sabanciuniv.edu
More informationIndex Terms Deterministic channel model, Gaussian interference channel, successive decoding, sum-rate maximization.
3798 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 58, NO 6, JUNE 2012 On the Maximum Achievable Sum-Rate With Successive Decoding in Interference Channels Yue Zhao, Member, IEEE, Chee Wei Tan, Member,
More informationFoundations of AI. 3. Solving Problems by Searching. Problem-Solving Agents, Formulating Problems, Search Strategies
Foundations of AI 3. Solving Problems by Searching Problem-Solving Agents, Formulating Problems, Search Strategies Luc De Raedt and Wolfram Burgard and Bernhard Nebel Contents Problem-Solving Agents Formulating
More informationName: Your EdX Login: SID: Name of person to left: Exam Room: Name of person to right: Primary TA:
UC Berkeley Computer Science CS188: Introduction to Artificial Intelligence Josh Hug and Adam Janin Midterm I, Fall 2016 This test has 8 questions worth a total of 100 points, to be completed in 110 minutes.
More informationThe Game-Theoretic Approach to Machine Learning and Adaptation
The Game-Theoretic Approach to Machine Learning and Adaptation Nicolò Cesa-Bianchi Università degli Studi di Milano Nicolò Cesa-Bianchi (Univ. di Milano) Game-Theoretic Approach 1 / 25 Machine Learning
More informationTUD Poker Challenge Reinforcement Learning with Imperfect Information
TUD Poker Challenge 2008 Reinforcement Learning with Imperfect Information Outline Reinforcement Learning Perfect Information Imperfect Information Lagging Anchor Algorithm Matrix Form Extensive Form Poker
More informationU strictly dominates D for player A, and L strictly dominates R for player B. This leaves (U, L) as a Strict Dominant Strategy Equilibrium.
Problem Set 3 (Game Theory) Do five of nine. 1. Games in Strategic Form Underline all best responses, then perform iterated deletion of strictly dominated strategies. In each case, do you get a unique
More informationOutline for this presentation. Introduction I -- background. Introduction I Background
Mining Spectrum Usage Data: A Large-Scale Spectrum Measurement Study Sixing Yin, Dawei Chen, Qian Zhang, Mingyan Liu, Shufang Li Outline for this presentation! Introduction! Methodology! Statistic and
More informationCS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs
Last name: First name: SID: Class account login: Collaborators: CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs Due: Monday 2/28 at 5:29pm either in lecture or in 283 Soda Drop Box (no slip days).
More informationIEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 3, MARCH
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 3, MARCH 2010 1401 Decomposition Principles and Online Learning in Cross-Layer Optimization for Delay-Sensitive Applications Fangwen Fu, Student Member,
More informationAsynchronous Best-Reply Dynamics
Asynchronous Best-Reply Dynamics Noam Nisan 1, Michael Schapira 2, and Aviv Zohar 2 1 Google Tel-Aviv and The School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel. 2 The
More informationDeepMind Self-Learning Atari Agent
DeepMind Self-Learning Atari Agent Human-level control through deep reinforcement learning Nature Vol 518, Feb 26, 2015 The Deep Mind of Demis Hassabis Backchannel / Medium.com interview with David Levy
More informationACRUCIAL issue in the design of wireless sensor networks
4322 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 8, AUGUST 2010 Coalition Formation for Bearings-Only Localization in Sensor Networks A Cooperative Game Approach Omid Namvar Gharehshiran, Student
More informationConvergence in competitive games
Convergence in competitive games Vahab S. Mirrokni Computer Science and AI Lab. (CSAIL) and Math. Dept., MIT. This talk is based on joint works with A. Vetta and with A. Sidiropoulos, A. Vetta DIMACS Bounded
More informationTheory of Moves Learners: Towards Non-Myopic Equilibria
Theory of s Learners: Towards Non-Myopic Equilibria Arjita Ghosh Math & CS Department University of Tulsa garjita@yahoo.com Sandip Sen Math & CS Department University of Tulsa sandip@utulsa.edu ABSTRACT
More informationReinforcement Learning Simulations and Robotics
Reinforcement Learning Simulations and Robotics Models Partially observable noise in sensors Policy search methods rather than value functionbased approaches Isolate key parameters by choosing an appropriate
More informationScheduling. Radek Mařík. April 28, 2015 FEE CTU, K Radek Mařík Scheduling April 28, / 48
Scheduling Radek Mařík FEE CTU, K13132 April 28, 2015 Radek Mařík (marikr@fel.cvut.cz) Scheduling April 28, 2015 1 / 48 Outline 1 Introduction to Scheduling Methodology Overview 2 Classification of Scheduling
More informationGame Theory and Randomized Algorithms
Game Theory and Randomized Algorithms Guy Aridor Game theory is a set of tools that allow us to understand how decisionmakers interact with each other. It has practical applications in economics, international
More informationOptimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function
Optimal Coded Information Network Design and Management via Improved Characterizations of the Binary Entropy Function John MacLaren Walsh & Steven Weber Department of Electrical and Computer Engineering
More informationLocalization (Position Estimation) Problem in WSN
Localization (Position Estimation) Problem in WSN [1] Convex Position Estimation in Wireless Sensor Networks by L. Doherty, K.S.J. Pister, and L.E. Ghaoui [2] Semidefinite Programming for Ad Hoc Wireless
More informationGuess the Mean. Joshua Hill. January 2, 2010
Guess the Mean Joshua Hill January, 010 Challenge: Provide a rational number in the interval [1, 100]. The winner will be the person whose guess is closest to /3rds of the mean of all the guesses. Answer:
More informationAI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)
AI Plays 2048 Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng) Abstract The strategy game 2048 gained great popularity quickly. Although it is easy to play, people cannot win the game easily,
More informationIncentive design for social computing: Interdisciplinarity time!
Incentive design for social computing: Interdisciplinarity time! ARPITA GHOSH Cornell University June 2015 Incentive design for social computing: Interdisciplinarity time! 1 Hello (world) Question: Please
More informationChapter 2 Distributed Consensus Estimation of Wireless Sensor Networks
Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks Recently, consensus based distributed estimation has attracted considerable attention from various fields to estimate deterministic
More informationElements of Artificial Intelligence and Expert Systems
Elements of Artificial Intelligence and Expert Systems Master in Data Science for Economics, Business & Finance Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135 Milano (MI) Ufficio
More informationMulti-user Space Time Scheduling for Wireless Systems with Multiple Antenna
Multi-user Space Time Scheduling for Wireless Systems with Multiple Antenna Vincent Lau Associate Prof., University of Hong Kong Senior Manager, ASTRI Agenda Bacground Lin Level vs System Level Performance
More informationOptimization of On-line Appointment Scheduling
Optimization of On-line Appointment Scheduling Brian Denton Edward P. Fitts Department of Industrial and Systems Engineering North Carolina State University Tsinghua University, Beijing, China May, 2012
More information