Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)

Size: px

Start display at page:

Download "Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)"

Sheryl Stone
5 years ago
Views:

1 Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.) Eric B. Laber February 12, 2008 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 1 / and 32do

2 Framework Introduction Objectives: Define Neuro-Dynamic Programming (NDP) Understand how NDP is used by learning to cheat at blackjack Learn other (more noble) applications of NDP Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 2 / and 32do

3 Framework What is NDP? NDP is about sequential decision making An agent (decision maker) is faced with a series of decisions Each decision results in a reward Each decision changes the environment Agent s objective: maximize accumulated reward over time Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 3 / and 32do

4 What is NDP? Framework Initial State S 0 A 1 R 1 S 1 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 4 / and 32do

5 What is NDP? Framework First Decision S 0 A 1 R 1 S 1 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 5 / and 32do

6 What is NDP? Framework First Reward S 0 A 1 R 1 S 1 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 6 / and 32do

7 What is NDP? Framework Second State S 0 A 1 R 1 S 1 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 7 / and 32do

8 What is NDP? Framework Actions affect future states so myopic decision making is NOT sufficient r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 8 / and 32do

9 What is NDP? Framework Actions affect future states so myopic decision making is NOT sufficient r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how tofebruary count cards 12, in 2008 blackjack 9 / and 32do

10 What is NDP? Framework Actions affect future states so myopic decision making is NOT sufficient r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 10 / and 32do

11 What is NDP? Framework Actions affect future states so myopic decision making is NOT sufficient r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 11 / and 32do

12 What is NDP? Framework Actions affect future states so myopic decision making is NOT sufficient r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 12 / and 32do

13 What is NDP? Framework Solution: Go backwards! r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 13 / and 32do

14 What is NDP? Framework Solution: Go backwards! r2 = 50 a21 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 14 / and 32do

15 What is NDP? Framework Solution: Go backwards! r2 = 50 a21 r1 = 101 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 15 / and 32do

16 What is NDP? Framework Solution: Go backwards! r2 = 50 a21 r1 = 101 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 16 / and 32do

17 What is NDP? Framework Solution: Go backwards! r2 = 50 a21 r1 = 101 r1 = 1 S1 A2 a22 r2 = 100 a11 S0 A1 a12 r1 = 2 r1 = 2 S 1 A 2 a 21 r2 = 100 a 22 r2 = 0 Backup diagrams put the DP in NDP Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 16 / and 32do

18 Framework What is NDP? Real sequential problems are more sophisticated Systems are stochastic System dynamics are unknown: Reward function is unknown Transition probabilities between states are unknown Number of states and actions may be large or even infinite Must use data to estimate some (or all) of the above Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 17 / and 32do

19 Framework What is NDP? NDP is a method for approximating the backup diagram method for sequential decision problems with unknown system dynamics, large state or action spaces, or both. The term Neuro in Neuro-Dynamic Programming refers to approximation of elements in backup diagram (uses something called Neural Networks in computer science) The term Dynamic Programming refers to solving the system with approximated components using backup diagram approach Often the above steps of approximation and evaluation are alternated repeatedly Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 18 / and 32do

20 Example: Cheating at Blackjack Cheating at Blackjack Example: Counting cards in Blackjack Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 19 / and 32do

21 Example: Cheating at Blackjack Intro to Blackjack Blackjack aka Twenty-one or Pontoon is a popular casino game. Object is to obtain cards whose numerical sum is large without exceeding 21 Player draws cards until he is satisfied with his total or it exceeds 21 (loses) Dealer draws cards according to a fixed policy: hit until total is 17 or higher Winner is person with highest numerical total less than our equal to 21 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 20 / and 32do

22 Example: Cheating at Blackjack Intro to Blackjack Available information at time t: All cards used prior to time t Players current total One of dealers cards Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 21 / and 32do

23 Intro to Blackjack Example: Cheating at Blackjack Beginning of a blackjack hand as sequential decision problem: Card History Number Aces Number Twos Number Threes... Number Kings Choose bet b in {minbest,maxbet} Reward : r1 = 0 Card History Number Aces Number Twos Number Threes... Number Kings Player s Hand One Dealer Card hit: take another card stand: take no more cards Notice we essentially need two strategies One for deciding which bet to place One for deciding when to hit/stand Should the strategies be independent? Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 22 / and 32do

24 Intro to Blackjack Example: Cheating at Blackjack Is the following hand a good one? Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 23 / and 32do

25 Example: Cheating at Blackjack Intro to Blackjack The goodness of a particular hand depends on the strategy being employed Betting strategy depends on estimated goodness of next hand Formally, we define goodness of a hand using a particular strategy as the expected total winnings from that hand and all future hands We must estimate betting and playing strategies simultaneously Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 24 / and 32do

26 Example: Cheating at Blackjack NDP and Blackjack Why solving blackjack directly is difficult: 1 No explicit model 2 Large number of states and actions 3 Variable number of decks (1,2,4, or 8) Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 25 / and 32do

27 Example: Cheating at Blackjack NDP and Blackjack Why solving blackjack directly is difficult: 1 No explicit model 2 Large number of states and actions 3 Variable number of decks (1,2,4, or 8) What makes this a good NPD problem: 1 Easy to simulate blackjack 2 Important features of the game are easy to summarize 3 Can simultaneously solve for any number of decks Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 25 / and 32do

28 Example: Cheating at Blackjack NDP and Blackjack Features for blackjack: We could keep track of the total number of Aces, Twos, Threes, etc. Better to keep track of total percentage of Aces, Twos, Threes, etc. that have appeared (IE at time t we ve seen 23% of all Aces) It is usually sufficient to keep track of less information Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 26 / and 32do

29 NDP and Blackjack Example: Cheating at Blackjack NDP Algorithm: For k = 1, 2,...: 1 Choose strategy π k which decides action for EVERY possible state so that it improves on previous strategy π k 1 2 Estimate expected performance of π k on every possible scenario using computer simulation Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 27 / and 32do

30 NDP and Blackjack Example: Cheating at Blackjack NDP Algorithm: Improve Evaluate Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 28 / and 32do

31 Example: Cheating at Blackjack NDP and Blackjack Questions: How to improve a strategy π k? Suppose at time t we observe state s t 1 We estimate performance of choosing action π k (s t) and following π k afterward 2 We also estimate performance of choosing alternate actions when faced with s t and following π k afterward If improvement can be made at any state s t we can improve π k by choosing the optimal action at s t and leaving π k unchanged at other states Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 29 / and 32do

32 Example: Cheating at Blackjack NDP and Blackjack Questions: How to improve a strategy π k? Suppose at time t we observe state s t 1 We estimate performance of choosing action π k (s t) and following π k afterward 2 We also estimate performance of choosing alternate actions when faced with s t and following π k afterward If improvement can be made at any state s t we can improve π k by choosing the optimal action at s t and leaving π k unchanged at other states How long to run algorithm? We run algorithm until no further improvements can be made Convergence to near-optimal strategy is guaranteed Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 29 / and 32do

33 Example: Cheating at Blackjack NDP and Blackjack Questions: How to improve a strategy π k? Suppose at time t we observe state s t 1 We estimate performance of choosing action π k (s t) and following π k afterward 2 We also estimate performance of choosing alternate actions when faced with s t and following π k afterward If improvement can be made at any state s t we can improve π k by choosing the optimal action at s t and leaving π k unchanged at other states How long to run algorithm? We run algorithm until no further improvements can be made Convergence to near-optimal strategy is guaranteed How to choose starting policy? Any starting policy will do, but some choices will lead to faster convergence Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 29 / and 32do

34 Example: Cheating at Blackjack NDP and Blackjack The preceding algorithm produces a strategy π which is near optimal. However, Using π requires memorizing every possible scenario! Fortunately, NDP allows us to restrict ourselves to simpler strategies Linear strategies like: Bet Large if: 2 NumberAcesLeft+NumberFaceCardsLeft NumberLowCardsLeft > 0 are currently popular Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 30 / and 32do

35 Other NDP Applications Other Applications NDP is utilized in a large number of applications including: Autonomous flight Tailored medical treatments for chronic illness Adaptive standard tests (e.g. GRE) Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 31 / and 32do

36 End Further Information There exist several standard references for NDP: Dynamic Programming and Optimal Control by Bertsekas, Athena Scientific Neuro-Dynamic Programming by Bertseaks and Tsitsiklis, Athena Scientific Reinforcement Learning by Sutton and Barto, MIT Press Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or, how to February count cards 12, 2008 in blackjack 32 / and 32do

Make better decisions. Learn the rules of the game before you play.

Make better decisions. Learn the rules of the game before you play. BLACKJACK BLACKJACK Blackjack, also known as 21, is a popular casino card game in which players compare their hand of cards with that of the dealer. To win at Blackjack, a player must create a hand with