Reinforcement Learning for Penalty Avoiding Policy Making and its Extensions and an Application to the Othello Game

Size: px

Start display at page:

Download "Reinforcement Learning for Penalty Avoiding Policy Making and its Extensions and an Application to the Othello Game"

Dulcie McDonald
5 years ago
Views:

1 Reinforcement Learning for Penalty Avoiding Policy Making and its Extensions and an Application to the Othello Game Kazuteru Miyazaki National Institution for Academic Degrees, Ootsuka Bunkyo-ku Tokyo, Japan ougo Tsuboi TOHIBA, 1 Toshiba Komukai aiwai Kawasaki, Japan higebu Kobayashi kobayasi@dis.titech.ac.jp Tokyo Institute of Techlogy, 4259 Nagatsuta Midori Yokohama, Japan ABTRACT The purpose of reinforcement learning system is to learn optimal policies in general. However, from the engineering point of view, it is useful and important to acquire t only optimal policies, but also penalty avoiding policies. In this paper, we are focused on formation of penalty avoiding policies based on the Penalty Avoiding Rational Policy Making algorithm [1]. In applying the algorithm to large-scale problems, we are confronted with the combinational explosion. To suppless the problem, especially the number of states, we introduce several ideas and heuristics. We implemented the proposed method as an Othello game player s learning system. This learning player can always defeat against the well-kwn Othello game program KITTY [7] after learning. Keywords: reinforcement learning, reward and penalty, penalty avoiding rational policy, the Othello game, KITTY 1. INTRODUCTION Reinforcement learning (RL) is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. If we give the agent what should he do (its purpose) and/or don t (its restriction), it can learn how to satisfy them. In RL, it is important how to design rewards. Recently, in most RL systems [5], a positive reward called a reward is given to the agent when it has achieved a purpose, and a negative one called a penalty is given to it when it has violated a restriction. However, if we set incorrect values for them, the agent will learn unexpected behavior. For example, in two players game, such that the Othello game, considering the case that a reward is given to the winner, and a penalty is given to the loser. If we have designed incorrect values for them, the agent may lose the game even if there is a victory strategy. This is because that reward and penalty are treated at the same dimension. Therefore, it is important to distinguish a reward (for achievement of a purpose) from a penalty (for violation of a restriction). We kw the Penalty Avoiding Rational Policy Making algorithm [1] as a reinforcement learning system to make a distinction a reward and a penalty. Though it can suppress any penalty as stable as possible and can get a reward constantly, it has to memorize many state-action pairs such that Q- learning [6] and TD(λ) [4]. In this paper, we discuss extensions of the Penalty Avoiding Rational Policy Making algorithm in the class where we have some information of target environments. We introduce several ideas and heuristics to suppless the combinational explosion in large-scale problems. Furthermore, we implemented the proposed method as an Othello game player s learning system. ection 2 describes the problem, the method, tations and the Penalty Avoiding Rational Policy Making algorithm. ection 3 describes extensions of the Penalty Avoiding Rational Policy Making algorithm. ection 4 applies it to the Othello game. ection 5 is conclusion. 2. THE DOMAIN 2.1 Target Environments Consider an agent in some unkwn environment. At each time step, the agent gets information about the environment through its sensors and chooses an action. As a result of some sequence of actions, the agent gets a reward or a penalty from the environment. We assume that target environments are Markov Decision Processes (MDPs). A pair of a sensory input (a state) and an action is called a rule. We dete a rule if x then a as xa. where x is a state and a is an action.

2 b x a a y Figure 1. An example of penalty rules (xa, ya) and a penalty state (y). The function that maps states to actions is called a policy. We call a policy rational if and only if expected reward per an action is larger than zero. The function that maps a state (or a rule) to a reward (or a penalty) is a reward function. We call a sequence of rules used between the previous reward (or penalty) and the current one an episode. We call a subsequence of an episode a detour when the state of the first firing rule and the state of the last firing rule are the same though both rules are different. The rule that does t exist on a detour in some episode is rational. Otherwise, a rule is called irrational. We call a rule penalty if and only if it has a penalty or it can transit to a penalty state in which there are penalty or irrrational rules. For example, in figure 1 xa and ya are penalty rules, and state y is a penalty state. We call a policy that cant have any penalty rule penalty avoiding policy. We assume that there is a deterministic rational policy in penalty avoiding policies. For each sensory input, a deterministic policy always returns an action but a stochastic policy returns an action stochastically. P b 2.2 The Penalty Avoiding Rational Policy Making algorithm [1] We kw the Penalty Avoiding Rational Policy Making algorithm (PARP) [1] as a reinforcement learning system to treat the environments discussed in section 2.1. To avoid all penalties, PARP suppresses all penalty rules in the current rule set by the Penalty Rule Judgment algorithm (PRJ) in figure 2. After suppressing all penalty rules, it makes a rational policy by the Rational Policy Improvement algorithm [1]. Though PARP can learn a stochastic rational policy in the class where there is deterministic rational policy in penalty avoiding policies, we do t treat a stochastic rational policy. Though PARP can always learn a deterministic rational policy in the class where there is it in penalty avoiding policies, PRJ has to memorize all rules that have been experienced and descendant states that have been transited by their rules to find all penalty rules. In applying PRJ to large-scale problems, we are confronted with the combinational ex- procedure The Penalty Rule Judgement begin et a mark on the rule that has been got a penalty directory do et a mark on the following state ; there is rational rule or there is rule that can transit to marked state. et a mark on the following rule ; there are marks in the states that can be transited by it. while (there is a new mark on some state) end. Figure 2. The Penalty Rule Judgment algorithm (PRJ) [1]; First, we set a mark on the rule that has been gotten a penalty directory. econd, we set a mark on the state where there is rational rule or there is rule that can transit to marked state. Last, we set a mark on the rule where there are marks in the states that can be transited by it. We can regard a marked rule as a penalty rule. We can find all penalty rules in the current rule set by continuing the above process until there is new mark. plosion of them. To suppless the problem, especially the number of states, we introduce several ideas and heuristics to PRJ. 3. EXTENION OF THE PENALTY AVOIDING RATIONAL POLICY MAKING ALGORITHM 3.1 The Basic Idea Though PRJ can find all penalty rules efficiently, it has to memorize all rules that have been experienced and descendant states that have been transited by their rules. In applying PRJ to large-scale problems, it is important to save the memory and restrict exploration. In section 3.2, we discuss how to save the memory. In general, there is free lunch to realize it.in this paper, we propose how to save the meory by calculation of state transition in ths class where we can kw a reward function and a candidate for a descendant state of the state transition. In section 3.3, we discuss how to restrict exploration. We propose an alogirithm to explore the environment by kwledge. 3.2 How to ave the Memory by Calculation of tate Transition In this paper, we treat the class where we can kw a reward function and a candidate for a descendant state of the state transition. When the agent selects

3 an action a A t in the state s t at time t, we can kw variation of the state s t+1 at time t + 1 and its immidiate reward or penalty. It is natural assumption in two players game such as the Othello, igo, shougi, backgammon and so on. We show extensions of PRJ in this situation. Before selecting an action, it finds all penalty rules in the current rule set by calculation of all states that can be transited from the current state. After selectiong an action, if the agent gets a new penalty, it tries to find a new penalty rule again. We use long and short term memories to realize it. long term memory If there is new penalty rules and states in short term memory, they are memorized in long term memory. They are holding in learning. short term memory hort term memory memorizes all states and actions in the current episode. After calculating all states and rules that can be transited from the current state, they are memorized in short term memory. If there is the states in long term memory, new penalty rules are found by PRJ. If there is new penalty rules and states, they are memorized in long term memory. hort term memory is initiallized for each episode. Therefore, the agent can find all new penalty rules by penalty rules only. It does t need to memorize descendant states of state transition in action selection. tate transition and reward functions that are given by the environment are t necessary correct functions. It is t confused by incomplete information such that some penalty or state that should be existed on are t given to the agent. However, it is confused by incredible information such that some penalty or state that should t be existed on are given to the agent. 3.3 How to Restrict Exploration by Kwledge In applying PRJ to large-scale problems, we need to try many trials to spread a penalty rule. Especially, it is a serious problem in long episode. We introduce how to design a semi-penalty that is a broad definition of a penalty by kwledge. It means that the action or the state may cause getting a penalty. After finding penalty rules by PRJ, we use PRJ to find semi-penalty rules. We call a rule semi-penalty if and only if it has a penalty or a semi-penalty, or it can transit to a penalty state or a semi-penalty state in which there are semi-penalty, penalty or irrrational rules. ince a semi-penalty does t always cause a penalty, it has a possible that all states are semi-penalty states even if there is a penalty avoiding rational policy. The problem can conquest an action selector. Usually, we should select a rational rule that is t a penalty and a semi-penalty rule. If we cant select any rational rule in semi-penalty states, we should select a rational rule that is t a penalty rule. However, if we define incorrect semi-penalty, we need more trial to find penalty rules than the original version of PRJ since exploration is biased. 4. APPLICATION TO THE OTHELLO GAME 4.1 The Basic Idea We implemented the proposed method as an Othello game player s learning system. We use KITTY by Igor Durdavic as an opponent player. It is the near-strongest program in open source players. We use kitty.ios in KITTY s source code [8]. It has interface of Internet Othello erver (IO). We do t give KITTY learning mechanism. Therefore, KITTY s action selection probability is stable. The depth sets 4 (it is minimum value) or 60 (it is maximum value). 4.2 Construction of the Reinforcement Learning Player peciffication We describe our RL player for the Othello game (see figure 3). It gets the state of the Othello from IO. It can calculate variations of actions from the state. It selects an action from them and returns it to IO. If it cant any action, it returns PA action to IO. If it loses the game, it gets a penalty from IO. Furthermore, we have ather experiment where if it cant win the game, it gets a penalty from IO. We set the size of short term memory It is e- ugh to storage at least one step state transitions. It can calculate two or three steps state transitions in first stages and one step them in middle stages. Remark that there is irrrational rule in the Othello game Kwledge of the Othello Game It is important to restrict exploration from first to middle stage since there is a huge state space in middle stages. We use kwledge to realize it. We can use the following two type kwledge. One is KIFU database that is memorized steps in previous famous games. The other is Evaluation Function that evaluates the state of games. i. KIFU database We use NEC s KIFU database [9]. It contains about

4 RL player learning system long term memory short term memory KIFU database sensory input penalty KITTY's evaluation value the action selector Environment IO action KITTY Figure 3. The Experimental Environment. 100,000 games. We can get typical state transitions in first stages from KIFU database. It may contribute to avoid wasteful exploration in first stages. ii. Evaluation Funcion We use KITTY s evaluation function that is sent to IO by KITTY as our RL palyer s evaluation function. KITTY returns a value from to to IO as the evaluation value of a state. We define a semi-penalty state as the state whose evaluation value is larger than +1. If our RL player always can win in the first player (the black player), we can regard our method as better than KITTY since winners of KITTY vs. KITTY games are always the second players (the white players) How to elect an Action We can use the following information in action selection. a penalty rule (or a penalty state) a semi-penalty rule (or a semi-penalty state) KIFU database The priority of these information is the following. a penarty rule (state) > a semi-penalty rule (state) > KIFU database Based on this priority, we use the action selector in figure 4. The basic strategy in the action selector is to select an action whose number of transition states is the least in all actions. It contributes to restrict wastfull exploration. getting state s from IO matching s with short term memory and KIFU database. penalty state? semi- penalty state? on KIFU database? game over suppress all penalty rules and select an action by the basic action strategy. suppress all penalty and semi-penalty rules and select the most frequently used rule. suppress all penalty and semi-penalty rules and select an action by the basic action strategy. Figure 4. The Action elector If the total number of black and white cells is larger than 54, our RL player calculates the end of game. On the other hand, KITTY calulates it if the number is larger than 50 since it can use min-max exploration with evaluation function. 4.3 Results We show the results of games in table 1 in condition that KITTY does t use its library. Table 1. the number of games to get a penalty avoiding rational policy Our RL Player(black) vs KITTY(white) penalty condition depth the number of games lost lost lost or even lost or even We can confirm the effectiveness of our method in this table. If KITTY does t use its library, it cant select several actions. Therefore, our RL player always win after getting a penalty avoiding rational policy. If KITTY can use its library, it can select several actions. In this case, our RL player has to learn several penalty avoiding rational policies. The number of games to get a penalty avoiding rational policy is about 2000 in the case that KITTY uses its library and the depth sets 4. In figure 5, We show a sample sequence to aquire a penalty avoiding rational policy of the latter condi-

5 game number N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 18 0N0 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 20 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 22 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 24 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 26 0N0 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 28 0N0 0N0 0N0 0N1 0N2 0N2 0N2 0N2 0N2 0N2 0N2 0N2 0N2 0N0 30 0N0 0N0 0N0 0N4 0N4 0N4 0N4 0N4 0N4 0N4 0N4 0N4 32 0N0 0N0 0N0 0N0 0N0 0N8 0N8 0N8 0N8 0N8 0N8 0N8 0N8 0N0 0N0 0N0 0N0 34 0N0 0N0 0N4 0N4 0N4 0N4 0N4 0N4 0N4 36 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N8 0N8 0N8 0N8 0N8 0N8 0N0 0N0 0N0 0N0 0N0 0N0 38 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 40 0N0 0N0 0N9 0N9 0N9 0N9 0N9 42 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N8 0N8 0N8 0N8 0N0 0N0 0N0 0N0 0N0 0N0 0N0 44 0N0 0N0 0N0 0N0 0N0 0N0 46 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N1 0N1 0N1 48 0N0 P 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N2 0N2 50 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 52 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 54 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 1N1 56 0N0 0N0 0N0 p 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 3n0 58 0N0 0N0 p p 0N0 0N0 p 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 1N0 60 0N0 0N0 0N0 0N0 p p p p n 0 p n 0N0 0N0 0N0 p 0N0 0N0 0N0 0N0 0N0 1N0 n 62 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 n 0 cell number Figure 5. A sample sequence to aquire a penalty avoiding rational policy. N(n) ; n (penalty or semipenalty)state, ; semi-penalty state, P(p) ; penalty state, n and p mean that our RL player calculates the end of game. The former and the latter number of N(n) are the numbers of penalty rules and semi-penalty rules, respectively. References [1] Miyazaki, K. & Kobayashi,. Reinforcement Learning for Penalty Avoiding Policy Making IEEE International Conference on ystems, Man and Cybernetics, pp , [2] Miyazaki, K. & Kobayashi,. On the Rationality of Profit haring in Partially Observable Markov Decision Processes, 5th International Conference on Information ystems Analysis and Cynthesis, pp (1999). [3] Miyazaki, K., Arai,. & Kobayashi,. Cranes Control Using Multi-agent Profit haring, 6th International Conference on Information ystems Analysis and Cynthesis, Vol.IX, pp (2000). [4] utton, R.. Learning to Predict by the Method of Temporal Differences. Machine Learning Vol.3, pp.9-44, [5] utton, R.. & Barto, A. Reinforcement Learning: An Introduction. A Bradford Book, The MIT Press, [6] Watkins, C. J. H., and Dayan, P.: Technical te: Q-learning, Machine Learning Vol.8, pp.55-68, [7] learn-game/systems/kitty.html [8] ftp://ftp.nj.nec.com/pub/igord/othello/kitty/ linux kitty.tgz [9] ftp://ftp.nj.nec.com/pub/igord/othello/misc/ database.zip tion of table 1. A penalty avoiding rational policy is made of a set of all hatched states. We can use K- IFU database before 16 cells. If we use the original version of PRJ, the frontier of penalty rules is 34 cells in 2000 games. On the other hand, in figure 5, we can use semi-penalty rule at 18, from 26 to 32 and larger than 36 cells in 949 games. It means that we can overcome the slow spreads of penalty rules by semi-penalty rule. 5. CONCLUION In this paper, we extend the Penalty Avoiding Rational Policy Making algorithm [1] to large scale MDPs. We have implemented our method as an Othello game player s learning system. Our RL player can always defeat against the well-kwn Othello game program KITTY after learning. In the future works, we will compare our method with KITTY with learning mechanism. Furtermore, we will extend our method to Partially Observable Markov Decision Processes [2] and multiagent systems [3].

TUD Poker Challenge Reinforcement Learning with Imperfect Information

TUD Poker Challenge Reinforcement Learning with Imperfect Information TUD Poker Challenge 2008 Reinforcement Learning with Imperfect Information Outline Reinforcement Learning Perfect Information Imperfect Information Lagging Anchor Algorithm Matrix Form Extensive Form Poker