Reinforcement Learning for Penalty Avoiding Policy Making and its Extensions and an Application to the Othello Game
|
|
- Dulcie McDonald
- 5 years ago
- Views:
Transcription
1 Reinforcement Learning for Penalty Avoiding Policy Making and its Extensions and an Application to the Othello Game Kazuteru Miyazaki National Institution for Academic Degrees, Ootsuka Bunkyo-ku Tokyo, Japan ougo Tsuboi TOHIBA, 1 Toshiba Komukai aiwai Kawasaki, Japan higebu Kobayashi kobayasi@dis.titech.ac.jp Tokyo Institute of Techlogy, 4259 Nagatsuta Midori Yokohama, Japan ABTRACT The purpose of reinforcement learning system is to learn optimal policies in general. However, from the engineering point of view, it is useful and important to acquire t only optimal policies, but also penalty avoiding policies. In this paper, we are focused on formation of penalty avoiding policies based on the Penalty Avoiding Rational Policy Making algorithm [1]. In applying the algorithm to large-scale problems, we are confronted with the combinational explosion. To suppless the problem, especially the number of states, we introduce several ideas and heuristics. We implemented the proposed method as an Othello game player s learning system. This learning player can always defeat against the well-kwn Othello game program KITTY [7] after learning. Keywords: reinforcement learning, reward and penalty, penalty avoiding rational policy, the Othello game, KITTY 1. INTRODUCTION Reinforcement learning (RL) is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. If we give the agent what should he do (its purpose) and/or don t (its restriction), it can learn how to satisfy them. In RL, it is important how to design rewards. Recently, in most RL systems [5], a positive reward called a reward is given to the agent when it has achieved a purpose, and a negative one called a penalty is given to it when it has violated a restriction. However, if we set incorrect values for them, the agent will learn unexpected behavior. For example, in two players game, such that the Othello game, considering the case that a reward is given to the winner, and a penalty is given to the loser. If we have designed incorrect values for them, the agent may lose the game even if there is a victory strategy. This is because that reward and penalty are treated at the same dimension. Therefore, it is important to distinguish a reward (for achievement of a purpose) from a penalty (for violation of a restriction). We kw the Penalty Avoiding Rational Policy Making algorithm [1] as a reinforcement learning system to make a distinction a reward and a penalty. Though it can suppress any penalty as stable as possible and can get a reward constantly, it has to memorize many state-action pairs such that Q- learning [6] and TD(λ) [4]. In this paper, we discuss extensions of the Penalty Avoiding Rational Policy Making algorithm in the class where we have some information of target environments. We introduce several ideas and heuristics to suppless the combinational explosion in large-scale problems. Furthermore, we implemented the proposed method as an Othello game player s learning system. ection 2 describes the problem, the method, tations and the Penalty Avoiding Rational Policy Making algorithm. ection 3 describes extensions of the Penalty Avoiding Rational Policy Making algorithm. ection 4 applies it to the Othello game. ection 5 is conclusion. 2. THE DOMAIN 2.1 Target Environments Consider an agent in some unkwn environment. At each time step, the agent gets information about the environment through its sensors and chooses an action. As a result of some sequence of actions, the agent gets a reward or a penalty from the environment. We assume that target environments are Markov Decision Processes (MDPs). A pair of a sensory input (a state) and an action is called a rule. We dete a rule if x then a as xa. where x is a state and a is an action.
2 b x a a y Figure 1. An example of penalty rules (xa, ya) and a penalty state (y). The function that maps states to actions is called a policy. We call a policy rational if and only if expected reward per an action is larger than zero. The function that maps a state (or a rule) to a reward (or a penalty) is a reward function. We call a sequence of rules used between the previous reward (or penalty) and the current one an episode. We call a subsequence of an episode a detour when the state of the first firing rule and the state of the last firing rule are the same though both rules are different. The rule that does t exist on a detour in some episode is rational. Otherwise, a rule is called irrational. We call a rule penalty if and only if it has a penalty or it can transit to a penalty state in which there are penalty or irrrational rules. For example, in figure 1 xa and ya are penalty rules, and state y is a penalty state. We call a policy that cant have any penalty rule penalty avoiding policy. We assume that there is a deterministic rational policy in penalty avoiding policies. For each sensory input, a deterministic policy always returns an action but a stochastic policy returns an action stochastically. P b 2.2 The Penalty Avoiding Rational Policy Making algorithm [1] We kw the Penalty Avoiding Rational Policy Making algorithm (PARP) [1] as a reinforcement learning system to treat the environments discussed in section 2.1. To avoid all penalties, PARP suppresses all penalty rules in the current rule set by the Penalty Rule Judgment algorithm (PRJ) in figure 2. After suppressing all penalty rules, it makes a rational policy by the Rational Policy Improvement algorithm [1]. Though PARP can learn a stochastic rational policy in the class where there is deterministic rational policy in penalty avoiding policies, we do t treat a stochastic rational policy. Though PARP can always learn a deterministic rational policy in the class where there is it in penalty avoiding policies, PRJ has to memorize all rules that have been experienced and descendant states that have been transited by their rules to find all penalty rules. In applying PRJ to large-scale problems, we are confronted with the combinational ex- procedure The Penalty Rule Judgement begin et a mark on the rule that has been got a penalty directory do et a mark on the following state ; there is rational rule or there is rule that can transit to marked state. et a mark on the following rule ; there are marks in the states that can be transited by it. while (there is a new mark on some state) end. Figure 2. The Penalty Rule Judgment algorithm (PRJ) [1]; First, we set a mark on the rule that has been gotten a penalty directory. econd, we set a mark on the state where there is rational rule or there is rule that can transit to marked state. Last, we set a mark on the rule where there are marks in the states that can be transited by it. We can regard a marked rule as a penalty rule. We can find all penalty rules in the current rule set by continuing the above process until there is new mark. plosion of them. To suppless the problem, especially the number of states, we introduce several ideas and heuristics to PRJ. 3. EXTENION OF THE PENALTY AVOIDING RATIONAL POLICY MAKING ALGORITHM 3.1 The Basic Idea Though PRJ can find all penalty rules efficiently, it has to memorize all rules that have been experienced and descendant states that have been transited by their rules. In applying PRJ to large-scale problems, it is important to save the memory and restrict exploration. In section 3.2, we discuss how to save the memory. In general, there is free lunch to realize it.in this paper, we propose how to save the meory by calculation of state transition in ths class where we can kw a reward function and a candidate for a descendant state of the state transition. In section 3.3, we discuss how to restrict exploration. We propose an alogirithm to explore the environment by kwledge. 3.2 How to ave the Memory by Calculation of tate Transition In this paper, we treat the class where we can kw a reward function and a candidate for a descendant state of the state transition. When the agent selects
3 an action a A t in the state s t at time t, we can kw variation of the state s t+1 at time t + 1 and its immidiate reward or penalty. It is natural assumption in two players game such as the Othello, igo, shougi, backgammon and so on. We show extensions of PRJ in this situation. Before selecting an action, it finds all penalty rules in the current rule set by calculation of all states that can be transited from the current state. After selectiong an action, if the agent gets a new penalty, it tries to find a new penalty rule again. We use long and short term memories to realize it. long term memory If there is new penalty rules and states in short term memory, they are memorized in long term memory. They are holding in learning. short term memory hort term memory memorizes all states and actions in the current episode. After calculating all states and rules that can be transited from the current state, they are memorized in short term memory. If there is the states in long term memory, new penalty rules are found by PRJ. If there is new penalty rules and states, they are memorized in long term memory. hort term memory is initiallized for each episode. Therefore, the agent can find all new penalty rules by penalty rules only. It does t need to memorize descendant states of state transition in action selection. tate transition and reward functions that are given by the environment are t necessary correct functions. It is t confused by incomplete information such that some penalty or state that should be existed on are t given to the agent. However, it is confused by incredible information such that some penalty or state that should t be existed on are given to the agent. 3.3 How to Restrict Exploration by Kwledge In applying PRJ to large-scale problems, we need to try many trials to spread a penalty rule. Especially, it is a serious problem in long episode. We introduce how to design a semi-penalty that is a broad definition of a penalty by kwledge. It means that the action or the state may cause getting a penalty. After finding penalty rules by PRJ, we use PRJ to find semi-penalty rules. We call a rule semi-penalty if and only if it has a penalty or a semi-penalty, or it can transit to a penalty state or a semi-penalty state in which there are semi-penalty, penalty or irrrational rules. ince a semi-penalty does t always cause a penalty, it has a possible that all states are semi-penalty states even if there is a penalty avoiding rational policy. The problem can conquest an action selector. Usually, we should select a rational rule that is t a penalty and a semi-penalty rule. If we cant select any rational rule in semi-penalty states, we should select a rational rule that is t a penalty rule. However, if we define incorrect semi-penalty, we need more trial to find penalty rules than the original version of PRJ since exploration is biased. 4. APPLICATION TO THE OTHELLO GAME 4.1 The Basic Idea We implemented the proposed method as an Othello game player s learning system. We use KITTY by Igor Durdavic as an opponent player. It is the near-strongest program in open source players. We use kitty.ios in KITTY s source code [8]. It has interface of Internet Othello erver (IO). We do t give KITTY learning mechanism. Therefore, KITTY s action selection probability is stable. The depth sets 4 (it is minimum value) or 60 (it is maximum value). 4.2 Construction of the Reinforcement Learning Player peciffication We describe our RL player for the Othello game (see figure 3). It gets the state of the Othello from IO. It can calculate variations of actions from the state. It selects an action from them and returns it to IO. If it cant any action, it returns PA action to IO. If it loses the game, it gets a penalty from IO. Furthermore, we have ather experiment where if it cant win the game, it gets a penalty from IO. We set the size of short term memory It is e- ugh to storage at least one step state transitions. It can calculate two or three steps state transitions in first stages and one step them in middle stages. Remark that there is irrrational rule in the Othello game Kwledge of the Othello Game It is important to restrict exploration from first to middle stage since there is a huge state space in middle stages. We use kwledge to realize it. We can use the following two type kwledge. One is KIFU database that is memorized steps in previous famous games. The other is Evaluation Function that evaluates the state of games. i. KIFU database We use NEC s KIFU database [9]. It contains about
4 RL player learning system long term memory short term memory KIFU database sensory input penalty KITTY's evaluation value the action selector Environment IO action KITTY Figure 3. The Experimental Environment. 100,000 games. We can get typical state transitions in first stages from KIFU database. It may contribute to avoid wasteful exploration in first stages. ii. Evaluation Funcion We use KITTY s evaluation function that is sent to IO by KITTY as our RL palyer s evaluation function. KITTY returns a value from to to IO as the evaluation value of a state. We define a semi-penalty state as the state whose evaluation value is larger than +1. If our RL player always can win in the first player (the black player), we can regard our method as better than KITTY since winners of KITTY vs. KITTY games are always the second players (the white players) How to elect an Action We can use the following information in action selection. a penalty rule (or a penalty state) a semi-penalty rule (or a semi-penalty state) KIFU database The priority of these information is the following. a penarty rule (state) > a semi-penalty rule (state) > KIFU database Based on this priority, we use the action selector in figure 4. The basic strategy in the action selector is to select an action whose number of transition states is the least in all actions. It contributes to restrict wastfull exploration. getting state s from IO matching s with short term memory and KIFU database. penalty state? semi- penalty state? on KIFU database? game over suppress all penalty rules and select an action by the basic action strategy. suppress all penalty and semi-penalty rules and select the most frequently used rule. suppress all penalty and semi-penalty rules and select an action by the basic action strategy. Figure 4. The Action elector If the total number of black and white cells is larger than 54, our RL player calculates the end of game. On the other hand, KITTY calulates it if the number is larger than 50 since it can use min-max exploration with evaluation function. 4.3 Results We show the results of games in table 1 in condition that KITTY does t use its library. Table 1. the number of games to get a penalty avoiding rational policy Our RL Player(black) vs KITTY(white) penalty condition depth the number of games lost lost lost or even lost or even We can confirm the effectiveness of our method in this table. If KITTY does t use its library, it cant select several actions. Therefore, our RL player always win after getting a penalty avoiding rational policy. If KITTY can use its library, it can select several actions. In this case, our RL player has to learn several penalty avoiding rational policies. The number of games to get a penalty avoiding rational policy is about 2000 in the case that KITTY uses its library and the depth sets 4. In figure 5, We show a sample sequence to aquire a penalty avoiding rational policy of the latter condi-
5 game number N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 18 0N0 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 20 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 22 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 24 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 26 0N0 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 0N1 28 0N0 0N0 0N0 0N1 0N2 0N2 0N2 0N2 0N2 0N2 0N2 0N2 0N2 0N0 30 0N0 0N0 0N0 0N4 0N4 0N4 0N4 0N4 0N4 0N4 0N4 0N4 32 0N0 0N0 0N0 0N0 0N0 0N8 0N8 0N8 0N8 0N8 0N8 0N8 0N8 0N0 0N0 0N0 0N0 34 0N0 0N0 0N4 0N4 0N4 0N4 0N4 0N4 0N4 36 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N8 0N8 0N8 0N8 0N8 0N8 0N0 0N0 0N0 0N0 0N0 0N0 38 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 40 0N0 0N0 0N9 0N9 0N9 0N9 0N9 42 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N8 0N8 0N8 0N8 0N0 0N0 0N0 0N0 0N0 0N0 0N0 44 0N0 0N0 0N0 0N0 0N0 0N0 46 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N1 0N1 0N1 48 0N0 P 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N2 0N2 50 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 52 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 54 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 1N1 56 0N0 0N0 0N0 p 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 3n0 58 0N0 0N0 p p 0N0 0N0 p 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 1N0 60 0N0 0N0 0N0 0N0 p p p p n 0 p n 0N0 0N0 0N0 p 0N0 0N0 0N0 0N0 0N0 1N0 n 62 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 0N0 n 0 cell number Figure 5. A sample sequence to aquire a penalty avoiding rational policy. N(n) ; n (penalty or semipenalty)state, ; semi-penalty state, P(p) ; penalty state, n and p mean that our RL player calculates the end of game. The former and the latter number of N(n) are the numbers of penalty rules and semi-penalty rules, respectively. References [1] Miyazaki, K. & Kobayashi,. Reinforcement Learning for Penalty Avoiding Policy Making IEEE International Conference on ystems, Man and Cybernetics, pp , [2] Miyazaki, K. & Kobayashi,. On the Rationality of Profit haring in Partially Observable Markov Decision Processes, 5th International Conference on Information ystems Analysis and Cynthesis, pp (1999). [3] Miyazaki, K., Arai,. & Kobayashi,. Cranes Control Using Multi-agent Profit haring, 6th International Conference on Information ystems Analysis and Cynthesis, Vol.IX, pp (2000). [4] utton, R.. Learning to Predict by the Method of Temporal Differences. Machine Learning Vol.3, pp.9-44, [5] utton, R.. & Barto, A. Reinforcement Learning: An Introduction. A Bradford Book, The MIT Press, [6] Watkins, C. J. H., and Dayan, P.: Technical te: Q-learning, Machine Learning Vol.8, pp.55-68, [7] learn-game/systems/kitty.html [8] ftp://ftp.nj.nec.com/pub/igord/othello/kitty/ linux kitty.tgz [9] ftp://ftp.nj.nec.com/pub/igord/othello/misc/ database.zip tion of table 1. A penalty avoiding rational policy is made of a set of all hatched states. We can use K- IFU database before 16 cells. If we use the original version of PRJ, the frontier of penalty rules is 34 cells in 2000 games. On the other hand, in figure 5, we can use semi-penalty rule at 18, from 26 to 32 and larger than 36 cells in 949 games. It means that we can overcome the slow spreads of penalty rules by semi-penalty rule. 5. CONCLUION In this paper, we extend the Penalty Avoiding Rational Policy Making algorithm [1] to large scale MDPs. We have implemented our method as an Othello game player s learning system. Our RL player can always defeat against the well-kwn Othello game program KITTY after learning. In the future works, we will compare our method with KITTY with learning mechanism. Furtermore, we will extend our method to Partially Observable Markov Decision Processes [2] and multiagent systems [3].
TUD Poker Challenge Reinforcement Learning with Imperfect Information
TUD Poker Challenge 2008 Reinforcement Learning with Imperfect Information Outline Reinforcement Learning Perfect Information Imperfect Information Lagging Anchor Algorithm Matrix Form Extensive Form Poker
More informationDecision Making in Multiplayer Environments Application in Backgammon Variants
Decision Making in Multiplayer Environments Application in Backgammon Variants PhD Thesis by Nikolaos Papahristou AI researcher Department of Applied Informatics Thessaloniki, Greece Contributions Expert
More informationReinforcement Learning in Games Autonomous Learning Systems Seminar
Reinforcement Learning in Games Autonomous Learning Systems Seminar Matthias Zöllner Intelligent Autonomous Systems TU-Darmstadt zoellner@rbg.informatik.tu-darmstadt.de Betreuer: Gerhard Neumann Abstract
More informationAn Artificially Intelligent Ludo Player
An Artificially Intelligent Ludo Player Andres Calderon Jaramillo and Deepak Aravindakshan Colorado State University {andrescj, deepakar}@cs.colostate.edu Abstract This project replicates results reported
More informationOutline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game
Outline Game Playing ECE457 Applied Artificial Intelligence Fall 2007 Lecture #5 Types of games Playing a perfect game Minimax search Alpha-beta pruning Playing an imperfect game Real-time Imperfect information
More informationGame Design Verification using Reinforcement Learning
Game Design Verification using Reinforcement Learning Eirini Ntoutsi Dimitris Kalles AHEAD Relationship Mediators S.A., 65 Othonos-Amalias St, 262 21 Patras, Greece and Department of Computer Engineering
More informationGame Playing for a Variant of Mancala Board Game (Pallanguzhi)
Game Playing for a Variant of Mancala Board Game (Pallanguzhi) Varsha Sankar (SUNet ID: svarsha) 1. INTRODUCTION Game playing is a very interesting area in the field of Artificial Intelligence presently.
More informationArtificial Intelligence. Minimax and alpha-beta pruning
Artificial Intelligence Minimax and alpha-beta pruning In which we examine the problems that arise when we try to plan ahead to get the best result in a world that includes a hostile agent (other agent
More informationCS 771 Artificial Intelligence. Adversarial Search
CS 771 Artificial Intelligence Adversarial Search Typical assumptions Two agents whose actions alternate Utility values for each agent are the opposite of the other This creates the adversarial situation
More informationIntelligent Agents & Search Problem Formulation. AIMA, Chapters 2,
Intelligent Agents & Search Problem Formulation AIMA, Chapters 2, 3.1-3.2 Outline for today s lecture Intelligent Agents (AIMA 2.1-2) Task Environments Formulating Search Problems CIS 421/521 - Intro to
More informationAdversarial Search. CS 486/686: Introduction to Artificial Intelligence
Adversarial Search CS 486/686: Introduction to Artificial Intelligence 1 Introduction So far we have only been concerned with a single agent Today, we introduce an adversary! 2 Outline Games Minimax search
More informationIntelligent Agents p.1/25. Intelligent Agents. Chapter 2
Intelligent Agents p.1/25 Intelligent Agents Chapter 2 Intelligent Agents p.2/25 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types
More informationOthello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar
Othello/Reversi using Game Theory techniques Parth Parekh Urjit Singh Bhatia Kushal Sukthankar Othello Rules Two Players (Black and White) 8x8 board Black plays first Every move should Flip over at least
More informationHomeostasis Lighting Control System Using a Sensor Agent Robot
Intelligent Control and Automation, 2013, 4, 138-153 http://dx.doi.org/10.4236/ica.2013.42019 Published Online May 2013 (http://www.scirp.org/journal/ica) Homeostasis Lighting Control System Using a Sensor
More informationArtificial Intelligence Adversarial Search
Artificial Intelligence Adversarial Search Adversarial Search Adversarial search problems games They occur in multiagent competitive environments There is an opponent we can t control planning again us!
More informationPractice Session 2. HW 1 Review
Practice Session 2 HW 1 Review Chapter 1 1.4 Suppose we extend Evans s Analogy program so that it can score 200 on a standard IQ test. Would we then have a program more intelligent than a human? Explain.
More informationAdversarial Search. CS 486/686: Introduction to Artificial Intelligence
Adversarial Search CS 486/686: Introduction to Artificial Intelligence 1 AccessAbility Services Volunteer Notetaker Required Interested? Complete an online application using your WATIAM: https://york.accessiblelearning.com/uwaterloo/
More informationReinforcement Learning Simulations and Robotics
Reinforcement Learning Simulations and Robotics Models Partially observable noise in sensors Policy search methods rather than value functionbased approaches Isolate key parameters by choosing an appropriate
More informationCognitive Radio: Brain-Empowered Wireless Communcations
Cognitive Radio: Brain-Empowered Wireless Communcations Simon Haykin, Life Fellow, IEEE Matt Yu, EE360 Presentation, February 15 th 2012 Overview Motivation Background Introduction Radio-scene analysis
More informationAI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)
AI Plays 2048 Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng) Abstract The strategy game 2048 gained great popularity quickly. Although it is easy to play, people cannot win the game easily,
More informationCMSC 671 Project Report- Google AI Challenge: Planet Wars
1. Introduction Purpose The purpose of the project is to apply relevant AI techniques learned during the course with a view to develop an intelligent game playing bot for the game of Planet Wars. Planet
More informationUnit-III Chap-II Adversarial Search. Created by: Ashish Shah 1
Unit-III Chap-II Adversarial Search Created by: Ashish Shah 1 Alpha beta Pruning In case of standard ALPHA BETA PRUNING minimax tree, it returns the same move as minimax would, but prunes away branches
More informationUsing Artificial intelligent to solve the game of 2048
Using Artificial intelligent to solve the game of 2048 Ho Shing Hin (20343288) WONG, Ngo Yin (20355097) Lam Ka Wing (20280151) Abstract The report presents the solver of the game 2048 base on artificial
More informationAgent. Pengju Ren. Institute of Artificial Intelligence and Robotics
Agent Pengju Ren Institute of Artificial Intelligence and Robotics pengjuren@xjtu.edu.cn 1 Review: What is AI? Artificial intelligence (AI) is intelligence exhibited by machines. In computer science, the
More informationAdversarial Search and Game- Playing C H A P T E R 6 C M P T : S P R I N G H A S S A N K H O S R A V I
Adversarial Search and Game- Playing C H A P T E R 6 C M P T 3 1 0 : S P R I N G 2 0 1 1 H A S S A N K H O S R A V I Adversarial Search Examine the problems that arise when we try to plan ahead in a world
More informationGame-Playing & Adversarial Search
Game-Playing & Adversarial Search This lecture topic: Game-Playing & Adversarial Search (two lectures) Chapter 5.1-5.5 Next lecture topic: Constraint Satisfaction Problems (two lectures) Chapter 6.1-6.4,
More informationCS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions
CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions Slides by Svetlana Lazebnik, 9/2016 Modified by Mark Hasegawa Johnson, 9/2017 Types of game environments Perfect
More informationCS440/ECE448 Lecture 9: Minimax Search. Slides by Svetlana Lazebnik 9/2016 Modified by Mark Hasegawa-Johnson 9/2017
CS440/ECE448 Lecture 9: Minimax Search Slides by Svetlana Lazebnik 9/2016 Modified by Mark Hasegawa-Johnson 9/2017 Why study games? Games are a traditional hallmark of intelligence Games are easy to formalize
More informationOptic Flow Based Skill Learning for A Humanoid to Trap, Approach to, and Pass a Ball
Optic Flow Based Skill Learning for A Humanoid to Trap, Approach to, and Pass a Ball Masaki Ogino 1, Masaaki Kikuchi 1, Jun ichiro Ooga 1, Masahiro Aono 1 and Minoru Asada 1,2 1 Dept. of Adaptive Machine
More informationLearning to Play Love Letter with Deep Reinforcement Learning
Learning to Play Love Letter with Deep Reinforcement Learning Madeleine D. Dawson* MIT mdd@mit.edu Robert X. Liang* MIT xbliang@mit.edu Alexander M. Turner* MIT turneram@mit.edu Abstract Recent advancements
More informationGeneralized Game Trees
Generalized Game Trees Richard E. Korf Computer Science Department University of California, Los Angeles Los Angeles, Ca. 90024 Abstract We consider two generalizations of the standard two-player game
More informationLearning Behaviors for Environment Modeling by Genetic Algorithm
Learning Behaviors for Environment Modeling by Genetic Algorithm Seiji Yamada Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo
More informationBy David Anderson SZTAKI (Budapest, Hungary) WPI D2009
By David Anderson SZTAKI (Budapest, Hungary) WPI D2009 1997, Deep Blue won against Kasparov Average workstation can defeat best Chess players Computer Chess no longer interesting Go is much harder for
More informationLearning and Using Models of Kicking Motions for Legged Robots
Learning and Using Models of Kicking Motions for Legged Robots Sonia Chernova and Manuela Veloso Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 {soniac, mmv}@cs.cmu.edu Abstract
More informationIntroduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.)
Introduction to Neuro-Dynamic Programming (Or, how to count cards in blackjack and do other fun things too.) Eric B. Laber February 12, 2008 Eric B. Laber () Introduction to Neuro-Dynamic Programming (Or,
More informationTD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play
NOTE Communicated by Richard Sutton TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas 1. Watson Research Center, I? 0. Box 704, Yorktozon Heights, NY 10598
More informationPlaying Othello Using Monte Carlo
June 22, 2007 Abstract This paper deals with the construction of an AI player to play the game Othello. A lot of techniques are already known to let AI players play the game Othello. Some of these techniques
More informationCMPT 310 Assignment 1
CMPT 310 Assignment 1 October 16, 2017 100 points total, worth 10% of the course grade. Turn in on CourSys. Submit a compressed directory (.zip or.tar.gz) with your solutions. Code should be submitted
More informationCS 229 Final Project: Using Reinforcement Learning to Play Othello
CS 229 Final Project: Using Reinforcement Learning to Play Othello Kevin Fry Frank Zheng Xianming Li ID: kfry ID: fzheng ID: xmli 16 December 2016 Abstract We built an AI that learned to play Othello.
More informationAdministrivia. CS 188: Artificial Intelligence Spring Agents and Environments. Today. Vacuum-Cleaner World. A Reflex Vacuum-Cleaner
CS 188: Artificial Intelligence Spring 2006 Lecture 2: Agents 1/19/2006 Administrivia Reminder: Drop-in Python/Unix lab Friday 1-4pm, 275 Soda Hall Optional, but recommended Accommodation issues Project
More informationA Reinforcement Learning Scheme for Adaptive Link Allocation in ATM Networks
A Reinforcement Learning Scheme for Adaptive Link Allocation in ATM Networks Ernst Nordström, Jakob Carlström Department of Computer Systems, Uppsala University, Box 325, S 751 05 Uppsala, Sweden Fax:
More informationARTIFICIAL INTELLIGENCE (CS 370D)
Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-5) ADVERSARIAL SEARCH ADVERSARIAL SEARCH Optimal decisions Min algorithm α-β pruning Imperfect,
More informationCooperative Transportation by Humanoid Robots Learning to Correct Positioning
Cooperative Transportation by Humanoid Robots Learning to Correct Positioning Yutaka Inoue, Takahiro Tohge, Hitoshi Iba Department of Frontier Informatics, Graduate School of Frontier Sciences, The University
More informationTutorial of Reinforcement: A Special Focus on Q-Learning
Tutorial of Reinforcement: A Special Focus on Q-Learning TINGWU WANG, MACHINE LEARNING GROUP, UNIVERSITY OF TORONTO Contents 1. Introduction 1. Discrete Domain vs. Continous Domain 2. Model Based vs. Model
More informationOnline Interactive Neuro-evolution
Appears in Neural Processing Letters, 1999. Online Interactive Neuro-evolution Adrian Agogino (agogino@ece.utexas.edu) Kenneth Stanley (kstanley@cs.utexas.edu) Risto Miikkulainen (risto@cs.utexas.edu)
More informationReinforcement Learning
Reinforcement Learning Reinforcement Learning Assumptions we made so far: Known state space S Known transition model T(s, a, s ) Known reward function R(s) not realistic for many real agents Reinforcement
More informationHybrid of Evolution and Reinforcement Learning for Othello Players
Hybrid of Evolution and Reinforcement Learning for Othello Players Kyung-Joong Kim, Heejin Choi and Sung-Bae Cho Dept. of Computer Science, Yonsei University 134 Shinchon-dong, Sudaemoon-ku, Seoul 12-749,
More informationIntuition Mini-Max 2
Games Today Saying Deep Blue doesn t really think about chess is like saying an airplane doesn t really fly because it doesn t flap its wings. Drew McDermott I could feel I could smell a new kind of intelligence
More informationBLUFF WITH AI. CS297 Report. Presented to. Dr. Chris Pollett. Department of Computer Science. San Jose State University. In Partial Fulfillment
BLUFF WITH AI CS297 Report Presented to Dr. Chris Pollett Department of Computer Science San Jose State University In Partial Fulfillment Of the Requirements for the Class CS 297 By Tina Philip May 2017
More informationFive-In-Row with Local Evaluation and Beam Search
Five-In-Row with Local Evaluation and Beam Search Jiun-Hung Chen and Adrienne X. Wang jhchen@cs axwang@cs Abstract This report provides a brief overview of the game of five-in-row, also known as Go-Moku,
More informationA Reinforcement Learning Scheme for a Partially-Observable Multi-Agent Game
Machine Learning, 59, 31 54, 2005 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. A Reinforcement Learning Scheme for a Partially-Observable Multi-Agent Game SHIN ISHII ishii@is.naist.jp
More informationFlexible Cooperation between Human and Robot by interpreting Human Intention from Gaze Information
Proceedings of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems September 28 - October 2, 2004, Sendai, Japan Flexible Cooperation between Human and Robot by interpreting Human
More informationAdversarial Search and Game Playing
Games Adversarial Search and Game Playing Russell and Norvig, 3 rd edition, Ch. 5 Games: multi-agent environment q What do other agents do and how do they affect our success? q Cooperative vs. competitive
More informationMove Evaluation Tree System
Move Evaluation Tree System Hiroto Yoshii hiroto-yoshii@mrj.biglobe.ne.jp Abstract This paper discloses a system that evaluates moves in Go. The system Move Evaluation Tree System (METS) introduces a tree
More informationa b c d e f g h 1 a b c d e f g h C A B B A C C X X C C X X C C A B B A C Diagram 1-2 Square names
Chapter Rules and notation Diagram - shows the standard notation for Othello. The columns are labeled a through h from left to right, and the rows are labeled through from top to bottom. In this book,
More informationUMBC 671 Midterm Exam 19 October 2009
Name: 0 1 2 3 4 5 6 total 0 20 25 30 30 25 20 150 UMBC 671 Midterm Exam 19 October 2009 Write all of your answers on this exam, which is closed book and consists of six problems, summing to 160 points.
More informationPlaying CHIP-8 Games with Reinforcement Learning
Playing CHIP-8 Games with Reinforcement Learning Niven Achenjang, Patrick DeMichele, Sam Rogers Stanford University Abstract We begin with some background in the history of CHIP-8 games and the use of
More informationCS221 Project Final Report Gomoku Game Agent
CS221 Project Final Report Gomoku Game Agent Qiao Tan qtan@stanford.edu Xiaoti Hu xiaotihu@stanford.edu 1 Introduction Gomoku, also know as five-in-a-row, is a strategy board game which is traditionally
More informationCMU-Q Lecture 20:
CMU-Q 15-381 Lecture 20: Game Theory I Teacher: Gianni A. Di Caro ICE-CREAM WARS http://youtu.be/jilgxenbk_8 2 GAME THEORY Game theory is the formal study of conflict and cooperation in (rational) multi-agent
More informationCSE 573: Artificial Intelligence Autumn 2010
CSE 573: Artificial Intelligence Autumn 2010 Lecture 4: Adversarial Search 10/12/2009 Luke Zettlemoyer Based on slides from Dan Klein Many slides over the course adapted from either Stuart Russell or Andrew
More informationCS 188: Artificial Intelligence Spring 2007
CS 188: Artificial Intelligence Spring 2007 Lecture 7: CSP-II and Adversarial Search 2/6/2007 Srini Narayanan ICSI and UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell or
More informationBootstrapping from Game Tree Search
Joel Veness David Silver Will Uther Alan Blair University of New South Wales NICTA University of Alberta December 9, 2009 Presentation Overview Introduction Overview Game Tree Search Evaluation Functions
More informationGame Tree Search. CSC384: Introduction to Artificial Intelligence. Generalizing Search Problem. General Games. What makes something a game?
CSC384: Introduction to Artificial Intelligence Generalizing Search Problem Game Tree Search Chapter 5.1, 5.2, 5.3, 5.6 cover some of the material we cover here. Section 5.6 has an interesting overview
More informationDIT411/TIN175, Artificial Intelligence. Peter Ljunglöf. 2 February, 2018
DIT411/TIN175, Artificial Intelligence Chapters 4 5: Non-classical and adversarial search CHAPTERS 4 5: NON-CLASSICAL AND ADVERSARIAL SEARCH DIT411/TIN175, Artificial Intelligence Peter Ljunglöf 2 February,
More informationReinforcement Learning and its Application to Othello
Reinforcement Learning and its Application to Othello Nees Jan van Eck, Michiel van Wezel Econometric Institute, Faculty of Economics, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR, Rotterdam, The
More informationHIT3002: Introduction to Artificial Intelligence
HIT3002: Introduction to Artificial Intelligence Intelligent Agents Outline Agents and environments. The vacuum-cleaner world The concept of rational behavior. Environments. Agent structure. Swinburne
More informationRobustness against Longer Memory Strategies in Evolutionary Games.
Robustness against Longer Memory Strategies in Evolutionary Games. Eizo Akiyama 1 Players as finite state automata In our daily life, we have to make our decisions with our restricted abilities (bounded
More informationAdversarial Search: Game Playing. Reading: Chapter
Adversarial Search: Game Playing Reading: Chapter 6.5-6.8 1 Games and AI Easy to represent, abstract, precise rules One of the first tasks undertaken by AI (since 1950) Better than humans in Othello and
More informationGame-playing AIs: Games and Adversarial Search I AIMA
Game-playing AIs: Games and Adversarial Search I AIMA 5.1-5.2 Games: Outline of Unit Part I: Games as Search Motivation Game-playing AI successes Game Trees Evaluation Functions Part II: Adversarial Search
More informationOutline. Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types Agent types
Intelligent Agents Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types Agent types Agents An agent is anything that can be viewed as
More informationAdversarial Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 9 Feb 2012
1 Hal Daumé III (me@hal3.name) Adversarial Search Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421: Introduction to Artificial Intelligence 9 Feb 2012 Many slides courtesy of Dan
More informationAnnouncements. CS 188: Artificial Intelligence Spring Game Playing State-of-the-Art. Overview. Game Playing. GamesCrafters
CS 188: Artificial Intelligence Spring 2011 Announcements W1 out and due Monday 4:59pm P2 out and due next week Friday 4:59pm Lecture 7: Mini and Alpha-Beta Search 2/9/2011 Pieter Abbeel UC Berkeley Many
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 7: Minimax and Alpha-Beta Search 2/9/2011 Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein 1 Announcements W1 out and due Monday 4:59pm P2
More informationCS-E4800 Artificial Intelligence
CS-E4800 Artificial Intelligence Jussi Rintanen Department of Computer Science Aalto University March 9, 2017 Difficulties in Rational Collective Behavior Individual utility in conflict with collective
More informationROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT
ROBOCODE PROJECT AIBOT - MARKOV MODEL DRIVEN AIMING COMBINED WITH Q LEARNING FOR MOVEMENT PATRICK HALUPTZOK, XU MIAO Abstract. In this paper the development of a robot controller for Robocode is discussed.
More informationGameplay. Topics in Game Development UNM Spring 2008 ECE 495/595; CS 491/591
Gameplay Topics in Game Development UNM Spring 2008 ECE 495/595; CS 491/591 What is Gameplay? Very general definition: It is what makes a game FUN And it is how players play a game. Taking one step back:
More informationContents. List of Figures
1 Contents 1 Introduction....................................... 3 1.1 Rules of the game............................... 3 1.2 Complexity of the game............................ 4 1.3 History of self-learning
More informationCITS3001. Algorithms, Agents and Artificial Intelligence. Semester 2, 2016 Tim French
CITS3001 Algorithms, Agents and Artificial Intelligence Semester 2, 2016 Tim French School of Computer Science & Software Eng. The University of Western Australia 8. Game-playing AIMA, Ch. 5 Objectives
More informationHierarchical Controller for Robotic Soccer
Hierarchical Controller for Robotic Soccer Byron Knoll Cognitive Systems 402 April 13, 2008 ABSTRACT RoboCup is an initiative aimed at advancing Artificial Intelligence (AI) and robotics research. This
More informationAdversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5
Adversarial Search CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2017 Soleymani Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5 Outline Game
More information4. Games and search. Lecture Artificial Intelligence (4ov / 8op)
4. Games and search 4.1 Search problems State space search find a (shortest) path from the initial state to the goal state. Constraint satisfaction find a value assignment to a set of variables so that
More informationMonte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar
Monte Carlo Tree Search and AlphaGo Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar Zero-Sum Games and AI A player s utility gain or loss is exactly balanced by the combined gain or loss of opponents:
More informationWhat is a Z-Code Almanac?
ZcodeSystem.com Presents Guide v.2.1. The Almanac Beta is updated in real time. All future updates are included in your membership What is a Z-Code Almanac? Today we are really excited to share our progress
More information2. The Extensive Form of a Game
2. The Extensive Form of a Game In the extensive form, games are sequential, interactive processes which moves from one position to another in response to the wills of the players or the whims of chance.
More informationUMBC CMSC 671 Midterm Exam 22 October 2012
Your name: 1 2 3 4 5 6 7 8 total 20 40 35 40 30 10 15 10 200 UMBC CMSC 671 Midterm Exam 22 October 2012 Write all of your answers on this exam, which is closed book and consists of six problems, summing
More informationSEARCHING is both a method of solving problems and
100 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 3, NO. 2, JUNE 2011 Two-Stage Monte Carlo Tree Search for Connect6 Shi-Jim Yen, Member, IEEE, and Jung-Kuei Yang Abstract Recently,
More informationCS325 Artificial Intelligence Ch. 5, Games!
CS325 Artificial Intelligence Ch. 5, Games! Cengiz Günay, Emory Univ. vs. Spring 2013 Günay Ch. 5, Games! Spring 2013 1 / 19 AI in Games A lot of work is done on it. Why? Günay Ch. 5, Games! Spring 2013
More informationGames CSE 473. Kasparov Vs. Deep Junior August 2, 2003 Match ends in a 3 / 3 tie!
Games CSE 473 Kasparov Vs. Deep Junior August 2, 2003 Match ends in a 3 / 3 tie! Games in AI In AI, games usually refers to deteristic, turntaking, two-player, zero-sum games of perfect information Deteristic:
More informationInformatics 2D: Tutorial 1 (Solutions)
Informatics 2D: Tutorial 1 (Solutions) Agents, Environment, Search Week 2 1 Agents and Environments Consider the following agents: A robot vacuum cleaner which follows a pre-set route around a house and
More informationModule 3. Problem Solving using Search- (Two agent) Version 2 CSE IIT, Kharagpur
Module 3 Problem Solving using Search- (Two agent) 3.1 Instructional Objective The students should understand the formulation of multi-agent search and in detail two-agent search. Students should b familiar
More informationCSE-571 AI-based Mobile Robotics
CSE-571 AI-based Mobile Robotics Approximation of POMDPs: Active Localization Localization so far: passive integration of sensor information Active Sensing and Reinforcement Learning 19 m 26.5 m Active
More informationDynamic Programming in Real Life: A Two-Person Dice Game
Mathematical Methods in Operations Research 2005 Special issue in honor of Arie Hordijk Dynamic Programming in Real Life: A Two-Person Dice Game Henk Tijms 1, Jan van der Wal 2 1 Department of Econometrics,
More informationDetecticon: A Prototype Inquiry Dialog System
Detecticon: A Prototype Inquiry Dialog System Takuya Hiraoka and Shota Motoura and Kunihiko Sadamasa Abstract A prototype inquiry dialog system, dubbed Detecticon, demonstrates its ability to handle inquiry
More informationCandyCrush.ai: An AI Agent for Candy Crush
CandyCrush.ai: An AI Agent for Candy Crush Jiwoo Lee, Niranjan Balachandar, Karan Singhal December 16, 2016 1 Introduction Candy Crush, a mobile puzzle game, has become very popular in the past few years.
More informationDeveloping Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function
Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function Davis Ancona and Jake Weiner Abstract In this report, we examine the plausibility of implementing a NEAT-based solution
More informationBLUFF WITH AI. A Project. Presented to. The Faculty of the Department of Computer Science. San Jose State University. In Partial Fulfillment
BLUFF WITH AI A Project Presented to The Faculty of the Department of Computer Science San Jose State University In Partial Fulfillment Of the Requirements for the Degree Master of Science By Tina Philip
More informationFoundations of AI. 6. Adversarial Search. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard & Bernhard Nebel
Foundations of AI 6. Adversarial Search Search Strategies for Games, Games with Chance, State of the Art Wolfram Burgard & Bernhard Nebel Contents Game Theory Board Games Minimax Search Alpha-Beta Search
More informationHow AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)
How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997) Alan Fern School of Electrical Engineering and Computer Science Oregon State University Deep Mind s vs. Lee Sedol (2016) Watson vs. Ken
More informationLearning to play Dominoes
Learning to play Dominoes Ivan de Jesus P. Pinto 1, Mateus R. Pereira 1, Luciano Reis Coutinho 1 1 Departamento de Informática Universidade Federal do Maranhão São Luís,MA Brazil navi1921@gmail.com, mateus.rp.slz@gmail.com,
More information2048: An Autonomous Solver
2048: An Autonomous Solver Final Project in Introduction to Artificial Intelligence ABSTRACT. Our goal in this project was to create an automatic solver for the wellknown game 2048 and to analyze how different
More information