An Emergence of Game Strategy in Multiagent Systems

Similar documents
ECE-517 Reinforcement Learning in Artificial Intelligence

Foreign Fiber Image Segmentation Based on Maximum Entropy and Genetic Algorithm

Lecture September 6, 2011

The Significance of Temporal-Difference Learning in Self-Play Training TD-rummy versus EVO-rummy

Comparing image compression predictors using fractal dimension

Pulse Train Controlled PCCM Buck-Boost Converter Ming Qina, Fangfang Lib

Pointwise Image Operations

EE 330 Lecture 24. Amplification with Transistor Circuits Small Signal Modelling

Fuzzy Inference Model for Learning from Experiences and Its Application to Robot Navigation

Lab 3 Acceleration. What You Need To Know: Physics 211 Lab

Role of Kalman Filters in Probabilistic Algorithm

ECMA st Edition / June Near Field Communication Wired Interface (NFC-WI)

EVOLVING IMPROVED OPPONENT INTELLIGENCE

Direct Analysis of Wave Digital Network of Microstrip Structure with Step Discontinuities

10. The Series Resistor and Inductor Circuit

Robot Control using Genetic Algorithms

Increasing Measurement Accuracy via Corrective Filtering in Digital Signal Processing

GP Algorithm versus Hybrid and Mixed Neural Networks

EXPERIMENT #4 AM MODULATOR AND POWER AMPLIFIER

Memorandum on Impulse Winding Tester

A New Voltage Sag and Swell Compensator Switched by Hysteresis Voltage Control Method

A new image security system based on cellular automata and chaotic systems

Knowledge Transfer in Semi-automatic Image Interpretation

A Fuzzy Model-based Virtual Theme Park Simulator and Evaluation of Agent Action Models

Evaluation of the Digital images of Penaeid Prawns Species Using Canny Edge Detection and Otsu Thresholding Segmentation

EE201 Circuit Theory I Fall

Social-aware Dynamic Router Node Placement in Wireless Mesh Networks

Dynamic Difficulty Adjustment in a Whac-A-Mole like Game

ECMA-373. Near Field Communication Wired Interface (NFC-WI) 2 nd Edition / June Reference number ECMA-123:2009

EXPERIMENT #9 FIBER OPTIC COMMUNICATIONS LINK

A Segmentation Method for Uneven Illumination Particle Images

Two-area Load Frequency Control using IP Controller Tuned Based on Harmony Search

How to Shorten First Order Unit Testing Time. Piotr Mróz 1

MATLAB/SIMULINK TECHNOLOGY OF THE SYGNAL MODULATION

THE OSCILLOSCOPE AND NOISE. Objectives:

The student will create simulations of vertical components of circular and harmonic motion on GX.

GaN-HEMT Dynamic ON-state Resistance characterisation and Modelling

Variation Aware Cross-Talk Aggressor Alignment by Mixed Integer Linear Programming

Receiver-Initiated vs. Short-Preamble Burst MAC Approaches for Multi-channel Wireless Sensor Networks

PREVENTIVE MAINTENANCE WITH IMPERFECT REPAIRS OF VEHICLES

Demodulation Based Testing of Off Chip Driver Performance

Answer Key for Week 3 Homework = 100 = 140 = 138

ACTIVITY BASED COSTING FOR MARITIME ENTERPRISES

Lecture 4. EITN Chapter 12, 13 Modulation and diversity. Antenna noise is usually given as a noise temperature!

P. Bruschi: Project guidelines PSM Project guidelines.

Proceedings of International Conference on Mechanical, Electrical and Medical Intelligent System 2017

Application of Neural Q-Learning Controllers on the Khepera II via Webots Software

A Cognitive Modeling of Space using Fingerprints of Places for Mobile Robot Navigation

Double Tangent Sampling Method for Sinusoidal Pulse Width Modulation

A-LEVEL Electronics. ELEC4 Programmable Control Systems Mark scheme June Version: 1.0 Final

SPEAKER IDENTIFICATION USING MODULAR RECURRENT NEURAL NETWORKS. M W Mak. The Hong Kong Polytechnic University

Design and Implementation an Autonomous Mobile Soccer Robot Based on Omnidirectional Mobility and Modularity

DAGSTUHL SEMINAR EPIDEMIC ALGORITHMS AND PROCESSES: FROM THEORY TO APPLICATIONS

Laplacian Mixture Modeling for Overcomplete Mixing Matrix in Wavelet Packet Domain by Adaptive EM-type Algorithm and Comparisons

Mobile Robot Localization Using Fusion of Object Recognition and Range Information

Transmit Beamforming with Reduced Feedback Information in OFDM Based Wireless Systems

Control and Protection Strategies for Matrix Converters. Control and Protection Strategies for Matrix Converters

Q-learning Based Adaptive Zone Partition for Load Balancing in Multi-Sink Wireless Sensor Networks

Research Article The RSU Access Problem Based on Evolutionary Game Theory for VANET

OPERATION MANUAL. Indoor unit for air to water heat pump system and options EKHBRD011ADV1 EKHBRD014ADV1 EKHBRD016ADV1

6.003: Signals and Systems

Table of Contents. 3.0 SMPS Topologies. For Further Research. 3.1 Basic Components. 3.2 Buck (Step Down) 3.3 Boost (Step Up) 3.4 Inverter (Buck/Boost)

Motion-blurred star image acquisition and restoration method based on the separable kernel Honglin Yuana, Fan Lib and Tao Yuc

Acquiring hand-action models by attention point analysis

Installing remote sites using TCP/IP

This is the submitted version of a paper presented at IEEE PowerTech Conference Eindhoven.

4.5 Biasing in BJT Amplifier Circuits

Comparitive Analysis of Image Segmentation Techniques

Dead Zone Compensation Method of H-Bridge Inverter Series Structure

March 13, 2009 CHAPTER 3: PARTIAL DERIVATIVES AND DIFFERENTIATION

Driver Modeling and Alignment for Worst-Case Delay Noise

Study on the Wide Gap Dielectric Barrier Discharge Device Gaofeng Wang

OPERATION MANUAL. Indoor unit for air to water heat pump system and options EKHBRD011AAV1 EKHBRD014AAV1 EKHBRD016AAV1

the next step in tyre modeling

Development of Temporary Ground Wire Detection Device

Optimization of PID Parameter for Position Control of DC-Motor using Multi-Objective Genetic Algorithm

The design of an improved matched filter in DSSS-GMSK system

Teacher Supplement to Operation Comics, Issue #5

The University of Melbourne Department of Mathematics and Statistics School Mathematics Competition, 2013 JUNIOR DIVISION Time allowed: Two hours

Power Control of Resonant Converter MPPT by Pulse Density Modulation

A1 K. 12V rms. 230V rms. 2 Full Wave Rectifier. Fig. 2.1: FWR with Transformer. Fig. 2.2: Transformer. Aim: To Design and setup a full wave rectifier.

Square Waves, Sinusoids and Gaussian White Noise: A Matching Pursuit Conundrum? Don Percival

Prediction of Pitch and Yaw Head Movements via Recurrent Neural Networks

Parameters Affecting Lightning Backflash Over Pattern at 132kV Double Circuit Transmission Lines

Architectures for Resource Reservation Modules for Optical Burst Switching Core Nodes *

LECTURE 1 CMOS PHASE LOCKED LOOPS

Surveillance System with Object-Aware Video Transcoder

EECE 301 Signals & Systems Prof. Mark Fowler

Network Design and Optimization for Quality of Services in Wireless Local Area Networks using Multi-Objective Approach

Channel Estimation for Wired MIMO Communication Systems

Improving the Sound Recording Quality of Wireless Sensors Using Automatic Gain Control Methods

(This lesson plan assumes the students are using an air-powered rocket as described in the Materials section.)

A WIDEBAND RADIO CHANNEL MODEL FOR SIMULATION OF CHAOTIC COMMUNICATION SYSTEMS

International Journal of Electrical & Computer Sciences IJECS-IJENS Vol:15 No:03 7

Deblurring Images via Partial Differential Equations

Analysis of Low Density Codes and Improved Designs Using Irregular Graphs

SOLVING SEQUENTIAL DECISION-MAKING PROBLEMS UNDER VIRTUAL REALITY SIMULATION SYSTEM. Wang Fei

The Relationship Between Creation and Innovation

A NEW DUAL-POLARIZED HORN ANTENNA EXCITED BY A GAP-FED SQUARE PATCH

Discrete Word Speech Recognition Using Hybrid Self-adaptive HMM/SVM Classifier

Transcription:

An Emergence of Game Sraegy in Muliagen Sysems Peer LACKO Slovak Universiy of Technology Faculy of Informaics and Informaion Technologies Ilkovičova 3, 842 16 Braislava, Slovakia lacko@fii.suba.sk Absrac. In his paper, we sudy an emergence of game sraegy in muliagen sysems. Symbolic and subsymbolic approaches are compared. Symbolic approach is represened by a backrack algorihm wih specified search deph, whereas he subsymbolic approach is represened by feedforward neural neworks ha are adaped by reinforcemen emporal difference TD(lambda) echnique. As a es game, we used simplified checkers. Three differen sraegies were used. The firs sraegy corresponds o a single agen ha repeaedly plays games agains MinMax version of a backrack search mehod. The second sraegy corresponds o single agens ha are repeaedly playing a megaournamen. The hird sraegy is an evoluionary modificaion of he second one. I is demonsraed ha all hese approaches led o a populaion of agens very successfully playing checkers agains a backrack algorihm wih he search deph 3. 1 Inroducion Applicaions of TD(λ) reinforcemen learning [2, 3] o compuaional sudies of emergence of game sraegies were iniiaed by Gerald Tesauro [4, 5] in 1992. He le a machine-learning program endowed wih feed-forward neural nework o play agains iself a backgammon game. Tesauro has observed ha a neural nework emerged, which is able o play backgammon on a supreme champion level. The purpose of his paper is o use TD(λ) reinforcemen learning mehod and evoluionary opimizaion for adapaion of feed-forward neural neworks ha are used as evaluaors of nex possible posiions creaed from a given posiion by permied moves. Neural nework evaluaes each posiion by a score number from an open Supervisor: prof. Ing. Vladimír Kvasnička, DrSc., Insiue of Applied Informaics, Faculy of Informaics and Informaion Technologies STU in Braislava M. Bieliková (Ed.), IIT.SRC 2005, April 27, 2005, pp. 41-48.

42 Peer Lacko inerval (0,1). A posiion wih he larges score is seleced as he forhcoming posiion, while oher moves are ignored. The mehod is esed on a simplified game of checkers, where ha player wins, whose piece firs reaches any of he squares on he opposie end of he game board. Three differen experimens are done. The firs experimen uses only one neural nework playing a game agains a player simulaed by a MinMax backrack algorihm wih he search deph 3. The second experimen corresponds o a populaion of neural neworks, which are repeaedly playing a megaournamen (each nework agains all ohers). Afer each game boh neural neworks are adaped by TD(λ) according o he resul of he game. Finally, he hird experimen uses evoluionary adapaion of neural neworks, i.e. reinforcemen learning was subsiued by random muaions and naural selecion. In all hree experimens, he emerged neural neworks won abou 60% of games of simplified checkers agains MinMax algorihm wih he search deph 3. 2 Simplified game of checkers The game of checkers is played on a square board wih sixy-four smaller squares arranged in an 8 8 grid of alernaing colors (like chess board). The saring posiion is wih each player having 8 pieces (black, respecively whie), on he 8 squares of he same color closes o his edge of he board. Each mus make one move per urn. The pieces move one square, diagonally, forward. A piece can only move o a vacan square. One capures an opponen's piece by jumping over i, diagonally, o he adjacen vacan square. If a player can jump, he mus. A player wins, if one of his/her pieces reaches a square on he opponen's edge of he board, or capures he las opponen's piece, or blocks all opponen's moves. 2.1. Formalizaion of game In his secion, we shall deal wih a formalizaion of he game of checkers, which can be applied o all symmeric wo player games (chess, go, backgammon, ec.). Le he curren posiion of he game be described by a variable P, his posiion can be changed by permied acions moves consiuing a se A(P). Using a move a A(P), he a posiion P shall be ransformed ino a new posiion P, P P. An inverse posiion P is obainable from a posiion P by swiching he color of all black pieces o whie and of all whie pieces o black. We shall use a muliagen approach, and we shall presume, ha he game is played by wo agens G 1 and G 2, which are endowed wih cogniive devices, by which hey are able o evaluae nex posiions. Algorihm 1. sep. The game is sared by he firs player, G G 1, from a saring posiion, P P ini.

An Emergence of Game Sraegy in Muliagen Sysems 43 2. sep. The player G generaes from he posiion P a se of he nex permied posiions A ( P) = { P1, P2,..., Pn }. Each such posiion P i from he se of hese posiions is evaluaed by a coefficien 0<z i <1. The player selecs as his nex posiion such P A(P), which is evaluaed by he maximum coefficien, P P. If he posiion P saisfies condiion for vicory, hen he player G wins and he game coninues wih he sep 4. 3. sep. The oher player akes urn in he game, G G 2, he posiion P is generaed from he inverse of he curren posiion, P P, he game coninues wih he sep 2. 4. sep. End of he game. The key role in he algorihm plays he calculaion of coefficiens z=z(p ) for posiions P A(P). These calculaions can be done eiher by mehods of classical arificial inelligence, based on a combinaion of deph firs search and various heurisics, or by sof compuing mehods. We shall use he second approach, where we shall urn our aenion o modern approach of muliagen sysems. I is based on a presumpion, ha he behavior of an agen in his/her environmen and/or cerain acions he/she performs are fully deermined by his/her cogniive device, which demonsraes a cerain plasiciy (i.e. i is capable of learning). informaion flow y zp ( )...... xp ( ) oupu neuron ( 1) hidden neurons ( p) inpu neurons ( 32) Fig. 1. Feed forward neural nework wih one layer of hidden neurons. The inpu aciviies are equivalen o 32-dimensional vecor x(p), which codes a posiion of he game. The oupu aciviy equals o he real number z(p) from he open inerval (0,1), his number is an evaluaion of he inpu posiion. 3 The srucure of he cogniive device neural nework Firsly, before we shall specify he cogniive device of agens, we have o inroduce he so-called numerical represenaion of posiions. Posiion is represened by a 32- dimensional vecor x ( P) = ( x ) { } 32 1,x 2,...,x 32 01,, 1 (1a) where single enries specify single squares a he posiion P of he game

44 Peer Lacko i { 0( h square is free) 1( on h square is our/opponen's piece) } x = i, ± i (1b) The used neural nework has he archiecure of a feed-forward nework wih one layer of hidden neurons. The aciviies of inpu neurons are deermined by a numerical represenaion x(p) of he given posiion P, he oupu aciviy evaluaes he posiion x(p) (see fig. 1). The number of parameers of he neural nework is 34p+1, where p is he number of hidden neurons 4 Adapaion of cogniive device of agen wih emporal difference TD(λ)-mehod wih reward and punishmen [2] In his secion we shall give he basic principles of he used reinforcemen learning mehod, which currenly in is emporal difference TD(λ) version belongs o effecive algorihmic ools for adapaion of cogniive devices of muliagen sysems. The basic principles of reinforcemen learning are following: Agen observes he mapping of his/her inpu paern o his/her oupu signal of his/her cogniive device (he oupu signal is ofen called an acion or conrol signal ). Agen evaluaes he qualiy of he oupu signal on he basis of he exernal scalar signal reward. The aim of he learning is such an adapaion of he cogniive organ of he agen, which will modify he oupu signals oward maximizaion of exernal reward signals. In many cases he reward signal is delayed, i arrives only a he end of a long sequence of acions and i can be undersood as he evaluaion of he whole sequence of acions, wheher he sequence achieved he desired goal or no. In his secion we shall ouline he consrucion of TD(λ) mehod as a cerain generalizaion of a sandard mehod of gradien descen learning of neural neworks. Le us presume, ha we know he sequence of posiions of he given agen - player and heir evaluaion by a real number z P1, P2,..., Pm, z (2) reward where z reward is an exernal evaluaion of he sequence and corresponds o he fac, ha he las posiion P m means ha he given agen won, or los zreward = { 1( sequence of posiions won ), 1( sequence of posiions los) } (3) From he sequence (2) we shall creae m couples of posiions and heir evaluaions by z reward, which shall be used as raining se for he following objecive funcion m 1 ( ) ( ( )) 2 E w = zreward G x ;w (4) 2 = 1 We shall look for such weigh coefficiens of he neural nework cogniive device, which will minimize he objecive funcion. When we would find ou such weigh coefficiens of he nework ha he funcion would be zero, hen each posiion from he sequence (2) is evaluaed by a number z reward. The recurren formula for adapaion of he weigh coefficiens is as follows w := E w α w = w+ w (5)

An Emergence of Game Sraegy in Muliagen Sysems 45 z w = α (6) m = 1 w ( z reward z ) where z=g(pi,w) is an evaluaion of h posiion P by neural nework working as a cogniive device. Our goal will be, ha all he posiions from he sequence (2) would be evaluaed by he same number z reward, which specifies, if oucome of he game consising from he sequence (2) was a win, draw, or loss for he given player. This approach can be generalized o a formula, which creaes he basis of he TD(λ) class of learning mehods [2] w = m w = w (7) = 1 k z k α λ (8) ( z + 1 z ) k = 1 w where he parameer 0 λ 1. Formulas (7) and (8) enable a recurren calculaion of he incremen w. We shall inroduce a new symbol e (λ), which can be easily calculaed recurrenly as follows k z k z + 1 e ( λ ) = λ e + 1( λ ) = λe ( λ ) + (9) k = 1 w w e1 λ = z1. Then he single parial incremens w are deermined where ( ) w 5 Resuls ( z z ) e ( λ ) w = α (10) +1 For measuremen of game sraegy emergency success we used MinMax algorih [1].In our implemenaion, we used he following heurisic funcion: l m n[] ( 8 s[] ) (11) evaluaion = y i y i i= 1 i= 1 If we denoe he number of our pieces on a game board l and he number of opponen s pieces m, hen we can denoe by y n [i] he posiion of our ih piece along he y axis (he axis owards he opponen s saring par of game board). By y s [i] we can denoe equivalen value of he opponen s pieces. This evaluaion ensures ha he MinMax will ry o ge is pieces oward he opponen s par of he game board and i will ry o preven he opponen s progress. The play of our player is quie offensive. 5.1 The resul of a supervised learning of neural nework In our firs, simples approach, we considered only a 1-agen sysem. Our agen learns by playing agains anoher agen, whose decisions are based on backrack search o a maximum deph 3. The game is repeaed wih players alernaing o go firs. Afer each end of he game, he agen wih a cogniive device represened by a neural nework adaps his/her neural nework by a reinforcemern learning mehod using reward/ punishmen signal.

46 Peer Lacko For raining and esing we used a wo-layered feed-forward nework wih 64 hidden neurons, he learning rae 0.01 and he coefficien λ=0.9. The nework learned afer each game, when i was evaluaed by 1 if i won, and 0 evaluaed i if i los. Afer each 100 games, he raio of won/los games was marked on he graph. In he rial, he nework played agains he MinMax algorihm searching o he level 3. The graph of he learning progress is shown on he fig. 2. I is eviden, ha he nework learned slowly and even afer 450000 maches achieved vicory only in 65% of maches. Neverheless, i is sill an excellen resul, since if he nework would play as well as our MinMax algorihm searching o he level 3, i would win only 50% of maches. Fig. 2. The progress of learning of neural nework playing agains he MinMax algorihm wih he search deph 3. 5.2 The resul of adapaion of a populaion of neural neworks In he second, more complicaed case, we consider a muliagen sysem. Is agens repeaedly play agains each oher a megaournamen, while afer each game he neural neworks of boh agens are adaped by a reinforcemen learning mehod. For his rial, we used 20 neural neworks. These neworks played a megaournamen agains each oher, and heir level of developmen was measured by a ournamen agains MinMax wih search deph 3. The learning curve is shown on he fig. 3. The figure shows, ha even hough he neural neworks were no augh by TD(λ) learning agains he MinMax algorihm, hey did learn o play checkers. I means ha we did no need an exper wih a good knowledge o each he neworks how o play. 5.3 The resul of evoluionary adapaion of a populaion of neural neworks In he hird, mos complex approach, we used in a muliagen sysem also a Darwinian evoluion; afer he end of each megaournamen agens are evaluaed by a finess according o heir success rae in he game and hey furher quasirandomly reproduce wih a probabiliy proporional o heir finess. In his case, we use asexual reproducion, where we creae a copy of he parenal agen. This copy is wih a cerain probabiliy muaed and replaces some weak agen in he populaion. To assess finess

An Emergence of Game Sraegy in Muliagen Sysems 47 we used he MinMax wih search deph 3. The figure 5 shows a learning curve as an average resul of he agens from he populaion agains MinMax wih search deph 3. The populaion consised of 55 agens, from which in each epoch a subpopulaion of 10 individuals was creaed. The subpopulaion was generaed by a quasirandom selecion of agens from he populaion, which were muaed wih a probabiliy 0.5. The muaion added o each weigh of neural nework wih a 0.001 probabiliy a random number from a logisic disribuion wih a parameer 0.1. This subpopulaion hen replaced he weak individuals in he populaion. The figure 4 shows, ha even in his approach a sraegy of he game emerged, and resuling neural neworks were able o play a he same level as he MinMax algorihm wih a deph-search 3. Fig. 3. The progress of learning in a populaion of 20 neural neworks rained by TD(λ). The curve shows an average percenage for he whole populaion of wins o losses agains he MinMax wih search deph 3. Fig. 4. The progress of learning of a populaion of 55 neural neworks adaped by evoluionary algorihm. The curve shows an average percenage of wins for he whole populaion agains he MinMax wih search deph 3.

48 Peer Lacko 6 Conclusions The purpose of his paper is a compuaional sudy of game-sraegy emergence for a simplified game of checkers. I was sudied a hree differen levels. A he firs simples level we have sudied a simple 1-agen sysem, where an agen (represened by a neural nework) plays agains anoher agen, which is simulaed by MinMax backrack mehod wih a specified search deph of 3. A he second level, we used a genuine muliagen sysem, where agens play repeaedly a megaournamen, each agen agains all oher agens. Afer finishing each single game, boh paricipaing agens modify heir neural neworks by TD(λ) reinforcemen learning. A he hird level, a Darwinian evoluion is used for all agens from a populaion (muliagen sysems). In a similar way as a he previous second level, agens play also a megaournamen; is resuls are used for finess evaluaion of agens. The finess values are used in he evoluionary approach for a reproducion process of agens, when fier agens are reproduced wih a higher probabiliy han weaker agens. In he reproducion process he weigh coefficien are sofly randomly modified muaed, where naural selecion ensures ha only beer neural neworks survive. A all hree levels we observed an emergence of checker game sraegy, where a he second and hird level we did no use any user defined agens ha are endowed wih an abiliy o predic a correc sraegy and are herefore able o play he game perfecly. This is a very imporan momen in our compuer simulaions; in he used muliagen sysems, an emergence of game sraegy is sponaneous and no biased by predefined opponens ha are able o play he game perfecly. Neural neworks are able o learn a sraegy, which gives rise o agens wih capabiliies o play checkers a a very good level. Acknowledgmen: This work was suppored by Scienific Gran Agency of Slovak Republic under grans #1/0062/03 and #1/1047/04. References 1. Russell, S.J., Norvig, P.: Arificial Inelligence: Modern Approach. Upper Saddle River, HJ: Prenice Hall, 1995. 2. Suon, R.S.: Learning o predic by he mehod of emporal differences. Machine Learning, Vol. 3 (1988), 9-44. 3. Suon, R.S., Baro, A.G.: Reinforcemen Learning: An Inroducion. Cambridge, MA: MIT Press, 1998. 4. Tesauro, G.J.: Pracical issues in emporal difference learning. Machine Learning, Vol. 8 (1992), 257-277. 5. Tesauro, G.J.: TD-gammon, a self-eaching backgammon program, achieves maser-level play. Neural Compuaion, Vol. 6, No. 2 (1994), 215-219.