A GGP Feature Learning Algorithm

Similar documents
Cmputer Chess Wrld champin Garry Kasparv beat Deep Thught decisively in ehibitin games in 1989 Deep Thught rated ~ 600 Deep Blue develped at IBM Thmas

Spring 06 Assignment 3: Robot Motion, Game Theory

Spring 06 Assignment 3: Solution

Game Playing. Foundations of Artificial Intelligence. Adversarial Search. Game Playing as Search. Game Playing. Simplified Minimax Algorithm

1. Constraint propagation

Hospital Task Scheduling using Constraint Programming

High Level Design Circuit CitEE. Irere Kwihangana Lauren Mahle Jaclyn Nord

Hands-Free Music Tablet

AccuBuild Version 9.3 Release 05/11/2015. Document Management Speed Performance Improvements

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Puget Sound Company Overview. Purpose of the Project. Solution Overview

PreLab5 Temperature-Controlled Fan (Due Oct 16)

Formative Evaluation of GeeGuides: Educational Technology to Enhance Art Exploration

Project Information o Simulating Cumulus Entrainment: A Resolution Problem, or Conceptual? o Sonia Lasher-Trapp, UIUC o

The Mathematics of the Rubik s Cube

Figure 1: A Battleship game by Pogo

GAMIFICATION REFERENCE GUIDE

DEAD MAN S DOUBLOONS. Rules v1.2

Cleveland Public Theatre. Catapult. Request for Proposals. Deadline for submissions is Monday, June 12 th, 2017

Meal Time! Game Concept

The demand for a successful flaw analysis is that the test equipment produces no distortion on the echos no noise. I _... I i.j J...

You Be The Chemist Challenge Official Competition Format

Last update: December 26, English Translation DRAFTS of Asian Rules by Eric Wu. Contents

Software Engineering

NATF CIP Requirement R1 Guideline

National Curriculum Programme of Study:

DXF2DAT 3.0 Professional Designed Computing Systems 848 W. Borton Road Essexville, Michigan 48732

My Little Pony CCG Comprehensive Rules

Wonder Tree Video Slot Introduction. How to Bet. Gamble Feature

Connection tariffs

The Motorcycle Industry in Europe. L-category vehicles type approval regulation ACEM comments on draft TRL durability study

WiFi Lab C. Equipment Needs:

PROBABILITY OF DETECTION OF FLAWS IN A GAS TURBINE ENGINE. Gary L. Burkhardt and R.E. Beissner

ELEC 7250 VLSI TESTING. Term Paper. Analog Test Bus Standard

Notified Body Office, VUZ a.s. Novodvorská 1698, Praha 4, Czech Republic

VILLAGE COORDINATOR AGREEMENT

Exam solutions FYS3240/

Using the Laser Cutter

Experiment 4 Op-Amp Circuits

POWERSLED CIRCUIT INTRODUCTION GAME COMPONENTS

COMP 110 INTRODUCTION TO PROGRAMMING WWW

1. Give an example of how one can exploit the associative property of convolution to more efficiently filter an image.

The WHO e-atlas of disaster risk for the European Region Instructions for use

BV4115. RF Packet Transmitter. Product specification. February ByVac 2007 ByVac Page 1 of 5

Network Working Group. Category: Informational Cisco Systems A. Shaikh AT&T Labs (Research) April 2005

Altis Flight Manager. PC application for AerobTec devices. AerobTec Altis v3 User Manual 1

Upgrading to PlanetPress Suite Version 5

Spinning Mills Registration Guidelines

Study of New architecture needs for AOCS / Avionics Abstract. Abstract

Specification for Learning and Qualifications for Physical Intervention Skills

SHADOW OF THE DRAGON AGE OF SIGMAR

DIMACS Working Group on Measuring Anonymity Notes from Session 3: Information Theoretic and Language-based Approaches

Automated Design of an ASIP for Image Processing Applications

Dice High Video Slot. Introduction. How to Bet. Gamble Feature

Table of Contents. ilab Solutions: Core Facilities Core Usage Reporting

GENERAL RULES FOR ALL MALIFAUX TOURNAMENTS MALIFAUX TEAM TOURNAMENT (50 STONES)

Processors with Sub-Microsecond Response Times Control a Variety of I/O. *Adapted from PID Control with ADwin, by Doug Rathburn, Keithley Instruments

100 Super Hot Video Slot Introduction. How to Bet. Gamble Feature

Feature Learning Using State Differences

FIRMWARE RELEASE NOTES. Versions V2.0.0 to V Model HDL-32E. High Definition LiDAR Sensor

NanoScan v2 Readme Version 2.7. Change log. v2.7 - Added information for new product Pyro/9/5-MIR.

Experiment 4 Op-Amp Circuits

ELECTRICAL CIRCUITS LABORATORY II EEE 209 EXPERIMENT-6. Operational Amplifiers II

Operating Instructions

Process Gain and Loop Gain

Transforming the University of Minnesota through the Enhancement of Interdisciplinary Research

The Urbana Free Library Patron Survey. Final Report

SARMAP RELEASE NOTES. Version: 7.0 (July 2016) rpsgroup.com

Waves Unit / Mechanical Waves Sub-Unit

E-Jobsheet Tablet Application Functionality

THE LAW SOCIETY OF ALBERTA HEARING COMMITTEE REPORT

GRFX 1801: Game Development for Platforms

YOUR FUTURE STARTS AT IMEC

Excel Step by Step Instructions Creating Lists and Charts. Microsoft

Optimal searching. Best-first search review Advantages

Experion MX Formation Measurement

Automated Meters Frequently Asked Questions

Experiment 7 Digital Logic Devices and the 555-Timer

.,Plc..d,~t l~ucjio PA300 DIGITAL BASS PROCESSOR USER'S MANUAL. 2 Why use the DIGITAL BASS PROCESSOR? 2 About the PWM Subsonic Filter

EDISON. The Mystery of the Missing Mouse Treasure. The truth turns out to be far more amazing.

CATA Composer R2016 Fact Sheet. Add a New Dimension to Your Product Communications

King Saud University. College of Engineering. IE 341: Human Factors Engineering

Dragon Fall Age of Sigmar Event

ASSEMBLE ALUMINUM TOOLBOX

Introduction. DL Control Signaling Errors in LTE

Snowball Fight. Components:

60min Tinkerb t games

Focus Session on Simulation at Aeronautics Test Facilities

1.12 Equipment Manager

Introduction. Version 8.2.2

LED wdali MC Switch Input Modul Set - User Manual

DesignCon A New Reference Design Development Environment for JPEG 2000 Applications

Super ABC Plug-in kit for Pacman or Ms Pacman

DreamHack Official rules DreamHack Winter 2010

a) Which points will be assigned to each center in the first iteration? b) What will be the values of the k new centers (means)?

King s College London: Gender Pay Reporting 2018

SVT Tab and Service Visibility Tool Job Aid

Producing Research Posters

CAR ASYST - Quick Start Guide MAIN MENU

Photoshop Elements: Color and Tonal Correction Basics

Transcription:

A GGP Feature Learning Algrithm Mesut Kirci Nathan Sturtevant Jnathan Schaeffer This paper presents a learning algrithm fr tw-player, alternating mve GGP games. The Game Independent Feature Learning algrithm, GIFL, uses the differences in temprally-related states t learn patterns that are crrelated with winning r lsing a GGP game. These patterns are then used t infrm the search. GIFL is simple, rbust and imprves the quality f play in the majrity f games tested. GIFL has been successfully used in the GGP prgram Maligne. 1 Intrductin Agents with the ability t demnstrate intelligence ver a wide variety f dmains remains an elusive gal fr the field f artificial intelligence. Even restricting the set f dmains t a subset f game playing, arguably an area with simple scpe, is quite challenging. The state f the art in this area is t develp game-specific slutins that prvide insights but d nt necessarily generalize t ther games. The annual General Game Playing (GGP) cmpetitins have encuraged researchers t eplre develping prgrams that can play a game given nly the set f rules [1]. Playing a legal game is easy; having the system play these games at a high level f skill is challenging. A general game player accepts a game descriptin as input at runtime, analyzes it, and then plays the game withut human interventin. GGP prgrams have t play many classes f games, varying parameters such as the number f players (ne r mre), mve cnstraints (alternating r simultaneus), and game space (such as bard games and card games). Thus, GGP prgrams cannt use game-specific algrithms r knwledge. The current state f the art in GGP algrithms, adpted by mst f the leading cmpetitrs, is t use the UCT search algrithm [5]. The algrithm s appeal is its simplicity f implementatin and, mre imprtantly, it achieves gd perfrmance with n dmain knwledge (ther than the game rules). There are tw prtins t the verall UCT search: a grwing tree which is held in memry string the current value f varius mves, and randm sampling that prceeds frm the leaves f the tree t terminal states in the game. Given enugh samples, UCT will prvide an educated guess as t the best mve t play. The advent f UCT represented an imprvement in the playing abilities f the tp GGP prgrams (frm very weak t weak, as evidenced by play against humans). Hwever, if ne wants t achieve high perfrmance, ne f the gals f GGP research, sme frm f applicatin-specific knwledge will have t learned by the game-playing sftware. Dmain-independent knwledge etractin is a very hard prblem. There have been a few attempts t use machine learning in a GGP prgram but nne f them have had significant cmpetitive success. The Clune Player and University f Teas prgrams [6], bth frequent GGP cmpetitrs, use autmatically etracted features t calculate the evaluatin functin. These are simple features like the number f pieces n the bard and the number f legal mves. Sharma et al. use tempral difference learning t build a dmain-independent knwledge base that is used t guide the UCT search [8]. This methd has nt been used by any cmpetitive GGP prgram. This paper presents the Game Independent Feature Learning (GIFL) algrithm. It learns features and uses them t guide the search in tw-player, alternating-mve games. Similar t the well-knwn histry heuristic [7], GIFL uses state differences in 2-ply game trees t identify gd and bad features. The learned features are used t guide the therwise randm mve selectin in the randm sampling prtin f the UCT search. In GGP, the game rules are defined using the Game Descriptin Language (GDL) [2]. A game state is defined as a set f predicates that are true (facts). Predicates present in a state are called state predicates. In the special case f a gal state they are referred t as terminal predicates. GIFL is given a line f play frm the starting psitin t a terminal psitin, and then des a retractive analysis. The algrithm identifies states and actins which might be assciated with winning r lsing, and then attempts t etract general patterns (sets f predicates) that are necessary fr an actin t be useful. An ffensive feature is a pattern that is crrelated with success and suggests a mve that can be used t direct a search twards a psitive utcme. A defensive feature is a pattern that is crrelated with failure avidance and suggests a mve (if legal in the current psitin) that may prevent a bad utcme. Each feature is assigned a value which measures the distance frm the said utcme. These features culd be used by many different algrithms, but we fcus n the applicatin t UCT here. During the randm sampling prtin f the UCT search all available features are evaluated t see if they are applicable in the current state. Immediate wins and lsses are taken r avided. Otherwise, the feature (and assciated mve) t apply is chsen via a Bltzmann distributin ver the value f each valid feature. This paper reprts eperimental results fr 15 games used in previus GGP cmpetitins. GIFL-enhanced UCT search utperfrms standard UCT in nine games, des nt effect the results in three games, lses slightly in tw games, and lses badly in ne game. As well, GIFL was able t imprve the perfrmance f the GGP prgram Maligne in five f the games in the 2010 GGP Champinship, where Maligne tk secnd place. Althugh the results are strng, ne must keep in mind that these are still early days fr machine learning in the GGP framewrk; perfrmance f the prgrams is still at a relatively weak level f play. Sme f the results in this paper have previusly been published [3, 4].

(a) Feature Learning Input state actin test (b) (c) GIFL Feature predicates actin value (d) (e) (f) Abbildung 2: General feature-learning prcess. Abbildung 1: 2-ply game tree at the end f the game sequence. 2 Feature Learning The gal f GIFL is t learn generalized features which can be used t imprve the quality f play. Given a mve sequence ending in a terminal state, there are tw stages t the learning prcess, which we describe in detail here. In the first stage, a 2-ply game tree is built that leads t a terminal state, and states are identified fr learning. In the secnd stage, general features are etracted frm states t be used during game play. We will demnstrate these using eamples frm the game tic-tac-te. 2.1 Identifying States fr Learning GIFL identifies states fr learning by perfrming randm walks in a game until a terminal state is reached. Then, a 2-ply tree is built arund the terminal state t analyze whether learning can ccur. We demnstrate this in Abbildung 1. This shws a small prtin f a tic-tac-te tree. The left branch was randmly sampled and ended because the state labelled (d) is a terminal state and a win fr the player. We wuld like t generalize that in states that are similar t (b), the player shuld mve in the center t win the game an ffensive feature. Thus, the state (b) is sent t a functin which etracts a general feature frm the state. We describe a methd t d this in Sectin 2.2. This tree als shws us, hwever, that the player had a better mve at state (a). If the player plays in the center in states like (a), player will n lnger have a winning mve at state (c), as evidenced by the successrs f (c). We call this a defensive feature. We similarly will then send state (a) t the functin which builds a generalized feature frm this state. 2.2 Learning Generalized Features Abbildung 2 illustrates the feature learning prcess which takes place nce interesting states have been identified. The generalizatin prcess takes as input a GGP state, an actin, and a functinal test which must be preserved during the generalizatin prcess. A GIFL feature includes (1) predicates fr identifying interesting states, (2) an actin t take when the predicates frm (1) are fund in the current state, and (3) the relative value f the Feature Learning Eample state actin test (mark 2 2 ) GIFL Feature predicates actin value (cell 1 1 ) (cell 3 3 ) (mark 2 2 ) 100 terminal? Abbildung 3: Building GIFL features: Only predicates which make the test true when applying the prvided actin are maintained in the GIFL feature. feature. The main task in learning GIFL features is t generalize frm full states fund frm the 2-ply trees built in the previus sectin t a small set f predicates which can pssibly match many different states. The generalizatin prcess ccurs as fllws. Predicates are remved frm the input state ne at a time. After remving each predicate, the actin is applied fllwed by the test. If remving a predicate makes the actin invalid r the test false, then that predicate becmes part f the GIFL feature. If remving the predicate has n effect n the actin r test, then it is nt required and des nt becme part f the generalized GIFL feature. We illustrate tw eamples f this generalizatin. The first, in Abbildung 3, is an eample f learning an ffensive feature. The input state is state (b) frm Abbildung 1. The actin that was applied at that state t win the game was t mark the middle psitin in the bard with an. This is an apprpriate mve as lng as applying it will result in the successr state being a terminal r gal state. T find the generalized predicates, all predicates in the state are remved ne at a time, the actin is applied, and the test is perfrmed. The predicates in this state are {(cell 1 1 ) (cell 3 3 ) (cell 1 2 ) (cell 3 2 )}. Remving (cell 1 2 ) and (cell 3 2 ) des nt make (mark 2 2 ) illegal, and des nt

Feature Learning Eample state actin test (mark 2 2 ) GIFL Feature predicates actin value (cell 1 1 ) (cell 3 3 ) (mark 2 2 ) 100 is GIFL ffensive feature invalid? Abbildung 4: Learning a defensive feature higher in the 2-ply tree. change whether the subsequent state is terminal. Thus, these predicates are nt part f the GIFL feature. Hwever, remving either (cell 1 1 ) r (cell 3 3 ) will cause the test t be false, as the subsequent state will n lnger be terminal. Therefre, these are part f the generalized GIFL feature. Because this actin leads directly t the gal, it is given a value f 100. This is an ffensive feature, because it directs the player twards a winning mve. An eample f learning a defensive feature is illustrated in Abbildung 4. This state crrespnds t the rt f the 2-ply tree, Abbildung 1 (a). In this eample, it has been discvered that (mark 2 2 ) prevents the player frm winning the game, and the state must be generalized. The test here is mre cmplicated. Instead f testing fr a win, we must test t see if the previusly learned ffensive feature becmes invalid. The predicates (cell 1 1 ) and (cell 3 3 ) are already knwn t be required in the GIFL defensive feature, as they have been identified in the ffensive feature. Remving the predicate (cell 3 2 ) frm the state des nt prevent the supplied actin frm successfully blcking the ffensive feature, s it is unnecessary. The final defensive feature then states that if the player has marks in ppsite crners, the player shuld mve in the middle between them. Because this prevents an immediate lss, it als gets a value f 100. Nte that if the player had an actin that erased ne f the players mark, then this wuld als be learned as a defensive feature, because the ffensive feature wuld then n lnger be applicable in the subsequent state. 2.3 Etending GIFL Features Up The Tree We have presented here a simple methd fr learning frm a 2-ply tree at the terminal state f a game. Building the tree identifies pssible states where ffensive r defensive features culd be learned, and builds generalized GIFL features frm these states. But, ideally, a learner wuld als learn elsewhere in the tree. This learning can be perfrmed using the same prcedure by lking at mves higher up in the randm walk. Instead f lking fr a mve which leads t a gal, GIFL lks fr a mve which cntributes t the ffensive feature. A 2-ply tree is then built arund this mve, and similar learning takes place. A key difference is that the value f the GIFL features fund higher up in the mve sequence are discunted the farther away they are frm the leaf state. The value f the mve is cmputed accrding t the frmula 100 V level 1 where V is a cnstant between 0-1 and the level is the level f the 2-ply tree where the feature was learned. Trees built frm terminal states have level 0. Trees built abve the rt f a terminal state have a level f 1, and s n. We used V = 0.9 in all the results presented here. An even mre detailed presentatin with pseude-cde fr the full feature learning prcess is available [3, 4]. 2.4 Limitatins and Further Generalizatin What has been described s far is a simple apprach t generalizing and building GIFL features. It is pssible that sme necessary predicates are nt prperly generalized using this methd; there are likely a wide variety f methds fr generalizing ffensive and defensive features which have yet t be eplred. Fr instance, suppse that there are three stnes placed vertically in the game f Cnnect4. If the stne in the middle is remved and then a stne is placed n tp f that clumn, the game descriptin dictates that a stne is placed n tp f every stne that has an empty space n tp. Therefre, the empty place in the middle is replaced even thugh it is nt suppsed t be. This will result in the stne in the middle nt being a part f the ffensive-feature predicates even thugh it shuld be. T slve this prblem, GIFL uses anther methd t find the ffensive-feature predicates. If a predicate which was remved during the generalizatin prcess is fund back in the state after applying the sample actin, then this predicate is added t the GIFL feature. Anther eample which ur implementatin f GIFL des nt handle is a gal which requires any tw f three predicates t be satisfied. If all three predicates are in the state identified fr learning, then nne f them will be include as predicates in the GIFL feature, as it is nly the cmbinatin f features that is necessary. While remving pairs f states wuld reveal such dependencies, higher level lgical analysis f a game culd lead t strnger generalizatin rules t handle cases such as these. 3 Using Features GIFL features are used t guide the randm simulatin in a UCT search. The prgram checks each state during a simulatin t see whether r nt a feature can be applied. A feature can be applied if the predicates f a feature are matched, and if the mve assciated with the feature is legal. If there is a feature match, then the mve assciated with that feature is marked with the value frm the GIFL feature. Mves with a value f 100 lead t immediate wins r lsses and are taken immediately, with preference given t ffensive features. After all f the applicable features are fund, the prgram selects a mve accrding t prbabilities calculated with a Bltzmann distributin: p(a) = e V (a)/τ n ev (b)/τ b=1

where there are n actins and V (a) is the value f an actin. This will bias the randm simulatin twards knwn ffensive and defensive mves, imprving the quality f the simulatin. This prvides a gd eplratin-eplitatin balance t the mve selectin. Even thugh a higher valued feature leads t a win in fewer mves, the utcme depends n the ppnent s respnse. Therefre, ther pssible mves are eplred. We used τ = 0.5 in all the results presented here. There is ne final step used in the applicatin f GIFL features. If bth players use GIFL features equally during the randm UCT simulatins, they will imprve simulatin fr each player equally well, and the perfrmance may nt imprve. Instead, each player has an independent prbability f using the GIFL features at each step, and the prbability fr the ppnent is lwer than fr the learning player. This incrprates sme frm f ppnent mdeling, causing the learning player t attempt t eplit the nn-learning player. In all results reprted here the ppnents were mdeled as nly being able t use the learned features 50% f the time. Pseude-cde fr hw t use the features in UCT search can be fund in [3]. 4 Eperiments The eperiments were prepared using the game definitins in the Stanfrd GGP repsitry [2] and thse used in the 2008 and 2009 GGP cmpetitins. Sme game frm the 2008 cmpetitin are named arbitrarily like game1, game2, etc. All games are 2- player, alternating mve and prefect infrmatin. Als, in sme games being first player r secnd player may be advantageus. Therefre, eperiments are cnducted s that this des nt effect the results. The player that uses the features t guide the randm simulatin is called the learning player, and is cmpared against a UCT player with purely randm simulatin. Therefre, the nly difference between the learning player and the nn-learning player is that learning player uses the learned features t guide the randm simulatin phase during the UCT search. We present fur sets f results here. First, we lk at the number f learned features and the perfrmance f the learning player when playing against a nn-learning player with the same number f UCT simulatins. We then analyze the speedup and slwdwns that GIFL intrduces befre cmparing perfrmance with a fied amunt f time per play. The number f training runs is limited t 500 unless specified therwise. The learning time may vary between 100 training runs per minute in breakthrugh and 20 training runs per minute in checkers. The level f 2-ply tree in which the learning is ccurring is limited t 3. This reduces the number f the features and the time spent in randm simulatins as t many features increases the cst f feature mapping. 4.1 Number f Learned Features We begin by lking at the number f learned features fr a variety f games. As tw minutes is a cmmn start clck, we shw what can be learned in this time frame in Tabelle 1. Fr mst games a significant number f features can be learned during the (shrt) time limit. This is imprtant, but perhaps less imprtant than it seems, as learning can take place while UCT is already beginning t eplre the game tree. The dual use f simulatins reduces the effective verhead f GIFL learning. The ttal number f ffensive features learned is always greater than the ttal number f defensive features, as a defensive feature can nly be learned in respnse t an ffensive feature. 4.2 Fied UCT Simulatins Given that a significant number f features can be learned, we then measure the effectiveness f these features n play. Results are in Tabelle 2. The scres are the average scre when playing tw games, ne as the first player and ne as the secnd player. Thus, a scre f 193-7 in game2 results frm always winning and getting 100 pints as player ne, and averaging 93 pints as player tw. Games that are cnstant-sum have scres that add up t 200. Chess and game5 are nt cnstant-sum. Of the 15 games that were used, the learning player defeats the nn-learning player in nine f the games. The knwledge des nt significantly affect the results in three games. In tw games, the learning player lses by a small margin. Using learned knwledge decreases the quality f play significantly in nly ne game, checkersbarrelnkings. In seven f the nine games fr which the learning player has the advantage ver the nn-learning player, the results are statistically significant. Thus, in the GGP cmpetitin setting, where games start frm the same initial state, the learning player is epected t beat a UCT player in these games with 95% cnfidence. This shws that GIFL features imprve the perfrmance f UCT search. Starting frm the same initial state culd result in test games that were played identically. Hwever, because the UCT search des randm simulatins the test games were nt repetitins f the same game. In 12 f the 15 test dmains all the games differentiated in less than five mves, and in the ther three dmains up t 10 mves were needed. It shuld be nted that the games in which the learning des nt affect the results are nt very interesting: the first player always wins in pentag, all games are tied in game4, and all games end in less then 10 mves in quart. Usage f GIFL seems t degrade perfrmance in checkersbarrelnkings. Althugh this game is similar t the riginal checkers, at which the learning player has a clear advantage, the learning player lses badly in checkersbarrelnkings. There are a number f reasns fr this. In checkersbarrelnkings, due t lack f kings and frced capture mves, the number f legal mves per step is lw. Therefre, the nn-learning player des less unnecessary eplratin. Fr eample, there are 20 legal mves in a state in breakthrugh. When the learning player uses a GIFL feature, this gives an advantage ver the nn-learning player because the nn-learning player eplres all f the 20 mves. Hwever, the average number f legal mves fr checkersbarrelnkings is lw and the advantage gained frm using GIFL features is als lwer than gained frm the breakthrugh. In additin, learning capture mves, which are the mst imprtant mves t win the game, is nt useful because they are frced mves. Therefre, GIFL can nly make a difference by learning defensive feature mves. Learning defensive features is

name surce n. f features n. f ffensive features n. f defensive features game2 2008 cmpetitin 135 75 60 pawn whpping 2009 cmpetitin 328 185 143 knightthrugh 2008 cmpetitin 134 79 55 game1 2008 cmpetitin 1869 1014 855 breakthrugh 2007 cmpetitin 120 63 57 checkers [2] 895 458 437 cnnect4 [2] 2187 1108 1079 chess [2] 25 13 12 game5 2008 cmpetitin 1357 686 671 pentag [2] 679 405 274 game4 2008 cmpetitin 1101 558 543 quart [2] 8527 5254 3273 game6 2008 cmpetitin 1456 813 643 game3 2008 cmpetitin 1555 812 743 checkersbarrelnkings [2] 5262 2830 2432 Tabelle 1: Number f features that learned by GIFL in tw minutes. name n. f simulatins n. f games learning-uct pint percentage 95% cnfidence game2 1000 20 193-7 97.5 % pawn whpping 1000 20 190-10 95.0 % knightthrugh 1000 20 184-16 92.0 % game1 1000 20 170-30 85.0 % breakthrugh 1000 20 165-35 82.5 % checkers 150 20 156-44 78.0 % cnnect4 1000 100 115-85 57.5 % chess 25 40 102-84 54.8 % game5 1000 40 111-94 54.1 % pentag 1000 100 100-100 50.0 % game4 1000 40 100-100 50.0 % quart 1000 100 98-102 49.0 % game6 1000 100 96-104 48.0 % game3 1000 100 91-109 45.5 % checkersbarrelnkings 1000 100 61-139 30.5 % Tabelle 2: Effectiveness f using GIFL with a fied number f simulatins fr each player. als harder in checkersbarrelnkings because escaping an imminent capture is ften dne by capturing the ppnent piece (frced mve). GIFL learns features that help the player t avid getting captured in the net turn. Hwever, GIFL features d nt cntain infrmatin abut ther pieces f the ppnent. Thus, when a GIFL feature suggests hw t avid capture frm ne piece, it may inadvertently put itself in psitin t be captured by an alternate piece. 4.3 Cst f GIFL Running GIFL incurs cst verheads which we measure in this sectin. The biggest verhead is that f matching GIFL features with each state t see if they are applicable. Hwever, there is an alternate benefit frm using these features, as they decrease the length f the randm UCT simulatins. We measure bth effects in Tabelle 3, predicting the epected speedup r slwdwn. The secnd clumn reprts the rati f the learner s randm walk length t the length f a regular UCT randm walk. In mst games using GIFL decreases the length f the randm walk, smetimes significantly. Hwever, in checkersbarrelnkings the length f randm walks is increased significantly, a factr in the pr perfrmance in that game. The third clumn reprts the rati f simulatins perfrmed by the learner t the regular UCT player. This measure already takes int accunt the shrter simulatin length. Despite the shrter lengths, the learner is perfrming fewer simulatins in all simulatin length n. f simulatins GIFL Overhead Effective game (learner/uct) (learner/uct) (uct/learner) Overhead game2 51 % 46 % 4.3 2.2 pawn whpping 82 % 65 % 1.9 1.5 knightthrugh 45 % 93 % 2.4 1.1 game1 53 % 104 % 1.8 1.0 breakthrugh 61 % 79 % 2.1 1.3 checkers 73 % 36 % 3.8 2.8 cnnect4 95 % 20 % 5.3 5.0 chess 99 % 74 % 1.4 1.4 game5 101 % 32 % 3.1 3.1 pentag 68 % 156 % 0.9 0.6 quart 100 % 34 % 2.9 2.9 game6 97 % 58 % 1.8 1.7 game3 52 % 99 % 1.9 1.0 checkersbarrelnkings 178 % 38 % 1.5 2.6 game4 116 % 47 % 1.8 2.1 Tabelle 3: Average length f simulatins. but tw games. Clumn fur is the inverse f the prduct f clumns tw and three. This gives the factr by which GIFL slws dwn the GGP player. Hwever, because the simulatins are f shrter length, the effective cst is less, as shwn in the last clumn (the inverse f clumn three). Fr eample, fr game2 the simulatins are, n average, half the length with GIFL. Because the cst f the GIFL analysis slws dwn the prgram by a factr f 4.3, the prgram runs 2.2 times slwer than with regular UCT.

name learning-uct win percentage game2 140-30 70.0 % pawn whpping 190-10 95.0 % knightthrugh 150-50 75.0 % game1 160-40 80.0 % breakthrugh 170-30 85.0 % checkers 110-90 55.0 % cnnect4 90-110 45.0 % chess 75-125 35.0 % game5 40-160 15.0 % pentag 100-100 50.0 % quart 80-120 50.0 % game6 70-130 35.0 % game3 100-100 50.0 % checkersbarrelnkings 30-170 15.0 % game4 100-100 50.0 % Tabelle 4: Effectiveness f using GIFL with 30 secnds per mve. 4.4 Fied Time We cmplete ur eperiments in games with a fied time limit, shwn in Tabelle 4. The games are still rdered frm best t wrst perfrmance given a fied number f UCT simulatins. Althugh the numbers are nt as favrable as befre, there are still significant gains in many different games. These are nt the best pssible results using GIFL, as it has nt been tuned fr maimal perfrmance. Mst f the wrk f GIFL culd be integrated int the inference engine, thereby reducing the verhead. 5 Cnclusin The learning algrithm learns GIFL features and uses them t guide randm UCT simulatin. The cncepts are simple and dmain independent which is essential fr GGP algrithms. Up until the 2008 GGP cmpetitin, learning algrithms have nt been an essential part f a successful GGP prgram because dmain-independent learning is a very hard prblem. Hwever, this paper presents a simple but effective methd that shws very prmising results in sme f the games that are frequently used in GGP cmpetitins. The algrithm shws prmising results in GGP, but the learning cncepts are heavily depended n the terminal cnditins. If the gal cnditins f a game is t specific, the features may nt be encuntered frequently. Thus, GIFL may nt be effective. Fr instance, the terminal cnditins f chess has many variatins depending n the psitin, number and type f pieces. GIFL learns ne f these variatins at each step f the algrithm. The ccurrence f that specific terminal psitin during a simulatin is necessary fr the learned feature t be effective. Hwever, mst f the GGP games in which GIFL is successful, have less number f different pssible terminal cnditins. In cnclusin, the effectiveness f GIFL depends n hw many variatins terminal cnditins f a game can have. In additin, the cmputatin verhead f using features is an imprtant area fr the future wrk. The primary fcus f GIFL is the effectiveness f the features, therefre time has nt been spent t develp mre efficient ways f feature matching and feature pruning. Sme f the learned features may nt be effective and can be remved. We believe that significant perfrmance gains are pssible. The algrithm has rm fr imprvements. First, the features can be used as a part f an evaluatin functin. A minima apprach can be tried with this evaluatin functin instead f the UCT search. Secnd, the algrithm can nly learn features frm a game sequence if the player that wins the game makes the last mve. The learning algrithm cannt be applied t games when the lsing side makes the last mves. Lse Checkers is an eample f these types f games. The players aim t lse all the pieces instead f trying t capture them. This prblem may be slved by changing the leaf f the 2-ply tree where the learning ccurs. In additin, the frequency f features seen in the learning prcess can be included when the values fr the feature mves are calculated. Right nw, all f the features have the same imprtance. The learning algrithm presented in this paper is relatively simple, yet we have shwn it t be quite successful. There are certainly mre cmple appraches which culd be even mre successful. We lk frward t future cmpetitins encuraging even mre learning with lnger start clcks which wuld allw mre learning t take place befre a game begins. Literatur [1] Michael Genesereth, Nathaniel Lve, and Barney Pell. General game playing: Overview f the AAAI cmpetitin. AI Magazine, 26(2):62 72, 2005. [2] Stanfrd Lgic Grup. http://lgic.stanfrd.edu. [3] Mesut Kirci. Feature learning using state differences. Master s thesis, Cmputing Science, University f Alberta, 2009. [4] Mesut Kirci, Nathan Sturtevant, and Jnathan Schaeffer. Feature learning using state differences. In IJCAI Wrkshp n General Game Playing, 2009. [5] Levante Kcsis and Csaba Szepesvári. Bandit based Mnte-Carl planning. In Eurpean Cnference n Machine Learning, pages 282 293, 2006. [6] Gregry Kuhlmann and Peter Stne. Graph-based dmain mapping fr transfer learning in general games. In Prceedings f the 18th Eurpean Cnference n Machine Learning, September 2007. [7] Jnathan Schaeffer. The histry heuristic and alpha-beta search enhancements in practice. IEEE Transactins n Pattern Analysis and Machine Intelligence, 11:1203 1212, 1989. [8] Shiven Sharma, Ziad Kbti, and Sctt Gdwin. Knwledge generatin fr imprving simulatins in UCT fr general game playing. In AI 2008: Advances in Artificial Intelligence, pages 49 55. Springer-Verlag, 2008.

Kntakt Mesut Kirci Email: kirci@ualberta.ca Nathan Sturtevant Department f Cmputing Science University f Alberta Edmntn, Alberta Canada T6M 2K9 Email: nathanst@cs.ualberta.ca Jnathan Schaeffer Department f Cmputing Science University f Alberta Edmntn, Alberta Canada T6M 2K9 Email: jnathan@ualberta.ca Bild Mesut Kirci has a M.Sc. degree frm the Department f Cmputing Science at the University f Alberta. He is currently wrking fr Talewrlds, a game-develpment cmpany in Turkey. Bild Bild Nathan Sturtevant is an Assistant Prfessr in the Department f Cmputer Science at the University f Denver, hwever the wrk in this paper was cmpleted while he was a assistant adjunct prfessr at the University f Alberta. Nathan s research fcuses n heuristic search, with cntributins in singleplayer and multi-player games. His wrk in single-player search was incrprated in the game Dragn Age, which has sld ver ne millin cpies. Jnathan Schaeffer is a Prfessr f Cmputing Science at the University f Alberta. He is the icore Chair fr High-Perfrmance AI Systems. Fr ver 30 years he has been using games and puzzles as eperimental testbeds fr his AI research. He is best knwn fr develping Chink, the first prgram t win a human wrld champinship in any game.