By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Similar documents
Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

A Bandit Approach for Tree Search

Monte Carlo Search in Games

CS 387: GAME AI BOARD GAMES

More on games (Ch )

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

CS 188: Artificial Intelligence

More on games (Ch )

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

Andrei Behel AC-43И 1

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Game-playing: DeepBlue and AlphaGo

ARTIFICIAL INTELLIGENCE (CS 370D)

Unit-III Chap-II Adversarial Search. Created by: Ashish Shah 1

Programming Project 1: Pacman (Due )

Playing Othello Using Monte Carlo

CS 5522: Artificial Intelligence II

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Improving MCTS and Neural Network Communication in Computer Go

Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

Lecture 33: How can computation Win games against you? Chess: Mechanical Turk

Challenges in Monte Carlo Tree Search. Martin Müller University of Alberta

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Game Playing State-of-the-Art. CS 188: Artificial Intelligence. Behavior from Computation. Video of Demo Mystery Pacman. Adversarial Search

Recent Progress in Computer Go. Martin Müller University of Alberta Edmonton, Canada

Game Playing State-of-the-Art

Game-Playing & Adversarial Search

CS-E4800 Artificial Intelligence

CS 771 Artificial Intelligence. Adversarial Search

An AI for Dominion Based on Monte-Carlo Methods

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Artificial Intelligence

Pengju

Game Playing State of the Art

Computing Science (CMPUT) 496

CS 188: Artificial Intelligence

CS 387/680: GAME AI BOARD GAMES

CSC321 Lecture 23: Go

Feature Learning Using State Differences

Analyzing the Impact of Knowledge and Search in Monte Carlo Tree Search in Go

Adversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5

Adversarial Search. Read AIMA Chapter CIS 421/521 - Intro to AI 1

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

Exploration exploitation in Go: UCT for Monte-Carlo Go

Game Playing State-of-the-Art CSE 473: Artificial Intelligence Fall Deterministic Games. Zero-Sum Games 10/13/17. Adversarial Search

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

AlphaGo and Artificial Intelligence GUEST LECTURE IN THE GAME OF GO AND SOCIETY

Theory and Practice of Artificial Intelligence

Adversarial Search and Game- Playing C H A P T E R 6 C M P T : S P R I N G H A S S A N K H O S R A V I

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Monte Carlo Tree Search

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

Application of UCT Search to the Connection Games of Hex, Y, *Star, and Renkula!

CS 188: Artificial Intelligence Spring Announcements

Announcements. CS 188: Artificial Intelligence Spring Game Playing State-of-the-Art. Overview. Game Playing. GamesCrafters

Announcements. Homework 1. Project 1. Due tonight at 11:59pm. Due Friday 2/8 at 4:00pm. Electronic HW1 Written HW1

CS 387: GAME AI BOARD GAMES. 5/24/2016 Instructor: Santiago Ontañón

game tree complete all possible moves

Foundations of AI. 6. Adversarial Search. Search Strategies for Games, Games with Chance, State of the Art. Wolfram Burgard & Bernhard Nebel

Production of Various Strategies and Position Control for Monte-Carlo Go - Entertaining human players

Local Search. Hill Climbing. Hill Climbing Diagram. Simulated Annealing. Simulated Annealing. Introduction to Artificial Intelligence

Adversarial Search: Game Playing. Reading: Chapter

Artificial Intelligence for Go. Kristen Ying Advisors: Dr. Maxim Likhachev & Dr. Norm Badler

A Complex Systems Introduction to Go

Programming an Othello AI Michael An (man4), Evan Liang (liange)

Training a Back-Propagation Network with Temporal Difference Learning and a database for the board game Pente

Game-playing AIs: Games and Adversarial Search FINAL SET (w/ pruning study examples) AIMA

Foundations of AI. 6. Board Games. Search Strategies for Games, Games with Chance, State of the Art

A Parallel Monte-Carlo Tree Search Algorithm

Adversarial Search. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 9 Feb 2012

Artificial Intelligence

Contents. Foundations of Artificial Intelligence. Problems. Why Board Games?

MONTE-CARLO TWIXT. Janik Steinhauer. Master Thesis 10-08

Combining Final Score with Winning Percentage by Sigmoid Function in Monte-Carlo Simulations

Artificial Intelligence Adversarial Search

Using Neural Network and Monte-Carlo Tree Search to Play the Game TEN

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Adversarial Search (Game Playing)

COMP3211 Project. Artificial Intelligence for Tron game. Group 7. Chiu Ka Wa ( ) Chun Wai Wong ( ) Ku Chun Kit ( )

Game Playing AI. Dr. Baldassano Yu s Elite Education

Announcements. CS 188: Artificial Intelligence Fall Local Search. Hill Climbing. Simulated Annealing. Hill Climbing Diagram

COMP219: Artificial Intelligence. Lecture 13: Game Playing

CMSC 671 Project Report- Google AI Challenge: Planet Wars

Games and Adversarial Search

Foundations of Artificial Intelligence

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

Artificial Intelligence. Minimax and alpha-beta pruning

Foundations of Artificial Intelligence Introduction State of the Art Summary. classification: Board Games: Overview

CS 188: Artificial Intelligence Spring 2007

Igo Math Natural and Artificial Intelligence

Adversarial Search. CMPSCI 383 September 29, 2011

How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)

Intuition Mini-Max 2

Chess Algorithms Theory and Practice. Rune Djurhuus Chess Grandmaster / September 23, 2013

43.1 Introduction. Foundations of Artificial Intelligence Introduction Monte-Carlo Methods Monte-Carlo Tree Search. 43.

Examples for Ikeda Territory I Scoring - Part 3

Transcription:

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

1997, Deep Blue won against Kasparov Average workstation can defeat best Chess players Computer Chess no longer interesting Go is much harder for computers to play Branching factor is ~50 200 versus ~35 in Chess Positional evaluation inaccurate, expensive Game cannot be scored until the end Beginners can defeat best Go programs

Two player, total information Players take turns placing black and white stones on grid Board is 19x19 (13x13 or 9x9 for beginners) Object is to surround empty space as territory Pieces can be captured, but not moved Winner determined by most points (territory plus captured pieces)

Image from http://ict.ewi.tudelft.nl/~gineke/

Minimax/α β algorithms require huge trees Tree depth cannot be cut easily

Monte Carlo now more popular Simulate random games from the game tree Use results to pick best move Two areas of optimization Discovery of good paths in the game tree Intelligence of random simulations Random games are usually bogus

Need to balance between exploration Discovering and simulating new paths And exploitation Simulating the most optimal path Best method is currently UCT given by Levente Kocsis and Csaba Szepesvári.

Say you have a slot machine with a probability of giving you money. You can infer this probability through experimentation.

What if there are three slot machines, and each has a different probability?

You need to choose between experimenting (exploration) and getting the best reward (exploitation).

UCB algorithm balances these problems to minimize loss of reward.

UCT applies UCB to games like Go, deciding which move to explore next by treating it like the bandit problem.

Starts with one level tree of legal board moves

Picks best move according to UCB algorithm

Runs Monte Carlo simulation, update node s win/loss. This is one iteration of the UCT process.

If node gets visited enough times, start looking at its child moves

UCT dives deeper, each time picking the most interesting move.

Eventually, UCT has built a large tree of simulation information

UCT is now in most major competitive programs MoGo used UCT to defeat a professional Used 800 node grid and a 9 stone handicap Much research now focused on improving simulation intelligence

Policy decides which move to play next in a random game simulation High stochasticity makes UCT less accurate Takes longer to converge to correct move Too much determinism makes UCT less effective Defeats purpose of Monte Carlo search Might introduce harmful selection bias

Certain shapes in Go are good Hane here is a strong attack on B Others are quite bad! B s empty triangle is too dense and wasteful

MoGo uses pattern knowledge with UCT Hand crafted database of 3x3 interesting patterns Doubled simulation win rate according to authors Can pattern knowledge be trained automatically via machine learning?

Paper Monte Carlo Simulation Balancing (by David Silver and Gerald Tesauro) Policies accumulate error with each move Strong policies minimize this error, but not the whole game error Proposes algorithms for minimizing whole game error with each move Authors tested on 5x5 Go using 2x2 patterns Found that balancing was more effective over raw strength

Implemented pattern learning algorithms in Monte Carlo Simulation Balancing Strength: Apprenticeship Strength: Policy Gradient Reinforcement Balance: Policy Gradient Simulation Balancing Balance: Two Step Simulation Balancing Used 9x9 Go with 3x3 patterns

Used amateur database of 9x9 games for training Mention worthy metrics: Simulation winrate against purely random UCT winrate against UCT purely random UCT winrate against GNU Go

Simplest algorithm Looks at every move of every game in the training set High preference for chosen moves Low preference for unchosen moves Strongly favored good patterns Over training; poor error compensation Values converge to infinity

80 Apprenticeship vs Pure Random 70 60 Winrate (%) 50 40 30 Pure Random Apprenticeship 20 10 0 Playout UCT vs libego UCT vs GNU Go Game Type

Plays random games from the training set If the simulation matches the original game result, patterns get higher preference Otherwise, lower preference Results were promising

70 Reinforcement vs Pure Random 60 50 Winrate (%) 40 30 Pure Random Reinforcement 20 10 0 Playout UCT vs libego UCT vs GNU Go Game Type

For each training game Plays random games to estimate win rate Plays more random games to determine which patterns win and lose Gives preferences to patterns based on error between actual game result and observed winrate

Usually, strong local moves Seemed to learn good pattern distribution Aggressively played useless moves hoping for an opponent mistake Poor consideration of the whole board

60 Simulation Balancing versus Pure Random 50 40 Winrate (%) 30 20 Pure Random Simulation Balancing 10 0 Playout UCT vs libego UCT vs GNU Go Game Type

Picks random game states Computes score estimate of every move at 2 ply depth Updates pattern preferences based on these results, using actual game result to compensate for error

Game score is hard to estimate, usually inaccurate Extremely expensive; 10 30 sec to estimate score Game score doesn t change meaningfully for many moves Probably does not scale as board size grows

70 Two Step Balancing vs Pure Random 60 50 Winrate (%) 40 30 Pure Random Two Step Balancing 20 10 0 Playout UCT vs libego UCT vs GNU Go Game Type

80 Algorithm Results 70 60 Winrate (%) 50 40 30 20 Pure Random Apprenticeship Reinforcement Simulation Balancing Two Step Balancing 10 0 Playout UCT vs libego UCT vs GNU Go Game Type

Reinforcement strongest All algorithms capable of very deterministic policies Higher playout winrates were too deterministic and thus usually bad with UCT Go may be too complex for these algorithms Optimizing self play doesn t guarantee good moves

Levente Kocsis SZTAKI Professors Sárközy and Selkow

Algorithm generates list of patterns Each pattern has a weight/value Policy looks at open positions on the board Gets the pattern at each open position Uses weights as a probability distribution