Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

Similar documents
Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

ARTIFICIAL INTELLIGENCE (CS 370D)

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

More on games (Ch )

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

CS 4700: Foundations of Artificial Intelligence

More on games (Ch )

CS 387: GAME AI BOARD GAMES

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

Computer Science and Software Engineering University of Wisconsin - Platteville. 4. Game Play. CS 3030 Lecture Notes Yan Shi UW-Platteville

A Bandit Approach for Tree Search

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

Monte Carlo Tree Search. Simon M. Lucas

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

CS 1571 Introduction to AI Lecture 12. Adversarial search. CS 1571 Intro to AI. Announcements

Adversarial Search 1

Adversarial Search and Game- Playing C H A P T E R 6 C M P T : S P R I N G H A S S A N K H O S R A V I

Module 3. Problem Solving using Search- (Two agent) Version 2 CSE IIT, Kharagpur

CS-E4800 Artificial Intelligence

Game-Playing & Adversarial Search

game tree complete all possible moves

mywbut.com Two agent games : alpha beta pruning

Adversary Search. Ref: Chapter 5

Experiments on Alternatives to Minimax

2 person perfect information

More Adversarial Search

Game-playing: DeepBlue and AlphaGo

Announcements. Homework 1 solutions posted. Test in 2 weeks (27 th ) -Covers up to and including HW2 (informed search)

CS510 \ Lecture Ariel Stolerman

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Monte-Carlo Tree Search and Minimax Hybrids with Heuristic Evaluation Functions

CS 387/680: GAME AI BOARD GAMES

Monte Carlo tree search techniques in the game of Kriegspiel

Ar#ficial)Intelligence!!

CS 771 Artificial Intelligence. Adversarial Search

Artificial Intelligence. Minimax and alpha-beta pruning

Playing Games. Henry Z. Lo. June 23, We consider writing AI to play games with the following properties:

Exploration exploitation in Go: UCT for Monte-Carlo Go

COMP219: COMP219: Artificial Intelligence Artificial Intelligence Dr. Annabel Latham Lecture 12: Game Playing Overview Games and Search

COMP219: Artificial Intelligence. Lecture 13: Game Playing

CS 2710 Foundations of AI. Lecture 9. Adversarial search. CS 2710 Foundations of AI. Game search

CS 4700: Foundations of Artificial Intelligence

Algorithms for solving sequential (zero-sum) games. Main case in these slides: chess! Slide pack by " Tuomas Sandholm"

Artificial Intelligence Adversarial Search

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Games CSE 473. Kasparov Vs. Deep Junior August 2, 2003 Match ends in a 3 / 3 tie!

Game-playing AIs: Games and Adversarial Search I AIMA

Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

Feature Learning Using State Differences

MONTE-CARLO TWIXT. Janik Steinhauer. Master Thesis 10-08

Intuition Mini-Max 2

Application of UCT Search to the Connection Games of Hex, Y, *Star, and Renkula!

A Study of UCT and its Enhancements in an Artificial Game

Recent Progress in Computer Go. Martin Müller University of Alberta Edmonton, Canada

Comparing UCT versus CFR in Simultaneous Games

Game-playing AIs: Games and Adversarial Search FINAL SET (w/ pruning study examples) AIMA

43.1 Introduction. Foundations of Artificial Intelligence Introduction Monte-Carlo Methods Monte-Carlo Tree Search. 43.

1 Introduction. 1.1 Game play. CSC 261 Lab 4: Adversarial Search Fall Assigned: Tuesday 24 September 2013

Adversarial Search. CMPSCI 383 September 29, 2011

Data Structures and Algorithms

Algorithms for Data Structures: Search for Games. Phillip Smith 27/11/13

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Programming Project 1: Pacman (Due )

Lecture 5: Game Playing (Adversarial Search)

CS188 Spring 2014 Section 3: Games

Games (adversarial search problems)

Mastering Chess and Shogi by Self- Play with a General Reinforcement Learning Algorithm

Game Playing Beyond Minimax. Game Playing Summary So Far. Game Playing Improving Efficiency. Game Playing Minimax using DFS.

CS188 Spring 2010 Section 3: Game Trees

CS 387: GAME AI BOARD GAMES. 5/24/2016 Instructor: Santiago Ontañón

UCT for Tactical Assault Planning in Real-Time Strategy Games

Monte-Carlo Tree Search and Minimax Hybrids

Monte Carlo Tree Search

Adversarial Search. Read AIMA Chapter CIS 421/521 - Intro to AI 1

CS885 Reinforcement Learning Lecture 13c: June 13, Adversarial Search [RusNor] Sec

Game-Playing & Adversarial Search Alpha-Beta Pruning, etc.

Artificial Intelligence

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

CMPUT 396 Tic-Tac-Toe Game

Computing Science (CMPUT) 496

Adversarial Search Aka Games

Aja Huang Cho Chikun David Silver Demis Hassabis. Fan Hui Geoff Hinton Lee Sedol Michael Redmond

Artificial Intelligence 1: game playing

6.034 Quiz 2 20 October 2010

CPS331 Lecture: Search in Games last revised 2/16/10

CS 188: Artificial Intelligence

Game Playing. Why do AI researchers study game playing? 1. It s a good reasoning problem, formal and nontrivial.

An AI for Dominion Based on Monte-Carlo Methods

Adversarial Search. Robert Platt Northeastern University. Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA

Advanced Game AI. Level 6 Search in Games. Prof Alexiei Dingli

CS229 Project: Building an Intelligent Agent to play 9x9 Go

Game playing. Outline

Artificial Intelligence

Lecture 14. Questions? Friday, February 10 CS 430 Artificial Intelligence - Lecture 14 1

Game Playing State-of-the-Art

ADVERSARIAL SEARCH. Today. Reading. Goals. AIMA Chapter , 5.7,5.8

Artificial Intelligence

Playout Search for Monte-Carlo Tree Search in Multi-Player Games

Transcription:

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm Joint work with Raghuram Ramanujan and Ashish Sabharwal

Upper Confidence bounds for Trees (UCT) n The UCT algorithm (Kocsis and Szepesvari, 2006), based on the UCB1 multi-armed bandit algorithm (Auer et al, 2002) has changed the landscape of game-playing programs in recent years. First program capable of master level play in 9x9 Go (Gelly and Silver, 2007) UCT-based agent is a two-time winner of the AAAI General Game Playing Contest (Finnsson and Bjornsson, 2008) Also successful in Kriegspiel (Ciancarini and Favini, 2009) and real-time strategy games (Balla and Fern, 2009) Key to Google s AlphaGo success.

Understanding UCT Despite its impressive track record, our current understanding of UCT is mostly anecdotal. Our work focuses on gaining better insights into UCT by studying its behavior in search spaces where comparisons to Minimax search are feasible.

Find the best slot machine --- while trying to make as much money as possible

The UCT Algorithm Thm: Non-optimal arms are sampled at most log(n) times! Strategy? What would you want to prove? Estimated utility = the average payout of arm so far s s' Every node in the search tree is treated like a multi-armed bandit. Find the best slot machine --- while trying to make as much money as possible The UCB score is computed for each child and the best scoring child is chosen for expansion Exploitation term Q(s ) is the estimated utility of state s Exploration term n(s) is the number of visits to state s

The UCT Algorithm Incrementally grow game tree by creating lines of play from the top. & Update estimates of board values. R We descend the tree by starting at the root node and repeatedly selecting the best scoring node At the opponent s nodes, a symmetric lower confidence bound is minimized to pick the best move Node with unexplored child, a new child is created A random playout is performed to estimate the utility R (+1,0, or -1) of this state

The UCT Algorithm R An averaging backup is used to update the value estimates of all nodes on the path from the root to the new node R R Visit count update R State utility update Select each child node based on UCB scheme.

UCT versus Minimax UCT Tree Minimax Tree Asymmetric tree Best-performing method for Go Full-width tree up to some depth bound k Best-performing method for eg Chess

Search Spaces n Minimax performs poorly in Go due to the large branching factor and lack of good board eval n To understand UCT better, we need domain with another competitive method. n Considered domains where Minimax search produces good play Chess n A domain where Minimax yields world-class play but UCT performs poorly Mancala n A domain where both Minimax and UCT yield competent players out-of-the-box Synthetic games (ongoing work)

Mancala Both UCT and Minimax search produce good players. pit store n n n n A move consists of picking up all the stones from a pit and sowing (distributing) them in a counter-clockwise fashion Stones placed in the stores are captured and taken out of circulation The game ends when one of the sides has no stones left Player with more captured stones at the end wins

UCT on Mancala n We examine the three key components of UCT I. Exploration vs. Exploitation II. Random playouts III. Info backup: Averaging versus minimax n Insights lead to improved UCT player n We then deploy the winning UCT agent in a novel partial game setting to understand how it wins

I) Full-Width versus Selective Search Optimal setting for c Vary the value of the exploration constant c and measure the winrate of UCT against a standardized Minimax player

Previous experience game tree search: full width + selective extensions outperforms trying to be smart in node expansion. n Lesson: UCT challenges general notion that forward (i.e. unsafe) pruning is a bad idea in game tree search not necessarily so! Smart exploration/ exploitation driven search can work.

II) Random Playouts (Samples) n During search tree construction, UCT estimates the value of a newly added node using a random playout (resulting in a +1, -1, or 0). n Appeal: Very general. Don t need to construct a board/state evaluation function. n But, random games look very different from actual games. Any real info in the result of random play? n A: Yes, but it s a very faint signal. Other aspects of UCT compensate. n Also, how many playouts ( samples ) per node? (surprise)

How many samples per leaf is optimal? a single playout per leaf optimal! 4000 nodes in tree, 5 playouts per node 2000 nodes in tree, 10 playouts per node It is Given better a to fixed quickly budget, examine should many we examine nodes than fewer to get nodes more random playout more samples carefully per or more node. nodes, with each node evaluated less accurately? Lesson: There is a weak but useful signal in random game playout. *** Single *** sample is enough.

III) Utility estimates backup: Averaging versus Minimax n UCT uses an averaging strategy to back-up reward values. What about using a minimax backup strategy? n For random playout information (i.e., a very noisy reward signal) averaging is best. n Confirmed on Mancala and with synthetic game tree model.

n Aside: Related to infamous game tree pathology: Deeper minimax searches can lead to worse performance! (Nau 1982; Pearl 1983). Phenomenon is related to very noisy heuristics --- such as random playout information. Minimax uses heuristic estimates only from fringe of the explored tree --- ignoring any heuristic information from internal nodes. UCT s averaging over all nodes (including internal ones) can compensate for pathology and near-pathology effects.

Improving over random playout info: Heuristic evaluation n Unlike Go, we do have a good heuristic available for Mancala Replace playouts with heuristics: UCT H n Heuristic much less noisy than random playout. Therefore, consider minimax backup. We call it UCTMAX H

Allow both algorithms to build the same sized trees and compare their winrates Lesson: When a state evaluation heuristics is available, use minimax back-up in conjunction with UCT.

Final step: UCTMAX H versus Minimax n We study the win-rate of UCTMAX H against Minimax players searching to various depths Both agents use the same state heuristic n We consider a hybrid strategy Region with no search traps, play UCTMAX H versus Minimax Region with search traps, play MM-16 versus MM-16 X X X X X X X X X X X X

Mancala: UCTMAX H vs. Minimax (w. alpha-beta) Minimax Look-ahead Depth Win-rate of UCTMAX H 6 Pits, 4 Stones per Pit 8 Pits, 6 Stones per Pit Hybrid Hybrid k = 6 0.76 0.72 0.71 0.68 k = 8 0.76 0.71 0.71 0.67 k = 10 0.75 0.65 0.69 0.65 k = 12 0.67 0.61 0.66 0.61 UCTMAX H hybrid outperforms+ MM UCTMAX H outperforms MM Note: exactly same number of nodes expanded. Difference fully due to exploration / exploitation strategy of UCT

UCT recap: n First head-to-head comparison UCT vs. Minimax and first competitive domain. n Multi-armed bandit inspired game tree exploration is effective. n Random playouts (fully domain-independent) contain weak but useful information. Averaging backup of UCT is the right strategy. n When domain information is available, use UCTMAX H. Outperforms Minimax given exactly the same information and # node expansions. n But, what about Chess?

Chess What is it about the nature of the Chess search space that makes it so difficult for sampling-style algorithms, such as UCT? We will look at the role of (shallow) search traps

Search Traps State at risk Level-3 search trap for white

Search Traps in Chess Random Chess Positions Positions from GM Games Search traps in Chess are sprinkled throughout the search space.

Identifying Traps n Given the ubiquity of traps in Chess, how good is UCT at detecting them? Run UCT search on states at risk Examine the values to which the utilities of the children of the root node converge Compare this to the recommendations of a deep Minimax search (the gold-standard )

Identifying Traps Utility of move that UCT thinks is the best Utility assigned by UCT to the move deemed best by Minimax Utility assigned by UCT to the trap move. True utility: -1 UCT-best Minimax-best Trap move Level-1 trap -0.083-0.092-0.250 Level-3 trap +0.020 +0.013-0.012 Level-5 trap -0.056-0.063-0.066 Level-7 trap +0.011 +0.009 +0.004 For very shallow traps, UCT has little trouble distinguishing between good and bad moves With deeper traps, good and bad moves start looking the same to UCT. UCT will start making fatal mistakes.

UCT conclusions: n Promising alternative to minimax for adversarial search. n Highly domain-independent. n Hybrid strategies needed when search traps are present. n Currently exploring applications to generic reasoning using SAT and QBF. n Also, applications in general optimization and constraint reasoning. n The exploration-exploitation strategy has potential in any branching search setting!