TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS. Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Similar documents
More on games (Ch )

Monte Carlo Tree Search and AlphaGo. Suraj Nair, Peter Kundzicz, Kevin An, Vansh Kumar

More on games (Ch )

Adversarial Reasoning: Sampling-Based Search with the UCT algorithm. Joint work with Raghuram Ramanujan and Ashish Sabharwal

Set 4: Game-Playing. ICS 271 Fall 2017 Kalev Kask

43.1 Introduction. Foundations of Artificial Intelligence Introduction Monte-Carlo Methods Monte-Carlo Tree Search. 43.

CS 387: GAME AI BOARD GAMES

CS 380: ARTIFICIAL INTELLIGENCE MONTE CARLO SEARCH. Santiago Ontañón

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

CS-E4800 Artificial Intelligence

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Adversarial Search. CS 486/686: Introduction to Artificial Intelligence

A Bandit Approach for Tree Search

Comparison of Monte Carlo Tree Search Methods in the Imperfect Information Card Game Cribbage

CS188: Artificial Intelligence, Fall 2011 Written 2: Games and MDP s

Learning via Delayed Knowledge A Case of Jamming. SaiDhiraj Amuru and R. Michael Buehrer

Lower Bounding Klondike Solitaire with Monte-Carlo Planning

Resource Management in QoS-Aware Wireless Cellular Networks

Monte Carlo Tree Search. Simon M. Lucas

Advanced Game AI. Level 6 Search in Games. Prof Alexiei Dingli

10703 Deep Reinforcement Learning and Control

Module 3. Problem Solving using Search- (Two agent) Version 2 CSE IIT, Kharagpur

TTIC 31230, Fundamentals of Deep Learning David McAllester, April AlphaZero

Game-Playing & Adversarial Search

Reinforcement Learning Simulations and Robotics

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

Monte Carlo Tree Search Method for AI Games

An AI for Dominion Based on Monte-Carlo Methods

CSE 473 Midterm Exam Feb 8, 2018

UCT for Tactical Assault Planning in Real-Time Strategy Games

Foundations of Artificial Intelligence

CS188 Spring 2011 Written 2: Minimax, Expectimax, MDPs

CandyCrush.ai: An AI Agent for Candy Crush

Outline. Game Playing. Game Problems. Game Problems. Types of games Playing a perfect game. Playing an imperfect game

CS188 Spring 2014 Section 3: Games

Game-playing: DeepBlue and AlphaGo

Section Marks Agents / 8. Search / 10. Games / 13. Logic / 15. Total / 46

Solving Coup as an MDP/POMDP

Application of UCT Search to the Connection Games of Hex, Y, *Star, and Renkula!

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

Comparing UCT versus CFR in Simultaneous Games

An Empirical Evaluation of Policy Rollout for Clue

Enhancements for Monte-Carlo Tree Search in Ms Pac-Man

Bandit Algorithms Continued: UCB1

Nested Monte-Carlo Search

Game Playing: Adversarial Search. Chapter 5

Research Statement MAXIM LIKHACHEV

Population Initialization Techniques for RHEA in GVGP

Theory and Practice of Artificial Intelligence

Monte Carlo Tree Search

Artificial Intelligence

Enhancements for Monte-Carlo Tree Search in Ms Pac-Man

CS 771 Artificial Intelligence. Adversarial Search

Adversarial Search. Human-aware Robotics. 2018/01/25 Chapter 5 in R&N 3rd Ø Announcement: Slides for this lecture are here:

Using Artificial intelligent to solve the game of 2048

Exploration exploitation in Go: UCT for Monte-Carlo Go

Artificial Intelligence. Minimax and alpha-beta pruning

CS 387/680: GAME AI BOARD GAMES

Monte Carlo tree search techniques in the game of Kriegspiel

Decision Making in Multiplayer Environments Application in Backgammon Variants

DeepMind Self-Learning Atari Agent

Monte-Carlo Tree Search and Minimax Hybrids with Heuristic Evaluation Functions

CS440/ECE448 Lecture 11: Stochastic Games, Stochastic Search, and Learned Evaluation Functions

Programming Project 1: Pacman (Due )

Artificial Intelligence

Name: Your EdX Login: SID: Name of person to left: Exam Room: Name of person to right: Primary TA:

Adversarial Search. Soleymani. Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 5

Search then involves moving from state-to-state in the problem space to find a goal (or to terminate without finding a goal).

Agenda Artificial Intelligence. Why AI Game Playing? The Problem. 6. Adversarial Search What To Do When Your Solution is Somebody Else s Failure

Game Playing State-of-the-Art

Adversarial Search Lecture 7

Foundations of Artificial Intelligence Introduction State of the Art Summary. classification: Board Games: Overview

Reinforcement Learning in Games Autonomous Learning Systems Seminar

Game Theoretic Control for Robot Teams

CSC321 Lecture 23: Go

Policy Teaching. Through Reward Function Learning. Haoqi Zhang, David Parkes, and Yiling Chen

Adversarial Search. Rob Platt Northeastern University. Some images and slides are used from: AIMA CS188 UC Berkeley

How AI Won at Go and So What? Garry Kasparov vs. Deep Blue (1997)

AN ABSTRACT OF THE THESIS OF

AN ABSTRACT OF THE THESIS OF

A Study of UCT and its Enhancements in an Artificial Game

Monte-Carlo Tree Search and Minimax Hybrids

School of EECS Washington State University. Artificial Intelligence

E190Q Lecture 15 Autonomous Robot Navigation

Adversarial Search and Game Theory. CS 510 Lecture 5 October 26, 2017

Implementation of Upper Confidence Bounds for Trees (UCT) on Gomoku

Information capture and reuse strategies in Monte Carlo Tree Search, with applications to games of hidden information

Artificial Intelligence

CS 331: Artificial Intelligence Adversarial Search II. Outline

Adversarial Search and Game- Playing C H A P T E R 6 C M P T : S P R I N G H A S S A N K H O S R A V I

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

Lecture 5: Game Playing (Adversarial Search)

MONTE CARLO TREE SEARCH (MCTS) is a method

Real-time Grid Computing : Monte-Carlo Methods in Parallel Tree Searching

Learning from Hints: AI for Playing Threes

Towards Real-Time Volunteer Distributed Computing

Rolling Horizon Evolution Enhancements in General Video Game Playing

Playing Othello Using Monte Carlo

Multiple Agents. Why can t we all just get along? (Rodney King)

Transcription:

TRIAL-BASED HEURISTIC TREE SEARCH FOR FINITE HORIZON MDPS Thomas Keller and Malte Helmert Presented by: Ryan Berryhill

Outline Motivation Background THTS framework THTS algorithms Results

Motivation Advances the state-of-the-art in finite-horizon MDP search based on Monte Carlo planning and heuristics Finite-horizon MDP: specialization of MDPs Unifies existing techniques into a common framework Monte Carlo, heuristic search, dynamic programming How can we talk about algorithms for this problem in a unified way? Focus on anytime-optimal algorithms Converge towards an optimal solution given enough time Give reasonably good results whenever they are stopped

Background Finite-horizon MDPs An MDP with an added parameter H (the horizon) At most H transitions between initial state and terminal states The horizon can be thought of as the agent s lifespan Decision nodes where the agent makes a decision Chance nodes where the environment makes a decision Planning problem: maximize reward obtained in H steps We get a non-stationary policy Agent may act differently if it has 5 seconds to live versus 50 years

Rollout-Based Monte Carlo Planning Rollout-based Monte Carlo planning Run a number of episodes starting from the current state Episodes are generated by sampling actions in each state visited Take the best action observed across all episodes Simple approach: sample actions uniformly UCT: treat states as multi-armed bandits when sampling Use the UCB1 algorithm to solve the bandit problem UCB1: take the action that optimizes x " # + % &' ( ( )

THTS Framework Goal: Make algorithms fit into this framework Algorithms specify: heuristic function backup function action selection outcome selection trial length

Example 0.5 0.25 0.25 Grid with rewards Assume H = 5 Actions: move in four directions 50% chance of ending up where you wanted to go 25% change of going to one of the two neighbors E.g. Going up

Example

Example

Example

Example

Example

Example

Example

Example (1,1) (1,1) U (1,1) D (1,1) L (1,1) R 0.25 0.5 0.25 (0,0) (0,1) (0,2) (0,0) U (0,0) R

Example Say it took the shown actions and ended up at O Reward is 2 We need to push this information back up the tree

Example Use standard techniques again E.g. Monte Carlo backups average over all previous trials In this example, the agent learns: The expected reward of going up in position (4,2) at time 5 is worth 2 The expected reward of state (4,2) at time 5 is 2 Right at (3,1) at time 4 is worth 2 etc.

Example 2 (1,1) (1,1) U (1,1) D (1,1) L 2 (1,1) R 2 0.25 (0,0) 0.5 0.25 (0,1) (0,2) 2 (0,0) U (0,0) R 2

Example 23 (1,1) (1,1) U (1,1) D (1,1) L 23 (1,1) R 23 1 1 0.25 (0,0) (0,1) (0,2) 4 2 (0,0) U (0,0) R 4

Example 34 (1,1) (1,1) U (1,1) D (1,1) L 3 1 1 (0,0) 34 2 0.25 0.51 6 (1,1) R (0,1) (0,2) 4 2 (0,0) U (0,0) R 6

THTS Framework Now we can describe this simple THTS algorithm: Heuristic: blind Backup: Monte Carlo Action selection: uniform random Outcome selection: Monte Carlo Trial length: until terminal is hit How do other algorithms vary? This framework makes it easy to describe algorithms that fit into it

UCT as a THTS UCT: Heuristic: any reasonable choice (even inadmissible) Backup: Monte Carlo Action selection: UCB1 Outcome selection: Monte Carlo Trial length: until terminal is hit

Max-UCT Uses a different backup function called Max Monte Carlo Backup for decision nodes is based on the best child This is generally preferable Monte Carlo Max Monte Carlo

Max-UCT 3 (1,1) (1,1) U (1,1) D (1,1) L 3 (1,1) R 3 1 1 0.25 (0,0) (0,1) (0,2) 4 2 (0,0) U (0,0) R 4

Max-UCT 4 (1,1) (1,1) U (1,1) D (1,1) L 4 (1,1) R 4 1 1 0.25 (0,0) (0,1) (0,2) 4 2 (0,0) U (0,0) R 4

DP-UCT Modify the backup function for chance nodes Partial Bellman A step towards the full Bellman approach used in dynamic programming Without having to explicate every possible outcome Max Monte Carlo Partial Bellman

DP-UCT 4 (1,1) (1,1) U (1,1) D (1,1) L 3 1 1 (0,0) 4 2 0.25 0.51 6 (1,1) R (0,1) (0,2) 4 2 (0,0) U (0,0) R 6

DP-UCT 5 (1,1) (1,1) U (1,1) D (1,1) L 5 (1,1) R 3 1 1 0.25 (0,0) 0.5 6 (0,1) (0,2) 4 2 (0,0) U (0,0) R 6

UCT* We don t want a complete policy, just the next decision Uncertainty grows with distance from the root node Therefore, things closer to the root are more important UCT, builds a tree skewed towards more promising parts of the solution space Because of the UCB1 action selection However, it does not consider depth as a deterrent DP-UCT + trial length change Trial ends when we expand a new node Encourages more exploration in shallower parts of the tree

Results IPCC 2011 benchmarks: 10 problems per domain Results averaged over 100 runs Heuristic used: inadmissible, same in all planners Compared against Prost (winner of IPCC 2011)

Results - Summary Domain UCT Max- UCT DP-UCT UCT* Prost ELEVATORS 0.93 0.97 0.97 0.97 0.93 SYSADMIN 0.66 0.71 0.65 1.00 0.82 RECON 0.99 0.88 0.89 0.88 0.99 GAME 0.88 0.90 0.89 0.98 0.93 TRAFFIC 0.84 0.86 0.87 0.99 0.93 CROSSING 0.85 0.96 0.96 0.98 0.82 SKILL 0.93 0.95 0.98 0.97 0.97 NAVIGATION 0.81 0.66 0.98 0.96 0.55 Total 0.86 0.86 0.9 0.97 0.87

Results Anytime Performance Graphic borrowed from Thomas Keller s presentation at ICAPS 2013

Conclusion A uniform framework to describe this type of algorithm Must specify five elements Three novel algorithms: Max-UCT: better handling of highly destructive actions DP-UCT: benefits of dynamic programming without the expense UCT*: encourage more exploration close to the decision at hand