VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL

Similar documents
Playing Atari Games with Deep Reinforcement Learning

Playing CHIP-8 Games with Reinforcement Learning

an AI for Slither.io

Reinforcement Learning Agent for Scrolling Shooter Game

REINFORCEMENT LEARNING (DD3359) O-03 END-TO-END LEARNING

Creating an Agent of Doom: A Visual Reinforcement Learning Approach

Swing Copters AI. Monisha White and Nolan Walsh Fall 2015, CS229, Stanford University

Deep Learning for Autonomous Driving

Augmenting Self-Learning In Chess Through Expert Imitation

Supplementary Material: Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with GANs

arxiv: v2 [cs.lg] 7 May 2017

CS221 Project Final Report Deep Q-Learning on Arcade Game Assault

Tutorial of Reinforcement: A Special Focus on Q-Learning

Deep RL For Starcraft II

Robotics at OpenAI. May 1, 2017 By Wojciech Zaremba

DeepMind Self-Learning Atari Agent

Model-Based Reinforcement Learning in Atari 2600 Games

CandyCrush.ai: An AI Agent for Candy Crush

Transfer Deep Reinforcement Learning in 3D Environments: An Empirical Study

Poker AI: Equilibrium, Online Resolving, Deep Learning and Reinforcement Learning

Upscaling Beyond Super Resolution Using a Novel Deep Learning System

Learning from Hints: AI for Playing Threes

Prof. Sameer Singh CS 175: PROJECTS IN AI (IN MINECRAFT) WINTER April 6, 2017

General Video Game AI: Learning from Screen Capture

AI Approaches to Ultimate Tic-Tac-Toe

Playing FPS Games with Deep Reinforcement Learning

Game Playing for a Variant of Mancala Board Game (Pallanguzhi)

Success Stories of Deep RL. David Silver

AI Agents for Playing Tetris

Mastering the game of Go without human knowledge

Playing Geometry Dash with Convolutional Neural Networks

A Deep Q-Learning Agent for the L-Game with Variable Batch Training

Transferring Deep Reinforcement Learning from a Game Engine Simulation for Robots

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Attention-based Multi-Encoder-Decoder Recurrent Neural Networks

Hierarchical Controller for Robotic Soccer

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Evaluating Persuasion Strategies and Deep Reinforcement Learning methods for Negotiation Dialogue agents

Geometry Activity. Then enter the following numbers in L 1 and L 2 respectively. L 1 L

Temporal Difference Learning for the Game Tic-Tac-Toe 3D: Applying Structure to Neural Networks

ConvNets and Forward Modeling for StarCraft AI

Reinforcement Learning for CPS Safety Engineering. Sam Green, Çetin Kaya Koç, Jieliang Luo University of California, Santa Barbara

Google DeepMind s AlphaGo vs. world Go champion Lee Sedol

Deep Imitation Learning for Playing Real Time Strategy Games

Applying Modern Reinforcement Learning to Play Video Games

Experiments with Tensor Flow Roman Weber (Geschäftsführer) Richard Schmid (Senior Consultant)

DOWNLOAD OR READ : VIDEO GAMES AND LEARNING TEACHING AND PARTICIPATORY CULTURE IN THE DIGITAL AGE PDF EBOOK EPUB MOBI

Learning to Play Love Letter with Deep Reinforcement Learning

9.5 symmetry 2017 ink.notebook. October 25, Page Symmetry Page 134. Standards. Page Symmetry. Lesson Objectives.

Artificial Intelligence and Games Playing Games

Visual Media Processing Using MATLAB Beginner's Guide

CS221 Project Final Report Gomoku Game Agent

Radio Deep Learning Efforts Showcase Presentation

Tiny ImageNet Challenge Investigating the Scaling of Inception Layers for Reduced Scale Classification Problems

Image Manipulation Detection using Convolutional Neural Network

An Adaptive Kernel-Growing Median Filter for High Noise Images. Jacob Laurel. Birmingham, AL, USA. Birmingham, AL, USA

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Whistle Pongbat Peter Capraro Michael Hankin Anand Rajeswaran

Conversational Systems in the Era of Deep Learning and Big Data. Ian Lane Carnegie Mellon University

Deep Neural Network Architectures for Modulation Classification

Hacking Reinforcement Learning

arxiv: v1 [cs.lg] 30 May 2016

Derek Allman a, Austin Reiter b, and Muyinatu Bell a,c

Deep Reinforcement Learning for General Video Game AI

arxiv: v1 [cs.lg] 7 Nov 2016

04. Two Player Pong. 04.Two Player Pong

ICS 61 Game Systems and Design Midterm Winter, Mean: 66 (82.5%) Median: 68 (85%)

ECE 517: Reinforcement Learning in Artificial Intelligence

Stacking Ensemble for auto ml

Creating a Poker Playing Program Using Evolutionary Computation

CSE-571 AI-based Mobile Robotics

arxiv: v1 [cs.lg] 30 Aug 2018

Grade 6 Math Circles Winter 2013 Mean, Median, Mode

arxiv: v4 [cs.ro] 21 Jul 2017

Artificial Intelligence and Deep Learning

AI Learning Agent for the Game of Battleship

AI Plays Yun Nie (yunn), Wenqi Hou (wenqihou), Yicheng An (yicheng)

MSc(CompSc) List of courses offered in

The Threshold Between Human and Computational Creativity. Pindar Van Arman

Video Games and Interfaces: Past, Present and Future Class #2: Intro to Video Game User Interfaces

arxiv: v2 [cs.lg] 13 Nov 2015

Enhancing Symmetry in GAN Generated Fashion Images

Vishnu Nath. Usage of computer vision and humanoid robotics to create autonomous robots. (Ximea Currera RL04C Camera Kit)

A2-RL: Aesthetics Aware Reinforcement Learning for Automatic Image Cropping

An Artificially Intelligent Ludo Player

Evolving robots to play dodgeball

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

Graphing Sine and Cosine

Decision Making in Multiplayer Environments Application in Backgammon Variants

Outline. Introduction to AI. Artificial Intelligence. What is an AI? What is an AI? Agents Environments

Reinforcement Learning Simulations and Robotics

Beating the World s Best at Super Smash Bros. Melee with Deep Reinforcement Learning

arxiv: v1 [cs.lg] 16 Aug 2017

Optic Flow Based Skill Learning for A Humanoid to Trap, Approach to, and Pass a Ball

CS221 Final Project Report Learn to Play Texas hold em

Neural Networks for Real-time Pathfinding in Computer Games

Monte Carlo based battleship agent

Developing Frogger Player Intelligence Using NEAT and a Score Driven Fitness Function

Optimal Yahtzee performance in multi-player games

46.1 Introduction. Foundations of Artificial Intelligence Introduction MCTS in AlphaGo Neural Networks. 46.

Transcription:

VISUAL ANALOGIES BETWEEN ATARI GAMES FOR STUDYING TRANSFER LEARNING IN RL Doron Sobol 1, Lior Wolf 1,2 & Yaniv Taigman 2 1 School of Computer Science, Tel-Aviv University 2 Facebook AI Research ABSTRACT In this work, we ask the following question: can visual analogies, learned in an unsupervised way, be used in order to transfer knowledge between pairs of games and even play one game using an agent trained for another game. We attempt to answer this research question by creating visual analogies between a pair of games: a source game and a target game. For example, given a video frame in the target game, we map it to an analogous state in the source game and then attempt to play using a trained policy learned for the source game. We demonstrate convincing visual mapping between four pairs of games (eight mappings). These mappings are used to evaluate three transfer learning approaches. The code and models are available at https://github.com/doronsobol/ Visual_analogies_for_RL_transfer_Learning 1 INTRODUCTION One of the most fascinating capabilities of humans is the ability to generalize between related but vastly different tasks. A surfer will be able to ride a snowboard after much less training than a beginner in board sports; a gamer experienced with adventure games will solve escape rooms long before the one hour is up; and a veteran tennis player will often top the office s ping pong league. The goal of this work is to check if a Reinforcement Learning (RL) agent can gain such an ability: an actor is being trained and evaluated on a target task after learning a source task in the typical reinforcement learning setting. The actor is also provided with mappers that given a frame in either game, are able to generate the analogous frame in the other game. The bidirectional mappers between the video sequences are based on recent approaches to the task of finding visual analogies, in combination with an added regularization term. We evaluate our methods on two groups of games, and are able to successfully learn the mappers between all samegroup pairs. Building on the existence of these mappers, we propose several Transfer Learning (TL) techniques for utilizing information from the source game when playing the target game. These methods include techniques such as data-transfer and distillation. Unfortunately, none of these methods seem to be consistently helpful, maybe with the one exception of the first, which uses scenes that are visually adapted from the source game to the domain of the target game. Despite the moderate success, we believe that our work presents value to the community in multiple ways. First, we are successful at the challenging video conversion task, which could benefit future efforts. Second, we devise a few possible TL methods that almost work. Third, a critical view of the practical value of TL in the current RL landscape is seldom heard. Lastly, by sharing our results, code, and models, we hope to help others in minimizing wasted efforts. 2 SETTINGS AND METHODS To create visual analogies between a pair of games, we collect frames off-line. The actor that is used to play the game does not need to be an expert and we do not imitate it. However, it is required that the states are diverse enough and, therefore, the actor is required to remain in the game for a while. 1

Pong Tennis Breakout Assault Demon-Attack Figure 1: The obtained attention maps for a frame from each of the five tested games. Pong as source Breakout as target Breakout as source Pong as target Figure 2: Images of samples of consecutive frames from the source game (left) to the target game (right). See Appendix A for the other games. These frames are used to learn, in an unsupervised way, a mapper G : s t, between the frames of the source game s, and the frames of the target game t. We also learn the mapper in the reverse direction G 1 : t s. We assume that we have the ability to train an agent in the source game without any limitation on the number of training episodes. Our goal is to be as efficient as possible in the training of the target game. 2.1 LEARNING CROSS-DOMAIN VIDEO MAPPING The unsupervised learning step requires prior processing of the data. This includes the following steps: (a) Rotating the frames, if needed, so that the main axis of motion is horizontal. (b) Applying an attention operator to the frame, by subtracting either the median pixel value at each location or the median pixel value of the entire image (depending on the game), and then applying a threshold to obtain a binary image. (c) Applying a dilation filter on the image to enlarge the relevant objects. (d) Creating three channels by cloning the dilated image and applying two levels of blurring. The resulting images are shown in Fig 1. To train the mapper functions G and G 1, we use the network architecture of UNIT GAN with cycle consistency loss (Liu et al., 2017). By itself, this method leads to mode collapse. In order to fix this, we add the gradient-penalty regularization term of improved WGAN (Gulrajani et al., 2017), adapted to the problem of cross-domain mapping: L GP = E[( ˆx D(ˆx) 2 1) 2 ], where ˆx is either ŝ = ɛs + (1 ɛ)g 1 (t) or ˆt = ɛt + (1 ɛ)g(s), D is the GAN s discriminator, and ɛ U[0, 1]. 2.2 TRANSFER LEARNING METHODS Training the strategy π t for the target game and the strategy π s of the source game (when used), is done with the asynchronous actor-critic (A3C) algorithm (Mnih et al., 2016). The network architec- 2

Table 1: The level of success (see text) reached by the various methods. DATA TRANSFER CONTINIOUS DATA DISTILLATION SOURCE TARGET PRETRAINING TRANSFER BREAKOUT PONG *,2 - - PONG BREAKOUT *,2 2 - TENNIS PONG 3 - - TENNIS BREAKOUT - - - BREAKOUT TENNIS 1 1 1 PONG TENNIS 1 1 1 ASSAULT DEMON-ATTACK 1 OR 2 - - DEMON-ATTACK ASSAULT 2-2 ture consists of four convolutional layers followed by an LSTM layer and two fully connect layers for the predicted action and value. We tried various methods for transferring knowledge between games, including: I Data transfer for pretraining: We transform frames from the source game s using G. We train a policy π t on these frames using the reward of the source game and using a static mapping of actions, instead of the regular source game actions, and then fine tune the resulting policy on the target game. II Continuous data transfer: Instead of pretraining π t on G(s), we provide it with mixed samples from G(s) and t throughout the entire training process. III Distillation: Directly fine-tunning π s failed, since it was trained on the source game. Finetunning π s G 1 instead, lead to an overly complex network. We found it preferable to train a network of the same architecture as π t to mimic π s G 1, on unsupervised frames from the target game. We then continue to train this network using real data. 3 RESULTS The experiments are conducted on five Atari games, split into two groups. The first group contains the games Breakout, Tennis and Pong, in which the player has a paddle it controls and its goal is to hit the ball in order to achieve a certain objective. The second games are Demon-Attack and Assault, in which the player has a spaceship it controls and it needs to shoot the targets (similar to Space Invaders). We were not able to identify other potential pairs among the Atari games. The two groups give rise to four pairs of games, which yield eight transfer directions. Samples of the transferred frames using the mapping method described in Sec. 2.1 can be found in Fig. 2, and in Appendix A. The mappings obtained seem to convey the semantics of the games. We design a subjective rating scale for describing the level of success of the TL methods described in Sec. 2.2 on a given pair of games. A method is successful in transferring knowledge from the source game, if reaching a certain level of performance requires less supervised training samples than the baseline method of vanilla training in the target domain. We distinguish between three levels of success: (1) Upon convergence or reaching the maximum possible reward, the method that employs TL outperforms the baseline method. (2) The TL method achieves almost all levels of performance between the random performance and the converged performance with less samples than the vanilla method. (3) The TL method achieves non-trivial levels of performance faster than the baseline method but then stops leading. We also employ a star (*) to denote situations in which the TL method starts off, without any supervised samples from the target domain, in a level that is significantly better than random. This can happen with any level of success. Lastly, we employ a dash (-) to indicate the lack of success. Tab. 1 shows the level of success reached by the various methods, in comparison to the baseline method. While the scoring is subjective, the table suggests that the data transfer for the purpose of pretraining is the only method to consistently outperform the baseline. Appendix B contains the full training logs. 3

REFERENCES Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5769 5779, 2017. Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pp. 700 708, 2017. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928 1937, 2016. 4

Pong as source Breakout as target Breakout as source Pong as target Tennis as source Breakout as target Breakout as source Tennis as target Tennis as source Pong as target Pong as source Tennis as target Demon-Attack as source Assault as target Assault as source Demon-Attack as target Figure 3: Images of samples of consecutive frames from the source game (left) to the target game (right). A VISUAL TRANSFER Fig.3 shows consecutive frames from the source games and their corresponding mapping in the target domain, using the trained function G. B TRAINING PROGRESS PER GAME Fig. 4 shows the training graphs of the transfer learning methods. Each point on the graphs is an average of samples of the model from the last 100K training states. The plotted results are the average of three independent runs. 5

BREAKOUT PONG TENNIS PONG PONG BREAKOUT TENNIS BREAKOUT PONG TENNIS BREAKOUT TENNIS DEMON-ATTACK ASSAULT ASSAULT DEMON-ATTACK Figure 4: A comparison of the training logs for the various TL methods. The x-axis is the number of training steps, and the y-axis is the reward. The plots are averaged over three independent runs. The blue line is the baseline, the red is distillation, the yellow is continuous data transfer and the green is data transfer for pretraining. 6