Game Architecture. 4/8/16: Multiprocessor Game Loops

Similar documents
Console Architecture 1

Killzone Shadow Fall: Threading the Entity Update on PS4. Jorrit Rouwé Lead Game Tech, Guerrilla Games

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

The Xbox One System on a Chip and Kinect Sensor

The Next Generation of Gaming Consoles

The Critical Role of Firmware and Flash Translation Layers in Solid State Drive Design

INTRODUCTION TO GAME AI

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

Arduino STEAM Academy Arduino STEM Academy Art without Engineering is dreaming. Engineering without Art is calculating. - Steven K.

Lecture Topics. Announcements. Today: Memory Management (Stallings, chapter ) Next: continued. Self-Study Exercise #6. Project #4 (due 10/11)

Project 5: Optimizer Jason Ansel

occam on the Arduino Adam T. Sampson School of Computing, University of Kent Matt C. Jadud Department of Computer Science, Allegheny College

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

Introduction to Game Design. Truong Tuan Anh CSE-HCMUT

The Who. Intel - no introduction required.

Understanding OpenGL

1 of 34 6/10/2007 7:52 PM. print this page

Artificial Intelligence for Games. Santa Clara University, 2012

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

CRYPTOSHOOTER MULTI AGENT BASED SECRET COMMUNICATION IN AUGMENTED VIRTUALITY

CS4617 Computer Architecture

Microarchitectural Attacks and Defenses in JavaScript

Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you

Recent Advances in Simulation Techniques and Tools

Campus Fighter. CSEE 4840 Embedded System Design. Haosen Wang, hw2363 Lei Wang, lw2464 Pan Deng, pd2389 Hongtao Li, hl2660 Pengyi Zhang, pnz2102

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Design of Embedded Systems - Advanced Course Project

Out-of-Order Execution. Register Renaming. Nima Honarmand

Introduction to Real-Time Systems

Parallelism Across the Curriculum

Lecture 6: Electronics Beyond the Logic Switches Xufeng Kou School of Information Science and Technology ShanghaiTech University

SOFTWARE IMPLEMENTATION OF THE

Oculus Rift Getting Started Guide

SDR-14 User s Guide Version 1.2 Software Defined Receiver & Spectrum Analyzer

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Blackfin Online Learning & Development

Real Time Operating Systems Lecture 29.1

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

Blackfin Online Learning & Development

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

Which motherboard is in the xbox 360 elite

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Advances in Antenna Measurement Instrumentation and Systems

FPGA Based 70MHz Digital Receiver for RADAR Applications

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

SPACEYARD SCRAPPERS 2-D GAME DESIGN DOCUMENT

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Embedded & Robotics Training

Image Processing Architectures (and their future requirements)

An architecture for Scalable Concurrent Embedded Software" No more communication in your program, the key to multi-core and distributed programming.

1. The decimal number 62 is represented in hexadecimal (base 16) and binary (base 2) respectively as

Image Processing Architectures (and their future requirements)

Datorstödd Elektronikkonstruktion

8 Frames in 16ms. Michael Stallone Lead Software Engineer Engine NetherRealm Studios

Lec 24: Parallel Processors. Announcements

Ps3 Computers Instruction Set Definition Reduced

A NEW ARCHITECTURE FOR FLIGHTGEAR FLIGHT SIMULATOR

Architecting Systems of the Future, page 1

? 5. VR/AR AI GPU

2020 Computing: Virtual Immersion Architectures (VIA-2020)

Digital Integrated Circuits Perspectives. Administrivia

SpinSpectra NSMS. Noise Spectrum Measurement System

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

Interfacing ACT-R with External Simulations

Table of Contents HOL ADV

Tutorial 3: Entering the World of GNU Software Radio

ECE 498 Linux Assembly Language Lecture 8

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Putting It All Together: Computer Architecture and the Digital Camera

Performance Metrics, Amdahl s Law

Low-Power CMOS VLSI Design

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective

Tobii Pro VR Integration based on HTC Vive Development Kit Description

Fall 2015 COMP Operating Systems. Lab #7

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system

Li-Fi And Microcontroller Based Home Automation Or Device Control Introduction

Chapter 1:Object Interaction with Blueprints. Creating a project and the first level

Using an FPGA based system for IEEE 1641 waveform generation

Ps3 Computing Instruction Set Definition Reduced

Computer Aided Design of Electronics

Propietary Engine VS Commercial engine. by Zalo

CMP 301B Computer Architecture. Appendix C

Lesson 3: Arduino. Goals

AN IMPLEMENTATION OF MULTI-DSP SYSTEM ARCHITECTURE FOR PROCESSING VARIANT LENGTH FRAME FOR WEATHER RADAR

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

Synthetic Aperture Beamformation using the GPU

CSE502: Computer Architecture CSE 502: Computer Architecture

Enabling Mobile Virtual Reality ARM 助力移动 VR 产业腾飞

CSEE4840 Project Design Document. Battle City

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Signal Technologies 1

Tobii Pro VR Analytics Product Description

DEMIGOD DEMIGOD. characterize stalls and pop-ups during game play. Serious gamers play games at their maximum settings driving HD monitors.

Pangolin: A Look at the Conceptual Architecture of SuperTuxKart. Caleb Aikens Russell Dawes Mohammed Gasmallah Leonard Ha Vincent Hung Joseph Landy

Processors Processing Processors. The meta-lecture

Outline Simulators and such. What defines a simulator? What about emulation?

RF and Microwave Test and Design Roadshow Cape Town & Midrand

MITOCW Project: Backgammon tutor MIT Multicore Programming Primer, IAP 2007

Transcription:

Game Architecture 4/8/16: Multiprocessor Game Loops

Monolithic Dead simple to set up, but it can get messy Flow-of-control can be complex Top-level may have too much knowledge of underlying systems (gross bubble-up effects like UT Actor) Tough to maintain

Cooperative Tasks class Task { virtual void Run() = 0; }; class Renderer : public Task { void Run(float time); }; class TaskManager{ void RunTasks(); void AddTask(Task*); }; void TaskManager::RunTasks(){ foreach(task) task->run(); }

Cooperative Tasks Flexible, but clarity suffers Can be too much flexibility What happens in what order difficult to discern by examining code

Pre-emptive void InputThread(){ while(1) input(); } void SimulationThread(){ while(1) simulate(); } void RenderThread() { while(1) render(); } void SoundThread() { while(1) sound(); }

Pre-emptive Tough to get right Complex interprocess communication Deadlocks, race conditions Questionable performance if used extensively But, increasingly parallel hardware makes this a major area for focus

Multiprocessor Game Loops In 2004, the microprocessor industry hit a brick wall due to heat dissipation problems Shifted focus to multicore processors Another painful shift (after all that graphics nonsense!) multithreaded program design is much harder than single-threaded By 2008, most studios ended the gradual transition

CPU Memory 512 MB DRAM Core0 Core1 Core2 L1D L1I MC0 MC1 L1D L1I 1MB L2 GPU 10MB EDRAM L1D BIU/IO Intf 3D Core L1I Video Out I/O Chip XMA Decoder SMC Analog Chip DVD (SATA) HDD port (SATA) Front controllers (2 USB) Wireless controllers MU ports (2 USB) Rear Panel USB Ethernet IR Audio Out FLASH System control Video Out

Memory Caches A cache is just a bank of memory that can be read from and written to by the CPU much more quickly than main RAM cache memory typically utilizes the fastest (and most expensive) technology available cache memory is located as physically close as possible to the CPU core, typically on the same die. Cache memory is usually quite a bit smaller in size than main RAM.

Memory Caches Improves memory access performance by keeping local copies in the cache of those chunks of data that are most frequently accessed by the program If the data requested by the CPU is already in the cache, it can be provided to the CPU very quickly on the order of tens of cycles (hit) If the data is not already present in the cache, it must be fetched into the cache from main RAM (miss) Reading data from main RAM can take thousands of cycles, so the cost of a cache miss is very high indeed

I$ and D$ Both instructions and data are cached The instruction cache (I$) is used to preload executable machine code before it runs The data cache (D$) is used to speed up reading and writing of data to main RAM Always physically distinct

Multilevel Caches There is a fundamental trade-off between cache latency and hit rate Larger caches mean higher hit rates, but larger caches cannot be located as close to the CPU, so they tend to be slower than smaller ones. Most game consoles employ at least two levels of cache The CPU first tries to find the data it s looking for in the level 1 (L1) cache. (small, but very low access latency) If the data isn t there, it tries the larger but higherlatency level 2 (L2) cache Only if the data cannot be found in the L2 cache do we incur the full cost of a main memory access.

Minimizing Misses The best way to avoid D$ misses is to organize your data in contiguous blocks that are as small as possible and then access them sequentially For I$, keep your high-performance loops as small as possible in terms of code size, and avoid calling functions within your innermost loops. Keep the entire body of the loop in the cache the entire time the loop is running.

I$ Misses Keep high-performance code as small as possible, in terms of number of machine language instructions The compiler and linker take care of keeping our functions contiguous in memory Avoid calling functions from within a performance-critical section of code If you have to, place it as close as possible to the calling function preferably immediately before or after the calling function and never in a different translation (compilation) unit Inlining? Inlining a small function can be a big performance boost. But too much bloats the size of the code, which can cause a performance-critical section of code to no longer fit within the cache

360 CPU Memory 512 MB DRAM Core0 Core1 Core2 L1D L1I MC0 MC1 L1D L1I 1MB L2 GPU 10MB EDRAM L1D BIU/IO Intf 3D Core L1I Video Out I/O Chip XMA Decoder SMC Analog Chip DVD (SATA) HDD port (SATA) Front controllers (2 USB) Wireless controllers MU ports (2 USB) Rear Panel USB Ethernet IR Audio Out FLASH System control Video Out

PS3

PS4

PS4 huma - heterogeneous unified memory architecture

PS4 Cache Architecture 220+ CYCLES CPU Regs FREE 30+ CYCLES L1 I$ (32 KiB) L1 D$ (32 KiB) 3 CYCLES L2 (2 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C2 C1 C3 C4CPUC5 C6 C7 Regs FREE L1 I$ (32 KiB) L1 D$ (32 KiB) L2 (2 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C2 C1 C3 L2 (1 MiB) C4CPUC5 C6 C7 Regs FREE L1 I$ (32 KiB) L1 D$ (32 KiB) L2 (1 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C2 C1 C3 26 CYCLES L2 (1 MiB) C4CPUC5 C6 C7 Regs FREE 26 CYCLES L1 I$ (32 KiB) L1 D$ (32 KiB) L2 (1 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C1 C2 C3 190 CYCLES L2 (1 MiB) C4 C5 CPU C6 C7 L1 I$ Regs FREE Tuesday, March 4, 14 (32 KiB) L1 D$ (32 KiB) L2 (1 MiB) MAIN RAM (8 GiB)

PS4 Cache Architecture MAIN RAM CACHE 0x5280 0x5240 0x5200 0x51C0 0x5180 0x5140 0x5100 0x50C0 0x5080 0x5040 0x5000 0x0280 0x0240 0x0200 0x01C0 0x0180 0x0140 0x0100 0x00C0 0x0080 0x0040 0x0000 Tuesday, March 4, 14

PS4 Cache Architecture MAIN RAM CACHE 0x5280 0x5240 0x5200 0x51C0 0x5180 0x5140 0x5100 0x50C0 0x5080 0x5040 0x5000 0x0280 0x0240 0x0200 0x01C0 0x0180 0x0140 0x0100 0x00C0 0x0080 0x0040 0x0000 Tuesday, March 4, 14

PS4 Cache Architecture MAIN RAM CACHE 0x5280 0x5240 0x5200 0x51C0 0x5180 0x5140 0x5100 0x50C0 0x5080 0x5040 0x5000 0x0280 0x0240 0x0200 0x01C0 0x0180 0x0140 0x0100 0x00C0 0x0080 0x0040 0x0000 Tuesday, March 4, 14

PS4 Optimization PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)! U32Bg_jobCount[6];B//BoneBperBcore Tuesday, March 4, 14

PS4 Optimization PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)! structbjobcount { BBBBU32Bm_count; BBBBU8BBm_padding[60]; }; JobCountBg_jobCount[6];B//BoneBperBcore Tuesday, March 4, 14

PS4

Xbox One

Subtle Differences Memory type The Xbox One utilizes GDDR3 RAM, while the PS4 uses GDDR5, which gives the PS4 higher theoretical memory bandwidth. The Xbox One counteracts this to some degree by providing its GPU with a dedicated 32 MiB memory store, implemented as very high-speed esram

Subtle Differences Bus speeds The buses in the Xbox One support higher bandwidth data transfers than those of the PS4 (30GB/sec vs 20) GPU PS4 s GPU is roughly equivalent to an AMD Radeon 7870, with 1152 parallel stream processors, the Xbox One s GPU is closer to an AMD Radeon 7790, supporting only 768 stream processors the Xbox One s GPU runs at 853MHz vs 800 for the PS4

Main Thread Pose Blending Simulate / Integrate Update Game Objects Pose Blending Post Animation Game Object Update Simulate / Integrate Ragdoll Physics etc. Pose Blending Simulate / Integrate Fork Join Fork Join

Main Thread Animation Thread Dynamics Thread Rendering Thread HID Update Game Objects Kick off Animation Post Animation Game Object Update Kick Dynamics Sim Ragdoll Physics Finalize Animation Finalize Collision Other Processing (AI Planning, Audio Work, etc.) Kick Redering (for next frame) Sleep Pose Blending Sleep Global Pose Calculation Skin Matrix Palette Calculation Sleep Ragdoll Skinning Sleep Simulate and Integrate Sleep Broad Phase Coll. Narrow Phase Coll. Resolve Constraints Sleep Visibility Determination Sort Submit Primitives Wait for GPU Full-Screen Effects Wait for V- Blank Swap Buffers

PPU HID Update Game Objects Kick Animation Jobs Post Animation Game Object Update Kick Dynamics Jobs Ragdoll Physics Finalize Animation Finalize Collision Other Processing (AI Planning, Audio Work, etc.) Kick Redering (for next frame) SPU0 Visibility Visibility Sort Sort Visibility Pose Blend Sort Pose Blend Physics Sim Submit Primitives Global Pose Submit Primitives Global Pose Collisions / Constraints Matrix Palette Ragdoll Skinning SPU1 Visibility Visibility Visibility Sort Visibility Pose Blend Pose Blend Pose Blend Sort Physics Simulation Global Pose Broad Phase Narrow Phase Narrow Phase Matrix Palette Ragdoll Skinning

Async Design while (true) { // main game loop //... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r = castray(playerpos, enemypos); // Now process the results... if (r.hitsomething() && isenemy(r.gethitobject())) { // Player can see the enemy. //... } // }

Async Design while (true) { // main game loop //... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r; requestraycast(playerpos, enemypos, &r); } // Do other unrelated work while we wait for the // other CPU to perform the ray cast for us. // // OK, we can't do any more useful work. Wait for the // results of our ray cast job. If the job is // complete, this function will return immediately. // Otherwise, the main thread will idle until the // results are ready... waitforraycastresults(&r); // Process results... if (r.hitsomething() && isenemy(r.gethitobject())) { // Player can see the enemy. //... //... } //...

Async Design RayCastResult r; bool rayjobpending = false; while (true) { // main game loop // // Wait for the results of last frame's ray cast job. if (rayjobpending) { waitforraycastresults(&r); // Process results... if (r.hitsomething() && isenemy(r.gethitobject())) { // Player can see the enemy. //... } } // Cast a new ray for next frame. rayjobpending = true; requestraycast(playerpos, enemypos, &r); // Do other work... //... }