Game Architecture. 4/8/16: Multiprocessor Game Loops

Game Architecture 4/8/16: Multiprocessor Game Loops

Monolithic Dead simple to set up, but it can get messy Flow-of-control can be complex Top-level may have too much knowledge of underlying systems (gross bubble-up effects like UT Actor) Tough to maintain

Cooperative Tasks class Task { virtual void Run() = 0; }; class Renderer : public Task { void Run(float time); }; class TaskManager{ void RunTasks(); void AddTask(Task*); }; void TaskManager::RunTasks(){ foreach(task) task->run(); }

Cooperative Tasks Flexible, but clarity suffers Can be too much flexibility What happens in what order difficult to discern by examining code

Pre-emptive void InputThread(){ while(1) input(); } void SimulationThread(){ while(1) simulate(); } void RenderThread() { while(1) render(); } void SoundThread() { while(1) sound(); }

Pre-emptive Tough to get right Complex interprocess communication Deadlocks, race conditions Questionable performance if used extensively But, increasingly parallel hardware makes this a major area for focus

Multiprocessor Game Loops In 2004, the microprocessor industry hit a brick wall due to heat dissipation problems Shifted focus to multicore processors Another painful shift (after all that graphics nonsense!) multithreaded program design is much harder than single-threaded By 2008, most studios ended the gradual transition

CPU Memory 512 MB DRAM Core0 Core1 Core2 L1D L1I MC0 MC1 L1D L1I 1MB L2 GPU 10MB EDRAM L1D BIU/IO Intf 3D Core L1I Video Out I/O Chip XMA Decoder SMC Analog Chip DVD (SATA) HDD port (SATA) Front controllers (2 USB) Wireless controllers MU ports (2 USB) Rear Panel USB Ethernet IR Audio Out FLASH System control Video Out

Memory Caches A cache is just a bank of memory that can be read from and written to by the CPU much more quickly than main RAM cache memory typically utilizes the fastest (and most expensive) technology available cache memory is located as physically close as possible to the CPU core, typically on the same die. Cache memory is usually quite a bit smaller in size than main RAM.

Memory Caches Improves memory access performance by keeping local copies in the cache of those chunks of data that are most frequently accessed by the program If the data requested by the CPU is already in the cache, it can be provided to the CPU very quickly on the order of tens of cycles (hit) If the data is not already present in the cache, it must be fetched into the cache from main RAM (miss) Reading data from main RAM can take thousands of cycles, so the cost of a cache miss is very high indeed

I$ and D$ Both instructions and data are cached The instruction cache (I$) is used to preload executable machine code before it runs The data cache (D$) is used to speed up reading and writing of data to main RAM Always physically distinct

Multilevel Caches There is a fundamental trade-off between cache latency and hit rate Larger caches mean higher hit rates, but larger caches cannot be located as close to the CPU, so they tend to be slower than smaller ones. Most game consoles employ at least two levels of cache The CPU first tries to find the data it s looking for in the level 1 (L1) cache. (small, but very low access latency) If the data isn t there, it tries the larger but higherlatency level 2 (L2) cache Only if the data cannot be found in the L2 cache do we incur the full cost of a main memory access.

Minimizing Misses The best way to avoid D$ misses is to organize your data in contiguous blocks that are as small as possible and then access them sequentially For I$, keep your high-performance loops as small as possible in terms of code size, and avoid calling functions within your innermost loops. Keep the entire body of the loop in the cache the entire time the loop is running.

I$ Misses Keep high-performance code as small as possible, in terms of number of machine language instructions The compiler and linker take care of keeping our functions contiguous in memory Avoid calling functions from within a performance-critical section of code If you have to, place it as close as possible to the calling function preferably immediately before or after the calling function and never in a different translation (compilation) unit Inlining? Inlining a small function can be a big performance boost. But too much bloats the size of the code, which can cause a performance-critical section of code to no longer fit within the cache

360 CPU Memory 512 MB DRAM Core0 Core1 Core2 L1D L1I MC0 MC1 L1D L1I 1MB L2 GPU 10MB EDRAM L1D BIU/IO Intf 3D Core L1I Video Out I/O Chip XMA Decoder SMC Analog Chip DVD (SATA) HDD port (SATA) Front controllers (2 USB) Wireless controllers MU ports (2 USB) Rear Panel USB Ethernet IR Audio Out FLASH System control Video Out

PS3

PS4

PS4 huma - heterogeneous unified memory architecture

PS4 Cache Architecture 220+ CYCLES CPU Regs FREE 30+ CYCLES L1 I$ (32 KiB) L1 D$ (32 KiB) 3 CYCLES L2 (2 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C2 C1 C3 C4CPUC5 C6 C7 Regs FREE L1 I$ (32 KiB) L1 D$ (32 KiB) L2 (2 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C2 C1 C3 L2 (1 MiB) C4CPUC5 C6 C7 Regs FREE L1 I$ (32 KiB) L1 D$ (32 KiB) L2 (1 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C2 C1 C3 26 CYCLES L2 (1 MiB) C4CPUC5 C6 C7 Regs FREE 26 CYCLES L1 I$ (32 KiB) L1 D$ (32 KiB) L2 (1 MiB) MAIN RAM (8 GiB) Tuesday, March 4, 14

PS4 Cache Architecture C0 C1 C2 C3 190 CYCLES L2 (1 MiB) C4 C5 CPU C6 C7 L1 I$ Regs FREE Tuesday, March 4, 14 (32 KiB) L1 D$ (32 KiB) L2 (1 MiB) MAIN RAM (8 GiB)

PS4 Cache Architecture MAIN RAM CACHE 0x5280 0x5240 0x5200 0x51C0 0x5180 0x5140 0x5100 0x50C0 0x5080 0x5040 0x5000 0x0280 0x0240 0x0200 0x01C0 0x0180 0x0140 0x0100 0x00C0 0x0080 0x0040 0x0000 Tuesday, March 4, 14

PS4 Optimization PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)! U32Bg_jobCount[6];B//BoneBperBcore Tuesday, March 4, 14

PS4 Optimization PS4-specific: avoid cross-cluster L2 cache line sharing (190 cycles versus 26 cycles)! structbjobcount { BBBBU32Bm_count; BBBBU8BBm_padding[60]; }; JobCountBg_jobCount[6];B//BoneBperBcore Tuesday, March 4, 14

PS4

Xbox One

Subtle Differences Memory type The Xbox One utilizes GDDR3 RAM, while the PS4 uses GDDR5, which gives the PS4 higher theoretical memory bandwidth. The Xbox One counteracts this to some degree by providing its GPU with a dedicated 32 MiB memory store, implemented as very high-speed esram

Subtle Differences Bus speeds The buses in the Xbox One support higher bandwidth data transfers than those of the PS4 (30GB/sec vs 20) GPU PS4 s GPU is roughly equivalent to an AMD Radeon 7870, with 1152 parallel stream processors, the Xbox One s GPU is closer to an AMD Radeon 7790, supporting only 768 stream processors the Xbox One s GPU runs at 853MHz vs 800 for the PS4

Main Thread Pose Blending Simulate / Integrate Update Game Objects Pose Blending Post Animation Game Object Update Simulate / Integrate Ragdoll Physics etc. Pose Blending Simulate / Integrate Fork Join Fork Join

Main Thread Animation Thread Dynamics Thread Rendering Thread HID Update Game Objects Kick off Animation Post Animation Game Object Update Kick Dynamics Sim Ragdoll Physics Finalize Animation Finalize Collision Other Processing (AI Planning, Audio Work, etc.) Kick Redering (for next frame) Sleep Pose Blending Sleep Global Pose Calculation Skin Matrix Palette Calculation Sleep Ragdoll Skinning Sleep Simulate and Integrate Sleep Broad Phase Coll. Narrow Phase Coll. Resolve Constraints Sleep Visibility Determination Sort Submit Primitives Wait for GPU Full-Screen Effects Wait for V- Blank Swap Buffers

PPU HID Update Game Objects Kick Animation Jobs Post Animation Game Object Update Kick Dynamics Jobs Ragdoll Physics Finalize Animation Finalize Collision Other Processing (AI Planning, Audio Work, etc.) Kick Redering (for next frame) SPU0 Visibility Visibility Sort Sort Visibility Pose Blend Sort Pose Blend Physics Sim Submit Primitives Global Pose Submit Primitives Global Pose Collisions / Constraints Matrix Palette Ragdoll Skinning SPU1 Visibility Visibility Visibility Sort Visibility Pose Blend Pose Blend Pose Blend Sort Physics Simulation Global Pose Broad Phase Narrow Phase Narrow Phase Matrix Palette Ragdoll Skinning

Async Design while (true) { // main game loop //... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r = castray(playerpos, enemypos); // Now process the results... if (r.hitsomething() && isenemy(r.gethitobject())) { // Player can see the enemy. //... } // }

Async Design while (true) { // main game loop //... // Cast a ray to see if the player has line of sight // to the enemy. RayCastResult r; requestraycast(playerpos, enemypos, &r); } // Do other unrelated work while we wait for the // other CPU to perform the ray cast for us. // // OK, we can't do any more useful work. Wait for the // results of our ray cast job. If the job is // complete, this function will return immediately. // Otherwise, the main thread will idle until the // results are ready... waitforraycastresults(&r); // Process results... if (r.hitsomething() && isenemy(r.gethitobject())) { // Player can see the enemy. //... //... } //...

Async Design RayCastResult r; bool rayjobpending = false; while (true) { // main game loop // // Wait for the results of last frame's ray cast job. if (rayjobpending) { waitforraycastresults(&r); // Process results... if (r.hitsomething() && isenemy(r.gethitobject())) { // Player can see the enemy. //... } } // Cast a new ray for next frame. rayjobpending = true; requestraycast(playerpos, enemypos, &r); // Do other work... //... }