Creating the Right Environment for Machine Learning Codesign. Cliff Young, Google AI

Similar documents
Challenges in Transition

Creating Intelligence at the Edge

Harnessing the Power of AI: An Easy Start with Lattice s sensai

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Embedding Artificial Intelligence into Our Lives

Lecture 1: Introduction to Digital System Design & Co-Design

Hardware-Software Co-Design Cosynthesis and Partitioning

Digital Systems Design

Neural Networks The New Moore s Law

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

AI Application Processing Requirements

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

Artificial intelligence, made simple. Written by: Dale Benton Produced by: Danielle Harris

Deep Learning. Dr. Johan Hagelbäck.

Small World Network Architectures. NIPS 2017 Workshop

Computer Architecture

Rethinking CAD. Brent Stucker, Univ. of Louisville Pat Lincoln, SRI

Architecting Systems of the Future, page 1

Architecture ISCA 16 Luis Ceze, Tom Wenisch

What is Artificial Intelligence? Alternate Definitions (Russell + Norvig) Human intelligence

KÜNSTLICHE INTELLIGENZ JOBKILLER VON MORGEN?

5G R&D at Huawei: An Insider Look

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

CS4617 Computer Architecture

DEEP LEARNING ON RF DATA. Adam Thompson Senior Solutions Architect March 29, 2018

Image Processing Architectures (and their future requirements)

Lecture 1. Tinoosh Mohsenin

PoC #1 On-chip frequency generation

Pragmatic Strategies for Adopting Model-Based Design for Embedded Applications. The MathWorks, Inc.

CS Computer Architecture Spring Lecture 04: Understanding Performance

CPS331 Lecture: Search in Games last revised 2/16/10

The Power of Exponential Thinking

Proposers Day Workshop

Application of AI Technology to Industrial Revolution

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Goals of this Course. CSE 473 Artificial Intelligence. AI as Science. AI as Engineering. Dieter Fox Colin Zheng

Using Deep Learning for Sentiment Analysis and Opinion Mining

Efficient Deep Learning in Communications

Examen. NU reproducere mecanica ASPC, P11. Foundations of Software Engineering

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Computer Aided Design of Electronics

Hardware-Software Codesign. 0. Organization

Challenges of in-circuit functional timing testing of System-on-a-Chip

Introduction to co-simulation. What is HW-SW co-simulation?

CPSC 340: Machine Learning and Data Mining. Convolutional Neural Networks Fall 2018

Understanding Neural Networks : Part II

Technology Transfers Opportunities, Process and Risk Mitigation. Radhika Srinivasan, Ph.D. IBM

TOOLS AND PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Spring 2017 Computer Vision Developer Survey

THE DEEP WATERS OF DEEP LEARNING

Energy efficient multi-granular arithmetic in a coarse-grain reconfigurable architecture

Applying Automated Optical Inspection Ben Dawson, DALSA Coreco Inc., ipd Group (987)

Processors Processing Processors. The meta-lecture

Low-Power Digital CMOS Design: A Survey

Computer Go: from the Beginnings to AlphaGo. Martin Müller, University of Alberta

SMARTPHONE SENSOR BASED GESTURE RECOGNITION LIBRARY

A.I in Automotive? Why and When.

Outline Simulators and such. What defines a simulator? What about emulation?

EN164: Design of Computing Systems Lecture 22: Processor / ILP 3

Chapter 3. H/w s/w interface. hardware software Vijaykumar ECE495K Lecture Notes: Chapter 3 1

Perspectives on Neuromorphic Computing

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

TOOLS & PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Fall 2017 Computer Vision Developer Survey

Computer Vision at the Edge and in the Cloud: Architectures, Algorithms, Processors, and Tools


Fixed Point Lms Adaptive Filter Using Partial Product Generator

Systems Engineering Overview. Axel Claudio Alex Gonzalez

A Computing Research Perspective on a Learning Healthcare System. Kevin Sullivan Computer Science University of Virginia 4/11/2013

Lecture 1 What is AI?

Measuring and Evaluating Computer System Performance

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

WHITE PAPER. Hybrid Beamforming for Massive MIMO Phased Array Systems

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

Learning to Play Love Letter with Deep Reinforcement Learning

Design of Mixed-Signal Microsystems in Nanometer CMOS

Advanced FPGA Design. Tinoosh Mohsenin CMPE 491/691 Spring 2012

Re-Visiting Power Measurement for the Green500

Convolutional neural networks

Open Source Digital Camera on Field Programmable Gate Arrays

AI Frontiers. Dr. Dario Gil Vice President IBM Research

What We Talk About When We Talk About AI

Framework Programme 7

Foundations Required for Novel Compute (FRANC) BAA Frequently Asked Questions (FAQ) Updated: October 24, 2017

PURELY NEURAL MACHINE TRANSLATION

THE AI REVOLUTION. How Artificial Intelligence is Redefining Marketing Automation

A Framework for Assessing the Feasibility of Learning Algorithms in Power-Constrained ASICs

Analog Custom Layout Engineer

SW simulation and Performance Analysis

THE INFLUENCE OF ACADEMIC RESEARCH ON INDUSTRY R&D. Steve Keckler, Vice President of Architecture Research June 19, 2016

Exploring the Software Stack for Underdesigned Computing Machines Rajesh Gupta UC San Diego.

Bricken Technologies Corporation Presentations: Bricken Technologies Corporation Corporate: Bricken Technologies Corporation Marketing:

Prediction of Cluster System Load Using Artificial Neural Networks

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Creating a Poker Playing Program Using Evolutionary Computation

Copyright 2003 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Slides prepared by Walid A. Najjar & Brian J.

Enabling Scientific Breakthroughs at the Petascale

Game-playing: DeepBlue and AlphaGo

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Copyright 2018, Technology Futures, Inc. 1

Merging Propagation Physics, Theory and Hardware in Wireless. Ada Poon

Experiments with Tensor Flow Roman Weber (Geschäftsführer) Richard Schmid (Senior Consultant)

Transcription:

Creating the Right Environment for Machine Learning Codesign Cliff Young, Google AI 1

Deep Learning has Reinvigorated Hardware GPUs AlexNet, Speech. TPUs Many Google applications: AlphaGo and Translate, WaveNet speech. Startups both training and inference, many different approaches. I m looking forward to test-driving new systems. 2

Agenda Classic Codesign versus Codesign for Domain-specific Architectures Codesign in Google s TPUs Recommendations for enabling and supporting Codesign 3

ISA Classic Codesign at the HW/SW Interface HW SW Definition: design spanning two fields for a common goal. Classic version is between architecture and compiler. Instruction Set Architecture (ISA) as interface/contract between levels. Example of pushing things back and forth: instruction scheduling. VLIW (static scheduling) OoO (dynamic scheduling) Answer today=both. Ultimately ISA is a single thin layer between the hardware and software domains. 4

ISA Codesign for Domain-Specific Architectures HW Physics Compiler Numerics Application Library Algorithms Model (conceptual, not rigorous diagram) Now, there are many different layers, with many different interfaces. TPUs are still digital (for now). Some startups are pushing into physics (NVRAM, Flash, optical). Need to do codesign from physics to application: hard! 5

Codesign in TPUs (1): the Hardware Descriptions TPUv1 Large for its time systolic array: 256x256x2 128K ops/cycle. Reduced and mixed precision: quantized int8, int16, and int32. TPUv2 Keep the systolic array. Reduced precision for matrix multiplications in training: bfloat16. System is a torus of chips, an array of systolic arrays. Nice crisp physical description, but we ve missed where the complexity lurks. 6

Codesign in TPUs (2): The Implications TPUv1 Large systolic array: system and code dedicated to feeding the beast. Activation pipeline does pooling, elementwise operations, and sigmoids. Quantized 8-bit arithmetic. Software, numerics, and probability estimation issues. TPUv2 Still systolic arrays, but now with back propagation: XLA for code generation. Bfloat16 arithmetic: codesign multiple-win (next slides). Torus of chips: great for SIMD style and scalable data parallelism. WIP: Hardware is actually MIMD, so can support model parallelism. 7

Codesign in TPUs (3): Floating-point Formats fp32: Single-precision IEEE Floating Point Format Range: ~1e 38 to ~3e 38 Exponent: 8 bits S E E E E E E E E Mantissa (Significand): 23 bits M M M M M M M M M M M M M M M M M M M M M M M fp16: Half-precision IEEE Floating Point Format Range: ~5.96e 8 to 65504 Exponent: 5 bits Mantissa (Significand): 10 bits S E E E E E M M M M M M M M M M bfloat16: Brain Floating Point Format Range: ~1e 38 to ~3e 38 Exponent: 8 bits S E E E E E E E E Mantissa (Significand): 7 bits M M M M M M M 8

Codesign in TPUs (4): Bfloat16 as Good Codesign Hardware: shorter mantissa multiplier power, area float32: 23 2 =529 float16: 10 2 =100 bfloat16: 7 2 =49 Software: same dynamic range on number line, same Inf/NaN behavior as float. Numerics: trains without loss scaling [Micikevicius 2017]. System: bfloat16 as an implementation technique inside the matrix multiplier. Can also expose it to save memory capacity and bandwidth, with more work. 9

Codesign in TPUs: Summary Three big bets: Systolic array matrix multiplication. Reduced precision numerics, appropriate to inference or training. Torus of chips, for data/simd and model/mimd parallelism. Lots of implications from these bets at all levels of the stack. Is this enough, or can/should we be doing more? 10

Some open codesign questions in Machine Learning What s the best architecture? Will the market be the final arbiter? At the end of Moore s Law, perhaps architectural efficiency matters more. Software may matter more than hardware: MultiFlow s Compiler as most important artifact. Ease of use takes time: typically a decade for compilers to mature. What s the lower limit on numerics? Kolmogorov complexity. How much more is sparsity going to matter? Embeddings, attention, compute and memory savings. What else? Brains are sparse. When does batch=1 matter? Definitely for inference. For training? How can we use more weights, but touch fewer of them? Mixture of Experts. 11

Codesign for the Individual Contributor Be T-shaped : deep in one core competency, and broad (but shallower) in many. Cherry Murray There are superb engineers who are very narrow, and who are comfortable saying that s not my problem. They can be an important part of the solution, but they re not going to lead the way in a codesign approach. For codesign we need people who are curious, and who take ownership across domains (even when they aren t necessarily experts in that domain). 12

Codesign for Organizations Value and enable the connections and the connectors. Take time to have hallway conversations. Beware of Conway s Law: Any organization that designs a system...will inevitably produce a design whose structure is a copy of the organization's communication structure. Harder for big companies than startups (Dunbar number). Being a startup is no guarantee that you won t fall prey. Consider interleaving/rototilling your people. Functional orgs and seating plans discourage codesign interactions. 13

Codesign for the Community: Sharing, Metrics, and Infrastructure Research Ideas: huge, rapid flow through arxiv and deep learning conferences. Common Frameworks: TensorFlow and XLA are open-source projects. Benchmarking and Measurement: MLPerf! 14

MLPerf (mlperf.org) in One Slide Goal: Build SPEC for Machine Learning. Consortium of companies and universities. Philosophy: Agile development because ML is changing rapidly. Serve both the commercial and research communities. Enforce replicability to ensure reliable results. Use representative workloads, reflecting production use-cases. Keep benchmarking effort affordable (so all can play). Launching v0.5 in October! 15

Crisis as both Danger and Opportunity Danger: the end of Moore s Law, Dennard Scaling, and standard CPU performance. Limits of CMOS in sight. Intel 10nm woes, Global Foundries 7nm exit. Opportunity: the revolution in ML. Economic demand for ML accelerators. Architectural and codesign experimentation and transformation. Can we use ML to design better accelerators? Irony: exponential demand for ML computation, just at the end of Moore s Law. Efficiency is going to matter a lot. 16

Takeaways Codesign is Fundamental to Domain-specific Architecture TPUs made three big bets (so far), with system-wide consequences. Think hard about the software implications of your hardware choices. There are codesign problems whose solutions could transform ML Systems. For example, an algorithmic advance that plays well with hardware constraints: Large-batch training instead of decreased learning rate. K-FAC for smarter SGD steps. 1-bit training. A sparsity framework that enables novel memory and compute structures. To foster codesign, people, organization, and community matter. 17