Creating the Right Environment for Machine Learning Codesign. Cliff Young, Google AI

Creating the Right Environment for Machine Learning Codesign Cliff Young, Google AI 1

Deep Learning has Reinvigorated Hardware GPUs AlexNet, Speech. TPUs Many Google applications: AlphaGo and Translate, WaveNet speech. Startups both training and inference, many different approaches. I m looking forward to test-driving new systems. 2

Agenda Classic Codesign versus Codesign for Domain-specific Architectures Codesign in Google s TPUs Recommendations for enabling and supporting Codesign 3

ISA Classic Codesign at the HW/SW Interface HW SW Definition: design spanning two fields for a common goal. Classic version is between architecture and compiler. Instruction Set Architecture (ISA) as interface/contract between levels. Example of pushing things back and forth: instruction scheduling. VLIW (static scheduling) OoO (dynamic scheduling) Answer today=both. Ultimately ISA is a single thin layer between the hardware and software domains. 4

ISA Codesign for Domain-Specific Architectures HW Physics Compiler Numerics Application Library Algorithms Model (conceptual, not rigorous diagram) Now, there are many different layers, with many different interfaces. TPUs are still digital (for now). Some startups are pushing into physics (NVRAM, Flash, optical). Need to do codesign from physics to application: hard! 5

Codesign in TPUs (1): the Hardware Descriptions TPUv1 Large for its time systolic array: 256x256x2 128K ops/cycle. Reduced and mixed precision: quantized int8, int16, and int32. TPUv2 Keep the systolic array. Reduced precision for matrix multiplications in training: bfloat16. System is a torus of chips, an array of systolic arrays. Nice crisp physical description, but we ve missed where the complexity lurks. 6

Codesign in TPUs (2): The Implications TPUv1 Large systolic array: system and code dedicated to feeding the beast. Activation pipeline does pooling, elementwise operations, and sigmoids. Quantized 8-bit arithmetic. Software, numerics, and probability estimation issues. TPUv2 Still systolic arrays, but now with back propagation: XLA for code generation. Bfloat16 arithmetic: codesign multiple-win (next slides). Torus of chips: great for SIMD style and scalable data parallelism. WIP: Hardware is actually MIMD, so can support model parallelism. 7

Codesign in TPUs (3): Floating-point Formats fp32: Single-precision IEEE Floating Point Format Range: ~1e 38 to ~3e 38 Exponent: 8 bits S E E E E E E E E Mantissa (Significand): 23 bits M M M M M M M M M M M M M M M M M M M M M M M fp16: Half-precision IEEE Floating Point Format Range: ~5.96e 8 to 65504 Exponent: 5 bits Mantissa (Significand): 10 bits S E E E E E M M M M M M M M M M bfloat16: Brain Floating Point Format Range: ~1e 38 to ~3e 38 Exponent: 8 bits S E E E E E E E E Mantissa (Significand): 7 bits M M M M M M M 8

Codesign in TPUs (4): Bfloat16 as Good Codesign Hardware: shorter mantissa multiplier power, area float32: 23 2 =529 float16: 10 2 =100 bfloat16: 7 2 =49 Software: same dynamic range on number line, same Inf/NaN behavior as float. Numerics: trains without loss scaling [Micikevicius 2017]. System: bfloat16 as an implementation technique inside the matrix multiplier. Can also expose it to save memory capacity and bandwidth, with more work. 9

Codesign in TPUs: Summary Three big bets: Systolic array matrix multiplication. Reduced precision numerics, appropriate to inference or training. Torus of chips, for data/simd and model/mimd parallelism. Lots of implications from these bets at all levels of the stack. Is this enough, or can/should we be doing more? 10

Some open codesign questions in Machine Learning What s the best architecture? Will the market be the final arbiter? At the end of Moore s Law, perhaps architectural efficiency matters more. Software may matter more than hardware: MultiFlow s Compiler as most important artifact. Ease of use takes time: typically a decade for compilers to mature. What s the lower limit on numerics? Kolmogorov complexity. How much more is sparsity going to matter? Embeddings, attention, compute and memory savings. What else? Brains are sparse. When does batch=1 matter? Definitely for inference. For training? How can we use more weights, but touch fewer of them? Mixture of Experts. 11

Codesign for the Individual Contributor Be T-shaped : deep in one core competency, and broad (but shallower) in many. Cherry Murray There are superb engineers who are very narrow, and who are comfortable saying that s not my problem. They can be an important part of the solution, but they re not going to lead the way in a codesign approach. For codesign we need people who are curious, and who take ownership across domains (even when they aren t necessarily experts in that domain). 12

Codesign for Organizations Value and enable the connections and the connectors. Take time to have hallway conversations. Beware of Conway s Law: Any organization that designs a system...will inevitably produce a design whose structure is a copy of the organization's communication structure. Harder for big companies than startups (Dunbar number). Being a startup is no guarantee that you won t fall prey. Consider interleaving/rototilling your people. Functional orgs and seating plans discourage codesign interactions. 13

Codesign for the Community: Sharing, Metrics, and Infrastructure Research Ideas: huge, rapid flow through arxiv and deep learning conferences. Common Frameworks: TensorFlow and XLA are open-source projects. Benchmarking and Measurement: MLPerf! 14

MLPerf (mlperf.org) in One Slide Goal: Build SPEC for Machine Learning. Consortium of companies and universities. Philosophy: Agile development because ML is changing rapidly. Serve both the commercial and research communities. Enforce replicability to ensure reliable results. Use representative workloads, reflecting production use-cases. Keep benchmarking effort affordable (so all can play). Launching v0.5 in October! 15

Crisis as both Danger and Opportunity Danger: the end of Moore s Law, Dennard Scaling, and standard CPU performance. Limits of CMOS in sight. Intel 10nm woes, Global Foundries 7nm exit. Opportunity: the revolution in ML. Economic demand for ML accelerators. Architectural and codesign experimentation and transformation. Can we use ML to design better accelerators? Irony: exponential demand for ML computation, just at the end of Moore s Law. Efficiency is going to matter a lot. 16

Takeaways Codesign is Fundamental to Domain-specific Architecture TPUs made three big bets (so far), with system-wide consequences. Think hard about the software implications of your hardware choices. There are codesign problems whose solutions could transform ML Systems. For example, an algorithmic advance that plays well with hardware constraints: Large-batch training instead of decreased learning rate. K-FAC for smarter SGD steps. 1-bit training. A sparsity framework that enables novel memory and compute structures. To foster codesign, people, organization, and community matter. 17