Challenges in Transition

Similar documents
SCAI SuperComputing Application & Innovation. Sanzio Bassini October 2017

Analog Custom Layout Engineer

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

Architecting Systems of the Future, page 1

High Performance Computing for Engineers

Creating the Right Environment for Machine Learning Codesign. Cliff Young, Google AI

Artificial intelligence, made simple. Written by: Dale Benton Produced by: Danielle Harris

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

CUDA-Accelerated Satellite Communication Demodulation

Document downloaded from:

TOOLS AND PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Spring 2017 Computer Vision Developer Survey

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

TOOLS & PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Fall 2017 Computer Vision Developer Survey

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

The Bump in the Road to Exaflops and Rethinking LINPACK

Experience with new architectures: moving from HELIOS to Marconi

Hardware Software Science Co-design in the Human Brain Project

October 6, 2017 DEEP LEARNING TOP 5. Insights into the new computing model

Establishment of a Multiplexed Thredds Installation and a Ramadda Collaboration Environment for Community Access to Climate Change Data

Exascale Initiatives in Europe

Embedding Artificial Intelligence into Our Lives

High Performance Computing and Visualization at the School of Health Information Sciences

Post K Supercomputer of. FLAGSHIP 2020 Project. FLAGSHIP 2020 Project. Schedule

GPU Computing for Cognitive Robotics


TOOLS & PROCESSORS FOR COMPUTER VISION. Selected Results from the Embedded Vision Alliance s Computer Vision Developer Survey

Deep Learning Overview

PMU Big Data Analysis Based on the SPARK Machine Learning Framework

AI-Driven QA: Simulating Massively Multiplayer Behavior for Debugging Games. Shuichi Kurabayashi, Ph.D. Cygames, Inc.

Ansible in Depth WHITEPAPER. ansible.com

Harnessing the Power of AI: An Easy Start with Lattice s sensai

Artificial Intelligence Machine learning and Deep Learning: Trends and Tools. Dr. Shaona

Benchmarking C++ From video games to algorithmic trading. Alexander Radchenko

MSc(CompSc) List of courses offered in

Fast and Accurate RF component characterization enabled by FPGA technology

A Scalable Computer Architecture for

Getting to Work with OpenPiton. Princeton University. OpenPit

escience: Pulsar searching on GPUs

The Key to the Internet-of-Things: Conquering Complexity One Step at a Time

Creating Intelligence at the Edge

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

The Spanish Supercomputing Network (RES)

BIO Helmet EEL 4914 Senior Design I Group # 3 Frank Alexin Nicholas Dijkhoffz Adam Hollifield Mark Le

Data acquisition and Trigger (with emphasis on LHC)

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Self-Aware Adaptation in FPGAbased

Introduction to co-simulation. What is HW-SW co-simulation?

Table of Contents HOL EMT

When to use an FPGA to prototype a controller and how to start

BMOSLFGEMW: A Spectrum of Game Engine Architectures

Power of Realtime 3D-Rendering. Raja Koduri

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Like Mobile Games* Currently a Distinguished i Engineer at Zynga, and CTO of FarmVille 2: Country Escape (for ios/android/kindle)

A Brief History of Project Fortress

Prototyping Next-Generation Communication Systems with Software-Defined Radio

FROM BRAIN RESEARCH TO FUTURE TECHNOLOGIES. Dirk Pleiter Post-H2020 Vision for HPC Workshop, Frankfurt

The end of Moore s law and the race for performance

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

HIGH-LEVEL SUPPORT FOR SIMULATIONS IN ASTRO- AND ELEMENTARY PARTICLE PHYSICS

DEEP LEARNING A NEW COMPUTING MODEL. Sundara R Nagalingam Head Deep Learning Practice

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

A Brief History of Project Fortress

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Topics in Development of Naval Architecture Software Applications

The Five R s for Developing Trusted Software Frameworks to increase confidence in, and maximise reuse of, Open Source Software

Real-Time Software Receiver Using Massively Parallel

Table of Contents HOL ADV

Transformation to Artificial Intelligence with MATLAB Roy Lurie, PhD Vice President of Engineering MATLAB Products

Report on NSF Workshop on Center Scale Activities Related to Accelerators for Data Intensive Applications

Hiding Virtual Computing and Supercomputing inside a Notebook: GISandbox Science Gateway & Other User Experiences Eric Shook

Processors Processing Processors. The meta-lecture

Exploiting the Unused Part of the Brain

High Performance Computing Facility for North East India through Information and Communication Technology

AUTOMATION ACROSS THE ENTERPRISE

Console Games Are Just Like Mobile Games* (* well, not really. But they are more alike than you

Revolutionize the Service Industries with AI 2016 Service Robot

Software Spectrometer for an ASTE Multi-beam Receiver. Jongsoo Kim Korea Astronomy and Space Science Institute

December 10, Why HPC? Daniel Lucio.

AGENTLESS ARCHITECTURE

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

New Paradigm in Testing Heads & Media for HDD. Dr. Lutz Henckels September 2010

THE NEXT WAVE OF COMPUTING. September 2017

The Key to the Internet-of-Things: Conquering Complexity One Step at a Time

Image-Domain Gridding on Accelerators

Publishing Your Research. Margaret Martonosi, Princeton Lydia Tapia, University of New Mexico

Digital Systems Design

NUIT Support of Researchers

PEAK GAMES IMPLEMENTS VOLTDB FOR REAL-TIME SEGMENTATION & PERSONALIZATION

Proposal Solicitation

Data acquisition and Trigger (with emphasis on LHC)

KÜNSTLICHE INTELLIGENZ JOBKILLER VON MORGEN?

Physics Based Sensor simulation

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Synthetic Aperture Beamformation using the GPU

Tomasz Włostowski Beams Department Controls Group Hardware and Timing Section. Trigger and RF distribution using White Rabbit

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

What can POP do for you?

Transcription:

Challenges in Transition Keynote talk at International Workshop on Software Engineering Methods for Parallel and High Performance Applications (SEM4HPC 2016) 1 Kazuaki Ishizaki IBM Research Tokyo kiszk@acm.org

What is this talk about? How we make a HPC platform consumable for non-hpc people? For machine learning (ML) and deep learning (DL) This talk is not a solid research proposal, but what I am recently thinking about. 2 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Takeaways Users, applications, and HWs are always in transition Programming is becoming hard Let us build an end-to-end runtime system for ML and DL Leave each layer to the specialist to do the best Each layer should know everything for optimizations Should not be isolated How we can make state-of-the-art technologies consumable in the system? Our research is here! 3 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

My History (mostly commercial, sometimes HPC) 1990-1992 Network HW interface for parallel computer 1992-1995 Static compiler for High Performance Fortran 1996-now Just-in-time compiler for IBM Developers Kit for Java 1996-2000 Benchmark and GUI applications 2000-2010 Web and Enterprise applications 2012- Analytics applications 2014-2015- GPUs Java language with GPUs Apache Spark (in-memory data processing framework) with 4 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Outline of this talk Review transition in HPC What are problems in this transition? How we will address these problems? 5 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Performance Trend of TOP500 Great performance improvements for Linpack, at 33.86PFlops MFLOPS Top Avg. of Top10 Date 6 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki Source: TOP500

TOP #1 Systems in TOP500 CM-5 (1993) ASCI Red (1997) BlueGene/L (2004) Tianhe-1A (2010) RoadRunner (2008) Tianhe-2 (2013) Cray-1 (1975) 7 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki Source: TOP500

TOP #1 Systems in TOP500 65536 processors Cell processor GPU Xeon Phi 7264 processors 1024 processors Vector processor 8 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Three Eras in HPC Accelerator Era (2008 -) MPP Era (1990-) 65536 processors Cell processor GPU Xeon Phi Vector Era (-1993) 7264 processors 1024 processors Vector processor *MPP: Massively Parallel Processing 9 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Review for Each Era What applications were executed? Who wrote these applications? What research we did? What was commodity HW? 10 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Vector Era Vector Era (-1993) Vector processor *MPP: Massively Parallel Processing 11 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Vector Era (- 1993) How we can exploit a vector machine for specific applications Hardware Applications Programmers Research Commodity HW Slow scalar processor with vector facility Weather, wind, fluid, and physics simulations Limited # of programmers who are well-educated for HPC (Ninja programmers) Automatic vectorization techniques Enhancement of vector HW features (e.g. sparse array support) Slow scalar processor 12 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

MPP Era MPP Era (1990-) 65536 processors 7264 processors 1024 processors *MPP: Massively Parallel Processing 13 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

MPP Era (1990 -) How we can hide latencies between nodes Hardware Applications Programmers Research Commodity HW Massive commodity processors with special network I/F Simulations for wider areas (e.g. chemical synthesis) Limited # of programmers who are well-educated for HPC Improvements on MPI implementations Parallelization and optimization of given applications by hand Fast scalar processors 14 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Accelerator Era (2008 - ) Accelerator Era (2008 -) Cell processor GPU Xeon Phi *MPP: Massively Parallel Processing 15 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Innovations in System Software CUDA/OpenCL make powerful computing resource accessible MPI 1.0 (1994) CUDA (2006) OpenCL (2008) 16 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Accelerator Era (2008 - ) How we can exploit GPUs in our applications Hardware Applications Programmers Research Commodity HW Massive commodity processors with HW accelerators Simulations for wider areas (e.g. chemical synthesis) Limited # of programmers who are well-educated for HPC GPU-friendly rewriting of given applications by hand GPU-oriented algorithms Desktop PC with GPU cards 17 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Innovations in Programming Environment MapReduce makes parallel programming easy MapReduce (2004) Hadoop (2007) Spark (2013) 18 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Innovations in Infrastructure Cloud makes a cluster of machines easily accessible GPU AWS EC2 (2006) CloudLayer (2009) GPU instance (2013) 19 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Big innovations in Applications Machine learning and deep learning are big FP consumers MPI 1.0 (1994) COTS: Commodity Off-The-Shelf Deep learning (2011) Deep learning with COTS HPC systems (2013) Big data Machine learning (2011) 20 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki Source: The analytic store, Deep Learning with GPUs

Accelerator Era 2.0 (2012 - ) HPC meets machine learning and deep learning with big data Hardware Applications Programmers Research Commodity HW Massive commodity processors with HW accelerators Machine learning (ML) and deep learning (DL) with big data Data scientists who are non-familiar with HPC How we can effectively use GPUs? How about accuracy of a new ML/DL algorithm with big data? A cluster of machines with GPUs on cloud 21 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Summary of Transition Majorities of applications are changing From simulations to machine learning (ML)/deep learning (DL) with big data HPC HW is becoming commodity GPUs are available on desktop and cloud Cloud provides a cluster of GPUs as a commodity Programmers are changing From Ninja programmers to data scientists 22 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Outline of this talk Review transition in HPC What are problems in this transition? How we will address these problems? 23 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Details in Application Data is becoming rapidly larger 1000x from 2010 to 2015 The number of applications is rapidly growing arxiv.org is hosting many papers On 2014, hit 1 million articles On 2015, 105,000 new submission and over 139 million downloads github.com is hosting many programs On 2014 4Q, more than 15M updates (pushes) to 2.2M repositories 24 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki Source: IDC, CISCO, IBM Optimization is ready For Big Data The arxiv preprint server hits 1 million articles arxiv Update - January 2016 Githut.info

Details in Programming Languages Data Scientists love Python and R (e.g. high level languages) Python and R make programming easy Scientific computing operations and libraries (e.g. Numpy) Programs do not scale to a cluster of machines Perform pre-filterings to reduce data size for a machine Spend much time to rewrite it for a cluster It is not easy to write a program optimized for a target architecture 25 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Details in Infrastructure Accelerators that matter Processing units GPU, FPGA, ASIC (e.g. Tensorflow Processing Unit), Storage Non-volatile memory, phase change memory, Communication Communication between accelerators (e.g. NVLINK), optical interconnect, 26 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Problems in Future Data will be too large to store on fast memory Memory hierarchy is becoming deep Programming will be hard Hard to program HW accelerators New applications rapidly appear Optimization and deployment will be hard Emerging HW accelerators will appear 27 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Outline of this talk Review transition in HPC What are problems in this transition? How we will address these problems? 28 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Any information can be exchanged My Proposal: Build an End-to-End System From an algorithm to hardware Leave each layer to the specialist to do the best Easy to develop new algorithms Easy to exploit parallelism from the algorithm Easy to generate accelerator code We should avoid complex tasks (e.g. analysis) Each layer should know everything What parts of the algorithm are parallel? What happens at hardware We should not make each layer isolated Algorithm Model Framework libraries Programming Language System software Processing Unit (CPU, accelerator) Memory Communication 29 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

30 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki Source: A New Look at the System, Algorithm and Theory Foundations of Distributed Machine Learning Similar Research 1 Build end-to-end system

Similar Research 2 System ML An algorithm written in R subset is translated to an optimized Apache Spark program with information 31 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki Source: Inside SystemML

Our Recent Research: Exploit GPUs at High Level Compile a Java Program for GPUs [PACT2015, http://ibm.com/java/jdk/] A parallel stream loop, which explicitly expresses a parallelism, can be offloaded to GPUs by our just-in-time compiler without any GPU specific code IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); 32 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Our Recent Research: Exploit GPUs at High Level Apache Spark with GPUs [http://github.com/ibmsparkgpu] Drive GPU code from an Apache Spark program transparently from a user // rdd: resilient distributed dataset is distributed over nodes rdd = sc.parallelize(1 to 10000, 2) // node0: 1-5000, node1: 5001-10000 rdd1 = rdd.map(i => i * 2) sum = rdd1.reduce((x, y) => (x + y)) rdd [1:5000: 1] i * 2 rdd1 [2: 10000: 2] x+y sum [5001: 10000: 1] i * 2 [10002: 20000: 2] 33 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki x+y Image Source: NVIDIA

How We Create this Proposal? Will we just pile up existing products? Algorithm Model Framework libraries Programming Language System software Processing Unit (CPU, accelerator) Memory Communication 34 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

How We Create this Proposal? Will we just pile up existing products? No, it would invent a naïve FAT stack Naïve FAT stack Algorithm Model Framework libraries Programming Language System software Processing Unit (CPU, accelerator) Memory Communication 35 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

How We Create this Proposal? Will we just pile up existing products? No, it would invent a naïve FAT stack I like an abstraction, but do not like to execute it as-is Run as an optimized THIN stack with end-to-end optimizations before an execution during an execution among executions Naïve FAT stack Algorithm Model Framework libraries Programming Language System software Processing Unit (CPU, accelerator) Memory Communication Do not guess: Each layer should know everything Optimized THIN stack 36 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Our Research Challenges (1/2) Programming environment Algorithm should be written declaratively without losing high level information Framework / libraries Resource scheduling Communication-avoiding algorithm Loosely-synchronized execution model Localization (e.g. tiling) Current ML/DL frameworks have not optimized than HPC software stacks yet 37 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Our Research Challenges (2/2) Programming languages / system software Make HW accelerators consumable without specific code Dynamic compilation or deployment for new HW accelerators Automatic tuning Deep learning may help too many tuning knobs in system Appropriate feedbacks from HW to programming Debugging Reproduce a bug for some converged algorithms 38 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki

Recap: Takeaways Users, applications, and HWs are always in transition Programming is becoming hard Let us build an end-to-end runtime system for ML and DL Leave each layer to the specialist to do the best Each layer should know everything for optimizations Should not be isolated How we can make state-of-the-art technologies consumable in the system Our research is here! 39 SEM4HPC 2016 Keynote: Challenges in Transition, Kazuaki Ishizaki