What can POP do for you?

Similar documents
CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro

RAPS ECMWF. RAPS Chairman. 20th ORAP Forum Slide 1

11/11/ PARTNERSHIP FOR ADVANCED COMPUTING IN EUROPE

Christina Miller Director, UK Research Office

Introduction to VI-HPS

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40

Challenges in Transition

EESI Presentation at IESP

Application of Maxwell Equations to Human Body Modelling

Outline. PRACE A Mid-Term Update Dietmar Erwin, Forschungszentrum Jülich ORAP, Lille, March 26, 2009

CDP-EIF ITAtech Equity Platform

Introduction to SHAPE Removing barriers to HPC for SMEs

24th VI-HPS Tuning Workshop PATC course in conjunction with POP CoE

28th VI-HPS Tuning Workshop UCL, London, June 2018

22nd VI-HPS Tuning Workshop PATC Performance Analysis Workshop

23rd VI-HPS Tuning Workshop & LLNL Performance Tools Deep-Dive

e-infrastructures for open science

Confidence in SKYLON. Success on future engine test would mean "a major breakthrough in propulsion worldwide"

THE DIGITALISATION CHALLENGES IN LITHUANIAN ENGINEERING INDUSTRY. Darius Lasionis LINPRA Director November 30, 2018 Latvia

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

H2020 Excellent science arie Skłodowska-Curie Actions. Your research career in Europe. 17 November 2015

PPP InfoDay Brussels, July 2012

PRACE PATC Course Intel MIC Programming Workshop. February, 7-8, 2017, IT4Innovations, Ostrava, Czech Republic

arxiv: v1 [cs.dc] 16 Oct 2012

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

EU Ecolabel EMAS Environmental Technology Verification (ETV) State-of-play and evaluations

Algorithm-Based Master-Worker Model of Fault Tolerance in Time-Evolving Applications

Vampir Getting Started. Holger Brunst March 4th 2008

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

Image-Domain Gridding on Accelerators

An Introduction to Load Balancing CCSM3 Components

Trinity Center of Excellence

Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Opportunity Knocks: Disruption in Computer Systems

HIGH-LEVEL SUPPORT FOR SIMULATIONS IN ASTRO- AND ELEMENTARY PARTICLE PHYSICS

Extreme Scale Computational Science Challenges in Fusion Energy Research

SR&ED International R&D Tax Credit Strategies

PRODUCT DATA. PULSE Multichannel Sound Power Type 7748 Version 1.2

EXPERIENCES WITH KNL IN THE ALCF EARLY SCIENCE PROGRAM

EIF Equity Products. Finland 5 February 2015

Solving Large Multi-Scale Problems in CST STUDIO SUITE

OROS Modal Analyzer : comprehensive and portable

UEAPME Think Small Test

Compliance for Eucomed: The Medical Technology Industry s s Perspective

THE ECONOMICS OF DATA-DRIVEN INNOVATION

escience/lhc-expts integrated t infrastructure

THE 12 COUNTRIES IN OUR SAMPLE

Scientific Computing Activities in KAUST

National e-infrastructure for Science. Jacko Koster UNINETT Sigma

Trend of Software R&D for Numerical Simulation Hardware for parallel and distributed computing and software automatic tuning

Development of a parallel, tree-based neighbour-search algorithm

Exascale Initiatives in Europe

CS4961 Parallel Programming. Lecture 1: Introduction 08/24/2010. Course Details Time and Location: TuTh, 9:10-10:30 AM, WEB L112 Course Website

Women on Boards. Vanessa Williams Managing Director, Awen Consultants Limited Founder, Governance for Growth Director & Lawyer, Excello Law Limited

Threading libraries performance when applied to image acquisition and processing in a forensic application

FPGA implementation of Generalized Frequency Division Multiplexing transmitter using NI LabVIEW and NI PXI platform

escience: Pulsar searching on GPUs

December 10, Why HPC? Daniel Lucio.

Innovation in Europe: Where s it going? How does it happen? Stephen Roper Aston Business School, Birmingham, UK


Parallel Programming I! (Fall 2016, Prof.dr. H. Wijshoff)

The TTO circle workshop on "Technology Transfer in Nanotechnology"

EU-Australia workshop 14 th November 2016

The Spanish Supercomputing Network (RES)

ADS-SystemVue Linkages

Process Control Calibration Made Easy with Agilent U1401A

European Technology Platforms

Towards Global Monitoring of Soil Moisture at 1 km Spatial Resolution using Sentinel-1: Initial Results

EBA Master Class The Benefits of International Collaboration. Steve Morgan Co-Chair, EBA Benchmarking Group

An Introduction to SIMDAT a Proposal for an Integrated Project on EU FP6 Topic. Grids for Integrated Problem Solving Environments

International Cooperation in the Development of New Technology for Commercial Transports

NEWSLETTER AUTUMN 2013

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

Realising the FNH-RI: Roadmap. Karin Zimmermann (Wageningen Economic Research [WUR], NL)

HUMAN FACTORS IN VEHICLE AUTOMATION

Fortissimo Enabling manufacturing SMEs to benefit from HPC

Science & Technology Cooperation Workshop

Research DG. European Commission. Sharing Visions. Towards a European Area for Foresight

The Neutrino Telescope of the KM3NeT Deep-Sea Research Infrastructure

Multi-Core Execution of Parallelised Hard Real-Time Applications

EM Insights Series. Episode #1: QFN Package. Agilent EEsof EDA September 2008

Framework Programme 7 and SMEs. Amaury NEVE European Commission DG Research - Unit T4: SMEs

The ETV pilot programme: State of play, standardisation issues

Quality Systems, Accreditation and the Food Sector

Shaping Europe s Digital Future

Keysight Technologies 7 Hints That Every Engineer Should Know When Making Power Measurements with Oscilloscopes. Application Note

First Experience with PCP in the PRACE Project: PCP at any cost? F. Berberich, Forschungszentrum Jülich, May 8, 2012, IHK Düsseldorf

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Building Bridges Between R&D Institutions and Tool Making Sector (Slovene experiences)

Designing the sound experience with NVH simulation

A Time-Saving Method for Analyzing Signal Integrity in DDR Memory Buses

Offshore Renewable Energy Conversion platforms - Coordination Action

ENVIROS, s.r.o. Madrid, 1 st of December 2015

Building a Cell Ecosystem. David A. Bader

Towards a New IP Consciousness in Universities and R&D Institutions: Case Show

Common Features and National Differences - preliminary findings -

SATURN 101: Part 3 Improving Convergence

Whole of Society Conflict Prevention and Peacebuilding

Topics in Development of Naval Architecture Software Applications

European-South African Cooperation in Scientific and Technical Research

Transcription:

What can POP do for you? Mike Dewar, NAG Ltd EU H2020 Center of Excellence (CoE) 1 October 2015 31 March 2018 Grant Agreement No 676553

Outline Overview of codes investigated Code audit & plan examples Analysis of inefficiencies identified Proof of concept projects Summary 2

Customers by Country 40% 35% 30% 25% 20% 15% 10% 5% 0% United Kingdom Germany France Spain Rest of Europe Netherlands Sweden Belgium All SMEs 3

Programming Languages 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Fortran C++ Fortran_C/C++ Python_* C Others 4

Parallelisation Scheme 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% MPI Hybrid MPI+OpenMP OpenMP Others CUDA 5

Application Sectors 30% 25% 20% 15% 10% 5% 0% Chemistry Engineering Earth Science CFD Energy Other Machine Learning Health All SMEs 6

So Far 72 Audits and plans completed or reporting to customer 5 completed Proofs of Concept Working on a further 36 studies and 8 Proofs of Concept Close to 40% of Audits lead to a follow-up Performance Plan Goal 150 assessments 7

Code Audit & Plan Examples 8

OpenNN - Artelnics Neural network open source application C++ code with OpenMP parallelisation www.artelnics.com Key audit result: Main issue is Computational Efficiency Main limit on performance is the unexpected variability in the number of times a parallel loop gets executed when number of threads is increased. Further work in a POP performance plan is investigating the unexpected extra computation 9

GS2 - Culham Centre for Fusion Energy Turbulence in fusion plasma application Fortran code with MPI parallelisation Key audit result: Main issue is Communication Efficiency Serialisation in the point-to-point calls leading to waiting time Use of non-blocking calls recommended 10

DROPS RWTH Aachen CFD tool for simulating two-phase flows C++ code parallelised with Hybrid MPI + OpenMP Complex due to heavy use of C++ templates Key audit result Main issue with computational Load Balance Resulted in waiting times in MPI collectives 11

Analysis of Inefficiencies 12

Leading cause of inefficiency Load Balance Computation issues Communication issues 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 13

Inefficiency by Parallelisation 120% 100% 80% 60% 40% 20% 0% MPI OpenMP Hybrid MPI + OpenMP Load Balance Computation Communication 14

Proof of concept projects 15

k-wave Brno University of Technology Toolbox for time domain acoustic and ultrasound simulations in complex and tissue-realistic media C++ code parallelised with Hybrid MPI and OpenMP (+ CUDA) Executed on Salomon Intel Xeon compute nodes Key audit findings: 3D domain decomposition suffered from major load imbalance : exterior MPI processes with fewer grid cells took much longer than interior OpenMP-parallelised FFTs were much less efficient for grid sizes of exterior, requiring many more small and poorly-balanced parallel loops Using a periodic domain with identical halo zones for each MPI rank reduced overall runtime by a factor of 2 www.k-wave.org 16

k-wave Brno University of Technology Comparison time-line before (white) and after (lilac) balancing, showing exterior MPI ranks (0,3) and interior MPI ranks (1,2) MPI synchronization in red, OpenMP synchronization in cyan 17

sphfluids Stuttgart Media University Simulates fluids for computer graphics applications C++ parallelised with OpenMP Key audit results: Several issues relating to the sequential computational performance Located critical parts of the application with specific recommended improvements 18

sphfluids Stuttgart Media University Implemented by the code developers: Review of overall code design from issues identified in POP audit Inlining short functions Reordering the particle processing order to reduce cache misses Removal of unnecessary operations and costly inner loop definitions Confirmed performance improvement up to 5x 6x depending on scenario and pressure model used Achieved thanks to insights provided by the POP experts and good information exchange during the work 19

EPW University of Oxford Electron-Phonon Wannier (EPW) materials science DFT code; Part of the Quantum ESPRESSO suite Fortran code parallelised with MPI Audit of unreleased development version of code Executed on ARCHER Cray XC30 (24 MPI ranks per node) Key audit findings: Poor load balance from excessive computation identified (addressed in separate POP Performance Plan) Large variations in runtime, likely caused by IO Final stage spends a great deal of time writing output to disk 20

EPW University of Oxford Original code had all MPI ranks writing the result to disk at the end POP PoC modified this to have only one rank do output On 480 MPI ranks, time taken to write results fell from over 7 hours to 56 seconds: 450-fold speed-up! Combined with previous improvements, enabled EPW simulations to scale to previously impractical 1920 MPI ranks 86% global efficiency with 960 MPI ranks epw.org.uk 21

Summary POP seeks to not only describe the performance of an application, but to identify the root causes of poor performance. Better performance leads to both resource savings and improved science. POP is a free service for people and organisations in the European Union. Current funding secured until March 2018 apply now for full range of services https://pop-coe.eu 22

POP analysis elegantly reveals in detail how our application's algorithm is running on HPC architectures. It is an extremely useful optimisation tool! Our POP contact was very knowledgeable and enthusiastic. An excellent service! Dr Joseph Parker, STFC UK 23