CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro

Similar documents
What can POP do for you?

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Impact from Industrial use of HPC HPC User Forum #59 Munich, Germany October 2015

Petascale Quantum Simulations of Nano Systems and Biomolecules

The Bump in the Road to Exaflops and Rethinking LINPACK

Document downloaded from:

Challenges in Transition

Sourcing in Scientific Computing

Experience with new architectures: moving from HELIOS to Marconi

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

PRACE PATC Course Intel MIC Programming Workshop. February, 7-8, 2017, IT4Innovations, Ostrava, Czech Republic

Scientific Computing Activities in KAUST

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

Signal Processing on GPUs for Radio Telescopes

Monte Carlo integration and event generation on GPU and their application to particle physics

GPU-based data analysis for Synthetic Aperture Microwave Imaging

RAPS ECMWF. RAPS Chairman. 20th ORAP Forum Slide 1

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

December 10, Why HPC? Daniel Lucio.

HIGH-LEVEL SUPPORT FOR SIMULATIONS IN ASTRO- AND ELEMENTARY PARTICLE PHYSICS

escience: Pulsar searching on GPUs

Exascale Initiatives in Europe

Trend of Software R&D for Numerical Simulation Hardware for parallel and distributed computing and software automatic tuning

GPUs: what are they good for?

Application of Maxwell Equations to Human Body Modelling

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40

Outline. PRACE A Mid-Term Update Dietmar Erwin, Forschungszentrum Jülich ORAP, Lille, March 26, 2009

Ben Baker. Sponsored by:

Threading libraries performance when applied to image acquisition and processing in a forensic application

Exploiting the Unused Part of the Brain

CUDA-Accelerated Satellite Communication Demodulation

Solving Large Multi-Scale Problems in CST STUDIO SUITE

High Performance Computing and Visualization at the School of Health Information Sciences

Software Correlators for Dish and Sparse Aperture Arrays of the SKA Phase I

High Performance Computing Facility for North East India through Information and Communication Technology

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

ADVANCES IN BIG DATA AND EXTREME SCALE COMPUTING ( BDEC ) William M. Tang

Practical Use of FX10 Supercomputer System (Oakleaf-FX) of Information Technology Center, The University of Tokyo

Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Software Spectrometer for an ASTE Multi-beam Receiver. Jongsoo Kim Korea Astronomy and Space Science Institute

Recent Advances in Simulation Techniques and Tools

INCITE Proposal Writing Webinar April 24, 2012

Multi-core Platforms for

A Scalable Computer Architecture for

IT-ECOSYSTEM OF THE HYBRILIT HETEROGENEOUS PLATFORM FOR HIGH-PERFORMANCE COMPUTING AND TRAINING OF IT-SPECIALISTS

ITR Collaborative Research: NOVEL SCALABLE SIMULATION TECHNIQUES FOR CHEMISTRY, MATERIALS SCIENCE, AND BIOLOGY

CS4961 Parallel Programming. Lecture 1: Introduction 08/24/2010. Course Details Time and Location: TuTh, 9:10-10:30 AM, WEB L112 Course Website

Image-Domain Gridding on Accelerators

BETTER THAN REMOVING YOUR APPENDIX WITH A SPORK: DEVELOPING FACULTY RESEARCH PARTNERSHIPS

CREST. Software co-design on the road to exascale. Dr Stephen Booth. EPCC Principal Architect. Dr Mark Parsons

DICELIB: A REAL TIME SYNCHRONIZATION LIBRARY FOR MULTI-PROJECTION VIRTUAL REALITY DISTRIBUTED ENVIRONMENTS

Hardware Software Science Co-design in the Human Brain Project

The Spanish Supercomputing Network (RES)

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

FPGA-accelerated High-Performance Computing Close to Breakthrough or Pipedream? Christian Plessl

Computational Simulations of The World s Biggest Eye on GPUs

NVIDIA GPU TECHNOLOGY THEATER AT SC13

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

R and the Message Passing Interface on the Little Fe Cluster

Overview. 1 Trends in Microprocessor Architecture. Computer architecture. Computer architecture

PRACE PATC Course: Intel MIC Programming Workshop & Scientific Workshop: HPC for natural hazard assessment and disaster mitigation, June 2017,

Perspective platforms for BOINC distributed computing network

Sourjya Bhaumik, Shoban Chandrabose, Kashyap Jataprolu, Gautam Kumar, Paul Polakos, Vikram Srinivasan, Thomas Woo

High Performance Computing for Engineers

Re-Visiting Power Measurement for the Green500

SCAI SuperComputing Application & Innovation. Sanzio Bassini October 2017

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

arxiv: v1 [astro-ph.im] 1 Sep 2015

Modeling the multi-conjugate adaptive optics system of the E-ELT. Laura Schreiber Carmelo Arcidiacono Giovanni Bregoli

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Lec 24: Parallel Processors. Announcements

arxiv: v1 [cs.dc] 16 Oct 2012

Table of Contents HOL ADV

Design of Parallel Algorithms. Communication Algorithms

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

Multiple Clock and Voltage Domains for Chip Multi Processors

NAMD: Innovation Beyond Petascale SC15 Workshop on Producing High Performance and Sustainable Software for Molecular Simulation

cfireworks: a Tool for Measuring the Communication Costs in Collective I/O

Automatic Energy Saving Schemes for Parallel Applications

The end of Moore s law and the race for performance

Concluding remarks. Makoto Asai (SLAC SD/EPP) April 19th, 2015 Geant4 MC2015

SKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION

Haptic Rendering of Large-Scale VEs

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Establishment of a Multiplexed Thredds Installation and a Ramadda Collaboration Environment for Community Access to Climate Change Data

Energy Efficient Soft Real-Time Computing through Cross-Layer Predictive Control

A Polyphase Filter for GPUs and Multi-Core Processors

RSE in UK Academia. Paul Richmond University of Sheffield (UK)

Building a Cell Ecosystem. David A. Bader

Barcelona Supercomputing Center

HIGH PERFORMANCE COMPUTING USING GPGPU FOR RADAR APPLICATIONS

National e-infrastructure for Science. Jacko Koster UNINETT Sigma

Automatic Kernel Code Generation for Focal-plane Sensor-Processor Devices

FROM KNIGHTS CORNER TO LANDING: A CASE STUDY BASED ON A HODGKIN- HUXLEY NEURON SIMULATOR

Virtual EM Prototyping: From Microwaves to Optics

Center for Hybrid Multicore Productivity Research (CHMPR)

Introduction to VI-HPS

The LinkSCEEM FP7 Infrastructure Project:

Transcription:

CP2K PERFORMANCE FROM CRAY XT3 TO XC30 Iain Bethune (ibethune@epcc.ed.ac.uk) Fiona Reid Alfio Lazzaro

Outline CP2K Overview Features Parallel Algorithms Cray HPC Systems Trends Water Benchmarks 2005 2013 Comprehensive Benchmarking XE6 vs XC30 CP2K with Accelerators

CP2K Overview CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems. It provides a general framework for different methods such as e.g., density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. From www.cp2k.org (2004!)

CP2K Overview Many force models: Classical DFT (GPW) Hybrid Hartree-Fock LS-DFT post-hf (MP2, RPA) Combinations (QM/MM, mixed) Simulation tools MD (various ensembles) Monte Carlo Minimisation (GEO/CELL_OPT) Properties (Spectra, excitations ) Open Source GPL, www.cp2k.org 1m loc, ~2 commits per day ~10 core developers

CP2K Overview Many force models: Classical DFT (GPW) Hybrid Hartree-Fock LS-DFT post-hf (MP2, RPA) Combinations (QM/MM, mixed) Simulation tools MD (various ensembles) Monte Carlo Minimisation (GEO/CELL_OPT) Properties (Spectra, excitations ) Open Source GPL, www.cp2k.org 1m loc, ~2 commits per day ~10 core developers

CP2K Overview HECToR Phase 3 code usage (Nov 2011-Mar 2014) Rank Code Node hours Fraction of total Method 1 VASP 5,822,878 19.34% DFT 2 CP2K 2,222,059 7.38% DFT 3 GROMACS 1,594,218 5.29% Classical 4 DL_POLY 1,359,751 4.52% Classical 5 CASTEP 1,351,163 4.49% DFT CP2K usage 1.6m notional cost (+ 2.4m on Phase 2)

CP2K Overview QUICKSTEP DFT: Gaussian and Plane Waves Method (VandeVondele et al, Comp. Phys. Comm., 2005) Advantages of atom-centred basis (primary) Density, KS matrices are sparse Advantages of plane-wave basis (auxiliary) Efficient computation of Hartree potential Efficient mapping between basis sets -> Computation of the KS Matrix is O(nlogn) Orbital Transformation Method (VandeVondele & Hutter, J. Chem. Phys., 2003) Replacement for traditional diagonalisation to orthogonalise wave functions Cubic scaling but ~10% cost

CP2K Overview (A,G) distributed matrices (B,F) realspace multigrids (C,E) realspace data on planewave multigrids (D) planewave grids (I,VI) integration/ collocation of gaussian products (II,V) realspace-toplanewave transfer (III,IV) FFTs (planewave transfer)

CP2K Overview Distributed realspace grids Overcome memory bottleneck Reduce communication costs Parallel load balancing On a single grid level Re-ordering multiple grid levels Finely balance with replicated tasks Level 1, fine grid, distributed Level 2, medium grid, dist Level 3, coarse grid, replicated 1 2 3 5 6 8 4 5 6 3 1 7 7 8 9 9 4 2

CP2K Overview Fast Fourier Transforms 1D or 2D decomposition FFTW3 and CuFFT library interface Cache and re-use data FFTW plans, Cartesian communicators 8" 7" Libsmm(vs.(Libsci(DGEMM(Performance( DBCSR Distributed Sparse MM based on Cannon s Algorithm Local multiplication recursive, cache oblivious libsmm for small block multiplications GFLOP/s( 6" 5" 4" 3" 2" 1" 0" 1,1,1" 1,9,9" 1,22,22" 4,9,6" 4,22,17" 5,9,5" 5,22,16" 6,9,4" 6,22,13" 9,9,1" 9,22,9" M,N,K( 13,6,22" 13,22,6" 16,6,17" 16,22,5" 17,6,16" 17,22,4" 22,6,13" 22,22,1" SMM"(Gfortran"4.6.2)" Libsci"BLAS"(11.0.04)" Figure 5: Comparing performance of SMM and Libsci BLAS for block sizes up to 22,22,22

CP2K Overview OpenMP Now in all key areas of CP2K FFT, DBCSR, Collocate/Integrate, Buffer Packing Incremental addition over time 20! Time per MD step (seconds)! XT4 (MPI Only)! XT4 (MPI/OpenMP)! XT6 (MPI Only)! XT6 (MPI/OpenMP)! 2! 10! 100! 1000! 10000! 100000! Number of cores!

Cray HPC Systems Name Arch. Processor Clock Nodes Cores/ Peak GFlop/s/ Year (GHz) Node TFlop/s Node XT3 Stage 0 XT3 AMD Opteron 146 2.0 84 1 0.336 4.0 2005 XT3 Stage 1 XT3 AMD Opteron 152 2.6 1100 1 5.72 5.2 2006 Piz Palü XT3 AMD Opteron 185 Dual Core 2.6 1664 2 17.31 10.4 2007 HECToR Phase 1 XT4 AMD Opteron 1220 Santa Ana Dual Core 2.8 5664 2 63.44 11.2 2007 HECToR Phase 2a XT4 AMD Opteron 2356 Barcelona 4-Core 2.3 5664 4 104.22 18.4 2009 Monte Rosa XT5 AMD Opteron 2431 Istanbul 6-Core 2.4 1844 12 212.43 115.2 2009 HECToR Phase 2b XT6 AMD Opteron 6172 Magny-Cours 12-Core 2.1 1856 24 374.17 201.6 2010 Piz Palü 1 XE6 AMD Opteron 6272 Interlagos 16-Core 2.1 1496 32 402.12 268.8 2011 HECToR Phase 3 XE6 AMD Opteron 6276 Interlagos 16-Core 2.3 2816 32 829.03 294.4 2011 Tödi XK7 AMD Opteron 6272 Interlagos 16-Core 2.1 272 16 392.90 1444.5 2012 + NVIDIA Tesla K20X (+14) Piz Daint XC30 Intel Xeon E5-2670 Sandy-Bridge 8-Core 2.6 5272 8 7788.90 1477.4 2013 + NVIDIA Tesla K20X (+14) ARCHER XC30 Intel Xeon E5-2697 v2 Ivy-Bridge 12-core 2.7 3008 24 1559.35 518.4 2013

Water benchmarks Born-Oppenheimer MD using Quickstep DFT TZV2P basis set 280 Ry planewave cut-off = typical production settings LDA exchange-correlation functional 32 up to 2048 water molecules H2O-32 96 atoms, 256 electrons, 9.9 Å 3 Typical problem size in ~2005 H2O-2048 6144 atoms, 49152 electrons, 39.5 Å 3 Large, even for 2014!

Water benchmarks 500! H2O-512! Time per MD steip (seconds)! 50! 5! XT3 Stage 0 (2005)! XC30 ARCHER (2013)! H2O-256! H2O-128! H2O-64! H2O-32! H2O-2048!! H2O-1024! H2O-512! H2O-256! H2O-128! H2O-64! H2O-32! 0.5! 1! 10! 100! 1000! 10000! Number of cores!

Water benchmarks Time per MD step (seconds)! 500! 50! 5! XT3 Stage 0 (2005)! XT3 Stage 1 (2006)! Piz Palü XT3 (2007)! HECToR 2a XT4 (2007)! Monte Rosa XT5 (2009)! HECToR XT6 (2010)! Piz Palü XE6 (2011)! ARCHER XC30 (2013)! 0.5! 1! 10! 100! 1000! 10000! Number of cores!

Comprehensive Benchmarking H2O-* benchmarks do not address the range of features now available in CP2K Classical Force Fields Linear-scaling DFT Hybrid DFT (Hartree-Fock Exchange) Many-body correlation (MP2, RPA) Aimed at users Performance expectations HECToR Phase 3 -> ARCHER Presented at 1 st Annual CP2K Users Meeting (Jan 2014)

Comprehensive Benchmarking 1000 ARCHER HECToR Phase 3 2TH 2TH 2TH 4TH 2TH 4TH 4TH 4TH 4TH Time (seconds) MPI 1.97 4TH 1.91 MPI 6TH 6TH 6TH 6TH 6TH 6TH 2.02 2.06 2.16 2.02 2.09 2.23 2.04 100 1 10 100 Number of nodes used

Comprehensive Benchmarking 1000 Performance comparison of the LiH-HFX benchmark 4TH ARCHER HECToR 2TH 6TH 2TH Time (seconds) 100 6TH 2.30 6TH 2.60 4TH 6TH 8TH 6TH 2.55 2.37 10 10 100 1000 10000 Number of nodes used

Comprehensive Benchmarking 1000 Performance comparison of the H2O-LS-DFT benchmark 2TH ARCHER HECToR 6TH 2TH 2.00 6TH 4TH 8TH 4TH Time (seconds) 100 2.06 6TH 2.20 6TH 4TH 8TH 3.30 2TH 2TH 4TH 4.66 3.68 3.45 10 10 100 1000 10000 Number of nodes used

Comprehensive Benchmarking 1000 2TH ARCHER HECToR Phase 3 MPI 2TH Time (seconds) 100 2.09 MPI 2.20 2TH 2TH 1.65 4TH 2TH 1.60 8TH 8TH8TH 4TH 4TH4TH 1.49 1.691.71 10 10 100 1000 10000 Number of nodes used

CP2K with Accelerators Heterogeneous systems well established #1,2,6,7 in TOP 500 use Intel Xeon Phi or NVIDIA K20x GPU XC30 & XK7 dual socket = 2 x CPU or CPU + GPU CP2K used during initial validation tests of Piz Daint CUDA GPU support for DBCSR Best performance obtained for LS-DFT calculations Work by Zurich, Cray, NVIDIA & CSCS

CP2K with Accelerators Implementation details: libcusmm for block-level of multiplication (4x better than cublas) CPU fills stacks of smm One GPU per MPI process, utilise cores with OpenMP Asynchronous offload to GPU via CUDA streams Asynchronous communication between nodes Benchmarks H2O-DFT-LS (6144 atoms, large blocks) TiO2 (9786 atoms, mixed block sizes) AMORPH (13846 atoms, small blocks)

CP2K with Accelerators Time (seconds) 5000 500 AMORPH Only CPU (GPU idle) H2O-DFT-LS Only CPU (GPU idle) TiO2 Only CPU (GPU idle) AMORPH CPU+GPU H2O-DFT-LS CPU+GPU TiO2 CPU+GPU 50 50 500 Number of nodes used

CP2K with Accelerators 1.8 1.6 1.62 1.53 1.65 AMORPH H2O-DFT-LS TiO2 Ratio 1.4 1.38 1.25 1.28 1.2 1.18 1.17 1.03 1.03 1 64 128 256 512 Number of nodes used in the CPU+GPU configuration

CP2K with Accelerators 2.20 AMORPH 2.00 H2O-DFT-LS TiO2 1.87 2.01 1.92 1.80 Ratio 1.60 1.54 1.47 1.59 1.49 1.40 1.20 1.00 64 128 256 Number of nodes used

Summary & Outlook CP2K performance has increased steadily year by year Hardware, software and algorithms all important Development has followed architectural trends Multi-core -> OpenMP Heterogeneous nodes -> CUDA Work on Xeon Phi port ongoing Collaborative development (co-design?) model End-users, code authors, HPC centres, vendors Funding from PASC, IPCC, ARCHER ecse

Acknowledgements This work made use of the facilities of HECToR, the UK's national high-performance computing service, which is provided by UoE HPCx Ltd at the University of Edinburgh, Cray Inc and NAG Ltd, and funded by the Office of Science and Technology through EPSRC's High End Computing Programme. This work used the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk)

Acknowledgement We are grateful to CSCS for giving us access to and supporting our use of a wide range of HPC systems. The first two authors are supported by the Engineering and Physical Sciences Research Council CP2K-UK project (grant number EP/K038583/1)

Acknowledgements Special thanks to Prof. Jurg Hutter and Prof. Joost VandeVondele for historical benchmark data and access to compute time for benchmarking and code development. Thanks for your attention, and any questions?