CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro

CP2K PERFORMANCE FROM CRAY XT3 TO XC30 Iain Bethune (ibethune@epcc.ed.ac.uk) Fiona Reid Alfio Lazzaro

Outline CP2K Overview Features Parallel Algorithms Cray HPC Systems Trends Water Benchmarks 2005 2013 Comprehensive Benchmarking XE6 vs XC30 CP2K with Accelerators

CP2K Overview CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems. It provides a general framework for different methods such as e.g., density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. From www.cp2k.org (2004!)

CP2K Overview Many force models: Classical DFT (GPW) Hybrid Hartree-Fock LS-DFT post-hf (MP2, RPA) Combinations (QM/MM, mixed) Simulation tools MD (various ensembles) Monte Carlo Minimisation (GEO/CELL_OPT) Properties (Spectra, excitations ) Open Source GPL, www.cp2k.org 1m loc, ~2 commits per day ~10 core developers

CP2K Overview HECToR Phase 3 code usage (Nov 2011-Mar 2014) Rank Code Node hours Fraction of total Method 1 VASP 5,822,878 19.34% DFT 2 CP2K 2,222,059 7.38% DFT 3 GROMACS 1,594,218 5.29% Classical 4 DL_POLY 1,359,751 4.52% Classical 5 CASTEP 1,351,163 4.49% DFT CP2K usage 1.6m notional cost (+ 2.4m on Phase 2)

CP2K Overview QUICKSTEP DFT: Gaussian and Plane Waves Method (VandeVondele et al, Comp. Phys. Comm., 2005) Advantages of atom-centred basis (primary) Density, KS matrices are sparse Advantages of plane-wave basis (auxiliary) Efficient computation of Hartree potential Efficient mapping between basis sets -> Computation of the KS Matrix is O(nlogn) Orbital Transformation Method (VandeVondele & Hutter, J. Chem. Phys., 2003) Replacement for traditional diagonalisation to orthogonalise wave functions Cubic scaling but ~10% cost

CP2K Overview (A,G) distributed matrices (B,F) realspace multigrids (C,E) realspace data on planewave multigrids (D) planewave grids (I,VI) integration/ collocation of gaussian products (II,V) realspace-toplanewave transfer (III,IV) FFTs (planewave transfer)

CP2K Overview Distributed realspace grids Overcome memory bottleneck Reduce communication costs Parallel load balancing On a single grid level Re-ordering multiple grid levels Finely balance with replicated tasks Level 1, fine grid, distributed Level 2, medium grid, dist Level 3, coarse grid, replicated 1 2 3 5 6 8 4 5 6 3 1 7 7 8 9 9 4 2

CP2K Overview Fast Fourier Transforms 1D or 2D decomposition FFTW3 and CuFFT library interface Cache and re-use data FFTW plans, Cartesian communicators 8" 7" Libsmm(vs.(Libsci(DGEMM(Performance( DBCSR Distributed Sparse MM based on Cannon s Algorithm Local multiplication recursive, cache oblivious libsmm for small block multiplications GFLOP/s( 6" 5" 4" 3" 2" 1" 0" 1,1,1" 1,9,9" 1,22,22" 4,9,6" 4,22,17" 5,9,5" 5,22,16" 6,9,4" 6,22,13" 9,9,1" 9,22,9" M,N,K( 13,6,22" 13,22,6" 16,6,17" 16,22,5" 17,6,16" 17,22,4" 22,6,13" 22,22,1" SMM"(Gfortran"4.6.2)" Libsci"BLAS"(11.0.04)" Figure 5: Comparing performance of SMM and Libsci BLAS for block sizes up to 22,22,22

CP2K Overview OpenMP Now in all key areas of CP2K FFT, DBCSR, Collocate/Integrate, Buffer Packing Incremental addition over time 20! Time per MD step (seconds)! XT4 (MPI Only)! XT4 (MPI/OpenMP)! XT6 (MPI Only)! XT6 (MPI/OpenMP)! 2! 10! 100! 1000! 10000! 100000! Number of cores!

Cray HPC Systems Name Arch. Processor Clock Nodes Cores/ Peak GFlop/s/ Year (GHz) Node TFlop/s Node XT3 Stage 0 XT3 AMD Opteron 146 2.0 84 1 0.336 4.0 2005 XT3 Stage 1 XT3 AMD Opteron 152 2.6 1100 1 5.72 5.2 2006 Piz Palü XT3 AMD Opteron 185 Dual Core 2.6 1664 2 17.31 10.4 2007 HECToR Phase 1 XT4 AMD Opteron 1220 Santa Ana Dual Core 2.8 5664 2 63.44 11.2 2007 HECToR Phase 2a XT4 AMD Opteron 2356 Barcelona 4-Core 2.3 5664 4 104.22 18.4 2009 Monte Rosa XT5 AMD Opteron 2431 Istanbul 6-Core 2.4 1844 12 212.43 115.2 2009 HECToR Phase 2b XT6 AMD Opteron 6172 Magny-Cours 12-Core 2.1 1856 24 374.17 201.6 2010 Piz Palü 1 XE6 AMD Opteron 6272 Interlagos 16-Core 2.1 1496 32 402.12 268.8 2011 HECToR Phase 3 XE6 AMD Opteron 6276 Interlagos 16-Core 2.3 2816 32 829.03 294.4 2011 Tödi XK7 AMD Opteron 6272 Interlagos 16-Core 2.1 272 16 392.90 1444.5 2012 + NVIDIA Tesla K20X (+14) Piz Daint XC30 Intel Xeon E5-2670 Sandy-Bridge 8-Core 2.6 5272 8 7788.90 1477.4 2013 + NVIDIA Tesla K20X (+14) ARCHER XC30 Intel Xeon E5-2697 v2 Ivy-Bridge 12-core 2.7 3008 24 1559.35 518.4 2013

Water benchmarks Born-Oppenheimer MD using Quickstep DFT TZV2P basis set 280 Ry planewave cut-off = typical production settings LDA exchange-correlation functional 32 up to 2048 water molecules H2O-32 96 atoms, 256 electrons, 9.9 Å 3 Typical problem size in ~2005 H2O-2048 6144 atoms, 49152 electrons, 39.5 Å 3 Large, even for 2014!

Water benchmarks 500! H2O-512! Time per MD steip (seconds)! 50! 5! XT3 Stage 0 (2005)! XC30 ARCHER (2013)! H2O-256! H2O-128! H2O-64! H2O-32! H2O-2048!! H2O-1024! H2O-512! H2O-256! H2O-128! H2O-64! H2O-32! 0.5! 1! 10! 100! 1000! 10000! Number of cores!

Water benchmarks Time per MD step (seconds)! 500! 50! 5! XT3 Stage 0 (2005)! XT3 Stage 1 (2006)! Piz Palü XT3 (2007)! HECToR 2a XT4 (2007)! Monte Rosa XT5 (2009)! HECToR XT6 (2010)! Piz Palü XE6 (2011)! ARCHER XC30 (2013)! 0.5! 1! 10! 100! 1000! 10000! Number of cores!

Comprehensive Benchmarking H2O-* benchmarks do not address the range of features now available in CP2K Classical Force Fields Linear-scaling DFT Hybrid DFT (Hartree-Fock Exchange) Many-body correlation (MP2, RPA) Aimed at users Performance expectations HECToR Phase 3 -> ARCHER Presented at 1 st Annual CP2K Users Meeting (Jan 2014)

Comprehensive Benchmarking 1000 ARCHER HECToR Phase 3 2TH 2TH 2TH 4TH 2TH 4TH 4TH 4TH 4TH Time (seconds) MPI 1.97 4TH 1.91 MPI 6TH 6TH 6TH 6TH 6TH 6TH 2.02 2.06 2.16 2.02 2.09 2.23 2.04 100 1 10 100 Number of nodes used

Comprehensive Benchmarking 1000 Performance comparison of the LiH-HFX benchmark 4TH ARCHER HECToR 2TH 6TH 2TH Time (seconds) 100 6TH 2.30 6TH 2.60 4TH 6TH 8TH 6TH 2.55 2.37 10 10 100 1000 10000 Number of nodes used

Comprehensive Benchmarking 1000 Performance comparison of the H2O-LS-DFT benchmark 2TH ARCHER HECToR 6TH 2TH 2.00 6TH 4TH 8TH 4TH Time (seconds) 100 2.06 6TH 2.20 6TH 4TH 8TH 3.30 2TH 2TH 4TH 4.66 3.68 3.45 10 10 100 1000 10000 Number of nodes used

Comprehensive Benchmarking 1000 2TH ARCHER HECToR Phase 3 MPI 2TH Time (seconds) 100 2.09 MPI 2.20 2TH 2TH 1.65 4TH 2TH 1.60 8TH 8TH8TH 4TH 4TH4TH 1.49 1.691.71 10 10 100 1000 10000 Number of nodes used

CP2K with Accelerators Heterogeneous systems well established #1,2,6,7 in TOP 500 use Intel Xeon Phi or NVIDIA K20x GPU XC30 & XK7 dual socket = 2 x CPU or CPU + GPU CP2K used during initial validation tests of Piz Daint CUDA GPU support for DBCSR Best performance obtained for LS-DFT calculations Work by Zurich, Cray, NVIDIA & CSCS

CP2K with Accelerators Implementation details: libcusmm for block-level of multiplication (4x better than cublas) CPU fills stacks of smm One GPU per MPI process, utilise cores with OpenMP Asynchronous offload to GPU via CUDA streams Asynchronous communication between nodes Benchmarks H2O-DFT-LS (6144 atoms, large blocks) TiO2 (9786 atoms, mixed block sizes) AMORPH (13846 atoms, small blocks)

CP2K with Accelerators Time (seconds) 5000 500 AMORPH Only CPU (GPU idle) H2O-DFT-LS Only CPU (GPU idle) TiO2 Only CPU (GPU idle) AMORPH CPU+GPU H2O-DFT-LS CPU+GPU TiO2 CPU+GPU 50 50 500 Number of nodes used

CP2K with Accelerators 1.8 1.6 1.62 1.53 1.65 AMORPH H2O-DFT-LS TiO2 Ratio 1.4 1.38 1.25 1.28 1.2 1.18 1.17 1.03 1.03 1 64 128 256 512 Number of nodes used in the CPU+GPU configuration

CP2K with Accelerators 2.20 AMORPH 2.00 H2O-DFT-LS TiO2 1.87 2.01 1.92 1.80 Ratio 1.60 1.54 1.47 1.59 1.49 1.40 1.20 1.00 64 128 256 Number of nodes used

Summary & Outlook CP2K performance has increased steadily year by year Hardware, software and algorithms all important Development has followed architectural trends Multi-core -> OpenMP Heterogeneous nodes -> CUDA Work on Xeon Phi port ongoing Collaborative development (co-design?) model End-users, code authors, HPC centres, vendors Funding from PASC, IPCC, ARCHER ecse

Acknowledgements This work made use of the facilities of HECToR, the UK's national high-performance computing service, which is provided by UoE HPCx Ltd at the University of Edinburgh, Cray Inc and NAG Ltd, and funded by the Office of Science and Technology through EPSRC's High End Computing Programme. This work used the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk)

Acknowledgement We are grateful to CSCS for giving us access to and supporting our use of a wide range of HPC systems. The first two authors are supported by the Engineering and Physical Sciences Research Council CP2K-UK project (grant number EP/K038583/1)

Acknowledgements Special thanks to Prof. Jurg Hutter and Prof. Joost VandeVondele for historical benchmark data and access to compute time for benchmarking and code development. Thanks for your attention, and any questions?