HPC Saudi 2018 March 13, 2018 Scientific Computing Activities in KAUST Jysoo Lee Facilities Director, Research Computing Core Labs King Abdullah University of Science and Technology
Supercomputing Services Computing Cycle Community Building Service Consulting Education & Training Expertise Infrastructure
Infrastructure 3
Top500 List June 2015 #20 as of Mar 2018 http://www.top500.org
Storage Compute Shaheen II Supercomputer Node Processor type: Intel Haswell 6174 nodes 197,568 cores 2 CPU sockets per node @2.3GHz 16 processor cores per CPU 128 GB of memory per node Over 790 TB total memory Power Up to 3.5MW Water cooled Weight/Size Speed Network Storage Burst Buffer Archive More than 100 metrics tons 7.2 Peta FLOPS peak performance Cray Aries interconnect with Dragonfly topology Sonexion 2000 Lustre appliance DataWarp Tiered Adaptive Storage (TAS) 36 XC40 Compute cabinets, disk, blowers, management nodes 5.53 Peta FLOPS sustained LINPACK and ranked 15 th in the latest Top500 list 57% of the maximum global bandwidth between the 18 groups of two cabinets 17.6 Peta Bytes of usable storage Over 500 GB/s bandwidth Intel Solid Sate Devices (SDD) fast data cache Over 1.5 TB/s bandwidth Hierarchical storage with 200 TB disk cache and 20 PB of tape storage, using a spectra logic tape library (Upgradable to 100 PB)
Ibex Cluster Total of 800+ Nodes Intel Nodes (305) Broadwell (2), Haswell (18), Ivy Bridge (178), Sandy Bridge (100), Westmere (7) AMD Nodes (492) Opteron 6300 (467), Opteron 6200 (25) GPU Nodes (28) P100 (6), K40m (12), P6000 (4), K6000 (6) Storage: 3.9 PB Addition of Skylake, Volta, Large Memory Nodes and Storage is in Progress
Discoveries Enabled 7
Modeling and Forecasting the Red Sea with Data Assimilation PI: Ibrahim Hoteit A fully parallel ensemble-based ocean data assimilation and forecasting system has been implemented for the Red Sea We have investigated the best ensemble approach for forecasting and reconstructing the history of the Red Sea circulations A one-year ensemble assimilation experiment with 250 members cost about 250K core-hours on KAUST Shaheen supercomputer. We have conducted dozens of these. Toye et al. Ensemble Data Assimilation in the Red Sea: Sensitivity to Ensemble Selection and Atmospheric Forcing. Ocean Dynamics, 2016
Direct Numerical Simulation on Dynamics of Bluff-Body Stabilized Flames PI: Hong G. Im, PSE/CCRC Direct numerical simulations of the dynamics of bluff-body-stabilized flames High fidelity simulations with full temporal and spatial resolution and description of detailed hydrogen and syngas chemistry provide fundamental insights into the mechanism of flame blow-off, which is a critical practical issue in gas turbine combustors. The simulations consumed about 6 million CPU-hours on KAUST s Shaheen supercomputer utilizing 3,200 cores
H10 L10 D0.5: H = 10 mm, L = 10 mm, and D (bluff-body) = 0.5 mm Hydrogen/Air, U = 85 m/s 800 cores A typical run requires 0.04 million CPU-hours until time = 20 ms Total number of simulations will take 1.0 million CPU-hours until blowoff Movie
Computing Cycle 11
Computing Operations: Resource CS Team Allocation Project Type Development Project : System familiarization, code porting, assessment of performance. Up to 2M of Core hours Production Project: Production run after applications have been ported and optimized Project Review Project proposals reviewed monthly by RCAC (Resource Computing Allocation Committee) Three-step process Computational Readiness Review: Performed by KSL. Justification for core hour, portability, scalability, impact on execution Scientific Readiness Review: Performed by scientific peers in the discipline of the proposed project RCAC final review and recommendation
Shaheen Services: Users Users and Projects and Projects Has Supported since July 2015-298 Projects - 105 Distinct PIs - 598 Users Institution Project number Percentage KAUST 251 84.23% Saudi Industry 11 3.69% Saudi Academia & Agency 13 4.36% Other in the world 23 7.72%
Shaheen Services: Users Users and Projects and Projects Has Provided - 2.428 Billion CPU Hours Institution Core Hours KAUST 2,271,002,588 Saudi Industry 125,383,699 Others 31,412,781
Services: Users and Projects Core Hours Provided (Discipline) Field of Science CPU hours % overall CFD/CSM 766,887,001 31.59% ErSE-GS 479,933,780 19.77% Material Science 442,328,890 18.22% ErSE-AS 277,852,890 11.44% AMCS 255,868,987 10.54% Bioscience 141,445,255 5.83% Physics 52,611,730 2.17% Others 10,870,534 0.45% More than 2.4 Billion Core hours in the last 32 months.
Consulting 16
Consulting Operations: CS Team Basic Support > 4000 RT-tickets resolved for Shaheen2 108 tools and libraries with 230 different installations Help for the installation of in-house codes Advanced Support / Method Development Code profiling, debugging, optimization Parallelization, porting to accelerator and burst buffer Workflow optimization Support for domain science Collaboration
Advanced Support: Aramco Reservoir Simulation TeraPOWERS new trillion node reservoir simulation to model oil migration problems in the Kingdom in a fraction of the time of previous run Shaheen II serves as the ONLY platform in the Kingdom for TeraPOWERS for developing capability and performing large scale production runs We simulated an oil migration problem in the Kingdom from the source rock to the trap with millions of years of history in 10 hours using 1 trillion active computational cells, Ali Dogru We could not have achieved this incredible milestone without the expertise and resources from KAUST, which provided superb support, Larry Fung The EXPEC Advanced Research Center (EXPEC ARC) TeraPOWERS Technology Team, under the leadership of Saudi Aramco fellow Ali Dogru, achieved a major breakthrough with the industry s first trillion cell reservoir simulation run on October 28, 2016 Aramco news, DHAHRAN, November 23, 2016 http://www.saudiaramco.com/en/home/news-media/news/saudi-aramco-scientists-achieve-new-world-record.html
Advanced Support: Aramco Gravity Separator Largest Scale Simulation of Engineering Code (~200,000 Cores) Critical to oil and gas production facilities. Help to reduce design development time and better predict equipment performance under varying operational conditions Reduce simulation time from several weeks to an overnight run
Method Development: Decimate - Tool for Managing Large Number of Jobs Context: PI: Pr. Ibrahim Hoteit Application: Ocean modeling and data assimilation Augmentation of SLURM functionality Complex workflow (> 10 K jobs, 20 M core-hours) Achievement: execution framework partially rewritten Hardware and numerically fault tolerant Highly configurable: user-defined function to validate steps and to modify them in case of rerun Easy to use: allowing the implementation of innovative coupling strategies even by non HPC-aware users Maintainable: modular, portable, written in Python Reusable: KSL-supported tool that can improve the management of other complex workflows on Shaheen2
Education and Training 21
Training Program: List User Training: Mainly through the monthly KSL Workshop Series. In this event the KSL Computational Scientists guide users through their presentations and live demos aiming to guarantee best usage of the KSL resources Broadcast Training: KSL often hosts some of the XSEDE regular series of remote workshops on High Performance Computing topics. These hands-on workshops provide convenient way for researchers to learn about the latest techniques and technologies of current interest in HPC Vendor Training: Main events were the ones offered by Cray about the hardware and software of XC40 in order to support user migration towards the new machine. There also was NVIDIA workshop followed by hack-a-thon On-demand Training: On demand training. Recent requests came from Aramco, and two faculty members who needed to advance their group s HPC skills
Training Program: Statistics Categories Events # Trainees # User Training 5 102 Broadcast Training 8 55 Vendor Training 2 186 On-demand Training 4 25 Total 19 368 List of Topics 1) MPI programming or Parallel programming 2) Vectorization techniques, modeling and measuring cache utilization, performance optimization for GPU, advanced MPI programming 3) Parallel / GPU computing with MATLAB 4) Parallel computing with Fortran on Shaheen 5) I/O, Debugging Tools, Profiling Tools 6) Performance optimization: Allinea and Barcelona Supercomputing performance tools 7) Burst Buffer 8) Advanced Parallel programming, analysis and optimization, computer architectures and memory allocation/distribution (specifically Cray/Shaheen II). 9) Code debugging on Shaheen II 10) Optimizations 11) API for Shaheen 2 filesystem that enable applications to be smarter when reading files 12) Profiling MPI code 13) Deep MPI/OpenMP topics specific to the Cray system 14) Practice how to start different applications : 1) linear algebra 2) metis, graphs 3) CFD application etc. 15) Compilation Options, differences between compiles
Training Program: Examples KAUST-NVIDIA annual workshop on Accelerating Scientific Applications Using GPUs 2013: 30 out-of-kaust participants 2015: 35 out-of-kaust participants 2016: 27 out-of-kaust participants KAUST-Intel workshop on Accelerating Scientific Applications on Intel CPUs 2013: 20 out-of-kaust participants 2015: 41 out-of-kaust participants Cray Workshop: Shaheen II training 2015: 2 out-of-kaust participants Allinea workshop 2015:19 out-of-kaust participants On-demand training for ARAMCO at KAUST 2016: Full day customized training for ARAMCO PI
GPU Conference and Hackathon https://www.nextplatform.com/2017/03/23/kaust-hackathon-shows-openacc-global-appeal/
Tutorials@HPC SAUDI 2017 More than 220 attendees 50% were from KAUST 33% from KSA universities and Agencies 12% from KSA Industries 5% out of Kingdom 8 different tutorials Introduction to HPC, presented by KSL and KVL Advanced performance tuning and optimization offered by Cray, Intel, Allinea Containers by Saudi Aramco Deep learning by Nvidia Best practices of HPC procurement by RedOak SLURM workload manager by SchedMD
Building Community 27
Helping Saudi Government KAUST collaborates with General Authority for Meteorology and Environmental Protection (GAMEP) To port GAMEP operational workflow to Shaheen II Employees from GAMEP were trained to use Shaheen II and execute the weather models on supercomputer The final outcome is the higher resolution operational weather forecast on Shaheen II KAUST collaborates with King Abdullah City of Atomic and Renewable Energy (KACARE) and Masdar Institute of Science and Technology (MASDAR) KAUST collaborates with KACARE and MASDAR with the objective to use Shaheen II for renewable energy research Personnel have been trained to use Shaheen II The scientists were supported by installing all the necessary software and validating the correctness of results 28
KSL-ANSYS Support for In-Kingdom Academic & Industrial Users Research Collaboration ANSYS & KSL scientists are working with potential users in transitioning their work from PC/small cluster to Shaheen 4-5 Projects are identified that can benefit from using ANSYS on Shaheen. The principal investigators of the projects are from Expec-Arc (Saudi Aramco), KFUPM, KSU and KAU. Training and Education Workshop series Certification program (in preparation) ANSYS Associate License Paving the way for industrial use of ANSYS on Shaheen 29
Thank You! Jysoo.lee@kaust.edu.sa