Enabling Science and Discovery at Georgia Tech With MVAPICH2

Similar documents
NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40

Development of a parallel, tree-based neighbour-search algorithm

cfireworks: a Tool for Measuring the Communication Costs in Collective I/O

Challenges in Transition

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

COTSon: Infrastructure for system-level simulation

Building a Cell Ecosystem. David A. Bader

Parallel Computing 2020: Preparing for the Post-Moore Era. Marc Snir

Scientific Computing Activities in KAUST

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

ComPat Tomasz Piontek 12 May 2016, Prague Poznan Supercomputing and Networking Center

Performance Metrics. Computer Architecture. Outline. Objectives. Basic Performance Metrics. Basic Performance Metrics

ANSYS v14.5. Manager Installation Guide CAE Associates

22nd VI-HPS Tuning Workshop PATC Performance Analysis Workshop

Experience with new architectures: moving from HELIOS to Marconi

Architecting Systems of the Future, page 1

Deep Learning Overview

R and the Message Passing Interface on the Little Fe Cluster

ACR: AUTOMATIC CHECKPOINT/ RESTART FOR SOFT AND HARD ERROR PROTECTION.

IESP AND APPLICATIONS. IESP BOF, SC09 Portland, Oregon Paul Messina November 18, 2009

Document downloaded from:

High Performance Computing and Visualization at the School of Health Information Sciences

December 10, Why HPC? Daniel Lucio.

Characterizing, Optimizing, and Auto-Tuning Applications for Energy Efficiency

24th VI-HPS Tuning Workshop PATC course in conjunction with POP CoE

Towards Scalable 1024 Processor Shared Memory Systems

Vampir Getting Started. Holger Brunst March 4th 2008

The Einstein Toolkit

A GPU-Based Real- Time Event Detection Framework for Power System Frequency Data Streams

Introduction to VI-HPS

Graduate Studies in Computational Science at U-M. Graduate Certificate in Computational Discovery and Engineering. and

Enabling technologies for beyond exascale computing

Global Alzheimer s Association Interactive Network. Imagine GAAIN

The Next-Generation Supercomputer Project and the Future of High End Computing in Japan

The Exascale Computing Project

Leveraging HPC for Alzheimer s Research and Beyond. Joseph Lombardo Executive Director, UNLV s National Supercomputing Center April 2015

HIGH-LEVEL SUPPORT FOR SIMULATIONS IN ASTRO- AND ELEMENTARY PARTICLE PHYSICS

Computer Systems Research: Past and Future

Rapid Deployment of Bare-Metal and In-Container HPC Clusters Using OpenHPC playbooks

SCAI SuperComputing Application & Innovation. Sanzio Bassini October 2017

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Data Science Initiative Winter Symposium. 5 February Mladen A. Vouk Director. Alyson Wilson Associate Director. Trey Overman Program Manager

Time Difference of Arrival Localization Testbed: Development, Calibration, and Automation GRCon 2017

Parallelism Across the Curriculum

Andrew Clinton, Matt Liberty, Ian Kuon

EESI Presentation at IESP

From Shared Memory to Message Passing

Non-Blocking Collectives for MPI-2

Practical Use of FX10 Supercomputer System (Oakleaf-FX) of Information Technology Center, The University of Tokyo

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

High Performance Computing for Engineers

Communications Planner for Operational and Simulation Effects With Realism (COMPOSER)

Performance Metrics, Amdahl s Law

Hiding Virtual Computing and Supercomputing inside a Notebook: GISandbox Science Gateway & Other User Experiences Eric Shook

High Performance Computing Systems and Scalable Networks for. Information Technology. Joint White Paper from the

Recent Advances in Simulation Techniques and Tools

Evaluation of CPU Frequency Transition Latency

Exascale Initiatives in Europe

Big Data Framework for Synchrophasor Data Analysis

President Barack Obama The White House Washington, DC June 19, Dear Mr. President,

Invitation for SMEs from associate partner institutions preparing a course under NPTEL

Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Optimization of an Acoustic Waveguide for Professional Audio Applications

Impact from Industrial use of HPC HPC User Forum #59 Munich, Germany October 2015

High Performance Computing i el sector agro-alimentari Fundació Catalana per la Recerca CAFÈ AMB LA RECERCA

6 System architecture

RAPS ECMWF. RAPS Chairman. 20th ORAP Forum Slide 1

Computer Architecture

Lab MIC Offload Experiments 11/13/13 offload_lab.tar TACC

Sourcing in Scientific Computing

Enabling Scientific Breakthroughs at the Petascale

Trinity Center of Excellence

Software and High Performance Computing: Challenges for Research

Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance

Model Based Design and Acoustic NDE of Surface Cracks

Arduino Platform Capabilities in Multitasking. environment.

APL s Reusable Flight Software Architecture and the Infusion of New Technology

LHCb Trigger & DAQ Design technology and performance. Mika Vesterinen ECFA High Luminosity LHC Experiments Workshop 8/10/2016

COMSOL-Related Activities within the Research Reactors Division of. Oak Ridge National Laboratory

Timothy R. Newman, Ph.D. VT

From network-level measurements to Quality of Experience: Estimating the quality of Internet access with ACQUA

Stress Testing the OpenSimulator Virtual World Server

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Applying Modern Reinforcement Learning to Play Video Games. Computer Science & Engineering Leung Man Ho Supervisor: Prof. LYU Rung Tsong Michael

7-8. ND STL Standards & Benchmarks Time Planned Activities

The Spanish Supercomputing Network (RES)

Recent imaging results with wide-band EVLA data, and lessons learnt so far

Computational Science and Engineering Introduction

Extreme Scale Computational Science Challenges in Fusion Energy Research

Canada s Most Powerful Research Supercomputer Niagara Fuels Canadian Innovation and Discovery

Idea propagation in organizations. Christopher A White June 10, 2009

The Bump in the Road to Exaflops and Rethinking LINPACK

Integrated Computational Materials Science & Engineering

High-performance computing for soil moisture estimation

Outline Simulators and such. What defines a simulator? What about emulation?

Broadening the Scope and Impact of escience. Frank Seinstra. Director escience Program Netherlands escience Center

23rd VI-HPS Tuning Workshop & LLNL Performance Tools Deep-Dive

CITRIS and LBNL Computational Science and Engineering (CSE)

Project 5: Optimizer Jason Ansel

Transcription:

Enabling Science and Discovery at Georgia Tech With MVAPICH2 3rd Annual MVAPICH User Group (MUG) Meeting August 19-21, 2015 Mehmet Belgin, Ph.D. Research Scientist PACE Team, OIT/ART

Georgia Tech #7 best public university (U.S. News & World report, 2014) College of Science consistently in top 5 #1 Industrial Engineering Program for the past 2 decades 21,500 undergrad and grad students Colleges: Architecture, Computing, Engineering, Sciences, Business, Liberal Arts 2

PACE (PACE.GATECH.EDU) what is: A Partnership for an Advanced Computing Environment provides: Centralized HPC services for federated clusters consists of: 11 active members (incl. 3 research scientists) 3 student assistants 3

PACE Structure 4

PACE > 2000 users (~1700 active) 215 participating faculty (PIs) > 100 queues 37k cores, most with QDR IB, but not all 3.5 PB of storage Total 9000 ft sq datacenter(s) 100 Gb/sec to Internet2 AL2S 5

MVAPICH2 @ PACE First encounter: mvapich2/1.4.1, May 2010 (end of mpich2 for us) PACE software repo (2011-2015) mvapich2/1.6, 1.7, 1.8, 1.9, 2.0 First encounter with the MVAPICH2 Team (Sep 2011) mvapich2/1.6 not working for > 64 cores (reg cache issue) received a workaround the next day! Another crisis (June, 2013) mvapich2/1.6 & 1.7 hanging for a user, critical simulations in danger workaround in 3 days! (unset MALLOC_PERTURB_) a patch in 2 weeks official integration in mvapich2/1.9a New PACE software repo (2015-) mvapich2/1.9, 2.0, 2.1, 6

MVAPICH2: powerful but familiar Same world (std OS, OFED, compilers) Turbo Boost! (MVAPICH2) Same animal (no code changes) Existing Infrastructure (IB) Familiar Technology (MPICH) MVAPICH2 provides superior performance without changing your world 7

MVAPICH2 for sysadmins Acceptance testing: 10-days of uninterrupted runs with mvapich2 compiled: - VASP (the node killer case!) - LAMMPS - HPL - SPEC2007 (will be added soon) High compilation success rate with MPI packages Node/IB fabric health analysis: p2p OSU benchmarks - Bandwidth and latency - A wrapper script to submit one-to-all jobs and analyze data - A summary to report slow paths with std deviations Excellent Compatibility with debuggers/profilers - Valgrind (compiled with MPI wrappers) - TAU - Allinea DDT (debugger) and MAP (profiler) 8

PACE software repository 420 packages, over 1TB 54 MPI packages with mvapich2 49 MPI packages with openmpi Yes, we know about SPACK 576 of ~2000 users choose to load an MPI module on login Mvapich2: 504 OpenMPI: 72 (mostly from a non-ib cluster) Hierarchical format for all version/mpi/compiler combinations (as possible) Software X v1.0.0 v2.0.3 v3.1.2 openmpi/1.6 mvapich2/1.9 gcc/4.6.2 Intel/12.1.4 pgi/12.3 openmpi/1.7 mvapich2/2.0 gcc/4.7.2 Intel/14.0.2 pgi/13.5 openmpi/1.8 mvapich2/2.1 gcc/4.9.0 Intel/15.0 pgi/14.10 9

Getting better every day 2.0rc1 vs. 2.0ga (rc2?) (available in 2.0rc1 but not default) Improved intra-node communication performance using Shared memory and Cross Memory Attach (CMA) p2p OSU benchmarks 3000" 2500" Latency( 64-core AMD node Latency (us) 2000" 1500" 1000" 2.0rc1"Latency" 2.0ga"Latency" 500" 0" Increasing message size (0 -> 4194304) XSEDE 14 article by Jerome Vienne 6000" 5000" Bandwidth( Benefits of Cross Memory Attach for MPI libraries on HPC Clusters Bandwidth (MB/s) 4000" 3000" 2000" 2.0rc1"Bandwidth" 2.0ga"Bandwidth" 1000" 0" Increasing message size (1 -> 4194304) 10

Challenges in multicore performance 64-core AMD AbuDhabi Each Processor has 16 cores! 4x sockets, 8x NUMA sections 8x (hwloc lstopo ) 11

Improved overall performance Leslie 3d from SPEC2007 benchmark, 128cube case (https://www.spec.org/mpi2007/) ~10% consistent performance improvement on average since 1.9rc1 195 QDR connected 16-core Intel sandybridge nodes, with 64GB memory 10% of a $1.2 million cluster is 3.00E-01 2.50E-01 mvapich2/1.9rc1 mvapich2/2.1 Runtime (sec) 2.00E-01 1.50E-01 1.00E-01 5.00E-02 0.00E+00 16 32 64 128 256 384 Number of cores 12

Impact on Research: Leslie Prof. Suresh Menon s Computational Combustion Lab @ GT LESLIE is a three-dimensional, parallel, multiblock, structured, finite-volume, compressible flow solver with multiphysics capability. It has been used to study wide variety of flow systems such as canonical turbulent flames, thermo-acoustic combustion instability, swirl spray combustion, real-gas systems, MHD flows etc. Combustion instability in model high-pressure rocket combustor Swirl spray combusion: Evolution of flame surface 13

Impact on Research: Enzo The Enzo Project: Prof. John Wise, Center for Relativistic Astrophysics @ GT One of the lead developers of publicly-available and open-source Enzo (http://enzo-project.org/) Simulations of early star and galaxy formation that include hydrodynamics, gravity, chemical networks, magnetic fields, and radiation transport. Interpreting observations of the farthest galaxies and to understand how galaxies form over cosmic time. * Also killer of black toners, do not print out this slide Close up of a young dwarf galaxy produced as part of simulation (SDSC)* 14

Impact on Research: Nonpareil Prof. Kostas Konstantinidis: Environmental Microbial Genomics Lab @ GT http://enve-omics.ce.gatech.edu Developing bioinformatics algorithms and tools to analyze genomic and metagenomic data from microbiome project. For instance, our tools are applied to the Human Microbiome Project to identify how the gut microbial community cause disease vs. healthy state. Nonpareil uses the redundancy of the reads in a metagenomic dataset to estimate the average coverage and predict the amount of sequences that will be required to achieve "nearly complete coverage", defined as 95% or 99% average coverage. 15

Impact on Research: Pentran Prof. Glenn Sjoden: Chief Scientist, Air Force Technical Applications Center Former Director, Radiological Science and Engineering Laboratory @GT Top left: Water Hole pressurized water reactor model. Others: Flux from high energy (red) to low energy (purple) Pentran: 3D Parallel deterministic radiation transport code Phase space decomposition with 3D topology in MPI in angle/direction, energy, and space, with further angular refinement inside each MPI task with OpenMP threading. 16

Today Busted Myths MPI will have no place in Exascale world Mvapich2 is IB dependent (not-so-good for cloud) Known issues Affinity problems with cpusets Mpi4py incompatibility Wishlist Ability to run seamlessly on non-ib networks A framework to analyze and publish OSU benchmark results => INAM!! Download links for old versions 17

Thank You! 18