Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Similar documents
Application of Maxwell Equations to Human Body Modelling

Design of Parallel Algorithms. Communication Algorithms

Challenges in Transition

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Signals and Systems. A signal is the representation of a physical wave

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Contribution to the Smecy Project

What can POP do for you?

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Document downloaded from:

Recent Advances in Simulation Techniques and Tools

A GNU Radio-based Full Duplex Radio System

CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro

MUMPS overview. MUMPS group, CERFACS, CNRS, ENS-Lyon, INRIA, INPT, Université Bordeaux 1

Parallel Dynamic and Selective Community Detection in Massive Streaming Graphs

PRACE PATC Course: Intel MIC Programming Workshop & Scientific Workshop: HPC for natural hazard assessment and disaster mitigation, June 2017,

GPUs: what are they good for?

Computational Simulations of The World s Biggest Eye on GPUs

The Bump in the Road to Exaflops and Rethinking LINPACK

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

CTWatch. the Promise and Perils of the Coming Multicore Revolution. and its Impact JACK DONGARRA ISSN VOLUME 3 NUMBER 1 FEBRUARY 2007

Virtual EM Prototyping: From Microwaves to Optics

When Should You Apply 3D Planar EM Simulation?

SCAI SuperComputing Application & Innovation. Sanzio Bassini October 2017

Performance Evaluation Of OFDM Based Wireless Communication Systems Using Graphics Processing Unit (GPU) Based High Performance Computing.

Lecture 8: Introduction to Hybrid FEM IE

A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system

Solving Large Multi-Scale Problems in CST STUDIO SUITE

Optimization of Tile Sets for DNA Self- Assembly

CUDA-Accelerated Satellite Communication Demodulation

Development of a parallel, tree-based neighbour-search algorithm

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

NVIDIA GPU Computing Theater

Multiple Clock and Voltage Domains for Chip Multi Processors

Module 3 Greedy Strategy

Vampir Getting Started. Holger Brunst March 4th 2008

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

PRACE PATC Course Intel MIC Programming Workshop. February, 7-8, 2017, IT4Innovations, Ostrava, Czech Republic

A Parallel Monte-Carlo Tree Search Algorithm

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL

December 10, Why HPC? Daniel Lucio.

Ben Baker. Sponsored by:

Trend of Software R&D for Numerical Simulation Hardware for parallel and distributed computing and software automatic tuning

Re-Visiting Power Measurement for the Green500

European Exascale Software Initiative: Numerical Libraries, Solvers and Algorithms

Hardware Software Science Co-design in the Human Brain Project

Petascale Quantum Simulations of Nano Systems and Biomolecules

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Half- and Full-Duplex FDD Operation in Cellular Multi-Hop Mobile Radio Networks

Building a Cell Ecosystem. David A. Bader

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

Joint MT/CSEM Anisotropic Inversion Olympic Dam

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40

Performance Metrics, Amdahl s Law

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84

COTSon: Infrastructure for system-level simulation

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Module 3 Greedy Strategy

Application-Specific Node Clustering of IR-UWB Sensor Networks with Two Classes of Nodes

CellSpecks: A Software for Automated Detection and Analysis of Calcium

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

The end of Moore s law and the race for performance

Decentralized Data Detection for Massive MU-MIMO on a Xeon Phi Cluster

Application of Multi-core and GPU Architectures on Signal Processing: Case Studies

Automatic Domain Decomposition for a Black-Box PDE Solver

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Monte Carlo integration and event generation on GPU and their application to particle physics

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Lecture 20: Combinatorial Search (1997) Steven Skiena. skiena

Architecting Systems of the Future, page 1

R and the Message Passing Interface on the Little Fe Cluster

Real-Time Software Receiver Using Massively Parallel

ST Tool. A CASE tool for security aware software requirements analysis

Spectrum Requirements for 4G Wireless Systems

Combining Differential/Integral Methods and Time/Frequency Domain Analysis to Solve Complex Antenna Problems

Algorithmic-Technique for Compensating Memory Errors in JPEG2000 Standard

FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS WITH OVERLAPPING MULTIPLY ADD INSTRUCTIONS

2017 by Bilge Acun. All rights reserved.

PROGRESSIVE CHANNEL ESTIMATION FOR ULTRA LOW LATENCY MILLIMETER WAVE COMMUNICATIONS

Advanced Computer Architecture - Baylor University The World s Most Advanced Technology For Solving The Nuix...

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

1) Evolução das velocidades de processamento, de acesso a memória e ao disco e das interfaces de rede - Um apanhado histórico até os dias de hoje

Course Overview. Dr. Edmund Lam. Department of Electrical and Electronic Engineering The University of Hong Kong

Parallelized Benchmark-Driven Performance Evaluation of SMPs and Tiled Multi-Core Architectures for Embedded Systems

EESI Presentation at IESP

Know your Algorithm! Architectural Trade-offs in the Implementation of a Viterbi Decoder. Matthias Kamuf,

Extreme Scale Computational Science Challenges in Fusion Energy Research

Threading libraries performance when applied to image acquisition and processing in a forensic application

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Architectural and Technology Influence on the Optimal Total Power Consumption

Wavelet-based image compression

Parallel Computing in the Multicore Era

Center for Hybrid Multicore Productivity Research (CHMPR)


Parallel Computing in the Multicore Era

IMPULSIVE NOISE MITIGATION IN OFDM SYSTEMS USING SPARSE BAYESIAN LEARNING

Disclaimer. Primer. Agenda. previous work at the EIT Department, activities at Ericsson

Transcription:

Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de 40th SPEEDUP Workshop on High-Performance Computing February 6 7, 2012 University of Basel, Switzerland Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 1 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 2 / 34

Symmetric Dense Eigenproblem AX = XΛ STDEIG AX = XBΛ GENEIG Input: A C n n, A H =A B C n n, SPD k, 1 k n #eigenpairs Output: X C n k, Λ R k k, eigenvectors eigenvalues Accuracy: AX XΛ, X H X I, residual orthogonality Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 3 / 34

6-stage approach GENEIG AX = XBΛ 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 4 / 34

Nested Eigensolvers GENEIG STDEIG TRDEIG 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 5 / 34

Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 6 / 34

Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Stage 3: Reduction to TRDEIG 1-stage Householder Successive Banded Reduction Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 6 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 7 / 34

Numerical Libs Development Cycle (?) (0) New architecture Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems HPC Linear solvers Eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

History Eigensolvers? - 2005 2006: Cell GEMM: 99% FFT Linear systems: HPL 2008: Roadrunner > 1 PetaFLOP 2009: discontinued Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 9 / 34

History Eigensolvers? 2011 2005: GPGPUs CUBLAS (*) HPL, Top500 CULA FLAME, MAGMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 10 / 34

History: Eigensolvers?? 2005 2006: multicores GEMM mt BLAS HPL, Top500 FLAME, PLASMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 11 / 34

Our contributions MR 3 -SMP multithreaded Matthias Petschow RWTH Aachen http://code.google.com/p/mr3smp PMRRR, EleMRRR hybrid MPI + MT Matthias Petschow RWTH Aachen http://code.google.com/p/pmrrr Jack Poulson UT Austin http://code.google.com/p/elemental... GPUs Christian Lessig University of Toronto Enrique Quintana-Ortí Universidad Jaume I Francisco Igual Universidad Jaume I Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 12 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 13 / 34

Multi-threaded BLAS Xeon, 32 physical cores 1 Efficiency of GEMM 0.8 Efficiency 0.6 0.4 0.2 0 1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 14 / 34

Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 15 / 34

Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 15 / 34

Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 15 / 34

Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 15 / 34

Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 15 / 34

More motivation? MR3 is O(n 2 ) anyway... Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 16 / 34

More motivation? MR3 is O(n 2 ) anyway... Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 16 / 34

MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 17 / 34

MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 17 / 34

MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 17 / 34

Representation Tree Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 18 / 34

MR 3 -SMP: the work queue Tasks: a) Singleton S: Eigenvector computation b) Cluster C: Shift + new representation (RRR) c) New RRR R: Eigenvalues refinement Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 19 / 34

Example trace: 16 cores eigenvectors Matrix size: 12387 Execution time: 3.3s Sequential: 49.3s (LAPACK) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 20 / 34

MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 100 50 0 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 21 / 34

MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 21 / 34

A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = 16023 Time in minutes 500 400 300 200 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 100 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 22 / 34

A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = 16023 350 N = 16023 Time in minutes 500 400 300 200 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 300 250 200 150 100 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads From 9+ hours to 8.3 seconds. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 22 / 34

Speedups Time in seconds 5 4.5 Eigenvalues 4 Eigenvectors 3.5 3 2.5 2 1.5 1 0.5 0 LAPACK 2 4 8 16 24 Number of threads Speedup 25 20 15 10 5 0 Ideal Eigenvalues (bisection) Eigenvectors (bisection) Eigenvectors(dqds) Total 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 23 / 34

3 stages: before and after Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation MR 3 SMP Reduction 0 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 24 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 25 / 34

Distributed memory PMRRR, EleMRRR Static assignment of eigenpairs to nodes Multithreading Node-node communication: only eigenvalues PMRRR + Elemental EleMRRR Generalized, standard and tridiagonal hybrid eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 26 / 34

TRDEIG: PMRRR 1-2-1 matrix Wilkinson matrix Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 27 / 34

GENEIG: Weak & strong scaling Weak scalability Strong scalability, n=20000 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 28 / 34

GENEIG: Efficiency Matrix size 20k 40k 80k 1 Parallel efficiency 0.8 0.6 0.4 0.2 EleMRRR ScaLAPACK s DC ScaLAPACK s MRRR 64 128 256 512 1024 2048 Number of cores Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 29 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 30 / 34

TRDEIG NVIDIA Tesla mrrr_dp = data-parallel MRRR rand(0,1) rand(-1,1) n LAPACK mrrr_dp LAPACK mrrr_dp 128 6.98 6.26 6.79 3.84 256 32.1 13.0 31.86 8.34 512 154.9 28.7 152.7 19.2 1024 656.1 60.2 647.6 54.0 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 31 / 34

STDEIG Nehalem, 8 cores Reduction to tridiagonal form n LAPACK SBR SBR + GPU 2048 0.23 0.6 0.58 6144 8.4 8.58 6.26 10240 40.5 30.4 20.32 24576 582.4 308.4 166.8 Reduction + backtransformation n LAPACK SBR SBR + GPU 2048 0.50 1.77 1.12 6144 13.5 29.0 12.7 10240 61.6 116.8 43.8 24576 845.1 1416.7 403.3 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 32 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 33 / 34

Conclusions Multi-threaded BLAS for eigensolvers: not THAT good MR 3 -SMP, PMRRR, EleMRRR eigensolvers tailored for multi-core, distributed, hybrid architectures faster than LAPACK, MKL, ScaLAPACK almost perfect speedups software is available Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 34 / 34