Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de 40th SPEEDUP Workshop on High-Performance Computing February 6 7, 2012 University of Basel, Switzerland Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 1 / 34

1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 2 / 34

Symmetric Dense Eigenproblem AX = XΛ STDEIG AX = XBΛ GENEIG Input: A C n n, A H =A B C n n, SPD k, 1 k n #eigenpairs Output: X C n k, Λ R k k, eigenvectors eigenvalues Accuracy: AX XΛ, X H X I, residual orthogonality Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 3 / 34

6-stage approach GENEIG AX = XBΛ 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 4 / 34

Nested Eigensolvers GENEIG STDEIG TRDEIG 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 5 / 34

Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 6 / 34

Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Stage 3: Reduction to TRDEIG 1-stage Householder Successive Banded Reduction Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 6 / 34

Numerical Libs Development Cycle (?) (0) New architecture Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems HPC Linear solvers Eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 8 / 34

History Eigensolvers? - 2005 2006: Cell GEMM: 99% FFT Linear systems: HPL 2008: Roadrunner > 1 PetaFLOP 2009: discontinued Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 9 / 34

History Eigensolvers? 2011 2005: GPGPUs CUBLAS (*) HPL, Top500 CULA FLAME, MAGMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 10 / 34

History: Eigensolvers?? 2005 2006: multicores GEMM mt BLAS HPL, Top500 FLAME, PLASMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 11 / 34

Our contributions MR 3 -SMP multithreaded Matthias Petschow RWTH Aachen http://code.google.com/p/mr3smp PMRRR, EleMRRR hybrid MPI + MT Matthias Petschow RWTH Aachen http://code.google.com/p/pmrrr Jack Poulson UT Austin http://code.google.com/p/elemental... GPUs Christian Lessig University of Toronto Enrique Quintana-Ortí Universidad Jaume I Francisco Igual Universidad Jaume I Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 12 / 34

Multi-threaded BLAS Xeon, 32 physical cores 1 Efficiency of GEMM 0.8 Efficiency 0.6 0.4 0.2 0 1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 14 / 34

Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 15 / 34

More motivation? MR3 is O(n 2 ) anyway... Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 16 / 34

More motivation? MR3 is O(n 2 ) anyway... Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 16 / 34

MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 17 / 34

MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 17 / 34

Representation Tree Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 18 / 34

MR 3 -SMP: the work queue Tasks: a) Singleton S: Eigenvector computation b) Cluster C: Shift + new representation (RRR) c) New RRR R: Eigenvalues refinement Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 19 / 34

Example trace: 16 cores eigenvectors Matrix size: 12387 Execution time: 3.3s Sequential: 49.3s (LAPACK) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 20 / 34

MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 100 50 0 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 21 / 34

MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds 500 450 400 350 300 250 200 150 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 14 12 10 8 6 4 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 2 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 21 / 34

A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = 16023 Time in minutes 500 400 300 200 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) 100 0 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 22 / 34

A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = 16023 350 N = 16023 Time in minutes 500 400 300 200 MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds 300 250 200 150 100 MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) 100 50 0 4 8 12 16 20 24 Number of threads 0 4 8 12 16 20 24 Number of threads From 9+ hours to 8.3 seconds. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 22 / 34

Speedups Time in seconds 5 4.5 Eigenvalues 4 Eigenvectors 3.5 3 2.5 2 1.5 1 0.5 0 LAPACK 2 4 8 16 24 Number of threads Speedup 25 20 15 10 5 0 Ideal Eigenvalues (bisection) Eigenvectors (bisection) Eigenvectors(dqds) Total 4 8 12 16 20 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 23 / 34

3 stages: before and after Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation Sequential MRRR Reduction 0 1 2 4 8 16 24 Number of threads Fraction of execution time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 N = 4,289 Backtransformation MR 3 SMP Reduction 0 1 2 4 8 16 24 Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 24 / 34

Distributed memory PMRRR, EleMRRR Static assignment of eigenpairs to nodes Multithreading Node-node communication: only eigenvalues PMRRR + Elemental EleMRRR Generalized, standard and tridiagonal hybrid eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 26 / 34

TRDEIG: PMRRR 1-2-1 matrix Wilkinson matrix Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 27 / 34

GENEIG: Weak & strong scaling Weak scalability Strong scalability, n=20000 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 28 / 34

GENEIG: Efficiency Matrix size 20k 40k 80k 1 Parallel efficiency 0.8 0.6 0.4 0.2 EleMRRR ScaLAPACK s DC ScaLAPACK s MRRR 64 128 256 512 1024 2048 Number of cores Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 29 / 34

TRDEIG NVIDIA Tesla mrrr_dp = data-parallel MRRR rand(0,1) rand(-1,1) n LAPACK mrrr_dp LAPACK mrrr_dp 128 6.98 6.26 6.79 3.84 256 32.1 13.0 31.86 8.34 512 154.9 28.7 152.7 19.2 1024 656.1 60.2 647.6 54.0 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 31 / 34

STDEIG Nehalem, 8 cores Reduction to tridiagonal form n LAPACK SBR SBR + GPU 2048 0.23 0.6 0.58 6144 8.4 8.58 6.26 10240 40.5 30.4 20.32 24576 582.4 308.4 166.8 Reduction + backtransformation n LAPACK SBR SBR + GPU 2048 0.50 1.77 1.12 6144 13.5 29.0 12.7 10240 61.6 116.8 43.8 24576 845.1 1416.7 403.3 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 32 / 34

Conclusions Multi-threaded BLAS for eigensolvers: not THAT good MR 3 -SMP, PMRRR, EleMRRR eigensolvers tailored for multi-core, distributed, hybrid architectures faster than LAPACK, MKL, ScaLAPACK almost perfect speedups software is available Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, 2012 34 / 34