Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Size: px

Start display at page:

Download "Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures"

Carmel Davis
5 years ago
Views:

1 Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures Paolo Bientinesi AICES, RWTH Aachen 40th SPEEDUP Workshop on High-Performance Computing February 6 7, 2012 University of Basel, Switzerland Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

2 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

3 Symmetric Dense Eigenproblem AX = XΛ STDEIG AX = XBΛ GENEIG Input: A C n n, A H =A B C n n, SPD k, 1 k n #eigenpairs Output: X C n k, Λ R k k, eigenvectors eigenvalues Accuracy: AX XΛ, X H X I, residual orthogonality Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

4 6-stage approach GENEIG AX = XBΛ 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

5 Nested Eigensolvers GENEIG STDEIG TRDEIG 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

6 Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

7 Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Stage 3: Reduction to TRDEIG 1-stage Householder Successive Banded Reduction Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

8 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

9 Numerical Libs Development Cycle (?) (0) New architecture Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

10 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

11 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

12 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

13 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

14 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

15 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

16 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

17 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

18 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

19 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

20 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems HPC Linear solvers Eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

21 History Eigensolvers? : Cell GEMM: 99% FFT Linear systems: HPL 2008: Roadrunner > 1 PetaFLOP 2009: discontinued Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

22 History Eigensolvers? : GPGPUs CUBLAS (*) HPL, Top500 CULA FLAME, MAGMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

23 History: Eigensolvers?? : multicores GEMM mt BLAS HPL, Top500 FLAME, PLASMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

24 Our contributions MR 3 -SMP multithreaded Matthias Petschow RWTH Aachen PMRRR, EleMRRR hybrid MPI + MT Matthias Petschow RWTH Aachen Jack Poulson UT Austin GPUs Christian Lessig University of Toronto Enrique Quintana-Ortí Universidad Jaume I Francisco Igual Universidad Jaume I Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

25 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

26 Multi-threaded BLAS Xeon, 32 physical cores 1 Efficiency of GEMM 0.8 Efficiency Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

27 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

28 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

29 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

30 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

31 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

32 More motivation? MR3 is O(n 2 ) anyway... Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

33 More motivation? MR3 is O(n 2 ) anyway... Fraction of execution time N = 4,289 Backtransformation Sequential MRRR Reduction Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

34 MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

35 MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

36 MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

37 Representation Tree Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

38 MR 3 -SMP: the work queue Tasks: a) Singleton S: Eigenvector computation b) Cluster C: Shift + new representation (RRR) c) New RRR R: Eigenvalues refinement Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

Example trace: 16 cores eigenvectors Matrix size: 12387 Execution time: 3.3s Sequential: 49.

39 Example trace: 16 cores eigenvectors Matrix size: Execution time: 3.3s Sequential: 49.3s (LAPACK) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

40 MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

41 MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

42 A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = Time in minutes MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

43 A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = N = Time in minutes MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads From 9+ hours to 8.3 seconds. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

44 Speedups Time in seconds Eigenvalues 4 Eigenvectors LAPACK Number of threads Speedup Ideal Eigenvalues (bisection) Eigenvectors (bisection) Eigenvectors(dqds) Total Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

45 3 stages: before and after Fraction of execution time N = 4,289 Backtransformation Sequential MRRR Reduction Number of threads Fraction of execution time N = 4,289 Backtransformation MR 3 SMP Reduction Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

46 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

47 Distributed memory PMRRR, EleMRRR Static assignment of eigenpairs to nodes Multithreading Node-node communication: only eigenvalues PMRRR + Elemental EleMRRR Generalized, standard and tridiagonal hybrid eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

48 TRDEIG: PMRRR matrix Wilkinson matrix Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

49 GENEIG: Weak & strong scaling Weak scalability Strong scalability, n=20000 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

GENEIG: Efficiency Matrix size 20k 40k 80k 1 Parallel efficiency 0.8 0.6 0.4 0.

50 GENEIG: Efficiency Matrix size 20k 40k 80k 1 Parallel efficiency EleMRRR ScaLAPACK s DC ScaLAPACK s MRRR Number of cores Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

51 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

52 TRDEIG NVIDIA Tesla mrrr_dp = data-parallel MRRR rand(0,1) rand(-1,1) n LAPACK mrrr_dp LAPACK mrrr_dp Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

53 STDEIG Nehalem, 8 cores Reduction to tridiagonal form n LAPACK SBR SBR + GPU Reduction + backtransformation n LAPACK SBR SBR + GPU Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

54 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

55 Conclusions Multi-threaded BLAS for eigensolvers: not THAT good MR 3 -SMP, PMRRR, EleMRRR eigensolvers tailored for multi-core, distributed, hybrid architectures faster than LAPACK, MKL, ScaLAPACK almost perfect speedups software is available Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

Application of Maxwell Equations to Human Body Modelling

Application of Maxwell Equations to Human Body Modelling Fumie Costen Room E, E0c at Sackville Street Building, fc@cs.man.ac.uk The University of Manchester, U.K. February 5, 0 Fumie Costen Room E, E0c