Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures

Size: px
Start display at page:

Download "Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures"

Transcription

1 Fast and Scalable Eigensolvers for Multicore and Hybrid Architectures Paolo Bientinesi AICES, RWTH Aachen 40th SPEEDUP Workshop on High-Performance Computing February 6 7, 2012 University of Basel, Switzerland Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

2 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

3 Symmetric Dense Eigenproblem AX = XΛ STDEIG AX = XBΛ GENEIG Input: A C n n, A H =A B C n n, SPD k, 1 k n #eigenpairs Output: X C n k, Λ R k k, eigenvectors eigenvalues Accuracy: AX XΛ, X H X I, residual orthogonality Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

4 6-stage approach GENEIG AX = XBΛ 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

5 Nested Eigensolvers GENEIG STDEIG TRDEIG 1 LL H = B Cholesky factorization O(n 3 ) 2 M L 1 AL H Reduction to standard form O(n 3 ) 3 T = Q H MQ Reduction to tridiagonal form O(n 3 ) 4 T Z = ZΛ Tridiagonal eigenproblem O(kn) O(n 3 ) 5 Y = QZ Backtransformation #1 O(kn 2 ) 6 X = L H Y Backtransformation #2 O(kn 2 ) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

6 Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

7 Algorithms Stage 4: TRDEIG 1958 Bisection + Inverse Iteration (BI) subsets O(kn 2 ) 1961 QR high-accuracy O(n 3 ) 1981 Divide & Conquer (DC) BLAS3, accurate O(n 3 ) 1997 MRRR subsets, no re-orth. O(kn) Stage 3: Reduction to TRDEIG 1-stage Householder Successive Banded Reduction Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

8 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

9 Numerical Libs Development Cycle (?) (0) New architecture Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

10 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

11 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

12 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

13 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

14 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

15 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

16 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

17 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

18 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

19 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

20 Numerical Libs Development Cycle (?) (0) New architecture (1) GEMM ( peak performance), FFT (2) BLAS3, factorizations, AX=B LINPACK benchmark (2) BLAS3, factorizations, AX=B (2) factorizations, AX=B (2) factorizations, AX=B (3) factorizations, AX=B, matrix operations.. (4) Eigenproblems HPC Linear solvers Eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

21 History Eigensolvers? : Cell GEMM: 99% FFT Linear systems: HPL 2008: Roadrunner > 1 PetaFLOP 2009: discontinued Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

22 History Eigensolvers? : GPGPUs CUBLAS (*) HPL, Top500 CULA FLAME, MAGMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

23 History: Eigensolvers?? : multicores GEMM mt BLAS HPL, Top500 FLAME, PLASMA Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

24 Our contributions MR 3 -SMP multithreaded Matthias Petschow RWTH Aachen PMRRR, EleMRRR hybrid MPI + MT Matthias Petschow RWTH Aachen Jack Poulson UT Austin GPUs Christian Lessig University of Toronto Enrique Quintana-Ortí Universidad Jaume I Francisco Igual Universidad Jaume I Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

25 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

26 Multi-threaded BLAS Xeon, 32 physical cores 1 Efficiency of GEMM 0.8 Efficiency Matrix dimension 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

27 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

28 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

29 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

30 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

31 Multi-threaded BLAS for TRDEIG? Tridiagonal eigensolvers. Matrix size=4289, from DFT. Time in seconds MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

32 More motivation? MR3 is O(n 2 ) anyway... Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

33 More motivation? MR3 is O(n 2 ) anyway... Fraction of execution time N = 4,289 Backtransformation Sequential MRRR Reduction Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

34 MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

35 MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

36 MRRR Dhillon & Parlett (1998) Multiple Relatively Robust Representations first stable algorithm to compute k eigenpairs in O(nk) ops no reorthogonalization 1) eigenvalues 2) eigenvectors + eigenvalues eigenvalues: dqds or Bisection eigenvectors: Compute 1-(λ, z) Scan λ s sep. cluster Shift New RRR Refine λ s Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

37 Representation Tree Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

38 MR 3 -SMP: the work queue Tasks: a) Singleton S: Eigenvector computation b) Cluster C: Shift + new representation (RRR) c) New RRR R: Eigenvalues refinement Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

39 Example trace: 16 cores eigenvectors Matrix size: Execution time: 3.3s Sequential: 49.3s (LAPACK) Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

40 MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

41 MR 3 -SMP: Timings Matrix size=4289, from DFT. Time in seconds MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

42 A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = Time in minutes MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

43 A larger example Matrix size=16023; frequency response analysis of automobiles. 600 N = N = Time in minutes MR 3 SMP MRRR (MKL) DC (MKL) QR (MKL) BI (MKL) Time in seconds MR 3 SMP MRRR (MKL) MRRR (LAPACK) DC (MKL) Number of threads Number of threads From 9+ hours to 8.3 seconds. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

44 Speedups Time in seconds Eigenvalues 4 Eigenvectors LAPACK Number of threads Speedup Ideal Eigenvalues (bisection) Eigenvectors (bisection) Eigenvectors(dqds) Total Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

45 3 stages: before and after Fraction of execution time N = 4,289 Backtransformation Sequential MRRR Reduction Number of threads Fraction of execution time N = 4,289 Backtransformation MR 3 SMP Reduction Number of threads Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

46 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

47 Distributed memory PMRRR, EleMRRR Static assignment of eigenpairs to nodes Multithreading Node-node communication: only eigenvalues PMRRR + Elemental EleMRRR Generalized, standard and tridiagonal hybrid eigensolvers Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

48 TRDEIG: PMRRR matrix Wilkinson matrix Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

49 GENEIG: Weak & strong scaling Weak scalability Strong scalability, n=20000 Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

50 GENEIG: Efficiency Matrix size 20k 40k 80k 1 Parallel efficiency EleMRRR ScaLAPACK s DC ScaLAPACK s MRRR Number of cores Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

51 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

52 TRDEIG NVIDIA Tesla mrrr_dp = data-parallel MRRR rand(0,1) rand(-1,1) n LAPACK mrrr_dp LAPACK mrrr_dp Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

53 STDEIG Nehalem, 8 cores Reduction to tridiagonal form n LAPACK SBR SBR + GPU Reduction + backtransformation n LAPACK SBR SBR + GPU Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

54 1 The Problem 2 Architectures and Libraries 3 Multicore Processors: MR 3 -SMP 4 Distributed Memory Architectures: PMRRR 5 GPUs 6 Conclusions Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

55 Conclusions Multi-threaded BLAS for eigensolvers: not THAT good MR 3 -SMP, PMRRR, EleMRRR eigensolvers tailored for multi-core, distributed, hybrid architectures faster than LAPACK, MKL, ScaLAPACK almost perfect speedups software is available Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Paolo Bientinesi (AICES, RWTH Aachen) Fast and Scalable Eigensolvers February 6, / 34

Application of Maxwell Equations to Human Body Modelling

Application of Maxwell Equations to Human Body Modelling Application of Maxwell Equations to Human Body Modelling Fumie Costen Room E, E0c at Sackville Street Building, fc@cs.man.ac.uk The University of Manchester, U.K. February 5, 0 Fumie Costen Room E, E0c

More information

Design of Parallel Algorithms. Communication Algorithms

Design of Parallel Algorithms. Communication Algorithms + Design of Parallel Algorithms Communication Algorithms + Topic Overview n One-to-All Broadcast and All-to-One Reduction n All-to-All Broadcast and Reduction n All-Reduce and Prefix-Sum Operations n Scatter

More information

Challenges in Transition

Challenges in Transition Challenges in Transition Keynote talk at International Workshop on Software Engineering Methods for Parallel and High Performance Applications (SEM4HPC 2016) 1 Kazuaki Ishizaki IBM Research Tokyo kiszk@acm.org

More information

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links DLR.de Chart 1 GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links Chen Tang chen.tang@dlr.de Institute of Communication and Navigation German Aerospace Center DLR.de Chart

More information

Signals and Systems. A signal is the representation of a physical wave

Signals and Systems. A signal is the representation of a physical wave Signals and Systems A signal is the representation of a physical wave Expressed as a variable in time-space, for instance x(t) Signals that might vary are the voltage or current of a circuit, the force

More information

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida

Early Adopter : Multiprocessor Programming in the Undergraduate Program. NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Early Adopter : Multiprocessor Programming in the Undergraduate Program NSF/TCPP Curriculum: Early Adoption at the University of Central Florida Narsingh Deo Damian Dechev Mahadevan Vasudevan Department

More information

Contribution to the Smecy Project

Contribution to the Smecy Project Alessio Pascucci Contribution to the Smecy Project Study some performance critical parts of Signal Processing Applications Study the parallelization methodology in order to achieve best performances on

More information

What can POP do for you?

What can POP do for you? What can POP do for you? Mike Dewar, NAG Ltd EU H2020 Center of Excellence (CoE) 1 October 2015 31 March 2018 Grant Agreement No 676553 Outline Overview of codes investigated Code audit & plan examples

More information

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS 6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS Editor: Publisher: Prof. Pece Mitrevski, PhD Faculty of Information and Communication

More information

Document downloaded from:

Document downloaded from: Document downloaded from: http://hdl.handle.net/1251/64738 This paper must be cited as: Reaño González, C.; Pérez López, F.; Silla Jiménez, F. (215). On the design of a demo for exhibiting rcuda. 15th

More information

Recent Advances in Simulation Techniques and Tools

Recent Advances in Simulation Techniques and Tools Recent Advances in Simulation Techniques and Tools Yuyang Li, li.yuyang(at)wustl.edu (A paper written under the guidance of Prof. Raj Jain) Download Abstract: Simulation refers to using specified kind

More information

A GNU Radio-based Full Duplex Radio System

A GNU Radio-based Full Duplex Radio System A GNU Radio-based Full Duplex Radio System Adam Parower The Aerospace Corporation September 13, 2017 2017 The Aerospace Corporation Agenda Theory: Full-Duplex What is Full Duplex? The Problem The Solution

More information

CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro

CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro CP2K PERFORMANCE FROM CRAY XT3 TO XC30 Iain Bethune (ibethune@epcc.ed.ac.uk) Fiona Reid Alfio Lazzaro Outline CP2K Overview Features Parallel Algorithms Cray HPC Systems Trends Water Benchmarks 2005 2013

More information

MUMPS overview. MUMPS group, CERFACS, CNRS, ENS-Lyon, INRIA, INPT, Université Bordeaux 1

MUMPS overview. MUMPS group, CERFACS, CNRS, ENS-Lyon, INRIA, INPT, Université Bordeaux 1 MUMPS overview Patrick Amestoy (INPT(ENSEEIHT)-IRIT), Abdou Guermouche (Univ. de Bordeaux), Jean-Yves L Excellent (Inria-LIP-ENS Lyon) MUMPS group, CERFACS, CNRS, ENS-Lyon, INRIA, INPT, Université Bordeaux

More information

Parallel Dynamic and Selective Community Detection in Massive Streaming Graphs

Parallel Dynamic and Selective Community Detection in Massive Streaming Graphs Parallel Dynamic and Selective Community Detection in Massive Streaming Graphs European Conference on Data Analysis 2013, Luxembourg July 11, 2013 Christian L. Staudt, Yassine Marrakchi, Aleksejs Sazonovs

More information

PRACE PATC Course: Intel MIC Programming Workshop & Scientific Workshop: HPC for natural hazard assessment and disaster mitigation, June 2017,

PRACE PATC Course: Intel MIC Programming Workshop & Scientific Workshop: HPC for natural hazard assessment and disaster mitigation, June 2017, PRACE PATC Course: Intel MIC Programming Workshop & Scientific Workshop: HPC for natural hazard assessment and disaster mitigation, 26-30 June 2017, LRZ CzeBaCCA Project Czech-Bavarian Competence Team

More information

GPUs: what are they good for?

GPUs: what are they good for? GPUs: what are they good for? Mike Giles mike.giles@maths.ox.ac.uk Oxford e-research Centre University of Oxford Fujitsu Research Laboratories: Feb 1, 2011 GPUs p. 1 Outline CPUs and GPUs: comparison,

More information

Computational Simulations of The World s Biggest Eye on GPUs

Computational Simulations of The World s Biggest Eye on GPUs Computational Simulations of The World s Biggest Eye on GPUs Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology, Saudi Arabia NVIDIA GTC at San Jose, CA April

More information

The Bump in the Road to Exaflops and Rethinking LINPACK

The Bump in the Road to Exaflops and Rethinking LINPACK The Bump in the Road to Exaflops and Rethinking LINPACK Bob Meisner, Director Office of Advanced Simulation and Computing The Parker Ranch installation in Hawaii 1 Theme Actively preparing for imminent

More information

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division 8/1/21 Professor G.G.L. Meyer Johns Hopkins University Parallel Computing

More information

CTWatch. the Promise and Perils of the Coming Multicore Revolution. and its Impact JACK DONGARRA ISSN VOLUME 3 NUMBER 1 FEBRUARY 2007

CTWatch. the Promise and Perils of the Coming Multicore Revolution. and its Impact JACK DONGARRA ISSN VOLUME 3 NUMBER 1 FEBRUARY 2007 CTWatch ISSN 1555-9874 VOLUME 3 NUMBER 1 FEBRUARY 2007 the Promise and Perils of the Coming Multicore Revolution GUEST EDITOR and its Impact JACK DONGARRA CTWatch Quarterly February 2007 Introduction Over

More information

Virtual EM Prototyping: From Microwaves to Optics

Virtual EM Prototyping: From Microwaves to Optics Virtual EM Prototyping: From Microwaves to Optics Dr. Frank Demming, CST AG Dr. Avri Frenkel, Anafa Electromagnetic Solutions Virtual EM Prototyping Efficient Maxwell Equations solvers has been developed,

More information

When Should You Apply 3D Planar EM Simulation?

When Should You Apply 3D Planar EM Simulation? When Should You Apply 3D Planar EM Simulation? Agilent EEsof EDA IMS 2010 MicroApps Andy Howard Agilent Technologies 1 3D planar EM is now much more of a design tool Solves bigger problems and runs faster

More information

SCAI SuperComputing Application & Innovation. Sanzio Bassini October 2017

SCAI SuperComputing Application & Innovation. Sanzio Bassini October 2017 SCAI SuperComputing Application & Innovation Sanzio Bassini October 2017 The Consortium Private non for Profit Organization Founded in 1969 by Ministry of Public Education now under the control of Ministry

More information

Performance Evaluation Of OFDM Based Wireless Communication Systems Using Graphics Processing Unit (GPU) Based High Performance Computing.

Performance Evaluation Of OFDM Based Wireless Communication Systems Using Graphics Processing Unit (GPU) Based High Performance Computing. Performance Evaluation Of OFDM Based Wireless Communication Systems Using Graphics Processing Unit (GPU) Based High Performance Computing. A Thesis submitted in partial fulfillment of the Requirements

More information

Lecture 8: Introduction to Hybrid FEM IE

Lecture 8: Introduction to Hybrid FEM IE Lecture 8: Introduction to Hybrid FEM IE 2015.0 Release ANSYS HFSS for Antenna Design 1 2015 ANSYS, Inc. Hybrid FEM-IE Solution Using HFSS and HFSS-IE Advantages of Hybrid Solution Leverage the strength

More information

A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations

A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations Simulation A PageRank Algorithm based on Asynchronous Gauss-Seidel Iterations D. Silvestre, J. Hespanha and C. Silvestre 2018 American Control Conference Milwaukee June 27-29 2018 Silvestre, Hespanha and

More information

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system Name: Affiliation: Field of research: Specific Field of Study: Proposed Research Topic: Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar Information Science and Technology Computer Science

More information

Solving Large Multi-Scale Problems in CST STUDIO SUITE

Solving Large Multi-Scale Problems in CST STUDIO SUITE Solving Large Multi-Scale Problems in CST STUDIO SUITE An Aircraft Application M. Kunze, Z. Reznicek, I. Munteanu, P. Tobola, F. Wolfheimer Motivation I New A/C concepts (fly-by-wire, all electric aircraft,

More information

Optimization of Tile Sets for DNA Self- Assembly

Optimization of Tile Sets for DNA Self- Assembly Optimization of Tile Sets for DNA Self- Assembly Joel Gawarecki Department of Computer Science Simpson College Indianola, IA 50125 joel.gawarecki@my.simpson.edu Adam Smith Department of Computer Science

More information

CUDA-Accelerated Satellite Communication Demodulation

CUDA-Accelerated Satellite Communication Demodulation CUDA-Accelerated Satellite Communication Demodulation Renliang Zhao, Ying Liu, Liheng Jian, Zhongya Wang School of Computer and Control University of Chinese Academy of Sciences Outline Motivation Related

More information

Development of a parallel, tree-based neighbour-search algorithm

Development of a parallel, tree-based neighbour-search algorithm Mitglied der Helmholtz-Gemeinschaft Development of a parallel, tree-based neighbour-search algorithm for the tree-code PEPC 28.09.2010 Andreas Breslau Outline 1 Motivation 2 Short introduction to tree-codes

More information

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads Terminology CUDA Threads Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Streaming Multiprocessor (SM) A SM processes block of threads Streaming Processors (SP) also called CUDA

More information

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Boot Camp Programming and Optimization with Intel Xeon Phi Coprocessors Colfax Developer Training One-day Boot Camp Abstract: Colfax Developer Training (CDT) is an in-depth intensive course on efficient parallel

More information

NVIDIA GPU Computing Theater

NVIDIA GPU Computing Theater NVIDIA GPU Computing Theater The theater will feature talks given by experts on a wide range of topics on high performance computing. Open to all attendees of SC10, the theater is located in the NVIDIA

More information

Multiple Clock and Voltage Domains for Chip Multi Processors

Multiple Clock and Voltage Domains for Chip Multi Processors Multiple Clock and Voltage Domains for Chip Multi Processors Efraim Rotem- Intel Corporation Israel Avi Mendelson- Microsoft R&D Israel Ran Ginosar- Technion Israel institute of Technology Uri Weiser-

More information

Module 3 Greedy Strategy

Module 3 Greedy Strategy Module 3 Greedy Strategy Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Introduction to Greedy Technique Main

More information

Vampir Getting Started. Holger Brunst March 4th 2008

Vampir Getting Started. Holger Brunst March 4th 2008 Vampir Getting Started Holger Brunst holger.brunst@tu-dresden.de March 4th 2008 What is Vampir? Program Monitoring, Visualization, and Analysis 1. Step: VampirTrace monitors your program s runtime behavior

More information

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels Accelerated Impulse Response Calculation for Indoor Optical Communication Channels M. Rahaim, J. Carruthers, and T.D.C. Little Department of Electrical and Computer Engineering Boston University, Boston,

More information

PRACE PATC Course Intel MIC Programming Workshop. February, 7-8, 2017, IT4Innovations, Ostrava, Czech Republic

PRACE PATC Course Intel MIC Programming Workshop. February, 7-8, 2017, IT4Innovations, Ostrava, Czech Republic PRACE PATC Course Intel MIC Programming Workshop February, 7-8, 2017, IT4Innovations, Ostrava, Czech Republic LRZ in the HPC Environment Bavarian Contribution to National Infrastructure HLRS@Stuttgart

More information

A Parallel Monte-Carlo Tree Search Algorithm

A Parallel Monte-Carlo Tree Search Algorithm A Parallel Monte-Carlo Tree Search Algorithm Tristan Cazenave and Nicolas Jouandeau LIASD, Université Paris 8, 93526, Saint-Denis, France cazenave@ai.univ-paris8.fr n@ai.univ-paris8.fr Abstract. Monte-Carlo

More information

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri Yudanov (Advanced Micro Devices, USA) Leon Reznik (Rochester Institute of Technology, USA) WCCI 2012, IJCNN, June

More information

December 10, Why HPC? Daniel Lucio.

December 10, Why HPC? Daniel Lucio. December 10, 2015 Why HPC? Daniel Lucio dlucio@utk.edu A revolution in astronomy Galileo Galilei - 1609 2 What is HPC? "High-Performance Computing," or HPC, is the application of "supercomputers" to computational

More information

Ben Baker. Sponsored by:

Ben Baker. Sponsored by: Ben Baker Sponsored by: Background Agenda GPU Computing Digital Image Processing at FamilySearch Potential GPU based solutions Performance Testing Results Conclusions and Future Work 2 CPU vs. GPU Architecture

More information

Trend of Software R&D for Numerical Simulation Hardware for parallel and distributed computing and software automatic tuning

Trend of Software R&D for Numerical Simulation Hardware for parallel and distributed computing and software automatic tuning SCIENCE & TECHNOLOGY TRENDS 4 Trend of Software R&D for Numerical Simulation Hardware for parallel and distributed computing and software automatic tuning Takao Furukawa Promoted Fields Unit Minoru Nomura

More information

Re-Visiting Power Measurement for the Green500

Re-Visiting Power Measurement for the Green500 Re-Visiting Power Measurement for the Green500 Thomas R. W. Scogland (LLNL/CASC, Green500) The Green500 List and its Continuing 1 Evolution BoF, November 2014 Level 1 Requirements Workload phase: Measure

More information

European Exascale Software Initiative: Numerical Libraries, Solvers and Algorithms

European Exascale Software Initiative: Numerical Libraries, Solvers and Algorithms European Exascale Software Initiative: Numerical Libraries, Solvers and Algorithms Iain Duff STFC Rutherford Appleton Laboratory and CERFACS HPSS 2011. EuroPar 2011, Bordeaux, France. August 29 2011 2

More information

Hardware Software Science Co-design in the Human Brain Project

Hardware Software Science Co-design in the Human Brain Project Hardware Software Science Co-design in the Human Brain Project Wouter Klijn 29-11-2016 Pune, India 1 Content The Human Brain Project Hardware - HBP Pilot machines Software - A Neuron - NestMC: NEST Multi

More information

Petascale Quantum Simulations of Nano Systems and Biomolecules

Petascale Quantum Simulations of Nano Systems and Biomolecules Petascale Quantum Simulations of Nano Systems and Biomolecules J. Bernholc, E. Briggs, W. Lu,Y. Li and M. Hodak North Carolina State University, Raleigh I. RMG petascale, open-source electronic structure

More information

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology Bronson Messer Director of Science National Center for Computational Sciences & Senior R&D Staff Oak Ridge

More information

Half- and Full-Duplex FDD Operation in Cellular Multi-Hop Mobile Radio Networks

Half- and Full-Duplex FDD Operation in Cellular Multi-Hop Mobile Radio Networks 5 th FFV Workshop Half- and Full-Duplex FDD Operation in Cellular Multi-Hop Mobile Radio Networks Arif Otyakmaz, Rainer Schoenen Department of Communication Networks RWTH Aachen University, Germany FFV

More information

Building a Cell Ecosystem. David A. Bader

Building a Cell Ecosystem. David A. Bader Building a Cell Ecosystem David A. Bader Acknowledgment of Support National Science Foundation CSR: A Framework for Optimizing Scientific Applications (06-14915) CAREER: High-Performance Algorithms for

More information

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture Jingwen Leng Yazhou Zu Vijay Janapa Reddi The University of Texas at Austin {jingwen, yazhou.zu}@utexas.edu,

More information

Joint MT/CSEM Anisotropic Inversion Olympic Dam

Joint MT/CSEM Anisotropic Inversion Olympic Dam Joint MT/CSEM Anisotropic Inversion Olympic Dam T.J. Ritchie* P.A. Rowston* Practical 1 Day Workshop Geophysical Inversion for Mineral Explorers * Geophysical Resources and Services Pty. Ltd. Brisbane

More information

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40

LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40 LS-DYNA Performance Enhancement of Fan Blade Off Simulation on Cray XC40 Ting-Ting Zhu, Cray Inc. Jason Wang, LSTC Brian Wainscott, LSTC Abstract This work uses LS-DYNA to enhance the performance of engine

More information

Performance Metrics, Amdahl s Law

Performance Metrics, Amdahl s Law ecture 26 Computer Science 61C Spring 2017 March 20th, 2017 Performance Metrics, Amdahl s Law 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned

More information

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84 Proc. EEE ntern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), EEE Computer Society Press, 1995, 76-84 Session 2: Architectures 77 toning speed is affected by the huge amount

More information

COTSon: Infrastructure for system-level simulation

COTSon: Infrastructure for system-level simulation COTSon: Infrastructure for system-level simulation Ayose Falcón, Paolo Faraboschi, Daniel Ortega HP Labs Exascale Computing Lab http://sites.google.com/site/hplabscotson MICRO-41 tutorial November 9, 28

More information

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations Sno Projects List IEEE 1 High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations 2 A Generalized Algorithm And Reconfigurable Architecture For Efficient And Scalable

More information

Module 3 Greedy Strategy

Module 3 Greedy Strategy Module 3 Greedy Strategy Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu Introduction to Greedy Technique Main

More information

Application-Specific Node Clustering of IR-UWB Sensor Networks with Two Classes of Nodes

Application-Specific Node Clustering of IR-UWB Sensor Networks with Two Classes of Nodes Application-Specific Node Clustering of IR-UWB Sensor Networks with Two Classes of Nodes Daniel Bielefeld 1, Gernot Fabeck 2, Rudolf Mathar 3 Institute for Theoretical Information Technology, RWTH Aachen

More information

CellSpecks: A Software for Automated Detection and Analysis of Calcium

CellSpecks: A Software for Automated Detection and Analysis of Calcium Biophysical Journal, Volume 115 Supplemental Information CellSpecks: A Software for Automated Detection and Analysis of Calcium Channels in Live Cells Syed Islamuddin Shah, Martin Smith, Divya Swaminathan,

More information

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs 5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs

More information

The end of Moore s law and the race for performance

The end of Moore s law and the race for performance The end of Moore s law and the race for performance Michael Resch (HLRS) September 15, 2016, Basel, Switzerland Roadmap Motivation (HPC@HLRS) Moore s law Options Outlook HPC@HLRS Cray XC40 Hazelhen 185.376

More information

Decentralized Data Detection for Massive MU-MIMO on a Xeon Phi Cluster

Decentralized Data Detection for Massive MU-MIMO on a Xeon Phi Cluster Decentralized Data Detection for Massive MU-MIMO on a Xeon Phi Cluster Kaipeng Li 1, Yujun Chen 1, Rishi Sharan 2, Tom Goldstein 3, Joseph R. Cavallaro 1, and Christoph Studer 2 1 Department of Electrical

More information

Application of Multi-core and GPU Architectures on Signal Processing: Case Studies

Application of Multi-core and GPU Architectures on Signal Processing: Case Studies Application of Multi-core and GPU Architectures on Signal Processing: Case Studies Alberto Gonzalez 1, José A. Belloch 1, Gema Piñero 1, Jorge Lorente 1, Miguel Ferrer 1, Sandra Roger 1, Carles Roig 1,

More information

Automatic Domain Decomposition for a Black-Box PDE Solver

Automatic Domain Decomposition for a Black-Box PDE Solver Automatic Domain Decomposition for a Black-Box PDE Solver Torsten Adolph and Willi Schönauer Forschungszentrum Karlsruhe Institute for Scientific Computing Karlsruhe, Germany torsten.adolph@iwr.fzk.de

More information

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose,

More information

Monte Carlo integration and event generation on GPU and their application to particle physics

Monte Carlo integration and event generation on GPU and their application to particle physics Monte Carlo integration and event generation on GPU and their application to particle physics Junichi Kanzaki (KEK) GPU2016 @ Rome, Italy Sep. 26, 2016 Motivation Increase of amount of LHC data (raw &

More information

GPU ACCELERATED DEEP LEARNING WITH CUDNN

GPU ACCELERATED DEEP LEARNING WITH CUDNN GPU ACCELERATED DEEP LEARNING WITH CUDNN Larry Brown Ph.D. March 2015 AGENDA 1 Introducing cudnn and GPUs 2 Deep Learning Context 3 cudnn V2 4 Using cudnn 2 Introducing cudnn and GPUs 3 HOW GPU ACCELERATION

More information

Lecture 20: Combinatorial Search (1997) Steven Skiena. skiena

Lecture 20: Combinatorial Search (1997) Steven Skiena.   skiena Lecture 20: Combinatorial Search (1997) Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Give an O(n lg k)-time algorithm

More information

Architecting Systems of the Future, page 1

Architecting Systems of the Future, page 1 Architecting Systems of the Future featuring Eric Werner interviewed by Suzanne Miller ---------------------------------------------------------------------------------------------Suzanne Miller: Welcome

More information

R and the Message Passing Interface on the Little Fe Cluster

R and the Message Passing Interface on the Little Fe Cluster the Little Fe October 3, 2012 O Discussion Topics Overview Little Fe BCCD Parallel Programming MPI R with MPI Results R with CUDA Conclusion O Overview At SuperComputing 2011, the University of Houston

More information

Real-Time Software Receiver Using Massively Parallel

Real-Time Software Receiver Using Massively Parallel Real-Time Software Receiver Using Massively Parallel Processors for GPS Adaptive Antenna Array Processing Jiwon Seo, David De Lorenzo, Sherman Lo, Per Enge, Stanford University Yu-Hsuan Chen, National

More information

ST Tool. A CASE tool for security aware software requirements analysis

ST Tool. A CASE tool for security aware software requirements analysis ST Tool A CASE tool for security aware software requirements analysis Paolo Giorgini Fabio Massacci John Mylopoulos Nicola Zannone Departement of Information and Communication Technology University of

More information

Spectrum Requirements for 4G Wireless Systems

Spectrum Requirements for 4G Wireless Systems Spectrum Requirements for 4G Wireless Systems Tim Irnich ComNets, RWTH Aachen University FFV Workshop, 30.3.2007 1 Outline Introduction Radio Spectrum Management Why? The ITU framework for spectrum management

More information

Combining Differential/Integral Methods and Time/Frequency Domain Analysis to Solve Complex Antenna Problems

Combining Differential/Integral Methods and Time/Frequency Domain Analysis to Solve Complex Antenna Problems Combining Differential/Integral Methods and Time/Frequency Domain Analysis to Solve Complex Antenna Problems IEEE Long Island Section MTT-S Jan. 27, 20 Overview of Presentation Antenna design challenges

More information

Algorithmic-Technique for Compensating Memory Errors in JPEG2000 Standard

Algorithmic-Technique for Compensating Memory Errors in JPEG2000 Standard Algorithmic-Technique for Compensating Memory Errors in JPEG2000 Standard M. Pradeep Raj 1, E.Dinesh 2 PG Student, Dept of ECE, M. Kumarasamy College of Engineering, Karur, Tamilnadu, India 1 Asst. Professor,

More information

FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS WITH OVERLAPPING MULTIPLY ADD INSTRUCTIONS

FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS WITH OVERLAPPING MULTIPLY ADD INSTRUCTIONS SIAM J. SCI. COMPUT. c 1997 Society for Industrial and Applied Mathematics Vol. 18, No. 6, pp. 1605 1611, November 1997 005 FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS

More information

2017 by Bilge Acun. All rights reserved.

2017 by Bilge Acun. All rights reserved. 2017 by Bilge Acun. All rights reserved. MITIGATING VARIABILITY IN HPC SYSTEMS AND APPLICATIONS FOR PERFORMANCE AND POWER EFFICIENCY BY BILGE ACUN DISSERTATION Submitted in partial fulfillment of the requirements

More information

PROGRESSIVE CHANNEL ESTIMATION FOR ULTRA LOW LATENCY MILLIMETER WAVE COMMUNICATIONS

PROGRESSIVE CHANNEL ESTIMATION FOR ULTRA LOW LATENCY MILLIMETER WAVE COMMUNICATIONS PROGRESSIVECHANNELESTIMATIONFOR ULTRA LOWLATENCYMILLIMETER WAVECOMMUNICATIONS Hung YiCheng,Ching ChunLiao,andAn Yeu(Andy)Wu,Fellow,IEEE Graduate Institute of Electronics Engineering, National Taiwan University

More information

Advanced Computer Architecture - Baylor University The World s Most Advanced Technology For Solving The Nuix...

Advanced Computer Architecture - Baylor University The World s Most Advanced Technology For Solving The Nuix... Advanced Parallel Processing Technologies 9th International Symposium Appt 2011 Shanghai China September 26 27 2011 Proceedings Lecture Notes In Computer Science ADVANCED PARALLEL PROCESSING TECHNOLOGIES

More information

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA MACHINE LEARNING Games and Beyond Calvin Lin, NVIDIA THE MACHINE LEARNING ERA IS HERE And it is transforming every industry... including Game Development OVERVIEW NVIDIA Volta: An Architecture for Machine

More information

1) Evolução das velocidades de processamento, de acesso a memória e ao disco e das interfaces de rede - Um apanhado histórico até os dias de hoje

1) Evolução das velocidades de processamento, de acesso a memória e ao disco e das interfaces de rede - Um apanhado histórico até os dias de hoje 2010 1) Evolução das velocidades de processamento, de acesso a memória e ao disco e das interfaces de rede - Um apanhado histórico até os dias de hoje 2) Green Computing - Qual a perspectiva em organização

More information

Course Overview. Dr. Edmund Lam. Department of Electrical and Electronic Engineering The University of Hong Kong

Course Overview. Dr. Edmund Lam. Department of Electrical and Electronic Engineering The University of Hong Kong Course Dr. Edmund Lam Department of Electrical and Electronic Engineering The University of Hong Kong ELEC8601: Advanced Topics in Image Processing (Second Semester, 2013 14) http://www.eee.hku.hk/ work8601

More information

Parallelized Benchmark-Driven Performance Evaluation of SMPs and Tiled Multi-Core Architectures for Embedded Systems

Parallelized Benchmark-Driven Performance Evaluation of SMPs and Tiled Multi-Core Architectures for Embedded Systems Parallelized Benchmark-Driven Performance Evaluation of SMPs and Tiled Multi-Core Architectures for Embedded Systems Arslan Munir Department of Electrical and Computer Engineering Rice University, Houston,

More information

EESI Presentation at IESP

EESI Presentation at IESP Presentation at IESP San Francisco, April 6, 2011 WG 3.1 : Applications in Energy & Transportation Chair: Philippe RICOUX (TOTAL) Vice-Chair: Jean-Claude ANDRE (CERFACS) 1 WG3.1 Scientific and Technical

More information

Know your Algorithm! Architectural Trade-offs in the Implementation of a Viterbi Decoder. Matthias Kamuf,

Know your Algorithm! Architectural Trade-offs in the Implementation of a Viterbi Decoder. Matthias Kamuf, Know your Algorithm! Architectural Trade-offs in the Implementation of a Viterbi Decoder Matthias Kamuf, 2009-12-08 Agenda Quick primer on communication and coding The Viterbi algorithm Observations to

More information

Extreme Scale Computational Science Challenges in Fusion Energy Research

Extreme Scale Computational Science Challenges in Fusion Energy Research Extreme Scale Computational Science Challenges in Fusion Energy Research William M. Tang Princeton University, Plasma Physics Laboratory Princeton, NJ USA International Advanced Research 2012 Workshop

More information

Threading libraries performance when applied to image acquisition and processing in a forensic application

Threading libraries performance when applied to image acquisition and processing in a forensic application Threading libraries performance when applied to image acquisition and processing in a forensic application Carlos Bermúdez MSc. in Photonics, Universitat Politècnica de Catalunya, Barcelona, Spain Student

More information

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102

Programming and Optimization with Intel Xeon Phi Coprocessors. Colfax Developer Training One-day Labs CDT 102 Programming and Optimization with Intel Xeon Phi Coprocessors Colfax Developer Training One-day Labs CDT 102 Abstract: Colfax Developer Training (CDT) is an in-depth intensive course on efficient parallel

More information

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU Seunghak Lee (HY-SDR Research Center, Hanyang Univ., Seoul, South Korea; invincible@dsplab.hanyang.ac.kr); Chiyoung Ahn (HY-SDR

More information

Architectural and Technology Influence on the Optimal Total Power Consumption

Architectural and Technology Influence on the Optimal Total Power Consumption Architectural and Technology Influence on the Optimal Total Power Consumption Schuster Christian 1, Nagel Jean-Luc 1, Piguet Christian, Farine Pierre-André 1 1 IMT, University of Neuchâtel, Switzerland

More information

Wavelet-based image compression

Wavelet-based image compression Institut Mines-Telecom Wavelet-based image compression Marco Cagnazzo Multimedia Compression Outline Introduction Discrete wavelet transform and multiresolution analysis Filter banks and DWT Multiresolution

More information

Parallel Computing in the Multicore Era

Parallel Computing in the Multicore Era Parallel Computing in the Multicore Era Mikel Lujan & Graham Riley 21 st September 2016 Combining the strengths of UMIST and The Victoria University of Manchester MSc in Advanced Computer Science Theme

More information

Center for Hybrid Multicore Productivity Research (CHMPR)

Center for Hybrid Multicore Productivity Research (CHMPR) A CISE-funded Center University of Maryland, Baltimore County, Milton Halem, Director, 410.455.3140, halem@umbc.edu University of California San Diego, Sheldon Brown, Site Director, 858.534.2423, sgbrown@ucsd.edu

More information

B(, ) + + / = B(, ) B( +, ) B(, ) B( +, ) B( + +, ) B( +, ) B( +, ) B( +, ) B( +, ) = --xoptflags="-g -xmic-avx512 -O3 -mp2opt_hpo_vec_remainder=f" --with-memalign=64 = = ( + + [ + + + + ] ) + + σ +

More information

Parallel Computing in the Multicore Era

Parallel Computing in the Multicore Era Parallel Computing in the Multicore Era Prof. John Gurd 18 th September 2014 Combining the strengths of UMIST and The Victoria University of Manchester MSc in Advanced Computer Science Theme on Routine

More information

IMPULSIVE NOISE MITIGATION IN OFDM SYSTEMS USING SPARSE BAYESIAN LEARNING

IMPULSIVE NOISE MITIGATION IN OFDM SYSTEMS USING SPARSE BAYESIAN LEARNING IMPULSIVE NOISE MITIGATION IN OFDM SYSTEMS USING SPARSE BAYESIAN LEARNING Jing Lin, Marcel Nassar and Brian L. Evans Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Disclaimer. Primer. Agenda. previous work at the EIT Department, activities at Ericsson

Disclaimer. Primer. Agenda. previous work at the EIT Department, activities at Ericsson Disclaimer Know your Algorithm! Architectural Trade-offs in the Implementation of a Viterbi Decoder This presentation is based on my previous work at the EIT Department, and is not connected to current

More information