Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

Similar documents
Signal Processing Architectures for Ultra-Wideband Wide-Angle Synthetic Aperture Radar Applications

A Comparison of Two Computational Technologies for Digital Pulse Compression

U.S. Army Training and Doctrine Command (TRADOC) Virtual World Project

Investigation of a Forward Looking Conformal Broadband Antenna for Airborne Wide Area Surveillance

Robotics and Artificial Intelligence. Rodney Brooks Director, MIT Computer Science and Artificial Intelligence Laboratory CTO, irobot Corp

August 9, Attached please find the progress report for ONR Contract N C-0230 for the period of January 20, 2015 to April 19, 2015.

0.18 μm CMOS Fully Differential CTIA for a 32x16 ROIC for 3D Ladar Imaging Systems

COM DEV AIS Initiative. TEXAS II Meeting September 03, 2008 Ian D Souza

Presentation to TEXAS II

IREAP. MURI 2001 Review. John Rodgers, T. M. Firestone,V. L. Granatstein, M. Walter

Coherent distributed radar for highresolution

SA Joint USN/USMC Spectrum Conference. Gerry Fitzgerald. Organization: G036 Project: 0710V250-A1

Lattice Spacing Effect on Scan Loss for Bat-Wing Phased Array Antennas

Durable Aircraft. February 7, 2011

Effects of Radar Absorbing Material (RAM) on the Radiated Power of Monopoles with Finite Ground Plane

Innovative 3D Visualization of Electro-optic Data for MCM

Army Acoustics Needs

Modeling of Ionospheric Refraction of UHF Radar Signals at High Latitudes

Strategic Technical Baselines for UK Nuclear Clean-up Programmes. Presented by Brian Ensor Strategy and Engineering Manager NDA

Mathematics, Information, and Life Sciences

David Siegel Masters Student University of Cincinnati. IAB 17, May 5 7, 2009 Ford & UM

A RENEWED SPIRIT OF DISCOVERY

Loop-Dipole Antenna Modeling using the FEKO code

Underwater Intelligent Sensor Protection System

THE NATIONAL SHIPBUILDING RESEARCH PROGRAM

Drexel Object Occlusion Repository (DOOR) Trip Denton, John Novatnack and Ali Shokoufandeh

MATLAB Algorithms for Rapid Detection and Embedding of Palindrome and Emordnilap Electronic Watermarks in Simulated Chemical and Biological Image Data

CFDTD Solution For Large Waveguide Slot Arrays

Design of Synchronization Sequences in a MIMO Demonstration System 1

Range-Depth Tracking of Sounds from a Single-Point Deployment by Exploiting the Deep-Water Sound Speed Minimum

Two-Way Time Transfer Modem

INTEGRATIVE MIGRATORY BIRD MANAGEMENT ON MILITARY BASES: THE ROLE OF RADAR ORNITHOLOGY

Non-Data Aided Doppler Shift Estimation for Underwater Acoustic Communication

Digital Radiography and X-ray Computed Tomography Slice Inspection of an Aluminum Truss Section

A HIGH-PRECISION COUNTER USING THE DSP TECHNIQUE

14. Model Based Systems Engineering: Issues of application to Soft Systems

Radar Detection of Marine Mammals

Wavelength Division Multiplexing (WDM) Technology for Naval Air Applications

Solar Radar Experiments

Department of Defense Partners in Flight

DARPA TRUST in IC s Effort. Dr. Dean Collins Deputy Director, MTO 7 March 2007

REPORT DOCUMENTATION PAGE. A peer-to-peer non-line-of-sight localization system scheme in GPS-denied scenarios. Dr.

Technology Maturation Planning for the Autonomous Approach and Landing Capability (AALC) Program

Student Independent Research Project : Evaluation of Thermal Voltage Converters Low-Frequency Errors

REPORT DOCUMENTATION PAGE. Thermal transport and measurement of specific heat in artificially sculpted nanostructures. Dr. Mandar Madhokar Deshmukh

UNCLASSIFIED UNCLASSIFIED 1

Future Trends of Software Technology and Applications: Software Architecture

VHF/UHF Imagery of Targets, Decoys, and Trees

Advancing Autonomy on Man Portable Robots. Brandon Sights SPAWAR Systems Center, San Diego May 14, 2008

ESME Workbench Enhancements

Measurement of Ocean Spatial Coherence by Spaceborne Synthetic Aperture Radar

RADAR SATELLITES AND MARITIME DOMAIN AWARENESS

Key Issues in Modulating Retroreflector Technology

PULSED POWER SWITCHING OF 4H-SIC VERTICAL D-MOSFET AND DEVICE CHARACTERIZATION

Automatic Payload Deployment System (APDS)

Fall 2014 SEI Research Review Aligning Acquisition Strategy and Software Architecture

Active Denial Array. Directed Energy. Technology, Modeling, and Assessment

LONG TERM GOALS OBJECTIVES

Development of a charged-particle accumulator using an RF confinement method FA

Thermal Simulation of a Silicon Carbide (SiC) Insulated-Gate Bipolar Transistor (IGBT) in Continuous Switching Mode

Report Documentation Page

Report Documentation Page

Cross-layer Approach to Low Energy Wireless Ad Hoc Networks

Adaptive CFAR Performance Prediction in an Uncertain Environment

GLOBAL POSITIONING SYSTEM SHIPBORNE REFERENCE SYSTEM

Acoustic Horizontal Coherence and Beamwidth Variability Observed in ASIAEX (SCS)

Neural Network-Based Hyperspectral Algorithms

Rump Session: Advanced Silicon Technology Foundry Access Options for DoD Research. Prof. Ken Shepard. Columbia University

Department of Energy Technology Readiness Assessments Process Guide and Training Plan

ULTRASTABLE OSCILLATORS FOR SPACE APPLICATIONS

Investigation of Modulated Laser Techniques for Improved Underwater Imaging

Ground Based GPS Phase Measurements for Atmospheric Sounding

REPORT DOCUMENTATION PAGE

RECENT TIMING ACTIVITIES AT THE U.S. NAVAL RESEARCH LABORATORY

Modeling an HF NVIS Towel-Bar Antenna on a Coast Guard Patrol Boat A Comparison of WIPL-D and the Numerical Electromagnetics Code (NEC)

RF Performance Predictions for Real Time Shipboard Applications

Effects of Fiberglass Poles on Radiation Patterns of Log-Periodic Antennas

Experiences Linking Vehicle Motion Simulators to Distributed Simulation Experiments

Simulation Comparisons of Three Different Meander Line Dipoles

Analytical Evaluation Framework

Marine~4 Pbscl~ PHYS(O laboratory -Ip ISUt

2008 Monitoring Research Review: Ground-Based Nuclear Explosion Monitoring Technologies INFRAMONITOR: A TOOL FOR REGIONAL INFRASOUND MONITORING

N C-0002 P13003-BBN. $475,359 (Base) $440,469 $277,858

The Algorithm Theoretical Basis Document for the Atmospheric Delay Correction to GLAS Laser Altimeter Ranges

Limits to the Exponential Advances in DWDM Filter Technology? Philip J. Anthony

Best Practices for Technology Transition. Technology Maturity Conference September 12, 2007

LITHUANIAN NATIONAL TIME AND FREQUENCY STANDARD

THE NATIONAL SHIPBUILDING RESEARCH PROGRAM

Fuzzy Logic Approach for Impact Source Identification in Ceramic Plates

Wavelet Shrinkage and Denoising. Brian Dadson & Lynette Obiero Summer 2009 Undergraduate Research Supported by NSF through MAA

Modeling and Evaluation of Bi-Static Tracking In Very Shallow Water

Frequency Stabilization Using Matched Fabry-Perots as References

Acoustic Monitoring of Flow Through the Strait of Gibraltar: Data Analysis and Interpretation

Remote Sediment Property From Chirp Data Collected During ASIAEX

South Atlantic Bight Synoptic Offshore Observational Network

BIOGRAPHY ABSTRACT. This paper will present the design of the dual-frequency L1/L2 S-CRPA and the measurement results of the antenna elements.

AFRL-RH-WP-TR

UNCLASSIFIED INTRODUCTION TO THE THEME: AIRBORNE ANTI-SUBMARINE WARFARE

Experimental Studies of Vulnerabilities in Devices and On-Chip Protection

REPORT DOCUMENTATION PAGE

Transcription:

Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division 8/1/21 Professor G.G.L. Meyer Johns Hopkins University Parallel Computing and Imaging Laboratory 1

Report Documentation Page Form Approved OMB No. 74-188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 124, Arlington VA 2222-432. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 21 MAY 23 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE Hybrid QR Factorization Algorithm for High Performance Computing Architectures 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) John Hopkins University Parallel Computing and Imaging Laboratory 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 1. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 11. SPONSOR/MONITOR S REPORT NUMBER(S) 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified 18. NUMBER OF PAGES 23 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Outline Background Problem Statement Givens Task Householder Task Paths Through Dependency Graph Parameterized Algorithms Parameters Used Results Conclusion 8/1/21 2

8/1/21 3 NRL Background In many least squares problems, QR decomposition is employed Factor matrix A into unitary matrix Q and upper triangular matrix R such that A = QR Two primary algorithms available to compute QR decomposition Givens rotations Pre-multiplying rows i-1 and i of a matrix A by a 2x2 Givens rotation matrix will zero the entry A( i, j ) Householder reflections When a column of A is multiplied by an appropriate Householder reflection, it is possible to zero all the subdiagonal entries in that column = c s s c = 2 T T vv v v I

Problem Statement Want to minimize the latency incurred when computing the QR decomposition of a matrix A and maintain performance across different platforms Algorithm consists of parallel Givens task and serial Householder task Parallel Givens task Allocate blocks of rows to different processors. Each processor uses Givens rotations to zero all available entries within block such that A( i, j ) = only if A( i-1, j-1 ) = and A( i, j-1 ) = Serial Householder task Once Givens task terminates, all distributed rows are sent to root processor which utilizes Householder reflections to zero remaining entries 8/1/21 4

8/1/21 5 NRL Givens Task Each processor uses Givens rotations to zero entries up to the topmost row in the assigned group Once task is complete, rows are returned to the root processor Givens rotations are accumulated in a separate matrix before updating all of the columns in the array Avoids updating columns that will not be use by an immediately following Givens rotation Saves significant fraction of computational flops Processor Processor 1 Processor 2

8/1/21 6 NRL Householder Task Root processor utilizes Householder reflections to zero remaining entries in Givens columns By computing a-priori where zeroes will be after each Givens task is complete, root processor can perform a sparse matrix multiply when performing a Householder update for additional speed-up Householder update is A = A - ßvv T A Householder update involves matrix-vector multiplication and an outer product update Makes extensive use of BLAS routines Processor

Dependency Graph - Path 1 9 8 7 6 5 4 3 17 16 15 14 13 12 24 23 22 21 2 3 29 28 27 35 34 33 39 38 42 Algorithm must zero matrix entries in such an order that previously zeroed entries are not filled-in Implies that A( i, j ) can be zeroed only if A( i-1, j-1 ) and A( i, j-1 ) are already zero More than one sequence exists to zero entries such that above constraint is satisfied Choice of path through dependency graph greatly affects performance 2 11 19 26 32 37 41 44 1 1 18 25 31 36 4 43 45 8/1/21 7

Dependency Graph - Path 2 By traversing dependency graph in zig-zag fashion, cache line reuse is maximized 37 29 22 38 3 39 Data from row already in cache is used to zero several matrix entries before row is expunged from cache 16 23 31 4 11 17 24 32 41 7 12 18 25 33 42 4 8 13 19 26 34 43 2 5 9 14 2 27 35 44 1 3 6 1 15 21 28 36 45 8/1/21 8

Parameterized Algorithms : Memory Hierarchy CPU clock cycles Registers 2-3 clock cycles First Level Cache 8-1 clock cycles Parameterized Algorithms make effective use of memory hierarchy Improve spatial locality of memory references by grouping together data used at the same time Improve temporal locality of memory references by using data retrieved from cache as many times as possible before cache is flushed Portable performance is primary objective Second Level Cache 6-2 clock cycles 8/1/21 Main Memory Memory Hierarchy of SGI O2 9

8/1/21 1 NRL Givens Parameter Parameter c controls the number of columns in Givens task Determines how many matrix entries can be zeroed before rows are flushed from cache c

8/1/21 11 NRL Householder Parameter Parameter h controls the number of columns zeroed by Householder reflections at the root processor If h is large, the root processor performs more serial work, avoiding the communication costs associated with the Givens task However, the other processors sit idle longer, decreasing the efficiency of the algorithm c h

8/1/21 12 NRL Work Partition Parameters Parameters v and w allow operator to assign rows to processors such that the work load is balanced and processor idle time is minimized Processor Processor 1 Processor 2 v w

Results 8/1/21 13

Server Computer (1) 48 55-MHz PA-RISC 86 CPUs 1.5 MB on-chip cache per CPU 1 GB RAM / Processor 8/1/21 HP Superdome SPAWAR in San Diego, CA 14

Server Computer (2) 512 R12 processors running at 4 MHz 8 MB on-chip cache Up to 2 GB RAM / Processor 8/1/21 SGI O3 NRL in Washington, D.C. 15

Embedded Computer 8 Motorola 74 processors with AltiVec units 4 MHz clock 64 MB RAM per processor 8/1/21 Mercury JHU in Baltimore, MD 16

Effect of c 1 3 1 2 Mercury SGI O3 HP Superdome 1 x 1 array 4 processors p = 12, h = Time - msec 1 1 8/1/21 1 1 2 3 4 5 6 7 8 9 c 17

Effect of h 1 2 Mercury SGI O3 HP Superdome 1 x 1 array 4 processors c = 63, p = 12 Time - msec 1 1 8/1/21 1 5 1 15 2 25 3 35 4 45 5 h 18

Effect of w 1 2 Time - msec 1 1 Mercury SGI O3 HP Superdome 1 x 1 array 4 processors h = 15, p = 1, c = 6, v = 15 8/1/21 1 34 36 38 4 42 44 46 48 5 52 w 19

Performance vs Matrix Size 1 5 1 4 4 processors Time - msec 1 3 1 2 1 1 Mercury SGI O3 HP Superdome 8/1/21 1 1 15 2 25 3 35 4 45 5 n = m 2

Scalability 12 1 5 x 5 array Time - msec 8 6 4 Mercury SGI O3 HP Superdome 2 8/1/21 2 3 4 5 6 7 Number of processors 21

Comparison to SCALAPACK 9.5 9 8.5 9.4 ms 4 processors SGI O3 For matrix sizes on the order of 1 by 1, the Hybrid QR algorithm outperforms the SCALAPACK library routine PSGEQRF by 16% Data distributed in block cyclic fashion before executing PSGEQRF 8 7.9 ms 7.5 7 PSGEQRF HYBRID 8/1/21 22

Conclusion Hybrid QR algorithm using combination of Givens rotations and Householder reflections is efficient way to compute QR decomposition for small arrays on the order of 1 x 1 Algorithm implemented on SGI O3 and HP Superdome servers as well as Mercury G4 embedded computer Mercury implementation lacked optimized BLAS routines and as a consequence performance was significantly slower Algorithm has applications to signal processing problems such as adaptive nulling where strict latency targets must be satisfied 8/1/21 23