Understanding the Spike Algorithm

Similar documents
A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

Target Response Adaptation for Correlation Filter Tracking

problems palette of David Rock and Mary K. Porter 6. A local musician comes to your school to give a performance

Graph Method for Solving Switched Capacitors Circuits

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

Generalized Incomplete Trojan-Type Designs with Unequal Cell Sizes

29. Network Functions for Circuits Containing Op Amps

ALICE AND BOB GO TO DINNER: A VARIATION ON MÉNAGE

Chinese Remainder. Discrete Mathematics Andrei Bulatov

MTBF PREDICTION REPORT

International Journal of Network Security & Its Application (IJNSA), Vol.2, No.1, January SYSTEL, SUPCOM, Tunisia.

Digital Transmission

An Effective Approach for Distribution System Power Flow Solution

Network Theory. EC / EE / IN. for

On Sensor Fusion in the Presence of Packet-dropping Communication Channels

A Proposal of Mode Shape Estimation Method Using Pseudo-Modal Response : Applied to Steel Bridge in Building

1 GSW Multipath Channel Models

Revision of Lecture Twenty-One

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

Piecewise Linear Approximation of Generators Cost Functions Using Max-Affine Functions

Th P5 13 Elastic Envelope Inversion SUMMARY. J.R. Luo* (Xi'an Jiaotong University), R.S. Wu (UC Santa Cruz) & J.H. Gao (Xi'an Jiaotong University)

DIMENSIONAL SYNTHESIS FOR WIDE-BAND BAND- PASS FILTERS WITH QUARTER-WAVELENGTH RES- ONATORS

Review: Our Approach 2. CSC310 Information Theory

An Efficient Procedure for Solving Radial Distribution Networks through the Backward/Forward Method

Passive Filters. References: Barbow (pp ), Hayes & Horowitz (pp 32-60), Rizzoni (Chap. 6)

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

High Speed ADC Sampling Transients

Fast Code Detection Using High Speed Time Delay Neural Networks

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

Algorithms Airline Scheduling. Airline Scheduling. Design and Analysis of Algorithms Andrei Bulatov

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

Uncertainty in measurements of power and energy on power networks

TECHNICAL NOTE TERMINATION FOR POINT- TO-POINT SYSTEMS TN TERMINATON FOR POINT-TO-POINT SYSTEMS. Zo = L C. ω - angular frequency = 2πf

A TWO-PLAYER MODEL FOR THE SIMULTANEOUS LOCATION OF FRANCHISING SERVICES WITH PREFERENTIAL RIGHTS

Shunt Active Filters (SAF)

Performance Study of OFDMA vs. OFDM/SDMA

Mesh Equations. Introduction

Unit 1. Current and Voltage U 1 VOLTAGE AND CURRENT. Circuit Basics KVL, KCL, Ohm's Law LED Outputs Buttons/Switch Inputs. Current / Voltage Analogy

NATIONAL RADIO ASTRONOMY OBSERVATORY Green Bank, West Virginia SPECTRAL PROCESSOR MEMO NO. 25. MEMORANDUM February 13, 1985

Space Time Equalization-space time codes System Model for STCM

Optimal Decentralized Kalman Filter

Application of Intelligent Voltage Control System to Korean Power Systems

4.3- Modeling the Diode Forward Characteristic

Rational Secret Sharing without Broadcast

Cod and climate: effect of the North Atlantic Oscillation on recruitment in the North Atlantic

Ultimate X Bonus Streak Analysis

Harmonic Modeling of Inrush Current in Core Type Power Transformers using Hartley Transform

Medium Term Load Forecasting for Jordan Electric Power System Using Particle Swarm Optimization Algorithm Based on Least Square Regression Methods

Multi-sensor optimal information fusion Kalman filter with mobile agents in ring sensor networks

Maximizing Lifetime of Sensor-Target Surveillance in Wireless Sensor Networks

UNIT 11 TWO-PERSON ZERO-SUM GAMES WITH SADDLE POINT

Optimization of triangular lattice defect in dynamic photonic crystal structures for optical storage and processing

Master Physician Scheduling Problem 1

NOVEL ITERATIVE TECHNIQUES FOR RADAR TARGET DISCRIMINATION

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Latency Insertion Method (LIM) for IR Drop Analysis in Power Grid

Comparison of Two Measurement Devices I. Fundamental Ideas.

Equivalent Circuit Model of Electromagnetic Behaviour of Wire Objects by the Matrix Pencil Method

USE OF GPS MULTICORRELATOR RECEIVERS FOR MULTIPATH PARAMETERS ESTIMATION

Design of Shunt Active Filter for Harmonic Compensation in a 3 Phase 3 Wire Distribution Network

Rotational Load Flow Method for Radial Distribution Systems

Figure 1. DC-DC Boost Converter

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Joint Adaptive Modulation and Power Allocation in Cognitive Radio Networks

Desensitized Kalman Filtering with Analytical Gain

Location of Rescue Helicopters in South Tyrol

Joint Power Control and Scheduling for Two-Cell Energy Efficient Broadcasting with Network Coding

EE 508 Lecture 6. Degrees of Freedom The Approximation Problem

Adaptive Modulation for Multiple Antenna Channels

Impact of Data Quality on Real-Time Locational Marginal Price

Impact of Interference Model on Capacity in CDMA Cellular Networks. Robert Akl, D.Sc. Asad Parvez University of North Texas

AN EFFICIENT ITERATIVE DFT-BASED CHANNEL ESTIMATION FOR MIMO-OFDM SYSTEMS ON MULTIPATH CHANNELS

Reflections on Rotators, Or, How to Turn the FEL Upgrade 3F Skew Quad Rotator Into a Skew Quad Rotator

Discussion on How to Express a Regional GPS Solution in the ITRF

arxiv: v1 [cs.it] 30 Sep 2008

Lecture 10: Bipolar Junction Transistor Construction. NPN Physical Operation.

Priority based Dynamic Multiple Robot Path Planning

DIRECT collocation methods for solving a continuous

A Benchmark for D2D in Cellular Networks: The Importance of Information

FFT Spectrum Analyzer

arxiv: v1 [astro-ph.im] 24 Apr 2015

aperture David Makovoz, 30/01/2006 Version 1.0 Table of Contents

Development of a High-Order Discontinuous Galerkin Fluid Solver Within SU2

Robust TDOA Passive Location Using Interval Analysis and Contractor Programming

Define Y = # of mobiles from M total mobiles that have an adequate link. Measure of average portion of mobiles allocated a link of adequate quality.

Combined Independent Component Analysis and Kalman Filter Based Real-Time Digital Video Stabilization

Electrical Capacitance Tomography with a Square Sensor

Tracking A Dynamic Sparse Channel Via Differential Orthogonal Matching Pursuit

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Secure Transmission of Sensitive data using multiple channels

DUE TO process scaling, the number of devices on a

RC Filters TEP Related Topics Principle Equipment

Low-Delay 16 kb/s Wideband Speech Coder with Fast Search Methods

Double-oracle Algorithm for Computing an Exact Nash Equilibrium in Zero-sum Extensive-form Games

Transmitted field in the lossy ground from ground penetrating radar (GPR) dipole antenna

Transcription:

Understandng the Spke Algorthm Vctor Ejkhout and Robert van de Gejn May, ntroducton The parallel soluton of lnear systems has a long hstory, spannng both drect and teratve methods Whle drect methods exst that have great generalty, here we consder a subcase of practcal mportance, that of banded matrces We note that many PDE problems naturally gve rse to banded systems, gven a large enough bandwdth For any banded matrx, we can mpose a block structure such that the matrx s block trdagonal Ths structure gves each processor a contguous block row of the matrx; we assume that the number of processors s low enough that the the part owned by any processor comprses one or more of the blocks that defne the block trdagonal structure n ths paper we present a number of varants on the Spke factorzaton of Polzz and Sameh [?] nstead of the customary algebrac presentaton we vew ths algorthm as a doman decomposton method, where each processor corresponds to a subdoman, and the problem varables are dvded n nteror regons and separators We wll do a cost analyss for the case where the algorthm s appled to a fnte element type matrx The Spke algorthm s then seen to be slghtly more expensve than a regular doman decomposton Basc nsghts From D doman to block trdagonal matrx To understand how a block trdagonal matrx mght arse, consder the doman n Fgure) magne a dscretzaton (mesh of unknowns) ordered n column-major order The dark lnes represent columns of unknowns that are vewed as separators Ths

Fgure : A one-dmensonally parttoned doman wth 4 subdomans and separators then yelds a matrx wth the followng structure: gves a matrx A () B () C () A () B () C () A () D () E () S () G () H () A () B () C () A () B () C () A () D () E () S () G () H () A () B () C () A () B () C () A () Here the large blocks correspond to the couplng matrx for the subdomans between separators and the small ones to the separators themselves

Permutaton Now, we can order the unknowns dfferently Stll n column-major order, but orderng all the columns for the nterors, skppng the separators, and then the separators Ths yelds the matrx A () B () C () A () B () C () A () D () A () B () C () A () B () C () A () A () B () C () A () B () C () A () H () D () H () E () G () S () E () G () S (), () whch we recognze as the same matrx, permuted LU factorzaton vs one-sded factorzaton Consder the matrx ( ) A B C S ts LU factorzaton s gven by ( ) ( A B L A = C S Ĉ L S )( U A ˆB U S ), where A = L A U A, L A ˆB = B, ĈU A = C, and S Ĉ ˆB = S CA B = L S U S

Step (Factor nteror block) for =,,R (*) (*) (*) := endfor Step := L A := C( U b (or b ) b (or b ) B( factor L A U A for =,,R := ($) + := Ĝ( ($) + U A b (or b ) b b (or b ) := L A := Ĉ( + b (or b ) (x) := endfor Step b (or b ) b (or b ) (+) R := L R (+) Ê R := E( R U (+) T := R (+) V := Ê E( R Ĥ( (+) S ( j+) := Ê R Step 4 b (or b ) R b (or b ) R b (or b ) ˆD R b (or b ) R b (or b ) ˆD R b (or b ) (x) add contrbuton from (x) processor j to (x) factor L S U S b (or b ) (+)(x) ˆT := L S T b (or b ) (+)(x) ˆV := V U S b (or b ) (+) S ( j+) := S ( j+) ˆV ˆT b (or b ) (+) send S ( j+) to processor j + T Ê V T Ê V Ê Ŝ ˆT ˆV S ( j+) Fgure : Parallel LU factorzaton L A and U A overwrte  L S and U S overwrte Ŝ Commands annotated wth (*) are not perform n teraton = Commands annotated wth ($) are not perform n teraton = R Commands annotated wth (x) are not perform by the frst processor Commands annotated wth (+) are not perform by the last processor 4

Fshbone Algorthm Let us perform a LU factorzaton of the permuted matrx n (), takng advantage of zero blocks, yeldng the matrces L () U () ˆB () Ĉ () L () U () Ĉ () L () ˆB () U () ˆD () L () U () Ĉ () L () ˆB () Ĥ () U () ˆB () Ĥ () Ĉ () L () U () Ĥ () ˆD () L () Ĉ () L () U () ˆB () Ĥ () U () Ĉ () L () ˆB () Ĥ () U () Ĥ () Ê () Ĝ () Ĝ () Ĝ () L () S U () S ˆT () Ê () Ĝ () Ĝ () Ĝ () ˆV () L () S U () S An algorthm for ths s gven n Fgure f one performed the computaton as f the matrx were permuted, but kept the data n the orgnal matrx, one would end up wth a matrx that would look lke ths: Â () Ĉ () ˆB () Â () Ĉ () ˆB () Â () Ê () ˆD () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () Ĝ () ˆB () Â () Ĉ () Ĝ () ˆB () Â () ˆV () Ê () ˆT () ˆD () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () Ĝ () Ĝ () ˆB () Â () Ĉ () ˆB () Â () where stores L A and U A, and stores L S and U S To parallelze ths one frst decdes where what data exsts Let us assume that the doman was parttoned nto p subdomans (wth p separators) and hence there are p large dagonal blocks We wll assgn to processor j the submatrx or, permuted, T S ( j+) V S ( j+), 5

Notce that the block n red s actually owned by processor j + A contrbuton to the Schur complement that updates that block wll be computed on ths processor, and then passed to the next one, to be added there to S ( j+) Assumng that each large dagonal block actually has R subblocks on the dagonal, a parallel algorthm (that s executed on each processor for computng the LU factorzaton of the matrx s gven n Fgure 4 The Spke Algorthm To understand the Spke algorthm, one should frst understand the followng unconventonal sequence of factorzatons We start wth matrx ( ) W X Y Z and compute the LU factorzaton of W, W L W U W Then ( ) ( )( )( W X LW UW ˇX = Y Z Y Z ) () Next, we compute factor further: ( ) ( W X LW )( UW )( )( ˇX ) Y Z = Y Z Y ˇX () where Z Y ˇX s often called a Schur complement Fnally, f we factor Z Y ˇX nto ts LU factorzaton, Z Y ˇX L W U W, we obtan ( ) ( )( )( )( )( )( ) W X LW UW ˇX = (4) Y Z Y U Z L Z Agan, consder the permuted matrx n Equaton A () B () C () A () B () C () A () D () A () B () C () A () B () C () A () A () B () C () A () B () C () A () H () D () H () E () G () S () E () G () S (), (5) 6

and assgn to processor j the submatrx S ( j+) or, permuted, T V S ( j+) The block n red s agan owned by processor j + and wll be a contrbuton to the update of that processor s S ( j+) The use thck lnes here and n () (5) was purposely chosen They delneate what we wll refer to as W, X, Y, and Z n the subsequent dscusson Consder the parallel algorthm n Fgure n Equatons??, let W consdes of the block dagonal matrx wth the nteror blocks on ts dagonal Step computes (ths processor s part of) the LU factorzaton W L W U W Step solves L W (U W ˆX) = X Step update Z = Z Y ˆX Step 4 factors Z Y ˆX f one performs the computaton as f the matrx were permuted, but updated the data n the orgnal matrx, one would end up wth a matrx that looks lke: Â () ˆB () ˆD () Ĉ () Â () ˆB () ˆD () Ĉ () Â () ˆD () Ê () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () ˆB () Â () Ĉ () ˆB () Â () ˆV () Ê () ˆT () ˆD () ˆD () ˆD () Ŝ () Ĝ () Ĥ () Ĥ () Ĥ () Â () Ĉ () ˆB () Â () Ĉ () ˆB () Â (), (6) where stores L A and U A, and stores L S and U S Ths s the classc sparsty pattern that results from the Spke algorthm 7

The factorzaton of the matrx s now gven by (BETTER CHECK THS!!!) L A () Ĉ () L A () Ĉ () L A () L A () Ĉ () L A () Ĉ () L A () U A () ˆB () U A () ˆB () U A () U A () ˆB () U A () ˆB () U A () Ê () Ĥ () Ĥ () Ĥ () Ê () L S () ˆV () L S () U S () ˆT () U S () ˆD () ˆD () ˆD () Ĝ () ˆD () ˆD () ˆD () 5 Cost comparson We now analyze the predcted cost of the algorthms For ths, we make the followng assumptons/observatons: All blocks are b b Computng an LU factorzaton of a b b matrx takes b flops Multplyng two b b matrces takes b flops Solvng a b b trangular system wth b rght-hand sdes takes b flops Each flop takes γ tme Sendng a b b block from one processor to another requres α + b β tme Wth ths assumptons, each operaton n Fgures and s annotated wth ts cost (gnore for now the cost estmate n parentheses) and a summary of estmated costs are gven n Fgure 4 Some nterestng observatons: 8

f R p, the speedup for the LU factorzatons that need to be performance approaches p, perfect speedup Even f R p, the maxmal speedup attaned combned for the TRSMs and GEMMs s at best a factor p/ (The combned costs approach Rb γ and Rb γ for Fshbone and Spke, respectvely, versus 4pR for sequental block trangular factorzaton, f p s fxed and R The reason s that the fll-n that occurs when factorng the permuted matrx represents a sgnfcant overhead The Spke algorthm s less effcent than the Fshbone algorthm Ths s not due to how the matrx s parttoned among processors, nor how the nherently sequental part s performed, snce both these are the same for both algorthms 9

Step : Factor nteror block (matrx W) for =,,R (*) (*) (*) := endfor := L A b (or b ) := C( U b B( b b factor L A U A Step : Solve L W (U W ˆX) = X for =,,R ($) + := L A b := Ĉ( b endfor R := L R b for = R,, (*) (*) := endfor := U A b := H( b := U A b Step : Update Z = Z Y ˆX (x) := (+) T := (+) V := Ê (+) S ( j+) := Ê Step 4: Factor the updated Z b b b R b R b (x) add contrbuton from (x) processor j to (x) factor L S U S b (+)(x) ˆT := L S T b (+)(x) ˆV := V U S b (+) S ( j+) := S ( j+) ˆV ˆT b (+) send S ( j+) to processor j + (L A and U A are stored n ) T V T V S ( j+) Ŝ ˆT ˆV S ( j+) Fgure : Spke Algorthm L A and U A overwrte  L S and U S overwrte Ŝ Commands annotated wth (*) are not perform n teraton = Commands annotated wth ($) are not perform n teraton = R Commands annotated wth (x) are not perform by the frst processor Commands annotated wth (+) are not perform by the last processor

Fshbone Spke Sequental Step LU fact Rb γ Rb γ (pr + p )b γ TRSM (R )b γ (R )b γ (pr + p )b γ GEMM (R )b γ (R )b γ (pr + p )b γ Step TRSM Rb γ (R + )b γ GEMM 6(R )b γ 6(R )b γ Step TRSM b γ b γ GEMM 6b γ 8b γ Step 4 LU fact (p )b γ (p )b γ TRSM (p )b γ (p )b γ GEMM (p )b γ (p )b γ Send (p )(α + b β) (p )(α + b β) Total LU fact (R + p )b γ (R + p )b γ (pr + p )b γ TRSM (4R + (p ))b γ (5R + (p ))b γ (pr + p )b γ GEMM (8R + p 4)b γ (8R + p )b γ (pr + p )b γ Send (p )(α + b β) (p )(α + b β) Fgure 4: Cost analyss of the dfferent algorthms