Lecture 8-1 Vector Processors 2 A. Sohn

Lecture 8-1 Vector Processors Vector Processors How many iterations does the following loop go through? For i=1 to n do A[i] = B[i] + C[i] Sequential Processor: n times. Vector processor: 1 instruction! Characteristics of VPs No data hazards A single instruction for the entire loop Heavily interleaved memory No control hazards Heavily pipelined operations. Questions to ask: Can memory support? Memory Bandwidth Massive interleaved memory, remember? Memory-to-Memory architecture Register-to-Register architecture Pipeline chaining Can instructions be vectorized? Vectorizing compiler Lecture 8-1 Vector Processors 2 A. Sohn

Vector Processors Pipeline the operations on individual elements. Vector functional units: Multiplier, Adder Vector registers Vector Load/Store units A set of scalar registers Lecture 8-1 Vector Processors 3 A. Sohn Memory Interleaving VPs have several functional units including Adder, Loader, and Storer. Vectors are ideally allocated to 8 memory modules for C[0:7]=A[0:7]+B[0:7] M0 M 1 M 2 M 3 A[0] B[6] C[4] A[1] B[7] C[5] A[2] B[0] C[6] A[3] B[1] C[7] M4 M 5 M 6 M 7 A[4] B[2] C[1] A[5] B[3] C[2] A[6] B[4] C[3] A[7] B[5] C[4] S4 S3 S2 S1 M7 RB5 RB5 RA7 RA7 W3 W3 M6 M5 RB4 RB3 RB3 RB4 RA5 RA6 RA6 RA5 W1 W2 W2 W1 M4 RB2 RB2 RA4 RA4 W0 W0 M3 RB1 RB1 RA3 RA3 W7 W7 M2 M1 RB0 RB0 R R R R RB7 RB7 W6 W6 W5 W5 M0 R R RB6 RB6 W4 W4 8 9 10 11 12 13 14 15 Lecture 8-1 Vector Processors 4 A. Sohn

Less Ideal Case M 0 M 1 M 2 M 3 A[0] B[0] C[0] A[1] B[1] C[1] A[2] B[2] C[2] A[3] B[3] C[3] M 4 M 5 M 6 M 7 A[4] B[4] C[4] A[5] B[5] C[5] A[6] B[6] C[6] A[7] B[7] C[7] S4 S3 S2 S1 M7 RA7 RA7 RB7 RB7 W7 W7 M6 RA6 RA6 RB6 RB6 W6 W6 M5 RA5 RA5 RB5 RB5 W5 W5 M4 RA4 RA4 RB4 RB4 W4 W4 M3 RA3 RA3 RB3 RB3 W3 W3 M2 R R RB2 RB2 W2 W2 M1 R R RB1 RB1 W1 W1 M0 R R RB0 RB0 W0 W0 8 9 10 11 12 13 14 Convoy 1 Convoy 2 15 16 17 Pipeline Chaining What is pipeline chaining? An expansion of the internal forwarding concept in pipeline. (Internal forwarding: one register to another) Feeding results of one pipeline to the operand registers of another pipeline. Example: V0 = Memory Memory Fetch V2 = V0 + V1 Vector Load V3 = V2 < A3 Left Shift V5 = V3 V4 Logical Product Facts and assumptions: One vector load pipe with 8 stages One vector add pipe with 3 stages One vector shift pipe with 4 stages One vector logic pipe with 2 stages Tansition between a pipe and a vector regisetr takes 1 clock cycle. V1 is already loaded in the register. Vector length =8. Vector register length = 64. Lecture 8-1 Vector Processors 6 A. Sohn

Pipeline Chaining with Vector Size 8 Memory a d d b Load pipe V0 c V1 Add pipe e V2 g f j j h Shift pipe V3 i V4 Logic pipe k V5 l Lecture 8-1 Vector Processors 7 A. Sohn Pipeline Chaining with Vector Length 8 Load pipe Add pipe Shift pipe Logic p A3 A3 A4 A5 A6 A7 A3 A4 A5 A6 A7 A4 A5 A6 A7 S0 S1 S2 S3 S4 S5 S6 S7 S0 S1 S2 S3 S4 S5 S0 S1 S2 S3 S4 S5 S6 S0 S1 S2 S3 S4 S5 S6 S7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

No Chaining with Vector Length 8 Load pipe Add pipe Shift pipe Logic p A3 A4 A5 A3 A4 A3 A4 A5 A6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Chaining with Vector Length 64 Load pipe Add pipe Shift pipe Logic p A3 A4 A5 A6 A7 A8 A3 A4 A5 A6 A7 A8 A9 A9 S0 S1 S2 S3 S4 S5 S0 S1 S2 S3 S4 S5 S6 S0 S1 S2 S3 S4 S5 S6 S7 S0 S1 S2 S3 S4 S5 S6 S7 S8 A3 A4 A5 A6 A7 A8 A9 0 1 0 1 2 0 1 2 3 L8 L9 L10 L11 L12 L13 L14 L15 L16 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L8 L9 L10 L11 L12 L13 L14 L15 L16 L17 L18 L19 L20 L21 L22 L23 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

For Vector Length N Suppose that the vector register length is 64 and the vector length is n>64. θ θ Lecture 8-1 Vector Processors 11 A. Sohn Vectorizing Compiler What is a vectorizing compiler? Case 1: S1: A[i] = B[i] + C[i] S2: D[i] = E[i] F[i] Two statements can replace the entire loop! S1: A[1:n] = B[1:n] + C[i] S2: D[1:n] = E[1:n] F[1:n] This simple example is easy to vectorize. Case 2: How about this loop? S1: A[i] = A[i 1] + B[i] Can we easily vectorize the loop? S1: A[1:n] = A[1:n] + B[i:n]? No, we can t because S1 uses a value computed by S1 in an earlier iteration. A[i] deps on A[i-1], which is computed in the previous iteration. Lecture 8-1 Vector Processors 12 A. Sohn

S1: A[i] = A[i 1] + B[i] Iteration 1: A[1] = A[0] + B[1] Iteration 2: A[2] = A[1] + B[2] Iteration 3: A[3] = A[2] + B[3] Iteration n: A[n] = A[n 1] + B[n] Lecture 8-1 Vector Processors 13 A. Sohn Various Depences Now, consider a little more complex case: S1: A[i] = A[i 1] + B[i] S2: B[i] = A[i] + B[i 1] S3: A[i] = B[i 1] + C[i] S1 uses a value computed by S1 in the previous iteration Loop-carried depence S2 uses a value computed by S2 in the previous iteration Loop-carried depence. S2 uses a value A[i] computed by S1 in the same iteration Flow-depence (RAW Hazard). S1 uses a value B[i] which will be modified by S2 in the same iteration Anti-depence (WAR Hazard). S1 and S3 write to the same value A[i] in the same iteration Output-depence (WAW Hazard). Note: Anti and output-depences are name conflicts. They are not true depences. Lecture 8-1 Vector Processors 14 A. Sohn

Ok, then we can remove the depences! How do we do that? Simply change A to X, then see what happens. S1: X[i] = A[i 1] + B[i] S2: B[i] = X[i] + B[i 1] S3: A[i] = B[i 1] + C[i] S2 uses a value computed by S2 in the previous iteration Loop-carried depence. Anti-depence (WAR Hazard). S1 and S3 write to the same value A[i] in the same iteration Output-depence (WAW Hazard). The output depence is gone! How about the anti-depence? It s also gone! Lecture 8-1 Vector Processors 15 A. Sohn How To Vectorize? Can we vectorize the various depences? Loop-carried depen cannot be vectorized. Flow-depence cannot be vectorized. Anti-depence can be vectorized. How? Output-depence can be vectorized. How? The two depences can be eliminated by renaming the storage. Consider the following example: No loop-carried depence in the loop. Flow-depence in (1,3), and (1,4). Anti-depence in (1,2) and (3,4) Output-depence in (1,4) B is renamed to T in S1, S3, and S4. It removes the output-depence. A is renamed to F in S2. It removes the anti-depence. S1: B[i] = A[i] + D[i] S2: A[i] = A[i] D[i] S3: C[i] = B[i] + D[i] S4: B[i] = E[i] B[i] S1: T[i] = A[i] + D[i] S2: F[i] = A[i] D[i] S3: C[i] = T[i] + D[i] S4: B[i] = E[i] T[i] Lecture 8-1 Vector Processors 16 A. Sohn