Implementing Multipliers in FLEX 10K Devices March 1996, ver. 1 Application Note 53 Introduction The Altera FLEX 10K embedded programmable logic device (PLD) family provides the first PLDs in the industry with an embedded array. The embedded array consists of a series of embedded array blocks (EABs) that can implement complex logic functions, such as multipliers. Each EAB can be configured as an 8-input, 8-output look-up table (LUT). Therefore, a single EAB can create a multiplier with up to 8 inputs such as a, 5 3, or 6 2 multiplier. Figure 1 shows a graphical representation of the flexible multiplier sizes that can be implemented in an EAB. Figure 1. Multiplier Configuration for a Single EAB 5 3 6 2 This application note describes how to implement large multipliers using several EABs and compares parallel multiplier and time-domainmultiplexed multiplier implementations. 1 The design files described in this application note are available from the Altera BBS via modem at (08) 95-010 and the Altera FTP site at ftp.altera.com. The self-extracting files are: an_53.exe and an_53.tar. Single-EAB Multipliers You can implement a multiplier with up to 8 inputs in a single EAB using a function from the library of parameterized modules (LPM). The LPM is a set of architecture-independent modules with scalable widths that completely describes the logical operation of a circuit. Using the LPM function, lpm_mult, you can define the width of the multiplicand for the multiplier. Then, you can use MAX+PLUS II to place the multiplier in an EAB by following these steps: 1. Select the lpm_mult function in any MAX+PLUS II application. 2. Choose the Logic Options command (Assign menu). In the Logic Options dialog box, the name of the function is displayed in the Node Name box. Altera Corporation 1 A-AN-053-01
3. Choose the Individual Logic Options button and turn on the Implement in EAB option. Choose OK.. Choose OK to implement the multiplier in the EAB. Multiple-EAB Multipliers A multiplier with more than 8 inputs must be implemented in two or more EABs. Each EAB computes a single partial product, generated from a multiplier. To illustrate how to split the multiplier across multiple EABs, consider how a 2-digit by 2-digit multiplication is calculated using base 10 multiplication. See Figure 2. Figure 2. Base 10 Multiplication 12 37 7 1 7 2 + 3 1 3 2 3 10 2 + (7 + 6) 10 1 + 1 10 0 Rather than using base 10 (as shown in Figure 2), the EAB performs the same operation in hexadecimal radix. Each partial product is calculated within a single EAB. See Figure 3. Figure 3. Hexadecimal Multiplication Each partial product is generated by one EAB. Partial products are summed to produce the final product. X[7..] X[3..0] Y[7..] Y[3..0] X[7..] Y[3..0] X[3..0] Y[3..0] + X[7..] Y[7..] X[3..0] Y[7..] X[7..] Y[7..] 2 + ((X[7..] Y[3..0]) + (X[3..0] Y[7..])) 1 + X[7..] Y[3..0] 0 To account for the relative significance in hexadecimal radix, each partial product is multiplied by n (where n = 0, 1, 2,...) and then added together to determine the final product. You can choose one of two design methods to generate the final product: a parallel multiplier or a time-domain-multiplexed multiplier. 2 Altera Corporation
Parallel Multiplier AN 53: Implementing Multipliers in FLEX 10K Devices The parallel multiplier design method uses multiple EABs to generate all of the partial products in parallel. For example, an 8 8 parallel multiplier uses four EABs (one for each partial product) to simultaneously generate four partial products. Before adding the partial products together, each partial product is shifted to account for the n term (i.e., each partial product is shifted over n hexadecimal digits or n bits). The adder assembles the final product by shifting the data into different bits. Addition is normally generated by a two-stage adder with 8 bits for the first stage and 12 bits for the second stage (see Figure ). Figure. 2-Stage Adder S 7 S 6 S 5 S S 3 S 2 S 1 S 0 T 7 T 6 T 5 T T 3 T 2 T 1 T 0 R 7 R 6 R 5 R R 3 R 2 R 1 R 0 + U 7 U 6 U 5 U U 3 U 2 U 1 U 0 Q 15 Q 1 Q 13 Q 12 Q 11 Q 10 Q 9 Q 8 Q 7 Q 6 Q 5 Q Q 3 Q 2 Q 1 Q 0 Where R = X[3..0] Y[3..0] S = X[3..0] Y[7..] T = X[7..] Y[3..0] U = X[7..] Y[7..] Addition performed in the first stage Addition performed in the second stage You can pipeline the parallel multiplier to enhance design speeds by using registers to process logic over multiple Clock cycles. The registers within the EAB can be used for pipelining (see Figure 5). Altera Corporation 3
Figure 5. Parallel Multiplier with Pipelining Optional Pipelining Registers X[3..0] EAB Z[3..0] Y[3..0] X[3..0] Z[7..] Y[7..] X[7..] Y[3..0] Z[11..8] X[7..] Y[7..] Z[15..12] Multiplier An 8 8 parallel multiplier is implemented in 3 stages: a multiplier stage using EABs, and 2 adder stages with 8 bits for the first stage and 12 bits for the second stage. To pipeline the multiplier, each bit must be registered after each stage, which requires 21 registers for the first stage and registers for the second stage. For the multiplier stage, each EAB has registers available at the inputs and outputs. Therefore, additional logic elements (LEs) are not required for the multiplier stage. The LEs containing the adder logic provide 21 registers; therefore only 20 additional LEs are required for the entire circuit. Time-Domain-Multiplexed Multiplier The time-domain-multiplexed multiplier design method uses a single EAB to generate all partial products on different Clock cycles (see Figure 6). Therefore, the appropriate bits need to be loaded into the EAB before each multiplication. After multiplication, the accumulator shifts the data to account for the n term and then sums the different partial products to produce the final product. Altera Corporation
Figure 6. Simulation Waveform for Time-Domain-Multiplexed Multiplier Clock EAB Output R S T U Accumulator Output (1) (2) (3) () Where R = X[3..0] Y[3..0] S = X[3..0] Y[7..] T = X[7..] Y[3..0] U = X[7..] Y[7..] Notes: (1) X[3..0] Y[3..0] 0 (2) (X[3..0] Y[3..0] 0 ) + (X[3..0] Y[7..] 1 ) (3) (X[3..0] Y[3..0] 0 ) + ((X3..0] Y[7..]) + (X[7..] Y[3..0])) 1 () (X[3..0] Y[3..0] 0 ) + [((X[3..0] Y[7..]) + (X[7..] Y[3..0])) 1 ] + (X[7..] Y[7..] 2 ) To pipeline the time-domain-multiplexed multiplier, insert registers between the EAB performing the multiplication and the accumulator performing the addition and shifting. Figure 7 shows a timedomain-multiplexed multiplier. Figure 7. Time-Domain-Multiplexed Multiplier X[7..] D Q ENA Optional Input Registers X[3..0] Y[7..] D Q ENA D Q EAB 8 Multiplier Shift 8 Shift Shift 0 12 8 Loadable Accumulator D Q Z[15..0] ENA Y[3..0] D Q ENA Control Altera Corporation 5
You can also increase throughput in the time-domain-multiplexed multiplier design method by implementing the multiplier in two or more EABs. Then, the multiplier computes multiple partial products simultaneously, which reduces the number of Clock cycles. The time-domain-multiplexed multiplier implementation is well-suited for very large multiplications, such as or 32 32, because it conserves EABs and logic cells. In contrast, large multiplications would consume a prohibitive amount of logic cells or EABs if computed in parallel. Design Speed The parallel multiplier generates all of the partial products and sums the response within a single Clock cycle. In addition, data is loaded on every Clock cycle, giving the parallel multiplier high throughput and fast calculation times. Designers can pipeline the parallel multiplier for faster Clock speeds. Pipelining requires multiple Clock cycles and more latency time to generate the multiplication for a single multiplier. However, it decreases the Clock period while still allowing new data to be loaded on every Clock cycle. The faster Clock speeds generated by pipelining allow for the highest throughput for consecutive operations because pipelining can generate a new product on every Clock cycle. See Figure 8. Figure 8. Simulation Waveforms for Non-Pipelined & Pipelined Parallel Multipliers Non-Pipelined Parallel Multiplier Clock Data Product 1 Computation Computations 1 2 3 1 2 3 Pipelined Parallel Multiplier Clock Data 1 2 3 Product 1 2 3 1 Computation Computations 6 Altera Corporation
The typical time-domain-multiplexed multiplier uses a single EAB to compute all partial products on different Clock cycles. Therefore, multiplication requires the same number of Clock cycles as partial products. In the 8 8 bit multiplication example shown in Figure 7, the multiplication requires Clock cycles. When consecutive multiplications are required, the first multiplication must be completed before the second multiplication can begin. Designers can pipeline the time-domain-multiplexed multiplier for faster Clock speeds. Pipelining creates faster Clock speeds by reducing the Clock period and generating higher throughput. Table 1 summarizes the performance of parallel and time-domain-multiplexed multipliers. Table 1. Circuit Performance Design Clock Cycles for an 8 8 Multiplier One Multiplication Two Multiplications Parallel Multiplier 1 2 Parallel Multiplier with 3-Stage 3 Pipeline Time-Domain-Multiplexed 9 Multiplier Time-Domain-Multiplexed Multiplier with 2-Stage Pipeline 5 10 Device Utilization The 8 8 parallel multiplier design uses EABs plus 21 additional LEs required for the 12-bit and 8-bit adders. A 3-stage pipeline requires 20 additional registers to store data. A parallel multiplier with 3-stage pipelining will not require any additional LEs when the registers are implemented in the EAB. In contrast, the time-domain-multiplexed multiplier uses only one EAB. The multiplier uses logic, rather than EABs, to select which bits are used for multiplication. A time-domain-multiplexed multiplier with 2-stage pipelining does not require any additional LEs. Altera Corporation 7
Table 2 summarizes the number of EABs and LEs required for each type of multiplier. Table 2. Device Utilization for an 8 8 Multiplier Design EABs Required LEs Required Parallel Multiplier 2 Parallel Multiplier with 3-Stage 5 Pipeline Time-Domain-Multiplexed Multiplier 1 65 Time-Domain-Multiplexed Multiplier with 2-Stage Pipeline 1 65 Conclusion Large multipliers can be implemented in FLEX 10K devices with either a parallel multiplier or time-domain-multiplexed multiplier design method. The parallel multiplier offers the fastest Clock speeds but requires more space and device resources. The timedomain-multiplexed multiplier conserves space and device resources but offers slower Clock speeds. Both design methods can be pipelined for faster Clock speeds. 8 Altera Corporation
Copyright 1995, 1996 Altera Corporation, 2610 Orchard Parkway, San Jose, California 9513, USA, all rights reserved. By accessing any information on this CD-ROM, you agree to be bound by the terms of Altera s Legal Notice.