Superpipelined Control and Data Path Synthesis

Size: px

Start display at page:

Download "Superpipelined Control and Data Path Synthesis"

Beatrice Daniela Fowler
5 years ago
Views:

1 Superpipelined Control and Data Path Synthesis Usha Prabhu Barry M. Pangrle Department of Computer Science, The Pennsylvania State University, University Park, PA Abstract This paper describes a superpipelined control and data path synthesis system. The system can 1) handle pipelined modules in the data path, 2) perform functional pipelining of the datapath and 3) schedule the datapath using a pipelined controller. Three control styles - serial, parallel and pipelined, are implemented. The system automatically picks one depending on the data path, the clock frequency, and the functional unit and control path delays. The results show that using a modifiable clock cycle time and a parameterized control style can significantly improve the throughput of high performance systems. 1 Introduction Much of the earlier work in high-level synthesis has concentrated on scheduling, allocation and binding within basic blocks. However, recent efforts have been directed toward global scheduling in an effort to improve throughput under resource constraints. Throughput is usually defined as the number of clock cycles needed to realize the schedule. Two main directions have been taken in global scheduling - transformational techniques and path-based techniques. Transformational techniques such as trace scheduling [7] and percolation scheduling [18] move operations across basic blocks in order to fully utilize the resources available. Path-based scheduling [5] tries to schedule each possible path optimally, and may thus schedule an operation into different control states depending on the path. The parallelism in the data path is thus exploited at the expense of an increase in control path complexity. Control path complexity may also increase with the choice of clock frequency in the synthesis of highperformance systems. High throughput designs can sometimes be achieved by choosing the clock cycle time to minimize the total amount of dead-time in 29th ACMIEEE Design Automation Conferences 638 a control state. Dead time is the time wasted in a clock cycle because all ready operations need functional units which have finished execution but cannot be scheduled because they have already been scheduled in the current clock cycle. Decreasing the dead time increases the efficiency of the schedule because less time is wasted. Minimizing dead time usually means decreasing the clock cycle time. As the clock cycle time decreases (and the number of states required to realize the schedule increases), the control circuit once again becomes more complex. Thus efforts to improve throughput may result in the control delay occupying a large percentage of the clock cycle, or even, in some cases, being longer than the clock cycle time. To achieve high throughput, high level synthesis systems must look at ways to execute the data and the control paths in parallel or to pipeline the control path. We propose that a modifiable clock cycle time and a parameterized control style are necessary when synthesizing high performance systems. The combination of small cycle times and a high degree of pipelining usually improves system throughput. Machines with these characteristics have been called superpipelined [9]. Ideally, both the clock cycle time and the pipeline depth should be chosen by the synthesis system. Currently, in this system, the clock cycle time and the level of pipelining in the functional units are supplied by the user. The system automatically picks the pipeline depth in the controller so as to maximize performance. Results show that in many of the benchmark examples, superpipelining improves performance. This paper describes the control styles incorporated in SandS. Figure 1 shows a block diagram of the sy5 tem. Slicer [14] and Splicer [15] perform the scheduling and module allocation tasks respectively. Piper [ 111 performs functional pipelining. The system can thus 1) handle pipelined modules in the data path, 2) perform functional pipelining of the data path, and 3) schedule the data path on a pipelined controller x192 $ IEEE T

2 Current Stat8 4-nq - 7 ~ r r b nn-r r ~ Figure 1: Block diagram of the Sands system 2 Previous Work Most high level synthesis systems which schedule both the data and the control path use an FSM PLA based controller and assume that the clock cycle time is large enough for both control and data path operations. One of the exceptions is Chippe [2]. Chippe calculates the cycle time and picks a control model that minimizes the area required while satisfying the performance criteria. The control styles used are PLAbased control (with or without a condition register), pipelined control (where the control is executed in parallel with the data path), and partitioned pipeline control (where a 2-stage pipelined controller is executed in parallel with the datapath). The main difference in the work described here, is that it allows an arbitrary degree of pipelining in the control path. It can thus handle the combination of long control delays and small cycle times. Mlinar concentrates on controller area in his thesis [12]. He uses a PLA based controller, and explores the changes in PLA area with different datapath schedules. He does not take controller delay into consideration. Most systems do not vary the clock cycle time, but take as input functional unit delays in clock cycle time units. An exception is CYOS [6], which tries to generate a schedule using the best possible cycle time. Thus cycle time is one of the variables explored in the design space. It usea simulated annealing to do the scheduling and cycle time optimization. MAHA [19] also picks a cycle time which allows ENext State 3 Figure 2: Serial Control Style it to meet its area/time constraints. However, the lower bound on the cycle time it chooses is the delay of the slowest functional unit, i.e., it does not allow multicycle functional units. 3 Generating the Control Path An iterative algorithm is used to generate both the data and the control path. Assuming a control delay of zero, Slicer and Splicer are used to schedule and allocate operations in the data path. A Moore machine based finite state machine is then generated to represent the control path. This is used as input to PEG [SI which generates the control equations. These are then passed to MisII [3], which is used with a standard script and a library of basic gates. MisII generates the control circuit, maps it into the component library and returns the maximum delay. This delay is fed back to Slicer to generate a new schedule taking into account the control delay and the control style used. (The control styles are described in the next section). This process is iterated until a stable solution is found. The fast response times of Slicer and Splicer allow such an iterative method to be used successfully. Section 3.1 describes the control styles defined in the system. Section 3.2 gives an overview of the sytem with the help of an example. 3.1 Control styles Three main control styles are used, serial, parallel, and pipelined, which are depicted in Figures 2-4. The style actually implemented depends on the data path, the clock frequency and the data and control path delays. Figure 2 depicts a serial control style, where both the data and the control paths are executed in the 639

3 c.mau.n *ut. n a tru Figure 4: Pipelined Control Style Figure 3: Parallel Control Style same clock cycle. If this control style is chosen, the data path scheduling algorithm must take into account the smaller time available to execute data path operations. Obviously, this style cannot be used if the control delay is greater than one clock cycle. This control style is usually chosen if the data path contains a large number of conditionals and loops, and there is not too much scope for acrossblock transformations. The dis advantage of this style is that a significant portion of a clock cycle can be used in generating the control signals. This problem is overcome in the control style depicted in Figure 3. Here the control path is executed in parallel with the data path, thus allowing operations in the data path to use the complete clock cycle. The penalty paid for this is that registers must be used to store the control signals (marked control word in Figure 3). Also, a NOOP state will have to be inserted in the data path after every conditional. This state may be deleted if the conditional can be scheduled early enough or if code motion can be used to move dabindependent code into these states (This is similar to the delayed bmnches used in RISC systems [9]). This control style is usually used if the data path does not contain too many conditionals or loops. A parallel control style may also be viewed as a 2-stage control and datapath pipeline. If the delay of the control path is greater than the clock cycle time, it may be necessary to break the control path into stages, where the delay of each stage is less than the clock cycle time. A 2-stage pipe is shown in Figure 4. Here registers are needed not only to store the final control signals, but also to store intermediate values between stages. Again, it may be necessary to introduce multiple NOOP states every time there is a branch in the control flow. The number of NOOP states required is the number of stages in the control path. Note that there is no fixed cycle initialization overhead associated with using parallel or pipelined controllers. All the intermediate registers (control word, pipe registers, condition registers and current state register) can be initialized on reset to values that are possible to precompute. In this system, the serial control style is represented as a Moore machine. Peg and MisII are then used to generate and time the resulting circuit. The parallel control style is represented as a Mealy machine, hence Meg [21] and MisII are used to generate these designs. The pipelined control style is generated from the parallel design using retiming [lo]. Retiming is used to insert registers into the network generated by MisII. The minimum number of registers are inserted such that no stage of the resulting circuit has delay greater than the clock cycle time. This problem is the linear programming dual of a minimum-cost flow problem and can be solved in polynomial time. The linear programming package LINDO [20] is used to solve it. 3.2 System overview The system is described with the help of an example. The example chosen is from HAL [lq, expanded so as to use input and output ports. The available hardware consists of 2 multipliers (each of which takes 60ns to execute), 1 adder, 1 subtractor, 1 comparator (each of which takes 30ns to execute), five input ports and two output ports (each of which takes 5ns to execute). Note that all the execution times given are register-to register times. For this example, the cycle time is set to 70ns. The system initially sets the control delay to zero. Using this, Slicer generatea the &state schedule shown in Figure 5. Splicer is then used to do the module allocation and connectivity binding. The system generates a VHDL-like description of the finite state machine for the control. The finite state machine description is then translated into Peg format. The output of Peg is then fed into MisII to map the description into a library of standard gates. The results in this paper were obtained with the library mcnc.genlib. The maximum 640

4 Figure 5: 6 state schedule for diffeq (with a serial control style). NOOP Figure 6: 7 state schedule for diffeq using 5 input ports and a parallel control style. I NOOP state does not have to be inserted within the loop. For this clock cycle time, the serial control style gives better throughput and is accepted by the system. The network description is fed to Artist11 [13] which doea the layout. With a cycle time of 60ns, the parallel control style is superior to the serial style and will thus be adopted by the system. The schedule with a serial control style takes 8 states (with the body of the loop taking 6 states), whereas that with the parallel control style takes 7 states (with the body of the loop taking 4 states). Using the parallel controller results in using less states in spite of the fact that a NOOP state has to be added after the comparison. If the loop is repeated only ten times, the system with the parallel control takes 2580ns to execute, whereas that with the serial control takes 3720ns - a savings of over 30%. This saving increases as the number of iterations through the loop increases. Table 1 shows the results obtained with this example and different clock frequencies. The total time column shows the time taken for ten iterations of the loop. For this example, the best throughput is achieved by setting the clock cycle time to lons and the control style to pipelined. Assuming that the loop iterates ten times, the total execution time is now 2170ns - a savings of 15501x3 over using the 7011s clock cycle and a serial control style. Thus significant improvements in throughput can be obtained even with very small systems. delay through the network is fed back to the scheduler. The scheduler may then change the schedule based on the new information available. If the control delay is greater than the clock cycle time, retiming [lo] is used to introduce registers into the control path until the maximum delay in any stage of this pipelined controller is leas than the clock cycle time. In this example, the maximum delay through the control path is 511s. Assuming a serial control style, the schedule remains exactly the same. Since there is a dead time of at least lons in every clock cycle, the control delay can easily be added to each cycle without modifying the circuit. With a clock cycle time of 70ns, a parallel control style would have given the schedule shown in Figure 6. The number of states has now increased to 7, since an extra NOOP state has to be inserted after the first comparison. Slicer performs in-block transformations that eliminate NOOP states wherever possible. In this case, the comparison within the loop is scheduled early enough 80 that other operations can be moved into the state after it. Hence a Cydo # of Comtrol Comtrol Dos of Tot Parallel Parallel Table 1: Differential Equation Solver 4 Experiments This section describes the results obtained with the FIR filter [18] and the Mark1 processor [4]. All the functional unit execution times mentioned in this section are register-to- register delays. Experiments run with other benchmarks from [l] showed similar increases in throughput. 641

5 4.1 Markl processor 5 Conclusions and Future Work This example was translated from the ISPS description of the Manchester University Mark-1 Computer in [4]. The functional units used were one adder, one subtractor, one comparator and a memory with a read and a write port. Memory operations take 9Ons, all other operations take 35~x3. Table 2 shows the results obtained with a range of clock frequencies. The numbers in the Number of States column are the number of states in the schedule with the chosen control style, and with the other control style. For example, with a clock cycle time of loons, Markl can be scheduled in 13 states with a parallel control style and in 21 states with a aerial control style - a savings of 800 ns. When the clock cycle time is reduced to 15ns, the control delay is greater than the cycle time, and a 2 stage pipelined controller is used. 47/ FIR filter lltyl. Pipelined 1 Pipelined Table 2: Markl Processor PIPI.& Experiments were run with the FIR filter to illustrate performance improvement with functional pipeliing, pipelined functional units and a parallel control path. The register-to-register delay of the adders is 30ns and that of the multipliers is 60ns. Table 3 compares nonpipelined and (functionally) pipelined designs using a non-pipelined multiplier. The second figure in the total time column for the pipelined case shows the elapsed time for the first output from the pipeline. The first figure in this column is the rate at which results are available after that. In both counts, substantial savings are obtained with the pipelined version as compared to the nonpipelined version. Table 4 compares nonpipelined and pipelined designs using a pipelined multiplier. The multiplier can accept new inputs in every clock cycle. The total time column here has the same meaning as in the previous case. The best throughput is obtained using a clock cycle time of 30ns, functional pipelining, pipelined multipliers and a parallel control style. Increasing throughput in high-performance systems can be achieved by completely utilizing every control step. This can be done either by increasing the hardware available or by changing the clock cycle time and modifying the control style. The results in this paper show that varying the clock cycle time and using pipelined or parallel controllers can lead to significant improvements in the throughput of resource constrained systems. Experiments have shown that the right choice of cycle time can decrease the amount of dead time in a control step. Sometimes the improvement can be achieved only by executing the data and the control paths in parallel. Even if the control delay is only a few nanoseconds, decreasing each clock cycle by this amount leads to a substantial improvement across the whole schedule. These effects will be much more pronounced in large systems scheduled over hundreds or thousands of clock cycles. These methods will also be very effective in control dominated systems where the control delay is comparable to functional unit delay. Here the necessity for parallel or pipelined control units is much more obvious. The system will be expanded to incorporate facilities for specifying external timing constraints on control so as to make the system more useful in control dominated designs. The system should also automatically generate the best cycle time. References [l] Benchmarks for the 6th International Workshop on High-Level Synthesis, [2] F. Brewer and D. Gajski. Chippe: A System for Constraint Driven Behavioral Synthesis. IEEE lhns. on CAD, Vol9, July 1990, pp [3] R. Brayton, R. Rudell, A. Sangiovanni- Vincentelli, A. Wang. MIS: A MultipleLevel Logic Op timization System. IEEE lhns. on CAD, Vol 6, NO 6, NOV 1987, pp [4] M. R. Barbacci and D. P. Siewiorek. Design and Analysis of Instruction Set Processors. McGraw- Hill, [5] R. Camposano. Path-Baaed Scheduling for Synthesis. IEEE %ns. on CAD, Vol 10, No 1, January 1991, pp T

6 5+, 4* 5+, 4* NUm of States Nonpipelined Ctrl I Control Serial Serial Parallel Serial Parallel Delay bs) ipelining Control Total DII/CT Serial 300/900 Serial 210/630 Parallel 180/720 Serial 160/440 Parallel 120/330 Table 3: FIR Filter: non-pipelined multiplier Nonpipelined 8 I 7 I Parallel I Intrvl Fun4 Comp Time (CY-) ;ional Pipelining DII/CT Table 4: FIR Filter: pipelined multiplier [6] J. Cortadella, R. M. Badia, E. Ayguade. Scheduling in a Continuous Area-Time Design Space: A Simulated Annealing Approach. Proc. of the Fifth International Workshop on High-Level Synthesis, March 1991, pp [7] J. A. Fisher. The Scheduling: A Technique for Global Microcode Compaction. IEEE lhns. on Computers, vol. C-30, No. 7, July [8] G. Hamachi. Designing Finite State Machines with PEG. UC Berkeley, [9] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc [lo] C. Leiseraon, J. Saxe. Retiming Synchronous Circuitry. Algorithmica, 6(1), 1991, pp [ll] D. Lobo and B. M. Pangrle. Optimization Techniques for Pipelined Scheduling. - Fifth. Intemational Confennee on VLSI Design, India, Jan [12] M. J. Mlinar. Control Path / Data Path Tradeoffi in VLSI Design. CEng Tech. Report 91-16, University of Southern California. [13] R. M. Owens and M. J. Irwin. A Comparison of Four 2-Dimensional Gate Matrix Layout Tools. Proc. 26th Design Automation Conference, [14] B. M. Pangrle and D. D. Gajski. Slicer: A State Synthesizer for Intelligent Compilation. Proc. of ICCD, Oct [15] B. M. Pangrle. Splicer: A Heuristic Approach to Connectivity Binding. Proc. 25th Design Automation Conference, June [16] P. G. Paulin and J. P. Knight. Scheduling and Binding Algorithms for High-Level Synthesis. Proc. 26th Design Automation Conference, June [17] P. G. Paulin, J. P. Knight, E. F. Girczyc. HAL: A multi-paradigm approach to automatic datapath synthesis. Proc. 23rd Design Automation Conference, 1986, pp [18] R. Potasman, J. Lis, A. Nicolau, D. Gajski. Percolation Baaed Synthesis. Proc. 27th Design Automation Conference, 1990, pp [19] A. Parker, J. Pizarro, M. Mlinar. MAHA: A Program for Datapath Synthesis. Proc. 23rd Design Automation Conference, July 1986, pp [20] L. Schrage. Linear, Integer and Quadratic Programming with LINDO. The Scientific Press, [21] D. Wood. MEG. UC Berkeley. 643 r- ~~ -

Power-conscious High Level Synthesis Using Loop Folding

Power-conscious High Level Synthesis Using Loop Folding Daehong Kim Kiyoung Choi School of Electrical Engineering Seoul National University, Seoul, Korea, 151-742 E-mail: daehong@poppy.snu.ac.kr Abstract