Speedup of Self-Timed Digital Systems Using Early Completion

Size: px

Start display at page:

Download "Speedup of Self-Timed Digital Systems Using Early Completion"

Albert Miller
6 years ago
Views:

1 Speedup of Self-Timed igital Systems Using Early ompletion Scott. Smith University of Missouri Rolla, epartment of Electrical and omputer Engineering 3 Emerson Electric o. Hall, 87 Miner ircle, Rolla, MO 659 Phone: (573) 3-3, Fax: (573) 3-53, smithsco@umr.edu bstract n Early ompletion technique is developed to significantly increase the throughput of onvention self-timed digital systems without impacting latency or compromising their self-timed nature. Early ompletion performs the completion detection for registration stage i at the input of the register, instead of at the output of the register, as in standard onvention Logic. This method requires that the singlerail completion signal from registration stage i+, i+, be used as an additional input to the completion detection circuitry for registration stage i, to maintain self-timed operation. However, Early ompletion does necessitate an assumption of equipotential regions, introducing a few easily satisfiable timing assumptions, thus making the design potentially more delay-sensitive. To illustrate the technique, Early ompletion is applied to a case study of the optimally pipelined -bit by -bit unsigned multiplier utilizing full-word completion, presented in [], where a speedup of. is achieved while self-timed operation is maintained and latency remains unchanged.. troduction this paper a new completion strategy for onvention Logic (NL) [] is presented, which increases throughput of NL systems without degrading latency or compromising their self-timed operation. The increased performance is due to a reduction of the impact of the handshaking overhead. The technique is based on anticipation, similar to that of a carry lookahead adder, in which future events are predicted, thus allowing overlapped computation time. Most multi-rail delay-insensitive logic paradigms [3,, 5, 6, and 7] consist of combinational logic, latches or registration, and completion detection. s the standard, completion detection is performed at the output of! I would like to thank the University of Missouri Research oard for their funding that has made this work possible. registration stage i and the completion signal is fed back as an input to registration stage i-. ssuming that each combinational logic block has the same delay and that each completion detection unit has the same delay, the basic cycle is as follows: flows through combinational logic block i (N i ), flows through completion detection unit i (RF i ), T flows through combinational logic block i ( i ), T flows through completion detection unit i (RFN i ). These events are mutually exclusive, such that no two events overlap in time. However, with the application of Early ompletion, N i and RF i partially overlap in time and i and RFN i partially overlap in time, thus decreasing the overall cycle time and increasing throughput. The Early ompletion technique presented herein is similar to the Early one technique presented in the Previous Work section. oth increase throughput by moving the completion detection circuitry forward in the pipeline such that some sort of prediction can be preformed, allowing the overlapping of previously mutually exclusive events. The contribution of this paper is the application of the concept of early completion to multi-rail delay-insensitive paradigms, specifically NL, where the solutions to the specific obstacles of this task are presented, the impact on self-timed operation is analyzed, and a test circuit is simulated to give an analytical measure of the technique s effectiveness.. Previous Work [8] an Early one technique was developed to speedup Williams PS pipeline [9], resulting in a 79% increase in throughput when applied to a -bit FIFO. Williams PS pipeline is based on precharge logic and consists of stages composed of a dual-rail function block and a completion detector. The completion component detects the completion of every functional evaluation and precharge at the output of its corresponding function block. The output of the completion detector for stage i is connected to the precharge/evaluate control input of stage i-, as shown in Figure. The basic cycle for a PS pipeline is: T PS = (3 t Eval ) + ( t ) + t Prech, assuming that each stage has the same functional evaluation delay

2 (t Eval ), precharge delay (t Prech ), and completion delay (t ) [8]. Figure. lock diagram of a PS pipeline [8] The Early one technique modifies the PS pipeline by moving the completion detectors in front of their corresponding functional blocks rather than after them. This allows the current stage to signal the previous stage when it is about to evaluate or precharge, instead of after the action has been completed; thus allowing the completion detection signal to be produced in parallel with the precharge or evaluation of its corresponding functional block, instead of after it. This new LP/ pipeline requires the completion detectors to be modified such that they require an additional input, the stage s P control input, as shown in Figure. The basic cycle for a LP/ pipeline is: T LP/ = ( t Eval ) + ( t ), which is t Eval + t Prech shorter than that of the PS pipeline [8]. Furthermore, this throughput optimization does not impact latency, since the forward path is unchanged. Figure. lock diagram of a LP/ pipeline [8] 3. Overview of NL NL offers a delay-insensitive logic paradigm where control is inherent with each datum. It follows the socalled weak conditions of Seitz s delay-insensitive signaling scheme [3]. s with other delay-insensitive logic methods, the NL paradigm assumes that forks in wires are isochronic [, ]. 3. elay-sensitivity NL uses symbolic completeness of expression [] to achieve delay-insensitive behavior. symbolically complete expression is defined as an expression that only depends on the relationships of the symbols present in the expression without a reference to the time of evaluation. particular, dual-rail signals with three logic states (, T, and T) can be used to rid NL of the implicit time reference of oolean circuits and achieve symbolic completeness of expression. dual-rail signal named Z has two rails denoted Z and Z. The T state of NL (Z =, Z = ) corresponds to a oolean logic, the T state of NL (Z =, Z = ) corresponds to a oolean logic, and the state of NL (Z =, Z = ) corresponds to the empty set, meaning that the result is not yet available. The two rails of a dual-rail NL signal are mutually exclusive, so both rails can never be asserted simultaneously; this state is defined as an illegal state. ll NL systems have at least two register stages, one at both the input and output. These two register stages interact through their request and acknowledge lines, K i and K o, respectively, to prevent T wavefront i from overwriting T wavefront i- by ensuring that the two T wavefronts are always separated by a wavefront. 3. Logic Gates NL uses threshold gates for its basic logic gates. The primary type of threshold gate is the THmn gate, where m n, as depicted in Figure 3. THmn gates have n inputs. t least m of the n inputs must be asserted before the output will become asserted. ecause NL threshold gates are designed with hysteresis, all asserted inputs must be de-asserted before the output will be de-asserted. Hysteresis ensures a complete transition of inputs back to before asserting the output associated with the next wavefront of input data. a THmn gate, each of the n inputs is connected to the rounded portion of the gate; the output emanates from the pointed end of the gate; and the gate s threshold value, m, is written inside of the gate. THnn gate is equivalent to an n-input -element [], while a THn gate is equivalent to an OR gate. put put put n m put Figure 3. THmn threshold gate y employing threshold gates for each logic rail, NL is able to determine the output status without referencing time. puts are partitioned into two separate wavefronts, the wavefront and the T wavefront. The wavefront consists of all inputs to a circuit being, while the T wavefront refers to all inputs being T, some combination of T and T. itially all circuit elements are reset to the state. First, a T wavefront is presented to the circuit. Once all of the outputs of the circuit transition to T, the wavefront is presented to the circuit. Once all of

3 the outputs of the circuit transition to, the next T wavefront is presented to the circuit. This T/ cycle continues repeatedly. s soon as all outputs of the circuit are T, the circuit s result is valid. The wavefront then transitions all of these T outputs back to. When they transition back to T again, the next output is available. This period is referred to as the T-to-T cycle time, denoted as T, and has an analogous role to the clock period in a synchronous system. 3.3 ompleteness of put The completeness of input criterion [], which NL combinational circuits must maintain in order to be delayinsensitive, requires that:. all the outputs of a combinational circuit may not transition from to T until all inputs have transitioned from to T, and. all the outputs of a combinational circuit may not transition from T to until all inputs have transitioned from T to. circuits with multiple outputs, it is acceptable for some of the outputs to transition without having a complete input set present, as long as all outputs cannot transition before all inputs arrive. 3. Observability There is one more condition that must be met in order for NL to retain delay-insensitivity. No orphans may propagate through a gate. n orphan is defined as a wire that transitions during the current T wavefront, but is not used in the determination of the output. Orphans are caused by wire forks and can be neglected through the isochronic fork assumption, as long as they are not allowed to cross a gate boundary. This observability condition ensures that every gate transition is observable at the output, which means that every gate that transitions is necessary to transition at least one of the outputs. TRFN is the time associated with the request for generation. s described in [], the worse-case throughput for an N-stage NL pipeline is based on the following three equations: TRF i- + T i- + T i + TRFN i, TRFN i- + TN i- + TN i + TRF i, and TRF i + T i + TRFN i + TN i, corresponding to the case of adjacent T propagation and request times, the case of adjacent propagation and request times, and the case of and T propagation and request times for a single registration stage, respectively. The worsecase cycle time for the entire pipeline is then calculated by finding the maximum of these three equations for every adjacent stage pair in the pipeline, as listed in the following algorithm: max_cycle_time = TRF + T + TRFN + TN for (i = to N) loop temp_cycle_time = MX(TRF i + T i + TRFN i + TN i, TRF i- + T i- + T i + TRFN i, TRFN i- + TN i- + TN i + TRF i ) if (temp_cycle_time > max_cycle_time) then max_cycle_time = temp_cycle_time end if end loop worst_case_throughput = / max_cycle_time lgorithm. alculation of worst-case throughput for an N-stage NL pipeline [] - T i-, TN i- ombinational ircuit TRF i-, TRFN i- ompletion - Figure. Standard NL pipeline I - T i, TN i ombinational ircuit TRF i, TRFN i ompletion O. Early ompletion Technique I O K i The standard NL pipeline is shown in Figure, where each registration stage consists of multiple single bit registers, shown in Figure 5, and the gate-level structure of the completion components is shown in Figure 6. The ombinational ircuit is an input-complete, fully observable functional block, designed using Threshold ombinational Reduction, as described in [3]. T denotes the time when any T bits are propagating through the combinational circuit, TN denotes the time when any bits are propagating through the combinational circuit, TRF is the time associated with the request for T generation, and K o Figure 5. Single-bit dual-rail NL register [] Notice that in the standard NL pipeline the completion detection is performed at the output of the registration stage. The inputs to completion component i are the outputs from registration stage i. On the other hand, a NL pipeline utilizing Early ompletion, shown in Figure 7, also uses the inputs to registration stage i as the inputs to completion component i ; however this

4 necessitates the completion component for Early ompletion to require an additional input, the completion signal from stage i+, i+, in order to maintain self-timed operation. The registration stage is also slightly modified, by removing the inverting TH gate for each single-bit register, since the output is no longer required. i- (N) (N-) (N-) (N-3) (N-) (N-5) (N-6) (N-7) Figure 6. N-bit completion component - (8) (7) (6) (5) () (3) () () T_E i-, TN_E i- ombinational ircuit TRF_E i-, TRFN_E i- ompletion - i - T_E i, TN_E i ombinational ircuit TRF_E i, TRFN_E i ompletion i+ Figure 7. NL pipeline with Early ompletion Early ompletion allows the completion evaluation for stage i to begin before all bits have propagated through combinational circuit i and have been latched by registration stage i. Therefore, TRFN_E i overlaps with T_E i and TRF_E i overlaps with TN_E i such that TRFN_E i + T_E i < TRFN i + T i and TRF_E i + TN_E i < TRFi + TN i. However, TRF i- + T i- and TRFN i- + TN i- still do not overlap, such that TRF_E i- + T_E i- TRF i- + T i- and TRFN_E i- + TN_E i- TRFN i- + TN i-. y examining the equations for determining the worse-case throughput for an N-stage NL pipeline, TRF i + T i + TRFN i + TN i, TRF i- + T i- + T i + TRFN i, and TRFN i- + TN i- + TN i + TRF i, it can be seen that each equation contains at least one of the previously noted sums: TRFN i + T i and TRFi + TN i. Therefore, the cycle time for a NL pipeline using Early ompletion must be less than one using standard completion since TRF_E i- + T_E i- TRF i- + T i-, TRFN_E i- + TN_E i- TRFN i- + TN i-, TRFN_E i + T_E i < TRFN i + T i, and TRF_E i + TN_E i < TRFi + TN i, as previously shown. Furthermore, Early ompletion does not impact latency, since the forward path is unchanged. The completion component for stages through M- for an M-stage NL pipeline utilizing Early ompletion is shown in Figure 8, where the THcomp gate has the following functionality: ( + ) ( + ) [3]. This completion component is for a datapath of N bits, where N is an odd number. For a datapath with an even number of bits, the TH gate would not be required, so there would only be N intermediate signals. lso, note that the final gate, the inverting TH gate, can be incorporated into the tree structure of TH gates, depending on the width of the datapath, to reduce the number of logic levels by one. This component requests T when all inputs to register i are and the request from register i+, i+, is requesting (rfn). It requests when all inputs to register i are T and i+ is requesting T (). The completion component for the final stage, stage M, is slightly different. The inverting TH gate is no longer inverted; instead the final gate of the tree structure of TH gates is inverted. This causes the component to request T when the input to register M is and the external request input line,, is ; and to request when the input to register M is T and is rfn. This variation in the completion component for the last stage is required since changes to rfn as soon as the output is T, and changes to as soon as the output is, to simulate an infinitely fast external interface. n alternative is to use the standard completion component, shown in Figure 6, for the last stage. However, this later approach produces a system with reduced throughput compared to that when the modified Early ompletion component is used for stage M. X X X X X 3 X 3 X X X N- X N- X N- X N- X N X N THcomp THcomp THcomp N/ N/+ i+ Figure 8. ompletion component for Early ompletion i

5 5. Evaluation of Self-Timed Operation Standard NL systems are self-timed, assuming that wire forks are isochronic. However, the application of Early ompletion changes the fundamental structure of the NL handshaking system, thus necessitating the selftimed issue to be revisited. the most delay-sensitive case, i+ and i+ are both and all bits at the input of register i change to T within a very short period of time. The T wavefront at the input of register i would flow through register i, followed by combinational logic block i+, and finally completion component i+, in order to transition i+ to rfn. Simultaneously, the T wavefront at the input of register i would flow through completion component i in order to transition i to rfn. Therefore, in order for the system to function incorrectly, the T wavefront would have to travel through a set of TH gates (register i ), combinational logic block i+, and completion component i+, before the same signal traveled through only completion component i. These paths are shown in boldface in Figure 9. Since the first path is normally much longer, the delay is well known and the system remains self-timed, through the assumption of equipotential regions [3]. This same argument can be made for the wavefront, by replacing T with, rfn with, and with rfn, yielding the same result. For the special case of a FIFO, the combinational logic delay would be zero, but the delay through completion component i and completion component i+ would be identical, so the above argument would still hold. For the generalized case, completion component i and completion component i+ normally have about the same delay, within one or two gate, such that the above analysis holds true. - T + X The other delay-sensitive scenario introduced by Early ompletion is when i+ changes to when all inputs to register i are already T and all inputs to register i- are. this case the has to pass through one gate (at the least an inverting TH gate) in order to transition i to rfn. Once i is rfn, the wavefront at the input of register i- can flow through the register s TH gates and overwrite the previous T wavefront at the input of register i. Simultaneously, the T wavefront at the input of register i has to pass through only one TH gate to be latched at the output of register i. Therefore, in order for the system to function incorrectly, a signal would have to travel through both an inverting and non-inverting TH gate before the same signal travels through only a single TH gate. Since the path through the two gates is obviously longer than the path through a single gate, the are well known and the system remains self-timed, through the assumption of equipotential regions [3]. This same argument can be made for the wavefront by replacing T with, rfn with, and with rfn, yielding the same result. Note that this example assumes that there is no combinational logic delay, as would be the case in a FIFO. For the generalized case the delay-sensitivity would be even less, since the path through an inverting TH gate, a TH gate, and combinational logic would have to be faster than the path through a single TH gate, as depicted in boldface in Figure, in order to adversely affect self-timed operation. - T T + X X X X X ombinational Logic i ombinational Logic i+ T X T ombinational Logic i T ombinational Logic i+ X 3 3 Early ompletion omponent i THcomp i 3 i+ Early ompletion omponent i+ THcomp i+ 3 Early ompletion omponent i 3 THcomp i 3 i+ Early ompletion omponent i+ THcomp i+ Figure. elay-sensitive scenario # 6. itialization Figure 9. elay-sensitive scenario # standard NL, the system is initialized using a global reset to set the output of each register to either

6 T or. Since each completion component only uses the outputs of its corresponding register as inputs, its output will become initialized in constant time after the global reset is applied, without requiring any reset circuitry itself. However, if this same initialization procedure was used with Early ompletion, the reset time would be O(N), where N is the number of stages in the pipeline, since would have to trickle through each completion component in order to initialize, because the early completion component for stage i not only uses the inputs of register i as its inputs; but it also requires the request output from stage i+ as an input. To remedy this situation, the reset signal is also applied to the final gate of each early completion component such that completion component i is reset to when register i is reset to or completion component i is reset to rfn when register i is reset to T. This revised initialization strategy retains the constant time initialization of standard NL, and is actually faster than standard NL initialization. However, more reset circuitry is required, which could also be applied to standard NL to attain the same reduced initialization time. 7. Results This paper does not use a FIFO as an example system, since bit-wise completion [] would obviously outperform any other completion strategy for an NL FIFO. stead, the optimally pipelined -bit by -bit unsigned multiplier utilizing full-word completion, presented in [], was chosen as the case study. functional block diagram of the multiplier is shown in Figure, where I denotes an incomplete N function [], denotes a complete N function [], H denotes a half-adder [], denotes a full-adder [], OMP denotes a completion component, as shown in Figure 6, and GEN_S7 denotes a specialized component to produce the most significant bit of the result []. To assess the effectiveness of the Early ompletion technique, both the multiplier utilizing standard completion and Early ompletion were simulated using Mentor Graphics, a commercial design tool. The Mentor Graphics technology library is based on Spice simulations of static.5 µm MOS gates, operating at 3.3V. The two systems were exhaustively tested and their average throughput calculated. The throughput for the multiplier using the standard completion technique was determined to be.9 ns - [], while the application of Early ompletion produced a throughput of.56 ns -, resulting in a speedup of.. 8. onclusions The technique of Early ompletion that moves the completion detection for registration stage i from the output of the register to its input can significantly increase throughput of self-timed systems without increasing latency. NL -bit by -bit multiplier case study indicates a speedup of. over the design utilizing standard completion. Furthermore, the technique could be applied to other self-timed paradigms [3,, 5, 6, and 7] as well, since they use the same handshaking scheme, with only differed combinational logic. However, since these other self-timed paradigms do not support the multitude of gates supported by NL, each THcomp gate of the early completion component in Figure 8 would have to be replaced by two TH gates, causing there to be N intermediate signals instead of only N as for NL, thus necessitating more gates and logic levels, and therefore reducing throughput for the other self-timed paradigms. References [] S.. Smith, R. F. emara, J. S. Yuan, M. Hagedorn, and. Ferguson, elay-sensitive Gate-Level Pipelining, tegration, The VLSI Journal, Vol. 3/, pp. 3-3,. [] Karl M. Fant and Scott. randt, onvention Logic: omplete and onsistent Logic for synchronous igital ircuit Synthesis, ternational onference on pplication Specific Systems, rchitectures, and Processors, pp. 6-73, 996. [3]. L. Seitz, System Timing, in troduction to VLSI Systems, ddison-wesley, pp. 8-6, 98. [] N. P. Singh, esign Methodology for Self-Timed Systems, Master s Thesis, MIT/LS/TR-58, Laboratory for omputer Science, MIT, 98. [5] T. S. nantharaman, elay sensitive Regular Expression Recognizer, IEEE VLSI Technology ulletin, Sept [6] Ilana avid, Ran Ginosar, and Michael Yoeli, n Efficient Implementation of oolean Functions as Self- Timed ircuits, IEEE Transactions on omputers, Vol., No., pp. -,99. [7] J. Sparso, J. Staunstrup, M. antzer-sorensen, esign of elay sensitive ircuits using Multi-Ring Structures. Proceedings of the European esign utomation onference, pp. 5-, 99. [8] M. Singh and S. M. Nowick, High-Throughput synchronous Pipelines for Fine-Grain ynamic atapaths, Proceeding of the Sixth ternational Symposium on dvanced Research in synchronous ircuits and Systems, pp. 98-9,. [9] T. E. Williams, Self-Timed Rings and Their pplication to ivision, Ph.. Thesis, SL-TR-9-8, epartment of Electrical Engineering and omputer Science, Stanford University, 99.

7 []. J. Martin, Programming in VLSI, in evelopment in oncurrency and ommunication, ddison-wesley, pp. -6, 99. [] K. Van erkel, eware the Isochronic Fork, tegration, The VLSI Journal, Vol. 3, No., pp. 3-8, 99. []. E. Muller, synchronous Logics and pplication to formation Processing, in Switching Theory in Space Technology, Stanford University Press, pp , 963. [3] S.. Smith, Gate and Throughput Optimizations for onvention Self-Timed igital ircuits, Ph.. issertation, School of Electrical Engineering and omputer Science, University of entral Florida,. OMP Reset X 3 X X X Y 3 Y Y Y 8 bit NL Register OMP OMP H I I I I I I I I 6 bit NL Register 3 bit NL Register I I I I H Stage : gate delay Stage : gate OMP H H H bit NL Register H Stage 3: gate OMP H Stage : gate bit NL Register OMP Stage 5: gate OMP bit NL Register bit NL Register Stage 6: gate X Y Z GEN_S7 S Stage 7: gate OMP 8 bit NL Register S 7 S 6 S 5 S S 3 S S S Figure. Optimally pipelined multiplier using full-word completion

Delay-Insensitive Gate-Level Pipelining

Delay-Insensitive Gate-Level Pipelining S. C. Smith, R. F. DeMara, J. S. Yuan, M. Hagedorn, and D. Ferguson Keywords: Asynchronous logic design, self-timed circuits, dual-rail encoding, pipelining, NULL