High Performance Carry Skip Adder Implementing Using Verilog-HDL

Size: px

Start display at page:

Download "High Performance Carry Skip Adder Implementing Using Verilog-HDL"

Michael Rich
6 years ago
Views:

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.

1 Available Online at International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN X IMPACT FACTOR: IJCSMC, Vol. 5, Issue. 12, December 2016, pg High Performance Carry Skip Adder Implementing Using Verilog-HDL P. Shalini 1, Laxman G 2 1 PG/ VLSI, 2 Associate Professor, Department of ECE Sree Chaitanya College of Engineering, Karimnagar, Telangana Abstract: In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy consumption compared with the conventional one. The speed enhancement is achieved by applying concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA) structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The structure may be realized with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which lowers the power consumption without considerably impacting the speed, is presented. This extension utilizes a modified parallel structure for increasing the slack time, and hence, enabling further voltage reduction. The proposed structures are assessed by comparing their speed, power, and energy parameters with those of other adders using a 45-nm static CMOS technology for a wide range of supply voltages. The results that are obtained using HSPICE simulations reveal, on average, 44% and 38% improvements in the delay and energy, respectively, compared with those of the Conv-CSKA. In addition, the power delay product was the lowest among the structures considered in this paper, while its energy delay product was almost the same as that of the Kogge Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the latest works in this field while having a reasonably high speed. Index Terms Carry skip adder (CSKA), energy efficient, high performance, hybrid variable latency adders, voltage scaling. I. INTRODUCTION Adders are a key building block in arithmetic and logic units (ALUs) [1] and hence increasing their speed and reducing their power/energy consumption strongly affect the speed and power consumption of processors. There are many works on the subject of optimizing the speed and power of these units, which have been reported in [2] [9]. Obviously, it is highly desirable to achieve higher speeds at low-power/energy consumptions, which is a challenge for the designers of general purpose processors. One of the effective techniques to lower the power consumption of digital circuits is to reduce the supply voltage due to quadratic dependence of the switching energy on the voltage. Moreover, the sub-threshold current, which is the main leakage component in OFF devices, has an exponential dependence on the supply voltage level through the drain-induced barrier lowering effect [10]. Depending on the amount of the supply voltage reduction, the operation of ON devices may reside in the superthreshold, nearthreshold, or subthreshold regions. Working in the superthreshold region provides us with lower delay and higher switching and leakage powers compared with the near/subthreshold regions. In the subthreshold region, the logic gate delay and leakage power exhibit exponential dependences on the supply and threshold voltages. Moreover, these voltages are (potentially) subject to process and environmental variations in the nanoscale technologies. The variations increase uncertainties in the aforesaid performance parameters. In addition, the small subthreshold current causes a large delay for the circuits operating in the subthreshold region [10]. Recently, the near-threshold region has been considered as a region that provides a more desirable tradeoff point between delay and power dissipation compared with that of the subthreshold one, because it results in lower delay com-pared with the subthreshold region and significantly lowers switching and leakage powers compared with the superthreshold region. In addition, near-threshold operation, which uses supply voltage levels near the threshold voltage of transistors [11], suffers considerably less from the process and environmental variations compared with the subthreshold region. 2016, IJCSMC All Rights Reserved 26

2 The dependence of the power (and performance) on the supply voltage has been the motivation for design of circuits with the feature of dynamic voltage and frequency scaling. In these circuits, to reduce the energy consumption, the system may change the voltage (and frequency) of the circuit based on the workload requirement [12]. For these systems, the circuit should be able to operate under a wide range of supply voltage levels. Of course, achieving higher speeds at lower supply voltages for the computational blocks, with the adder as one the main components, could be crucial in the design of high-speed, yet energy efficient, processors. In addition to the knob of the supply voltage, one may choose between different adder structures/ families for optimizing power and speed. There are many adder families with different delays, power consumptions, and area usages. Examples include ripple carry adder (RCA), carry increment adder (CIA), carry skip adder (CSKA), carry select adder (CSLA), and parallel prefix adders (PPAs). The descriptions of each of these adder architectures along with their characteristics may be found in [1] and [13]. The RCA has the simplest structure with the smallest area and power consumption but with the worst critical path delay. In the CSLA, the speed, power consumption, and area usages are considerably larger than those of the RCA. The PPAs, which are also called carry look-ahead adders, exploit direct parallel prefix structures to generate the carry as fast as possible [14]. There are different types of the parallel prefix algorithms that lead to different PPA structures with different performances. As an example, the Kogge Stone adder (KSA) [15] is one of the fastest structures but results in large power consumption and area usage. It should be noted that the structure complexities of PPAs are more than those of other adder schemes [13], [16]. The CSKA, which is an efficient adder in terms of power consumption and area usage, was introduced in [17]. The critical path delay of the CSKA is much smaller than the one in the RCA, whereas its area and power consumption are similar to those of the RCA. In addition, the power-delay product (PDP) of the CSKA is smaller than those of the CSLA and PPA structures [19]. In addition, due to the small number of transistors, the CSKA benefits from relatively short wiring lengths as well as a regular and simple layout [18]. The comparatively lower speed of this adder structure, however, limits its use for high-speed applications. In this paper, given the attractive features of the CSKA structure, we have focused on reducing its delay by mod-ifying its implementation based on the static CMOS logic. The concentration on the static CMOS originates from the desire to have a reliably operating circuit under a wide range of supply voltages in highly scaled technologies [10]. The proposed modification increases the speed considerably while maintaining the low area and power consumption features of the CSKA. In addition, an adjustment of the structure, based on the variable latency technique, which in turn lowers the power consumption without considerably impacting the CSKA speed, is also presented. To the best of our knowledge, no work concentrating on design of CSKAs operating from the superthreshold region down to near-threshold region and also, the design of (hybrid) variable latency CSKA structures have been reported in the literature. Hence, the contributions of this paper can be summarized as follows. 1) Proposing a modified CSKA structure by combining the concatenation and the incrementation schemes to the conventional CSKA (Conv-CSKA) structure for enhanc-ing the speed and energy efficiency of the adder. The modification provides us with the ability to use simpler carry skip logics based on the AOI/OAI compound gates instead of the multiplexer. 2) Providing a design strategy for constructing an efficient CSKA structure based on analytically expres-sions presented for the critical path delay. 3) Investigating the impact of voltage scaling on the efficiency of the proposed CSKA structure (from the nominal supply voltage to the near-threshold voltage). 4) Proposing a hybrid variable latency CSKA structure based on the extension of the suggested CSKA, by replacing some of the middle stages in its structure with a PPA, which is modified in this paper. The rest of this paper is organized as follows. Section II discusses related work on modifying the CSKA structure for improving the speed as well as prior work that use variable latency structures for increasing the efficiency of adders at low supply voltages. In Section III, the Conv-CSKA with fixed stage size (FSS) and variable stage size (VSS) is explained, while Section IV describes the proposed static CSKA structure. The hybrid variable latency CSKA structure is suggested in Section V. The results of comparing the characteristics of the proposed structures with those of other adders are discussed in Section VI. Finally, the conclusion is drawn in Section VII. II. PRIOR WORK Since the focus of this paper is on the CSKA structure, first the related work to this adder are reviewed and then the variable latency adder structures are discussed. A. Modifying CSKAs for Improving Speed The conventional structure of the CSKA consists of stages containing chain of full adders (FAs) (RCA block) and 2:1 multiplexer (carry skip logic). The RCA blocks are connected to each other through 2:1 multiplexers, which can be placed into one or more level structures [19]. The CSKA configuration (i.e., the number of the FAs per stage) has a great impact on the speed of this type of adder [23]. Many methods have been suggested for finding the optimum number of the FAs [18] [26]. The techniques presented in [19] [24] make use of VSSs to minimize the delay of adders based on a single-level carry skip logic. In [25], some methods to increase the speed of the multilevel CSKAs are proposed. The techniques, however, cause area and power increase considerably and less regular layout. The design of a static CMOS CSKA where the stages of the CSKA have a variable sizes was suggested in [18]. In addition, to lower the propagation delay of the adder, in each stage, the carry look-ahead logics were utilized. Again, it had a complex layout as well as large power consumption and area usage. In addition, the design approach, which was presented only for the 32-bit adder, was not general to be applied for structures with different bits lengths. Alioto and Palumbo [19] propose a simple strategy for the design of a single-level CSKA. The method is based on the VSS technique where the near-optimal numbers of the FAs are determined based on the skip time (delay of the multiplexer), and the ripple time (the time required by a carry to ripple through a FA). The goal of this method is to decrease the critical path delay by 2016, IJCSMC All Rights Reserved 27

considering a non-integer ratio of the skip time to the ripple time on contrary to most of the previous works, which considered an integer ratio [17], [20].

3 considering a non-integer ratio of the skip time to the ripple time on contrary to most of the previous works, which considered an integer ratio [17], [20]. In all of the works reviewed so far, the focus was on the speed, while the power consumption and area usage of the CSKAs were not considered. Even for the speed, the delay of skip logics, which are based on multiplexers and form a large part of the adder critical path delay [19], has not been reduced. Fig. 1. Conventional structure of the CSKA [19] B. Improving Efficiency of Adders at Low Supply Voltages To improve the performance of the adder structures at low supply voltage levels, some methods have been proposed in [27] [36]. In [27] [29], an adaptive clock stretching operation has been suggested. The method is based on the observation that the critical paths in adder units are rarely activated. Therefore, the slack time between the critical paths and the off-critical paths may be used to reduce the supply voltage. Notice that the voltage reduction must not increase the delays of the noncritical timing paths to become larger than the period of the clock allowing us to keep the original clock frequency at a reduced supply voltage level. When the critical timing paths in the adder are activated, the structure uses two clock cycles to complete the operation. This way the power consumption reduces considerably at the cost of rather small throughput degradation. In [27], the efficiency of this method for reducing the power consumption of the RCA structure has been demonstrated. The CSLA structure in [28] was enhanced to use adaptive clock stretching operation where the enhanced structure was called cascade CSLA (C 2 SLA). Compared with the common CSLA structure, C 2 SLA uses more and different sizes of RCA blocks. Since the slack time between the critical timing paths and the longest off-critical path was small, the supply voltage scaling, and hence, the power reduction were limited. Finally, using the hybrid structure to improve the effec-tiveness of the adaptive clock stretching operation has been investigated in [31] and [33]. In the proposed hybrid structure, the KSA has been used in the middle part of the C 2 SLA where this combination leads to the positive slack time increase. However, the C 2 SLA and its hybrid version are not good candidates for low-power ALUs. This statement originates from the fact that due to the logic duplication in this type of adders, the power consumption and also the PDP are still high even at low supply voltages [33]. III. CONVENTIONAL CARRY SKIP ADDER The structure of an N-bit Conv-CSKA, which is based on blocks of the RCA (RCA blocks), is shown in Fig. 1. In addition to the chain of FAs in each stage, there is a carry skip logic. For an RCA that contains N cascaded FAs, the worst propagation delay of the summation of two N -bit numbers, A and B, belongs to the case where all the FAs are in the propagation mode. It means that the worst case delay belongs to the case where P i = A i B i = 1 for i = 1,..., N Where P i is the propagation signal related to A i and B i. This shows that the delay of the RCA is linearly related to N [1]. In the case, where a group of cascaded FAs are in the propagate mode, the carry output of the chain is equal to the carry input. In the CSKA, the carry skip logic detects this situation, and makes the carry ready for the next stage without waiting for the operation of the FA chain to be completed. The skip operation is performed using the gates and the multiplexer shown in the figure. Based on this explanation, the N FAs of the CSKA are grouped in Q stages. Each stage contains an RCA block with M j FAs ( j = 1,..., Q) and a skip logic. In each stage, the inputs of the multiplexer (skip logic) are the carry input of the stage and the carry output of its RCA block (FA chain). In addition, the product of the propagation signals (P) of the stage is used as the selector signal of the multiplexer. The CSKA may be implemented using FSS and VSS where the highest speed may be obtained for the VSS structure [19], [22]. Here, the stage size is the same as the RCA block size. In Sections III-A and III-B, these two different implementations of the CSKA adder are described in more detail. 2016, IJCSMC All Rights Reserved 28

A. Fixed Stage Size CSKA By assuming that each stage of the CSKA contains M FAs, there are Q = N /M stages where for the sake of simplicity, we assume Q is an integer.

4 A. Fixed Stage Size CSKA By assuming that each stage of the CSKA contains M FAs, there are Q = N /M stages where for the sake of simplicity, we assume Q is an integer. The input signals of the j th multiplexer are the carry output of the FAs chain in the j th stage denoted by C 0 j, the carry output of the previous stage (carry input of the j th stage) denoted by C 1 j (Fig. 1). The critical path of the CSKA contains three parts: 1) the path of the FA chain of the first stage whose delay is equal to M T CARRY ; 2) the path of the intermediate carry skip multiplexer whose delay is equal to the (Q 1) T MUX ; and 3) the path of the FA chain in the last stage whose its delay is equal to the (M 1) T CARRY + TSUM. Note that T CARRY, TSUM, and TMUX are the propagation delays of the carry output of an FA, the sum output of an FA, and the output delay of a 2:1 multiplexer, respectively. Hence, the critical path delay of a FSS CSKA is formulated by TD = [M TCARRY] + [(N/M-1)*T MUX ] + [(M 1) TCARRY + TSUM]. Based on (1), the optimum value of M (Mopt) that leads to optimum propagation delay may be calculated as (0.5Nα)1/2 where α is equal to TMUX/TCARRY. Therefore, the optimum propagation delay (TD,opt) is obtained from TD,opt = 2 2NT CARRYT MUX + (TSUM TCARRY TMUX) = TSUM + (2 2Nα 1 α) TCARRY. Thus, the optimum delay of the FSS CSKA is almost proportional to the square root of the product of N and α [19]. B. Variable Stage Size CSKA As mentioned before, by assigning variable sizes to the stages, the speed of the CSKA may be improved. The speed improvement in this type is achieved by lowering the delays of the first and third terms in (1). These delays are minimized by lowering sizes of first and last RCA blocks. For instance, the first RCA block size may be set to one, whereas sizes of the following blocks may increase. To determine the rate of increase, let us express the propagation delay of the C j 1 (t 1 j) by t 1 j = max( t 0 j 1,t 1 j 1 ) + T MUX (3) Where t 0 j 1, t 1 j 1 shows the calculating delay of C0 j 1(C1 j 1) signal in the ( j 1)th stage. In a FSS CSKA, except in the first stage, t 0 j is smaller than t 1 j. Hence, based on (3), the delay of t 0 j 1 may be increased from t 0 1 to t 1 j 1 without increasing the delay of C1 j signal. This means that one could increase the size of the ( j 1)th stage (i.e., Mj 1) without increasing the propagation delay of the CSKA. Therefore, increasing the size of Mj for the jth stage should be bounded by t 0 j t 1 j = t (j 1)TMUX. (4) Since the last RCA block size also should be minimized, the increase in the stage size may not be continued to the last RCA block. Thus, we justify the decrease in the RCA block sizes toward the last stage. First, note that based on Fig. 1, the output of the jth stage is, in the worst case, accessible after t1 j + TSUM,j. Assuming that the pth stage has the maximum RCA block size, we wish to keep the delay of the outputs of the following stages to be equal to the delay of the output of the pth stage. To keep the same worst case delay for the critical path, we should reduce the size of the following RCA blocks. For example, when i p, for the (i +1)th stage, the output delay is t 1 i + TMUX + TSUM,i+1, where TSUM,i+1 is the delay of the (i + 1)th RCA block for calculating all of its sum outputs when its carry input is ready. Therefore, the size of the (i + 1)th stage should be reduced to decrease TSUM,i+1 preventing the increase in the worst case delay (TD) of the adder. In other words, we eliminate the increase in the delay of the next stage due to the additional multiplexer by reducing the sum delay of the RCA block. This may be analytically expressed as TSUM,i+1 TSUM,i TMUX; for i p. (5) The trend of decreasing the stage size should be continued until we produce the required number of adder bits. Note that, in this case, the size of the last RCA block may only be one (i.e., one FA). Hence, to reach the highest number of input bits under a constant propagation delay, both (4) and (5) should be satisfied. Having these constraints, we can minimize the delay of the CSKA for a given number of input bits to find the stages sizes for an optimal structure. In this optimal CSKA, the size of first p stages is increased, while the size of the last (Q p) stages is decreased. For this structure, the pth stage, which is called nucleus of the adder, has the maximum size [24]. Now, let us find the constraints used for determining the optimum structure in this case. As mentioned before, when the jth stage is not in the propagate mode, the carry output of the stage is C0 j. In this case, the maximum of t 0 j is equal to Mj TCARRY. To satisfy (4), we increase the size of the first p stages up to the nucleus using [19] Mj M1 + (j 1)α; for 1 j p. (6) In addition, the maximum of TSUM, i is equal to (Mi 1) TCARRY + TSUM. To satisfy (5), the size of the last (Q p) stages from the nucleus to the last stage should decrease based on [19] Mi MQ + (Q i)α; for p i Q. (7) In the case, where α is an integer value, the exact sizes of stages for the optimal structure can be determined. Subsequently, the optimal values of M1, MQ, and Q as well as the delay of the optimal CSKA may be calculated [19]. In the case, where α is a non-integer value, one may realize only a near optimal structure, as detailed in [19] and [21]. In this case, most of the time, by setting M1 to 1 and using (6) and (7), the near-optimal structure is determined. It should be noted that, in practice, α is noninteger whose value is smaller than one. This is the case that has been studied in [19], where the estimation of the near-optimal propagation delay of the CSKA is given by [19] This equation may be written in a more general form by replacing TMUX by TSKIP to allow for other logic types instead of the multiplexer. For this form, α becomes equal to TSKIP/TCARRY. Finally, note that in real implementations, TSKIP < TCARRY, and hence, _α/2_ becomes equal to one. (8) 2016, IJCSMC All Rights Reserved 29

Thus, (8) may be written as Note that, as (9) reveals that a large portion of the critical path delay is due to the carry skip logics. (9) Fig. 2. Proposed CI-CSKA structure. IV.

Hence, in this paper, we present a modified CSKA structure that reduces this delay. A.

It provides us with the ability to use simpler carry skip logics. The logic replaces 2:1 multiplexers by AOI/OAI compound gates (Fig. 2).

5 Thus, (8) may be written as Note that, as (9) reveals that a large portion of the critical path delay is due to the carry skip logics. (9) Fig. 2. Proposed CI-CSKA structure. IV. PROPOSED CSKA STRUCTURE Based on the discussion presented in Section III, it is concluded that by reducing the delay of the skip logic, one may lower the propagation delay of the CSKA significantly. Hence, in this paper, we present a modified CSKA structure that reduces this delay. A. General Description of the Proposed Structure The structure is based on combining the concatenation and the incrementation schemes [13] with the Conv-CSKA structure, and hence, is denoted by CI-CSKA. It provides us with the ability to use simpler carry skip logics. The logic replaces 2:1 multiplexers by AOI/OAI compound gates (Fig. 2). The gates, which consist of fewer transistors, have lower delay, area, and smaller power consumption compared with those of the 2:1 multiplexer [37]. Note that, in this structure, as the carry propagates through the skip logics, it becomes complemented. Therefore, at the output of the skip logic of even stages, the complement of the carry is generated. The structure has a considerable lower propagation delay with a slightly smaller area compared with those of the conventional one. Note that while the power consumptions of the AOI (or OAI) gate are smaller than that of the multiplexer, the power consumption of the proposed CI-CSKA is a little more than that of the conventional one. This is due to the increase in the number of the gates, which imposes a higher wiring capacitance (in the noncritical paths). Now, we describe the internal structure of the proposed CI-CSKA shown in Fig. 2 in more detail. The adder contains two N bits inputs, A and B, and Q stages. Each stage consists of an RCA block with the size of Mj ( j = 1,..., Q). In this structure, the carry input of all the RCA blocks, except for the first block which is Ci, is zero (concatenation of the RCA blocks). Therefore, all the blocks execute their jobs simultaneously. In this structure, when the first block computes the summation of its corresponding input bits (i.e., SM1,..., S1), and C1, the other blocks simultaneously compute the intermediate results [i.e., {ZK j+mj,..., ZK j+2, ZK j+1} for K j = j 1 r=1 Mr ( j = 2,..., Q)], and also Cj signals. In the proposed structure, the first stage has only one block, which is RCA. The stages 2 to Q consist of two blocks of RCA and incrementation. Fig. 3. Internal structure of the j th incrementation block The incrementation block uses the intermediate results generated by the RCA block and the carry output of the previous stage to calculate the final summation of the stage. The internal structure of the incrementation block, which contains a chain of halfadders (HAs), is shown in Fig. 3. In addition, note that, to reduce the delay considerably, for computing the carry output of the stage, the carry output of the incrementation block is not used. As shown in Fig. 2, the skip logic determines the carry output of the j th stage (CO, j ) based on the intermediate results of the j th stage and the carry output of the previous stage (CO, j 1) as well as the carry output of the corresponding RCA block (Cj ). When determining CO, j, these cases may be encountered. When Cj is 2016, IJCSMC All Rights Reserved 30

P. Shalini et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.12, December- 2016, pg. 26-37 equal to one, CO, j will be one.

The reason for using both AOI and OAI compound gates as the skip logics is the inverting functions of these gates in standard cell libraries.

6 P. Shalini et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.12, December- 2016, pg equal to one, CO, j will be one. On the other hand, when Cj is equal to zero, if the product of the intermediate results is one (zero), the value of CO, j will be the same as CO, j 1 (zero). The reason for using both AOI and OAI compound gates as the skip logics is the inverting functions of these gates in standard cell libraries. This way the need for an inverter gate, which increases the power consumption and delay, is eliminated. As shown in Fig. 2, if an AOI is used as the skip logic, the next skip logic should use OAI gate. In addition, another point to mention is that the use of the proposed skipping structure in the Conv-CSKA structure increases the delay of the critical path considerably. This originates from the fact that, in the Conv-CSKA, the skip logic (AOI or OAI compound gates) is not able to bypass the zero carry input until the zero carry input propagates from the corresponding RCA block. To solve this problem, in the proposed structure, we have used an RCA block with a carry input of zero (using the concatenation approach). This way, since the RCA block of the stage does not need to wait for the carry output of the previous stage, the output carries of the blocks are calculated in parallel. B. Area and Delay of the Proposed Structure As mentioned before, the use of the static AOI and OAI gates (six transistors) compared with the static 2:1 multiplexer (12 transistors), leads to decreases in the area usage and delay of the skip logic [37], [38]. In addition, except for the first RCA block, the carry input for all other blocks is zero, and hence, for these blocks, the first adder cell in the RCA chain is a HA. This means that (Q 1) FAs in the conventional structure are replaced with the same number of HAs in the suggested structure decreasing the area usage (Fig. 2). In addition, note that the proposed structure utilizes incrementation blocks that do not exist in the conventional one. These blocks, however, may be implemented with about the same logic gates ( XOR and AND gates) as those used for generating the select signal of the multiplexer in the conventional structure. Therefore, the area usage of the proposed ICSKA structure is decreased compared with that of the conventional one. The critical path of the proposed CI-CSKA structure, which contains three parts, is shown in Fig. 2. These parts include the chain of the FAs of the first stage, the path of the skip logics, and the incrementation block in the last stage. The delay of this path (TD) may be expressed as TD = [M1TCARRY] + [(Q 2)TSKIP] +[(MQ 1)TAND + TXOR] (10) Where the three brackets correspond to the three parts mentioned above, respectively. Here, TAND and TXOR are the delays of the two inputs static AND and XOR gates, respectively. Note that, [(Mj 1)TAND + TXOR] shows the critical path delay of the j th incrementation block (TINC, j ), which is shown in Fig. 3. To calculate the delay of the skip logic, the average of the delays of the OI and OAI gates, which are typically close to one another [35], is used. Thus, (10) may be modified to (11) Where TAOI and TOAI are the delays of the static AOI and OAI gates, respectively. The comparison of (1) and (11) indicates that the delay of the proposed structure is smaller than that of the conventional one. The First reason is that the delay of the skip logic is considerably smaller than that of the conventional structure while the number of the stages is about the same in both structures. Second, since TAND and TXOR are smaller than TCARRY and TSUM, the third additive term in (11) becomes smaller than the third term in (1) [37]. It should be noted that the delay reduction of the skip logic has the largest impact on the delay decrease of the whole structure. C. Stage Sizes Consideration Similar to the Conv-CSKA structure, the proposed CI-CSKA structure may be implemented with either FSS or VSS. Here, the stage size is the same as the RCA and incrementation blocks size. In the case of the FSS (FSS-CI-CSKA), there are Q = N/M stages with the size of M. The optimum value of M, which may be obtained using (11), is given by (12) In the case of the VSS (VSS-CI-CSKA), the sizes of the stages, which are M1 to MQ, are obtained using a method similar to the one discussed in Section III-B. For this structure, the new value for TSKIP should be used, and hence, α becomes (TAOI+TOAI) / (2 TCARRY). In particular, the following steps should be taken. 1) The size of the RCA block of the first stage is one. 2) From the second stage to the nucleus stage, the size of j th stage is determined based on the delay of the product of the sum of its RCA block and the delay of the carry output of the ( j 1)th stage. Hence, based on the description given in Section III-B, the size of the RCA block of the j th stage should be as large as possible, while the delay of the product of the its output sum should be smaller than the delay of the carry output of the ( j 1)th stage. Therefore, in this case, the sizes of the stages are either not changed or increased. 3) The increase in the size is continued until the summation of all the sizes up to this stage becomes larger than N/2. The last stage, which has the largest size, is considered as the nucleus ( pth) stage. There are cases that we should consider the stage right before this stage as the nucleus stage (Step 5). 4) Starting from the stage (p + 1) to the last stage, the sizes of the stage i is determined based on the delay of the incrementation block of the i th and (i 1)th stages (TINC,i and TINC,i 1, respectively), and the delay of the skip logic. In particular TINC,i TINC,i 1 TSKIP,i 1; for i p + 1. (13) 2016, IJCSMC All Rights Reserved 31

7 Fig. 4. Sizes of the stages in the case of VSS for the proposed and conventional 32-bit CSKA structures in 45-nm static CMOS technology. In this case, the size of the last stage is one, and its RCA block contains a HA.increase one in the case of the proposed structure. It originates from the fact that, in the Conv-CSKA structure, both of the stages size increase and decrease are determined based on the RCA block delay [according to (4) and (5)], while in the proposed CI-CSKA structure, the increase is determined based on the RCA block delay and the decrease is determined based on the incrementation block delay [according to (13)]. The imbalanced rates may yield a larger nucleus stage and smaller number of stages leading to a smaller propagation delay V. PROPOSED HYBRID VARIABLE LATENCY CSKA In this section, first, the structure of a generic variable latency adder, which may be used with the voltage scaling relying on adaptive clock stretching, is described. Then, a hybrid variable latency CSKA structure based on the CI-CSKA structure described in Section IV is proposed. A. Variable Latency Adders Relying On Adaptive Clock Stretching the basic idea behind variable latency adders is that the critical paths of the adders are activated rarely [33]. Hence, the supply voltage may be scaled down without decreasing the clock frequency. If the critical paths are not activated, one clock period is enough for completing the operation. In the cases, where the critical paths are activated, the structure allows two clock periods for finishing the operation. Hence, in this structure, the slack between the longest off-critical paths and the longest critical paths determines the maximum amount of the supply voltage scaling. herefore, in the variable latency adders, for determining the critical paths activation, a predictor block, which works based on the inputs pattern, is required [28]. The concepts of the variable latency adders, adaptive clock stretching, and also supply voltage scaling in an N-bit RCA adder may be explained using Fig. 5. The predictor block consists of some XOR and AND gates that determines the product of the propagate signals of considered bit positions. Since the block has some area and power overheads, only few middle bits are used to predict the activation of the critical paths at price of prediction accuracy decrease [31], [33]. In Fig. 5, the input bits ( j + 1)th-( j + m)th have been exploited to predict the propagation of the carry output of the j th stage (FA) to the carry output of ( j +m)th stage. For this configuration, the carry propagation path from the first stage to the Nth stage is the longest critical path (which is denoted by Long Latency Path (LLP), while the carry propagation path from first stage to the ( j+m)th stage and the carry propagation path from ( j +1)th stage to the Nth stage (which are denoted by Short Latency Path (SLP1) and SLP2, respectively) are the longest off-critical paths. It should be noted the paths that the predictor shows are (are not) active for a given set of inputs are considered as critical (off-critical) paths. Having the bits in the middle decreases the maximum of the off-critical paths [33]. The range of voltage scaling is determined by the slack time, which is defined by the delay difference between LLP and max(slp1, SLP2). Since the activation probability of the critical paths is low (<1/2m), the clock stretching has a negligible impact on the 2016, IJCSMC All Rights Reserved 32

Therefore, the predictor block size should be selected based on these tradeoffs. B.

8 throughput (e.g., for a 32-bit adder, m = 6 10 may be considered [33]). There are cases that the predictor mispredicts the critical path activation. By increasing m, the number of misprediction decreases at the price of increasing the longest off-critical path, and hence, limiting the range of the voltage scaling. Therefore, the predictor block size should be selected based on these tradeoffs. B. Proposed Hybrid Variable Latency CSKA Structure The basic idea behind using VSS CSKA structures was based on almost balancing the delays of paths such that the delay of the critical path is minimized compared with that of the FSS structure [21]. This deprives us from having the opportunity of using the slack time for the supply voltage scaling. To provide the variable latency feature for the VSS CSKA structure, we replace some of the middle stages in our proposed structure with a PPA modified in this paper. It should be noted that since the Conv-CSKA structure has a lower speed than that of the proposed one, in this section, we do not consider the conventional structure. The proposed hybrid variable latency CSKA structure is shown in Fig. 6 where an Mp-bit modified PPA is used for the pth stage (nucleus stage). Since the nucleus stage, which has the largest size (and delay) among the stages, is present in both SLP1 and SLP2, replacing it by the PPA reduces the delay of the longest off-critical paths. Thus, the use of the fast PPA helps increasing the available slack time in the variable latency structure. It should be mentioned that since the input bits of the PPA block are used in the predictor block, this block becomes parts of both SLP1 and SLP2. In the proposed hybrid structure, the prefix network of the Brent Kung adder [39] is used for constructing the nucleus stage (Fig. 7). One the advantages of the this adder compared with other prefix adders is that in this structure, using forward paths, the longest carry is calculated sooner compared with the intermediate carries, which are computed by backward paths. In addition, the fan-out of adder is less than other parallel adders, while the length of its wiring is smaller [14]. Finally, it has a simple and regular layout. The internal structure of the stage p, including the modified PPA and skip logic, is shown in Fig. 7. Note that, for this figure, the size of the PPA is assumed to be 8 (i.e., Mp = 8). Fig. 7. Internal structure of the pth stage of the proposed hybrid variable latency CSKA. 2016, IJCSMC All Rights Reserved 33

9 As shown in the figure, in the preprocessing level, the propagate signals (Pi ) and generate signals (Gi ) for the inputs are calculated. In the next level, using Brent Kung parallel prefix network, the longest carry (i.e., G8:1) of the prefix network along with P8:1, which is the product of the all propagate signals of the inputs, are calculated sooner than other intermediate signals in his network. The signal P8:1 is used in the skip logic to determine if the carry output of the previous stage (i.e., CO,p 1) should be skipped or not. In addition, this signal is exploited as the predictor signal in the variable latency adder. It should be mentioned that all of these operations are performed in parallel with other stages. In the case, where P8:1 is one, CO,p 1 should skip this stage predicting that some critical paths are activated. On the other hand, when P8:1 is zero, CO,p is equal to the G8:1. In addition, no critical path will be activated in this case. After the parallel prefix network, the intermediate carries, which are functions of CO,p 1 and intermediate signals, are computed (Fig. 7). Finally, in the postprocessing level, the output sums of this stage are calculated. It should be noted that this implementation is based on the similar ideas of the concatenation and incrementation concepts used in the CI-CSKA discussed in Section IV. It should be noted that the end part of the SPL1 path from CO,p 1 to final summation results of the PPA block and the beginning part of the SPL2 paths from inputs of this block to CO,p belong to the PPA block (Fig. 7). In addition, similar to the proposed CI-CSKA structure, the first point of SPL1 is the first input bit of the first stage, and the last point of SPL2 is the last bit of the sum output of the incrementation block of the stage Q. The steps for determining the sizes of the stages in the hybrid variable latency CSKA structure are similar to the ones discussed in Section IV. Since the PPA structure is more efficient when its size is equal to an integer power of two, we can select a larger size for the nucleus stage accordingly [14]. This implies that the third step discussed in that section is modified. The larger size (number of bits), compared with that of the nucleus stage in the original CI-CSKA structure, leads to the decrease in the number of stages as well smaller delays for SLP1 and SLP2. Thus, the slack time increases further. VI. RESULTS AND DISCUSSION In this section, we assess the efficacies of the proposed structures by comparing their delays, powers, energies, and areas with those of some other adders. All the adders considered here had the size of 32 bits and were designed and simulated using a 45-nm static CMOS technology [38]. The simulations were performed using HSPICE [40] in the room temperature of 25 C. The nominal supply voltage of the technology was 1.1 V, and the threshold voltages of the nmos and pmos transistors were and V, respectively. It should be noted that, to extract the power consumption of the adders, uniform randomstimuli were injected to them. In addition, for each adder structure in each supply voltage level, the injection rate of the stimuli was chosen based on the maximum operating frequency of the structure. In the following Section VI-A and Section VI-B, we first concentrate on studying the effectiveness of the proposed CI-CSKA structure and then investigate the efficiency of the proposed hybrid variable latency structure based on the CI-CSKA. A. CSKA Structures with Fixed and Variable Stage Sizes In this section, both proposed and Conv-CSKA structures with FSS and VSS are considered. The optimum size of the stages for the FSS was 4 in the proposed (CI-CSKA) and Conv-CSKA adders. The sizes of the stages in the case of VSS were the same, as indicated in Fig. 4. The comparative study also included the RCA, CIA, square root CSLA (SQRT-CSLA), and KSA. The results were obtained for a wide range of voltage levels from the nominal voltage (superthreshold) to nmos threshold voltage (VTH,nMOS) (near threshold). The delays of the adders versus the supply voltage are plotted in Fig. 8. As the results show, the RCA (KSA) has the highest (lowest) delay due to its serial (parallel) structure under all the supply voltages. In addition, the smaller delay of SQRT-CSLA compared with that of CIA is due to the logic duplication. In addition, as was expected the CSKA structures have significantly smaller delays compared with that of the RCA. Fig. 8. Critical path delay of the adders versus the supply voltage. Fig. 9. Power consumption of the adders versus the supply voltage. The power consumptions of the adders versus the supply voltage are shown in Fig. 9. The results reveal that the smallest power consumption belongs to the RCA, while the KSA structure consumes the highest power owing to its parallel structure. The power consumption of the CIA is more than the RCA while it is smaller than that of the SQRT-CSLA. The reason for the high power of the SQRT-CSLA is its logic duplication. The power consumptions of the conventional and proposed CI-CSKA 2016, IJCSMC All Rights Reserved 34

structures are slightly more than that of the CIA. The powers of these adders increase further using VSS scheme where the number of stages is larger.

10 structures are slightly more than that of the CIA. The powers of these adders increase further using VSS scheme where the number of stages is larger. As mentioned before, the power of the CI-CSKA structure is little more than that of the conventional one. For example, the power of VSS-CI-CSKA is 5% 7% larger than that of the VSS-Conv-CSKA. It should be pointed out that while the delay of the VSS-CI-CSKA was smaller than delay of the SQRT-CSLA, its power is also considerably smaller than that of the SQRT-CSLA. Finally, the results reveal, on average, a 32 reduction in the power consumption of the adders when scaling the supply voltage from 1.1 V to the nmos threshold voltage. The PDP of the adders for different supply voltages. The proposed CI-CSKA has the best PDP compared with those of the other structures in the supply voltage range considered in this paper. The highest PDP (with 2.5 more than that of the CI- CSKA structure) corresponds to SQRT-CSLA. After SQRT-CSLA, KSA has the highest PDP. The results show that the PDP of the proposed CI-CSKA structure is 35% 38% less than that of the Conv-CSKA structure. In addition, in both the conventional and the proposed structures, the PDP of FSS and VSS are about the same. The values of the energy delay product (EDP) of the adders versus the supply voltage are plotted in Fig. 11. As the results reveal, the RCA has the largest EDP due to its lowest speed. The EDP of the proposed VSS-CI-CSKA is almost the same as that of the KSA structure. The lower value of the EDP for the proposed CI-CSKA originates from the smaller power consumption as well as higher speed of the structure. Furthermore, the SS- CI-CSKA has smaller area and power consumption compared with those of the KSA. Finally, to demonstrate the tradeoffs between the delay and the energy for each adder structure, the energy delay. Pareto-optimal curves are plotted in Fig. 12, which suggests the proposed VSS-CI-CSKA structure as the better adder. Table I reports the area usages and number of transistors for each adder structure. The RCA has the smallest area, while the KSA has the highest area. The next largest adder is SQRT-CSLA. All four CSKA structures and the CIA have about the same area. In addition, as stated before, the proposed CI-CSKA structure slightly decreases the area compared with that of the conventional one. In addition, the number of transistors of the proposed CI-CSKA structure is smaller than that of the Conv-CSKA structure in both FSS and VSS styles. It should be noted that the lowest PDP (energy) and low area of the proposed CI-CSKA structure were the motivation behind extending the structure for variable latency applications. Finally, to investigate the effect of bit length on the efficiency of the proposed CI-CSKA structure, we compare the changes [(ValueConventional ValueProposed)/ValueConventional] of the delay, power, energy, and area of the CI-CSKA and Conv-CSKA structures for 16-, 32-, and 64-bit. For the sake of space, we present the average results of different supply voltage levels. In addition, for the same reason, we limit the comparison with VSS structures because the VSS-CI-CSKA is the more efficient structure among the considered CSKA structures. Furthermore, as mentioned before, the proposed hybrid variable latency CSKA is constructed based on the VSS-CI-CSKA. The results are presented in Fig. 13. The figure reveals that the delay reduction and energy saving slightly decreases and the power increase enlarges a bit with increasing the length. In addition, the increase in the bit length improves the area and number of transistors of the proposed VSS-CI-CSKA compared with those of the VSS-Conv-CSKA. In Section VI-B, we present the results for the variable latency adders. 2016, IJCSMC All Rights Reserved 35

B. Variable Latency Adders In this part, the performance of the proposed hybrid variable latency CSKA structure is compared with those of some other variable latency adders, including RCA [27], C2SLA

11 B. Variable Latency Adders In this part, the performance of the proposed hybrid variable latency CSKA structure is compared with those of some other variable latency adders, including RCA [27], C2SLA [29], and hybrid C2SLA [31], [33]. In the proposed 32-bit hybrid structure, an 8-bit modified PPA block was used in the nucleus stage (Fig. 7). The sizes of the stages from LSB to MSB were {1, 1, 1, 2, 2, 3, 3, 8, 3, 3, 2, 2, 1} where the prediction was performed using the input bits of In the case of the RCA, eight intermediate bits of were exploited for the prediction block. The C2SLA is an extension of the SQRT-CSLA where the variable latency feature is achieved by increasing the number of stages as well as having different sizes for their RCA blocks. In the C2SLA, cascading was done by dividing the 32 bits into groups of {2, 2, 3, 4, 5, 2, 2, 3, 4, 5} where the partial sum was computed in parallel for Ci = 0 as well as Ci = 1 using the RCA. Next, the multiplexers selected the appropriate sum based on the actual carry. In this structure, seven intermediate bits of were used in the prediction block. In the hybrid C2SLA, nine intermediate results were calculated using KSA where the details may be found in [33]. As a measure of the ability of a structure in using the variable latency feature for reducing the power consumption, one may use the ratio of the slack time to the delay of the adder (which is equal to the delay of the LLP denoted by DLLP). The ratios for the four adder structures are shown in Fig. 14(a). The figure also contains the ratio of the slack time to (DLLP) 2 to include the speed of the adder in the figure of merit for the efficacy of the structure in reducing the power using the variable latency scheme. Note that, the details for the LLP, SLP1, and SPL2 in the C2SLA and hybrid C2SLA may be found in [31]. As the results show, the RCA can obtain the highest improvement using the adaptive clock stretching technique. This adder, however, has the worst delay among the four adder structures. The next highest improvement belongs to the proposed hybrid CSKA whose delay is in the order of the other two adders. The observation indicates that the proposed hybrid CSKA may be considered as a fast adder structure for low-power applications. To further clarify this, the results for the power and PDP at both the nominal and the reduced supply voltages for each adder structure are allotted in Fig. 14(b). The amounts of power and energy savings are functions of the supply voltage deduction, which is determined by the slack time. Since the slack times are different for the structures, the amounts of the voltage reduction are different too. The power and PDP at the nominal voltage are for the corresponding baseline structure of each adder (no variable latency structure) while at the reduced voltage, the variable latency structure is considered. The results show that the highest power (energy) reduction of 29% belongs to the RCA structure, which has due the highest slack time. In this case, the supply voltage reduction was 0.2 V. In the case of the standard C2SLA, since the slack time was small, the voltage reduction was 0.05 V, which led to a power reduction of 6%. For the hybrid C2SLA, the slack time was higher than that of the standard C2SLA and hence the voltage reduction of 0.1 V became possible. This provided a higher power reduction ( 13%). Finally, the proposed hybrid CSKA had a larger slack compared with that of the hybrid C2SLA, and hence, the voltage reduction of 0.15 V was possible. This provided the structure with a power reduction of 23% (larger than those of the C2SLA structures). Very low delay of the hybrid variable latency CSKA along with its lower power consumption result in the minimum PDP for this structure. In addition, the higher PDP of the C2SLA structures is due to their high-power consumptions. VII. CONCLUSION In this paper, a static CMOS CSKA structure called CI-CSKA was proposed, which exhibits a higher speed and lower energy consumption compared with those of the conventional one. The speed enhancement was achieved by modifying the structure through the concatenation and incrementation techniques. In addition, AOI and OAI compound gates were exploited for the carry skip logics. The efficiency of the proposed structure for both FSS and VSS was studied by comparing its power and delay with those of the Conv-CSKA, RCA, CIA, SQRT-CSLA, and KSA structures. The results revealed considerably lower PDP for the VSS implementation of the CI-CSKA structure over a wide range of voltage from super-threshold to near threshold. The results also suggested the CI-CSKA structure as a very good adder for the applications where both the speed and energy consumption are critical. In addition, a hybrid variable latency extension of the structure was proposed. It exploited a modified parallel adder structure at the middle stage for increasing the slack time, which provided us with the opportunity for lowering the 2016, IJCSMC All Rights Reserved 36

A Novel Design of High-Speed Carry Skip Adder Operating Under a Wide Range of Supply Voltages

A Novel Design of High-Speed Carry Skip Adder Operating Under a Wide Range of Supply Voltages Jalluri srinivisu,(m.tech),email Id: jsvasu494@gmail.com Ch.Prabhakar,M.tech,Assoc.Prof,Email Id: skytechsolutions2015@gmail.com