FPGA IMPLEMENTATION OF 32-BIT WAVE-PIPELINED SPARSE- TREE ADDER

FPGA IMPLEMENTATION OF 32-BIT WAVE-PIPELINED SPARSE- TREE ADDER Kasharaboina Thrisandhya *1, LathaSahukar *2 1 Post graduate (M.Tech) in ATRI, JNTUH University, Telangana, India. 2 Associate Professor in ATRI, JNTUH University, Telangana, India. ABSTRACT In this novel presentation include the design, testing and architecture of the 32-bit asynchronous wave pipelined sparse-tree superconductor rapid single-flux quantum adder implemented. Compared to the Kogge Stone adder, our prefix parallel sparsetree adder has better efficiency on energy with significantly decreased complexity and almost no reduced operation frequency. The 32-bit adder core has 9941 Joseph-son junctions occupying an area of 8.5 mm2. It is designed operation frequency targeted as 30 GHz with the expected latency of 352ps at bias voltage of2.5mv. The adder chip was fabricated and tested successfully at low frequency for all test patterns with measured bias margins of +9.8%/ 10.7%. Index-terms: Adders, digital arithmetic, superconducting integrated circuits, superconducting logic circuits, sparse tree. I. INTRODUCTION In the universal digital circuits for almost any application is an adder. It is the fundamental building block of Arithmetic Logic Units (ALUs) in general - purpose and special-purpose digital signal microprocessors. Currently, in the CMOS domain, the design space of adder structures has been nearly exhausted, with only minimal improvements shown over previous designs. In contrast, emerging digital circuit technologies such as superconducting Rapid Single Flux Quantum (RSFQ) logic opens a way for researchers to explore new design methodologies for extremely fast, energy-efficient adders. In RSFQ logic, most adder designs demonstrated to date are bit-serial or digit-serial architectures which operate on a single bit or a small group of bits sequentially at a very high processing rate [1] [2]. Such designs allow for simple clocking and compact structures. However, the latency of serial adders scales O(n), where n is the number of bits per operand, which leads to long latencies for 32-/64-bit operations in general purpose processors. In the past, parallel architectures in RSFQ have been limited to small data widths or relatively long latency ripple-carry adders [3]. One study evaluated 32-/64-bit parallel Kogge-Stone RSFQ adders using co-flow clocking [4]. In the effort of realizing scalable, highperformance, fully parallel designs, a new technique of asynchronous hybrid wave-pipelining for RSFQ circuits has been developed at Stony Brook University (SBU) [5], [6]. Later, as a result of the collaboration between the SBU and HYPRES designers, an 8-bit wave-pipelined ALU was successfully designed, fabricated, and demonstrated correct operation at the rate of 20 GHz[7], [8].In this paper, we present the design of the first 32-bit asynchronous parallel adder implemented in RSFQ logic. It builds upon the proven hybrid wavepipelining techniques to provide 32-bit wide processing and synchronization. It incorporates an energy efficient, low complexity sparse-tree structure with very high processing rate. The work is based on a design study for a scalable 32-bit wave-pipelined sparse-tree adder conducted at SBU. II. 32-BIT SPARSE TREE RSFQ ADDER A. Sparse-tree RSFQ Adder High-performance parallel adders typically use prefix trees which generate carries in log2(n) time, where n is the number of bits of the data path. The Kogge- Stone adder (KSA) is considered to be the fastest among parallel-prefix adders. Further enhancements to the KSA prefix structure such as the sparse-tree configuration have been proposed and used in high- Volume: 05 Issue: 30 l Sep -2014 www.ijeec.com Page 451

performance Intel processors. In our 32-bit RSFQ adder design, we chose the sparse-tree structure to reduce the number of wiring junctions needed for its implementation without any significant effect on its processing rate. As a side effect, this will also lead to a more energy-efficient design by reducing the total bias current and power consumption. It consists of the following three stages: Initialization, Prefix-Tree and Summation. The Initialization stage receives two 32-bit data operands A and B to create bitwise Generate (G) and Propagate (P) signals which will be merged in a logarithmic manner in the Prefix-Tree stage. The Initialization stage consists of GPR_INIT logic blocks, one for each bit. The GPR_INIT creates the bitwise prefix functions described as Gi = Ai Bi and Pi = Ai Bi where i is the bit index column ranging from 31 down to 0 in the 32-bit adder. These functions are easily realized through clocked AND and XOR gates in a co-flow clocking arrangement. The clock is the Rdy signal provided to all bits Additionally, it is necessary to create the trailing reset signal R which will be used to reset the asynchronous elements in the Prefix-Tree. Signal R is a copy of the Rdy signal for each bit with wj-based delay lines to ensure data signals are processed before reset follows in the asynchronously Prefix-Tree. The Prefix-Tree stage consists of Carry-Merge (CM) blocks to merge the prefix signals and provide a group carry to each 4-bit summation block. In contrast, the Kogge-Stone prefix tree provides a carry to every individual bit of the adder. DFF (D flip-flop) buffers appropriately delay prefix and bitwise P signals until they are ready to be merged or processed at the Summation stage, respectively. The first three levels of the Prefix-Tree also perform the ripple-carry addition within each 4-bit group before data arrive at the Summation stage. Merging of the prefix signals is described in [10]. It is implemented with CFFs (resettable Muller C-flip-flop gates based on the Muller C-element and confluence buffers used as asynchronous OR gates without any danger of violating the time separation requirement of their input pulses. B. Parallel prefix adders The parallel prefix adders are more flexible and are used to speed up the binary additions. Parallel prefix adders are obtained from Carry Look Ahead (CLA) structure. We use tree structure form to increase the speed of arithmetic operation. Parallel prefix adders are fastest adders and these are used for high performance arithmetic circuits in industries. The construction of parallel prefix adder [10] involves three stages. Pre-possessing stage: In this stage we compute, generate and propagate signals to each pair of inputs A and B. These signals are given by the logic equations 1&2: Pi=Ai xor Bi... (1) Gi=Ai and Bi... (2) Carry generation network: In this stage we compute carries corresponding to each bit. Execution of these operations is carried out in parallel [9]. After the computation of carries in parallel they are segmented into smaller pieces. It uses carry propagate and generate as intermediate signals which are given by the logic equations 3&4: CPi:j=Pi:k+1 and Pk:j...(3) CGi:j=Gi:k+1 or (Pi:k+1 and Gk:j)...(4) Post processing: This is the final step to compute the summation of input bits. It is common for all adders and the sum bits are computed by logic equation 4&5: Ci-1=(Pi and Cin) or Gi... (4) Si=Pi xor Ci-1... (5) Volume: 05 Issue: 30 l Sep -2014 www.ijeec.com Page 452

Figure-1: Structural diagram of the 32-bit sparse-tree adder C. Carry Look Ahead Adder To reduce the computation time, engineers devised faster ways to add two binary numbers by using carry-look ahead adders. They work by creating two signals (P and G) for each bit position, based on if a carry is propagated through from a less significant bit position (at least one input is a '1'), a carry is generated in that bit position (both inputs are '1'), or if a carry is killed in that bit position (both inputs are '0'). In most cases, P is simply the sum output of a half-adder and G is the carry output of the same adder. After P and G are generated the carries for every bit position are created. Some advanced carry-lookahead architectures are the Manchester carry chain, Brent Kung adder, and the Kogge Stone adder.some other multi-bit adder architectures break the adder into blocks. It is possible to vary the length of these blocks based on the propagation Delay of the circuits to optimize computation time. These block based adders include the carry by pass adder which will determine P and G values for each block rather than each bit, and the carry select adder which pregenerates sum and carry values for either possible carry input to the block. A carry-look ahead adder (CLA) is a type of adder used in digital logic. A carry-look ahead adder improves speed by reducing the amount of time required to determine carry bits. It can be contrasted with the simpler, but usually slower, ripple carry adder for which the carry bit is calculated alongside the sum bit, and each bit must wait until the previous carry has been calculated to begin calculating its own result and carry bits (see adder for detail on ripple carry adders). The carry-look ahead adder calculates one or more carry bits before the sum, which reduces the wait time to calculate the result of the larger value bits. The Kogge-Stone adder and Brent-Kung adder are examples of this type of adder. III. SIMULATION RESULTS Various adders were designed using Verilog language in Xilinx ISE Navigator and all the simulations are performed using Model sim 6.5e Volume: 05 Issue: 30 l Sep -2014 www.ijeec.com Page 453

simulator. The performance of proposed adders are analyzed and compared. In this proposed architecture, the implementation code for modified 32-bit sparsetree RSFQ adder carry look Ahead adders were developed and corresponding values of delay and area were observed. Table1 shows the comparison of adders. The simulated outputs of 32-bit proposed adders are shown in figure. Figure-2: Simulation waveform for Sparse Tree Adder Figure-3: RTL diagram IV. CONCLUSION We have designed, fabricated, and tested the first 32-bitwave-pipelined sparse-tree RSFQ adder chip with the core complexity of 9941 JJs and the target operation rate of 30 GHz. We have successfully demonstrated the correct operation of the chip at low frequency, passing all carefully chosen test vector with a measured bias margin of +9.8%/ 10.7%. Another adder chip consisting of 12785 junctions with additional on-chip circuits for 30 GHz testing was also fabricated but its testing showed the need for another fabrication run. REFERENCES [1] H. Park, Y. Yamanashi, N. Yoshikawa, M. Tanaka, and A. Fujimaki, Design of fast digit-serial adders using SFQ logic circuits, IEICE Electronics Express, vol. 6, no. 19, pp. 1408 1413, 2009. [2] S. V. Polonsky, V. K. Semenov, P. I. Bunyk, A. F. Kirichenko, A. Y. Kidiyarov-Shevchenko, O. A. Mukhanov, P. N. Shevchenko, D. F. Schneider, D. Y. Zinoviev, and K. K. Likharev, New RSFQ circuits Josephson junction digital devices, IEEE Trans. Appl. Supercond., vol. 3, no. 1, pp. 2566 2577, Mar. 1993. [3] J. Y. Kim, S. Kim, and J. Kang, Construction of an RSFQ 4-bit ALU with half adder cells, IEEE Trans. Appl. Supercond., vol. 15, no. 2, pp. 308 311, Jun. 2005. [4] P. Bunyk and P. Litskevitch, Case study in RSFQ design: Fast pipelined parallel adder, IEEE Trans. Appl. Supercond., vol. 9, no. 2, pp. 3714 3720, Jun. 1999. [5] M. Dorojevets, C. Ayala, and A. Kasperek, Development and evaluation of design techniques for high-performance wave-pipelined wide datapath RSFQ processors, in Proc. 12th Int. Supercond. Electron. Conf. Fukuoka, Japan, 2009, SP-P46. [6] M. Dorojevets, C. L. Ayala, and A. K. Kasperek, Data-flow microarchitecture for wide datapath RSFQ processors: Design study, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 787 791, Jun. 2011. [4]. H. Aboushady, Y. Dumonteix, M. M. Louerat and H. Mehrez, Efficient polyphase decomposition of comb decimation filters in Sigma- Delta analog-todigital converters, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. 48, No. 10, pp. 898 903, 2001. [5]. Gordana Jovanovic Dolecek and Sanjit K. Mitra,"A New Two-Stage CIC-Based Decimation Filter",Proceedings of the 5th International Symposium on image and Signal Processing and Analysis, pp. 218 223, 2007. [6]. Gordana Jovanovic-Dolecek and Sanjit K Mitra, On Design of CIC Decimation Filter with Improved Response, IEEE 3rd International Symposium on Volume: 05 Issue: 30 l Sep -2014 www.ijeec.com Page 454

Communications, Control and Signal Processing, pp. 1072-1076, 2008. [7] T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and O. Mukhanov, 8-bit asynchronous wave-pipelined RSFQ arithmeticlogic unit, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 847 851, Jun. 2011. [8] T. V. Filippov, A. Sahu, A. F. Kirichenko, I. V. Vernik, M. Dorojevets, C. L. Ayala, and O. A. Mukhanov, 20 GHz operation of an asynchronous wave-pipelined RSFQ arithmetic-logic unit, Phys. Proc., vol. 36, pp. 59 65, 2012. [9] P.M Kogge and H. S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Trans. Computer, vol.c- 22, no. 8, pp. 786-793, Aug.1973. [10] D. Harris, A taxonomy of parallel prefix networks, in Signals, Systems and Computers, 2003. Conference Record of Thirty Seventh Asilomar Conference on, vol. 2, the Nov. 2003, pp.2217. Design. Kasharaboina Thrisandhya received B.Tech degree in ECE from Vanjari Seethaiah Memorial Engineering College in 2012, pursuing M.Tech (2012-2014) in the stream of VLSI at Aurora s Technological and Research Institute, (Affiliated to JNTUH) Hyderabad. Her interest area is VLSI Latha Sahukar, Presently working as Associate professor in ATRI, Hyderabad. Her s area of interest is VLSI Design, Communication Systems. Volume: 05 Issue: 30 l Sep -2014 www.ijeec.com Page 455