Analyzing Metrics of ALU Designs Traversing from Years 2002 to 2015 Brianna V. Thomason Department of Electrical and Computer Engineering University of Central Flida Orlando, FL 32816-2362 Email: brianna.thomason@knights.ucf.edu Abstract In the following paper, ten architecture papers are compared, spanning from the years 2002 to 2015. Each has implemented a unique arithmetic logic unit design using fundamental metrics, such as adders, multipliers and floating points, which are examined in this paper. The designers also discuss how power consumption and energy efficiency are effected, which are compared in Table 1, and why they are imptant to consider when implementing a new design. Clock rate, how many clock cycles a CPU can perfm per second, and supply voltage, the voltage between the input and output, are a few other metrics that are compared to one another in the table. Keywds ALU, Power Consumption, Supply Voltage, ITRS Node, Execution Time, CPU, Energy Efficiency, Clock Cycle I. INTRODUCTION One of the main components of the computer s central processing unit (CPU) is the arithmetic logic unit (ALU). The ALU is an intricate electronic circuit perfming arithmetic and bitwise logical operations on the operands, loading the data from registers. As shown in figure 1, the datapath of the ALU includes the inputs and output registers. The control path going in contains the function, while the control path going out includes any flag bits, comprising of the zero and overflow. The inside of the ALU is comprised of the essential logic gates which perfm addition, subtraction and multiplication operations and any Boolean instructions. The process executes a large set of instructions. These instructions tell the CPU which type of arithmetic logical calculations the ALU will execute. Furtherme, they tell the CPU the locations of the data and where to ste the results. The registers provide the sources of the data needed to execute the calculations and the destination f these results. The data bus sends and receives data from the memy. Occasionally, CPUs will have a data bus width that is narrower than the ALU width, which minimizes the cost of the chip. Although energy and power are distinctive from each other, they are interconnected. The process in which the CPU consumes energy results in power dissipation. The goal f engineers is to design processs that use less power and preserve energy which will reduce total cost and be better f the environment. Effective ways to reduce power dissipation would be to decrease the clock rate, lessen voltage, even use multiple and slower ces in the design. The following equations are used to determine the CPU s execution time, power dissipation and energy consumption: (1) CPU Time = Instruction Count * CPI * Clock Cycle Time (2) Power = Capacitance * Voltage 2 * Frequency (3) Energy = Power * Running Time (4) Energy Efficiency = Process Throughput / Energy Consumed The International Technology Roadmap f Semiconducts, ITRS, is a fifteen year valuation of the requirements f the future semiconduct engineering s technology. Moe s Law states that the amount of transists will double nearly every two years. The trend is expected to end around the year 2022, when the ITRS node will reach to 5 nanometers. There are ten arithmetic logic unit designs, spanning from the year 2002 to 2015, that are reviewed in Section II, each with a unique implementation. The auths discuss their projects in terms of energy, power and ITRS technology. The following section compares and contrasts their research and designs. Fig. 1 Design of the Arithmetic Logic Unit Page 1 of 5
II. LITERATURE REVIEW In 2015, Soheil Salehi et al. examined power consumption and cell area of an IEEE-754 Single Precision Floating-Point Unit (Fig. 2) in a 15nm and 45nm Complementary Metal-Oxide Semiconduct (CMOS) technology. He used power consumption and cell area to compare the two technologies. His results revealed that when using the 15nm technology, it had four times less energy and a supply voltage of 0.8V, while the 45nm had a supply voltage of 1.1V (Fig. 5) [1]. Fig. 2 FPU Functional Elements [1] In 2015, Arman Roohi et al. designed a Quantum-Dot Cellular Automata (QCA) full adder. When analyzing the aspects of latency, complexity and area, the results revealed these were advantages of the proposed full adder when compared to preceding designs [2]. In 2014, Mohammed Alawad et al. presented a highperfmance reconfigurable discrete convolver specifically designed f FPGA-based image and video processs. Whereas the typical multiplier-based design can attain a runtime of O(n^2), the most significant benefit of this proposed design is that it can achieve approximately O(n) in algithmic convolution, therefe being me scalable and energy efficient [3]. In 2013, Naveed Imran et al. proposed an active dynamic redundancy-based fault-handling approach exploiting the partial dynamic reconfiguration capability of static random-access memy-based field-programmable gate arrays. The experimental testing of the FaDReS algithm exhibited valuable results f fault-handling situations [4]. In 2013, Bishwajeet Pandey et al. used latch-free clock gating techniques, applied in the ALU and implemented on the 90nm Spartan-3, to reduce clock power and dynamic power consumption (Fig. 5). Although the clock gating techniques increase area, they also reduce clock and power consumption of the overall design. [7]. In 2009, Jian Huang et al. proposed a field programmable gate array-based scalable architecture f discrete cosine transfm (DCT) computation using FPGA dynamic partial reconfiguration. The auths have analyzed certain specifications of their proposed design such as power consumption, processing clock cycle and reconfiguration overhead and provided the detailed trade-offs. The power was found to have a clock rate of 41.79MHz (Fig. 6). Low precision implementation with reduced ROM size can be beneficial in terms of hardware and power consumption. [5]. In 2009, M. Ramalatha proved the efficiency of the Urdhva Triyagbhyam-Vedic method f multiplication which strikes a difference in the actual process of multiplication itself. The complexity, execution time, area and power are reduced by utilizing the techniques in the computation algithms of the coprocess, which is used to build a high speed power efficient multiplier [6]. In 2009, Kui YI et al. analyzed structure and algithm of the Floating-Point ALU, implementing multiplication and division operations. The Floating-Point number is suppted by the Floating-Point multiplication and division ALU, which is IEEE- 754 standard. The result proves that the Floating-Point number realizes the expectant function. This ALU assumes a 4-Level pipelining structure, with each step acting as a single module. The pipelining structure implements each parallel operation and improves the system perfmance [9]. In 2004, Bhaskar Chatterjee et al. presented a high perfmance 32-bit ALU with low power applications, which minimized total energy. This was implemented in the 180nm- 65nm CMOS technologies. The results concluded that it is possible to reduce the ALU total energy by 18-24% with little delay and reduction in power leakage [10]. In 2002, A. Srivastava et al. designed a high-speed 4-bit ALU (Fig. 4), incpating a ripple carry adder into the design (Fig. 3), to show the effectiveness of the Back-Gate Fward Substrate Bias (BGFSB) method in 1.2 um N-well CMOS technology. It has been emphasized f its low-voltage and highspeed applications [8]. Fig. 3 Block diagram of a 4-bit ripple carry adder showing wst case delay [3]
III. DATA ANALYSIS Fig. 4 The number of bit operands in the designs of the indicated papers. Fig. 7 The supply voltage of each design f the indicated papers. IV. CONCLUSION After comparing the ten arithmetic logic unit designs, spanning from the year 2002 to 2015, each with a unique implementation, there are several design metrics that were related. Each has various design types including types of adders, multipliers and floating points. The auths discuss their projects in terms of energy, power and ITRS technology nodes, while this paper also compares the clock rates and supply voltages of each. Fig. 5 Power consumption (mw) in the desgins of the indicated papers. Fig. 6 Clock rate of the designs in the indicated papers. REFERENCES [1] S. Salehi, and R. F. DeMara, "Energy and Area Analysis of a Floating- Point Unit in 15nm CMOS Process Technology," in Proceedings of IEEE SoutheastCon 2015 (SECon-2015), Ft Lauderdale, FL, April 9-12, 2015. [2] A. Roohi, R. F. DeMara, and N. Khoshavi, "Design and Evaluation of an Ultra-Area-Efficient Fault-Tolerant QCA Full," Microelectronics Journal, Vol. 46, No. 6, pp. 531-542., June 2015, [3] M. Alawad, Y. Bai, R. F. DeMara, and M. Lin, Energy-Efficient -Less Discrete Convolver through Probabilistic Domain Transfmation, in Proceedings of 22nd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA-14), pp. 185-188, Monterey, Califnia, USA, February 27-28, 2014. [4] N. Imran, J. Lee and R. F. DeMara, "Fault Demotion Using Reconfigurable Slack (FaDReS)," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.21, no.7, pp.1364-1368, July 2013. [5] J. Huang, M. Parris, J. Lee, and R. F. DeMara, "Scalable FPGA-based Architecture f DCT Computation Using Dynamic Partial Reconfiguration," ACM Transactions on Embedded Computing Systems, Vol. 9, No. 1, Art. 9, pp. 1 18, October, 2009. [6] Ramalatha, M.; Dayalan, K.D.; Dharani, P.; Priya, S.D., "High speed energy efficient ALU design using Vedic multiplication techniques," Advances in Computational Tools f Engineering Applications, 2009. ACTEA '09. International Conference on, vol., no., pp.600,603, 15-17 July 2009. [7] Pandey, B.; Yadav, J.; Pattanaik, M.; Rajia, N., "Clock gating based energy efficient ALU design and implementation on FPGA," Energy Efficient Technologies f Sustainability (ICEETS), 2013 International Conference on, vol., no., pp.93,97, 10-12 April 2013. [8] A. Srivastava and D. Govindarajan, A Fast ALU Design in CMOS f Low Voltage Operation, VLSI Design, vol. 14, no. 4, pp. 315-327, 2002.
[9] Kui Yi; Yue-Hua Ding, "32 bit Multiplication and Division ALU Design Based on RISC Structure," Artificial Intelligence, 2009. JCAI '09. International Joint Conference on, vol., no., pp.761,764, 25-26 April 2009. [10] Bhaskar Chatterjee, Manoj Sachdev, and Ram Krishnamurthy, A CPLbased dual supply 32-bit ALU f sub 180nm CMOS technologies, In Proceedings of the international symposium on Low power electronics and design (ISLPED '04). ACM, pp. 248-251, New Yk, NY, USA, 2004.
TABLE I. COMPARISON OF 10 ARITHMETIC LOGIC UNIT DESIGNS ALU Floating Point Architecture Name Energy and Area Analysis of a Floating- Point Unit [1] Design and Evaluation of an Ultra-Area- Efficient Fault-Tolerant QCA Full [2] Energy-Efficient -Less Discrete Convolver through Probabilistic Domain Transfmation [3] Datapath width (bits) #bits in operands Time f Operation Design Type 32 bits 1 bit Ultra-areaefficient fault-tolerant QCA full adder 128 bits Floating Point IEEE-754 Single Precision ITRS Technology Node (nm) Area Model of Chip used 45 nm and 15 nm (ITRS Technology) Energy/Power Consumption(W J) else indicate low high 2.048 mw (45nm) 0.6340 mw (15nm) Clock Rate Clock Frequency 200 MHz Supply Voltage 1.1 V (45 nm) 0.8 V (15 nm) 18 nm 2 (area) Low 4.09 μs Energy- Efficient 40 nm Virtex-6 FPGA devices (XC6VLX550t) 166.63 nj 250 MHz 1.0 V Fault Demotion Using Rreconfigurable Slack (FaDRes) [4] 32 bits DSP48 90 nm Virtex-4 FX60 FPGA 1541 mw 108 MHz 1.2 V Scalable FPGA-based Architeture f DCT Computation Using Dynamic Partial Reconfiguration [5] High Speed Energy Efficient ALU Design using Vedic Multiplication Techniques [6] Clock Gating Based Energy Efficient ALU Design and Implementation on FPGA [7] Adaptive 1 to 8 bits Adaptive 8 to 64 bits 8 bits Reconfigurable PE f DCT using ROM Shifters (RS) Carry Out and Carry In s 15ns 45ns Optimized Vedic 90 nm Virtex-4 SX35 FPGA 24.03-26.27 mw 41.79 MHz (power) 100 MHz (SelectMap) 1.2 V Vedic MAC unit Low 90 nm Spartan-3 FPGA 23889 mw (1 THz) 2433 mw (100 GHz) 36 mw (1 GHz) 3 mw (100 MHz) 1 THz 100 GHz 1 GHz 100 MHz 1.2 V A Fast ALU Design in CMOS f Low Voltage Operation [8] 4 bits Ripple Carry 1200 nm N-well CMOS Low 0.5 MHz 1 V 32-bit Multiplication and Division ALU Design Based on RISC Structure [9] 32 bits RISC Structure IEEE-754 Standard GW48 EDA system A CPL-Based Dual Supply 32-bit ALU f Sub 180nm CMOS Technologies [10] 32 bits Propgate- Generate Unit 180 nm 65 nm CMOS MUX Reduction in Energy 18-24% 4.2 GHz 0.7-1.0 V