The Metrics and Designs of an Arithmetic Logic Function over 2002-2015 Jimmy Vallejo Department of Electrical and Computer Engineering University of Central Flida Orlando, FL 32816-2362 Abstract There has been many modifications made to the Arithmetic Logic Functions throughout 2002-2015. The Arithmetic Logic Function is like the heart to the CPU if it s removed there s no way of getting infmation across to memy. It s been modified by implementing floating points, clock gating, pipeline gating, FIR structure, and the QCA full adder. Every implementation is designed to be faster and better than the next. The smaller amount of power consumption and area needed the me superi becomes the design. F example, the ALU floating point consists of 32 bit wd width in a IEEE 754 single precision with an energy of 2.048mW (45nm) to a 0.6340mW (15nm). It had two difficult challenges one being power density and the other being area f CMOS devices, but results have proved that the 15nm is better than the 45nm technology. The 15nm technology recommends 3-4 fold improvement energy efficiency over the 45nm technology. Using the 15nm technology also gives us about a 30% less cell area. Keywds power consumption, wd width, ALU, clock gating, execution time, area cell, and clock rate. I. INTRODUCTION The ALU stands f arithmetic logic unit. It allows the computer to add, subtract, and to perfm basic logical operations such as AND/OR. It also uses an electrical circuit which prefms arithmetic and bitwise logical operations using binary numbers. The ALU is a very imptant piece in a CPU (central processing unit), FPU (floating point unit), even in a GPU (graphics processing unit); every CPU, FPU, GPU could contain multiple ALUs. Sometimes the ALU is Sub-divided into two units. F example, one could be put in the fixed point operations and the other in the floating point operations. Every ALU has direct input and output access to the process controller, main memy and input/output devices. Input is data being inserted into the ALU in der to make it function. The output is what we get as a solution after the simulation has been completed. Inputs and outputs move along an electrical path called a bus. The input necessitates of an instruction that contains an operation code sometimes a fmat code. The operation code is also referred to as the opcode. The op-code tells the ALU what operation to perfm. F instance, two operands might be added/subtracted together compared logically. This is then joined with the op-code and states if it s a fixed-point a floating-point instruction. Afterwards the output is placed in a stage register which has settings that determine if the operation was completed successfully. Some of the key components that are discussed in this paper are imptant to every computer system s perfmance. The data bus wd width determines how much infmation may be carried in a single instruction. F instance, if a process is called a 16 bit process then it s a 16 bit data bus wd width. An ITRS technology node is the half pitch between two adjacent DRAM metal lines, but a company may be referring to the Lmin of a MOSFET of 130nm. Execution time is the time that it takes f a single instruction to be executed. It also makes up the last half of the instruction cycle. In der to calculate execution time you need to multiply the Instruction count times the CPI times one over the Clock Rate ( [Instr. Count] X [CPI] X [1/Clock Rate] ). Power Dissipation occurs when central processing units consume electrical energy and dissipate this energy by the action of the switching devices energy lost due to heating of the material. Its calculated easily by multiplying current times voltage ([ P=I x V ]). Energy Consumption of a process is the amount of power the computer needs in der to operate properly. The design of the ALU is a critical part of the process and new approaches to speeding up instruction are being made today. F example, many engineers use Ripple Carry Adders (RCA), Carry Look Ahead Adders (CLA), and Carry Select Adder (CSA) to implement into their designs to make the addition, subtraction, multiplication, division process me efficient. The way a RCA wks is if want to add two 32 bit numbers Every ALU is designed differently in der f it to function faster and me efficient. The faster the ALU operates the better. In Section II there are ten ALU designs spanning from 2002-2015. I will be discussing some of the metrics implemented into ALUs over the years and also their functionality s. II. LITERATURE REVIEW The energy analysis of a floating unit was brought fth in 2015 by IEEE Southeast Conference [1]. The energy analysis consisted of 32 bits in a floating point IEEE 754 single precision with an energy of 2.048mW (45nm) to a 0.6340mW (15nm). The power density and area were two difficult Page 1 of 5
challenges f CMOS devices but results have proved that a 15nm is better than the 45nm technology. The 15nm technology recommends 3-4 fold improvement energy efficiency over the 45nm technology. Using 15nm technology also gives us about 30% less cell area. Design and Evaluation of an Ultra-Area-Efficient Fault- Tolerant QCA Full was discussed in 2015 by Microelectronics Journal. It involved a wd width of 1 bit operand with an Ultra-area-efficient fault-tolerant QCA full adder and a cell area of 18nm^2. The full adder achieves significant advances over the previous designs in terms of cell count and area. The effectiveness of the adder is verified through implementation of a 4-bit carry save adder [2]. Design & Analysis of a 16 bit RISC Process Using low Power Pipelining was discussed in 2015 International Conference it explains the pipelining scheme f high throughput FIR [6]. By implementing pipelining multipliers and adders into the design it achieves very high throughput. F a 2-Dimensional pipeline gating technique it makes the designed FIR power aware of the accuracy of the operands. This model was operated with a wd width of 16 bit (Operands) a Carry Save Adder and a Virtex 6 FPGA devices (XC6VLX 550t) chip. Energy-Efficient Multiplier-Less Discrete Convolver through Probabilistic Domain Transfmation was brought fth in 2014 Monterey, Califnia. The design implemented with Virtex 6 FPGA devices (XC6VLX550t) requires 4.09 µs to perfm a 128 128 convolution and dissipates only 166.63 nj in energy consumption at 250 MHz [3]. Also by computing infmation with probabilistic domain enables me basic operations are to be perfmed to accomplish higher energyefficiency at a lower hardware price. This discovery has made probabilistic convolver even me valuable when the problem size increases. Clock Gating Based Energy Efficient ALU Design and Implementation on FPGA was spoken about in an International Conference article in 2013 [7]. This design uses a width of 4 bit (Datapath) with a Clock Gating Based Energy Efficient ALU, a Design 90nm (RITRS Node) Spartan-3, and low power consumption. This design has a Clock power of 50%, 41.46%, 51.30%, 55.15% and 55.78% of total dynamic power the device operating frequency is 100MHz, 1GHz, 10GHz, 100GHz and 1 THz. After clock gating techniques in ALU are done, the clock power reduces to 17.85%, 23.39%, 26.49% and 27.19% of total dynamic power. When the device operating frequency is 1GHz, 10GHz, 100GHz and 1 THz. Then we use clock gating there is 72.77% reduction in clock power, 38.88% reduction in IOs power, and 44% reduction in dynamic power in comparison to power consumption without using clock gating. Improving Power-awareness of Pipelined Array Multipliers using 2-Dimensional Pipeline Gating and its Application to FIR Design was brought into consideration by the VLSI Journal in 2006 [5]. This pipeline array consists of a 16 bit (Operands), an Array Multiplier, 0.24um chip, and low power consumption. This implementation was to gate the clocks to registers in both vertical direction and hizontal direction. F multipliers using 2 s complement representation sign extension, which tend to waste me power and longer delays; could be avoided by using this system. Simulation results have shown that an average power saving of 66% and latency reduction of 47% can be achieved under this implementation. A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS was brought fth in Solid- State Circuits, IEEE Journal in 2005 [9]. This design has a wd width of 64 bit (Operands), a Carry adder, with a 0.073mm^2 cell area, and energy consumption of 300mW as stated in the first sentence. A high perfmance 32-bit ALU f programmable logic was discussed in the 12th international symposium on Field programmable gate arrays in 2004 [10]. It s operated with a wd width of 32 bit (Operands), a high perfmance 32-bit ALU, a logic Altera s NIOS 2.0 Process chip, and has a low power consumption. High Throughput Power-aware FIR Filter Design based on Fine-grain Pipeline Multipliers and Adders were discussed in the 2003 IEEE Annual Symposium on VLSI, in Tampa, Flida [4]. The design had a wd width of 16 bit (Operands), a FIR Structure multiplier, a cell area of 0.24um, a low power consumption. A Fast ALU Design in CMOS f Voltage Operation was deliberated in VLSI Design in 2002 [8]. This implantation has a wd width of 4 bit (Operands), a Ripple Carry Adder, a technology node of 1.2nm n-well CMOS, and a low power consumption. III. DATA ANALYSIS Figure 1. The bits in operands f different processs from 2002-2015.
Figure 2. Nodes Range from 2013 to 2015. Figure 4. No Execution Time. Figure 3. The area of different process throughout the years of 2002-2015. Figure 5. The Power and Energy Consumption graph from 2005 to 2015. The other years indicated as so unable to graph. Metrics covered by various papers which are suitable f plotting: Data bus width (bits) vs. Year ITRS technology node (nm) vs. Year Execution time per ALU Floating Point Unit operation (nsec) vs. Year Power Energy vs. Year Fig. 1. <Write a Caption in your own wds below each Figure.>
IV. CONCLUSION In this study I have learned many ways to implement a design in der to achieve power consumption and a large range of different designs from 2002-2015. F instance, in a floating point f a CMOS with a 32 bit data bus it s better to use a 15nm node instead of a 45nm because it uses up to 30% less of an area cell and it achieves a 3 to 4 fold in energy improvement. Also f an ALU one way to achieve cell count and area is by using a full adder. It shows over the years how the ALUs are implemented either using pipeline gating, floating points, clock gating, FIR Structure, and the most interesting the QCA full adder. REFERENCES [1] S. Salehi, and R. F. DeMara, "Energy and Area Analysis of a Floating-Point Unit in 15nm CMOS Process Technology," in Proceedings of IEEE SoutheastCon 2015 (SECon-2015), Ft Lauderdale, FL, April 9-12, 2015. [2] A. Roohi, R. F. DeMara, and N. Khoshavi, "Design and Evaluation of an Ultra-Area-Efficient Fault-Tolerant QCA Full Adder," Microelectronics Journal, Vol. 46, No. 6, pp. 531-542., June 2015, [3] M. Alawad, Y. Bai, R. F. DeMara, and M. Lin, Energy-Efficient Multiplier-Less Discrete Convolver through Probabilistic Domain Transfmation, in Proceedings of 22nd ACM/SIGDA International Symposium on Field- Programmable Gate Arrays (FPGA-14), pp. 185-188, Monterey, Califnia, USA, February 27-28, 2014. [4] J. Di, J. S. Yuan, and R. DeMara, "High Throughput Power-aware FIR Filter Design based on Fine-grain Pipeline Multipliers and Adders," in Proceedings of the 2003 IEEE Annual Symposium on VLSI (ISVLSI-03), pp. 260-261, Tampa, Flida, U.S.A., February 20-21, 2003. [5] J. Di, J. S. Yuan, and R. F. DeMara, "Improving Power-awareness of Pipelined Array Multipliers using 2- Dimensional Pipeline Gating and its Application to FIR Design," Integration, the VLSI Journal, Vol. 39, No. 2, March, 2006, pp. 90-112. [6] Trivedi, Priyanka; Tripathi, Rajan Prasad, "Design & analysis of 16 bit RISC process using low power pipelining," Computing, Communication & Automation (ICCCA), 2015 International Conference on, vol., no., pp.1294,1297, 15-16 May 2015. [7] Pandey, B.; Yadav, J.; Pattanaik, M.; Rajia, N., "Clock gating based energy efficient ALU design and implementation on FPGA," Energy Efficient Technologies f Sustainability (ICEETS), 2013 International Conference on, vol., no., pp.93,97, 10-12 April 2013. [8] A. Srivastava and D. Govindarajan, A Fast ALU Design in CMOS f Voltage Operation, VLSI Design, vol. 14, no. 4, pp. 315-327, 2002. [9] Mathew, S.K.; Anders, M.A.; Bloechel, B.; Trang Nguyen; Krishnamurthy, R.K.; Bkar, S., "A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90- nm CMOS," Solid-State Circuits, IEEE Journal of, vol.40, no.1, pp.44,51, Jan. 2005. [10] Paul Metzgen, A high perfmance 32-bit ALU f programmable logic, In Proceedings of the ACM/SIGDA 12th international symposium on Field programmable gate arrays (FPGA '04). ACM, pp. 61-70, New Yk, NY, USA,2004.
TABLE I. <WRITE A CAPTION IN YOUR OWN WORDS ABOVE EACH TABLE.> ALU Floating Point Architecture Name Datapath width (bits) #bits in operands Time f Operation Design Type Adder Multiplier Floating Point ITRS Technology Node (nm) Area Model of Chip used Energy/Power Consumption(W J) else indicate low high Energy and Area Analysis of a Floating-Point Unit [1] 32 bits (Operands) IEEE-754 Single Precision 45nm and 15nm (ITRS Node) 2.048mW (45nm) 0.6340mW (15nm) Ultra-area-efficient faulttolerant QCA full adder [2] 1 bit (Operands) Ultra-areaefficient faulttolerant QCA full adder 18nm^2 (Cell Area) low Energy-Efficient Multiplier- Less Discrete Convolver through Probabilistic Domain Transfmation [3] Design & Analysis of 16 bit RISC Process Using Power Pipelining [6] 128 bits (Operands) 16 bit (Operands) Carry Save Adder 4.09 μs Energy- Efficient Multiplier Virtex 6 FPGA devices (XC6VLX550t) (Model of Chip used) Virtex 6 FPGA devices (XC6VLX 550t) (Model of Chip used) 166.63 nj Clock Gating Based Energy Efficient ALU Design and Implementation on FPGA [7] 4 bit (Datapath) Clock Gating Based Energy Efficient ALU Design 90nm (RITRS Node) Spartan-3 A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90- nm CMOS[9] High Throughput Poweraware FIR Filter Design based on Fine-grain Pipeline Multipliers and Adders[4] Improving Power-awareness of Pipelined Array Multipliers using 2- Dimensional Pipeline Gating and its Application to FIR Design[5] 64 bit (Operands) Carry 0.073mm^2 300mW 16 bit (Operands) FIR Structure 0.24um 16 bit (Operands) Array Multiplier 0.24um A Fast ALU Design in CMOS f Voltage Operation[8] 4 bit (Operands) Ripple Carry Adder 1.2nm n-well CMOS A high perfmance 32-bit ALU f programmable logic[10] 32 bit (Operands) A high perfmance 32-bit ALU f programmable logic Altera s NIOS 2.0 Process