<Explanation of Improved the Quality of ALU And Ten Different Types of Designs for Decreasing Power Dissipation>

<Explanation of Improved the Quality of ALU And Ten Different Types of Designs f Decreasing Power Dissipation> Jihang Li Department of Electrical and Computer Engineering University of Central Flida Orlando, FL 32816-2362 Abstract The goal of this paper is explaining what is ALU and how can we improve the quality of ALU. The topics are including the basic explanation of ALU and one bit Full Adder, different types of the adders and me imptant the requirements to increase the quality of ALU. There are several key requirements f a better ALU are width of data bus, ITRS technology node, execution time and energy consumption. Meover, this paper will also discuss about ten different ALU designs. Most of designs are design f lower energy consumption with a different length of node. F instance, design number 10 in the table; the paper is about a 32-bit ALU with a 180 nm CMOS technologies, the goal of it is decrease the power dissipation. Keywds CMOS technology node, Adder, Multiplier, Floating point, Data path width, operand, Model of chip, execution time, power dissipation, enery consumption. I. INTRODUCTION Arithmetic logic unit (ALU), it is the central processing unit and execution unit inside of the CPU. ALU is the ce part of all the central processing, and it is fmed by logical units that arithmetic by And gate and Or gate. The main function is using the binary code to calculate, such as addition, subtraction. All of the operations are coming from control unit. Basically, today, all CPU architectures are using the fm of binary code to represent. ALU is a structure by using integer arithmetic. To process the calculation, it needs circuits that inside the chip to achieve it. In another wds, ALU is the digital circuitry that is dedicated to perfm arithmetic and logical operations. ALU is the main part of the central process; even the smallest microprocess needs the counter function from ALU. Early computers could use different digital systems to operate the calculations, including anti-code, symbolic code and so on. Now, most of processs are using binary code; it is because it can simplify the operations of additions and subtractions. The vast majity of computer instructions are executed by the ALU. It pulls the data from registers, and then operates the data and sted in the output register inside of ALU. F example, add two numbers 3 and 4 together, befe the addition, the operand 3 is set on to accumulat, and 3 is in the register. When it begins to process, ALU add two numbers together become 7 and sets it back to accumulat and replace the iginal number 3. Other components are responsible f the transferring data between registers and memy. The control unit controls the ALU through the control circuits to give out the instructions. The basic two arithmetic operations are addition and subtraction. Also, the logic operations such as AND, OR, NOR and XOR. ALU are able to operate two inputs f addition, subtraction and so on. The design of ALU is based on Full Adders. One bit Full Adder is a combination of two XOR gates, two AND gates and one OR gate. It has two inputs A and B with a carry in C. One output S with a carry out C. The function f output S=A B Ci, and the Cout=A* B+Ci*(A B). There are few common adders can be used inside of the ALU, such as Half Adder, Ripple-carry Adder, Carry-save Adder and Lookahead carry unit. With Carry Select Adder design will have the fastest of speed but me gates are required. Ripple-carry Adder will require least gates, but the speed is slower. Truth table of one bit Full Adder: Ci A B S Cout 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 The strength of ALU is depends on the numbers of operations and the operating speed; meover, it evaluates the strength of computer. The basic operation is addition: a number adds with zero is simply pass this number. A number adds with a non-zero number is the same as a number subtracts with a non-zero number. The subtraction of two numbers can treat as compare the size of two numbers. Of course, the multiplication and division operations will have higher cost, and they are me complicated. Multiplication is operating based on the addition operation. F instance, add a number with number times of another number; multiply 7 by 7 is the same as add 7 to 7 by 7 times. Same thing f the division, the division operation can be processed based on subtraction operation. Therefe, to operate multiplication and division, they are slower than addition and subtraction. Multiplication and division can also operate by using shifting. Shift left is the same as multiplication, and shift right is f division. Page 1 of 5

ALU has the ability of privilege to directly access the control unit, memy and input-output. Therefe, it will increase the speed of the operations. The input-output is processing by using busses. The most common ALU has two inputs and one output. Two input operands are receiving the data from users, and it can be 1 bit, 4 bits, 64 bits and so on. Input command contains an instruction wd, which is machine instruction wd. The output is the result of operations. The goal of the ALU is to deal with inputs data from the user. A better ALU will not lack of quality of ITRS technology node, and others are the width of data bus, execution time, dissipation of power, and meover the energy consumption of processs. Data bus is a group of lines that transmit the infmation from one to another one. The wider the data bus, me infmation can transfer into ALU. The width of data bus can be 1 bit, 32 bits, 128 bits. However, the wider data bus, then the higher cost of implement. ITRS technology node is showing the length of the transist gate inside of ALU, and the smallest size of ITR technology node, the me that it can have. The length of node is decreased every two year. F instance, the length of semiconduct device fabrication node is changed from 10 um in 1971 to 14 nm in 2014. Next, the execution time is testing the speed of ALU. There are several main requirements to increase the execution time of ALU. Firstly, using multiplication and division instead of addition and subtraction will save me time. Secondly, increase the width of data bus will increase the amount of the bits that can be sending to ALU. Lastly, with a Carry Select Adder design will increase the speed of the ALU. The power dissipation is also testing the quality of ALU. The length of ITRS technology node can affect the power dissipation. Meover, the different Adder can affect the power dissipation. F instance, me gates are required in the ALU, the larger power dissipation will have. Energy consumption of processs depends on power dissipation and execution time due to the fmula E=p*t. There are several of different Arithmetic s. The one can process on all of the bits at the same time is called parallel arithmetic unit. The one can only process a bit at a time is called serial number arithmetic. Some can process 4 bits 8 bits at a time. Of course, the me bits that process at the same time the fastest speed will approach. Meover, the better ability of the ALU the better CPU is. Next section is about the ten different designs with different node and data path width and the time f operations. II. LITERATURE REVIEW The first design in the table is about a design of Floatingpoint Unit f ITRS Node with 45 nm and 15 nm. [1] The data path width is 32 bits; the operation of this design is using IEEE-754 Single Precision. The goal of this design is to analysis the power consumption of the nodes. The power consumption is calculated by total dynamic count, static count and the quantity of different size of gates. The power consumption of 45 nm ITRS node is 2.048mW and.6340mw f 15 nm. From this result, we can conclude that smaller the length of ITRS node the least power dissipation. Compare with second design Ultra-area-efficient fault-tolerant QCA full adder, the data path width is least than the first design, and it is using the Ultra-area-efficient fault-tolerant QCA full adder with 18 nm^2 of cell area, the result of this design has lower power dissipation. [2] The conclusion of the second design is that with the different testing of area of chips and small size of the operand, which decrease the power dissipation. With a different multiplier and size of operands will also affect the power dissipation and energy consumption. F instance, in design [3] and [5], design [3] used Energy- Efficient Multiplier with 128 bits operands, and the model of chip used is Virtex 6 FPGA device (XC6VLX550t). [3] The energy consumption of this design is 166.63nJ. Design [5] used 9 bits f operand and with a DSP48 multipliers, and the model of Chips used are FPGA, PRM, DCT and HWICAP. The result of design [5] also decreases the power dissipation. The power dissipation of PRM is 0.023 mw, 0.061 mw f HWICAP, 0.081 mw f DCT. Design [4] used Fine-Grained Pipelining Adders and Multiplier with 16 bits of operands. The result is very successful with high throughput and low power dissipation. With 0.24um static CMOS in this design, the power dissipation saves around 62.5 % of iginal design. Same as design [6], with a multiplier power dissipation also decreases. Design [6] used Carry Select Adder and Barrel Shift Rotat with 16 bits of operands; and the ITRS node is 28nm. The result of low power is testing by using Verilog HDL. The simulation proves the decreased of the execution time. With the results of these two designs, we can conclude that with a multiplier, the execution time is decreased; therefe the energy consumption is decreased. However, the costs of these two designs are higher than other designs. Design [7], [8], [9] and [10] are only used Adder to decrease the power dissipation. Design [7] used Carry Out Adders and Carry In Adders with 8 bits of operands. With number of clocks, logic gates and signals at different frequencies, the result of energy consumption of this design with 90 nm Spartan-3 is low. To compare with design [8] and design [9], the result of design [9] is me successful than design [8]. The design [8] used Sparse-tree semi-dynamic Adder and 32b radix-2 sparse-tree Adder with an operand of 64 bits. The power dissipation of 90 nm of CMOS is 300mW. Design [9] used ELM Adder with 8 bits of operands. The power dissipation of 40 nm of Virtex-6 FPGA is 88mW. By comparing with these two designs, design [9] has lower power dissipation. It makes sense by comparing the length of nodes that design [8] and [9] are used. Last design [10] used 32-bit Adder with 32 bits of operand. With the range of length of CMOS from 180 nm to 65 nm, the power dissipation also decreased as length of ITRS node decreased.

III. DATA ANALYSIS Metrics covered by various papers, which are suitable f plotting: Data bus width (bits) vs. Year ITRS technology node (nm) vs. Year Execution time per ALU Floating Point Unit operation (nsec) vs. Year Power Energy vs. Year From the graph above, we can see not every design is using the ITRS technology node. Design [2], [3] and [5] are not including in the graph because these design doesn t have ITRS node. The design in the graph in der is [4], [10], [8], [7], [9], [6] and [1]. We can conclude that the ITRS node is decreasing every few years. The design in the graph in der is [8], [7], [5], [3]. The dots in 2015-2017 are treat as zero, since those design does not have execution time of ALU. Although, some of design does not have the execution time, we can see it from table by comparing the type of operations used in the design. From the graph, we can conclude execution time is increasing every few years. From the graph above, we can see two three columns are together because the year of the design are the same; however, the der is using month if the paper has it. The design in the graph in der is [4], [10], [8], [7], [9], [5], [3], [2], [6] and [1]. Since, design [2] is using 1 bit of operand, so it is hard to see it in the graph. We can conclude that data bus width f every few years are increase. It means that every design is not using the closing length of data bus; every design requires its own length of data bus. The design in the graph in der is [8], [9], [1]. Other designs do not have power dissipation of ITRS node. From the graph above, we can conclude that not every design has power dissipation because those designs used percentage to represent the power dissipations. IV. CONCLUSION In conclusion, from these ten designs, we can conclude that the data path width, different type of adders, types of ITRS Technology node and model of chip used are the key to

decrease the power dissipation and energy consumption. F data path width, larger it is the fastest that ALU run; and it will decrease the power dissipation. Same as using the Multiplier instead of Adder, it will increase the speed of ALU. Meover, as the size of ITRS node decreases, the me amounts of nodes that can use in the ALU, and then it increase the speed of the ALU. However, the higher implements will cost. From all of these designs, I think design [3] has higher data bus of operand, and it has multiplier that will increase the speed of calculation, and the energy consumption is 166.63nJ, which is very small. Follow by that is design [8] has 64 bits of operand with a size 90nm of CMOS, which will decrease the space inside of ALU. Last, design [1] also has very low power dissipation with small size of ITRS node; meover, the data path width is big. REFERENCES [1] S. Salehi, and R. F. DeMara, "Energy and Area Analysis of a Floating- Point Unit in 15nm CMOS Process Technology," in Proceedings of IEEE SoutheastCon 2015 (SECon-2015), Ft Lauderdale, FL, April 9-12, 2015. [2] A. Roohi, R. F. DeMara, and N. Khoshavi, "Design and Evaluation of an Ultra-Area-Efficient Fault-Tolerant QCA Full Adder," Microelectronics Journal, Vol. 46, No. 6, pp. 531-542., June 2015, [3] M. Alawad, Y. Bai, R. F. DeMara, and M. Lin, Energy-Efficient Multiplier-Less Discrete Convolver through Probabilistic Domain Transfmation, in Proceedings of 22nd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA-14), pp. 185-188, Monterey, Califnia, USA, February 27-28, 2014. [4] J. Di, J. S. Yuan, and R. DeMara, "High Throughput Power-aware FIR Filter Design based on Fine-grain Pipeline Multipliers and Adders," in Proceedings of the 2003 IEEE Annual Symposium on VLSI (ISVLSI- 03), pp. 260-261, Tampa, Flida, U.S.A., February 20-21, 2003. [5] N. Imran, J. Lee and R. F. DeMara, "Fault Demotion Using Reconfigurable Slack (FaDReS)," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.21, no.7, pp.1364-1368, July 2013. [6] Trivedi, Priyanka; Tripathi, Rajan Prasad, "Design & analysis of 16 bit RISC process using low power pipelining," Computing, Communication & Automation (ICCCA), 2015 International Conference on, vol., no., pp.1294,1297, 15-16 May 2015. [7] Pandey, B.; Yadav, J.; Pattanaik, M.; Rajia, N., "Clock gating based energy efficient ALU design and implementation on FPGA," Energy Efficient Technologies f Sustainability (ICEETS), 2013 International Conference on, vol., no., pp.93,97, 10-12 April 2013. [8] Mathew, S.K.; Anders, M.A.; Bloechel, B.; Trang Nguyen; Krishnamurthy, R.K.; Bkar, S., "A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS," Solid-State Circuits, IEEE Journal of, vol.40, no.1, pp.44,51, Jan. 2005. [9] Pandey, B.; Yadav, J.; Singh, Y.K.; Kumar, R.; Patel, S., "Energy efficient design and implementation of ALU on 40nm FPGA," Energy Efficient Technologies f Sustainability (ICEETS), 2013 International Conference on, vol., no., pp.45,50, 10-12 April 2013. [10] Bhaskar Chatterjee, Manoj Sachdev, and Ram Krishnamurthy, A CPLbased dual supply 32-bit ALU f sub 180nm CMOS technologies, In Proceedings of the international symposium on power electronics and design (ISLPED '04). ACM, pp. 248-251, New Yk, NY, USA, 2004.

TABLE I. <TEN DIFFERENT TYPES OF DESIGNS FOR DEACREASING THE POWER DISSIPATION.> ALU Floating Point Architecture Name Datapath width (bits) #bits in operands Time f Operation Design Type Adder Multiplier Floating Point ITRS Technology Node (nm) Area Model of Chip used Energy/Power Consumption(W J) else indicate low high Energy and Area Analysis of a Floating-Point Unit [1] 32 bits (Operands) IEEE-754 Single Precision 45nm and 15nm (ITRS Node) 2.048mW (45nm) 0.6340mW (15nm) Ultra-area-efficient faulttolerant QCA full adder [2] 1 bit (Operands) Ultra-areaefficient faulttolerant QCA full adder 18nm^2 (Cell Area) low Energy-Efficient Multiplier- Less Discrete Convolver through Probabilistic Domain Transfmation [3] 128 bits (Operands) 4.09 μs Energy- Efficient Multiplier Virtex 6 FPGA devices (XC6VLX550t) (Model of Chip used) 166.63 nj High Throughput Poweraware FIR Filter Design based on Fine-grain Pipeline Multipliers and Adders [4] 16 bits (Operands) Fine-Grained Pipelining Adders Fine-Grained Pipelining Multipliers 0.24μm static CMOS Fault Demotion Using Reconfigurable Slack (FaDReS) [5] 9 bits (Operands) 200ms DSP48 multipliers FPGA, PRM, DCT and HWICAP (model of Chip used) Design & analysis of 16 bit RISC process using low power pipelining [6] 16 bits (Operands) Carry Select Adder Barrel Shift Rotat XILINX KINTEX (XC7K1607-3fbg676) 28 nm Clock gating based energy efficient ALU design and implementation on FPGA [7] 8 bits (Operands) 1ps Carry Out Adder, Carry In adder 90 nm Spartan-3 (model of device) A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90- nm CMOS [8] 64 bits (Operands) 30ps Sparse-tree semi-dynamic adder and 32b radix-2 sparse-tree adder 90 nm (CMOS) 300 mw Energy efficient design and implementation of ALU on 40nm FPGA [9] 8 bits (Operands) ELM Adder 40 nm (Virtex-6 FPGA) 88 mw A CPL-based dual supply 32-bit ALU f sub 180nm CMOS technologies[10] 32 bits (Operands) 32-bit adder 180 nm-65 nm (CMOS)