An Efficient Digital Signal Processing With Razor Based Programmable Truncated Multiplier for Accumulate and Energy reduction

Similar documents
Design and Implementation of Low Power Testing Using Advanced Razor Based Processor

REALIAZATION OF LOW POWER VLSI ARCHITECTURE FOR RECONFIGURABLE FIR FILTER USING DYNAMIC SWITCHING ACITIVITY OF MULTIPLIERS

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Designing Reliable and Low Power Multiplier by using Algorithmic Noise Tolerant

AN EFFICIENT DESIGN OF ROBA MULTIPLIERS 1 BADDI. MOUNIKA, 2 V. RAMA RAO M.Tech, Assistant professor

Design and Performance Analysis of a Reconfigurable Fir Filter

International Journal of Computer Engineering and Applications, Volume XI, Issue XI, Nov. 17, ISSN

AREA EFFICIENT LOW ERROR COMPENSATION MULTIPLIER DESIGN USING FIXED WIDTH RPR

LOW-POWER FFT VIA REDUCED PRECISION

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

An Area Efficient Decomposed Approximate Multiplier for DCT Applications

Design of Baugh Wooley Multiplier with Adaptive Hold Logic. M.Kavia, V.Meenakshi

Methods for Reducing the Activity Switching Factor

Design and Implementation of Truncated Multipliers for Precision Improvement and Its Application to a Filter Structure

An Efficient Reconfigurable Fir Filter based on Twin Precision Multiplier and Low Power Adder

Implementation of Parallel Multiplier-Accumulator using Radix- 2 Modified Booth Algorithm and SPST

A LOW POWER SINGLE PHASE CLOCK DISTRIBUTION USING 4/5 PRESCALER TECHNIQUE

Area Power and Delay Efficient Carry Select Adder (CSLA) Using Bit Excess Technique

Design of Area and Power Efficient FIR Filter Using Truncated Multiplier Technique

Modified Booth Encoding Multiplier for both Signed and Unsigned Radix Based Multi-Modulus Multiplier

DESIGN OF LOW POWER MULTIPLIER USING COMPOUND CONSTANT DELAY LOGIC STYLE

Efficient Multi-Operand Adders in VLSI Technology

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

A Novel Low-Power Scan Design Technique Using Supply Gating

An Optimized Design for Parallel MAC based on Radix-4 MBA

Low Power Approach for Fir Filter Using Modified Booth Multiprecision Multiplier

Data Word Length Reduction for Low-Power DSP Software

International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May ISSN

DESIGN OF MULTIPLE CONSTANT MULTIPLICATION ALGORITHM FOR FIR FILTER

Tirupur, Tamilnadu, India 1 2

Embedded Error Compensation for Energy Efficient DSP Systems

Design of a Power Optimal Reversible FIR Filter ASIC Speech Signal Processing

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

LOW POWER & LOW VOLTAGE APPROXIMATION ADDERS IMPLEMENTATION FOR DIGITAL SIGNAL PROCESSING Raja Shekhar P* 1, G. Anad Babu 2

Design and Implementation of Complex Multiplier Using Compressors

Pass Transistor and CMOS Logic Configuration based De- Multiplexers

A New Configurable Full Adder For Low Power Applications

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

Low-Power Digital CMOS Design: A Survey

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

Design Of Arthematic Logic Unit using GDI adder and multiplexer 1

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Design and Implementation of High Speed Carry Select Adder

32-Bit CMOS Comparator Using a Zero Detector

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Low Power Design of Successive Approximation Registers

AN EFFICIENT MAC DESIGN IN DIGITAL FILTERS

A Novel Approach to 32-Bit Approximate Adder

Design of an optimized multiplier based on approximation logic

Modified Partial Product Generator for Redundant Binary Multiplier with High Modularity and Carry-Free Addition

An area optimized FIR Digital filter using DA Algorithm based on FPGA

Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse 1 K.Bala. 2

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

SINGLE CYCLE TREE 64 BIT BINARY COMPARATOR WITH CONSTANT DELAY LOGIC

Mahendra Engineering College, Namakkal, Tamilnadu, India.

Power Efficient Digital LDO Regulator with Transient Response Boost Technique K.K.Sree Janani 1, M.Balasubramani 2

Design of Low Voltage and High Speed Double-Tail Dynamic Comparator for Low Power Applications

DESIGN OF LOW POWER ETA FOR DIGITAL SIGNAL PROCESSING APPLICATION 1

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

An Design of Radix-4 Modified Booth Encoded Multiplier and Optimised Carry Select Adder Design for Efficient Area and Delay

A Fixed-Width Modified Baugh-Wooley Multiplier Using Verilog

Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

AREA AND DELAY EFFICIENT DESIGN FOR PARALLEL PREFIX FINITE FIELD MULTIPLIER

/$ IEEE

VLSI Implementation of Digital Down Converter (DDC)

2 Assoc Prof, Dept of ECE, George Institute of Engineering & Technology, Markapur, AP, India,

DESIGN & IMPLEMENTATION OF FIXED WIDTH MODIFIED BOOTH MULTIPLIER

Low-Power Multipliers with Data Wordlength Reduction

ISSN: X International Journal of Advanced Research in Electronics and Communication Engineering (IJARECE) Volume 1, Issue 5, November 2012

MS Project :Trading Accuracy for Power with an Under-designed Multiplier Architecture Parag Kulkarni Adviser : Prof. Puneet Gupta Electrical Eng.

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Design of Low Power High Speed Fully Dynamic CMOS Latched Comparator

Design of High Performance Arithmetic and Logic Circuits in DSM Technology

Optimized FIR filter design using Truncated Multiplier Technique

A Low-Power High-speed Pipelined Accumulator Design Using CMOS Logic for DSP Applications

Design for Low Power Multiplier Based On Fixed Width Replica Redundancy Block & Compressor Trees

ASIC Design and Implementation of SPST in FIR Filter

A Low Power Single Phase Clock Distribution Multiband Network

EFFICIENT VLSI IMPLEMENTATION OF A SEQUENTIAL FINITE FIELD MULTIPLIER USING REORDERED NORMAL BASIS IN DOMINO LOGIC

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Design and Analysis of Approximate Compressors for Multiplication

IJCSIET-- International Journal of Computer Science information and Engg., Technologies ISSN

A COMPARATIVE ANALYSIS OF LEAKAGE REDUCTION TECHNIQUES IN NANOSCALE CMOS ARITHMETIC CIRCUITS

Faster and Low Power Twin Precision Multiplier

High Speed Binary Counters Based on Wallace Tree Multiplier in VHDL

Sub-threshold Logic Circuit Design using Feedback Equalization

Reduce Power Consumption for Digital Cmos Circuits Using Dvts Algoritham

A NOVEL DESIGN FOR HIGH SPEED-LOW POWER TRUNCATION ERROR TOLERANT ADDER

Implementation of High Performance Carry Save Adder Using Domino Logic

ISSN Vol.07,Issue.08, July-2015, Pages:

COMPARISION OF LOW POWER AND DELAY USING BAUGH WOOLEY AND WALLACE TREE MULTIPLIERS

Performance Analysis of an Efficient Reconfigurable Multiplier for Multirate Systems

A Review on Different Multiplier Techniques

International Journal of Advance Engineering and Research Development

Totally Self-Checking Carry-Select Adder Design Based on Two-Rail Code

A Low Power Array Multiplier Design using Modified Gate Diffusion Input (GDI)

Transcription:

An Efficient Digital Signal Processing With Razor Based Programmable Truncated Multiplier for Accumulate and Energy reduction S.Anil Kumar M.Tech Student Department of ECE (VLSI DESIGN), Swetha Institute of Technology, JNTUA, Anantapur, Tirupati, Chittor District, Andhra Pradesh, India. R.Kalyan Assistant Professor & HOD Department of ECE, Swetha Institute of Technology, JNTUA, Anantapur, Tirupati, Chittor District, Andhra Pradesh, India. ABSTRACT Fault tolerant methods can extend the power savings achievable by dynamic voltage scaling(dvs) by exchanging exactness and/or timing execution against power. Such energy upgrades have a strong dependency on the delay distribution of the circuit and the measurable attributes of the data signal. independently, programmable truncated multipliers likewise accomplish power advantages to the expense of degradation of the output sign to-noise ratio. In this brief, a mix of programmable truncated multiplication is utilized inside of a fault tolerant computerized sign handling (DSP) structure in which the supply voltage is decreased beyond the basic timing level. Timing modulation properties of truncated multiplications are analyzed and exhibited to enhance the execution of fault tolerant designs, reducing error correction burdens, and extending the system operating voltage range. Joining both power techniques brings about lower energy utilization levels, which enhance the vitality savings beyond that normal when applying a mix of both methods with the first DSP. Keywords Digital signal processing (DSP), fault tolerant, low power, razor, reconfigurable multiplier, truncated multiplication. I.INTRODUCTION Dynamic voltage scaling is widely used as part of strategies to manage switching power consumption in battery powered devices such as cell phones and laptop computers. Low voltage modes are used in conjunction with lowered clock frequencies to minimize power consumption associated with components such as CPUs and DSPs; only when significant computational power is needed will the voltage and frequency be raised. Voltage scaling provides an effective means to lower power consumption in VLSI circuits, because scaling the supply voltage by a factor of K results in reductions in the dominating dynamic power consumption by a factor of K2 and yields static power benefits. However, advances in CMOS technology scaling contributed to an exponential growth of design issues derived from process voltage temperature (PVT) variations, often resulting in conservative designs that lead to a high power consumption. Some of the classic design timing constraints can be relaxed in digital signal processing (DSP) systems by applying unconventional voltage over scaling (VOS) levels to further improve energy consumption levels while maintaining signal processing performance. Two of the main streams for providing error-resiliency against timing violations are: 1. Techniques that introduce an estimation or prediction subsystem that monitors the system output and provides an approximation if a fault is detected. 2. Techniques that modify the data capture by augmenting the latches or flip-flops on the critical path and allotting extra execution time for operations that need a long execution time. Such techniques allow implementation of low power systems with acceptable circuit performance at the Page 1779

expense of either signal degradation [1], [2], [5], or execution time penalties [3], [4]. Power savings obtained by fault tolerant techniques are dependent on both PVT variations and the circuit physical design, but are also influenced by the data input to the circuit, as the statistical timing distribution defines the percentage of samples estimated and/or corrected, thus conditioning the maximum power savings obtainable using such techniques. Truncated multiplication has been widely studied as a means of achieving both power and area improvements in the field of arithmetic circuit design, at the expense of signal degradation [6] [12]. As the truncated multipliers are smaller than full-precision ones, they not only achieve improvements in power consumption and area, but result in different timing distributions. The existence of synergic benefits derived from the combination of truncated multiplication and VOS using a fault tolerance strategy is presented in this brief where both techniques are applied to a customdesigned fixed point multiply and accumulate (MAC) structure. II.EXISTING SYSTEM A. Voltage Scaling Beyond Vdd crit Dynamic power consumption is the dominating component in many arithmetic unit circuits because of the high toggling profile of such structures. The switching component of the energy consumed by a digital gate is defined as Pavg = α0 1CLV2dd fclk, where α0 1 is defined as the average number of times in each clock cycle (at a frequency fclk) that a node with capacitance CL makes a power consuming transition. Reducing the supply voltage by a factor of K results in a quadratic improvement in the power consumption rate of CMOS logic. Scaling of Vdd results in timing penalties which increase as Vdd approaches the threshold voltages of the devices. The relationship between the circuit delay (τd) and the supply voltage Vdd is given by τd = CLVdd/β(Vdd Vt )α, where CL is the load capacitance, β is the gate transconductance, Vt is the device threshold voltage, and α is the velocity saturation index. We refer to the critical supply voltage of a given architecture Vdd crit, as the minimum supply voltage where timing on the critical path is met for any expected PVT variations. Scaling the supply voltage to Vdd = K Vdd crit, where 0 < K < 1 is referred to as VOS; although this technique results in further energy reductions almost proportional to K2, scaling Vdd below the critical supply voltage results in critical timing failures for certain input combinations under certain PVT conditions. This is impractical for use with designs that do not apply fault tolerant schemes. B. Razor and Fault Tolerance for Timing The Razor technique is an approach to apply dynamic voltage scaling by dynamic detection and correction of circuit timing errors. By measuring the error rate in the circuit, the supply voltage can be tuned while the circuit is in operation, easing the requirements imposed by conservative timing analysis. Implementation issues of Razor along with its required hardware overhead, where Razor II and Bubble Razor were introduced and tested within a full system with reduced area and timing overheads, and Razor is applied to a high-speed real-time finite-impulse response (FIR) filter. The efficiency of Razor, and the limits regarding Vdd scaling depend on the circuit timing distribution. Therefore, for any circuit implementing Razor, reducing the amount of time required to perform the average and slowest operation will significantly improve Razor merits. This is the motivation for considering the truncated multiplier which exhibits a timing profile different from the standard multiplier. C. Truncated Multiplication In systems where it is not necessary to compute the exact least significant part of the product, truncated multipliers allow power, area, and timing improvements by skipping the implementation of Page 1780

sections of the least significant part of the partial product matrix. Instead of computing the full-precision output, the output is that from the sum of the first (N + h) columns (where 0 h N), where N is the operand width, plus an estimation of the erased bits. In many applications, product values generated by fixed width N N bit multipliers are truncated or rounded back to the original bitwidth in latter stages of the algorithm flow. Truncation allows a way of reducing the complexity of the multiplier unit by replacing the lower parts of the partial product matrix by a smaller compensation circuit, and its variants range from very aggressively truncated applications to faithfully rounded truncated multipliers. Programmable and configurable approaches to truncated multiplication use fixed-width structures that can be operated at reduced resolutions by disabling parts of the partial product generation. The introduction of programmable truncation in a fixedpoint multiplier facilitates modifying not only he multiplier power, but also the timing of the system where the multiplier is embedded. This also alters the original critical path (OCP) of such arithmetic block, making the architecture virtually faster where the active critical path (ACP) τacp < τocp. This characteristic of the PTM over the overall and maximum delay is exploited with fault-tolerant schemes to achieve lower minimum energy consumption limits. III.PROPOSED SYSTEM PTMAC A FLEXIBLE LOW-POWER DSP WITH PTM To extend the usage of PTM to general DSP architectures, the PTMAC was introduced and analyzed. PTMAC, designed as a vehicle to exercise PTM in low-power biomedical applications with a need for modest DSP such as ECG filtering or fall detection, will be utilized in this brief as a platform to combine the benefits of programmable truncation and fault tolerance. Figure 1: PTMAC top level diagram. The proposed DSP, as depicted in Figure 1, includes a control unit operating in a five-stage pipeline, program and memory blocks in a multi bus Harvard configuration, some I/O connectivity and an arithmetic unit consisting of a MAC structure with a 16-bit PTM,a 40-bit accumulator, and a 40-bit barrel shifter for scaling and rotating the accumulated value. The total gate count of the original PTMAC chip is 48 k, and it is estimated (post synthesis) to have a maximum power consumption of 79.46 μw/mhz. Timing analysis of the proposed PTMAC architecture indicates that the critical path is located within the MAC structure of the arithmetic unit; therefore, energy savings derived from the application of voltage scaling approaches will be constrained by the signal propagation time through the arithmetic unit. An experimental approach to combine the delaymodulation capabilities of programmable truncation and the benefits of fault tolerance is explored in the following sections as a way to achieve a flexible unit that trades energy for signal and performance degradation. IV.RAZOR-BASED PTMAC, LOW-POWER DSP VIA DELAY MODULATION The combination of a PTM and a fault tolerant system allows such a system to modulate the average and maximum delay times in the MAC unit at run time. Therefore, the number of errors that need correction at any Vdd level can be trimmed down by reducing the multiplier accuracy. This technique also enables lower Page 1781

functional Vdd values that can be applied before nonrecoverable failures appear in the system, delivering lower optimum operation voltages which result in lower energy expenditure levels. To explore the independent benefits and interactions between fault tolerance and truncated multiplication, Razor PTMAC was designed as an evolution of PTMAC. To that end, a Razor-enabled version of the original DSP was designed and implemented using Cadence RTL Compiler and TSMC 90 nm technology. Razor Implementation To achieve the fault tolerance, the accumulator unit of the PTMAC was replaced by a fault tolerant version named Razor Accumulator where the original flipflops were substituted by a version of the Razor registers. The proposed augmented cells were designed and stored as library cells for postsynthesis insertion. Such a cell follows the original implementation Razor implementation, replacing the shadow latch within the Razor registers with a shadow-flip-flop to avoid synthesis issues. The metastability detector required in Razor implementations was modeled as the delay of an inverter added as a constraint to the hold time of the Razor accumulator. In this way, all timing violations potentially causing metastability are then detected as timing errors, providing a lower bound for the performance of Razor. Static timing analysis of PTMAC demonstrated that the only registers situated at potentially critical paths within PTMAC were located in the accumulator, as the multiplication and accumulation of the input data is performed within a clock cycle. Therefore, flipflops capturing the 10 most significant bits of the accumulator were replaced by Razor flip-flops. Insertion of the Razor flip-flops and the associated control logic resulted in an increase of 18% of the core area. Since the hold constraint only limits the maximum duration of the positive clock phase and does not affect the clock frequency, a single clock was utilized to drive both main and razor flip-flops with both transition edges providing flexibility to configure the extra time allowed by the shadow registers by configuring the duty cycle of the clock. A delay of 25% of the overall clock cycle was selected, which results in an asymmetrical clock signal. The selection of a short error detection phase, enabled a strategy whereby a barrier formed by transparent latches was situated between the compression tree of the multiplier and adder blocks. During the high phase of the clock, the partial products generation begins, but the signals provided by the multiplier are blocked at the latch barrier input, while during the low cycle of the clock, the latches become transparent and signals are free to pass to the adder. V.RESULTS ANALYSIS: Results Truncated Multiplier Figure 2: Execution of five instructions in the Razor PTMAC pipeline. With the four stages of the Razor error detection-correction cycle indicated. Fig 3 :Truncated Multiplier Page 1782

Fig 7 :Technology Schematic Fig 4 :RTL Schematic Fig 5 :Technology Schematic Fig 8:Razor reg Fig 6 :RTL Schematic Fig 9 : RTL Page 1783

Fig 10 :Technology Schematic Fig 11 :PTMAC VI.CONCLUSION The use of Razor on a PTMAC structure has been tested at a post synthesis simulation level to study the effect and interactions of both energy reducing techniques on a previously tested DSP design. The timing and power effects of VOS with error correction and the application of programmable truncated multiplication resulted in significant power reductions. Fault tolerance was provided by implementing a conservative approach to the Razor I technique, and achieved energy reductions of 17.7% over the original DSP implementation by enabling the reduction of Vdd beyond the original critical supply level. Truncated multiplication was achieved by implementing a PTM, and resulted in energy savings of 8.1% of the full design. Energy reductions achieved by fault tolerant techniques are limited by the overheads required to provide error resilience and the amount of operations that need correction, therefore, they are highly influenced by the delay distribution and maximum value of the system critical paths. The introduction of the truncated multipliers achieve two goals in this scenario: 1) it reduces power on the multiplier by cancelling the switching activity within its least significant sections and 2) disables the multiplier critical path, thus reducing the error recovery overheads of Razor, and extending the applicable Vdd range. Results show that the application of both techniques to the proposed DSP unit allow maximum energy savings of 24.8%, improving the results obtained by independently implementing programmable truncation, fault tolerance via Razor, and the most optimistic prediction for the combination of both techniques (24.4%). This indicates that the delay-modulation properties of truncated multiplication can be exploited to improve the energy consumption of fault tolerant DSP architectures where multipliers are involved in the critical path of the circuit. Fig 12 : PTMAC1 Page 1784

REFERENCES 1] R. Hegde and N. R. Shanbhag, Soft digital signal processing, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 9, no. 6, pp. 813 823, Dec. 2001. [2] B. Shim, S. Sridhara, and N. R. Shanbhag, Reliable low-power digital signal processing via reduced precision redundancy, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 497 510, May 2004. [3] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, et al., Razor: A low-power pipeline based on circuit-level timing speculation, in Proc. 36th Annu. IEEE/ACM Int. Symp. Microarch., 2003, pp. 7 18. [4] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, et al., RazorII: In situ error detection and correction for PVT and SER tolerance, IEEE J. Solid-State Circuits, vol. 44, no. 1, pp. 32 48, Jan. 2009. [5] P. Whatmough, S. Das, and D. Bull, A low-power 1 GHz razor FIR accelerator with time-borrow tracking pipeline and approximate error correction in 65 nm CMOS, in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2013, pp. 428 429. [6] S. Kidambi, F. El-Guibaly, and A. Antoniou, Area-efficient multipliers for digital signal processing applications, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 43, no. 2, pp. 90 95, Feb. 1996. Page 1785