Power Issues with Embedded Systems. Rabi Mahapatra Computer Science

Similar documents
Low-Power Design for Embedded Processors

Low-Power CMOS VLSI Design

Low-Power Digital CMOS Design: A Survey

Low Power Design in VLSI

A Survey of the Low Power Design Techniques at the Circuit Level

UNIT-II LOW POWER VLSI DESIGN APPROACHES

An Overview of Static Power Dissipation

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

Contents 1 Introduction 2 MOS Fabrication Technology

Low Power Design for Systems on a Chip. Tutorial Outline

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

Energy Efficiency of Power-Gating in Low-Power Clocked Storage Elements

Power Spring /7/05 L11 Power 1

Low Power, Area Efficient FinFET Circuit Design

EECS 427 Lecture 22: Low and Multiple-Vdd Design

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Chapter 4. Problems. 1 Chapter 4 Problem Set

CS4617 Computer Architecture

Static Energy Reduction Techniques in Microprocessor Caches

Microcircuit Electrical Issues

A Static Power Model for Architects

Low-Power Multipliers with Data Wordlength Reduction

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

CS 110 Computer Architecture Lecture 11: Pipelining

LOW POWER VLSI TECHNIQUES FOR PORTABLE DEVICES Sandeep Singh 1, Neeraj Gupta 2, Rashmi Gupta 2

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Low Power High Performance 10T Full Adder for Low Voltage CMOS Technology Using Dual Threshold Voltage

Pramoda N V Department of Electronics and Communication Engineering, MCE Hassan Karnataka India

EECS 427 Lecture 13: Leakage Power Reduction Readings: 6.4.2, CBF Ch.3. EECS 427 F09 Lecture Reminders

Lecture 13 CMOS Power Dissipation

SURVEY AND EVALUATION OF LOW-POWER FULL-ADDER CELLS

Data Word Length Reduction for Low-Power DSP Software

Low Power Design of Successive Approximation Registers

UNIT-1 Fundamentals of Low Power VLSI Design

CHAPTER 5 DESIGN AND ANALYSIS OF COMPLEMENTARY PASS- TRANSISTOR WITH ASYNCHRONOUS ADIABATIC LOGIC CIRCUITS

A LOW POWER DESIGN FOR ARITHMETIC AND LOGIC UNIT

Jan Rabaey, «Low Powere Design Essentials," Springer tml

CMPEN 411 VLSI Digital Circuits Spring Lecture 24: Peripheral Memory Circuits

A NEW APPROACH FOR DELAY AND LEAKAGE POWER REDUCTION IN CMOS VLSI CIRCUITS

Lecture 04 CSE 40547/60547 Computing at the Nanoscale Interconnect

Reduce Power Consumption for Digital Cmos Circuits Using Dvts Algoritham

Ramon Canal NCD Master MIRI. NCD Master MIRI 1

Lecture #2 Solving the Interconnect Problems in VLSI

LSI and Circuit Technologies for the SX-8 Supercomputer

A Novel Low-Power Scan Design Technique Using Supply Gating

Energy-Recovery CMOS Design

High Performance Low-Power Signed Multiplier

Low Power Techniques for SoC Design: basic concepts and techniques

Topics. Low Power Techniques. Based on Penn State CSE477 Lecture Notes 2002 M.J. Irwin and adapted from Digital Integrated Circuits 2002 J.

Low Transistor Variability The Key to Energy Efficient ICs

International Journal of Advanced Research in Computer Science and Software Engineering

A new 6-T multiplexer based full-adder for low power and leakage current optimization

Logic Restructuring Revisited. Glitching in an RCA. Glitching in Static CMOS Networks

1. Short answer questions. (30) a. What impact does increasing the length of a transistor have on power and delay? Why? (6)

Lecture 02: Logic Families. R.J. Harris & D.G. Bailey

Leakage Current Analysis

Investigation on Performance of high speed CMOS Full adder Circuits

BICMOS Technology and Fabrication

Towards PVT-Tolerant Glitch-Free Operation in FPGAs

Design and Implementation of Digital CMOS VLSI Circuits Using Dual Sub-Threshold Supply Voltages

Power Considerations in the Design of the Alpha Microprocessor

Lecture 11: Clocking

CS61c: Introduction to Synchronous Digital Systems

DYNAMIC VOLTAGE FREQUENCY SCALING (DVFS) FOR MICROPROCESSORS POWER AND ENERGY REDUCTION

Course Content. Course Content. Course Format. Low Power VLSI System Design Lecture 1: Introduction. Course focus

DESIGN AND SIMULATION OF A HIGH PERFORMANCE CMOS VOLTAGE DOUBLERS USING CHARGE REUSE TECHNIQUE

A Literature Review on Leakage and Power Reduction Techniques in CMOS VLSI Design

Introduction. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. July 30, 2002

Sleepy Keeper Approach for Power Performance Tuning in VLSI Design

CMOS circuits and technology limits

High Speed Communication Circuits and Systems Lecture 14 High Speed Frequency Dividers

THERE is a growing need for high-performance and. Static Leakage Reduction Through Simultaneous V t /T ox and State Assignment

Leakage Power Reduction by Using Sleep Methods

EE 42/100 Lecture 23: CMOS Transistors and Logic Gates. Rev A 4/15/2012 (10:39 AM) Prof. Ali M. Niknejad

Leakage Power Minimization in Deep-Submicron CMOS circuits

Preface to Third Edition Deep Submicron Digital IC Design p. 1 Introduction p. 1 Brief History of IC Industry p. 3 Review of Digital Logic Gate

Dual-K K Versus Dual-T T Technique for Gate Leakage Reduction : A Comparative Perspective

Short-Circuit Power Reduction by Using High-Threshold Transistors

EEC 216 Lecture #10: Ultra Low Voltage and Subthreshold Circuit Design. Rajeevan Amirtharajah University of California, Davis

Chapter 1 Introduction

Implementation of High Performance Carry Save Adder Using Domino Logic

Interconnect-Power Dissipation in a Microprocessor

COMPREHENSIVE ANALYSIS OF ENHANCED CARRY-LOOK AHEAD ADDER USING DIFFERENT LOGIC STYLES

DESIGN CONSIDERATIONS FOR SIZE, WEIGHT, AND POWER (SWAP) CONSTRAINED RADIOS

Power Gating of the FlexCore Processor. Master of Science Thesis in Integrated Electronic System Design. Vineeth Saseendran Donatas Siaudinis

Design of High Performance Arithmetic and Logic Circuits in DSM Technology

A HIGH SPEED & LOW POWER 16T 1-BIT FULL ADDER CIRCUIT DESIGN BY USING MTCMOS TECHNIQUE IN 45nm TECHNOLOGY

Low Power VLSI CMOS Design. An Image Processing Chip for RGB to HSI Conversion

EE434 ASIC & Digital Systems

Design of Low Power Vlsi Circuits Using Cascode Logic Style

Circuit level, 32 nm, 1-bit MOSSI-ULP adder: power, PDP and area efficient base cell for unsigned multiplier

ECE 484 VLSI Digital Circuits Fall Lecture 02: Design Metrics

CHAPTER 3 NEW SLEEPY- PASS GATE

ECE 471/571 The CMOS Inverter Lecture-6. Gurjeet Singh

Power-Area trade-off for Different CMOS Design Technologies

Probabilistic and Variation- Tolerant Design: Key to Continued Moore's Law. Tanay Karnik, Shekhar Borkar, Vivek De Circuit Research, Intel Labs

ESTIMATION OF LEAKAGE POWER IN CMOS DIGITAL CIRCUIT STACKS

VLSI Designed Low Power Based DPDT Switch

White Paper Stratix III Programmable Power

Transcription:

Power Issues with Embedded Systems Rabi Mahapatra Computer Science

Plan for today Some Power Models Familiar with technique to reduce power consumption Reading assignment: paper by Bill Moyer on Low-Power Design for Embedded Processors Proceedings of IEEE Nov. 2001 Mahapatra-Texas A&M-Spring'03 2

Next Generation Computing: Watts metrics? (.) (.) vvv (.) (.) vvv (.) (.) vvv Wireless Networks Micro-watts (.) (.) vvv Base Station Laptops,PDAs, Cellphones, GPS.1-10W (Watts) Router Server/Data Processing Mega Watts Mahapatra-Texas A&M-Spring'03 3

Power Aware Increase in prominence of portable devices SoC complexity: heat generation Traditionally, speed (performance), & area (cost), Now, add power as the new axix Mahapatra-Texas A&M-Spring'03 4

Physics Revisited Energy is in Joules Power: Rate of energy consumption (joules/sec), in Watt Vdd*Id: instantaneous power Mahapatra-Texas A&M-Spring'03 5

Impact on embedded system Energy consumed per activity reduces battery life Decreases battery capacity fast IR drops in a battery due to flow of current Requires more Vdd & GND pins to reduce R, also, thick & wide wiring is necessary Inductive Power-supply voltage bounce due to current switching Requires more & shorter pins to reduce inductance Require on chip decoupling capacitance to help bypass pins Power dissipation produces heat and high temperature reduces speed and reliability Mahapatra-Texas A&M-Spring'03 6

Opportunities for Low-Power Algorithms Source Code Compiler Operating System ISA Microarchitecture Circuit Design Manufacturing Minimize Operation Optimized code Energy miser Scheduling Energy Exposed Clocked Gating Low voltage swing Low-k dielectric Mahapatra-Texas A&M-Spring'03 7

Some Power Models Macro level Arithmetic Software Memory Activity Based Empirical Information-theoretic Signal modeling-based Mahapatra-Texas A&M-Spring'03 8

Empirical Based on chip estimation system [Glaser ICCAD91]: P = αg(e r + C L* V dd2 )f G = number of equivalent gates E r = energy consumed by an equivalent gate C L = average loading per gate including fanout α = activity factor Demerit: lacks consideration on different logic styles Mahapatra-Texas A&M-Spring'03 9

Information Theoretic Reference [Najm95] Based on activity estimation P = k (C L )(α ) = k(a)(h) A = area, h = entropy factor (a function of entropy H) Limited accuracy, does not include possibility of encoding Mahapatra-Texas A&M-Spring'03 10

Signal Model Based Reference [Landman TCAD96] Properties of 2 s complement encoded data stream Arithmetic blocks are regular Analytical Method: [Ramprasad TCAD97] Word-level statistics Auto-regressive Moving Average signal generation model 2 s complement & sign magnitude signal encoding Mahapatra-Texas A&M-Spring'03 11

Software Power Power consumed by a processor (P): Ref [TiwariTVLSI94] P = V dd * I Energy (E): E = P *T p, program execution time Program Execution Time(T p ) T p = N*T clk E = P *T p = V dd * I *N*T clk If V dd and T clk are assumed to be constant, Energy is measured by measuring current I. Low-power software: small value of N or fast execution time When V dd and T clk are varying? Current measurements? Mahapatra-Texas A&M-Spring'03 12

Instruction Level Power Modeling Reference: [Tiwari TVLSI97] Current consumption of a program with no loops but M instruction I = i=0 B k *N k + O i,i+1modm / i=0 N k B k = Base current of kth instruction in the program N k = Number of clocks required to complete kth instruction O i,j = overhead of executing successive instruction Mahapatra-Texas A&M-Spring'03 13

Power Dissipation in CMOS C L Three sources: P switching : Switching power (capacitive): dominant today Pl eakage : Leakage Power, will dominant in 0.13 micron and below. P shortcircuit : Schort circuit component Mahapatra-Texas A&M-Spring'03 14

Switching Power Dissipation Occurs when device changes state or switching of charge in and out of C L, capacitance Flow of current across the transistor s impedence P switching = t * C L * V 2 dd * f t= average number of transition per cycle f = clock frequency C L = effective capacitance Increases with clock frequency Decreases quadratically with supply voltage 85-90% of active power consumption Mahapatra-Texas A&M-Spring'03 15

Low-Power Techniques Low-power techniques reduces one or more of t, C L, V dd, and f t: encoding C L : fast algorithm, design layout V dd : voltage scaling, variable voltage processor f: low-frequency and clock gating All of these are useful for embedded system Mahapatra-Texas A&M-Spring'03 16

Short Circuit Power Dissipation Occurs due to the overlapped conductance of both PMOS and NMOS transistors forming a CMOS logic gate as the input signal transitions P shortcircuit = I mean * V dd 10-20% contribution to dynamic power Not important if all signals are guaranted to have steep slopes Mahapatra-Texas A&M-Spring'03 17

Leakage Power Dissipation Occurs regardless of state change Due to leakage currents from reversed biased PN junction (OFF switches are not really off) Proportional to device area and temperature Increases exponentially with reduction in Vt, voltage scaling Significant when system is idle (Embedded Systems?) Mahapatra-Texas A&M-Spring'03 18

Static Power Not a factor in pure CMOS designs Sense amplifier, voltage references and constant current sources contribute to the static power Regardless of device state change Total Power: P switching + P shortcircuit +P static +P leakage Mahapatra-Texas A&M-Spring'03 19

Power Delay Leverage Power & Delay trade off Speed is proportional to C L * V dd / (V dd V t ) 1.5 Trends: Reduce V dd & V t to improve speed Energy-delay product is minimized when V dd = 2 * V t Reducing V dd from 3 * V t to 2 * V t results in an approximately 50% decrease in performance while using only 44% of the power. Mahapatra-Texas A&M-Spring'03 20

Algorithmic Technique PR Focus on minimizing number of operation weighted by their cost: First order goal. Underlying implementation: arithmatic or logical Recomputation of intermediate results may be cheaper than memory use Loop unrolling: reduces loop overhead Number representation: fixed point or floating point Sign-magnitude versus 2 s complement is preferred in certain DSP when input samples are uncorrelated and dynamic range is minimized Bit length (of course trade off accuracy) Adaptive bit truncation in portable video encoder reduces 70% of the power over full bit width Mahapatra-Texas A&M-Spring'03 21

Architectural Technique PR Instruction set design and exploiting parallelism & pipelining are important Architecture driven voltage scaling method [Chandrakasan, IEEE J. Solid state Circuits 92] Lower voltage for power but apply parallelism/pipeline to speedup Possible if application has parallelism, trade-off with latency due to pipeline & data dependencies, and area Speculative logic allowed if low overhead else determental Meeting required performance without over-designing a solution is fundamental to optimization Extra logic power is not controllable and they still present even if parallelism is absent. Mahapatra-Texas A&M-Spring'03 22

Logic and Circuit Level PR Focus on reducing switched capacitance or/and signal swing Signal probabilities may favor either static or dynamic CMOS logic Example: Two-input NAND gate with uniform distribution at inputs, probability of output being 0 (p0) is 0.25, p1 = 0.75 For static gate, probability of a power consuming transition from 0 > 1 is p0*p1 = 0.1875 For dynamic gate with the output is pre-charged to logic 1, power is consumed whenever the output was previously 0. Thus it has higher (by 0.25) transition at output than static. However, dynamic circuit has lower input capacitance by a factor of 2 to 3. Mahapatra-Texas A&M-Spring'03 23

Logic circuit PR For wider input static gate, say four input NAND, p0 = 0.0625 and p0 > 1 is 0.0586 For dynamic version as above, p0 = p0 > 1 = 0.0625 Static logic suffers from glitches: needs restructuring and that adds up power more than 20% X Y X Hazard Y Restructured Logic Mahapatra-Texas A&M-Spring'03 24

Logic circuit PR Mapping logic function to gates is tricky too Example: transition probability P = 0.5 P = 0.25 P =0.0625 P = 0.25 P(total) = 0.5625 P =0.5 P = 0.25 P = 0.125 P =0.0625 P(total) =0.4375 (Four-input AND function) Mahapatra-Texas A&M-Spring'03 25

Logic Circuit Reording Reordering of equivalent inputs of gates and reordering of transistors in complex gates Example of (A+B)*C in Fig.3 Affects amounts of switched internal capacitance, gate speed and static power Signals with high probability of being off are placed nearest the output node of the gate, subject to timing constraint, and signals with high probability of being on are placed nearest the supply node Signals with high probability of switching are placed nearest the output 10% saving between best to worst case Ref. Rules in [29] Shen et al ASP-DAC 95 Mahapatra-Texas A&M-Spring'03 26

Logic circuit Clocking Clock generation & distribution can consume 30-40% of total system power: minimize unnecessary transitions on clock signals Low-power STG of state machines and assignment of adjacent binary encoding Clock gating at function unit level by inhibiting input updates Fig. 4 D Q CLK Enable C Mahapatra-Texas A&M-Spring'03 27

Logic Circuit Clocking Overhead due to enable signal and limitation on granularity at clock gating can be applied Self-gating storage elements use local clocking enable Stroll et al, Symposium Low-power Electronics & Design 2000 Reduced swing clock drivers saves 63% [33] at the cost of twice the delay Differential clock signaling: (0.2*Vdd), in static power, saves 60%; affects duty cycle and receiver skew Mahapatra-Texas A&M-Spring'03 28

Logic Circuit Pre-computation Selectively pre-computing output values Saves 11 66%. Ref. Monterio.. TCAD 98 [37] F Reg E Combinational Logic Reg F = 1 Logic F = 0 Logic Reg Mahapatra-Texas A&M-Spring'03 29

Device Technology Threshold voltage is key at device level for leakage Silicon on Insulator lowers the parasitic capacitance and reduces body effect Dual device threshold technology reduces standby power High threshold for non-critical and low threshold for speed-critical paths Alternative: raise threshold of all during standby Mahapatra-Texas A&M-Spring'03 30

Case Study: MCORE ISA 8-bit CPUs in watches, calculators (microwatts), 16-bit CPUs in handheld devices (milliwatts), 32-bit CPUs in notebooks (watts) ISA specification have significant effect on powerperformance. (RISC, CISC, VLIW) CISC is best suited for low-cost ES due to optimized code density RISC & VLIW trade code density for simplified fetch/decode Power spent on fundamental computing of an algorithm is important. However, Fetch/Decode/Sequencing overheads must consume less power Mahapatra-Texas A&M-Spring'03 31

ISA & Power RISC has reduced sequencing and control overhead but has increased instruction fetch bandwidth (32-bit words). Compare 22-24 bits for CISC MCORE is not a pure RISC: 16-bit ISA with 16 GPR with load/store operation Limited by long execution path length (due to reduced fields) & 2- operand instruction format Using compiler-driven instructions, the limitations are minimized For ES, optimized codes resides on-chip memory and reduced memory traffic due to careful selection of semantics also reduces power by 40% Mahapatra-Texas A&M-Spring'03 32

ISA Power Doubling I-cache reduces cache misses up to 50% For on-chip cache or memory, 32-bit data path is provided Doubles the fetch bandwidth Using ISS, capture the frequency of execution of all instructions and order the opcodes accordingly (control unit design): could save 15% of power System-level power saving Wait: disable only CPU Doze: CPU plus some peripheral & clocks Stop: deep power down state with stopping all clocks, reduced to off most supply voltages Mahapatra-Texas A&M-Spring'03 33

Microarchitecture PR High-end processor microarchitectures exploit all possible ILP and correspondingly high level power inefficiency DSP & multimedia algorithms in embedded class have this opportunities: Special power aware architectures Mid range controllers use optimal pipeline architecture No free clocks in the datapath: gated and delayed clocks Floor planning of datapath element is evaluated executing a set of embedded benchmarks Infrequently utilized functional units placed farther from reg-file & decoupled from bus appropriately Mahapatra-Texas A&M-Spring'03 34

Clock distribution Efficient distribution and gating mechanisms are vital Two-level approach: Align clock edges in the first level for circuits and generate two non-overlapped phases These two phases are further distributed to a set of regenerating cells with clock gating control inputs Gating of clocks can be used at both levels per granularity Clock tree generation technique is used to allow both balanced and unbalance clock tree structures For balanced clock tree, additional routing capacitance is used at all nodes For unbalanced clock tree, resizing at intermediate aligner/regenerators (more effective for low-power) Mahapatra-Texas A&M-Spring'03 35

Two-level clock distribution align enable rgen C1 C2 Clk in enable enable C1 C2 Mahapatra-Texas A&M-Spring'03 36

Adapting power consumption profile to application and system needs IBM s power aware simulator Use event driven complete simulator and power profiler UCI s COPPER project Integrated view of compiler strategy for power managements Mahapatra-Texas A&M-Spring'03 37

IBM s power simulator SIMOS-PPC 405GP POWER MODEL E total =P s *t + C i *E i EVENT TRACE INST DTLB MISS UTLB MISS ICACHE MISS DCACHE HIT EVENT POWER TABLE ICACHE HIT 1.56 nj DTLB MISS 12.17 nj ADD 0.95 nj LOAD WORD 4 nj DIVISION 17 nj Mahapatra-Texas A&M-Spring'03 38

UCI s Copper Project framework for power management Available Power Power Scheduler Power Profiler Chosen Code Version Hardware Config Code Versions Cycle-Level Performance Simulator Cycle-by-Cycle Hardware Access Counts Parameterizable Power Models Power Simulator Power Estimate Performance Estimate Compiler (gcc) Application Mahapatra-Texas A&M-Spring'03 39

Summary Power aware embedded system architectures Techniques to low power at circuit and device level Some recent efforts in use of tools for power efficient design Mahapatra-Texas A&M-Spring'03 40

Execution Time: Revisit Processor Architecture (No. of Instructions)*(cycles/instructions)*(seconds/cycle) Performance: (Instructions/Cycle) * (Cycles/second) It is mistake to treat these values as independent Mahapatra-Texas A&M-Spring'03 41

Architecture Revisited Clock frequency = f (device technology, circuit design, Vdd, architecture) How to improve clock rate? Reduce complexity & use pipeline How do you feel about faster clock on power? Should we reduce clock? How about reduce Vdd at same clock? Will reduce power? Mahapatra-Texas A&M-Spring'03 42

Instructions/cycle Architecture Revisited So much we can perform during a cycle How to improve more instruction issued during a cycle? ILP Exploit level of parallelism Reduce/eliminate delays (memory & branch) VLIW Superscalar EPIC (explicitly parallel instruction computing) Mahapatra-Texas A&M-Spring'03 43

ILP & Power Increase instructions/cycle ->fewer cycles -> longer clock for same performance ->lower voltage->lower power Increase instructions/cycle ->more switching elements -> more transitions -> more power Also, more functional units means proportionately less performance but more power. It is tricky! Mahapatra-Texas A&M-Spring'03 44

Delay Elimination & Power Memory and Branch actions cause delay. Branch prediction helps reducing this delay. Guess the next path most likely If not, restore quickly with proper back up plans Net gain when saved time is significant How to guess? History based tables Compiler assisted Example: PCs save >95% time but may not be feasible for low-end ES Mahapatra-Texas A&M-Spring'03 45

Prediction & Speculation Prediction Use conditional statements in places of brances Good for short if-then-else statements Improves branch prediction performance and uses simple hardware Speculation Move instructions from one path of branch to above branch Add r4, r5, r6 Add r4, r5, r6 Beq elsewhere, r4, #6 Sub r8, r9, r10 Sub r8, r9, r10 Beq elsewhere, r4, #6 Improves ILP If another path is taken, need to take care of it. Mahapatra-Texas A&M-Spring'03 46

More on speculation What to do if another path taken? Speculate only instructions that can not cause problem Preserve data, but allow exceptions Provide HW to undo bad instructions Exceptions during speculation; Only speculate non-excepting instructions Ignore exceptions in speculated instructions Delay exceptions until path is known Mahapatra-Texas A&M-Spring'03 47

Data speculation Move loads above stores that may not cause conflicts Add r7, r8, r9 Ld r2, r3 St r10, r7 Add r7, r8, r9 Ld r2, r3 St r10, r7 Add r1, r2, r3 Add r1, r2, r3 Mahapatra-Texas A&M-Spring'03 48

Power Issues Are we able to reduce cycle counts? Not offset by the extra power dissipation due to additional HW! For low-power Compiler driven branch prediction Prediction Low-hardware instruction speculation High Power Table based branch prediction Hardware data and instruction speculation Mahapatra-Texas A&M-Spring'03 49