Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

Similar documents
REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

Subra Ganesan DSP 1.

Digital Signal Processors principles, use & application to PS systems.

Dr. D. M. Akbar Hussain

Multi-Channel FIR Filters

Lesson 7. Digital Signal Processors

Chapter 6: DSP And Its Impact On Technology. Book: Processor Design Systems On Chip. By Jari Nurmi

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors

Using Soft Multipliers with Stratix & Stratix GX

A Low-Power Broad-Bandwidth Noise Cancellation VLSI Circuit Design for In-Ear Headphones

Keywords: Adaptive filtering, LMS algorithm, Noise cancellation, VHDL Design, Signal to noise ratio (SNR), Convergence Speed.

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

Stratix II DSP Performance

Designing with STM32F3x

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

ASIP Solution for Implementation of H.264 Multi Resolution Motion Estimation

2002 IEEE International Solid-State Circuits Conference 2002 IEEE

EE382V-ICS: System-on-a-Chip (SoC) Design

FIR Filter for Audio Signals Based on FPGA: Design and Implementation

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting

Section 1. Fundamentals of DDS Technology

ICS312 Machine-level and Systems Programming

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

Digital Signal Processing. VO Embedded Systems Engineering Armin Wasicek WS 2009/10

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

Control Systems Overview REV II

Wideband Spectral Measurement Using Time-Gated Acquisition Implemented on a User-Programmable FPGA

Design and Implementation of Signal Processing Systems: An Introduction

Energy efficient multi-granular arithmetic in a coarse-grain reconfigurable architecture

DSP VLSI Design. DSP Systems. Byungin Moon. Yonsei University

Implementing FIR Filters and FFTs with 28-nm Variable-Precision DSP Architecture

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

Department Computer Science and Engineering IIT Kanpur

Instructor: Dr. Mainak Chaudhuri. Instructor: Dr. S. K. Aggarwal. Instructor: Dr. Rajat Moona

Option 1: A programmable Digital (FIR) Filter

FIR_NTAP_MUX. N-Channel Multiplexed FIR Filter Rev Key Design Features. Block Diagram. Applications. Pin-out Description. Generic Parameters

AREA EFFICIENT DISTRIBUTED ARITHMETIC DISCRETE COSINE TRANSFORM USING MODIFIED WALLACE TREE MULTIPLIER

CS61c: Introduction to Synchronous Digital Systems

Mahendra Engineering College, Namakkal, Tamilnadu, India.

Video Enhancement Algorithms on System on Chip

Review On Design Of Low Power Multiply And Accumulate Unit Using Baugh-Wooley Based Multiplier

Unit-6 PROGRAMMABLE INTERRUPT CONTROLLERS 8259A-PROGRAMMABLE INTERRUPT CONTROLLER (PIC) INTRODUCTION

FIR Filter Design on Chip Using VHDL

Flexibility, Speed and Accuracy in VLIW Architectures Simulation and Modeling

An area optimized FIR Digital filter using DA Algorithm based on FPGA

7/11/2012. Single Cycle (Review) CSE 2021: Computer Organization. Multi-Cycle Implementation. Single Cycle with Jump. Pipelining Analogy

AN EFFICIENT ALGORITHM FOR THE REMOVAL OF IMPULSE NOISE IN IMAGES USING BLACKFIN PROCESSOR

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

A High Definition Motion JPEG Encoder Based on Epuma Platform

Stratix Filtering Reference Design

Digital Power: Consider The Possibilities

Novel Low-Overhead Operand Isolation Techniques for Low-Power Datapath Synthesis

Digital Hearing Aids Specific μdsp Chip Design by Verilog HDL

A New RNS 4-moduli Set for the Implementation of FIR Filters. Gayathri Chalivendra

Improving Loop-Gain Performance In Digital Power Supplies With Latest- Generation DSCs

Simultaneous Multithreaded DSPs: Scaling from High Performance to Low Power

FPGA Based 70MHz Digital Receiver for RADAR Applications

ModemX Heterogeneous Multi-Core Architecture for SDR Applications ASOCS Ltd. All rights reserved.

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Computer Architecture and Organization:

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

Stratix II Filtering Lab

Research Statement. Sorin Cotofana

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

Cyclone II Filtering Lab

Design of Multiplier Less 32 Tap FIR Filter using VHDL

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Implementing Logic with the Embedded Array

An Energy Scalable Computational Array for Energy Harvesting Sensor Signal Processing. Rajeevan Amirtharajah University of California, Davis

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Performance Analysis of FIR Filter Design Using Reconfigurable Mac Unit

Fpga Implementation of Truncated Multiplier Using Reversible Logic Gates

GSM Interference Cancellation For Forensic Audio

Signal Processing in Mobile Communication Using DSP and Multi media Communication via GSM

Design of Adjustable Reconfigurable Wireless Single Core

Microprocessor & Interfacing Lecture Programmable Interval Timer

DESIGN OF LOW POWER / HIGH SPEED MULTIPLIER USING SPURIOUS POWER SUPPRESSION TECHNIQUE (SPST)

High Speed ECC Implementation on FPGA over GF(2 m )

10. DSP Blocks in Arria GX Devices

Out-of-Order Execution. Register Renaming. Nima Honarmand

Chapter 4. Pipelining Analogy. The Processor. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop:

High Speed Vedic Multiplier Designs Using Novel Carry Select Adder

Low Power Approach for Fir Filter Using Modified Booth Multiprecision Multiplier

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

VLSI Implementation of Auto-Correlation Architecture for Synchronization of MIMO-OFDM WLAN Systems

TMS320F241 DSP Boards for Power-electronics Applications

Low-Power Communications and Neural Spike Sorting

Nonlinear Equalization Processor IC for Wideband Receivers and

6. DSP Blocks in Stratix II and Stratix II GX Devices

SINGLE MAC IMPLEMENTATION OF A 32- COEFFICIENT FIR FILTER USING XILINX

Channelization and Frequency Tuning using FPGA for UMTS Baseband Application

Index Terms. Adaptive filters, Reconfigurable filter, circuit optimization, fixed-point arithmetic, least mean square (LMS) algorithms. 1.

FPGA Implementation of High Speed FIR Filters and less power consumption structure

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

Image processing. Case Study. 2-diemensional Image Convolution. From a hardware perspective. Often massively yparallel.

Dynamic Scheduling I

Transcription:

Evolution of DSP Processors Kartik Kariya EE, IIT Bombay

Agenda Expected features of DSPs Brief overview of early DSPs Multi-issue DSPs Case Study: VLIW based Processor (SPXK5) for Mobile Applications Kartik Kariya Evolution Of DSPs 2

Why DSPs? Most DSP tasks require: Real- time processing Repetitive numeric computations Attention to numeric fidelity Processors must perform these tasks efficiently while minimizing: Cost Power Memory use Development time Kartik Kariya Evolution Of DSPs 3

Features of DSPs DSPs architecture is driven by algorithms Algorithms puts requirement for exhaustive computations eg. Consider FIR filter example M 1 Y(n) = h(n) x(n-k) K = 0 Each tap (M taps total) requires: Two data fetches Multiply Accumulate Kartik Kariya Evolution Of DSPs 4

Features of DSPs Features Common to Most DSP Processors Specialized hardware performs all key arithmetic operations in 1 cycle. Specialized addressing modes for efficient Memory access e. g. -Auto increment -Circular addressing -Bit- reversed (for FFT) Hardware support for managing numeric fidelity - Guard bits - Saturation Kartik Kariya Evolution Of DSPs 5

Features of DSPs Conti.. Zero-Overhead Looping - specialized hardware for test and branch - loop nesting Repeat Block Code Program Initialization: Load Start and End address registers Load Repeat Counter No Yes Exit Loop Counter = 0? Code Block to be Repeated Start Address Register Repeat Counter Decrement Repeat Counter Start Address Register Kartik Kariya Evolution Of DSPs 6

Features of DSPs Conti.. Specialized, complex instructions - To make maximum advantage of processor hardware - Minimize the memory space required for storing data Multiple operations per instruction I/O handling mechanism with no intervention to computational units Other features like on chip ADC,DAC, DMA controller etc. Kartik Kariya Evolution Of DSPs 7

Brief Overview Of Early DSPs First Generation DSPs : e.g Texas Instruments TMS32010 16 Bit Multiplier, and 32 bit Accumulator Issue and execute one instruction per clock cycle Performance 6-8 MHz ( 390 ns - MAC instruction ) Second Generation DSPs : e.g ADSP-21xx, TMS320C2xx, DSP560xx Pipelined to some extent 20-50Mhz ( 75 ns MAC) 16/24/32bit instructions Makes Instruction set complicated, irregular Typically used in consumer and telecom products that have modest DSP performance requirements Kartik Kariya Evolution Of DSPs 8

Brief Overview Of Early DSPs contd Mid Range/Third Generation DSPs : Higher clock speeds 100-150Mhz Parallel execution units multipliers, adders Deeper pipelines, parallelism Wider data buses, wider instruction words Increase in cost and power consumption offset by performance 20ns MAC Used in higher performance DSP tasks Wireless Telecom, high-speed modems Examples : TMS320C54x, DSP16xxx (Lucent) etc. Problems : Difficult assembly, compiling Kartik Kariya Evolution Of DSPs 9

Multi-issue DSPs Goals - High Performance - Compiler friendly architecture Simple instructions, 1 operation/instruction Issue/execute instructions in parallel groups 3ns MAC throughput Targeted at demanding computational requirements Two classes of multi-issue VLIW Superscalar Kartik Kariya Evolution Of DSPs 10

Multi-issue DSPs Contd Very Long Instruction Word (VLIW) Class TMSC62xx, first multi-issue (VLIW), introduced in 1996 Large number of parallel execution unit Typically issue 4-8 instructions / cycle (VLIW Assembly programmer / compiler decide parallel instruction grouping depending on data dependencies and resource contention. Instruction groups do not change in execution Large number of instruction decoders, buses,registers and hence memory bandwidth Problems : High-Energy consumption Usage : high computational applications e.g. Cellular base station Kartik Kariya Evolution Of DSPs 11

Multi-issue DSPs Cont TMS320C6xx Execution Unit On-Chip Program Memory Dispatch Unit 32 x 8 = 256 bits 8 instruction L1 S1 M1 D1 L2 S2 M2 D2 Register File A Register File B 32 Bits Each L : ALU S : Shifter On Chip Data Memory M : Multiplier D : Address Gen Kartik Kariya Evolution Of DSPs 12

Multi-issue DSPs Cont Superscalar Class Special hardware decides parallel instruction grouping considering data dependency, resource contention Instruction groups can change in execution depending on data access, loop execution etc. Problems : Difficult to predict execution times hence not suitable for real time applications High energy, memory usage Kartik Kariya Evolution Of DSPs 13

Single Instruction Multiple Data Technique Single Instruction Multiple Data technique (SIMD) Execute multiple instances of the same operation in parallel using different data. Combined with VLIW / Superscalar / Conventional Boosts performance in vector heavy operations such as multimedia applications. Based on added parallel execution units(e.g. ADSP- 2116x) or logical split of existing execution units(e.g TigerSharc) Problems : Must arrange data in memory Algorithm re-organization to use processor resources Not effective for algos that are inherently serial Kartik Kariya Evolution Of DSPs 14

Case Study: VLIW based Processor (SPXK5) for Mobile Applications Requirements Higher Processing Power for multimedia applications like video codecs, Speech codecs, speech recognition systems etc executing simultaneously Fast beat rate. Minimum Power consumption. Architectural overview In incorporates customized VLIW approach as well as SIMD features to give better performance. The functional units consist of Two multiply-accumulate (MAC), two arithmetic units (ALU), two data address units (DAU) for load and store and System control unit (SCU) for branch, zero overhead looping, and conditional execution. Kartik Kariya Evolution Of DSPs 15

Case Study: VLIW based Processor (SPXK5) for Mobile Applications contd. Interrupt Control Instruction Bus 64bits JTAG Loop Control Stack Control Dispatcher Fetcher MAC MAC ALU ALU DAU DAU SCU R0 R1 R2 R3 R4 R5 R6 R7 40 Bit General Purpose Registers R0H R0L R1H R1L R2H R2L R3H R3L R4H R4L R5H R5L R6H R6L R7H R7L DP0 DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 32Bit Address Register 16 Bit Offset Register DN0 DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 7 System Registers Main Bus (32 Bit) X Bus(32 Bit) Y Bus (32 Bit) Kartik Kariya Evolution Of DSPs 16

Case Study: VLIW based Processor (SPXK5) for Mobile Applications contd. Features Operational Frequency 250 MHz; Avg. power Consumption 0.15 mw/mips at 1.5 V Maximum four functional units work simultaneously. 16 Kbyte instruction cache. Six-stage pipeline: Instruction fetch, dispatch queue, decode, DP register update, Execution phase I and II. Instruction: 16 or 32 bits long; Instruction packet size 16 to 64 bits. gives higher code density. Eight special SIMD instructions (PADD, PSUB, PSHIFT, PADDABS, PACKV etc) to take advantage of data-level parallelism. SIMD Instructions useful to implement DSP algorithms such as video encoding/decoding,fft etc. Kartik Kariya Evolution Of DSPs 17

Case Study: VLIW based Processor (SPXK5) for Mobile Applications contd. e.g. Implementation of mean absolute error (MAE) required in Motion Estimation for video codec Parallel Operations 32-bit Load a 0.1 a 0,0 32-bit Load b 0.1 b 0,0 MAE = Time PSUB a 0,1 - b 0.1 a 00 -b 0,0 32-bit Load a 0.3 a 0,2 32-bit Load b 0.3 b 0,2 M 1 N 1 A mn - B mn PADDABS PSUB 32-bit Load 32-bit Load M= 0 N= 0 += a 0,1 - b 0.1 += a 00 -b 0,0 a 0,3 - b 0.3 a 02 -b 0,2 a 0.5 a 0,4 b 0.5 b 0,4 PADDABS PSUB 32-bit Load 32-bit Load += a 0,3 - b 0.3 += a 02 -b 0,2 a 05 - b 0.5 a 04 -b 0,4 a 0.7 a 0,6 b 0.7 b 0,6 Kartik Kariya Evolution Of DSPs 18

Conclusion DSP Processor performance has increased substantially over the years Drivers for evolution of DSPs: speed, energy, memory usage, cost Focus is on compiler-friendly architectures DSP processor architectures is increasingly being specialized for specific applications. Kartik Kariya Evolution Of DSPs 19