High Performance DSP Solutions for Ultrasound

Similar documents
Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 2 nd year engineering students

Audio Sample Rate Conversion in FPGAs

Pre-distortion. General Principles & Implementation in Xilinx FPGAs

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

BPSK_DEMOD. Binary-PSK Demodulator Rev Key Design Features. Block Diagram. Applications. General Description. Generic Parameters

IJSRD - International Journal for Scientific Research & Development Vol. 5, Issue 06, 2017 ISSN (online):

Design and Implementation of Software Defined Radio Using Xilinx System Generator

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski

VLSI Implementation of Digital Down Converter (DDC)

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

The Application of System Generator in Digital Quadrature Direct Up-Conversion

Multi-Channel FIR Filters

FIR_NTAP_MUX. N-Channel Multiplexed FIR Filter Rev Key Design Features. Block Diagram. Applications. Pin-out Description. Generic Parameters

MULTIRATE IIR LINEAR DIGITAL FILTER DESIGN FOR POWER SYSTEM SUBSTATION

SPIRO SOLUTIONS PVT LTD

Channelization and Frequency Tuning using FPGA for UMTS Baseband Application

A FFT/IFFT Soft IP Generator for OFDM Communication System

From Antenna to Bits:

Stratix II DSP Performance

QAM Receiver Reference Design V 1.0

EMBEDDED DOPPLER ULTRASOUND SIGNAL PROCESSING USING FIELD PROGRAMMABLE GATE ARRAYS

2015 The MathWorks, Inc. 1

Software Design of Digital Receiver using FPGA

Design of Multiplier Less 32 Tap FIR Filter using VHDL

VLSI Architecture for Ultrasound Array Signal Processor

Rapid FPGA Modem Design Techniques For SDRs Using Altera DSP Builder

Enabling High-Performance DSP Applications with Arria V or Cyclone V Variable-Precision DSP Blocks

Using Soft Multipliers with Stratix & Stratix GX

CHAPTER III THE FPGA IMPLEMENTATION OF PULSE WIDTH MODULATION

Section 1. Fundamentals of DDS Technology

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

THIS work focus on a sector of the hardware to be used

Techniques for Implementing Multipliers in Stratix, Stratix GX & Cyclone Devices

ADQ214. Datasheet. Features. Introduction. Applications. Software support. ADQ Development Kit. Ordering information

FIR Compiler v3.2. General Description. Features

FPGA Implementation of Digital Modulation Techniques BPSK and QPSK using HDL Verilog

Cyclone II Filtering Lab

FPGA based Uniform Channelizer Implementation

FIR Filter Design on Chip Using VHDL

Abstract of PhD Thesis

Field Programmable Gate Arrays based Design, Implementation and Delay Study of Braun s Multipliers

An Efficient FFT Design for OFDM Systems with MIMO support

Hardware Implementation of Automatic Control Systems using FPGAs

What s Behind 5G Wireless Communications?

Stratix Filtering Reference Design

Low-Power Communications and Neural Spike Sorting

BPSK System on Spartan 3E FPGA

Keywords: CIC Filter, Field Programmable Gate Array (FPGA), Decimator, Interpolator, Modelsim and Chipscope.

FINITE IMPULSE RESPONSE (FIR) FILTER

Stratix II Filtering Lab

STUDY ON THE REALIZATION WITH FPGA OF A MULTICARRIER MODEM

Implementing Logic with the Embedded Array

Fast Fourier Transform: VLSI Architectures

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

OQPSK COGNITIVE MODULATOR FULLY FPGA-IMPLEMENTED VIA DYNAMIC PARTIAL RECONFIGURATION AND RAPID PROTOTYPING TOOLS

Field Programmable Gate Array Implementation and Testing of a Minimum-phase Finite Impulse Response Filter

Designing with STM32F3x

How different FPGA firmware options enable digitizer platforms to address and facilitate multiple applications

PLC2 FPGA Days Software Defined Radio

An FPGA Based Architecture for Moving Target Indication (MTI) Processing Using IIR Filters

IP-PSK-DEMOD4. BPSK, QPSK, 8-PSK Demodulator for FPGA FEATURES DESCRIPTION APPLICATIONS HARDWARE SUPPORT DELIVERABLES

Implementation of FPGA based Design for Digital Signal Processing

A Survey on Power Reduction Techniques in FIR Filter

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

An FPGA-based Re-configurable 24-bit 96kHz Sigma-Delta Audio DAC

REALIZATION OF FPGA BASED Q-FORMAT ARITHMETIC LOGIC UNIT FOR POWER ELECTRONIC CONVERTER APPLICATIONS

Rapid Design of FIR Filters in the SDR- 500 Software Defined Radio Evaluation System using the ASN Filter Designer

FPGA & Pulse Width Modulation. Digital Logic. Programing the FPGA 7/23/2015. Time Allotment During the First 14 Weeks of Our Advanced Lab Course

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting

Design Space Exploration of a Cooperative MIMO Receiver for Reconfigurable Architectures

Research Article. Amiya Karmakar Ȧ,#, Deepshikha Mullick Ḃ,#,* and Amitabha Sinha Ċ. Abstract

Implementing FIR Filters and FFTs with 28-nm Variable-Precision DSP Architecture

Block Diagram. i_in. q_in (optional) clk. 0 < seed < use both ports i_in and q_in

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

Design and Implementation of a Multi-Carrier Demodulator

FPGA Based 70MHz Digital Receiver for RADAR Applications

Single Chip FPGA Based Realization of Arbitrary Waveform Generator using Rademacher and Walsh Functions

WHAT ARE FIELD PROGRAMMABLE. Audible plays called at the line of scrimmage? Signaling for a squeeze bunt in the ninth inning?

Digital Systems Design

DSP VLSI Design. DSP Systems. Byungin Moon. Yonsei University

FPGA DESIGN OF A HARDWARE EFFICIENT PIPELINED FFT PROCESSOR. A thesis submitted in partial fulfillment. of the requirements for the degree of

Synthesis and Simulation of Floating Point Multipliers Dr. P. N. Jain 1, Dr. A.J. Patil 2, M. Y. Thakre 3

Design and FPGA Implementation of an Adaptive Demodulator. Design and FPGA Implementation of an Adaptive Demodulator

Evolution of DSP Processors. Kartik Kariya EE, IIT Bombay

AutoBench 1.1. software benchmark data book.

DDC_DEC. Digital Down Converter with configurable Decimation Filter Rev Block Diagram. Key Design Features. Applications. Generic Parameters

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Design of Digital FIR Filter using Modified MAC Unit

DSP Design Lecture 1. Introduction and DSP Basics. Fredrik Edman, PhD

Scalable Front-End Digital Signal Processing for a Phased Array Radar Demonstrator. International Radar Symposium 2012 Warsaw, 24 May 2012

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

FPGAs: Why, When, and How to use them (with RFNoC ) Pt. 1 Martin Braun, Nicolas Cuervo FOSDEM 2017, SDR Devroom

Lecture 3 Review of Signals and Systems: Part 2. EE4900/EE6720 Digital Communications

Discontinued IP. IEEE e CTC Decoder v4.0. Introduction. Features. Functional Description

CHAPTER 5 NOVEL CARRIER FUNCTION FOR FUNDAMENTAL FORTIFICATION IN VSI

Partial Reconfigurable Implementation of IEEE802.11g OFDM

CHAPTER 2 FIR ARCHITECTURE FOR THE FILTER BANK OF SPEECH PROCESSOR

ni.com The NI PXIe-5644R Vector Signal Transceiver World s First Software-Designed Instrument

Design and Implementation of Signal Processing Systems: An Introduction

Transcription:

High Performance DSP Solutions for Ultrasound By Hong-Swee Lim Senior Manager, DSP/Embedded Marketing Hong-Swee.Lim@xilinx.com 12 May 2008

DSP Performance Gap Performance (Algorithmic and Processor Forecast) 350 GMACs 30 GMACs 5 GMACs Algorithm Complexity DSP/GPP Performance Limit Virtex -DSP Spartan -DSP Traditional Processor Architectures 3D Medical Imaging Wireless Base Stations HD Audio/Video Broadcast Radar & Sonar HD Video Surveillance Mobile Software Defined Radio MIMO High End Ultrasound Low End Ultrasound Pico/Femto Base Stations Consumer Video SD/HD Video Surveillance Mobile Software Defined Radio Source: Jan Rabaey BWRC Time High Performance DSP Solutions 2

Agenda The Demand for DSP in Medical Imaging FPGAs The Programmable Ultra High DSP Performance Platform The DSP48E Slice Essential DSP Building Blocks Imaging Algorithms Digital Beamforming High Level Development Tools Conclusion High Performance DSP Solutions 3

A Little Ultrasound History Machines Images First Ultrasound introduced in mid 50s - Analog Processing Chain - Low Ultrasound Frequencies - 2-D Images - Small Image Sizes - Black and White Latest Ultrasounds - Digital Processing Chain - Higher Ultrasound Sample Frequencies (50 MHz) - Portable - 2-D,3-D and 4-D - Larger Image Sizes - Colour Images - Higher Quality - Elastography - Tissue Harmonic Imaging Trend: More and More Data being processed Faster and Faster High Performance DSP Solutions 4 Photo courtesy of Dynamic Imaging Limited, and Siemens Medical

FPGAs The Programmable High Performance DSP Platform High Performance DSP Solutions 5

Two Devices Over 30 GMACS XC3SD3400A Over 20 GMACS XC3SD1800A Spartan-3A DSP Overview Built on cost-effective, industry-accepted Spartan platform Superset of Spartan-3A platform Increased capacity of DSP resources, memory and logic Signal processing, memory capacity, bandwidth Integrated, cost-optimized XtremeDSP DSP48A slice High Performance DSP Solutions 6

Spartan-3A DSP DSP48A 250 MHz operation in the lowest cost speed grade High Performance DSP Solutions 7

The Virtex-5 DSP Messages Higher Performance (352 GMACs/s and 38% improvement over V-4) Optimized Ratio of Circuit Functions (Logic, Memory, and DSP) Expanded Functionality (Higher precision, SIMD) Lower Power (35% reduction in Dynamic Power over Virtex-4) High Performance DSP Solutions 8

BCOUT ACOUT Virtex-5 DSP48E Slice PCOUT B A 18 25 0 1 0 1 B REG CE D Q 2-Deep A REG CE D Q 2-Deep 18 25 48 A:B M REG CE D Q 72 36 0 36 0 1 0 17-bit shift 17-bit shift X Y Z ALUMode 4 48 P REG CE D Q 48 P C 48 C REG CE D Q 7 OpMode CarryIn 48 = PATTERN DETECT C or MC 48 BCIN ACIN 450 MHz operation in the slowest speed grade PCIN High Performance DSP Solutions 9

Common Functions DSP Designers need DSP48E provides the key functions for DSP Multipliers Multiply Accumulate A B opmode = 0000101 P A B opmode = 0100101 P Adders / Accumulators Multiply Add A C P P B opmode = 0110010 C opmode = 0110101 These form the building blocks for the majority of arithmetic functions required for DSP. Cascading capabilities for Multiply Add, Accumulate and adder chains is also a requirement for performance driven designs. High Performance DSP Solutions 10

DSP48E Slice Power Savings 70.0 Virtex-4-25x25: 14.3 mw/100mhz Average Power (mw) 60.0 50.0 40.0 30.0 20.0 10.0 0.0 0 200 400 600 Frequency (MHz) Dynamic Power Saving: 40% per DSP48E Slice 70% per 25x25 Multiplier Virtex-5-25x25: 3.6 mw/100mhz Virtex-4-18x18: 3.0 mw/100mhz Virtex-5-25x18: 1.8 mw/100mhz Conditions: 25C, nominal Vcc, Fully pipelined, (50% input toggle rate) based on HW test results, dynamic power consumption High Performance DSP Solutions 11

Larger Memories are Benefit to Imaging Each BRAM block can be used as 36K BRAM / FIFO or 18K BRAM 18K BRAM / FIFO One 36K BRAM / FIFO Two independent 18K BRAMs, or One 18K FIFO and 18K BRAM 36K BRAM size is doubled from Virtex-4 Significant in FFTs, Beamformer delays and Image Buffers High Performance DSP Solutions 12

Essential DSP Building Block High Performance DSP Solutions 13

Key Building Block DSP Functions in Ultrasound FIR Filters 2-D FIR Filters FFTs Floating Point Operators CIC Filters Adaptive Filtering Video Mixing and general Video Functions Addressed in this presentation High Performance DSP Solutions 14

High Performance Filters Sample Rate (Mhz) Log Scale 500 400 300 200 100 50 10 5 Parallel FIR Filters Sequential FIR Filters Semi-Parallel FIR Filters FPGAs can implement a complete spectrum of differing performance filters using the DSP48E High Performance filters of greater than 50 MHz sample rate are of most interest in Medical Imaging 0.5 1 1 5 10 20 50 100 Number of Coefficients (N) Log Scale 200 500 1000 High Performance structures must using multiple DSP48E in parallel to achieve required compute speed High Performance DSP Solutions 15

Parallel Systolic FIR Filter Filter Specification: Sampling Frequency = 450 Mhz, Coefficients = 31 Input time delay series is created inside the DSP Slice for maximum performance irrespective of the number of coefficients This filter structure while referred to as a Systolic FIR filter, it is really a Direct Form with one extra stage of pipelining x(n) 18 Coefficients are from left to right. This causes the latency to be as large and grow with the increase of coefficients K0 K1 K2 K30 K31 0 DSP48E Slice opmode = 0000101 Max Sample Rate = Clock Rate DSP48E Slice opmode = 0010101 Dedicated cascade connections (PCOUT and PCIN) are exploited to achieve maximum performance 41 y(n) Filter Size: 31 DSP48E Slices High Performance DSP Solutions 16

4 Multiplier Semi-Parallel FIR Filter Specification: Sampling Frequency = 100 Mhz, Coefficients = 16 x(n) 16 Input time delay series is created outside the Xtreme DSP Slice as SRL16E are required to store the set of inputs to drive each engine 18 18 18 18 The important thing to note about the addressing is that each SRL16E and coefficient memory buffer have identical addressing K0 K1 K2 K3 K4 K5 K6 K7 18 18 18 18 K8 K9 K10 K11 CE 0 D Q K12 K13 K14 K15 40 40 y(n) DSP48E Slice opmode = 0000101 DSP48E Slice opmode = 0010101 DSP48 Slice opmode = 0010010 The adder chain pipeline register is compensated for in the addressing of the memories. Hence each buffers address is delayed by one clock cycle Max Sample Rate = Clock Rate x Number of Multipliers Number of Taps An extra Xtreme DSP Slice is require to accumulate the results over the 4 clock cycles required before the slower capture register grabs the final output Filter Size: 5 DSP48E Slices 208 LUT6-FF Pairs (24 for control) High Performance DSP Solutions 17

FIR Compiler v3.1 Max Performance at the push of a button Ensures maximum performance, smallest area and in a simple to use wizard flow Provides all aspects of FIR Filter algorithm: - Number of Taps - Number of Channels - Single, Multi or Fractional Rate - Bit Widths Clock Frequency control enables trade-off between performance and area Resource Estimation Panel enables rapid resource analysis Verify System Specification for implementation High Performance DSP Solutions 18

FFT v4.1 Architectures Sample Rate (MSPS) 450 Delivered through Core Generator and System Generator Pipeline FFTs (input and sample every clock cycle) Radix-2 Single Delay Feedback (SDF) Streaming IO 1K FFT Resources: 16 DSP48, 6 BRAM, 3374 LUT6-FF Pair 200 Loop Engine FFTs (single butterfly processes each ranks) Radix-4 Dragonfly Loop Engine (Max throughput ~85 MSPS) 1K FFT Resources: 12 DSP48, 6 BRAM, 1748 LUT6-FF Pair 100 50 25 0 Radix-2 Butterfly Loop Engine (Max throughput ~50 MSPS) 1K FFT Resources: 4 DSP48, 3 BRAM, 868 LUT6-FF Pair Radix-2 Lite Butterfly Loop Engine (Max throughput ~25 MSPS) 1K FFT Resources: 2 DSP48, 3 BRAM, 742 LUT6-FF Pair High Performance DSP Solutions 19

Typical FFTs in Ultrasound Below is a typical FFT that is desired in Ultrasound Imaging systems and the requirements they place on the hardware implementations. Note the low Performance and area requirements Number of points 256 512 Sample Rate Buffer size (words) 10 MHz 512 200 KHz 1,024 Number of stages 8 9 Butterflies per stage 128 256 Total number of butterflies 1,024 2,304 Number of Multiplications Clock Cycles (300MHz) How many multipliers are required? 4,096 9,216 7,680 768,000 1 1 High Performance DSP Solutions 20

Radix-2 Loop Engine Supports Data Rates from 25 MSPS to 45 MSPS ROM for Twiddles Input Data Data DPM 0 Data DPM 1 RADIX-2 BUTTERFLY - Burst Interface (can be streaming with FIFO buffering) 2 input Radix-2 Engine Output Data High Performance DSP Solutions 21

Latest Architecture Lowers Area by 30% Reduced architecture (Radix-2 Lite Loop Engine) is smallest size (~30% smaller) Input Data Store data in single RAM Data DPM 0 ROM for Twiddles RADIX-2 BUTTERFLY Sine one cycle, cosine the next Data DPM 1 - Multiply real one cycle, imaginary the next Output Data Generate one output each cycle High Performance DSP Solutions 22

DSP48 enables Complex Multiplier and Butterfly Large Adders >16 bit do NOT reach top clock performance. DSP48E Slice opmode = 0010101 DSP48E Slice opmode = 0010101 RXm 18 RYm 18 IYm DSP48E Slice opmode = 0000101 Add / Sub IXm 0 SIMD mode of DSP48E enables max speed in the butterfly for efficient cost Sin / Cos LUT 2 cycle engine enables time sharing of DSP48Es in Buttefly Addition and Complex Multiplier. Lower Cost! High Performance DSP Solutions 23

FFT v4.1 Complete FFT at the push of a button Ensures High performance, minimal area and in a simple to use wizard flow Provides all aspects of FFT algorithm: - Transform Length - Number of Channels - Rounding and Scaling - Bit Widths Clock Frequency control enables trade-off between performance and area Resource Estimation Panel enables rapid resource analysis High Performance DSP Solutions 24

Virtex-5 SP FP Adder Input select DSP48E Exponent Alignment and addition LOD 25x18 DSP48E reduces resources by 50% Logic Normalization and round Output conditioning Floating Point Adder Size and Performance: 2 DSP48E Slices 366 LUT6-FF Pairs 410 MHz Latency = 12 Cycles High Performance DSP Solutions 25

Floating Point Operators v3.0 Floating Point is actually possible Ensures High performance, minimal area and in a simple to use wizard flow Comprehensive set of arithmetic operators: Add / Subtract Multiply Compare Fixed Float Conversion Divide Square-root High Performance DSP Solutions 26

Virtex-4 vs Virtex-5 Floating Pt Resource usage ( LUT-FF Pair / DSP48E ) 177 / 3 375 / 2 80 / 0 226 / 0 237 / 0 1370 / 0 787 / 0 Single Precision V-5 235 / 5 466 / 4 94 / 0 233 / 0 238 / 0 1370 / 0 787 / 0 Single Precision V-4 654 / 13 967 / 3 142 / 0 504 / 0 446 / 0 6002 / 0 3234/ 0 759 / 17 1220 / 4 161 / 0 565 / 0 523 / 0 6002 / 0 3234/ 0 Double Precision V-5 Double Precision V-4 Performance Goal Single Precision V-5 22% Faster! Single Precision V-4 Double Precision V-5 28% Faster! Double Precision V-4 Note: Maximum Latency Cores used High Performance DSP Solutions 27

How much Floating Point can Virtex-5 do? V-5 SX95T V-5 SX50T V-5 SX35T FF FF DSP48E DSP48E FF DSP48E Resource Utilization LUTs LUTs LUTs >50 GFLOPs possible in an 5SX95T High Performance DSP Solutions 28

Summary of Building Blocks DSP Algorithm FIR Filter 450 MSPS, 31 Tap,18-Bit FIR Filter 100 MSPS, 16 Tap,18-Bit FFT 300 MSPS, 1K Pt,18-Bit FFT 300 MSPS, 4K Pt,18-Bit Floating Point Operators Mult / Add Single Precision Floating Point Operators Complete Set of Operators, Single Precision Area 31 DSP48E Slices 0 BRAM 0 LUT6-FF Pairs 5 DSP48E Slices 0 BRAM 208 LUT6-FF Pairs 36 DSP48E Slices 7 BRAM 3,742 LUT6-FF Pairs 44 DSP48E Slices 19 BRAM 4,560 LUT6-FF Pairs 5 DSP48E Slices 0 BRAM 552 LUT6-FF Pairs 5 DSP48E Slices 0 BRAM 1436 LUT6-FF Pairs Clock Performance 450 MHZ 450 MHZ 305 MHZ 280 MHZ 410 MHZ 365 MHZ High Performance DSP Solutions 29

Key Imaging Algorithms High Performance DSP Solutions 30

Modalities and Algorithms Ultrasound Digital Beamforming Demodulation Image Forming Image Reconstruction B-Mode Doppler Colour Flow Processing M-Mode Elastography 2-D Noise Filtering 3-D & 4-D Imaging Video Functions High Performance DSP Solutions 31

Ultrasound System Overview MPEG-2 Encoding for DVD Tx and Rx not at same time Tissue Analysis and Diagnoses Video Scaling To ADC / DAC TX Beamformer Front End RX Beamformer Beamformer Control Demodulator Image Pre Processing Gray Level Image Reconstruction and manipulation techniques Doppler Processing Colour Flow Processing B Mode Processing Backplane to PCI / PCIe 3-D Graphics (GPU) Host PC and Display M Mode Processing 50 MSPS 200 MSPS 50 MSPS Slow KSPS High Performance DSP Solutions 32

Digital Beamforming: A Compute Problem Ultrasound Rx Beamformer To Transducers 12-Bit Multi- Channel Serial ADC 12-Bit Multi- Channel Serial ADC 1 1 1 1 1 S P S P S P S P S P 4 4 4 4 4 LPF LPF LPF LPF LPF Variable Delay Variable Delay Variable Delay Variable Delay Variable Delay Apodization Apodization Apodization Apodization Apodization Demodulator Key Questions 1. How many channels can I fit into a Single FPGA? 2. What is the cost and power per channel? High Performance DSP Solutions 33

A High Performance Beamformer Architecture Serial Inputs greatly reduces the required Pins of the FPGA 1 1 Serial to Parallel 12 12 Double Date Rate (DDR) IOs and Serial to Parallel converters slow the input data stream down to manageable rates 4 20 Tap Interpolation Filter 2 Channels interleaved to exploit the available FPGA performance reducing cost 2K Variable Delay 2K Variable Delay 2K Deep Delays fit perfectly in the Virtex-5 Block RAM and provide good beam steering ability 18 1 1 Serial to Parallel 12 12 4 Interpolation Filter enables finer control of individual beams 20 Tap Interpolation Filter 2K Variable Delay 2K Variable Delay Window 18 Window ~600 MSPS ~50 MSPS ~100 MSPS ~400 MSPS ~200 MSPS ~400 MSPS ~200 MSPS High Performance DSP Solutions 34

Multi-Channel Multi-Rate Filter 2 channel, 20-Tap, Interpolate by 4 Filter Input data stream is 2 Channel Time Division Multiplexed (TDM) Re-loadable Coefficient memories created out of small Dual Port Distributed Memories, capable of storing 3 different sets Simple Output reorder buffer to make sure output is TDM like the input x(n) 12 24 x 16 24 x 16 24 x 16 24 x 16 24 x 16 8 x 16 8 x 16 8 x 16 8 x 16 8 x 16 Reloadable Reloadable Reloadable Reloadable Coefficients Coefficients Coefficients Coefficients Re-order 0 Buffer y(n) 8 x 18 18 DSP48E Slice 1 opmode = 0000101 ALU Mode = 0000 Dedicated cascade connections (PCOUT and PCIN) are exploited to achieve maximum performance DSP48E Slice 2 opmode = 0010101 ALU Mode = 0000 Only a single Phase of the Interpolator is implemented and each clock cycle yields a new result from a 5 Tap Polyphase Arm. Each Channel processed in order Filter Size: 5 DSP48E Slices 250 LUT6-FF Pairs (80 for control) 400 MHz High Performance DSP Solutions 35

Variable Delay Element Interpolated samples are streamed into the Variable Delay x(n) 200 MHz Samples 18 Dual Port aspect of the Block RAMs are excellent for Delay Elements Counter 0-2048 Beam Delay Value 11 11 Variable Delay 2K x 18 18 (50 x Output Beams) MHz Beam Value are written into little memory Enables rapid update rate 2K deep Delays are perfect fit for Ultrasound Beam Steering and Virtex-5 Memories Filter Size: 2 Block RAM 100 LUT6-FF Pairs (80 for control) 250 MHz High Performance DSP Solutions 36

What is the Total Cost? Structure LUT6-FF Pairs DSP48 BRAM Serial to Parallel Converter 24 0 0 2 Channel Interpolation Filter 250 5 0 Variable Delays 100 0 2 Windowing Function 120 1 1 Summation 24 0 0 Total for 2 Channels 538 6 3 Miscellaneous Functions (Control Interface, DDR Memory Controller, DMA) 3500 0 0 64 Channel Beamformer: 192 DSP48E Slices 96 BRAM 20,716 LUT6-FF Pairs 400 MHz High Performance DSP Solutions 37 128 Channel Beamformer: 384 DSP48E Slices 60 % of 5VSX95T 192 BRAM 79 % of 5VSX95T 37,932 LUT6-FF Pairs 64 % of 5VSX95T 400 MHz

Potential Architecture ADC 8 channels @40Mhz, ADC 8 channels 12 bits @40Mhz, ADC 8 channels 12 bits ADC 8 channels @40Mhz, 12 bits @50Mhz, 12 bits 1 32 Channels per chip 128 Channels in total FPGA 1 5VSX35T Total Power for 128 channel digital receiver beamformer estimated at: ADC 8 channels @40Mhz, ADC 8 channels 12 bits @40Mhz, ADC 8 channels 12 bits ADC 8 channels @40Mhz, 12 bits @50Mhz, 12 bits ADC 8 channels @40Mhz, ADC 8 channels 12 bits @40Mhz, ADC 8 channels 12 bits ADC 8 channels @40Mhz, 12 bits @50Mhz, 12 bits 1 1 FPGA2 5VSX35T FPGA 3 5VSX35T 24bit (50 MHz) FPGA: ADC and VGA: Total estimated at: 23.6W Further investigation needed ~2.7 x 4 = 10.8 W ~0.8 x 16 = 12.8 W ADC 8 channels @40Mhz, ADC 8 channels 12 bits @40Mhz, ADC 8 channels 12 bits ADC 8 channels @40Mhz, 12 bits @50Mhz, 12 bits 1 FPGA 4 5VSX35T To demodulation High Performance DSP Solutions 38

Other Aspects to Consider Ultrasound Rx Single Channel Beamformer 1 S P 4 LPF Variable Delay Apodization Delay Calculator Apodization Calculator Apodization Apodization Calculator Output Beams Apodization Every channel and output beam also needs a delay calculator and also and Apodization Calculator. Can be done using external memory storing tables, or can by dynamic would like to work with you on beamforming as we consider Virtex-6 - What is your target Cost and Power per channel? High Performance DSP Solutions 39

TX Signal Flow Block Diagram @ 80 Msps Counter 9 Stored gain value (REG) 10 Each pulse is read out of storage on a programmable count value Pulse storage 1K x 18 Unique storage for each channel s pulses 9 9 10 Stored gain value (REG) Stored gain value (REG) 10 DAC transducer chan 0 Pulse storage 1K x 18 9 10 DAC transducer chan 1 Stored gain value (REG) Stored gain value (REG) 9 10 Pulse storage 1K x 18 9 10 DAC transducer chan N-1 Stored gain value (REG) Control Interface High Performance DSP Solutions 40

What is the Total Cost? Structure LUT6-FF Pairs DSP48 BRAM Transmit Waveform Storage 0 0 2 Control 50 0 0 Complex Gain and Summation 34 1 0 DAC Interface 50 0 0 Total for 2 Channels 134 1 2 Miscellaneous Functions (Control Interface) 500 0 0 64 Channel Tx Beamformer: 32 DSP48E Slices 64 BRAM 4,788 LUT6-FF Pairs 200 MHz High Performance DSP Solutions 41 128 Channel Tx Beamformer: 64 DSP48E Slices 10 % of 5VSX95T 128 BRAM 52 % of 5VSX95T 9,576 LUT6-FF Pairs 16 % of 5VSX95T 200 MHz Pin count is the most concern

Ultrasound System Overview MPEG-2 Encoding for DVD Tx and Rx not at same time Tissue Analysis and Diagnoses Video Scaling To ADC / DAC TX Beamformer Front End RX Beamformer Beamformer Control Demodulator Image Pre Processing Gray Level Image Reconstruction and manipulation techniques Doppler Processing Colour Flow Processing B Mode Processing Backplane to PCI / PCIe 3-D Graphics (GPU) Host PC and Display M Mode Processing 50 MSPS 200 MSPS 50 MSPS Key Questions High Performance DSP Solutions 42 Slow KSPS 1. What is the cost and power per demodulation channel? 2. What is the rate change of the demodulator? 3. What are the filter specifications?

Demodulation FIR Compiler FIR Compiler DDS Compiler 17 17 FIR1 100 Tap, Decimate by 10 17 FIR2 48 Tap 17 I Input DDS cos (2.π.f 1.t) sin (2.π.f 1.t) 17 17 FIR1 100 Tap, Decimate by 10 17 FIR2 48 Tap 17 Q Sample Rates 50 MHz 1 MHz Key Questions: What is the input and output clock frequency? Are the IP being used? How many channels? Is the rate change programmable? High Performance DSP Solutions 43

High Level Design Tools High Performance DSP Solutions 44

DSP Tools and Flows Accelerate DSP Design ISE Platform Studio AccelDSP System Generator System Generator High Performance DSP Solutions 45

DSP Development Environment offers a complete DSP design flow from The Mathworks MATLAB/Simulink model based design environment AccelDSP Synthesis MATLAB to FPGAs MATLAB Algorithm acceleration System Generator for DSP Simulink to FPGAs Simulink algorithm acceleration DSP system design RTL verification DSP IP and Reference Designs Hardware platforms AccelDSP MATLAB to gates System Generator Simulink to Gates High Performance DSP Solutions 46

Visual Data Flow Paradigm Polymorphic Block Libraries Bit and Cycle True Modeling Seamlessly Integrated with Simulink and MATLAB Test bench and data analysis System Generator for DSP Automatic Code Generation Synthesizable VHDL IP cores HDL test bench Project and constraint files High Performance DSP Solutions 47

Hardware Accelerated Simulation System Generator supports automated HIL flows to an extensive set of commercially available boards Up to 1000x simulation performance improvement Offers an easy way to accelerate algorithms for data effect analysis Automatically create FPGA bitstream from Simulink Transparent use of FPGA implementation tools High Performance DSP Solutions 48

Embedded Processor Design DSP software components can be quickly implemented on an embedded processor Integration to platform studio Interface details abstracted away through a shared memory interface System Generator Platform Studio Platform Studio pcore High Performance DSP Solutions 49

Echo Data Signal Data Echo Data Signal Data Estimates Estimates Coefficient Echo Data Signal Data Echo Data Disable Disable Adapt Cancelled Data System Integration Platform System Generator provides a common platform for integrating the the RTL, algorithm, software, interface and processor components of a DSP system Co-simulate in a DSP modeling environment Single flow to implementation VHDL / Verilog C/C++ Models Models DRAM DRAM Interface Page Buffer Page Buffer ulaw/alaw Conversion ulaw/alaw Conversion Page Buffer ulaw/alaw Conversion Speech and Tone Detection MATLAB Models Echo Canceller NLP ulaw/alaw Conversion Adaptive Algorithm and Echo Estimation System Control System Generator Models System Generator High Performance DSP Solutions 50

AccelDSP Design Flow Customer proven to increase productivity up to 20X! Typical MATLAB DSP Design Flow Floating-Pt. Algorithm Fixed-Point Conversion Architecture Definition Create / Integrate IP Blocks Create RTL Design Refine Architecture Verify RTL RTL Synthesis Floating-Pt. Algorithm Steps performed by AccelDSP AccelDSP AccelDSP Design Flow RTL Synthesis Replaces manual steps Integrated design flow We saw a 30% reduction in the design cycle time. This equated to an overall project development reduction of 15 percent, which provides two very significant benefits: we get our products to market faster and our teams are freed up to work on other projects sooner. Dr. Paul Turner Principal Systems Engineer Powerwave Technologies High Performance DSP Solutions 51

Floating- to Fixed-point Conversion Floating-point MATLAB models automatically converted into fixed-point Fixed-point bit widths Binary point conversion Saturation and rounding logic Process is user interactive and controllable Fixed-point hardware is automatically generated Analysis features help address reducedprecision arithmetic errors Signal probes, fixed-point reports, histogram overflow and underflow reporting Accel Probe High Performance DSP Solutions 52

Summary Shortened verification time for RTL models of DSP applications Accelerate DSP designs developed using MATLAB or Simulink algorithms in FPGA hardware Create complete DSP systems using embedded processors or FPGA co-processors High Performance DSP Solutions 53