Multi-core Platforms for

Similar documents
Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Reconfigurable Accelerator for WFS-Based 3D-Audio

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Real-Time Software Receiver Using Massively Parallel

SDR Applications using VLSI Design of Reconfigurable Devices

VLSI DESIGN OF RECONFIGURABLE FILTER FOR HIGH SPEED APPLICATION

Globally Asynchronous Locally Synchronous (GALS) Microprogrammed Parallel FIR Filter

Implementation of FPGA based Design for Digital Signal Processing

An FPGA Based Architecture for Moving Target Indication (MTI) Processing Using IIR Filters

Image processing. Case Study. 2-diemensional Image Convolution. From a hardware perspective. Often massively yparallel.

Area Efficient and Low Power Reconfiurable Fir Filter

EMBEDDED DOPPLER ULTRASOUND SIGNAL PROCESSING USING FIELD PROGRAMMABLE GATE ARRAYS

An area optimized FIR Digital filter using DA Algorithm based on FPGA

Data Word Length Reduction for Low-Power DSP Software

DTP4700 Next Generation Software Defined Radio Platform

GPU-based data analysis for Synthetic Aperture Microwave Imaging

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

A Low-Power Broad-Bandwidth Noise Cancellation VLSI Circuit Design for In-Ear Headphones

Vocal Command Recognition Using Parallel Processing of Multiple Confidence-Weighted Algorithms in an FPGA

Design of Multiplier Less 32 Tap FIR Filter using VHDL

I. Introduction. Reddy, Telangana. Ranga Reddy, Telangana. 3 Professor, HOD, Dept of ECE, Sphoorthy Engineering College, Nadergul, Saroor Nagar, Ranga

CUDA-Accelerated Satellite Communication Demodulation

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Design of Digital FIR Filter using Modified MAC Unit

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

Using SDR for Cost-Effective DTV Applications

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA

Improving Meetings with Microphone Array Algorithms. Ivan Tashev Microsoft Research

Power consumption reduction in a SDR based wireless communication system using partial reconfigurable FPGA

Reduced Complexity Wallace Tree Mulplier and Enhanced Carry Look-Ahead Adder for Digital FIR Filter

VLSI Implementation of Digital Down Converter (DDC)

SOPA version 2. Revised July SOPA project. September 21, Introduction 2. 2 Basic concept 3. 3 Capturing spatial audio 4

Self-Aware Adaptation in FPGAbased

Abstract of PhD Thesis

Oswal S.M 1, Prof. Miss Yogita Hon 2

A NOVEL WALLACE TREE MULTIPLIER FOR USING FAST ADDERS

Area Power and Delay Efficient Carry Select Adder (CSLA) Using Bit Excess Technique

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

Chapter 6: DSP And Its Impact On Technology. Book: Processor Design Systems On Chip. By Jari Nurmi

FPGA-BASED PULSED-RF PHASE AND AMPLITUDE DETECTOR AT SLRI

An Area Efficient Decomposed Approximate Multiplier for DCT Applications

Design and FPGA Implementation of an Adaptive Demodulator. Design and FPGA Implementation of an Adaptive Demodulator

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting

Appendix B. Design Implementation Description For The Digital Frequency Demodulator

FIR Filter Design on Chip Using VHDL

Implementation of Face Detection System Based on ZYNQ FPGA Jing Feng1, a, Busheng Zheng1, b* and Hao Xiao1, c

Tirupur, Tamilnadu, India 1 2

IMPLEMENTATION OF MULTIRATE SAMPLING ON FPGA WITH LOW COMPLEXITY FIR FILTERS

AN EFFICIENT MULTI RESOLUTION FILTER BANK BASED ON DA BASED MULTIPLICATION

Performance Analysis of Acoustic Echo Cancellation in Sound Processing

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Video Enhancement Algorithms on System on Chip

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Console Architecture 1

QAM Receiver Reference Design V 1.0

High performance Radix-16 Booth Partial Product Generator for 64-bit Binary Multipliers

Design and Implementation of Scalable Micro Programmed Fir Filter Using Wallace Tree and Birecoder

Digital Signal Processing Lecture 1

A Novel High-Speed, Higher-Order 128 bit Adders for Digital Signal Processing Applications Using Advanced EDA Tools

Resource Efficient Reconfigurable Processor for DSP Applications

DESIGN OF MULTIPLE CONSTANT MULTIPLICATION ALGORITHM FOR FIR FILTER

Ben Baker. Sponsored by:

Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen

Concepts of Parallelism In An Introductory Computer Architecture Courses With FPGA Laboratories

A HIGH SPEED FIFO DESIGN USING ERROR REDUCED DATA COMPRESSION TECHNIQUE FOR IMAGE/VIDEO APPLICATIONS

Design and Implementation of High Speed Carry Select Adder

Design and Analysis of RNS Based FIR Filter Using Verilog Language

Benefits of a Reconfigurable Software GNSS Receiver in Multipath Environment

High Performance Computing for Engineers

National Instruments Flex II ADC Technology The Flexible Resolution Technology inside the NI PXI-5922 Digitizer

IMPLEMENTATION OF AREA EFFICIENT MULTIPLIER AND ADDER ARCHITECTURE IN DIGITAL FIR FILTER

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

Design and Implementation of Digit Serial Fir Filter

Field Programmable Gate Arrays based Design, Implementation and Delay Study of Braun s Multipliers

SINGLE MAC IMPLEMENTATION OF A 32- COEFFICIENT FIR FILTER USING XILINX

FIR Filter for Audio Signals Based on FPGA: Design and Implementation

IMPLEMENTATION OF 64-POINT FFT/IFFT BY USING RADIX-8 ALGORITHM

A NOVEL FPGA-BASED DIGITAL APPROACH TO NEUTRON/ -RAY PULSE ACQUISITION AND DISCRIMINATION IN SCINTILLATORS

SPIRO SOLUTIONS PVT LTD

Power and Area Efficient Column-Parallel ADC Architectures for CMOS Image Sensors

DESIGN OF A MEASUREMENT PLATFORM FOR COMMUNICATIONS SYSTEMS

Performance Analysis of an Efficient Reconfigurable Multiplier for Multirate Systems

FPGA Implementation of Desensitized Half Band Filters

Design and Implementation on a Sub-band based Acoustic Echo Cancellation Approach

Prototyping Next-Generation Communication Systems with Software-Defined Radio

Signal Processing and Display of LFMCW Radar on a Chip

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

SIMULATION AND PROGRAM REALIZATION OF RECURSIVE DIGITAL FILTERS

Hybrid System Level Power Consumption Estimation for FPGA-Based MPSoC

AURALIAS: An audio-immersive system for auralizing room acoustics projects

An Design of Radix-4 Modified Booth Encoded Multiplier and Optimised Carry Select Adder Design for Efficient Area and Delay

Outline. Context. Aim of our projects. Framework

Exploring Computation- Communication Tradeoffs in Camera Systems

Subra Ganesan DSP 1.

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

An Efficient Method for Implementation of Convolution

FOR HIGH SPEED LOW POWER APPLICATIONS USING RADIX-4 MODIFIED BOOTH ENCODER

Transcription:

20 JUNE 2011 Multi-core Platforms for Immersive-Audio Applications Course: Advanced Computer Architectures Teacher: Prof. Cristina Silvano Student: Silvio La Blasca 771338

Introduction on Immersive-Audio Immersive Audio Systems: Sound Rendering: Wave Field Synthesis Standard Implementation is GPP based: Sound Acquisition: Beamforming Computational intensive algorithms PRO: easily programmable short development time CONs: processing bottlenecks and excessive power consumption Alternative implementations: Basedon Graphic Processing Units (GPUs) and Field ProgrammableGate Array (FPGAs) Exploit multi core parallelism to achieve speed up and power saving CASE STUDY: Multi core Platforms for Beamforming and Wave Field Synthesis Theodoropoulos, Kuzmanov & Gaydadjiev (2011) Course: Advanced Computer Architectures Teacher: Prof. Cristina Silvano Student: Silvio La Blasca 20-06-2011

Introduction on Immersive-Audio Beamforming (BF): spatial filtering technique that allows estimation of the direction of arrival of an audio signal in order to perform source separation Example of Filter and sum approach: each microphone channel signal fed to a FIR that acts as a delay line FIR coefficients are set according to source position (beamsteering) Wave Field Synthesis: (WFS) spatial audio reproduction technique that uses loudspeaker arrays to generate a soundwave field over a wide area (no seet spot) Example of WFS array: for each sample the distance between the source and each loudspeaker must be calculated l distance affect the signal amplitude and delay

Introduction on Immersive-Audio Implementations in literature: Ultrasound imaging with 288 channels using 14 FPGA and a GPU connected to a PC BF Delay and sum beamformer implementation on a Nvidia GeForce 8800 Teleconferencing system based on Texas Instrument DSP (TMS320C6201) WFS SonicEmotion WFS system based on Intel Core2 Duo for 24 speakers Higher number of speakers is supported by PC based implementation developed by Iosono and Delft University GPU based implementation using Nvidia GeForce GTX285 and Tesla C1060 Considerations: Most of the implementations rely on standard GPP Little or none performance evaluation is done with respect to processor features GOAL: implement BF and WFS on GPUs and FPGAs, with a rough high level design space exploration and evaluate performances

Proposed Implementation GPU Implementation of Beamforming: Program flow: 1. STORE input data into GPU main memory 2. STORE FIRcoefficients into GPU main memory 3. Perform DECIMATION for each channel 4. For each source perform: SOURCE EXTRACTION for each channel + MEM ACCUMULATION of all channels + MEM INTERPOLATION of source signal + MEM 5. STORE extracted source Main features: 2 kernel: a flexible one for all FIR computations and one for accumulation #blocks match the #samples, #threads match filter size Need of up to 2 MB space foro store FIR coefficients

Proposed Implementation GPU Implementation of WFS: Program flow: 1. STORE input data into GPU main memory 2. For each source perform: FIR Filtering + MEM CALCULATE all speakers signals and ACCUMULATE with previous 3. STORE all speakers signals into GPU main memory Main features: Use the previous FIR kernel for Filtering Each block of WFS kernel process a chunk of samples

Proposed Implementation FPGA Implementation: Beamforming module: APU accelerate communication with host processor FCM BF controller initiates the processing phases All samples stored in the local buffer and all channels are processed concurrently FIR filter include coefficients banks Wave Field Synthesis module: Samples are filtered and stored in the WFS Engine buffer to be processed in parallel The loudspeakers signal computing is distributed to the Rendering Unit according to the available resources The Preprocessorcalculates the sourcedistance A complete prototype of Immersive Audio reconfigurable processor uses an FPGA for BF and WFS acceleration and 2 PowerPC as host

Performance analysis Hardware characteristics: FPGAs: Performance evaluation with Xilinx tools Number of BF channels and WFS RU has been considered for each FPGA GPUs: Performance evaluation with FF XIV benchmark The GTX460 is the only one with two levels of on chip cache GPP: AscomparisonA i reference GPP an Intel lcore2 Duo @ 3.0 GHz has been used

Performance analysis Performance results VS Core2 Duo: All execution times include memory access delays Real time threshold is at 11264 ms Beamforming 8 channels 16 channels 32 speakers WFS 96 speakers

Performance analysis Performance results: GTX275 VS FPGA In order to compare the processing speed, memory access delay is subtracted from execution time: in the GPU it accounts for more than 50% of the overall execution time Optimized GTX275 performances are evaluated considering the different integration technology: processing time reduction is estimated referring to ITRS About power performances: DSPs and FPGAs require much lower power than GPP and FPGAs are more performing than GPUs The amount of power is highly affected by the number of microphones and loudspeakers

Conclusions General considerations: BF applications benefits from multi core platforms since signals can be processed cuncurrently and GPUs and FPGA can achieve about an order of magnitude of speed up WFS is even more advantaged and proposed implementation can run up to two order of magnitude faster than GPP solution With respect to standard PC, GPUs can reduce power consumption by 2.5 times and FPGAs even more Further considerations: Multi core platforms are at the moment the best approach to increase performances of Immersive Audio systems Though parallelization is particularly effective for Immersive Audio processing, as any computational intensive application it would benefit from processing speed up and memory delay reduction Despite a variety of experiments on both acquisition and rendering techniques implementation on different platforms, acomplete system has not been proposed p

References Theodoropoulos, D.; Kuzmanov, G.; Gaydadjiev, dji G; Multi Core Platforms for Beamforming and Wave Field Synthesis IEEE Transactions on Multimedia, Vol. 13, No.2, April 2011 D. Theodoropoulos, C. B. Ciobanu, and G. Kuzmanov, Wave field synthesis for 3D audio: Architectural prospectives, in Proc. ACM Int. Conf. Computing Frontiers, May 2009, pp. 127 136. Thanksforyourattention