REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Size: px
Start display at page:

Download "REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND."

Transcription

1 December 3-6, 2018 Santa Clara Convention Center CA, USA REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

2 ACCELERATING INFERENCING ON THE EDGE WITH RISC-V Russell Klein Technical Director Mentor, A Siemens Company

3 Accelerating inferencing Inferencing algorithms exceed the performance capabilities of embedded processors We examine creating accelerators as: RoCC co-processor for RISC-V Bus-based peripheral Instruction extensions were considered Was not a good fit for the algorithms we were working on Would need to contort things too far

4 Driving automation Already have 6 HD cameras at 30 frames/sec in ADAS 1920 x 1080 pixels x 30 x 3 colors (RGB) 186,624,000 pixels/sec for a single camera >1 billion pixels per second for the car More Radar-Lidar-cameras added at every level Camera resolutions and frame rates expected to double every few years

5 All embedded systems are not equal Should get smart enough not to wake me when the neighbor s cat is on the front porch at 4 a.m. but wake me when a person is there Inferencing rates of 1 or 2 seconds is fine <1 million pixels per second 5

6 Yolo Tiny (v3) Algorithm for detecting and classifying objects in pictures Used on cell phones and computationally limited systems Over 5.5 billion floating point operations per inference Over 36 million weight and bias values Neural network has 23 layers Full Yolo has 106 layers

7 Yolo-Tiny Profile

8 What is convolution? Multiply one array by another, element by element, and sum the results Source: Embedded-Vision.com

9 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel Image

10 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 0 x 1 = 0 1 x 2 = 2 2 x 3 = 6 10 x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 681 Image

11 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 1 x 1 = 1 2 x 2 = 4 3 x 3 = 9 11 x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 726 Image

12 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 10 x 1 = x 2 = x 3 = x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 1131 Image

13 Algorithm Description 3x3 convolution with an optional 2x2 max pooling Kernel 11 x 1 = x 2 = x 3 = x 4 = x 5 = x 6 = x 7 = x 8 = x 9 = sum 1174 Image

14 Algorithm Description Repeat across the entire image Kernel Image

15 Algorithm Description Repeat across the entire image Kernel Image

16 Algorithm Description Repeat across the entire image Kernel Image

17 Algorithm Description Repeat across the entire image Kernel Image

18 Some observations For the 416x416 pixel image a 3x3 convolution requires 1,557,504 multiplications It is embarrassingly parallel, all multiplies could be done in parallel If you could get 1.5 million multipliers on a piece of silicon If you could get the data to and from the multipliers This is 0.002% of the multiplies for the inference

19 High Level Synthesis Enables faster design Write more abstractly, i.e. less coding Synthesis handles many details you probably don t want to Creates control/data flow graph from original C/C++ Automatically constructs: parallelism pipelining resource sharing Infers/interfaces memories

20 High Level Synthesis Makes verification easier Reuse stimulus from original C/C++ code Original C/C++ can be an oracle for created RTL Can enable formal verification CDFG can be used to compare against implemented RTL Even if complete equivalence cannot be proved, reduces verification space Original Algorithm Original Testbench Compare Transactor HLS RTL Transactor

21 80% reduction in verification effort Time Traditional RTL Functional Regression 3 months 1000 CPUs Time HLS C++ Functional Regression using formal 2 weeks 14 CPUs Resources Resources NVIDIA Xavier 12nFF SoC NVIDIA Case Study available on mentor.com

22 Yolo Tiny 1000 image verification suite Original C implementation on desktop computer 1.5 seconds per inference ~25 minutes Behavioral C implementation with overhead 5 seconds per inference - ~1 hour 25 minutes RTL implementation 389 minutes per inference - ~9 months Verifying the behavioral code and proving equivalence is much, much faster

23 3x3 convolve + 2x2 max pool

24 High Level Synthesis By unrolling loops and pipelining created various implementation with no changes to source code: 1 84 clocks 4 22 clocks 9 8 clocks 36 3 clocks 216 2/clock 36 multiplier schematic

25 Operation sizing Move from floating point to fixed point Select fixed point precision to meet needs of the algorithm For Yolo-Tiny accuracy improves moving from 32 bit floating point to 20 bit fixed point Fixed point algorithms can be verified on the desk-top Open source fixed point library: Multiplier costs in area and power a roughly proportional to the square of the size of the operands. Moving from 32 bits to 12 bits is about 1/7 th the area and power

26 RoCC accelerator Created accelerator in HDL using HLS Interface was AC channel (ready/valid/data) Good match for RISC-V protocols Instantiated protocol converter between accelerator and RoCC interface Interacted with custom0 instruction No change to compliers Accessed kernel and image data through L1 data cache port

27 RoCC Accelerator Details Followed excellent directions from Colin Schmidt from 2015 rocket chip tutorial bootcamp And course notes from CS250 at Berkeley and github examples. Defined method for creating and interfacing an RoCC accelerator They explain it so much better than I can Many thanks

28 Software vs. Accelerator Software was executing 1 multiply in 7 instructions Found by examination of object code from optimized compile Accelerator was executing 36 parallel multiplies in one clock Then adds and compares in 2 more clocks Accelerator was about 100x faster As expected: 3 clock cycles per conv/max_pool for accelerator 7 clock cycles x 36 multiplies = 252, plus some loop control for software

29 But wait, where s my performance Team member computed 3 x number of conv/map_pools per inference and clock cycles and found we were 4x too slow Looked at bus, nice burst cycles, almost 100 utilization Looked at cache misses and found lots of them

30 What s going on?

31 What s going on?

32 What s going on?

33 What s going on?

34 What s going on? All these pixels are on different cache lines At 100% of bandwidth on bus we would keep accelerator fed. With cache thrashing, accelerator was starving.

35 What to do about it? Create a shift register 3 lines + 4 pixels long Add 2 pixels for each computation

36 What to do about it? Create a shift register 3 lines + 4 pixels long Add 2 pixels for each computation

37 What to do about it? Create a shift register 3 lines + 4 pixels long Add 2 pixels for each computation Makes caches much happier

38 memory memory memory What to do about it? Too any registers to be efficient (1260) Add memories where there are no multiplier taps * * * Requires 7 array declarations in C++, the large arrays are automatically put into memories

39 Peripheral Accelerator Alternate approach for accelerator is to attach it to the bus. Instead of custom0 instructions, read and write to memory mapped control registers Accelerator has master interface to read image and kernel data directly from memory, but no longer shares caches with Rocket Core Core tile to interconnect was bottleneck 3x3 conv 2x2 max accel

40 Memory Architecture and Power Considerations Keeping data local is key to minimizing power consumption Very important for ASIC Floating-point is costly Used in training of networks Not needed in network inference engine, use fixed point Fixed-point doesn't need to be power-of-two For custom hardware can be anything *NVIDIA 2017

41 Yolo-Tiny Profile Small number of coefficients in early layers Small(ish) number of features in later layers

42 Reordering Loops to Minimize Weight Reads Loops are organized so that weights are only read once from system DRAM Weights are held stationary across feature maps Feature maps are computed in order and stored in local SRAM Original Algorithm OUT_CHAN:for(int oc=0;oc<out_channels;oc++){ FMAP_HEIGHT:for(int r=0;r<in_height;r++){ FMAP_WIDTH:for(int c=0;c<in_width+1;c++){ IN_CHAN:for(int ic=0;ic<in_channels;ic++){ KERNEL_Y:for(int i=0;i<3;i++){ KERNEL_X:for(int j=0;j<3;j++){ acc+=fmap[ic][r-i/2][c-j/2]*kernel[ic][oc][i][j] } } } fmap_out[d][r][c] = acc; } } OUT_CHAN:for(int oc=0;oc<max_out_chan;oc++){ KERNEL_Y:for(int i=-1;i<2;i++){ KERNEL_X:for(int j=-1;j<2;j++){ FMAP_HEIGHT:for(int r=0;r<max_height;r++){ FMAP_WIDTH:for(int c=0;c<max_width;c++){ IN_CHAN:for(int ic=0;ic!=max_in_chan;ic+=pack){ < FMAP ping-pong memory read PACK values > < Read and cache weights once for reuse > MAC:for(int p=0;p<pack;p++) acc += fmap_data[p] * kernel_data[p]; } acc_mem[r][c] += acc;//fmap accumulator memory } } }

43 In-place Architecture Layers processed one after another Feature maps stored locally in SRAM Weights read from system memory Layer Control Kernel I/F Feature memory A Multiply/Accumul ate Engine Feature memory B Ping-Pong memory

44 Reorganize Feature Maps Feature maps are interleaved by input channels Example: PACK=4, input chan=8, fmaps come in order Pack 4 input channels side-by-side i0 i1 i2 i3 i4 i5 i6 i7 mem Feature maps

45 Conclusion Built accelerator for 3x3 convolution and 2x2 max pool Comprises 96% of the load for object recognition algorithm 36 parallel multipliers, pipelined adder tree and comparitors Implemented as a RoCC accelerator Shares L1 cache with core, so a good choice when co-operating with software on common data Implemented as a bus based peripheral Has independent bus interface, so a good choice when communication bound High-level synthesis Enables highly customized accelerators Accommodates last minute changes Targets ASICs, efpgas, and FPGAs

46 THANK YOU

Digital Systems Design

Digital Systems Design Digital Systems Design Digital Systems Design and Test Dr. D. J. Jackson Lecture 1-1 Introduction Traditional digital design Manual process of designing and capturing circuits Schematic entry System-level

More information

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques. Introduction EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Techniques Cristian Grecu grecuc@ece.ubc.ca Course web site: http://courses.ece.ubc.ca/353/ What have you learned so far?

More information

Hardware-based Image Retrieval and Classifier System

Hardware-based Image Retrieval and Classifier System Hardware-based Image Retrieval and Classifier System Jason Isaacs, Joe Petrone, Geoffrey Wall, Faizal Iqbal, Xiuwen Liu, and Simon Foo Department of Electrical and Computer Engineering Florida A&M - Florida

More information

Course Outcome of M.Tech (VLSI Design)

Course Outcome of M.Tech (VLSI Design) Course Outcome of M.Tech (VLSI Design) PVL108: Device Physics and Technology The students are able to: 1. Understand the basic physics of semiconductor devices and the basics theory of PN junction. 2.

More information

Hardware Implementation of Automatic Control Systems using FPGAs

Hardware Implementation of Automatic Control Systems using FPGAs Hardware Implementation of Automatic Control Systems using FPGAs Lecturer PhD Eng. Ionel BOSTAN Lecturer PhD Eng. Florin-Marian BÎRLEANU Romania Disclaimer: This presentation tries to show the current

More information

Lecture 1: Introduction to Digital System Design & Co-Design

Lecture 1: Introduction to Digital System Design & Co-Design Design & Co-design of Embedded Systems Lecture 1: Introduction to Digital System Design & Co-Design Computer Engineering Dept. Sharif University of Technology Winter-Spring 2008 Mehdi Modarressi Topics

More information

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to.

Technology Timeline. Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs. FPGAs. The Design Warrior s Guide to. FPGAs 1 CMPE 415 Technology Timeline 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs FPGAs The Design Warrior s Guide

More information

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing Paper by: Wajahat Qadeer Rehan Hameed Ofer Shacham Preethi Venkatesan Christos Kozyrakis Mark Horowitz Presentation by:

More information

Chapter 6: DSP And Its Impact On Technology. Book: Processor Design Systems On Chip. By Jari Nurmi

Chapter 6: DSP And Its Impact On Technology. Book: Processor Design Systems On Chip. By Jari Nurmi Chapter 6: DSP And Its Impact On Technology Book: Processor Design Systems On Chip Computing For ASICs And FPGAs By Jari Nurmi Slides Prepared by: Omer Anjum Introduction The early beginning g of DSP DSP

More information

Introduction to co-simulation. What is HW-SW co-simulation?

Introduction to co-simulation. What is HW-SW co-simulation? Introduction to co-simulation CPSC489-501 Hardware-Software Codesign of Embedded Systems Mahapatra-TexasA&M-Fall 00 1 What is HW-SW co-simulation? A basic definition: Manipulating simulated hardware with

More information

Audio Sample Rate Conversion in FPGAs

Audio Sample Rate Conversion in FPGAs Audio Sample Rate Conversion in FPGAs An efficient implementation of audio algorithms in programmable logic. by Philipp Jacobsohn Field Applications Engineer Synplicity eutschland GmbH philipp@synplicity.com

More information

Datorstödd Elektronikkonstruktion

Datorstödd Elektronikkonstruktion Datorstödd Elektronikkonstruktion [Computer Aided Design of Electronics] Zebo Peng, Petru Eles and Gert Jervan Embedded Systems Laboratory IDA, Linköping University http://www.ida.liu.se/~tdts80/~tdts80

More information

Image processing. Case Study. 2-diemensional Image Convolution. From a hardware perspective. Often massively yparallel.

Image processing. Case Study. 2-diemensional Image Convolution. From a hardware perspective. Often massively yparallel. Case Study Image Processing Image processing From a hardware perspective Often massively yparallel Can be used to increase throughput Memory intensive Storage size Memory bandwidth -diemensional Image

More information

Computer Aided Design of Electronics

Computer Aided Design of Electronics Computer Aided Design of Electronics [Datorstödd Elektronikkonstruktion] Zebo Peng, Petru Eles, and Nima Aghaee Embedded Systems Laboratory IDA, Linköping University www.ida.liu.se/~tdts01 Electronic Systems

More information

USING EMBEDDED PROCESSORS IN HARDWARE MODELS OF ARTIFICIAL NEURAL NETWORKS

USING EMBEDDED PROCESSORS IN HARDWARE MODELS OF ARTIFICIAL NEURAL NETWORKS USING EMBEDDED PROCESSORS IN HARDWARE MODELS OF ARTIFICIAL NEURAL NETWORKS DENIS F. WOLF, ROSELI A. F. ROMERO, EDUARDO MARQUES Universidade de São Paulo Instituto de Ciências Matemáticas e de Computação

More information

Lecture 1. Tinoosh Mohsenin

Lecture 1. Tinoosh Mohsenin Lecture 1 Tinoosh Mohsenin Today Administrative items Syllabus and course overview Digital systems and optimization overview 2 Course Communication Email Urgent announcements Web page http://www.csee.umbc.edu/~tinoosh/cmpe650/

More information

EE382V-ICS: System-on-a-Chip (SoC) Design

EE382V-ICS: System-on-a-Chip (SoC) Design EE38V-CS: System-on-a-Chip (SoC) Design Hardware Synthesis and Architectures Source: D. Gajski, S. Abdi, A. Gerstlauer, G. Schirner, Embedded System Design: Modeling, Synthesis, Verification, Chapter 6:

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Project Background High speed multiplication is another critical function in a range of very large scale integration (VLSI) applications. Multiplications are expensive and slow

More information

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1

EECS150 - Digital Design Lecture 28 Course Wrap Up. Recap 1 EECS150 - Digital Design Lecture 28 Course Wrap Up Dec. 5, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

Neural Networks The New Moore s Law

Neural Networks The New Moore s Law Neural Networks The New Moore s Law Chris Rowen, PhD, FIEEE CEO Cognite Ventures December 216 Outline Moore s Law Revisited: Efficiency Drives Productivity Embedded Neural Network Product Segments Efficiency

More information

Harnessing the Power of AI: An Easy Start with Lattice s sensai

Harnessing the Power of AI: An Easy Start with Lattice s sensai Harnessing the Power of AI: An Easy Start with Lattice s sensai A Lattice Semiconductor White Paper. January 2019 Artificial intelligence, or AI, is everywhere. It s a revolutionary technology that is

More information

An area optimized FIR Digital filter using DA Algorithm based on FPGA

An area optimized FIR Digital filter using DA Algorithm based on FPGA An area optimized FIR Digital filter using DA Algorithm based on FPGA B.Chaitanya Student, M.Tech (VLSI DESIGN), Department of Electronics and communication/vlsi Vidya Jyothi Institute of Technology, JNTU

More information

Creating Intelligence at the Edge

Creating Intelligence at the Edge Creating Intelligence at the Edge Vladimir Stojanović E3S Retreat September 8, 2017 The growing importance of machine learning Page 2 Applications exploding in the cloud Huge interest to move to the edge

More information

(VE2: Verilog HDL) Software Development & Education Center

(VE2: Verilog HDL) Software Development & Education Center Software Development & Education Center (VE2: Verilog HDL) VLSI Designing & Integration Introduction VLSI: With the hardware market booming with the rise demand in chip driven products in consumer electronics,

More information

VLSI System Testing. Outline

VLSI System Testing. Outline ECE 538 VLSI System Testing Krish Chakrabarty System-on-Chip (SOC) Testing ECE 538 Krish Chakrabarty 1 Outline Motivation for modular testing of SOCs Wrapper design IEEE 1500 Standard Optimization Test

More information

FPGA Circuits. na A simple FPGA model. nfull-adder realization

FPGA Circuits. na A simple FPGA model. nfull-adder realization FPGA Circuits na A simple FPGA model nfull-adder realization ndemos Presentation References n Altera Training Course Designing With Quartus-II n Altera Training Course Migrating ASIC Designs to FPGA n

More information

A Survey on Power Reduction Techniques in FIR Filter

A Survey on Power Reduction Techniques in FIR Filter A Survey on Power Reduction Techniques in FIR Filter 1 Pooja Madhumatke, 2 Shubhangi Borkar, 3 Dinesh Katole 1, 2 Department of Computer Science & Engineering, RTMNU, Nagpur Institute of Technology Nagpur,

More information

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM

RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM RANA: Towards Efficient Neural Acceleration with Refresh-Optimized Embedded DRAM Fengbin Tu, Weiwei Wu, Shouyi Yin, Leibo Liu, Shaojun Wei Institute of Microelectronics Tsinghua University The 45th International

More information

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

[Krishna, 2(9): September, 2013] ISSN: Impact Factor: INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY Design of Wallace Tree Multiplier using Compressors K.Gopi Krishna *1, B.Santhosh 2, V.Sridhar 3 gopikoleti@gmail.com Abstract

More information

Video Enhancement Algorithms on System on Chip

Video Enhancement Algorithms on System on Chip International Journal of Scientific and Research Publications, Volume 2, Issue 4, April 2012 1 Video Enhancement Algorithms on System on Chip Dr.Ch. Ravikumar, Dr. S.K. Srivatsa Abstract- This paper presents

More information

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION Sinan Yalcin and Ilker Hamzaoglu Faculty of Engineering and Natural Sciences, Sabanci University, 34956, Tuzla,

More information

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors T.N.Priyatharshne Prof. L. Raja, M.E, (Ph.D) A. Vinodhini ME VLSI DESIGN Professor, ECE DEPT ME VLSI DESIGN

More information

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm

Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm Design of a High Speed FIR Filter on FPGA by Using DA-OBC Algorithm Vijay Kumar Ch 1, Leelakrishna Muthyala 1, Chitra E 2 1 Research Scholar, VLSI, SRM University, Tamilnadu, India 2 Assistant Professor,

More information

Physics Based Sensor simulation

Physics Based Sensor simulation Physics Based Sensor simulation Jordan Gorrochotegui - Product Manager Software and Services Mike Phillips Software Engineer Restricted Siemens AG 2017 Realize innovation. Siemens offers solutions across

More information

Design of a Power Optimal Reversible FIR Filter ASIC Speech Signal Processing

Design of a Power Optimal Reversible FIR Filter ASIC Speech Signal Processing Design of a Power Optimal Reversible FIR Filter ASIC Speech Signal Processing Yelle Harika M.Tech, Joginpally B.R.Engineering College. P.N.V.M.Sastry M.S(ECE)(A.U), M.Tech(ECE), (Ph.D)ECE(JNTUH), PG DIP

More information

ASIC Implementation of High Throughput PID Controller

ASIC Implementation of High Throughput PID Controller ASIC Implementation of High Throughput PID Controller 1 Chavan Suyog, 2 Sameer Nandagave, 3 P.Arunkumar 1,2 M.Tech Scholar, 3 Assistant Professor School of Electronics Engineering VLSI Division, VIT University,

More information

Exploring Computation- Communication Tradeoffs in Camera Systems

Exploring Computation- Communication Tradeoffs in Camera Systems Exploring Computation- Communication Tradeoffs in Camera Systems Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe IISWC 2017 1 Camera applications are

More information

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER

ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER ENHANCING SPEED AND REDUCING POWER OF SHIFT AND ADD MULTIPLIER 1 ZUBER M. PATEL 1 S V National Institute of Technology, Surat, Gujarat, Inida E-mail: zuber_patel@rediffmail.com Abstract- This paper presents

More information

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen GIGA seminar 11.1.2010 Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen janne.janhunen@ee.oulu.fi 2 Outline Introduction Benefits and Challenges

More information

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski Introduction: The CEBAF upgrade Low Level Radio Frequency (LLRF) control

More information

Policy-Based RTL Design

Policy-Based RTL Design Policy-Based RTL Design Bhanu Kapoor and Bernard Murphy bkapoor@atrenta.com Atrenta, Inc., 2001 Gateway Pl. 440W San Jose, CA 95110 Abstract achieving the desired goals. We present a new methodology to

More information

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction REAL TIME DIGITAL SIGNAL Introduction Why Digital? A brief comparison with analog. PROCESSING Seminario de Electrónica: Sistemas Embebidos Advantages The BIG picture Flexibility. Easily modifiable and

More information

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet LETTER IEICE Electronics Express, Vol.14, No.15, 1 12 An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet Boya Zhao a), Mingjiang Wang b), and Ming Liu Harbin

More information

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN

IJCSIET--International Journal of Computer Science information and Engg., Technologies ISSN An efficient add multiplier operator design using modified Booth recoder 1 I.K.RAMANI, 2 V L N PHANI PONNAPALLI 2 Assistant Professor 1,2 PYDAH COLLEGE OF ENGINEERING & TECHNOLOGY, Visakhapatnam,AP, India.

More information

Multi-core Platforms for

Multi-core Platforms for 20 JUNE 2011 Multi-core Platforms for Immersive-Audio Applications Course: Advanced Computer Architectures Teacher: Prof. Cristina Silvano Student: Silvio La Blasca 771338 Introduction on Immersive-Audio

More information

Embedding Artificial Intelligence into Our Lives

Embedding Artificial Intelligence into Our Lives Embedding Artificial Intelligence into Our Lives Michael Thompson, Synopsys D&R IP-SOC DAYS Santa Clara April 2018 1 Agenda Introduction What AI is and is Not Where AI is being used Rapid Advance of AI

More information

CMOS VLSI IC Design. A decent understanding of all tasks required to design and fabricate a chip takes years of experience

CMOS VLSI IC Design. A decent understanding of all tasks required to design and fabricate a chip takes years of experience CMOS VLSI IC Design A decent understanding of all tasks required to design and fabricate a chip takes years of experience 1 Commonly used keywords INTEGRATED CIRCUIT (IC) many transistors on one chip VERY

More information

Cyclone II Filtering Lab

Cyclone II Filtering Lab May 2005, ver. 1.0 Application Note 376 Introduction The Cyclone II filtering lab design provided in the DSP Development Kit, Cyclone II Edition, shows you how to use the Altera DSP Builder for system

More information

Performance Analysis of a 64-bit signed Multiplier with a Carry Select Adder Using VHDL

Performance Analysis of a 64-bit signed Multiplier with a Carry Select Adder Using VHDL Performance Analysis of a 64-bit signed Multiplier with a Carry Select Adder Using VHDL E.Deepthi, V.M.Rani, O.Manasa Abstract: This paper presents a performance analysis of carrylook-ahead-adder and carry

More information

Design of High-Performance HOG Feature Calculation Circuit for Real-Time Pedestrian Detection *

Design of High-Performance HOG Feature Calculation Circuit for Real-Time Pedestrian Detection * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 31, 2055-2073 (2015) Design of High-Performance HOG Feature Calculation Circuit for Real-Time Pedestrian Detection * SOOJIN KIM AND KYEONGSOON CHO + Department

More information

HARDWARE ACCELERATION OF THE GIPPS MODEL

HARDWARE ACCELERATION OF THE GIPPS MODEL HARDWARE ACCELERATION OF THE GIPPS MODEL FOR REAL-TIME TRAFFIC SIMULATION Salim Farah 1 and Magdy Bayoumi 2 The Center for Advanced Computer Studies, University of Louisiana at Lafayette, USA 1 snf3346@cacs.louisiana.edu

More information

DESIGN OF MULTIPLE CONSTANT MULTIPLICATION ALGORITHM FOR FIR FILTER

DESIGN OF MULTIPLE CONSTANT MULTIPLICATION ALGORITHM FOR FIR FILTER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 3, March 2014,

More information

Design and Implementation Radix-8 High Performance Multiplier Using High Speed Compressors

Design and Implementation Radix-8 High Performance Multiplier Using High Speed Compressors Design and Implementation Radix-8 High Performance Multiplier Using High Speed Compressors M.Satheesh, D.Sri Hari Student, Dept of Electronics and Communication Engineering, Siddartha Educational Academy

More information

ECE6332 VLSI Eric Zhang & Xinfei Guo Design Review

ECE6332 VLSI Eric Zhang & Xinfei Guo Design Review Summaries: [1] Xiaoxiao Zhang, Amine Bermak, Farid Boussaid, "Dynamic Voltage and Frequency Scaling for Low-power Multi-precision Reconfigurable Multiplier", in Proc. of 2010 IEEE International Symposium

More information

Enhancing System Architecture by Modelling the Flash Translation Layer

Enhancing System Architecture by Modelling the Flash Translation Layer Enhancing System Architecture by Modelling the Flash Translation Layer Robert Sykes Sr. Dir. Firmware August 2014 OCZ Storage Solutions A Toshiba Group Company Introduction This presentation will discuss

More information

CS Computer Architecture Spring Lecture 04: Understanding Performance

CS Computer Architecture Spring Lecture 04: Understanding Performance CS 35101 Computer Architecture Spring 2008 Lecture 04: Understanding Performance Taken from Mary Jane Irwin (www.cse.psu.edu/~mji) and Kevin Schaffer [Adapted from Computer Organization and Design, Patterson

More information

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting

4.4 Implementation Structures in FPGAs and DSPs. Presented by Lee Pucker President, ForwardLink Consulting 4.4 Implementation Structures in FPGAs and DSPs Presented by Lee Pucker President, ForwardLink Consulting Agenda Case Study on Implementation Structures Synchronization in a GSM Network Option 1: DSP Implementation

More information

Design of High-Performance Intra Prediction Circuit for H.264 Video Decoder

Design of High-Performance Intra Prediction Circuit for H.264 Video Decoder JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.9, NO.4, DECEMBER, 2009 187 Design of High-Performance Intra Prediction Circuit for H.264 Video Decoder Jihye Yoo, Seonyoung Lee, and Kyeongsoon Cho

More information

Bricken Technologies Corporation Presentations: Bricken Technologies Corporation Corporate: Bricken Technologies Corporation Marketing:

Bricken Technologies Corporation Presentations: Bricken Technologies Corporation Corporate: Bricken Technologies Corporation Marketing: TECHNICAL REPORTS William Bricken compiled 2004 Bricken Technologies Corporation Presentations: 2004: Synthesis Applications of Boundary Logic 2004: BTC Board of Directors Technical Review (quarterly)

More information

Performance Analysis of FIR Filter Design Using Reconfigurable Mac Unit

Performance Analysis of FIR Filter Design Using Reconfigurable Mac Unit Volume 4 Issue 4 December 2016 ISSN: 2320-9984 (Online) International Journal of Modern Engineering & Management Research Website: www.ijmemr.org Performance Analysis of FIR Filter Design Using Reconfigurable

More information

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER 87 CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER 4.1 INTRODUCTION The Field Programmable Gate Array (FPGA) is a high performance data processing general

More information

Implementing Logic with the Embedded Array

Implementing Logic with the Embedded Array Implementing Logic with the Embedded Array in FLEX 10K Devices May 2001, ver. 2.1 Product Information Bulletin 21 Introduction Altera s FLEX 10K devices are the first programmable logic devices (PLDs)

More information

10. DSP Blocks in Arria GX Devices

10. DSP Blocks in Arria GX Devices 10. SP Blocks in Arria GX evices AGX52010-1.2 Introduction Arria TM GX devices have dedicated digital signal processing (SP) blocks optimized for SP applications requiring high data throughput. These SP

More information

5G R&D at Huawei: An Insider Look

5G R&D at Huawei: An Insider Look 5G R&D at Huawei: An Insider Look Accelerating the move from theory to engineering practice with MATLAB and Simulink Huawei is the largest networking and telecommunications equipment and services corporation

More information

Chapter 1 Introduction

Chapter 1 Introduction Chapter 1 Introduction 1.1 Introduction There are many possible facts because of which the power efficiency is becoming important consideration. The most portable systems used in recent era, which are

More information

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka

Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka Simulation Performance Optimization of Virtual Prototypes Sammidi Mounika, B S Renuka Abstract Virtual prototyping is becoming increasingly important to embedded software developers, engineers, managers

More information

Study on Digital Multiplier Architecture Using Square Law and Divide-Conquer Method

Study on Digital Multiplier Architecture Using Square Law and Divide-Conquer Method Study on Digital Multiplier Architecture Using Square Law and Divide-Conquer Method Yifei Sun 1,a, Shu Sasaki 1,b, Dan Yao 1,c, Nobukazu Tsukiji 1,d, Haruo Kobayashi 1,e 1 Division of Electronics and Informatics,

More information

Using an FPGA based system for IEEE 1641 waveform generation

Using an FPGA based system for IEEE 1641 waveform generation Using an FPGA based system for IEEE 1641 waveform generation Colin Baker EADS Test & Services (UK) Ltd 23 25 Cobham Road Wimborne, Dorset, UK colin.baker@eads-ts.com Ashley Hulme EADS Test Engineering

More information

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm V.Sandeep Kumar Assistant Professor, Indur Institute Of Engineering & Technology,Siddipet

More information

Enabling High-Performance DSP Applications with Arria V or Cyclone V Variable-Precision DSP Blocks

Enabling High-Performance DSP Applications with Arria V or Cyclone V Variable-Precision DSP Blocks Enabling HighPerformance DSP Applications with Arria V or Cyclone V VariablePrecision DSP Blocks WP011591.0 White Paper This document highlights the benefits of variableprecision digital signal processing

More information

Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen

Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen Modified Booth Multiplier Based Low-Cost FIR Filter Design Shelja Jose, Shereena Mytheen Abstract A new low area-cost FIR filter design is proposed using a modified Booth multiplier based on direct form

More information

CS4617 Computer Architecture

CS4617 Computer Architecture 1/26 CS4617 Computer Architecture Lecture 2 Dr J Vaughan September 10, 2014 2/26 Amdahl s Law Speedup = Execution time for entire task without using enhancement Execution time for entire task using enhancement

More information

CHAPTER 4 GALS ARCHITECTURE

CHAPTER 4 GALS ARCHITECTURE 64 CHAPTER 4 GALS ARCHITECTURE The aim of this chapter is to implement an application on GALS architecture. The synchronous and asynchronous implementations are compared in FFT design. The power consumption

More information

DESIGN OF LOW POWER MULTIPLIER USING COMPOUND CONSTANT DELAY LOGIC STYLE

DESIGN OF LOW POWER MULTIPLIER USING COMPOUND CONSTANT DELAY LOGIC STYLE DESIGN OF LOW POWER MULTIPLIER USING COMPOUND CONSTANT DELAY LOGIC STYLE 1 S. DARWIN, 2 A. BENO, 3 L. VIJAYA LAKSHMI 1 & 2 Assistant Professor Electronics & Communication Engineering Department, Dr. Sivanthi

More information

Design of Mixed-Signal Microsystems in Nanometer CMOS

Design of Mixed-Signal Microsystems in Nanometer CMOS Design of Mixed-Signal Microsystems in Nanometer CMOS Carl Grace Lawrence Berkeley National Laboratory August 2, 2012 DOE BES Neutron and Photon Detector Workshop Introduction Common themes in emerging

More information

Advanced FPGA Design. Tinoosh Mohsenin CMPE 491/691 Spring 2012

Advanced FPGA Design. Tinoosh Mohsenin CMPE 491/691 Spring 2012 Advanced FPGA Design Tinoosh Mohsenin CMPE 491/691 Spring 2012 Today Administrative items Syllabus and course overview Digital signal processing overview 2 Course Communication Email Urgent announcements

More information

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS

Low Power Design Part I Introduction and VHDL design. Ricardo Santos LSCAD/FACOM/UFMS Low Power Design Part I Introduction and VHDL design Ricardo Santos ricardo@facom.ufms.br LSCAD/FACOM/UFMS Motivation for Low Power Design Low power design is important from three different reasons Device

More information

Model checking in the cloud VIGYAN SINGHAL OSKI TECHNOLOGY

Model checking in the cloud VIGYAN SINGHAL OSKI TECHNOLOGY Model checking in the cloud VIGYAN SINGHAL OSKI TECHNOLOGY Views are biased by Oski experience Service provider, only doing model checking Using off-the-shelf tools (Cadence, Jasper, Mentor, OneSpin Synopsys)

More information

Journal of Engineering Science and Technology Review 9 (5) (2016) Research Article. L. Pyrgas, A. Kalantzopoulos* and E. Zigouris.

Journal of Engineering Science and Technology Review 9 (5) (2016) Research Article. L. Pyrgas, A. Kalantzopoulos* and E. Zigouris. Jestr Journal of Engineering Science and Technology Review 9 (5) (2016) 51-55 Research Article Design and Implementation of an Open Image Processing System based on NIOS II and Altera DE2-70 Board L. Pyrgas,

More information

FIR Filter Design on Chip Using VHDL

FIR Filter Design on Chip Using VHDL FIR Filter Design on Chip Using VHDL Mrs.Vidya H. Deshmukh, Dr.Abhilasha Mishra, Prof.Dr.Mrs.A.S.Bhalchandra MIT College of Engineering, Aurangabad ABSTRACT This paper describes the design and implementation

More information

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS

CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS 49 CHAPTER 5 IMPLEMENTATION OF MULTIPLIERS USING VEDIC MATHEMATICS 5.1 INTRODUCTION TO VHDL VHDL stands for VHSIC (Very High Speed Integrated Circuits) Hardware Description Language. The other widely used

More information

High Performance DSP Solutions for Ultrasound

High Performance DSP Solutions for Ultrasound High Performance DSP Solutions for Ultrasound By Hong-Swee Lim Senior Manager, DSP/Embedded Marketing Hong-Swee.Lim@xilinx.com 12 May 2008 DSP Performance Gap Performance (Algorithmic and Processor Forecast)

More information

On Current Strategies for Hardware Acceleration of Digital Image Restoration Filters

On Current Strategies for Hardware Acceleration of Digital Image Restoration Filters On Current Strategies for Hardware Acceleration of Digital Image Restoration Filters ERIC GRANGER Laboratoire d imagerie, de vision et d intelligence artificielle Dépt. de génie de la production automatisée

More information

Aerial Photographic System Using an Unmanned Aerial Vehicle

Aerial Photographic System Using an Unmanned Aerial Vehicle Aerial Photographic System Using an Unmanned Aerial Vehicle Second Prize Aerial Photographic System Using an Unmanned Aerial Vehicle Institution: Participants: Instructor: Chungbuk National University

More information

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER

AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER AN EFFICIENT APPROACH TO MINIMIZE POWER AND AREA IN CARRY SELECT ADDER USING BINARY TO EXCESS ONE CONVERTER K. RAMAMOORTHY 1 T. CHELLADURAI 2 V. MANIKANDAN 3 1 Department of Electronics and Communication

More information

PE713 FPGA Based System Design

PE713 FPGA Based System Design PE713 FPGA Based System Design Why VLSI? Dept. of EEE, Amrita School of Engineering Why ICs? Dept. of EEE, Amrita School of Engineering IC Classification ANALOG (OR LINEAR) ICs produce, amplify, or respond

More information

23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017

23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS. Sergii Bykov Technical Lead Machine Learning 12 Oct 2017 23270: AUGMENTED REALITY FOR NAVIGATION AND INFORMATIONAL ADAS Sergii Bykov Technical Lead Machine Learning 12 Oct 2017 Product Vision Company Introduction Apostera GmbH with headquarter in Munich, was

More information

IJMIE Volume 2, Issue 5 ISSN:

IJMIE Volume 2, Issue 5 ISSN: Systematic Design of High-Speed and Low- Power Digit-Serial Multipliers VLSI Based Ms.P.J.Tayade* Dr. Prof. A.A.Gurjar** Abstract: Terms of both latency and power Digit-serial implementation styles are

More information

Low Power VLSI Circuit Synthesis: Introduction and Course Outline

Low Power VLSI Circuit Synthesis: Introduction and Course Outline Low Power VLSI Circuit Synthesis: Introduction and Course Outline Ajit Pal Professor Department of Computer Science and Engineering Indian Institute of Technology Kharagpur INDIA -721302 Agenda Why Low

More information

Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse 1 K.Bala. 2

Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse 1 K.Bala. 2 IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 07, 2015 ISSN (online): 2321-0613 Design and Implementation of High Speed Carry Select Adder Korrapatti Mohammed Ghouse

More information

A Review on Different Multiplier Techniques

A Review on Different Multiplier Techniques A Review on Different Multiplier Techniques B.Sudharani Research Scholar, Department of ECE S.V.U.College of Engineering Sri Venkateswara University Tirupati, Andhra Pradesh, India Dr.G.Sreenivasulu Professor

More information

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm M. Suhasini, K. Prabhu Kumar & P. Srinivas Department of Electronics & Comm. Engineering, Nimra College of Engineering

More information

Meeting the Challenges of Formal Verification

Meeting the Challenges of Formal Verification Meeting the Challenges of Formal Verification Doug Fisher Synopsys Jean-Marc Forey - Synopsys 23rd May 2013 Synopsys 2013 1 In the next 30 minutes... Benefits and Challenges of Formal Verification Meeting

More information

Rapid FPGA Modem Design Techniques For SDRs Using Altera DSP Builder

Rapid FPGA Modem Design Techniques For SDRs Using Altera DSP Builder Rapid FPGA Modem Design Techniques For SDRs Using Altera DSP Builder Steven W. Cox Joel A. Seely General Dynamics C4 Systems Altera Corporation 820 E. McDowell Road, MDR25 0 Innovation Dr Scottsdale, Arizona

More information

Model-Based Design for Sensor Systems

Model-Based Design for Sensor Systems 2009 The MathWorks, Inc. Model-Based Design for Sensor Systems Stephanie Kwan Applications Engineer Agenda Sensor Systems Overview System Level Design Challenges Components of Sensor Systems Sensor Characterization

More information

EE25266 ASIC/FPGA Chip Design. Designing a FIR Filter, FPGA in the Loop, Ethernet

EE25266 ASIC/FPGA Chip Design. Designing a FIR Filter, FPGA in the Loop, Ethernet EE25266 ASIC/FPGA Chip Design Mahdi Shabany Electrical Engineering Department Sharif University of Technology Assignment #8 Designing a FIR Filter, FPGA in the Loop, Ethernet Introduction In this lab,

More information

Multi-Channel FIR Filters

Multi-Channel FIR Filters Chapter 7 Multi-Channel FIR Filters This chapter illustrates the use of the advanced Virtex -4 DSP features when implementing a widely used DSP function known as multi-channel FIR filtering. Multi-channel

More information

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K. VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K. Sasikala 2 1 Professor, Department of Electronics and Communication

More information

Ben Baker. Sponsored by:

Ben Baker. Sponsored by: Ben Baker Sponsored by: Background Agenda GPU Computing Digital Image Processing at FamilySearch Potential GPU based solutions Performance Testing Results Conclusions and Future Work 2 CPU vs. GPU Architecture

More information

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective

Overview of Design Methodology. A Few Points Before We Start 11/4/2012. All About Handling The Complexity. Lecture 1. Put things into perspective Overview of Design Methodology Lecture 1 Put things into perspective ECE 156A 1 A Few Points Before We Start ECE 156A 2 All About Handling The Complexity Design and manufacturing of semiconductor products

More information