A Rotation-based Data Buffering Architecture for Convolution Filtering in a Field Programmable Gate Array

Similar documents
Image processing. Case Study. 2-diemensional Image Convolution. From a hardware perspective. Often massively yparallel.

Video Enhancement Algorithms on System on Chip

Implementing Logic with the Embedded Array

A High Definition Motion JPEG Encoder Based on Epuma Platform

REVOLUTIONIZING THE COMPUTING LANDSCAPE AND BEYOND.

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Performance Analysis of Multipliers in VLSI Design

A Low-Power High-speed Pipelined Accumulator Design Using CMOS Logic for DSP Applications

A Low Power and High Speed Viterbi Decoder Based on Deep Pipelined, Clock Blocking and Hazards Filtering

An energy-efficient coarse grained spatial architecture for convolutional neural networks AlexNet

Design and Characterization of 16 Bit Multiplier Accumulator Based on Radix-2 Modified Booth Algorithm

AUTOMATIC IMPLEMENTATION OF FIR FILTERS ON FIELD PROGRAMMABLE GATE ARRAYS

A New High Speed Low Power Performance of 8- Bit Parallel Multiplier-Accumulator Using Modified Radix-2 Booth Encoded Algorithm

Proc. IEEE Intern. Conf. on Application Specific Array Processors, (Eds. Capello et. al.), IEEE Computer Society Press, 1995, 76-84

PLazeR. a planar laser rangefinder. Robert Ying (ry2242) Derek Xingzhou He (xh2187) Peiqian Li (pl2521) Minh Trang Nguyen (mnn2108)

Performance Evaluation of Edge Detection Techniques for Square Pixel and Hexagon Pixel images

DESIGN OF MULTIPLYING DELAY LOCKED LOOP FOR DIFFERENT MULTIPLYING FACTORS

Digital Image Processing. Digital Image Fundamentals II 12 th June, 2017

FIR Filter Fits in an FPGA using a Bit Serial Approach

EE241 - Spring 2004 Advanced Digital Integrated Circuits. Announcements. Borivoje Nikolic. Lecture 15 Low-Power Design: Supply Voltage Scaling

High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF(2 m )

An Optimized Wallace Tree Multiplier using Parallel Prefix Han-Carlson Adder for DSP Processors

A Compact Design of 8X8 Bit Vedic Multiplier Using Reversible Logic Based Compressor

Image Recognition for PCB Soldering Platform Controlled by Embedded Microchip Based on Hopfield Neural Network

Area Efficient Fft/Ifft Processor for Wireless Communication

Vector Arithmetic Logic Unit Amit Kumar Dutta JIS College of Engineering, Kalyani, WB, India

Imaging serial interface ROM

A PIPELINE FFT PROCESSOR

An Energy Scalable Computational Array for Energy Harvesting Sensor Signal Processing. Rajeevan Amirtharajah University of California, Davis

Abstract. 2. MUX Vs XOR-XNOR. 1. Introduction.

Methods for Reducing the Activity Switching Factor

Image Convolution on FPGAs: the Implementation of a Multi-FPGA FIFO Structure

Real-Time License Plate Localisation on FPGA

Hardware-based Image Retrieval and Classifier System

Lecture 17 Convolutional Neural Networks

Multiplier Design and Performance Estimation with Distributed Arithmetic Algorithm

Efficient Construction of SIFT Multi-Scale Image Pyramids for Embedded Robot Vision

AN ITERATIVE UNSYMMETRICAL TRIMMED MIDPOINT-MEDIAN FILTER FOR REMOVAL OF HIGH DENSITY SALT AND PEPPER NOISE

Design of Adjustable Reconfigurable Wireless Single Core

Modified Booth Encoding Multiplier for both Signed and Unsigned Radix Based Multi-Modulus Multiplier

Design of Parallel Algorithms. Communication Algorithms

UNIT-II LOW POWER VLSI DESIGN APPROACHES

Implementation of Parallel Multiplier-Accumulator using Radix- 2 Modified Booth Algorithm and SPST

AREA EFFICIENT DISTRIBUTED ARITHMETIC DISCRETE COSINE TRANSFORM USING MODIFIED WALLACE TREE MULTIPLIER

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global Communication

AN EFFICIENT ALGORITHM FOR THE REMOVAL OF IMPULSE NOISE IN IMAGES USING BLACKFIN PROCESSOR

Using Genetic Algorithm in the Evolutionary Design of Sequential Logic Circuits

Implementation of a Visible Watermarking in a Secure Still Digital Camera Using VLSI Design

An Efficient Method for Implementation of Convolution

Keywords SEFDM, OFDM, FFT, CORDIC, FPGA.

VLSI Implementation of Cascaded Integrator Comb Filters for DSP Applications

An Optimized Design for Parallel MAC based on Radix-4 MBA

Face Detection System on Ada boost Algorithm Using Haar Classifiers

DIGITAL SIGNAL PROCESSOR WITH EFFICIENT RGB INTERPOLATION AND HISTOGRAM ACCUMULATION

Mahendra Engineering College, Namakkal, Tamilnadu, India.

High Speed Binary Counters Based on Wallace Tree Multiplier in VHDL

Creating Intelligence at the Edge

Reconfigurable Video Image Processing

A DSP ENGINE FOR A 64-ELEMENT ARRAY

A New network multiplier using modified high order encoder and optimized hybrid adder in CMOS technology

Hardware-accelerated CCD readout smear correction for Fast Solar Polarimeter

Module -18 Flip flops

A Modified Structure for High-Speed and Low-Overshoot Comparator-Based Switched-Capacitor Integrator

FPGA Based Efficient Median Filter Implementation Using Xilinx System Generator

Low Power R4SDC Pipelined FFT Processor Architecture

Digital Integrated CircuitDesign

Digital Logic, Algorithms, and Functions for the CEBAF Upgrade LLRF System Hai Dong, Curt Hovater, John Musson, and Tomasz Plawski

Optimized Image Scaling Processor using VLSI

VLSI Implementation of Impulse Noise Suppression in Images

Part Number SuperPix TM image sensor is one of SuperPix TM 2 Mega Digital image sensor series products. These series sensors have the same maximum ima

JESD204A for wireless base station and radar systems

IMPLEMENTATION OF DIGITAL FILTER ON FPGA FOR ECG SIGNAL PROCESSING

An FPGA Based Architecture for Moving Target Indication (MTI) Processing Using IIR Filters

Analysis and Reduction of On-Chip Inductance Effects in Power Supply Grids

Discrete Wavelet Transform: Architectures, Design and Performance Issues

IMAGE TYPE WATER METER CHARACTER RECOGNITION BASED ON EMBEDDED DSP

Lecture 3, Handouts Page 1. Introduction. EECE 353: Digital Systems Design Lecture 3: Digital Design Flows, Simulation Techniques.

FPGA based slope computation for ELTs adaptive optics wavefront sensors

A Survey on Power Reduction Techniques in FIR Filter

Design A Redundant Binary Multiplier Using Dual Logic Level Technique

Computer Architecture Laboratory

VLSI Implementation of Area-Efficient and Low Power OFDM Transmitter and Receiver

CHAPTER 4 ANALYSIS OF LOW POWER, AREA EFFICIENT AND HIGH SPEED MULTIPLIER TOPOLOGIES

Comparison of Two Approaches to Finding the Median in Image Filtering

DYNAMICALLY RECONFIGURABLE SOFTWARE DEFINED RADIO FOR GNSS APPLICATIONS

ADVANCED EMBEDDED MONITORING SYSTEM FOR ELECTROMAGNETIC RADIATION

Using One hot Residue Number System (OHRNS) for Digital Image Processing

6. FUNDAMENTALS OF CHANNEL CODER

Reducing Power Dissipation in Pipelined Accumulators

Hybrid QR Factorization Algorithm for High Performance Computing Architectures. Peter Vouras Naval Research Laboratory Radar Division

A NOVEL VISION SYSTEM-ON-CHIP FOR EMBEDDED IMAGE ACQUISITION AND PROCESSING

VLSI IMPLEMENTATION OF MODIFIED DISTRIBUTED ARITHMETIC BASED LOW POWER AND HIGH PERFORMANCE DIGITAL FIR FILTER Dr. S.Satheeskumaran 1 K.

Reference. Wayne Wolf, FPGA-Based System Design Pearson Education, N Krishna Prakash,, Amrita School of Engineering

Open Source Digital Camera on Field Programmable Gate Arrays

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

>>> from numpy import random as r >>> I = r.rand(256,256);

POWER GATING. Power-gating parameters

ECE6332 VLSI Eric Zhang & Xinfei Guo Design Review

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

1. The decimal number 62 is represented in hexadecimal (base 16) and binary (base 2) respectively as

Transcription:

JURNAL CMPUTER, VL 8, N 6, JUNE 2013 1411 A Rotation-based Data Buffering Architecture for Convolution iltering in a ield Programmable Gate Array Zhijian Lu College of Computer cience and Technology Harbin Engineering University, Harbin, China Email: luzhijian@hrbeueducn Yanxia Wu, Zhenhua Guo, Guochang Gu College of Computer cience and Technology Harbin Engineering University, Harbin, China Email: {wuyanxia, guozhenhua, guguochang}@hrbeueducn Abstract Convolution filtering applications range from image recognition and video surveillance Two observations drive the design of a new buffering architecture for convolution filters irst, the convolutional operations are inherently local; hence every pixel of the output feature maps is calculated by the neighboring pixels of the input feature maps Even though the operation is simple, the convolution filtering is both computation-intensive and memory-intensive or real-time applications, large amounts of on-chip memories are required to support massively parallel processing architectures econd, to avoid access to external memories directly, the data that are already stored in on-chip memories should be used as many times as possible Based on these two observations, we show that for a given throughput rate and off-chip memory bandwidth, a rotation-based data buffering architecture provide the optimum area-utilization results for a particular design point, which are commonly used applications in recognition area ndex Terms convolution filtering, ield Programmable Gate Arrays (PGAs), data buffering NTRDUCTN Convolution filters are the computational models that are widely used in recognition and video processing domains [1][2][3][4] The computation of convolution requires not only the high computational capability but also large memory bandwidth, especially when high-definition images and videos have to be processed in real-time n these applications, convolution filtering plays an essential role [5][6] Generally, external memories are used to contain input image pixels, but the memory bandwidth cannot satisfy the requirement of the optimal throughput directly Hence intermediate buffers by means of on-chip memories are adopted to avoid access to external memories directly [7][8] To load as many pixel values as needed to the convolution filter in one cycle, multiple memory ports are attached to intermediate data buffers nce a pixel value is loaded, it can be reused for the corresponding successive convolutions to avoid accessing it from off-chip memories repetitively As a result, the requirements for off-chip memory bandwidth are reduced Convolution architecture with a complete convolution architecture is adopted in [7], where a set of linear are used to move a window over the input image The input image is divided in rows, each with a fixed length according to the input image row length, and the height according to the convolution window height Each pixel in the input image needs to be loaded only once to the intermediate data buffer and with a fixed minimum external memory bandwidth n case the size of input image or convolution window become large, PGA implementations become very expensive, which will cost a significant amount of PGA resources [7][8] There are alternative buffering architectures that internal buffers only store a small portion of pixels [7][9] Each group of in the convolution window receives the pixels belonging to consecutive rows of input image Compared with the aforementioned methods, a great register reduction is achieved However, multiple-dataflow is needed to feed data to the internal buffer Pixels in the input image need to be read repetitively from external memories depending on the size of convolution window And to keep the maximum throughput rate, this leads to a sharp increase in terms of external memory bandwidth requirement n this paper, we are concerned with the implementation of convolution filters in PGA and we design a alternative buffering architecture for convolution filters that shows good balance between on-chip resource utilization and external memory bus bandwidth RTATN-BAED DATA BUERNG ARCHTECTURE Yanxia Wu is the corresponding author doi:104304/jcp861411-1416

1412 JURNAL CMPUTER, VL 8, N 6, JUNE 2013 igure 1 Conceptual view of an convolver and an image n this section, we will first introduce the convolution filtering implementation strategy The advantages and disadvantages of existing implementation architectures will be discussed Then we will present the rotation-based data buffering architecture n ig 1, we show the conceptual view of a convolution filter moving over an input image, which will be used in the following sections A Convolution ilter mplementation trategy The convolution of an image is defined by equation 1:,,, R nput mage / / / / (1) where, is the convolved pixel on the output image,, is the pixel value from the input image, and, is the convolution kernel weight To calculate the convolution,, each pixel, from a window of input image centered on, is multiplied by the corresponding convolution kernel of weights, and then the products are accumulated to produce the output value Because the two-dimensional convolution, of each pixel, requires the values of its 1 immediate neighbors before being able to process that pixel, more columns than needed will be read within the same transaction Each output pixel requires multiplyaccumulations, all of which can be performed in parallel To accelerate the computation of convolution filter, multiple data in a convolution window need to be accessed simultaneously, so the calculations can be performed in parallel B Multiple Dataflow ingle Convolution Architecture (MDCA) n order to eliminate the register arrays in [7], multiple dataflow single convolution architectures are adopted in [8][10] n these architectures, small portion of image pixels are loaded to the convolution filter However, with fewer register arrays, the pixels can no longer be loaded to the convolution window in zigzag order nstead of that, pixels belonging to consecutive rows are read into the register simultaneously Groups of s are included to feed the pixels to the After one column of pixels are fed into the convolution filter, the convolution window moves to a next position ig 2 shows a multiple dataflow single convolution architecture using an input/output bus, which can completely eliminate the register arrays in [7] The convolution window pixel receive the pixels belonging to consecutive rows of the original image through stacks Multiple dataflow single convolution architecture requires much larger bandwidth than the single dataflow architecture The register arrays are completely eliminated Extra memory bandwidth is used to reduce the number of To compute a single cycle convolution, one new pixel per row is needed at every cycle The total of pixels transferred and one result produced means that a bandwidth of 1 bytes per cycle is needed C ingle Dataflow Complete Convolution Architecture (DCCA) To avoid directly access to external memories, PGA on-chip memories are used as intermediate data buffers [7] n ig 3, a single dataflow complete convolution architecture, makes use of on-chip register arrays to move a window over the input image To extract pixels from input image, a single dataflow strategy has been adopted Pixels are fed from external memories in a zigzag order, until 1 complete lines and the first pixels in the next line are contained within a series of linear rom that moment on, all the pixels belonging to the first convolution window are available for the processing element Each time a new pixel is loaded, the convolution window moves to a new position until the entire image has been visited The throughput of this architecture is one clock per pixel n [7], 1 sets of with a length of, are employed to keep data before moving them to the convolution filter, and sets of, each with, are used for the convolution filter These, which enable arbitrary size convolution filter to work with a single data stream, require no more than one pixel per clock external memory bandwidth Pixels in the input image need to be read only once The side-effect of this architecture is that in order to make this single data stream architecture work, 1 complete rows must be read from external memory first, therefore storing these data within a set of would be very expensive in PGA implementation when the size of input image or the size of convolution filter is large D Rotation-based Multiple dataflow Buffering Architecture (RMDBA) n order to reuse data that are already stored in on-chip buffers as many times as possible, we proposed a rotation-based data buffering architecture ig4 illustrates continuous convolution filter in a row-wise direction, where the two adjacent filter windows share 1 columns The architecture of these sliding windows includes R contiguous convolution filter windows, which share 1 columns in the row-wise direction f the calculations of these convolution kernels are performed at the same time, a much higher level of data reusing will be

JURNAL CMPUTER, VL 8, N 6, JUNE 2013 1413 off-chip memory and convolution filter array igure 2 Multiple dataflow single convolution architecture off-chip memory and (N-) hift convolution filter array (N-) hift achieved compared with the multiple dataflow single convolution architecture ig 5 illustrates the rotationbased multiple dataflow architecture we proposed The number of register arrays is extended to Y to hold all the pixels in the area as depicted in ig 4 Unlike the multiple dataflow single convolution architecture and the single dataflow complete convolution architecture, the pixel data in each set of register array are not simultaneously fed to the convolution filter window, but in a serial type instead ne register in the register group is useable in each cycle, and a rotationally selfincrementing counter is used to address the register in the output Consequently, pixels in all of a same row in the input, belonging to adjacent windows in the row-wise direction, are available to the convolution filter in each cycle After cycles, all the data in the place have igure 3 ingle dataflow complete convolution architecture been sent to the convolution filter, and then register arrays will be updated A new row of data will be moved in from the and moves the area to next position effectively The architecture for the convolution filter using rotation-based data buffering architecture is not the same as the aforementioned architectures or each convolution window, input pixels are fed column-bycolumn, therefore one-column convolution line can be calculated, and it will take cycles to complete all the calculation for each convolution window When neighboring windows are available, entire R one-column convolution can be processed simultaneously n order to achieve the throughput rate of 1 cycle/pixel, multiple dataflow must be loaded to update the convolution window Compared with the multiple dataflow single

1414 JURNAL CMPUTER, VL 8, N 6, JUNE 2013 igure 4 R simultaneous convolution windows in a area off-chip memory column 1 column -1 column column Y R 1 R 1 R 1 R 1 convolution filter array igure 5 Rotation-based data buffering architecture convolution architecture the window in the rotation-based architecture is updated every cycle n this case, can move every cycles pixels in all will be loaded from off-chip memories every cycles o the external memory bandwidth is / pixels/clock This means that for most convolution filter applications approximately twice of the external memory bandwidth requirement is needed ARCHTECTURE ELECTN n this section, we will consider an input image size of

JURNAL CMPUTER, VL 8, N 6, JUNE 2013 1415 1280720 with 8 bits/pixel and a convolution kernel size of 77 as a case study The operation will fetch image pixels from external memories, and store back to external memories after the convolution operation n addition to this we will use a memory bus word length of 256-bits and a burst length (BL) of 8 words (ie 16 pixels) n Table, we have summarized the main features of the two and the proposed architectures: area-utilization measured in terms of register pixels and memory pixels lip-flop count was obtained by multiplying the number of and memory pixels by bit per pixel; TABLE 1 EATURE DERENT CNVLUTN LTER R A WNDW architecture register pixels memory pixels throughput (cycles/pixel) ff count bandwidth (pixels/cycle) MDCA 1 5496 7 DCCA 1 1 49336 1 RMDBA 1 2392 19 TABLE 2 AREA UTLZATN DERENT ARCHTECTURE R VARU CNVLUTN LTER WNDW ZE filter size MDCA DCCA RMDBA flip-flop count flip-flop count flip-flop count 33 456 16536 760 55 840 32936 1512 77 5496 49336 2392 99 1800 65736 3400 11 11 2376 82136 4536 13 13 3016 98536 5800 15 15 3720 114936 7192 17 17 4488 131336 8712 19 19 5320 147736 10360 throughput, given in terms of cycles/pixel; and external memory bandwidth requirements, given in terms of pixels/cycle We used different PGA resources to implement s and depending on specific PGA devices or comparison, the area-utilization will be evaluated in terms of flip-flops The last two columns of Table show the results of flip-flop count and external memory bandwidth requirement for the case study The CPB architecture shows the most area-efficient feature at the cost of much more requirement of the external memory bandwidth n order to choose the optimum architecture for a particular design point, a suitable metric that consists in maximizing the throughput with respect to the amount of resources will be used The evaluation metric was proposed in [10] that the product throughput in terms of cycles/pixel times flip-flop number is the metric or a particular design point, the architecture will minimize the metric value and maximize the degree of area efficiency We used the same concept in our architecture Table 2 shows the corresponding product of flip-flop count and throughput for convolution window size from 3 to 19 for the three architectures We assumed a same output memory bandwidth of 1 pixel/cycle n ig 6, we show the aforementioned metric comparisons and the remaining variable are the same described for the case study n the bar diagram in ig 6, we can observe that RMDBA architecture is superior to the rest of the architecture for window size 7, and for the other window size MDCA is superior Window size 5 and 7 are the most frequently used convolution window in practical applications As the size of input image gets larger, tradeoffs must be made, depending on different PGA resources and available offchip memory bandwidth V CNCLUN n this paper, we proposed a rotation-based data buffering architecture for convolution filtering in PGA Compared with the direct implementation of the prior-arts, the new technique requires less PGA resources and lowers off-chip memory bandwidth and retains the optimum throughput for a particular design point, therefore it is suitable for low-cost PGA implementation ACKNWLEDGEMENT This work is supported by the National Natural cience oundation of China No 61003036 and the Natural cience oundation of Heilongjiang Province of China under Grant No QC2010049 and undamental Research unds for the Central Universities (No HEUCT1202, No HEUC100606)

1416 JURNAL CMPUTER, VL 8, N 6, JUNE 2013 igure 6 Bar diagram comparing the area efficiency metric for different architectures and for window sizes from 3x3 to 19x19 using the parameters of the case study The lower the bar, the more efficient REERENCE [1] Gonzalez, RC and RE Woods, Digital mage Processing, Prentice Hall Press, 2002 [2] B Wu, C C Hsieh and C C Lee, A Distance Computer Vision Assisted Yoga Learning ystem, Journal of Computers, 11(6): pp2382-2388, 2011 [3] Z Wang and X un, rthogonal Maximum Margin Projection for ace Recognition, Journal of Computers, 2(7): pp377-383, 2012 [4] B Zhu and W Jin, Radar Emitter ignal Recognition Based on EMD and Neural Network, Journal of Computers, 6(7): pp1413-1420, 2012 [5] Hecht, V and K Ronner, An Advanced Programmable 2D-convolution Chip for Real Time mage Processing, EEE nternational ympoisum on Circuits and ystems, pp1897-1900, 1991 [6] Leblebici, Y, et al, A ully Pipelined Programmable Real-time (3 3) mage ilter Based on Capacitive Thresholdlogic gates, Proceedings of EEE nternational ymposium on Circuits and ystems, vol3, pp 2072-2075, 1997 [7] Bosi, B, G Bois, and Y avaria, Reconfigurable Pipelined 2-D Convolvers for ast Digital ignal Processing, EEE Transactions on Very Large cale ntegration (VL) ystems, 7(3): pp 299-308, 1999 [8] Liang, X, J Jean, and K Tomko, Data Buffering and Allocation in Mapping Generalized Template Matching on Reconfigurable ystems, The Journal of upercomputing, 19(1): pp 77-91, 2001 [9] Nakajima, M, et al, A 40GP 250mw Massively Parallel Processor Based on Matrix Architecture, EEE nternational olid-tate Circuits Conference, pp1616-1625, 2006 [10] Cardells-Tormo, and PL Molinet, Area-efficient 2-D hift-variant Convolvers for PGA-based Digital mage Processing, EEE Workshop on ignal Processing ystems Design and mplementation, pp 209-213, 2005 Zhijian Lu is a PhD student in College of Computer cience and Technology of Harbin Engineering University, Harbin, China His current research interest includes neural network, reconfigurable computing and image processing Yanxia Wu is Associate Professor in College of Computer cience and Technology of Harbin Engineering University, Harbin, China Her current research interests include safe compiler, reconfigurable compiler and computer architecture Zhenhua Guo is a PhD student in College of Computer cience and Technology of Harbin Engineering University, Harbin, China His current research interest includes reconfigurable computing and embedded system Guochang Gu is Professor in College of Computer cience and Technology of Harbin Engineering University, Harbin, China His main research interests include embedded systems and safe compiler