Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

Similar documents
GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

CUDA-Accelerated Satellite Communication Demodulation

A new mixed integer linear programming formulation for one problem of exploration of online social networks

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Document downloaded from:

Multi-core Platforms for

High Performance Computing for Engineers

Synthetic Aperture Beamformation using the GPU

HIGH PERFORMANCE COMPUTING USING GPGPU FOR RADAR APPLICATIONS

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

High Speed ECC Implementation on FPGA over GF(2 m )

Ben Baker. Sponsored by:

Sno Projects List IEEE. High - Throughput Finite Field Multipliers Using Redundant Basis For FPGA And ASIC Implementations

Airborne radar clutter simulation using GPU (CUDA)

Real-Time Software Receiver Using Massively Parallel

ERROR CONTROL CODING From Theory to Practice

Console Architecture 1

Image-Domain Gridding on Accelerators

Table of Contents HOL ADV

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

Monte Carlo integration and event generation on GPU and their application to particle physics

Creating Intelligence at the Edge

Massively Parallel Signal Processing for Wireless Communication Systems

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

REAL TIME DIGITAL SIGNAL PROCESSING. Introduction

New Paradigm in Testing Heads & Media for HDD. Dr. Lutz Henckels September 2010

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

escience: Pulsar searching on GPUs

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

Towards Real-Time Volunteer Distributed Computing

CORDIC Algorithm Implementation in FPGA for Computation of Sine & Cosine Signals

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

A Survey on Power Reduction Techniques in FIR Filter

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Performance Evaluation Of OFDM Based Wireless Communication Systems Using Graphics Processing Unit (GPU) Based High Performance Computing.

Parallel Simulation of Social Agents using Cilk and OpenCL

RF and Microwave Test and Design Roadshow Cape Town & Midrand

Design of Reed Solomon Encoder and Decoder

A Polyphase Filter for GPUs and Multi-Core Processors

GPU-based data analysis for Synthetic Aperture Microwave Imaging

A GPU Implementation for two MIMO OFDM Detectors

A High Definition Motion JPEG Encoder Based on Epuma Platform

6. FUNDAMENTALS OF CHANNEL CODER

Accelerating the Detection of Spectral Bands by ANN-ED on a GPU

AutoBench 1.1. software benchmark data book.

Recent Advances in Simulation Techniques and Tools

Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture

Design of a High Throughput 128-bit AES (Rijndael Block Cipher)

CHAPTER 4 FIELD PROGRAMMABLE GATE ARRAY IMPLEMENTATION OF FIVE LEVEL CASCADED MULTILEVEL INVERTER

A HIGH PERFORMANCE HARDWARE ARCHITECTURE FOR HALF-PIXEL ACCURATE H.264 MOTION ESTIMATION

Hardware-accelerated CCD readout smear correction for Fast Solar Polarimeter

Perspective platforms for BOINC distributed computing network

Using Soft Multipliers with Stratix & Stratix GX

Matthew Grossman Mentor: Rick Brownrigg

GPU Computing for Cognitive Robotics

Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization

IHV means Independent Hardware Vendor. Example is Qualcomm Technologies Inc. that makes Snapdragon processors. OEM means Original Equipment

A New RNS 4-moduli Set for the Implementation of FIR Filters. Gayathri Chalivendra

FPGA Co-Processing Solutions for High-Performance Signal Processing Applications. 101 Innovation Dr., MS: N. First Street, Suite 310

Image Processing Architectures (and their future requirements)

Prototyping Next-Generation Communication Systems with Software-Defined Radio

Field Programmable Gate Arrays based Design, Implementation and Delay Study of Braun s Multipliers

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

Video Enhancement Algorithms on System on Chip

Keywords SEFDM, OFDM, FFT, CORDIC, FPGA.

MULTIRATE IIR LINEAR DIGITAL FILTER DESIGN FOR POWER SYSTEM SUBSTATION

Implementation of Reed-Solomon RS(255,239) Code

Design and Analysis of RNS Based FIR Filter Using Verilog Language

Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing

Detector Implementations Based on Software Defined Radio for Next Generation Wireless Systems Janne Janhunen

Digital Communication Systems ECS 452

DATA SECURITY USING ADVANCED ENCRYPTION STANDARD (AES) IN RECONFIGURABLE HARDWARE FOR SDR BASED WIRELESS SYSTEMS

RF and Microwave Test and Design Roadshow 5 Locations across Australia and New Zealand

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

An FPGA Based Architecture for Moving Target Indication (MTI) Processing Using IIR Filters

Programmable Wireless Networking Overview

Experience with new architectures: moving from HELIOS to Marconi

Introduction (concepts and definitions)

Single Chip FPGA Based Realization of Arbitrary Waveform Generator using Rademacher and Walsh Functions

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

Exploiting the Unused Part of the Brain

Sourjya Bhaumik, Shoban Chandrabose, Kashyap Jataprolu, Gautam Kumar, Paul Polakos, Vikram Srinivasan, Thomas Woo

Design of FIR Filter Using Modified Montgomery Multiplier with Pipelining Technique

Threading libraries performance when applied to image acquisition and processing in a forensic application

EM Simulation of Automotive Radar Mounted in Vehicle Bumper

Techniques for Implementing Multipliers in Stratix, Stratix GX & Cyclone Devices

Developing and Prototyping Next-Generation Communications Systems

Importance of object middleware on a digital signal processor for SCA type architectures - a power/cpu management perspective

Real-Time License Plate Localisation on FPGA

Challenges in Transition

THIS work focus on a sector of the hardware to be used

JDT LOW POWER FIR FILTER ARCHITECTURE USING ACCUMULATOR BASED RADIX-2 MULTIPLIER

AN AT89C52 MICROCONTROLLER BASED HIGH RESOLUTION PWM CONTROLLER FOR 3-PHASE VOLTAGE SOURCE INVERTERS

Mobile GPU Accelerated Digital Predistortion on a Software-defined Mobile Transmitter

FPGA implementation of Generalized Frequency Division Multiplexing transmitter using NI LabVIEW and NI PXI platform

Applications of Linear Algebra in Signal Sampling and Modeling

OFDM and FFT. Cairo University Faculty of Engineering Department of Electronics and Electrical Communications Dr. Karim Ossama Abbas Fall 2010

Improving GPU Performance via Large Warps and Two-Level Warp Scheduling

CHAPTER 4 GALS ARCHITECTURE

Transcription:

5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs Dušan B. Gajić 1, Radomir S. Stanković 2 1 Dept. of Computing and Control, Faculty of Technical Sciences, University of Novi Sad Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia 2 Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš Aleksandra Medvedeva 14, 18000 Nis, Serbia E-mail: 1 dusan.b.gajic@gmail.com, 2 radomir.stankovic@gmail.com 23.9.2016. LAP 2016 Dubrovnik 1

1. The Galois field (GF) and the Reed-Muller-Fourier (RMF) transforms 2. Graphics processing units (GPUs) and GPGPU 3. Computing GF and RMF transforms of quaternary logic functions on CPUs and GPUs 4. Experimental results 5. Closing remarks Presentation Outline 23.9.2016. LAP 2016 Dubrovnik 2

Spectral Transforms signal (function) apply spectral transform achieve redistribution of information content perform in spectral domain 1. easier observation of some properties of signals 2. more efficient computation of certain operations Applications: Digital logic design (spectral transforms over GF(p) and ring of integers modulo p), Digital signal processing, pattern recognition 23.9.2016. LAP 2016 Dubrovnik 3

Spectral Transforms Spectral transforms are mathematical operators in linear vector spaces which assign to a function f a corresponding spectrum S f defined as n n f :{0,1,..., p 1} {0,1,..., p 1} F [ f (0), f (1),..., f ( p 1)] S 1 S f T F, - Matrix with basis functions as columns n [ s (0), s (1),..., s ( p 1)] f f f f T S f transform matrix F Function is reconstructed from the spectrum as: T F - Functional vector for f Fast algorithms are based on the factorization of the transform matrix into sparse matrices O( N log N) F TS T f 2 ON ( ) 23.9.2016. LAP 2016 Dubrovnik 4

Quaternary Logic Functions Quaternary logic functions (p = 4) are of special interest since they can be easily encoded by binary values They can be realized by two-stable state circuits in binary devices Genetic code can be viewed as a quaternary logic function research in bioinformatics 23.9.2016. LAP 2016 Dubrovnik 5

Polynomial expressions for a quaternary logic function of n variables 4 1 f ( x1, x2,..., x ) g g {0,1, 2,3} i Galois Field (GF) Transform for Quaternary Logic Functions n n i i i 0 ϕ i - basis functions (products of powers of variables) + 0 1 2 3 0 1 2 3 0 0 1 2 3 0 0 0 0 0 1 1 0 3 2 1 0 1 2 3 2 2 3 0 1 2 0 2 3 1 3 3 2 1 0 3 0 3 1 2 n T F [ f (0), f (1),..., f (4 1)] S G 1 0 0 0,4 4 ( n ) F f GF GF n 0 1 3 2 G 4GF ( n) G 4GF (1), G 4GF (1) i 1 0 1 2 3 1 1 1 1 23.9.2016. LAP 2016 Dubrovnik 6

Operations in the GF Transform Field operations depend on the order of the considered finite (Galois) field. p prime p composite programming implementation: 1. % operator from high-level languages 2. lookup tables (LUTs) programming implementation: 1. lookup tables (LUTs) 23.9.2016. LAP 2016 Dubrovnik 7

Example: GF(4), n = 2 Basic transform matrix for GF(4): G 4GF (1) 1 0 0 0 0 1 3 2 0 1 2 3 1 1 1 1 Cooley-Tukey factorization: C G (1) I 1 4GF C I G 2 4GF (1) 23.9.2016. LAP 2016 Dubrovnik 8

Example: GF(4) n = 2 23.9.2016. LAP 2016 Dubrovnik 9

Reed-Muller-Fourier (RMF) Transform for Polynomial expressions for a quaternary logic function of n variables 4 1 f ( x1, x2,..., x ) g g {0,1, 2,3} i Quaternary Logic Functions n n i i i 0 ϕ i - basis functions (products of powers of variables) 0 1 2 3 0 1 2 3 0 0 1 2 3 0 0 0 0 0 1 1 2 3 0 1 0 1 2 3 2 2 3 0 1 2 0 2 0 2 3 3 0 1 2 3 0 3 2 1 n T F [ f (0), f (1),..., f (4 1)] S R 1 0 0 0,4 4 ( n ) F f RMF RMF n 1 3 0 0 R4RMF ( n) R4RMF (1), R4RMF (1) 3 i 1 1 2 1 0 1 1 3 3 23.9.2016. LAP 2016 Dubrovnik 10

Operations in the RMF Transform Introduced by changing the underlying algebraic structure into the Gibbs algebra Group operation is modulo p addition for all positive integer values of p, while multiplication is a convolutionwise (Gibbs) multiplication all positive integer values of p programming implementation: 1. % operator from high-level languages 2. lookup tables (LUTs) 23.9.2016. LAP 2016 Dubrovnik 11

Example: RMF(4), n = 2 Basic transform matrix for RMF(4): R 4RMF 1 0 0 0 1 3 0 0 (1) 3 1 2 1 0 1 1 3 3 Cooley-Tukey factorization: C R (1) I 1 4RMF C I R 2 4RMF (1) 23.9.2016. LAP 2016 Dubrovnik 12

Example: RMF(4) n = 2 23.9.2016. LAP 2016 Dubrovnik 13

Comparison of Algorithms GF(4) RMF(4) RMF has a triangular transform matrix (smaller number of operations) RMF for many functions offers less non-zero spectral coefficients Different arithmetic operations, modulo p instead GF-operations 23.9.2016. LAP 2016 Dubrovnik 14

Graphics Processing Unit (GPU) Graphics processing unit (GPU) is a hardware device originally specialized for rendering computer graphics The first GPU appeared in 1999 Early 2000s: fixed-function processors dedicated to rendering computer graphics Presently: a unified programmable graphics processor and a parallel computing platform GPU design philosophy is oposite to the design of CPUs (throughput vs latency) different programming philosophy 23.9.2016. LAP 2016 Dubrovnik 15

Throughput [GFLOPS] Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs CPU and GPU Throughput 6000 5632 5000 4000 4500 3000 2488 3090 2000 1000 0 1581 1062 518 576 648 43 51 55 58 86 187 225 225 225 2006 2007 2008 2009 2010 2011 2012 2013 2014 Year CPU GPU 23.9.2016. LAP 2016 Dubrovnik 16

Bandwidth [GB/s] Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs CPU and GPU Bandwidth 350 336 300 288 250 200 150 100 90 108 142 159 177 192 192 50 0 51 51 51 26 26 32 32 32 10 2006 2007 2008 2009 2010 2011 2012 2013 2014 Year CPU GPU 23.9.2016. LAP 2016 Dubrovnik 17

GPU Computing (GPGPU) General purpose computations on the GPU (GPGPU or GPU computing) GPU features: manycore architecture high throughput and processing power lower cost and smaller energy consumption Suitable for intensive computations and large data processing Nvidia CUDA (high performance, exclusive for Nvidia GPUs), appeared in 2007 OpenCL (open standard, acceleration on heterogeneous devices (CPUs, GPUs, DSPs, FPGAs), appeared in 2009 23.9.2016. LAP 2016 Dubrovnik 18

GPU Computing Programs A GPGPU program is composed of: 1. host program (processed on CPUs, controls execution) and 2. device program (processed on GPUs, implements kernels) Kernel is a data-parallel function executed on a GPU Each kernel describes computations performed by a single thread Block (set of threads) and grid (set of blocks) configurations defined in the host program 23.9.2016. LAP 2016 Dubrovnik 19

GPU Architecture and Computing Model 2 3 GPU executes kernels with high parallelism Different programming philosophy for GPUs input output 1 4 input buffer output buffer 23.9.2016. LAP 2016 Dubrovnik 20

Implementation of Operations for p = 4 Randomly generated quaternary logic function vectors F(n) On the CPU C++, on the GPU CUDA C Group operation was implemented in C++ and CUDA C using LUTs for GF(4) modulo arithmetic operator % for RMF(4) On GPUs there is additional time for memory transfers 23.9.2016. LAP 2016 Dubrovnik 21

Experimental Platforms Component Platform 1 (Desktop) Platform 2 (Workstation) CPU microarchitecture clock (GHz) processing power (GFLOPS) cores/threads Intel Core i7-920 Bloomfield 2.66 28 4/8 Intel Xeon E5-1620 Haswell 3.5 122 4/8 RAM 12GB DDR3 2000 MHz 32GB DDR4 ECC 2133 MHz GPU microarchitecture processing power (GFLOPS) cores memory type bandwidth (GB/s) Nvidia GTX 560 Ti Fermi 1263 384 1 GB GDDR5 128 GB/s Nvidia Quadro K620 Kepler 768 384 2 GB DDR3 28.8 GB/s OS Windows 7 64-bit Windows 10 64-bit GPU SDK Nvidia GPU Computing 7.5 Nvidia GPU Computing 7.5 23.9.2016. LAP 2016 Dubrovnik 22

Computing time [ms] Experimental Results Platform 1 (Desktop) 10000,0 1000,0 100,0 10,0 1,0 0,1 8 9 10 11 12 13 14 Number of variables (n) CPU GF CPU RMF GPU GF GPU RMF 23.9.2016. LAP 2016 Dubrovnik 23

Experimental Results Platform 1 (Desktop) Processing time [ms] n CPU/C++ GPU/CUDA GF RMF GF RMF Memory 8 2 1 0.4 0.1 0.1 9 10 6 1.2 0.2 0.4 10 47 27 4.7 0.9 1.5 11 210 123 20.1 3.9 5.9 12 917 534 86.8 16.8 23.6 13 3994 2337 374.8 72.4 94.0 14 12119 9057 - - - On the CPU, RMF is from 1.3 to 2 faster than GF On the GPU, RMF is from 4 to 6 faster than GF Computing on GPUs is from 10 to 33 faster than on CPUs 23.9.2016. LAP 2016 Dubrovnik 24

Computing time [ms] Experimental Results Platform 2 (Workstation) 10000,0 1000,0 100,0 10,0 1,0 0,1 8 9 10 11 12 13 14 Number of variables (n) CPU GF CPU RMF GPU GF GPU RMF 23.9.2016. LAP 2016 Dubrovnik 25

Experimental Results Platform 2 (Workstation) Processing time [ms] n CPU/C++ GPU/CUDA GF RMF GF RMF Memory 8 1 0.6 0.5 0.1 0.1 9 5 3 1.3 0.3 0.4 10 20 13 5.3 2.7 1.3 11 87 62 22.0 12.2 5.3 12 371 269 91.6 54.3 21.3 13 1643 1171 402.0 233.1 85.0 14 7032 5047 - - - On the CPU, RMF is from 1.4 to 1.7 faster than GF On the GPU, RMF is from 1.7 to 5 faster than GF Computing on GPUs is from 2 to 5 faster than on CPUs 23.9.2016. LAP 2016 Dubrovnik 26

Closing Remarks Performance comparison of computing the GF and the RMF transforms for quaternary logic functions on CPUs and GPUs Modulo operators in RMF(4) outperform LUTs in GF(4) by 1.3 to 2 on CPUs Modulo operators in RMF(4) outperform LUTs in GF(4) by 1.7 to 6 on GPUs For considered tasks, GPUs are almost an order of magnitude faster than CPUs The computational advantage of RMF over GF increases on novel computing architectures 23.9.2016. LAP 2016 Dubrovnik 27

5 th International Conference on Logic and Application LAP 2016 Dubrovnik, Croatia, September 19-23, 2016 Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs Dušan B. Gajić 1, Radomir S. Stanković 2 1 Dept. of Computing and Control, Faculty of Technical Sciences, University of Novi Sad Trg Dositeja Obradovića 6, 21000 Novi Sad, Serbia 2 Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš Aleksandra Medvedeva 14, 18000 Nis, Serbia E-mail: 1 dusan.b.gajic@gmail.com, 2 radomir.stankovic@gmail.com 23.9.2016. LAP 2016 Dubrovnik 28