Document downloaded from:

Similar documents
Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

Ben Baker. Sponsored by:

Challenges in Transition

Construction of visualization system for scientific experiments

CUDA-Accelerated Satellite Communication Demodulation

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Table of Contents HOL ADV

Threading libraries performance when applied to image acquisition and processing in a forensic application

MACHINE LEARNING Games and Beyond. Calvin Lin, NVIDIA

High Performance Computing Facility for North East India through Information and Communication Technology

From Cloud Computing To Online Gaming. Mark Sung General Manager zillians.com

ReVRSR: Remote Virtual Reality for Service Robots

seawater temperature charts and aquatic resources distribution charts. Moreover, by developing a GIS plotter that runs on a common Linux distribution,

GPU ACCELERATED DEEP LEARNING WITH CUDNN

Waveform Multiplexing using Chirp Rate Diversity for Chirp-Sequence based MIMO Radar Systems

Perspective platforms for BOINC distributed computing network

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Efficient Construction of SIFT Multi-Scale Image Pyramids for Embedded Robot Vision

Signal Processing on GPUs for Radio Telescopes

High Performance Computing for Engineers

Recent Advances in Simulation Techniques and Tools

Development of a Dual-Extraction Industrial Turbine Simulator Using General Purpose Simulation Tools

Supporting x86-64 Address Translation for 100s of GPU Lanes. Jason Power, Mark D. Hill, David A. Wood

CP2K PERFORMANCE FROM CRAY XT3 TO XC30. Iain Bethune Fiona Reid Alfio Lazzaro

Towards Real-Time Volunteer Distributed Computing

Full Wave Solution for Intel CPU With a Heat Sink for EMC Investigations

GPU-based data analysis for Synthetic Aperture Microwave Imaging

AGENTLESS ARCHITECTURE

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

A Study of Optimal Spatial Partition Size and Field of View in Massively Multiplayer Online Game Server

Color is the factory default setting. The printer driver is capable of overriding this setting. Adjust the color output on the printed page.

This Color Quality guide helps users understand how operations available on the printer can be used to adjust and customize color output.

Circuit Simulators: a Revolutionary E-Learning Platform

Application Note 106 IP2 Measurements of Wideband Amplifiers v1.0

Final Report: DBmbench

1 Interference Cancellation

Multi-core Platforms for

Scaling Resolution with the Quadro SVS Platform. Andrew Page Senior Product Manager: SVS & Broadcast Video

Research on Hand Gesture Recognition Using Convolutional Neural Network

AUTOMATION ACROSS THE ENTERPRISE

Virtual EM Prototyping: From Microwaves to Optics

Image Compression Using SVD ON Labview With Vision Module

DEEP LEARNING ON RF DATA. Adam Thompson Senior Solutions Architect March 29, 2018

Image-Domain Gridding on Accelerators

Hardware Software Science Co-design in the Human Brain Project

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

The Use of Border in Colour 2D Barcode

Evaluation of CPU Frequency Transition Latency

Technical challenges for high-frequency wireless communication

Saphira Robot Control Architecture

S4695 A Real-Time Defocus Deblurring Method for Semiconductor Manufacturing

New Paradigm in Testing Heads & Media for HDD. Dr. Lutz Henckels September 2010

Dr Myat Su Hlaing Asia Research Center, Yangon University, Myanmar. Data programming model for an operation based parallel image processing system

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

RTTY: an FSK decoder program for Linux. Jesús Arias (EB1DIX)

Modulator with Op- Amp Gain Compensation for Nanometer CMOS Technologies

2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Efficiency variations in electrically small, meander line RFID antennas

A Solution for Identification of Bird s Nests on Transmission Lines with UAV Patrol. Qinghua Wang

2015 The MathWorks, Inc. 1

Monte Carlo integration and event generation on GPU and their application to particle physics

GENERIC SDR PLATFORM USED FOR MULTI- CARRIER AIDED LOCALIZATION

Superior Radar Imagery, Target Detection and Tracking SIGMA S6 RADAR PROCESSOR

Oculus Rift Getting Started Guide

Plane-dependent Error Diffusion on a GPU

Liquid Camera PROJECT REPORT STUDY WEEK FASCINATING INFORMATICS. N. Ionescu, L. Kauflin & F. Rickenbach

Low-power smart imagers for vision-enabled wireless sensor networks and a case study

Supplementary Figures

FPGA implementation of Generalized Frequency Division Multiplexing transmitter using NI LabVIEW and NI PXI platform

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

HIGH PERFORMANCE COMPUTING USING GPGPU FOR RADAR APPLICATIONS

Self-Aware Adaptation in FPGAbased

escience: Pulsar searching on GPUs

RF & Microwave Test Solutions from Pickering Interfaces

A FUZZY CONTROLLER USING SWITCHED-CAPACITOR TECHNIQUES

COMET DISTRIBUTED ELEVATOR CONTROLLER CASE STUDY

A Novel Network Design and Operation for Reducing Transmission Power in Cloud Radio Access Network with Power over Fiber

What s Behind 5G Wireless Communications?

arxiv: v1 [astro-ph.im] 1 Sep 2015

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Follower Robot Using Android Programming

BIO Helmet EEL 4914 Senior Design I Group # 3 Frank Alexin Nicholas Dijkhoffz Adam Hollifield Mark Le

HARDWARE ACCELERATION OF THE GIPPS MODEL

AN0503 Using swarm bee LE for Collision Avoidance Systems (CAS)

SpinSpectra NSMS. Noise Spectrum Measurement System

A COMPACT, AGILE, LOW-PHASE-NOISE FREQUENCY SOURCE WITH AM, FM AND PULSE MODULATION CAPABILITIES

Deployment scenarios and interference analysis using V-band beam-steering antennas

Evaluation of CPU Frequency Transition Latency

Determination of Smart Inverter Power Factor Control Settings for Distributed Energy Resources

Real-time Pulsar Timing signal processing on GPUs

Matthew Grossman Mentor: Rick Brownrigg

CSTA K- 12 Computer Science Standards: Mapped to STEM, Common Core, and Partnership for the 21 st Century Standards

Oculus Rift Getting Started Guide

Noise Aware Decoupling Capacitors for Multi-Voltage Power Distribution Systems

Author: Yih-Yih Lin. Correspondence: Yih-Yih Lin Hewlett-Packard Company MR Forest Street Marlboro, MA USA

Performance Evaluation of Edge Detection Techniques for Square Pixel and Hexagon Pixel images

Watec USB Camera. User s Manual

Seamless Energy Management Systems. Part II: Development of Prototype Core Elements

Transcription:

Document downloaded from: http://hdl.handle.net/1251/64738 This paper must be cited as: Reaño González, C.; Pérez López, F.; Silla Jiménez, F. (215). On the design of a demo for exhibiting rcuda. 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 215). IEEE. doi:1.119/ccgrid.215.53. The final publication is available at http://dx.doi.org/1.119/ccgrid.215.53 Copyright IEEE Additional Information 215 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

On the Design of a Demo for Exhibiting rcuda Carlos Reaño, Ferrán Pérez, and Federico Silla Universitat Politècnica de València València, Spain carregon@gap.upv.es, fsilla@disca.upv.es Abstract CUDA is a technology developed by NVIDIA which provides a parallel computing platform and programming model for NVIDIA GPUs and compatible ones. It takes benefit from the enormous parallel processing power of GPUs in order to accelerate a wide range of applications, thus reducing their execution time. rcuda (remote CUDA) is a middleware which grants applications concurrent access to CUDA-compatible devices installed in other nodes of the cluster in a transparent way so that applications are not aware of accessing a remote device. In this paper we present a demo which shows, in real time, the overhead introduced by rcuda in comparison to CUDA when running image filtering applications. The approach followed in this work is to develop a graphical demo which contains both an appealing design and technical contents. Keywords-GPGPU; CUDA; HPC; virtualization; I. INTRODUCTION GPU-accelerated computing consists in using the massively parallel power of graphics processing units (GPUs) to boost the performance of a wide range of application in areas such as computational algebra, chemical physics, finance, or image analysis, to name only a few. Since 26, NVIDIA response to this trend has been CUDA (Compute Unified Device Architecture) [1], a technology which provides a parallel computing platform and programming model for NVIDIA GPUs and compatible ones. However, the use of GPUs in current high performance computing (HPC) clusters presents several disadvantages, such as high acquisition costs and power consumption. In addition, current computational science and HPC applications make, in general, a relatively low utilization of GPUs. Hence, sharing a reduced number of GPUs among the nodes of a cluster might be beneficial both to reduce acquisition costs and power consumption, and to increase GPU utilization rate. rcuda (remote CUDA) [2], [3] is a middleware which enables sharing remote CUDA-compatible devices concurrently and transparently. It grants applications concurrent access to GPUs installed in other nodes of the cluster in a manner that they are not aware of accessing a remote device. Furthermore, rcuda does not require to modify the source code of applications and, additionally, introduces a small overhead with respect to CUDA. In this paper we introduce a demonstrator for rcuda consisting of a graphical demo which combines an appealing design and live applications along with technical contents. The rest of the paper is organized as follows. In Section II we present rcuda in more detail. Section III describes the applications later used in the demo. Finally, in Section IV we assemble all the components and describe the live demo to be presented. II. RCUDA: REMOTE CUDA In the same way as CUDA uses local GPUs to accelerate certain parts of applications, rcuda (remote CUDA) [2], [3] takes benefit from remote GPUs to do so. Figure 1 illustrates a sample scenario for the sake of clarity. CUDA App CUDA App CLIENTS rcuda Figure 1. Network rcuda sample scenario. GPU SERVER As commented previously, rcuda enables sharing remote CUDA-compatible devices concurrently and transparently to applications. In this manner, a GPU installed in one node of a cluster (the server node) can be used by the rest of the nodes of the cluster (the client nodes) to accelerate applications using CUDA. In order to do so, the rcuda middleware intercepts the application calls to the CUDA API and forwards them to the remote GPU. Notice that the application continues using the very same CUDA API and it does not require to be modified. Once a CUDA call arrives at the remote GPU thanks to rcuda, it is executed using the real CUDA library and the real GPU. When the CUDA call completes, its results are returned by rcuda to the application which made the initial call. Notice that this process is transparent to the application, which is not aware of accessing a remote GPU. To communicate client and server nodes, rcuda provides two different communications modules: one using the general TCP/IP protocol stack, and another using the InfiniBand Verbs API. The last available rcuda version, release 5., supports CUDA Runtime and Driver API 6.5. It also supports the

most important routines of the following CUDA specific libraries: cublas (Basic Linear Algebra Subprograms), cufft (Fast Fourier Transform), curand (generation of random numbers), and cusparse (BLAS subroutines for handling sparse matrices). Finally, rcuda is free and can be obtained from the website www.rcuda.net. III. APPLICATIONS USED IN THE DEMO In this section we describe the applications used in the live demo. It is important to remark that one of the demo requirements was that it should attract the attention of the exhibition attendees. Therefore, the demo should be devised with a very appealing design in order to attract the interest of attendees. For this reason, the applications used in the demo are two image filters: color image to grayscale conversion (Subsection III-A), and image blurring (Subsection III-B). Additionally, those filters will be applied to a set of over 2 pictures especially selected to attract the attention of attendees. A. Color Image to Grayscale Conversion In computer graphics, each pixel of a color image is commonly represented by four parameters: RGBA [4]. R indicates how much red is in the pixel, G how much green and B how much blue. A stands for Alpha and specifies the opacity of the pixel. Each one of these parameters is represented by one byte, so there are 256 different possible values for each parameter. On the other hand, each pixel of a grayscale image is represented by a single parameter which specifies the level of gray using one byte. Hence, each pixel has 256 possible values. To convert an image from color to grayscale, given that the eye responds most strongly to green, followed by red and then blue, the NTSC (National Television System Committee) recommends using Equation 1. Notice that the parameter A is ignored in this formula. I =.299 R +.587 G +.114 B (1) Based on an initial program extracted from [4], we have developed a CUDA application which performs the image conversion in the GPU using the formula shown in Equation 1. B. Image Blurring Blurring an image [4] consists in applying to each pixel and its neighbors a filter which varies depending on the desired level of distortion. For instance, imagine that we have an image represented by the matrix shown in Figure 2, where B represents the pixel to blur, N1..N8 refer to what we have called the neighbor pixels and X are pixels which will not be modified when blurring pixel B. To blur pixel B of the image represented by the matrix in Figure 2, X X X X X X ---------- X N1 N2 N3 X X N4 B N5 X X N6 N7 N8 X ---------- X X X X X X Figure 2. Matrix representing the image to blur: B represents the pixel to blur, N1..N8 refer to the neighbor pixels and X are pixels which will not be modified when blurring pixel B. we apply Equation 2, where d is an array which specifies the distortion level for each neighbor pixel and db is the distortion level for pixel B. blur(b) = B db + 8 N[i] d[i] (2) i=1 Based on an initial program extracted from [4], we have developed a CUDA application which performs the image blurring in the GPU using the formula shown in Equation 2. IV. RCUDA DEMO DESCRIPTION In this section we describe the demo. We first present the equipment used for the demo (Subsection IV-A) and then the demo itself (Subsection IV-B). Finally, we show performance results in Subsection IV-C. A. Equipment used The equipment necessary for this demo consists of two 127GR-TRF Supermicro servers, each with the following characteristics: Two Intel Xeon hexa-core processors E5-268 v2 (Ivy Bridge) operating at 2.8 GHz. 32 GB of DDR3 SDRAM memory at 1,6 MHz. 1 Mellanox Connect-IB (FDR) dual-port InfiniBand adapter. Red Hat Enterprise Linux Server release 6.4 with Mellanox OFED 2.1-1.. (InfiniBand drivers and administrative tools) and CUDA 6.5 with NVIDIA driver 34.29. 1 NVIDIA Tesla K8 In addition, one monitor is necessary to display the graphical part of the demo. Figure 4 shows how the equipment is interconnected. The demo runs in node A, whereas node B hosts an rcuda server. B. Description of the Demo Figure 3 presents a screen shot of the demo. A video of the demo can also be seen at http://youtu.be/qblh6ww3dha. The demo consists of 245 different color pictures, each of them available in three different sizes: 124x768 (2.4MB), 248x1536 (9.4MB), and 496x372 (37.7MB). The current

Figure 3. Screen shot of the demo. image size being computed during the live demo is shown at the top right part of the screen under the label Current picture size. For each image and size, the next steps are followed: 1) The original image is displayed on the screen. 2) The image is converted to grayscale first using CUDA (the calculations are done in a local GPU), then using rcuda (the calculations are done in a remote GPU). For so, we employ the filter exposed in Subsection III-A. The image conversion times with CUDA and with rcuda are stored separately. 3) Although the complete image is converted to grayscale, only the top right part of the image displayed on the screen is changed to grayscale due to K8 K8 Node A InfiniBand dual-port connection VGA connection aesthetic reasons. 4) The blur filter explained in Subsection III-B is then applied to the image using again CUDA and rcuda. Both conversion times are also stored. 5) Though the whole image is blurred, only the bottom right part of the blurred image is displayed on the screen for aesthetic reasons. 6) The conversion time of CUDA and rcuda for both filters (grayscale and blur) is numerically displayed at the right side of the screen. It is also represented in the form of a bar chart: the green part of the bars is the CUDA conversion time, while the blue part of the bars refers to the overhead of doing the same conversion with rcuda. The bar chart keeps track of the results for the last 2 images. 7) The bottom right part of the screen is then updated showing the average rcuda overhead over CUDA for the different image sizes and filters, taking into account all the executions since the demo started. Once the previous sequence is completed, a new image is displayed on the screen, and the process starts again. It is repeated for all the images and all the sizes. C. Performance Results Node B Figure 4. Scheme of the equipment used for the demo. The applications used in the demo have different behaviors in order to show the performance of rcuda under distinct scenarios. In general, three factors influence rcuda:

Transfers: CUDA memory copies translate into network transfers when using rcuda, what introduces an overhead which depends on network bandwidth. Computations: the time employed by CUDA kernels in the GPU is the same for CUDA and rcuda. Therefore, performing a large amount of computations helps rcuda to compensate the overhead caused by transferring data across the network. CUDA calls: when using rcuda, calls to the CUDA API turn into small size network transfers, which increment rcuda overhead depending on network latency. Figure 5 presents the rcuda overhead over CUDA when running the applications explained in Section III using the three different image sizes commented in Subsection IV-B. The results are the average of ten executions, and the maximum Relative Standard Deviation (RSD) observed was.77. This RSD was achieved when using CUDA and the grayscale filter over an image of size 124x768. To ease the interpretation of the results, we also show in Figure 6 and Figure 7 profiling information obtained by using the NVIDIA profiling tools. Regarding the application which converts the images from color to grayscale, referred to as grayscale in the figures, we can observe that the overhead experienced by rcuda noticeably increases with image size (see Figure 5). This is due to the fact that the time spent in transfers (Figure 7), Figure 5. filters. rcuda Overhead (%) Computation Time (s) 5 4 3 2 1 grayscale blur 124x768 248x1536 496x372 rcuda overhead over CUDA when running grayscale and blur 3,5 3 2,5 2 1,5 1,5 grayscale blur 124x768 248x1536 496x372 Figure 6. Time spent in computations (i.e., CUDA kernels) by grayscale and blur filters. Transfer Time (ms) 4 35 3 25 2 15 1 5 grayscale blur grayscale blur 8 7 6 5 4 3 2 1 124x768 248x1536 496x372 Figure 7. Time spent in transfers (i.e., CUDA memcopies) and calls made to the CUDA API by grayscale and blur filters. Bars represent transfers, whereas lines depict calls. is growing much faster than the time spent in computations (Figure 6) when increasing the image size, thus making more notorious the overhead introduced by rcuda because of network transfers. With respect to the application which blurs the images, labeled as blur in the figures, it can be seen that the overhead presented by rcuda for an image size of 124x768 is higher than in the grayscale application (see Figure 5). This is because the number of calls to the CUDA API for this filter is larger than for the grayscale one (Figure 7), what introduces an overhead due to the network latency which is not compensated by the time spent in computations (Figure 6). In contrast, rcuda overhead for image sizes of 248x1536 and 496x372 is lower when running the blur filter than the grayscale one (Figure 5). The reason lies in the fact that the time spent in computations (Figure 6) in this case is enough to counterbalance the overhead due to network transfers (Figure 7). ACKNOWLEDGMENT This work was funded by the Spanish MINECO and FEDER funds under Grant TIN212-38341-C4-1. Authors are also grateful for the generous support provided by Mellanox Technologies and the equipment donated by NVIDIA Corporation. REFERENCES [1] NVIDIA, NVIDIA CUDA C Programming Guide 6.5, 214. [2] A. J. Peña, C. Reaño, F. Silla, R. Mayo, E. S. Quintana-Ortí, and J. Duato, A complete and efficient cuda-sharing solution for HPC clusters, Parallel Computing (PARCO), vol. 4, no. 1, pp. 574 588, 214. [3] C. Reaño, R. Mayo, E. S. Quintana-Ortí, F. Silla, J. Duato, and A. J. Peña, Influence of infiniband FDR on the performance of remote GPU virtualization, in IEEE International Conference on Cluster Computing (CLUSTER), 213, pp. 1 8. [4] Udacity, Intro to Parallel Programming, 215. CUDA Calls