GPU-based data analysis for Synthetic Aperture Microwave Imaging

Similar documents
Imaging EBW emission on MAST to diagnose the plasma edge

GPU-accelerated track reconstruction in the ALICE High Level Trigger

Synthetic Aperture Beamformation using the GPU

Track and Vertex Reconstruction on GPUs for the Mu3e Experiment

Real-time Pulsar Timing signal processing on GPUs

CUDA-Accelerated Satellite Communication Demodulation

Signal Processing on GPUs for Radio Telescopes

Real-Time Software Receiver Using Massively Parallel

arxiv: v1 [astro-ph.im] 1 Sep 2015

Multi-core Platforms for

TWO-DIMENSIONAL STUDIES OF ELECTRON BERNSTEIN WAVE EMISSION IN MAST

Computational Efficiency of the GF and the RMF Transforms for Quaternary Logic Functions on CPUs and GPUs

BYU SAR: A LOW COST COMPACT SYNTHETIC APERTURE RADAR

Synthetic aperture microwave imaging with active probing for fusion plasma diagnostics

Liu Yang, Bong-Joo Jang, Sanghun Lim, Ki-Chang Kwon, Suk-Hwan Lee, Ki-Ryong Kwon 1. INTRODUCTION

Synthetic Aperture Microwave Imaging with Active Probing for Fusion Plasma Diagnostics

How different FPGA firmware options enable digitizer platforms to address and facilitate multiple applications

ni.com The NI PXIe-5644R Vector Signal Transceiver World s First Software-Designed Instrument

2-PAD: An Introduction. The 2-PAD Team

HIGH PERFORMANCE COMPUTING USING GPGPU FOR RADAR APPLICATIONS

Experience with new architectures: moving from HELIOS to Marconi

Application of Maxwell Equations to Human Body Modelling

2002 IEEE International Solid-State Circuits Conference 2002 IEEE

Document downloaded from:

RF and Microwave Test and Design Roadshow 5 Locations across Australia and New Zealand

Borut Baricevic. Libera LLRF. 17 September 2009

Automatic electron density measurements with microwave reflectometry during highdensity H-mode discharges on ASDEX Upgrade

PXI Maestro PXI Maestro, software that accelerates wireless device test speed and reduces ATE system development time.

PARALLEL ALGORITHMS FOR HISTOGRAM-BASED IMAGE REGISTRATION. Benjamin Guthier, Stephan Kopf, Matthias Wichtlhuber, Wolfgang Effelsberg

Detection of Radio Pulses from Air Showers with LOPES

Advanced Density Profile Reflectometry; the State-of-the-Art and Measurement Prospects for ITER

CUDA 를활용한실시간 IMAGE PROCESSING SYSTEM 구현. Chang Hee Lee

Correlator electronics. Alejandro Saez

CUDA Threads. Terminology. How it works. Terminology. Streaming Multiprocessor (SM) A SM processes block of threads

Amplifier Characterization in the millimeter wave range. Tera Hertz : New opportunities for industry 3-5 February 2015

Parallel Programming Design of BPSK Signal Demodulation Based on CUDA

PLANAR R54. Vector Reflectometer KEY FEATURES

Use Nvidia Performance Primitives (NPP) in Deep Learning Training. Yang Song

Monte Carlo integration and event generation on GPU and their application to particle physics

System Upgrades to the DIII-D Facility

Phased Array Feeds A new technology for multi-beam radio astronomy

H. Y. Lee, J. W. Lee, J. G. Jo, J. Y. Park, S. C. Kim, J. I. Wang, J. Y. Jang, S. H. Kim, Y. S. Na, Y. S. Hwang

Adrian Loch, Hany Assasa, Joan Palacios, and Joerg Widmer IMDEA Networks Institute. Hans Suys and Björn Debaillie Imec Belgium

Summer of LabVIEW. The Sunny Side of System Design. 30th June - 18th July. spain.ni.com/foro-aeroespacio-defensa

A NOVEL FPGA-BASED DIGITAL APPROACH TO NEUTRON/ -RAY PULSE ACQUISITION AND DISCRIMINATION IN SCINTILLATORS

Dynamic Sciences International, Inc. Detection with Direction

Smart Antennas in Radio Astronomy

IEEE SUPPLEMENT TO IEEE STANDARD FOR INFORMATION TECHNOLOGY

IMPLEMENTATION OF SOFTWARE-BASED 2X2 MIMO LTE BASE STATION SYSTEM USING GPU

SKA NON IMAGING PROCESSING CONCEPT DESCRIPTION: GPU PROCESSING FOR REAL TIME ISOLATED RADIO PULSE DETECTION

THE USE OF A FREQUENCY DOMAIN STEPPED FREQUENCY TECHNIQUE TO OBTAIN HIGH RANGE RESOLUTION ON THE CSIR X-BAND SAR SYSTEM

Supplementary Figures

GPU-accelerated SDR Implementation of Multi-User Detector for Satellite Return Links

Automatic Kernel Code Generation for Focal-plane Sensor-Processor Devices

THE OFFICINE GALILEO DIGITAL SUN SENSOR

Threading libraries performance when applied to image acquisition and processing in a forensic application

FTMS Booster X1 High-performance data acquisition system for FT-ICR MS

R&D for ILC detectors

Measurement Setup for Phase Noise Test at Frequencies above 50 GHz Application Note

High Performance Computing for Engineers

An evaluation of debayering algorithms on GPU for real-time panoramic video recording

Technical challenges for high-frequency wireless communication

Data acquisition and Trigger (with emphasis on LHC)

RF and Microwave Test and Design Roadshow Cape Town & Midrand

6 TH INTERNATIONAL CONFERENCE ON APPLIED INTERNET AND INFORMATION TECHNOLOGIES 3-4 JUNE 2016, BITOLA, R. MACEDONIA PROCEEDINGS

MIMO RFIC Test Architectures

Efficient FDTD parallel processing on modern PC CPUs

Evaluation of a Field Aligned ICRF Antenna in Alcator C-Mod

NRC Workshop on NASA s Modeling, Simulation, and Information Systems and Processing Technology

Recent Advances in Simulation Techniques and Tools

An IR UWB Research and Development Platform for a

FTMS Booster X1 High-performance data acquisition system for Orbitrap FTMS

Data Acquisition and Digital Processing in Nuclear Fusion

DEEP LEARNING ON RF DATA. Adam Thompson Senior Solutions Architect March 29, 2018

Some Notes on Beamforming.

Model 855 RF / Microwave Signal Generator

Accelerated Impulse Response Calculation for Indoor Optical Communication Channels

ni.com Redefining RF and Microwave Instruments

An Indoor Localization System Based on DTDOA for Different Wireless LAN Systems. 1 Principles of differential time difference of arrival (DTDOA)

Software Requirements Specification for LLRF Applications at FLASH Version 1.0 Prepared by Zheqiao Geng MSK, DESY Nov. 06, 2009

KSTAR ICRF transmission line system upgrade for load resilient operation

SX-NSR 2.0 A Multi-frequency and Multi-sensor Software Receiver with a Quad-band RF Front End

Merging Propagation Physics, Theory and Hardware in Wireless. Ada Poon

PXI WLAN Measurement Suite Data Sheet

From Antenna to Bits:

FAST RADIX 2, 3, 4, AND 5 KERNELS FOR FAST FOURIER TRANSFORMATIONS ON COMPUTERS WITH OVERLAPPING MULTIPLY ADD INSTRUCTIONS

A HILBERT TRANSFORM BASED RECEIVER POST PROCESSOR

High Performance Imaging Using Large Camera Arrays

Real-time Systems in Tokamak Devices. A case study: the JET Tokamak May 25, 2010

Reflectometer Series:

Video Enhancement Algorithms on System on Chip

Simulating and Testing of Signal Processing Methods for Frequency Stepped Chirp Radar

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική

N5194A and N5192A. UXG Agile Vector Adapter 50 MHz to 20 GHz DATA SHEET

High Gain Advanced GPS Receiver

Multi-Channel Time Digitizing Systems

Casper Instrumentation at Green Bank

A HIGH SPEED MICROWAVE MEASUREMENT RECEIVER

The detector read-out in ALICE during Run 3 and 4

GMES Sentinel-1 Transponder Development

Transcription:

GPU-based data analysis for Synthetic Aperture Microwave Imaging 1 st IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis 1 st -3 rd June 2015 J.C. Chorley 1, K.J. Brunner 1, N.A. Dipper 1, S.J. Freethy 4, R.M. Sharples 1 V.F. Shevchenko 3, D.A. Thomas 2, R.G.L. Vann 2 1 Durham University 2 University of York 3 Culham Centre for Fusion Energy 4 Max-Planck-Institut für Plasmaphysik This work is funded by Durham University and EPSRC grant EP/K504178/1

Talk outline SAMI overview Motivation for GPU acceleration GPU code and techniques Acceleration results Summary and future work 1

SAMI overview SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs 2d thermal images of the plasma SAMI is a phased array the phase on each antenna is determined by the geometry and polarisation If the antennas do not have perfectly aligned polarisations there is an additional phase difference between the antennas The image is then the sum of the products of antenna cross-correlation 2

SAMI overview SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs 2d thermal images of the plasma SAMI is a phased array the phase on each antenna is determined by the geometry and polarisation If the antennas do not have perfectly aligned polarisations there is an additional phase difference between the antennas The image is then the sum of the products of antenna cross-correlation 3

SAMI overview Optimised design for SAMI satisfying bandwidth and space requirements consists of 8 antennas [1] [2] [1] S.J. Freethy et al. IEEE transactions on antennas and propagation 60 5442 (2012) [2] S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013) 4

SAMI overview Shot 27022 SAMI is the first diagnostic of its kind: 2d maps of Electron Bernstein Emission process and mode conversion windows Useful for RF heating and current drive SAMI has demonstrated the feasibility of a phased array microwave imaging system through a successful campaign on MAST and will be installed on NSTX-U for the next campaign In a future reactor environment a microwave imaging diagnostic such as SAMI is essential: SAMI is resilient to high energy neutron fluxes Antennas can be incorporated into vessel wall Compact design, doesn t use much wall space S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013) 5

SAMI overview Above: An image of the array of Vivaldi antennas in a 21 configuration Right: The RF electronics mounted on MAST V.F. Shevchenko et al. J. Inst. 7 p10016 (2012) 6

SAMI overview Demanding data acquisition requirements! 16 frequency channels 14 bit sample depth (dynamic range of plasma during ELMs) Sampling at 250 Msamples/s For a total of 500ms (length of MAST shot) Data rate of 8 Gbytes/s Meaning we have 4 Gbytes raw data from SAMI per shot 7

Motivation for GPU code 4 Gbytes raw data per shot on MAST => 12Tbyte RAID system plus backup for M8 and M9 campaigns Data nant and Computation/Resolution nant(nant 1) Original IDL data analysis code takes ~30 minutes to process data for 1 shot on AMD Phenom(tm) II X2 560 Processor Time between shots on MAST is ~15 minutes => no intershot analysis Masses of unanalysed raw data accumulating An accelerated GPU data processing code could cycle through the data from previous campaigns in significantly reduced time and in future campaigns provide the ability to do intershot analysis Aim for real time data analysis as a multi-megawatt EBW current drive and heating system will require real time aiming and interlocking diagnostics 8

GPU architecture CPU 1 2 3 4 CPU Cache Main System Bus System Memory Size = 64GB, Speed = 40GB/s PCIe Bus 8GB/s GPU Cache GPU Memory Size = 6GB, Speed = 250GB/s GPU Key hardware features: Massive use of long vector units Low clock speed Very fast memory No advanced instruction processing Designed to do massive parallel computations 9

SAMI suitability for GPU code SAMI aquires nint data points in all 8 antennas simultaneously and has a 160µs switching frequency => data structure with shape nint*nant*nf*nsweeps nsweeps = shot length switching frequency nint SIMD scenario => parallelisation by CUDA 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 nf =1... nf=16 x nsweeps Each CUDA thread mapped to 1 element of vector unit Full vector unit = 32 consecutive threads = warp Warp processed at once by the hardware On software level threads are grouped into thread blocks 0B 128B 256B 384B 0B 128B 256B 384B Warp Warp 10

bootconfig.rfctrl.ini IDL code configuration data specifies which frequencies to read in and the time windows length and location TF test shot noise data data file data get_config.pro integer data cross-correlations calculated for each antenna pair, frequency sweep and upper and lower sidebands read_freq_split.pro gpu_correlate_model.pro read_bin_raw.pro voltage data for each selected frequency for each frequency sweep filter.pro calibration data correcting for phase offsets and balancing amplitudes between I and Q components via matrix inversion sideband_cal_values_upper.dat complexify.pro upper_lower_complex.dat 16 real signals get converted to 8 complex signals for upper and lower sideband calibration data correcting for phase differences between antennas due to RF electrical lengths iqphasegradient.dat calibration data correcting for phase drift between I and Q components 11

GPU code read_bin_raw_gpu.cu copy to GPU data conditioning forward CUFFT sideband suppression IQ correction backward CUFFT IQ_filter forward CUFFT filter backward CUFFT RF phase calibration results available on the host copy from GPU Wrote 14 CUDA kernels and made use of CUFFT library calculate cross correlations Limited memory available on GPU => can t copy all data to GPU and process at once Need to carve problem up and exploit CUDA streams and concurrency 12

CUDA streams and concurrency Exploit concurrency overlap copy to the GPU with kernel execution on the GPU CUDA exposes concurrency through streams a sequence of commands that execute in order copy 1 up kernel 1 copy 1 down copy 2 up kernel 2 copy 2 down Stream 1 copy 3 up copy 5 up kernel 3 kernel 5 copy 3 down copy 5 down copy 4 up copy 6 up kernel 4 kernel 6 copy 4 down copy 6 down copy 1 up kernel 1 copy 1 down copy 4 up kernel 4 copy 4 down Stream 1 copy 2 up kernel 2 copy 2 down copy 5 up kernel 5 copy 5 down Stream 2 copy 3 up kernel 3 copy 3 down copy 6 up kernel 6 copy 6 down Stream 3 13

GPUs support the following forms of concurrency: CUDA streams and concurrency Overlapping copies to or from the device with kernel execution Executing more than one kernel at the same time Overlapping copies to the GPU with copies from the GPU copy 1 up kernel 1 copy down 1 copy 4 up kernel 4 copy down 4 Stream 1 copy 2 up kernel 2 copy down 2 copy 5 up kernel 5 copy down 5 Stream 2 copy 3 up kernel 3 copy down 3 copy 6 up kernel 6 copy down 6 Stream 3 Time 14

Code development on a machine with: Acceleration results Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Tesla K40C 12Gbytes GDDR5 Data for a few shots on development machine to check correctness with IDL code IDL C CUDA Total time (s) 1038.38 464.55 17.42 Total time (mins:s) 17:18 7:44 0:17 Speed-up 2.24x 59.61x (26.67x) Acquired a dedicated GPU card for SAMI GeForce GTX770 4Gbytes GDDR5 Cycle through 1837 shots in 30 hours => averaging 58 seconds per shot Most of this increase due to CPU time and reading from hard disk 15

Summary and future work Successfully achieved acceleration of the SAMI data analysis code to enable the processing of 12Tbytes raw data from previous MAST campaigns Ability to compare cross-correlation data from many shots Enables inter-shot analysis in future campaigns (NSTX-U, MAST-U) Reduce run time of code aiming for real-time (how the code accesses raw data/data shape, FPGA/GPU communication 1 ) Demonstrate benefit of a multi-gpu system 1 R. Bittner et al. Cluster Comput. DOI 10/1007/s 10586-013-0280-9 16