GPU-based data analysis for Synthetic Aperture Microwave Imaging

GPU-based data analysis for Synthetic Aperture Microwave Imaging 1 st IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis 1 st -3 rd June 2015 J.C. Chorley 1, K.J. Brunner 1, N.A. Dipper 1, S.J. Freethy 4, R.M. Sharples 1 V.F. Shevchenko 3, D.A. Thomas 2, R.G.L. Vann 2 1 Durham University 2 University of York 3 Culham Centre for Fusion Energy 4 Max-Planck-Institut für Plasmaphysik This work is funded by Durham University and EPSRC grant EP/K504178/1

Talk outline SAMI overview Motivation for GPU acceleration GPU code and techniques Acceleration results Summary and future work 1

SAMI overview SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs 2d thermal images of the plasma SAMI is a phased array the phase on each antenna is determined by the geometry and polarisation If the antennas do not have perfectly aligned polarisations there is an additional phase difference between the antennas The image is then the sum of the products of antenna cross-correlation 2

SAMI overview Optimised design for SAMI satisfying bandwidth and space requirements consists of 8 antennas [1] [2] [1] S.J. Freethy et al. IEEE transactions on antennas and propagation 60 5442 (2012) [2] S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013) 4

SAMI overview Shot 27022 SAMI is the first diagnostic of its kind: 2d maps of Electron Bernstein Emission process and mode conversion windows Useful for RF heating and current drive SAMI has demonstrated the feasibility of a phased array microwave imaging system through a successful campaign on MAST and will be installed on NSTX-U for the next campaign In a future reactor environment a microwave imaging diagnostic such as SAMI is essential: SAMI is resilient to high energy neutron fluxes Antennas can be incorporated into vessel wall Compact design, doesn t use much wall space S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013) 5

SAMI overview Above: An image of the array of Vivaldi antennas in a 21 configuration Right: The RF electronics mounted on MAST V.F. Shevchenko et al. J. Inst. 7 p10016 (2012) 6

SAMI overview Demanding data acquisition requirements! 16 frequency channels 14 bit sample depth (dynamic range of plasma during ELMs) Sampling at 250 Msamples/s For a total of 500ms (length of MAST shot) Data rate of 8 Gbytes/s Meaning we have 4 Gbytes raw data from SAMI per shot 7

Motivation for GPU code 4 Gbytes raw data per shot on MAST => 12Tbyte RAID system plus backup for M8 and M9 campaigns Data nant and Computation/Resolution nant(nant 1) Original IDL data analysis code takes ~30 minutes to process data for 1 shot on AMD Phenom(tm) II X2 560 Processor Time between shots on MAST is ~15 minutes => no intershot analysis Masses of unanalysed raw data accumulating An accelerated GPU data processing code could cycle through the data from previous campaigns in significantly reduced time and in future campaigns provide the ability to do intershot analysis Aim for real time data analysis as a multi-megawatt EBW current drive and heating system will require real time aiming and interlocking diagnostics 8

GPU architecture CPU 1 2 3 4 CPU Cache Main System Bus System Memory Size = 64GB, Speed = 40GB/s PCIe Bus 8GB/s GPU Cache GPU Memory Size = 6GB, Speed = 250GB/s GPU Key hardware features: Massive use of long vector units Low clock speed Very fast memory No advanced instruction processing Designed to do massive parallel computations 9

SAMI suitability for GPU code SAMI aquires nint data points in all 8 antennas simultaneously and has a 160µs switching frequency => data structure with shape nint*nant*nf*nsweeps nsweeps = shot length switching frequency nint SIMD scenario => parallelisation by CUDA 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 nf =1... nf=16 x nsweeps Each CUDA thread mapped to 1 element of vector unit Full vector unit = 32 consecutive threads = warp Warp processed at once by the hardware On software level threads are grouped into thread blocks 0B 128B 256B 384B 0B 128B 256B 384B Warp Warp 10

bootconfig.rfctrl.ini IDL code configuration data specifies which frequencies to read in and the time windows length and location TF test shot noise data data file data get_config.pro integer data cross-correlations calculated for each antenna pair, frequency sweep and upper and lower sidebands read_freq_split.pro gpu_correlate_model.pro read_bin_raw.pro voltage data for each selected frequency for each frequency sweep filter.pro calibration data correcting for phase offsets and balancing amplitudes between I and Q components via matrix inversion sideband_cal_values_upper.dat complexify.pro upper_lower_complex.dat 16 real signals get converted to 8 complex signals for upper and lower sideband calibration data correcting for phase differences between antennas due to RF electrical lengths iqphasegradient.dat calibration data correcting for phase drift between I and Q components 11

GPU code read_bin_raw_gpu.cu copy to GPU data conditioning forward CUFFT sideband suppression IQ correction backward CUFFT IQ_filter forward CUFFT filter backward CUFFT RF phase calibration results available on the host copy from GPU Wrote 14 CUDA kernels and made use of CUFFT library calculate cross correlations Limited memory available on GPU => can t copy all data to GPU and process at once Need to carve problem up and exploit CUDA streams and concurrency 12

CUDA streams and concurrency Exploit concurrency overlap copy to the GPU with kernel execution on the GPU CUDA exposes concurrency through streams a sequence of commands that execute in order copy 1 up kernel 1 copy 1 down copy 2 up kernel 2 copy 2 down Stream 1 copy 3 up copy 5 up kernel 3 kernel 5 copy 3 down copy 5 down copy 4 up copy 6 up kernel 4 kernel 6 copy 4 down copy 6 down copy 1 up kernel 1 copy 1 down copy 4 up kernel 4 copy 4 down Stream 1 copy 2 up kernel 2 copy 2 down copy 5 up kernel 5 copy 5 down Stream 2 copy 3 up kernel 3 copy 3 down copy 6 up kernel 6 copy 6 down Stream 3 13

GPUs support the following forms of concurrency: CUDA streams and concurrency Overlapping copies to or from the device with kernel execution Executing more than one kernel at the same time Overlapping copies to the GPU with copies from the GPU copy 1 up kernel 1 copy down 1 copy 4 up kernel 4 copy down 4 Stream 1 copy 2 up kernel 2 copy down 2 copy 5 up kernel 5 copy down 5 Stream 2 copy 3 up kernel 3 copy down 3 copy 6 up kernel 6 copy down 6 Stream 3 Time 14

Code development on a machine with: Acceleration results Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz Tesla K40C 12Gbytes GDDR5 Data for a few shots on development machine to check correctness with IDL code IDL C CUDA Total time (s) 1038.38 464.55 17.42 Total time (mins:s) 17:18 7:44 0:17 Speed-up 2.24x 59.61x (26.67x) Acquired a dedicated GPU card for SAMI GeForce GTX770 4Gbytes GDDR5 Cycle through 1837 shots in 30 hours => averaging 58 seconds per shot Most of this increase due to CPU time and reading from hard disk 15

Summary and future work Successfully achieved acceleration of the SAMI data analysis code to enable the processing of 12Tbytes raw data from previous MAST campaigns Ability to compare cross-correlation data from many shots Enables inter-shot analysis in future campaigns (NSTX-U, MAST-U) Reduce run time of code aiming for real-time (how the code accesses raw data/data shape, FPGA/GPU communication 1 ) Demonstrate benefit of a multi-gpu system 1 R. Bittner et al. Cluster Comput. DOI 10/1007/s 10586-013-0280-9 16