GPU based imager for radio astronomy

Similar documents
Components of Imaging at Low Frequencies: Status & Challenges

Recent progress in EVLA-specific algorithms. EVLA Advisory Committee Meeting, March 19-20, S. Bhatnagar and U. Rau

Radio Interferometers Around the World. Amy J. Mioduszewski (NRAO)

EVLA and LWA Imaging Challenges

Plan for Imaging Algorithm Research and Development

Radio Astronomy: SKA-Era Interferometry and Other Challenges. Dr Jasper Horrell, SKA SA (and Dr Oleg Smirnov, Rhodes and SKA SA)

Image-Domain Gridding on Accelerators

Basic Mapping Simon Garrington JBO/Manchester

Wide-Band Imaging. Outline : CASS Radio Astronomy School Sept 2012 Narrabri, NSW, Australia. - What is wideband imaging?

Introduction to Imaging in CASA

Casper Instrumentation at Green Bank

Wide-band Wide-field Imaging

Applying full polarization A-Projection to very-wide fields of view instruments: An imager for LOFAR Cyril Tasse

EVLA Memo 146 RFI Mitigation in AIPS. The New Task UVRFI

INTERFEROMETRY: II Nissim Kanekar (NCRA TIFR)

Towards SKA Multi-beam concepts and technology

ARRAY DESIGN AND SIMULATIONS

What is CASA? Rachel Friesen. North American ALMA Science Center. Victoria BC, January 18, 2011 ALMA Software Tutorial 1

Signal Processing on GPUs for Radio Telescopes

arxiv: v1 [astro-ph.im] 1 Sep 2015

REDUCTION OF ALMA DATA USING CASA SOFTWARE

Radio Interferometry. Xuening Bai. AST 542 Observational Seminar May 4, 2011

Introduction to Interferometry. Michelson Interferometer. Fourier Transforms. Optics: holes in a mask. Two ways of understanding interferometry

The Basics of Radio Interferometry. Frédéric Boone LERMA, Observatoire de Paris

Large-field imaging. Frédéric Gueth, IRAM Grenoble. 7th IRAM Millimeter Interferometry School 4 8 October 2010

Wide Bandwidth Imaging

Mosaicking. Brian Mason (NRAO) Sixteenth Synthesis Imaging Workshop May 2018

Cross Correlators. Jayce Dowell/Greg Taylor. University of New Mexico Spring Astronomy 423 at UNM Radio Astronomy

ASKAP Industry technical briefing. Tim Cornwell, ASKAP Computing Project Lead Australian Square Kilometre Array Pathfinder

Planning (VLA) observations

images with ASKAP Max Voronkov ASKAP So(ware scien1st 20 November 2012 Astronomy and Space Science

Focal Plane Array Beamformer for the Expanded GMRT: Initial

Smart Antennas in Radio Astronomy

A model for the SKA. Melvyn Wright. Radio Astronomy laboratory, University of California, Berkeley, CA, ABSTRACT

Phased Array Feeds A new technology for multi-beam radio astronomy

Wide-field, wide-band and multi-scale imaging - II

Overview of the SKA. P. Dewdney International SKA Project Engineer Nov 9, 2009

ESO/ALBiUS activities in ALMA imaging with CASA

May AA Communications. Portugal

EVLA Memo 105. Phase coherence of the EVLA radio telescope

Fourier Transforms in Radio Astronomy

Radio Data Archives. how to find, retrieve, and image radio data: a lay-person s primer. Michael P Rupen (NRAO)

SKA Phase 1: Costs of Computation. Duncan Hall CALIM 2010

Technology Drivers, SKA Pathfinders P. Dewdney

Why? When? How What to do What to worry about

Recent imaging results with wide-band EVLA data, and lessons learnt so far

Correlator Development at Haystack. Roger Cappallo Haystack-NRAO Technical Mtg

March Phased Array Technology. Andrew Faulkner

The CASPER Hardware Platform. Richard Armstrong

Introduction to Radioastronomy: Interferometers and Aperture Synthesis

Phased Array Feeds A new technology for wide-field radio astronomy

Imaging Simulations with CARMA-23

Introduction to Radio Interferometry Sabrina Stierwalt Alison Peck, Jim Braatz, Ashley Bemis

Data processing with the RTS A GPU-accelerated calibration & imaging stream processor

Imaging and Calibration Algorithms for EVLA, e-merlin and ALMA. Robert Laing ESO

Phased Array Feeds & Primary Beams

Deconvolution. Amy Mioduszewski National Radio Astronomy Observatory. Synthesis Imaging g in Radio Astronomy

LOFAR: From raw visibilities to calibrated data

Fundamentals of Radio Interferometry

Practicalities of Radio Interferometry

Sideband Smear: Sideband Separation with the ALMA 2SB and DSB Total Power Receivers

More Radio Astronomy

Giant Metrewave Radio Telescope (GMRT) - Introduction, Current System & ugmrt

Atacama Large Millimeter Array Project Status. M. Tarenghi ALMA Director

Propagation effects (tropospheric and ionospheric phase calibration)

Real-Time Through-Wall Imaging Using an Ultrawideband Multiple-Input Multiple-Output (MIMO) Phased-Array Radar System

ASTRO 6525 Lecture #18:! (Sub-)Millimeter Interferometry I!! October 27, 2015!

EVLA Science Operations: the Array Science Center. Claire Chandler NRAO/Socorro

Volume 82 VERY LONG BASELINE INTERFEROMETRY AND THE VLBA. J. A. Zensus, P. J. Diamond, and P. J. Napier

Fundamentals of Radio Interferometry. Robert Laing (ESO)

Radio Interferometer Array Point Spread Functions I. Theory and Statistics

The Australian SKA Pathfinder Project. ASKAP Digital Signal Processing Systems System Description & Overview of Industry Opportunities

LOFAR: Special Issues

VLBI Post-Correlation Analysis and Fringe-Fitting

Comparing MMA and VLA Capabilities in the GHz Band. Socorro, NM Abstract

Next Generation Very Large Array Memo No. 16 More on Synthesized Beams and Sensitivity. C.L. Carilli, NRAO, PO Box O, Socorro, NM

MANUAL flagging by the data reducing astronomer used to be sufficient for dealing with. The LOFAR RFI pipeline. Chapter 3

Very Long Baseline Interferometry. Richard Porcas Max-Planck-Institut fuer Radioastronomie, Bonn

3 rd (and 4 th ) Generation Calibration. Jan Noordam ASTRON Oude Hoogeveensedijk 4, 7991 PD Dwingeloo, The Netherlands. J.E.

November SKA Low Frequency Aperture Array. Andrew Faulkner

Error Recognition Emil Lenc (and Arin)

Fundamentals of Interferometry

The discrete charms of Redundant Spacing Calibration (RSC) J.E.Noordam. Madroon Community Consultants (MCC)

Green Bank Instrumentation circa 2030

Why Single Dish? Darrel Emerson NRAO Tucson. NAIC-NRAO School on Single-Dish Radio Astronomy. Green Bank, August 2003.

Real-Time Software Receiver Using Massively Parallel

Receivers for. FFRF Tutorial by Tom Clark, NASA/GSFC & NVI Wettzell, March 19, 2009

Technical Considerations: Nuts and Bolts Project Planning and Technical Justification

USE OF FT IN IMAGE PROCESSING IMAGE PROCESSING (RRY025)

Modeling the multi-conjugate adaptive optics system of the E-ELT. Laura Schreiber Carmelo Arcidiacono Giovanni Bregoli

Introduction to Radio Interferometry Anand Crossley Alison Peck, Jim Braatz, Ashley Bemis (NRAO)

SOFTWARE CORRELATOR CONCEPT DESCRIPTION

Specifications for the GBT spectrometer

Heterogeneous Array Imaging with the CARMA Telescope

Why Single Dish? Why Single Dish? Darrel Emerson NRAO Tucson

Software Spectrometer for an ASTE Multi-beam Receiver. Jongsoo Kim Korea Astronomy and Space Science Institute

Deconvolution , , Computational Photography Fall 2018, Lecture 12

An FPGA-Based Back End for Real Time, Multi-Beam Transient Searches Over a Wide Dispersion Measure Range

The SKA LOW correlator design challenges

High Fidelity Imaging of Extended Sources. Rick Perley NRAO Socorro, NM

Transcription:

GPU based imager for radio astronomy GTC2014, San Jose, March 27th 2014 S. Bhatnagar, P. K. Gupta, M. Clark, National Radio Astronomy Observatory, NM, USA NVIDIA-India, Pune NVIDIA-US, CA

Introduction Sanjay Bhatnagar Algorithms R&D Scientist at the National Radio Astronomy Observatory, Socorro, New Mexico Motivation: Deploy all compute-intensive imaging operations on the GPU Why? How? Listen on... 2

Overview of the talk Introduction to NRAO What is it? Why it is? Quick intro. to the scientific projects that pose the HPC problem Overview of RA imaging What is needed for imaging with current & future telescopes Details of the current hot-spots Motivate the three Proof-of-Concept (PoC) projects Progress so far, future plans 3

National Radio Astronomy Observatory A NSF funded national observatory To build and operate large radio facilities Operate three of the largest ground-based radio telescopes» EVLA, ALMA, VLBA, GBT Central Development Lab (CDL): Digital Correlators for EVLA, ALMA, VLBA; digital back-end for the GBT Off-line software for calibration, imaging and astronomical computing» Open source» Most widely used RA imaging software world-wide» Runs on laptops, desk-tops, Compute Clusters» Now exploring using GPUs to mitigate compute and memory footprint hotspots 4

The Very Large Array (NM, USA) Very Large Array 27 antennas Antennas movable on rails 1 27 Km radius Spread over 27 Km radius Size of the lens 30 Km Frequency range 300 MHz 50 GHz 5

Atacama Large MM Array (ALMA), Chile In partnership with EU & Japan At an altitude of 16,500 ft 50 antennas Effective size of the lens: 3 Km Frequency range 100 GHz 950 GHz Re-configurable with antennas movable on a special transporter. 6

Very Long Baseline Array (VLBA), US 10 antennas Antennas across the US Size of the lens few 1000 Km Frequency range 100 GHz 950 GHz Angular resolution milli arcsec 7

Other RA Observatories in the world WSRT VLA CARMA GMRT LOFAR LWA ATCA MeerKAT ASKAP ALMA PAPER MWA SKA (to be built) 8

Other RA Observatories in the world in ed WSRT GMRT se th e co m b VLA CARMA LWA Im H a P g C in + g B w ig it D ha at l a lo p f ro t h b es le e m p s o LOFAR ATCA MeerKAT ALMA PAPER ASKAP MWA SKA (to be built) 9

Interferometric Imaging: Big Data Uses the technique of Aperture Synthesis to synthesize a lens of size equal to the maximum separation between the antennas Not a direct imaging telescope: Data is in the Fourier Domain Image reconstruction using iterative algorithms Data volume with existing telescopes : 10 200 TB with SKA Telescope: Exa Bytes Effective data i/o for image reconstruction: 10x 10

Interferometric Imaging: HPC Sensitivity improvements of 10x or more in modern telescopes What was an ignorable/too weak an effect earlier, now limits the imaging performance of modern telescopes Need more compute-intensive imaging algorithms Tera/Peta Flops now. Exa Flops SKA (soon...ish) Orders of magnitude more expensive algorithms for imaging using many orders of magnitude more data Imaging algorithms in CASA for EVLA and ALMA The aw-imager for LOFAR (NL) a modified version of CASA Imager ASKAP Imager (AU) optimized for large cluster computing 11

Interferometric Imaging Bottom line for computing Tera Exa(SKA) scale computing using Tera Exa(SKA) Bytes of data to make Giga-Pixel images Full end-to-end processing will require a cluster 12

Computing hot spots Gridding / de-gridding: PoC - 1 Irregularly sampled data to a regular grid for FFT Computing Convolution Functions (CF): PoC - 2 Computing convolution kernels for Gridding» Pre-compute-n-cache OR On-demand computing» Memory foot print issues Wide-band Image Reconstruction: PoC - 3 Requires convolutions of many large images» Pre-compute-n-cache OR» On-demand Computing using GPU 13

Scientific projects: Deep Imaging Requires high resolution, high dynamic range imaging i.e. Big Data Compute-intensive algorithms Large area surveys of the radio sky 1Jy Dynamic range Ratio of strongest to weakest Source: 106 Dynamic range of raw images 102 3 Need high resolution to reduce background sky-noise ( confusion noise ) 14

Scientific projects: Deep Imaging Requires high resolution, high dynamic range imaging i.e. Big Data Compute-intensive algorithms Large area surveys of the radio sky 10-6 Jy Dynamic range Ratio of strongest to weakest Source: 106 Dynamic range of raw images 102 3 Need high resolution to reduce background sky-noise ( confusion noise ) 15

Scientific projects: Fast Imaging Transient sky Storing data at high time resolution for later processing is not an option Needs fast (near real-time) imaging as a trigger to store busts of data A short blip in time Spike of ms in 10s of hours of data Data rate too high to be recorded at ms resolution Need fast imaging as a trigger to record short busts of data Need interferometric imaging to localize on the sky Thornton et al., Science, 2013 16

Scientific projects: Fast Imaging Transient sky Storing data at high time resolution for later processing is not an option Needs fast (near real-time) imaging as a trigger to store busts of data A short blip in time Spike of ms in 10s of hours of data Data rate too high to be recorded at ms resolution Need fast imaging as a trigger to record short busts of data Need interferometric imaging to localize on the sky Thornton et al., Science, 2013 10s of hours ms 17

Aperture Synthesis Imaging: Why? Single dish Resolution too low for many scientific investigations Limited collecting area + resolution limits sensitivity at low frequencies Single dish resolving power Wavelength Dish Diameter Biggest steerable single dish = 100 m 100m 18

Aperture Synthesis Imaging: Why? Single dish Resolution too low for many scientific investigations Limited sensitivity/limits sensitivity at low frequencies Synthesis Array resolving power Wavelength Max. separation between antennas Max. separation in VLA = 35 km Resolution: ~ 350x better 35 Km 19

Aperture Synthesis Imaging: How? Aperture Synthesis or Fourier Synthesis technique An interferometric imaging technique (Nobel Prize in '74) Many antennas separated by 10s 100s Km Each pair of antennas measure one Fourier Component The Fourier Plane The Data Plane The UV-Plane Synthesized aperture equal to the largest separation between antennas 20

Aperture Synthesis Imaging: How? Aperture Synthesis An interferometric imaging technique (Nobel Prize in '74) Many antennas separated by 10s 100s Km Each pair of antennas measure another Fourier Component The Fourier Plane The Data Plane The UV-Plane Synthesized aperture equal to the largest separation between antennas 21

Aperture Synthesis Imaging: How? Aperture Synthesis An interferometric imaging technique (Nobel Prize in '74) Many antennas separated by 10s 100s Km Each pair of antennas measure another (one) Fourier Component The Fourier Plane The Data Plane The UV-Plane Synthesized aperture equal to the largest separation between antennas 22

Aperture Synthesis Imaging: How? Aperture Synthesis An interferometric imaging technique (Nobel Prize in '74) Many antennas separated by 10s 100s Km All pairs with one antenna measure N-1 Fourier Component = 26 The Fourier Plane The Data Plane The UV-Plane Synthesized aperture equal to the largest separation between antennas 23

Aperture Synthesis Imaging: How? Aperture Synthesis An interferometric imaging technique (Nobel Prize in '74) Many antennas separated by 10s 100s Km All pairs with all antenna measure N(N-1)/2 Fourier Component = 351 The Fourier Plane The Data Plane The UV-Plane Digital Backend Correlator Massively Parallel H/w Synthesized aperture equal to the largest separation between antennas 24

Aperture Synthesis Imaging: How? Aperture Synthesis Use Earth Rotation Synthesis to fill the Fourier plane All pairs with all antenna measures N(N-1)/2 Fourier Component Measure N(N-1)/2 x 2 Fourier components over 2 integration time = 702 The Fourier Plane The Data Plane The UV-Plane Digital Backend Correlator Massively Parallel H/w Synthesized aperture equal to the largest separation between antennas 25

Aperture Synthesis Imaging: How? Aperture Synthesis Use Earth Rotation Synthesis to fill the Fourier plane All pairs with all antenna measures N(N-1)/2 Fourier Component Measure N(N-1)/2 x 10 Fourier components over 10 integrations = 7020 The Fourier Plane The Data Plane The UV-Plane Digital Backend Correlator Massively Parallel H/w Synthesized aperture equal to the largest separation between antennas 26

Aperture Synthesis Imaging: How? Aperture Synthesis Use Earth Rotation Synthesis to fill the Fourier plane All pairs with all antenna measures N(N-1)/2 Fourier Component Fourier Components measured over 10 hr: O(1011 12) The Fourier Plane The Data Plane The UV-Plane Digital Backend Correlator Massively Parallel H/w Data Size: 10s 100s TB now Up to Exa Bytes for SKA-class telescopes Data not on a regular grid. 27

Interferometric Imaging Aperture Synthesis Imaging Indirect imaging: data in the Fourier domain Incomplete sampling artifacts in the image Sky Telescope Raw Image Dirty Image Data Visibilities Telescope Transfer Function Point Spread Function * (Convolution) Fourier Transform 28

Interferometric Imaging Raw image (FT of the raw data) is dynamic range limited Reconstructed Image Dynamic range: > 1 : 1000, 000 Raw Image Dynamic range: 1 : 1000 Processing: Remove telescope artifacts to reconstruct the sky brightness Image reconstruction is a High-Performance-Computing-usingBig-Data problem 29

Interferometric Imaging Image reconstruction is an ill-posed Inverse Problem D = A Itrue D: The Raw Data A: The Measurement Matrix Itrue: The True Sky Brightness distribution Recover Itrue given D A-1 D = ITrue A is singular ==> Non-linear (iterative) algorithms required to reconstruct the Sky Brightness distribution Typically 10 iterations, using ALL the data in each iteration Raw Data: 10 100 TB Effective data volume: 100 1000 TB 30

The Computing Problem Basic computing steps 1. Use FFT to transform to the image domain: Gridding + FFT 2. Image-plane deconvolution of the PSF : Search and subtract on images 3. Inverse transform to the data domain: De-gridding + Inv. FFT Resample On regular grid FFT Image Domain Data Domain Image deconvolution Iterative in nature Use all data Resample: Regular grid to Irregularly sampled data FFT-1 31

The Computing Problem Basic computing steps 1. Use FFT to transform to the image domain: Gridding + FFT 2. Image-plane deconvolution of the PSF : Search and subtract on images 3. Inverse transform to the data domain: De-gridding + Inv. FFT 2 Supply Convolution Functions Data Gridding De-Gridding 3 FFT 1 Image Reconstruction (Convolutions Of large images) 32

Computing architecture Make images on the GPU Use GPU as a Gridding + FFT server CPU host for image reconstruction GPU Convolution Functions Data Gridding De-Gridding FFT Image Reconstruction (Convolutions Of large images) 33

Computing architecture Make images on the GPU Use GPU as a Gridding + FFT server CPU host for image reconstruction + GPU as a image convolution server GPU Convolution Functions Data Gridding De-Gridding GPU as Image convolution server FFT Image Reconstruction (Convolutions Of large images) 34

Computing architecture Fast imaging 100s of image in milli seconds on the GPU Search for peak on the GPU If peak found, send a trigger to the host to save the data buffer on the disk GPU Data Gridding FFT Trigger data storage 35

The Computing Problem: Why Gridding? Use FFT to transform to the image domain Raw data is not on a regular grid FFT Raw data Re-sampled On grid Raw image 36

The Computing Problem: Why Gridding? Use FFT to transform to the image domain Raw data is not on a regular grid FFT Raw data Re-sampled On grid Raw image v u 37

The Computing Problem: Why Gridding? Use FFT to transform to the image domain Raw data is not on a regular grid FFT Raw data Re-sampled On grid v Raw image Raw data Regular Grid u 38

The Computing Problem: Why Gridding? Use FFT to transform to the image domain Raw data is not on a regular grid FFT Requires re-sampling ALL the data on a regular grid Using 2D Convolutional resampling 2D Interpolation Raw data Re-sampled On grid v Raw image Raw data Regular Grid u 39

Gridding How? Gridding/De-gridding 2D Interpolation via convolutional resampling Convolution Function 2D Weighting Function v 1 2D Convolution Function Single Data u 40

Gridding How? Gridding/De-gridding 2D Interpolation via convolutional resampling Convolution Function 2D Weighting Function v 1 Single Data 2D Convolution Function N x N Complex Multiply u 41

Gridding How? Gridding/De-gridding 2D Interpolation via convolutional resampling Convolution Function 2D Weighting Function v 1 Single Data 2D Convolution Function N x N Complex Multiply N x N Complex Additions u 42

Gridding How? Gridding/De-gridding 2D Interpolation via convolutional resampling Convolution Function 2D Weighting Function v 2 Single Data 2D Convolution Function N x N Complex Multiply N x N Complex Additions u 43

Gridding How? Gridding/De-gridding 2D Interpolation via convolutional resampling Convolution Function 2D Weighting Function v 2 Single Data Single Data 2D Convolution Function N x N Complex Multiply N x N Complex Additions 102 5 FLOP u Massively Parallel H/W should help: PoC 1 44

Status-3: Gridding PoC-1 (on-going) Divide the grid into sub-grids, each with multiple pixels Map each sub-grid to a CUDA Block of threads One thread per sub-grid pixel For each data Di Calculate the range of the CF centered on Di If this block in range For all threads in range local_gridj += Di * Cfi-j Write local_grid to GMEM Grid CF Gridding 1 FFT 45 Image Conv.

Status-3: Gridding PoC-1 (on-going) Sub-grids/Blocks of 10x10 threads For Data Data 1 Calculate the range of the CF centered on Di If this block in range For all threads in range local_gridj += Di * Cfi-j Write local_grid to GMEM Grid 46

Status-3: Gridding PoC-1 (on-going) Sub-grids/Blocks of 10x10 threads For Data Data 2 Calculate the range of the CF centered on Di If this block in range For all threads in range local_gridj += Di * Cfi-j Write local_grid to GMEM Grid 47

Status-3: Gridding PoC-1 (on-going) Sub-grids/Blocks of 10x10 threads For Data Data 3 Calculate the range of the CF centered on Di If this block in range For all threads in range local_gridj += Di * Cfi-j Write local_grid to GMEM Grid 48

Status-3: Gridding PoC-1 (on-going) No thread contentions No atomic-operations required But current implementation is limited by global memory accesses Solution: reduce GMEM access Coarse-grid the raw data / sort the raw data Copy data required for each block to SMEM Cache-coherent access Re-arrange GMEM buffers CF Gridding 1 FFT 49 Image Conv.

Convolution Functions (CFs) Convolution Functions encode the physics of the measurement process Prolate Spheroidal: As anti-aliasing operator W-Term: Account for Frensel Propagation term Cornwell et al. (2008) A-Term: Account for antenna optics Final function: Convolution of all three Bhatnagar et al. (2008) PS * W-Term * A-Term NxN = 10x10 few x 100x100 In use in CASA (NRAO), aw-imager for LOFAR (NL), ASKAP Imager (AU) 50

Compute CFs on the GPU: PoC - 2 CFs as tabulated functions in the computer RAM Minimize the quantization errors, by over-sampling Typical over-sampling 10x 100x Memory footprint gets prohibitive Total Memory = 103 8 x 1000s of CF = 10s 100s GB PoC-2: Can we compute the CF s on-the-fly (as against compute-n-cache)? Compute + Multiply + FFT CF 2 Gridding Image Conv. FFT 51

Status-1: CF Computations PoC-2 Negligible I/O mostly computing T0 W-Terms A-Term... 100s 1000s Compute per pixel Analytically computed FFT CF set used for 100s of gridding cycles CF Cached 2 Gridding FFT No analytical expr. 52 Image Conv.

Status-1: CF Computations PoC-2 Negligible I/O mostly computing T0 A-Term... FFT CF set used for 100s of gridding cycles T1 W-Terms... FFT CF set used for 100s of gridding cycles 100s 1000s Compute per pixel Analytically computed CF Cached 2 Gridding FFT No analytical expr. 53 Image Conv.

Status-1: CF Computations PoC-2 Negligible I/O mostly computing 10s 100s W-Terms A-Term... FFT CF set used for 100s of gridding cycles... FFT CF set used for 100s of gridding cycles : : : : : :... 100s 1000s Compute per pixel Analytically computed FFT CF set used for 100s of gridding cycles CF Cached 2 Gridding FFT No analytical expr. 54 Image Conv.

Status-1: Compute CFs on GPU PoC-2 Negligible I/O mostly computing... GPU: Pre-compute A-Term and cache it in GPU GMEM Compute W-term OTF one thread per pixel Multiple A x W FFT Sizes involves: 2K x 2K Complex images GPU: 1024 CFs made in ~1 ms. ~20x faster than CPU Room for improvement by another 2 3x CF 2 Gridding FFT 55 Image Conv.

Image reconstruction: PoC - 3 Simplest algorithm: CLEAN Iteratively search for the peak in the Raw image and subtract the PSF image at the location of the peak CF Gridding 3 Image Conv. FFT Most complex algorithms: Multi-scale Multi-freq. Synthesis (MSMFS) Requires convolutions of large images + Requires CLEAN Use GPU as a convolution-server PoC 3 Do deconvolution on the GPU (future) 56

Status-2: Image deconvolution PoC-3 Wide-band image reconstruction Multi-Term Multi-Scale Nterms = 2 3 Nscales = 3 10 Computing cost Nterms gridding cycles 2 2 Convolutions of N Terms x N scales images Search in Nterms images CF Griding FFT 57 3Image Conv.

Status-2: Image deconvolution PoC-3 High memory footprint: High resolution wide-band imaging currently not possible Solution being pursued Use GPU as an enabler technology (high resolution wide-band imaging) Compute the multi-scale images OTF Use GPU as a convolution-server Total FFT Mulit-scale image computing CF Gridding FFT 58 3Image mage Conv.

Low DR, fast imaging needs Transient source localization on the sky Data rates too high for store-n-process approach Need fast, low-dr imaging to trigger storage of short busts of data EVLA: Data dumps every 5 ms Computing Make 119 images (DM search): 1K x 1K size Trigger storage if peak > threshold Current CPU based processing (14 nodes x 16 cores) ~ 10x slower than real-time 59

Fast-imaging GPU pipeline Simplify the gridder On-GPU FFT On-GPU peak detection If (peak > threshold) trigger data storage Compute to I/O ratio ~ O(105 6) Data (@900MB per sec) goes into the GPU Only trigger info. Comes out 60

Fast-imaging GPU pipeline estimates Imaging is FFT-limited GPU: Gridding + FFT + Peak search Once per ~1 ms 50 (100x?) faster than single CPU core Initial estimates for fast-imaging (work-in-progress): 5 (2?) K20Xs become comparable to 14x16 CPU cores» 10x slower than real-time 50 (25?) K20X GPU cluster can enable real-time processing 61

Conclusions, future work The algorithms for the three hot-spots ported on GPU (the three PoCs) Work in progress on the gridding algorithm Minimize Global Memory transactions, other optimizations Take decision about which algorithm to use Optimize the CF severe code and image convolution code Integrate to make a imaging pipeline Scientifically test the results Measure actual run-time performance with real data Prototype and check Fast Imaging pipeline If the estimates of run-time improvements hold up, deploy for real-time fast-imaging 62

Back up slides 63

Number of CFs required Not all CF terms can be computed analytically Final convolution function can't be computed analytically No. of CF s required for wide-band, full-polarization high-dynamic range imaging is large Total number of CF s : 10s 1000s Expensive to compute Current solution: Pre-compute and cache CF 2 Memory Footprint issues Gridding Image Conv. FFT 64

Computing architecture Make images on the GPU Use GPU as a Gridding + FFT server CPU host for deconvolution Host Data Deconvolution Data Tx Overlap with computing GPU Gridding + FFT Rx images GPU as a Image Convolution + on-demand CF server Multi-core Host Data On-demand CF server Gridding GPU Gridding + FFT Deconvolver Convolved images 65

Computing architecture Make images on the GPU Use GPU as a Gridding + FFT server CPU host for deconvolution GPU as an Enabling Technology GPU as a Convolution and Convolution Functions server Where GPU RAM is not sufficient to hold all CF and buffers for MS-MFS Imaging + Deconvolution loops on the Host GPU as a trigger for fast-transients 100s of images from a given set of data Image + search for transients on the GPU 66

Aperture Synthesis Imaging: How? Each Pair of Antennas : => Measures one 2D fringe fringe spacing, orientation, amplitude, phase Correlator Analog and Digital Electronics Complex FFT Complex FFT : : Data Processing Multiply Signals from All antennas With all other antennas Disk Integrator 67

Aperture Synthesis Imaging: How? Aperture Synthesis An interferometric imaging technique (Nobel Prize in '74) Many antennas separated by 10s 100s Km Each pair of antennas measure another (one) Fourier Component The Fourier Plane The Data Plane The UV-Plane Synthesized aperture equal to the largest separation between antennas 68

Gridding on the GPU: PoC - 1 Each data point is multipled by a NxN complex Convolution Function followed by NxN additions to the Global Grid Ndata x N2CF x 8 FLOP + overheads = O(1010 12) x (10x10) x 8.2 CF = 100s TFLOP x (100x100) x 8.2 = ~PFLOP Gridding 1 SKA Image Conv. FFT O(1015) x... x 8.2 = ~ExaFLOP Gridding cost dominates computing load for all imaging Compute to I/O ratio: 102 5 v Massively Parallel H/W should help: PoC 1 u 69

Status-3: Gridding PoC-1 (on-going) Gridding / de-gridding: Dominates to cost of High Dynamic Range imaging Compute to i/o ratio: 102 5 Dominant cost for most imaging Wide-band imaging: comparable to the cost of the deconvolution step Scaling: Run-time cost: (data volume) x (CF size) W-, AW-Projection: 1012 x 102 5 FLOP A-Projection: 1012 x 102 3 FLOP Existing literature GPU: Cornwell et al. (2010), Romien (2012), Daniel Mascot (2014) Non-imaging: Margo et al. (2013), FPGA: Clarke et al. (2014) CF Gridding 1 FFT 70 Image Conv.

Status-2: Image deconvolution PoC-3 T0, S0 T1, S0 T0, S1 T1, S1 Model Data T0, S0 T0, S0 T1, S0 T1, S0 T0, S1 T0, S1 T1, S1 T1, S1 Each element is a Convolution of 4 functions ( precomputed once ) Every iteration evaluates this matrix multiply/convolution. CF Gridding FFT 71 3 Image Conv.

Status-1: CF Memory footprint PoC-2 Negligible I/O mostly computing... Tabulated CF requires oversampling to minimize quantization errors Memory per CF for high DR imaging with the EVLA: Oversampling: 100; Pixels: 2K x 2K = 8 Mbytes No. of CF s : 100 W-Terms x 100-ATerms = 104 Total memory footprint : 80 GB Memory footprint for SKA several orders of magnitude larger 2 CF Gridding FFT 72 Image Conv.

Status-3: Gridding PoC-1 (on-going) Solutions: Load balancing Non-regular sub-grids Data points per block CF Gridding 1 FFT 73 Image Conv.