SOFTWARE CORRELATOR CONCEPT DESCRIPTION

SOFTWARE CORRELATOR CONCEPT DESCRIPTION Document number... WP2 040.040.010 TD 002 Revision... 1 Author... Dominic Ford Date... 2011 03 29 Status... Approved for release Name Designation Affiliation Date Signature Additional Authors A. Faulkner, J. Kim, P. Alexander Submitted by: D.Ford UCAM 2011 03 26 Approved by: W. Turner Signal Processing Domain Specialist SPDO 2011 03 26

DOCUMENT HISTORY Revision Date Of Issue Engineering Change Number Comments 1 First Issue DOCUMENT SOFTWARE Package Version Filename Wordprocessor MsWord Word 2003 03a2 WP2 040.040.010 TD 002 Software Correlator UCAM 2003 Block diagrams Other ORGANISATION DETAILS Name Physical/Postal Address SKA Program Development Office Jodrell Bank Centre for Astrophysics Alan Turing Building The University of Manchester Oxford Road Manchester, UK M13 9PL Fax. +44 (0)161 275 4049 Website www.skatelescope.org 2011 03 29 Page 2 of 31

TABLE OF CONTENTS 1 INTRODUCTION... 8 1.1 Purpose of the document... 8 2 REFERENCES... 8 3 OVERVIEW... 10 3.1 FX correlator implementations... 11 3.2 Beamforming... 12 4 SYSTEM DIAGRAM... 13 5 DATA FLOW THROUGH THE CORRELATOR... 18 5.1 Notation... 18 5.2 Calculation of data rates... 20 5.2.1 Input data rate... 20 5.2.2 Internal data rates... 20 5.2.3 Output data rate... 21 5.3 Floating point operation counts... 21 5.3.1 The X step... 22 5.3.2 The B step... 22 5.3.3 The F step... 22 5.4 Implications for choice of architecture... 23 6 IMPLEMENTATION ON GPUS... 24 6.1 Overview of the Tesla platform... 25 6.2 NVIDIA roadmap... 26 6.3 The X step... 27 6.4 The F step... 28 6.5 The data routing problem... 28 6.6 Hardware cost estimate... 29 6.7 Power requirements... 30 6.8 Non recurring engineering (NRE)... 30 6.9 Reliability and maintenance... 30 6.10 Project Execution Plan (PEP)... 31 2011 03 29 Page 3 of 31

LIST OF FIGURES Figure 1: The typical flow of data through an interferometer with an FX correlator.... 10 Figure 2: System diagram of a software correlator for SKA 1 Mid (dishes) implemented using GPGPU cards.... 14 Figure 3: System diagram of a software correlator for SKA 1 Low (aperture arrays) implemented using GPGPU cards.... 15 Figure 4: Data transport within an FX correlator.... 20 Figure 5: An NVIDIA C2050 Tesla card, offering a theoretical peak performance of 1030.4 GFLOP/s in single precision.... 25 Figure 6: The CUDA memory model.... 26 Figure 7: NVIDIA's roadmap for the Tesla product line.... 27 LIST OF TABLES Table 1: Illustration of how a pair of software correlators could be built for SKA 1 Low and SKA 1 Mid following the design shown in Figures 2 and 3. The scenario presented assumes that NVIDIA's Maxwell range of GPGPU cards (expected to be available in 2013) are used. These are interfaced with 40 Gbit/s fibre connections. Table 2 and Table 3 show how the scenario alters if other generations of NVIDIA GPGPUs are used.... 16 Table 2: Estimates of the numbers of GPGPU cards which would be required in the SKA 1 Low (AA) system design presented in Table 1 if various future generations of NVIDIA GPGPU cards, expected to be released before the deployment of SKA 1, were used (see Section 6.2 for details of the assumed specifications).... 17 Table 3: Estimates of the numbers of GPGPU cards which would be required in the SKA 1 Mid (Dishes) system design presented in Table 1 if various future generations of NVIDIA GPGPU cards, expected to be released before the deployment of SKA 1, were used (see Section 6.2 for details of the assumed specifications).... 17 Table 4: Assumed system parameters for an SKA 1 system, based upon Memos 125 and 130. The numbers of frequency channels used are left undefined, since they has little to no effect on the correlator workload calculations presented here. They will, however, have a substantial effect on the workload of the UV processor, not discussed here.... 19 Table 5: Estimated processing requirements of one dimensional FFTs as a function of size.... 22 Table 6: Anticipated performance of future NVIDIA GPGPU cards, as assumed here.... 27 Table 7: The processing capabilities of NVIDIA Tesla cards, expressed as the RF bandwidth a single card could process within a single beam for SKA 1 Low and SKA 1 Mid, assuming the numbers of baselines indicated in Table 4.... 28 2011 03 29 Page 4 of 31

Table 8: The estimated cost of the GPGPU cards required to implement the X step using various generations of NVIDIA Tesla cards.... 29 Table 9: The estimated power dissipation of the GPGPU cards required to implement the X step using various generations of NVIDIA Tesla cards.... 30 2011 03 29 Page 5 of 31

LIST OF ABBREVIATIONS AA... Aperture Array ADC... Analogue-to-Digital Converter AI... Arithmetic Intensity ASIC... Application-Specific Integrated Circuit CoDR... Conceptual Design Review CMAC... Complex Multiplication and ACcumulation CPU... Central Processing Unit CUDA TM... Compute Unified Device Architecture (NVIDIA 2009) DFT... Discrete Fourier Transform DiFX... Distributed FX correlator (Deller et al. 2007) DRAM... Dynamic Random Access Memory DRM... Design Reference Mission DSP... Digital Signal Processing emerlin... extended Multi-Element Radio-Linked Interferometer Network evla... Extended Very Large Array FFT... Fast Fourier Transform FLOPS... Floating Point Operations per second FPGA... Field Programmable Gate Array FoV... Field of View GMRT... Giant Meter-wave Radio Telescope GPGPU... General-Purpose Graphics Processing Unit GPU... Graphics Processing Unit HPC... High-Performance Computing IF... Intermediate Frequency LO... Local Oscillator LOFAR... LOw-Frequency ARray MPI... Message Passing Interface (MPI Forum 2009) MWA... Murchison Wavefield Array NRE... Non-Recurring Engineering PEP... Project Execution Plan PrepSKA... Preparatory Phase for the SKA RF... Radio Frequency SEMP... Systems Engineering Management Plan SRS... Systems Requirement Specification SIMD... Single Instruction Multiple Data SKA... Square Kilometre Array 2011 03 29 Page 6 of 31

SKADS... SKA Design Studies SPDO... SKA Program Development Office TBD... To be decided VLBA... Very Long Baseline Array WIDAR... Widefield Interferometric Digital ARchitecture (correlator implementation) 2011 03 29 Page 7 of 31

1 Introduction This document describes software based architectures which could provide an FX correlator for SKA Phase 1. It provides an assessment of the feasibility of implementing such a correlator on the current generation of NVIDIA general purpose graphics processing unit (GPGPU) cards, showing that the performance delivered by this line of processors is considerably more competitive than that of the current generation of x86 based processors. It also provides a forecast of how we expect the performance metrics derived from this implementation to evolve between now and the construction of SKA Phase 1 in 2016 2019 (Garrett et al. 2010, Dewdney et al. 2010), making reference to the digital signal processing (DSP) technology roadmap (Turner 2011). Reasonable GPGPU performance expectations by 2016 show that the SKA Phase 1 correlator can be implemented economically using this platform. 1.1 Purpose of the document The purpose of this document is to provide a concept description as part of a larger document set in support of the SKA Signal Processing concept design review (CoDR). It provides a bottom up perspective of how a software correlator could be implemented. This document has been produced in accordance with the Systems Engineering Management Plan (SEMP) and Signal Processing PrepSKA Work Breakdown document and includes: First draft block diagram of the relevant subsystem. First draft estimates of cost. First draft estimates of power. A discussion of reliability issues. System parameters for SKA Phase 1 have been drawn from Garrett et al. (2010), Dewdney et al. (2010) and the SKA Phase 1 Design Reference Mission (DRM) while the Systems Requirement Specification (SRS) is being created. 2 References [1] Garrett, M.A., et al. (2010), A Concept Design for SKA Phase 1 (SKA 1 ), Memo 125 [2] Dewdney, P., et al. (2010), SKA Phase 1: Preliminary System Description, Memo 130 [3] Turner, W., (2011), Technology Roadmap Document for SKA Signal Processing, WP2 040.030.011 TD 001 [4] System Engineering Management Plan (SEMP), WP2 005.010.030 MP 001 [5] SKA System Requirement Specification (SRS) [6] Thompson, A. R., Moran, J. M., and Swenson, G. W. (2001), Interferometry and Aperture Synthesis in Radio Astronomy, second ed., Wiley (New York) [7] Deller, A.T., et al. (2007), DiFX: A Software Correlator for Very Long Baseline Interferometry using Multiprocessor Computing Environments, PASP, 119, 318 [8] MPI Forum (2009), MPI: A Message Passing Interface Standard, Version 2.2 [9] Roy, J., et al. (2010), A real time software backend for the GMRT, ExA, 28(25) 2011 03 29 Page 8 of 31

[10] Romein, J.W., et al. (2009), Astronomical Real Time Streaming Signal Processing on a BlueGene/P Supercomputer. [11] Harris, C., et al. (2008), GPU accelerated radio astronomy signal convolution, Exp Astron, 22, 129 [12] van Nieuwpoort, R.V., and Romein, J.W. (2009), Using Many Core Hardware to Correlate Radio Astronomy Signals, SKADS DS3 T2 Deliverable Document [13] Wayth, R.B., Greenhill, L.J., and Briggs, F.H. (2009), A GPU based real time software correlation system for the Murchison Widefield Array prototype, PASP, 121, 857 [14] NVIDIA (2009), NVIDIA CUDA TM Programming Guide, Version 2.3.1 [15] Alexander, P., et al. (2010), SKA Data Flow and Processing, in Wide Field Astronomy & Technology for the Square Kilometre Array, ed. Torchinsky, S., et al. 2011 03 29 Page 9 of 31

3 Overview WP2 040.040.010 TD 002 The angular resolution of any telescope scales in proportion to the wavelength being observed, and in inverse proportion to the diameter of the telescope's aperture. The long wavelengths of radio waves mean that any single antenna instrument would have to measure many kilometres across in order to resolve structure within many astronomically interesting objects. Interferometric arrays allow fine structure to be resolved on the sky without the need for building such prohibitively large monolithic antennas, by combining the signals from many widely spaced smaller antennas in a process called aperture synthesis (see, e.g., Thompson et al. 1986). Central to any interferometer is the correlator which brings together the signals from the individual antennas. The correlator cross multiplies complex valued measurements of the radio frequency (RF) electric field produced by pairs of antennas to produce visibilities, which are related to the spatial brightness distribution of the sky by a transformation which, in the small field of view flat sky limit reduces to a Fourier transformation. These complex visibilities are summed over some short period of time, equivalent to taking a finite length exposure of the sky, before being stored for subsequent calibration, gridding, image formation and deconvolution. Antennas Antennas Antennas Time-domain FFT Cross correlation High data rate. Low complexity. Time integration Spatial FFT and Imaging Lower sample rate. High complexity. Figure 1: The typical flow of data through an interferometer with an FX correlator. This process is broken down into more detail in Figure 1. Each antenna produces measurements of the received time varying RF electric field in one or more polarisations. In a heterodyne receiver, these RF signals may be shifted down to an intermediate frequency (IF) using an analogue mixer connected to a local oscillator (LO). This signal is then passed to an analogue to digital converter (ADC) for digitisation. Samples must be taken at a minimum rate of the bandwidth of the antenna, the Nyquist rate, if both the real and imaginary parts of the input sinusoidal signal are recorded, and at twice this rate otherwise. In practice, the sampling rate N s usually slightly exceeds this and the signal is said to be oversampled. 2011 03 29 Page 10 of 31

In an FX correlator of the type discussed here, the samples from each antenna/polarisation are collected into blocks of length N f, each of which is Fourier transformed (FFTed) or polyphase filtered (PPFed) to form an electromagnetic spectrum with N f frequency channels; we term this the F step of the correlator. After the F step, each antenna/polarisation yields samples at a rate of N s /N f in each of N f frequency channels which are handled independently from here onwards. Within each channel, samples from each pair of antennas are delay compensated and cross multiplied to form visibilities; we term this the X step of the correlator. The visibilities are time integrated and periodically recorded at some dump rate. The maximum period of time over which each visibility can be integrated is constrained by the time taken for the rotation of the Earth to become significant over the angular length scale associated with that visibility. The resultant time integrated visibilities are then stored, later to be passed to gridding and image formation algorithms. This signal chain may be divided into two regimes. Prior to the time integration step, the sample rate is necessarily high, as it has to exceed the Nyquist rate of the interferometer's bandwidth. But, the algorithms being applied to this data have low arithmetic intensities 1 (AIs). After the time integration step, the sample rate is much reduced 2, but the algorithms required for image formation have much higher AIs. 3.1 FX correlator implementations The majority of digital FX correlators built in recent times have used application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). We term these hardware correlators. Examples include the WIDAR correlators used by the extended Very Large Array (evla) and emerlin, that used until recently by the Giant Meterwave Radio Telescope (GMRT), and that being built for the full 128 antenna Murchison Widefield Array (MWA). Driven by the increasing processing power and decreasing cost of mass produced desktop PCs, however, an alternative approach of using general purpose processors, which can be programmed using conventional programming languages, has been growing increasingly competitive. We term these software correlators. Deller et al. 2007 have developed a correlator called DiFX which runs on clusters of off the shelf PCs using the MPI messaging system (MPI Forum 2009); this is used by the Australian Long Baseline Array (LBA) and the Very Long Baseline Array (VLBA). LOFAR uses a software correlator running on an IBM BlueGene/P supercomputer (Romein et al. 2009). The GMRT has recently deployed a custom designed software correlator (Roy et al. 2010). Several papers have studied the feasibility of deploying software correlators on general purpose graphics processing units (GPGPUs; see, e.g., Harris et al. 2008 and van Nieuwpoort & Romein 2009). Such processors were originally developed for the computer games market, but interfaces are now available which ease the use of their phenomenal processing power for more generalised high 1 Arithmetic intensity is defined as the average number of floating point operations which need to be performed on a stream of data per byte processed. 2 For SKA Phase 2, this assumption may no longer hold (Alexander et al. 2010) as the longest baselines will sample the sky on length scales where the rotation of the Earth becomes significant on very short timescales, making very fast dump rates necessary. For SKA Phase 1, however, this assumption is valid. 2011 03 29 Page 11 of 31

performance computing (HPC) tasks. The most widely used of these programming environments is the Compute Unified Device Architecture (CUDA TM ), a vendor specific interface for NVIDIA s line of GPGPUs. Wayth et al. 2009 report the results of testing an experimental CUDA based FX correlator on the 32 antenna prototype of the MWA. Several key advantages of software correlators can be identified: o Rapid development cycles. Software correlators are designed using conventional programming languages for which mature software development tools and debugging suites already exist. Unlike ASICs, which have long development cycles, software correlators can track advances in technology very closely. o Reduced NRE. Pre existing hardware is used, which is already well tested and mass produced for the consumer market. o Easy reconfigurability. It is straightforward to reconfigure the correlator post deployment, perhaps in order to use alternative algorithms developed after installation, or in response to hardware failure. The same processors could potentially be used for both correlation and beamforming (see Section 3.2).It is conceivable to implement multiple observing modes which correlate different numbers of frequency channels or different numbers of antennas using the same hardware. These advantages come at the cost of a small decrease in FLOP per unit silicon efficiency as compared to application specific hardware, since the inherent flexibility of general purpose processors requires a sizeable fraction of the processor die to be dedicated to flow control. This in turn leads to a small decrease in power efficiency, since the dynamic power consumption of a processor is proportional to the number of gates (see, e.g., Turner 2011). In this document we present a study of the feasibility of deploying a software correlator for the SKA Phase 1. 3.2 Beamforming SKA 1 will implement beamforming using at least two hierarchical levels within the signal path. Aperture array stations will beamform their constituent antenna tiles into around 160 beams (Dewdney et al. 2010) before data is transmitted back to the central processing bunker. However, some of the science objectives of SKA 1, in particular long term pulsar monitoring, do not require imaging but rather high time resolution in a single beam. For these science objectives, the correlator and image processor would be replaced by a much simpler beamformer. As noted above, software correlators are implemented using general purpose processors which can be reconfigured near instantaneously to run different algorithms on the same hardware. These algorithms can even run concurrently. It is therefore quite conceivable that a software correlator could be used to image one part of the sky whilst simultaneously forming a beam to monitor a nearby pulsar. We show in Section 6.4 the cost overhead of doing so is modest, saving the cost of an additional beamforming system. 2011 03 29 Page 12 of 31

4 System diagram WP2 040.040.010 TD 002 In Section 5, we develop a generic description of a software correlator which is not tied to any particular hardware platform. This approach is taken in recognition of the fact that there are several other lines of processors, known to be on the roadmaps of other vendors, which may well emerge as promising candidate platforms over the coming few years. For example, Intel demonstrated a 32 core 1.2 GHz x86 processor known as Knight's Bridge in 2010 and are expected to demonstrate a similar 50 core processor, Knight's Corner, in late 2011 (see, e.g., Turner 2011). The roadmap of ATI (owned by AMD) is less clearly known, and though the support of their GPUs for the HPC market is currently limited, this may well change. The line of DSP processors produced by Picochip may also provide a competitive platform for implementing software correlators within the timeframe of SKA 1. In Section 6, we go on to provide a specific example of how such a correlator could be implemented on the NVIDIA Tesla line of GPGPU cards. Figures 2 and 3 summarise the conclusions of this discussion, showing system diagrams for the correlators for SKA 1 Mid and SKA 1 Low respectively. The data flow through these correlators is given in Table 1. In quoting the numbers of GPGPU cards required for each correlator, we assume that NVIDIA's Maxwell line of GPGPU cards are used, which are expected to be available in 2013, and that these cards achieve 75% of their theoretical performance (see Section 6.3). Tables 2 and 3 show how the number of cards needed might be expected to change for successive generations of NVIDIA cards (see Section 6.2 for the specifications assumed for these). All network interconnections are assumed to operate at 40 Gbit/s, which is already achievable. In Figures 2 and 3, each antenna is connected to a backend which amplifies, filters, digitises and coarsely channelises the received signal. In the case of aperture array stations, stationwide beamforming occurs immediately before coarse channelisation. The resulting data stream is transmitted over fibre to the central processing bunker. The N a signals from N a antennas are fed into N FPGA FPGA based F step subsystems which buffer the data to produce the required delays and finely channelise each signal. Each subsystem is anticipated to receive data from up to N FPGA in antennas. The channelised output data is divided into a small number N switches of broad frequency bands, each of which is transmitted over fibre to a separate network switch. Each network switch collates data from all of the F step subsystems and therefore all the antennas within its particular frequency band. Data is then routed to one of a number of PCs hosting GPGPU cards, each of which performs cross correlation and time integration on a subset of the frequency channels within the switch's frequency band. Tables 2 and 3 estimate the numbers of GPGPU cards required to implement each correlator using various generations of NVIDIA hardware. For reference, the price of each generation of GPGPU card is not expected to change significantly from 2,500, and each card is expected to dissipate around 250 W. The rate of flow of data onto each card is shown; this increases with time as the cards become able to process more data. The current generation of PCI Express 2.0 cards can receive data at a maximum rate of 64 Gbit/s (theoretical peak). Though PCI Express 3.0 is not currently widely supported, its baseline specifications were published in November 2010 and support a maximum transfer rate of 128 Gbit/s (theoretical peak). It seems reasonable to assume that future NVIDIA cards will use this bus, and that at least another doubling of speed will be achieved by 2016. 2011 03 29 Page 13 of 31

Antenna backend Analogue gain/filter Digitisation Coarse Channelisation TX Fibre link to central processing bunker Central processing bunker Bulk delay and fine channelisation Switch GPGPU GPGPU Data readout N a = 50 inputs; two into each F step subsystem. Bulk delay and fine channelisation Bulk delay and fine channelisation N FPGA = 125 subsystems, each accepting data from two antennas. The output frequency channels are divided into sixteen broad frequency bands, each of which is handled by a separate switch. F Step N switch = 16 switches, each handling data from around a 100 MHz bandwidth segment out of the total 1.55 GHz bandwidth of SKA 1 Mid (450 MHz to 2 GHz). X Step Switch GPGPU GPGPU N GPGPU = 272 NVIDIA GPGPU cards (Maxwell series, available 2013). Each switch is connected to 17 cards via host PCs. Control and monitoring Figure 2: System diagram of a software correlator for SKA 1 Mid (dishes) implemented using GPGPU cards. 2011 03 29 Page 14 of 31

Antenna backend Analogue gain/filter Digitisation Station beamforming Coarse Channelisation TX Fibre link to central processing bunker Central processing bunker N a = 250 inputs; six into each F step subsystem for each beam. Bulk delay and fine channelisation Bulk delay and fine channelisation N FPGA = 9 subsystems, each accepting data from six antennas. Beam 1 Switch A single switch collates data from the nine F step subsystems and passes it onto three GPGPU cards, each of which processes a third of the total bandwidth of SKA Low. Data readout GPGPU GPGPU N GPGPU = 3 NVIDIA GPGPU cards (Maxwell series, available 2013). Beam 2 Switch GPGPU GPGPU F Step Beam i X Step A total of 408 GPGPU cards are required to form 160 beams. These are connected to 160 switches and 1440 FPGA based F step subsystems. Figure 3: System diagram of a software correlator for SKA 1 Low (aperture arrays) implemented using GPGPU cards. 2011 03 29 Page 15 of 31

Table 1: Illustration of how a pair of software correlators could be built for SKA 1 Low and SKA 1 Mid following the design shown in Figures 2 and 3. The scenario presented assumes that NVIDIA's Maxwell range of GPGPU cards (expected to be available in 2013) are used. These are interfaced with 40 Gbit/s fibre connections. Table 2 and Table 3 show how the scenario alters if other generations of NVIDIA GPGPUs are used. Symbol Description SKA 1 Low (AA) per beam N a Number of input antennas 50 250 SKA 1 Mid (Dishes) F step subsystems (FPGAs) N FPGA Number of FPGA boards 9 125 N FPGA in Number of inputs to each FPGA subsystem 6 2 N FPGA out Data throughput of each FPGA subsystem 36.48 Gbit/s 49.6 Gbit/s Number of outputs from each FPGA subsystem 1 16 g FPGA out Bitrate on each output line 36.48 Gbit/s 3.1 Gbit/s Switches N switches Number of switches (also frequency bands) 1 16 N FPGA Number of inputs to each switch 9 125 n GPGPU Number of outputs from each switch 3 17 Minimum number of ports on switch (including monitoring and control). 5 34 Data throughput of switch 304 Gbit/s 387.5 Gbit/s GPGPU cards n GPGPU Number of Maxwell GPGPU cards connected to each switch 3 * 17 * Data rate into each Maxwell GPGPU card 101.3 Gbit/s * 22.79 Gbit/s * N GPGPU Total number of Maxwell GPGPU cards 480 * 272 * * These figures assume that NVIDIA's Maxwell (2013) range of GPGPU cards are used. See Table 2 and Table 3 for the numbers of cards which would be needed if other ranges of GPGPU cards were used. 2011 03 29 Page 16 of 31

Table 2: Estimates of the numbers of GPGPU cards which would be required in the SKA 1 Low (AA) system design presented in Table 1 if various future generations of NVIDIA GPGPU cards, expected to be released before the deployment of SKA 1, were used (see Section 6.2 for details of the assumed specifications). GPGPU card Expected release year Number of GPGPU cards needed per switch Total number of GPGPU cards needed Fermi 2009 21 3360 14.48 Kepler 2011 8 1280 38.00 Maxwell 2013 3 480 101.3??? 2015 2 320 152.0??? 2017 1 160 304.0 Data rate onto each card (Gbit/s) Table 3: Estimates of the numbers of GPGPU cards which would be required in the SKA 1 Mid (Dishes) system design presented in Table 1 if various future generations of NVIDIA GPGPU cards, expected to be released before the deployment of SKA 1, were used (see Section 6.2 for details of the assumed specifications). GPGPU card Expected release year Number of GPGPU cards needed per switch Total number of GPGPU cards needed Fermi 2009 126 2016 3.075 Kepler 2011 47 752 8.245 Maxwell 2013 17 272 22.79??? 2015 9 144 43.06??? 2017 5 80 77.50 Data rate onto each card (Gbit/s) 2011 03 29 Page 17 of 31

5 Data flow through the correlator A software correlator acts on a continuous flow of data, which it must be able to process in pseudoreal time. This means that it must, on average, be able to process data samples as fast as they arrive. This differs from stricter definitions of real time processing in the sense that it may be possible to buffer data for a short time, but the amount of buffered data cannot be allowed to grow indefinitely. A streamed processor may be limited either by the rate at which data can be fed into it (data limited), or by the rate at which it can perform the required computations on the data (processing limited). In this section, we enumerate the data throughput and processing requirements of a software correlator for the SKA Phase 1. 5.1 Notation To proceed, the specification of the FX correlator must be formalised. We assume that it accepts input data streams from N a antennas, each measuring N p polarisations (either one or two) and simultaneously observing N beam beams. Each input stream consists of complex integer data made up words consisting of an N b bit real component and an N b bit imaginary component for each polarisation. These words arrive at a sample rate of N s, which must exceed the Nyquist rate of Δν, where Δν is the total bandwidth of the antenna. In the correlator itself, the F step FFTs or PPFs each data stream from the time domain into N f frequency channels. The X step correlates all N p 2 polarisation pairs and all N B = N a (N a +1)/2 antenna pairs, i.e. baselines, within each of these channels. The latter comprises N a (N a 1)/2 cross correlations and N a auto correlations. Each correlation product is summed over some time period t before being dumped as output data. The maximum time period over which each correlation product may be summed is constrained by the time taken for the rotation of the Earth to become significant over the angular scale associated with the correlation product. For short baselines measuring large angular scales, long integrations may be acceptable, but for longer baselines, shorter integrations are necessary to prevent smearing. Though all interferometers built to date typically use fixed (fast) dump rates for all baselines, Alexander et al. 2010 have pointed out that for a telescope with the angular resolution of SKA Phase 2, the dump rates of the longest baselines will need to be so fast that the output data rate from the correlator will exceed the input data rate unless baseline dependent dump rates are used. This is less of an issue for SKA Phase 1, but we nonetheless write t(b) here, where B is baseline, noting that we require t(b) < D/2Bω f to satisfy the criteria given by Alexander et al. 2010, where D is the diameter of the collectors used, ω = 7.272 10 5 rad/s is the rotation rate of the Earth, and f is the factor by which the visibilities are oversampled. We take f=4 here. In Table 4, ballpark values for these system parameters are drawn from Memos 125 and 130. 2011 03 29 Page 18 of 31

Table 4: Assumed system parameters for an SKA 1 system, based upon Memos 125 and 130. The numbers of frequency channels used are left undefined, since they has little to no effect on the correlator workload calculations presented here. They will, however, have a substantial effect on the workload of the UV processor, not discussed here. Parameter SKA 1 Low (AA) SKA 1 Mid (Dishes) N beam 160 1 D 180 m 15 m N a 50 250 N B 1,275 31,375 N p 2 2 Δν 380 MHz (70MHz 450 MHz) 1.55 GHz (450 MHz 2 GHz) N s 380 Msample/s 1.55 Gsample/s N f N f,low N f,mid N b 4 4 Both the F step and the X step consist of many parallel and independent computations, as shown in Figure 4. The F step consists of N a N p parallel FFT/PPFs of length N f per elapsed time interval N f /N s (i.e. at a rate of N s /N f per second). The X step consists of N f N p parallel cross multiplication and accumulation tasks, each of which receives a single input sample from each antenna/polarisation in the same time interval. In enumerating the data rates into, and processing requirements of, each calculation, we use G to denote the total requirements of a set of parallel computations, and g to denote the requirements of each individual computation. Next to each algebraic expression below, we write bit and FLOP rates derived from the above system parameters in green for SKA 1 Low and in red for SKA 1 Mid. 2011 03 29 Page 19 of 31

Input data: Antenna 1 Antenna 2 Bitrate = Antenna 3 N a antennas N p polarisations N beam beams N s samples per second 2 complex components N b bit words. Antenna i Integer data F Step FFT/PPF FFT/PPF FFT/PPF FFT/PPF Total of N a N p N beam parallel FFT/PPFs of length N f, each computed at a rate of N s /N f times a second. Channel 1 Channel 2 Channel 3 Channel j X Step CMAC CMAC CMAC CMAC Output data: Total of N f N p N beam parallel correlations. Each correlation product within each frequency channel is accumulated over t(b,j)n s / N f discrete samples. Bitrate = N f channels N beam beams N B baselines N p 2 polarisation pairs 1/t(B,j) samples per second 2 complex components 32 bit words (single prec.). Floating point accumulations 5.2 Calculation of data rates Figure 4: Data transport within an FX correlator. In this section, we give expressions for the rate at which data enters and leaves each of the computational units shown in Figure 4. 5.2.1 Input data rate The rate at which each antenna produces data is SKA 1 Low (AA) SKA 1 Mid (Dishes) g A out = 2N p N beam N s N b, 972.8 Gbit/s 24.80 Gbit/s and the total rate at which data enters the correlator is G A out = 2N a N p N beam N s N b. 48.64 Tbit/s 6.200 Tbit/s 5.2.2 Internal data rates The rate of flow of data into each of the parallel F step operations is g F in = 2N s N b, 3.040 Gbit/s 12.40 Gbit/s and the total rate of flow of data into the F step is G F in = 2N a N p N beam N s N b. 48.64 Tbit/s 6.200 Tbit/s The flow rate of data out of each FFT/PPF is the same as the flow rate into it: 2011 03 29 Page 20 of 31

g F out = g F in = 2N s N b, 3.040 Gbit/s 12.40 Gbit/s and G F out = G F in = 2N a N p N beam N s N b = G X in. 48.64 Gbit/s 6.200 Tbit/s WP2 040.040.010 TD 002 The flow rate of data into the X step for each frequency channel is given by dividing this rate by N f : g X in = 2N a N p N beam N s N b /N f. 48.64/N f Tbit/s 6.200/N f Tbit/s If the correlator is also used for beamforming, then each beamformer receives the same flow of data as the corresponding X step for the same frequency channel and input antenna beam: g B in = g X in = 2N a N p N s N b /N f. 304.0/N f Gbit/s 6.200/N f Tbit/s 5.2.3 Output data rate In Section 5.1 we noted that the F step can only integrate each visibility for some maximum time period t(b) given by t(b) < D/2Bω f 3.094 s 0.2578 s before the rotation of the Earth causes smearing in the (u,v) plane (Alexander et al. 2010). This time period is a function of baseline length, and in theory the output data rate from the correlator could be reduced by dumping visibilities at different rates for different baselines. However, in the calculations presented here, we assume for simplicity that all visibilities are dumped at a rate appropriate for a baseline length of 100 km the longest baseline present in SKA 1. If this dump rate is used, the data rate at which each X step produces time integrated visibilities is g X out = N B N p 2 2 32 / t(b). 105.5 kbit/s 31.15 Mbit/s The total output data rate from the correlator is given by multiplying this by the number of frequency channels (i.e. parallel X steps): G X out = N f N B N p 2 2 32 / t(b). 105.5 N f kbit/s 31.15 N f Mbit/s In practice, it is apparent that the flow of data out of the X step of the correlator is much smaller than the flow into it; we envisage that it will be read out to a central server using the same switches which supply the data to the compute nodes, before being transmitted out of the central processing bunker over fibre. 5.3 Floating point operation counts The number of arithmetic operations required to perform each task can be similarly quantified. 2011 03 29 Page 21 of 31

5.3.1 The X step Each cross correlation and accumulation block needs to perform N s N p 2 N B /N f complex multiplicationand accumulation (CMAC) operations per second. Assuming that each CMAC comprises eight floating point operations 3, this corresponds to g X FLOP = 8N s N p 2 N B /N f FLOPS 15.50/N f TFLOP/s 1.556/N f PFLOP/s per frequency channel per beam, or a total of G X FLOP = 8N beam N s N p 2 N B FLOPS. 2.481 PFLOP/s 1.556 PFLOP/s 5.3.2 The B step If the correlator is also used for beamforming, then the number of operations required by each B step to form a single beam is much fewer than that required by the corresponding X step. In total, N s N p N a /N f CMAC operations must be performed per second, corresponding to g B FLOP = 8N s N p N a /N f FLOPS 304/N f GFLOP/s 6.2/N f TFLOP/s per frequency channel per beam being formed, or a total of G B FLOP = 8N s N p N a FLOPS 304 GFLOP/s 6.2 TFLOP/s per beam being formed. 5.3.3 The F step Quantifying the rate of floating point operations required by each F step is more problematic owing to the plethora of optimised algorithms which exist for performing FFTs. However, taking the radix 2 Cooley Tukey algorithm as a suboptimal but representative implementation, we can estimate that each FFT requires around (N f /2)log 2 N f multiplication operations and N f log 2 N f addition operations a total of (3N f /2)log 2 N f operations per FFT. This FLOP count is evaluated in Table 5 for several values of N f ; the final row shows the FLOP count required to divide the full 380 MHz bandwidth of SKA 1 Low into 1 khz channels. The third column of the table expresses each FLOP count per bit of data input into the FFT. Table 5: Estimated processing requirements of one dimensional FFTs as a function of size. FFT Size N f FLOP count per FFT / kflop g F FLOP / g F in 1,024 15.36 1.875 32,768 737.3 2.812 380,000 10,570 3.475 3 On some platforms, e.g. NVIDIA GPGPU cards, a figure of four floating point operations is more appropriate, since some processors implement combined multiply and accumulate instructions which take the same time as straightforward multiplication instructions. 2011 03 29 Page 22 of 31

The F step of the correlator needs to perform these FFTs at a rate of N s /N f times per second per antenna per polarisation, corresponding to a processing rate of g F FLOP = (3N s /2)log 2 N f FLOPS per antenna per polarisation, or a total of G F FLOP = N a N p (3N s /2)log 2 N f FLOPS. 5.4 Implications for choice of architecture To assess the suitability of various computer architectures for the tasks described above, we consider first the black box specifications of each architecture in terms of its indivisible unit system for example, an x86 server, FPGA board or GPGPU card. We consider the maximum rate G data at which data can be transferred onto each system, and the rate G FLOP at which it performs floatingpoint operations. Comparing these rates against the values calculated above yields a lower limit on the number of parallel systems required, since it neglects the time taken internally transferring data within each system. The number of parallel blocks shown in Figure 4 is large, and so it is to be envisaged that many blocks will run in parallel on each system for example, that many independent frequency channels will be correlated in parallel on each X step system. A metric which is useful to determine the extent to which this is feasible is the ratio G data /G FLOP. If this ratio is less than the corresponding data to FLOP ratios of the tasks described above, then the system will be limited by the rate at which data can be transported into it, and it will be unable to achieve its full processing potential. Both G data determined by the speed of the PCI Express bus and G FLOP determined by the speeds of individual GPGPU cards are crudely expected to follow Moore's Law for most systems in coming years, and so this ratio is not expected to change significantly before the deployment of SKA 1. Taking the example of a Tesla C2050 GPGPU card (available now; see Section 6 for more details) working in single precision arithmetic, G data = 64 Gbit/s and G FLOP = 1030.4 GFLOP/s. Assessing its suitability for the X step, we require that g X FLOP /g X in = 4 N p N B /N a N b > 1030.4/64 = 16.1 FLOP/bit if the GPGPU card is to be able to achieve FLOP limited performance. For values of N p, N B and N a appropriate for SKA 1, this equates to N a > 16 antennas, which is easily satisfied. However, assessing its suitability for the F step, we observe that g F FLOP /g F in = (3/2)log 2 N f / (2N b ) is much less than 16.1 FLOP/bit for any number of channels N f which might be used in SKA 1 (see the third column of Table 5). GPGPUs are only likely to be useful for the F step if data is retained on the same GPGPU card for the subsequent X step, and in practice this is unlikely to be possible. Beamforming suffers from a similar problem of low arithmetic intensity: g B FLOP /g B in = 1 FLOP/bit. However, if it can be executed in parallel with an X step working on the same data on the same system, then the data need only be transferred onto each system once for both the X and the B 2011 03 29 Page 23 of 31

step. In practice, this is likely to be straightforward to achieve, and if beamforming and crosscorrelation are occurring on the same data simultaneously is a strong argument for using the same processing hardware for both. 6 Implementation on GPUs In this section, we provide a specific example of how a software correlator could be implemented using NVIDIA GPGPU cards. These represent the only platform which could, with an architecture which is already available, economically meet the performance requirements of a software correlator for SKA 1. Other rival processor lines, such as Intel s Knight s Bridge, ATI s line of GPUs, and Picochip s specialist DSP processors, could become competitive within the next five years, but are either not competitive or not available at present. Other options, such as a Beowulf cluster of commodity x86 PCs, may be ruled out as unable to economically meet the performance requirements listed above. We have shown that the X step requires around 4 PFLOP/s between SKA 1 Low and SKA 1 Mid, which would today require of order hundreds of thousands of commodity x86 PCs in such a cluster; such a cluster will not become economic within the next ten years. GPUs achieve their high performance by using a single instruction multiple data (SIMD) architecture. Whereas modern CPUs devote large numbers of transitors to flow control and data caching, GPUs attach sixteen or more arithmetic units to each flow control unit, meaning that each instruction can be applied to many data simultaneously. The result is that a much higher proportion of the transistors on each die can be devoted to arithmetic operations, but that the processors are less flexible. Though GPUs were initially developed for high speed graphics rendering, a distinction is now cast between GPUs which are programmed using the traditional graphics interfaces such as DirectX, and general purpose GPUs (GPGPUs) which can be programmed using more generic interfaces such as NVIDIA s CUDA. NVIDIA s line of high performance GPGPU cards is the Tesla series of cards, and the latest generation Fermi C2050 cards (released 2009; see Figure 5) can each deliver a theoretical peak performance of 1030.4 GFLOP/s when working with single precision floating point data. They retain the traditional physical packaging of a graphics card, interfacing to a PCI Express bus, which means that an x86 host system is needed to house them. In practice, two or four Tesla cards can often be housed side byside in a single x86 host. The latest PCI Express standard is version 2.0, which can transfer data at a theoretical peak rate of 64 Gbit/s between the host and each Tesla card via a 16 line slot. In practice, actual transfer rates of 40 50 Gbit/s are typically reported. A new PCI Express standard, version 3.0, has been published in November 2010 and promises theoretical peak transfer speeds of 128 Gbit/s via an equivalent slot, though neither motherboards nor Tesla cards which support this standard have yet appeared on the market. 2011 03 29 Page 24 of 31

Figure 5: An NVIDIA C2050 Tesla card, offering a theoretical peak performance of 1030.4 GFLOP/s in single precision. It is worthy of note that any sizeable cluster of Tesla cards sits alongside a significant cluster of x86 processors in the host machines. Whilst the latter have a modest processing capability in comparison, their processors are more flexible and they may nonetheless be useful for performing operations such as packing on the data. 6.1 Overview of the Tesla platform In practice, GPGPUs rarely operate at anything close to their theoretical peak performance. For many applications, core utilisation of around 30% is reported. There are a number of reasons for this, as illustrated by the block diagram of NVIDIA s GPGPU architecture shown in Figure 6. Firstly, the processor s SIMD architecture means that many arithmetic units are controlled by a single instruction unit, and can only be utilised when multiple threads follow exactly the same flow control path. Programs with many conditionally executed blocks of code suffer a significant performance penalty. Also, the lack of a traditional memory cache in the GPGPU s memory model means that careful management of memory accesses is required. Whilst several thousand registers can be accessed at high speed, accesses to the Tesla card s multi gigabyte device memory entail a performance penalty of around a hundred clock cycles. If, and only if, consecutive threads access consecutive memory locations simultaneously called a coalesced memory access then the threads pay the performance penalty in parallel rather than serial. Hence an optimised GPU program must carefully consider the ordering of memory accesses and the arrangement of data in structures. However, the correlation tasks discussed in this document are sufficiently simple, and intrinsically parallel, that a core utilisation fraction of considerably more than 30% might be expected. This is in contrast to the imaging processor, where algorithms are complex and peak efficiencies of only a few percent have been reported. The author has himself achieved 20 30% efficiency in a prototype cross correlation (X step) code with minimal effort at optimisation. Lincoln Greenhill (Harvard) and Mike Clark report having empirically achieved 79% efficiency in a similar prototype code (private communication; publication in prep.). This includes the time penalties associated with internally transferring data within the Tesla card. 2011 03 29 Page 25 of 31

Figure 6: The CUDA memory model. 6.2 NVIDIA roadmap NVIDIA are notoriously secretive about their product roadmaps. What little is known about the future of the Tesla line of GPGPUs 4 is based upon a single slide displayed by NVIDIA's CEO, Jen Huan Huang, at the GPU Technology Conference in September 2010, shown in Figure 7. Numerous technology websites have attempted to reverse engineer the numbers which went into the slide, based on the strong likelihood that future generations of cards will draw a similar power of around 250W to the Fermi C2050 cards already available. Compared to Fermi, their floating point operation rates are widely expected to rise by a factor of 2.7 by the end of 2011 (Kepler), and by a factor of 7.6 by the end of 2013 (Maxwell). In addition, the Maxwell cards will reportedly use a new architecture which will combine an ARM CPU with the GPU cores. It is highly likely that at least one, and perhaps two, further lines of products will be released before the deployment of SKA 1. For the purposes of this document and given the lack of any better information, we assume that these will be released at two year intervals, in 2015 and 2017, and that each will double the performance of its predecessor (see Table 6). 4 Not to be confused with the Tesla C10xx series of products released within this line in 2007. 2011 03 29 Page 26 of 31

Figure 7: NVIDIA's roadmap for the Tesla product line. Table 6: Anticipated performance of future NVIDIA GPGPU cards, as assumed here. GPGPU card Expected release year Performance relative to Fermi C2050 card Fermi 2009 1.0 Kepler 2011 2.7 Maxwell 2013 7.6??? 2015 15.2??? 2017 30.4 6.3 The X step In this document we assume that the X step can be implemented on NVIDIA GPGPU cards with a core utilisation fraction (i.e. processing efficiency) of 75%. As noted in Section 6.1, Lincoln Greenhill (Harvard) and Mike Clark have reported empirically achieving 79% efficiency in an implementation of a prototype system similar to one that we require (private communication; publication in prep.). This efficiency figure includes the cost of internal data transfer within the Tesla card. It is possible to express the processing capabilities of each Tesla card in terms of the observed bandwidth that it can correlate for a single beam. Inverting the equation given in Section 5.3.1 for the processing requirements of the X step yields: N s = G X FLOP / 8N p 2 N B where N s may be equated with the bandwidth which a single card can correlate, and G X FLOP is equated with 75% of the theoretical processing capability of the GPGPU card. The resulting bandwidths are shown in Table 7. 2011 03 29 Page 27 of 31

Table 7: The processing capabilities of NVIDIA Tesla cards, expressed as the RF bandwidth a single card could process within a single beam for SKA 1 Low and SKA 1 Mid, assuming the numbers of baselines indicated in Table 4. GPGPU card Expected release year Bandwidth for one SKA 1 Low beam / MHz Fermi 2009 18.99 0.771 Kepler 2011 51.27 2.083 Maxwell 2013 144.3 5.865??? 2015 288.6 11.73??? 2017 577.3 23.46 Bandwidth for SKA 1 Mid / MHz These values, used to compute the numbers of cards quoted in Tables 2 and 3, assume that the performance of the Tesla cards is limited by their processing capability rather than the rate at which data can be transferred onto them, but this was shown to be the case in Section 5.4. We envisage that visibilities would be read out via the same network routers which supply the input data to the processing nodes. Since we showed in Section 5.2.3 that the output data rate from the X step is at least a factor 100 smaller than the input data rate, we do not envisage this being a bottleneck. 6.4 The F step Although FFT implementations exist for GPGPU cards most notably NVIDIA s CUFFT library we noted in Section 5.4 that the FLOP per input bit ratio of the F step is considerably lower than the ratio of the processing capability of GPGPU cards to the rate at which data can be transferred onto them. Therefore any GPGPU based implementation of the F step would be limited by the speed of the PCI Express bus. In the system diagrams presented in Figures 2 and 3, we have therefore suggested that the F step be implemented using FPGA boards. In the system presented, we have assumed that data from multiple antennas can be streamed through each FPGA subsystem, up to a maximum data rate of around 40 Gbit/s per card. This bandwidth is already achievable using the current generation of ROACH boards, and we expect each board to be able to process much higher data rates by the time that SKA 1 is deployed, leading to a reduction in the number of boards required. The channelised data emerging from each FPGA subsystem is directed to one of a number of network switches, each handling data from a subset of the observed frequency channels. This design reflects the independence of the data in each frequency channel after the F step, and minimises the need for very large switches. 6.5 The data routing problem The switches shown in Figures 2 and 3 each accept data from all of the F step subsystems associated with a particular beam. Thus, they receive data from all of the antennas in either SKA 1 Low or SKA 1 Mid, but only in a subset of the observed frequency channels. The reason for using multiple switches for each beam is that the frequency channels are treated independently after the F step, and the 2011 03 29 Page 28 of 31