SKA-low DSP and computing overview : W. Turner SPDO 8 th September 2011
Agenda An overview of the DSP and Computing within the Signal Processing Domain (mainly SKA1) Channelisation Correlation Central Beamforming Non- Imaging Processing Based on CoDR documentation and Presentations: http://www.skatelescope.org/public2011-04_signal_processing_codr_documents/ Presentation doesn t cover: Science Computing DSP within the arrays
Signal Processing (WP2.5) Phase 1 Signal Processing Signal Processing: RFI Mitigation Correlation Central Beamforming Non-Imaging Computing SKA1: 50 AA low stations 25 in the 1km core ~ 160 station beams average 70MHz to 450 MHz Station O/P rate ~ 122 G bps * Total O/P rate ~ 6 P bps * 250 Dishes (SPF) 125 in the 1km core 450 MHz to 1GHz 1GHz to 2 GHz 2GHz to 3GHz Dish O/P rate ~ 24 G bps ** Total O/P rate 6 P bps ** AIP Extensibility for PAFS & AA high WBSPF *assumes 8B/10B encoding. **assumes 8B/10B encoding + 20% oversampling as per WP2 030.030.030 TD 001 STAN HIGH LEVEL DESCRIPTION
Channelisation (SKA1) Frequency Resolution Requirements Digital Beamforming (Bt<<1) Assuming phase shift beamforming: B<< 300kHz 1 km diameter core B<< 1.6MHz 180m diameter AA low station Science (DRM rev 1.3) Ch2 EoR < 100kHz (AA low) Ch3 H I Absorption <5kHz (Dish and AA low) Ch4 21 cm forest < 200Hz (AA low) Ch5 Pulsar Survey > 10kHz (Dish and AA low) Ch6 Pulsar Timing > 1 MHz (Dish) RFI Mitigation <1kHz (TBC) Continuum frequency resolution Will require frequency stitching for beams generated in the central beamformer used for pulsar timing Science case and RFI Mitigation frequency resolution requirements are more stringent than those for limiting smearing < 2%
Software Correlator Na = 50 inputs; two into each F- step subsystem. Analogue gain/filter Digitisation Central processing bunker Bulk delay and fine channelisation Bulk delay and fine channelisation Bulk delay and fine channelisation NFPGA = 125 subsystems, each accepting data from two antennas. The output frequency channels are divided into sixteen broad frequency bands, each of which is handled by a separate switch. F-Step Antenna backend Coarse Channelisation TX Fibre link to central processing bunker Switch Nswitch = 16 switches, each handling data from around a 100 MHz bandwidth segment out of the total 1.55 GHz bandwidth of SKA1 Mid (450 MHz to 2 GHz). X-Step Switch Data readout GPGPU GPGPU GPGPU GPGPU NGPGPU = 272 NVIDIA GPGPU cards (Maxwell series, available 2013). Each switch is connected to 17 cards via host PCs. Control and monitoring SKA1 Dish software correlator Na = 250 inputs; six into each F- step subsystem for each beam. Analogue gain/filter Digitisation Central processing bunker Bulk delay and fine channelisation Bulk delay and fine channelisation NFPGA = 9 subsystems, each accepting data from six antennas. F-Step Antenna backend Station beamforming Coarse Channelisation Fibre link to central processing bunker Beam 1 Switch A single switch collates data from the nine F- step subsystems and passes it onto three GPGPU cards, each of which processes a third of the total bandwidth of SKA Low. Beam 2 Beam i X-Step Switch Data readout GPGPU GPGPU TX NGPGPU = 3 NVIDIA GPGPU cards (Maxwell series, available 2013). GPGPU GPGPU A total of 408 GPGPU cards are required to form 160 beams. These are connected to 160 switches and 1440 FPGA-based F-step subsystems. SKA1 AA low software correlator Assumptions F step of correlator in FPGA X-step NVIDIA's Maxwell GPGPU 75% of theoretical performance Network interconnect: 40 Gbit/s Metric G data / G flop used Tesla C2050 GPGPU : G data ~ 64 G bit/s G flop ~ 1030 G FLOPs/s aim for >16 FLOPs per bit SKA1 Dish correlator 272 GPUs sixteen 34-port 40 Gbit/s network switches AA low correlator: 480 GPUs one five-port 40 Gbit/s network switch per beam Dominic Ford WP2-040.040.010-TD-002 : Software Correlator Concept Description Jongsoo Kim WP2-040.040.010-TD-001 : Software Correlator Concept Description
ASKAP based Correlator ROSA RTM 12 x 10G input 48 x3.2g output ROSA Dish Correlator: ATCA based FPGA cards /12 /12 /12 /12 Vitesse Cross Point Switch 72x72 /60 Zone 2 /4 /4 /4 /4 /4 /15 /15 /12 LX240T /15 LX240T LX240T LX240T DRAM /4 SFP+ /4 SFP+ DRAM /4 SFP+ /4 SFP+ DRAM /4 SFP+ /4 SFP+ DRAM /4 SFP+ /4 SFP+ Assumptions Memory limitation ~ 1000 frequency cells per device: FFT scales as log(n) where as memory scales as N A single shelf (crate, chassis or card cage) is limited to ~256 optical connections Data is transported at 10Gb/s per fibre FPGAs will have ~8000 multipliers by 2017 1 CMAC per clock cycle per 18 x 18 bit multiplier 250 AA hi with bandwidth.6ghz SKA2 250 AA hi with bandwidth.4ghz SKA2; 50 for SKA1 250 AA lo/ hi overlap.15 GHz SKA2 2000 Dishes with PAFs.6 GHz SKA2 2700 WBSPF Dish 9GHz SKA2; 250 for SKA with ~2.6 GHz 133 Dish remote stations 9GHz SKA2 AA low Correlator: Pizza Box FPGA 1 SKA1 Dish Correlator: 2 ATCA shelves of 16 cards AA low correlator: 320 pizza boxes for 480 beams Single SFP+ 25 10Gb/s links Single SFP+ FPGA 2 SKA2 WBSPF & PAF Correlator 300 ATCA shelves AA low & AA hi correlator 260 ATCA shelves Monitor and control Gigabit Ethernet John Bunton WP2 040.060.010 TD 001/2 : ASKAP STYLE SKA2 /1 CORRELATOR CONCEPT DESCRIPTION
Uniboard Correlator Current UNIBOARD 8 Altera FPGAs per board 322 18x18 complex multipliers per device 1235 memory units Each FPGA has to two DDR3 memory banks Four front panel SFP+ cages each 10 GbE High speed mesh on bespoke backplane Bespoke shelf metalwork (TBC) SKA1 Correlator Sizing based on current UNIBOARD 384 Boards needed for the dish array 416 UniBoards to process 480 beams (168 UniBoards for 160 beams) Dish Processing split into 3 Tiers: 1. 64 UNIBOARDSs channelising to 64 MHz 4 dishes processed per board 2. 16 x 4 Uniboards provide corner turn and fine channelisation 3. 256 Uniboards provide correlation ~ 0.5 MHz bandwidth processed per FPGA AA Low Processing 3040 UNIBOARDS required for correlation of 480 beams 1014 UNIBOARDS required for correlation of 160 beams By 2015 the number of cards required is likely to improve by a factor of 4 Arpad Szomoru WP2 040.070.010 TD 001 : A UNIBOARD Based Phase 1 SKA Correlator and Beamformer Concept Description
CASPER Architecture SKA1 SKA phase 1 dish component Virtex 7 Virtex 6 ROACH capability roadmap FPGAs have 3 year life-cycle CASPER philosophy the system design is deliberately sub-optimum development cost is low flexibility is very high Uses commercial Ethernet switch The switch adds cost Provides a simplicity to the design and flexibility Current CASPER hardware uses 10 gigabit Future: 40 and then 100 gigabit Ethernet 1U ATX computer format Planning for ROACH3 is due to start in October 2011 SKA1 sizing estimates from CASPER tools: 1. SKA1Dish 2B[Nlog2(F) x (10 OPs) + (N(N + 1)/2) 4(8 OPs)] = 4.2POps 2584 x ROACH4 3948 and 2572 port 10 and 40 G bit/s switch ports respectively 2. SKA1 Sparse AA 2B[Nlog2(F) x (10 OPs) + (N(N + 1)/2) 4(8 OPs)] = 8.6POps 7680 x ROACH 4 (assumes 480 beams for each of 50 AA low stations)) 37,000 switch ports distributed over 480 switches Francois Kapp WP2-040.080.010-TD-001 : SKA CASPER Correlator Concept Description
Giant Systolic Array Correlator GSA board and ASIC concept X421 ASIC: Full Stokes 1024 baseline, 4 bit,2024 ch, 150MHz GSA Board 8x8 array of ASICs. Use ASICs with on-chip edram for low power, minimal I/O connections for yield and reliability. Larger/fewer chips might save power. use SERDES for all I/O (minimal solder contacts). Make as simple as possible: to maximize yield and reliability. GSA board Interconnect Boards are mounted horizontally in adjacent racks Connect adjacent circuit boards together to form matrix The BLUE and RED lines in the figure show data flow for one particular set of antenna data through the array Central beam forming hardware can then reside in equipment in additional racks on either end Brent Carlson WP2 040.050.010 TD 001: Giant Systolic Array (GSA) Correlator Concept Description
GSA Correlator Room Layout SKA2 22m DAAs, SAAs: 250 stations, ~1000 beams, 300-400 MHz/pol n. PAF: ~2000 antennas, ~30 beams, ~700MHz/pol n/beam. WBSPF: ~3000 antennas, 1 beam, 1-n GHz/pol n. ~100 k channels/ baseline/beam, possibly more at narrower bandwidths. Note: does not include F-part 34m Brent Carlson WP2 040.050.010 TD 001: Giant Systolic Array (GSA) Correlator Concept Description
Low power design Matrix Architecture Notes: [1] ALMA, EVLA, SKA Memo 127 [4] ATA Memo 73 Estimated energy per op in 90 nm CMOS: CMAC for 2b+2b samples 2 pj Move one sample between chips 8 pj Write and read one sample to/from RAM 33 pj Memory operations dominate power and memories dominate chip area in some architectures; those architectures must be avoided. Maximum power dissipation 75 W Maximum chip area 200 mm 2 Maximum input rate 40 Gb/s Maximum output rate 40 Gb/s Channel bandwidth 100 khz Minimum integration 65 ms Analysis of architectures N2000, B=1GHz correlator (2 bit) Dedicated Pipeline Architecture RAM Accumulator Architecture: 1 2 3 4 5 Total ICs in System 9,718 3,333 3,333 10,000 30,000 Total power all ICs watts 21,859 25,771 56,823 114,873 344,873 L. Urry, ATA Memo 73, Feb 2007 Larry D Addario WP2 040.090.010 TD 002 LOW POWER CORRELATOR ARCHITECTURE FOR THE MID FREQUENCY SKA
Central Beamformer Ants 1-j; band/slice 1 GSA based Central Beamformer Concept Antenna band/slice De-constructor/Buffer 1 2 j ABG ABG ABG ABG ABG ABG BEAM-1 BEAM-2 BEAM-M MAMBG BS-BEAMFORMER BEAM-1 BEAM-2 BEAM-1 band/slice 1 BEAM-2 band/slice 1 Hierarchical (2 stage) Beamforming Equations DD ssss = DD cccccccc (NN ddddddh ) 1 4 NN ssss = DD 2 cccccccc DD ssss NN oooooo = 2 NN DDDDDDh cccccccc NN pppppp BBξ DD 2 cccccccc DD ddddddh Optimal Sub array diameter = 299m Number of Sub Arrays ~ 11 Dish Bfm load ~ 60 T Ops/s Ants j+1-2j; band/slice 1 Ants wj-j; band/slice 1 Ants 1-J; band/slice 2 Ants 1-J; band/slice B channel packets ᶲmodel(t,beam) ch phasestep phase(ch-0) MAMBG MAMBG Ants 1-J, band/slice 1, BEAMs 1-M Beamformer ch-0 BS-BEAMFORMER - 2 BS-BEAMFORMER - B next-ch en SIN LUT COS LUT ch-data c-mult Q Minimal Linear ABG I offset-beam output BEAM-M BEAM-1 band/slice 2 BEAM-1 band/slice B BEAM-1 BSM-1 BEAM-2 BSM-2 BEAM-M BSM-M Correlator concepts: Central beamformer likely to be tightly coupled to correlator. Could include hardware coefficient generation NN oooooo = NN ssssssssssssssss NN pppppp BB 1,25 ππ 180cc 2 2 DD cccccccc νν 2 AA low Bfm load ~ 4 T Ops/s For Survey: Dcore = 1000m, Ndish = 125, Bdish= 300 MHz, BAA-low = 100MHz, Nstations = 25,ξ = 1, Npol =2 NB these parameters match those of the Smits Memo WP2 040.030.010 TD 003 but differ from other SP CoDR documents Notes: Beamformer is coherent SKA1 Data rate out of beamformer > data rate in Need to cater for Extensibility For dishes reduce bandwidth in channelisation to 300 MHz Assumed the number of beams is programmable at AA low Timing is likely to use dishes out to 200km Brent Carlson WP2-040.110.010-TD-001 Also see John Bunton s WP2 040.060.010 D 002
Pulsar Search Pulsars become brighter towards lower frequencies by a power law with an index around 1.6 At low frequencies the profile from pulsars are smeared due to scattering, leading to a lower sensitivity The sky temperature increases towards lower frequencies by a power law with an index of 2.6. The beam-width increases towards lower frequencies. The FoV of one beam therefore scales with f 2 The effects of dispersion smearing due to the propagation of the radio waves through the interstellar medium becomes greater towards lower frequencies. R. Smits, B. Stappers, M. Kramer, A. Karastergiou: Pulsar survey with SKA phase 1
Pulsar Search Scenarios Scenario 1 Either Dish all sky survey 800 MHz centre frequency 300 MHz Bandwidth FoV per observation 2.1 deg 2 4400 pencil beams Telescope time 119 days Normal pulsars: 7750 MSP: 1000 Or AA low all sky survey 400 MHz centre frequency 100 MHz Bandwidth FoV per observation 4.0 deg 2 2700 pencil beams Telescope time 119 days Normal pulsars: 6300 MSP: 900 Scenario 2 Telescope time < 53 days Normal pulsars: 9800 MSP: 1240 R. Smits, B. Stappers, M. Kramer, A. Karastergiou: Pulsar survey with SKA phase 1 R. N. Manchester, A. G. Lyne, F. Camil, J. F. Bell, V. M. Kaspi, N. D'Amico, N. P. F. McKay, F. Crawford, I. H. Stairs, A. Possenti, M. Kramer, D. C. Sheppard, The Parkes Multi-Beam Pulsar Survey I. Observing and Data Analysis Systems, Discovery and Timing of 100 Pulsars, Mon. Not. R. Astron. Soc., issue 328, pp. 17-35, 2001
Dedispersion Time delay due to Dispersion t = 6.7s 1.3 G samples Per beam ν = 1 tt ssssssss 10 kkhhhh Equations Channel width DDMM mmmmmm = 3 tt ssssssss (μμμμ)νν mmmmmm (GGGGGG) 8.3 10 3 νν(gggggg) Maximum Dispersion Measure NN DDDD = 4150 DDDD mmmmmm 1 2 1 2 ff mmmmmm ff mmaaaa tt ssssssss Number of Dispersion Measures AA low t = 2.6s 1.6 G samples Per beam DeDispersion Options Coherent Incoherent Delay and Sum PreSummation Accumulation and Difference Multiple Sample Period Taylor Tree Frequency Partitioning See SP High Level Description WP2-040.030.010-TD-001 Dish
Incoherent Dedispersion Concept Large DM high frequency & Small DM individual samples Dedispersion Architecture Large DM multiple samples summed At lower frequency Input samples from dispersion buffer Dedispersion series Running sum for each frequency Channel with first & last sample De Dispersion Processing Rate C= No. Freq channels, D the number of DMs,τ O/P sample period Dedispersion Detail for processing 4 samples into future (parallelisation, J) V-FASTR/ ASKAP dedispersion FPGA based Dedispersion buffer uses FPGA block memory Accumulator memory size proportional to the dedispersion interval (J) Larger values of J reduce FTS bandwidth requirements Performance equations in Nathan Clarke s document Nathan Clarke WP2 040.150.010 TD 001 An Architecture for Incoherent Dedispersion Also A AhmedSaid WP2 040.170.010 TD 001 Pulsar Signal Processing on UNIBOARD
Acceleration Search Example Acceleration Search Acceleration Search Pulsars in highly relativistic binary systems show periodic changes in their pulse frequency due to the Doppler effect. Standard Fourier based periodicity searches are not sensitive to varying frequency signals. Need to compensate using trial accelerations Suggest range of -100 to +100 m s -2 ~ 300 trial accelerations with δa = 0.66m s -2 Ralph Eatough CoDR Presentation also see G. Knittel WP2 040.130.010 TD 002 A Scaleable Computer Architecture For on line Pulsar Search on the SKA
Comparison of SKA1 Concepts Rough attempt to compare power and cost of designs in concept papers. Should not be taken too seriously, since designs did not all stick to the Memo 130 specs, and they varied considerably in their projections of future technology. [1] For the GPU based concepts, the "chips" columns contain the count of computing nodes. [2] Costs do not include NRE. [3] ASIC-based: 90 nm technology. Sparse AA correlator is for 150 stations, so it has 9x the capability of the others (which are each for 50 stations). [4] Systolic ASICs: 30 nm technology. Dishes=7 boards of 72 ASICs, AAs=7chips/beam=3360chips; 14chips+aux/board->240boards. Larry D Addario CoDR presentation slide