RAPS George.Mozdzynski@ecmwf.int RAPS Chairman 20th ORAP Forum Slide 1
20th ORAP Forum Slide 2
What is RAPS? Real Applications on Parallel Systems European Software Initiative RAPS Consortium (founded early 90 s) Working group of hardware vendors Programming model (MPI + F90/F95 + OpenMP) The partners of the RAPS Consortium develop portable parallel versions of their production codes which are made available to a Working Group of Hardware Vendors for benchmarking and testing. 20th ORAP Forum Slide 3
RAPS Consortium CCLRC, Daresbury CSCS, Lugano DWD, Offenbach DKRZ, Hamburg, Reading Fraunhofer SCAI/ITWM UK Met Office, Exeter MPI-M, Hamburg METEO-FRANCE, Toulouse NERC, UK 20th ORAP Forum Slide 4
Working Group of Hardware Vendors Bull Cray Fujitsu Hitachi HP IBM INTEL Linux Networx NEC SGI SUN 20th ORAP Forum Slide 5
Why RAPS Portability of codes (F90/F95, C/C++, MPI, OpenMP) Availability of benchmark codes ahead of a formal procurement Some influence on standardization (PARMACS -> MPI) - Vendors needed to support F90 + MPI to run benchmarks Information exchange - 20 meetings held to date 20th ORAP Forum Slide 6
RAPS process RAPS benchmarks distributed by individual organizations No official membership required for vendors Vendors approach (NDA) individual orgs for benchmarks Meetings (once a year) - Every 2 years as part of Use of HPC in Meteorology workshop - http://www.ecmwf.int/newsevents/meetings/workshops/2006/high_performance_computing- 12th/index.html Next meeting 21/22 June 2007, UPMC, Jussieu, Paris - Contact: Marie-Alice.Foujols@ipsl.jussieu.fr 20th ORAP Forum Slide 7
RAPS benchmarks Today - IFS (RAPS9) - DWD LM_RAPS - Met Office UM Commitment to produce up to date benchmarks reflecting key operational applications of consortium members 20th ORAP Forum Slide 8
RAPS Future To seek new members outside of meteorological community Exchange experiences with other communities Fortran Standards - Fortran 2003 - HPCS compiler? 20th ORAP Forum Slide 9
: Supporting States and Co-operation Belgium Ireland Portugal Denmark Italy Switzerland Germany Luxembourg Finland Spain The Netherlands Sweden France Norway Turkey Greece Austria United Kingdom Co-operation agreements or working arrangements with: Czech Republic Croatia Estonia Hungary Iceland Lithuania Morocco Romania Serbia Slovenia 20th ORAP Forum Slide 10 ACMAD ESA EUMETSAT WMO JRC CTBTO CLRTAP
Phase3 hpcc & hpcd IBM p690+ Phase4 hpce & hpcf IBM p575+ Power4+ 1.9 GHz Peak 7.6 Gflops per PE Sustained ~.5 Gflops per PE Power5+ 1.9 GHz --> with SMT Peak 7.6 Gflops per PE Sustained ~1 Gflops per PE 2176 PEs per cluster 2240 PEs per cluster 32 PEs per node 16 PEs per node ---> 3*Mem BW per PE Same Federation Switch 20th ORAP Forum Slide 11
History of RAPS benchmark Gflop/s 10000 1000 100 10 IBM p690+ 2004 RAPS-8 T799 L91 IBM p575+ 2006 RAPS-9 T799 L91 CRAY T3E-1200 1998 RAPS-4 T213 L31 1 100 1000 10000 Number of processors 20th ORAP Forum Slide 12
T1279 16km NGPTOT = 2,140,704 TSTEP = 450 secs Flops for 10-day forecast = 7.207*10 15 684.1 650 600 550 500 450 400 350 300 250 200 150 100 50 N 50 N 0 20th ORAP Forum Slide 13 50 10
Comparison of Resolutions Resolution T1279 L91 T799 L91 T399 L62 Grid spacing 16km 25km 50km Number of grid-points 2,140,704 843,490 213,988 Time-step 450 secs 720 secs 1800 secs Flops for 10-day forecast 7.207*10 15 1.615*10 15 0.1013*10 15 EPS * 50 20th ORAP Forum Slide 14
RAPS9 10-day T799 L91 Forecast Percentage of Peak 16 14 12 10 8 6 4 2 0 0 500 1000 1500 2000 2500 hpce T1279 hpce T799 hpcd T799 Number of PEs 20th ORAP Forum Slide 15
RAPS9 T799 L91 10-day Forecast on hpce 6 5 2048 2240 Speed-up 4 3 2 1 384 768 1024 1536 Ideal T799 T1279 0 0 500 1000 1500 2000 2500 Number of PEs 20th ORAP Forum Slide 16
RAPS9 T799 L91 10-day Forecast on Cray XT3 at ORNL Speed-up 7 6 5 4 3 2 1 960 1920 3072 3940 5200 4800 6144 Ideal T799 0 0 2000 4000 6000 Number of PEs 20th ORAP Forum Slide 17
RAPS9 - T799 L91 10-day forecast OpenMP threads / MPI task on 96 Nodes 11 Percentage of Peak 10 9 8 7 6 5 SMT No SMT 4 Threads used for operations 1 2 3 4 5 6 7 8 Number of threads per MPI task 20th ORAP Forum Slide 18
RAPS9-10-day forecasts Message passing comms on hpce Resolution Nodes MPI x OMP WALL (secs) %Comms (barrier) Tflop/s % of peak T799 L91 T1279 L91 T799 L91 24 Nodes 96 x 8 96 Nodes 384 x 8 140 Nodes 560 x 8 4253 8.0% 0.38 13.0% 4836 11.5% 1.61 12.8% 995 18.9% 1.60 9.4% T1279 L91 140 Nodes 3506 13.8% 2.05 12.1% 560 x 8 20th ORAP Forum Slide 19
Ensemble forecasts of hurricane Katrina From 12UTC Thursday 25 Aug 2005 From 12UTC Friday 26 Aug 2005 Ensemble member High res. forecast Shading: Probability that Katrina would pass within 120km 20th ORAP Forum Slide 20
Integrated Forecasting System (IFS) IFS 1992 - today - Collaboration between Meteo France and - Source ~ 1.8 million lines - Fortran 95, some C - Good performance on scalar and vector systems IFS model characteristics: - Spectral - Semi-implicit - Semi-Lagrangian 20th ORAP Forum Slide 21
IFS - Parallelised using mixed MPI and OpenMP MPI communications Transpositions - Between Grid point, Fourier and Spectral spaces Wide halo exchange - Semi Lagrangian method - Radiation grid interpolation Long messages Typically MPI_ISEND/RECV/WAITALL or collective OpenMP Shared memory nodes Memory efficient Use 4/8 threads 20th ORAP Forum Slide 22
T L 799 1024 tasks 2D partitioning 2D partitioning results in non-optimal Semi-Lagrangian comms requirement at poles and equator! Square shaped partitions are better than rectangular shaped partitions. 20th ORAP Forum Slide 23
Model / Radiation Grids Radiation computations are expensive To reduce this cost we, - Run radiation computations every hour every 5 th timestep for T L 799 model - Run radiation computations on a courser grid T L 399 requires interpolation Two interpolation possibilities - Gather global fields to different tasks (non-scalable) global comms is bad; # fields can be less then # tasks - Perform interpolation with only local comms for halo (scalable) implemented in IFS this way 20th ORAP Forum Slide 24
Reduced grids (linear) &NAMRGRI NRGRI(1)= 18, NRGRI(2)= 25, NRGRI(3)= 36, NRGRI(4)= 40, NRGRI(5)= 45, NRGRI(6)= 50, NRGRI(7)= 60, NRGRI(8)= 64, NRGRI(9)= 72, NRGRI(10)= 72, NRGRI(11)= 75, NRGRI(12)= 81, NRGRI(13)= 90, NRGRI(14)= 96,... T L 399 note only factors 2, 3, and 5 for fourier transforms NRGRI(200)= 800,... NRGRI(398)= 36, NRGRI(399)= 25, NRGRI(400)= 18, / T799 model grid (blue) T399 radiation grid (red) 20th ORAP Forum Slide 25
PE=293, Radiation Grid T L 255 PE=293, Model Grid T L 511 Model and Radiation grids for same partition are offset geographically, because Use of reduced grid (linear) T L 255 is not a projection of T L 511 Long thin partitions make matters worse 20th ORAP Forum Slide 26
eq_regions algorithm 20th ORAP Forum Slide 27
Why eq_regions? eq_regions partitioning is broadly similar to existing IFS 2D partitioning - 2D A-Sets similar to eq_regions bands - 2D partitioning good for a regular lat - lon grid - eq_regions partitioning more suited to a reduced grid Only one new data structure required - N_REGIONS Code changes straightforward (example follows) eq_regions partitioning works for any number of tasks and not just task numbers that have nice factors 20th ORAP Forum Slide 28
Other partitioning approaches: e.g. quadrangles Difficult to implement in IFS (but not impossible). Nothing in common with 2D partitioning approach. C. Lemaire/J.C. Weill, March 23 2000, Partitioning the sphere with constant area quadrangles, 12 th Canadian Conference on Computational Geometry 20th ORAP Forum Slide 29
2D partitioning T799 1024 tasks (NS=32 x EW=32) 20th ORAP Forum Slide 30
eq_regions partitioning T799 1024 tasks N_REGIONS( 1)= 1 N_REGIONS( 2)= 7 N_REGIONS( 3)= 13 N_REGIONS( 4)= 19 N_REGIONS( 5)= 25 N_REGIONS( 6)= 31 N_REGIONS( 7)= 35 N_REGIONS( 8)= 41 N_REGIONS( 9)= 45 N_REGIONS(10)= 48 N_REGIONS(11)= 52 N_REGIONS(12)= 54 N_REGIONS(13)= 56 N_REGIONS(14)= 56 N_REGIONS(15)= 58 N_REGIONS(16)= 56 N_REGIONS(17)= 56 N_REGIONS(18)= 54 N_REGIONS(19)= 52 N_REGIONS(20)= 48 N_REGIONS(21)= 45 N_REGIONS(22)= 41 N_REGIONS(23)= 35 N_REGIONS(24)= 31 N_REGIONS(25)= 25 N_REGIONS(26)= 19 N_REGIONS(27)= 13 N_REGIONS(28)= 7 N_REGIONS(29)= 1 20th ORAP Forum Slide 31
2D partitioning T799 251 tasks (NS=251 x EW=1, 251 is a prime ) 20th ORAP Forum Slide 32
eq_regions partitioning T799 251 tasks N_REGIONS( 1)= 1 N_REGIONS( 2)= 7 N_REGIONS( 3)= 12 N_REGIONS( 4)= 17 N_REGIONS( 5)= 22 N_REGIONS( 6)= 25 N_REGIONS( 7)= 28 N_REGIONS( 8)= 27 N_REGIONS( 9)= 28 N_REGIONS(10)= 25 N_REGIONS(11)= 22 N_REGIONS(12)= 17 N_REGIONS(13)= 12 N_REGIONS(14)= 7 N_REGIONS(15)= 1 20th ORAP Forum Slide 33
T799 512 tasks, 2D, task 201 T799 model grid T399 radiation grid 20th ORAP Forum Slide 34
T799 512 tasks, eq_regions, task 220 T799 model grid T399 radiation grid 20th ORAP Forum Slide 35
Grid interpolation HALO area (512 tasks, T799=model grid, T399=radiation grid) 5000 4500 model to radiation grid HALO area 4000 3500 3000 2500 2000 1500 2D T799 to T399 eq_regions T799 to T399 2D T399 to T799 eq_regions T399 to T799 1000 500 0 radiation to model grid 1 33 65 97 129 161 193 225 257 289 321 353 385 417 449 481 task number Halo area includes tasks own grid points 20th ORAP Forum Slide 36
T799 performance (comparing 2D & eq_reqions) 2D eq_regions Application tasks x threads partitioning partitioning 2D / eq_regions secs secs model 512 x 2 3648 3512 1.039 4D-Var 96 x 8 3563 3468 1.027 Good: Bad: Reduced semi-lagrangian comms Reduced memory requirements Increased TRGTOL/TRLTOG comms (grid to fourier space) Less of an issue for thin nodes as relatively more comms is on switch 20th ORAP Forum Slide 37
summary eq_regions partitioning implemented in IFS - Both 2D and eq_regions partitioning are supported - eq_regions is the default partitioning - Available in IFS cycle CY31R2 eq_regions reduces semi-lagrangian communication cost - Also for model / radiation grid interpolation eq_regions has small performance advantage over 2D partitioning 20th ORAP Forum Slide 38
QUESTIONS? 20th ORAP Forum Slide 39