1 Expression of Interest Submitted in response to Call EOI.FP Sub-Thematic Priority: i EDSN European Distributed Supercomputer Network Gabrielle Allen 1 Jarek Nabrzyski 2 Edward Seidel 1 1 Max-Planck-Institut-für-Gravitationsphysik (AEI), Golm, Germany 2 Poznań Supercomputing and Networking Center (PSNC), Poznań, Poland We address a serious shortfall of computational facilities available to European scientists and engineers by proposing as an Integrated Project the development of an Europe-wide, distributed system of high performance computers connected by very high speed networks. This facility would provide badly needed, scarcely existent, high end computing cycles and computational resources to all EU researchers. Distributed across multiple EU and associated states sites, it would enable the interconnection of geographically, academically and culturally diverse research groups. Dramatically advancing Europe s internal networking and grid infrastructure, it would bring Europe into a competitive position with respect to the US, and would also be a major step toward the development of a global grid, with the possibility of intellectual and computational resource sharing between many developed nations of the world.
2 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 2 Contents Contents 2 1 Rationale and Background 3 2 Proposed Structure of Facility Hardware Focus Areas of the Facility and Staffing Issues Scientific and Engineering Applications Computational Science Educational Mission Role of Industry Administration Benefits 8 4 Cost and Funding Model 8 5 Consortia and Support for this Project 9
3 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 3 1 Rationale and Background Virtually every field of science and engineering has been dramatically impacted in recent years by High Performance Computing (HPC). From astrophysics to psychology, from chemical engineering to economics, from unravelling the human genome to the development of the world wide web advances in computational science and technology have revolutionised our ability to investigate problems in science and engineering. At the same time, they have powered advances in information technologies touching virtually all aspects of every day life web browsers, cell phones, game machines, home computers. In turn, these technologies constitute a powerful engine for the entire economy. Recognising the economic importance of these technologies, the European Council in March 2000 set the EU the goal to become the most competitive and dynamic knowledge-based economy in the world, and the US Presidential panel for information technology (the PITAC report 1 ) stated that developments and applications in information technology have accounted for a third of the growth in the U.S. economy in this decade [1990s]. Given the key role that HPC plays in scientific advancement and therefore in economic progress, it is worrying to many scientists and policy-makers that there is no Europe-wide programme to provide adequate access to supercomputing resources to all scientists who require them. The situation in the USA is dramatically different. Nearly two decades ago, the United States embarked on a programme to make high performance computing available to all US-based academic researchers through the establishment of five NSF supported National Centres (PSC, Cornell, SDSC, NCSA, and Princeton). The facilities at these centres were designed to be orders of magnitude larger than those available to individual research groups, allowing problems of unprecedented scale to be attacked, critically important to the development of science and engineering of the 21st century. This programme has resulted in sustained US dominance across virtually all areas of computational science, and in the scientific and engineering disciplines served by these resources. Most recently, following a series of re-competitions and the recommendations of the PITAC report, the NSF is supporting the construction of the TeraGrid, which will be the world s largest computing facility supporting academic research. (As an aside, this NSF initiative is dwarfed by the DOE ASCI programme, also available to some US academic researchers.) Even with such facilities, it is well understood in the US that these resources remain woefully too small for the kinds of problems researchers would like to address. Hence, a sustained upgrade path has been followed since the inception of the programme; the computing power has increased 5-6 orders of magnitude in the last 20 years, and will continue to increase at a similar rate in the future. With such large scale facilities already available to US researchers for many years, and with so many clear successes resulting in both the academic and commercial worlds, an obvious question arises: Why is there no comparable programme for High Performance Computing in Europe? The EU has funded numerous large scale research collaborations. However, it is clear to those involved in such projects that without access to adequate resources, our research capabilities are severely limited in comparison to similar US-based initiatives. (This observation is from experience; one of us (E.S.) is the leader of a EU Astrophysics Network, funded by the 5th Framework Programme. In this case, close ties to the US community have provided time on US facilities, which has been crucial for the success of the Network). With the aggressive pace of advancement of US facilities, EU researchers are simply falling further and further behind each year. Although a few individual EU member countries have developed their own large scale computing facilities, these facilities are smaller than US counterparts, and are often restricted to certain communities, (e.g., Max- Planck in Germany). With no coordination, and lacking an intra-eu Grid infrastructure and administration, it is impossible to use these systems as a single, larger facility. Hence, even the largest single European systems are far behind what will soon be available in the US, a fact which on its own strongly argues for an EU-wide integrated project. But much worse, most European countries simply do not have any such national facilities, and hence unless special arrangements can be made, they are simply cut off; their ability to study real-world problems, with the level of scale and complexity needed to mimic nature, is impossibly restricted. For many European researchers, even in well developed countries, this means that their science 1
4 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 4 and engineering are becoming simply irrelevant. For less favoured regions, especially those in the east that even have difficulty accessing complex remote web sites due to poor networking infrastructure, their very high level, talented scientists and engineers are either left in the dust or move to where the resources are. A central EU-computing facility, with adequate networking, would rapidly allow Europe and the world to harness this untapped scientific and engineering potential. In Appendix A we have included supporting letters from prominent scientists and engineers from numerous EU countries as a testament to the need for such a facility. These letters are only a sample of the demand and enthusiam for this idea, gathered since the call for EoIs. The latent demand in Europe is very strong. We propose to address this severe imbalance by developing an all-encompassing Europe wide computing facility, available to all EU academic researchers, with a scale needed to study problems in the coming decades. With appropriate contributions from the EU and various member and associated states, such a facility could be of a scale beyond that of US-based facilities. Drawing on the unique and diverse strengths within Europe, it could reach unparalleled levels of excellence, with far reaching benefits in the kind of science and engineering that can be achieved, as well as on research in computation, Grids, visualization and algorithms. Furthermore, with comparable resources, a new level of cooperation between US and European researchers could now occur, including the sharing of both intellectual and computational resources. We make this proposal as scientists, as users of supercomputing facilities. There are already several consortia in Europe interested in building supercomputing, networking and Grid infrastructure who are also constructing EoIs, and we intend to convene a meeting this summer to bring together the interested parties from all the different communities and interest groups involved, to determine a set of requirements and the appropriate partnerships needed to fulfil them. A strong partnership among scientist-users, computer scientists, computer centres, network builders and managers, and equipment suppliers will be necessary if an effective supercomputer facility is to take shape in Europe. We want to stress that while this proposal aims to build badly needed computing infrastructure, it just as strongly aims at enhancing existing scientific, engineering, and computational applications across all research domains, and developing new applications. Hence, the eventual proposal application for this facility may also include various Networks of Excellence aimed at supporting these activities. The development of strong application groups, from bio-informatics to Grid tool development, across the EU-wide nodes in this facility, would be a crucial aspect. We feel strongly that such a balance between infrastructure and application research is the key to success. One of the primary goals of EU programmes is to bring together various countries research efforts, building a better integrated European community, and particularly to bolster efforts in underdeveloped regions (especially those in the east). This facility and its various research activities would have a dramatic effect in this regard. It would be at once distributed across both developed and less favoured regions, yet integrated through very high speed networking and central management structure. Not only would the computing resources be shared, but also a joint European facility would naturally lead to joint research projects and cross-fertilization of different research and geographic areas through, for example, sharing of algorithms, expertise and training. In this regard, this facility would serve as a magnet for various research projects from all EU members. 2 Proposed Structure of Facility 2.1 Hardware The final configuration for this project will depend crucially on the assembled consortium, existing and proposed networks, and additional funding from member states. As a point of reference, the US TeraGrid model 2 is built on a linux supercluster across 4 sites, which is actually built and integrated by IBM. The following hardware descriptions are just to illustrate the kind of 2
5 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 5 facility we currently envision. Most technical details are irrelevant at this point, except that the design should be fundamentally distributed, with advanced high speed networking. Location: Our distributed facility could have two major centres, one in Eastern Europe and one in Western Europe, and a number of somewhat smaller sites distributed across the EU, from every region (larger countries could have a single facility, while smaller countries may pool resources to create, e.g., a Baltic node in the facility). Computing sites may be developed on top of existing centres, or with significant co-funding from National agencies it is possible that entirely new centres could be created. Network: It will be crucial to work with European high speed network providers, in particular with Dante who manage the GÉANT network, which is already funded by the EU and the member states. There are two systems to consider, the access network from all sites, and the backplane network connecting the main distributed compute resources. We have already contacted Dante in respect to these different requirements. The highspeed backplane network should support IPv6 independently from the traditional IPv4. The natural way to fulfil these requirements would be to use the Optical Transport Network based on the DWDM technology with multiple parallel communication channels (lambdas) with bandwidth dependent on the user demand (e.g. 1Gbps, 2.5Gbps, 10Gbps, 40Gbps,... ). On every lambda both IPv4 and IPv6 protocols framing in various technologies, such as xgige, POS (Pocket Over Sonet) would be available. An example architecture of such network is described in [?]. The high speed network should allow dynamic configuration of multiple backplanes and virtual grids, which can be dedicate to different classes of grid applications. The backplane network must be intelligently adaptive to dynamic and evolving situations. It will need to respond dynamically to application requirements, and provide automated network management and QoS. Significant networking research will be needed to vertically integrate applications and to provide the tools and services needed to support applications. Here we will leverage heavily from EU funded networking projects such as 6NET 3, SEQUIN 4, ATRIUM 5 and others. Strong collaboration with networking centres of excellence, like HLRC, PSNC, and National Research and Education Network operators will be also crucial, in addition to connections to US and Asian centres. Compute Nodes: The integrator could be IBM, HP/Compaq, Sun or other vendors, or most likely a combination of interested vendors, and their support is being sought (we are already in discussions with each of them; see support letters in Appendix C 6. Each location should have a sizable cluster (1000+ processors), powered by advanced microprocessors (we are already in discussion with Intel for support in the case that IA-64 McKinley/Madison processors are chosen; See Appendix C 7 ). Possibly (part of) the machine could be dual bootable with Linux/Windows, or a mixture, in which case Microsoft support should be sought. The aggregate size of the entire facility should be at least 50TF/50TB. The individual sites should be connected for distributed and grid computing, as described in the previous section, using e.g. lambda technology and/or multi-bladed GigE networks of order 100Gbit/sec, and the entire system should be accessible for occasional single large scale jobs, as well as decomposed in to smaller machines for multiple simultaneous jobs. A separate I/O network should be developed as well, with similar capabilities. The ability to configure the system in different ways, possibly with different networking technologies between the sites, would enable the facility to not only provide badly needed simulation capabilities for its constituent engineering and scientific communities, but also the machine itself would provide a crucial laboratory for researching distributed computing architectures of the future, Grid technologies, and software frameworks for application to take advantage of them. Visualization and Collaboration: All sites would be well equipped with high speed video conferencing, access grid technology, high performance scientific visualization machines. While one or two sites should be See also eseidel/edsn. 7 See also eseidel/edsn.
6 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 6 equipped with a Cave-like VR environment, extensive use of Grid-enabled visualization techniques would allow researchers at remote sites to view advanced visualizations rendered from the central visualization machines. Not merely a copy of the US based TeraGrid, this facility would be stronger in various ways, differentiated by its ability to be multiply configured, and drawing on the various National research strengths present in the EU. (Compared to the US, where national borders do not exist, different EU groups bring a greater variety of research approaches to various problems.) This combination of a unique computing facility, together with a larger diversity of research strengths and approaches present within the EU, have the potential to make this facility much stronger in many ways than the US counterpart. The facility itself, including various scientific, computing and administrative staff, would provide a focus to bring together these different efforts across all participating EU countries. Research groups across the EU, from countries as diverse as Spain and Poland, the UK and Greece, would be rapidly brought together through this facility and its high speed Networks. The facility would also be differentiated from US centres in that it could address problems that essentially do not exist there; for example, language and cultural issues that are much more pronounced in Europe could be studied and addressed. Further, by solving problems created by national borders, individual funding agencies, and individual administration policies, it then becomes much easier to develop infrastructure and policies needed to form a single, global Grid system that would integrate facilities from the US, EU, and Asia. (This is highly desired by the worldwide community; see support letters from center Directors of major US and Asian facilities in Appendix B 8.) It should be emphasised that this should be considered a starting point for such a facility, which should be aggressively maintained and expanded over the coming years. It is not a one-shot investment! It should be supporting European science and engineering for the coming decade and beyond, and should be scaled up in capacity over time to meet the increasing needs of the various communities. A continuing annual investment of order 30-40% of the initial capital outlay would be needed to keep the facility at the leading edge. Note that Networking capabilities double on a faster scale (9 months) than computing (18 months), a fact which needs to be taken into account. See further comments on the funding model below. 2.2 Focus Areas of the Facility and Staffing Issues Scientific and Engineering Applications This project must be fundamentally application driven! Most research fields are active in many European countries, and a networked supercomputer centre is an ideal vehicle for strengthening collaboration among scientists in Europe. Therefore, in addition to central support for a strong IST component described below, various regional centres in the system should be identified as support centres for key scientific/engineering research fields, with the responsibility to reach out to their communities to integrate them into the HPC world. Such areas could include astrophysics, aerospace engineering, computational biology, earth and space science, climate modelling, bio-informatics, industrial engineering, etc. These groups would develop leading-edge application software specific to their field, acquire any hardware needed specially for supporting computing in that research field, and support and develop visualization requirements specific to the field. Importantly, the groups would contain leading-edge research scientists as well as software developers, whose interaction would ensure that the centres stay relevant to their fields. Moreover, the centres should support young scientists from all European countries, who could train there for a year or two and return to their home countries, ready to make effective use of the European facility. In this way, each smaller centre could build up particular Centres of Expertise (e.g., a Spanish affiliate could become the Hydrodynamics Support Centre, a Dutch affiliate could become the Bio-informatics Centre, etc). These groups would also be strongly tied into the educational mission described below. Many such groups, from individual scientists, to research institutes, to national academies of science, to the European Space Agency, have expressed their strong interest in writing; 8 See also eseidel/edsn.
7 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 7 representative support letters are included in Appendix A Computational Science One, or at most two, locations should be considered the primary centre for development of general purpose scientific computing software tool development (e.g., toolkits for HPC such as Cactus, Grid computing toolkits, scientific visualization, etc). This should include multiple staff positions for professionals to form a focused core, to work closely with various existing universities and research institutes affiliated with the centre. It would be important for the success of the centre, as a single facility, to ensure a very strong IST development component from the beginning, aimed at deploying emerging technologies, such as those supported by various EU- and other-funded Grid projects, as quickly as possible. Appropriate cooperative MOUs between Grid projects and the Centre would be developed, leading to closer coordination between the different groups and projects researching and developing Grid technologies, and integrating them with application communities. Such cooperation will clearly be enthusiastically supported; for example, principal investigators of the GridLab project are the authors of the present proposal; see also support letters from other Grid projects and Directors of major US-based PACI centres and Asian projects Educational Mission The facility should also have a strong education and training mission covering high performance computing, computational science, as well and the various research/engineering disciplines. These activities would draw from staff at the various sites within the facility, as well as from the research communities and EU projects which it brings together. These activities will help address the lack of expertise in high performance computing throughout the different EU research communities, while at the same time further helping to integrate both geographically and research-domain distinct areas across the EU Role of Industry It will be important to have an industrial component, both helping support the facility, as well as helping support industries which benefit from high performance computing. Industrial partners, from the computing industry, engineering, and scientific companies around the EU will be sought. This type of model has worked well in the US, where, for example, NCSA s Industrial Partner Program has involved partners from areas as diverse as insurance, oil, aerospace, finance, and automotive industries. Various partnerships are possible, but typically they would involve computational services and training in return for support of the facility. 2.3 Administration The facility would be administered as a single, but distributed entity, with staff distributed across all locations. Allocations would be peer reviewed, applications would be accepted from all EU countries, and possibly from the US, assuming reciprocal agreements can be made. Special allocations would be earmarked for developing countries in the EU. Although support staff would be distributed across all sites, one or two centres would be the primary sites for this. EU supported projects (e.g., 5th and 6th framework programme collaborations) could be given a certain fraction of the allocations. Partnerships could be established between the network and consortia of research groups, universities, or laboratories. EU-funded research networks would be natural partners in many cases. Principal partner sites could be provided with high speed networking to the primary centres, video conferencing facilities, etc. Specific research areas would be organised across across various interconnected subgroups. With a facility, serious partnerships with the US and Asia in both HPC and engineering/scientific disciplines, would be possible for the first time. The sharing of computational resources across continents would be developed, 9 See also eseidel/edsn.
8 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 8 further driving the quest for high speed international networking connectivity. This is precisely the kind of world predicted by advocates of the Grid, but without such an EU facility, such a development will be greatly slowed. (See support letters from major US and Asian HPC/Grid Centres). 3 Benefits The benefits of such a facility would be far reaching and revolutionise European science and technology. In a nutshell, they include: Vastly improved level of computation-based science and engineering, across all of Europe. A central point for development of computational science tools and algorithms to enhance the efforts of the EU scientific/engineering community. An immediate integration of scientific and engineering communities from different countries, both developed and less favoured, through a centralised EU facility. The facility s distributed, yet centralised, nature enhances this benefit. A facility for the development, testing, and deployment of Grid technologies under development in the EU and elsewhere. Direct application-oriented focus driving high speed network infrastructure development within Europe. A central point of contact to US and Asian HPC centres. The ability to share resources across nations, including the US and Asia, both for even larger scale scientific/engineering computing, intercontinental load balancing, and research, development and deployment of worldwide Grid systems. A huge driving force for international high speed networking. 4 Cost and Funding Model This distributed facility, along with the associated support and research staff, would require an initial investment of at least 120M Euro across the first few years. A well constructed programme for sustained, long term investment into the future would also be crucial. Funding should not only be provided by the EU, but also from individual nations research organizations as well as major vendors in the computing world. The programme would be made attractive to individual EU countries, encouraging them to buy in to the project. A possible funding model would be to expect the EU to pay a reasonably large fraction of the costs at each site, but to have additional significant local contributions. For example, the two major centres could be significantly (50-75% or more) funded by the EU, while connective network infrastructure to the smaller national centres would be primarily EU-funded. The smaller National centres would be ideally more significantly funded (perhaps 75%) by individual member states, as would most of the local staff at those centres. Research staff at each centre would be jointly funded by both EU and national sources. Young scientists trained at regional centres could be supported by the EU in the same way as it supports mobility training now. With such a model, by putting in a comparatively small investment, individual nations would be able (and presumably eager) to have a local centre, since they would be automatically connected to the entire installation. In this way, they would have access to the resources of the entire system, enabling scientist and engineers from their country access to the much larger facility. Hence, if done properly, even relatively large and well equipped centres (e.g., the UK, Germany, and France, etc.) should be enthusiastic about joining and building up around their existing strengths.
9 EOI.FP6.2002: European Distributed Supercomputing Network [7th June 2002] 9 We have initial support from various vendors for the facility (see support letters from Intel, HP/Compaq, Sun, in Appendix C ). With the scientific administrative plan described above, the various local centres would be driven by application groups in particular domains of expertise, such as hydrodynamics, bio-informatics, etc, making it even more attractive for local research organizations to fund them. Clearly, if most EU member states are able to contribute, the overall funding levels and strength of the facility would be greatly enhanced. In this model, not only do individual member states get access to computing facilities unparalleled in the world, they also strengthen their own existing research base directly. 5 Consortia and Support for this Project The primary institutions involved in preparing this EOI are the Max Planck Institute for Gravitational Physics (Albert Einstein Institute) and the Poznań Supercomputing and Networking Centre. However, many other institutes, including the Eurropean Space Agency, have been involved in supporting this initiative and advising in the preparation of the EOI. We list a number of these institutes in Appendix A. A project of this scale will require enthusiastic support and cooperation among (a) many institutes across the EU, including scientific and engineering research sites, (b) existing computation centres, and (c) high speed networking organizations. We have made contact with representatives of all three, and find universal support in the ideas contained in this EOI. In some cases, separate EOIs are being submitted. Although there are differences between them, some of the ideas presented are similar. We plan to organise a conference after the EOIs are submitted to determine how to best organise consortia for different proposals that would be submitted, should the Commission decide to make a call for proposals of this type. We have assembled a broad array of supporting scientists, engineers, and institutions, from regions of the EU and Associated States, for this initiative. The level of support is overwhelming and convincing. From literally hundreds of researchers who were polled, we received not a single negative response. A sample of the support letters have been included in the Appendix A, and a web link provided for additional letters 12. We also have strong support letters from the primary US and Asian computing sites, indicating both their support and willingness to cooperate in developing common Grid infrastructure, resource allocation procedures, etc, if the facility is developed. Support letters are included in Appendix B. 10 See also eseidel/edsn. 11 A verbal commitment from high level IBM officials was obtained, but the support letter could not be included by the deadline. 12 See also eseidel/edsn.