HECToR is the UK's new high-end computing resource available for research funded by the UK Resear... more HECToR is the UK's new high-end computing resource available for research funded by the UK Research Councils. The HECToR Cray XT4 system began user service in October 2007 and comprises 5,564 dual core 2.8GHz AMD Opteron processors. The results of running a number of synthetic benchmarks and popular application codes which are used by the UK academic community are presented. The synthetic benchmarks include STREAMS and MPI benchmarks. The application benchmarks include fluid dynamics, molecular dynamics, fusion, materials science and environmental science codes. The results are compared with those obtained on the UK's HPCx service which comprises 160 IBM e-Server p575 16way SMP nodes each containing 8 dual core 1.5GHz IBM Power5 64-bit RISC chips. Where appropriate, results are also included from the HECToR Test and Development system which has a different memory structure from the main system. It is found that there is not much difference between the systems in terms of comparing similar numbers of processing cores, but HECToR is a much larger system with many more cores and a more scalable interconnect. Memory bandwidth is seen to be a bottleneck for certain applications on both systems, with HECToR more seriously affected.
FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workl... more FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the multi-level concurrency available within applications onto HPC systems with FPGAs is a challenge. OpenCL and HLS provide different mechanisms for exploiting concurrency within a node leading to a concurrency mapping design problem. In addition to considering the performance of different mappings, there are also questions of resource usage, programmability (development effort), ease-of-use and robustness. This paper examines the concurrency levels available in a case study kernel from a shallow water model and explores the programming options available in OpenCL and HLS. We conclude that the use of SDSoC Dataflow over functions mechanism, targeting functional parallelism in the kernel, provides the best performance in terms of both Latency and execution time, with a speedup of 314x over the naive reference implementation.
We report of the activities of the Computational Science & Engineering Department at CCLRC Daresb... more We report of the activities of the Computational Science & Engineering Department at CCLRC Daresbury Laboratory in the evaluation of Cray high-end and mid-range systems. We examine the performance of applications from computational fluid dynamics, coastal ocean modelling and molecular dynamics as well as kernels from the HPC Challenge benchmark. We find that the CrayX1 and Cray XT3 are highly competitive with contemporary systems from IBM and SGI, the precise ranking of these systems being application dependent. We examine the performance of the Cray XD1 as a mid-range computing resource. It performs well but a Pathscale InfiniPath cluster performs equally well at some fraction of the cost. InfiniPath systems appear particularly competitive for runs on 32 and 64 processors; still considered the ‘sweet spot’ for the majority of applications on mid-range systems. A successor to the XD1 is required if Cray are going to provide a cost-effective solution in the mid-range cluster market.
Despite a significant decline in their popularity in the last decade vector processors are still ... more Despite a significant decline in their popularity in the last decade vector processors are still with us, and manufacturers such as Cray and NEC are bringing new products to market. We have carried out a performance comparison of three full-scale applications, the first, SBLI, a Direct Numerical Simulation code from Computational Fluid Dynamics, the second, DL_POLY, a molecular dynamics code and the third, POLCOMS, a coastal-ocean model. Comparing the performance of the Cray X1 vector system with two massively parallel (MPP) micro-processor-based systems we find three rather different results. The SBLI PCHAN benchmark performs excellently on the Cray X1 with no code modification, showing 100% vectorisation and significantly outperforming the MPP systems. The performance of DL_POLY was initially poor, but we were able to make significant improvements through a few simple optimisations. The POLCOMS code has been substantially restructured for cache-based MPP systems and now does not v...
Concurrency and Computation: Practice and Experience, 2018
SummaryOngoing transistor scaling and the growing complexity of embedded system designs has led t... more SummaryOngoing transistor scaling and the growing complexity of embedded system designs has led to the rise of MPSoCs (Multi‐Processor System‐on‐Chip), combining multiple hard‐core CPUs and accelerators (FPGA, GPU) on the same physical die. These devices are of great interest to the supercomputing community, who are increasingly reliant on heterogeneity to achieve power and performance goals in these closing stages of the race to exascale. In this paper, we present a network interface architecture and networking infrastructure, designed to sit inside the FPGA fabric of a cutting‐edge MPSoC device, enabling networks of these devices to communicate within both a distributed and shared memory context, with reduced need for costly software networking system calls. We will present our implementation and prototype system and discuss the main design decisions relevant to the use of the Xilinx Zynq Ultrascale+, a state‐of‐the‐art MPSoC, and the challenges to be overcome given the device'...
We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...
Summary form only given. Advances in computational science are closely tied to developments in hi... more Summary form only given. Advances in computational science are closely tied to developments in high-performance computing. We consider the case of shelf sea modelling where models have been growing in complexity and where model domains have been growing and grid resolutions shrinking in pace with the increasing storage capacity and computing power of high-end systems. Terascale systems are now readily available with performance levels measurable in TeraFlop/s and memories counted in TeraBytes). The scientific case is now being made for regional models at 1 km resolution, allowing the accurate representation of eddies, fronts and other regions containing steep gradients. The hydrodynamic model is increasingly being coupled with other models in multidisciplinary studies e.g. ecosystem modelling and wave modelling. We show that the performance attainable from the POLCOMS hydrodynamic code is measurable at about 0.5 TeraFlop/s on an IBM p690 cluster with 1024 processors. The scalability...
A Large Eddy Simulation code based on a non-orthogonal, multiblock, finite volume approach with c... more A Large Eddy Simulation code based on a non-orthogonal, multiblock, finite volume approach with co-located variable storage, was ported to three different parallel architectures: a Cray T3E/1200E, an Alpha cluster and a PC Beowulf cluster. Scalability and parallelisation issues have been investigated, and merits as well as limitations of the three implementations are reported. Representative LES results for three flows are also presented.
The incompressible smoothed particle hydrodynamics (ISPH) method with projection based pressure c... more The incompressible smoothed particle hydrodynamics (ISPH) method with projection based pressure correction has been shown to be highly accurate and stable for internal flows. We have proposed a focused effort to optimize an efficient fluid solver using the incompressible smoothed particle hydrodynamics (ISPH) method for violent free-surface flows on offshore and coastal structures. In ISPH, previous benchmarks showed that the simulation costs are dominated by the neighbour searching algorithm and the pressure Poisson equation(PPE) solver. In parallelisation, a Hilbert space filling curve(HSFC) and the Zoltan package have been used to perform domain decomposition where preservation of the spatial locality is critical for the performance of neighbour searching The following list highlights the major developments: • The map kernel which provide the functionality of sending particles and their physical field data blocks to an appropriate partition have been rewritten and optimised with MPI one-sided communications and Zoltan distributed directory utility. The percentage of the map kernel is now reduced to less than 20% for those simulations with particles distributed evenly across the domain. For non-evenly distributed particles, using only non-empty cells will be part of future work.
The incompressible smoothed particle hydrodynamics (ISPH) method with projection based pressure c... more The incompressible smoothed particle hydrodynamics (ISPH) method with projection based pressure correction has been shown to be highly accurate and stable for internal flows. This paper describes an alternative parallel approach for domain decomposition and dynamic load balancing by using Hilbert space filling curve to decompose the cells with number of particles in each cell as the cells' weight functions. This approach can distribute particles evenly to MPI partitions without losing spatial locality which is critical for neighbour list searching. As a tradeoff, the subdomain shapes become irregular. The unstructured communication mechanism has also been introduced to deal with halo exchange. Solving sparse linear equations for pressure Poisson equation is one of the most time consuming parts in ISPH using standard preconditioners and solvers from PETSc 1 . The particles are reordered so that insertions of values to the global matrix become local operations without incurring extra communications, which also have the benefit of reducing the bandwidth of coefficients matrix. The performance analysis and results showed the promising parallel efficiency.
The results from running a range of synthetic and application benchmarks on the HPCx and HECToR s... more The results from running a range of synthetic and application benchmarks on the HPCx and HECToR systems are presented and compared. It is found that there is not much dierence between the systems in terms of comparing similar numbers of processing cores, but HECToR is a much larger system with many more cores and a more scalable interconnect. Memory bandwidth is seen to be a bottleneck for certain applications on both systems, with HECToR more seriously aected.
The results of benchmarking communications and several popular applications on HPCx Phase 2a are ... more The results of benchmarking communications and several popular applications on HPCx Phase 2a are presented and compared with those results from running on HPCx Phase 2. In terms of communications, there is little performance difference between the ...
Bulletin of the Seismological Society of America, 2010
The seismic potential of southern China is associated with the collision between the Indian and t... more The seismic potential of southern China is associated with the collision between the Indian and the Eurasian plates. This is manifested in the western Sichuan Plateau by several seismically active systems of faults, such as the Longmen Shan. The seismicity observed on the Longmen Shan fault includes recent events with magnitudes of up to 6.5, and the one of 12 May 2008 M w 7.9 Wenchuan earthquake. Herewith, as part of an ongoing research program, a recently optimized threedimensional (3D) seismic wave propagation parallel finite-difference code was used to obtain low-frequency (≤ 0:3 Hz) 3D synthetic seismograms for the Wenchuan earthquake. The code was run on KanBalam (Universidad Nacional Autónoma de México, Mexico) and HECToR (UK National Supercomputing Service) supercomputers. The modeling included the U.S. Geological Survey 40 × 315 km 2 kinematic description of the earthquake's rupture, embedded in a 2400 × 1600 × 300 km 3 physical domain, spatially discretized at 1 km in the three directions and a temporal discretization of 0.03 s. The compression and shear wave velocities and densities of the geologic structure used were obtained from recently published geophysical studies performed in the Sichuan region. The synthetic seismograms favorably compare with the observed ones for several station sites of the Seismological and Accelerographic Networks of China, such as MZQ, GYA, and TIY, located at about 90, 500, and 1200 km, respectively, from the epicenter of the Wenchuan event. Moreover, the comparisons of synthetic displacements with differential radar interferometry (DinSAR) ground deformation imagery, as well as of maximum velocity synthetic patterns with Mercalli modified intensity isoseist of the 2008 Wenchuan earthquake, are acceptable. 3D visualizations of the propagation of the event were also obtained; they show the source rupture directivity effects of the M w 7.9 Wenchuan event. Our results partially explain the extensive damage observed on the infrastructure and towns located in the neighborhood of the Wenchuan earthquake rupture zone.
In 2004 the HPCx system underwent a major upgrade, including the replacement of the 1.3 GHz POWER... more In 2004 the HPCx system underwent a major upgrade, including the replacement of the 1.3 GHz POWER4 processors by 1.7 GHz POWER4+ and the "Colony" SP Switch2 being superceded by IBM's High Performance Switch (HPS). The aim of the upgrade was to effect a factor of two performance increase in the overall capability of the system. We present the results of a benchmarking programme to compare the performance of the Phase1 and Phase2 systems across a range of capability applications of key importance to UK science. Codes are considered from a diverse section of application domains; materials science, molecular simulation and molecular electronic structure, computational engineering and environmental science. Whereas all codes performed better on the Phase2 system, the extent of the improvement is highly application dependent. Those which already showed excellent scaling on the Phase1 system, CPMD, CRYSTAL, NAMD, PCHAN and POLCOMS, in some cases following significant optimisation effort by the HPCx Terascaling Team, showed a speed-up which was close to the increase in processor clock speed of 1.31. Other applications, AIMPRO, CASTEP, DL_POLY, GAMESS, and THOR, for which the performance of the interconnect was a critical issue, showed much greater performance increases ranging from 1.65 to 2.88. The PDSYEVDbased matrix diagonalization benchmarks showed even greater improvement, up to a factor of 3.88 for the larger matrix size.
ABSTRACT: Electron collisions with atoms were among the earliest problems studied using quantum m... more ABSTRACT: Electron collisions with atoms were among the earliest problems studied using quantum mechanics. However, the accurate computation of much of the data required in astrophysics and plasma physics still presents huge computational challenges, even on the latest ...
Many areas of scientific research are underpinned by computational methods that require ever incr... more Many areas of scientific research are underpinned by computational methods that require ever increasing levels of computer performance. In order to meet this demand high-performance systems are rapidly heading towards Petascale performance levels with planned systems typically consisting of O(100,000) processors. We are investigating whether current applications used in the UK are capable of scaling to these levels. We present performance results for five applications (SBLI, Code_Saturne, POLCOMS, DL_POLY_3 and CRYSTAL) from a range of scientific areas on Cray XT, IBM POWER5 and IBM BlueGene systems up to 16,384 processors. Most codes scale well with sufficiently large problem sizes, though we have identified a requirement for further research in efficient parallel I/O, parallel partitioning for unstructured mesh codes and diagonalisation-less methods for quantum chemistry
Tutorial on Debugging and Performance Tools for MPI and OpenMP 4.0 Applications for CPU and Accelerators/Coprocessors
Scientific developers face challenges adapting software to leverage increasingly heterogeneous ar... more Scientific developers face challenges adapting software to leverage increasingly heterogeneous architectures. Many systems feature nodes that couple multi-core processors with GPU-based computational accelerators, like the NVIDIA Kepler, or many-core coprocessors, like the Intel Xeon Phi. In order to effectively utilize these systems, application developers need to demonstrate an extremely high level of parallelism while also coping with the complexities of multiple programming paradigms, including MPI, OpenMP, CUDA, and OpenACC. This tutorial provides in-depth exploration of parallel debugging and optimization focused on techniques that can be used with accelerators and coprocessors. We cover debugging techniques such as grouping, advanced breakpoints and barriers, and MPI message queue graphing. We discuss optimization techniques like profiling, tracing, and cache memory optimization with tools such as Vampir, Scalasca, Tau, CrayPAT, Vtune and the NVIDIA Visual Profiler. Participants have the opportunity to do hands-on GPU and Intel Xeon Phi debugging and profiling. Additionally, up-to-date capabilities in accelerator and coprocessing computing (e.g. OpenMP 4.0 device constructs, CUDA Unified Memory, CUDA core file debugging) and their peculiarities with respect to error finding and optimization will be discussed. For the hands-on sessions SSH and NX clients have to be installed in the attendees laptops
Uploads
Papers by Mike Ashworth