Evaluating GPU Programming Models for the LUMI Supercomputer

Michael  Bussmann

doi:10.1007/978-3-031-10419-0_6

Outline

Evaluating GPU Programming Models for the LUMI Supercomputer

Michael Bussmann

Supercomputing Frontiers

https://doi.org/10.1007/978-3-031-10419-0_6

visibility

…

description

23 pages

link

1 file

Abstract

It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct™ accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.

References (40)

Summary We can observe that based on the hardware performance AMD MI100 performs faster than NVIDIA V100 and slower than NVIDIA A100. The peak band- width percentage for the OpenMP programming model is 42.16%-94.68% for MI100 while it is at least 96% for the NVIDIA GPUs which demonstrates that AOMP needs further development. For Kokkos, the range is 74-99% for MI100, 72.87-99% for the NVIDIA GPUs, where the non-optimized version has lower References
CSC LUMI supercomputer. https://www.lumi-supercomputer.eu/lumis-full- system-architecture-revealed/
Frontier web page. https://www.olcf.ornl.gov/frontier/
NVIDIA. CUDA. https://developer.nvidia.com/about-cuda
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. In: Computing in Science & Engineering, vol. 12, no. 3, pp. 66-73, May-June 2010. https://doi.org/10.1109/MCSE.2010.69
OpenMP Architecture Review Board. OpenMP Application Programming Inter- face, version 4.0. https://openmp.org/40pdf
OpenACC Specification 3.0. https://www.openacc.org/sites/default/files/inline- images/Specification/OpenACC.3.0.pdf
Davis, J.H., Daley, C., Pophale, S., Huber, T., Chandrasekaran, S., Wright, N.J.: Performance assessment of OpenMP compilers targeting NVIDIA V100 GPUs. In: Bhalachandra, S., Wienke, S., Chandrasekaran, S., Juckeland, G. (eds.) WACCPD 2020. LNCS, vol. 12655, pp. 25-44. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-74224-9 2
Poenaru, A., Lin, W.-C., McIntosh-Smith, S.: A performance analysis of modern parallel programming models using a compute-bound application. In: 36th Inter- national Conference, ISC High Performance 2021, Frankfurt, Germany (2021)
Khalilov, M., Timoveev, A.: Performance analysis of CUDA, OpenACC and OpenMP programming models on TESLA V100 GPU. In: Journal of Physics: Conference Series, vol. 1740 (2021)
Deakin, T., McIntosh-Smith, S.: Evaluating the performance of HPC-style SYCL applications. In: Proceedings of the International Workshop on OpenCL (2020)
Deakin, T., et al.: Performance portability across diverse computer architec- tures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 1-13 (2019). https://doi.org/10.1109/ P3HPC49587.2019.00006
AMD. ROCm Platform. https://github.com/RadeonOpenCompute/ROCm
AMD. ROCm Documentation. https://rocmdocs.amd.com/en/latest/
AMD. HIP. https://github.com/ROCm-Developer-Tools/HIP
AMD. HIPify Tools. https://github.com/ROCm-Developer-Tools/HIPIFY
AMD. HIP Porting Guide. https://github.com/RadeonOpenCompute/ROCm Documentation/blob/master/Programming Guides/HIP-porting-guide.rst
CSC. Porting GPU Codes to HIP. https://github.com/csc-training/hip
de Supinski, B.R., et al.: The ongoing evolution of OpenMP. In: Proceedings of the IEEE, vol. 106, no. 11, pp. 2004-2019, November 2018
Khronos Group. SYCL 2020 Specification. https://www.khronos.org/registry/ SYCL/specs/sycl-2020/pdf/sycl-2020.pdf 20. Codeplay Software. ComputeCpp. https://www.codeplay.com/solutions/ ecosystem/
Intel Corporation. SYCL* Compiler and Runtimes. https://github.com/intel/llvm
Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of hipSYCL. In: Proceedings of the International Workshop on OpenCL (IWOCL 2020), Association for Computing Machinery, New York, Article vol. 8, no. 1 (2020). https://github.com/illuhad/hipSYCL
triSYCL. https://github.com/trisycl/trisycl
ORNL and Mentor Graphics. https://www.olcf.ornl.gov/2020/09/03/oak-ridge- leadership-computing-facility-fosters-gcc-compiler-development-with-mentor- contract/
Denny, J.E., Lee, S. and Vetter, J.S.: Clacc: translating OpenACC to OpenMP in clang. In: 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC. LLVM-HPC), Dallas, TX, USA (2018)
Clacc. https://github.com/llvm-doe-org/llvm-project/tree/clacc/main
Zenker, E., et al.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Work- shops (IPDPSW), pp. 631-640, May 2016
Bussmann, M., et al.: Radiative signature of the relativistic kelvin-helmholtz insta- bility. In: SC 2013: Proceedings of the International Conference on High Perfor- mance Computing, Networking, Storage and Analysis, pp. 1-12 (2013)
René, W., Sergei, B., Simeon, E., Jeffrey, K., Jan, S.: Cupla -C++ User interface for the Platform Independent Library alpaka. https://rodare.hzdr.de/record/1103
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore perfor- mance portability through polymorphic memory access patterns. J Parall. Distrib. Comput. 74, 3202-3216 (2014). https://doi.org/10.1016/j.jpdc.2014.07.003
AMD. hipfort. https://github.com/ROCmSoftwarePlatform/hipfort
Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Paper presented at PˆMA Workshop at ISC High Performance, Frankfurt, Germany (2016). https://doi.org/10.1007/978- 3-319-46079-6 34
Tom, D., Simon, M.-S.: BabelStream. https://github.com/UoB-HPC/ BabelStream
CSC. Puhti Supercomputer. https://docs.csc.fi/computing/systems-puhti/
CSC. Mahti Supercomputer. https://docs.csc.fi/computing/systems-mahti/
CUPLA BabelStream Fork, v3.4-alpaka release https://github.com/jyoung3131/ BabelStream/releases/tag/v3.4-alpaka
Konstantinidis, E., Cotronis, Y.: A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parall. Distrib. Comput. 107, 37-56 (2017)
Mixbench. https://github.com/ekondis/mixbench
Reproduce the results of the paper Evaluating GPU Programming Models for the LUMI Supercomputer. https://zenodo.org/record/6307447
Elbencho. https://github.com/breuner/elbencho

Evaluating GPU Programming Models for the LUMI Supercomputer

Sign up for access to the world's latest research

Abstract

Related papers

References (40)

Related papers

Related topics