Evaluating GPU Programming Models for the LUMI Supercomputer
Supercomputing Frontiers
https://doi.org/10.1007/978-3-031-10419-0_6Abstract
It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct™ accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
References (40)
- Summary We can observe that based on the hardware performance AMD MI100 performs faster than NVIDIA V100 and slower than NVIDIA A100. The peak band- width percentage for the OpenMP programming model is 42.16%-94.68% for MI100 while it is at least 96% for the NVIDIA GPUs which demonstrates that AOMP needs further development. For Kokkos, the range is 74-99% for MI100, 72.87-99% for the NVIDIA GPUs, where the non-optimized version has lower References
- CSC LUMI supercomputer. https://www.lumi-supercomputer.eu/lumis-full- system-architecture-revealed/
- Frontier web page. https://www.olcf.ornl.gov/frontier/
- NVIDIA. CUDA. https://developer.nvidia.com/about-cuda
- Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. In: Computing in Science & Engineering, vol. 12, no. 3, pp. 66-73, May-June 2010. https://doi.org/10.1109/MCSE.2010.69
- OpenMP Architecture Review Board. OpenMP Application Programming Inter- face, version 4.0. https://openmp.org/40pdf
- OpenACC Specification 3.0. https://www.openacc.org/sites/default/files/inline- images/Specification/OpenACC.3.0.pdf
- Davis, J.H., Daley, C., Pophale, S., Huber, T., Chandrasekaran, S., Wright, N.J.: Performance assessment of OpenMP compilers targeting NVIDIA V100 GPUs. In: Bhalachandra, S., Wienke, S., Chandrasekaran, S., Juckeland, G. (eds.) WACCPD 2020. LNCS, vol. 12655, pp. 25-44. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-74224-9 2
- Poenaru, A., Lin, W.-C., McIntosh-Smith, S.: A performance analysis of modern parallel programming models using a compute-bound application. In: 36th Inter- national Conference, ISC High Performance 2021, Frankfurt, Germany (2021)
- Khalilov, M., Timoveev, A.: Performance analysis of CUDA, OpenACC and OpenMP programming models on TESLA V100 GPU. In: Journal of Physics: Conference Series, vol. 1740 (2021)
- Deakin, T., McIntosh-Smith, S.: Evaluating the performance of HPC-style SYCL applications. In: Proceedings of the International Workshop on OpenCL (2020)
- Deakin, T., et al.: Performance portability across diverse computer architec- tures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 1-13 (2019). https://doi.org/10.1109/ P3HPC49587.2019.00006
- AMD. ROCm Platform. https://github.com/RadeonOpenCompute/ROCm
- AMD. ROCm Documentation. https://rocmdocs.amd.com/en/latest/
- AMD. HIP. https://github.com/ROCm-Developer-Tools/HIP
- AMD. HIPify Tools. https://github.com/ROCm-Developer-Tools/HIPIFY
- AMD. HIP Porting Guide. https://github.com/RadeonOpenCompute/ROCm Documentation/blob/master/Programming Guides/HIP-porting-guide.rst
- CSC. Porting GPU Codes to HIP. https://github.com/csc-training/hip
- de Supinski, B.R., et al.: The ongoing evolution of OpenMP. In: Proceedings of the IEEE, vol. 106, no. 11, pp. 2004-2019, November 2018
- Khronos Group. SYCL 2020 Specification. https://www.khronos.org/registry/ SYCL/specs/sycl-2020/pdf/sycl-2020.pdf 20. Codeplay Software. ComputeCpp. https://www.codeplay.com/solutions/ ecosystem/
- Intel Corporation. SYCL* Compiler and Runtimes. https://github.com/intel/llvm
- Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of hipSYCL. In: Proceedings of the International Workshop on OpenCL (IWOCL 2020), Association for Computing Machinery, New York, Article vol. 8, no. 1 (2020). https://github.com/illuhad/hipSYCL
- triSYCL. https://github.com/trisycl/trisycl
- ORNL and Mentor Graphics. https://www.olcf.ornl.gov/2020/09/03/oak-ridge- leadership-computing-facility-fosters-gcc-compiler-development-with-mentor- contract/
- Denny, J.E., Lee, S. and Vetter, J.S.: Clacc: translating OpenACC to OpenMP in clang. In: 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC. LLVM-HPC), Dallas, TX, USA (2018)
- Clacc. https://github.com/llvm-doe-org/llvm-project/tree/clacc/main
- Zenker, E., et al.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Work- shops (IPDPSW), pp. 631-640, May 2016
- Bussmann, M., et al.: Radiative signature of the relativistic kelvin-helmholtz insta- bility. In: SC 2013: Proceedings of the International Conference on High Perfor- mance Computing, Networking, Storage and Analysis, pp. 1-12 (2013)
- René, W., Sergei, B., Simeon, E., Jeffrey, K., Jan, S.: Cupla -C++ User interface for the Platform Independent Library alpaka. https://rodare.hzdr.de/record/1103
- Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore perfor- mance portability through polymorphic memory access patterns. J Parall. Distrib. Comput. 74, 3202-3216 (2014). https://doi.org/10.1016/j.jpdc.2014.07.003
- AMD. hipfort. https://github.com/ROCmSoftwarePlatform/hipfort
- Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Paper presented at PˆMA Workshop at ISC High Performance, Frankfurt, Germany (2016). https://doi.org/10.1007/978- 3-319-46079-6 34
- Tom, D., Simon, M.-S.: BabelStream. https://github.com/UoB-HPC/ BabelStream
- CSC. Puhti Supercomputer. https://docs.csc.fi/computing/systems-puhti/
- CSC. Mahti Supercomputer. https://docs.csc.fi/computing/systems-mahti/
- CUPLA BabelStream Fork, v3.4-alpaka release https://github.com/jyoung3131/ BabelStream/releases/tag/v3.4-alpaka
- Konstantinidis, E., Cotronis, Y.: A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parall. Distrib. Comput. 107, 37-56 (2017)
- Mixbench. https://github.com/ekondis/mixbench
- Reproduce the results of the paper Evaluating GPU Programming Models for the LUMI Supercomputer. https://zenodo.org/record/6307447
- Elbencho. https://github.com/breuner/elbencho