MIMD synchronization on SIMT architectures
2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
https://doi.org/10.1109/MICRO.2016.7783714Abstract
In the single-instruction multiple-threads (SIMT) execution model, small groups of scalar threads operate in lockstep. Within each group, current SIMT hardware implementations serialize the execution of threads that follow different paths, and to ensure efficiency, revert to lockstep execution as soon as possible. These constraints must be considered when adapting algorithms that employ synchronization. A deadlockfree program on a multiple-instruction multiple-data (MIMD) architecture may deadlock on a SIMT machine. To avoid this, programmers need to restructure control flow with SIMT scheduling constraints in mind. This requires programmers to be familiar with the underlying SIMT hardware. In this paper, we propose a static analysis technique that detects SIMT deadlocks by inspecting the application control flow graph (CFG). We further propose a CFG transformation that avoids SIMT deadlocks when synchronization is local to a function. Both the analysis and the transformation algorithms are implemented as LLVM compiler passes. Finally, we propose an adaptive hardware reconvergence mechanism that supports MIMD synchronization without changing the application CFG, but which can leverage our compiler analysis to gain efficiency. The static detection has a false detection rate of only 4%-5%. The automated transformation has an average performance overhead of 8.2%-10.9% compared to manual transformation. Our hardware approach performs on par with the compiler transformation, however, it avoids synchronization scope limitations, static instruction and register overheads, and debuggability challenges that are present in the compiler only solution. 1 We use the term "MIMD machine" to mean any architecture that guarantees loose fairness in thread scheduling so that threads not waiting on a programmer synchronization condition make forward progress.
References (74)
- NVIDIA, CUDA, "NVIDIA CUDA Programming Guide," 2011.
- Intel Corporation, "The ISPC Parallel Execution Model," 2016.
- AMD, "Accelerated Parallel Processing: OpenCL Guide," 2013.
- A. Levinthal T. "Chap -A SIMD Graphics Processor," Proc. ACM Conf. on Comp. Grap. and Interactive Tech. (SIGGRAPH), 1984.
- B. Coon and J. Lindholm, "System and method for managing divergent threads in a simd architecture," 2008. Patent 7,353,369.
- B. Beylin and R. S. Glanville, "Insertion of multithreaded execution synchronization points in a software program," 2013. US Patent 8,381,203.
- M. HOUSTON, B. Gaster, L. HOWES, M. Mantor, and D. Behr, "Method and System for Synchronization of Workitems with Divergent Control Flow," 2013. WO Patent App. PCT/US2013/043,394.
- A. Habermaier and A. Knapp, "On the Correctness of the SIMT Execution Model of GPUs," in Programming Languages and Systems, pp. 316-335, Springer, 2012.
- S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA Graph Algorithms at Maximum Warp," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 267-276, 2011.
- M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proc. IEEE Symp. on Workload Characterization (IISWC), 2012.
- D. Merrill, M. Garland, and A. Grimshaw, "Scalable GPU Graph Traversal," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 117-128, 2012.
- S. Lee, S.-J. Min, and R. Eigenmann, "OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 101-110, 2009.
- G. Noaje, C. Jaillet, and M. Krajecki, "Source-to-source Code Transla- tor: OpenMP C to CUDA," in IEEE Int'l Conf. on High Performance Computing and Communications (HPCC), 2011.
- C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O'Brien, Z. Sura, A. C. Jacob, T. Chen, and O. Sallenave, "Coordinating GPU Threads for OpenMP 4.0 in LLVM," in Proc. LLVM Compiler Infrastructure in HPC, 2014.
- S. Antao, C. Bertolli, A. Bokhanko, A. Eichenberger, H. Finkel, S. Os- tanevich, E. Stotzer, and G. Zhang, "OpenMP Offload Infrastructure in LLVM," tech. rep.
- OpenMP Clang Frontend, "OpenMP Clang Frontend Documentation." https://github.com/clang-omp, 2015.
- X. Tian and B. R. de Supins, "Explicit Vector Programming with OpenMP 4.0 SIMD Extension," Primeur Magazine 2014, 2014.
- A. ElTantawy, "SSDE and AWARE codes." https://github.com/ElTantawy/mimd to simt/, 2016.
- W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 407-420, 2007.
- J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 235-246, 2010.
- NVIDIA Forums, "atomicCAS does NOT seem to work." http://forums.nvidia.com/index.php?showtopic=98444, 2009.
- NVIDIA Forums, "atomic locks." https://devtalk.nvidia.com/default/topic/512038/atomic-locks/, 2012.
- A. Ramamurthy, "Towards Scalar Synchronization in SIMT Architec- tures," Master's thesis, The University of British Columbia, 2011.
- W. W. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 296-307, 2011.
- Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian, "Software Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), p. 1, 2014.
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, "Demystifying GPU Microarchitecture Through Microbenchmarking," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 235-246, 2010.
- A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson, "GPUVerify: a Verifier for GPU Kernels," in Proc. ACM Int'l Conf. on Object oriented programming systems languages and applications, pp. 113-132, 2012.
- G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan, "GKLEE: Concolic Verification and Test Generation for GPUs," in PPoPP, 2012.
- R. Sharma, M. Bauer, and A. Aiken, "Verification of Producer-Consumer Synchronization in GPU Programs," in Proc. ACM Conf. on Program- ming Language Design and Implementation (PLDI), pp. 88-98, 2015.
- "How does the OpenACC API relate to the OpenMP API?." Ope- nACC.org.
- LLVM Compiler, "LLVN 3.6 Release Information." http://llvm.org/releases/3.6.0/, 2015.
- M. Villmow, "AMD OpenCL Compiler." LLVM Developers Conference, 2010. Presentation.
- NVIDIA, "CUDA LLVM Compiler." https://developer.nvidia.com/cuda- llvm-compiler, 2015.
- Intel Developer Zone, "Weird behaviour of atomic functions." https://software.intel.com/en-us/forums/opencl/topic/278350, 2012.
- NVIDIA Forums, "GLSL Spinlock." https://devtalk.nvidia.com/default/topic/768115/opengl/glsl-spinlock/, 2014.
- M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm," GPU computing Gems Emerald edition, 2011.
- A. ElTantawy and T. M. Aamodt, "Correctness Discussion of a SIMT- induced Deadlock Elimination Algorithm," tech. rep., University of British Columbia, 2016.
- NVIDIA, "CUDA Binary Utilities," http://docs.nvidia.com/cuda/cuda- binary-utilities/, 2015.
- S. Horwitz, T. Reps, and D. Binkley, "Interprocedural Slicing Using Dependence Graphs," in PLDI, 1988.
- M. Burke and R. Cytron, "Interprocedural Dependence Analysis and Parallelization," in Proc. ACM SIGPLAN Symp. on Compiler Construc- tion, 1986.
- U. Hölzle, C. Chambers, and D. Ungar, "Debugging Optimized Code with Dynamic Deoptimization," in Proc. ACM Conf. on Programming Language Design and Implementation (PLDI), 1992.
- J. Hennessy, "Symbolic Debugging of Optimized Code," ACM Trans. on Prog. Lang. and Sys.(TOPLAS), vol. 4, no. 3, pp. 323-344, 1982.
- A. ElTantawy, J. W. Ma, M. O'Connor, and T. M. Aamodt, "A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2014.
- B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu., "Tracking Register Usage during Multithreaded Processing Using a Scoreboard having Separate Memory Regions and Storing Sequential Register Size Indicators. US Patent 7,434,032," 2008.
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MI- CRO), pp. 72-83, 2012.
- A. Brownsword, "Cloth in OpenCL," tech. rep., Khronos Group, 2009.
- J. Coplin and M. Burtscher, "Effects of Source-Code Optimizations on GPU Performance and Energy Consumption," in Proc. ACM Workshop on General Purpose Processing on Graphics Processing Units, 2015.
- J. M. Bull, "Measuring synchronisation and scheduling overheads in openmp," in Proc. European Workshop on OpenMP, vol. 8, p. 49, 1999.
- LLVM Compiler, "LLVM Alias Analysis Infrastructure." http://llvm.org/docs/AliasAnalysis.html, 2015.
- V. C. Sreedhar, G. R. Gao, and Y.-F. Lee, "Identifying loops using dj graphs," ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 1996.
- A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 163-174, 2009.
- T. M. Aamodt et al., GPGPU-Sim 3.x Manual. University of British Columbia, 2013.
- LLVM Compiler, "Clang Front End." http://clang.llvm.org/, 2015.
- LLVM Compiler, "LIBCLC Library." http://libclc.llvm.org/, 2015.
- Dmitry Mikushin, "CUDA to LLVM-IR." https://github.com/apc- llc/nvcc-llvm-ir, 2015.
- NVIDIA, "LibNVVM Library." http://docs.nvidia.com/cuda/libnvvm- api/, 2015.
- NVIDIA, "CUDA SDK 3.2," September 2013.
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Com- puting," in Proc. IEEE Symp. on Workload Characterization (IISWC), pp. 44-54, 2009.
- Jeff Larkin, NVIDIA, "OpenMP and NVIDIA." http://openmp.org/sc13/SC13 OpenMP and NVIDIA.pdf, 2013.
- Michael Wong, Alexey Bataev, "OpenMP GPU/Accelerators Coming of Age in Clang." http://llvm.org/devmtg/2015-10/slides/WongBataev- OpenMPGPUAcceleratorsComingOfAgeInClang.pdf, 2015.
- C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Integrating GPU support for OpenMP offloading directives into Clang," in Proc. ACM Int'l Workshop on the LLVM Compiler Infrastructure in HPC, 2015.
- G.-T. Bercea, C. Bertolli, S. F. Antao, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Performance Analysis of OpenMp on a GPU Using a Coral Proxy Application," in Proc. ACM Int'l Workshop on Perf. Modeling, Benchmarking, and Simulation of High Perf. Computing Sys., 2015.
- M. Rhu and M. Erez, "The Dual-Path Execution Model for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 235-246, 2013.
- G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalaman- chili, "SIMD Re-convergence at Thread Frontiers," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 477-488, 2011.
- W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 25-36, 2011.
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 308-317, 2011.
- M. Rhu and M. Erez, "CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 61-71, 2012.
- Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic, "Con- vergence and Scalarization for Data-Parallel Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), pp. 1- 11, 2013.
- M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal, "GRace: a Low-overhead Mechanism for Detecting Data Races in GPU Programs," in ACM SIGPLAN Notices, vol. 46, pp. 135-146, ACM, 2011.
- C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), 2004.
- J. A. Stratton, S. S. Stone, and W. H. Wen-mei, "MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs," in Languages and Compilers for Parallel Computing, Springer, 2008.
- A. Yilmazer and D. Kaeli, "HQL: A Scalable Synchronization Mech- anism for GPUs," in Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, 2013.
- A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, "Fine-grained Synchronizations and Dataflow Programming on GPUs," 2015.
- Y. Xu, L. Gao, R. Wang, Z. Luan, W. Wu, and D. Qian, "Lock- based Synchronization for GPU Architectures," in Proc. Int'l Conf. on Computing Frontiers, 2016.