MIMD synchronization on SIMT architectures

ahmed eltantawy

doi:10.1109/MICRO.2016.7783714

Outline

MIMD synchronization on SIMT architectures

ahmed eltantawy

2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

https://doi.org/10.1109/MICRO.2016.7783714

visibility

…

description

14 pages

link

1 file

Abstract

In the single-instruction multiple-threads (SIMT) execution model, small groups of scalar threads operate in lockstep. Within each group, current SIMT hardware implementations serialize the execution of threads that follow different paths, and to ensure efficiency, revert to lockstep execution as soon as possible. These constraints must be considered when adapting algorithms that employ synchronization. A deadlockfree program on a multiple-instruction multiple-data (MIMD) architecture may deadlock on a SIMT machine. To avoid this, programmers need to restructure control flow with SIMT scheduling constraints in mind. This requires programmers to be familiar with the underlying SIMT hardware. In this paper, we propose a static analysis technique that detects SIMT deadlocks by inspecting the application control flow graph (CFG). We further propose a CFG transformation that avoids SIMT deadlocks when synchronization is local to a function. Both the analysis and the transformation algorithms are implemented as LLVM compiler passes. Finally, we propose an adaptive hardware reconvergence mechanism that supports MIMD synchronization without changing the application CFG, but which can leverage our compiler analysis to gain efficiency. The static detection has a false detection rate of only 4%-5%. The automated transformation has an average performance overhead of 8.2%-10.9% compared to manual transformation. Our hardware approach performs on par with the compiler transformation, however, it avoids synchronization scope limitations, static instruction and register overheads, and debuggability challenges that are present in the compiler only solution. 1 We use the term "MIMD machine" to mean any architecture that guarantees loose fairness in thread scheduling so that threads not waiting on a programmer synchronization condition make forward progress.

Figures (17)

Fig. 1: SIMT-Induced Deadlock threads within the control flow paths) the execution of d same warp to diverge (i.e., follow different . However, they achieve this by serializing ifferent control-flow paths while restoring SIMD utilization by forcing divergent threads to reconverge as soon as possible (typically at an immediate postdominator point) [2], [5], [8 . This in turn creates implicit scheduling constraints for divergent threads within a warp. Therefore, when GPU kernel programmer intend code is written in such a way that the s divergent threads to communicate, these scheduling constraints can lead to surprising (from a program- mer perspective) d a multi-threaded p a MIMD architect eadlock and/or livelock conditions. Thus, rogram that is guaranteed to terminate on ure may not terminate on machines with current SIMT implementations (oy. hmed ElTantawy and Tor M. Aamod University of British Columbia {ahmede,aamodt} @ece.ubc.ca

Fig. 3: Modified SIMT compliant Spin Lock

Algorithm 1 SIMT-Induced Deadlock Detection slice of the loop exit condition. If the loop exit conditions do not depend on a shared memory read operation that occurs inside the loop body then the loop cannot have a SIMT- induced deadlock. If a loop exit condition does depend on a shared memory read instruction Ip, we add Ig in the set of shared reads Shrdgeags on lines 4-7. A potential SIMT- induced deadlock exists if any of these shared memory reads can be redefined by divergent threads. The next steps of the algorithm detect these shared memory redefinitions.

Algorithm 2 Safe Reconvergence Points loop exit is control dependent on the atomicCAS instruction, there are no shared memory write instructions that are parallel to, or reachable from, the loop exit. Therefore, no SIMT deadlock is detected.

Fig. 4: SIMT-induced deadlock scenarios occurs if these indefinitely blocked paths must execute to enable the exit conditions of the looping threads. To avoid this, our compiler based SIMT deadlock elimination algorithm (ex- plained in more details in Section IV-A) replaces the backward edge of a loop identified by Algorithm | with two edges: a forward edge towards the loop’s SafePDom, and a backward edge from SafePDom to the loop header. This modification combined with the forced reconvergence constraint, guarantees that threads iterating in the loop wait at the SafePDom for threads executing other paths postdominated by SafePDom before attempting another iteration. Accordingly, SafePDom should postdominate the original loop exits, the redefining writes, and all control flow paths that could lead to redefining writes that are either reachable from the loop (lines 4-9 in Algorithm 2) or parallel to it (ines /0-/4 in Algorithm 2).

Algorithm 3 SIMT-Induced Deadlock Elimination it postdominates all reachable paths to the redefining writes (i.e., leading threads may only wait for lagging ones after they finish all iterations of the outer loop).

Fig. 5: SIMT-Induced Deadlock Elimination Steps

Fig. 6: MIMD-Compatible Reconvergence Mechanism Operation

TABLE II: Code Configuration Encoding our generated CFG. This could be avoided if the elimination algorithm is applied at the SASS code generation stage. We also implemented AWARE in GPGPU-Sim 3.2.2 [53], [54]. We use the Tes Sim. However, scheduler with scheduler that we observed th aC2050 configuration released with GPGPU- we replaced the Greedy Then Oldest (GTO) a Greedy then Loose Round Robin (GLRR) forces loose fairness in warp scheduling as at unfairness in GTO leads to livelocks due to inter-warp de pendencies on locks 8. Modified GPGPU-Sim and LLVM codes can be found online [19].

‘ig. 7: AWARE Virtualized Implementation TABLE I: Evaluated Kernels

Fig. 8: Normalized Accumulative GPU Execution Time

TABLE V: SSDE Evaluation on OpenMP Kernels

Fig. 9: Evaluation of the Static SIMT-Induced Deadlock Elimination on Tesla K20C GPU

Fig. 11: Sensitivity to the TimeOut value (in cycles)

Fig. 10: Evaluation of the Adaptive Warp Reconvergence Mechanism using GPGPU-Sim

Fig. 12: Effect of AWARE Virtualization on Performance

References (74)

NVIDIA, CUDA, "NVIDIA CUDA Programming Guide," 2011.
Intel Corporation, "The ISPC Parallel Execution Model," 2016.
AMD, "Accelerated Parallel Processing: OpenCL Guide," 2013.
A. Levinthal T. "Chap -A SIMD Graphics Processor," Proc. ACM Conf. on Comp. Grap. and Interactive Tech. (SIGGRAPH), 1984.
B. Coon and J. Lindholm, "System and method for managing divergent threads in a simd architecture," 2008. Patent 7,353,369.
B. Beylin and R. S. Glanville, "Insertion of multithreaded execution synchronization points in a software program," 2013. US Patent 8,381,203.
M. HOUSTON, B. Gaster, L. HOWES, M. Mantor, and D. Behr, "Method and System for Synchronization of Workitems with Divergent Control Flow," 2013. WO Patent App. PCT/US2013/043,394.
A. Habermaier and A. Knapp, "On the Correctness of the SIMT Execution Model of GPUs," in Programming Languages and Systems, pp. 316-335, Springer, 2012.
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating CUDA Graph Algorithms at Maximum Warp," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 267-276, 2011.
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proc. IEEE Symp. on Workload Characterization (IISWC), 2012.
D. Merrill, M. Garland, and A. Grimshaw, "Scalable GPU Graph Traversal," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 117-128, 2012.
S. Lee, S.-J. Min, and R. Eigenmann, "OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization," in Proc. ACM Symp. on Prin. and Prac. of Par. Prog. (PPoPP), pp. 101-110, 2009.
G. Noaje, C. Jaillet, and M. Krajecki, "Source-to-source Code Transla- tor: OpenMP C to CUDA," in IEEE Int'l Conf. on High Performance Computing and Communications (HPCC), 2011.
C. Bertolli, S. F. Antao, A. E. Eichenberger, K. O'Brien, Z. Sura, A. C. Jacob, T. Chen, and O. Sallenave, "Coordinating GPU Threads for OpenMP 4.0 in LLVM," in Proc. LLVM Compiler Infrastructure in HPC, 2014.
S. Antao, C. Bertolli, A. Bokhanko, A. Eichenberger, H. Finkel, S. Os- tanevich, E. Stotzer, and G. Zhang, "OpenMP Offload Infrastructure in LLVM," tech. rep.
OpenMP Clang Frontend, "OpenMP Clang Frontend Documentation." https://github.com/clang-omp, 2015.
X. Tian and B. R. de Supins, "Explicit Vector Programming with OpenMP 4.0 SIMD Extension," Primeur Magazine 2014, 2014.
A. ElTantawy, "SSDE and AWARE codes." https://github.com/ElTantawy/mimd to simt/, 2016.
W. Fung, I. Sham, G. Yuan, and T. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 407-420, 2007.
J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 235-246, 2010.
NVIDIA Forums, "atomicCAS does NOT seem to work." http://forums.nvidia.com/index.php?showtopic=98444, 2009.
NVIDIA Forums, "atomic locks." https://devtalk.nvidia.com/default/topic/512038/atomic-locks/, 2012.
A. Ramamurthy, "Towards Scalar Synchronization in SIMT Architec- tures," Master's thesis, The University of British Columbia, 2011.
W. W. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 296-307, 2011.
Y. Xu, R. Wang, N. Goswami, T. Li, L. Gao, and D. Qian, "Software Transactional Memory for GPU Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), p. 1, 2014.
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, "Demystifying GPU Microarchitecture Through Microbenchmarking," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 235-246, 2010.
A. Betts, N. Chong, A. Donaldson, S. Qadeer, and P. Thomson, "GPUVerify: a Verifier for GPU Kernels," in Proc. ACM Int'l Conf. on Object oriented programming systems languages and applications, pp. 113-132, 2012.
G. Li, P. Li, G. Sawaya, G. Gopalakrishnan, I. Ghosh, and S. P. Rajan, "GKLEE: Concolic Verification and Test Generation for GPUs," in PPoPP, 2012.
R. Sharma, M. Bauer, and A. Aiken, "Verification of Producer-Consumer Synchronization in GPU Programs," in Proc. ACM Conf. on Program- ming Language Design and Implementation (PLDI), pp. 88-98, 2015.
"How does the OpenACC API relate to the OpenMP API?." Ope- nACC.org.
LLVM Compiler, "LLVN 3.6 Release Information." http://llvm.org/releases/3.6.0/, 2015.
M. Villmow, "AMD OpenCL Compiler." LLVM Developers Conference, 2010. Presentation.
NVIDIA, "CUDA LLVM Compiler." https://developer.nvidia.com/cuda- llvm-compiler, 2015.
Intel Developer Zone, "Weird behaviour of atomic functions." https://software.intel.com/en-us/forums/opencl/topic/278350, 2012.
NVIDIA Forums, "GLSL Spinlock." https://devtalk.nvidia.com/default/topic/768115/opengl/glsl-spinlock/, 2014.
M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-based Barnes Hut n-Body Algorithm," GPU computing Gems Emerald edition, 2011.
A. ElTantawy and T. M. Aamodt, "Correctness Discussion of a SIMT- induced Deadlock Elimination Algorithm," tech. rep., University of British Columbia, 2016.
NVIDIA, "CUDA Binary Utilities," http://docs.nvidia.com/cuda/cuda- binary-utilities/, 2015.
S. Horwitz, T. Reps, and D. Binkley, "Interprocedural Slicing Using Dependence Graphs," in PLDI, 1988.
M. Burke and R. Cytron, "Interprocedural Dependence Analysis and Parallelization," in Proc. ACM SIGPLAN Symp. on Compiler Construc- tion, 1986.
U. Hölzle, C. Chambers, and D. Ungar, "Debugging Optimized Code with Dynamic Deoptimization," in Proc. ACM Conf. on Programming Language Design and Implementation (PLDI), 1992.
J. Hennessy, "Symbolic Debugging of Optimized Code," ACM Trans. on Prog. Lang. and Sys.(TOPLAS), vol. 4, no. 3, pp. 323-344, 1982.
A. ElTantawy, J. W. Ma, M. O'Connor, and T. M. Aamodt, "A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), 2014.
B. W. Coon, P. C. Mills, S. F. Oberman, and M. Y. Siu., "Tracking Register Usage during Multithreaded Processing Using a Scoreboard having Separate Memory Regions and Storing Sequential Register Size Indicators. US Patent 7,434,032," 2008.
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MI- CRO), pp. 72-83, 2012.
A. Brownsword, "Cloth in OpenCL," tech. rep., Khronos Group, 2009.
J. Coplin and M. Burtscher, "Effects of Source-Code Optimizations on GPU Performance and Energy Consumption," in Proc. ACM Workshop on General Purpose Processing on Graphics Processing Units, 2015.
J. M. Bull, "Measuring synchronisation and scheduling overheads in openmp," in Proc. European Workshop on OpenMP, vol. 8, p. 49, 1999.
LLVM Compiler, "LLVM Alias Analysis Infrastructure." http://llvm.org/docs/AliasAnalysis.html, 2015.
V. C. Sreedhar, G. R. Gao, and Y.-F. Lee, "Identifying loops using dj graphs," ACM Trans. on Prog. Lang. and Sys. (TOPLAS), 1996.
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proc. IEEE Symp. on Perf. Analysis of Systems and Software (ISPASS), pp. 163-174, 2009.
T. M. Aamodt et al., GPGPU-Sim 3.x Manual. University of British Columbia, 2013.
LLVM Compiler, "Clang Front End." http://clang.llvm.org/, 2015.
LLVM Compiler, "LIBCLC Library." http://libclc.llvm.org/, 2015.
Dmitry Mikushin, "CUDA to LLVM-IR." https://github.com/apc- llc/nvcc-llvm-ir, 2015.
NVIDIA, "LibNVVM Library." http://docs.nvidia.com/cuda/libnvvm- api/, 2015.
NVIDIA, "CUDA SDK 3.2," September 2013.
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Com- puting," in Proc. IEEE Symp. on Workload Characterization (IISWC), pp. 44-54, 2009.
Jeff Larkin, NVIDIA, "OpenMP and NVIDIA." http://openmp.org/sc13/SC13 OpenMP and NVIDIA.pdf, 2013.
Michael Wong, Alexey Bataev, "OpenMP GPU/Accelerators Coming of Age in Clang." http://llvm.org/devmtg/2015-10/slides/WongBataev- OpenMPGPUAcceleratorsComingOfAgeInClang.pdf, 2015.
C. Bertolli, S. F. Antao, G.-T. Bercea, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Integrating GPU support for OpenMP offloading directives into Clang," in Proc. ACM Int'l Workshop on the LLVM Compiler Infrastructure in HPC, 2015.
G.-T. Bercea, C. Bertolli, S. F. Antao, A. C. Jacob, A. E. Eichenberger, T. Chen, Z. Sura, H. Sung, G. Rokos, D. Appelhans, et al., "Performance Analysis of OpenMp on a GPU Using a Coral Proxy Application," in Proc. ACM Int'l Workshop on Perf. Modeling, Benchmarking, and Simulation of High Perf. Computing Sys., 2015.
M. Rhu and M. Erez, "The Dual-Path Execution Model for Efficient GPU Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 235-246, 2013.
G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalaman- chili, "SIMD Re-convergence at Thread Frontiers," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 477-488, 2011.
W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in Proc. IEEE Symp. on High-Perf. Computer Architecture (HPCA), pp. 25-36, 2011.
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in Proc. IEEE/ACM Symp. on Microarch. (MICRO), pp. 308-317, 2011.
M. Rhu and M. Erez, "CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures," in Proc. IEEE/ACM Symp. on Computer Architecture (ISCA), pp. 61-71, 2012.
Y. Lee, R. Krashinsky, V. Grover, S. Keckler, and K. Asanovic, "Con- vergence and Scalarization for Data-Parallel Architectures," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), pp. 1- 11, 2013.
M. Zheng, V. T. Ravi, F. Qin, and G. Agrawal, "GRace: a Low-overhead Mechanism for Detecting Data Races in GPU Programs," in ACM SIGPLAN Notices, vol. 46, pp. 135-146, ACM, 2011.
C. Lattner and V. Adve, "LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation," in Proc. IEEE/ACM Symp. on Code Generation and Optimization (CGO), 2004.
J. A. Stratton, S. S. Stone, and W. H. Wen-mei, "MCUDA: An Efficient Implementation of CUDA Kernels for Multi-Core CPUs," in Languages and Compilers for Parallel Computing, Springer, 2008.
A. Yilmazer and D. Kaeli, "HQL: A Scalable Synchronization Mech- anism for GPUs," in Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, 2013.
A. Li, G.-J. van den Braak, H. Corporaal, and A. Kumar, "Fine-grained Synchronizations and Dataflow Programming on GPUs," 2015.
Y. Xu, L. Gao, R. Wang, Z. Luan, W. Wu, and D. Qian, "Lock- based Synchronization for GPU Architectures," in Proc. Int'l Conf. on Computing Frontiers, 2016.

MIMD synchronization on SIMT architectures

Sign up for access to the world's latest research

Abstract

Related papers

References (74)

Related papers

Related topics