DataRaceOnAccelerator – A Micro-benchmark Suite for Evaluating Correctness Tools Targeting Accelerators
Springer eBooks, 2020
The advent of hardware accelerators over the past decade has significantly increased the complexi... more The advent of hardware accelerators over the past decade has significantly increased the complexity of modern parallel applications. For correctness, applications must synchronize the host with accelerators properly to avoid defects. Considering concurrency defects on accelerators are hard to detect and debug, researchers have proposed several correctness tools. However, existing correctness tools targeting accelerators are not comprehensively and objectively evaluated since there exist few available micro-benchmarks that can test the functionality of a correctness tool.
When it comes to data race detection, complete information about synchronization, concurrency and... more When it comes to data race detection, complete information about synchronization, concurrency and memory accesses is needed. This information might be gathered at various levels of abstraction. For best results regarding accuracy this information should be collected at the abstraction level of the parallel programming paradigm. With the latest preview of the OpenMP specification, a tools interface (OMPT) was added to OpenMP. In this paper we discuss whether the synchronization information provided by OMPT is sufficient to apply accurate data race analysis for OpenMP applications. We further present some implementation details and results for our data race detection tool called Archer which derives the synchronization information from OMPT.
High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-... more High-performance computing codes often combine the Message-Passing Interface (MPI) with a shared-memory programming model, e.g., OpenMP, for efficient computations. These so-called hybrid models may issue MPI calls concurrently from different threads at the highest level of MPI thread support. The correct use of either MPI or OpenMP can be complex and error-prone. The hybrid model increases this complexity even further. While correctness analysis tools exist for both programming paradigms, for hybrid models, a new set of potential errors exist, whose detection requires combining knowledge of MPI and OpenMP primitives. Unfortunately, correctness tools do not fully support the hybrid model yet, and their current capabilities are also hard to assess. In previous work, to enable structured comparisons of correctness tools and improve their coverage, we proposed the MPI-CorrBench test suite for MPI. Likewise, others proposed the DataRaceBench test suite for OpenMP. However, for the particular error classes of the hybrid model, no such test suite exists. Hence, we propose a hybrid MPI-OpenMP test suite to (1) facilitate the correctness tool development in this area and, subsequently, (2) further encourage the use of the hybrid model at the highest level of MPI thread support. To that end, we discuss issues with this hybrid model and the knowledge correctness tools need to combine w.r.t. MPI and OpenMP to detect these. In our evaluation of two state-of-the-art correctness tools, we see that for most cases of concurrent and conflicting MPI operations, these tools can cope with the added complexity of OpenMP. However, more intricate errors, where user code interferes with MPI, e.g., a data race on a buffer, still evade tool analysis. CCS CONCEPTS • Computing methodologies → Parallel computing methodologies; • Software and its engineering → Correctness.
When aiming for large scale parallel computing, waiting time due to network latency, synchronizat... more When aiming for large scale parallel computing, waiting time due to network latency, synchronization, and load imbalance are the primary opponents of high parallel efficiency. A common approach to hide latency with computation is the use of non-blocking communication. In the presence of a consistent load imbalance, synchronization cost is just the visible symptom of the load imbalance. Tasking approaches as in OpenMP, TBB, OmpSs, or C++20 coroutines promise to expose a higher degree of concurrency, which can be distributed on available execution units and significantly increase load balance. Available MPI non-blocking functionality does not integrate seamlessly into such tasking parallelization. In this work, we present a slim extension of the MPI interface to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++.
Runtime Correctness Analysis of MPI-3 Nonblocking Collectives
The Message Passing Interface (MPI) includes nonblocking collective operations that support addit... more The Message Passing Interface (MPI) includes nonblocking collective operations that support additional overlap between computation and communication. These new operations enable complex data movement between large numbers of processes. However, their asynchronous behavior hides and complicates the detection of defects in their use. We highlight a lack of correctness tool support for these operations and extend the MUST runtime MPI correctness tool to alleviate this complexity. We introduce a classification to summarize the types of correctness analyses that are applicable to MPI's nonblocking collectives. We identify complex wait-for dependencies in deadlock situations and incorrect use of communication buffers as the most challenging types of usage errors. We devise, demonstrate, and evaluate the applicability of correctness analyses for these errors. A scalable analysis mechanism allows our runtime approach to scale with the application. Benchmark measurements highlight the scalability and applicability of our approach at up to 4,096 application processes and with low overhead.
Particle advection is the approach for extraction of integral curves from vector fields. Efficien... more Particle advection is the approach for extraction of integral curves from vector fields. Efficient parallelization of particle advection is a challenging task due to the problem of load imbalance, in which processes are assigned unequal workloads, causing some of them to idle as the others are performing compute. Various approaches to load balancing exist, yet they all involve trade-offs such as increased inter-process communication, or the need for central control structures. In this work, we present two local load balancing methods for particle advection based on the family of diffusive load balancing. Each process has access to the blocks of its neighboring processes, which enables dynamic sharing of the particles based on a metric defined by the workload of the neighborhood. The approaches are assessed in terms of strong and weak scaling as well as load imbalance. We show that the methods reduce the total run-time of advection and are promising with regard to scaling as they operate locally on isolated process neighborhoods.
ARBALEST: Dynamic Detection of Data Mapping Issues in Heterogeneous OpenMP Applications
From OpenMP 4.0 onwards, programmers can offload code regions to accelerators by using the target... more From OpenMP 4.0 onwards, programmers can offload code regions to accelerators by using the target offloading feature. However, incorrect usage of target offloading constructs may incur data mapping issues. A data mapping issue occurs when the host fails to observe updates on the accelerator or vice versa. It may further lead to multiple memory issues such as use of uninitialized memory, use of stale data, and data race. To the best of our knowledge, currently there is no prior work on dynamic detection of data mapping issues in heterogeneous OpenMP applications.In this paper, we identify possible root causes of data mapping issues in OpenMP’s standard memory model and the unified memory model. We find that data mapping issues primarily result from incorrect settings of map and nowait clauses in target offloading constructs. Further, the novel unified memory model introduced in OpenMP 5.0 cannot avoid the occurrence of data mapping issues. To mitigate the difficulty of detecting data mapping issues, we propose ARBALEST, an on-the-fly data mapping issue detector for OpenMP applications. For each variable mapped to the accelerator, ARBALEST’s detection algorithm leverages a state machine to track the last write’s visibility. ARBALEST requires constant storage space for each memory location and takes amortized constant time per memory access. To demonstrate ARBALEST’s effectiveness, an experimental comparison with four other dynamic analysis tools (Valgrind, Archer, AddressSanitizer, MemorySanitizer) has been carried out on a number of open-source benchmark suites. The evaluation results show that ARBALEST delivers demonstrably better precision than the other four tools, and its execution time overhead is comparable to that of state-of-the-art dynamic analysis tools.
With rapidly increasing concurrency, the HPC community is looking for new parallel programming pa... more With rapidly increasing concurrency, the HPC community is looking for new parallel programming paradigms to make best use of current and up-coming machines. Under the Japanese CREST funding program, the post-petascale HPC project developed the XMP programming paradigm, a pragma-based partitioned global address space (PGAS) approach. Good tool support for debugging and performance analysis is crucial for the productivity and therefore acceptance of a new programming paradigm. In this work we investigate which properties of a parallel programing language specification may help tools to highlight correctness and performance issues or help to avoid common issues in parallel programming in the first place. In this paper we exercise these investigations on the example of XMP. We also investigate the question how to improve the reusability of existing correctness and performance analysis tools. CCS CONCEPTS • Software and its engineering → Correctness; Parallel programming languages; Software maintenance tools;
OpenMP plays a growing role as a portable programming model to harness on-node parallelism; yet, ... more OpenMP plays a growing role as a portable programming model to harness on-node parallelism; yet, existing data race checkers for OpenMP have high overheads and generate many false positives. In this paper, we propose the first OpenMP data race checker, ARCHER, that achieves high accuracy, low overheads on large applications, and portability. ARCHER incorporates scalable happens-before tracking, exploits structured parallelism via combined static and dynamic analysis, and modularly interfaces with OpenMP runtimes. ARCHER significantly outperforms TSan and Intel Inspector XE, while providing the same or better precision. It has helped detect critical data races in the Hypre library that is central to many projects at Lawrence Livermore National Laboratory and elsewhere.
Testing Infrastructure for OpenMP Debugging Interface Implementations
Lecture Notes in Computer Science, 2016
With complex codes moving to systems of greater on-node parallelism using OpenMP, debugging these... more With complex codes moving to systems of greater on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, OpenMP support within existing debuggers is either largely ineffective or unsustainable. The OpenMP tools working group is working to specify a debugging interface for the OpenMP standard to be implemented by every OpenMP runtime implementation. To increase the acceptance of this interface by runtime implementers and to ensure the quality of these interface implementations, availability of a common testing infrastructure compatible with any runtime implementation is critical. In this paper, we present a promising software architecture for such a testing infrastructure.
Algorithmic Differentiation (AD) is a set of techniques to calculate derivatives of a computer pr... more Algorithmic Differentiation (AD) is a set of techniques to calculate derivatives of a computer program. In C++, AD typically requires (i) a type change of the built-in double, and (ii) a replacement of all MPI calls with AD-specific implementations. This poses challenges on MPI correctness tools, such as MUST, a dynamic checker, and TypeART, its memory sanitizer extension. In particular, AD impacts (i) memory layouts of the whole code, (ii) requires more memory allocations tracking by TypeART, and (iii) approximately doubles the MPI type checks of MUST due to an AD-specific communication reversal. To address these challenges, we propose a new callback interface for MUST to reduce the number of intercepted MPI calls, and, also, improve the filtering capabilities of TypeART to reduce tracking of temporary allocations for the derivative computation. We evaluate our approach on an AD-enhanced version of CORAL LULESH. In particular, we reduce stack variable tracking from 32 million to 13 thousand. MUST with TypeART and the callback interface reduces the runtime overhead to that of vanilla MUST.
Score-P and OMPT: Navigating the Perils of Callback-Driven Parallel Runtime Introspection
Lecture Notes in Computer Science, 2019
Event-based performance analysis aims at modeling the behavior of parallel applications through a... more Event-based performance analysis aims at modeling the behavior of parallel applications through a series of state transitions during execution. Different approaches to obtain such transition points for OpenMP programs include source-level instrumentation (e.g., OPARI) and callback-driven runtime support (e.g., OMPT).
This chapter describes a multi-SPMD (mSPMD) programming model and a set of software and libraries... more This chapter describes a multi-SPMD (mSPMD) programming model and a set of software and libraries to support the mSPMD programming model. The mSPMD programming model has been proposed to realize scalable applications on huge and hierarchical systems. It has been evident that simple SPMD programs such as MPI, XMP, or hybrid programs such as OpenMP/MPI cannot exploit the postpeta-or exascale systems efficiently due to the increasing complexity of applications and systems. The mSPMD programming model has been designed to adopt multiple programming models across different architecture levels. Instead of invoking a single parallel program on millions of processor cores, multiple SPMD programs of moderate sizes can be worked together in the mSPMD programming model. As components of the mSPMD programming model, XMP has been supported. Fault-tolerance features, correctness checks, and some numerical libraries' implementations in the mSPMD programming model have been presented.
Understanding the Performance of Dynamic Data Race Detection
With increasing per-node concurrency, the interest in dynamic data race detection for OpenMP appl... more With increasing per-node concurrency, the interest in dynamic data race detection for OpenMP applications increased significantly in recent years. Benchmarks such as DataRaceBench (DRB) help evaluate the classification quality of data race detection tools for simple memory access patterns. Various publications use short-running benchmark kernels from OmpSRC and DRB also for performance benchmarking of data race detection tools. Due to the short execution time, one-time initialization overhead dominates the measurement. Such results are not representative for the overhead with real codes. This paper proposes a new problem class for the SPEC OMP 2012 benchmark designed to analyze the runtime overhead of data race detection tools. Prior work reported runtime overheads of 80 × and higher for the OpenMP data race detection tool Archer (i.e., execution time with the tool is 80 times as long as without a tool). For a specific application, we report 500 × runtime overhead in this paper. This overhead stands in contrast to the 2–20 × runtime overhead claimed by the underlying tool ThreadSanitizer. We use our newly proposed input data set to observe and investigate significant runtime overhead of dynamic data race detection for specific applications. With the help of performance analysis tools and hardware performance counters, we can identify massively concurrent read accesses of the same shared variable as the root cause. We identify parallel matrix-vector multiplication as an application pattern responsible for such huge runtime overheads in data race analysis. Finally, we propose a modification of ThreadSanitizer, limiting the runtime overhead for these applications to less than 40×.
Incorrect usage of OpenMP constructs may cause different kinds of defects in OpenMP applications.... more Incorrect usage of OpenMP constructs may cause different kinds of defects in OpenMP applications. Most of the existing work focuses on concurrency bugs such as data races and deadlocks, since concurrency bugs are difficult to detect and debug. In this paper, we discuss an under-examined defect in OpenMP applications: memory anomalies. These occur when the application issues illegal memory accesses that may result in a non-deterministic result or even a program crash. Based on the latest OpenMP 5.0 specification, we analyze some OpenMP usage errors that may lead to memory anomalies. Then we illustrate three kinds of memory anomalies: use of uninitialized memory (UUM), use of stale data (USD), and use after free (UAF). While all three anomalies can occur in sequential programs, their manifestations in parallel OpenMP programs can be different, and debugging such anomalies in the context of parallel programs also imposes an additional complexity relative to sequential programs. To measure the effectiveness of memory anomaly detectors on OpenMP applications, we have evaluated three state-of-theart tools with a group of micro-benchmarks. These micro-benchmarks are either selected from the DRACC benchmark suite or constructed from our own experience. The evaluation result shows that none of these tools can currently handle all three kinds of memory anomalies.
With greater adoption of various high-level parallel programming models to harness on-node parall... more With greater adoption of various high-level parallel programming models to harness on-node parallelism, accurate data race detection has become more crucial than ever. However, existing tools have great difficulty spotting data races through these high-level models, as they primarily target low-level concurrent execution models (e.g., concurrency expressed at the level of POSIX threads). In this
Uploads
Papers by Joachim Protze