Data Races

description397 papers

group16 followers

lightbulbAbout this topic

Data races occur in concurrent programming when two or more threads access shared data simultaneously, and at least one thread modifies the data without proper synchronization mechanisms. This can lead to unpredictable behavior and inconsistent results, making it a critical concern in the design and implementation of multithreaded applications.

lightbulbAbout this topic

Key research themes

1. How can benchmark suites and algorithmic innovations improve the precision and efficiency of data race detection?

This research theme focuses on the creation and enhancement of benchmark suites designed to systematically evaluate data race detection tools and on the development of algorithms that improve the accuracy and performance of these detection methods. Accurate detection is crucial to ensuring correctness and reliability in multi-threaded programs, while efficient algorithms make real-time or on-the-fly detection feasible, reducing overhead during program execution.

Enhancing DataRaceBench for Evaluating Data Race Detection Tools

by Chunhua Liao

2023, 2020 IEEE/ACM 4th International Workshop on Software Correctness for HPC Applications (Correctness)

Key finding: This paper presents the significant expansion of the DataRaceBench suite, adding 222 benchmarks including Fortran versions and new OpenMP 5.0 features, and introduces a distance-based code similarity analysis to reduce... Read more

articleView Paper downloadDownload

DataRaceBench V1.4.1 and DataRaceBench-ML V0.1: Benchmark Suites for Data Race Detection

by Chunhua Liao

2024, arXiv (Cornell University)

Key finding: This work advances DataRaceBench by integrating 20 new data race cases and developing DataRaceBench-ML, a dataset tailored for machine learning and large language model (LLM) applications. The dataset includes detailed labels... Read more

articleView Paper downloadDownload

An Efficient Algorithm for On-the-Fly Data Race Detection Using an Epoch-Based Technique

by Ok-Kyoon Ha

2022, Scientific Programming

Key finding: The authors propose iFT, an epoch-based algorithm that eliminates the need for vector clock switching in race detection, requiring only O(1) operations to maintain access histories and detect data races. Compared to... Read more

articleView Paper downloadDownload

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs

by Ok-Kyoon Ha

2022, Advanced Science and Technology Letters

Key finding: This paper introduces VcTrace, a practical, efficient dynamic monitoring tool based on vector clock analysis to detect data races in multithreaded C/C++ programs. VcTrace uses dynamic binary instrumentation with minimal... Read more

articleView Paper downloadDownload

Analysis on Interactive Data Race Checker: IDRC

by Md Abu Obaida

2017

Key finding: The study presents IDRC, an Eclipse plugin providing interactive, incremental static analysis for early detection of data races in Java projects during development. By integrating data race warnings directly in the IDE and... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What hardware and programming model innovations can reduce the complexity and nondeterminism caused by data races in parallel systems?

This theme addresses how disciplined parallel programming models and novel hardware architectures can mitigate data race complexities in shared-memory systems. It investigates programming language abstractions ensuring data-race-freedom and deterministic behaviors, alongside hardware designs leveraging these guarantees for simpler, scalable, and energy-efficient cache coherence and memory systems. This alignment potentially reduces nondeterministic bugs and aids maintainability in multicore architectures.

DeNovo: Rethinking Hardware for Disciplined Parallelism

by Hyojin Sung

2022

Key finding: This paper argues that disciplined parallel programming models enforcing data-race-freedom and structured parallel control allow a radical redesign of shared-memory hardware, eliminating complex directory-based coherence and... Read more

articleView Paper downloadDownload

Domains: Sharing state in the communicating event-loop actor model

by Tom Van Cutsem

2023, Computer Languages, Systems & Structures

Key finding: The authors propose four novel language abstractions ('domains') enabling safe shared mutable state within the pure actor model by categorizing state as immutable, isolated, observable, or shared, each with operational... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How do socio-technical perspectives and engagement with data influence contentious data practices and the politics surrounding datafication?

This research focus explores the role of social movements, activism, and civil society in shaping data politics through engagements that contest dominant datafication processes. It examines bottom-up transformative practices—termed 'contentious politics of data'—that challenge or reappropriate data infrastructures, emphasizing data both as a tool and object in political struggle. Understanding these dynamics is essential to comprehending how data acts as a site of power, resistance, and care in contemporary digital societies.

From data politics to the contentious politics of data

by Stefania Milan

2021, Big Data & Society

Key finding: This article conceptualizes 'contentious politics of data' as civil society's bottom-up initiatives that interfere with dominant datafication, mapping data activism along two analytical dimensions: 'data as stakes'... Read more

articleView Paper downloadDownload

Careful Data Tinkering

by Anh-Ton Tran

2023, Proceedings of the ACM on Human-Computer Interaction

Key finding: Through an ethnographic study of the Housing Justice League's Tenant Power Hotline, this work highlights how grassroots organizations engage in 'careful tinkering' with data practices to negotiate between care and efficiency.... Read more

articleView Paper downloadDownload

Inside the Data Spectacle

by Melissa Gregg

2014, Television and New Media

Key finding: This paper analyzes the spectacle of large-scale data visualization within tech industry contexts, framing it as 'below the line' labor involving rhetorical work to produce and sustain myths of technological progress and... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Data Races

SigRace

by Dario Gomez Suarez

2025, ACM SIGARCH Computer Architecture News

Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the... more

descriptionView Paper arrow_downwardDownload

Finding Concurrency-Related Bugs Using Random Isolation

by Julian Dolby

2025, Lecture Notes in Computer Science

This paper describes the methods used in Empire, a tool to detect concurrency-related bugs, namely atomic-set serializability violations in Java programs. The correctness criterion is based on atomic sets of memory locations, which share... more

descriptionView Paper arrow_downwardDownload

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

by David Kaeli

2025, arXiv (Cornell University)

While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUsâ Ȃ Ź memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestampbased coherence protocol, HALCONE , for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HAL-CONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6× and 3× better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks. Broadly, HALCONE scales well with both GPU count and CU count. Furthermore, we stress test our HALCONE protocol using a custom synthetic benchmark suite to evaluate its impact on the overall performance. When running our synthetic benchmark suite, the HALCONE protocol slows down the execution time by only 16.8% in the worst case.

descriptionView Paper arrow_downwardDownload

Hardware support for Local Memory Transactions on GPU Architectures

by David Kaeli

2025

Graphics Processing Units (GPUs) are popular hardware accelerators for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction Multiple Thread (SIMT) fashion. However, the SIMT execution model is... more

descriptionView Paper arrow_downwardDownload

Improving the Java memory model using CRF

by Jan-Willem Maessen

2025, ACM SIGPLAN Notices

This paper describes alternative memory semantics for Java programs using an enriched version of the Commit/Reconcile/Fence (CRF) memory model [16]. It outlines a set of reasonable practices for safe multithreaded programming in Java. Our... more

descriptionView Paper arrow_downwardDownload

Memory Model = Instruction Reordering + Store Atomicity

by Jan-Willem Maessen

2025, ACM SIGARCH Computer Architecture News

We present a novel framework for defining memory models in terms of two properties: thread-local Instruction Reordering axioms and Store Atomicity, which describes inter-thread communication via memory. Most memory models have the store... more

descriptionView Paper arrow_downwardDownload

Data flow equations for explicitly parallel programs

by John Hennessy

2025, Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming - PPOPP '93

We present a solution to the reaching definitions problem for programs with explicit lexicully specified parallel constructs, such as cobeginicoend orparallel.sections, hothwith and without explicit synchronization operations, such as... more

descriptionView Paper arrow_downwardDownload

How Do Developers Use APIs? A Case Study in Concurrency

by Joseph Kiniry

2025

With the omnipresent usage of APIs in software development, it has become important to analyse how the routines and functionalities of APIs are actually used. This information is in particular useful for API developers, to make decisions... more

descriptionView Paper arrow_downwardDownload

Acculock: Accurate and efficient detection of data races

by Jingling Xue

2025

Happens-before detectors are precise but can be too conservative to detect certain data races in repeated test runs as they are sensitive to thread interleaving. By making the opposite tradeoffs, lockset detectors can detect more races... more

descriptionView Paper arrow_downwardDownload

LXDs: Towards Isolation of Kernel Subsystems

by Aftab Hussain

2025, USENIX Annual Technical Conference

Modern operating systems are monolithic. Today, however, lack of isolation is one of the main factors undermining security of the kernel. Inherent complexity of the kernel code and rapid development pace combined with the use of unsafe,... more

descriptionView Paper arrow_downwardDownload

LRW Lock: Light-weight Read Write Lock

by Praveen Alapati and

2025

Efficient management of concurrent access to shared resources is crucial in modern multi-threaded systems to avoid race conditions and performance bottlenecks. Traditional locking mechanisms, such as standard read-write locks, often... more

descriptionView Paper arrow_downwardDownload

BTRACE: Path Optimization for Debugging

by Akash Lal

2025

We present and solve a path optimization problem on programs. Given a set of program nodes, called critical nodes, we find a shortest path through the program's control flow graph that touches the maximum num-ber of these nodes.... more

descriptionView Paper arrow_downwardDownload

Weighted Pushdown Systems and Weighted Transducers

by Akash Lal

2025

Pushdown Systems (PDSs) are an important formalism for modeling programs. Reachability analysis on PDSs has been used extensively for program verification. A key result, which made PDSs popular in the model-checking community was that the... more

descriptionView Paper arrow_downwardDownload

KUDA: GPU Accelerated Split Race Checker

by Can Bekar

2025, Workshop on Determinism and Correctness in Parallel Programming (WoDet), London, England, UK

We propose a novel approach for runtime verification on computers with a large number of computation cores, without any hardware extension to mainstream PC environment. The goal of the approach is making use of all hardware resources to... more

descriptionView Paper arrow_downwardDownload

Taking Static Analysis to the Next Level: Proving the Absence of Run-Time Errors and Data Races with Astrée

by Xavier Rival

2025, HAL (Le Centre pour la Communication Scientifique Directe)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or... more

descriptionView Paper arrow_downwardDownload

Taking Static Analysis to the Next Level: Proving the Absence of Run-Time Errors and Data Races with Astrée

by Xavier Rival

2025

We present an extension of Astree to concurrent C software. Astree is a sound static analyzer for run-time errors previously limited to sequential C software. Our extension employs a scalable abstraction which covers all possible thread... more

descriptionView Paper arrow_downwardDownload

Efficient system-enforced deterministic parallelism

by Amittai Aviram

2024, Communications of The ACM

Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing parallel programs deterministically, however, often incur high costs, allow misbehaved software to defeat... more

descriptionView Paper arrow_downwardDownload

Cooperative crug isolation

by Aditya Thakur

2024, Proceedings of the Seventh International Workshop on Dynamic Analysis

With the widespread deployment of multi-core hardware, writing concurrent programs has become inescapable. This has made fixing concurrency bugs (or crugs) critical in modern software systems. Static analysis techniques to find crugs such... more

descriptionView Paper arrow_downwardDownload

Profiling of SCOOP Programs Master Thesis

by Bertrand Meyer

2024

SCOOP (Simple Concurrent Object-Oriented Programming) [18] is a model and practical framework for building concurrent applications. It comes as a refinement of the Eiffel [15] programming language and is in the process of being integrated... more

descriptionView Paper arrow_downwardDownload

The semantics of x86-CC multiprocessor machine code

by Joe Blow

2024, ACM SIGPLAN Notices

Multiprocessors are now dominant, but real multiprocessors do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have subtle relaxed (or weak) memory models, usually... more

descriptionView Paper arrow_downwardDownload

Instrumentation of Java bytecode for runtime analysis

by Klaus Havelund

2024

This paper describes JSpy, a system for high-level instrumentation of Java bytecode and its use with JPaX, our system for runtime analysis of Java programs. JPaX monitors the execution of temporal logic formulas and performs predicative... more

descriptionView Paper arrow_downwardDownload

Acculock: Accurate and efficient detection of data races

by Jingling Xue

2024, Symposium on Code Generation and Optimization

descriptionView Paper arrow_downwardDownload

W.K. Chan, T.Y. Chen, and T.H. Tse, "An overview of integration testing techniques for object-oriented programs"

by T.H. Tse

2024, Proceedings of the 2nd ACIS Annual International Conference on Computer and Information Science (ICIS '02), International Association for Computer and Information Science, Mt. Pleasant, MI, USA, pp. 696-701 (August 2002)

Object-oriented programs involve many unique features that are not present in their conventional counterparts. Examples are message passing, synchronization, dynamic binding, object instantiation, persistence, encapsulation, inheritance,... more

descriptionView Paper arrow_downwardDownload

Analyzing memory management methods on integrated CPU-GPU systems

by Mohammad Dashti

2024, ACM SIGPLAN Notices

Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates... more

Figure 1. A high level view of the NVIDIA integrated CPU-GPU system used in this work.

Figure 3. Write bandwidth under different memory management methods.

Figure 4. Performance of Rodinia applications with the different global memory allocation methods. (a) Overall. (b) Kernel-only.

Figure 5. Accesses are cached in L2. (a) Each GPU thread accesses its respective array index. (b) All GPU threads access the same array index.

Figure 6. Accesses are cached in L1. (a) Each GPU thread accesses its respective array index. (b) All GPU threads access the same arra’ index.

Figure 7. CPU-GPU concurrent benchmarks. Both the CPU and GPU work on independent data. Default-conc (y-axis) is the default implementation of the concurrent version of the benchmark where data has to be explicitly copied between the CPU/GPU The default applications in Rodinia do not impose much inter- actions and sharing between threads running on the CPU and GPU. This has been explored in some details in a recent paper [8]. So, we modified a number of applications to introduce such interactions. In the modified Rodinia applications, the kernel works on half of the data by launching half as many GPU threads as in the original. The second half of the data is processed by CPU threads (via OpenMP) concurrently with the GPU threads, except for the managed scheme where they run in lockstep.

Figure 9. Fine-grained sharing benchmark showing the perfor- mance of hostalloc and sharedalloc While the shared region itself can be allocated using hostalloc or sharedalloc, the shared lock must be always allocated using hostalloc, because it requires strict consistency between the CPU and the GPU.

Figure 8. The locking scheme used in fine-grained sharing

Figure 10. Memory system architecture of AMD Kaveri. Figure reproduced from [15]. CP: Command Processor, GNB: Garlic North Bridge, UNB: Unified North Bridge.

Figure 11. Write memory bandwidth for OpenCL 1.2 (coarse), OpenCL 2.0 (SVM) and HSA (fine-grained). From now on, we refer to the three chosen configurations simply as OpenCL 1.2, OpenCL 2.0 and HSA, for brevity. We measure the read and write bandwidth under the three chosen configurations using the same test as in Section 4: we vary the size of the allocated array, and each GPU thread reads or writes its own dedicated 8-byte entry. The number of threads is, therefore, the same as the number of array entries. We verified both experimentally and by reading the code and documentation that none of the memory management methods on the AMD system disable CPU caching or affect memory access latency on the CPU, so we only report the results of experiments on the GPU.

Figure 12. Write throughput as multiple threads concurrently up- date a single memory location allocated as coarse (non-coherent) or fine (coherent) with HSA.

Figure 14. Slowdown of OpenCL 2.0 compared to OpenCL 1.2 as we increase the padding size of array entries. Kernel: memory [idx]++;. Figure 13. Latency of OpenCL 1.2, OpenCL 2.0, HSA kernel execution time. Array size = 64 MB and each array entry is either not padded, padded to 32 Bytes, or padded to 64 Bytes (cache line size). (a) Kernel: memory [idx]++;. (b) Kernel: if (idx %2 == 0 ) memory [idx]++;.

Figure 15. Runtime of Rodinia applications normalized to OpenCL 12

A summary of caching and concurrency trade-offs for the different memory allocation schemes is shown in Table 1.

descriptionView Paper arrow_downwardDownload

Finding concurrency-related bugs using random isolation

by Julian Dolby

2024, International Journal on Software Tools for Technology Transfer

descriptionView Paper arrow_downwardDownload

Extending AOP to Support Broad Runtime Monitoring Needs

by Jonathan Cook

2024

Runtime monitoring, where some part of a pro- gram's behavior and/or data is observed during execution, is a very useful technique that software developers to use for un- derstanding, analyzing, debugging, and improving their... more

descriptionView Paper arrow_downwardDownload

Accelerating real-time deterministic discovery through single instruction multiple data graphical processor unit for executing distributed event logs

by International Journal of Electrical and Computer Engineering (IJECE)

2024, International Journal of Electrical and Computer Engineering (IJECE)

With the rapid expansion of process mining implementation in global enterprises distributed across numerous branches, there is a critical requirement to develop an application qualified for real-time operation with fast and precise data... more

descriptionView Paper arrow_downwardDownload

Internally deterministic parallel algorithms can be fast

by Guy Blelloch

2024, ACM SIGPLAN Notices

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there... more

descriptionView Paper arrow_downwardDownload

Accounting in Genetics

by Karl Javorszky

2024

We present a logical tool which allows understanding the rationality of the translation underlying some interactions in Nature. In an abstract, formal way, we can demonstrate the epistemological link between a sequence and a... more

descriptionView Paper arrow_downwardDownload

On correcting the intrusion of tracing non-deterministic programs by software

by Florin Teodorescu

2024, Springer eBooks

This paper describes a performance evaluation technique of parallel programs based on software tracing. The interest of the proposed method is to enable post-mortem correction of the intrusion of software tracing of non deterministic... more

descriptionView Paper arrow_downwardDownload

On correcting the intrusion of tracing non-deterministic programs by software

by Florin Teodorescu

2024, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

An Abstract memory model describing the interaction between thread and memory with debugger tools

by Raghuraj Singh

2024

This paper describe the multithreaded execution and data race detectors which are commonly viewed as debugging tools.The C++ Standard defines single-threaded program execution. Basically, multithreaded execution requires a much more refined memory and execution model. C++ threading libraries are in the awkward situation of specifying an extended memory model for C++ in order to specify program execution. We suggest integrating a memory model suitable for multithreaded execution in the C++ Standard. We wants to make fast and error free program .but ideally it is not possible To overcome this problem we give first concept threading and ssecond concept in this paper is data race detector. They would allow us to give precise, simple, and safe semantics to shared variables in multithreaded programs, a problem that has so far defied a complete solution. Keyword:-atomicity,data race. I INTRODUCTION multithreaded execution. is use in most of today's programming. C++ is commonly used as part of multithreaded applications, sometimes with either direct calls into an OSprovided threading library or with the aid of an intervening layer that provides a platform-neutral interface. Properties critical for reliable, efficient, and correct multithreaded execution are left unspecified The C++ Standard specifies program execution in terms of observable behavior, which in turn describes sequential execution on an implicitly singlethreaded abstract machine. The main sketch of attack is: 1. Specification of an abstract memory model describing the interactions between threads and memory. 2. Application of this model to existing aspects of the C++ specification to replace the current implicitly sequential semantics. This will entail new constraints on how compilers can emit and optimize code. In particular, this will entail a reworking of the specification of volatile to provide useful multithreaded semantics. 3. Introduction of a small number of standard library classes providing standardized access to atomic update operations (such as compare_and_set). These classes will have multithreaded semantics integrated with the above

descriptionView Paper arrow_downwardDownload

A highly-parallel formulation of quantum computing simulation through fine-grained dataflow

by Tom Van Cutsem

2024

Quantum Computing lies at the frontier of computing, offering a radically different and unconventional model of computation. In the absence of practical quantum computers today, we must simulate their execution. This creates a performance... more

descriptionView Paper arrow_downwardDownload

Automated approach to Register Design and Verification of complex SOC

by Ballori Banerjee

2024

Today's designs contain several hundreds to thousands of registers and memory elements. Starting from documentation to design implementation to verification of each single register, each bit and its property involves a lot of time and... more

process repeats itself many times over the course of the project. Bugs are only one source of change though. Marketing requests may also come in at any stage of the design cycle requiring the specification to change and all downstream code to be modified. Figure 1 captures this course in a flow chart.

This is shown in Figure 2. The RDL file serves as a one-stop point for any register update required following a requirement change. 3.1.1 Choosing SystemRDL

4, DEMONSTRATING WITH A CSR EXAMPLE A complete VMM compliant randomized, coverage driven register verification environment can be created by extending the flow such that.

For each field, register, block and system component available in RALF, the RAL contains a System Verilog class. These classes are extended from RAL base classes. The attributes of the components in RALF such as base address, offset, reset value, domain name are passed to the individual classes as arguments. This RAL model can be integrated ina VMM environment for complete DUT register verification, as in Figure 4. The XL XACTOR translates the RAL commands to interface commands. The BFM uses these to drive DUT signals as per protocol. Figure 4: RAL integration in VMM environment Figure 4: RAL integration in VMM environment 2A L has several useful features that help in building verification snvironment for large and complex designs:

Table 3: Verilog RTL Interface for CSR_EXAMPLE

Table 6: Interrupt Register Example RDL provides particular constructs to define registers like interrupt- enable/mask and internupt-status from which interrupt will be derived. Each bit in the interrupt status register has to be mapped with corresponding enable/mask bit in the interrupt enable/mask register using interrupt field access property enable or mask. If it is enable corresponding interrupt source is used to generate an interrupt. In case of mask, corresponding interrupt source is not used to generate an interrupt. Each fieldwidth defined in interrupt status and interrupt enable register should be 1.SystemRDL example for interrupt status and enable register and their mapping is given in Table 6.

descriptionView Paper arrow_downwardDownload

Pruners: Providing reproducibility for uncovering non-deterministic errors in runs on supercomputers

by Joachim Protze

2024, International Journal of High Performance Computing Applications

descriptionView Paper arrow_downwardDownload

Enabling Modularity and Re-use in Dynamic Program Analysis Tools for the Java Virtual Machine

by W. Binder

2024, Lecture Notes in Computer Science

Dynamic program analysis tools based on code instrumentation serve many important software engineering tasks such as profiling, debugging, testing, program comprehension, and reverse engineering. Unfortunately, constructing new analysis... more

To illustrate this, we observe how a variety of useful tools can be constructed by independently recombining different shadow value mappings (i.e., what elements of program state to shadow) with different update rules (i.e., how to shadow them). First, consider a simple code coverage tool working at the basic block level. It consists of (1) an instrumentation of basic block entries, (2) a mapper associating a shadow boolean value to every distinct basic block ID, and (3) an updater that, for each basic block ID that is received, sets its shadow boolean to true. Instead of coverage, suppose we now require a count-based profiling tool. We shadow the same program state elements, but now with an integer updated by increments (instead of a boolean updated by set-to-true).

Now consider a context-sensitive profiler. We keep the same updater, but maintain each shadow per call chain. This means our mapper now has two levels: from call chain, to basic block, to the counter payload. The set of call chains must itself be constructed by additional instrumentation, applied to method entry and exit, typically to maintain a calling context tree [1].

Note that the overall form of the system is still the same, and it contains the same kinds of units: an instrumentation that observes events of interest, a mapper of such events to the relevant part of the analysis state (possibly over multiple stages), and updaters of individual state elements in response to events to reflect the context information available for such events. An interesting property is that the mapping logic can itself maintain state and be sensitive to events gathered using instrumentation, as with the call chain in the latter example.

Fig. 1. Trades of ease-of-use with flexibility in existing frameworks construct dynamic analysis tools. Our insight is that despite the latent com- monality we saw in the previous section, current infrastructure makes it either impossible or extremely difficult to structure dynamic analysis tools so that this logic can be isolated and re-used. We survey these existing infrastructures in two broad categories: low-level libraries for code transformation, and higher-level instrumentation-based frameworks.

Fig. 11. Mean startup and steady-state overhead, with 95% confidence interval.

descriptionView Paper arrow_downwardDownload

JP2: collecting dynamic bytecode metrics in JVMs

by W. Binder

2024

The collection of dynamic metrics is an important part of performance analysis and workload characterization. We demonstrate JP2, a new tool for collecting dynamic bytecode metrics for standard Java Virtual Machines (JVMs). The... more

descriptionView Paper arrow_downwardDownload

Formalization of Conflict Analysis of Programs with Procedures, Thread Creation, and Monitors

by Markus Müller-Olm

2024, The Archive of Formal Proofs

We study conflict detection for programs with procedures, dynamic thread creation and a fixed finite set of (reentrant) monitors. We show that deciding the existence of a conflict is NP-complete for our model (that abstracts guarded... more

descriptionView Paper arrow_downwardDownload

MIMD synchronization on SIMT architectures

by ahmed eltantawy

2024, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

In the single-instruction multiple-threads (SIMT) execution model, small groups of scalar threads operate in lockstep. Within each group, current SIMT hardware implementations serialize the execution of threads that follow different... more

Fig. 1: SIMT-Induced Deadlock threads within the control flow paths) the execution of d same warp to diverge (i.e., follow different . However, they achieve this by serializing ifferent control-flow paths while restoring SIMD utilization by forcing divergent threads to reconverge as soon as possible (typically at an immediate postdominator point) [2], [5], [8 . This in turn creates implicit scheduling constraints for divergent threads within a warp. Therefore, when GPU kernel programmer intend code is written in such a way that the s divergent threads to communicate, these scheduling constraints can lead to surprising (from a program- mer perspective) d a multi-threaded p a MIMD architect eadlock and/or livelock conditions. Thus, rogram that is guaranteed to terminate on ure may not terminate on machines with current SIMT implementations (oy. hmed ElTantawy and Tor M. Aamod University of British Columbia {ahmede,aamodt} @ece.ubc.ca

Fig. 3: Modified SIMT compliant Spin Lock

Fig. 4: SIMT-induced deadlock scenarios occurs if these indefinitely blocked paths must execute to enable the exit conditions of the looping threads. To avoid this, our compiler based SIMT deadlock elimination algorithm (ex- plained in more details in Section IV-A) replaces the backward edge of a loop identified by Algorithm | with two edges: a forward edge towards the loop’s SafePDom, and a backward edge from SafePDom to the loop header. This modification combined with the forced reconvergence constraint, guarantees that threads iterating in the loop wait at the SafePDom for threads executing other paths postdominated by SafePDom before attempting another iteration. Accordingly, SafePDom should postdominate the original loop exits, the redefining writes, and all control flow paths that could lead to redefining writes that are either reachable from the loop (lines 4-9 in Algorithm 2) or parallel to it (ines /0-/4 in Algorithm 2).

Fig. 5: SIMT-Induced Deadlock Elimination Steps

Fig. 6: MIMD-Compatible Reconvergence Mechanism Operation

‘ig. 7: AWARE Virtualized Implementation TABLE I: Evaluated Kernels

Fig. 8: Normalized Accumulative GPU Execution Time

Fig. 9: Evaluation of the Static SIMT-Induced Deadlock Elimination on Tesla K20C GPU

Fig. 11: Sensitivity to the TimeOut value (in cycles)

Fig. 10: Evaluation of the Adaptive Warp Reconvergence Mechanism using GPGPU-Sim

Fig. 12: Effect of AWARE Virtualization on Performance

Algorithm 1 SIMT-Induced Deadlock Detection slice of the loop exit condition. If the loop exit conditions do not depend on a shared memory read operation that occurs inside the loop body then the loop cannot have a SIMT- induced deadlock. If a loop exit condition does depend on a shared memory read instruction Ip, we add Ig in the set of shared reads Shrdgeags on lines 4-7. A potential SIMT- induced deadlock exists if any of these shared memory reads can be redefined by divergent threads. The next steps of the algorithm detect these shared memory redefinitions.

Algorithm 2 Safe Reconvergence Points loop exit is control dependent on the atomicCAS instruction, there are no shared memory write instructions that are parallel to, or reachable from, the loop exit. Therefore, no SIMT deadlock is detected.

Algorithm 3 SIMT-Induced Deadlock Elimination it postdominates all reachable paths to the redefining writes (i.e., leading threads may only wait for lagging ones after they finish all iterations of the outer loop).

TABLE II: Code Configuration Encoding our generated CFG. This could be avoided if the elimination algorithm is applied at the SASS code generation stage. We also implemented AWARE in GPGPU-Sim 3.2.2 [53], [54]. We use the Tes Sim. However, scheduler with scheduler that we observed th aC2050 configuration released with GPGPU- we replaced the Greedy Then Oldest (GTO) a Greedy then Loose Round Robin (GLRR) forces loose fairness in warp scheduling as at unfairness in GTO leads to livelocks due to inter-warp de pendencies on locks 8. Modified GPGPU-Sim and LLVM codes can be found online [19].

TABLE V: SSDE Evaluation on OpenMP Kernels

descriptionView Paper arrow_downwardDownload

Sparse flow-sensitive pointer analysis for multithreaded programs

by Jingling Xue

2024, Proceedings of the 2016 International Symposium on Code Generation and Optimization

For C programs, flow-sensitivity is important to enable pointer analysis to achieve highly usable precision. Despite significant recent advances in scaling flow-sensitive pointer analysis sparsely for sequential C programs, relatively... more

descriptionView Paper arrow_downwardDownload

Region-Based May-Happen-in-Parallel Analysis for C Programs

by Jingling Xue

2024, 2015 44th International Conference on Parallel Processing

The C programming language continues to play an essential role in the development of system software. May-Happen-in-Parallel (MHP) analysis is the basis of many other analyses and optimisations for concurrent programs. Existing MHP... more

descriptionView Paper arrow_downwardDownload

A Hardware Approach for Detecting, Exposing and Tolerating High Level Atomicity Violations

by Lois Orosa

2024

descriptionView Paper arrow_downwardDownload

Necessity Specifications for Robustness

by Susan Eisenbach

2024, arXiv (Cornell University)

Robust modules guarantee to do only what they are supposed to do-even in the presence of untrusted, malicious clients, and considering not just the direct behaviour of individual methods, but also the emergent behaviour from calls to more... more

descriptionView Paper arrow_downwardDownload

Accounting in Theoretical Genetics

by Karl Javorszky

2024

descriptionView Paper arrow_downwardDownload

Fine-Grained Synchronizations and Dataflow Programming on GPUs

by Henk Corporaal

2024, Proceedings of the 29th ACM on International Conference on Supercomputing

The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The dataparallel... more

descriptionView Paper arrow_downwardDownload

Automatically verifying and reproducing event-based races in Android apps

by Arash Alavi

2024, Proceedings of the 25th International Symposium on Software Testing and Analysis

Concurrency has been a perpetual problem in Android apps, mainly due to event-based races. Several event-based race detectors have been proposed, but they produce false positives, cannot reproduce races, and cannot distinguish between... more

descriptionView Paper arrow_downwardDownload

Prediction and Correction of Software Defects in Message-Passing Interfaces Using a Static Analysis Tool and Machine Learning

by Sanaa Sharaf

2024, IEEE Access

The Software Defect Prediction (SDP) method forecasts the occurrence of defects at the beginning of the software development process. Early fault detection will decrease the overall cost of software and improve its dependability. However,... more

descriptionView Paper arrow_downwardDownload

Automatic Correction of Dynamic Power Management Architecture in Modern Processors

by Bijan Alizadeh

2024, IEEE Transactions on Very Large Scale Integration Systems

The increasing demand for lower power forces designers to use sophisticated power management strategies such as multivoltage and power gating which are often accompanied with many design bugs. Correcting such bugs can be a timeconsuming... more

descriptionView Paper arrow_downwardDownload

MPI-CHECK: a tool for checking Fortran 90 MPI programs

by Glenn Luecke

2024, Concurrency and Computation: Practice and Experience

MPI is commonly used to write parallel programs for distributed memory parallel computers. MPI-CHECK is a tool developed to aid in the debugging of MPI programs that are written in free or fixed format Fortran 90 and Fortran 77. MPI-CHECK... more

descriptionView Paper arrow_downwardDownload

HRF-Relaxed

by Benedict Gaster

2024, ACM Transactions on Architecture and Code Optimization

Memory consistency models, or memory models, allow both programmers and program language implementers to reason about concurrent accesses to one or more memory locations. Memory model specifications balance the often conflicting needs for... more

descriptionView Paper arrow_downwardDownload

Data Races

Key research themes

1. How can benchmark suites and algorithmic innovations improve the precision and efficiency of data race detection?

2. What hardware and programming model innovations can reduce the complexity and nondeterminism caused by data races in parallel systems?

3. How do socio-technical perspectives and engagement with data influence contentious data practices and the politics surrounding datafication?

Related Topics

All papers in Data Races