Papers by Marcio Machado Pereira
In this text we present a short introduction to the new OpenMP Cluster (OMPC) distributed program... more In this text we present a short introduction to the new OpenMP Cluster (OMPC) distributed programming model. The OMPC runtime allows the programmer to annotate their code using OpenMP target offloading directives and run the application in a distributed environment seamlessly using a task-based programming model. OMPC is responsible for scheduling tasks to available nodes, transferring input/output data between nodes, and triggering remote execution all the while handling fault tolerance. The runtime leverages the LLVM infrastructure and is implemented using the well-known MPI library.

In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming ... more In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming model that can effectively combine performance improvement with ease of programming. Moreover, the recent introduction of (H)TM-based ISA extensions, by major microprocessor manufacturers, also seems to endorse TM as a programming model for today's parallel applications. One of the central issues in designing Software TM (STM) systems is to identify mechanisms or heuristics that can minimize contention arising from conflicting transactions. Although a number of mechanisms have been proposed to tackle contention, such techniques have a limited scope, because conflict is avoided by either interrupting or serializing transaction execution, thus considerably impacting performance. This work explores a complementary approach to boost the performance of STM through the use of schedulers. A TM scheduler is a software component that decides when a particular transaction should be executed. Their effectiveness is very sensitive to the accuracy of the metrics used to predict transaction behaviour, particularly in high-contention scenarios. This work proposes a new Dynamic Transaction Scheduler -DTS to select a transaction to execute next, based on a new policy that rewards success and an improved metric that measures the amount of effective work performed by a transaction. Hardware TMs (HTM) are an interesting mechanism to implement TM as they integrate the support for transactions at the lowest, most efficient, architectural level. On the other hand, for some applications, HTMs can have their performance hindered by the lack of scalability and by limitations in cache store capacity. This work presents an extensive performance study of the implementation of HTM in the Haswell generation of Intel x86 core processors. It evaluates the strengths and weaknesses of this new architecture by exploring several dimensions in the space of TM application characteristics. This detailed performance study provides insights on the constraints imposed by the Intel's Transaction Synchronization Extension (Intel's TSX) and introduces a simple, but efficient, serialization policy for guaranteeing forward progress on top of the best-effort Intel's HTM which was critical to achieving performance. ix x I thank all the faculty members, staff and students of the Computer Science Institute at UNICAMP as well as of the Department of Computing Science at the University of Alberta, whose presence and spirit have made both fantastic places to learn, to think, and to have fun. Finally, and most importantly, I would like to thank my family. They were always supporting me and encouraging me with their best wishes. Most especially, I would like to thank my wife and best friend, Lia, for her love and tremendous support in each step of the way. No amount of words could fully convey my gratitude to her.

Workshop Proceedings of the 51st International Conference on Parallel Processing
Despite the various research initiatives and proposed programming models, efficient solutions for... more Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra-and internode parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.
2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018
HardCloud seeks to ease the task of integrating programs to FPGA-based accelerators. In HardCloud... more HardCloud seeks to ease the task of integrating programs to FPGA-based accelerators. In HardCloud programmers could take their original code add a few annotations and quickly evaluate if an FPGA accelerator is a suitable solution to a particular application.
Journal of Parallel and Distributed Computing, 2019
• DCA: a set of two data-flow analyses that seek to identify CPU/GPU accesses. • DCO: creates sha... more • DCA: a set of two data-flow analyses that seek to identify CPU/GPU accesses. • DCO: creates shared buffers between CPU/GPU and inserts calls to keep data coherence. • A technique that tries to remove data offloading during GPU computation. • Speed-up of up to 8.87x on representative benchmarks on integrated and discrete GPUs.

Parallel Computing, 2016
This paper presents an extensive performance study of the implementation of Hardware Transactiona... more This paper presents an extensive performance study of the implementation of Hardware Transactional Memory (HTM) in the Haswell generation of Intel x86 core processors. It evaluates the strengths and weaknesses of this new architecture by exploring several dimensions in the space of Transactional Memory (TM) application characteristics using the Eigenbench [1] and the CLOMP-TM [2] benchmarks. This paper also introduces a new tool, called htm-pBuilder that tailors fallback policies and allows independent exploration of its parameters. This detailed performance study provides insights on the constraints imposed by the Intel's Transaction Synchronization Extension (Intel's TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the best-effort Intel's HTM which was critical to achieving performance. The evaluation also shows that there are a number of potential improvements for designers of TM applications and software systems that use Intel's TM and provides recommendations to extract maximum benefit from the current TM support available in Haswell.

Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
Convolution is one of the most computationally intensive machine learning model operations, usual... more Convolution is one of the most computationally intensive machine learning model operations, usually solved by the known Im2Col + BLAS method. This work proposes a novel convolution-algorithm to improve upon Im2Col + BLAS by introducing (a) CSA: a convolution specific 3D cache-blocking analysis that focuses on tile reuse over the cache hierarchy, (b) CSO: a macro-kernel that follows CSA to compute the convolution by tiling it, (c) a specialized microkernel that seeks to achieve peak hardware performance, and (d) packing routines for the input tensor and filters to bridge the gap between tiling and micro-kernel. Our approach speeds up end-toend machine learning model inference by up to 26% and 21% for x86 and POWER10 architectures, respectively. • General and reference → Performance; • Software and its engineering → Compilers; • Computing methodologies → Machine learning.

2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2021
FPGA-based hardware accelerators have received increasing attention mainly due to their ability t... more FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture capable of accelerating a single application. However, programming such architecture is a challenging endeavor that still requires additional research. This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture. Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increase.
This paper presents a proposal of a particular agent-oriented language, called TARDIS 1. Actually... more This paper presents a proposal of a particular agent-oriented language, called TARDIS 1. Actually, the TARDIS is an extension of a functional language Scheme by including primitives for creating and manipulating agents. Our approach is motivated by a desire to bridge the gap between functional and agent-oriented paradigm. The syntax and semantic we developed was intended to be useful for justifying programs transformations for real languages, and for formalizing intuitive arguments and properties used by programmers.

In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming ... more In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming model that can effectively combine performance improvement with ease of programming. Moreover, the recent introduction of (H)TM-based ISA extensions, by major microprocessor manufacturers, also seems to endorse TM as a programming model for today’s parallel applications. One of the central issues in designing Software TM (STM) systems is to identify mechanisms or heuristics that can minimize contention arising from conflicting transactions. Although a number of mechanisms have been proposed to tackle contention, such techniques have a limited scope, because conflict is avoided by either interrupting or serializing transaction execution, thus considerably impacting performance. This work explores a complementary approach to boost the performance of STM through the use of schedulers. A TM scheduler is a software component that decides when a particular transaction should be executed. Their...

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2017
Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the ar... more Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host a...

ACM Transactions on Architecture and Code Optimization
Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a seq... more Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Su...

DOACROSS Parallelization Based on Component Annotation and Loop-Carried Probability
2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
Although modern compilers implement many loop parallelization techniques, their application is ty... more Although modern compilers implement many loop parallelization techniques, their application is typically restricted to loops that have no loop-carried dependences (DOALL) or that contain well-known structured dependence patterns (e.g. reduction). These restrictions preclude the parallelization of many computational intensive DOACROSS loops. In such loops, either the compiler finds at least one loop-carried dependence or it cannot prove, at compile-time, that the loop is free of such dependences, even though they might never show-up at runtime. In any case, most compilers end-up not parallelizing DOACROSS loops. This paper brings three contributions to address this problem. First, it integrates three algorithms (TLS, DOAX, and BDX) into a simple openMP clause that enables the programmer to select the best algorithm for a given loop. Second, it proposes an annotation approach to separate the sequential components of a loop, thus exposing other components to parallelization. Finally, it shows that loop-carried probability is an effective metric to decide when to use TLS or other non-speculative techniques (e.g. DOAX or BDX) to parallelize DOACROSS loops. Experimental results reveal that, for certain loops, slow-downs can be transformed in 2×speed-ups by quickly selecting the appropriate algorithm.

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016
Directive-based programming models, such as OpenACC and OpenMP arise today as promising technique... more Directive-based programming models, such as OpenACC and OpenMP arise today as promising techniques to support the development of parallel applications. These systems allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This paper provides one fundamental component in the solution of this problem. We introduce a static program analysis that infers the bounds of memory regions referenced in source code. Such bounds allow us to automatically insert data-transfer primitives, which are needed when the parallelized code is meant to be executed in an accelerator device, such as a GPU. To validate our ideas, we have applied them onto Polybench, using two different architectures: Nvidia and Qualcomm-based. We have successfully analyzed 98% of all the memory accesses in Polybench. This result has enabled us to insert automatic annotations into those benchmarks leading to speedups of over 100x.
Multi-dimensional Evaluation of Haswell's Transactional Memory Performance
2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, 2014

Transaction scheduling using conflict avoidance and Contention Intensity
20th Annual International Conference on High Performance Computing, 2013
ABSTRACT In the last few years, Transactional Memories (TMs) have been shown to be a parallel pro... more ABSTRACT In the last few years, Transactional Memories (TMs) have been shown to be a parallel programming model that can effectively combine performance improvement with ease of programming. Moreover, the recent introduction of TM-based ISA extensions, by major microprocessor manufacturers, also seems to endorse TM as a programming model for today's parallel applications. One of the central issues in designing Software TM (STM) systems is to identify mechanisms/heuristics that can minimize contention arising from conflicting transactions. Although a number of mechanisms have been proposed to tackle contention, such techniques have a limited scope, as conflict is avoided by either interrupting or serializing transaction execution, thus considerably impacting performance. To deal with this limitation, we have proposed a new effective transaction scheduler, along with a conflict-avoidance heuristic, that imple-ments a fully cooperative scheduler that switches a conflicting transaction by another with a lower conflicting probability. This paper extends such framework and introduces a new heuristic, built from the combination of our previous conflict avoidance technique with the Contention Intensity heuristic proposed by Yoo and Lee. Experimental results, obtained using the STMBench7 and STAMP benchmarks atop tinySTM, show that the proposed heuristic produces significant speedups when compared to other four solutions.
Uploads
Papers by Marcio Machado Pereira