Sameer Shende

Followers

Following

Co-authors

Public Views

Interests

Uploads

Papers by Sameer Shende

Performance Tool Integration in a GPU Programming Environment: Experiences with TAU and HMPP

Parallel Computing, 2009

Application development environments offering high-level programming support for accelerators wil... more Application development environments offering high-level programming support for accelerators will need to integrate instrumentation and measurement capabilities to enable full, consistent performance views for analysis and tuning. We describe early experiences with the integration of a parallel performance system (TAU) and accelerator performance tool (TAUcuda) with the HMPP Workbench for programming GPU accelerators using CUDA. A description of the design approach is given, and two case studies are reported to demonstrate our development prototype. A new version of the technology is now being created based on the lessons learned from the research work.

Download

MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU

Parallel Computing, Sep 1, 2018

The desire for high performance on scalable parallel systems is increasing the complexity and tun... more The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to fine-tune the performance of the MPI library dynamically at runtime. In this paper, we propose an infrastructure that extends existing components -TAU, MVAPICH2, and BEACON to take advantage of the MPI_T interface and offer runtime introspection, online monitoring, recommendation generation, and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. Using our infrastructure, we implement an autotuning policy for AmberMD (a molecular dynamics package) that monitors and reduces the internal memory footprint of the MVAPICH2 MPI library without affecting performance. For applications such as MiniAMR whose collective communication is latency sensitive, our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, the MPI time for MiniAMR at 224 processes reduces by 15%.

Download

Exploring Space-Time Trade-Off in Backtraces

Springer eBooks, 2021

Performance Technology for Complex Parallel and Distributed Systems

Scalable Computing: Practice and Experience, Jan 3, 2001

The ability of performance technology to keep pace with the growing complexity of parallel and di... more The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems will depend on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. The TAU system is offered as an example framework that meets these requirements. With a flexible, modular instrumentation and measurement system, and an open performance data and analysis environment, TAU can target a range of complex performance scenarios. Examples are given showing the diversity of TAU application.

Download

Gleaming the Cube: Online Performance Analysis and Visualization Using MALP

Multi-Application onLine Profiling (MALP) is a performance tool which has been developed as an al... more Multi-Application onLine Profiling (MALP) is a performance tool which has been developed as an alternative to the trace-based approach for fine-grained event collection. Any performance and analysis measurement system must address the problem of data management and projection to meaningful forms. Our concept of a valorization chain is introduced to capture this fundamental principle. MALP is a dramatic departure from performance tool dogma in that is advocates for an online valorization architecture that integrates data producers with transformers, consumers, and visualizers, all operating in concert and simultaneously. MALP provides a powerful, dynamic framework for performance processing, as is demonstrated in unique performance analysis and application dashboard examples. Our experience with MALP has identified opportunities for data-query in MPI context, and more generally, creating a “constellation of services” that allow parallel processes and tools to collaborate through a common mediation layer.

Knowledge-Based Parallel Performance Technology for Scientific Application Competitiveness Final Report

The primary goal of the University of Oregon's DOE "AÂcompetitiveness" project wa... more The primary goal of the University of Oregon's DOE "AÂcompetitiveness" project was to create performance technology that embodies and supports knowledge of performance data, analysis, and diagnosis in parallel performance problem solving. The target of our development activities was the TAU Performance System and the technology accomplishments reported in this and prior reports have all been incorporated in the TAU open software distribution. In addition, the project has been committed to maintaining strong interactions with the DOE SciDAC Performance Engineering Research Institute (PERI) and Center for Technology for Advanced Scientific Component Software (TASCS). This collaboration has proved valuable for translation of our knowledge-based performance techniques to parallel application development and performance engineering practice. Our outreach has also extended to the DOE Advanced CompuTational Software (ACTS) collection and project. Throughout the project we have participated in the PERI and TASCS meetings, as well as the ACTS annual workshops.

Extreme Performance Scalable Operating Systems Final Progress Report (July 1, 2008 - October 31, 2011)

Download

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team

The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mecha... more The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully.

Download

Transparent High-Speed Network Checkpoint/Restart in MPI

Fault-tolerance has always been an important topic when it comes to running massively parallel pr... more Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.

Towards a Better Expressiveness of the Speedup Metric in MPI Context

Many-core processors are imposing new constraints to parallel applications. In particular, the MP... more Many-core processors are imposing new constraints to parallel applications. In particular, the MPI+X model or hybridization is becoming a compulsory avenue to extract performance by mitigating both memory and communication overhead. In this context, performance tools also have to evolve in order to represent more complex states combining multiple runtimes and programming models. In this paper, we propose to start from a well-known performance metric, the Speedup, showing that it can be bounded by the acceleration of any program section. From this observation, we propose a compact tool-oriented MPI abstraction providing such time slices (or phases). We demonstrate the benefits of this approach first on a simple benchmark, identifying factors limiting speedup. And second, using an MPI+OpenMP benchmark to measure OpenMP scaling solely from MPI instrumentation.

Download

A Plugin Architecture for the TAU Performance System

Proceedings of the 48th International Conference on Parallel Processing, 2019

Several robust performance systems have been created for parallel machines with the ability to ob... more Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policybased control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.

Download

Performance Analysis and Debugging Tools at Scale

Exascale Scientific Applications, 2017

Advanced Scientific Computing Research Exascale Requirements Review. An Office of Science review sponsored by Advanced Scientific Computing Research, September 27-29, 2016, Rockville, Maryland

Checkpoint/restart approaches for a thread-based MPI runtime

Parallel Computing, 2019

Fault-tolerance has always been an important topic when it comes to running massively parallel pr... more Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

Download

An MPI Halo-Cell Implementation for Zero-Copy Abstraction

Proceedings of the 22nd European MPI Users' Group Meeting, 2015

In the race for Exascale, the advent of many-core processors will bring a shift in parallel compu... more In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.

Download

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Team

Download

Unifying the Analysis of Performance Event Streams at the Consumer Interface Level

Several instrumentation interfaces have been developed for parallel programs to make observable a... more Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of MPI (MPI-T) and OpenMP (OMPT) runtimes are now in the standards. Taking advantage of these interfaces requires tools to selectively collect events from multiples interfaces by various techniques: function interposition (PMPI), value read (MPI-T), and callbacks (OMPT). In this paper, we present the unified instrumentation pipeline proposed by the MALP infrastructure that can be used to forward a variety of fine-grained events from multiple interfaces online to multi-threaded analysis processes implemented orthogonally with plugins. In essence, our contribution complements “front-end” instrumentation mechanisms by a generic “back-end” event consumption interface that allows “consumer” callbacks to generate performance measurements in various formats for analysis and transport. With such support, online and post-mortem cases become similar from an analysis point of view, making it possible to build more unified and consistent analysis frameworks. The paper describes the approach and demonstrates its benefits with several use cases.

Multi-Platform SYCL Profiling with TAU

Proceedings of the International Workshop on OpenCL, 2020

Multi-Level Performance Instrumentation for Kokkos Applications Using TAU

2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), 2019

Download

TAU Performance System

International Workshop on OpenCL, May 10, 2022

Sameer Shende

Uploads

Papers by Sameer Shende

Log In