Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC... more
MOS current mode logic (MCML) is an emerging logic family which is gaining attention due to its high speed of operation, robust performance and presence of mere switching noise as compared to the CMOS logic family. In this paper we have... more
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make... more
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale... more
MOS current mode logic (MCML) is an emerging logic family which is gaining attention due to its high speed of operation, robust performance and presence of mere switching noise as compared to the CMOS logic family. In this paper we have... more
Résumé: With exascale computing on the horizon, the performance variability of I/O systems represents a key challenge in sustaining high performance. In many HPC applications, I/O is concurrently performed by all processes, which leads to... more
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high... more
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously.... more
The use of asymmetric multi-core processors with onchip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on... more
Recent research efforts of parallel processing on nondedicated clusters have focused on high execution performance, parallelism management, transparent access to resources, and making clusters easy to use. However, as a collection of... more
Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper... more
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale... more
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more
In a Fine-Grained Cycle Sharing (FGCS) system [1], machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, for guest users, these free computation resources... more
In a Fine-Grained Cycle Sharing (FGCS) system [1], machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, for guest users, these free computation resources... more
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application... more
It is widely accepted that the asynchronous parallel methods are more suitable than the synchronous ones on a grid architecture. Indeed, they outperform the synchronous methods because they overlap the communications of the synchronous... more
Fabric Manager Paul Grun, HPE; Jeff Hilland, HPE; Russ Herrell, HPE OpenFabrics Verification Services Paul Grun, HPE; Doug Ledford, Red Hat; Jim Ryan, OpenFabrics Alliance
Dealing with the large amounts of data generated by longrunning parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more... more
This article evaluates several strategies for storing checkpoint data in an opportunistic grid environment, including replication, parity information, and erasure coding. This evaluation compares the computational overhead, storage... more
This article evaluates several strategies for storing checkpoint data in an opportunistic grid environment, including replication, parity information, and erasure coding. This evaluation compares the computational overhead, storage... more
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks... more
Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper... more
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including... more
As Grid architecture provides resources that fluctuates, application that should be run in this environment must be able to take into account the changes that may occur. This application must adapt to the changes in Grid environment.... more
The use of asymmetric multi-core processors with onchip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on... more
—A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A... more
High speed and low power is the dream of circuit designers. In this paper a novel self-timed logic family is presented for high-speed self-timed pipelining applications. We developed a novel triple-rail MOS current mode logic (Tr-MCML)... more
We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically... more
As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes... more
Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale... more
In this work we address the growing need for mechanisms for intranode application composition. We provide a novel shared memory interface that allows composite applications, two or more coupled applications, to share internal data... more
ABSTRACT: As Grid architecture provides resources that fluctuates, application that should be run in this environment must be able to take into account the changes that may occur. This application must adapt to the changes in Grid... more
Dealing with the large amounts of data generated by longrunning parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more... more
Dealing with the large amounts of data generated by longrunning parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more... more
This article evaluates several strategies for storing checkpoint data in an opportunistic grid environment, including replication, parity information, and erasure coding. This evaluation compares the computational overhead, storage... more
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more
Checkpoint/recovery has been studied extensively, and various optimization techniques have been presented for its improvement. Regardless of the considerable research efforts, little work has been done on improving its restart latency.... more
Great effort has been devoted to the design of optimized checkpointing strategies for optimistic parallel discrete event simulators. On the other hand there is less work in the direction to improve the execution mode of any single... more
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more
threaded runtime library for parallel I/O. We extend the multi-threading concept to separate the compute and I/O tasks in two separate threads of control. Multi-threading in our design permits a) asynchronous I/O even if the underlying... more