Asynchronous I/O Research Papers

2024, ACM Transactions on Parallel Computing

With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts... more

descriptionView Paper arrow_downwardDownload

Accelerating incremental checkpointing for extreme-scale computing

by Patrick Bridges

2023, Future Generation Computer Systems

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC... more

descriptionView Paper arrow_downwardDownload

Comparative Analysis of Universal Gates using MCML and CMOS Technique

by keerti vyas

2023, International Journal of Computer Applications

MOS current mode logic (MCML) is an emerging logic family which is gaining attention due to its high speed of operation, robust performance and presence of mere switching noise as compared to the CMOS logic family. In this paper we have... more

descriptionView Paper arrow_downwardDownload

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

by Jose Luis Lopez Sancho

2023, ACM/IEEE SC 2005 Conference (SC'05)

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more

descriptionView Paper arrow_downwardDownload

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

by Esteban Meneses

2023, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make... more

descriptionView Paper arrow_downwardDownload

The design and implementation of a multi-level content-addressable checkpoint file system

by Abhishek Kulkarni

2023, 2012 19th International Conference on High Performance Computing

Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale... more

descriptionView Paper arrow_downwardDownload

Comparative Analysis of Universal Gates using MCML and CMOS Technique

by VIJENDRA MAURYA

2023, International Journal of Computer Applications

MOS current mode logic (MCML) is an emerging logic family which is gaining attention due to its high speed of operation, robust performance and presence of mere switching noise as compared to the CMOS logic family. In this paper we have... more

descriptionView Paper arrow_downwardDownload

Damaris: Leveraging Multicore Parallelism to Mask I/O Jitter

by A. Gabriel

2023

Résumé: With exascale computing on the horizon, the performance variability of I/O systems represents a key challenge in sustaining high performance. In many HPC applications, I/O is concurrently performed by all processes, which leads to... more

descriptionView Paper arrow_downwardDownload

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

by Tanzima Islam

2023, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high... more

descriptionView Paper arrow_downwardDownload

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

by jose carlos

2023, ACM/IEEE SC 2005 Conference (SC'05)

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more

descriptionView Paper arrow_downwardDownload

An Asynchronous Recovery Algorithm Based on a Staggered Quasi-Synchronous Checkpointing Algorithm

by D. Manivannan

2023, Distributed Computing – IWDC 2005

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously.... more

Fig. 1. An example illustrating basic checkpoints taken in a staggered way Fig. 1. An example illustrating basic checkpoints taken in a staggered way

descriptionView Paper arrow_downwardDownload

CellMR: A framework for supporting mapreduce on asymmetric cell-based clusters

by Muhammad Rafique

2022, 2009 IEEE International Symposium on Parallel & Distributed Processing

The use of asymmetric multi-core processors with onchip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on... more

descriptionView Paper arrow_downwardDownload

BAD-check

by Sorin Faibish

2022, Proceedings of the 10th Parallel Data Storage Workshop on - PDSW '15

Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the... more

descriptionView Paper arrow_downwardDownload

BAD-check

by Jeremy Sauer

2022, Proceedings of the 10th Parallel Data Storage Workshop on - PDSW '15

Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the... more

descriptionView Paper arrow_downwardDownload

Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS

by A. Goscinski

2022, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings.

Recent research efforts of parallel processing on nondedicated clusters have focused on high execution performance, parallelism management, transparent access to resources, and making clusters easy to use. However, as a collection of... more

descriptionView Paper arrow_downwardDownload

H-RADIC: A Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

by Georgiana Vasile

2022, Journal of Computer Science and Technology

Even though the cloud platform promises to be reliable, several availability incidents prove that it is not. How can we be sure that a parallel application finishes it´s execution even if a site is affected by a failure? This paper... more

Protector: It is in charge of requesting the observers to perform checkpoints and storing the checkpoint files in its own Stable Storage (SS). It is also in charge of the detection of failures by verifying that the node that it’s protecting is working and, should if fail, it performs the restoration of the process that failed by launching the latest checkpoint.

Besides the regular RADIC architecture designed to protect from node fails, H-RADIC will protect applications from crashing in the event of multiple fails, granting that the application finishes its execution despite the fails. When the virtual nodes fail, the physical nodes that the virtual ones are mounted on fail and/or whenever there is loss of Recovery: Protector Ta, _ restarts/rolls-back the processes running in Ng, using the data saved in the SS from the last checkpoint, if the system has a spare node (Nz_), the processes will be restarted in it, otherwise the processes are restarted in the node

The main difference between the two restart options is the way to store the checkpoints. In the case that there is a spare cluster available in another cloud (Figure 3), H-RADIC will be working as usual, but if there are no spare nodes available in a third cloud, the checkpoint will be restarted in the spare nodes of the cloud that has the checkpoint. After the restart, the two clusters (X and Y’ in Figure 4) will send their checkpoints to Cluster Z and vice versa. Then the checkpoint files are sent to the nodes that are in the spare cluster.

Figure 5 - H-RADIC Recovery function - spare nodes/cluster in the same cloud and not more clouds left. controller, the storage and the physical computer that the virtual nodes are mounted on. These vulnerabilities will now allow us to be able to guaranty availability.

Another experiment was performed where an error was induced around the middle of the application’s execution and the we took the time that the application took to restart form the latest checkpoint and the time that the application took to restart from the beginning, as shown in Figure 7.

Figure 6 - H-RADIC percentage of time overhead. an induced error. Then by taking the increase percentage of time in 2) and 3), we attain Figure 6.

Finally, the error masking function pseudocode is described in Algorithm 4.

Table 2 - Variables description. 5. Conclusion and future work

descriptionView Paper arrow_downwardDownload

The design and implementation of a multi-level content-addressable checkpoint file system

by Andrew Lumsdaine

2022, 2012 19th International Conference on High Performance Computing

Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale... more

descriptionView Paper arrow_downwardDownload

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

by Jose Carlos Melo Carlos

2022, ACM/IEEE SC 2005 Conference (SC'05)

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more

descriptionView Paper arrow_downwardDownload

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

by Jose Carlos

2022

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more

descriptionView Paper arrow_downwardDownload

Reliable and Efficient Checkpoint/Recovery in Shared Grid Environments

by Tanzima Islam

2022

In a Fine-Grained Cycle Sharing (FGCS) system [1], machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, for guest users, these free computation resources... more

descriptionView Paper arrow_downwardDownload

Reliable and Efficient Checkpoint/Recovery in Shared Grid Environments

by Tanzima Islam

2022

In a Fine-Grained Cycle Sharing (FGCS) system [1], machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, for guest users, these free computation resources... more

descriptionView Paper arrow_downwardDownload

MATCH: An MPI Fault Tolerance Benchmark Suite

by Konstantinos Parasyris

2022, 2020 IEEE International Symposium on Workload Characterization (IISWC)

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application... more

descriptionView Paper arrow_downwardDownload

Energy consumption reduction for asynchronous message-passing applications

by Ahmed Badri

2022, The Journal of Supercomputing

It is widely accepted that the asynchronous parallel methods are more suitable than the synchronous ones on a grid architecture. Indeed, they outperform the synchronous methods because they overlap the communications of the synchronous... more

descriptionView Paper arrow_downwardDownload

Enhancing NVMe and NVMe-oF configuration and managability with SNIA Swordfish and DMTF Redfish to enable scalable infrastructures

by Bernard Metzler

2022

Fabric Manager Paul Grun, HPE; Jeff Hilland, HPE; Russ Herrell, HPE OpenFabrics Verification Services Paul Grun, HPE; Doug Ledford, Red Hat; Jim Ryan, OpenFabrics Alliance

descriptionView Paper arrow_downwardDownload

Mmapcopy

by P. Marwedel

2022, Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16

Memory requirements can be a limiting factor for programs dealing with large data structures. Especially interpreted programming languages that are used to deal with large vectors like R suffer from memory overhead when copying such data... more

descriptionView Paper arrow_downwardDownload

Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

by Fabio Kon

2021, Proceedings of the 3rd international workshop on Middleware for grid computing - MGC '05

Dealing with the large amounts of data generated by longrunning parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more... more

descriptionView Paper arrow_downwardDownload

Strategies for Checkpoint Storage on Opportunistic Grids

by Fabio Kon

2021, IEEE Distributed Systems Online

This article evaluates several strategies for storing checkpoint data in an opportunistic grid environment, including replication, parity information, and erasure coding. This evaluation compares the computational overhead, storage... more

descriptionView Paper arrow_downwardDownload

Strategies for Checkpoint Storage on Opportunistic Grids

by Renato Cerqueira

2021, IEEE Distributed Systems Online

This article evaluates several strategies for storing checkpoint data in an opportunistic grid environment, including replication, parity information, and erasure coding. This evaluation compares the computational overhead, storage... more

descriptionView Paper arrow_downwardDownload

Improving performance of iterative methods by lossy checkponting

by Sheng Di

2021, Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks... more

descriptionView Paper arrow_downwardDownload

Live process migration for load balancing and/or fault tolerance

by Peter Väterlein

2021, Proceedings of The International Symposium on Grids and Clouds (ISGC) 2012 — PoS(ISGC 2012)

descriptionView Paper arrow_downwardDownload

H-RADIC: A Fault Tolerance Framework for Virtual Clusters on Multi-Cloud Environments

by Marcela Castro León

2021

Even though the cloud platform promises to be reliable, several availability incidents prove that they are not. How can we be sure that a parallel application finishes the execution even if a site is affected by a failure? This paper... more

descriptionView Paper arrow_downwardDownload

DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop

by Rodrigo Vitor Ribeiro

2020

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including... more

descriptionView Paper arrow_downwardDownload

Adaptation Mechanism for Managing Grid Resources

by Faki Silas

2017, International Journal of Engineering Science Invention

As Grid architecture provides resources that fluctuates, application that should be run in this environment must be able to take into account the changes that may occur. This application must adapt to the changes in Grid environment.... more

descriptionView Paper arrow_downwardDownload

CellMR: A Framework for Supporting MapReduce on Asymmetric Cell-Based Clusters

by Muhammad Rafique

2016

The use of asymmetric multi-core processors with onchip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on... more

descriptionView Paper arrow_downwardDownload

Checkpointing as a Service in Heterogeneous Cloud Environments

by aaina arora

2016

—A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A... more

descriptionView Paper arrow_downwardDownload

Triple-Rail MOS Current Mode Logic for High-Speed Self-Timed Pipeline Applications

by Alan Drake

2016, 2006 IEEE International Symposium on Circuits and Systems

High speed and low power is the dream of circuit designers. In this paper a novel self-timed logic family is presented for high-speed self-timed pipelining applications. We developed a novel triple-rail MOS current mode logic (Tr-MCML)... more

descriptionView Paper arrow_downwardDownload

On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

by Dewan Ibtesham and

2016, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

User-level socket-based checkpointing for distributed and parallel computation

by Jason Ansel and

2016

We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically... more

descriptionView Paper arrow_downwardDownload

Hybrid Checkpointing for MPI Jobs in HPC Environments

by Christian Engelmann

2015, 2010 IEEE 16th International Conference on Parallel and Distributed Systems

As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes... more

descriptionView Paper arrow_downwardDownload

The design and implementation of a multi-level content-addressable checkpoint file system

by Latchesar Ionkov and

2015, 2012 19th International Conference on High Performance Computing

Long-running HPC applications guard against node failures by writing checkpoints to parallel file systems. Writing these checkpoints with petascale class machines has proven difficult and the increased concurrency demands of exascale... more

descriptionView Paper arrow_downwardDownload

Enabling composite applications through an asynchronous shared memory interface

by Latchesar Ionkov and

2015, 2014 IEEE International Conference on Big Data (Big Data)

In this work we address the growing need for mechanisms for intranode application composition. We provide a novel shared memory interface that allows composite applications, two or more coupled applications, to share internal data... more

descriptionView Paper arrow_downwardDownload

Adaptation Mechanism for Managing Grid Resources

by Faki Silas

2015

ABSTRACT: As Grid architecture provides resources that fluctuates, application that should be run in this environment must be able to take into account the changes that may occur. This application must adapt to the changes in Grid... more

descriptionView Paper arrow_downwardDownload

Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

by Renato Cerqueira

2015, Proceedings of the 3rd international workshop on Middleware for grid computing - MGC '05

Dealing with the large amounts of data generated by longrunning parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more... more

descriptionView Paper arrow_downwardDownload

Strategies for storage of checkpointing data using non-dedicated repositories on Grid systems

by Renato Cerqueira

2015, ACM International Conference Proceeding Series

Dealing with the large amounts of data generated by longrunning parallel applications is one of the most challenging aspects of Grid Computing. Periodic checkpoints might be taken to guarantee application progression, producing even more... more

descriptionView Paper arrow_downwardDownload

Strategies for Checkpoint Storage on Opportunistic Grids

by Renato Cerqueira

2015, IEEE Distributed Systems Online

This article evaluates several strategies for storing checkpoint data in an opportunistic grid environment, including replication, parity information, and erasure coding. This evaluation compares the computational overhead, storage... more

descriptionView Paper arrow_downwardDownload

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

by jose carlos

2015

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more

descriptionView Paper arrow_downwardDownload

A fast restart mechanism for checkpoint/recovery protocols in networked environments

by Zhiling Lan

2014

Checkpoint/recovery has been studied extensively, and various optimization techniques have been presented for its improvement. Regardless of the considerable research efforts, little work has been done on improving its restart latency.... more

reduction of restart latency. Further, the principle of FREM is applicable to general C/R protocols. The post-checkpoint tracking phase is composed of [wo steps:

Figure 2. Pitfalls in the identification of the touch set caused by dynamic memory management 4.1.3. Dynamic memory management. Dynamic memory allocation and deallocation operations change the process address space. Without a careful analysis, they may cause identification errors. As shown in Figure 2, we identify three types of pitfalls stemming from dynamic memory usage.

Figures 4 and 5 show, respectively, the raw improvement and the relative improvement on restart latency achieved by FREM over BLCR. As we can see from Figure 4, the reduction ranges from a couple of seconds to a couple of hundred seconds. The highest reduction is 152.6 seconds in the FAST network and 208.5 seconds in the SLOW network. According to Figure 5, except for applications 8 and 9, the relative improvement is more than 53.78% in the SLOW network and more than 49.25% in the FAST network. The trivial improvements on applications 8 and 9 are attributed to their low temporal data locality. For instance, for application 8, its touch set is 402 MB, which is very close to the checkpoint image of 409 MB; for application 9, the improvement achieved by FREM drops sharply when the network performance is changed from FAST to SLOW. This is also caused by the rapid growth of the touch set when the network performance is low. However, we shall point out that even in a slow network, the raw restart latency is still reduced by at least a couple of seconds.

Table 2. Restart latency (RL) by using BLCR and FREM with SPEC CPU2006 applications. The parenthesized numbers in the last two columns are relative improvements (in percentage) achieved by FREM.

The benchmark suite SPEC CPU2006 is tested in our experiments [21]. Since FREM targets the applications with large memory consumptions, we choose the applications whose memory footprints are greater than 150 MB. Among these applications, we randomly select twelve applications and present their results in the following.

Table 3. Post-checkpoint tracking overhead (in milliseconds) Table 4. Fast restart overhead (time unit: seconds)

The post-checkpoint tracking overhead is mainly caused by three factors: the PTE scan time, the descriptor search and insertion time, and the I/O time to store the descriptor. Table 3 lists the measured post-checkpoint tracking overheads. We have observed

descriptionView Paper arrow_downwardDownload

Semi-Asynchronous Checkpointing for Optimistic Parallel Simulation: Description and an Implementation

by Andrea Santoro

2013

Great effort has been devoted to the design of optimized checkpointing strategies for optimistic parallel discrete event simulators. On the other hand there is less work in the direction to improve the execution mode of any single... more

descriptionView Paper arrow_downwardDownload

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

by Kei Davis

2013

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault... more

descriptionView Paper arrow_downwardDownload

MTIO - A Multi-Threaded Parallel I/O System

by Sachin More

2013

threaded runtime library for parallel I/O. We extend the multi-threading concept to separate the compute and I/O tasks in two separate threads of control. Multi-threading in our design permits a) asynchronous I/O even if the underlying... more

descriptionView Paper arrow_downwardDownload