Rollback recovery Research Papers

New Causal Message Logging Protocol with Asynchronous Checkpointing for Distributed Systems

2025

Causal message logging is an efficient approach for tolerating fail- ures of processes in distributed systems because it has the advantages of both pessimistic and optimistic message logging approach. However, traditional causal message... more

descriptionView Paper arrow_downwardDownload

A survey of solved problems and applications on bandwidth, edgesum, and profile of graphs

by Kenneth Williams

2025, Journal of Graph Theory

This paper provides a survey of results on the exact bandwidth, edgesum, and pro le of graphs. A bibliography o f work in these areas is provided. The emphasis is on composite graphs. This may be regarded as an update of the original... more

descriptionView Paper arrow_downwardDownload

A Survey of State Management in Big Data Processing Systems

by Volker Markl

2025, arXiv (Cornell University)

The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache... more

descriptionView Paper arrow_downwardDownload

A survey of state management in big data processing systems

by Volker Markl

2025, The VLDB Journal

The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache... more

descriptionView Paper arrow_downwardDownload

An index-based checkpointing algorithm for autonomous distributed systems

by Paolo Fornara

2025, IEEE Transactions on Parallel and Distributed Systems

points in other processes. This protocol shows good performance especially in autonomous environments where each process does not have any private information about other processes.

descriptionView Paper arrow_downwardDownload

Caracterización de una estrategia de detección de fallos transitorios en HPC

by emilio luque

2025

1 III-LIDI, Facultad de Informática, UNLP Calle 50 y 120, 1900 La Plata (Buenos Aires), Argentina {dmontezanti, erucci, mnaiouf, degiusti}@lidi.info.unlp.edu.ar 2 Departamento de Arquitectura de Computadoras y Sistemas Operativos, UAB... more

descriptionView Paper arrow_downwardDownload

A new roll-forward checkpointing/recovery mechanism for cluster federation

by Shahram Rahimi

2025

In this paper, we have addressed the complex problem of determining a recovery line for cluster federation and proposed an efficient checkpointing / recovery mechanism for it. The main objective of the proposed approach is to advance the... more

descriptionView Paper arrow_downwardDownload

A novel roll-back mechanism for performance enhancement of asynchronous checkpointing and recovery

by Shahram Rahimi

2025

In this paper, we present a high performance recovery algorithm for distributed systems in which checkpoints are taken asynchronously. It offers fast determination of the recent consistent global checkpoint (maximum consistent state) of a... more

descriptionView Paper arrow_downwardDownload

Interacción De Los Componentes Del Clúster Microsoft HPC (High Performance Computing) Server 2008, Con Aplicaciones Mpi. Interaction of Microsoft Clúster Components HPC (High Performance Computing) Server 2008 with Mpi Applications

by Mauricio Ochoa Echeverria

2025

The High Performance Computing (HPC) refers to the solution of complex problems by a group of servers, called cluster. The cluster as a whole is used for solving a single problem or a group of related problems. Initially, the solutions... more

descriptionView Paper arrow_downwardDownload

Extensión funcional de CluSim para tolerancia a fallos

by LUIS VALDIVIEZO

2024

descriptionView Paper arrow_downwardDownload

Cooperative Application/OS DRAM Fault Recovery

by Patrick Bridges

2024, Lecture Notes in Computer Science

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such... more

descriptionView Paper arrow_downwardDownload

Incorporating quality-of-service in the virtual interface architecture

by Shailabh Nagar

2024, Proceedings 16th International Parallel and Distributed Processing Symposium

User-level Networking (ULN) is gaining rapid acceptance in the commercial world with Virtual Interface Architecture (VIA), and Infiniband more recently, being touted as the interface of choice to diverse peripherals. There is an important... more

descriptionView Paper arrow_downwardDownload

Sifat-Sifat Rollback Recovery Menggunakan Uncoordinated Checkpointing Berbasis Causality Strength

by Junianto Sesa

2024, Jurnal Matematika Statistik dan Komputasi

Fault tolerance approach is the most popular computing application on computer devices in which depends on checkpoint uncoordinated. This alternative approach is based on checkpoint uncoordinated and logging message requiring all records,... more

descriptionView Paper arrow_downwardDownload

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

by Dakshnamoorthy Manivannan

2024, Journal of Parallel and Distributed Computing

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can... more

descriptionView Paper arrow_downwardDownload

Intra-block amalgamation in sparse hypermatrix Cholesky factorization

by José R . Herrero

2024

descriptionView Paper arrow_downwardDownload

A new evaluation function for the minla problem

by Jose Torres Jimenez

2024

descriptionView Paper arrow_downwardDownload

Performance analysis of different checkpointing and recovery schemes using stochastic model

by Krishnendu Mukhopadhyaya

2024, Journal of Parallel and Distributed Computing

Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have derived expressions for average cost of checkpointing,... more

descriptionView Paper arrow_downwardDownload

Self-stabilizing algorithm for checkpointing in a distributed system

by Krishnendu Mukhopadhyaya

2024, Journal of Parallel and Distributed Computing

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms... more

descriptionView Paper arrow_downwardDownload

New aspect of investigating fault sensitivity of scientific workflows

by Eszter Kail

2024, 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES)

Scientific workflows have emerged as a key technology that assists scientists with the design, management, execution, sharing and reuse of in silico experiments. These experiments are mainly data and compute intensive so they require high... more

descriptionView Paper arrow_downwardDownload

Achieving dynamic workflow management system by applying provenance based checkpointing method

by Eszter Kail

2024, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

Scientific workflows are data and compute intensive thus may run for days or even for weeks on parallel and distributed infrastructures such as HPC systems and cloud. In HPC environment the number of failures that can arise during... more

descriptionView Paper arrow_downwardDownload

TERPS: the embedded reliable processing system

by Samuel Cervantes Rodriguez

2024, Proceedings of the ASP-DAC 2005. Asia and South Pacific Design Automation Conference, 2005.

TERPS is a fault-tolerant computer design that significantly reduces the threat of electromagnetic interference (EMI), using hardware checkpoint/rollback-recovery. TERPS tolerates EMI by periodically checkpointing processor state into a... more

descriptionView Paper arrow_downwardDownload

Analysis and Evaluation of the Performance of CAPE

by Éric Renault

2024

MPI (Message Passing Interface) and OpenMP are two tools broadly used to develop parallel programs. On the one hand, MPI has the advantage of high performance while being difficult to use. On the other hand, OpenMP is very easy to use but... more

descriptionView Paper arrow_downwardDownload

Analysis and Evaluation of the Performance of CAPE

by Éric Renault

2024, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld)

MPI (Message Passing Interface) and OpenMP are two tools broadly used to develop parallel programs. On the one hand, MPI has the advantage of high performance while being difficult to use. On the other hand, OpenMP is very easy to use but... more

descriptionView Paper arrow_downwardDownload

Performance analysis of recoverable flight control systems using hybrid dynamical models

by Oscar Gonzalez

2024, American Control Conference

The main goal of this paper is to describe and validate a specific hybrid dynamical model for NASA's recoverable computer system subjected to simulated random upsets. The system is closed-looped with a Boeing 737 simulation model... more

descriptionView Paper arrow_downwardDownload

Checkpoint Interval and System's Overall Quality for Message Logging-Based Rollback and Recovery in Distributed and Embedded Computing

by Nianen Chen

2024

In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent... more

descriptionView Paper arrow_downwardDownload

Checkpoint Interval and System's Overall Quality for Message Logging-Based Rollback and Recovery in Distributed and Embedded Computing

by Nianen Chen

2024, 2009 International Conference on Embedded Software and Systems

In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent... more

descriptionView Paper arrow_downwardDownload

Caracterización de una estrategia de detección de fallos transitorios en HPC

by E. Fadón

2024

1 III-LIDI, Facultad de Informática, UNLP Calle 50 y 120, 1900 La Plata (Buenos Aires), Argentina {dmontezanti, erucci, mnaiouf, degiusti}@lidi.info.unlp.edu.ar 2 Departamento de Arquitectura de Computadoras y Sistemas Operativos, UAB... more

descriptionView Paper arrow_downwardDownload

Analysis of parallel application checkpoint storage for system configuration

by E. Fadón

2024, The Journal of Supercomputing

The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and... more

descriptionView Paper arrow_downwardDownload

Public Cluster: parallel machine with multi-block approach

by Iman Firmansyah

2024, Arxiv preprint arXiv:0708.0603

Abstract: We introduce a new approach to enable an open and public parallel machine which is accessible for multi users with multi jobs belong to different blocks running at the same time. The concept is required especially for parallel... more

descriptionView Paper arrow_downwardDownload

Comparison and tuning of MPI implementation in a grid context

by Jean-Christophe Mignot

2024, HAL (Le Centre pour la Communication Scientifique Directe)

Today, clusters are often interconnected by long distance networks to compose grids and to provide users with a huge number of available ressources. To write parallel applications, developers are generally using the standard communication... more

descriptionView Paper arrow_downwardDownload

Simulador de un cluster tolerante a fallos basado en OMNeT++

by Luis Coello Valdiviezo

2024

descriptionView Paper arrow_downwardDownload

Public Cluster: parallel machine with multi-block approach

by Bambang Hermanto

2024, Arxiv preprint arXiv:0708.0603

Abstract: We introduce a new approach to enable an open and public parallel machine which is accessible for multi users with multi jobs belong to different blocks running at the same time. The concept is required especially for parallel... more

descriptionView Paper arrow_downwardDownload

Strategic directions in research in theory of computing

by David Shmoys

2024, ACM Computing Surveys

descriptionView Paper arrow_downwardDownload

Determining Consistent States of Distributed Objects Participating in a Remote Method Call

by Bogdan Wiszniewski

2024, Lecture Notes in Computer Science

The article presents the first component of a new approach for testing distributed object-oriented applications called TestByRep which is based on the concept of replication of object states. The paper describes key ideas in TestByRep... more

descriptionView Paper arrow_downwardDownload

A distributed counter-based non-blocking coordinated checkpoint algorithm for grid computing applications

by khadra hossny

2024, 2012 2nd International Conference on Advances in Computational Tools for Engineering Applications (ACTEA)

In distributed systems, there are many opportunities for failure. Any component in any compute node could fail. This includes, but is not limited to, the processor, disk, memory, or network interface on the node. Any of these failures... more

descriptionView Paper arrow_downwardDownload

Application of Mobile Agents as a Tool for Maintaining Consistency in Distributed Applications

by Rajendra purohit

2024, International journal of engineering research and technology

Distributed application executes on multiple nodes of remote sites. Due to involvement of multiple nodes, it requires error recovery algorithms. Traditional message passing techniques were proposed to design these error recovery... more

descriptionView Paper arrow_downwardDownload

New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors

by M. Violante

2024, IEEE Transactions on Nuclear Science

Fig. 2. Example of execution of rollback recovery using checkpoint. ‘ig. 1. Flow chart of rollback recovery using checkpoint.

Fig. 3. Architecture of the synchronized lockstep with rollback.

Fig. 4. Architecture modified to include the WHT.

manifests itself within the same execution cycle during which it is latched has been increased, and so the probability of suc- cessful execution of the rollback, thereby providing higher de- pendability for the whole system.

RESULTS OF FAULT INJECTION ON THE PROCESSORS TABLE II

DATA SEGMENT SIZE BREAK-EVEN POINT FOR USE OF WHT

descriptionView Paper arrow_downwardDownload

To cite this version

by Nick Reynaert

2024

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad,... more

descriptionView Paper arrow_downwardDownload

Stability analysis of upset recovery methods for electromagnetic interference

by sudarshan Patilkulkarni

2023, Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228)

In this paper a recently developed analytical tool is explained and used to determine the effect on stability of standard error recovery systems on a model of a Boeing 737 used for research at the NASA Langley Research Center. In... more

descriptionView Paper arrow_downwardDownload

Measurement, analysis and performance improvement of the Apache Web server

by Dr. Ashwini Nanda

2023, 1999 IEEE International Performance, Computing and Communications Conference (Cat. No.99CH36305)

Performance of Web servers is critical to the success of many corporations and organizations. However, very few results have been published that quantitatively study the server behavior and identify the performance bottlenecks. In this... more

descriptionView Paper arrow_downwardDownload

A Graph Transformation-Based Approach for the Validation of Checkpointing Algorithms in Distributed Systems

by saul hernandez

2023, 2014 IEEE 23rd International WETICE Conference

Autonomic Computing Systems are oriented to prevente the human intervention and to enable distributed systems to manage themselves. One of their challenges is the efficient monitoring at runtime oriented to collect information from which... more

descriptionView Paper arrow_downwardDownload

Consistency model for collaborative software development on cloud

by Dr Yin Mar Aye

2023, Information and Communication Technology for Education

Consistency preservation is an important problem in collaborative system that is activated in both traditional distributed systems and cloud based systems for availability and performance. Especially, cloud storage services still need to... more

descriptionView Paper arrow_downwardDownload

Fine-Grained Consistency Upgrades for Online Services

by Joao Leitao

2023, 2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS)

Online services such as Facebook or Twitter have public APIs to enable an easy integration of these services with third party applications. However, the developers who design these applications have no information about the consistency... more

descriptionView Paper arrow_downwardDownload

Application-Level Checkpointing Techniques for Parallel Programs

by Vipin Chaudhary

2023, Lecture Notes in Computer Science

In its simplest form, checkpointing is the act of saving a program's computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume... more

descriptionView Paper arrow_downwardDownload

A Causal Checkpointing Algorithm for Mobile Computing Environments

by Pranav Raj

2023, Lecture Notes in Computer Science

Checkpointing algorithms suitable for mobile computing environments should be economical in terms of storage and energy consumption, and they should be able to handle that at starting time not all processes are known which are to be... more

descriptionView Paper arrow_downwardDownload

An Extended Coherence Protocol for Recoverable DSM Systems with Causal Consistency

by Michał Szychowiak

2023, Lecture Notes in Computer Science

This paper presents a new checkpoint recovery protocol for Distributed Shared Memory (DSM) systems with read-write objects. It is based on independent checkpointing integrated with a coherence protocol for causal consistency model. That... more

descriptionView Paper arrow_downwardDownload

Checkpointing and rollback-recovery protocol for mobile systems with MW session guarantee

by Michał Szychowiak

2023, International Parallel and Distributed Processing Symposium

In the mobile environment, weak consistency replication of shared data is the key to obtaining high data availability, good access performance, and good scalability. Therefore new class of consistency models, called session guarantees,... more

descriptionView Paper arrow_downwardDownload

Rollback-Recovery Protocol Guarantying MR Session Guarantee in Distributed Systems with Mobile Clients

by Michał Szychowiak

2023, Springer eBooks

This paper presents rVsMR rollback-recovery protocol for distributed mobile systems, guarantying Monotonic Reads consistency model, even in case of server's failures. The proposed protocol employs known rollback-recovery techniques,... more

descriptionView Paper arrow_downwardDownload

Sistemas de cómputo de altas prestaciones con alta disponibilidad: evaluación de la performance en diferentes configuraciones

by Luis Coello Valdiviezo

2023

descriptionView Paper arrow_downwardDownload

A generic model of consistency guarantees for replicated services

by Dariusz Wawrzyniak

2023

High availability, scalability, and reliability of services can be provided by replication. However, distributed systems suffer from network partitioning, which reduces availability and/or consistency. The choice between availability and... more

descriptionView Paper arrow_downwardDownload

Rollback recovery

Related Topics