Academia.eduAcademia.edu

Rollback recovery

description229 papers
group8 followers
lightbulbAbout this topic
Rollback recovery is a fault-tolerance technique in computing that allows a system to revert to a previous stable state after a failure or error. It involves saving the system's state at certain checkpoints, enabling the restoration of data and processes to ensure consistency and reliability in the event of disruptions.
lightbulbAbout this topic
Rollback recovery is a fault-tolerance technique in computing that allows a system to revert to a previous stable state after a failure or error. It involves saving the system's state at certain checkpoints, enabling the restoration of data and processes to ensure consistency and reliability in the event of disruptions.
Causal message logging is an efficient approach for tolerating fail- ures of processes in distributed systems because it has the advantages of both pessimistic and optimistic message logging approach. However, traditional causal message... more
This paper provides a survey of results on the exact bandwidth, edgesum, and pro le of graphs. A bibliography o f work in these areas is provided. The emphasis is on composite graphs. This may be regarded as an update of the original... more
The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache... more
The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache... more
points in other processes. This protocol shows good performance especially in autonomous environments where each process does not have any private information about other processes.
1 III-LIDI, Facultad de Informática, UNLP Calle 50 y 120, 1900 La Plata (Buenos Aires), Argentina {dmontezanti, erucci, mnaiouf, degiusti}@lidi.info.unlp.edu.ar 2 Departamento de Arquitectura de Computadoras y Sistemas Operativos, UAB... more
In this paper, we have addressed the complex problem of determining a recovery line for cluster federation and proposed an efficient checkpointing / recovery mechanism for it. The main objective of the proposed approach is to advance the... more
In this paper, we present a high performance recovery algorithm for distributed systems in which checkpoints are taken asynchronously. It offers fast determination of the recent consistent global checkpoint (maximum consistent state) of a... more
The High Performance Computing (HPC) refers to the solution of complex problems by a group of servers, called cluster. The cluster as a whole is used for solving a single problem or a group of related problems. Initially, the solutions... more
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such... more
User-level Networking (ULN) is gaining rapid acceptance in the commercial world with Virtual Interface Architecture (VIA), and Infiniband more recently, being touted as the interface of choice to diverse peripherals. There is an important... more
Fault tolerance approach is the most popular computing application on computer devices in which depends on checkpoint uncoordinated. This alternative approach is based on checkpoint uncoordinated and logging message requiring all records,... more
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can... more
Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have derived expressions for average cost of checkpointing,... more
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms... more
Scientific workflows have emerged as a key technology that assists scientists with the design, management, execution, sharing and reuse of in silico experiments. These experiments are mainly data and compute intensive so they require high... more
Scientific workflows are data and compute intensive thus may run for days or even for weeks on parallel and distributed infrastructures such as HPC systems and cloud. In HPC environment the number of failures that can arise during... more
TERPS is a fault-tolerant computer design that significantly reduces the threat of electromagnetic interference (EMI), using hardware checkpoint/rollback-recovery. TERPS tolerates EMI by periodically checkpointing processor state into a... more
MPI (Message Passing Interface) and OpenMP are two tools broadly used to develop parallel programs. On the one hand, MPI has the advantage of high performance while being difficult to use. On the other hand, OpenMP is very easy to use but... more
MPI (Message Passing Interface) and OpenMP are two tools broadly used to develop parallel programs. On the one hand, MPI has the advantage of high performance while being difficult to use. On the other hand, OpenMP is very easy to use but... more
The main goal of this paper is to describe and validate a specific hybrid dynamical model for NASA's recoverable computer system subjected to simulated random upsets. The system is closed-looped with a Boeing 737 simulation model... more
In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent... more
In distributed environment, message logging based checkpointing and rollback recovery is a commonly used approach for providing distributed systems with fault tolerance and synchronized global states. Clearly, taking more frequent... more
1 III-LIDI, Facultad de Informática, UNLP Calle 50 y 120, 1900 La Plata (Buenos Aires), Argentina {dmontezanti, erucci, mnaiouf, degiusti}@lidi.info.unlp.edu.ar 2 Departamento de Arquitectura de Computadoras y Sistemas Operativos, UAB... more
The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and... more
Abstract: We introduce a new approach to enable an open and public parallel machine which is accessible for multi users with multi jobs belong to different blocks running at the same time. The concept is required especially for parallel... more
Today, clusters are often interconnected by long distance networks to compose grids and to provide users with a huge number of available ressources. To write parallel applications, developers are generally using the standard communication... more
Abstract: We introduce a new approach to enable an open and public parallel machine which is accessible for multi users with multi jobs belong to different blocks running at the same time. The concept is required especially for parallel... more
The article presents the first component of a new approach for testing distributed object-oriented applications called TestByRep which is based on the concept of replication of object states. The paper describes key ideas in TestByRep... more
In distributed systems, there are many opportunities for failure. Any component in any compute node could fail. This includes, but is not limited to, the processor, disk, memory, or network interface on the node. Any of these failures... more
Distributed application executes on multiple nodes of remote sites. Due to involvement of multiple nodes, it requires error recovery algorithms. Traditional message passing techniques were proposed to design these error recovery... more
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad,... more
In this paper a recently developed analytical tool is explained and used to determine the effect on stability of standard error recovery systems on a model of a Boeing 737 used for research at the NASA Langley Research Center. In... more
Performance of Web servers is critical to the success of many corporations and organizations. However, very few results have been published that quantitatively study the server behavior and identify the performance bottlenecks. In this... more
Autonomic Computing Systems are oriented to prevente the human intervention and to enable distributed systems to manage themselves. One of their challenges is the efficient monitoring at runtime oriented to collect information from which... more
Consistency preservation is an important problem in collaborative system that is activated in both traditional distributed systems and cloud based systems for availability and performance. Especially, cloud storage services still need to... more
Online services such as Facebook or Twitter have public APIs to enable an easy integration of these services with third party applications. However, the developers who design these applications have no information about the consistency... more
In its simplest form, checkpointing is the act of saving a program's computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume... more
Checkpointing algorithms suitable for mobile computing environments should be economical in terms of storage and energy consumption, and they should be able to handle that at starting time not all processes are known which are to be... more
This paper presents a new checkpoint recovery protocol for Distributed Shared Memory (DSM) systems with read-write objects. It is based on independent checkpointing integrated with a coherence protocol for causal consistency model. That... more
In the mobile environment, weak consistency replication of shared data is the key to obtaining high data availability, good access performance, and good scalability. Therefore new class of consistency models, called session guarantees,... more
This paper presents rVsMR rollback-recovery protocol for distributed mobile systems, guarantying Monotonic Reads consistency model, even in case of server's failures. The proposed protocol employs known rollback-recovery techniques,... more
High availability, scalability, and reliability of services can be provided by replication. However, distributed systems suffer from network partitioning, which reduces availability and/or consistency. The choice between availability and... more
Download research papers for free!