Academia.eduAcademia.edu

Failure Recovery

description583 papers
group8 followers
lightbulbAbout this topic
Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.
lightbulbAbout this topic
Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.

Key research themes

1. How can recovery-oriented computing methodologies optimize system failure recovery to improve availability and reduce total cost of ownership?

This theme explores methods of designing computing systems that can recover quickly and efficiently from failures by rethinking recovery as a first-class design goal rather than a secondary concern, thereby enhancing system availability, reducing downtime costs, and lowering the total cost of ownership (TCO). The focus is on recovery-oriented computing (ROC) principles that target networked services with metrics such as availability, rapid scale, and change, analyzing failure causes and developing techniques for automatic and effective failure recovery.

Key finding: This foundational paper introduces recovery-oriented computing (ROC) which emphasizes making recovery a primary design goal to significantly improve system availability and reduce downtime costs. It demonstrates that operator... Read more
Key finding: This work presents a technique exploiting intrinsic redundancy in reusable software components to automatically avoid application field failures without requiring system restarts. By generating alternative workarounds... Read more
Key finding: This study develops a model-driven, Bayesian and Markov decision process based framework enabling automatic system monitoring and recovery in distributed systems under imperfect and conflicting monitoring conditions. It... Read more
Key finding: This paper details a software-driven fault tolerance scheme for large multicomputer systems executing long jobs, where error detection and recovery are mostly handled by software via paired subsystems executing identical... Read more

2. What are the formal models and programming paradigms that enable systematic recovery and self-healing in software systems after failures?

This theme investigates formal approaches and frameworks for implementing recovery and self-healing capabilities in software systems. It includes transactional compensation models enabling undoing committed transactions without cascading aborts, recovery-oriented programming paradigms embedding monitoring and recovery actions for safety and liveness properties, and systems exhibiting self-healing inspired by biological analogies to autonomously detect, diagnose, and repair faults. The goal is to provide theoretical and practical bases for building software resilient to transient and permanent faults.

Key finding: This paper formulates a transaction model introducing compensating transactions which semantically undo effects of committed or uncommitted transactions affecting others, thereby avoiding cascading aborts. It formalizes... Read more
Key finding: This research proposes the recovery oriented programming (ROP) paradigm wherein programs integrate monitoring of safety and liveness properties and embed recovery actions upon violation detection. Using a generic... Read more
Key finding: The paper identifies that self-healing in distributed software requires invariant regularities across all system configurations, proposing imposing artificial 'laws' on heterogeneous distributed systems to achieve this. It... Read more
Key finding: This review systematically categorizes self-healing techniques inspired by biological systems, presenting methodologies such as middleware-based self-adaptive fault tolerance, monitoring frameworks, and hierarchical fault... Read more
Key finding: This position paper delineates self-healing as systems autonomously detecting faults and performing recovery steps to restore specified operational modes. It distinguishes self-healing from fault tolerance and related... Read more

3. How can failure recovery be optimized in storage and network systems through algorithmic and architectural techniques to ensure minimum performance degradation during faults?

This theme considers optimizing failure recovery in storage and network infrastructures, focusing on minimizing recovery overhead, ensuring consistency without rollback cascades, and maintaining service continuity under component failures. It covers topics such as I/O optimal recovery schemes for erasure-coded storage minimizing read/write operations needed for reconstruction, failure recovery architectures in cluster computing free from domino effect, and fault-tolerance frameworks in software-defined networking (SDN) and optical transport networks.

Key finding: This work develops an algorithm to find minimum I/O schedules for recovery from arbitrary numbers of disk failures in XOR-based erasure-coded storage. It introduces a family of codes enabling recovery from up to 11... Read more
Key finding: This paper introduces the Impact Failure Detector that assigns impact factors to processes and outputs a trust level for a set of monitored processes rather than individual binary suspicion. By defining thresholds that... Read more
Key finding: The authors propose a recovery approach for multi-cluster federations that handles both inter-cluster orphan and lost messages, ensuring recovery free from the domino effect, thereby minimizing recomputation. By using common... Read more
Key finding: This survey details fault tolerance challenges and solutions within SDN architectures, examining detection and recovery mechanisms in data, control, and application planes. It highlights that SDN introduces novel fault... Read more
Key finding: This position paper reviews mechanisms enabling optical networks to achieve resilience against disasters including natural events and malicious attacks. It categorizes proactive pre-disaster, preparatory, and reactive... Read more

All papers in Failure Recovery

In this paper, we consider IP fast recovery from single-link failures in a given network topology. The basic idea is to replace some existing routers with a designated switch. When a link fails, the affected router will send all the... more
Overbooking represents an important strategy for many service providers that apply revenue management. Although the objective is to overbook such that no customers are denied service, denials may result when the customer no-show rate is... more
In this paper a distributed routing management solution is described that takes into consideration statistical Quality of Service (QoS) information about the state of network links. The goal is to offer dynamic metrics to the routing... more
Abstract—This paper presents the design principles and the practical implementation of a routing management solution which takes into account statistical cross-layer Quality of Service information regarding the state of the network. Link... more
We design a peer-to-peer technique called ZIGZAG for single-source media streaming. ZIGZAG allows the media server to distribute content to many clients by organizing them into an appropriate tree rooted at the server. This... more
Given the fact that the current Internet does not widely support IP Multicast while content-distribution-networks technologies are costly, the concept of peer-to-peer could be a promising start for enabling large-scale streaming systems.... more
We describe a new scalable application-layer multicast protocol, specifically designed for low-bandwidth, data streaming applications with large receiver sets. Our scheme is based upon a hierarchical clustering of the application-layer... more
This paper presents OntoOmnia, a new meta-operating system architecture designed to address the challenges and risks of AI singularity. While previous research has focused on embedding ethical principles or ontological frameworks into AI... more
The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache... more
The concept of state and its applications vary widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Heron, Apache Samza, Apache Spark, and Apache... more
For several decades, optical networks, due to their high capacity and long-distance transmission range, have been used as the major communication technology to serve network traffic, especially in the core and metro segments of... more
Establishing multicast communications in MPLS-capable networks is an essential requirement for a wide-scale deployment of MPLS in the Internet. This paper outlines a framework for the setup of a MultiPoint-to-MultiPoint (MP2MP) Label... more
This paper presents a study of our proposed architecture for the setup of a MultiPoint-to-MultiPoint (MP2MP) Label Switched Path (LSP). This form of LSP is needed for establishing uni-directional multicast shared trees. Such trees are... more
In this paper, we investigate distributed mutual exclusion algorithms and delineate the features of a new distributed mutual exclusion algorithm. The basis of the algorithm is the logical ring structure employed in token-based mutual... more
Wireless sensor networks (WSN) have been investigated as a powerful distributed sensing application to enhance the efficiency of embedded systems and wireless networking capabilities. Although WSN has offered unique opportunities to set... more
We have built a software layer on top of Mach 2.5 that recovers multitask Mach applications from fail-stop failures. The layer implements Optimistic Recovery (OR), a mechanism for transparent recovery from failing tasks and processors,... more
The aim of this paper is to convey our experience using the ESA's 3DROV planetary rover simulator as a visualization and validation tool through a dynamic analysis on the performance of an advanced autonomous control architecture: a... more
This paper aims at describing an integrated power-aware, model-based autonomous control architecture for planetary rover-based mission operations synthesized in the context of a Ph.D. program on the topic "Autonomy for Interplanetary... more
Improving the computation efficiency is the key issue in image processing, especially in edge detection, because edge detection is very computationally intensive. With the development of real-time image processing application, fast... more
WDM optical networks are high speed networks and provide enormous capacity. Survivability is very important issue in these networks. Survivability requires resources for handling the failures. So, efficient resource allocation strategy is... more
In this paper, we have addressed the complex problem of determining a recovery line for cluster federation and proposed an efficient checkpointing / recovery mechanism for it. The main objective of the proposed approach is to advance the... more
We describe a new scalable application-layer multicast protocol, specifically designed for low-bandwidth, data streaming applications with large receiver sets. Our scheme is based upon a hierarchical clustering of the application-layer... more
A composite web service is essentially a combination of smaller services to provide extended functionalities. However, such services are more susceptible to failures than atomic services. This is due to its dependency on other services... more
In the Microbial typing field, the need to have a common understanding of the concepts described and the ability to share results within the community is an increasingly important requisite for the continued development of portable and... more
In this paper we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses... more
Finding consistent global checkpoints of a distributed computation is important for analyzing, testing, or verifying properties of these computations. In this paper we present a theoretical foundation for nding consistent global... more
Checkpointing algorithms are classi ed as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always... more
Advance reservation of lightpaths in grid environments is necessary to guarantee QoS and reliability. In this paper, we have evaluated and compared several algorithms for dynamic scheduling of lightpaths using a flexible advance... more
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR,... more
In this paper a distributed routing management solution is described that takes into consideration statistical Quality of Service (QoS) information about the state of network links. The goal is to offer dynamic metrics to the routing... more
This document defines the architecture for IP and LDP Fast Reroute using Maximally Redundant Trees (MRT-FRR). MRT-FRR is a technology that gives link-protection and node-protection with 100% coverage in any network topology that is still... more
Istituto di Linguistica Computazionale – CNR1 Area della Ricerca, via G. Moruzzi 1, 56100 Pisa, Italy {roberto.bartolini, simonetta.montemagni, vito.pirrelli}@ilc.cnr.it ... Università di Pisa, Dipartimento di Linguistica2 via Santa... more
In a single link network architecture if a link fails, system hunts for the substitute link and transmits the data through that link. It is always necessary for system to search the reason for path break then configure the system again to... more
Fault-tolerance is an essential aspect of network resilience. Fault-tolerance mechanisms are required to ensure high availability and high reliability in systems. The advent of software-defined networking (SDN) has both presented new... more
Multicasting is a fundamental networking primitive utilized by numerous applications. This also holds true for cognitive radio networks (CRNs) which have been proposed as a solution to the problems that emanate from the static... more
Availability in software refers to the system's ability to be operational and ready to perform its tasks when required. This concept is broader than reliability, as it includes not only consistent performance but also the system's... more
Tha Ncvr Millennium Remote Agent (NMRA) will be the first AI system to control an actual spacecraft. The spacecraft domain places a strong premium on autonomy and requires dynamic recoveries and robust concurrent execution, all in the... more
Resource allocation is most needed in the next generation of Cognitive radio networks these techniques are used to increase the Cognitive radio network's performance. But, it is difficult to accomplish these techniques in real-time... more
R esearch in consumer psychology shows that customers seek reasons for service failures and that attributions of blame moderate the effects of failure on the level of customer satisfaction. This paper extends research on service... more
Service composition provides a flexible way to quickly enable new application functionalities in next generation networks. We focus on the scenario where next generation portal providers 'compose' the component services of other... more
In KU-FEL (Kyoto University FEL) 12-14 m FEL has been available by using a 40 MeV S-band linac and 1.6 m undulator. We are going to install 1.8 m undulator which was used in JAEA to extend the lasing range of KU-FEL. We measured the... more
This paper proposes an approach to examining how testing affects the operational behavior of aging software systems. Such an approach requires models for the testing phase and the operational phase that explicitly account for crash... more
Under link-state routing protocols such as OSPF and IS-IS, when there is a change in the topology, propagation of link-state announcements, path recomputation, and updating of forwarding tables (FIBs) will all incur some delay before... more
Under link-state routing protocols such as OSPF and IS-IS, when there is a change in the topology, propagation of link-state advertisements, path recomputation, and updating of forwarding tables (FIBs) will all incur some delay before... more
Overlay multicast scheme has been regarded as an alternative to conventional IP multicast since it can support multicast functions without infrastructural level changes. However, multicast tree reconstruction procedure is required when a... more
The aim of this paper is to move away from today's multi-tier, manually operated, and performance limited Data Centre Network (DCN) towards more scalable, flexible, and optimized architecture of tomorrow. We propose a new hybrid optical... more
Wireless communications have seen remarkable progress over the past two decades and perceived tremendous success due to their agile nature and capability to provide fast and ubiquitous internet access. Maturation of 3G wireless network... more
Download research papers for free!