Academia.eduAcademia.edu

Failure Recovery

description582 papers
group8 followers
lightbulbAbout this topic
Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.
lightbulbAbout this topic
Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.

Key research themes

1. How can recovery-oriented computing methodologies optimize system failure recovery to improve availability and reduce total cost of ownership?

This theme explores methods of designing computing systems that can recover quickly and efficiently from failures by rethinking recovery as a first-class design goal rather than a secondary concern, thereby enhancing system availability, reducing downtime costs, and lowering the total cost of ownership (TCO). The focus is on recovery-oriented computing (ROC) principles that target networked services with metrics such as availability, rapid scale, and change, analyzing failure causes and developing techniques for automatic and effective failure recovery.

Key finding: This foundational paper introduces recovery-oriented computing (ROC) which emphasizes making recovery a primary design goal to significantly improve system availability and reduce downtime costs. It demonstrates that operator... Read more
Key finding: This work presents a technique exploiting intrinsic redundancy in reusable software components to automatically avoid application field failures without requiring system restarts. By generating alternative workarounds... Read more
Key finding: This paper details a software-driven fault tolerance scheme for large multicomputer systems executing long jobs, where error detection and recovery are mostly handled by software via paired subsystems executing identical... Read more

2. What are the formal models and programming paradigms that enable systematic recovery and self-healing in software systems after failures?

This theme investigates formal approaches and frameworks for implementing recovery and self-healing capabilities in software systems. It includes transactional compensation models enabling undoing committed transactions without cascading aborts, recovery-oriented programming paradigms embedding monitoring and recovery actions for safety and liveness properties, and systems exhibiting self-healing inspired by biological analogies to autonomously detect, diagnose, and repair faults. The goal is to provide theoretical and practical bases for building software resilient to transient and permanent faults.

Key finding: This paper formulates a transaction model introducing compensating transactions which semantically undo effects of committed or uncommitted transactions affecting others, thereby avoiding cascading aborts. It formalizes... Read more
Key finding: This research proposes the recovery oriented programming (ROP) paradigm wherein programs integrate monitoring of safety and liveness properties and embed recovery actions upon violation detection. Using a generic... Read more
Key finding: The paper identifies that self-healing in distributed software requires invariant regularities across all system configurations, proposing imposing artificial 'laws' on heterogeneous distributed systems to achieve this. It... Read more
Key finding: This review systematically categorizes self-healing techniques inspired by biological systems, presenting methodologies such as middleware-based self-adaptive fault tolerance, monitoring frameworks, and hierarchical fault... Read more
Key finding: This position paper delineates self-healing as systems autonomously detecting faults and performing recovery steps to restore specified operational modes. It distinguishes self-healing from fault tolerance and related... Read more

3. How can failure recovery be optimized in storage and network systems through algorithmic and architectural techniques to ensure minimum performance degradation during faults?

This theme considers optimizing failure recovery in storage and network infrastructures, focusing on minimizing recovery overhead, ensuring consistency without rollback cascades, and maintaining service continuity under component failures. It covers topics such as I/O optimal recovery schemes for erasure-coded storage minimizing read/write operations needed for reconstruction, failure recovery architectures in cluster computing free from domino effect, and fault-tolerance frameworks in software-defined networking (SDN) and optical transport networks.

Key finding: This work develops an algorithm to find minimum I/O schedules for recovery from arbitrary numbers of disk failures in XOR-based erasure-coded storage. It introduces a family of codes enabling recovery from up to 11... Read more
Key finding: This paper introduces the Impact Failure Detector that assigns impact factors to processes and outputs a trust level for a set of monitored processes rather than individual binary suspicion. By defining thresholds that... Read more
Key finding: The authors propose a recovery approach for multi-cluster federations that handles both inter-cluster orphan and lost messages, ensuring recovery free from the domino effect, thereby minimizing recomputation. By using common... Read more
Key finding: This survey details fault tolerance challenges and solutions within SDN architectures, examining detection and recovery mechanisms in data, control, and application planes. It highlights that SDN introduces novel fault... Read more
Key finding: This position paper reviews mechanisms enabling optical networks to achieve resilience against disasters including natural events and malicious attacks. It categorizes proactive pre-disaster, preparatory, and reactive... Read more

All papers in Failure Recovery

Wedescribeanewscalableapplication-layermulticastprotocol, specif- icallydesignedforlow-bandwidth, datastreamingapplicationswith large receiver sets. Our schemeis baseduponahierarchical cluster- ing of the ...
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be... more
A peer-to-peer technique called ZIGZAG for single-source media streaming is designed . ZIGZAG allows the media server to distribute content to many clients by organizing them into an appropriate tree rooted at the server. This... more
Availability is a storage system property that is both highly desired and yet minimally engineered. While many systems provide mechanisms to improve availability -such as redundancy and failure recovery -how to best configure these... more
It is now well recognized that an effective service recovery program is essential to generating customer satisfaction and loyalty. A number of studies have investigated the impact of service recovery efforts (compensation, speed of... more
On May 17th 1999, NASA activated for the first time an AI-based planner/scheduler running on the flight processor of a spacecraft. This was part of the Remote Agent Experiment (RAX), a demonstration of closed-loop planning and execution,... more
The tremendous popularity of wireless systems in recent years has led to the commoditization of RF transceivers (radios) whose prices have fallen dramatically. The lower cost allows us to consider using two or more radios in the same... more
In this article, we examine consumer reactions to two service recovery strategies: fixing the service failure for a fee and fixing the service failure for no fee and adding compensation. We expect that the more desirable recovery strategy... more
Modern computer systems are expected to be up continuously: even planned downtime to accomplish system reconfiguration is becoming unacceptable, so more and more changes are having to be made to "live" systems that are running production... more
A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available... more
This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major... more
Brick and object-based storage architectures have emerged as a means of improving the scalability of storage clusters. However, existing systems continue to treat storage nodes as passive devices, despite their ability to exhibit... more
The IEEE 802.11i wireless networking protocol provides mutual authentication between a network access point and user devices prior to user connectivity. The protocol consists of several parts, including an 802.1X authentication phase... more
The rapid advances in dense wavelength-division multiplexing technology with hundreds of wavelengths per fiber and worldwide fiber dcployment have brought about a tremendous increasc in the size (i.e., number of ports) of photonic... more
Precise failure analysis requires accurate fault diagnosis. A previously proposed method for diagnosing bridging faults using single stuck-at dictionaries was applied only to small circuits, produced large and imprecise diagnoses, and did... more
We are interested in the validation of a cognitive theory of human communication, grounded in a speech acts perspective. The theory we refer to is outlined, and a number of predictions are drawn from it. We report a series of protocols... more
Several distributed real-time applications (e.g., medical imaging, air traffic control, and video conferencing) demand hard guarantees on the message delivery latency and the recovery delay from component failures. As these demands cannot... more
Routing protocols for wireless sensor networks must address the challenges of reliable packet delivery at increasingly large scale and highly constrained node resources. Attempts to limit node state can result in undesirable worst-case... more
Unknown, unexplored and abandoned subterranean voids threaten mining operations, surface developments and the environment. Hazards within these spaces preclude human access to create and verify extensive maps or to characterize and... more
Key service elements combine to create the service concept and its value proposition for customers. During service operations failures, employee interactions with customers are a critical service element in restoring customer... more
One of the desirable features of any network is its ability to keep services running despite a link or node failure. This ability is usually referred to as network resilience and has become a key demand from service providers. Resilient... more
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow the trend, the mean time between failures (MTBF) suffered,... more
This paper presents an approach to concurrency control based on the decomposition of both the database and the individual transactions. This approach is a generalization of serializability theory in that the set of permissible transaction... more
Advance reservation of lightpaths in Grid environments is necessary to guarantee QoS and reliability. In this paper, we have evaluated and compared several algorithms for dynamic scheduling of lightpaths using a flexible advance... more
Network applications of the future will require advanced mechanisms for automatic failure recovery in order to provide an acceptable quality of service. Because of this requirement, there is a need for tools that can inject simulated... more
This paper addresses the problem of markerless tracking of a human in full 3D with a high-dimensional (29D) body model. Most work in this area has been focused on achieving accurate tracking in order to replace marker-based motion... more
A model of remote procedure call (RPC) which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented. A technique of adopting orphans caused by failures,... more
Many processes must complete in the presence of failures. Different systems respond to task failure in different ways. The system may resume a failed task from the failure point (or a saved checkpoint shortly before the failure point), it... more
The combination of photogrammetric aerial and terrestrial recording methods can provide new opportunities for photogrammetric applications. A UAV (Unmanned Aerial Vehicle), in our case a helicopter system, can cover both the aerial and... more
A sensor network can be described as a collection of sensor nodes which coordinate with each other to perform some specific function. These sensor nodes are mainly in large numbers and are densely deployed either inside the phenomenon or... more
Web services emergence has triggered extensive research efforts. Currently, there is a trend towards deploying business processes as an orchestration of web services compositions. Given that web services are inherently looselycoupled and... more
In order to achieve resilient multipath routing we introduce the concept of Independent Directed Acyclic Graphs (IDAGs) in this study. Link-independent (Node-independent) DAGs satisfy the property that any path from a source to the root... more
Most research to date in survivable optical network design and operation, focused on the failure of a single component such as a link or a node. A double-link failure model in which any two links in the network may fail in an arbitrary... more
We present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays,... more
Current trends suggest future software systems will comprise collections of components that combine and recombine dynamically in reaction to changing conditions. Service-discovery protocols, which enable software components to locate... more
Legacy application design models, which are still widely used for developing context-aware applications, incur important limitations. Firstly, embedding contextual dependencies in the form of if-then rules specifying how applications... more
Network operators are migrating towards IP over WDM architectures. In such multi-layer networks, it is necessary to efficiently use the resources available from both layers in order to provide coordinated recovery strategies. Thanks to... more
Optimistic failure recovery mechanisms are proposed as a way to provide transparent fault tolerance to distributed applications and systems. The authors identify problems that may arise when these mechanisms are applied to vast networks... more
With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher... more
This paper presents Sonora, a platform for mobile-cloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming... more
Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, which is often prohibitively expensive for practical use due to its high performance overhead and cost. The Adaptive Reconfigurable Mobile... more
Abstract— Border gateway protocol (BGP) is the standard routing protocol between various autonomous systems (AS) in the Internet. In the event of a failure, BGP may repeatedly withdraw some routes and advertise new ones until a stable... more
Multi-protocol label switching (MPLS) is an evolving network technology that is used to provide traffic engineering (TE) and high speed networking. Internet service providers, which support MPLS technology, are increasingly demanded to... more
An important research area in the workflow management domain is the adaptation of workflows to unexpected events or failures at runtime. In this paper we present a concept for dynamic and automated workflow re-planning that allows to... more
In this paper, we describe a method of execution retry for bypassing software faults in messagepassing applications. Based on the techniques of cting and message logging, we demonstrate the use of message replaying and message reordering... more
In this paper, we propose a novel Web service composition framework which dynamically accommodates various failure recovery requirements. In the proposed framework called Adaptive Failure-handling Framework (AdaFF), failure-handling... more
In choosing a network service technology, a subscriber considers many features such as latency, jitter, packet loss, security, and availability. The most important feature, and usually the one that determines the final selection, is the... more
Download research papers for free!