Academia.eduAcademia.edu

Failure Recovery

description583 papers
group8 followers
lightbulbAbout this topic
Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.
lightbulbAbout this topic
Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.

Key research themes

1. How can recovery-oriented computing methodologies optimize system failure recovery to improve availability and reduce total cost of ownership?

This theme explores methods of designing computing systems that can recover quickly and efficiently from failures by rethinking recovery as a first-class design goal rather than a secondary concern, thereby enhancing system availability, reducing downtime costs, and lowering the total cost of ownership (TCO). The focus is on recovery-oriented computing (ROC) principles that target networked services with metrics such as availability, rapid scale, and change, analyzing failure causes and developing techniques for automatic and effective failure recovery.

Key finding: This foundational paper introduces recovery-oriented computing (ROC) which emphasizes making recovery a primary design goal to significantly improve system availability and reduce downtime costs. It demonstrates that operator... Read more
Key finding: This work presents a technique exploiting intrinsic redundancy in reusable software components to automatically avoid application field failures without requiring system restarts. By generating alternative workarounds... Read more
Key finding: This study develops a model-driven, Bayesian and Markov decision process based framework enabling automatic system monitoring and recovery in distributed systems under imperfect and conflicting monitoring conditions. It... Read more
Key finding: This paper details a software-driven fault tolerance scheme for large multicomputer systems executing long jobs, where error detection and recovery are mostly handled by software via paired subsystems executing identical... Read more

2. What are the formal models and programming paradigms that enable systematic recovery and self-healing in software systems after failures?

This theme investigates formal approaches and frameworks for implementing recovery and self-healing capabilities in software systems. It includes transactional compensation models enabling undoing committed transactions without cascading aborts, recovery-oriented programming paradigms embedding monitoring and recovery actions for safety and liveness properties, and systems exhibiting self-healing inspired by biological analogies to autonomously detect, diagnose, and repair faults. The goal is to provide theoretical and practical bases for building software resilient to transient and permanent faults.

Key finding: This paper formulates a transaction model introducing compensating transactions which semantically undo effects of committed or uncommitted transactions affecting others, thereby avoiding cascading aborts. It formalizes... Read more
Key finding: This research proposes the recovery oriented programming (ROP) paradigm wherein programs integrate monitoring of safety and liveness properties and embed recovery actions upon violation detection. Using a generic... Read more
Key finding: The paper identifies that self-healing in distributed software requires invariant regularities across all system configurations, proposing imposing artificial 'laws' on heterogeneous distributed systems to achieve this. It... Read more
Key finding: This review systematically categorizes self-healing techniques inspired by biological systems, presenting methodologies such as middleware-based self-adaptive fault tolerance, monitoring frameworks, and hierarchical fault... Read more
Key finding: This position paper delineates self-healing as systems autonomously detecting faults and performing recovery steps to restore specified operational modes. It distinguishes self-healing from fault tolerance and related... Read more

3. How can failure recovery be optimized in storage and network systems through algorithmic and architectural techniques to ensure minimum performance degradation during faults?

This theme considers optimizing failure recovery in storage and network infrastructures, focusing on minimizing recovery overhead, ensuring consistency without rollback cascades, and maintaining service continuity under component failures. It covers topics such as I/O optimal recovery schemes for erasure-coded storage minimizing read/write operations needed for reconstruction, failure recovery architectures in cluster computing free from domino effect, and fault-tolerance frameworks in software-defined networking (SDN) and optical transport networks.

Key finding: This work develops an algorithm to find minimum I/O schedules for recovery from arbitrary numbers of disk failures in XOR-based erasure-coded storage. It introduces a family of codes enabling recovery from up to 11... Read more
Key finding: This paper introduces the Impact Failure Detector that assigns impact factors to processes and outputs a trust level for a set of monitored processes rather than individual binary suspicion. By defining thresholds that... Read more
Key finding: The authors propose a recovery approach for multi-cluster federations that handles both inter-cluster orphan and lost messages, ensuring recovery free from the domino effect, thereby minimizing recomputation. By using common... Read more
Key finding: This survey details fault tolerance challenges and solutions within SDN architectures, examining detection and recovery mechanisms in data, control, and application planes. It highlights that SDN introduces novel fault... Read more
Key finding: This position paper reviews mechanisms enabling optical networks to achieve resilience against disasters including natural events and malicious attacks. It categorizes proactive pre-disaster, preparatory, and reactive... Read more

All papers in Failure Recovery

In this article, we examine consumer reactions to two service recovery strategies: fixing the service failure for a fee and fixing the service failure for no fee and adding compensation. We expect that the more desirable recovery strategy... more
It is now well recognized that an effective service recovery program is essential to generating customer satisfaction and loyalty. A number of studies have investigated the impact of service recovery efforts (compensation, speed of... more
1 Due to the lack of a general theoretical foundation, today's distributed traffic control mechanisms developed at the networking layer, transport layer, and overlay are largely disintegrated. As a result, traffic control protocols... more
This paper investigates learning approaches for discovering fault-tolerant control policies to overcome thruster failures in Autonomous Underwater Vehicles (AUV). The proposed approach is a model-based direct policy search that learns on... more
One of the desirable features of any network is its ability to keep services running despite a link or node failure. This ability is usually referred to as network resilience and has become a key demand from service providers. Resilient... more
Today small autonomous helicopters offer a low budget platform for aerial applications such as surveillance (both military and civil), land management and earth sciences. In this paper we introduce a prototype of autonomous aerial... more
by Fuchun Lin and 
1 more
With the increasing interest in deploying 4G/LTE networks, IMS has a potential to be deployed in a wide scale in order to support mobile Internet and value-added services over next-generation networks. Moreover, the effort to create an... more
This paper presents Sonora, a platform for mobile-cloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming... more
We investigate methods to improve fault-tolerance of Autonomous Underwater Vehicles (AUVs) to increase their reliability and persistent autonomy. We propose a learning-based approach that is able to discover new control policies to... more
The tremendous popularity of wireless systems in recent years has led to the commoditization of RF transceivers (radios) whose prices have fallen dramatically. The lower cost allows us to consider using two or more radios in the same... more
In choosing a network service technology, a subscriber considers many features such as latency, jitter, packet loss, security, and availability. The most important feature, and usually the one that determines the final selection, is the... more
This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major... more
We present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays,... more
Fault tolerance is a crucial design consideration for missioncritical distributed real-time and embedded (DRE) systems, which combine the real-time characteristics of embedded platforms with the dynamic characteristics of distributed... more
We investigate methods to improve fault-tolerance of Autonomous Underwater Vehicles (AUVs) to increase their reliability and persistent autonomy. We propose a learningbased approach that is able to discover new control policies to... more
In choosing a network service technology, a subscriber considers many features such as latency, jitter, packet loss, security, and availability. The most important feature, and usually the one that determines the final selection, is the... more
Today small autonomous helicopters offer a low budget platform for aerial applications such as surveillance (both military and civil), land management and earth sciences. In this paper we introduce a prototype of autonomous aerial... more
The combination of photogrammetric aerial and terrestrial recording methods can provide new opportunities for photogrammetric applications. A UAV (Unmanned Aerial Vehicle), in our case a helicopter system, can cover both the aerial and... more
Today small autonomous helicopters offer a low budget platform for aerial applications such as surveillance (both military and civil), land management and earth sciences. In this paper we introduce a prototype of autonomous aerial... more
A sensor network can be described as a collection of sensor nodes which coordinate with each other to perform some specific function. These sensor nodes are mainly in large numbers and are densely deployed either inside the phenomenon or... more
Brick and object-based storage architectures have emerged as a means of improving the scalability of storage clusters. However, existing systems continue to treat storage nodes as passive devices, despite their ability to exhibit... more
Key service elements combine to create the service concept and its value proposition for customers. During service operations failures, employee interactions with customers are a critical service element in restoring customer... more
We are interested in the validation of a cognitive theory of human communication, grounded in a speech acts perspective. The theory we refer to is outlined, and a number of predictions are drawn from it. We report a series of protocols... more
—Software failures are still a major concern in mission-and enterprise-critical contexts, despite significant efforts spent in software testing. In fact, while software testing is effective against easily-reproducible bugs (Bohrbugs), it... more
This paper addresses the problem of optimal Quality of Service (QoS), Traffic Engineering (TE) and Failure Recovery (FR) in Computer Networks by introducing novel algorithms that only use source inferrable information. More precisely,... more
The IEEE 802.11i wireless networking protocol provides mutual authentication between a network access point and user devices prior to user connectivity. The protocol consists of several parts, including an 802.1X authentication phase... more
Web services are building blocks of interoperable systems. Composing Web services makes the processes capable of doing complex tasks. Composite services may fail during their execution which can be diagnosed by a mediator. The mediator... more
Deploying Ethernet in the metro domain will require many different upgrades including end-to-end QoS guarantees, protection mechanisms and service performance monitoring. In this paper we propose a distributed method to address network... more
The combination of photogrammetric aerial and terrestrial recording methods can provide new opportunities for photogrammetric applications. A UAV (Unmanned Aerial Vehicle), in our case a helicopter system, can cover both the aerial and... more
Traditional pure performance model that ignores failure and recovery but considers resource contention generally overestimates the system's ability to perform a certain job. On the other hand, pure availability analysis tends to be too... more
In choosing a network service technology, a subscriber considers many features such as latency, jitter, packet loss, security, and availability. The most important feature, and usually the one that determines the final selection, is the... more
A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available... more
In this paper, we discuss the design and implementation of fault-aware Global Memory Management (GMM) for a multi-kernel architecture. Scalability of today's systems is limited by SMP hardware, as well as by the underlying commodity... more
Unknown, unexplored and abandoned subterranean voids threaten mining operations, surface developments and the environment. Hazards within these spaces preclude human access to create and verify extensive maps or to characterize and... more
Service Oriented Architecture facilitates automatic execution and composition of web services in distributed environment. This service composition in the heterogeneous environment may suffer from various kinds of service failures. These... more
Multipath routing is essential in the wake of voice over IP, multimedia streaming for efficient data transmission. The growing usage of such network requirements also demands fast recovery from network failures. Multipath routing is one... more
Multipath routing is essential in the wake of voice over IP, multimedia streaming for efficient data transmission. The growing usage of such network requirements also demands fast recovery from network failures. Multipath routing is one... more
The High Level Architecture (HLA) is a standard for the interoperability and reuse of simulation components, referred to as federates. Large scale HLA-compliant simulations are built to study complex problems, and they often involve a... more
A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available... more
Today small autonomous helicopters offer a low budget platform for aerial applications such as surveillance (both military and civil), land management and earth sciences. In this paper we introduce a prototype of autonomous aerial... more
A framework for planning and supervision of robotized assembly tasks is initially presented, with emphasis on failure recovery. The approach to the integration of services and the modeling of tasks, resources and environment is briefly... more
An important research area in the workflow management domain is the adaptation of workflows to unexpected events or failures at runtime. In this paper we present a concept for dynamic and automated workflow re-planning that allows to... more
''Stateless'' servers have been popularized by NFS [Sandberg85]. The benefit of a stateless server is that the server can crash and reboot and no special recovery action is required. Clients simply retry their operations until they get a... more
Defects per million (DPM), defined as the number of calls out of a million dropped due to failures, is an important service (un)reliability measure for telecommunication systems. Most previous research derives the DPM from steady-state... more
Today small autonomous helicopters offer a low budget platform for aerial applications such as surveillance (both military and civil), land management and earth sciences. In this paper we introduce a prototype of autonomous aerial... more
The rapid advances in dense wavelength-division multiplexing technology with hundreds of wavelengths per fiber and worldwide fiber dcployment have brought about a tremendous increasc in the size (i.e., number of ports) of photonic... more
Download research papers for free!