Failure Recovery

description582 papers

group8 followers

lightbulbAbout this topic

Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.

lightbulbAbout this topic

Key research themes

1. How can recovery-oriented computing methodologies optimize system failure recovery to improve availability and reduce total cost of ownership?

This theme explores methods of designing computing systems that can recover quickly and efficiently from failures by rethinking recovery as a first-class design goal rather than a secondary concern, thereby enhancing system availability, reducing downtime costs, and lowering the total cost of ownership (TCO). The focus is on recovery-oriented computing (ROC) principles that target networked services with metrics such as availability, rapid scale, and change, analyzing failure causes and developing techniques for automatic and effective failure recovery.

Recovery-oriented computing (ROC): Motivation, definition, techniques, and …

by William Tetzlaff and

2016

Key finding: This foundational paper introduces recovery-oriented computing (ROC) which emphasizes making recovery a primary design goal to significantly improve system availability and reduce downtime costs. It demonstrates that operator... Read more

articleView Paper downloadDownload

Automatic Recovery from Runtime Failures

by Mauro Pezzè

2015

Key finding: This work presents a technique exploiting intrinsic redundancy in reusable software components to automatically avoid application field failures without requiring system restarts. By generating alternative workarounds... Read more

articleView Paper downloadDownload

A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

by Eli Gafni

2025

Key finding: This paper details a software-driven fault tolerance scheme for large multicomputer systems executing long jobs, where error detection and recovery are mostly handled by software via paired subsystems executing identical... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are the formal models and programming paradigms that enable systematic recovery and self-healing in software systems after failures?

This theme investigates formal approaches and frameworks for implementing recovery and self-healing capabilities in software systems. It includes transactional compensation models enabling undoing committed transactions without cascading aborts, recovery-oriented programming paradigms embedding monitoring and recovery actions for safety and liveness properties, and systems exhibiting self-healing inspired by biological analogies to autonomously detect, diagnose, and repair faults. The goal is to provide theoretical and practical bases for building software resilient to transient and permanent faults.

A formal approach to recovery by compensating transactions

by Eliezer Levy and

2016, The VLDB Journal

Key finding: This paper formulates a transaction model introducing compensating transactions which semantically undo effects of committed or uncommitted transactions affecting others, thereby avoiding cascading aborts. It formalizes... Read more

articleView Paper downloadDownload

Recovery Oriented Programming: Runtime Monitoring of Safety and Liveness

by Olga Brukman

2017

Key finding: This research proposes the recovery oriented programming (ROP) paradigm wherein programs integrate monitoring of safety and liveness properties and embed recovery actions upon violation detection. Using a generic... Read more

articleView Paper downloadDownload

On conditions for self-healing in distributed software systems

by Naftaly Minsky

2023, 2003 Autonomic Computing Workshop

Key finding: The paper identifies that self-healing in distributed software requires invariant regularities across all system configurations, proposing imposing artificial 'laws' on heterogeneous distributed systems to achieve this. It... Read more

articleView Paper downloadDownload

Self-Healing Systems: Application and Methodologies-A Review

by Fidelis Ugwuanyi

2022, International Journal of Research

Key finding: This review systematically categorizes self-healing techniques inspired by biological systems, presenting methodologies such as middleware-based self-adaptive fault tolerance, monitoring frameworks, and hierarchical fault... Read more

articleView Paper downloadDownload

Self-Healing Systems: Foundations and Challenges

by Gabi Dreo Rodosek

2016

Key finding: This position paper delineates self-healing as systems autonomously detecting faults and performing recovery steps to restore specified operational modes. It distinguishes self-healing from fault tolerance and related... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can failure recovery be optimized in storage and network systems through algorithmic and architectural techniques to ensure minimum performance degradation during faults?

This theme considers optimizing failure recovery in storage and network infrastructures, focusing on minimizing recovery overhead, ensuring consistency without rollback cascades, and maintaining service continuity under component failures. It covers topics such as I/O optimal recovery schemes for erasure-coded storage minimizing read/write operations needed for reconstruction, failure recovery architectures in cluster computing free from domino effect, and fault-tolerance frameworks in software-defined networking (SDN) and optical transport networks.

In search of I/O-optimal recovery from disk failures

by Osama N Khan

2021

Key finding: This work develops an algorithm to find minimum I/O schedules for recovery from arbitrary numbers of disk failures in XOR-based erasure-coded storage. It introduces a family of codes enabling recovery from up to 11... Read more

articleView Paper downloadDownload

Impact: an Unreliable Failure Detector Based on Processes' Relevance and the Confidence Degree in the System

by Anubis Graciela de Moraes Rossetto

2025

Key finding: This paper introduces the Impact Failure Detector that assigns impact factors to processes and outputs a trust level for a set of monitored processes rather than individual binary suspicion. By defining thresholds that... Read more

articleView Paper downloadDownload

Domino-Effect Free Crash Recovery for Concurrent Failures in Cluster Federation

by Shahram Rahimi

2025, Lecture Notes in Computer Science

Key finding: The authors propose a recovery approach for multi-cluster federations that handles both inter-cluster orphan and lost messages, ensuring recovery free from the domino effect, thereby minimizing recomputation. By using common... Read more

articleView Paper downloadDownload

Fault-Tolerance in the Scope of Software-Defined Networking (SDN)

by Rui Aguiar

2024, IEEE Access

Key finding: This survey details fault tolerance challenges and solutions within SDN architectures, examining detection and recovery mechanisms in data, control, and application planes. It highlights that SDN introduces novel fault... Read more

articleView Paper downloadDownload

Disaster resilience of optical networks: State of the art, challenges, and opportunities

by georgios ellinas

2025, Optical Switching and Networking

Key finding: This position paper reviews mechanisms enabling optical networks to achieve resilience against disasters including natural events and malicious attacks. It categorizes proactive pre-disaster, preparatory, and reactive... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Failure Recovery

Scalable application layer multicast

by Duc Vinh Tran

2002, Proceedings of the 2002 SIGCOMM conference

Wedescribeanewscalableapplication-layermulticastprotocol, specif- icallydesignedforlow-bandwidth, datastreamingapplicationswith large receiver sets. Our schemeis baseduponahierarchical cluster- ing of the ...

descriptionView Paper arrow_downwardDownload

Reliable communication in the presence of failures

by Abdul Muttalib

1985, ACM Transactions on Computer Systems

The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be... more

descriptionView Paper arrow_downwardDownload

ZIGZAG: An Efficient Peer-to-Peer Scheme for Media Streaming

by an tran

2003

A peer-to-peer technique called ZIGZAG for single-source media streaming is designed . ZIGZAG allows the media server to distribute content to many clients by organizing them into an appropriate tree rooted at the server. This... more

descriptionView Paper arrow_downwardDownload

Total Recall: System Support for Automated Availability Management

by Kiran Tati

2004

Availability is a storage system property that is both highly desired and yet minimally engineered. While many systems provide mechanisms to improve availability -such as redundancy and failure recovery -how to best configure these... more

descriptionView Paper arrow_downwardDownload

Service failure recovery: The moderating impact of individual-level cultural value orientation on perceptions of justice

by Kriengsin Prasongsukarn and

2006

It is now well recognized that an effective service recovery program is essential to generating customer satisfaction and loyalty. A number of studies have investigated the impact of service recovery efforts (compensation, speed of... more

Fig. 1. Two way interaction: apology status x power distance.

Fig. 2. Two way interaction: recovery initiate x collectivisn/individualism.

Fig. 3. Two way interaction: cognitive control x uncertainly avoidance.

Test of equivalence of scenario between Thailand and Australia

Goodness of fit statistics: ¢C (df: 40)=75.3, p<0.05, GFI=0.97, AGFI=0.95, NFI=0.97, CFI=0.98, RMSEA =0.043.

(R)=reverse scored. Measurement items for perceived justice

Regression analysis — service recovery satisfaction model A Chow test was used to test for any significance differences in the form (or slope) of the two regression models (Australia and Thailand). Results were not significantly different thus justifying pooling of the two country data (F 2.37, p>0.05).

Appendix A. CVSCALE — Test of invariance

P=Power Distance; U= Uncertainty Avoidance; C=Collectivism. Justice dimensions — text of invariance

D=Distributive justice; P=Procedural justice; I=Interactional justice.

descriptionView Paper arrow_downwardDownload

Planning in interplanetary space: Theory and practice

by Kanna Rajan

2000, Proceedings of the Fifth …

On May 17th 1999, NASA activated for the first time an AI-based planner/scheduler running on the flight processor of a spacecraft. This was part of the Remote Agent Experiment (RAX), a demonstration of closed-loop planning and execution,... more

descriptionView Paper arrow_downwardDownload

Reconsidering wireless systems with multiple radios

by alecs S.

2004, Computer Communication Review

The tremendous popularity of wireless systems in recent years has led to the commoditization of RF transceivers (radios) whose prices have fallen dramatically. The lower cost allows us to consider using two or more radios in the same... more

descriptionView Paper arrow_downwardDownload

Consumer responses to service recovery strategies: The moderating role of online versus offline environment

by Dhruv Grewal

2006, Journal of Business Research

In this article, we examine consumer reactions to two service recovery strategies: fixing the service failure for a fee and fixing the service failure for no fee and adding compensation. We expect that the more desirable recovery strategy... more

descriptionView Paper arrow_downwardDownload

Aqueduct: online data migration with performance guarantees

by Chenyang Lu and

2002

Modern computer systems are expected to be up continuously: even planned downtime to accomplish system reconfiguration is becoming unacceptable, so more and more changes are having to be made to "live" systems that are running production... more

descriptionView Paper arrow_downwardDownload

Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

by Mukesh Singhal

1994, IEEE Transactions on Parallel and Distributed Systems

A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available... more

descriptionView Paper arrow_downwardDownload

A Flexible Framework for Fault Tolerance in the Grid

by Soonwook Hwang

2003, Journal of Grid Computing

This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major... more

descriptionView Paper arrow_downwardDownload

Rados

by Scott Brandt

2007, Proceedings of the 2nd international workshop on Petascale data storage held in conjunction with Supercomputing '07 - PDSW '07

Brick and object-based storage architectures have emerged as a means of improving the scalability of storage clusters. However, existing systems continue to treat storage nodes as passive devices, despite their ability to exhibit... more

descriptionView Paper arrow_downwardDownload

A modular correctness proof of IEEE 802.11i and TLS

by Anupam Datta

2005

The IEEE 802.11i wireless networking protocol provides mutual authentication between a network access point and user devices prior to user connectivity. The protocol consists of several parts, including an 802.1X authentication phase... more

descriptionView Paper arrow_downwardDownload

Waveband switching in optical networks

by Vishal Anand

2003, IEEE Communications Magazine

The rapid advances in dense wavelength-division multiplexing technology with hundreds of wavelengths per fiber and worldwide fiber dcployment have brought about a tremendous increasc in the size (i.e., number of ports) of photonic... more

descriptionView Paper arrow_downwardDownload

Diagnosis of realistic bridging faults with single stuck-at information

by Tracy Larrabee

1995, Proceedings of IEEE International Conference on Computer Aided Design (ICCAD)

Precise failure analysis requires accurate fault diagnosis. A previously proposed method for diagnosing bridging faults using single stuck-at dictionaries was applied only to small circuits, produced large and imprecise diagnoses, and did... more

descriptionView Paper arrow_downwardDownload

Neuropragmatics: Neuropsychological Constraints on Formal Theories of Dialogue

by Maurizio Tirassa and

1997, Brain and Language

We are interested in the validation of a cognitive theory of human communication, grounded in a speech acts perspective. The theory we refer to is outlined, and a number of predictions are drawn from it. We report a series of protocols... more

descriptionView Paper arrow_downwardDownload

An efficient primary-segmented backup scheme for dependable real-time communication in multihop networks

by Phanie Phany

2003, IEEE/ACM Transactions on Networking

Several distributed real-time applications (e.g., medical imaging, air traffic control, and video conferencing) demand hard guarantees on the message delivery latency and the recovery delay from component failures. As these demands cannot... more

descriptionView Paper arrow_downwardDownload

S4: Small state and small stretch routing protocol for large wireless sensor networks

by Jonathan Smith

2007

Routing protocols for wireless sensor networks must address the challenges of reliable packet delivery at increasingly large scale and highly constrained node resources. Attempts to limit node state can result in undesirable worst-case... more

descriptionView Paper arrow_downwardDownload

A campaign in autonomous mine mapping

by Dave Ferguson and

2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004

Unknown, unexplored and abandoned subterranean voids threaten mining operations, surface developments and the environment. Hazards within these spaces preclude human access to create and verify extensive maps or to characterize and... more

SUMMARY OF FIELD DEPLOY MENTS OF GROUNDHOG INTO THE MATHIES MINE DURING MAY AND OCTOBER 2003. TABLE II

Fig. 1. Groundhog: a rugged platform designed to traverse the rough, unpredictable terrain of mine corridors, able to overcome obstacles such as fallen roof timbers, partial sidewall or roof collapses, rail tracks and deep mud. Shown here at the north portal to the Mathies mine.

Fig. 2. Groundhog Layout: (a) Laser Range Finders (b) Gas Sensors (c) Low-Light Camera (d) Sinkage Sensors (e) Wireless Ethernet (f) Batteries (g) Main Electronics Enclosure (CPU, Tilt, Gyro, Control Circuitry)

Fig. 3. From Mine Corridor to Cost Map: (a) An image, deep in the mine, taken by Groundhog’s low-light camera. (b) A 3D point cloud obtained by the laser scanner in a similar corridor. (c) The corresponding traversability map where brighter spots are easier to traverse.

Fig. 4. Cost Map to Paths: White areas have higher cost; paths are in green. Top: An unbiased path through the center of the corridor. Middle: A path biased to the right side of the corridor. Bottom: Example convolution Filters If the robot is unable to find a path to any of the goals generated, it concludes that the corridor is unnavigable and begins exiting the mine.

Fig. 5. Groundhog’s State Transition Graph. Nodes with multiple outgoing edges branch depending on the type and severity of the problem. “MSHA nominally requires strict and rigorous safety testing before permit- ting any device to operate in a mine.

Portal 1 with the goal of autonomously traversing the entire mine from end to end. Instead, Groundhog encountered a fallen roof timber (see Fig.7) 308 meters into the mine and decided to turn back. Subsequent system failures stranded the robot approximately 160 meters from the portal, and on- site inspectors received permission to suit up and walk into the mine to recover the robot. The lessons learned from that first deployment led to the development of the fault-tolerance paradigm described above. Fig. 7. Fallen roof timber 308 meters inside portal 1 (Photograph Courtesy PA-DEP)

Fig. 8. Results from the Mathies Mine: The 2D maps are approximately scaled and aligned to match the orientation in Fig.6. The 3D scans are, from left to right, The roof-fall encountered 140 meters into portal 2, the fallen timber encountered 308 meters into portal 1, and the fork in the corridor encountered 200 meters into portal 3.

descriptionView Paper arrow_downwardDownload

The Impact of Service Operations Failures on Customer Satisfaction: Evidence on How Failures and Their Source Affect What Matters to Customers

by Shannon Anderson

2009, Manufacturing & Service Operations Management

Key service elements combine to create the service concept and its value proposition for customers. During service operations failures, employee interactions with customers are a critical service element in restoring customer satisfaction. However, research in consumer psychology shows that customers seek reasons for service failures and their attributions of blame moderate the effects of the failure on the level of customer satisfaction. This paper extends research on services operations failures by hypothesizing that attributions of blame also affect what matters to the customer during service failures. Specifically, we hypothesize that the relative weights that customers assign to the key elements of the service in reaching an overall assessment of customer satisfaction are affected by customer attributions of blame for service failures. We use the U.S. airline industry as a quasi-experimental research setting to investigate the components of customer satisfaction for three samples of customers who experience: 1) routine service, 2) flight delays of external (i.e., weather) origin, and 3) flight delays of internal origin. Although the level of customer satisfaction is lower for all service failures, we find that the key components of satisfaction differ between delayed and routine flights only when customers blame the service provider for the failure. Specifically, when delays are of external original, satisfaction is lower than for routine flights, but there is virtually no difference in the weight that customers assign to the components of customer satisfaction (including employee interactions). In contrast, when delays are of internal origin, satisfaction is lower than for either routine flights or flights delayed by external factors and employee interactions have a significantly diminished role in customer satisfaction evaluations. Contrary to the popular view that employee interactions take on a greater role in determining customer satisfaction during service failures, we find that the opposite is true if the customer attributes blame to the service provider. The results highlight the important role of customer attributions during service failures and present more nuanced evidence on the role of employee-customer interactions in mitigating the effects of service failures on customer satisfaction. Data Availability: Data for replicating the results of this study are available online at: [insert web site address]. Included in an online appendix and as electronic data files are: the LISREL program code for the basic model and data files containing the variable means, standard deviations and covariance matrices for each of the three treatment groups. In addition to replicating the results of the study, the reader may explore any model that is nested within our model by making changes to the original program code to reflect constraints of a nested model. Confidentiality and nondisclosure agreements with the data provider preclude us from redistributing the raw survey data or reporting results that may be used to identify the customer satisfaction performance of any air carrier.

* A passengers may fit within more than one of these categories. This table summarizes the steps taken to construct the three samples of delayed (weather and other sources) and on-time flights.

This table reports the structural model coefficients from maximum likelihood estimation of the structural equation model relating satisfaction with the service elements to overall customer satisfaction. In this model, after suitable tests (see text), the measurement model is constrained to be identical and the structural coefficients are allowed to vary for three groups: 1) travelers that experience weather delays, 2) other sources of delays and 3) no delays. The cells contain the unstandardized coefficient, the t-statistic (in parentheses), the within-group completely standardized coefficient, and the completely standardized common metric coefficient. The within-group completely standardized coefficients permit comparisons within a group of the relative impact of each attribute of the service concept. The completely standardized common metric coefficients permit comparisons of the impact of an attribute across groups.

This table presents summary statistics from 21 tests of whether the intercept or individual coefficients relating a service element to overall customer satisfaction differs between groups and from 3 joint tests of whether all coefficients relating service elements to customer satisfaction differs significantly between groups. Tests are conducted for each pair of groups by comparing the model of Table 3 with the nested model that constrains the coefficient(s) of interest to be identical for two groups. The reported p-values from chi-squared tests indicate whether allowing the coefficients to differ between groups yields a better fitting model (bold font indicates values less than 0.10).

descriptionView Paper arrow_downwardDownload

A survey of IP and multiprotocol label switching fast reroute schemes

by Alex Noel Joseph Raj

2007, Computer Networks

One of the desirable features of any network is its ability to keep services running despite a link or node failure. This ability is usually referred to as network resilience and has become a key demand from service providers. Resilient... more

descriptionView Paper arrow_downwardDownload

Redesigning the message logging model for high performance

by George Bosilca

2010

Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow the trend, the mean time between failures (MTBF) suffered,... more

descriptionView Paper arrow_downwardDownload

Modular Concurrency Control and Failure Recovery

by E. Douglas Jensen

This paper presents an approach to concurrency control based on the decomposition of both the database and the individual transactions. This approach is a generalization of serializability theory in that the set of permissible transaction... more

EXECUTIONS OF SCHEDULES S, AND S, with a consistency constraint ‘‘A + B = 100.’’ Consider two transfer transactions: T; = (¢,,(4:= A — 1); ¢,.(B:= B+ 1)) and 7, = (4:(B:= B — 2); b2(A := A + 2)), where the symbol *‘é;;’” denotes step j of transaction i. In addition, consider a different implementation of transaction T), 7} = ((#(B := B — 2); t#(A := 100 — B)). Note that both transactions 7* and T, transfer two units from B to A and preserve the consistency constraint ‘‘A + B = 100’ when executing alone. Suppose that we are first given transactions T, and T,. To enhance the concurrency, one may use a nonserializable scheduling method which schedules these two transactions by associating the ‘‘lock’’ and ‘‘unlock’’ opera- tion pair with each step. That is, T; = (Lock A, t,1, Unlock A; Lock B, t;2, Unlock B), T, = (Lock B, ty, Unlock B; Lock A, t2, Unlock A). Later, suppose that someone modifies the implementation of 7, to 7} but retains the same locking protocol with the intuitive argument that transactions T, and 7* have the same number of steps, use the same commutative ‘‘add’’ and ‘‘subtract’’ operations, and perform the identical computation. SHA ef al.: CONCURRENCY CONTROL AND FAILURE RECOVERY

descriptionView Paper arrow_downwardDownload

Dynamic scheduling of network resources with advance reservations in optical grids

by Savera Tanwir and

2008, International Journal of Network Management

Advance reservation of lightpaths in Grid environments is necessary to guarantee QoS and reliability. In this paper, we have evaluated and compared several algorithms for dynamic scheduling of lightpaths using a flexible advance... more

descriptionView Paper arrow_downwardDownload

FIG: A Prototype Tool for Online Verification of Recovery Mechanisms

by Peter Broadwell

2002

Network applications of the future will require advanced mechanisms for automatic failure recovery in order to provide an acceptable quality of service. Because of this requirement, there is a need for tools that can inject simulated... more

descriptionView Paper arrow_downwardDownload

The Impact of Service Fairness Perceptions on Relationship Quality

by Mavis T. Adjei

2009, Services Marketing Quarterly

descriptionView Paper arrow_downwardDownload

Tracking-as-Recognition for Articulated Full-Body Human Motion Analysis

by Geoff West

2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition

This paper addresses the problem of markerless tracking of a human in full 3D with a high-dimensional (29D) body model. Most work in this area has been focused on achieving accurate tracking in order to replace marker-based motion... more

descriptionView Paper arrow_downwardDownload

Failure Transparency in Remote Procedure Calls

by Kaliappa Ravindran

1989

A model of remote procedure call (RPC) which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented. A technique of adopting orphans caused by failures,... more

Fig. 1. Locus of the remote procedure call thread.

Thus, the necessary condition for the server to execute TR’ without causing state inconsistency is that TR‘’ should be idempotent. However, a sufficient condition is given by the requirements [see relation (2)] that B. Event Logs

Fig. 3. Data structures used in the RPC run-time system.

Fig. 4. Variation of catch up distance with respect to Pigen. rollback required to recover from a failure is the criterion underscoring these indexes. The indexes guide a proper choice of the run-time parameters to minimize and/or eliminate rollback (and the associated rollback propagation). failure will cause the callee to rollback, is given by

Note that S is less than the checkpoint interval (in our case, the number of calls between call receipt and return). If S < (K, + 1), the question of rollback does not arise. The mean rollback distance is given by

Fig. 6. Variation of probability of rollback with respect to P, idem * message for notification and the other for acknowledgment) to each of the procedures connected to P;. In addition, a group message followed by one or more replies may be required to locate a server if the client’s cache does not contain the name binding information for the server.

descriptionView Paper arrow_downwardDownload

Asymptotic Behavior of Total Times for Jobs That Must Start Over if a Failure Occurs

by Pierre Fiorini

2008, Mathematics of Operations Research

Many processes must complete in the presence of failures. Different systems respond to task failure in different ways. The system may resume a failed task from the failure point (or a saved checkpoint shortly before the failure point), it... more

descriptionView Paper arrow_downwardDownload

A Visual Global Positioning System for Unmanned Aerial Vehicles Used in Photogrammetric Applications

by Emanuele Frontoni and

2011, Journal of Intelligent & …

The combination of photogrammetric aerial and terrestrial recording methods can provide new opportunities for photogrammetric applications. A UAV (Unmanned Aerial Vehicle), in our case a helicopter system, can cover both the aerial and... more

descriptionView Paper arrow_downwardDownload

A self-managing fault management mechanism for wireless sensor networks

by International Journal of Wireless & Mobile Networks (IJWMN) - ERA, WJCI Indexed

A sensor network can be described as a collection of sensor nodes which coordinate with each other to perform some specific function. These sensor nodes are mainly in large numbers and are densely deployed either inside the phenomenon or... more

descriptionView Paper arrow_downwardDownload

THROWS: An Architecture for Highly Available Distributed Execution of Web Services Compositions

by Neila Ben Lakhal

2004

Web services emergence has triggered extensive research efforts. Currently, there is a trend towards deploying business processes as an orchestration of web services compositions. Given that web services are inherently looselycoupled and... more

descriptionView Paper arrow_downwardDownload

Independent Directed Acyclic Graphs for Resilient Multipath Routing

by Srinivasan Ramasubramanian

Networking, IEEE/ACM …

In order to achieve resilient multipath routing we introduce the concept of Independent Directed Acyclic Graphs (IDAGs) in this study. Link-independent (Node-independent) DAGs satisfy the property that any path from a source to the root... more

descriptionView Paper arrow_downwardDownload

Composite performance and availability analysis using a hierarchy of stochastic reward nets

by Jogesh Muppala

Computer Performance Evaluation: Modelling …

Figure 1: Architecture of the OLTP System. Online transaction processing systems (OLTP) have become a major application area for com- puters. An OLTP system is needed in cases where many users require instant access to records in large databases. Examples of such systems include the airline reservation systems and auto- mated bank-teller systems. These systems are characterized by high throughput and availability requirements.

Figure 3: Closed Queueing Network Model of the OLTP system. The performance of an OLTP can be studied using a queueing network model since the system involves contention for resources. A queueing network model for the OLTP system is shown in Figure 3. In this model it is assumed that the TPs have a single queue from which GOW WITTE GQUuriite tile Tbe Val [Uy &). Table 1 shows the results for the OLTP system for different configurations. For this example we set the mean time to failure of the processors to be 400 hrs. The processor reboot time was assumed to be 1 minute and the system reboot time was set to 2 minutes. In the table, results are given for various values of the probability of successful reconfiguration c. The probability of successful processor reboot is also set to c. We set the probability that the system needs reboot upon reconfiguration failure to be & = 1 —c.

transactions are selected for processing using a scheduling discipline that satisfies product-for1 assumptions. The TPs are modeled using a multi-server queue with the number of serve equal to the number of TPs. The DBPs are also similarly configured. The service times « the TPs are exponentially distributed with mean 1/jrp and the service times of the DBPs a1 also exponentially distributed with mean 1/ppgp. The average time between completion of transaction and submission of the next transaction at a terminal, which is equivalent to the thin time at the terminal, is also exponentially distributed with mean 1/A. The number of terminal available in the system is assumed to be N. A transaction finishing execution at the TP may visi the DBP with probability p9 or complete execution and return to the terminals with probabilit 1 — po, respectively. Since this queueing network obeys product form assumptions, we could us efficient algorithms like mean value analysis’ to compute the steady-state performance measure such as the average throughput, average queue length at each queue and average response tim for a transaction. However, since we are using SRN as our model type, we will construct a SRN model for the queueing network. Figure 4: SRN Model of the Closed Queueing Network.

for repair is greater than the number of DBPs waiting for repair, the repair-person is allocated to the TPs and vice versa. However if equal number of TPs and DBPs are waiting for repair, the TPs are given priority over the DBPs.

Table 3: System Performance with Failures and Repairs. Thus to evaluate a system with Nrp TPs and Npgp DBPs, we need to obtain the mean response time and the overall system throughput for each configuration (7,7) of the system with 7 TPs and 7 DBPs where 1 <i < Nrp and 1 <7 < Npgp. The model given in Figure 4 is evaluated for each of these configurations to obtain the reward rates for the failure-repair model given in Figure 2.

Table 4: Sizes of the Markov Chains for Different Configurations. Figure 6: Percentile Plot of Response Times of the OLTP System. The percentile plots for various configurations (in the figure (t,d) represents the number of TPs and the number of DBPs respectively) are shown in Figure 6. For this example, we assume that A = 1.0, prp = 50.0, poppe = 20.0 and po = 0.8 respectively. In this example we assume that the number of TPs is equal to the number of DBPs (the model allows the number to be different). We also assume that with every additional TP, the number of terminals is increased

descriptionView Paper arrow_downwardDownload

Capacity Optimization for Surviving Double-Link Failures in Mesh-Restorable Optical Networks

by Arun Somani

2005, Photonic Network Communications

Most research to date in survivable optical network design and operation, focused on the failure of a single component such as a link or a node. A double-link failure model in which any two links in the network may fail in an arbitrary... more

descriptionView Paper arrow_downwardDownload

Availability Modeling of SIP Protocol on IBM(c) WebSphere(c

by dazhi wang

2008

We present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays,... more

descriptionView Paper arrow_downwardDownload

Understanding consistency maintenance in service discovery architectures during communication failure

by Kevin Mills

2002, Proceedings of the third international workshop on Software and performance - WOSP '02

Current trends suggest future software systems will comprise collections of components that combine and recombine dynamically in reaction to changing conditions. Service-discovery protocols, which enable software components to locate... more

descriptionView Paper arrow_downwardDownload

An architecture for rapid, on-demand service composition

by Peter Robinson

2007, Service Oriented Computing and Applications

Legacy application design models, which are still widely used for developing context-aware applications, incur important limitations. Firstly, embedding contextual dependencies in the form of if-then rules specifying how applications... more

descriptionView Paper arrow_downwardDownload

A multi-layer recovery strategy in FAN over WDM architectures

by Cesar Cardenas

2009

Network operators are migrating towards IP over WDM architectures. In such multi-layer networks, it is necessary to efficiently use the resources available from both layers in order to provide coordinated recovery strategies. Thanks to... more

descriptionView Paper arrow_downwardDownload

Optimistic Failure Recovery for Very Large Networks

by Andy Lowry

1991

Optimistic failure recovery mechanisms are proposed as a way to provide transparent fault tolerance to distributed applications and systems. The authors identify problems that may arise when these mechanisms are applied to vast networks... more

descriptionView Paper arrow_downwardDownload

Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery

by George Bosilca

2009, 2009 IEEE International Conference on Cluster Computing and Workshops

With the growing scale of high performance computing platforms, fault tolerance has become a major issue. Among the various approaches for providing fault tolerance to MPI applications, message logging has been proved to tolerate higher... more

descriptionView Paper arrow_downwardDownload

Sonora: A Platform for Continuous Mobile-Cloud Computing

by Fan Yang

This paper presents Sonora, a platform for mobile-cloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming... more

Figure 3. A data-flow view of Sonora. handled transparently if this is desirable; an application may also decide to be notified when disconnections oc- cur.

Figure 5. An example of failure recovery. vo s Ulputl VCIUCEsS. More precisely, let F’ be the set of vertices that have failed and Dr be the set of downstream vertices from any vertex in fF’. Then the rollback set Ry is the union of F and Dp: these vertices will be rolled back to the latest consistent checkpoint. The re-execution set Ep contains all input vertices of Rr that are not in Rp. Tc restart vertices in Ry from the checkpoint, vertices in Er must replay inputs. If the recorded output for re- execution is lost on a vertex v in Ep, vertex v has to be added to the rollback set and the re-execution has to start from an earlier stage. In the worst case, all ver- tices are in the rollback set and the entire computation restarts from that latest consistent checkpoint. Correct- ness of recovery follows directly from global consis- tency provided by the global checkpointing protocol, as a special case of the Chandy-Lamport’s snapshot pro- tocol. pany a cag wees es cae — a

tem catches up with the load. Redirec take advantage of periods of low load ion helps Sonora for catching up, and enables batching for higher throughput. In case the incoming data rate exceeds the t hroughput of the storage system, the cloud pushes back onto the mobile devices to adaptively reduce the sensor sampling fre- quency (see Figure 2 and Section 2.1 for details). storage system, the cloud pushes back onto the mobile

Figure 8. Sonora scalability. Figure 8 shows that the Sonora implementation of PEIR scales linearly — peak throughput increases pro- portionally to the number of machines. The speedup efficiency is 0.70 when the number of machines is 40. If each mobile user reported their location once every 5 seconds, Sonora would be able to support over 350,000 users concurrently with just 40 machines. In addition, sync stream filters can reduce the amount of traffic from mobile devices by filtering out insignificant changes in location. This would allow Sonora to support even more users.

Figure 10. CPU load distribution. were overloaded because there were no additional idle machine (with CPU load below 30%) to share the load. Fault tolerance. To evaluate how Sonora recovers from machine failures we turned off dynamic adap- tation and ran PEIR on 32 machines, with 9 machines running a physical vertex in the first stage and 23 ma- chines running a physical vertex in the second stage. This assignment is balanced as the second stage is more costly than the first. All machines in the first stage generate data that is consumed by all machines in the second stage. A dispatcher was used to feed input to vertices in the first stage. were overloaded because there were no additional idle

Figure 11. Throughput and goodput over time as Sonora checkpoints twice and then recovers from a failure.

Figure 12. Total KB sent over time with a sync stream. A disconnection is masked between ¢) and

descriptionView Paper arrow_downwardDownload

Application Fault Tolerance with Armor Middleware

by Zbigniew Kalbarczyk

2005, IEEE Internet Computing

Many current approaches to software-implemented fault tolerance (SIFT) rely on process replication, which is often prohibitively expensive for practical use due to its high performance overhead and cost. The Adaptive Reconfigurable Mobile... more

descriptionView Paper arrow_downwardDownload

Characterization of BGP Recovery Time Under Large-Scale Failures

by Ayesh kanta Sahoo

2006, IEEE International Conference …

Abstract Border gateway protocol (BGP) is the standard routing protocol between various autonomous systems (AS) in the Internet. In the event of a failure, BGP may repeatedly withdraw some routes and advertise new ones until a stable... more

descriptionView Paper arrow_downwardDownload

A novel path protection scheme for MPLS networks using multi-path routing

by sahel alouneh

2009, Computer Networks

Multi-protocol label switching (MPLS) is an evolving network technology that is used to provide traffic engineering (TE) and high speed networking. Internet service providers, which support MPLS technology, are increasingly demanded to... more

descriptionView Paper arrow_downwardDownload

Dynamic Failure Recovery of Generated Workflows

by Mariusz Momotko

2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05)

An important research area in the workflow management domain is the adaptation of workflows to unexpected events or failures at runtime. In this paper we present a concept for dynamic and automated workflow re-planning that allows to... more

descriptionView Paper arrow_downwardDownload

Progressive retry for software failure recovery in message-passing applications

by Yennun Huang

1997, IEEE Transactions on Computers

In this paper, we describe a method of execution retry for bypassing software faults in messagepassing applications. Based on the techniques of cting and message logging, we demonstrate the use of message replaying and message reordering... more

descriptionView Paper arrow_downwardDownload

AdaFF: Adaptive Failure-Handling Framework for Composite Web Services

by YUNA KIM

2010, IEICE Transactions on Information and Systems

In this paper, we propose a novel Web service composition framework which dynamically accommodates various failure recovery requirements. In the proposed framework called Adaptive Failure-handling Framework (AdaFF), failure-handling... more

descriptionView Paper arrow_downwardDownload

Resilience technologies in Ethernet

by Minh Huynh

2010, Computer Networks

In choosing a network service technology, a subscriber considers many features such as latency, jitter, packet loss, security, and availability. The most important feature, and usually the one that determines the final selection, is the... more