Failure Recovery

description583 papers

group8 followers

lightbulbAbout this topic

Failure recovery refers to the processes and strategies employed to restore a system, application, or organization to operational status after a failure or disruption. It encompasses the identification of failure causes, implementation of corrective actions, and the establishment of protocols to prevent future occurrences, ensuring resilience and continuity.

lightbulbAbout this topic

Key research themes

1. How can recovery-oriented computing methodologies optimize system failure recovery to improve availability and reduce total cost of ownership?

This theme explores methods of designing computing systems that can recover quickly and efficiently from failures by rethinking recovery as a first-class design goal rather than a secondary concern, thereby enhancing system availability, reducing downtime costs, and lowering the total cost of ownership (TCO). The focus is on recovery-oriented computing (ROC) principles that target networked services with metrics such as availability, rapid scale, and change, analyzing failure causes and developing techniques for automatic and effective failure recovery.

Recovery-oriented computing (ROC): Motivation, definition, techniques, and …

by William Tetzlaff and

2016

Key finding: This foundational paper introduces recovery-oriented computing (ROC) which emphasizes making recovery a primary design goal to significantly improve system availability and reduce downtime costs. It demonstrates that operator... Read more

articleView Paper downloadDownload

Automatic Recovery from Runtime Failures

by Mauro Pezzè

2015

Key finding: This work presents a technique exploiting intrinsic redundancy in reusable software components to automatically avoid application field failures without requiring system restarts. By generating alternative workarounds... Read more

articleView Paper downloadDownload

Automatic Model-Driven Recovery in Distributed Systems

by Kaustubh 'KJ' Joshi

2025, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05)

Key finding: This study develops a model-driven, Bayesian and Markov decision process based framework enabling automatic system monitoring and recovery in distributed systems under imperfect and conflicting monitoring conditions. It... Read more

articleView Paper downloadDownload

A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

by Eli Gafni

2025

Key finding: This paper details a software-driven fault tolerance scheme for large multicomputer systems executing long jobs, where error detection and recovery are mostly handled by software via paired subsystems executing identical... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are the formal models and programming paradigms that enable systematic recovery and self-healing in software systems after failures?

This theme investigates formal approaches and frameworks for implementing recovery and self-healing capabilities in software systems. It includes transactional compensation models enabling undoing committed transactions without cascading aborts, recovery-oriented programming paradigms embedding monitoring and recovery actions for safety and liveness properties, and systems exhibiting self-healing inspired by biological analogies to autonomously detect, diagnose, and repair faults. The goal is to provide theoretical and practical bases for building software resilient to transient and permanent faults.

A formal approach to recovery by compensating transactions

by Eliezer Levy and

2016, The VLDB Journal

Key finding: This paper formulates a transaction model introducing compensating transactions which semantically undo effects of committed or uncommitted transactions affecting others, thereby avoiding cascading aborts. It formalizes... Read more

articleView Paper downloadDownload

Recovery Oriented Programming: Runtime Monitoring of Safety and Liveness

by Olga Brukman

2017

Key finding: This research proposes the recovery oriented programming (ROP) paradigm wherein programs integrate monitoring of safety and liveness properties and embed recovery actions upon violation detection. Using a generic... Read more

articleView Paper downloadDownload

On conditions for self-healing in distributed software systems

by Naftaly Minsky

2023, 2003 Autonomic Computing Workshop

Key finding: The paper identifies that self-healing in distributed software requires invariant regularities across all system configurations, proposing imposing artificial 'laws' on heterogeneous distributed systems to achieve this. It... Read more

articleView Paper downloadDownload

Self-Healing Systems: Application and Methodologies-A Review

by Fidelis Ugwuanyi

2022, International Journal of Research

Key finding: This review systematically categorizes self-healing techniques inspired by biological systems, presenting methodologies such as middleware-based self-adaptive fault tolerance, monitoring frameworks, and hierarchical fault... Read more

articleView Paper downloadDownload

Self-Healing Systems: Foundations and Challenges

by Gabi Dreo Rodosek

2016

Key finding: This position paper delineates self-healing as systems autonomously detecting faults and performing recovery steps to restore specified operational modes. It distinguishes self-healing from fault tolerance and related... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can failure recovery be optimized in storage and network systems through algorithmic and architectural techniques to ensure minimum performance degradation during faults?

This theme considers optimizing failure recovery in storage and network infrastructures, focusing on minimizing recovery overhead, ensuring consistency without rollback cascades, and maintaining service continuity under component failures. It covers topics such as I/O optimal recovery schemes for erasure-coded storage minimizing read/write operations needed for reconstruction, failure recovery architectures in cluster computing free from domino effect, and fault-tolerance frameworks in software-defined networking (SDN) and optical transport networks.

In search of I/O-optimal recovery from disk failures

by Osama N Khan

2021

Key finding: This work develops an algorithm to find minimum I/O schedules for recovery from arbitrary numbers of disk failures in XOR-based erasure-coded storage. It introduces a family of codes enabling recovery from up to 11... Read more

articleView Paper downloadDownload

Impact: an Unreliable Failure Detector Based on Processes' Relevance and the Confidence Degree in the System

by Anubis Graciela de Moraes Rossetto

2025

Key finding: This paper introduces the Impact Failure Detector that assigns impact factors to processes and outputs a trust level for a set of monitored processes rather than individual binary suspicion. By defining thresholds that... Read more

articleView Paper downloadDownload

Domino-Effect Free Crash Recovery for Concurrent Failures in Cluster Federation

by Shahram Rahimi

2025, Lecture Notes in Computer Science

Key finding: The authors propose a recovery approach for multi-cluster federations that handles both inter-cluster orphan and lost messages, ensuring recovery free from the domino effect, thereby minimizing recomputation. By using common... Read more

articleView Paper downloadDownload

Fault-Tolerance in the Scope of Software-Defined Networking (SDN)

by Rui Aguiar

2024, IEEE Access

Key finding: This survey details fault tolerance challenges and solutions within SDN architectures, examining detection and recovery mechanisms in data, control, and application planes. It highlights that SDN introduces novel fault... Read more

articleView Paper downloadDownload

Disaster resilience of optical networks: State of the art, challenges, and opportunities

by georgios ellinas

2025, Optical Switching and Networking

Key finding: This position paper reviews mechanisms enabling optical networks to achieve resilience against disasters including natural events and malicious attacks. It categorizes proactive pre-disaster, preparatory, and reactive... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Failure Recovery

Consumer responses to service recovery strategies: The moderating role of online versus offline environment

by Kenneth Bernhardt

2006, Journal of Business Research

In this article, we examine consumer reactions to two service recovery strategies: fixing the service failure for a fee and fixing the service failure for no fee and adding compensation. We expect that the more desirable recovery strategy... more

Fig. 2. Effect of severity by recovery. In this study, we examine the affects of an on-/offline medium on customer satisfaction with service failure recovery and postpurchase intentions in two different service contexts. We show that recovery levels have positive effects on satisfaction and intentions in both online and offline settings and, even more interesting, that the on-/offline medium

descriptionView Paper arrow_downwardDownload

Service failure recovery: The moderating impact of individual-level cultural value orientation on perceptions of justice

by Kriengsin Prasongsukarn and

2006

It is now well recognized that an effective service recovery program is essential to generating customer satisfaction and loyalty. A number of studies have investigated the impact of service recovery efforts (compensation, speed of... more

Fig. 1. Two way interaction: apology status x power distance.

Fig. 2. Two way interaction: recovery initiate x collectivisn/individualism.

Fig. 3. Two way interaction: cognitive control x uncertainly avoidance.

Test of equivalence of scenario between Thailand and Australia

Goodness of fit statistics: ¢C (df: 40)=75.3, p<0.05, GFI=0.97, AGFI=0.95, NFI=0.97, CFI=0.98, RMSEA =0.043.

(R)=reverse scored. Measurement items for perceived justice

Regression analysis — service recovery satisfaction model A Chow test was used to test for any significance differences in the form (or slope) of the two regression models (Australia and Thailand). Results were not significantly different thus justifying pooling of the two country data (F 2.37, p>0.05).

Appendix A. CVSCALE — Test of invariance

P=Power Distance; U= Uncertainty Avoidance; C=Collectivism. Justice dimensions — text of invariance

D=Distributive justice; P=Procedural justice; I=Interactional justice.

descriptionView Paper arrow_downwardDownload

An integrated, distributed traffic control strategy for the future internet

by C. Lagoa

2006

1 Due to the lack of a general theoretical foundation, today's distributed traffic control mechanisms developed at the networking layer, transport layer, and overlay are largely disintegrated. As a result, traffic control protocols... more

descriptionView Paper arrow_downwardDownload

Multi-Objective Reinforcement Learning for AUV Thruster Failure Recovery

by Reza Azadeh and

This paper investigates learning approaches for discovering fault-tolerant control policies to overcome thruster failures in Autonomous Underwater Vehicles (AUV). The proposed approach is a model-based direct policy search that learns on... more

descriptionView Paper arrow_downwardDownload

A survey of IP and multiprotocol label switching fast reroute schemes

by alex raj

2007, Computer Networks

One of the desirable features of any network is its ability to keep services running despite a link or node failure. This ability is usually referred to as network resilience and has become a key demand from service providers. Resilient... more

Fig. 1. Node failure and direct FRR loop.

Fig. 2. Node failure and indirect FRR loop.

Fig. 6. Multihoming link failure. In Fig. 6, external node D is connected to both R3 and R4. The interior gateway protocol used within an autonomous system does not have the visibility of this node D. The routers R3 and R4, which are connected to this external node, advertise the reachability of this node D to the routers in the In some cases, there may be a network separating this multihoming device, which may make it difficult to detect the node failure. In the case of external

Fig. 8. ECMP and partial loops. during a fast reroute. When the backup path travels through the ECMP path, the backup path computa- tion procedure must ensure that none of the ECMP paths makes any forwarding loops during the fast reroute when the backup path is used. For example, in Fig. 8, if all the links have a cost of 1 unit or the routing metric is the hop count, then R1 has three ECMP paths to D, which are {R1,R4,R5,D}, {R1,R6,R3,D} and {R1,R2,R3,D}. R3 has the ink (R3—D) protecting backup path {R3,R2,R1, R4,R5,D}. When the link (R3—D) failure occurs, his backup path {R3,R2,R1,R4,R5,D} is used and traffic is sent up to R1. R1 load balances the raffic on its three ECMP paths {R1,R4,R5,D}, {R1,R6,R3,D} and {R1,R2,R3,D}. Unlike the multicast Reverse Path Forwarding (RPF) [48], in he unicast routing, the incoming interface is not checked for loops. When R1 load balances the FRR backup traffic between the ECMP paths, it creates the forwarding loop {R3,R2,R1,R6, R3,...} and the forwarding loop {R3,R2,R1, R2,R3,...} as shown in Fig. 8. Therefore, the each fast reroute mechanism must take special consider-

Fig. 7. Multihoming node failure. nodes, even though a device has multiple links, it may not have fast reroute capability, and this could cause more data loss. Additionally, in the following cases, the same prefix could be attached to two or more routers:

Fig. 10. Link-down micro-loop: R1 converged and RS is not converged.

Fig. 12. Link-up micro-loop: Rl converged and R2 is not converged. Fig. 11. Link-up: before convergence.

descriptionView Paper arrow_downwardDownload

Composite performance and availability analysis using a hierarchy of stochastic reward nets

by Jogesh Muppala

Computer Performance Evaluation: Modelling …

Figure 1: Architecture of the OLTP System. Online transaction processing systems (OLTP) have become a major application area for com- puters. An OLTP system is needed in cases where many users require instant access to records in large databases. Examples of such systems include the airline reservation systems and auto- mated bank-teller systems. These systems are characterized by high throughput and availability requirements.

Figure 3: Closed Queueing Network Model of the OLTP system. The performance of an OLTP can be studied using a queueing network model since the system involves contention for resources. A queueing network model for the OLTP system is shown in Figure 3. In this model it is assumed that the TPs have a single queue from which GOW WITTE GQUuriite tile Tbe Val [Uy &). Table 1 shows the results for the OLTP system for different configurations. For this example we set the mean time to failure of the processors to be 400 hrs. The processor reboot time was assumed to be 1 minute and the system reboot time was set to 2 minutes. In the table, results are given for various values of the probability of successful reconfiguration c. The probability of successful processor reboot is also set to c. We set the probability that the system needs reboot upon reconfiguration failure to be & = 1 —c.

transactions are selected for processing using a scheduling discipline that satisfies product-for1 assumptions. The TPs are modeled using a multi-server queue with the number of serve equal to the number of TPs. The DBPs are also similarly configured. The service times « the TPs are exponentially distributed with mean 1/jrp and the service times of the DBPs a1 also exponentially distributed with mean 1/ppgp. The average time between completion of transaction and submission of the next transaction at a terminal, which is equivalent to the thin time at the terminal, is also exponentially distributed with mean 1/A. The number of terminal available in the system is assumed to be N. A transaction finishing execution at the TP may visi the DBP with probability p9 or complete execution and return to the terminals with probabilit 1 — po, respectively. Since this queueing network obeys product form assumptions, we could us efficient algorithms like mean value analysis’ to compute the steady-state performance measure such as the average throughput, average queue length at each queue and average response tim for a transaction. However, since we are using SRN as our model type, we will construct a SRN model for the queueing network. Figure 4: SRN Model of the Closed Queueing Network.

for repair is greater than the number of DBPs waiting for repair, the repair-person is allocated to the TPs and vice versa. However if equal number of TPs and DBPs are waiting for repair, the TPs are given priority over the DBPs.

Table 3: System Performance with Failures and Repairs. Thus to evaluate a system with Nrp TPs and Npgp DBPs, we need to obtain the mean response time and the overall system throughput for each configuration (7,7) of the system with 7 TPs and 7 DBPs where 1 <i < Nrp and 1 <7 < Npgp. The model given in Figure 4 is evaluated for each of these configurations to obtain the reward rates for the failure-repair model given in Figure 2.

Table 4: Sizes of the Markov Chains for Different Configurations. Figure 6: Percentile Plot of Response Times of the OLTP System. The percentile plots for various configurations (in the figure (t,d) represents the number of TPs and the number of DBPs respectively) are shown in Figure 6. For this example, we assume that A = 1.0, prp = 50.0, poppe = 20.0 and po = 0.8 respectively. In this example we assume that the number of TPs is equal to the number of DBPs (the model allows the number to be different). We also assume that with every additional TP, the number of terminals is increased

descriptionView Paper arrow_downwardDownload

Safe flying for an UAV helicopter

by Primo Zingaretti and

2007

Today small autonomous helicopters offer a low budget platform for aerial applications such as surveillance (both military and civil), land management and earth sciences. In this paper we introduce a prototype of autonomous aerial... more

descriptionView Paper arrow_downwardDownload

Service continuity support in self-organizing IMS networks

by Fuchun Lin and

2011, 2011 2nd International Conference on Wireless Communication, Vehicular Technology, Information Theory and Aerospace & Electronic Systems Technology (Wireless VITAE)

With the increasing interest in deploying 4G/LTE networks, IMS has a potential to be deployed in a wide scale in order to support mobile Internet and value-added services over next-generation networks. Moreover, the effort to create an... more

descriptionView Paper arrow_downwardDownload

Sonora: A Platform for Continuous Mobile-Cloud Computing

by Fan Yang

This paper presents Sonora, a platform for mobile-cloud computing. Sonora is designed to support the development and execution of continuous mobile-cloud services. To this end, Sonora provides developers with stream-based programming... more

Figure 3. A data-flow view of Sonora. handled transparently if this is desirable; an application may also decide to be notified when disconnections oc- cur.

Figure 5. An example of failure recovery. vo s Ulputl VCIUCEsS. More precisely, let F’ be the set of vertices that have failed and Dr be the set of downstream vertices from any vertex in fF’. Then the rollback set Ry is the union of F and Dp: these vertices will be rolled back to the latest consistent checkpoint. The re-execution set Ep contains all input vertices of Rr that are not in Rp. Tc restart vertices in Ry from the checkpoint, vertices in Er must replay inputs. If the recorded output for re- execution is lost on a vertex v in Ep, vertex v has to be added to the rollback set and the re-execution has to start from an earlier stage. In the worst case, all ver- tices are in the rollback set and the entire computation restarts from that latest consistent checkpoint. Correct- ness of recovery follows directly from global consis- tency provided by the global checkpointing protocol, as a special case of the Chandy-Lamport’s snapshot pro- tocol. pany a cag wees es cae — a

tem catches up with the load. Redirec take advantage of periods of low load ion helps Sonora for catching up, and enables batching for higher throughput. In case the incoming data rate exceeds the t hroughput of the storage system, the cloud pushes back onto the mobile devices to adaptively reduce the sensor sampling fre- quency (see Figure 2 and Section 2.1 for details). storage system, the cloud pushes back onto the mobile

Figure 8. Sonora scalability. Figure 8 shows that the Sonora implementation of PEIR scales linearly — peak throughput increases pro- portionally to the number of machines. The speedup efficiency is 0.70 when the number of machines is 40. If each mobile user reported their location once every 5 seconds, Sonora would be able to support over 350,000 users concurrently with just 40 machines. In addition, sync stream filters can reduce the amount of traffic from mobile devices by filtering out insignificant changes in location. This would allow Sonora to support even more users.

Figure 10. CPU load distribution. were overloaded because there were no additional idle machine (with CPU load below 30%) to share the load. Fault tolerance. To evaluate how Sonora recovers from machine failures we turned off dynamic adap- tation and ran PEIR on 32 machines, with 9 machines running a physical vertex in the first stage and 23 ma- chines running a physical vertex in the second stage. This assignment is balanced as the second stage is more costly than the first. All machines in the first stage generate data that is consumed by all machines in the second stage. A dispatcher was used to feed input to vertices in the first stage. were overloaded because there were no additional idle

Figure 11. Throughput and goodput over time as Sonora checkpoints twice and then recovers from a failure.

Figure 12. Total KB sent over time with a sync stream. A disconnection is masked between ¢) and

descriptionView Paper arrow_downwardDownload

Online Discovery of AUV Control Policies to Overcome Thruster Failures

by Reza Azadeh

2014

We investigate methods to improve fault-tolerance of Autonomous Underwater Vehicles (AUVs) to increase their reliability and persistent autonomy. We propose a learning-based approach that is able to discover new control policies to... more

descriptionView Paper arrow_downwardDownload

Reconsidering wireless systems with multiple radios

by alecs S.

2004, Computer Communication Review

The tremendous popularity of wireless systems in recent years has led to the commoditization of RF transceivers (radios) whose prices have fallen dramatically. The lower cost allows us to consider using two or more radios in the same... more

descriptionView Paper arrow_downwardDownload

Resilience technologies in Ethernet

by MINH HIEU Huynh

2010, Computer Networks

In choosing a network service technology, a subscriber considers many features such as latency, jitter, packet loss, security, and availability. The most important feature, and usually the one that determines the final selection, is the... more

descriptionView Paper arrow_downwardDownload

A Flexible Framework for Fault Tolerance in the Grid

by Soonwook Hwang

2003, Journal of Grid Computing

This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major... more

descriptionView Paper arrow_downwardDownload

Availability Modeling of SIP Protocol on IBM© WebSphere&#x0A9

by A. Rindos

2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing

We present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays,... more

descriptionView Paper arrow_downwardDownload

Fault Tolerant Approaches for Distributed Real-time and Embedded Systems

by Richard Schantz

2007, MILCOM 2007 - IEEE Military Communications Conference

Fault tolerance is a crucial design consideration for missioncritical distributed real-time and embedded (DRE) systems, which combine the real-time characteristics of embedded platforms with the dynamic characteristics of distributed... more

descriptionView Paper arrow_downwardDownload

Online Discovery of AUV Control Policies to Overcome Thruster Failures

by Petar Kormushev

2014

We investigate methods to improve fault-tolerance of Autonomous Underwater Vehicles (AUVs) to increase their reliability and persistent autonomy. We propose a learningbased approach that is able to discover new control policies to... more

descriptionView Paper arrow_downwardDownload

Resilience technologies in Ethernet

by Minh Huynh

2010, Computer Networks

Fig. 1. Recurring Cost of Operation in a 3 year period study. Ethernet can save more than 50% over a 3 year period in a business case study from the MEF.

Fig. 2. Availability vs. recovery time for different frequency of failure. more than all the others in determining the market size for services and the resulting potential revenues [18]. The re- sult of one recent market analysis shows that 50% of sub- scribers expect at least the 99.99% service availability. Fig. 2 shows the recovery time for different failure rate and its availability in term of the number of 9s [18]. For example, if the recovery time is 100 min and the failure rate is 10 occurrences per year, then the availability is

Fig. 3. Example of a linear topology with leaf nodes. The ring topology is popular in network deployment be- cause of its simplicity, deterministic behavior, and built-in

redundancy. Different variants of the ring topology are shown in Figs. 7-9. A ring topology creates a loop in the network that causes a frame to circulate infinitely. The management protocol must ensure that loops are elimi- nated while still able to exploit the advantages of the redundant links. In a multiple ring topology, all the rings can be managed by a single management instance, or dif- ferent management instances that intercommunicate. Fig. 7. Example of a single ring topology. A mesh topology is a general topology of a network. There are two types: a partial mesh and a full mesh, as shown in Fig. 10. Typically, there is more than one path be- tween any pair of source destination because of the redun- dant links in the mesh. The path with the best cost is used

Fig. 4. Example of a branch from a linear topology.

Fig. 9. Example of a multi-ring with redundancy between rings.

Fig. 11. Frequency of network related errors in a LAN across the OSI model. Network failures account for more than one third of IT related failures [1]. These failures can occur across all of the seven OSI layers. Fig. 11 shows the distribution of er- rors in a LAN across the OSI model. Misconfigurations are generally the main cause of failures in the link layer that resulted in corrupted forwarding tables, while a link fail- ures and node failures are the main causes in the physical layer. A link failure occurs when a cable damaged or when errors occur at the network interface. Usually this type of failure is localized and can be fixed quickly via the backup

Fig. 13. Time division for IRT communication [7]. Most protocols in this category recover in less than the recommended recovery time of 1-3 s, except for STP. In other words, for applications other than the interactive voice message, all other services will operate without interruption during a failure. The few applications that re- cover in less than 1s can satisfy without interruption. As STP was designed well before the emergence of the mod- Running on specialized a ASIC, Isochronous Real-Time (IRT) is defined to have cycle times in the range of 150 Us to 1ms and 1 us jitter with the synchronization of all nodes. However, the fastest time supported by commercial equipment starts from 500 us [7]. IRT is deployed on tree or line topologies where it can support a maximum of 25 devices per line. To achieve deterministic behavior and low cycle time, IRT schedules real-time data at regular interval and inserts best-effort in between, as shown in

Fig. 12. Communication cycle time and their jitter [7].

Table 7 Fig. 14. STP’s process of selecting root node and block redundant links to create a loop free topology.

Fig. 15. An example of a MSTP configuration.

Fig. 17. A multi-ring topology in MRP-Foundry.

Fig. 19. Virtual concatenation (VC). Varadarajan et al. proposed Ethereal [20], a connection oriented architecture, to support assured service and best-effort service at the Ethernet layer. Ethereal uses the Propagation Order Spanning Tree for fast reconvergence once a failure has been detected. Utilizing periodic hello messages to immediate neighbors, a switch can detect a failure if there are missing consecutive hello messages. Once a fault has been detected, all best-effort traffic is dis- carded. The established QoS-assured flows are maintained unless part of the path is affected by the fault. The best-ef- fort flows behave consistent with the STP protocol, while requests to reserve paths with the required QoS parame- ters are required for QoS-assured traffic. Ethereal design is directly aiming at real-time multimedia traffic via hop- by-hop reservation. Similar to a MPLS, each switch makes a request to its immediate downstream hop for the flow reservation, whereupon the penultimate node sends a re- ply indicating whether the reservation was successful. The scalabilty of Ethereal is limited as only 65536 connec- tions can be supported. To protect Ethernet over SONET with a low overhead, Acharya et al. proposed PESO [22]. Traditional SONET uses a 1+1 protection, but this can be considered excessive since data traffic can tolerate failure and operate at a reduced rate. Depending on the protection requirements, PESO will compute an optimum routing path that uses virtual con- catenation (VC), as shown in Fig. 19, and Link Capacity Adjustment Scheme (LCAS) to make the necessary recov- ery. For the scenario where a single failure should not af- fect more than x% of the bandwidth, PESO transforms the link capacity in the topology to the equivalent of y lines. Each chosen line out of y cannot carry more than x% pro- tected bandwidth. PESO determines the number of mem- bers in the VC. Using a path augmentation maximum flow algorithm, such as Ford and Fulkerson [23] or Ed- monds and Karp [24], PESO determines the routes that the virtual concatenation group (VCG) will take. Upon fail- ure, LCAS removes the failed member resulting in a contin- uous connection with the destination but the throughput

The maximum recovery time from a fault for CRP is:

on the active interface, the end-nodes switch to the alter- nate interface. tpath = 6 x td = 0.174 s. There are different kinds of fault in a BRP network. Firstly, if the leaf link faults are detectable in the end node physical layer, the recovery time is less than 10 us [27]. Secondly, if the faults occurred in the direction of flow of beacon messages and those that are detectable in the node/switch physical layer, then the recovery time is less than 1 ms (two beacon timeouts) [27]. Lastly, if the faults occurred in the opposite direction to the flow of beacon messages, but are not detectable in the node/switch phys- ical layer, the recovery time is the worse case:

A summary of the requirements and recommendations for each type of application. Table 1

Requirements for end-users applications define in the ITU-T Recommendation G.114 [3].

Performance requirements for multimedia applications define in the ITU-T Recommendation G.114.

The typical grace time in an Industrial Network from IEC 62439. Table 4

Performance requirement for PROFINET and SERCOS III (*in reality, 960 kbit/s and 108 kbit/s is the maximum instead of 3.2 Mbit/s and 408 kbit/s, respectively). Table 5

Comparison charts for resilient protocols operating in end-user environment. Table 6 ern applications its recovery time was acceptable, but it is now obsolete. The ring topology boasts the protocols with the fastest recovery time. Since the behavior on a ring is more predictable, it is easier to optimize the management protocol than with mesh networks. However, the recovery time of protocols managing ring networks with a central redundancy manager is directly proportional to the size of the ring. As the ring size grows, the failover time also grows making it difficult to sustain a failover time below 1s. Tables 6 and 7 summarizes the protocols that are suit- able for applications in this class of network performance. its topology. STP is standardized in IEEE 802.1d [12] to for- ward layer 2 frames. Using the shortest path to the central root, STP forms a tree that is overlaid on top of a mesh Ethernet Network as shown in Fig. 14. Unlike IP packets, Ethernet frames do not have a time-to-live field. Therefore, the Spanning Tree blocks redundant links in the topology to avoid a broadcast storm that can bring down the net- work. The drawback of this approach is that the links around the root will be heavily congested, leaving it at risk of failure and unbalance loads. Upon a failure, STP takes 30-60 s to recover.

Comparison charts for resilient protocols operating in end-user environment (continue).

Comparison charts for resilient protocols operating in metro area network and multimedia environment. Table 8

Comparison charts for resilient protocols operating in metro area network and multimedia environment (continue). Table 9

Need customized tuning for each individual network. Comparison charts for resilient protocols operating in Industrial Ethernet Networks. Table 10

Comparison charts for resilient protocols operating in Industrial Ethernet Networks (continue). Table 11

Parameters for a recovery example. Table 12

descriptionView Paper arrow_downwardDownload

Safe flying for an UAV helicopter

by Andrea Monteriù and

2007

descriptionView Paper arrow_downwardDownload

The Impact of Service Fairness Perceptions on Relationship Quality

by Mavis T. Adjei

2009, Services Marketing Quarterly

descriptionView Paper arrow_downwardDownload

A Visual Global Positioning System for Unmanned Aerial Vehicles Used in Photogrammetric Applications

by Emanuele Frontoni and

2011, Journal of Intelligent & …

The combination of photogrammetric aerial and terrestrial recording methods can provide new opportunities for photogrammetric applications. A UAV (Unmanned Aerial Vehicle), in our case a helicopter system, can cover both the aerial and... more

descriptionView Paper arrow_downwardDownload

Safe flying for an UAV helicopter

by Primo Zingaretti

2007

descriptionView Paper arrow_downwardDownload

A self-managing fault management mechanism for wireless sensor networks

by International Journal of Wireless & Mobile Networks (IJWMN) - ERA, WJCI Indexed

A sensor network can be described as a collection of sensor nodes which coordinate with each other to perform some specific function. These sensor nodes are mainly in large numbers and are densely deployed either inside the phenomenon or... more

descriptionView Paper arrow_downwardDownload

Rados

by Scott Brandt

2007, Proceedings of the 2nd international workshop on Petascale data storage held in conjunction with Supercomputing '07 - PDSW '07

Brick and object-based storage architectures have emerged as a means of improving the scalability of storage clusters. However, existing systems continue to treat storage nodes as passive devices, despite their ability to exhibit... more

descriptionView Paper arrow_downwardDownload

The Impact of Service Operations Failures on Customer Satisfaction: Evidence on How Failures and Their Source Affect What Matters to Customers

by Shannon Anderson

2009, Manufacturing & Service Operations Management

Key service elements combine to create the service concept and its value proposition for customers. During service operations failures, employee interactions with customers are a critical service element in restoring customer satisfaction. However, research in consumer psychology shows that customers seek reasons for service failures and their attributions of blame moderate the effects of the failure on the level of customer satisfaction. This paper extends research on services operations failures by hypothesizing that attributions of blame also affect what matters to the customer during service failures. Specifically, we hypothesize that the relative weights that customers assign to the key elements of the service in reaching an overall assessment of customer satisfaction are affected by customer attributions of blame for service failures. We use the U.S. airline industry as a quasi-experimental research setting to investigate the components of customer satisfaction for three samples of customers who experience: 1) routine service, 2) flight delays of external (i.e., weather) origin, and 3) flight delays of internal origin. Although the level of customer satisfaction is lower for all service failures, we find that the key components of satisfaction differ between delayed and routine flights only when customers blame the service provider for the failure. Specifically, when delays are of external original, satisfaction is lower than for routine flights, but there is virtually no difference in the weight that customers assign to the components of customer satisfaction (including employee interactions). In contrast, when delays are of internal origin, satisfaction is lower than for either routine flights or flights delayed by external factors and employee interactions have a significantly diminished role in customer satisfaction evaluations. Contrary to the popular view that employee interactions take on a greater role in determining customer satisfaction during service failures, we find that the opposite is true if the customer attributes blame to the service provider. The results highlight the important role of customer attributions during service failures and present more nuanced evidence on the role of employee-customer interactions in mitigating the effects of service failures on customer satisfaction. Data Availability: Data for replicating the results of this study are available online at: [insert web site address]. Included in an online appendix and as electronic data files are: the LISREL program code for the basic model and data files containing the variable means, standard deviations and covariance matrices for each of the three treatment groups. In addition to replicating the results of the study, the reader may explore any model that is nested within our model by making changes to the original program code to reflect constraints of a nested model. Confidentiality and nondisclosure agreements with the data provider preclude us from redistributing the raw survey data or reporting results that may be used to identify the customer satisfaction performance of any air carrier.

* A passengers may fit within more than one of these categories. This table summarizes the steps taken to construct the three samples of delayed (weather and other sources) and on-time flights.

This table reports the structural model coefficients from maximum likelihood estimation of the structural equation model relating satisfaction with the service elements to overall customer satisfaction. In this model, after suitable tests (see text), the measurement model is constrained to be identical and the structural coefficients are allowed to vary for three groups: 1) travelers that experience weather delays, 2) other sources of delays and 3) no delays. The cells contain the unstandardized coefficient, the t-statistic (in parentheses), the within-group completely standardized coefficient, and the completely standardized common metric coefficient. The within-group completely standardized coefficients permit comparisons within a group of the relative impact of each attribute of the service concept. The completely standardized common metric coefficients permit comparisons of the impact of an attribute across groups.

This table presents summary statistics from 21 tests of whether the intercept or individual coefficients relating a service element to overall customer satisfaction differs between groups and from 3 joint tests of whether all coefficients relating service elements to customer satisfaction differs significantly between groups. Tests are conducted for each pair of groups by comparing the model of Table 3 with the nested model that constrains the coefficient(s) of interest to be identical for two groups. The reported p-values from chi-squared tests indicate whether allowing the coefficients to differ between groups yields a better fitting model (bold font indicates values less than 0.10).

descriptionView Paper arrow_downwardDownload

Spectrum Allocation Algorithms for Cognitive Radio Mesh Networks

by Hisham Almasaeid

2011

Figure 1.1: A general architecture of a wireless mesh network. A cognitive radio mesh network is a WMN that deploys cognitive radios (for both routers

sends back the ACK, and then polls the next MC. This operation is summarized in Figure 3.3. Figure 3.2: The format of the modefied Sync-MAC frame

Figure 5.14: Average end-to-end delay for multiple sessions with 10 MHz spacing.

Figure 5.15: The network used to for the experiments in Section 5.8.2

Figure 5.17: The average cost of the selected route reported at the gateway.

Figure 5.16: The average number of transmitted ACP packets during the forward phase in the entire network.

Figure 5.18: The actual cost of the selected route reported at the gateway.

Figure 5.19: The average number of ACP packets received by the gateway for the same route.

Figure B.2: A receiver-based channel allocation that can make the CMR i and its auxiliary se Figure B.1: Mapping a maximum K-SAT problem into an UDCP problem.

Figure 5.13: Average end-to-end delay for multiple sessions with 7 MHz spacing.

Figure 3.3: The operation of the MAC-RBA mechanism. (4) MCs send their traffic requirements during their dedicated control slots. to the MR using "Transmission Request (TREQ)" packets. The MR use polling to achieve fairness between MCs. The performance of this MAC protocol depends on how the parameters (including the mini-

(3) The MR now has two periods, one for inra-cell and another for inter-cell communication. (2) An MR sends Cell Communication Period (CCP) notification to all adjacent MRs.

Figure 3.4: The aggregate throughput under different lengths of the intra-cell communication period, and fixed packet inter-arrival time.

Figure 3.5: The average delay under different lengths of the intra-cell communication period, and

‘igure 3.7: The average delay under different lengths of the intra-cell communication period, and

Figure 3.6: The aggregate throughput under different lengths of the intra-cell communication period and exponentially distributed packet inter-arrival time.

Figure 3.8: The performance of the three allocation strategies without a preassumed CCC. |B] = 3,

Figure 3.9: The performance of the three allocation strategies with a preassumed CCC. |B| = 3.

Figure 3.10: The performance of the three allocation strategies without a preassumed CCC. |B] = 8,

Figure 3.11: The performance of the three allocation strategies with a preassumed CCC. |B] = &

Figure 3.12: The performance of the HRBA algorithm compared to the optimal solution. |B] = 8,

Figure 3.13: The performance of the HRBA algorithm compared to the optimal solution. |B| = 15,

Figure 4.1: An example that illustrates the heterogeneity property of CRNs. to its effect on the multicast throughput, in which we are interested in this chapter, the het-

Figure 4.2: An example that shows the benefit of using assisted multicast in reducing the total multicast period. the second stage.

Figure 4.3: Interaction between different queues.

Figure 4.5: The queue dynamics of the case-study summarized in Figure 4.4. 4.8 Performance Evaluation

also proposed to allow collision-free schedules across multiple cells in a CR-WMN. Figure 4.6: The gain of using intra-group assistance in a single multicast group (Pa = 0.25) total multicast period, i.e., overall throughput. A proactive collision resolution procedure was

Figure 4.7: Average gain of assisted multicast using different levels of assistance (M = 3, Pa = 0.25)

Figure 4.8: Average gain of assisted multicast using different levels of assistance (IW = 4, Pa = 0.25)

Figure 4.9: Average gain of assisted multicast using different levels of assistance (IV = 5, Pa = 0.25)

Figure 4.10: Average multicast period with- and without-assistance (M=3, 4,5, Pa=0.25)

Figure 4.11: The effect of channel availability on the gain of assisted multicast (M=1).

Figure 4.12: The effect of channel availability on the gain of assisted multicast (M=5).

Figure 4.13: Proactive versus reactive collision resolution (Pa = 0.25).

Figure 5.1: An example that illustrates the effect of channel assignment on the throughput and end-to-end delay of multicast traffic.

Figure 5.2: A toy example of a path of five MRs from n, to ns, lists under the nodes represent available channels at each node. 5.4 Optimal Channel Assignment on a Route

Stages (i.e., nodes along a path) Figure 5.3: The dynamic program formulation of the path in Figure 5.2

(2) The new members starts the forward phase, affecting the links in red. Figure 5.4: An illustrative example of the forward and backward phases of the route establishment process.

Figure 5.5: An example to illustrate how channel availability can affect the hop distance between SUs. 5.5.1 Finding the minimum hop distance (level) of MRs To find the shortest hop-count distance,i.e., level, from the gateway to every other MR in

Figure 5.6: An example to explain the lifetime of a Setk on a give channel. The curve shows the PU usage of the channel across time slots (1 means the PU is using the channel). link means longer periods between reroutings caused by failures of this link. Please note

!-7]. Otherwise, it informs the downstream node to recover the route [line 7]. This answers

Figure 5.9: Average end-to-end delay for a single session with 4 MHz spacing. tion. any-available-channel” allocation and shortest path with “closest-available-channel” alloca:

Figure 5.10: Average end-to-end delay for a single session with 7 MHz spacing.

Figure 5.11: Average end-to-end delay for a single session with 10 MHz spacing.

Figure 5.12: Average end-to-end delay for multiple sessions with 4 MHz spacing.

Table 2.1: A summary of the characteristics of different spectrum sensing paradigms mission of a PU and the case of accumulated noise that exceeds the detection threshold.

Table 3.1: Differences between the RBA, TBA, and ATA channel allocation strategies

3.4.2 Phase 2: Finding the maximum number of reliable uplinks

in saturation throughput by ~ 50%, and ~ 200% decrease in average delay Table 3.2: The simulation parameters of the MAC experiment

Table 4.1: Enhancing throughput by introducing different assistance mechanisms 1.3. Motivation and Problem Definition Before we formally define the assisted multicast problem, we would like to present an example that illustrates the motivation behind this work, and then give some definitions.

Table 4.2: Summary of Notations Definition 4.4.1. Unassisted Multicast scheduling for a single group (UMS-Single): for a cel

Figure 4.4: A case study to illustrate the recovery processes. The figure to the left shows the network topology and channel availability, while the table to the right shows the calculated schedule.

Table 5.1: All possible channel assignments and their end-to-end delays .4.1 Dynamic programming approach for channel assignment

Table 5.2: A Dynamic program for optimal channel allocation along a route R 5.5 Multicast Routing: Challenges and Solutions In this section, we use the dynamic program developed in Section 5.4 for the case of a single

Algorithm 7: Level Evaluation and Reconfiguration Therefore,

lifetime, then this algorithm will find the channel allocation along the route which results in

Table 5.3: The mean times of the ON/OFF periods for all experiments 5.8.2 Route recovery In this subsection, we evaluate the performance of the route recovery algorithm. We sim-

descriptionView Paper arrow_downwardDownload

Neuropragmatics: Neuropsychological Constraints on Formal Theories of Dialogue

by Maurizio Tirassa and

1997, Brain and Language

We are interested in the validation of a cognitive theory of human communication, grounded in a speech acts perspective. The theory we refer to is outlined, and a number of predictions are drawn from it. We report a series of protocols... more

descriptionView Paper arrow_downwardDownload

Recovery From Software Failures Caused by Mandelbugs

by Michael Grottke and

—Software failures are still a major concern in mission-and enterprise-critical contexts, despite significant efforts spent in software testing. In fact, while software testing is effective against easily-reproducible bugs (Bohrbugs), it... more

Fig. 1. Flowchart of Recovery after a Failure.

Fig. 2. Time to Recovery Distribution of a Failure.

NINE CASES OBTAINED FROM THE THREE COMPLEXITY SCENARIOS AND THE THREE DIAGNOSIS SCENARIOS ig. 3. MTTR, and SSUA for the nine cases considered in this study. (a) Mean time to recovery (MTTR). (b) Steady-state unavailability (SSUA) as a function f MTTF. TABLE V

Fig. 4. MTTR of manual diagnosis compared to escalated recovery (in high-complexity software), for different values of the mean time to problem diagnosis.

Fig. 5. MTTR of automated diagnosis compared to escalated recovery (in high-complexity software), for different values of the probability of correct diagnosis.

EXPLANATIONS OF INPUT PARAMETERS AND THEIR VALUES FOR THREE DIAGNOSIS SCENARIOS B. Results TABLE IV f£) Observation 1: Manual failure diagnosis 1s less effective than escalated recovery. In all the considered systems (high, medium, and low complexity), escalated recovery has a mean time to recovery lower than manual diagnosis. This difference indicates that the gain of automatically recovering from Man- delbugs through restarts, reboots, and reconfigurations is higher than the penalty due to useless restarts, reboots, and reconfigura- tions attempted in the presence of Bohrbugs. Even if Bohrbugs are the majority of bugs (~ 60% or more in the considered sce- narios), the automated recovery actions can avoid, at least in some cases, to perform a full problem diagnosis and bug-fixing, t t t t t t hus saving significant effort while increasing the availability of the IT system. As shown in Fig. 4, the main factor impacting he manual diagnosis is the time required to perform the manual problem determination. To be as effective as escalated recovery, he manual problem determination should be performed in less han 5 minutes. However, in the experience of the authors, more han 5 minutes is often required to determine a problem, and he quickest solution is to attempt an escalated recovery and to mask the failure; only in the case that the problem persists is an in-depth investigation of the root cause of the problem necessi- tated. SD) fl) kines st retina %». ANaertrannmtnNy Jtnnmanntn ta LKntgtnen than We compute the MTTR using the input parameters as shown in Tables II-IV. Moreover, we computed steady-state unavail- ability (SSUA) using [49]

MEAN TIME TO RECOVERY FOR THE NINE CASES

_ASTICITY OF MTTR WITH RESPECT TO MODEL PARAMETERS (AND THEIR RANKINGS IN BRACKETS) (A) HIGH-COMPLEXITY SOFTWARE SYSTEMS (B) MEDIUM-COMPLEXITY SOFTWARE SYSTEMS (C) LOW-COMPLEXITY SOFTWARE SYSTEMS To get more insights into the recovery process, we performed a parametric sensitivity analysis of the model. Sensitivity anal- ysis is amethod to determine factors that are most influential on model results. It can be used to find recovery or availability bot- tlenecks in the system, and to thus guide improvement and opti- mization. Parametric sensitivity analysis is performed by com- puting the elasticity of each parameter. Here, elasticity repre- sents the percentage change in MTTR (or SSUA) that results from a percentage change in the respective parameter. This mea- sure provides a uniform way to compare the impact of different parameters with different measurement units. The elasticity of MTTR with respect to a parameter p can be computed from the partial derivative with respect to that parameter: complexity systems is higher because most of the failures are caused by Bohrbugs; thus, fewer failures can be masked by auto- mated recovery actions (restart, reboot, reconfiguration), while a higher proportion of them require bug-fixes, which tend to take a much longer time. However, (un)availability is a function of both MTTR and MTTF. As can be seen from Fig. 3(b), low- complexity systems can have a higher availability than high- complexity ones if their MTTF is high enough. For instance, consider a high-complexity system with escalated recovery, and a MTTF of 1000 hours; from Fig. 3(b), it can be seen that a low-complexity system with escalated recovery attains a lower SSUA than this high-complexity system if its MTTF exceeds 1545 hours.

descriptionView Paper arrow_downwardDownload

End-to-End Optimal Algorithms for Integrated QoS, Traffic Engineering, and Failure Recovery

by Constantino Lagoa

2000, IEEE/ACM Transactions on Networking

This paper addresses the problem of optimal Quality of Service (QoS), Traffic Engineering (TE) and Failure Recovery (FR) in Computer Networks by introducing novel algorithms that only use source inferrable information. More precisely,... more

descriptionView Paper arrow_downwardDownload

A modular correctness proof of IEEE 802.11i and TLS

by Anupam Datta

2005

The IEEE 802.11i wireless networking protocol provides mutual authentication between a network access point and user devices prior to user connectivity. The protocol consists of several parts, including an 802.1X authentication phase... more

descriptionView Paper arrow_downwardDownload

Requirements of a Recovery Solution for Failure of Composite Web Services

by Hadi Saboohi

2012

Web services are building blocks of interoperable systems. Composing Web services makes the processes capable of doing complex tasks. Composite services may fail during their execution which can be diagnosed by a mediator. The mediator... more

descriptionView Paper arrow_downwardDownload

Distributed Restoration Method for Metro Ethernet

by maher ali

2006, International Conference on Networking International Conference on Systems and International Conference on Mobile Communications and Learning Technologies

Deploying Ethernet in the metro domain will require many different upgrades including end-to-end QoS guarantees, protection mechanisms and service performance monitoring. In this paper we propose a distributed method to address network... more

descriptionView Paper arrow_downwardDownload

A Visual Global Positioning System for Unmanned Aerial Vehicles Used in Photogrammetric Applications

by Primo Zingaretti and

2011, Journal of Intelligent & Robotic Systems

descriptionView Paper arrow_downwardDownload

Composite performance and availability analysis of wireless communication networks

by Kishor S Trivedi

2001, IEEE Transactions on Vehicular Technology

Traditional pure performance model that ignores failure and recovery but considers resource contention generally overestimates the system's ability to perform a certain job. On the other hand, pure availability analysis tends to be too... more

descriptionView Paper arrow_downwardDownload

Resilience technologies in Ethernet

by Minh Thư Huỳnh

2010, Computer Networks

descriptionView Paper arrow_downwardDownload

Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

by Mukesh Singhal

1994, IEEE Transactions on Parallel and Distributed Systems

A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available... more

descriptionView Paper arrow_downwardDownload

Global memory management for a multi computer system

by Alberto MUNOZ

2000

In this paper, we discuss the design and implementation of fault-aware Global Memory Management (GMM) for a multi-kernel architecture. Scalability of today's systems is limited by SMP hardware, as well as by the underlying commodity... more

descriptionView Paper arrow_downwardDownload

A campaign in autonomous mine mapping

by Dave Ferguson and

2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004

Unknown, unexplored and abandoned subterranean voids threaten mining operations, surface developments and the environment. Hazards within these spaces preclude human access to create and verify extensive maps or to characterize and... more

SUMMARY OF FIELD DEPLOY MENTS OF GROUNDHOG INTO THE MATHIES MINE DURING MAY AND OCTOBER 2003. TABLE II

Fig. 1. Groundhog: a rugged platform designed to traverse the rough, unpredictable terrain of mine corridors, able to overcome obstacles such as fallen roof timbers, partial sidewall or roof collapses, rail tracks and deep mud. Shown here at the north portal to the Mathies mine.

Fig. 2. Groundhog Layout: (a) Laser Range Finders (b) Gas Sensors (c) Low-Light Camera (d) Sinkage Sensors (e) Wireless Ethernet (f) Batteries (g) Main Electronics Enclosure (CPU, Tilt, Gyro, Control Circuitry)

Fig. 3. From Mine Corridor to Cost Map: (a) An image, deep in the mine, taken by Groundhog’s low-light camera. (b) A 3D point cloud obtained by the laser scanner in a similar corridor. (c) The corresponding traversability map where brighter spots are easier to traverse.

Fig. 4. Cost Map to Paths: White areas have higher cost; paths are in green. Top: An unbiased path through the center of the corridor. Middle: A path biased to the right side of the corridor. Bottom: Example convolution Filters If the robot is unable to find a path to any of the goals generated, it concludes that the corridor is unnavigable and begins exiting the mine.

Fig. 5. Groundhog’s State Transition Graph. Nodes with multiple outgoing edges branch depending on the type and severity of the problem. “MSHA nominally requires strict and rigorous safety testing before permit- ting any device to operate in a mine.

Portal 1 with the goal of autonomously traversing the entire mine from end to end. Instead, Groundhog encountered a fallen roof timber (see Fig.7) 308 meters into the mine and decided to turn back. Subsequent system failures stranded the robot approximately 160 meters from the portal, and on- site inspectors received permission to suit up and walk into the mine to recover the robot. The lessons learned from that first deployment led to the development of the fault-tolerance paradigm described above. Fig. 7. Fallen roof timber 308 meters inside portal 1 (Photograph Courtesy PA-DEP)

Fig. 8. Results from the Mathies Mine: The 2D maps are approximately scaled and aligned to match the orientation in Fig.6. The 3D scans are, from left to right, The roof-fall encountered 140 meters into portal 2, the fallen timber encountered 308 meters into portal 1, and the fork in the corridor encountered 200 meters into portal 3.

descriptionView Paper arrow_downwardDownload

An Adaptive Approach for Dynamic Recovery Decisions in Web Service Composition Using Space Based QOS Factor

by ijwsc journal

Service Oriented Architecture facilitates automatic execution and composition of web services in distributed environment. This service composition in the heterogeneous environment may suffer from various kinds of service failures. These... more

descriptionView Paper arrow_downwardDownload

A NOVEL SCHEME FOR RELIABLE MULTIPATH ROUTING THROUGH NODE-INDEPENDENT DIRECTED ACYCLIC GRAPHS

by Editor IJRET

Multipath routing is essential in the wake of voice over IP, multimedia streaming for efficient data transmission. The growing usage of such network requirements also demands fast recovery from network failures. Multipath routing is one... more

descriptionView Paper arrow_downwardDownload

A NOVEL SCHEME FOR RELIABLE MULTIPATH ROUTING THROUGH NODE-INDEPENDENT DIRECTED ACYCLIC GRAPHS

by Editor IJRET

descriptionView Paper arrow_downwardDownload

A framework for robust HLA-based distributed simulations

by Wentong Cai

2006

The High Level Architecture (HLA) is a standard for the interoperability and reuse of simulation components, referred to as federates. Large scale HLA-compliant simulations are built to study complex problems, and they often involve a... more

descriptionView Paper arrow_downwardDownload

Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

by Ravi Prakash

1994, IEEE Transactions on Parallel and Distributed Systems

descriptionView Paper arrow_downwardDownload

Safe flying for an UAV helicopter

by Emanuele Frontoni and

2007

descriptionView Paper arrow_downwardDownload

Learning failure recovery knowledge for mechanical assembly

by Luis M Camarinha-Matos

1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96

A framework for planning and supervision of robotized assembly tasks is initially presented, with emphasis on failure recovery. The approach to the integration of services and the modeling of tasks, resources and environment is briefly... more

descriptionView Paper arrow_downwardDownload

Dynamic Failure Recovery of Generated Workflows

by Mariusz Momotko

2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05)

An important research area in the workflow management domain is the adaptation of workflows to unexpected events or failures at runtime. In this paper we present a concept for dynamic and automated workflow re-planning that allows to... more

descriptionView Paper arrow_downwardDownload

Sprite position statement: Use distributed state for failure recovery

by Fred Douglis

1989

''Stateless'' servers have been popularized by NFS [Sandberg85]. The benefit of a stateless server is that the server can crash and reboot and no special recovery action is required. Clients simply retry their operations until they get a... more

descriptionView Paper arrow_downwardDownload

Computing the Number of Calls Dropped Due to Failures

by Kishor S Trivedi

2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering

Defects per million (DPM), defined as the number of calls out of a million dropped due to failures, is an important service (un)reliability measure for telecommunication systems. Most previous research derives the DPM from steady-state... more

The call flow for B2BUA is shown in Figure 2. The UAC first sends an INVITE message to UAS through the B2BUA proxy. The UAS replies the UAC with a RINGING message then pauses 15 seconds before send- ing an OK message to UAC indicating the phone has been picked up. The UAC replies the UAS with an ACK message and the call session is now been set up. The UAC pauses for 45 seconds simulating the phone conversation of 45 seconds and then sends an INFO message to UAS. The UAS sends back an OK message after receiving the INFO message from UAC. The UAC pauses for another 60 seconds, then sends a BYE message to terminate the call session. The UAS replies with an OK message and the session is terminated.

Fig. 3: Availability model for a replication domain We now present a Markov availability model for a single replication domain for use in later sections to compute lost calls per failure and the failure frequen- cies. Figure 3 shows the availability model for the two

Fig. 5: Loss model for newly arriving calls I) Number of new calls lost per replication domain failure: To compute the mean number of newly arriving calls lost per failure, we consider the continuous-time Markov chain (CTMC) model of Figure 5, that shows the state transitions after a failure has occurred. In this

Fig. 6: Lost newly arriving calls If Ty < rw;, no new calls will be dropped due to this failure. If on the other hand Ty > rwj;, then since the call arrival rate for the failed server is 4/2 (the cal arrival rate for one replication domain is \ and the calls are evenly distributed between the two application servers in that domain, therefore for one server the call arriva rate is \/2), the mean number of new calls dropped is (Ta — rw;)A/2. This follows from the property of Poisson arrival process where the mean number of arrivals in a duration of length ¢ is the arrival rate multiplied by ¢. Figure 6 shows how the newly arriving calls are affected by Ty. Tq is a random variable and its cumulative distribution function, F(x), can be computed from the CTMC of Figure 5: Fy(x) = mg(x) where 7¢(x) is the transient probability that the model of Figure 5 is in state G at time x. Hence the mean number of new calls dropped due to a server failure is

Fig. 8: Lost stable calls in phase 1 Suppose Ty is the time period between the failure occurrence and the model entering state G (Ty = oo if the model never enters G), rw, is the maximum retry window for non-INVITE messages. If Ty is less than rw», no call will be lost; otherwise the number of lost calls is min(Tq — rwo,t1)A/2, as the total number of lost calls cannot exceed At, /2. Figure 8 depicts the case for lost stable calls in phase 1. From the explanation above we get the mean number of lost stable calls in phase 1 as

Fig. 7: Loss model for stable calls 3) Number OF Stable calls Lost per replication domain failure: When an application server fails, there are At; /2 calls in the failed server that are in stable phase 1. The INFO requests sent by these calls will still be directed to the failed server before the failure is recovered. Because these At, /2 calls arrive at the application server at different times, the time for them to issue the INFO message is also different. And because the call arrival rate is \/2 for each server, We assume that the INFO request rate issued by these At, /2 calls is \/2.

Fig. 9: Call loss model for proxy failure recovery. Using the same method as in Section IV-B3 (replacing t; with t., and /2 with 6X), we can get the mean number of candidate setup calls that might be lost due to a message loss as

TABLE V: DPM by Failure Modes Figure 10 shows the sensitivity analysis of DPM mean time to WebSphere Application server failure MTTF_WAS). As seen from the figure, the true DPM ind RBDPM decrease with MTTF_WAS, the maxi- num/minimum for true DPM are 36.6/16.1, and those ‘or RBDPM are 37.9/5.64. Figures 11 shows the sensi- ivity of DPM to WLM detection delay, which is varied rom 0.5 second to 20 seconds. Both true DPM and RBDPM increase with WLM detection delay, and the curve becomes linear as the delay increases. Figure 12 shows the sensitivity of various coverage factors. They ire together varied from 0.8 to 1. As shown in the figure, soth true DPM and RBDPM decrease as the coverage factors increase, and the coverage factors impact more yn true DPM than RBDPM. (since the voice channel is already established), they will be put in the bucket RBDPM (revnue and billing DPM); calls lost during setup phase or lost new calls will be put in the true DPM bucket. Table V shows the DPM caused by various failure modes.

TABLE I: Replication Domain Configuration

TABLE II: States in Replication Domain Model Assuming that the successful detection probabilities by WLM and NA are d and e, respectively, then if the WLM detects the failure first, the model enters state 1D, where a failover is performed and in the mean time the node agent is trying to detect the failure. We assume that the node agent will not detect the failure before failover is completed (which over-estimates the DPM due to replication domain failures). Then with probability c the failover is successful and the model enters state F'S, where the node agent is still attempting to detect the failure. With probability e the failure is detected by NA, the model enters UA, UR, UB and RE, in sequence, for auto process restart, manual process restart, manual reboot and manual repair. With probability 1 — e, NA is not able to detect the failure, and hence from state F'S the model enters the state UR, UB and then RE. If the failover is unsuccessful in state 1D, the model will go through FN, UC, US, UT and RP states that correspond to NA detection, auto process restart, manual process restart, manual reboot and manual repair.

TABLE III: Replication Domain Parameters The proxy availability model is similar to the repli- cation domain availability model execpt that the role of

descriptionView Paper arrow_downwardDownload

Safe flying for an UAV helicopter

by Fabio Caponetti and

2007

descriptionView Paper arrow_downwardDownload

Waveband switching in optical networks

by Vishal Anand

2003, IEEE Communications Magazine

The rapid advances in dense wavelength-division multiplexing technology with hundreds of wavelengths per fiber and worldwide fiber dcployment have brought about a tremendous increasc in the size (i.e., number of ports) of photonic... more