Computing the Number of Calls Dropped Due to Failures
2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering
https://doi.org/10.1109/ISSRE.2010.18…
10 pages
1 file
Sign up for access to the world's latest research
Abstract
Defects per million (DPM), defined as the number of calls out of a million dropped due to failures, is an important service (un)reliability measure for telecommunication systems. Most previous research derives the DPM from steady-state system availability model. In this paper, we develop a novel method for DPM computation which takes into consideration not only system availability, but also the impact of service application as well as the transient behavior of failure recovery. We illustrate this approach using a real system which is the IBM SIP SLEE cluster. Our method takes into account software/hardware failures, different stages of recovery, different phases of call flow, retry attempts and the interactions between call flow and failure/recovery behavior.














Related papers
2008
We present the availability model of a high availability SIP Application Server configuration on WebSphere. Hardware, operating system and application server failures are considered. Different types of fault detectors, detection delays, failover delays, restarts, reboots and repairs are considered. Imperfect coverages for detection, failover and recovery are incorporated. Computations are based on a set of interacting sub-models of all system components capturing their failure and recovery behavior. The parameter values used in the calculations are based on several sources, including field data, high availability testing, and agreedupon assumptions. In cases where a parameter value is uncertain, due to assumptions or limited test data, a sensitivity analysis of that parameter has been provided. Our analysis indicates the failure types and recovery parameters that are most critical in their impact on overall system availability. These results will help guide system improvement efforts throughout future releases of these products.
International Journal of Reliability and Safety
Nowadays Voice over IP (VoIP) has become an evolutionary technology in telecommunications. Because of the increasing dependencies of people on VoIP for voice communication, it is important to design a reliable and performance oriented VoIP system. In this paper a hierarchical model combining reliability and performance is proposed for a server based VoIP system on a wireless network. The top level reliability model consists of only the basic components of a VoIP system. Simultaneous failure of multiple components is taken into modelling consideration. Component redundancies and software rejuvenation are employed to achieve higher system reliability. Assuming exponential failure and repair times of the components, continuous time Markov chain is used to develop the top level reliability model. In addition, a lower level performance model is constructed to obtain the performance metrics of the system at each state of the top level reliability model. Numerical results are provided for ...
In this day and age Voice over Internet Protocol (VoIP) has turned out to be a revolutionary technology in the field of telecommunication. Because of the escalating number of users of VoIP, it is essential to intend for a reliable VoIP system providing good quality of service. In this paper a stochastic model based on continuous time Markov chain is developed to analyze the reliability of a server based VoIP system comprising of significant components only. The concurrent breakdown of more than one component is taken into modeling concern. Redundancy at the component level is used with the purpose to boost the system reliability of VoIP. As a defensive course of action, software rejuvenation is implemented to avert or suspend software failures. And an optimal software rejuvenation strategy is proposed which leads to increased system reliability. Numerical results are presented for the quantitative examination of the suggested model.
Concurrency and Computation: Practice and Experience, 2019
Emergency call services are expected to be highly available in order to minimize the loss of urgent calls and, as a consequence, minimize loss of life due to lack of timely medical response. This service availability depends heavily on the cloud data center on which it is hosted. However, availability information alone cannot provide sufficient understanding of how failures impact the service and users' perception. In this paper, we evaluate the impact of failures on an emergency call system, considering service-level metrics such as the number of affected calls per failure and the time an emergency service takes until it recovers from a failure. We analyze a real data set from an emergency call center for a large Brazilian city. From stochastic models that represent a cloud data center, we evaluate different data center architectures to observe the impact of failures on the emergency call service. Results show that changing data center's architecture in order to improve availability from two to three nines cannot decrease the average number of affected calls per failure. On the other hand, it can decrease the probability to affect a considerable number of calls at the same time. KEYWORDS availability, cloud computing, data center failure, emergency call service 1 INTRODUCTION It is observed that reliability and availability analysis cannot accurately relate failures and user perception. For instance, a system with a 99.99% availability level is subject to about 52 minutes of downtime during a year. However, if the failure occurs during a peak hour, it has a higher impact on users' perception than a failure that occurs when the system load is low, as less users will be affected by it. According to Trivedi and Bobbio, 1 even short outages of technological infrastructures, such as cloud data centers, can have drastic consequences, ranging from economic loss to loss of human life. For instance, let us consider an emergency call service hosted in a cloud data center. The call center receives requests from patients, and depending on the case, an ambulance with a medical team can be dispatched to attend or to transport the patient to a hospital. This service relies on the cloud data center to be operational at all times and, consequently, to save lives. In order to guarantee a target availability level of a data center, different techniques can be applied to assess, predict, verify, and validate it. 1 Redundancy is the common mechanism used to achieve high availability. We therefore develop mathematical models, for instance, by using state-space methods to evaluate how redundancy techniques can improve the availability of a cloud service. In this paper, we use some previously proposed stochastic models to represent a cloud data center (see other works 2-4) and analyze how failures of such infrastructure impact an emergency call service. We evaluate different redundancy architectures following TIA-942, a standard that defines the recommended redundancy strategies and best practices to reach a specified availability level. In addition to availability-related metrics, we focus on service-level metrics such as the number of calls lost per failure in a year. To explore and study service-related metrics, we use real data provided by the Servico de Atendimento Móvel de Urgência (SAMU-Emergency Mobile Attendance Service), 5 a public emergency service offered by the Brazilian government to the general population. Our analysis is divided into two stages. Firstly, we present an overview of the SAMU's data in order to characterize and to detail it showing its relevance and overall behavior. These are measured in terms of the time between calls, attendance time, peak hours, and number of urgent
2006
Recently, measurement based studies of software systems prolifirated, reflecting an increasingly empirical ficus on system availability, reliability, aging and fault tolerance. However, it is a non-trivial, error-prone, arduous, and time-consuming task even for experienced system administrators and statistical analysts to know what a reasonable set of steps should include to model and success-,fully predict performance variables or system failures of a complex software system. Reported results are fragmented and focus on applying statistical regression techniques to captured numerical system data. In thir pap~r, we propose a best practice guide for building empirical models based on our experience with forecasting Apache web sewer performance variables and forecasting call availability of a real world telecommunication system. To substantiate the presented guide and to demonstrate our approach step-by-step we model and predict the response time and the amount of free physical memory of an Apache web sewer system. Additionally, we present concrete results for a) variable selection where we cross benchmark three procedures, b) empirical model building where we cross benchmark four techniques and c) sensitivity analysis. This besr practice guide intends to assist in configuring modeling approaches systematically .for best estimation andprediction results. 12th Pacific Rim International Symposium on Dependable Computing (PRDC'O6) 0-7695-2724-8106 $20.00 O 2006 IEEE C~M P U T E R SOCIETY
2011
Abstract This paper presents a prediction model for software service availability measured by the mean-time-to-repair (MTTR) and mean-time-to-failure (MTTF) of a service. The prediction model is based on the experimental identification of probability distribution functions for variables that affect MTTR/MTTF and has been implemented using a framework that we have developed to support monitoring and prediction of quality-of-service properties, called EVEREST+.
2010
In this paper, we investigate the performance and dependability modeling of voice and data services in computer networks. We use Stochastic Petri Net as an enabling modeling approach for analytical evaluation of complex scenarios. We apply our proposed modeling approach in a case study to evaluate the dependability of an enterprise network, in terms of Total Cost of Ownership, and to assess the financial impact of outages over voice and data networks. The performability will be analyzed by considering the influence of network topologies.
In this paper, we investigate the performance and dependability modeling of voice and data services in computer networks. We use Stochastic Petri Net as an enabling modeling approach for analytical evaluation of complex scenarios. We apply our proposed modeling approach in a case study to evaluate the dependability and performability of an enterprise network in differents scenarios. The performability will be analyzed by considering the influence of network topologies.
– We develop a probabilistic model of the behavior of a crash-recovery target, i.e. one which has the ability to recover from the crash state. We show that the fail-free and the crash-stop are special cases of the crash-recovery run with mean time to failure (MTTF) approaching to infinity and mean time to recovery (MTTR) approaching to infinity, respectively. We compare the previous work QoS metrics to allow the measurement of the recovery speed, and the definition of the completeness property of a failure detector. Then, the impact of the dependability of the crash-recovery target on the QoS bounds for such a crash-recovery failure detector is analyzed using general dependability metrics, such as MTTF and MTTR, based on an approximate probabilistic model of the two-process failure detection system. Then according to our approximate model, we show how to estimate the failure detector's parameters to achieve a required QoS, based NFD-S algorithm analytically, and how to execute the configuration procedure of this crash-recovery failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored process can have a significant impact on the QoS of our failure detector.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (11)
- S. R. Ali. Digital Switching Systems: System Reliability and Analysis. McGraw-Hill Professional Publishing, 1997.
- S. Garg, Y. Huang, C. Kintala, K. Trivedi, and S. Yagnik. Perfor- mance and reliability evaluation of passive replication schemes in application level fault tolerance. In Proc. FTCS, 1999.
- J. F. Hayes and T. V. J. G. Babu. Modeling and Analysis of Telecommunications Networks. John Wiley and Sons, 2004.
- C. R. Johnson, Y. Kogan, Y. Levy, F. Saheban, and P. Tarapore. VoIP reliability: a service provider's perspective. IEEE Commu- nications Magazine, 42(7), 2004.
- M. Kaaniche, K. Kanoun, and M. Martinello. A user-perceived availability evaluation of a web based travel agency. Proc. DSN, 2003.
- G. E. Mahdy. Disaster Management in Telecommunications, Broadcasting and Computer Systems. John Wiley and Sons, 2001.
- M. Martinello. Availability Modeling and Evaluation of Web-based Services -A Pragmatic Approach. Ph.D. Thesis, LAAS, Toulouse, France, 2005.
- V. B. Mendiratta. Reliability analysis of clustered computing systems. In Proc. ISSRE, 1999.
- P. Stavroulakis. Reliability, Survivability and Quality of Large Scale Telecommunication Systems: Case Study: Olympic Games. John Wiley and Sons, 2003.
- K. Trivedi, D. Wang, J. Hunt, A. Rindos, W. E. Smith, and B. Vashaw. Availability modeling of sip protocol on ibm web- sphere. In Proc. PRDC, 2008.
- G. Bolch, S. Greiner, H. de Meer, and K. Trivedi. Queueing Net- works and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. John Wiley, second edition, 2006.