Computing the Number of Calls Dropped Due to Failures

Kishor S Trivedi

doi:10.1109/ISSRE.2010.18

Outline

Computing the Number of Calls Dropped Due to Failures

Kishor S Trivedi

2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering

https://doi.org/10.1109/ISSRE.2010.18

visibility

…

description

10 pages

link

1 file

Abstract

Defects per million (DPM), defined as the number of calls out of a million dropped due to failures, is an important service (un)reliability measure for telecommunication systems. Most previous research derives the DPM from steady-state system availability model. In this paper, we develop a novel method for DPM computation which takes into consideration not only system availability, but also the impact of service application as well as the transient behavior of failure recovery. We illustrate this approach using a real system which is the IBM SIP SLEE cluster. Our method takes into account software/hardware failures, different stages of recovery, different phases of call flow, retry attempts and the interactions between call flow and failure/recovery behavior.

Figures (14)

TABLE I: Replication Domain Configuration

The call flow for B2BUA is shown in Figure 2. The UAC first sends an INVITE message to UAS through the B2BUA proxy. The UAS replies the UAC with a RINGING message then pauses 15 seconds before send- ing an OK message to UAC indicating the phone has been picked up. The UAC replies the UAS with an ACK message and the call session is now been set up. The UAC pauses for 45 seconds simulating the phone conversation of 45 seconds and then sends an INFO message to UAS. The UAS sends back an OK message after receiving the INFO message from UAC. The UAC pauses for another 60 seconds, then sends a BYE message to terminate the call session. The UAS replies with an OK message and the session is terminated.

Fig. 3: Availability model for a replication domain We now present a Markov availability model for a single replication domain for use in later sections to compute lost calls per failure and the failure frequen- cies. Figure 3 shows the availability model for the two

TABLE II: States in Replication Domain Model Assuming that the successful detection probabilities by WLM and NA are d and e, respectively, then if the WLM detects the failure first, the model enters state 1D, where a failover is performed and in the mean time the node agent is trying to detect the failure. We assume that the node agent will not detect the failure before failover is completed (which over-estimates the DPM due to replication domain failures). Then with probability c the failover is successful and the model enters state F'S, where the node agent is still attempting to detect the failure. With probability e the failure is detected by NA, the model enters UA, UR, UB and RE, in sequence, for auto process restart, manual process restart, manual reboot and manual repair. With probability 1 — e, NA is not able to detect the failure, and hence from state F'S the model enters the state UR, UB and then RE. If the failover is unsuccessful in state 1D, the model will go through FN, UC, US, UT and RP states that correspond to NA detection, auto process restart, manual process restart, manual reboot and manual repair.

TABLE III: Replication Domain Parameters The proxy availability model is similar to the repli- cation domain availability model execpt that the role of

Fig. 5: Loss model for newly arriving calls I) Number of new calls lost per replication domain failure: To compute the mean number of newly arriving calls lost per failure, we consider the continuous-time Markov chain (CTMC) model of Figure 5, that shows the state transitions after a failure has occurred. In this

Fig. 6: Lost newly arriving calls If Ty < rw;, no new calls will be dropped due to this failure. If on the other hand Ty > rwj;, then since the call arrival rate for the failed server is 4/2 (the cal arrival rate for one replication domain is \ and the calls are evenly distributed between the two application servers in that domain, therefore for one server the call arriva rate is \/2), the mean number of new calls dropped is (Ta — rw;)A/2. This follows from the property of Poisson arrival process where the mean number of arrivals in a duration of length ¢ is the arrival rate multiplied by ¢. Figure 6 shows how the newly arriving calls are affected by Ty. Tq is a random variable and its cumulative distribution function, F(x), can be computed from the CTMC of Figure 5: Fy(x) = mg(x) where 7¢(x) is the transient probability that the model of Figure 5 is in state G at time x. Hence the mean number of new calls dropped due to a server failure is

Fig. 8: Lost stable calls in phase 1 Suppose Ty is the time period between the failure occurrence and the model entering state G (Ty = oo if the model never enters G), rw, is the maximum retry window for non-INVITE messages. If Ty is less than rw», no call will be lost; otherwise the number of lost calls is min(Tq — rwo,t1)A/2, as the total number of lost calls cannot exceed At, /2. Figure 8 depicts the case for lost stable calls in phase 1. From the explanation above we get the mean number of lost stable calls in phase 1 as

Fig. 7: Loss model for stable calls 3) Number OF Stable calls Lost per replication domain failure: When an application server fails, there are At; /2 calls in the failed server that are in stable phase 1. The INFO requests sent by these calls will still be directed to the failed server before the failure is recovered. Because these At, /2 calls arrive at the application server at different times, the time for them to issue the INFO message is also different. And because the call arrival rate is \/2 for each server, We assume that the INFO request rate issued by these At, /2 calls is \/2.

Fig. 9: Call loss model for proxy failure recovery. Using the same method as in Section IV-B3 (replacing t; with t., and /2 with 6X), we can get the mean number of candidate setup calls that might be lost due to a message loss as

TABLE V: DPM by Failure Modes Figure 10 shows the sensitivity analysis of DPM mean time to WebSphere Application server failure MTTF_WAS). As seen from the figure, the true DPM ind RBDPM decrease with MTTF_WAS, the maxi- num/minimum for true DPM are 36.6/16.1, and those ‘or RBDPM are 37.9/5.64. Figures 11 shows the sensi- ivity of DPM to WLM detection delay, which is varied rom 0.5 second to 20 seconds. Both true DPM and RBDPM increase with WLM detection delay, and the curve becomes linear as the delay increases. Figure 12 shows the sensitivity of various coverage factors. They ire together varied from 0.8 to 1. As shown in the figure, soth true DPM and RBDPM decrease as the coverage factors increase, and the coverage factors impact more yn true DPM than RBDPM. (since the voice channel is already established), they will be put in the bucket RBDPM (revnue and billing DPM); calls lost during setup phase or lost new calls will be put in the true DPM bucket. Table V shows the DPM caused by various failure modes.

References (11)

S. R. Ali. Digital Switching Systems: System Reliability and Analysis. McGraw-Hill Professional Publishing, 1997.
S. Garg, Y. Huang, C. Kintala, K. Trivedi, and S. Yagnik. Perfor- mance and reliability evaluation of passive replication schemes in application level fault tolerance. In Proc. FTCS, 1999.
J. F. Hayes and T. V. J. G. Babu. Modeling and Analysis of Telecommunications Networks. John Wiley and Sons, 2004.
C. R. Johnson, Y. Kogan, Y. Levy, F. Saheban, and P. Tarapore. VoIP reliability: a service provider's perspective. IEEE Commu- nications Magazine, 42(7), 2004.
M. Kaaniche, K. Kanoun, and M. Martinello. A user-perceived availability evaluation of a web based travel agency. Proc. DSN, 2003.
G. E. Mahdy. Disaster Management in Telecommunications, Broadcasting and Computer Systems. John Wiley and Sons, 2001.
M. Martinello. Availability Modeling and Evaluation of Web-based Services -A Pragmatic Approach. Ph.D. Thesis, LAAS, Toulouse, France, 2005.
V. B. Mendiratta. Reliability analysis of clustered computing systems. In Proc. ISSRE, 1999.
P. Stavroulakis. Reliability, Survivability and Quality of Large Scale Telecommunication Systems: Case Study: Olympic Games. John Wiley and Sons, 2003.
K. Trivedi, D. Wang, J. Hunt, A. Rindos, W. E. Smith, and B. Vashaw. Availability modeling of sip protocol on ibm web- sphere. In Proc. PRDC, 2008.
G. Bolch, S. Greiner, H. de Meer, and K. Trivedi. Queueing Net- works and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. John Wiley, second edition, 2006.

Computing the Number of Calls Dropped Due to Failures

Sign up for access to the world's latest research

Abstract

Related papers

References (11)

Related papers

Related topics