The cost of recovery in message logging protocols
2000, IEEE Transactions on Knowledge and Data Engineering
https://doi.org/10.1109/69.842260Abstract
AbstractÐPast research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic, and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. Our results suggest that applications face a complex trade-off when choosing a message logging protocol for fault tolerance. On the one hand, optimistic protocols can provide fast failure-free execution and good performance during recovery, but are complex to implement and can create orphan processes. On the other hand, orphan-free protocols either risk being slow during recovery, e.g., sender-based pessimistic and causal protocols, or incur a substantial overhead during failure-free execution, e.g., receiver-based pessimistic protocols. To address this trade-off, we propose hybrid logging protocols, a new class of orphan-free protocols. We show that hybrid protocols perform within two percent of causal logging during failure-free execution and within two percent of receiver-based logging during recovery.
References (20)
- L. Alvisi and K. Marzullo, ªTradeoffs in Implementing Optimal Message Logging Protocols,º Proc. Fifth ACM Symp. Principles of Distributed Computing, pp. 58-67, June 1996.
- L. Alvisi and K. Marzullo, ªMessage Logging: Pessimistic, Optimistic, Causal, and Optimal,º IEEE Trans. Software Eng., vol. 24, no. 2, pp. 149-159, Feb. 1998.
- A. Borg, J. Baumbach, and S. Glazer, ªA Message System Supporting Fault Tolerance,º Proc. Symp. ACM SIGOPS Operating Systems Principles, pp. 90-99, Oct. 1983.
- R. Butler and E. Lusk, ªMonitors, Message, and Clusters: The p4 Parallel Programming System,º Parallel Computing, vol. 20, pp. 547- 564, Apr. 1994.
- ªNAS Parallel Benchmarks,ºNASA Ames Research Center, http://science.nas.nasa.gov/Software/NPB/, 1997.
- O.P. Damani and V.K. Garg, ªHow to Recover Efficiently and Asynchronously when Optimism Fails,º Proc. 16th Int'l Conf. Distributed Computing Systems, pp. 108-115, 1996.
- E.N. Elnozahy, ªOn the Relevance of Communication Costs of Rollback-Recovery Protocols,º Proc. 14th Ann. ACM Symp. Principles of Distributed Computing, pp. 74-79, Aug. 1995.
- E.N. Elnozahy and W. Zwaenepoel, ªManetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit,º IEEE Trans. Computers, vol. 41, no. 5, pp. 526-531, May 1992.
- E.N. Elnozahy and W. Zwaenepoel, ªOn the Use and Implementation of Message Logging, Digest of Papers: 24th Ann. Int'l Symp. Fault-Tolerant Computing, June 1994.
- D.B. Johnson, ªDistributed System Fault Tolerance Using Message Logging and Checkpointing,º PhD thesis, report no. COMPTR89- 101, Rice Univ., Dec. 1989.
- D.B. Johnson and W. Zwaenepoel, ªSender-Based Message Logging,º Digest of Papers: 17th Ann. Int'l Symp. Fault-Tolerant Computing, June 1987.
- T.Y. Juang and S. Venkatesan, ªCrash Recovery with Little Overhead,º Proc. 11th Int'l Conf. Distributed Computing Systems, pp. 454-461, June 1987.
- L. Lamport, ªTime, Clocks, and the Ordering of Events in a Distributed System,º Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
- F. Mattern, ªVirtual Time and Global States of Distributed Systems,º Parallel and Distributed Algorithms, M. Cosnard et. al., eds., Elsevir Science Publishers B.V., 1989.
- J.R. Mitchell and V.K. Garg, ªA Non-Blocking Recovery Algo- rithm for Causal Message Logging,º Proc. 17th Symp. Reliable Distributed Systems, West Lafayette, Ind., pp. 3-9, Oct. 1998.
- S. Rao, L. Alvisi, and H.M. Vin, ªEgida: An Extensible Toolkit for Low-Overhead Fault-Tolerance,º Proc. IEEE Fault-Tolerant Com- puting Symp. FTCS-29, Madison, Wis., pp. 48-55, June 1999.
- F.B. Schneider, ªImplementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial, Computing Surveys, vol. 22, no. 3, pp. 299-319, Sep. 1990.
- M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, ªScientific and Engineering Computation Series,º MPI: The Complete Reference, Cambridge, Mass.: MIT Press, 1996.
- R.B. Strom and S. Yemeni, ªOptimistic Recovery in Distributed Systems,º Proc. ACM Trans. Computer Systems, vol. 3, no. 3, pp. 204- 226, Apr. 1985.
- R.E. Strom, D.F. Bacon, and S.A. Yemini, ªVolatile Logging in n- Fault-Tolerant Distributed Systems,º Proc. Third Ann. Int'l Symp. Fault-Tolerant Computing, pp. 44-49, 1988. Sriram Rao received his PhD in computer science from the University of Texas at Austin in 1999. He also received his MS and BS (with high honors) from the University of Texas at Austin in 1994 and 1992, respectively. He was a recipient of the Microelectronics and Computer Development (MCD) fellowship awarded by the University of Texas, Department of Computer Sciences. His research interests include fault tolerance, distributed systems, and multimedia systems. He is currently employed by Inktomi Corporation.