Architecture Support for Behavior-based Adaptive Checkpointing
2008, Journal of Software
Abstract
Checkpointing is a commonly used approach to provide system fault-tolerance. However, using a constant checkpointing frequency may compromise the system's overall performance when there are multiple types of QoS requirements involved. Hence, it is important that the checkpointing frequency is customizable and runtime adaptable. However, for open distributed and embedded applications, often there is a large number of entities involved in an application and these entities may join or leave the system frequently. The scale and the dynamicity make it difficult to apply the adaptive checkpointing strategy unless we have a model that encapsulates the issues within a well-defined structure and further shields complexity from application developers. In this paper, we introduce a behavior-based adaptive checkpointing approach for open systems and present an architecture support to optimize system's overall performance through using adaptive checkpointing frequencies.
References (17)
- E. N. Elnozahy, L. Alvisi, Y.-M. Wang, D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message- Passing Systems. ACM Computing Surveys, Vol. 34, No. 3, September, 2002. Page 375-408.
- E. Gelenbe, D. Derochette. Performance of Rollback Recovery Systems under Intermittent Failures. Communication of the ACM. Volume 21. 1978.
- H. Lee, H. Shin and S. Min. Worst case timing requirement of real-time tasks with time redundancy. In Proc. Real- Time Computing Systems and Application. 1999. 410-414.
- S. W. Kwak, B. J. Choi and B. K. Kim. An optimal checkpointing-strategy for real-time control systems under transient faults. In IEEE Transaction of Reliability, vol. 50, no. 3, pp. 293-301, 2001.
- S. P. Ren, Y. Yu, N. E. Chen, M. Kevin, and L. M. Shen, P. Pierre. Actors, Roles and Coordinators -A Coordination Model for Dynamic Distributed Open Systems. In Proc. of 8th Conference on Coordination Models and Languages, 2006.
- C. L. Hwang and K. Yoon, Multiple Criteria Decision Making, Lecture Notes in Economics and Mathematical Systems. Springer-Verlag, 1981.
- D. C. Schmidt et al. The design of the TAO real-time Object Request Broker. In Computer Communications, 1998.
- Object Management Group. CORBA 3 specification, In OMG Technical committee Document, 2002.
- N. E. Chen, S. P. Ren. Building a Coordination Framework to Support Behavior-Based Adaptive Checkpointing for Open Distributed Embedded Systems. HICSS 40th Annual Hawaii International Conference. 2007.
- J. A. Zinkey, D. E. Bakken, and R. E. Schantz. Architectural support for quality of service for CORBA objects. In Theory and Practice of Object Systems, 3(1), 1997.
- D. Srivastava and P. Narasimhan. Architectural Support for Mode-Driven Fault Tolerance in Distributed Applications. In Proc. of ICSE 2005 Workshop on Architecting Dependable Systems, 2005.
- J. W. Young. A First-Order Approximation to the Optimum Checkpoint Interval. Communication of ACM. 530-531. 1974.
- K. M. Chandy, J. C. Browne, C. W. Dissly, W. R. Uhrig. Analytic Models for Rollback and Recovery Strategies in Database Systems. IEEE Transaction of Software Engineering. SE-1, 1, 100-110. 1975.
- K. M. Chandy. A survey of analytic models of rollback and recovery strategies. Computer 8, 5. 40-47. 1975.
- "Using a WCET Analysis Tool in Real-Time Systems Education" Euromicro Worst-Case Execution Time Workshop 2005 (WCET 2005)
- N. E. Chen, S. P. Ren. Performance Optimization of Message Logging based Rollback Recovery in Distributed Real-time Embedded Systems. Technical Report. http://sunrise.cs.iit.edu/chen_1070819.pdf
- Agha, G.: Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, 1986.