rMPI : increasing fault resiliency in a message-passing environment
2011
https://doi.org/10.2172/1012733Abstract
As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpointrestart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.
References (38)
- Michael Barborak, Anton Dahbura, and Miroslaw Malek. The problem in fault-tolerant computing. ACM Comput. Surv., 25(2):171-220, 1993.
- Joel F. Bartlett. A nonstop kernel. In SOSP '81: Proceedings of the eighth ACM symposium on Operating systems principles, pages 22-29, 1981.
- Ron Brightwell, Sue P. Goudy, Arun Rodrigues, and Keith D. Underwood. Implications of applica- tion usage characteristics for collective communication offload. Int. J. High Perform. Comput. Netw., 4(3/4):104-116, 2006.
- Ron Brightwell, Trammell Hudson, Kevin T. Pedretti, and Keith D. Underwood. SeaStar Interconnect: Balanced bandwidth for scalable performance. IEEE Micro, 26(3), May/June 2006.
- Greg Bronevetsky, Daniel J. Marques, Keshav K. Pingali, Radu Rugina, and Sally A. McKee. Compiler- enhanced incremental checkpointing for openmp applications. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 275-276, New York, NY, USA, 2008. ACM.
- William J. Camp and James L. Tomkins. Thor's hammer: The first version of the Red Storm MPP archi- tecture. In In Proceedings of the SC 2002 Conference on High Performance Networking and Computing, Baltimore, MD, November 2002.
- Sayantan Chakravorty and Laxmikant V. Kalé. A fault tolerant protocol for massively parallel systems. In Proceedings of the International Parallel and Distributed Processing Symposium, Santa Fe, NM USA, April 2004. IEEE Computer Society Press.
- J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: fault containment for shared-memory multiprocessors. In SOSP '95: Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 12-25, New York, NY, USA, 1995. ACM.
- Tzi-Cker Chiueh and Peitao Deng. Evaluation of checkpoint mechanisms for massively parallel machines. In Annual Symposium on Fault Tolerant Computing, pages 370-379, Sendai, Japan, June 1996. IEEE Computer Society Press.
- J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst., 22(3):303-312, 2006.
- Xiangyu Dong, Naveen Muralimanohar, Norm Jouppi, Richard Kaufmann, and Yuan Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1-12, New York, NY, USA, 2009. ACM.
- Jr. E. S. Hertel, R. L. Bell, M. G. Elrick, A. V. Farnsworth, G. I. Kerley, J. M. McGlaun, S. V. PetneY, S. A. Silling, P. A. Taylor, and L. Yarrington. CTH: A software family for multi-dimensional shock physics analysis. In Proceedings of the 19th International Symposium on Shock Waves, pages 377-382, July 1993.
- E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375 -408, SEP 2002.
- E.N. Elnozahy and J.S. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on, 1(2):97-108, April 2004.
- Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The case for modular redundancy in large- scale high performance computing systems. In Proceedings of the 8 th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada.
- F. C. Gärtner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1):1-26, March 1999.
- William Gropp. MPICH2: A new start for MPI implementations. In Dieter Kranzlmuller, Peter Kacsuk, Jack Dongarra, and Jens Volkert, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface: 9th European PVM/MPI Users' Group Meeting, volume 2474 of Lecture Notes in Computer Science, September/October 2002.
- William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789-828, September 1996.
- R. Gupta, P. Beckman, B. H. Park, E. Lusk, P. Hargrove, A. Geist, D. K. Panda, A. Lumsdaine, and J. Dongarra. Cifts: A coordinated infrastructure for fault-tolerant systems. In To appear in the Proceedings of the 38th International Conference on Parallel Processing, 2009.
- Qiangfeng Jiang, Yi Luo, and D. Manivannan. An optimistic checkpointing and message logging ap- proach for consistent global checkpoint collection in distributed systems. J. Parallel Distrib. Comput., 68(12):1575-1589, 2008.
- David B. Johnson and Willy Zwaenepoel. Recovery in distributed systems using asynchronous and checkpointing. In Proceedings of the seventh annual ACM Symposium on Principles of distributed computing, pages 171-181, 1988.
- D. J. Kerbyson, H. J. Alme, Adolfy Hoisie, Fabrizio Petrini, H. J. Wasserman, and M. Gittings. Predic- tive performance and scalability modeling of a large-scale application. In Proceedings of the ACM/IEEE conference on Supercomputing, pages 37-48, 2001.
- Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558-565, 1978.
- Frank H. Mathis. A generalized birthday problem. SIAM Review, 33(2):265-270, June 1991.
- Dennis McEvoy. The architecture of tandem's nonstop system. In ACM '81: Proceedings of the ACM '81 conference, page 245, New York, NY, USA, 1981. ACM.
- Ron A. Oldfield, Sarala Arunagiri, Patricia J. Teller, Seetharami Seelam, Maria Ruiz Varela, Rolf Riesen, and Philip C. Roth. Modeling the impact of checkpoints on next-generation systems. In 24th IEEE Conference on Mass Storage Systems and Technologies, pages 30-46, September 2007.
- A. J. Oliner, R. K. Sahoo, J. E. Moreira, and M. Gupta. Performance implications of periodic check- pointing on large-scale cluster systems. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) -Workshop 18, page 299.2, 2005.
- Hewlett Packard. HP NonStop computing. http://h20338.www2.hp.com/NonStopComputing/cache/ 76385-0-0-0-121.html.
- Kevin T. Pedretti, Courtenay Vaughan, Karl Scott Hemmert, and Brian Barrett. Application sensitivity to link and injection bandwidth on a Cray XT4 system. In Proceedings of the 2005 Cray User Group Annual Technical Conference, Helsinki, Finland, May 2008.
- James S. Plank, Kai Li, and Michael A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972-986, October 1998.
- Steve J. Plimpton. Fast parallel algorithms for short-range molecular dynamics. J Comp Phys, 117(1):1- 19, 1995.
- Rolf Riesen, Kurt Ferreira, and Jon Stearley. See applications run and throughput jump: The case for redundant computing in HPC. In To Appear, 1st International Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2010), Chicago, IL, June 2010.
- Jose; Carlos Sancho, Fabrizio Petrini, Greg Johnson, Juan Fernandez, and Eitan Frachtenberg. On the feasibility of incremental checkpointing for scientific computing. Parallel and Distributed Processing Symposium, International, 1:58b, 2004.
- Sandia National Laboratory. LAMMPS molecular dynamics simulator. http://lammps.sandia.gov, Apr. 10 2010.
- Sandia National Laboratory. Mantevo project home page. https://software.sandia.gov/mantevo, Apr. 10 2010.
- Bianca Schroeder and Garth A Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78(1):012022, 2007.
- Chen Yu, Du Zhi-Hui, Peng Liu, and Li San-Li. Os kernel supported fault tolerant mpi. Journal of Shanghai University, 5(SUP):18 -21, 2001.
- Ziming Zheng and Zhiling Lan. Reliability-aware scalability models for high performance computing,. In Cluster'09: Proceedings of the IEEE conference on Cluster Computing, 2009.