Replication-Based Fault Tolerance for MPI Applications
2000, IEEE Transactions on Parallel and Distributed Systems
https://doi.org/10.1109/TPDS.2008.172Abstract
As computational clusters increase in size, their mean-time-to-failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI, however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun X4500-based solution, an EMC SAN, and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25% of that of a typical SAN/parallel file system-equipped storage system.
References (39)
- The MPI Forum, "MPI: A Message Passing Interface," in SC '93: Proceedings of the annual Supercomputing Conference. ACM Press, 1993, pp. 878-883.
- Q. Gao, W. Yu, W. Huang, and D. K. Panda, "Application- Transparent Checkpoint/Restart for MPI Programs over Infini- Band," in ICPP '06: Proceedings of the 35 th annual International Conference on Parallel Processing. IEEE Computer Society, 2006, pp. 471-478.
- G. Burns, R. Daoud, and J. Vaigl, "LAM: An Open Cluster Envi- ronment for MPI," in Proceedings of the Supercomputing Symposium. IEEE Computer Society, 1994, pp. 379-386.
- S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing," International Jour- nal of High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
- J. P. Walters and V. Chaudhary, "A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications," in HiPC '07: the International Conference on High Performance Com- puting, LNCS 4873. Springer-Verlag, 2007, pp. 257-268.
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, "The NAS Parallel Benchmarks," International Journal of High Performance Computing Applications, vol. 5, no. 3, pp. 63-73, 1991.
- J. M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," in Proceedings of the 10 th European PVM/MPI Users' Group Meeting, LNCS 2840. Springer-Verlag, 2003, pp. 379-387.
- InfiniBand Trade Association, "InfiniBand," 2007, http://www. infinibandta.org/home.
- Myricom, "Myrinet," 2007, http://www.myricom.com/.
- J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, "The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI," in IPDPS '07: Proceedings of the 21 st annual International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2007.
- E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
- J. S. Plank, M. Beck, G. Kingsley, and K. Li, "Libckpt: Transparent Checkpointing under Unix," in In Proceedings of the USENIX Winter Technical Conference. USENIX Association, 1995, pp. 213- 223.
- J. Duell, "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart," Lawrence Berkeley National Lab, Tech. Rep. LBNL-54941, 2002.
- R. Gioiosa, J. C. Sancho, S. Jiang, and F. Petrini, "Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers," in SC '05: Proceedings of the annual Supercomputing Conference. IEEE Computer Society, 2005, pp. 9-23.
- Y. Zhang, D. Wong, and W. Zheng, "User-Level Checkpoint and Recovery for LAM/MPI," SIGOPS Operating Systems Review, vol. 39, no. 3, pp. 72-81, 2005.
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott, "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance," in IPDPS '07: Proceedings of the 21 st International Parallel and Distributed Processing Symposium. IEEE Computer, 2007, pp. 116- 125.
- H. Jung, D. Shin, H. Han, J. W. Kim, H. Y. Yeom, and J. Lee, "Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M 3 )," in SC '05: Proceedings of the annual Supercomputing Conference. IEEE Computer Society, 2005, pp. 32-46.
- L. Kalé and S. Krishnan, "CHARM++: A Portable Concurrent Object Oriented System Based on C++," in OOPSLA '93: Pro- ceedings of the conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM Press, 1993, pp. 91-108.
- C. Huang, G. Zheng, S. Kumar, and L. V. Kalé, "Performance Evaluation of Adaptive MPI," in PPoPP '06: Proceedings of the 11 th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, 2006, pp. 306-322.
- S. Chakravorty and C. Mendes and L. V. Kalé, "Proactive Fault Tolerance in MPI Applications via Task Migration," in HiPC '06: Proceedings of the 13 th International Conference on High Performance Computing, LNCS 4297, 2006, pp. 485-496.
- S. Chakravorty and L. V. Kalé, "A Fault Tolerance Protocol with Fast Fault Recovery," in IPDPS '07: Proceedings of the 21 st annual International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2007, pp. 117-126.
- G. Zheng, L. Shi, and L. V. Kalé, "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," in Cluster '04: Proceedings of the International Conference on Cluster Computing. IEEE Computer Society, 2004, pp. 93-103.
- J. S. Plank, "Improving the Performance of Coordinated Check- pointers on Networks of Workstations using RAID Techniques," in SRDS '96: Proceedings of the 15 th Symposium on Reliable Dis- tributed Systems. IEEE Computer Society, 1996, pp. 76-85.
- J. S. Plank and L. Kai, "Faster Checkpointing With N+1 Parity," in SFTC '94: Proceedings of the 24 th annual International Symposium on Fault-Tolerant Computing, 1994, pp. 288-297.
- R. Y. de Camargo, R. Cerqueira, and F. Kon, "Strategies for Storage of Checkpointing Data Using Non-Dedicated Repositories on Grid Systems," in Proceedings of the 3 rd International Workshop on Middleware for Grid Computing. ACM Press, 2005.
- X. Ren, R. Eigenmann, and S. Bagchi, "Failure-Aware Check- pointing in Fine-Grained Cycle Sharing Systems," in HPDC '07: Proceedings of the 16th International Symposium on High Performance Distributed Computing. ACM Press, 2007, pp. 33-42.
- Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, "Fault Tolerant High Performance Computing by a Coding Approach," in PPoPP '05: Proceedings of the 10 th annual ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, 2005, pp. 213-223.
- P. Nath, B. Urgaonkar, and A. Sivasubramaniam, "Evaluating the Usefulness of Content Addressable Storage for High-Performance Data Intensive Applications," in HPDC '08: Proceedings of the 17 th International Symposium on High Performance Distributed Computing. ACM Press, 2008, pp. 35-44.
- J. Cao, Y. Li, and M. Guo, "Process Migration for MPI Applications based on Coordinated Checkpoint," in ICPADS '05: Proceedings of the 11 th annual International Conference on Parallel and Distributed Systems. IEEE Computer Society, 2005, pp. 306-312.
- V. Zandy, "Ckpt: User-Level Checkpointing," 2005, http://www. cs.wisc.edu/ ∼ zandy/ckpt/.
- R. E. Bryant, "Data-Intensive Supercomputing: The Case for DISC," Carnegie Mellon University, School of Computer Science, Tech. Rep. CMU-CS-07-128, 2007.
- S. Gurumurthi, "Should Disks be Speed Demons or Brainiacs?" SIGOPS Operating Systems Review, vol. 41, no. 1, pp. 33-36, 2007.
- "Top500 list http://www.top500.org," 2008.
- J. J. Dongarra, P. Luszczek, and A. Petitet, "The LINPACK Bench- mark: Past, Present, and Future," Concurrency and Computation: Practice and Experience, vol. 15, pp. 1-18, 2003.
- C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Ro- driguez, and F. Cappello, "MPI Tools and Performance Studies- Blocking vs. Non-Blocking Coordinated Checkpointing for Large- Scale Fault Tolerant MPI," in SC '06: Proceedings of the 18 th annual Supercomputing Conference. ACM Press, 2006, pp. 127-140.
- Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, "Search and Replication in Unstructured Peer-to-Peer Networks," in ICS '02: Proceedings of the 16 th International Conference on Supercomputing. ACM Press, 2002, pp. 84-95.
- S. Hoory, N. Linial, and A. Wigderson, "Expander Graphs and their Applications," Bulletin of the American Mathematical Society, vol. 43, no. 4, pp. 439-561, 2006.
- "CiFTS: Coordinated infrastructure for Fault Tolerant Systems. http://www.mcs.anl.gov/research/cifts/," 2008.
- E. Pinheiro, W.-D. Weber, and L. A. Barroso, "Failure trends in a Large Disk Drive Population," in FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies. USENIX 2007, pp. 17-28.