Abstract
Today's high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper presents SHIELD, a practical and easily-deployable fault-tolerant MPI and management system of MPI for an Infiniband cluster. SHIELD provides a novel framework that can be easily used in real cluster systems, and it has different design perspectives than those proposed by other fault-tolerant MPI. We show that SHIELD provides robust fault-resilience to fault-vulnerable cluster systems and that the design features of SHIELD are useful wherever fault-resilience is regarded as the matter of utmost importance.
References (11)
- G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov, MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes Proceedings of the 2002 ACM/IEEE Supercomputing Conference, 2002.
- B. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Mag- niette, MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging Proceedings of the 2003 ACM/IEEE Supercom- puting Conference, 2003.
- G.E. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2000.
- H. Garcia-Molina, Elections in a Distributed Computing System IEEE Transac- tions on Computers. , 1982.
- H. Jung, D. Shin, H. Han, J.W. Kim, H.Y. Yeom, and J. Lee, Design and Imple- mentation of Multiple Fault-Tolerant MPI over Myrinet Proceedings of the 2005 ACM/IEEE Supercomputing Conference, 2005.
- H.S. Kim and H.Y. Yeom, A User-Transparent Recoverable File System for Dis- tributed Computing Environment Challenges of Large Applications in Distributed Environments (CLADE 2005), 2005.
- J. Liu, J. Wu, S.P. Kini, P. Wyckoff, and D.K. Panda, High Performance RDMA- based MPI Implementation over InfiniBand ICS '03: Proceedings of the 17th annual international conference on Supercomputing, 2003.
- K.J. Oh and M.L. Klein, A General Purpose Parallel Molecular Dynamics Simula- tion Program Computer Physics Communication, 2006.
- G. Stellner, CoCheck: Checkpointing and Process Migration for MPI Proceedings of the International Parallel Processing Symposium, 1996.
- N. Woo, H. Jung, H.Y. Yeom, T. Park, and H. Park, MPICH-GF: Transpar- ent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes IEICE Transactions on Information and Systems, 2004.
- N. Woo, H. Jung, D. Shin, H. Han, H.Y. Yeom, and T. Park, Performance Evalu- ation of Consistent Recovery Protocols Using MPICH-GF Proceedings of the 5th European Dependable Computing Conference, 2005.