Academia.eduAcademia.edu

Outline

SHIELD: a fault-tolerant MPI for an infiniband cluster

2006

https://doi.org/10.1007/11847366_90

Abstract

Today's high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper presents SHIELD, a practical and easily-deployable fault-tolerant MPI and management system of MPI for an Infiniband cluster. SHIELD provides a novel framework that can be easily used in real cluster systems, and it has different design perspectives than those proposed by other fault-tolerant MPI. We show that SHIELD provides robust fault-resilience to fault-vulnerable cluster systems and that the design features of SHIELD are useful wherever fault-resilience is regarded as the matter of utmost importance.

References (11)

  1. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov, MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes Proceedings of the 2002 ACM/IEEE Supercomputing Conference, 2002.
  2. B. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Mag- niette, MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging Proceedings of the 2003 ACM/IEEE Supercom- puting Conference, 2003.
  3. G.E. Fagg and J. Dongarra, FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2000.
  4. H. Garcia-Molina, Elections in a Distributed Computing System IEEE Transac- tions on Computers. , 1982.
  5. H. Jung, D. Shin, H. Han, J.W. Kim, H.Y. Yeom, and J. Lee, Design and Imple- mentation of Multiple Fault-Tolerant MPI over Myrinet Proceedings of the 2005 ACM/IEEE Supercomputing Conference, 2005.
  6. H.S. Kim and H.Y. Yeom, A User-Transparent Recoverable File System for Dis- tributed Computing Environment Challenges of Large Applications in Distributed Environments (CLADE 2005), 2005.
  7. J. Liu, J. Wu, S.P. Kini, P. Wyckoff, and D.K. Panda, High Performance RDMA- based MPI Implementation over InfiniBand ICS '03: Proceedings of the 17th annual international conference on Supercomputing, 2003.
  8. K.J. Oh and M.L. Klein, A General Purpose Parallel Molecular Dynamics Simula- tion Program Computer Physics Communication, 2006.
  9. G. Stellner, CoCheck: Checkpointing and Process Migration for MPI Proceedings of the International Parallel Processing Symposium, 1996.
  10. N. Woo, H. Jung, H.Y. Yeom, T. Park, and H. Park, MPICH-GF: Transpar- ent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes IEICE Transactions on Information and Systems, 2004.
  11. N. Woo, H. Jung, D. Shin, H. Han, H.Y. Yeom, and T. Park, Performance Evalu- ation of Consistent Recovery Protocols Using MPICH-GF Proceedings of the 5th European Dependable Computing Conference, 2005.