Academia.eduAcademia.edu

Outline

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

2002, ACM/IEEE SC 2002 Conference (SC'02)

https://doi.org/10.1109/SC.2002.10048

Abstract

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/ rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility. 0-7695-1524-X/02 $17.00 (C) 2002 IEEE

References (23)

  1. A. Agbaria and R. Friedman. Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In In 8th IEEE International Symposium on High Perfor- mance Distributed Computing, 1999.
  2. A.Selikhov, G.Bosilca, S.Germain, G.Fedak, and F.Cappello. MPICH-CM: a communication library design for P2P MPI implementation. In To appear in Proceedings of 9-th EuroPVM/MPI conference, Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg New York, 2002.
  3. D. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, and M. Yarrow. The NAS Parallel Bench- marks 2.0. Report NAS-95-020, Numerical Aero- dynamic Simulation Facility, NASA Ames Research Center, Mail Stop T 27 A-1, Moffett Field, CA 94035- 1000, USA, December 1995.
  4. R. Batchu, J. Neelamegam, Z. Cui, M. Beddhua, A. Skjellum, Y. Dandass, and M. Apte. MPI/FT TM : Architecture and taxonomies for fault-tolerant, message-passing middleware for performance- portable parallel computing. In In Proceedings of the 1st IEEE International Symposium of Cluster Com- puting and the Grid held in Melbourne, Australia. , 2001.
  5. W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system de- ployed on an existing set of desktop pcs. In Proceed- ings of the ACM international conference on Measure- ment and modeling of computer systems, SIGMET- RICS, pages 34-43, 2000.
  6. A. Brown and D. A. Patterson. Embracing fail- ure: A case for recovery-oriented computing (roc). In High Performance Transaction Processing Sympo- sium, Asilomar, CA, October 2001.
  7. Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel pro- grams. In Proceedings the IEEE Supercomputing '97 Conference (SC97), november 1997.
  8. E. Elnozahy, D. Johnson, and Y. Wang. A survey of rollback-recovery protocols in message-passing sys- tems. In Technical Report CMU-CS-96-181, Carnegie Mellon University, October, 1996.
  9. E. N. Elnozahy and W. Zwaenepoel. Replicated dis- tributed processes in manetho. In FTCS-22: 22nd International Symposium on Fault Tolerant Comput- ing, pages 18-27, Boston, Massachusetts, 1992. IEEE Computer Society Press.
  10. G. Fagg and J. Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In Euro PVM/MPI User's Group Meeting 2000 ,Springer-Verlag, Berlin, Germany, pp.346-353., 2000.
  11. G. Fagg and K. London. MPI interconnection and control. In Technical Report TR98-42, Major Shared Resource Center, U.S. Army Corps of Engineers Wa- terways Experiment Station, Vicksburg, Missippissi, 1998.
  12. I. Foster and N. Karonis. A grid-enabled MPI: Mes- sage passing in heterogeneous distributed computing systems. In In Proceedings of SC 98. IEEE, Nov., 1999.
  13. Ian T. Foster, Carl Kesselman, Gene Tsudik, and Steven Tuecke. A security architecture for computa- tional grids. In ACM Conference on Computer and Communications Security, pages 83-92, 1998.
  14. Edgar Gabriel, Michael Resch, and Roland Rhle. Im- plementing MPI with optimized algorithms for meta- computing. In Yoginder S. Dandass Anthony Skjel- lum, Purushotham V. Bangalore, editor, Proceedings of the Third MPI Developer's and User's Conference. MPI Software Technology Press, 1999.
  15. Cecile Germain, Vincent Neri, Gille Fedak, and Franck Cappello. XtremWeb: Building an experi- mental platform for global computing. In The 1st IEEE/ACM International Workshop on Grid Com- puing, Springer Verlag, LNCS 1971, pages 91-101, 2000.
  16. William Gropp, Ewing Lusk, Nathan Doss, and An- thony Skjellum. High-performance, portable imple- mentation of the MPI Message Passing Interface Stan- dard. Parallel Computing, 22(6):789-828, September 1996.
  17. D B. Johnson and W Zwaenepoel. Sender-based mes- sage logging. Technical report, Department of Com- puter Science, Rice University, Houston, Texas, 1987.
  18. T. T-Y. Juang and S. Venkatesan. Crash recovery with little overhead. In 11th Int. Conf on Distributed Com- puting Systems, ICDCS-11, pages 454-461, MAY 1991.
  19. Soulla Louca, Neophytos Neophytou, Arianos Lachanas, and Paraskevas Evripidou. MPI-FT: Portable fault tolerance scheme for MPI. In Parallel Processing Letters, Vol. 10, No. 4, 371-382 , World Scientific Publishing Company., 2000.
  20. P. N. Pruitt. An Asynchronous Checkpoint and Roll- back Facility for Distributed Computations. PhD the- sis, College of William and Mary in Virginia, May 1998.
  21. Sriram Rao, Lorenzo Alvisi, and Harrick M. Vin. Egida: An extensible toolkit for low-overhead fault- tolerance. In Symposium on Fault-Tolerant Comput- ing, pages 48-55, 1999.
  22. Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th Inter- national Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996.
  23. E. Strom and S. Yemini. Optimistic recovery in dis- tributed systems. In ACM Transactions on Computer Systems, volume 3(3), pages 204-226. ACM, Aug 1985.