Redesigning the message logging model for high performance
2010
Abstract
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow the trend, the mean time between failures (MTBF) suffered, and it's now being counted in hours. In order to circumvent this limitation, a number of fault tolerant algorithms as well as execution environments have been developed using the message passing paradigm. Among them, message logging has been proved to achieve a better overall performance when the MTBF is low, mainly due to it's faster failure recovery. However, message logging suffers from a high overhead when no failure occurs. Therefore, in this paper we discuss a refinement of the message logging model intended to improve failure free message logging performance. The proposed approach simultaneously removes useless memory copies and reduces the number of logged events. We present the implementation of a pessimistic message logging protocol in Open MPI and compare it with the previous reference implementation MPICH-V2. Results outline a several order of magnitude improvement on performance and a zero overhead for most messages.
References (19)
- A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello. MPICH-V project: a multiprotocol automatic fault tolerant MPI. volume 20, pages 319-333. SAGE Pub- lications, Summer 2006.
- A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello. Coordinated checkpoint versus message log for fault tolerant MPI. In IEEE International Conference on Cluster Comput- ing (Cluster 2003). IEEE CS Press, December 2003.
- K. M. Chandy and L. Lamport. Distributed snapshots : De- termining global states of distributed systems. In Trans- actions on Computer Systems, volume 3(1), pages 63-75. ACM, February 1985.
- Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In PPoPP '05: Proceed- ings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 213-223, New York, NY, USA, 2005. ACM Press.
- Elnozahy, Elmootazbellah, and Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computing, 41(5), May 1992.
- M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375 - 408, september 2002.
- G. Fagg and J. Dongarra. FT-MPI : Fault tolerant MPI, sup- porting dynamic applications in a dynamic world. In 7th Euro PVM/MPI User's Group Meeting2000, volume 1908 / 2000, Balatonfüred, Hungary, september 2000. Springer- Verlag Heidelberg.
- G. E. Fagg, A. Bukovsky, and J. J. Dongarra. HARNESS and fault tolerant MPI. Parallel Computing, 27(11):1479- 1495, October 2001.
- E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Don- garra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 97- 104, Budapest, Hungary, September 2004.
- J.-M. Hélary, A. Mostefaoui, and M. Raynal. Communication-induced determination of consistent snapshots. IEEE Transactions on Parallel and Distributed Systems, 10(9):865-877, 1999.
- L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558-565, 1978.
- P. Lemarinier, A. Bouteiller, T. Herault, G. Krawezik, and F. Cappello. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In IEEE International Conference on Cluster Computing (Cluster 2004). IEEE CS Press, 2004.
- S. Rao, L. Alvisi, and H. M. Vin. Egida: An extensible toolkit for low-overhead fault-tolerance. In 29th Sympo- sium on Fault-Tolerant Computing (FTCS'99), pages 48-55. IEEE CS Press, 1999.
- A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations on multipro- cessor systems. IEEE Trans. Comput., 45(11):1239-1247, 1996.
- S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Du- ell, P. Hargrove, and E. Roman. The LAM/MPI check- point/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium, Sante Fe, New Mexico, USA, October 2003.
- Q. Snell, A. Mikler, and J. Gustafson. Netpipe: A network protocol independent performance evaluator. In IASTED In- ternational Conference on Intelligent Information Manage- ment and Systems., June 1996.
- G. Stellner. CoCheck: Checkpointing and process migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, April 1996. IEEE CS Press.
- The IBM LLNL BlueGene/L Team. An overview of the BlueGene/L supercomputer. In Supercomputing '02: Pro- ceedings of the 2002 ACM/IEEE conference on Supercom- puting, pages 1-22, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press.
- The MPI Forum. MPI: a message passing interface. In Su- percomputing '93: Proceedings of the 1993 ACM/IEEE con- ference on Supercomputing, pages 878-883, New York, NY, USA, 1993. ACM Press.