Academia.eduAcademia.edu

Outline

BAD-check

2015, Proceedings of the 10th Parallel Data Storage Workshop on - PDSW '15

https://doi.org/10.1145/2834976.2834981

Abstract

Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpointrestart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneficial. With simulations, we demonstrate that scientific applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.

References (25)

  1. E. Barton, J. Bent, and Q. Koziol, "Fast forward storage and io program documents," in LLNS subcontract no. B599860 For Extreme-Scale Com- puting Research and Development (Fast Forward) Storage and I/O, 2014. [Online]. Available: https://wiki.hpdd.intel.com/display/PUB/Fast+Forward+ Storage+and+IO+Program+Documents
  2. M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, "Legion: Expressing locality and independence with logical regions," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 66:1-66:11. [Online]. Available: http: //dl.acm.org/citation.cfm?id=2388996.2389086
  3. J. Bent, S. Faibish, J. Ahrens, G. Grider, J. Patchett, P. Tzelnic, and J. Woodring, "Jitter-free co-processing on a prototype exascale storage stack," in Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, April 2012, pp. 1-5.
  4. J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: a checkpoint filesystem for parallel applications," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09. New York, NY, USA: ACM, 2009, pp. 21:1-21:12. [Online]. Available: http://doi.acm.org/10.1145/1654059. 1654081
  5. J. Bent, B. Settlemyer, N. DeBardeleben, S. Faibish, D. Ting, U. Gupta, and P. Tzelnic, "On the non-suitability of non-volatility," in 7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15). Santa Clara, CA: USENIX Association, Jul. 2015. [Online]. Available: https: //www.usenix.org/conference/hotstorage15/workshop-program/presentation/bent
  6. B. Bhargava and S.-R. Lian, "Independent checkpointing and concurrent roll- back for recovery in distributed systems-an optimistic approach," in Reliable Dis- tributed Systems, 1988. Proceedings., Seventh Symposium on, Oct 1988, pp. 3-12.
  7. A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler, "The scalable commutativity rule: Designing scalable software for multicore processors," in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ser. SOSP '13. New York, NY, USA: ACM, 2013, pp. 1-17. [Online]. Available: http://doi.acm.org/10.1145/2517349.2522712
  8. J. J. Colman and R. R. Linn, "Separating combustion from pyrolysis in higrad/firetec," International Journal of Wildland Fire, vol. 16, no. 4, pp. 493-502, 2007. [Online]. Available: http://dx.doi.org/10.1071/WF06074
  9. R. Cook, E. Dube, I. Lee, C. Shereda, F. Wang, and L. Nau, Survey of Novel Programming Models for Parallelizing Applications at Exascale, Nov 2011. [Online]. Available: http://www.osti.gov/scitech/servlets/purl/1107306
  10. A. Hammouda, A. Siegel, and S. Siegel, "Overcoming asynchrony: An analysis of the effects of asynchronous noise on nearest neighbor synchronizations," in Solving Software Challenges for Exascale, ser. Lecture Notes in Computer Science, S. Markidis and E. Laure, Eds. Springer International Publishing, 2015, vol. 8759, pp. 100-109. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-319-15976-8 7
  11. D. Ibtesham, D. Arnold, K. B. Ferreira, and P. G. Bridges, "On the viability of checkpoint compression for extreme scale fault tolerance," in Proceedings of the 2011 International Conference on Parallel Processing -Volume 2, ser. Euro-Par'11. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 302-311. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-29740-3 34
  12. L. Lamport, "Time, clocks, and the ordering of events in a distributed system," Commun. ACM, vol. 21, no. 7, pp. 558-565, Jul. 1978. [Online]. Available: http://doi.acm.org/10.1145/359545.359563
  13. K. Li, J. F. Naughton, and J. S. Plank, "Real-time, concurrent checkpoint for parallel programs," in Proceedings of the Second ACM SIGPLAN Symposium on Principles &Amp; Practice of Parallel Programming, ser. PPOPP '90. New York, NY, USA: ACM, 1990, pp. 79-88. [Online]. Available: http://doi.acm.org/10.1145/99163.99173
  14. N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, "On the role of burst buffers in leadership-class storage systems," in In Proceedings of the 2012 IEEE Conference on Massive Data Storage, 2012.
  15. J. F. Lofstead, S. Klasky, K. Schwan, N. Podhorszki, and C. Jin, "Flexible io and integration for scientific codes through the adaptable io system (adios)," in CLADE '08: Proceedings of the 6th international workshop on Challenges of large applications in distributed environments. New York, NY, USA: ACM, 2008, pp. 15-24.
  16. A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1-11. [Online]. Available: http://dx.doi.org/10.1109/SC.2010.18
  17. NERSC and the Alliance for Computing at Extreme Scale, Trinity / NERSC-8 Request for Proposal, 2013. [Online]. Available: http://www.nersc.gov/users/ computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/
  18. B. Nicolae, "Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal," in IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, Boston, United States, May 2013, pp. 19-28. [Online]. Available: https://hal.inria.fr/hal-00781532
  19. B. Nicolae and F. Cappello, "Ai-ckpt: Leveraging memory access patterns for adaptive asynchronous incremental checkpointing," in Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing, ser. HPDC '13. New York, NY, USA: ACM, 2013, pp. 155-166. [Online]. Available: http://doi.acm.org/10.1145/2462902.2462918
  20. Oak Ridge, Argonne, and Livermore National Labs, CORAL Request for Proposal B604142, 2014. [Online]. Available: https://asc.llnl.gov/CORAL/
  21. S. Osman, D. Subhraveti, G. Su, and J. Nieh, "The design and implementation of zap: A system for migrating computing environments," SIGOPS Oper. Syst. Rev., vol. 36, no. SI, pp. 361-376, Dec. 2002. [Online]. Available: http://doi.acm.org/10.1145/844128.844162
  22. F. Petrini, D. J. Kerbyson, and S. Pakin, "The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q," in Supercomputing, ser. SC '03. New York, NY, USA: ACM, 2003, pp. 55-. [Online]. Available: http://doi.acm.org/10.1145/1048935.1050204
  23. B. Randell, "System structure for software fault tolerance," in Proceedings of the International Conference on Reliable Software. New York, NY, USA: ACM, 1975, pp. 437-449. [Online]. Available: http://doi.acm.org/10.1145/800027. 808467
  24. R. Riesen, K. Ferreira, D. Da Silva, P. Lemarinier, D. Arnold, and P. G. Bridges, "Alleviating scalability issues of checkpointing protocols," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 18:1-18:11. [Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389021
  25. R. Thakur, W. Gropp, and E. Lusk, "Data sieving and collective i/o in romio," in Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation, ser. FRONTIERS '99. Washington, DC, USA: IEEE Computer Society, 1999, pp. 182-. [Online]. Available: http://dl.acm.org/citation.cfm?id=795668.796733