Checkpointing as a Service in Heterogeneous Cloud Environments
2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Abstract
A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when longrunning jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.
References (35)
- J. Ansel, K. Arya, and G. Cooperman, "DMTCP: Transparent check- pointing for cluster computations and the desktop," in 23rd IEEE Inter- national Symposium on Parallel and Distributed Processing (IPDPS-09). IEEE, 2009, pp. 1-12.
- J. Cao, G. Kerr, K. Arya, and G. Cooperman, "Transparent checkpoint- restart over InfiniBand," in ACM Symposium on High Performance Parallel and and Distributed Computing (HPDC'14). ACM Press, 2009.
- "OpenStack project," https://wiki.openstack.org/wiki/Main Page, 2014.
- K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, and M. Tsugawa, "Science clouds: Early experiences in cloud computing for scientific applications," Cloud computing and applications, vol. 2008, pp. 825- 830, 2008.
- D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Yous- eff, and D. Zagorodnov, "The Eucalyptus open-source cloud-computing system," in Cluster Computing and the Grid, 2009. CCGRID'09. 9th IEEE/ACM International Symposium on. IEEE, 2009, pp. 124-131.
- D. Milojičić, I. M. Llorente, and R. S. Montero, "Opennebula: A cloud management tool," IEEE Internet Computing, vol. 15, no. 2, pp. 0011- 14, 2011.
- E. Feller, L. Rilling, and C. Morin, "Snooze: A scalable and autonomic virtual machine management framework for private clouds," in Proceed- ings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2012.
- 2013) Cloud Infrastructure Management Interface (CIMI) Model and RESTful HTTP-based Protocol An Interface for Manag- ing Cloud Infrastructure. http://dmtf.org/sites/default/files/standards/ documents/DSP0263 1.1.0.pdf.
- 2012) Open Cloud Computing Interface -OCCI. http://occi-wg.org/.
- P. Marshall, K. Keahey, and T. Freeman, "Improving utilization of infrastructure clouds," in 2011 IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing (CCGrid), May 2011, pp. 205-214.
- T. Wood, K. K. Ramakrishnan, P. Shenoy, and J. van der Merwe, "CloudNet: Dynamic pooling of cloud resources by live WAN migration of virtual machines," SIGPLAN Not., vol. 46, no. 7, pp. 121-132, Mar. 2011. [Online]. Available: http://doi.acm.org/10.1145/2007477.1952699
- D. Ghoshal and L. Ramakrishnan, "Frieda: Flexible robust intelligent elastic data management in cloud environments," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Com- panion:. IEEE, 2012, pp. 1096-1105.
- S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn, "Ceph: A scalable, high-performance distributed file system," in Pro- ceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 2006, pp. 307-320.
- G. Cooperman, J. Ansel, and X. Ma, "Adaptive checkpointing for master-worker style parallelism (extended abstract)," in Proc. of 2005 IEEE Computer Society International Conference on Cluster Computing. IEEE Press, 2005, conference proceedings on CD.
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "Zookeeper: Wait-free coordination for internet-scale systems." in USENIX Annual Technical Conference, vol. 8, 2010, p. 9.
- Restlet: RESTful web framework for java. http://www.restlet.org.
- 2013) The Grid'5000 experimentation testbed. http://www.grid5000.fr/.
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber et al., "The NAS parallel benchmarks," International Journal of High Performance Computing Applications, vol. 5, no. 3, pp. 63-73, 1991.
- "Ns-3 simulator," http://www.nsnam.org/, 2014.
- I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, "A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems," The Journal of Supercomputing, vol. 65, no. 3, pp. 1302-1326, Sep. 2013.
- J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," in Proceedings of the 21 st IEEE International Parallel and Distributed Processing Symposium (IPDPS) / 12 th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems. IEEE Computer Society, March 2007.
- S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
- Q. Gao, W. Yu, W. Huang, and D. K. Panda, "Application-transparent checkpoint/restart for MPI programs over InfiniBand," in ICPP '06: Pro- ceedings of the 2006 International Conference on Parallel Processing. Washington, DC, USA: IEEE Computer Society, 2006, pp. 471-478.
- A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, "MPICH-V project: a multiprotocol automatic fault tolerant MPI," International Journal of High Performance Computing Applications, vol. 20, pp. 319-333, 2006.
- P. Hargrove and J. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters," Journal of Physics Conference Series, vol. 46, pp. 494-499, Sep. 2006.
- A. Tchana, L. Broto, and D. Hagimont, "Approaches to cloud comput- ing fault tolerance," in Computer, Information and Telecommunication Systems (CITS), 2012 International Conference on, May 2012, pp. 1-6.
- W. Zhao, P. Melliar-Smith, and L. Moser, "Fault tolerance middleware for cloud computing," in Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, July 2010, pp. 67-74.
- I. Egwutuoha, S. Chen, D. Levy, and B. Selic, "A fault tolerance framework for high performance computing in cloud," in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on, May 2012, pp. 709-710.
- S. Di, Y. Robert, F. Vivien, D. Kondo, C.-L. Wang, and F. Cappello, "Op- timization of cloud task processing with checkpoint-restart mechanism," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '13. New York, NY, USA: ACM, 2013, pp. 64:1-64:12.
- B. Nicolae and F. Cappello, "BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 34:1-34:12.
- A. Kangarlou, P. Eugster, and D. Xu, "VNsnap: Taking snapshots of virtual networked infrastructures in the cloud," Services Computing, IEEE Transactions on, vol. 5, no. 4, pp. 484-496, 2012.
- R. Garg, K. Sodha, Z. Jin, and G. Cooperman, "Checkpoint-restart for a network of virtual machines," in Proc. of 2013 IEEE Computer Society International Conference on Cluster Computing. IEEE Press, 2013, 8 pages, electronic copy only.
- B. Schroeder and G. Gibson, "A large-scale study of failures in high- performance computing systems," Dependable and Secure Computing, IEEE Transactions on, vol. 7, no. 4, pp. 337-350, Oct 2010.
- N. Xiong, A. Vasilakos, J. Wu, Y. Yang, A. Rindos, Y. Zhou, W.-Z. Song, and Y. Pan, "A self-tuning failure detection scheme for cloud computing service," in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, May 2012, pp. 668-679.
- T. Ropars, E. Jeanvoine, and C. Morin, "Gamose: An accurate monitor- ing service for Grid applications," in Sixth Int. Symp. on Parallel and Distributed Computing, 2007, July 2007, pp. 40-40.