Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing
Information and Communication Technology for Competitive Strategies
https://doi.org/10.1007/978-981-13-0586-3_28Abstract
Nowadays, to a large extent, clients look at cloud not just as service provider but also as partner. So, they want cloud to deliver timely and accurate services. Cloud nodes must be reliable in order to provide quality of services as per the customer requirements. Further, physical size of high-performance computing environment is also increasing day by day. Larger the system, more failures are likely to occur that eventually results in the poor reliability of the system which is highly undesirable for the time-critical applications. To deal with the reliability, service provider must know the failure characteristics of the cloud computing nodes in order to better handle the failure using fault-tolerance-aware techniques at the time of scheduling the application tasks. Thus, in this paper, we presented the survey of fault-tolerance-aware techniques which are classified as proactive and reactive fault tolerance. This survey provides the foundation for the researchers to work in the area of fault-tolerance-aware scheduling in order to have better scheduling decisions with the aim to enhance the performance and reliability of application execution. Keywords Reliability ⋅ Fault tolerance ⋅ Virtualization 1 Introduction Cloud is an Internet-based computing paradigm that provides basic services as Infrastructure as a Service (IaaS), Software as a Service (SaaS), Platform as a Service (PaaS) [1]. Different types of cloud providers, i.e., public, private, or hybrids, are responsible for providing above services to user. Nowadays, usage of
FAQs
AI
What informs the scheduling of tasks in proactive fault tolerance methods?
The study reveals that tasks are assigned to reliable virtual machines based on historical failure data. By utilizing prior reliability information, these methods aim to enhance application execution reliability.
How does the Monte Carlo Failure Estimation algorithm improve scheduling reliability?
This algorithm estimates future failure patterns using Weibull failure distribution, leading to more informed scheduling decisions. It incorporates failure predictions into resource allocation, thus minimizing disruptions during application execution.
What distinguishes heuristic approaches from meta-heuristic approaches in scheduling?
Heuristic approaches focus on problem-specific solutions and may not yield near-optimal outcomes, while meta-heuristic approaches are designed to escape local optima. For instance, a simulated annealing algorithm outperformed traditional branch-and-bound techniques in reliability maximization.
What role does checkpointing play in reactive fault tolerance strategies?
Checkpointing allows systems to record state at intervals, facilitating recovery from failures by resuming execution from the last saved state. This method is crucial for minimizing computation loss during task failures in cloud environments.
What are the benefits of replication in enhancing system reliability?
Replication reduces vulnerability to single point failures by creating redundant task copies across multiple processors. Active replication can successfully execute tasks even if some copies face faults, thus improving overall system resilience.
References (30)
- Sadiku, M.N., Musa, S.M., Momoh, O.D.: Cloud computing: opportunities and challenges. IEEE Potent. 33(1), 34-36 (2014)
- Patel, P., Ranabahu, A.H., Sheth, A.P.: Service Level Agreement in Cloud Computing (2009)
- Garraghan, P., Townend, P., Xu, J.: An empirical failure-analysis of a large-scale cloud computing environment. In: 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering (HASE), pp. 113-120. IEEE (2014)
- Attiya, G., Hamam, Y.: Task allocation for maximizing reliability of distributed systems: a simulated annealing approach. J. Parallel Distrib. Comput. 66(10), 1259-1266 (2006)
- Rehani, N., Garg, R.: Meta-heuristic based reliable and green workflow scheduling in cloud computing. Int. J. Syst. Assur. Eng. Manag. 1-10
- Zhou, A., Wang, S., Cheng, B., Zheng, Z., Yang, F., Chang, R., Buyya, R.: Cloud service reliability enhancement via virtual machine placement optimization. IEEE Trans. Serv. Comput. (2016)
- Heddaya, A., Helal, A.: Reliability, Availability, Dependability and Performability: A User-Centered View. Boston University Computer Science Department (1997)
- Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Distrib. Comput. 65(8), 885- 900 (2005)
- Charity, T.J., Hua, G.C.: Resource reliability using fault tolerance in cloud computing. In: 2016 2nd International Conference on Next Generation Computing Technologies (NGCT), pp. 65-71. IEEE (2016)
- Zhou, A., Wang, S., Cheng, B., Zheng, Z., Yang, F., Chang, R., Buyya, R.: Cloud service reliability enhancement via virtual machine placement optimization. IEEE Trans. Serv. Comput.
- Rehani, N., Garg, R.: Reliability-aware workflow scheduling using monte carlo failure estimation in cloud. In: Proceedings of International Conference on Communication and Networks, pp. 139-153. Springer, Singapore (2017)
- Cao, F., Zhu, M.M.: Distributed workflow mapping algorithm for maximized reliability under end-to-end delay constraint. J. Supercomput. 66(3), 1462-1488 (2013)
- Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp. 280-288. ACM (2007)
- Wang, X., Yeo, C.S., Buyya, R., Su, J.: Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm. Fut. Generat. Comput. Syst. 27(8), 1124-1134 (2011)
- Zhang, L., Li, K., Li, C., Li, K.: Bi-objective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems. Inf. Sci. 379, 241-256 (2017)
- Fard, H.M., Prodan, R., Barrionuevo, J.J.D., Fahringer, T.: A multi-objective approach for workflow scheduling in heterogeneous environments. In: Proceedings of the 2012 12th IEEE/ ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp. 300-309. IEEE Computer Society (2012)
- Zhang, L., Li, K., Xu, Y., Mei, J., Zhang, F., & Li, K.: Maximizing reliability with energy conservation for parallel task scheduling in a heterogeneous cluster. Inf. Sci. 319, 113-131 (2015)
- Zhou, A., Sun, Q., Li, J.: Enhancing reliability via checkpointing in cloud computing systems. China Commun. 14(7), 1-10 (2017)
- Paun, M., Naksinehaboon, N., Nassar, R., Leangsuksun, C., Scott, S.L., Taerat, N.: Incremental checkpoint schemes for Weibull failure distribution. Int. J. Foundat. Comput. Sci. 21(03), 329-344 (2010)
- Goiri, Í., Julia, F., Guitart, J., Torres, J.: Checkpoint-based fault-tolerant infrastructure for virtualized service providers. In: 2010 IEEE Network Operations and Management Symposium (NOMS), pp. 455-462. IEEE (2010)
- Cao, G., Singhal, M.: On coordinated checkpointing in distributed systems. IEEE Trans. Parallel Distrib. Syst. 9(12), 1213-1225 (1998)
- Zhao, J., Xiang, Y., Lan, T., Huang, H.H., Subramanian, S.: Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. IEEE Trans. Parallel Distrib. Syst. 28 (2), 491-502 (2017)
- Zhang, Y., Zheng, Z., Lyu, M.R.: BFTCloud: a byzantine fault tolerance framework for voluntary-resource cloud computing. In 2011 IEEE International Conference on Cloud Computing (CLOUD), pp. 444-451. IEEE
- Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 434-441. IEEE (2010)
- Mei, J., Li, K., Zhou, X., Li, K.: Fault-tolerant dynamic rescheduling for heterogeneous computing systems. J. Grid Comput. 13(4), 507-525 (2015)
- Chen, C.Y.: Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Trans. Parallel Distrib. Syst. 27(2), 521- 532 (2016)
- Amoon, M.: Adaptive framework for reliable cloud computing environment. IEEE Access 4, 9469-9478 (2016)
- Wang, S., Li, K., Mei, J., Xiao, G., Li, K.: A Reliability-aware task scheduling algorithm based on replication on heterogeneous computing systems. J. Grid Comput. 15(1), 23-39 (2017)
- Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L.T., Liu, L.: Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Trans. Parallel Distrib. Syst. 27(12), 3501-3517 (2016)
- Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380-393 (2009)