Failure detector abstractions for MapReduce-based systems
2017, Information Sciences
https://doi.org/10.1016/J.INS.2016.08.013Abstract
Omission failures represent an important source of problems in data-intensive computing systems. In these frameworks, omission failures are caused by slow tasks, known as stragglers , which can strongly jeopardize the workload performance. In the case of MapReduce-based systems, many state-of-the-art approaches have preferred to explore and extend speculative execution mechanisms. Other alternatives have based their contributions in doubling the computing resources for their tasks. Nevertheless, none of these approaches has addressed a fundamental aspect related to the detection and further solving of the omission failures, that is, the timeout service adjustment. In this paper, we have studied the omission failures in MapReduce systems, formalizing their failure detector abstraction by means of three different algorithms for defining the timeout. The first abstraction, called High Relax Failure Detector (HR-FD), acts as a static alternative to the default timeout, which is able to estimate the completion time for the user workload. The second abstraction, called Medium Relax Failure Detector (MR-FD), dynamically modifies the timeout, according to the progress score of each workload. Finally, taking into account that some of the user requests are strictly deadline-bounded, we have introduced the third abstraction, called Low Relax Failure Detector (LR-FD), which is able to merge the MapReduce dynamic timeout with an external monitoring system, in order to enforce more accurate failure detections. Whereas HR-FD shows performance improvements for most of the user request (in particular , small workloads), MR-FD and LR-FD enhance significantly the current timeout selection , for any kind of scenario, regardless of the workload type and failure injection time.
References (38)
- Apache Hadoop NextGen MapReduce (YARN), (2015). http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html .
- The Apache Hadoop Project, (2015). http://hadoop.apache.org/ .
- G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, E. Harris, Scarlett: coping with skewed content popularity in mapreduce clusters, in: Proceedings of the Sixth Conference on Computer Systems, EuroSys '11, ACM, New York, NY, USA, 2011, pp. 287-300, doi: 10.1145/1966445. 196 6 472 .
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, Effective straggler mitigation: attack of the clones, in: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, nsdi'13, USENIX Association, Berkeley, CA, USA, 2013, pp. 185-198 . http://dl.acm.org/citation.cfm? id=2482626.2482645 .
- G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, I. Stoica, PACMan: coordinated memory caching for parallel jobs, in: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, USENIX Association, Berkeley, CA, USA, 2012 . 20-20. http://dl.acm.org/citation.cfm?id=2228298.2228326 .
- G. Ananthanarayanan, M.C.-C. Hung, X. Ren, I. Stoica, A. Wierman, M. Yu, GRASS: trimming stragglers in approximation analytics, in: Proceedings of the 11 t h USENIX Conference on Networked Systems Design and Implementation, NSDI'14, USENIX Association, Berkeley, CA , USA , 2014, pp. 289-302 . http://dl.acm.org/citation.cfm?id=2616448.2616475 .
- G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, E. Harris, Reining in the outliers in map-reduce clusters using mantri, in: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI'10, USENIX Association, Berkeley, CA , USA , 2010, pp. 1-16 . http://dl.acm.org/citation.cfm?id=1924943.1924962 .
- R. Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, A. Rowstron, Scale-up vs scale-out for Hadoop: time to rethink? in: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, ACM, New York, NY, USA, 2013, pp. 20:1-20:13, doi: 10.1145/2523616.2523629 .
- C. Cachin , R. Guerraoui , L. Rodrigues , Introduction to Reliable and Secure Distributed Programming (2. ed.), Springer, 2011 .
- T.D. Chandra, S. Toueg, Unreliable failure detectors for reliable distributed systems, J. ACM 43 (1996) 225-267, doi: 10.1145/226643.226647 .
- Q. Chen , C. Liu , Z. Xiao , Improving mapreduce performance using smart speculative execution strategy, Comput. IEEE Trans. 63 (4) (2014) 954-967 .
- A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, T. Riche, Upright cluster services, in: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, ACM, New York , NY, USA, 2009, pp. 277-290, doi: 10.1145/1629575.1629602 .
- T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online, in: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, NSDI'10, USENIX Association, Berkeley, CA, USA, 2010 . 21-21. http://dl.acm.org/citation.cfm?id=1855711.1855732 .
- P. Costa, M. Pasin, A. Bessani, M. Correia, Byzantine fault-tolerant mapreduce: Faults are not just crashes, in: Proceedings of the 3rd IEEE Second International Conference on Cloud Computing Technology and Science, CLOUDCOM '11, IEEE Computer Society, Washington, DC, USA, 2010, pp. 17-24, doi: 10.1109/CloudCom.2010.25 .
- J. Dean , S. Ghemawat , Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107-113 .
- J. Dean , S. Ghemawat , G. Inc , Mapreduce: simplified data processing on large clusters, in: Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation, OSDI'04, USENIX Association, 2004 .
- F. Dinu , T.E. Ng , Understanding the effects and implications of compute node related failures in Hadoop, in: HPDC '12: Proceedings of the 21st Inter- national Symposium on High-Performance Parallel and Distributed Computing, ACM, New York, NY, USA, 2012, pp. 187-198 .
- F. Dinu , T.S.E. Ng , Hadoop's overload tolerant design exacerbates failure detection and recovery, in: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, NetDB'11, ACM, New York , NY, USA, 2011, pp. 1-7 .
- K. Elmeleegy, Piranha: optimizing short jobs in Hadoop, Proc. VLDB Endow. 6 (11) (2013) 985-996 . http://dl.acm.org/citation.cfm?id=2536222.2536225 .
- F.C. Freiling, R. Guerraoui, P. Kuznetsov, The failure detector abstraction, ACM Comput. Surv. 43 (2011) 9:1-9:40, doi: 10.1145/1883612.1883616 .
- M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, in: Proceedings of the 2nd ACM SIGOPS/EuroSys 2007, EuroSys '07, ACM, New York, NY, USA, 2007, pp. 59-72, doi: 10.1145/1272996.1273005 .
- J. Kephart , D. Chess , The vision of autonomic computing, Computer 36 (1) (2003) 41-50 .
- S.Y. Ko, I. Hoque, B. Cho, I. Gupta, Making cloud intermediate data fault-tolerant, in: Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, ACM, New York, NY, USA, 2010, pp. 181-192, doi: 10.1145/1807128.1807160 .
- B. Memishi , S. Ibrahim , M.S. Pérez , G. Antoniu , On the dynamic shifting of the mapreduce timeout, in: R. Kannan, R.U. Rasool, H. Jin, S. Balasundaram (Eds.), Handbook of Research on Managing and Processing Big Data in Cloud Computing, IGI Global, Hershey, Pennsylvania (USA), 2016 .
- B. Memishi, M.S. Pérez, G. Antoniu, Diarchy: an optimized management approach for mapreduce masters, Procedia Comput. Sci. 51 (2015) 9-18 . International Conference On Computational Science, {ICCS} 2015 Computational Science at the Gates of Nature. http://www.sciencedirect.com/science/ article/pii/S1877050915009874 .
- G. Mone, Beyond hadoop, Commun. ACM 56 (1) (2013) 22-24, doi: 10.1145/2398356.2398364 .
- J. Montes, A. Sánchez, B. Memishi, M.S. Pérez, G. Antoniu, GMone: a complete approach to cloud monitoring, Future Gener. Comput. Syst. 29 (8) (2013) 2026-2040, doi: 10.1016/j.future.2013.02.011 .
- K. Morton, M. Balazinska, D. Grossman, ParaTimer: A Progress Indicator for MapReduce DAGs, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, ACM, New York, NY, USA, 2010, pp. 507-518, doi: 10.1145/1807167.1807223 .
- K. Morton , A. Friesen , M. Balazinska , D. Grossman , Estimating the progress of mapreduce pipelines, in: Data Engineering (ICDE), 2010 IEEE 26th International Conference on, 2010, pp. 6 81-6 84 .
- M. Nami , K. Bertels , A survey of autonomic computing systems, in: Autonomic and Autonomous Systems, 2007. ICAS07. Third International Conference on, 2007 . 26-26.
- J.S. Plank, M. Allen, R. Wolski, The effect of timeout prediction and selection on wide area collective operations, in: Proceedings of the IEEE Inter- national Symposium on Network Computing and Applications (NCA'01), NCA '01, IEEE Computer Society, Washington, DC, USA, 2001, pp. 320-329 . http://dl.acm.org/citation.cfm?id=580585.883098 .
- A. Sánchez, J. Montes, M.S. Pérez, T. Cortes, An autonomic framework for enhancing the quality of data grid services, Future Generation Comp. Syst. 28 (7) (2012) 1005-1016, doi: 10.1016/j.future.2011.08.016 .
- T. White , Hadoop -The Definitive Guide: Storage and Analysis at Internet Scale (3. ed., revised and updated), O'Reilly, 2012 .
- H. Xu , W.C. Lau , Speculative execution for a single job in a mapreduce-like system, in: Cloud Computing (CLOUD), 2014 IEEE 7th International Confer- ence on, 2014, pp. 586-593 .
- H. Xu, W.C. Lau, Optimization for speculative execution in a mapreduce-like cluster, in: 2015 IEEE Conference on Computer Communications, INFOCOM 2015, Kowloon, Hong Kong, April 26 -May 1, 2015, 2015, pp. 1071-1079, doi: 10.1109/INFOCOM.2015.7218480 .
- M. Zaharia, D. Borthakur, J.S. Sarma, K. Elmeleegy, S. Shenker, I. Stoica, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling, in: Proceedings of the 5th European Conference on Computer Systems, EuroSys '10, ACM, New York, NY, USA, 2010, pp. 265-278, doi: 10.1145/1755913.1755940 .
- M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, I. Stoica, Improving mapreduce performance in heterogeneous environments, in: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, USENIX Association, Berkeley, CA , USA , 2008, pp. 29-42 . http: //dl.acm.org/citation.cfm?id=1855741.1855744 .
- H. Zhu , C. Haopeng , Adaptive failure detection via heartbeat under Hadoop, in: Proceedings of the 2011 IEEE Asia-Pacific Services Computing Confer- ence, ApSCC'11, IEEE, New York, NY, USA, 2011, pp. 231-238 .