Academia.eduAcademia.edu

Outline

MRBS: A Comprehensive MapReduce Benchmark Suite

2012

Abstract

MapReduce is a promising programming model for distributed data processing. Extensive research has been conducted on the scalability of MapReduce, and several systems have been proposed in the literature, ranging from job scheduling to data placement and replication. However, realistic benchmarks are still missing to analyze and compare the effectiveness of these proposals. To date, most MapReduce techniques have been evaluated using microbenchmarks in an overly simplified setting, which may not be representative of real-world applications. This paper presents MRBS, a comprehensive benchmark suite for evaluating the performance of MapReduce systems. MRBS includes five benchmarks covering several application domains and a wide range of execution scenarios such as data-intensive vs. compute-intensive applications, or batch applications vs. online interactive applications. MRBS allows to characterize application workload and dataload, and produces extensive high-level and low-level performance statistics. We illustrate the use of MRBS with Hadoop clusters running on Amazon EC2.

References (34)

  1. J. Dean and S. Ghemawat, "MapReduce: Simpli- fied Data Processing on Large Clusters," in The 6th Symposium on Operating System Design and Implementation (OSDI 2004), 2004.
  2. S. Chen and S. W. Schlosser, "Map-Reduce Meets Wider Varieties of Applications," Intel, Tech. Rep. IRP- TR-08-05, 2008.
  3. M. C. Schatz, "CloudBurst: Highly Sensitive Read Mapping with MapReduce," Bioinformatics (Oxford, England), vol. 25, no. 11, June 2009.
  4. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Tal- war, and A. Goldberg, "Quincy: fair scheduling for dis- tributed computing clusters," in 22nd ACM Symposium on Operating Systems Principles 2009 (SOSP 2009), Big Sky, Montana, October 2009.
  5. M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, "Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Clus- ter Scheduling," in EuroSys 2010 Conference, Paris, France, April 2010.
  6. M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, "Improving MapReduce Per- formance in Heterogeneous Environments," in 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2008), 2008.
  7. H. Herodotou and S. Babu, "Profiling, What-if Analy- sis, and Cost-based Optimization of MapReduce Pro- grams," in 37th International Conference on Very Large Data Bases (VLDB 2011), 2011.
  8. A. Verma, L. Cherkasova, and R. H. Campbell, "Resource Provisioning Framework for MapReduce Jobs with Performance Goals," in 12th ACM/IFIP/USENIX International Middleware Conference (Middleware'2011), 2011.
  9. G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris, "Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters," in EuroSys 2011 Conference, Salzburg, Austria, April 2011.
  10. M. Eltabakh, Y. Tian, F. Ozcan, R. Gemulla, A. Kret- tek, and J. McPherson, "CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop," in 37th International Conference on Very Large Data Bases (VLDB 2011), Seattle, Washington, September 2011.
  11. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, and A. Silberschatz, "HadoopDB: An Archi- tectural Hybrid of MapReduce and DBMS Technolo- gies for Analytical Workloads," in 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, August 2009.
  12. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad, "Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Notic- ing)," in 36th International Conference on Very Large Data Bases (VLDB 2010), Singapore, September 2010.
  13. A. Floratou, J. Patel, E. Shekita, and S. Tata, "Column- Oriented Storage Techniques for MapReduce," in 37th International Conference on Very Large Data Bases (VLDB 2011), Seattle, Washington, September 2011.
  14. M.-Y. Lu and W. Zwaenepoel, "HadoopToSQL," in EuroSys 2010 Conference, Paris, France, April 2010.
  15. T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmeleegy, and R. Sears, "MapReduce Online," in The 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI '10), 2010.
  16. H. Liu and D. Orban, "Cloud mapreduce: A mapreduce implementation on top of a cloud operating system," in The 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID '11), Washington, DC, 2011.
  17. "Apache Hadoop," http://hadoop.apache.org/.
  18. "Amazon Elastic MapReduce," http://aws.amazon.com/elasticmapreduce/.
  19. "Google App Engine," http://code.google.com/intl/en/appengine/.
  20. "Open Cirrus: The HP/Intel/Yahoo! Open Cloud Com- puting Research Testbed," https://opencirrus.org/.
  21. "HDFS: Hadoop Distributed File System," http://hadoop.apache.org/hdfs/.
  22. "Amazon Elastic Compute Cloud (Amazon EC2)," http://aws.amazon.com/ec2/.
  23. "MovieLens web site," http://movielens.umn.edu/.
  24. D. Jannach, M. Zanker, A. Felfernig, and G. Friedrich, Recommender Systems: An Introduction. Cambridge University Press, 2010.
  25. "TPC Benchmark H -Standard Specification," http://www.tpc.org/tpch/.
  26. "Apache Hive," http://hive.apache.org/.
  27. "Genomic research centre," http://www.sanger.ac.uk/.
  28. "Apache Mahout machine learning library," http:// mahout.apache.org/.
  29. "20 Newsgroups," http://people.csail.mit.edu/jrennie/ 20Newsgroups/.
  30. "TPC-C: an on-line transaction processing benchmark," http://www.tpc.org/tpcc/.
  31. "TPC-W: a transactional web e-Commerce bench- mark," http://www.tpc.org/tpcw/.
  32. Standard Performance Evaluation, "SPEC OpenMP Benchmark Suite," http://www.spec.org/omp/.
  33. K. Kim, K. Jeon, H. Han, S.-g. Kim, H. Jung, and H. Y. Yeom, "MRBench: A Benchmark for MapReduce Framework," in 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS '08), 2008.
  34. S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, "The HiBench Benchmark Suite: Characterization of the MapReduce-Based Data Analysis," in 22nd International Conference on Data Engineering Workshops (ICDE 2010), Los Alamitos, CA, 2010.