Academia.eduAcademia.edu

Outline

A Close-Up View About Spark in Big Data Jurisdiction

2018, International Journal of Engineering Research and Application (IJERA), ISSN : 2248-9622

https://doi.org/10.9790/9622-0801022641

Abstract

The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms used to analyze this enormous data. Hadoop technology as a big data processing technology has proven to be the go to solution for processing enormous data sets. MapReduce is a conspicuous solution for computations, which requirement one-pass to complete, but not exact efficient for use cases that need multi-pass for computations and algorithms. The Job output data between every stage has to be stored in the file system before the next stage can begin. Consequently, this method is slow, disk Input/output operations and due to replication. Additionally, Hadoop ecosystem doesn't have every component to ending a big data use case. Suppose we want to do an iterative job, you would have to stitch together a sequence of MapReduce jobs and execute them in sequence. Every this job has high-latency, and each depends upon the completion of the previous stage. Apache Spark is one of the most widely used open source processing engines for big data, with wealthy language-integrated APIs and an extensive range of libraries. Apache Spark is a usual framework for distributed computing that offers high performance for both batch and interactive processing. In this paper, we aimed to demonstrate a close-up view about Apache Spark and its features and working with Spark using Hadoop. We are in a nutshell discussing about the Resilient Distributed Datasets (RDD), RDD operations, features, and limitation. Spark can be used along with MapReduce in the same Hadoop cluster or can be used lonely as a processing framework. In the last comparative analysis between Spark and Hadoop and MapReduce in this paper.

References (41)

  1. SamanSarraf, Mehdi Ostadhashem, "Big data application in functional magnetic resonance imaging using apache spark", 2016 Future Technologies Conference (FTC), San Francisco, CA, USA, Pages: 281 -284, Year: 2016, DOI: 10.1109/FTC.2016.7821623
  2. Dr. Yusuf Perwej, "An Experiential Study of the Big Data," for published in the International Transaction of Electrical and Computer Engineers System (ITECES), USA, ISSN (Print): 2373-1273 ISSN (Online): 2373- 1281, Vol. 4, No. 1, page 14-25, March 2017, DOI:10.12691/iteces-4-1-3.
  3. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation -Volume 6, 2004, p. 10.
  4. Apache Spark, "Apache Spark-lightning-fast cluster computing," 2016, accessed 19- February-2016. [Online]. Available: http://spark.apache.org
  5. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: cluster computing with working sets," in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10), USENIX Association, Berkeley, CA, 2010, p. 10-10.
  6. H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark. Sebastopol, CA: O'Reilly Media, 2015.
  7. NikhatAkhtar, FirojParwej, Dr. Yusuf Perwej, "A Perusal Of Big Data Classification And Hadoop Technology," for published in the International Transaction of Electrical and Computer Engineers System (ITECES), USA, ISSN (Print): 2373-1273 ISSN (Online): 2373- 1281, Vol. 4, No. 1, page 26-38, May 2017, DOI: 10.12691/iteces-4-1-4.
  8. N. Islam, S. Sharmin, M. Wasi-ur-Rahman, X. Lu, D. Shankar, D. K. Panda, "Performance characterization and acceleration of in- memory file systems for Hadoop and Spark applications on HPC clusters," in 2015 IEEE International Conference on Big Data (Big Data), October 29, 2015-November 1, 2015, pp. 243-252.
  9. X. Lin, P. Wang, and B. Wu, "Log analysis in cloud computing environment with Hadoop and Spark," in 2013 5th IEEE International Conference on Broadband Network & Multimedia Technology (IC-BNMT), November 1
  10. L. Gu and H. Li, "Memory or time: performance evaluation for iterative operation on Hadoop and Spark," in 2013 IEEE 10th International Conference on High Performance Comput. and Comm.& 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), November 13-15, 2013, pp. 721-727.
  11. K. Wang and M. M. H. Khan, "Performance prediction for Apache Spark platform," in 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), August 24-26, 2015, pp. 166-173.
  12. Tim Kraska, AmeetTalwalkar, John Duchi, ReanGri_th, Michael Franklin, and Michael Jordan.MLbase: A Distributed Machine- learning System. In Conference on Innovative Data Systems Research, 2013.
  13. XiangruiMeng, Joseph Bradley, Evan Sparks, and ShivaramVenkataraman. Ml pipelines: A new high-level api for MLlib. https://databricks.com/?p=2473, 2015.
  14. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi et al., "Spark sql: Relational data processing in spark", Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data., ACM, pp. 1383-1394, 2015.
  15. N. Chaimov, A. Malony, S. Canon, C. Iancu, K. Z. Ibrahim, J. Srinivasan, "Scaling Spark on HPC Systems", Proceedings of the 25th ACM International Symposium on High- Performance Parallel and Distributed Computing, 2016.
  16. New directions for Apache Spark in 2015," http://www.slideshare.net/databricks/new- directions-for-apache-spark-in-2015.
  17. "Apache Spark-Lightning-Fast Cluster Computing", 2016, [online] Available: http://spark.apache.org.
  18. J. Liu, Y. Liang, C. Fang, and N. Ansari, "Spark-based Large-scale Matrix Inversion for Big Data Processing," IEEE INFOCOM Workshop of Big Data Sciences, Technologies, and Applications (BDSTA) ,accepted, 2016.
  19. Omar Backhoff, EiriniNtoutsi,"Scalable Online-Offline Stream Clustering in Apache Spark", Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on, Barcelona, Spain ,12-15 Dec. 2016.
  20. DOI: 10.1109/ICDMW.2016.0014
  21. David Siegal ,JiaGuo ,G. Agrawal," Smart- MLlib: A High-Performance Machine- Learning Library",Cluster Computing (CLUSTER), 2016 IEEE International Conference on, Taipei, Taiwan ,12-16 Sept. 2016DOI: 10.1109/CLUSTER.2016.49
  22. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, I. Stoica, "Graphx: Graph processing in a distributed dataflow framework", Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation ser. OSDI'14, pp. 599-613, 2014.
  23. S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In
  24. SIGMOD 2010 , pages 987-998. ACM, 2010.
  25. L. Yejas, D. Oscar, W. Zhuang, and A. Pannu. Big R:Large-Scale Analytics on Hadoop Using R. InIEEE BigData 2014, pages 570-577.
  26. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing", Proceedings of the USENIX Conference on Networked Systems Design and Implementation '12), pp. 15-28, Apr. 2012.
  27. Teng-Sheng Moh,"DBSCAN on Resilient Distributed Datasets", High Performance Computing & Simulation (HPCS), 2015
  28. International Conference on, Amsterdam, Netherlands, 20-24 July 2015.
  29. Zixia Liu ,Hong Zhang, Liqiang Wang," Hierarchical Spark: A Multi-Cluster Big Data Computing Framework",Cloud Computing (CLOUD), 2017 IEEE 10th International Conference on, Honolulu, CA, USA, Electronic ISBN: 978-1-5386-1993-3 , 25-30 June 2017.
  30. Hamid Mushtaq, Zaid Al-Ars,"Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline",Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, Washington, DC, USA, 9-12
  31. Nov. 2015., DOI:10.1109/BIBM.2015.7359893
  32. Benjamin Hindman, Andy Konwinski, MateiZaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center", University of California, Berkley, September 2010.
  33. Yusuf Perwej, BedineKerim, MohmedSirelkhtemAdrees, Osama E. Sheta, " An Empirical Exploration of the Yarn in Big Data" for published in the International Journal of Applied Information Systems (IJAIS), ISSN : 2249-0868 , Foundation of Computer Science FCS, New York, USA Volume 12 , No.9, page 19-29 , December 2017 DOI : 10.5120/ijais2017451730
  34. Nhan Nguyen, Mohammad MaifiHasan Khan, Yusuf Albayram, Kewen Wang, "Understanding the Influence of Configuration Settings: An Execution Model-Driven Framework for Apache Spark Platform", Cloud Computing (CLOUD) 2017 IEEE 10th International Conference on, pp. 802-807, 2017, ISSN 2159-6190.
  35. Kewen Wang, M. M.HasanKhan,"Performance Prediction for Apache Spark Platform",2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, New York, NY, USA, 24-26 Aug. 2015. DOI: 10.1109/HPCC-CSS-ICESS.2015.246
  36. Kai Hildebrandt, Fabian Panse, NiklasWilcke,"Large-Scale Data Pollution with Apache Spark",IEEE Transactions on Big Data, PP 1 -1, Issue: 99,Electronic ISSN: 2332-7790 , 09 January 2017DOI: 10.1109/TBDATA.2016.2637378
  37. YassirSamadi ,MostaphaZbakh ,Claude Tadonki ,"Comparative study between Hadoop and Spark based on Hibench benchmarks",Cloud Computing Technologies and Applications (CloudTech), 2016 2nd International Conference on, Marrakech, Morocco, 24-26 May 2016.DOI: 10.1109/CloudTech.2016.7847709
  38. IstvanSzegedi, "Apache Spark: a fast big data analytics engine", [online] Available: https://dzone.com/articles/apache-spark-fast- big-data.
  39. Juwei Shi , YunjieQiu, Umar FarooqMinhas , Limei Jiao , Chen Wang , Berthold Reinwald , and Fatma O ̈ zcan , "Clash of Titans: MapReduce vs. Spark for Large Scale Data Analytics", Proceedings of the VLDB Endowment, Vol. 8, No. 13 Copyright 2015 VLDB Endowment 2150 8097/15/09
  40. PolatoIvanilton, R é Reginaldo, Goldman Alfredo, Kon Fabio, "A comprehensive view of Hadoop research-A systematic literature review", Journal of Network and Computer Applications, vol. 46, pp. 1-25, November 2014.
  41. *. "A Close-Up View About Spark in Big Data Jurisdiction." International Journal of Engineering Research and Applications (IJERA), vol. 08, no. 01, 2018, pp. 26-41. International Journal of Engineering Research and Applications (IJERA) is UGC approved Journal with Sl. No. 4525, Journal no. 47088. Indexed in Cross Ref, Index Copernicus (ICV 80.82), NASA, Ads, Researcher Id Thomson Reuters, DOAJ.