Academia.eduAcademia.edu

Spark Ecosystem

description14 papers
group0 followers
lightbulbAbout this topic
The Spark Ecosystem refers to a unified analytics platform that enables large-scale data processing and machine learning. It encompasses various components, including Apache Spark, libraries for SQL, streaming, machine learning, and graph processing, facilitating efficient data manipulation and analysis across distributed computing environments.
lightbulbAbout this topic
The Spark Ecosystem refers to a unified analytics platform that enables large-scale data processing and machine learning. It encompasses various components, including Apache Spark, libraries for SQL, streaming, machine learning, and graph processing, facilitating efficient data manipulation and analysis across distributed computing environments.

Key research themes

1. How does Apache Spark perform and scale on diverse computing infrastructures including HPC systems and hybrid cloud setups?

This theme explores the behavior, bottlenecks, and scalability challenges of Apache Spark when deployed on High Performance Computing (HPC) systems and hybrid or multi-site cloud environments. It is crucial because Spark’s widespread adoption for big data analytics extends beyond typical datacenter clusters into heterogeneous, distributed infrastructures with distinct hardware and networking characteristics. Understanding these performance implications informs architectural optimizations and broadens Spark’s applicability in scientific and enterprise contexts.

Key finding: This paper identifies that on HPC systems using Lustre, file system metadata access latency dominates Spark’s single-node performance limiting scalability to about 100 cores initially. By introducing a file pooling layer... Read more
Key finding: This study reveals that when Spark is deployed across geographically distributed hybrid cloud environments with low inter-cluster bandwidth and high latency, job completion time suffers significant overhead mainly from slow... Read more

2. What are the key optimization techniques to improve Spark’s query execution efficiency in big data analytics?

This theme focuses on algorithmic and architectural enhancements within Spark’s query engine designed to reduce the costs associated with stateful operators like shuffle exchange, aggregation, and sorting, which dominate execution time. Improving these operators is critical for Spark to maintain efficiency at scale in complex analytic workloads commonly seen in cloud and enterprise data environments.

Key finding: Introduces a novel exchange placement algorithm that simultaneously minimizes the number of exchange operators and maximizes computation reuse via multi-consumer exchanges, yielding significant reductions in data shuffling... Read more

3. How are Spark-based frameworks and implementations utilized and extended for parallel metaheuristics and performance testing in large-scale distributed/cloud environments?

This theme investigates how Spark’s ecosystem supports the development of specialized parallel algorithms—such as metaheuristics for optimization—and comprehensive performance benchmarking suites. It covers frameworks leveraging Spark’s distributed programming model for efficient computations on cloud resources and seeks to characterize Spark’s behavior across diverse workloads and deployment configurations to enable systematic performance evaluation and optimization.

Key finding: The paper demonstrates the feasibility and scalability of implementing parallel Differential Evolution (DE) metaheuristic algorithms on Spark in cloud environments. It compares master-slave and island-based parallelization... Read more
Key finding: Proposes the design and development of a comprehensive Spark-specific performance testing suite to support agile evaluation across core APIs and layered libraries including machine learning, graph processing, SQL, and... Read more

All papers in Spark Ecosystem

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Many organizations are shifting to a data management paradigm called the "Lakehouse," which implements the functionality of structured data warehouses on top of unstructured data lakes. This presents new challenges for query execution... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Bioinformatics is an emerging interdisciplinary research area that deals with the computational management and analysis of biological information. Genomics is the most important domain in bioinformatics which compares genomic features... more
Motor development is an important factor affecting the health status physically, psychologically and socially in both childhood and adulthood. It is important to develop motor skills starting from childhood and to participate in a variety... more
In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Usage of big data which is related to medical filed is gaining popularity among healthcare services and for clinical research. Medical field is one of the largest areas which is generating enormous amount and varieties of data.... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Big data have acquired big attention in recent years. As big data makes its way into companies and business so there are some challenges in big data analytics. Apache spark framework becomes very popular for using in distributed data... more
Community detection is an important research topic in graph analytics that has a wide range of applications. A variety of static community detection algorithms and quality metrics were developed in the past few years. However, most... more
In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention... more
The nuclear industry is experiencing a steady increase in maintenance costs even though plants are maintained under high levels of safety, capability and reliability. Nuclear power plants are expected to run every unit at maximum capacity... more
Distributed in-memory processing frameworks accelerate iterative workloads by caching suitable datasets in memory rather than recomputing them in each iteration. Selecting appropriate datasets to cache as well as allocating a suitable... more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
by beei iaes and 
1 more
Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more
Recently, efforts have been made to bring together the areas of high-performance computing (HPC) and massive data processing (Big Data). Traditional HPC frameworks, like COMPSs, are mostly task-based, while popular big-data environments,... more
The complexity of Big Data analytics has long outreached the capabilities of current platforms, which fail to efficiently cope with the data and task heterogeneity of modern workflows due to their adhesion to a single data and/or compute... more
Acknowledgement and Disclaimer This publication is based upon work from the COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet), supported by COST (European Cooperation in Science and... more
Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However,... more
There is a need to integrate SQL processing with more advanced machine learning (ML) analytics to drive actionable insights from large volumes of data. As a first step towards this integration, we study how to efficiently connect big SQL... more
The hardware in the UMBC High Performance Computing Facility (HPCF) is supported by the U.S. National Science Foundation through the MRI program (grant nos. CNS–0821258, CNS–1228778, and OAC–1726023) and the SCREMS program (grant no.... more
Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization... more
Healthcare informatics is undergoing a revolution because of the availability of safe, wearable sensors at low cost. Smart hospitals have exploited the development of the Internet of Things (IoT) sensors to create Remote Patients... more
In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention... more
As Per Requirement of Application Transaction Perform on Database.These are Transaction must maintenance Consistency .This Paper explain Various Technique of Transaction for Maintenance Consistency.
Interventions that can successfully alter the trajectory toward obesity among high-risk children are critical if we are to effectively address this public health crisis. The goal of this pilot study was to implement and evaluate an... more
I would like to express sincere thanks to my supervisor Ing. Adam Šenk for his helpful advices and comments that helped me to finish this master's thesis. Also I would like to thank Prof. Dr. Wolfgang Benn and Johannes Fliege from... more
There is a need to integrate SQL processing with more advanced machine learning (ML) analytics to drive actionable insights from large volumes of data. As a first step towards this integration, we study how to efficiently connect big SQL... more
We introduce GraphFlow, a big graph framework that is able to encode complex data science experiments as a set of high-level workflows. GraphFlow combines the Spark big data processing platform and the Galaxy workflow management system to... more
This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogenous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The... more
Thanks to the huge amount of sequenced data that is becoming available, building scalable solutions for supporting query processing and data analysis over genomics datasets is increasingly important. This paper presents GDMS, a scalable... more
I would like to express sincere thanks to my supervisor Ing. Adam Šenk for his helpful advices and comments that helped me to finish this master's thesis. Also I would like to thank Prof. Dr. Wolfgang Benn and Johannes Fliege from... more
Smart cities use digital technologies such as cloud computing, Internet of Things, or open data in order to overcome limitations of traditional representation and exchange of geospatial data. This concept ensures a significant increase in... more
Clustering is a fundamental task in Knowledge Discovery and Data mining. It aims to discover the unknown nature of data by grouping together data objects that are more similar. While hundreds of clustering algorithms have been proposed,... more
The latest technological advances have allowed the development of smart home systems that establish a connection between humans and the devices that surround them, living in homes and working in fully automated companies. While these... more
Today environment monitoring becomes important for humans to ensure a safe and wealthy life. Monitoring requirements are extremely different depending on the environment, leading to specially appointed usage that needs adaptability. The... more
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms... more
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms... more
The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms... more
The entry point into all functionality in Spark SQL is the SQLContext.
—This study was conducted based on an assumption that Spark ML package has much better performance and accuracy than Spark MLlib package in dealing with big data. The used dataset in the comparison is for bank customers transactions. The... more
The emerging phenomenon called ―Big Data‖ is pushing numerous changes in businesses and several other organizations, Domains, Fields, areas etc. Many of them are struggling just to manage the massive data sets. Big data management is... more
This paper provides analysis of features provided by existing Parallel design patterns based Programming System. Objective of this paper is to examine features required to exploit parallelism with ease in Multicore Architectures.
Download research papers for free!