Spark Ecosystem Research Papers

The effectiveness of big data classification control based on principal component analysis

2025, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

Design and Implement a MapReduce Framework for Executing Standalone Software Packages in Hadoop-based Distributed Environments

by Min-Hsiung Hung

2025, Smart Science

descriptionView Paper arrow_downwardDownload

Photon: A Fast Query Engine for Lakehouse Systems

by mostafa mokhtar

2025, Proceedings of the 2022 International Conference on Management of Data

Many organizations are shifting to a data management paradigm called the "Lakehouse," which implements the functionality of structured data warehouses on top of unstructured data lakes. This presents new challenges for query execution... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by Ahmed Hassan Ali

2024, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by Ahmed Hassan Ali

2024, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by Raed A . Hasan

2024, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by Ahmed Hussein Ali

2024, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

Biological Sequence Alignment-A Review

by Kala Karun

2024

Bioinformatics is an emerging interdisciplinary research area that deals with the computational management and analysis of biological information. Genomics is the most important domain in bioinformatics which compares genomic features... more

descriptionView Paper arrow_downwardDownload

Effects of Twelve Week Multi-Skills Movement Programme On Motor Development in 7-10 Years Old Boys

by Yasin ERSÖZ

2024

Motor development is an important factor affecting the health status physically, psychologically and socially in both childhood and adulthood. It is important to develop motor skills starting from childhood and to participate in a variety... more

descriptionView Paper arrow_downwardDownload

Fine-tuning Resource Allocation of Apache Spark Distributed Multinode Cluster for Faster Processing of Network-trace Data

by jhansi rani

2023, International Journal of Advanced Computer Science and Applications

In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by Ahmed Ali

2023, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

by Dr. Sreekanth Rallapalli

2023

Usage of big data which is related to medical filed is gaining popularity among healthcare services and for clinical research. Medical field is one of the largest areas which is generating enormous amount and varieties of data.... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by Ziyad Hussien Saleh

2023, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

Survey of Apache spark optimized job scheduling in big data

by Hanaa Torkey

2023, International Journal of Industry and Sustainable Development

Big data have acquired big attention in recent years. As big data makes its way into companies and business so there are some challenges in big data analytics. Apache spark framework becomes very popular for using in distributed data... more

descriptionView Paper arrow_downwardDownload

Incremental Community Detection in Distributed Dynamic Graph

by HARUNA ISAH

2023, 2021 IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService)

Community detection is an important research topic in graph analytics that has a wide range of applications. A variety of static community detection algorithms and quality metrics were developed in the past few years. However, most... more

descriptionView Paper arrow_downwardDownload

Fine-tuning Resource Allocation of Apache Spark Distributed Multinode Cluster for Faster Processing of Network-trace Data

by Jhansi Rani G

2023, International Journal of Advanced Computer Science and Applications

In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention... more

descriptionView Paper arrow_downwardDownload

Big Data Framework with Machine Learning for D&D Applications -19108

by Santosh Joshi

2023

The nuclear industry is experiencing a steady increase in maintenance costs even though plants are maintained under high levels of safety, capability and reliability. Nuclear power plants are expected to run every unit at maximum capacity... more

The nuclear industry is experiencing a steady increase in maintenance costs even though plants are maintained under high levels of safety, capability and reliability. Nuclear power plants are expected to run every unit at maximum capacity at all times, efficiently utilizing assets with minimal downtime. Surveillance and maintenance of nuclear decommissioning infrastructure provides lot challenges with respect to maintenance or decommissioning of the buildings. There is a need for a framework to analyze the huge amount of data generated by the sensors on the nuclear reactor components as well as structures, to monitor the conditions of these building over a period of time. Emerging technologies such as big data analytics have become a requirement in the nuclear industry to improve structural health monitoring and diagnostics. FIU will make use of existing mature technologies in the areas of imaging, robotics, big data [1], and machine learning/deep learning [2] [3] [4] to assess the structural integrity of aging facilities at DOE sites. As these facilities await decommissioning, there is a need to understand the structural health of these structures. Many of these facilities were built over 50 years ago and in some cases these facilities have gone beyond the operational life expectancy. In other cases, the facilities have been placed in a state of "cold and dark" and they are sitting unused, awaiting decommissioning. Finally, some older facilities are one-of-a-kind operational/production facilities supporting critical DOE missions and cannot be shut down. In any of these scenarios, the structural integrity of these facilities may be compromised, so it is imperative that adequate inspections and data collection/analysis be performed on a continuous and ongoing basis. The primary goals of the research include collecting various formats of data such as structured, un-structured and streaming data from the various sensors deployed in buildings, collect videos/pictures from various imaging sources, ingest them using a Hadoop [5] distributed file system and process the data using Spark [6, 7] to perform batch processing and real time analytics. Research and development on various machine learning/deep learning algorithms will also be performed to analyze the heterogeneous data collected from nuclear decommission infrastructure. FIU will design the big data framework to ingest, store and process huge amounts of heterogeneous data collected from many sources and optimize the algorithms to provide insights into the data and predict anomalies observed when compared against baseline conditions. Various modules of this framework will include heterogeneous data sources, message broker, Hadoop distributed file system, Spark for stream and batch processing, machine learning/ deep learning, Cassandrapersistent data store and visualization.

descriptionView Paper arrow_downwardDownload

Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications

by Muhammad Jibril

2023, Proceedings of the 2022 International Conference on Management of Data

Distributed in-memory processing frameworks accelerate iterative workloads by caching suitable datasets in memory rather than recomputing them in each iteration. Selecting appropriate datasets to cache as well as allocating a suitable... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by mustafa mahmood

2022, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

descriptionView Paper arrow_downwardDownload

The effectiveness of big data classification control based on principal component analysis

by beei iaes and

2022, Bulletin of Electrical Engineering and Informatics

Large-scale datasets are becoming more common, yet they can be challenging to understand and interpret. When dealing with big datasets, principal component analysis (PCA) is used to minimize the dimensionality of the data while... more

Table 3. Classification evaluation with PCA

The effectiveness of big data classification based on principal ... (Mostafa Abduhgafoor Mohammed)

Table 2. Classification evaluation without PCA Specificity=TNATN+FP)

descriptionView Paper arrow_downwardDownload

Extension of a Task-Based Model to Functional Programming

by lucas ponce

2022, 2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Recently, efforts have been made to bring together the areas of high-performance computing (HPC) and massive data processing (Big Data). Traditional HPC frameworks, like COMPSs, are mostly task-based, while popular big-data environments,... more

descriptionView Paper arrow_downwardDownload

Robust and Adaptive Multi-Engine Analytics using IReS

by Victor Giannakouris

2022

The complexity of Big Data analytics has long outreached the capabilities of current platforms, which fail to efficiently cope with the data and task heterogeneity of modern workflows due to their adhesion to a single data and/or compute... more

descriptionView Paper arrow_downwardDownload

High-performance modelling and simulation for big data applications

by Svetozar Rancic

2022, Simulation Modelling Practice and Theory

Acknowledgement and Disclaimer This publication is based upon work from the COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications (cHiPSet), supported by COST (European Cooperation in Science and... more

descriptionView Paper arrow_downwardDownload

Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

by Amir Mosavi

2022

Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However,... more

descriptionView Paper arrow_downwardDownload

A Generic Solution to Integrate SQL and Analytics for Big Data

by Hamid Pirahesh

2022

There is a need to integrate SQL processing with more advanced machine learning (ML) analytics to drive actionable insights from large volumes of data. As a first step towards this integration, we study how to efficiently connect big SQL... more

descriptionView Paper arrow_downwardDownload

Scalable Machine Learning and Related Technologies

by Karthik Ramasubramanian

2022, Machine Learning Using R

descriptionView Paper arrow_downwardDownload

Efficient Product Inventory Maintenance for Black Friday Sale via Spark Big Data System

by Chhaya Kulkarni

2022

The hardware in the UMBC High Performance Computing Facility (HPCF) is supported by the U.S. National Science Foundation through the MRI program (grant nos. CNS–0821258, CNS–1228778, and OAC–1726023) and the SCREMS program (grant no.... more

descriptionView Paper arrow_downwardDownload

RumbleML: program the lakehouse with JSONiq

by Ghislain Fourny

2022, ArXiv

Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization... more

descriptionView Paper arrow_downwardDownload

Big Data Challenges and Opportunities in Healthcare Informatics and Smart Hospitals

by Sally Elghamrawy

2022

Healthcare informatics is undergoing a revolution because of the availability of safe, wearable sensors at low cost. Smart hospitals have exploited the development of the Internet of Things (IoT) sensors to create Remote Patients... more

descriptionView Paper arrow_downwardDownload

Fine-tuning Resource Allocation of Apache Spark Distributed Multinode Cluster for Faster Processing of Network-trace Data

by jhansi rani

2022, International Journal of Advanced Computer Science and Applications

In the field of network security, the task of processing and analyzing huge amount of Packet CAPture (PCAP) data is of utmost importance for developing and monitoring the behavior of networks, having an intrusion detection and prevention... more

descriptionView Paper arrow_downwardDownload

Transaction Control System On Various Techniques for Maintaining Consistency of Databases

by Ajay Gadicha

2022

As Per Requirement of Application Transaction Perform on Database.These are Transaction must maintenance Consistency .This Paper explain Various Technique of Transaction for Maintenance Consistency.

descriptionView Paper arrow_downwardDownload

The Children in Action Pilot Study

by Lifescience Global Canada

2022, International Journal of Child Health and Nutrition

Interventions that can successfully alter the trajectory toward obesity among high-risk children are critical if we are to effectively address this public health crisis. The goal of this pilot study was to implement and evaluate an... more

descriptionView Paper arrow_downwardDownload

Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework

by Petr Kroha

2022, Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation

I would like to express sincere thanks to my supervisor Ing. Adam Šenk for his helpful advices and comments that helped me to finish this master's thesis. Also I would like to thank Prof. Dr. Wolfgang Benn and Johannes Fliege from... more

descriptionView Paper arrow_downwardDownload

A Generic Solution to Integrate SQL and Analytics for Big Data

by Berthold Reinwald

2021

There is a need to integrate SQL processing with more advanced machine learning (ML) analytics to drive actionable insights from large volumes of data. As a first step towards this integration, we study how to efficiently connect big SQL... more

descriptionView Paper arrow_downwardDownload

GraphFlow: Workflow-based big graph processing

by Boyana Norris

2021, 2016 IEEE International Conference on Big Data (Big Data)

We introduce GraphFlow, a big graph framework that is able to encode complex data science experiments as a set of high-level workflows. GraphFlow combines the Spark big data processing platform and the Galaxy workflow management system to... more

Figure 1. The architecture of GraphFlow. The Spark-based tools in Galaxy interact with Spark nodes on the cluster system using a cluster-adapter.

Figure 2. The Query tool expects a table name and query on the given table name. Providing the Query tool with the output schema is optional.

Figure 3. The workflow of hierarchical clustering using Subgraph, GraphCluster, and Query.

Figure 4. CDF of the shortest path length from the all nodes of the graph to vertices of Harvard University, Stanford University, University of Oregon, and Seattle University. Figure 5. The workflow of finding the CDFs of shortest paths.

Figure 6. The workflow of coarsening a graph using clustering.

Figure 7. CDF of the shortest path length in the coarse graph.

Figure 9. The workflow of ranking universities using Wikipedia graph. Figure 8. The degree distribution of Wikepedia graph along with the corresponding workflow.

descriptionView Paper arrow_downwardDownload

Rumble: data independence when data is in a mess

by Ghislain Fourny

2021, ArXiv

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogenous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The... more

descriptionView Paper arrow_downwardDownload

Scalable Genomic Data Management System on the Cloud

by Andrea Gulino

2021, 2017 International Conference on High Performance Computing & Simulation (HPCS)

Thanks to the huge amount of sequenced data that is becoming available, building scalable solutions for supporting query processing and data analysis over genomics datasets is increasingly important. This paper presents GDMS, a scalable... more

descriptionView Paper arrow_downwardDownload

Evaluation of XPath Queries Over XML Documents Using SparkSQL Framework

by Michal Valenta

2021, Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation

I would like to express sincere thanks to my supervisor Ing. Adam Šenk for his helpful advices and comments that helped me to finish this master's thesis. Also I would like to thank Prof. Dr. Wolfgang Benn and Johannes Fliege from... more

descriptionView Paper arrow_downwardDownload

Big Data in Smart City: Management Challenges

by Mladen Amović

2021, Applied Sciences

Smart cities use digital technologies such as cloud computing, Internet of Things, or open data in order to overcome limitations of traditional representation and exchange of geospatial data. This concept ensures a significant increase in... more

descriptionView Paper arrow_downwardDownload

PatchWork, a scalable density-grid clustering algorithm

by Thomas Triplet

2021, Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16

Clustering is a fundamental task in Knowledge Discovery and Data mining. It aims to discover the unknown nature of data by grouping together data objects that are more similar. While hundreds of clustering algorithms have been proposed,... more

descriptionView Paper arrow_downwardDownload

Machine Learning in Smart Home Systems

by IJERA Journal

2021

The latest technological advances have allowed the development of smart home systems that establish a connection between humans and the devices that surround them, living in homes and working in fully automated companies. While these... more

Figure 2. General architecture of the proposed ML-based recommendation system. The PAbMM is a manager of semi- ictured measurements streams, enriched with tadata supported by C-INCAMI, specialized in Regarding the semantic ground for measurement and evaluation (M&E), the CINCAMI conceptual framework is built on an ontology which includes the concepts and relationships needed to specify data and metadata for any M&E project; in order to promote consistence and comparability of results. Unlike other data stream processing strategies, PADMM is able to support the appropriate processing of measures’ generated from heterogeneous data sources thanks to the included metadata. In this way, each measure is analyzed considering its semantic and context as per the formal definition of each M&E project

descriptionView Paper arrow_downwardDownload

IoT Based Distributed Environmental Monitoring System and Weather Prediction using Machine Learning Approach

by IJERA Journal

2020

Today environment monitoring becomes important for humans to ensure a safe and wealthy life. Monitoring requirements are extremely different depending on the environment, leading to specially appointed usage that needs adaptability. The... more

descriptionView Paper arrow_downwardDownload

A Close-Up View About Spark in Big Data Jurisdiction

by Nikhat Akhtar and

2020, International Journal of Engineering Research and Application (IJERA), ISSN : 2248-9622

The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms... more

descriptionView Paper arrow_downwardDownload

A Close-Up View About Spark in Big Data Jurisdiction

by Nikhat Akhtar

2020, International Journal of Engineering Research and Application (IJERA), ISSN : 2248-9622

The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms... more

descriptionView Paper arrow_downwardDownload

A Close-Up View About Spark in Big Data Jurisdiction

by Dr. Yusuf Perwej

2020, International Journal of Engineering Research and Application (IJERA), ISSN : 2248-9622, Vol. 8, Issue 1, ( Part -I1), Page 26-41

The Big data is the name used ubiquitously now a day in distributed paradigm on the web. As the name point out it is the collection of sets of very large amounts of data in pet bytes, Exabyte etc. related systems as well as the algorithms... more

descriptionView Paper arrow_downwardDownload

Spark Part 3: Data frames, SQL

by Koffi Kouakou Jonathan

2019

The entry point into all functionality in Spark SQL is the SQLContext.

descriptionView Paper arrow_downwardDownload

Predicting Potential Banking Customer Churn using Apache Spark ML and MLlib Packages: A Comparative Study

by hend sayed and

2018

—This study was conducted based on an assumption that Spark ML package has much better performance and accuracy than Spark MLlib package in dealing with big data. The used dataset in the comparison is for bank customers transactions. The... more

descriptionView Paper arrow_downwardDownload

Big Data and Big Data Management (BDM) with current Technologies –Review

by IJERA Journal

2017

The emerging phenomenon called ―Big Data‖ is pushing numerous changes in businesses and several other organizations, Domains, Fields, areas etc. Many of them are struggling just to manage the massive data sets. Big data management is... more

Fig.1: Schematic representation of the 3V’s of Big Data Keywords: Big Data, Big Data Management (BDM),e-Bussiness, Electronic Data Interchange(ED]), e- Governence, e-Commerce, Hadoop, ETL.

As data sets are in various formats and data generation through electronic means are generating in large volumes, velocity and variety per second, it has become difficult for current DBMS & RDBMS to store, manipulate and extract required information from it in real time. Big Data Management Current technologies the most commonly used framework is Hadoop. Hadoop is the combination of several other components like Hadoop Distribution File Systems HDFS), Pig, Hive and HBase etc. “Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. There are mainly five building blocks inside this runtime environment (from bottom to top)”. [9]

descriptionView Paper arrow_downwardDownload

AQ4301224227

by IJERA Journal

2014

This paper provides analysis of features provided by existing Parallel design patterns based Programming System. Objective of this paper is to examine features required to exploit parallelism with ease in Multicore Architectures.

descriptionView Paper arrow_downwardDownload

Spark Ecosystem

Key research themes

1. How does Apache Spark perform and scale on diverse computing infrastructures including HPC systems and hybrid cloud setups?

2. What are the key optimization techniques to improve Spark’s query execution efficiency in big data analytics?

3. How are Spark-based frameworks and implementations utilized and extended for parallel metaheuristics and performance testing in large-scale distributed/cloud environments?

All papers in Spark Ecosystem