Academia.eduAcademia.edu

Apache Spark Streaming

description15 papers
group3 followers
lightbulbAbout this topic
Apache Spark Streaming is an extension of the Apache Spark framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It allows for the processing of real-time data using micro-batch processing, integrating seamlessly with Spark's core capabilities for batch processing and machine learning.
lightbulbAbout this topic
Apache Spark Streaming is an extension of the Apache Spark framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It allows for the processing of real-time data using micro-batch processing, integrating seamlessly with Spark's core capabilities for batch processing and machine learning.
Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It... more
Aspect mining constitutes an essential part of delivering concise and, perhaps more importantly, accurately tailored cultural content. With the advent of social media, there is a data abundance so that analytics can be reliably designed... more
In the era of big data, organizations are increasingly relying on robust data pipelines to manage, process, and analyze vast amounts of data in real-time. Apache Kafka and Apache Spark are two of the most prominent tools used in the... more
Emergency evacuation management in urban areas is paramount due to rapid population growth and urbanization. This research introduces an innovative methodology leveraging the Internet of Things (IoT) and Big Data Technologies to fortify... more
The topic of the 2020 DEBS Grand Challenge is to develop a solution for Non Intrusive Load Monitoring (NILM). Sensors continuously send voltage and current data into a stream processing application that would detect the pattern of power... more
• RDDs provide a fault tolerant implementation of distributed immutable multisets. • Computations are defined as transformations on RDDs. • The set of predefined RDD transformations includes typical higher-order functions from functional... more
The paper addresses a highly relevant and contemporary topic in the field of data processing. Big data is a crucial aspect of modern computing, and the choice of processing framework can significantly impact performance and efficiency.... more
Real world social networks are highly dynamic environments consisting of numerous users and communities, rendering the tracking of their evolution a challenging problem. In this work, we propose a parallel algorithm for tracking dynamic... more
As data permeates all disciplines, the role of big data becomes increasingly important. Sensors, IoT devices, social networks, and online transactions are all generating data that can be monitored constantly to enable a business to... more
This paper addresses the challenge of predicting the level of parallelism in distributed stream processing (DSP) systems, which are essential to deal with different high workload requirements of various industries such as e-commerce,... more
This paper presents a benchmark of stream processing throughput comparing Apache Spark Streaming (under file-, TCP socket-and Kafka-based stream integration), with a prototype P2P stream processing framework, HarmonicIO. Maximum... more
This paper presents a benchmark of stream processing throughput comparing Apache Spark Streaming (under file-, TCP socket-and Kafka-based stream integration), with a prototype P2P stream processing framework, HarmonicIO. Maximum... more
In this paper, we present a scalable and real-time intelligent transportation system based on a big data framework. The proposed system allows for the use of existing data from road sensors to better understand traffic flow, traveler... more
Business processes represent a cornerstone to the operation of any enterprise. They are the operational means for such organizations to fulfill their goals. Nowadays, enterprises are able to gather massive amounts of event data. These are... more
Stream Processing Engines (SPEs) have to support high data ingestion to ensure the quality and efficiency for the end-user or a system administrator. The data flow processed by SPE fluctuates over time, and requires real-time or near... more
The rapid growth of stream applications in financial markets, health care, education, social media, and sensor networks represents a remarkable milestone for data processing and analytic in recent years, leading to new challenges to... more
Large scale applications nowadays continuously generate massive amounts of data at high speed. Stream processing engines (SPEs) such as Apache Storm and Flink are becoming increasingly popular because they provide reliable platforms to... more
Most high-performance data processing (a.k.a. big data) systems allow users to express their computation using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most frameworks, however, do not... more
Asking questions is the driving force for scientific progress. But as important as it is to ask questions, so important is to be able to understand the obtained answers. In this way, we are able to verify the sanity of the question itself... more
In recent years, users have come to expect reactivity from their applications, i.e. they assume that changes made by other users are immediately reflected in the interfaces they are using. Examples are shared worksheets and websites... more
Big data processing systems are evolving to be more stream oriented where each data record is processed as it arrives by distributed and low-latency computational frameworks on a continuous basis. As the stream processing technology... more
Aspect mining constitutes an essential part of delivering concise and, perhaps more importantly, accurately tailored cultural content. With the advent of social media, there is a data abundance so that analytics can be reliably designed... more
Big data analytics platforms have played a critical role in the unprecedented success of data-driven applications. However, real-time and streaming data applications, and recent legislation, e.g., GDPR in Europe, have posed constraints on... more
Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It... more
Big data processing systems are evolving to be more stream oriented where each data record is processed as it arrives by distributed and low-latency computational frameworks on a continuous basis. As the stream processing technology... more
This paper presents an analysis of four online stream processing systems (MillWheel, S4, Spark Streaming and Storm) regarding the strategies they use for fault tolerance. We use this sort of system for processing of data streams that can... more
Index joins range unit pivotal for proficiency and quality once technique questions over colossal data. HIVE may be a cluster balanced immense data administration motor that is good for data examination applications and for OLAP for... more
In this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends.... more
Resource management in Distributed Stream Processing Systems (DSPS) defines the way queries are deployed on in-network resources to deliver query results while fulfilling the Quality of Service (QoS) requirements of the end-users. Various... more
Distributed Stream Processing (DSP) systems highly rely on parallelism mechanisms to deliver high performance in terms of latency and throughput. Yet the development of such parallel systems altogether comes with numerous challenges. In... more
The advent of various processing frameworks which happens under big data technologies is due to tremendous dataset size and its complexity. The speed of execution was much higher with High Performance computing frameworks rather than big... more
Event processing (EP) is a data processing technology that conducts online processing of event information. In this survey, we summarize the latest cutting-edge work done on EP from both industrial and academic research community... more
This paper presents a benchmark of stream processing throughput comparing Apache Spark Streaming (under file-, TCP socket-and Kafka-based stream integration), with a prototype P2P stream processing framework, HarmonicIO. Maximum... more
Apache Flink is an open-source system for the scalable processing of batch and streaming data. Flink does not natively support efficient processing of spatial data streams, which is the requirement of many applications dealing with... more
Natural hazards result in devastating losses in human life, environmental assets and personal, and regional and national economies. The availability of different big data such as satellite imageries, Global Positioning System (GPS)... more
Distributed real-time computing has been the domain of practical system engineering for many decades. The development of a discipline of real-time programming would allow the construction of programs with analysable and variable timing... more
More and more use cases require fast, accurate, and reliable processing of large volumes of data. To do this, a distributed stream processing framework is needed which can distribute the load over several machines. In this work, we study... more
The past decade saw an exponential rise in the amount of information available on the World Wide Web. Almost every business organization today uses web based technology to wield its huge client base. Consequently, managing the large data... more
Most high-performance data processing (a.k.a. big data) systems allow users to express their computation using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most frameworks, however, do not... more
by A. Y. Aidoo and 
1 more
An anomaly (deviant objects, exceptions, peculiar objects) is an important concept of the analysis. The volume and velocity of the data within many systems makes it difficult to detect and process anomalies for Big Data in real-time. Many... more
Internet of Things (IoT) is a technology paradigm where millions of sensors monitor, and help inform or manage, physical, envi- ronmental and human systems in real-time. The inherent closed-loop re- sponsiveness and decision making of IoT... more
Internet of Things (IoT) is a technology paradigm where millions of sensors monitor, and help inform or manage, physical, environmental and human systems in real-time. The inherent closed-loop responsiveness and decision making of IoT... more
In the Big Data era, stream processing has become a common requirement for many data-intensive applications. This has lead to many advances in the development and adaption of large scale streaming systems. Spark and Flink have become a... more
With the ever increasing number of IoT devices getting connected, an enormous amount of streaming data is being produced with very high velocity. In order to process these large number of data streams, a variety of stream processing... more
An anomaly (deviant objects, exceptions, peculiar objects) is an important concept of the analysis. The volume and velocity of the data within many systems makes it difficult to detect and process anomalies for Big Data in real-time. Many... more
Advances in information technology have facilitated large volume, high-velocity of data, and the ability to store data continuously leading to several computational challenges. Due to the nature of big data in terms of volume, velocity,... more
CloudWave will revolutionise modern cloud infrastructures and tools by enabling agile development and delivery of adaptive cloud services which dynamically adjust to changes in their environment so as to optimise service quality and... more
An anomaly (deviant objects, exceptions, peculiar objects) is an important concept of the analysis. The volume and velocity of the data within many systems makes it difficult to detect and process anomalies for Big Data in real-time. Many... more
An anomaly (deviant objects, exceptions, peculiar objects) is an important concept of the analysis. The volume and velocity of the data within many systems makes it difficult to detect and process anomalies for Big Data in real-time. Many... more
Many important "big data" applications need to process data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state... more
Download research papers for free!