Academia.eduAcademia.edu

Hadoop Framework

description19 papers
group0 followers
lightbulbAbout this topic
The Hadoop Framework is an open-source software platform designed for distributed storage and processing of large data sets across clusters of computers using simple programming models. It enables scalable, fault-tolerant data management and analytics, primarily through its core components: Hadoop Distributed File System (HDFS) and MapReduce processing engine.
lightbulbAbout this topic
The Hadoop Framework is an open-source software platform designed for distributed storage and processing of large data sets across clusters of computers using simple programming models. It enables scalable, fault-tolerant data management and analytics, primarily through its core components: Hadoop Distributed File System (HDFS) and MapReduce processing engine.

Key research themes

1. How can the Hadoop ecosystem be optimized for scalable and efficient big data storage and processing?

This research area focuses on the architectural and configuration aspects of Hadoop and its key components—Hadoop Distributed File System (HDFS) and MapReduce—to improve fault tolerance, data locality, replication strategies, and performance in large-scale distributed storage and computation environments. Understanding these optimizations is crucial for enabling Hadoop to reliably process petabyte-scale datasets on commodity hardware with fault tolerance and high throughput.

Key finding: This study empirically demonstrated that increasing the replication factor for 'hot' or frequently accessed data blocks in HDFS improves data availability and locality, significantly reducing job execution time. The... Read more
Key finding: The paper presented a comprehensive setup and operational guide for Hadoop 3.1.1, detailing the critical role of HDFS in dividing datasets into blocks and managing these via NameNode and DataNode roles. It highlighted that... Read more
Key finding: This work detailed the core architecture of Hadoop, emphasizing the centralized NameNode managing filesystem namespace and metadata, and the distributed DataNodes storing actual data in replicated blocks. It underscored how... Read more
Key finding: Through comparative analysis, this paper elucidated the architectural distinctions and deployment considerations between single-node and multi-node Hadoop clusters. It showed multi-node clusters utilize HDFS's distributed... Read more
Key finding: This experimental benchmark study highlighted that managed Hadoop services (Hadoop-on-PaaS) differ significantly in resource utilization and performance despite similar nominal configurations. It revealed that proprietary... Read more

2. What advances in SQL and query engines have improved interactive and high-performance analytics on Hadoop?

This theme investigates the integration and performance of SQL-on-Hadoop engines designed to enable interactive, low-latency, high-concurrency analytics directly on Hadoop data. Focusing on systems like Impala, this research area is significant because traditional batch frameworks like Apache Hive lack the latency and concurrency levels required for many BI and analytic workloads. Improvements in front-end optimizers, execution engines, and resource management constitute key enablers of scalable SQL processing over big data stored in Hadoop.

Key finding: The paper described Impala’s architecture as a massively parallel processing (MPP) SQL engine designed from the ground up for Hadoop, with distributed daemons co-located on data nodes to minimize latency and maximize... Read more
Key finding: This study applied Hive, a SQL-like distributed data warehouse running on Hadoop, to perform querying and predictive analytics on large COVID-19 datasets. Empirical evaluation showed that Hive on Hadoop outperformed... Read more
Key finding: Through case studies, this work empirically compared Hadoop-based data analytics tools against traditional DBMS and statistical tools, demonstrating Hadoop’s superior performance on large data sizes. The research also... Read more

3. Which distributed computing frameworks beyond MapReduce are promising for overcoming big data analysis challenges in Hadoop environments?

This research area reviews the limitations of MapReduce-based frameworks such as Hadoop MapReduce in handling contemporary big data analysis tasks, especially those requiring complex, iterative, or memory-efficient computations. It also investigates alternative distributed computing frameworks that can reduce I/O overhead, enable scalability beyond memory constraints, and support serial algorithms. Exploring such frameworks is vital for evolving big data analytics to handle ever-growing data volume and complexity effectively.

Key finding: The survey identified that while Hadoop MapReduce is the industrial quasi-standard for big data, its high I/O and communication costs, memory scalability limits, and inability to run many serial algorithms restrict... Read more
Key finding: This performance study deployed Apache Hadoop and MapReduce on the Blue Waters supercomputer, uncovering challenges in adapting MapReduce paradigms to HPC environments with fine-grained or coarse-grained parallelism. The... Read more

All papers in Hadoop Framework

Transformers owe their success to both using Feedforward Neural Networks and Scaled Dot Product Attention in one model to both represent and filter unwanted information. Most attention-based Deep Learning multimodal fusion models use... more
Adverse conditions within specific offshore environments magnify the challenges faced by a vessel’s energy-efficiency optimization in the Industry 4.0 era. As the data rate and volume increase, the analysis of big data using analytical... more
Sentiment Analysis plays a vital role in Natural Language Processing (NLP) which aims to discern opinions and emotions expressed in text. However, the data sparsity and disambiguation of natural languages make it challenging for the... more
COVID-19 pandemic has received a serious attention from academia, industry and governments to stop the huge number of deaths and economic disruptions around the world. Many techniques have been used to control the spread of the pandemic... more
Real-time data processing has become increasingly important in today's data-driven world, where organizations need to quickly analyze and respond to incoming data to maintain a competitive edge. PySpark, an open-source distributed... more
The amount of data generated due to the development of IT technology is increasing exponentially every year. As an alternative to this, research on distributed systems and in-memory-based big data processing techniques is being actively... more
Big Data (BD) is associated with a new generation of technologies and architectures which can harness the value of extremely large volumes of very varied data through real time processing and analysis. It involves changes in (1) data... more
The purpose of this study is to study and forecast the crude oil prices in India using Autoregressive Integrated Moving Average (ARIMA) model of time series analysis. The report tables, charts, and ARIMA model are used to forecast the... more
Opinion Mining (OM) is a field of Natural Language Processing (NLP) that aims to capture human sentiment in the given text. With the ever-spreading of online purchasing websites, micro-blogging sites, and social media platforms, OM in... more
At present, with the growing number of Web 2.0 platforms such as Instagram, Facebook, and Twitter, users honestly communicate their opinions and ideas about events, services, and products. Owing to this rise in the number of social... more
Big Data (BD) is associated with a new generation of technologies and architectures which can harness the value of extremely large volumes of very varied data through real time processing and analysis. It involves changes in (1) data... more
Twitter, a popular social media platform, has become a rich source of user-generated content. The classification of Twitter users based on their characteristics and behavior has gained significant attention. Deep learning techniques, with... more
21 st Century population management is an excellent challenge for all of us. The current population of India in 2022 is 1,406,631,776, a 0.95% increase from 2021. It is obvious that people need more houses in this situation in order to... more
Massive volumes of multidimensional array-based spatiotemporal data are generated by climate observations and model simulations. The growth in climate data leads to new opportunities for climate studies at multiple spatial and temporal... more
In this paper we investigate the use of a multimodal feature learning approach, using neural network based models such as Skip-gram and Denoising Autoencoders, to address sentiment analysis of micro-blogging content, such as Twitter short... more
CONTEXT is vital in formulating intelligent classifications and responses, especially under uncertainty. In a standard feed-forward neural network (FFNN), context comes in the form of information encoded in the input vector and trained in... more
The increasing demand for information and rapid growth of big data has dramatically increased textual data. The amount of different kinds of data has led to the overloading of information. For obtaining useful text information, the... more
Increasing demands for information and the rapid growth of big data have dramatically increased the amount of textual data. In order to obtain useful text information, the classification of texts is considered an imperative task.... more
Emotion processing has been a very intense domain of investigation in data analysis and NLP during the previous few years. Currently, the algorithms of the deep neural networks have been applied for opinion mining tasks with good results.... more
At present, with the growing number of Web 2.0 platforms such as Instagram, Facebook, and Twitter, users honestly communicate their opinions and ideas about events, services, and products. Owing to this rise in the number of social... more
At present, with the growing number of Web 2.0 platforms such as Instagram, Facebook, and Twitter, users honestly communicate their opinions and ideas about events, services, and products. Owing to this rise in the number of social... more
At present, with the growing number of Web 2.0 platforms such as Instagram, Facebook, and Twitter, users honestly communicate their opinions and ideas about events, services, and products. Owing to this rise in the number of social... more
According to the published reports and studies, the symptoms of the disease caused by the COVID-19 virus have not yet been fully determined. It is a major stress on clinicians to make a correct and consistent decision about whether to... more
Opinion Mining (OM) is a field of Natural Language Processing (NLP) that aims to capture human sentiment in the given text. With the ever-spreading of online purchasing websites, micro-blogging sites, and social media platforms, OM in... more
In this modernistic chapter of information and technology, a galactic volume of data generations is happening every moment. Big data is a phrase that is referred to data sets that are not only big or, massive but also having velocity,... more
Twitter is one of the social media platforms that has evolved into an incredible environment for users to communicate with friends and other users to trade thoughts, videos, and photographs that reflect their present mood. Using social... more
Currently, remote sensing is widely used in environmental monitoring applications, mostly air quality mapping and climate change supervision. However, satellite sensors occur massive volumes of data in near-real-time, stored in multiple... more
Annually, over three million people in North America suffer concussions. Every age group is susceptible to concussion, but youth involved in sporting activities are particularly vulnerable, with about 6% of all youth suffering a... more
Sentiment analysis is application of natural language processing for understanding the opinions or views of public on various topics. This is also popularly known as opinion mining, the system collects, analyses and examines the... more
This paper presents a novel 1-D sentiment classifier trained on the benchmark IMDB dataset. The classifier is a 1-D convolutional neural network with repeated convolution and max pooling layers. The main contribution of this work is the... more
Sentiment analysis has been an important topic of discussion from two decades since Lee published his first paper on the sentimental analysis in 2002. Apart from the sentimental analysis in English, it has spread its wing to other natural... more
The government is seeking preventive steps to reduce the risk of the spread of Covid-19, one of which is social restrictions that have become popular with social distancing and physical distancing. One way to assess whether the steps... more
This paper presents an algorithm based on fuzzy logic, devised to identify emotions in corpora of literary texts, called Fuzzy Logic Emotions (FLE) classifier. This algorithm evaluates a sentence to define the class(es) of emotions to... more
Currently, remote sensing is widely used in environmental monitoring applications, mostly air quality mapping and climate change supervision. However, satellite sensors occur massive volumes of data in near-real-time, stored in multiple... more
As part of the Social Media Mining for Health Applications (SMM4H) Shared Task 2020, our team participated in task 2, the automatic classification of tweets that mention adverse events associated with medication use. Our general... more
Text document classification is an important task for diverse natural language processing based applications. Traditional machine learning approaches mainly focused on reducing dimensionality of textual data to perform classification.... more
Sentiment analysis using stemmed Twitter data from various languages is an emerging research topic. In this paper, we address three data augmentation techniques namely Shift, Shuffle, and Hybrid to increase the size of the training data;... more
This paper solves the problem which arises in the production of crops by analyzing the various factors using data mining techniques. This system gathers information about the crops that are cultivated from the different place around the... more
Sentiment detection of Arabic tweets is interesting research topic and it enables scholars to analyze huge resources of shared opinions in social media websites such as Facebook and tweeter. It is one of the more complex natural language... more
At present, with the growing number of Web 2.0 platforms such as Instagram, Facebook, and Twitter, users honestly communicate their opinions and ideas about events, services, and products. Owing to this rise in the number of social... more
Energy and security remain the main two challenges in Wireless Sensor Networks (WSNs). Therefore, protecting these WSN networks from Denial of Service (DoS) and Distributed DoS (DDoS) is one of the WSN networks security tasks. Traditional... more
There is a need to extract meaningful information from big data, classify it into different categories, and predict end-user behavior or emotions. Large amounts of data are generated from various sources such as social media and websites.... more
In broad, three machine learning classification algorithms are used to discover correlations, hidden patterns, and other useful information from different data sets known as big data. Today, Twitter, Facebook, Instagram, and many other... more
Online reviews and feedback of a product plays a vital role in human tendency to purchase those products. To affect the product sale spammer generates fake reviews on online social media platform. To identify spam reviews and spammer... more
In the era of web 2.0, online forums, blogs and Twitter are becoming primary sources for sharing views, opinions and comments about different topics. Classifying these views, opinions and comments is known as sentiment analysis which is... more
This paper explores the combination of two deep learning techniques that consists of convolutional neural networks (CNN) and long short-term memory recurrent neural networks (LSTM-RNN) as a hybrid approach to sentence classification. The... more
Analysis and deciphering code-mixed data is imperative in academia and industry, in a multilingual country like India, in order to solve problems apropos Natural Language Processing. This paper proposes a bidirectional long short-term... more
The increase in covid-19 positive patients in Indonesia, especially in West Java, is unpredictable, resulting in unpreparedness in dealing with covid-19 cases. People in monitoring and patients under supervision are the category that is... more
The government is seeking preventive steps to reduce the risk of the spread of Covid-19, one of which is social restrictions that have become popular with social distancing and physical distancing. One way to assess whether the steps... more
A vast amount of data is generated every second for microblogs, content sharing via social media sites, and social networking. Twitter is an essential popular microblog where people voice their opinions about daily issues. Recently,... more
Download research papers for free!