Rathinaraja Jeyaraj

Smart data processing for energy harvesting systems using artificial intelligence

Nano Energy

Optimizing MapReduce Task Scheduling on Virtualized Heterogeneous Environments Using Ant Colony Optimization

IEEE Access

Consuming Hadoop MapReduce via virtual infrastructure as a service is becoming common practice as... more Consuming Hadoop MapReduce via virtual infrastructure as a service is becoming common practice as cloud service providers (CSP) offers relevant applications and scalable resources. One of the predominant requirements of cloud users is to improve resource utilization in the virtual cluster during the service period. However, it may not be possible when MapReduce workloads and virtual machines (VM) are highly heterogeneous. Therefore, in this paper, we addressed these heterogeneities and proposed an efficient MapReduce scheduler to improve resource utilization by placing the right combination of the map and reduce tasks in each VM in the virtual cluster. To achieve this, we transformed the MapReduce task scheduling problem into a 2-Dimensional (2D) bin packing model and obtained an optimal schedule using the ant colony optimization (ACO) algorithm. As an added advantage, our proposed ACO based bin packing (ACO-BP) scheduler minimized the makespan for a batch of jobs. To showcase the performance improvement, we compared our proposed scheduler with three existing schedulers that work well in a heterogeneous environment. As expected, results show that ACO-BP significantly outperformed the existing schedulers while dealing with workload and VM level heterogeneities.

Download

Data Science

Dynamic Performance Aware Reduce Task Scheduling in MapReduce on Virtualized Environment

2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA), 2018

Hadoop MapReduce as a service from cloud is widely used by various research, and commercial commu... more Hadoop MapReduce as a service from cloud is widely used by various research, and commercial communities. Hadoop MapReduce is typically offered as a service hosted on virtualized environment in Cloud Data-Center. Cluster of virtual machines for MapReduce is placed across racks in Cloud Data-Center to achieve fault tolerance. But, it negatively introduces dynamic/heterogeneous performance for virtual machines due to hardware heterogeneity and co-located virtual machine's interference, which cause varying latency for same task. Alongside, curbing number of intermediate records and placing reduce tasks on right virtual node are also important to minimize MapReduce job latency further. In this paper, we introduce Multi-Level Per Node Combiner to minimize the number of intermediate records and Dynamic Ranking based MapReduce Job Scheduler to place reduce tasks on right virtual machine to minimize MapReduce job latency by exploiting dynamic performance of virtual machines. To experimen...

Multi-level per node combiner (MLPNC) to minimize mapreduce job latency on virtualized environment

Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018

Big data drove businesses and researches more data driven. Hadoop MapReduce is one of the cost-ef... more Big data drove businesses and researches more data driven. Hadoop MapReduce is one of the cost-effective ways for processing huge amount of data and also offered as a service from cloud on cluster of Virtual Machines (VM). In Cloud Data Center (CDC), Hadoop VMs are co-located with other general purpose VMs across racks. Such a multi-tenancy leads to varying local network bandwidth availability for Hadoop VMs, which directly impacts MapReduce job latency. Because, shuffle phase in MapReduce execution sequence itself contributes 26%-70% of overall job latency due to large number of intermediate records. Therefore, Hadoop virtual cluster requires to ensure a maximum bandwidth to minimize job latency, but, it also increases the bandwidth usage cost. In this paper, we propose "Multi-Level Per Node Combiner" (MLPNC) that curtails the number of intermediate records in shuffle phase resulting to reduction in overall job latency. It also minimizes bandwidth usage cost as well. We evaluate MLPNC results on wordcount job against default combiner, and Per Node Combiner (PNC). We also discuss the results based on number of shuffled records, shuffle latency, average merge latency, average reduce latency, average reduce task start time, and overall job latency. Finally, we argue in favor of MLPNC as it achieves up to 33% reduction in number of intermediate records and up to 32% reduction in average job latency than PNC.

Dynamic Performance Aware Reduce Task Scheduling in MapReduce on Virtualized Environment

2018 IEEE 16th International Conference on Software Engineering Research, Management and Applications (SERA), 2018

Hadoop MapReduce as a service from cloud is widely used by various research, and commercial commu... more Hadoop MapReduce as a service from cloud is widely used by various research, and commercial communities. Hadoop MapReduce is typically offered as a service hosted on virtualized environment in Cloud Data-Center. Cluster of virtual machines for MapReduce is placed across racks in Cloud Data-Center to achieve fault tolerance. But, it negatively introduces dynamic/heterogeneous performance for virtual machines due to hardware heterogeneity and co-located virtual machine's interference, which cause varying latency for same task. Alongside, curbing number of intermediate records and placing reduce tasks on right virtual node are also important to minimize MapReduce job latency further. In this paper, we introduce Multi-Level Per Node Combiner to minimize the number of intermediate records and Dynamic Ranking based MapReduce Job Scheduler to place reduce tasks on right virtual machine to minimize MapReduce job latency by exploiting dynamic performance of virtual machines. To experimen...

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Concurrency and Computation: Practice and Experience, 2019

Big data is largely influencing business entities and research sectors to be more data-driven. Ha... more Big data is largely influencing business entities and research sectors to be more data-driven. Hadoop MapReduce is one of the cost-effective ways to process large scale datasets and offered as a service over the Internet. Even though cloud service providers promise an infinite amount of resources available on-demand, it is inevitable that some of the hired virtual resources for MapReduce are left unutilized and makespan is limited due to various heterogeneities that exist while offering MapReduce as a service. As MapReduce v2 allows users to define the size of containers for the map and reduce tasks, jobs in a batch become heterogeneous and behave differently. Also, the different capacity of virtual machines in the MapReduce virtual cluster accommodate a varying number of map/reduce tasks. These factors highly affect resource utilization in the virtual cluster and the makespan for a batch of MapReduce jobs. Default MapReduce job schedulers do not consider these heterogeneities that exist in a cloud environment. Moreover, virtual machines in MapReduce virtual cluster process an equal number of blocks regardless of their capacity, which affects the makespan. Therefore, we devised a heuristic-based MapReduce job scheduler that exploits virtual machine and MapReduce workload level heterogeneities to improve resource utilization and makespan. We proposed two methods to achieve this: (i) roulette wheel scheme based data block placement in heterogeneous virtual machines, and (ii) a constrained 2-dimensional bin packing to place heterogeneous map/reduce tasks. We compared heuristic-based MapReduce job scheduler against the classical fair scheduler in MapReduce v2. Experimental results showed that our proposed scheduler improved makespan and resource utilization by 45.6% and 47.9% over classical fair scheduler. KEYWORDS bin packing, heterogeneous workloads/jobs, map/reduce task placement 1 INTRODUCTION Big data is growing 1 exponentially and is inevitable as it plays a vital role in businesses and research studies for more reliable and effective decision making. Different characteristics of big data demand more efficient processing tools to improve job latency, makespan, and resource utilization. Hadoop MapReduce 2 is one of the predominant opensource batch processing tools to deal with a huge volume of data in a cluster of physical/virtual machines. Even though it is opensource, short-term users are not affordable for full-fledged on-premise data-center to deploy Hadoop. Therefore, public Cloud Service Providers (CSP) such as Amazon, Google, and Microsoft offer Hadoop MapReduce and related applications as a service 3 over the Internet. MapReduce is offered as a service via either a cluster of Physical Machines (PMs) with the dedicated high-speed network or a cluster of Virtual Machines (VMs) sharing the available network. Availing MapReduce on a cluster of VMs is highly scalable and based on a pay-per-use model, which attract the short-term users. Although it is cost-effective, there are some performance implications 4-6 due to the heterogeneity 7 that exists in different levels (from a cluster of PMs till a batch of MapReduce jobs) while offering MapReduce on a cluster of VMs, as shown in Figure 1. For instance, consider a set of PMs (PM 1 , PM 2 … PM 50), a set of VMs (VM 1 , VM 2 … VM 100) with different VM flavors (VMF 1 , VMF 2 … VMF 5), as given in Table 1, and a set of MapReduce jobs (J 1 , J 2 … J 6). All PMs are not identical in capacity and performance. 8 Similarly, VMs in MapReduce virtual cluster may not be of the same configuration and capacity, 9 which leads to accommodate

Download

MapReduce Scheduler to Minimize the Size of Intermediate Data in Shuffle Phase

2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), 2019

Hadoop MapReduce is one of the cost-effective ways for processing huge data in this decade. Despi... more Hadoop MapReduce is one of the cost-effective ways for processing huge data in this decade. Despite it is opensource, setting up Hadoop on-premise is not affordable for small-scale businesses and research entities. Therefore, consuming Hadoop MapReduce as a service from cloud is on increasing pace as it is scalable on-demand and based on pay-per-use model. In such multi-tenant environment, virtual bandwidth is an expensive commodity and co-located virtual machines race each other to make use of the bandwidth. A study shows that 26%-70% of MapReduce job latency is due to shuffle phase in MapReduce execution sequence. Primary expectation of a typical cloud user is to minimize the service usage cost. Allocating less bandwidth to the service costs less but increases job latency, consequently increases makespan. This trade-off is compromised by minimizing the amount of intermediate data generated in shuffle phase at application level. To achieve this, we proposed Time Sharing MapReduce Job Scheduler to minimize the amount of intermediate data; thus, service cost is cut down. As a by-product, MapReduce job latency and makespan also are improved. Result shows that our proposed model minimized the size of intermediate data upto 62.1%, when compared to the classical schedulers with combiners.

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Journal of Ambient Intelligence and Humanized Computing, 2020

Big data overwhelmed industries and research sectors. Reliable decision making is always a challe... more Big data overwhelmed industries and research sectors. Reliable decision making is always a challenging task, which requires cost-effective big data processing tools. Hadoop MapReduce is being used to store and process huge volume of data in a distributed environment. However, due to huge capital investment and lack of expertise to set up an on-premise Hadoop cluster, big data users seek cloud-based MapReduce service over the Internet. Mostly, MapReduce on a cluster of virtual machines is offered as a service for a pay-per-use basis. Virtual machines in MapReduce virtual cluster reside in different physical machines and co-locate with other non-MapReduce VMs. This causes to share IO resources such as disk and network bandwidth, leading to congestion as most of the MapReduce jobs are disk and network intensive. Especially, the shuffle phase in MapReduce execution sequence consumes huge network bandwidth in a multi-tenant environment. This results in increased job latency and bandwidth consumption cost. Therefore, it is essential to minimize the amount of intermediate data in the shuffle phase rather than supplying more network bandwidth that results in increased service cost. Considering this objective, we extended multi-level per node combiner for a batch of MapReduce jobs to improve makespan. We observed that makespan is improved up to 32.4 $$\%$$ % by minimizing the number of intermediate data in shuffle phase when compared to classical schedulers with default combiners.

Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment

The Journal of Supercomputing, 2019

More data, more information." Big data helps businesses and research communities to gain insights... more More data, more information." Big data helps businesses and research communities to gain insights and increase productivity. Many public cloud service providers offer Hadoop MapReduce as a service based on pay-per-use via infrastructure as a service on clusters of virtual machines promising on-demand horizontal scaling. These clusters of virtual machines are launched in various physical machines across racks in cloud data centers. Such multi-tenancy negatively introduces performance heterogeneity for Hadoop virtual machines due to hardware heterogeneity and interference from co-located virtual machine. Performance heterogeneity largely affects MapReduce job latency and resource utilization of rented Hadoop virtual clusters. Default MapReduce schedulers assign map/reduce tasks assuming the hardware is homogeneous. Interference-aware schedulers perform by only observing the interference pattern generated by co-located virtual machines. These schedulers do not consider the heterogeneous performance of virtual machines. Therefore, we propose a dynamic ranking-based MapReduce job scheduler that places the map and reduces tasks based on a virtual machine's performance rank to minimize job latency and improve resource utilization. Our proposed approach calculates the performance score for each virtual machine based on hardware heterogeneity and co-located virtual machine interference. Then, it ranks the virtual machines based on the map and reduce performance separately to place map and reduce tasks. To demonstrate our ideas, we have set a test bed with 29 virtual machines on eight physical machines with different configurations and capacities. We modify a default fair scheduler in Hadoop 2.x to incorporate our ideas and evaluate them with different workloads on the PUMA dataset. The proposed method is then compared against a default fair scheduler (resource-aware) and an interference-aware scheduler based on job latency and resource utilization. Finally, we argue in favor of our approach as it improves resource utilization by 30-65% and overall job latency by up to 30%.

Download

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Concurrency and Computation: Practice and Experience, 2019

Big data is largely influencing business entities and research sectors to be more data-driven. Ha... more Big data is largely influencing business entities and research sectors to be more data-driven. Hadoop MapReduce is one of the cost-effective ways to process large scale datasets and offered as a service over the Internet. Even though cloud service providers promise an infinite amount of resources available on-demand, it is inevitable that some of the hired virtual resources for MapReduce are left unutilized and makespan is limited due to various heterogeneities that exist while offering MapReduce as a service. As MapReduce v2 allows users to define the size of containers for the map and reduce tasks, jobs in a batch become heterogeneous and behave differently. Also, the different capacity of virtual machines in the MapReduce virtual cluster accommodate a varying number of map/reduce tasks. These factors highly affect resource utilization in the virtual cluster and the makespan for a batch of MapReduce jobs. Default MapReduce job schedulers do not consider these heterogeneities that exist in a cloud environment. Moreover, virtual machines in MapReduce virtual cluster process an equal number of blocks regardless of their capacity, which affects the makespan. Therefore, we devised a heuristic-based MapReduce job scheduler that exploits virtual machine and MapReduce workload level heterogeneities to improve resource utilization and makespan. We proposed two methods to achieve this: (i) roulette wheel scheme based data block placement in heterogeneous virtual machines, and (ii) a constrained 2-dimensional bin packing to place heterogeneous map/reduce tasks. We compared heuristic-based MapReduce job scheduler against the classical fair scheduler in MapReduce v2. Experimental results showed that our proposed scheduler improved makespan and resource utilization by 45.6% and 47.9% over classical fair scheduler. KEYWORDS bin packing, heterogeneous workloads/jobs, map/reduce task placement 1 INTRODUCTION Big data is growing 1 exponentially and is inevitable as it plays a vital role in businesses and research studies for more reliable and effective decision making. Different characteristics of big data demand more efficient processing tools to improve job latency, makespan, and resource utilization. Hadoop MapReduce 2 is one of the predominant opensource batch processing tools to deal with a huge volume of data in a cluster of physical/virtual machines. Even though it is opensource, short-term users are not affordable for full-fledged on-premise data-center to deploy Hadoop. Therefore, public Cloud Service Providers (CSP) such as Amazon, Google, and Microsoft offer Hadoop MapReduce and related applications as a service 3 over the Internet. MapReduce is offered as a service via either a cluster of Physical Machines (PMs) with the dedicated high-speed network or a cluster of Virtual Machines (VMs) sharing the available network. Availing MapReduce on a cluster of VMs is highly scalable and based on a pay-per-use model, which attract the short-term users. Although it is cost-effective, there are some performance implications 4-6 due to the heterogeneity 7 that exists in different levels (from a cluster of PMs till a batch of MapReduce jobs) while offering MapReduce on a cluster of VMs, as shown in Figure 1. For instance, consider a set of PMs (PM 1 , PM 2 … PM 50), a set of VMs (VM 1 , VM 2 … VM 100) with different VM flavors (VMF 1 , VMF 2 … VMF 5), as given in Table 1, and a set of MapReduce jobs (J 1 , J 2 … J 6). All PMs are not identical in capacity and performance. 8 Similarly, VMs in MapReduce virtual cluster may not be of the same configuration and capacity, 9 which leads to accommodate

Download

A Review on Mechanism Linking between Diabetic and Obesity

Obesity is considered as a major health disaster as it causes to many health issues for aged peop... more Obesity is considered as a major health disaster as it causes to many health issues for aged people around the globe. Obesity functions by accumulating adipose tissue that leads to impairment of physical and psychological health. The adipose tissue with abnormal deposition of fat causes physical inactivity due to many factors such as hereditary and overeating habit. One of the risk factors of obesity is diabetes, which could also be a root cause for many other abnormalities in the human body. So, diabetes has a strong relationship with obesity. Moreover, insulin substance plays a vital in maintaining diabetes, which also has a direct relationship with body mass index. In addition, the amount of fatty acid, hormones, glycerol, cytokines etc. are the major substances involved in the diabetic development in human body. As diabetes is primarily caused by obesity, this paper aims to describe the rapport between obesity and diabetes, it’s types and implications on human health.

Download

Hadoop Framework

Big Data Infrastructure and Analytics for Education 4.0

Big Data Applications in Industry 4.0

Big Data

Big Data with Hadoop MapReduce

Nonnegative Matrix Factorization to Understand Spatio-Temporal Traffic Pattern Variations During COVID-19: A Case Study

Communications in Computer and Information Science

Big Data with Hadoop MapReduce

Hadoop 1.2.1 Installation

Handling Non-Local Executions to Improve MapReduce Performance Using Ant Colony Optimization

IEEE Access

Improving the performance of MapReduce scheduler is a primary objective, especially in a heteroge... more Improving the performance of MapReduce scheduler is a primary objective, especially in a heterogeneous virtualized cloud environment. A map task is typically assigned with an input split, which consists of one or more data blocks. When a map task is assigned to more than one data block, non-local execution is performed. In classical MapReduce scheduling schemes, data blocks are copied over the network to a node where the map task is running. This increases job latency and consumes considerable network bandwidth within and between racks in the cloud data centre. Considering this situation, we propose a methodology, ''improving data locality using ant colony optimization (IDLACO),'' to minimize the number of non-local executions and virtual network bandwidth consumption when input split is assigned to more than one data block. First, IDLACO determines a set of data blocks for each map task of a MapReduce job to perform non-local executions to minimize the job latency and virtual network consumption. Then, the target virtual machine to execute map task is determined based on its heterogeneous performance. Finally, if a set of data blocks is transferred to the same node for repeated job execution, it is decided to temporarily cache them in the target virtual machine. The performance of IDLACO is analysed and compared with fair scheduler and Holistic scheduler based on the parameters, such as the number of non-local executions, average map task latency, job latency, and amount of bandwidth consumed for a MapReduce job. Results show that IDLACO significantly outperformed the classical fair scheduler and Holistic scheduler.

Download

Uploads

Papers by Rathinaraja Jeyaraj

Log In