Skip to main content

Maria Alejandra Sarmiento Perez

Followers

0

Following

1

Public Views

Interests

Uploads

Papers by Maria Alejandra Sarmiento Perez

Storage and Ingestion Systems in Support of Stream Processing: A Survey

Under the pressure of massive, exponentially increasing amounts of heterogeneous data that are ge... more Under the pressure of massive, exponentially increasing amounts of heterogeneous data that are generated faster and faster, Big Data analytics applications have seen a shift from batch processing to stream processing, which can reduce the time needed to obtain meaningful insight dramatically. Stream processing is particularly well suited to address the challenges of fog/edge computing: much of this massive data comes from Internet of Things (IoT) devices and needs to be continuously funneled through an edge infrastructure towards centralized clouds. Thus, it is only natural to process data on their way as much as possible rather than wait for streams to accumulate on the cloud. Unfortunately, state-of-the-art stream processing systems are not well suited for this role: the data are accumulated (ingested), processed and persisted (stored) separately, often using different services hosted on different physical machines/clusters. Furthermore, there is only limited support for advanced ...

Kera: A Unified Storage and Ingestion Architecture for Efficient Stream Processing

Big Data applications are rapidly moving from a batch-oriented execution to a real-time model in ... more Big Data applications are rapidly moving from a batch-oriented execution to a real-time model in order to extract value from the streams of data just as fast as they arrive. Such stream-based applications need to immediately ingest and analyze data and in many use cases combine live (i.e., real-time streams) and archived data in order to extract better insights. Current streaming architectures are designed with distinct components for ingestion (e.g., Kafka) and storage (e.g., HDFS) of stream data. Unfortunately, this separation is becoming an overhead especially when data needs to be archived for later analysis (i.e., near real-time): in such use cases, stream data has to be written twice to disk and may pass twice over high latency networks. Moreover, current ingestion mechanisms offer no support for searching the acquired streams in real time, an important requirement to promptly react to fast data. In this paper we describe the design of Kera: a unified storage and ingestion arc...

Keeping up with storage: Decentralized, write-enabled dynamic geo-replication

Future Generation Computer Systems, 2017

Large-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible ... more Large-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible data locality is crucial to ensure high performance of such applications. Dynamic replication addresses this problem by dynamically creating replicas of frequently accessed data close to the clients. This data is often stored in decentralized storage systems such as Dynamo or Voldemort, which offer support for mutable data. However, existing approaches to dynamic replication for such mutable data remain centralized, thus incompatible with these systems. In this paper we introduce a writeenabled dynamic replication scheme that leverages the decentralized architecture of such storage systems. We propose an algorithm enabling clients to locate tentatively the closest data replica without prior request to any metadata node. Large-scale experiments on various workloads show a read latency decrease of up to 42% compared to other state-ofthe-art, caching-based solutions.

Spark Versus Flink: Understanding Performance in Big Data Analytics Frameworks

2016 IEEE International Conference on Cluster Computing (CLUSTER), 2016

Big Data analytics has recently gained increasing popularity as a tool to process large amounts o... more Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design.

Failure detector abstractions for MapReduce-based systems

Information Sciences, 2017

Omission failures represent an important source of problems in data-intensive computing systems. ... more Omission failures represent an important source of problems in data-intensive computing systems. In these frameworks, omission failures are caused by slow tasks, known as stragglers , which can strongly jeopardize the workload performance. In the case of MapReduce-based systems, many state-of-the-art approaches have preferred to explore and extend speculative execution mechanisms. Other alternatives have based their contributions in doubling the computing resources for their tasks. Nevertheless, none of these approaches has addressed a fundamental aspect related to the detection and further solving of the omission failures, that is, the timeout service adjustment. In this paper, we have studied the omission failures in MapReduce systems, formalizing their failure detector abstraction by means of three different algorithms for defining the timeout. The first abstraction, called High Relax Failure Detector (HR-FD), acts as a static alternative to the default timeout, which is able to estimate the completion time for the user workload. The second abstraction, called Medium Relax Failure Detector (MR-FD), dynamically modifies the timeout, according to the progress score of each workload. Finally, taking into account that some of the user requests are strictly deadline-bounded, we have introduced the third abstraction, called Low Relax Failure Detector (LR-FD), which is able to merge the MapReduce dynamic timeout with an external monitoring system, in order to enforce more accurate failure detections. Whereas HR-FD shows performance improvements for most of the user request (in particular , small workloads), MR-FD and LR-FD enhance significantly the current timeout selection , for any kind of scenario, regardless of the workload type and failure injection time.

Towards Efficient Location and Placement of Dynamic Replicas for Geo-Distributed Data Stores

Proceedings of the ACM 7th Workshop on Scientific Cloud Computing, 2016

Large-scale scientific experiments increasingly rely on geodistributed clouds to serve relevant d... more Large-scale scientific experiments increasingly rely on geodistributed clouds to serve relevant data to scientists worldwide with minimal latency. State-of-the-art caching systems often require the client to access the data through a caching proxy, or to contact a metadata server to locate the closest available copy of the desired data. Also, such caching systems are inconsistent with the design of distributed hashtable databases such as Dynamo, which focus on allowing clients to locate data independently. We argue there is a gap between existing state-of-the-art solutions and the needs of geographically distributed applications, which require fast access to popular objects while not degrading access latency for the rest of the data. In this paper, we introduce a probabilistic algorithm allowing the user to locate the closest copy of the data efficiently and independently with minimal overhead, allowing low-latency access to non-cached data. Also, we propose a network-efficient technique to identify the most popular data objects in the cluster and trigger their replication close to the clients. Experiments with a real-world data set show that these principles allow clients to locate the closest available copy of data with small memory footprint and low error-rate, thus improving read-latency for non-cached data and allowing hot data to be read locally.

Consistency in the Cloud: When Money Does Matter!

2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, 2013

With the emergence of cloud computing, many organizations have moved their data to the cloud in o... more With the emergence of cloud computing, many organizations have moved their data to the cloud in order to provide scalable, reliable and highly available services. To meet ever growing user needs, these services mainly rely on geographically-distributed data replication to guarantee good performance and high availability. However, with replication, consistency comes into question. Service providers in the cloud have the freedom to select the level of consistency according to the access patterns exhibited by the applications. Most optimizations efforts then concentrate on how to provide adequate trade-offs between consistency guarantees and performance. However, as the monetary cost completely relies on the service providers, in this paper we argue that monetary cost should be taken into consideration when evaluating or selecting a consistency level in the cloud. Accordingly, we define a new metric called consistency-cost efficiency. Based on this metric, we present a simple, yet efficient economical consistency model, called Bismar, that adaptively tunes the consistency level at run-time in order to reduce the monetary cost while simultaneously maintaining a low fraction of stale reads. Experimental evaluations with the Cassandra cloud storage on a Grid'5000 testbed show the validity of the metric and demonstrate the effectiveness of the proposed consistency model.

Semantic grid applications to complex satellite mission systems

2006 7th IEEE/ACM International Conference on Grid Computing, 2006

This paper presents the application of a semantic grid architecture into a scenario for the produ... more This paper presents the application of a semantic grid architecture into a scenario for the product analysis of a representative Earth Observation Satellite Mission (EnviSat). This Use Case aims at demonstrating the benefits of a Semantic Grid approach to real world problems in terms of flexibility, reduction of SW running costs, maintainability, expandability, interoperability and definition of a standardized approach.

Harmony: Towards Automated Self-Adaptive Consistency in Cloud Storage

2012 IEEE International Conference on Cluster Computing, 2012

In just a few years cloud computing has become a very popular paradigm and a business success sto... more In just a few years cloud computing has become a very popular paradigm and a business success story, with storage being one of the key features. To achieve high data availability, cloud storage services rely on replication. In this context, one major challenge is data consistency. In contrast to traditional approaches that are mostly based on strong consistency, many cloud storage services opt for weaker consistency models in order to achieve better availability and performance. This comes at the cost of a high probability of stale data being read, as the replicas involved in the reads may not always have the most recent write. In this paper, we propose a novel approach, named Harmony, which adaptively tunes the consistency level at run-time according to the application requirements. The key idea behind Harmony is an intelligent estimation model of stale reads, allowing to elastically scale up or down the number of replicas involved in read operations to maintain a low (possibly zero) tolerable fraction of stale reads. As a result, Harmony can meet the desired consistency of the applications while achieving good performance. We have implemented Harmony and performed extensive evaluations with the Cassandra cloud storage on Grid'5000 testbed and on Amazon EC2. The results show that Harmony can achieve good performance without exceeding the tolerated number of stale reads. For instance, in contrast to the static eventual consistency used in Cassandra, Harmony reduces the stale data being read by almost 80% while adding only minimal latency. Meanwhile, it improves the throughput of the system by 45% while maintaining the desired consistency requirements of the applications when compared to the strong consistency model in Cassandra.

A Semantic Data Grid for Satellite Mission Quality Analysis

Lecture Notes in Computer Science, 2008

The combination of Semantic Web and Grid technologies and architectures eases the development of ... more The combination of Semantic Web and Grid technologies and architectures eases the development of applications that share heterogeneous resources (data and computing elements) that belong to several organisations. The Aerospace domain has an extensive and heterogeneous network of facilities and institutions, with a strong need to share both data and computational resources for complex processing tasks. One such task is monitoring and data analysis for Satellite Missions. This paper presents a Semantic Data Grid for satellite missions, where flexibility, scalability, interoperability, extensibility and efficient development have been considered the key issues to be addressed.

On tie breaking in competitive location under binary customer behavior

Omega, 2015

Ties in customer facility choice may occur when the customer selects the facility with maximum ut... more Ties in customer facility choice may occur when the customer selects the facility with maximum utility to be served. In the location literature ties in maximum utility are broken by assigning a fixed proportion of the customer demand to the facilities with maximum utility which are owned by the entering firm. This tie breaking rule does not take into account the number of tied facilities of both the entering firm and its competitors. In this paper we introduce a more realistic tie breaking rule which assigns a variable proportion of customer demand to the entering firm depending on the number of tied facilities. We present a general framework in which optimal locations for the old and the new tie breaking rules can be obtained through Integer Linear Programming formulations of the corresponding location models. The optimal locations are obtained for the old tie breaking rule for different values of the fixed proportion and a comparison with the results obtained for the new tie breaking rule are drawn with data of Spanish municipalities in a variety of scenarios. Finally, some conclusions are presented.

Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services

2010 IEEE Second International Conference on Cloud Computing Technology and Science, 2010

Grid Global Behavior Prediction

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011

Complexity has always been one of the most important issues in distributed computing. From the fi... more Complexity has always been one of the most important issues in distributed computing. From the first clusters to grid and now cloud computing, dealing correctly and efficiently with system complexity is the key to taking technology a step further. In this sense, global behavior modeling is an innovative methodology aimed at understanding the grid behavior. The main objective of this methodology is to synthesize the grid's vast, heterogeneous nature into a simple but powerful behavior model, represented in the form of a single, abstract entity, with a global state. Global behavior modeling has proved to be very useful in effectively managing grid complexity but, in many cases, deeper knowledge is needed. It generates a descriptive model that could be greatly improved if extended not only to explain behavior, but also to predict it. In this paper we present a prediction methodology whose objective is to define the techniques needed to créate global behavior prediction models for grid systems. This global behavior prediction can benefit grid management, specially in áreas such as fault tolerance or job scheduling. The paper presents experimental results obtained in real scenarios in order to valídate this approach.

Improving Grid Fault Tolerance by Means of Global Behavior Modeling

2010 Ninth International Symposium on Parallel and Distributed Computing, 2010

Grid systems have proved to be one of the most important new alternatives to face challenging pro... more Grid systems have proved to be one of the most important new alternatives to face challenging problems but, to exploit its benefits, dependability and fault tolerance are key aspects. However, the vast complexity of these systems limits the efficiency of traditional fault tolerance techniques. It seems necessary to distinguish between resource-level fault tolerance (focused on every machine) and service-level fault tolerance (focused on global behavior). Techniques based on these concepts can handle system complexity and increase dependability. We present an autonomous, self-adaptive fault tolerance framework for grid systems, based on a new approach to model distributed environments. The grid is considered as a single entity, instead of a set of independent resources. This point of view focuses on service-level fault tolerance, allowing us to see the big picture and understand the system's global behavior. The resulting model's simplicity is the key to provide systemwide fault tolerance.

On price competition in location-price models with spatially separated markets

Top, 2004

Page 1. Sociedad de Estadlstica e Investigacidn Operativa Top (2004) Vol. 12, No. 2, pp. 351-374 ... more

On the location of new facilities for chain expansion under delivered pricing

Omega, 2012

We study the problem of locating new facilities for one expanding chain which competes for demand... more

GMonE: A complete approach to cloud monitoring

Future Generation Computer Systems, 2013

The inherent complexity of modern cloud infrastructures has created the need for innovative monit... more The inherent complexity of modern cloud infrastructures has created the need for innovative monitoring approaches, as state-of-the-art solutions used for other large-scale environments do not address specific cloud features. Although cloud monitoring is nowadays an active research field, a comprehensive study covering all its aspects has not been presented yet. This paper provides a deep insight into cloud monitoring. It proposes a unified cloud monitoring taxonomy, based on which it defines a layered cloud monitoring architecture. To illustrate it, we have implemented GMonE, a general-purpose cloud monitoring tool which covers all aspects of cloud monitoring by specifically addressing the needs of modern cloud infrastructures. Furthermore, we have evaluated the performance, scalability and overhead of GMonE with Yahoo Cloud Serving Benchmark (YCSB), by using the OpenNebula cloud middleware on the Grid'5000 experimental testbed. The results of this evaluation demonstrate the benefits of our approach, surpassing the monitoring performance and capabilities of cloud monitoring alternatives such as those present in state-of-the-art systems such as Amazon EC2 and OpenNebula.

A semantic collaborative awareness model to deal with resource sharing in grids

Future Generation Computer Systems, 2010

Grid Computing environments are mainly created to lead the shared use of different resources base... more Grid Computing environments are mainly created to lead the shared use of different resources based on business/science needs. The way these resources are shared in terms of CPU cycles, storage capacity, software licenses,. .. is normally established by the availability of these resources out of the local administration context. Semantic Grid is the extension of Grid Computing with Semantic Web based technologies. Semantic Web allows grid management data to be machine-understable represented, therefore reasoning can handled complicated situations in Virtual Organization management. This paper presents the extension of CAM (Collaborative Awareness Model) to manage Virtual Organizations in Semantic Grid environments. CAM applies some theoretical principles of awareness models to promote resource interaction and management as well as task delivery.

Special section: Data analysis, access and management on grids

Future Generation Computer Systems, 2007

An autonomic framework for enhancing the quality of data grid services

Future Generation Computer Systems, 2012

Data grid services have been used to deal with the increasing needs of applications in terms of d... more Data grid services have been used to deal with the increasing needs of applications in terms of data volume and throughput. The large scale, heterogeneity and dynamism of grid environments often make management and tuning of these data services very complex. Furthermore, current high-performance I/O approaches are characterized by their high complexity and specific features that usually require specialized administrator skills. Autonomic computing can help manage this complexity. The present paper describes an autonomic subsystem intended to provide self-management features aimed at efficiently reducing the I/O problem in a grid environment, thereby enhancing the quality of service (QoS) of data access and storage services in the grid. Our proposal takes into account that data produced in an I/O system is not usually immediately required. Therefore, performance improvements are related not only to current but also to any future I/O access, as the actual data access usually occurs later on. Nevertheless, the exact time of the next I/O operations is unknown. Thus, our approach proposes a long-term prediction designed to forecast the future workload of grid components. This enables the autonomic subsystem to determine the optimal data placement to improve both current and future I/O operations.