Papers by MARIA VALENTINA FLORES PEREZ

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
We are now witnessing an unprecedented growth of data that needs to be processed at always increa... more We are now witnessing an unprecedented growth of data that needs to be processed at always increasing rates in order to extract valuable insights. Big Data streaming analytics tools have been developed to cope with the online dimension of data processing: they enable real-time handling of live data sources by means of stateful aggregations (operators). Current state-of-art frameworks (e.g. Apache Flink [1]) enable each operator to work in isolation by creating data copies, at the expense of increased memory utilization. In this paper, we explore the feasibility of deduplication techniques to address the challenge of reducing memory footprint for window-based stream processing without significant impact on performance. We design a deduplication method specifically for windowbased operators that rely on key-value stores to hold a shared state. We experiment with a synthetically generated workload while considering several deduplication scenarios and based on the results, we identify several potential areas of improvement. Our key finding is that more fine-grained interactions between streaming engines and (key-value) stores need to be designed in order to better respond to scenarios that have to overcome memory scarcity.

Under the pressure of massive, exponentially increasing amounts of heterogeneous data that are ge... more Under the pressure of massive, exponentially increasing amounts of heterogeneous data that are generated faster and faster, Big Data analytics applications have seen a shift from batch processing to stream processing, which can reduce the time needed to obtain meaningful insight dramatically. Stream processing is particularly well suited to address the challenges of fog/edge computing: much of this massive data comes from Internet of Things (IoT) devices and needs to be continuously funneled through an edge infrastructure towards centralized clouds. Thus, it is only natural to process data on their way as much as possible rather than wait for streams to accumulate on the cloud. Unfortunately, state-of-the-art stream processing systems are not well suited for this role: the data are accumulated (ingested), processed and persisted (stored) separately, often using different services hosted on different physical machines/clusters. Furthermore, there is only limited support for advanced ...

Big Data applications are rapidly moving from a batch-oriented execution to a real-time model in ... more Big Data applications are rapidly moving from a batch-oriented execution to a real-time model in order to extract value from the streams of data just as fast as they arrive. Such stream-based applications need to immediately ingest and analyze data and in many use cases combine live (i.e., real-time streams) and archived data in order to extract better insights. Current streaming architectures are designed with distinct components for ingestion (e.g., Kafka) and storage (e.g., HDFS) of stream data. Unfortunately, this separation is becoming an overhead especially when data needs to be archived for later analysis (i.e., near real-time): in such use cases, stream data has to be written twice to disk and may pass twice over high latency networks. Moreover, current ingestion mechanisms offer no support for searching the acquired streams in real time, an important requirement to promptly react to fast data. In this paper we describe the design of Kera: a unified storage and ingestion arc...

Computer Communications and Networks, 2016
Data-intensive computing has become one of the most popular forms of parallel computing. This is ... more Data-intensive computing has become one of the most popular forms of parallel computing. This is due to the explosion of digital data we are living. This data expansion has mainly come from three sources: (i) scientific experiments from fields such as astronomy, particle physics, or genomics; (ii) data from sensors; and (iii) citizens publications in channels such as social networks. Data-intensive computing systems, such as Hadoop MapReduce, have as main goal the processing of an enormous amount of data in a short time, by transmitting the computation where the data resides. In failure-free scenarios, these frameworks usually achieve good results. Given that failures are common at large scale, these frameworks exhibit some fault tolerance and dependability techniques as built-in features. In particular, MapReduce frameworks tolerate machine failures (crash failures) by re-executing all the tasks of the failed machine by the virtue of data replication. Furthermore, in order to mask temporary failures caused by network or machine overload (timing failure) where some tasks are performing relatively slower than other tasks, Hadoop relaunches other copies of these tasks on other machines.
2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific re... more HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

IEEE Access, 2021
Most studies about the solar forecasting topic do not analyze and exploit the temporal and spatia... more Most studies about the solar forecasting topic do not analyze and exploit the temporal and spatial components that are inherent to such a task. Furthermore, they mostly focus just on precision and not on other meaningful features, such as flexibility and robustness. With the current energy production trends, where many solar panels are distributed across city rooftops, there is a need to manage all this information simultaneously and to be able to add and remove sensors as needed. Likewise, robust models need to be able to cope with (inevitable) sensor failure and continue producing reliable predictions. Due to all of this, solar forecasting models need to be as decoupled as possible from the number of data sources that feed them and their geographical distribution, enabling also the reusability of the models. This article contributes with a family of Deep Learning models for solar irradiance forecasting complying with the aforementioned features, i.e. flexibility and robustness. In the first stage, several Artificial Neural Networks are trained as a basis for predicting solar irradiance on several locations at the same time. Thereupon, a family of models that work with irradiance maps thanks to Convolutional Long Short-Term Memory layers is presented, obtaining forecast skills between 7.4% and 41% (depending on the location and horizon) compared to the baseline. The latter family comes with flexibility and robustness features, which are required in large-scale Intelligent Environments, such as Smart Cities. Working with irradiance maps means that new sensors can be added (or removed) as needed, without requiring rebuilding the model. Experiments carried out show that sensor failures have a mild impact on the prediction error for several forecast horizons. INDEX TERMS Convolutional long short-term memory (Conv-LSTM), deep learning, irradiance map, solar irradiance, time series forecasting.

Future Generation Computer Systems, 2018
HPC and Big Data stacks are completely separated today. The storage layer offers opportunities fo... more HPC and Big Data stacks are completely separated today. The storage layer offers opportunities for convergence, as the challenges associated with HPC and Big Data storage are similar: trading versatility for performance. This motivates a global move towards dropping file-based, POSIX-IO compliance systems. However, on HPC platforms this is made difficult by the centralized storage architecture using file-based storage. In this paper we advocate that the growing trend of equipping HPC compute nodes with local storage redistributes the cards by enabling object storage to be deployed alongside the application on the compute nodes. Such integration of application and storage not only allows fine-grained configuration of the storage system, but also improves application portability across platforms. In addition, the single-user nature of such application-specific storage obviates the need for resource-consuming storage features like permissions or file hierarchies offered by traditional file systems. In this article we propose and evaluate Blobs (Binary Large Objects) as an alternative to distributed file systems. We factually demonstrate that it offers drop-in compatibility with a variety of existing applications while improving storage throughput by up to 28%.

Future Generation Computer Systems, 2017
Large-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible ... more Large-scale applications are ever-increasingly geo-distributed. Maintaining the highest possible data locality is crucial to ensure high performance of such applications. Dynamic replication addresses this problem by dynamically creating replicas of frequently accessed data close to the clients. This data is often stored in decentralized storage systems such as Dynamo or Voldemort, which offer support for mutable data. However, existing approaches to dynamic replication for such mutable data remain centralized, thus incompatible with these systems. In this paper we introduce a writeenabled dynamic replication scheme that leverages the decentralized architecture of such storage systems. We propose an algorithm enabling clients to locate tentatively the closest data replica without prior request to any metadata node. Large-scale experiments on various workloads show a read latency decrease of up to 42% compared to other state-ofthe-art, caching-based solutions.

2016 IEEE International Conference on Cluster Computing (CLUSTER), 2016
Big Data analytics has recently gained increasing popularity as a tool to process large amounts o... more Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design.

Information Sciences, 2017
Omission failures represent an important source of problems in data-intensive computing systems. ... more Omission failures represent an important source of problems in data-intensive computing systems. In these frameworks, omission failures are caused by slow tasks, known as stragglers , which can strongly jeopardize the workload performance. In the case of MapReduce-based systems, many state-of-the-art approaches have preferred to explore and extend speculative execution mechanisms. Other alternatives have based their contributions in doubling the computing resources for their tasks. Nevertheless, none of these approaches has addressed a fundamental aspect related to the detection and further solving of the omission failures, that is, the timeout service adjustment. In this paper, we have studied the omission failures in MapReduce systems, formalizing their failure detector abstraction by means of three different algorithms for defining the timeout. The first abstraction, called High Relax Failure Detector (HR-FD), acts as a static alternative to the default timeout, which is able to estimate the completion time for the user workload. The second abstraction, called Medium Relax Failure Detector (MR-FD), dynamically modifies the timeout, according to the progress score of each workload. Finally, taking into account that some of the user requests are strictly deadline-bounded, we have introduced the third abstraction, called Low Relax Failure Detector (LR-FD), which is able to merge the MapReduce dynamic timeout with an external monitoring system, in order to enforce more accurate failure detections. Whereas HR-FD shows performance improvements for most of the user request (in particular , small workloads), MR-FD and LR-FD enhance significantly the current timeout selection , for any kind of scenario, regardless of the workload type and failure injection time.

Proceedings of the ACM 7th Workshop on Scientific Cloud Computing, 2016
Large-scale scientific experiments increasingly rely on geodistributed clouds to serve relevant d... more Large-scale scientific experiments increasingly rely on geodistributed clouds to serve relevant data to scientists worldwide with minimal latency. State-of-the-art caching systems often require the client to access the data through a caching proxy, or to contact a metadata server to locate the closest available copy of the desired data. Also, such caching systems are inconsistent with the design of distributed hashtable databases such as Dynamo, which focus on allowing clients to locate data independently. We argue there is a gap between existing state-of-the-art solutions and the needs of geographically distributed applications, which require fast access to popular objects while not degrading access latency for the rest of the data. In this paper, we introduce a probabilistic algorithm allowing the user to locate the closest copy of the data efficiently and independently with minimal overhead, allowing low-latency access to non-cached data. Also, we propose a network-efficient technique to identify the most popular data objects in the cluster and trigger their replication close to the clients. Experiments with a real-world data set show that these principles allow clients to locate the closest available copy of data with small memory footprint and low error-rate, thus improving read-latency for non-cached data and allowing hot data to be read locally.

2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2015
Apache Cassandra is an open-source cloud storage system that offers multiple types of operation-l... more Apache Cassandra is an open-source cloud storage system that offers multiple types of operation-level consistency including eventual consistency with multiple levels of guarantees and strong consistency. It is being used by many datacenter applications (e.g., Facebook and AppScale). Most existing research efforts have been dedicated to exploring trade-offs such as: consistency vs. performance, consistency vs. latency and consistency vs. monetary cost. In contrast, a little work is focusing on the consistency vs. energy trade-off. As power bills have become a substantial part of the monetary cost for operating a data-center, this paper aims to provide a clearer understanding of the interplay between consistency and energy consumption. Accordingly, a series of experiments have been conducted to explore the implication of different factors on the energy consumption in Cassandra. Our experiments have revealed a noticeable variation in the energy consumption depending on the consistency level. Furthermore, for a given consistency level, the energy consumption of Cassandra varies with the access pattern and the load exhibited by the application. This further analysis indicates that the uneven distribution of the load amongst different nodes also impacts the energy consumption in Cassandra. Finally, we experimentally compare the impact of four storage configuration and data partitioning policies on the energy consumption in Cassandra: interestingly, we achieve 23% energy saving when assigning 50% of the nodes to the hot pool for the applications with moderate ratio of reads and writes, while applying eventual (quorum) consistency. This study points to opportunities for future research on consistencyenergy trade-offs and offers useful insight into designing energyefficient techniques for cloud storage systems.
2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), 2006
The use of a Semantic Grid architecture can ease the deployment of complex applications, in which... more The use of a Semantic Grid architecture can ease the deployment of complex applications, in which several organizations are involved and where resources of diverse nature (data and computing elements) are shared. This is the situation in the Space domain, with a strong demand of computational resources inscribed in an extensive and heterogeneous network of facilities and institutions. This paper presents the S-OGSA architecture, defined in the Ontogrid project, as applied into a scenario for the overall monitoring and data analysis in a Satellite Mission currently in nominal operations. Flexibility, scalability, interoperability and use of a common framework for data sharing are the main advantages of a Semantic Grid implementation in Complex and data-intensive systems.

2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, 2013
With the emergence of cloud computing, many organizations have moved their data to the cloud in o... more With the emergence of cloud computing, many organizations have moved their data to the cloud in order to provide scalable, reliable and highly available services. To meet ever growing user needs, these services mainly rely on geographically-distributed data replication to guarantee good performance and high availability. However, with replication, consistency comes into question. Service providers in the cloud have the freedom to select the level of consistency according to the access patterns exhibited by the applications. Most optimizations efforts then concentrate on how to provide adequate trade-offs between consistency guarantees and performance. However, as the monetary cost completely relies on the service providers, in this paper we argue that monetary cost should be taken into consideration when evaluating or selecting a consistency level in the cloud. Accordingly, we define a new metric called consistency-cost efficiency. Based on this metric, we present a simple, yet efficient economical consistency model, called Bismar, that adaptively tunes the consistency level at run-time in order to reduce the monetary cost while simultaneously maintaining a low fraction of stale reads. Experimental evaluations with the Cassandra cloud storage on a Grid'5000 testbed show the validity of the metric and demonstrate the effectiveness of the proposed consistency model.
2006 7th IEEE/ACM International Conference on Grid Computing, 2006
This paper presents the application of a semantic grid architecture into a scenario for the produ... more This paper presents the application of a semantic grid architecture into a scenario for the product analysis of a representative Earth Observation Satellite Mission (EnviSat). This Use Case aims at demonstrating the benefits of a Semantic Grid approach to real world problems in terms of flexibility, reduction of SW running costs, maintainability, expandability, interoperability and definition of a standardized approach.

2012 IEEE International Conference on Cluster Computing, 2012
In just a few years cloud computing has become a very popular paradigm and a business success sto... more In just a few years cloud computing has become a very popular paradigm and a business success story, with storage being one of the key features. To achieve high data availability, cloud storage services rely on replication. In this context, one major challenge is data consistency. In contrast to traditional approaches that are mostly based on strong consistency, many cloud storage services opt for weaker consistency models in order to achieve better availability and performance. This comes at the cost of a high probability of stale data being read, as the replicas involved in the reads may not always have the most recent write. In this paper, we propose a novel approach, named Harmony, which adaptively tunes the consistency level at run-time according to the application requirements. The key idea behind Harmony is an intelligent estimation model of stale reads, allowing to elastically scale up or down the number of replicas involved in read operations to maintain a low (possibly zero) tolerable fraction of stale reads. As a result, Harmony can meet the desired consistency of the applications while achieving good performance. We have implemented Harmony and performed extensive evaluations with the Cassandra cloud storage on Grid'5000 testbed and on Amazon EC2. The results show that Harmony can achieve good performance without exceeding the tolerated number of stale reads. For instance, in contrast to the static eventual consistency used in Cassandra, Harmony reduces the stale data being read by almost 80% while adding only minimal latency. Meanwhile, it improves the throughput of the system by 45% while maintaining the desired consistency requirements of the applications when compared to the strong consistency model in Cassandra.
Lecture Notes in Computer Science, 2008
The combination of Semantic Web and Grid technologies and architectures eases the development of ... more The combination of Semantic Web and Grid technologies and architectures eases the development of applications that share heterogeneous resources (data and computing elements) that belong to several organisations. The Aerospace domain has an extensive and heterogeneous network of facilities and institutions, with a strong need to share both data and computational resources for complex processing tasks. One such task is monitoring and data analysis for Satellite Missions. This paper presents a Semantic Data Grid for satellite missions, where flexibility, scalability, interoperability, extensibility and efficient development have been considered the key issues to be addressed.
2010 IEEE Second International Conference on Cloud Computing Technology and Science, 2010

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2011
Complexity has always been one of the most important issues in distributed computing. From the fi... more Complexity has always been one of the most important issues in distributed computing. From the first clusters to grid and now cloud computing, dealing correctly and efficiently with system complexity is the key to taking technology a step further. In this sense, global behavior modeling is an innovative methodology aimed at understanding the grid behavior. The main objective of this methodology is to synthesize the grid's vast, heterogeneous nature into a simple but powerful behavior model, represented in the form of a single, abstract entity, with a global state. Global behavior modeling has proved to be very useful in effectively managing grid complexity but, in many cases, deeper knowledge is needed. It generates a descriptive model that could be greatly improved if extended not only to explain behavior, but also to predict it. In this paper we present a prediction methodology whose objective is to define the techniques needed to créate global behavior prediction models for grid systems. This global behavior prediction can benefit grid management, specially in áreas such as fault tolerance or job scheduling. The paper presents experimental results obtained in real scenarios in order to valídate this approach.

2010 Ninth International Symposium on Parallel and Distributed Computing, 2010
Grid systems have proved to be one of the most important new alternatives to face challenging pro... more Grid systems have proved to be one of the most important new alternatives to face challenging problems but, to exploit its benefits, dependability and fault tolerance are key aspects. However, the vast complexity of these systems limits the efficiency of traditional fault tolerance techniques. It seems necessary to distinguish between resource-level fault tolerance (focused on every machine) and service-level fault tolerance (focused on global behavior). Techniques based on these concepts can handle system complexity and increase dependability. We present an autonomous, self-adaptive fault tolerance framework for grid systems, based on a new approach to model distributed environments. The grid is considered as a single entity, instead of a set of independent resources. This point of view focuses on service-level fault tolerance, allowing us to see the big picture and understand the system's global behavior. The resulting model's simplicity is the key to provide systemwide fault tolerance.
Uploads
Papers by MARIA VALENTINA FLORES PEREZ