Papers by Juan Colmenares

arXiv (Cornell University), Jul 4, 2017
Sources of multidimensional data are becoming more prevalent, partly due to the rise of the Inter... more Sources of multidimensional data are becoming more prevalent, partly due to the rise of the Internet of Things (IoT), and with that the need to ingest and analyze data streams at rates higher than before. Some industrial IoT applications require ingesting millions of records per second, while processing queries on recently ingested and historical data. Unfortunately, existing database systems suited to multidimensional data exhibit low per-node ingestion performance, and even if they can scale horizontally in distributed settings, they require large number of nodes to meet such ingest demands. For this reason, in this paper we evaluate a singlenode multidimensional data store for high-velocity sensor data. Its design centers around a two-level indexing structure, wherein the global index is an in-memory R*-tree and the local indices are serialized kd-trees. This study is confined to records with numerical indexing fields and range queries, and covers ingest throughput, query response time, and storage footprint. We show that the adopted design streamlines data ingestion and offers ingress rates two orders of magnitude higher than those of a selection of open-source database systems, namely Percona Server, SQLite, and Druid. Our prototype also reports query response times comparable to or better than those of Percona Server and Druid, and compares favorably in terms of storage footprint. In addition, we evaluate a kd-tree partitioning based scheme for grouping incoming streamed data records. Compared to a random scheme, this scheme produces less overlap between groups of streamed records, but contrary to what we expected, such reduced overlap does not translate into better query performance. By contrast, the local indices prove much more beneficial to query performance. We believe the experience reported in this paper is valuable to practitioners and researchers alike interested in building database systems for high-velocity multidimensional data.

Tuning third-party systems is time-consuming and sometimes challenging, particularly when targeti... more Tuning third-party systems is time-consuming and sometimes challenging, particularly when targeting multiple embedded platforms. Unfortunately, system integrators, application developers, and other users of third-party systems lack proper tools for conducting systematic performance analysis on those systems, and have no easy way to reproduce the systems’ advertised performance and identify configurations that yield excellent, fair, or poor behavior. To fill this void we introduce SPEX, a framework aimed at making it easier to characterize third-party systems’ performance in relation to configuration parameters. SPEX enables automatic performance exploration for systems with no need to access their source code. It offers the flexibility to define pluggable policies that steer the exploration process by varying configuration parameters of the observed system. Our results show that SPEX adds little overhead to the monitored system, and suggest that it can be effective in providing usef...

Proceedings of the 6th ACM SIGSPATIAL Workshop on Analytics for Big Geospatial Data - BigSpatial'17, 2017
Many people take photos and videos with smartphones and more recently with 360 • cameras at popul... more Many people take photos and videos with smartphones and more recently with 360 • cameras at popular places and events, and share them in social media. Such visual content is produced in large volumes in urban areas, and it is a source of information that online users could exploit to learn what has got the interest of the general public on the streets of the cities where they live or plan to visit. A key step to providing users with that information is to identify the most popular k spots in specified areas. In this paper, we propose a clustering and incremental sampling (C&IS) approach that trades off accuracy of top-k results for detection speed. It uses clustering to determine areas with high density of visual content, and incremental sampling, controlled by stopping criteria, to limit the amount of computational work. It leverages spatial metadata, which represent the scenes in the visual content, to rapidly detect the hotspots, and uses a recently proposed Gaussian probability model to describe the capture intention distribution in the query area. We evaluate the approach with metadata, derived from a non-synthetic, user-generated dataset, for regular mobile and 360 • visual content. Our results show that the C&IS approach offers 2.8×-19× reductions in processing time over an optimized baseline, while in most cases correctly identifying 4 out of 5 top locations.
ACM SIGBED Review, 2018
Virtualization allows simultaneous execution of multi-tenant workloads on the same platform, eith... more Virtualization allows simultaneous execution of multi-tenant workloads on the same platform, either a server or an embedded system. Unfortunately, it is non-trivial to attribute hardware events to multiple virtual tenants, as some system's metrics relate to the whole system (e.g., RAPL energy counters). Virtualized environments have then a rather incomplete picture of how tenants use the hardware, limiting their optimization capabilities. Thus, we propose XeM-Power, a lightweight monitoring solution for Xen that precisely accounts hardware events to guest workloads. It also enables attribution of CPU power consumption to individual tenants. We show that XeMPower introduces negligible overhead in power consumption, aiming to be a reference design for power-aware virtualized environments.
ACM Transactions on Architecture and Code Optimization, 2017
Multi-tenant virtualized infrastructures allow cloud providers to minimize costs through workload... more Multi-tenant virtualized infrastructures allow cloud providers to minimize costs through workload consolidation. One of the largest costs is power consumption, which is challenging to understand in heterogeneous environments. We propose a power modeling methodology that tackles this complexity using a divide-and-conquer approach. Our results outperform previous research work, achieving a relative error of 2% on average and under 4% in almost all cases. Models are portable across similar architectures, enabling predictions of power consumption before migrating a tenant to a different hardware platform. Moreover, we show the models allow us to evaluate colocations of tenants to reduce overall consumption.

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016
HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit time-... more HPC facilities typically use batch scheduling to space-share jobs. In this paper we revisit time-sharing using a trace of 2.4 million jobs obtained during 20 months of operation of a modern petascale supercomputer. Our simulations show that batch scheduling produces distributions that are skewed towards smaller jobs and longer execution times, whereas timesharing produces much more uniform slowdowns. Consequently, for applications that strong scale, the turnaround time does not scale with batch scheduling, but it does with time-sharing, resulting in turnarounds that are orders of magnitude better at the largest scales. In addition, we show that time-sharing can confer additional benefits with modern programming practices and in noisy systems. FutureExascale HPC systems, are expected to exhibit billion-way heterogeneous parallelism and poor performance predictability. As many applications will run in strong scaling, how resource allocation policies affect the experience of supercomputer users has once again become a timely subject.
A Scalable High-Performance In-Memory Key-Value Cache using a Microkernel-Based Design
Latency and costs of Internet-based services are driving the proliferation of web-object caching.... more Latency and costs of Internet-based services are driving the proliferation of web-object caching. Memcached, the most broadly deployed web-object caching solution, is a key infrastructure component for many companies that offer services via the Web, such as Amazon, Facebook, LinkedIn, Twitter, Wikipedia, and YouTube. Its aim is to reduce service latency and improve processing capability on back-end data servers by caching immutable data closer to the client machines. Caching of key-value pairs is performed
Caching Architecture for Packet-Form In-Memory Object Caching

Proceedings of the 20th international symposium on High performance distributed computing, 2011
We investigate proactive dynamic load balancing on multicore systems, in which threads are contin... more We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches to enhance the flexibility of the SPMD-style programming model, and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. Juggle shows performance improvements of up to 80% over static balancing for UPC, OpenMP, and pthreads benchmarks. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems. We also show that Juggle is effective in multiprogrammed environments with unpredictable interference from unrelated external applications.

Cluster Computing, 2012
We investigate proactive dynamic load balancing on multicore systems, in which threads are contin... more We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.
A component-based automation architecture for continuous process industries
This paper discusses the design of an automation architecture for continuous process industries b... more This paper discusses the design of an automation architecture for continuous process industries based on the Enterprise JavaBeans server−side component model. This architecture has been designed considering criteria such as interoperability, portability, scalability, performance, reliability, security, use of legacy systems and mantainability. The Automation Integrated Environment (AIE) architecture includes the following elements: i) primary data sources, ii) real time data sources, iii) high level data sources, iv) primary event generators, v) server−side applications and vi) client applications. The proposed architecture satisfied the aforementioned criteria and holds promise to be effective in the automation of continuous process industries.
Tessellation is a manycore OS targeted at the resource management challenges of emerging client d... more Tessellation is a manycore OS targeted at the resource management challenges of emerging client devices, including the need for real-time and QoS guarantees. It is predicated on two central ideas: Space-Time Partitioning (STP) and Two-Level Scheduling. STP provides performance isolation and strong partitioning of resources among interacting software components, called Cells. Two-Level Scheduling separates global decisions about the allocation of resources to Cells from applicationspecific scheduling of resources within Cells. We describe Tessellation's Cell model and its resource allocation architecture. We present results from an early prototype running on two different platforms including one with memory-bandwidth partitioning hardware.

Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking, 2018
Improvements in cloud-based speech recognition have led to an explosion in voice assistants, as b... more Improvements in cloud-based speech recognition have led to an explosion in voice assistants, as bespoke devices in the home, cars, wearables or on smart phones. In this paper, we present UIVoice, through which we enable voice assistants (that heavily utilize the cloud) to dynamically interact with mobile applications running in the edge. We present a framework that can be used by third party developers to easily create Voice User Interfaces (VUIs) on top of existing applications. We demonstrate the feasibility of our approach through a prototype based on Android and Amazon Alexa, describe how we added voice to several popular applications and provide an initial performance evaluation. We also highlight research challenges that are relevant to the edge computing community. CCS CONCEPTS • Human-centered computing → Ubiquitous and mobile computing systems and tools; Interactive systems and tools; * Work done during internship at Samsung Research America.

A multicore operating system with QoS guarantees for network audio applications
ABSTRACT This paper is about the role of the operating system (OS) within computer nodes of netwo... more ABSTRACT This paper is about the role of the operating system (OS) within computer nodes of network audio systems. While many efforts in the network-audio community focus on low-latency network protocols, here we highlight the importance of the OS for network audio applications. We present Tessellation, an experimental OS tailored to multicore processors. We show how specific OS features, such as guaranteed resource allocation and customizable user-level run- times, can help ensure quality-of-service (QoS) guarantees for data transmission and audio signal processing, especially in scenarios where network bandwidth and processing resources are shared between applications. To demonstrate performance isolation and service guarantees, we benchmark Tessellation under different conditions using a resource-demanding network audio application. Our results show that Tessellation can be used to create low-latency network audio systems.
A Scalable High-Performance In-Memory Key-Value Cache using a Microkernel-Based Design
Caching Architecture for Packet-Form In-Memory Object Caching
Non-Blocking Queue-Based Clock Replacement Algorithm
KV-Cache: A Scalable High-Performance Web-Object Cache for Manycore
2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing, 2013
Journal of Petroleum Science and Engineering, 2002
TX 75083-3836, U.S.A., fax 01-972-952-9435.

Computers in Industry, 2007
This paper presents a reference software architecture for the development of enterprise industria... more This paper presents a reference software architecture for the development of enterprise industrial automation applications for the oil industry. Its design accounts for criteria such as interoperability, portability, scalability, availability, security, use of legacy systems and maintainability. The architecture includes a technological platform that consists of a J2EE application server, a failover management system and, optionally, a server farm with an IP redirection-based load balancer. Also part of the architecture are infrastructure elements such as: (i) process data sources (PDSs) that offer an uniform interface for the synchronous and asynchronous access to SCADA or similar systems, (ii) field event generators (FEGs) that produce asynchronous notifications corresponding to the occurrence of pre-established conditions in the industrial processes, and (iii) business entities (BEs) that allow the handling of persistent information of real business entities, independently of the persistence mechanism used. Finally, its effectiveness is verified through the development of a prototype application for the optimization of the duration of well production tests.
Uploads
Papers by Juan Colmenares