Papers by Ion Gabriel Stoica

Proceedings of the 15th International Middleware Conference on - Middleware '14, 2014
The deployment of MapReduce in datacenters and clouds present several challenges in achieving goo... more The deployment of MapReduce in datacenters and clouds present several challenges in achieving good job performance. Compared to in-house dedicated clusters, datacenters and clouds often exhibit significant hardware and performance heterogeneity due to continuous server replacement and multitenant interferences. As most Mapreduce implementations assume homogeneous clusters, heterogeneity can cause significant load imbalance in task execution, leading to poor performance and low cluster utilizations. Despite existing optimizations on task scheduling and load balancing, MapReduce still performs poorly on heterogeneous clusters. In this paper, we find that the homogeneous configuration of tasks on heterogeneous nodes can be an important source of load imbalance and thus cause poor performance. Tasks should be customized with different settings to match the capabilities of heterogeneous nodes. To this end, we propose an adaptive task tuning approach, Ant, that automatically finds the optimal settings for individual tasks running on different nodes. Ant works best for large jobs with multiple rounds of map task execution. It first configures tasks with randomly selected configurations and gradually improves tasks settings by reproducing the settings from best performing tasks and discarding poor performing configurations. To accelerate task tuning and avoid trapping in local optimum, Ant uses genetic functions during task configuration. Experimental results on a heterogeneous cluster and a virtual cluster with varying hardware capabilities show that Ant improves the average job completion time by 23%, 11%, and 16% compared to stock Hadoop, customized Hadoop with industry recommendations, and a profiling-based configuration approach, respectively.

Quality of Service — IWQoS 2003
Lecture Notes in Computer Science, 2003
Analysis and Modeling.- Network Characteristics: Modelling, Measurements, and Admission Control.-... more Analysis and Modeling.- Network Characteristics: Modelling, Measurements, and Admission Control.- Statistical Characterization for Per-hop QoS.- Performance Analysis of Server Sharing Collectives for Content Distribution.- An Approximation of the End-to-End Delay Distribution.- Resource Allocation and Admission Control.- Price-Based Resource Allocation in Wireless Ad Hoc Networks.- On Achieving Fairness in the Joint Allocation of Processing and Bandwidth Resources.- Distributed Admission Control for Heterogeneous Multicast with Bandwidth Guarantees.- Multimedia & Incentives.- Subjective Impression of Variations in Layer Encoded Videos.- A Moving Average Predictor for Playout Delay Control in VoIP.- To Play or to Control: A Game-Based Control-Theoretic Approach to Peer-to-Peer Incentive Engineering.- Dependability and Fault Tolerance.- Improving Dependability of Real-Time Communication with Preplanned Backup Routes and Spare Resource Pool.- Fault Tolerance in Networks with an Advance Reservation Service.- Routing.- Routing and Grooming in Two-Tier Survivable Optical Mesh Networks.- Fast Network Re-optimization Schemes for MPLS and Optical Networks.- HMP: Hotspot Mitigation Protocol for Mobile Ad hoc Networks.- Availability and Dependability.- Failure Insensitive Routing for Ensuring Service Availability.- Network Availability Based Service Differentiation.- Quality of Availability: Replica Placement for Widely Distributed Systems.- Web Services.- Using Latency Quantiles to Engineer QoS Guarantees for Web Services.- DotQoS - A QoS Extension for .NET Remoting.- Dynamic Resource Allocation for Shared Data Centers Using Online Measurements.- Rate-Based QoS.- Providing Deterministic End-to-End Fairness Guarantees in Core-Stateless Networks.- Per-domain Packet Scale Rate Guarantee for Expedited Forwarding.- On Achieving Weighted Service Differentiation: An End-to-End Perspective.- Storage.- Online Response Time Optimization of Apache Web Server.- A Practical Learning-Based Approach for Dynamic Storage Bandwidth Allocation.- CacheCOW: QoS for Storage System Caches.

Replay debugging systems enable the reproduction and debugging of non-deterministic failures in p... more Replay debugging systems enable the reproduction and debugging of non-deterministic failures in production application runs. However, no existing replay system is suitable for datacenter applications like Cassandra, Hadoop, and Hypertable. For these large scale, distributed, and data intensive programs, existing methods either incur excessive production overheads or don't scale to multi-node, terabyte-scale processing. In this position paper, we hypothesize and empirically verify that control plane determinism is the key to recordefficient and high-fidelity replay of datacenter applications. The key idea behind control plane determinism is that debugging does not always require a precise replica of the original datacenter run. Instead, it often suffices to produce some run that exhibits the original behavior of the control-plane-the application code responsible for controlling and managing data flow through a datacenter system.
A Distributed Waypoint Service Approach to Connect Heterogeneous Internet Address Spaces
AbstractThe rapid growth of the Internet has made IPv4 addresses a scarce resource. Today we wit... more AbstractThe rapid growth of the Internet has made IPv4 addresses a scarce resource. Today we witness two major trends to get around this problem. The first is to upgrade and deploy networks using IPv6; the second is to deploy net-works using reusable-IPv4 addresses. As a result, ...
A new approach to implement proportional share resource allocation
We describe a new approach to implement proportional share resource allocation and to provide di ... more We describe a new approach to implement proportional share resource allocation and to provide di erent levels of service quality. We consider multiple clients that compete for a time-shared resource, and we associate to each client a certain amount of funds. At the ...
Serial concentration of thyroid hormones in blood after successful kidney transplantation
Dialysis & …, 1998
... successful kidney transplantation. Auteur(s) / Author(s). JOSEPH LJ (1) ; DESAI KB (1) ; MEHT... more ... successful kidney transplantation. Auteur(s) / Author(s). JOSEPH LJ (1) ; DESAI KB (1) ; MEHTA HJ (1) ; MEHTA MN (1 2) ; ALMEIDA AF (2) ; ACHARYA VN (2) ; SAMUEL AM (1) ; Affiliation(s) du ou des auteurs / Author(s) Affiliation(s). ...

This paper presents TrickleDNS, a practical and decentralized system for disseminating DNS data s... more This paper presents TrickleDNS, a practical and decentralized system for disseminating DNS data securely. Unlike prior solutions, which depend on the as-yetundeployed DNSSEC standard to preserve data integrity, TrickleDNS uses a novel security framework that provides resilience from data corruption by compromised servers and denial of service attacks. It is based on the key design principle of randomization: First, Trick-leDNS organizes participating nameservers into a wellconnected peer-to-peer network with random yet constrained links to form a Secure Network of Nameservers (SNN). Nameservers in the SNN reliably broadcast their public-keys to other nameservers without relying a centralized PKI. Second, TrickleDNS reliably binds domains to their authoritative name servers through independent verification by multiple, randomly chosen peers within the SNN. Finally, TrickleDNS servers proactively disseminate self-certified versions of DNS records to provide faster performance, better availability, and improved security. This paper validates TrickleDNS through simulations and experiments on a prototype implementation.
Chord: A scalable P2P lookup service for internet applications
Proc. of ACM SIGCOMM, 2001

Recent work on media streaming has proposed to exploit path diversity, i.e., the use of multiple ... more Recent work on media streaming has proposed to exploit path diversity, i.e., the use of multiple endto-end paths, as a means to obtain better performance. The best performance is achieved when the various paths are independent in the sense that the two paths do not share a Point of Congestion (PoC). However, topologies used in media streaming applications do not meet the assumption of Inverted-Y or Y topologies made by prior work on detecting shared PoC. In this paper, we propose a new technique called CD-DJ (Correlating Drops and Delay Jitter) which solves this problem. CD-DJ is better than earlier solutions for three main reasons. First, CD-DJ overcomes the clock synchronization problem and can work with most topologies relevant to applications. Second, it provides applications with an estimate of the fraction of packet drops caused by shared PoCs. This information is more useful than a "yes/no" decision for media streaming applications because they can use it to choose a path based on the level of shared congestion. Third, CD-DJ makes the estimation by correlating bursts of packet drops in conjunction with the correlation of delay jitter in a novel way. A key contribution of our work is our evaluation methodology. We use a novel overlaybased method to evaluate our technique extensively using about 800 hours of experimental traces from Planetlab, a global overlay network. Our results indicate that CD-DJ calculates estimates which are at least within a factor of 0.8 of the actual fraction of shared drops for 80 − 90% of the flows. We also illustrate the advantage of using CD-DJ with a simple streaming video application.

One of the key reasons overlay networks are seen as an excellent platform for large scale distrib... more One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best be suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than the baseline algorithm.

BGP, the current inter-domain routing protocol, assumes that the routing information propagated b... more BGP, the current inter-domain routing protocol, assumes that the routing information propagated by authenticated routers is correct. This assumption renders the current infrastructure vulnerable to both accidental misconfigurations and deliberate attacks. To reduce this vulnerability, we present a combination of two mechanisms: Listen and Whisper. Listen passively probes the data plane and checks whether the underlying routes to different destinations work. Whisper uses cryptographic functions along with routing redundancy to detect bogus route advertisements in the control plane. These mechanisms are easily deployable, and do not rely on either a public key infrastructure or a central authority like ICANN. The combination of Listen and Whisper eliminates a large number of problems due to router misconfigurations, and restricts (though not eliminates) the damage that deliberate attackers can cause. Moreover, these mechanisms can detect and contain isolated adversaries that propagate even a few invalid route announcements. Colluding adversaries pose a more stringent challenge, and we propose simple changes to the BGP policy mechanism to limit the damage colluding adversaries can cause. We demonstrate the utility of Listen and Whisper through real-world deployment, measurements and empirical analysis. For example, a randomly placed isolated adversary, in the worst case can affect reachability to only ¦ § of the nodes. Techniques dealing with adversaries can be classified as Key distribution based or Non-PKI based. Key-distribution based: One class of mechanisms builds on cryptographic enhancements of the BGP protocol, for instance the security mechanisms proposed by Smith et al. [31], Murphy et al. [27], Kent et al. [24], and recent work on Secure Origin BGP [28]. All these protocols make extensive use of digital signatures and public key certification. More lightweight approaches based on cryptographic hash functions have been proposed e.g., by Hu et al. [20, 22] in the context of secure routing in ad hoc networks. However, these mechanisms require prior secure distribution of hash chain elements.

This report 1 summarizes recommendations from a workshop on research challenges in distributed co... more This report 1 summarizes recommendations from a workshop on research challenges in distributed computer systems, sponsored by the National Science Foundation. A program committee solicited input from the research community by asking researchers to submit position papers that identified grand challenges in distributed systems, invited researchers based on their submissions, and selected a few position papers for presentation at the workshop. Most of the workshop was organized around break-out sessions, in which we refined the research challenges and identified the facilities needed to carry out future research as well as what the distributed systems community can contribute to the facility. Information about the workshop, including the full program, the selected submissions, and the slides of the presentations and the break-out sessions, is available at http://www.pdos.lcs.mit.edu/˜kaashoek/nsf/. The workshop attendees identified a number of challenge applications whose implementation will require research advances in the design and engineering of distributed systems. Examples include: managing a large number of personal devices and data, improving the auto commute through data dissemination and using sensors and actuators in the car to avoid accidents, rapidly deploying fault-tolerant distributed systems to assist in disaster recovery, and understanding and affecting the planet in real-time. Each of these applications has security, storage, fault-tolerance, and usability requirements that can be addressed only if there are new research advances. To spur these advances, the community has a need for a facility to support experimentation. Facilities such as Planetlab [2] and Emulab [3] have demonstrated that the right facility can spark research progress. The new applications, however, require a scale of facility that is unavailable at present, and that includes product versions of recent research advances. Such a facility would allow researchers to leverage the recent results in tackling the next set of challenges. The report makes the following recommendations for the research community and NSF. For the research community: • Use the challenge applications to frame important and challenging research questions in distributed systems. The answers are likely to generate knowledge that goes well beyond the current understanding of distributed systems. • Participate in the development of a shared facility to experiment with solutions. This development can leverage the recent advances in overlay networks, virtualization, secure global access, resource allocation, and debugging. * This committee was greatly assisted by the contributions of all workshop participants, listed in Appendix A. 1 The references in this report don't follow the scientific standards for research publications. A few references have been included to allow the reader to easily follow up on some specific points, but the references are not comprehensive.

To provide routing flexibility, that is, to accommodate vari- ous performance and policy goals, r... more To provide routing flexibility, that is, to accommodate vari- ous performance and policy goals, routing protocols (such as OSPF and EIGRP) include many complex knobs. Owing to this complexity, protocols today do not adequately satisfy their main goal—to provide connectivity between nodes in the face of failures and misconfigured nodes. In this paper, we ask the question of how one can design routing proto- cols that are flexible, yet provide connectivity in the face of failures and misconfigurations. To this end, we propose a dif- ferent routing paradigm that decouples the task of providing basic connectivity from sophisticated routing operations. We propose an underlying Basic Connectivity Routing Protocol (BCRP) that is robust to link failures and prevents miscon- figured nodes from arbitrarily subverting traffic. Routing can then be made flexible by layering sophisticated route selec- tion on top of BCRP; these protocols fall back to BCRP when failures are encountered.
USENIX Association, Mar 30, 2011
We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster comp... more We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today's frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.
login Usenix Mag., 2012
Matei Zaharia is a fifth year PhD student at UC Berkeley, working with Scott Shenker and Ion Sto... more Matei Zaharia is a fifth year PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in computer systems, networks, cloud computing, and big data. He is also a committer on Apache Hadoop and Apache

Neural programs are highly accurate and structured policies that perform algorithmic tasks by con... more Neural programs are highly accurate and structured policies that perform algorithmic tasks by controlling the behavior of a computation mechanism. Despite the potential to increase the interpretability and the compositionality of the behavior of artificial agents, it remains difficult to learn from demonstrations neural networks that represent computer programs. The main challenges that set algorithmic domains apart from other imitation learning domains are the need for high accuracy, the involvement of specific structures of data, and the extremely limited observability. To address these challenges, we propose to model programs as Parametrized Hierarchical Procedures (PHPs). A PHP is a sequence of conditional operations, using a program counter along with the observation to select between taking an elementary action, invoking another PHP as a sub-procedure, and returning to the caller. We develop an algorithm for training PHPs from a set of supervisor demonstrations, only some of w...

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization
One of the key challenges arising when compilers vectorize loops for today's SIMD-compatible arch... more One of the key challenges arising when compilers vectorize loops for today's SIMD-compatible architectures is to decide if vectorization or interleaving is beneficial. Then, the compiler has to determine the number of instructions to pack together and the interleaving level (stride). Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions. Alternatively, software engineers often hand-write the vectorization factors of every loop. This, however, places a huge burden on them, since it requires prior experience and significantly increases the development time. In this work, we explore a novel approach for handling loop vectorization and propose an end-to-end solution using deep reinforcement learning (RL). We conjecture that deep RL can capture different instructions, dependencies, and data structures to enable learning a sophisticated model that can better predict the actual performance cost and determine the optimal vectorization factors. We develop an end-to-end framework, from code to vectorization, that integrates deep RL in the LLVM compiler. Our proposed framework takes benchmark codes as input and extracts the loop codes. These loop codes are then fed to a loop embedding generator that learns an embedding for these loops. Finally, the learned embeddings are used as input to a Deep RL agent, which * Part of this work was done while Ameer Haj-Ali was in a summer internship at Intel Labs.

Proceedings of the VLDB Endowment
Distributed storage employs replication to mask failures and improve availability. However, these... more Distributed storage employs replication to mask failures and improve availability. However, these systems typically exhibit a hard tradeoff between consistency and performance. Ensuring consistency introduces coordination overhead, and as a result the system throughput does not scale with the number of replicas. We present Harmonia, a replicated storage architecture that exploits the capability of new-generation programmable switches to obviate this tradeoff by providing near-linear scalability without sacrificing consistency. To achieve this goal, Harmonia detects read-write conflicts in the network, which enables any replica to serve reads for objects with no pending writes. Harmonia implements this functionality at line rate, thus imposing no performance overhead. We have implemented a prototype of Harmonia on a cluster of commodity servers connected by a Barefoot Tofino switch, and have integrated it with Redis. We demonstrate the generality of our approach by supporting a varie...
Uploads
Papers by Ion Gabriel Stoica