Partial-Parallel-Repair (PPR): A Distributed Technique for Repairing Erasure Coded Storage
EuroSys 2016, Apr 18, 2016
With the explosion of data in applications all around us, era-sure coded storage has emerged as a... more With the explosion of data in applications all around us, era-sure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log 2 (k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.
Uploads
Papers by Subrata Mitra
and functional behavior is a critical component for a wide
range of tools, including anomaly detection, task scheduling and
approximate computing. Statistical modeling is a very powerful
approach for making such predictions and it uses observations
of application behavior on a small number of training cases to
predict how the application will behave in practice. However, the
fact that applications’ behavior often depends closely on their
configuration parameters and properties of their inputs means
that any suite of application training runs will cover only a small
fraction of its overall behavior space. Since a model’s accuracy
often degrades as application configuration and inputs deviate
further from its training set, this makes it difficult to act based
on the model’s predictions.
This paper presents a systematic approach to quantify the
prediction errors of the statistical models of the application
behavior, focusing on extrapolation, where the application con-
figuration and input parameters differ significantly from the
model’s training set. Given any statistical model of application
behavior and a data set of training application runs from which
this model is built, our technique predicts the accuracy of the
model for predicting application behavior on a new run on
hitherto unseen inputs. We validate the utility of this method
by evaluating it on the use case of anomaly detection for
seven mainstream applications and benchmarks. The evaluation
demonstrates that our technique can reduce false alarms while
providing high detection accuracy compared to a statistical,
input-unaware modeling technique.
and functional behavior is a critical component for a wide
range of tools, including anomaly detection, task scheduling and approximate computing. Statistical modeling is a very powerful approach for making such predictions and it uses observations of application behavior on a small number of training cases to predict how the application will behave in practice. However, the fact that applications’ behavior often depends closely on their configuration parameters and properties of their inputs means that any suite of application training runs will cover only a small fraction of its overall behavior space. Since a model’s accuracy often degrades as application configuration and inputs deviate further from its training set, this makes it difficult to act based on the model’s predictions.
This paper presents a systematic approach to quantify the
prediction errors of the statistical models of the application
behavior, focusing on extrapolation, where the application configuration and input parameters differ significantly from the
model’s training set. Given any statistical model of application
behavior and a data set of training application runs from which
this model is built, our technique predicts the accuracy of the
model for predicting application behavior on a new run on
hitherto unseen inputs. We validate the utility of this method
by evaluating it on the use case of anomaly detection for
seven mainstream applications and benchmarks. The evaluation demonstrates that our technique can reduce false alarms while providing high detection accuracy compared to a statistical, input-unaware modeling technique.
tion of hardware resources such as cache, network, and I/O has
been a frequent occurrence in public cloud platforms. A web
server that is suffering from performance interference degrades interactive user experience and results in lost revenues. Existing work on interference mitigation tries to address this problem by intrusive changes to the hypervisor, e.g., using intelligent schedulers or live migration, many of which are available only to infrastructure providers and not end consumers. In this paper, we present a framework for administering webserver clusters where effects of interference can be reduced by intelligent reconfiguration. Our controller, ICE, improves webserver performance during interference by performing two-fold autonomous reconfigurations. First, it reconfigures the load balancer at the ingress point of the server cluster and thus reduces load on the impacted server. ICE then reconfigures the middleware at the impacted server to reduce its load even further. We implement and evaluate ICE on CloudSuite, a popular web application benchmark, and with two popular load balancers - HAProxy and LVS. Our experiments in a private cloud testbed show that ICE can improve median response time of web servers by upto 94% compared to a statically configured server cluster. ICE also outperforms an adaptive load balancer (using least connection scheduling) by upto 39%.
number, duration and variety of video contents, which contribute to the bulk of internet traffic. With increase in smartphone and tablet users, watching videos on mobile devices has become one of its most popular use cases. These devices live on limited battery energy which is still a major bottleneck and a source of user dissatisfaction during video playback. In this paper we introduce an intermediate framework called VIDALIZER for power efficient video delivery to smartphones and tablets. This almost transparent to the user, battery aware framework takes away some of the video processing overhead from the device and intelligently tunes its parameters customized for the mobile device while delivering the video using a novel transport protocol. Our preliminary results show that this framework can significantly reduce energy consumption up to 45%-55% of a mobile device without compromising user experience.
present a novel run-time technique, the loop-aware progress-dependence analysis, that improves the accuracy of identifying the least-progressed (LP) task(s). Our technique extends an existing analysis technique (AutomaDeD) to detect LP task(s) even when the error arises within complex loop structures. Our preliminary evaluation shows that it accurately finds LP task(s) on several hangs, where the previous technique failed.
metrics and models the application’s runtime behavior through pairwise correlations of those metrics in the system, within multiple non-overlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a campus-wide Java application, and a regression
testing framework from IBM. Our results show that ORION is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.
Keywords: debugging aids; tracing; diagnostics; performance
metrics