Skip to main content

Subrata Mitra

Adobe Systems, Adobe Research, Research Scientist

Purdue University, ECE, Graduate Student

Followers

77

Following

105

Co-authors

6

Public Views

InterestsView All (8)

Uploads

Papers by Subrata Mitra

atc18-xu-ran.pdf

Many video streaming applications require low-latency processing on resource-constrained devices.... more Many video streaming applications require low-latency processing on resource-constrained devices. To meet the latency and resource constraints, developers must often approximate filter computations. A key challenge to successfully tuning approximations is finding the optimal configuration, which may change across and within the input videos because it is content-dependent. Searching through the entire search space for every frame in the video stream is infeasible, while tuning the pipeline offline, on a set of training videos, yields suboptimal results.

Dealing with the Unknown: Resilience to Prediction Errors

Accurate prediction of applications’ performance and functional behavior is a critical component ... more Accurate prediction of applications’ performance
and functional behavior is a critical component for a wide
range of tools, including anomaly detection, task scheduling and
approximate computing. Statistical modeling is a very powerful
approach for making such predictions and it uses observations
of application behavior on a small number of training cases to
predict how the application will behave in practice. However, the
fact that applications’ behavior often depends closely on their
configuration parameters and properties of their inputs means
that any suite of application training runs will cover only a small
fraction of its overall behavior space. Since a model’s accuracy
often degrades as application configuration and inputs deviate
further from its training set, this makes it difficult to act based
on the model’s predictions.
This paper presents a systematic approach to quantify the
prediction errors of the statistical models of the application
behavior, focusing on extrapolation, where the application con-
figuration and input parameters differ significantly from the
model’s training set. Given any statistical model of application
behavior and a data set of training application runs from which
this model is built, our technique predicts the accuracy of the
model for predicting application behavior on a new run on
hitherto unseen inputs. We validate the utility of this method
by evaluating it on the use case of anomaly detection for
seven mainstream applications and benchmarks. The evaluation
demonstrates that our technique can reduce false alarms while
providing high detection accuracy compared to a statistical,
input-unaware modeling technique.

A Study of Failures in Community Clusters: The Case of Conte

Large community clusters are becoming common in universities and other organizations due to the b... more Large community clusters are becoming common in universities and other organizations due to the benefits they provide to participating researchers in terms of reduced operational costs and a bigger resource pool. However, effective management, and diagnosing failures and performance issues in these clusters are challenging tasks due to the diversity of workloads run by users from various domains and experience levels. Many users who use these clusters have very less experience in computing and hence often face performance issues — leading to resource wastage. In this paper, we study these dynamics in one of the largest university-wide community clusters. We perform in-depth analysis of library and application usage patterns, job failures and performance issues. Further, we introduce a set of novel analysis techniques that can be used to identify hidden trends and diagnose job failures in compute clusters in general. We provide concrete recommendations for the cluster administrators and present case studies highlighting how such information can be used to proactively solve many user issues, ultimately leading to better quality of service.

Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage

EuroSys 2016

With the explosion of data in applications all around us, erasure coded storage has emerged as an... more With the explosion of data in applications all around us, erasure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log2(k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.

Linear Antenna Array with Suppressed Sidelobe and Sideband Levels using Time Modulation

In this paper, the goal is to achieve an ultra low sidelobe level (SLL) and sideband levels (SBL)... more In this paper, the goal is to achieve an ultra low sidelobe level (SLL) and sideband levels (SBL) of a time modulated linear antenna array. The approach followed here is not to give fixed level of excitation to the elements of an array, but to change it dynamically with time. The excitation levels of the different array elements over time are varied to get the low sidelobe and sideband levels. The mathematics of getting the SLL and SBL furnished in detail and simulation is done using the mathematical results. The excitation pattern over time is optimized using Genetic Algorithm (GA). Since, the amplitudes of the excitations of the elements are varied within a finite limit, results show it gives better sidelobe and sideband suppression compared to previous time modulated arrays with uniform amplitude excitations.

Linear Antenna Array with Suppressed Sidelobe and Sideband Levels using Time Modulation

arXiv preprint arXiv:1211.1733, Nov 8, 2012

Abstract: In this paper, the goal is to achieve an ultra low sidelobe level (SLL) and sideband le... more Abstract: In this paper, the goal is to achieve an ultra low sidelobe level (SLL) and sideband levels (SBL) of a time modulated linear antenna array. The approach followed here is not to give fixed level of excitation to the elements of an array, but to change it dynamically with time. The excitation levels of the different array elements over time are varied to get the low sidelobe and sideband levels. The mathematics of getting the SLL and SBL furnished in detail and simulation is done using the mathematical results. The excitation pattern over ...

Partial-Parallel-Repair (PPR): A Distributed Technique for Repairing Erasure Coded Storage

by Subrata Mitra, Rajesh Panta, and Moo-Ryong Ra

EuroSys 2016, Apr 18, 2016

With the explosion of data in applications all around us, era-sure coded storage has emerged as a... more With the explosion of data in applications all around us, era-sure coded storage has emerged as an attractive alternative to replication because even with significantly lower storage overhead, they provide better reliability against data loss. Reed-Solomon code is the most widely used erasure code because it provides maximum reliability for a given storage overhead and is flexible in the choice of coding parameters that determine the achievable reliability. However, reconstruction time for unavailable data becomes prohibitively long mainly because of network bottlenecks. Some proposed solutions either use additional storage or limit the coding parameters that can be used. In this paper, we propose a novel distributed reconstruction technique, called Partial Parallel Repair (PPR), which divides the reconstruction operation to small partial operations and schedules them on multiple nodes already involved in the data reconstruction. Then a distributed protocol progressively combines these partial results to reconstruct the unavailable data blocks and this technique reduces the network pressure. Theoretically, our technique can complete the network transfer in ⌈(log 2 (k + 1))⌉ time, compared to k time needed for a (k, m) Reed-Solomon code. Our experiments show that PPR reduces repair time and degraded read time significantly. Moreover, our technique is compatible with existing erasure codes and does not require any additional storage overhead. We demonstrate this by overlaying PPR on top of two prior schemes, Local Reconstruction Code and Rotated Reed-Solomon code, to gain additional savings in reconstruction time.

Dealing with the Unknown: Resilience to Prediction Errors

by Greg Bronevetsky, Subrata Mitra, and Suhas Javagal

PACT'15, 24th International Conference on Parallel Architectures and Compilation Techniques, Oct 20, 2015

Accurate prediction of applications’ performance and functional behavior is a critical component... more Accurate prediction of applications’ performance
and functional behavior is a critical component for a wide
range of tools, including anomaly detection, task scheduling and approximate computing. Statistical modeling is a very powerful approach for making such predictions and it uses observations of application behavior on a small number of training cases to predict how the application will behave in practice. However, the fact that applications’ behavior often depends closely on their configuration parameters and properties of their inputs means that any suite of application training runs will cover only a small fraction of its overall behavior space. Since a model’s accuracy often degrades as application configuration and inputs deviate further from its training set, this makes it difficult to act based on the model’s predictions.
This paper presents a systematic approach to quantify the
prediction errors of the statistical models of the application
behavior, focusing on extrapolation, where the application configuration and input parameters differ significantly from the
model’s training set. Given any statistical model of application
behavior and a data set of training application runs from which
this model is built, our technique predicts the accuracy of the
model for predicting application behavior on a new run on
hitherto unseen inputs. We validate the utility of this method
by evaluating it on the use case of anomaly detection for
seven mainstream applications and benchmarks. The evaluation demonstrates that our technique can reduce false alarms while providing high detection accuracy compared to a statistical, input-unaware modeling technique.

ICE : An Integrated Configuration Engine for Interference Mitiga tion in Cloud Services

ICAC'15, 12th IEEE International Conference on Autonomic Computing, Jul 2015

Performance degradation due to imperfect isola- tion of hardware resources such as cache, networ... more Performance degradation due to imperfect isola-
tion of hardware resources such as cache, network, and I/O has
been a frequent occurrence in public cloud platforms. A web
server that is suffering from performance interference degrades interactive user experience and results in lost revenues. Existing work on interference mitigation tries to address this problem by intrusive changes to the hypervisor, e.g., using intelligent schedulers or live migration, many of which are available only to infrastructure providers and not end consumers. In this paper, we present a framework for administering webserver clusters where effects of interference can be reduced by intelligent reconfiguration. Our controller, ICE, improves webserver performance during interference by performing two-fold autonomous reconfigurations. First, it reconfigures the load balancer at the ingress point of the server cluster and thus reduces load on the impacted server. ICE then reconfigures the middleware at the impacted server to reduce its load even further. We implement and evaluate ICE on CloudSuite, a popular web application benchmark, and with two popular load balancers - HAProxy and LVS. Our experiments in a private cloud testbed show that ICE can improve median response time of web servers by upto 94% compared to a statically configured server cluster. ICE also outperforms an adaptive load balancer (using least connection scheduling) by upto 39%.

VIDalizer: An Energy Efficient Video Streamer

by Subrata Mitra and Arnab Raha

Recent years have witnessed a significant rise in the number, duration and variety of video conte... more Recent years have witnessed a significant rise in the
number, duration and variety of video contents, which contribute to the bulk of internet traffic. With increase in smartphone and tablet users, watching videos on mobile devices has become one of its most popular use cases. These devices live on limited battery energy which is still a major bottleneck and a source of user dissatisfaction during video playback. In this paper we introduce an intermediate framework called VIDALIZER for power efficient video delivery to smartphones and tablets. This almost transparent to the user, battery aware framework takes away some of the video processing overhead from the device and intelligently tunes its parameters customized for the mobile device while delivering the video using a novel transport protocol. Our preliminary results show that this framework can significantly reduce energy consumption up to 45%-55% of a mobile device without compromising user experience.

Mitigating Interference in Cloud Services by Middleware Reconfiguration

ACM/IFIP/USENIX Middleware conference, Dec 8, 2014

Application performance has been and remains one of top five concerns since the inception of clou... more Application performance has been and remains one of top five concerns since the inception of cloud computing. A primary determinant of application performance is multitenancy or sharing of hardware resources in clouds. While some hardware resources can be partitioned well among VMs (such as CPUs), many others cannot (such as memory bandwidth). In this paper, we focus on understanding the variability in application performance on a cloud and explore ways for an end customer to deal with it. Based on rigorous experiments using CloudSuite, a popular Web2.0 benchmark, running on EC2, we found that interference-induced performance degradation is a reality. On a private cloud testbed, we also observed that interference impacts the choice of best configuration values for applications and middleware. We posit that intelligent reconfiguration of application parameters presents a way for an end customer to reduce the impact of interference. However, tuning the application to deal with interference is challenging because of two fundamental reasons — the configuration depends on the nature and degree of interference and there are inter-parameter dependencies. We design and implement the IC2 system (Interference-aware Cloud application Configuration) to address the challenges of detection and mitigation of performance interference in clouds. Compared to an interference-agnostic configuration, the proposed solution provides up to 29% and 40% improvement in average response time on EC2 and a private cloud testbed respectively.

Accurate application progress analysis for large-scale parallel debugging

35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014, Jun 9, 2014

Debugging large-scale parallel applications is challenging. In most HPC applications, parallel ta... more Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, Prodometer, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.

Scalable Parallel Debugging via Loop-Aware Progress Dependence Analysis

Supercomputing Conference (SC) 2013, Nov 17, 2013

Debugging large-scale parallel applications is challenging, as this often requires extensive manu... more Debugging large-scale parallel applications is challenging, as this often requires extensive manual intervention to isolate the origin of errors. For many bugs in scientific applications, where parallel tasks progress forward in a coordinated fashion, finding tasks that progressed the least can significantly reduce the time to isolate error root causes. We
present a novel run-time technique, the loop-aware progress-dependence analysis, that improves the accuracy of identifying the least-progressed (LP) task(s). Our technique extends an existing analysis technique (AutomaDeD) to detect LP task(s) even when the error arises within complex loop structures. Our preliminary evaluation shows that it accurately finds LP task(s) on several hangs, where the previous technique failed.

Automatic Problem Localization via Multi-dimensional Metric Profiling

by Subrata Mitra and Fahad Arshad

32nd International Symposium on Reliable Distributed Systems (SRDS), 2013, Oct 2013

Debugging today’s large-scale distributed applications is complex. Traditional debugging techniqu... more Debugging today’s large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present ORION, a framework to automate the problem-localization process in distributed applications. From a large set of metrics, ORION intelligently chooses important
metrics and models the application’s runtime behavior through pairwise correlations of those metrics in the system, within multiple non-overlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a campus-wide Java application, and a regression
testing framework from IBM. Our results show that ORION is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.
Keywords: debugging aids; tracing; diagnostics; performance
metrics

A Novel Approach for Handling Misbehaving Nodes in Behavior-Aware Mobile Networking

Profile-cast is a service paradigm within the communication framework of delay tolerant networks ... more Profile-cast is a service paradigm within the communication framework of delay tolerant networks (DTN). Instead of using destination addresses to determine the final destination it uses similarity-based forwarding protocol. With the rise in popularity of various wireless networks, the need to make wireless technologies robust, resilient to attacks and failure becomes mandatory. One issue that remains to be addressed in behavioral networks is node co-operation in forwarding packets. Nodes might behave selfishly (due to bandwidth preservation, energy /power constraints) or maliciously by dropping packets or not forwarding them to other nodes based on profile similarity. In both cases the net result is degradation in the performance of the network. It is our goal to show that the performance of the behavioral network can be improved by employing self-policing scheme that would detect node misbehavior and then decide how to tackle them in order to ensure node cooperation or so that the overall performance does not fall below a certain threshold. For this various existing self-policing techniques which are in use in ad-hoc networks will be first tried on this behavioral scenario. At various stages simulation would be used to measure performances of the network under different constraints, and after subjected to different techniques

Documentation for Prodometer/AutomaDeD GUI

This version of AutomaDeD implements the diagnosis algorithm of the Prodometer technique [1], whi... more