Papers by Subhabrata Bhattacharya
Shot Change and Stuck Pixel Detection of Digital Video Assets
SMPTE motion imaging journal, Jul 1, 2018
In this paper, we address two significant components of typical post-production workflows in curr... more In this paper, we address two significant components of typical post-production workflows in current entertainment industry: (a) shot change detection and (b) dead- or stuck-pixel detection. The former involves identifying precise temporal boundaries of cinematographic content in order to analyze quality of cuts, while the latter is useful for localizing and correcting spatial anomalies that emerge during the image acquisition step due to faulty sensors. To address the aforementioned, we devise two novel data-driven approaches and demonstrate their portability, scalability, and efficacy through two respective cloud computing-based implementations. Our approaches show promising results on our in-house data set of a large number of movie titles indicating their effectiveness.

Composition information is an important cue to characterize the aesthetic property of an image. W... more Composition information is an important cue to characterize the aesthetic property of an image. We propose to model the image composition information as the mutual dependencies of its local regions, and design an architecture to leverage such information to boost aesthetic assessment. We adopt a Fully Convolutional Network (FCN) as the feature encoder of the input image and use the encoded feature map to represent the individual local regions and their spatial layout in the image. Then we build a region composition graph in which each node denotes one region and any two nodes are connected by an edge weighted by the similarity of the region features. We perform reasoning on this graph via graph convolution, in which the activation of each node is determined by its highly correlated neighbors. Our method achieves the state-of-the-art performance on the benchmark visual aesthetic dataset .

Adaptive Context Reading Network for Movie Scene Detection
IEEE Transactions on Circuits and Systems for Video Technology, Sep 1, 2021
Video scene detection is the task of temporally segmenting a video into its basic story units cal... more Video scene detection is the task of temporally segmenting a video into its basic story units called scenes. We propose a temporal context aware scene detection method. For each shot in a video, we store the time-indexed features of its surrounding shots as its context memory. A context-reading operation is performed to read the most relevant information from the memory which is used to update the feature of the query shot. To adaptively determine the temporal scale of context memory for different queries, we apply a bank of context memories of different temporal scales to generate multiple context reads, and adaptively aggregate them according to their confidence scores. The adaptive context-reading is guided by a structure learning objective which encourages each shot to read the most appropriate context such that the global structure of scene can be revealed in the feature space. With the context-aware shot features learned by our method, we perform clustering to find the scene boundaries. Our experiments demonstrate that adaptively modeling temporal context yields the state-of-the-art results on the existing video scene detection datasets. We also construct a large-scale dataset for the task and our ablation studies on it show that the performance gains owe to the proposed adaptive context reading.

Adaptive Context Reading Network for Movie Scene Detection
IEEE Transactions on Circuits and Systems for Video Technology, 2021
Video scene detection is the task of temporally segmenting a video into its basic story units cal... more Video scene detection is the task of temporally segmenting a video into its basic story units called scenes. We propose a temporal context aware scene detection method. For each shot in a video, we store the time-indexed features of its surrounding shots as its context memory. A context-reading operation is performed to read the most relevant information from the memory which is used to update the feature of the query shot. To adaptively determine the temporal scale of context memory for different queries, we apply a bank of context memories of different temporal scales to generate multiple context reads, and adaptively aggregate them according to their confidence scores. The adaptive context-reading is guided by a structure learning objective which encourages each shot to read the most appropriate context such that the global structure of scene can be revealed in the feature space. With the context-aware shot features learned by our method, we perform clustering to find the scene boundaries. Our experiments demonstrate that adaptively modeling temporal context yields the state-of-the-art results on the existing video scene detection datasets. We also construct a large-scale dataset for the task and our ablation studies on it show that the performance gains owe to the proposed adaptive context reading.
In this paper, we present results from the experimental evaluation for the TRECVID 2011 MED11 (Mu... more In this paper, we present results from the experimental evaluation for the TRECVID 2011 MED11 (Multimedia Event Detection) task as a part of Team SRI-Sarnoff's AURORA system being developed under the IARPA ALADDIN Program. Our approach employs two classes of content descriptions for describing videos depicting diverse events: (1) Low level features and their aggregates, and (2) Semantic concepts that capture scenes, objects and atomic actions that are local in space-time. In this presentation we summarize our system design and the content descriptions used. We also present four MED11 experiments that we submitted, discuss the results and lessons learned. Figure The overall approach for multimedia event detection.
In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (ME... more In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level features, we developed Fixed-Pattern and Object-Orientated spatial feature pooling, which result in significant ...
Proceedings of NIST TRECVID, Workshop, 2011
In this paper, we present results from the experimental evaluation for the TRECVID 2011 MED11 (Mu... more In this paper, we present results from the experimental evaluation for the TRECVID 2011 MED11 (Multimedia Event Detection) task as a part of Team SRI-Sarnoff's AURORA system being developed under the IARPA ALADDIN Program. Our approach employs two classes of content descriptions for describing videos depicting diverse events:(1) Low level features and their aggregates, and (2) Semantic concepts that capture scenes, objects and atomic actions that are local in space-time. In this presentation we summarize our system ...

International Journal of Multimedia Information Retrieval, 2012
The goal of high-level event recognition is to automatically detect complex high-level events in ... more The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.

Proceedings of the 22nd ACM international conference on Multimedia, 2014
Animated GIFs are everywhere on the Web. Our work focuses on the computational prediction of emot... more Animated GIFs are everywhere on the Web. Our work focuses on the computational prediction of emotions perceived by viewers after they are shown animated GIF images. We evaluate our results on a dataset of over 3,800 animated GIFs gathered from MIT's GIFGIF platform, each with scores for 17 discrete emotions aggregated from over 2.5M user annotations -the first computational evaluation of its kind for content-based prediction on animated GIFs to our knowledge. In addition, we advocate a conceptual paradigm in emotion prediction that shows delineating distinct types of emotion is important and is useful to be concrete about the emotion target. One of our objectives is to systematically compare different types of content features for emotion prediction, including low-level, aesthetics, semantic and face features. We also formulate a multi-task regression problem to evaluate whether viewer perceived emotion prediction can benefit from jointly learning across emotion classes compared to disjoint, independent learning.

Proceedings of International Conference on Multimedia Retrieval, 2014
This paper addresses the fundamental question -How do humans recognize complex events in videos? ... more This paper addresses the fundamental question -How do humans recognize complex events in videos? Normally, humans view videos in a sequential manner. We hypothesize that humans can make high-level inference such as an event is present or not in a video, by looking at a very small number of frames not necessarily in a linear order. We attempt to verify this cognitive capability of humans and to discover the Minimally Needed Evidence (MNE) for each event. To this end, we introduce an online game based event quiz facilitating selection of minimal evidence required by humans to judge the presence or absence of a complex event in an open source video. Each video is divided into a set of temporally coherent microshots (1.5 secs in length) which are revealed only on player request. The player's task is to identify the positive and negative occurrences of the given target event with minimal number of requests to reveal evidence. Incentives are given to players for correct identification with the minimal number of requests. Our extensive human study using the game quiz validates our hypothesis -55% of videos need only one microshot for correct human judgment and events of varying complexity require different amounts of evidence for human judgment. In addition, the proposed notion of MNE enables us to select discriminative features, drastically improving speed and accuracy of a video retrieval system.

This paper introduces an integrated surveillance system capable of tracking multiple objects acro... more This paper introduces an integrated surveillance system capable of tracking multiple objects across aerial and ground cameras. To this end, we propose a set of methodologies that deal with tracking problems in urban scenarios where cameras mounted on quad-rotor unmanned helicopters could be used in conjunction with ground cameras to track multiple subjects persistently. We track moving objects from a moving aerial platform using a three staged conventional technique consisting of ego-motion-compensation, blob detection, and blob tracking. A hierarchical robust background subtraction followed by a motion correspondence algorithm is applied to track objects from the ground surveillance camera. Using metadata available at the airborne camera and the calibration parameters of the ground camera, we are able to transform the object's position in both cameras' local coordinate system to a generic world coordinate system. Trajectories obtained in terms of the generic world coordinate system are then merged assuming temporal continuity. A false candidate trajectory is eliminated using a similarity measure based on color intensity of the object that generated it. Our system has been tested in 3 real-world scenarios where it has been able to merge trajectories successfully in 80% of the cases.
Study of Perceptive Video Quality in Tablet Devices
CVPR 2011, 2011
Our performance compares favorably to the state-ofthe-art on experiments over three challenging h... more Our performance compares favorably to the state-ofthe-art on experiments over three challenging human action and a scene categorization dataset, demonstrating the universal applicability of our method.

2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
While approaches based on bags of features excel at lowlevel action classification, they are ill-... more While approaches based on bags of features excel at lowlevel action classification, they are ill-suited for recognizing complex events in video, where concept-based temporal representations currently dominate. This paper proposes a novel representation that captures the temporal dynamics of windowed mid-level concept detectors in order to improve complex event recognition. We first express each video as an ordered vector time series, where each time step consists of the vector formed from the concatenated confidences of the pre-trained concept detectors. We hypothesize that the dynamics of time series for different instances from the same event class, as captured by simple linear dynamical system (LDS) models, are likely to be similar even if the instances differ in terms of low-level visual features. We propose a two-part representation composed of fusing: (1) a singular value decomposition of block Hankel matrices (SSID-S) and (2) a harmonic signature (H-S) computed from the corresponding eigen-dynamics matrix. The proposed method offers several benefits over alternate approaches: our approach is straightforward to implement, directly employs existing concept detectors and can be plugged into linear classification frameworks. Results on standard datasets such as NIST's TRECVID Multimedia Event Detection task demonstrate the improved accuracy of the proposed method.

Machine Vision Beyond Visible Spectrum, 2011
This chapter discusses the challenges of automating surveillance and reconnaissance tasks for inf... more This chapter discusses the challenges of automating surveillance and reconnaissance tasks for infra-red visual data obtained from aerial platforms. These problems have gained significant importance over the years, especially with the advent of lightweight and reliable imaging devices. Detection and tracking of objects of interest has traditionally been an area of interest in the computer vision literature. These tasks are rendered especially challenging in aerial sequences of infra red modality. The chapter gives an overview of these problems, and the associated limitations of some of the conventional techniques typically employed for these applications. We begin with a study of various image registration techniques that are required to eliminate motion induced by the motion of the aerial sensor. Next, we present a technique for detecting moving objects from the ego-motion compensated input sequence. Finally, we describe a methodology for tracking already detected objects using their motion history. We substantiate our claims with results on a wide range of aerial video sequences.
Scalable and Distributed Mechanisms for Integrated Scheduling and Replication in Data Grids
Lecture Notes in Computer Science
Data Grids seek to harness geographically distributed resources for large-scale data-intensive pr... more Data Grids seek to harness geographically distributed resources for large-scale data-intensive problems. The issues that need to be considered in the Data Grid research area include resource management for computation and data. Computation management comprises scheduling of jobs, load balancing, fault tolerance and response time; while data management includes replication and movement of data at selected sites. As jobs are
Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06), 2006
This paper attempts to reduce the overheads of dynamically creating and destroying the virtual en... more This paper attempts to reduce the overheads of dynamically creating and destroying the virtual environments for secure job execution. It broaches a grid architecture which we call Nova, consisting of extremely minuscule, pre-created virtual machines whose configurations could be altered with respect to the application executed within it. The benefits of the architecture are supported by experimental claims.

International Journal for Numerical Methods in Biomedical Engineering, 2010
This paper is concerned with computational modeling of the respiratory system against the backgro... more This paper is concerned with computational modeling of the respiratory system against the background of acute lung diseases and mechanical ventilation. Conceptually, we divide the lung into two major subsystems, namely the conducting airways and the respiratory zone represented by lung parenchyma. Owing to their respective complexity, both parts are themselves out of range for a direct numerical simulation resolving all relevant length scales. Therefore, we develop detailed individual models for parts of the subsystems as a basis for novel multi‐scale approaches taking into account the unresolved parts appropriately. In the tracheobronchial region, CT‐based geometries up to a maximum of approximately seven generations are employed in fluid–structure interaction simulations, considering not only airway wall deformability but also the influence of surrounding lung tissue. Physiological outflow boundary conditions are derived by considering the impedance of the unresolved parts of the ...

IEEE Transactions on Multimedia, 2014
In this paper, we propose a discriminative representation of a video shot based on its camera mot... more In this paper, we propose a discriminative representation of a video shot based on its camera motion and demonstrate how the representation can be used for high level multimedia tasks like complex event recognition. In our technique, we assume that a homography exists between a pair of subsequent frames in a given shot. Using purely image-based methods, we compute homography parameters that serve as coarse indicators of the ambient camera motion. Next, using Lie algebra, we map the homography matrices to an intermediate vector space that preserves the intrinsic geometric structure of the transformation. The mappings are stacked temporally to generate vector time-series per shot. To extract meaningful features from time-series, we propose an efficient linear dynamical system based technique. The extracted temporal features are further used to train linear SVMs as classifiers for a particular shot class. In addition to demonstrating the efficacy of our method on a novel dataset, we extend its applicability to recognize complex events in large scale videos under unconstrained scenarios. Our empirical evaluations on eight cinematographic shot classes show that our technique performs close to approaches that involve extraction of 3-D trajectories using computationally prohibitive structure from motion techniques.

e & i Elektrotechnik und Informationstechnik, 2010
Advances in control engineering and material science made it possible to develop small-scale unma... more Advances in control engineering and material science made it possible to develop small-scale unmanned aerial vehicles (UAVs) equipped with cameras and sensors. These UAVs enable us to obtain a bird's eye view of the environment. Having access to an aerial view over large areas is helpful in disaster situations, where often only incomplete and inconsistent information is available to the rescue team. In such situations, airborne cameras and sensors are valuable sources of information helping us to build an ''overview'' of the environment and to assess the current situation. This paper reports on our ongoing research on deploying small-scale, battery-powered and wirelessly connected UAVs carrying cameras for disaster management applications. In this ''aerial sensor network'' several UAVs fly in formations and cooperate to achieve a certain mission. The ultimate goal is to have an aerial imaging system in which UAVs build a flight formation, fly over a disaster area such as wood fire or a large traffic accident, and deliver high-quality sensor data such as images or videos. These images and videos are communicated to the ground, fused, analyzed in real-time, and finally delivered to the user. In this paper we introduce our aerial sensor network and its application in disaster situations. We discuss challenges of such aerial sensor networks and focus on the optimal placement of sensors. We formulate the coverage problem as integer linear program (ILP) and present first evaluation results.
Uploads
Papers by Subhabrata Bhattacharya