Papers by Chinmaya Devaraj

Emergent Languages from Pretrained Embeddings Characterize Latent Concepts in Dynamic Imagery
International journal of semantic computing, Sep 1, 2020
Recent unsupervised learning approaches have explored the feasibility of semantic analysis and in... more Recent unsupervised learning approaches have explored the feasibility of semantic analysis and interpretation of imagery using Emergent Language (EL) models. As EL requires some form of numerical embedding as input, it remains unclear which type is required in order for the EL to properly capture key semantic concepts associated with a given domain. In this paper, we compare unsupervised and supervised approaches for generating embeddings across two experiments. In Experiment 1, data are produced using a single-agent simulator. In each episode, a goal-driven agent attempts to accomplish a number of tasks in a synthetic cityscape environment which includes houses, banks, theaters and restaurants. In Experiment 2, a comparatively smaller dataset is produced where one or more objects demonstrate various types of physical motion in a 3D simulator environment. We investigate whether EL models generated from embeddings of raw pixel data produce expressions that capture key latent concepts (i.e. an agent’s motivations or physical motion types) in each environment. Our initial experiments show that the supervised learning approaches yield embeddings and EL descriptions that capture meaningful concepts from raw pixel inputs. Alternatively, embeddings from an unsupervised learning approach result in greater ambiguity with respect to latent concepts.

Towards Semantic Action Analysis via Emergent Language
Recent work on unsupervised learning has explored the feasibility of semantic analysis and interp... more Recent work on unsupervised learning has explored the feasibility of semantic analysis and interpretation via Emergent Language (EL) models. As EL requires some form of numerical embedding, it remains unclear which type is required in order for the EL to properly capture certain semantic concepts associated with a given task. In this paper, we compare different approaches that can be used to generate such embeddings: unsupervised and supervised. We start by producing a large dataset using a single-agent simulation environment. In these experiments, a purpose-driven agent attempts to accomplish a number of tasks. These tasks are performed in a synthetic cityscape environment, which includes houses, banks, theaters, and restaurants. Given such experiences, specification of the associated goal structure constitutes a narrative. We investigate the feasibility of producing an EL from raw pixel data with the hope that resulting descriptions can be used to infer the underlying narrative structure. Our initial experiments show that a supervised learning approach yields embeddings and EL descriptions that capture narrative structure. Alternatively, an unsupervised learning approach results in greater ambiguity with respect to the narrative.
From Symbols to Signals: Symbolic Variational Autoencoders
We introduce Symbolic Variational Autoencoders which generate images from symbols that represent ... more We introduce Symbolic Variational Autoencoders which generate images from symbols that represent semantic concepts. Unlike generic Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), the latent distribution from the Symbolic Variational Autoencoder is discrete. The symbols are learned in a completely unsupervised manner by reconstructing images from symbolic encodings. We demonstrate the efficacy of our symbolic approach on the MNIST and FashionMNIST datasets. Results indicate that symbolic encodings naturally form a grammar, where unique strings of symbols map to different semantic concepts. We further explore how changing these symbols affects the final image that is generated.

We introduce Evenly Cascaded convolutional Network (ECN), a neural network taking inspiration fro... more We introduce Evenly Cascaded convolutional Network (ECN), a neural network taking inspiration from the cascade algorithm of wavelet analysis. ECN employs two feature streams-a low-level and high-level steam. At each layer these streams interact, such that low-level features are modulated using advanced perspectives from the high-level stream. ECN is evenly structured through resizing feature map dimensions by a consistent ratio, which removes the burden of ad-hoc specification of feature map dimensions. ECN produces easily interpretable features maps, a result whose intuition can be understood in the context of scale-space theory. We demonstrate that ECN's design facilitates the training process through providing easily trainable shortcuts. We report new state-of-the-art results for small networks, without the need for additional treatment such as pruning or compression-a consequence of ECN's simple structure and direct training. A 6-layered ECN design with under 500k parameters achieves 95.24% and 78.99% accuracy on CIFAR-10 and CIFAR-100 datasets, respectively, outperforming the current state-of-the-art on small parameter networks, and a 3 million parameter ECN produces results competitive to the state-of-the-art.
Incorporating Visual Grounding In GCN For Zero-shot Learning Of Human Object Interaction Actions
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

2018 IEEE International Conference on Big Data (Big Data), 2018
We introduce Evenly Cascaded convolutional Network (ECN), a neural network taking inspiration fro... more We introduce Evenly Cascaded convolutional Network (ECN), a neural network taking inspiration from the cascade algorithm of wavelet analysis. ECN employs two feature streams-a low-level and high-level steam. At each layer these streams interact, such that low-level features are modulated using advanced perspectives from the high-level stream. ECN is evenly structured through resizing feature map dimensions by a consistent ratio, which removes the burden of ad-hoc specification of feature map dimensions. ECN produces easily interpretable features maps, a result whose intuition can be understood in the context of scale-space theory. We demonstrate that ECN's design facilitates the training process through providing easily trainable shortcuts. We report new state-of-the-art results for small networks, without the need for additional treatment such as pruning or compression-a consequence of ECN's simple structure and direct training. A 6-layered ECN design with under 500k parameters achieves 95.24% and 78.99% accuracy on CIFAR-10 and CIFAR-100 datasets, respectively, outperforming the current state-of-the-art on small parameter networks, and a 3 million parameter ECN produces results competitive to the state-of-the-art.

Towards Semantic Action Analysis via Emergent Language
2019 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), 2019
Recent work on unsupervised learning has explored the feasibility of semantic analysis and interp... more Recent work on unsupervised learning has explored the feasibility of semantic analysis and interpretation via Emergent Language (EL) models. As EL requires some form of numerical embedding, it remains unclear which type is required in order for the EL to properly capture certain semantic concepts associated with a given task. In this paper, we compare different approaches that can be used to generate such embeddings: unsupervised and supervised. We start by producing a large dataset using a single-agent simulation environment. In these experiments, a purpose-driven agent attempts to accomplish a number of tasks. These tasks are performed in a synthetic cityscape environment, which includes houses, banks, theaters, and restaurants. Given such experiences, specification of the associated goal structure constitutes a narrative. We investigate the feasibility of producing an EL from raw pixel data with the hope that resulting descriptions can be used to infer the underlying narrative structure. Our initial experiments show that a supervised learning approach yields embeddings and EL descriptions that capture narrative structure. Alternatively, an unsupervised learning approach results in greater ambiguity with respect to the narrative.
From Symbols to Signals: Symbolic Variational Autoencoders
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020
We introduce Symbolic Variational Autoencoders which generate images from symbols that represent ... more We introduce Symbolic Variational Autoencoders which generate images from symbols that represent semantic concepts. Unlike generic Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), the latent distribution from the Symbolic Variational Autoencoder is discrete. The symbols are learned in a completely unsupervised manner by reconstructing images from symbolic encodings. We demonstrate the efficacy of our symbolic approach on the MNIST and FashionMNIST datasets. Results indicate that symbolic encodings naturally form a grammar, where unique strings of symbols map to different semantic concepts. We further explore how changing these symbols affects the final image that is generated.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021
Human actions involving hand manipulations are structured according to the making and breaking of... more Human actions involving hand manipulations are structured according to the making and breaking of hand-object contact, and human visual understanding of action is reliant on anticipation of contact as is demonstrated by pioneering work in cognitive science. Taking inspiration from this, we introduce representations and models centered on contact, which we then use in action prediction and anticipation. We annotate a subset of the EPIC Kitchens dataset to include time-to-contact between hands and objects, as well as segmentations of hands and objects. Using these annotations we train the Anticipation Module, a module producing Contact Anticipation Maps and Next Active Object Segmentations-novel low-level representations providing temporal and spatial characteristics of anticipated near future action. On top of the Anticipation Module we apply Egocentric Object Manipulation Graphs (Ego-OMG), a framework for action anticipation and prediction. Ego-OMG models longer term temporal semantic relations through the use of a graph modeling transitions between contact delineated action states. Use of the Anticipation Module within Ego-OMG produces state-of-the-art results, achieving 1 st and 2 nd place on the unseen and seen test sets, respectively, of the EPIC Kitchens Action Anticipation Challenge, and achieving state-of-the-art results on the tasks of action anticipation and action prediction over EPIC Kitchens. We perform ablation studies over characteristics of the Anticipation Module to evaluate their utility.

Emergent Languages from Pretrained Embeddings Characterize Latent Concepts in Dynamic Imagery
International Journal of Semantic Computing, 2020
Recent unsupervised learning approaches have explored the feasibility of semantic analysis and in... more Recent unsupervised learning approaches have explored the feasibility of semantic analysis and interpretation of imagery using Emergent Language (EL) models. As EL requires some form of numerical embedding as input, it remains unclear which type is required in order for the EL to properly capture key semantic concepts associated with a given domain. In this paper, we compare unsupervised and supervised approaches for generating embeddings across two experiments. In Experiment 1, data are produced using a single-agent simulator. In each episode, a goal-driven agent attempts to accomplish a number of tasks in a synthetic cityscape environment which includes houses, banks, theaters and restaurants. In Experiment 2, a comparatively smaller dataset is produced where one or more objects demonstrate various types of physical motion in a 3D simulator environment. We investigate whether EL models generated from embeddings of raw pixel data produce expressions that capture key latent concepts...

ArXiv, 2020
We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel representation for activit... more We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel representation for activity modeling and anticipation of near future actions integrating three components: 1) semantic temporal structure of activities, 2) short-term dynamics, and 3) representations for appearance. Semantic temporal structure is modeled through a graph, embedded through a Graph Convolutional Network, whose states model characteristics of and relations between hands and objects. These state representations derive from all three levels of abstraction, and span segments delimited by the making and breaking of hand-object contact. Short-term dynamics are modeled in two ways: A) through 3D convolutions, and B) through anticipating the spatiotemporal end points of hand trajectories, where hands come into contact with objects. Appearance is modeled through deep spatiotemporal features produced through existing methods. We note that in Ego-OMG it is simple to swap these appearance features, and thus Ego...
Uploads
Papers by Chinmaya Devaraj