Object Category Detection Using Audio-Visual Cues

Luo, Jie; Caputo, Barbara; Zweig, Alon; Bach, Jörg-Hendrik; Anemüller, Jörn

doi:10.1007/978-3-540-79547-6_52

Outline

Title

Abstract

Introduction

Discussion and Conclusions

References

All Topics

Computer Science

Artificial Intelligence

Object Category Detection Using Audio-Visual Cues

LUO JIE

Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-540-79547-6_52

visibility

…

description

13 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision domain, even though cognitive agents are intrinsically multimodal. Indeed, biological systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologically motivated spectral features via a discriminative classifier. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difficult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection.

Bruno Gas

2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009

In mobile robotics applications, pattern and object recognition are mainly achieved relying only on vision. Several other perceptual modalities are also available such as, touch, hearing or vestibular proprioception. They are rarely used and can provide valuable additional information within the recognition tasks. This article presents an analysis of several methods of fusion of perceptual and auditory modalitites. It relies on the use of a perspective camera and a microphone on a moving object recognition problem. Experimental data are also provided on a database of audio/visual objects including cases of visual occlusions and audio corruptions.

downloadDownload free PDF View PDFchevron_right

Supervised and Unsupervised Learning of Multidimensional Acoustic Categories

Daniel Swingley

Journal of Experimental Psychology-human Perception and Performance, 2009

Learning to recognize the contrasts of a language-specific phonemic repertoire can be viewed as forming categories in a multidimensional psychophysical space. Research on the learning of distributionally defined visual categories has shown that categories defined over 1 dimension are easy to learn and that learning multidimensional categories is more difficult but tractable under specific task conditions. In 2 experiments, adult participants learned either a unidimensional or a multidimensional category distinction with or without supervision (feedback) during learning. The unidimensional distinctions were readily learned and supervision proved beneficial, especially in maintaining category learning beyond the learning phase. Learning the multidimensional category distinction proved to be much more difficult and supervision was not nearly as beneficial as with unidimensionally defined categories. Maintaining a learned multidimensional category distinction was only possible when the distributional information that identified the categories remained present throughout the testing phase. We conclude that listeners are sensitive to both trial-by-trial feedback and the distributional information in the stimuli. Even given limited exposure, listeners learned to use 2 relevant dimensions, albeit with considerable difficulty.

downloadDownload free PDF View PDFchevron_right

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Anikó Ekárt

2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

The novelty of this study consists in a multimodality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-theart classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

downloadDownload free PDF View PDFchevron_right

Audio-Visual Feature Fusion for Vehicles Classification in a Surveillance System

Riad Hammoud

2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013

In this paper we tackle the challenging problem of multimodal feature selection and fusion for vehicle categorization. Our proposed framework utilizes a boosting-based feature learning technique to learn the optimal combinations of feature modalities. New multimodal features are learned from the existing unimodal features which are initially extracted from the data acquired by a novel audiovisual sensing system under different sensing conditions (long range, moving vehicles, and various environments). Experiments on a challenging dataset collected with our long-range sensing system demonstrated that the proposed technique is robust to noise and can find the best among multiple good feature modalities from training in terms of classification performance than the feature modality selection using a sequential based technique which tends to stay on a local maxima.

downloadDownload free PDF View PDFchevron_right

Categorization by learning and combining object parts

Bernd Heisele

Advances in neural …, 2002

downloadDownload free PDF View PDFchevron_right

Classification of audio scenes with novel features in a fused system framework

Shefali Waldekar

Digital Signal Processing, 2018

The rapidly increasing requirements from context-aware gadgets, like smartphones and intelligent wearable devices, along with applications such as audio archiving, have given a fillip to the research in the field of Acoustic Scene Classification (ASC). The Detection and Classification of Acoustic Scenes and Events (DCASE) challenges have seen systems addressing the problem of ASC from different directions. Some of them could achieve better results than the Mel Frequency Cepstral Coefficients-Gaussian Mixture Model (MFCC-GMM) baseline system. However, a collective decision from all participating systems was found to surpass the accuracy obtained by each system. The simultaneous use of various approaches can exploit the discriminating information in a better way for audio collected from different environments covering audible-frequency range in varying degrees. In this work, we show that the framelevel statistics of some well-known spectral features when fed to Support Vector Machine (SVM) classifier individually, are able to outperform the baseline system of DCASE challenges. Furthermore, we analyzed different methods of combining these features, and also of combining information from two channels when the data is in binaural format. The proposed approach resulted in around 17% and 9% relative improvement in accuracy with respect to the baseline system on the development and evaluation dataset, respectively, from DCASE 2016 ASC task.

downloadDownload free PDF View PDFchevron_right

Cross-modal conflicts in object recognition: determining the influence of object category

Kirsteen Titchener

Experimental Brain Research, 2011

Previous research examining cross-modal conflicts in object recognition has often made use of animal vocalizations and images, which may be considered natural and ecologically valid, thus strengthening the association in the congruent condition. The current research tested whether the same cross-modal conflict would exist for manmade object sounds as well as comparing the speed and accuracy of auditory processing across the two object categories. Participants were required to attend to a sound paired with a visual stimulus and then respond to a verification item (e.g., ''Dog?''). Sounds were congruent (same object), neutral (unidentifiable image), or incongruent (different object) with the images presented. In the congruent and neutral condition, animals were recognized significantly faster and with greater accuracy than manmade objects. It was hypothesized that in the incongruent condition, no difference in reaction time or error rate would be found between animals and man-made objects. This prediction was not supported, indicating that the association between an object's sound and image may not be that disparate when comparing animals to man-made objects. The findings further support cross-modal conflict research for both the animal and man-made object category. The most important finding, however, was that auditory processing is enhanced for living compared to nonliving objects, a difference only previously found in visual processing. Implications relevant to both the neuropsychological literature and sound research are discussed.

downloadDownload free PDF View PDFchevron_right

Multisensory Information Facilitates the Categorization of Untrained Stimuli

Qiufang Fu

Multisensory Research, 2021

Although it has been demonstrated that multisensory information can facilitate object recognition and object memory, it remains unclear whether such facilitation effect exists in category learning. To address this issue, comparable car images and sounds were first selected by a discrimination task in Experiment 1. Then, those selected images and sounds were utilized in a prototype category learning task in Experiments 2 and 3, in which participants were trained with auditory, visual, and audiovisual stimuli, and were tested with trained or untrained stimuli within the same categories presented alone or accompanied with a congruent or incongruent stimulus in the other modality. In Experiment 2, when low-distortion stimuli (more similar to the prototypes) were trained, there was higher accuracy for audiovisual trials than visual trials, but no significant difference between audiovisual and auditory trials. During testing, accuracy was significantly higher for congruent trials than uni...

downloadDownload free PDF View PDFchevron_right

Learning Bimodal Structure in Audio–Visual Data

Gianluca Monaci

IEEE Transactions on Neural Networks, 2000

A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.

downloadDownload free PDF View PDFchevron_right

Cross-modal information integration in category learning

Gregory Ashby

Attention, perception & psychophysics, 2014

An influential theoretical perspective describes an implicit category-learning system that associates regions of perceptual space with response outputs by integrating information preattentionally and predecisionally across multiple stimulus dimensions. This study tested whether this kind of implicit, information-integration category learning is possible across stimulus dimensions lying in different sensory modalities. Humans learned categories composed of conjoint visual-auditory category exemplars comprising a visual component (rectangles varying in the density of contained lit pixels) and an auditory component (Experiment 1: auditory sequences varying in duration; Experiment 2: pure tones varying in pitch). The categories had either a onedimensional, rule-based solution or a two-dimensional, information-integration solution. Humans can solve the information-integration category tasks by integrating information across two stimulus modalities. The results demonstrate an important cross-modal form of sensory integration in the service of category learning, and they advance the field's knowledge about the sensory organization of systems for categorization.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (4)

Pfeifer, R., Bongard, J.: How the body shapes the way we think. MIT Press (2006)
Fergus, R., Perona, P., Zisserman, A.: Weakly supervised scale-invariant learning of models for visual recognition. IJCV (2006)
Bar-Hillel, A., Weinshall, D.: Efficient learning of relational object class models. IJCV (2007)
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing (2004)

F. Orabona, Alon Zweig

2000

Auditory and visual cues are important sensor inputs for biological and artificial systems. They provide crucial information for navigating environments, recognizing categories, animals and people. How to combine effectively these two sensory channels is still an open issue. As a step towards this goal, this paper presents a comparison between three different multi-modal integration strategies, for audio- visual object category

downloadDownload free PDF View PDFchevron_right

Luo_ICVS_2008.pdf

Jörn Anemüller

downloadDownload free PDF View PDFchevron_right

Multimodal Object Classification Models Inspired by Multisensory Integration in the Brain

Lalit Gupta

Brain Sciences, 2019

Two multimodal classification models aimed at enhancing object classification through the integration of semantically congruent unimodal stimuli are introduced. The feature-integrating model, inspired by multisensory integration in the subcortical superior colliculus, combines unimodal features which are subsequently classified by a multimodal classifier. The decision-integrating model, inspired by integration in primary cortical areas, classifies unimodal stimuli independently using unimodal classifiers and classifies the combined decisions using a multimodal classifier. The multimodal classifier models are implemented using multilayer perceptrons and multivariate statistical classifiers. Experiments involving the classification of noisy and attenuated auditory and visual representations of ten digits are designed to demonstrate the properties of the multimodal classifiers and to compare the performances of multimodal and unimodal classifiers. The experimental results show that the...

downloadDownload free PDF View PDFchevron_right

Joint Object-Material Category Segmentation from Audio-Visual Cues

Julien Valentin

Procedings of the British Machine Vision Conference 2015, 2015

It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object class and material properties are mutuallyinformative, we optimise our multi-output labelling jointly using a random-field framework. We evaluate our system on a new dataset with paired visual and auditory data that we make publicly available. We demonstrate that this joint estimation of object and material labels significantly outperforms the estimation of either category in isolation.

downloadDownload free PDF View PDFchevron_right

AVIS: a connectionist-based framework for integrated auditory and visual information processing

Eric Postma

Information Sciences, 2000

The paper presents a general framework that facilitates the exploration of a single information-processing system in which auditory and visual information is integrated. The framework allows for learning, adaptation, knowledge discovery, and decision making. An application of the framework is a personidentification task in which face and voice recognition are combined in one system. Experiments are performed using visual and auditory dynamic features which synchronously extract visual and auditory information flows. The experimental results support the hypothesis that the recognition rate is considerably enhanced by combining visual and auditory dynamic features.

downloadDownload free PDF View PDFchevron_right

Training Multimodal Systems for Classification with Multiple Objectives

rishi tripathi

2020

We learn about the world from a diverse range of sensory information. Automated systems lack this ability as investigation has centred on processing information presented in a single form. Adapting architectures to learn from multiple modalities creates the potential to learn rich representations of the world - but current multimodal systems only deliver marginal improvements on unimodal approaches. Neural networks learn sampling noise during training with the result that performance on unseen data is degraded. This research introduces a second objective over the multimodal fusion process learned with variational inference. Regularisation methods are implemented in the inner training loop to control variance and the modular structure stabilises performance as additional neurons are added to layers. This framework is evaluated on a multilabel classification task with textual and visual inputs to demonstrate the potential for multiple objectives and probabilistic methods to lower vari...

downloadDownload free PDF View PDFchevron_right

Combining visual and acoustic features for audio classification tasks

rafael lucio

Pattern Recognition Letters, 2017

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Highlights • Coupling texture descriptors and acoustic features • Different methods for representing an audio as an image are compared • Heterogeneous ensemble of different classifiers improves performance 2

downloadDownload free PDF View PDFchevron_right

Log-Linear Mixtures for Object Class Recognition

Thomas Deselaers

Procedings of the British Machine Vision Conference 2009, 2009

We present log-linear mixture models as a fully discriminative approach to object category recognition which can, analogously to kernelised models, represent non-linear decision boundaries. We show that this model is the discriminative counterpart to Gaussian mixtures and that either one can be transformed into the respective other. However, the proposed model is easier to extend toward fusing multiple cues and numerically more stable to train and to evaluate. Experiments on the PASCAL VOC 2006 data show that the performance of our model compares favourably well to the state-of-the-art despite the model consisting of an order of magnitude fewer parameters, which suggests excellent generalisation capabilities.

downloadDownload free PDF View PDFchevron_right

An in-depth evaluation of multimodal video genre categorization

Patrick Lambert

2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI), 2013

In this paper we propose an in-depth evaluation of the performance of video descriptors to multimodal video genre categorization. We discuss the perspective of designing appropriate late fusion techniques that would enable to attain very high categorization accuracy, close to the one achieved with user-based text information. Evaluation is carried out in the context of the 2012 Video Genre Tagging Task of the MediaEval Benchmarking Initiative for Multimedia Evaluation, using a data set of up to 15.000 videos (3,200 hours of footage) and 26 video genre categories specific to web media. Results show that the proposed approach significantly improves genre categorization performance, outperforming other existing approaches. The main contribution of this paper is in the experimental part, several valuable interesting findings are reported that motivate further research on video genre classification.

downloadDownload free PDF View PDFchevron_right

A Hierarchical Probabilistic Model for Rapid Object Categorization in Natural Scenes

Zhiyong Yang

PLoS ONE, 2011

Humans can categorize objects in complex natural scenes within 100-150 ms. This amazing ability of rapid categorization has motivated many computational models. Most of these models require extensive training to obtain a decision boundary in a very high dimensional (e.g., ,6,000 in a leading model) feature space and often categorize objects in natural scenes by categorizing the context that co-occurs with objects when objects do not occupy large portions of the scenes. It is thus unclear how humans achieve rapid scene categorization. To address this issue, we developed a hierarchical probabilistic model for rapid object categorization in natural scenes. In this model, a natural object category is represented by a coarse hierarchical probability distribution (PD), which includes PDs of object geometry and spatial configuration of object parts. Object parts are encoded by PDs of a set of natural object structures, each of which is a concatenation of local object features. Rapid categorization is performed as statistical inference. Since the model uses a very small number (,100) of structures for even complex object categories such as animals and cars, it requires little training and is robust in the presence of large variations within object categories and in their occurrences in natural scenes. Remarkably, we found that the model categorized animals in natural scenes and cars in street scenes with a near human-level performance. We also found that the model located animals and cars in natural scenes, thus overcoming a flaw in many other models which is to categorize objects in natural context by categorizing contextual features. These results suggest that coarse PDs of object categories based on natural object structures and statistical operations on these PDs may underlie the human ability to rapidly categorize scenes.

downloadDownload free PDF View PDFchevron_right

Object Category Detection Using Audio-Visual Cues

Sign up for access to the world's latest research

Abstract

Related papers

References (4)

Related papers

Related topics