Object Category Detection Using Audio-Visual Cues
Lecture Notes in Computer Science
https://doi.org/10.1007/978-3-540-79547-6_52…
13 pages
1 file
Sign up for access to the world's latest research
Abstract
Categorization is one of the fundamental building blocks of cognitive systems. Object categorization has traditionally been addressed in the vision domain, even though cognitive agents are intrinsically multimodal. Indeed, biological systems combine several modalities in order to achieve robust categorization. In this paper we propose a multimodal approach to object category detection, using audio and visual information. The auditory channel is modeled on biologically motivated spectral features via a discriminative classifier. The visual channel is modeled by a state of the art part based model. Multimodality is achieved using two fusion schemes, one high level and the other low level. Experiments on six different object categories, under increasingly difficult conditions, show strengths and weaknesses of the two approaches, and clearly underline the open challenges for multimodal category detection.
Related papers
2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009
In mobile robotics applications, pattern and object recognition are mainly achieved relying only on vision. Several other perceptual modalities are also available such as, touch, hearing or vestibular proprioception. They are rarely used and can provide valuable additional information within the recognition tasks. This article presents an analysis of several methods of fusion of perceptual and auditory modalitites. It relies on the use of a perspective camera and a microphone on a moving object recognition problem. Experimental data are also provided on a database of audio/visual objects including cases of visual occlusions and audio corruptions.
Journal of Experimental Psychology-human Perception and Performance, 2009
Learning to recognize the contrasts of a language-specific phonemic repertoire can be viewed as forming categories in a multidimensional psychophysical space. Research on the learning of distributionally defined visual categories has shown that categories defined over 1 dimension are easy to learn and that learning multidimensional categories is more difficult but tractable under specific task conditions. In 2 experiments, adult participants learned either a unidimensional or a multidimensional category distinction with or without supervision (feedback) during learning. The unidimensional distinctions were readily learned and supervision proved beneficial, especially in maintaining category learning beyond the learning phase. Learning the multidimensional category distinction proved to be much more difficult and supervision was not nearly as beneficial as with unidimensionally defined categories. Maintaining a learned multidimensional category distinction was only possible when the distributional information that identified the categories remained present throughout the testing phase. We conclude that listeners are sensitive to both trial-by-trial feedback and the distributional information in the stimuli. Even given limited exposure, listeners learned to use 2 relevant dimensions, albeit with considerable difficulty.
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
The novelty of this study consists in a multimodality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-theart classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.
2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013
In this paper we tackle the challenging problem of multimodal feature selection and fusion for vehicle categorization. Our proposed framework utilizes a boosting-based feature learning technique to learn the optimal combinations of feature modalities. New multimodal features are learned from the existing unimodal features which are initially extracted from the data acquired by a novel audiovisual sensing system under different sensing conditions (long range, moving vehicles, and various environments). Experiments on a challenging dataset collected with our long-range sensing system demonstrated that the proposed technique is robust to noise and can find the best among multiple good feature modalities from training in terms of classification performance than the feature modality selection using a sequential based technique which tends to stay on a local maxima.
Advances in neural …, 2002
Digital Signal Processing, 2018
The rapidly increasing requirements from context-aware gadgets, like smartphones and intelligent wearable devices, along with applications such as audio archiving, have given a fillip to the research in the field of Acoustic Scene Classification (ASC). The Detection and Classification of Acoustic Scenes and Events (DCASE) challenges have seen systems addressing the problem of ASC from different directions. Some of them could achieve better results than the Mel Frequency Cepstral Coefficients-Gaussian Mixture Model (MFCC-GMM) baseline system. However, a collective decision from all participating systems was found to surpass the accuracy obtained by each system. The simultaneous use of various approaches can exploit the discriminating information in a better way for audio collected from different environments covering audible-frequency range in varying degrees. In this work, we show that the framelevel statistics of some well-known spectral features when fed to Support Vector Machine (SVM) classifier individually, are able to outperform the baseline system of DCASE challenges. Furthermore, we analyzed different methods of combining these features, and also of combining information from two channels when the data is in binaural format. The proposed approach resulted in around 17% and 9% relative improvement in accuracy with respect to the baseline system on the development and evaluation dataset, respectively, from DCASE 2016 ASC task.
Experimental Brain Research, 2011
Previous research examining cross-modal conflicts in object recognition has often made use of animal vocalizations and images, which may be considered natural and ecologically valid, thus strengthening the association in the congruent condition. The current research tested whether the same cross-modal conflict would exist for manmade object sounds as well as comparing the speed and accuracy of auditory processing across the two object categories. Participants were required to attend to a sound paired with a visual stimulus and then respond to a verification item (e.g., ''Dog?''). Sounds were congruent (same object), neutral (unidentifiable image), or incongruent (different object) with the images presented. In the congruent and neutral condition, animals were recognized significantly faster and with greater accuracy than manmade objects. It was hypothesized that in the incongruent condition, no difference in reaction time or error rate would be found between animals and man-made objects. This prediction was not supported, indicating that the association between an object's sound and image may not be that disparate when comparing animals to man-made objects. The findings further support cross-modal conflict research for both the animal and man-made object category. The most important finding, however, was that auditory processing is enhanced for living compared to nonliving objects, a difference only previously found in visual processing. Implications relevant to both the neuropsychological literature and sound research are discussed.
Multisensory Research, 2021
Although it has been demonstrated that multisensory information can facilitate object recognition and object memory, it remains unclear whether such facilitation effect exists in category learning. To address this issue, comparable car images and sounds were first selected by a discrimination task in Experiment 1. Then, those selected images and sounds were utilized in a prototype category learning task in Experiments 2 and 3, in which participants were trained with auditory, visual, and audiovisual stimuli, and were tested with trained or untrained stimuli within the same categories presented alone or accompanied with a congruent or incongruent stimulus in the other modality. In Experiment 2, when low-distortion stimuli (more similar to the prototypes) were trained, there was higher accuracy for audiovisual trials than visual trials, but no significant difference between audiovisual and auditory trials. During testing, accuracy was significantly higher for congruent trials than uni...
IEEE Transactions on Neural Networks, 2000
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
Attention, perception & psychophysics, 2014
An influential theoretical perspective describes an implicit category-learning system that associates regions of perceptual space with response outputs by integrating information preattentionally and predecisionally across multiple stimulus dimensions. This study tested whether this kind of implicit, information-integration category learning is possible across stimulus dimensions lying in different sensory modalities. Humans learned categories composed of conjoint visual-auditory category exemplars comprising a visual component (rectangles varying in the density of contained lit pixels) and an auditory component (Experiment 1: auditory sequences varying in duration; Experiment 2: pure tones varying in pitch). The categories had either a onedimensional, rule-based solution or a two-dimensional, information-integration solution. Humans can solve the information-integration category tasks by integrating information across two stimulus modalities. The results demonstrate an important cross-modal form of sensory integration in the service of category learning, and they advance the field's knowledge about the sensory organization of systems for categorization.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (4)
- Pfeifer, R., Bongard, J.: How the body shapes the way we think. MIT Press (2006)
- Fergus, R., Perona, P., Zisserman, A.: Weakly supervised scale-invariant learning of models for visual recognition. IJCV (2006)
- Bar-Hillel, A., Weinshall, D.: Efficient learning of relational object class models. IJCV (2007)
- Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing (2004)