Papers by Georgios Evangelopoulos

The complexity of a learning task is increased by transformations in the input space that preserv... more The complexity of a learning task is increased by transformations in the input space that preserve class identity. Visual object recognition for example is affected by changes in viewpoint, scale, illumination or planar transformations. While drastically altering the visual appearance, these changes are orthogonal to recognition and should not be reflected in the representation or feature encoding used for learning. We introduce a framework for weakly supervised learning of image embeddings that are robust to transformations and selective to the class distribution, using sets of transforming examples (orbit sets), deep parametrizations and a novel orbit-based loss. The proposed loss combines a discriminative, contrastive part for orbits with a reconstruction error that learns to rectify orbit transformations. The learned embeddings are evaluated in distance metric-based tasks, such as one-shot classification under geometric transformations, as well as face verification and retrieval under more realistic visual variability. Our results suggest that orbit sets, suitably computed or observed, can be used for efficient, weakly-supervised learning of semantically relevant image embeddings.

Representation Learning from Orbit Sets for One-Shot Classification
AAAI Spring Symposium Series 2017, Technical Reports
The sample complexity of a learning task is increased by transformations that do not change class... more The sample complexity of a learning task is increased by transformations that do not change class identity. Visual object recognition for example, i.e. the discrimination or categorization of distinct semantic classes, is affected by changes in viewpoint, scale, illumination or planar transformations. We introduce a weakly-supervised framework for learning robust and selective representations from sets of transforming examples (orbit sets). We train deep encoders that explicitly account for the equivalence up to transformations of orbit sets and show that the resulting encodings contract the intra-orbit distance and preserve identity either by preserving reconstruction or by increasing the inter-orbit distance. We explore a loss function that combines a discriminative term, and a reconstruction term that uses a decoder-encoder map to learn to rectify transformation-perturbed examples, and demonstrate the validity of the resulting embeddings for one-shot learning. Our results suggest that a suitable definition of orbit sets is a form of weak supervision that can be exploited to learn semantically relevant embeddings.

Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations
INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association
In the framework of a theory for invariant sensory signal representations, a signature which is i... more In the framework of a theory for invariant sensory signal representations, a signature which is invariant and selective for speech sounds can be obtained through projections in template signals and pooling over their transformations under a group. For locally compact groups, e.g., translations, the theory explains the resilience of convolutional neural networks with filter weight sharing and max pooling across their local translations in frequency or time. In this paper we propose a discriminative approach for learning an optimum set of templates, under a family of transformations, namely frequency transpositions and perturbations of the vocal tract length, which are among the primary sources of speech variability. Implicitly, we generalize convolutional networks to transformations other than translations, and derive data-specific templates by training a deep network with convolution-pooling layers and densely connected layers. We demonstrate that such a representation, combining group-generalized convolutions, theoretical invariance guarantees and discriminative template selection, improves frame classification performance over standard translation-CNNs and DNNs on TIMIT and Wall Street Journal datasets.

Extracting discriminant, transformation-invariant features from raw audio signals remains a serio... more Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple levels (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or preprocessing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speakermismatch difficulty, and the results are compared to those of an MFCC-based representation.

Phone Classification by a Hierarchy of Invariant Representation Layers
INTERSPEECH 2014 15th Annual Conference of the International Speech Communication Association
We propose a multi-layer feature extraction framework for speech, capable of providing invariant ... more We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of “neurons”. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.

Recognition of speech, and in particular the ability to generalize and learn from small sets of l... more Recognition of speech, and in particular the ability to generalize and learn from small sets of labeled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates -such as specific phones or words -together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e, the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.

Representations in the auditory cortex might be based on mechanisms similar to the visual ventral... more Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.

3D-2D Face Recognition with Pose and Illumination Normalization
In this paper, we propose a 3D-2D framework for face recognition that is more practical than 3D-3... more In this paper, we propose a 3D-2D framework for face recognition that is more practical than 3D-3D, yet more accurate than 2D-2D. For 3D-2D face recognition, the gallery data comprises of 3D shape and 2D texture data and the probes are arbitrary 2D images. A 3D-2D system (UR2D) is presented that is based on a 3D deformable face model that allows registration of 3D and 2D data, face alignment, and normalization of pose and illumination. During enrollment, subject-specific 3D models are constructed using 3D+2D data. For recognition, 2D images are represented in a normalized image space using the gallery 3D models and landmark-based 3D-2D projection estimation. A method for bidirectional relighting is applied for non-linear, local illumination normalization between probe and gallery textures, and a global orientation-based correlation metric is used for pairwise similarity scoring. The generated, personalized, pose- and light- normalized signatures can be used for one-to-one verification or one-to-many identification. Results for 3D-2D face recognition on the UHDB11 3D-2D database with 2D images under large illumination and pose variations support our hypothesis that, in challenging datasets, 3D-2D outperforms 2D-2D and decreases the performance gap against 3D-3D face recognition. Evaluations on FRGC v2.0 3D-2D data with frontal facial images, demonstrate that the method can generalize to databases with different and diverse illumination conditions

Multimodal streams of sensory information are naturally parsed and integrated by humans using sig... more Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher-level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and orientation. Textual or linguistic saliency is extracted from partof-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.

Multimodal Saliency and Fusion for Movie Summarization based on Aural, Visual, and Textual Attention
IEEE Transactions on Multimedia, Nov 2013
Multimodal streams of sensory information are naturally parsed and integrated by humans using sig... more Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher-level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color and orientation. Textual or linguistic saliency is extracted from part-ofspeech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.

6th Pacific-Rim Symposium on Image and Video Technology (PSIVT '13), Nov 1, 2013
Performance boosts in face recognition have been facilitated by the formation of facial databases... more Performance boosts in face recognition have been facilitated by the formation of facial databases, with collection protocols customized to address challenges such as light variability, expressions, pose, sensor/modality differences, and, more recently, uncontrolled acquisition conditions. In this paper, we
present database UHDB11, to facilitate 3D-2D face recognition evaluations, where the gallery has been acquired using 3D sensors (3D mesh and texture) and the probes using 2D sensors (images). The database consists of samples from 23 individuals,
in the form of 2D high-resolution images spanning six illumination
conditions and 12 head-pose variations, and 3D facial mesh and texture. It addresses limitations regarding resolution, variability and type of 3D/2D data and has demonstrated to be statistically more challenging, diverse and information rich than existing cohorts of 10 times larger number of subjects. We propose a
set of 3D-2D experimental configurations, with frontal 3D galleries and poseillumination varying probes and provide baseline performance for identification and verification.

Minimizing Illumination Differences for 3-D to 2-D Face Recognition Using Lighting Maps
Asymmetric 3D to 2D face recognition has gained attention from the research community since the r... more Asymmetric 3D to 2D face recognition has gained attention from the research community since the real-world application of 3D to 3D recognition is limited by the unavailability of inexpensive 3D data acquisition equipment. A 3D to 2D face recognition system explicitly relies on 3D facial data to account for uncontrolled image conditions related to head pose or illumination. We build upon such a system, which matches relit gallery textures with pose-normalized probe images, using the gallery facial meshes. The relighting process, however, is based on an assumption of indoor lighting conditions and limits recognition performance on outdoor images. In this paper, we propose a novel method for minimizing illumination difference by unlighting a 3D face texture via albedo estimation using lighting maps. The algorithm is evaluated on challenging databases (UHDB30, UHDB11, FRGC v2.0) with drastic lighting and pose variations. The experimental results demonstrate the robustness of our method for estimating the albedo from both indoor and outdoor captured images, and the effectiveness and efficiency for illumination normalization in face recognition.
Color constancy in 3D-2D face recognition
Biometric and Surveillance Technology for Human and Activity Identification X, 2013
ABSTRACT

Color constancy in 3D-2D face recognition
Proc. SPIE 8712, Biometric and Surveillance Technology for Human and Activity Identification X, Apr 29, 2013
Face is one of the most popular biometric modalities. However, up to now, color is rarely activel... more Face is one of the most popular biometric modalities. However, up to now, color is rarely actively used in face recognition. Yet, it is well-known that when a person recognizes a face, color cues can become as important as shape, especially when combined with the ability of people to identify the color of objects independent of illuminant color variations. In this paper, we examine the feasibility and effect of explicitly embedding illuminant color information in face recognition systems. We empirically examine the theoretical maximum gain of including known illuminant color to a 3D-2D face recognition system. We also investigate the impact of using computational color constancy methods for estimating the illuminant color, which is then incorporated into the face recognition framework. Our experiments show that under close-to-ideal illumination estimates, one can improve face recognition rates by 16%. When the illuminant color is algorithmically estimated, the improvement is approximately 5%. These results suggest that color constancy has a positive impact on face recognition, but the accuracy of the illuminant color estimate has a considerable effect on its benefits.
The impact of specular highlights on 3D-2D face recognition
ABSTRACT

The impact of specular highlights on 3D-2D face recognition
Proc. SPIE 8712, Biometric and Surveillance Technology for Human and Activity Identification X, Apr 29, 2013
One of the most popular form of biometrics is face recognition. Face recognition techniques typic... more One of the most popular form of biometrics is face recognition. Face recognition techniques typically assume that a face exhibits Lambertian reectance. However, a face often exhibits prominent specularities, especially in outdoor environments. These specular highlights can compromise an identity authentication. In this work, we analyze the impact of such highlights on a 3D-2D face recognition system. First, we investigate three different specularity removal methods as preprocessing steps for face recognition. Then, we explicitly model facial specularities within the face detection system with the Cook-Torrance reflectance model. In our experiments, specularity removal increases the recognition rate on an outdoor face database by about 5% at a false alarm rate of 10-3. The integration of the Cook-Torrance model further improves these results, increasing the verification rate by 19% at a FAR of 10-3.

Benchmarking asymmetric 3D-2D face recognition systems
Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, 2013
Asymmetric 3D-2D face recognition (FR) aims to recognize individuals from 2D face images using te... more Asymmetric 3D-2D face recognition (FR) aims to recognize individuals from 2D face images using textured 3D face models in the gallery (or vice versa). This new FR scenario has the potential to be readily deployable in field applications while still keeping the advantages of 3D FR solutions of being more robust to pose and lighting variations. In this paper, we propose a new experimental protocol based on the UHDB11 dataset for benchmarking 3D-2D FR algorithms. This new experimental protocol allows for the study of the performance of a 3D-2D FR solution under pose and/or lighting variations. Furthermore, we also benchmarked two state of the art 3D2D FR algorithms. One is based on the Annotated Deformable Model (using manually labeled landmarks in this paper) using manually labeled landmarks whereas the other makes use of Oriented Gradient Maps along with an automatic pose estimation through random forest.

In this paper, we approach the problem of audio summarization by saliency computation of audio st... more In this paper, we approach the problem of audio summarization by saliency computation of audio streams, exploring the potential of a modulationmodel for the detection of perceptually important audio events based on saliency models, along with various fusion schemes for their combination. The fusion schemes include linear, adaptive and nonlinear methods. A machine learning approach, where training of the features is performed, was also applied for the purpose of comparison with the proposed technique. For the evaluation of the algorithm we use audio data taken from movies and we show that nonlinear fusion schemes perform best. The results are reported on the MovSum database, using objective evaluations (against ground-truth denoting the perceptually important audio events). Analysis of the selected audio segments is also performed against a labeled database in respect to audio categories, while a method for fine-tuning of the selected audio events is proposed.

IEEE International Conference on Image Processing (ICIP), 2012
The presence of multiband amplitude and frequency modulations (AM-FM) in wideband signals, such a... more The presence of multiband amplitude and frequency modulations (AM-FM) in wideband signals, such as textured images or speech, has led to the development of efficient multicomponent modulation models for low-level image and sound analysis. Moreover, compact yet descriptive representations have emerged by tracking, through non-linear energy operators, the dominant model components across time, space or frequency.In this paper, we propose a generalization of such approaches in the 3D spatio-temporal domain and explore the benefits of incorporating the Dominant Component Analysis scheme for interest point detection in videos for action recognition. Within this framework, actions are implicitly considered as manifestations of spatio-temporal oscillations in the dynamic visual stream. Multiband filtering and energy operators are applied to track the source energy in both spatial and temporal frequency bands. A new measure for extracting keypoint locations is formulated as the temporal dominant energy computed over the locally dominant modulation components, in terms of spatial modulation energy, of the input video frames. Theoretical formulation is supported by evaluation and comparisons in human action classification, which demonstrate the potential of the proposed detector.
In this paper, we present experiments on continuous time, continuous scale affective movie conten... more In this paper, we present experiments on continuous time, continuous scale affective movie content recognition (emotion tracking). A major obstacle for emotion research has been the lack of appropriately annotated databases, limiting the potential for supervised algorithms.
Uploads
Papers by Georgios Evangelopoulos
present database UHDB11, to facilitate 3D-2D face recognition evaluations, where the gallery has been acquired using 3D sensors (3D mesh and texture) and the probes using 2D sensors (images). The database consists of samples from 23 individuals,
in the form of 2D high-resolution images spanning six illumination
conditions and 12 head-pose variations, and 3D facial mesh and texture. It addresses limitations regarding resolution, variability and type of 3D/2D data and has demonstrated to be statistically more challenging, diverse and information rich than existing cohorts of 10 times larger number of subjects. We propose a
set of 3D-2D experimental configurations, with frontal 3D galleries and poseillumination varying probes and provide baseline performance for identification and verification.