Papers by Fabrizio Falchi

On the Robustness to Adversarial Examples of Neural ODE Image Classifiers
2019 IEEE International Workshop on Information Forensics and Security (WIFS), 2019
The vulnerability of deep neural networks to adversarial attacks currently represents one of the ... more The vulnerability of deep neural networks to adversarial attacks currently represents one of the most challenging open problems in the deep learning field. The NeurIPS 2018 work that obtained the best paper award proposed a new paradigm for defining deep neural networks with continuous internal activations. In this kind of networks, dubbed Neural ODE Networks, a continuous hidden state can be defined via parametric ordinary differential equations, and its dynamics can be adjusted to build representations for a given task, such as image classification. In this paper, we analyze the robustness of image classifiers implemented as ODE Nets to adversarial attacks and compare it to standard deep models. We show that Neural ODE are natively more robust to adversarial attacks with respect to state-of-the-art residual networks, and some of their intrinsic properties, such as adaptive computation cost, open new directions to further increase the robustness of deep-learned models. Moreover, thanks to the continuity of the hidden state, we are able to follow the perturbation injected by manipulated inputs and pinpoint the part of the internal dynamics that is most responsible for the misclassification.

Facial expressions play a fundamental role in human communication, and their study, which represe... more Facial expressions play a fundamental role in human communication, and their study, which represents a multidisciplinary subject, embraces a great variety of research fields, e.g., from psychology to computer science, among others. Concerning Deep Learning, the recognition of facial expressions is a task named Facial Expression Recognition (FER). With such an objective, the goal of a learning model is to classify human emotions starting from a facial image of a given subject. Typically, face images are acquired by cameras that have, by nature, different characteristics, such as the output resolution. Moreover, other circumstances might involve cameras placed far from the observed scene, thus obtaining faces with very low resolutions. Therefore, since the FER task might involve analyzing face images that can be acquired with heterogeneous sources, it is plausible to expect that resolution plays a vital role. In such a context, we propose a multi-resolution training approach to solve ...

Proceedings of the 1st International Workshop on Multimedia AI against Disinformation
Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realist... more Deepfake Generation Techniques are evolving at a rapid pace, making it possible to create realistic manipulated images and videos and endangering the serenity of modern society. The continual emergence of new and varied techniques brings with it a further problem to be faced, namely the ability of deepfake detection models to update themselves promptly in order to be able to identify manipulations carried out using even the most recent methods. This is an extremely complex problem to solve, as training a model requires large amounts of data, which are difficult to obtain if the deepfake generation method is too recent. Moreover, continuously retraining a network would be unfeasible. In this paper, we ask ourselves if, among the various deep learning techniques, there is one that is able to generalise the concept of deepfake to such an extent that it does not remain tied to one or more specific deepfake generation methods used in the training set. We compared a Vision Transformer with an EfficientNetV2 on a cross-forgery context based on the ForgeryNet dataset. From our experiments, It emerges that EfficientNetV2 has a greater tendency to specialize often obtaining better results on training methods while Vision Transformers exhibit a superior generalization ability that makes them more competent even on images generated with new methodologies. CCS CONCEPTS • Applied computing → Computer forensics; • Computing methodologies → Computer vision.
ACM Computing Surveys
In recent years, Quantum Computing witnessed massive improvements in terms of available resources... more In recent years, Quantum Computing witnessed massive improvements in terms of available resources and algorithms development. The ability to harness quantum phenomena to solve computational problems is a long-standing dream that has drawn the scientific community’s interest since the late 80s. In such a context, we propose our contribution. First, we introduce basic concepts related to quantum computations, and then we explain the core functionalities of technologies that implement the Gate Model and Adiabatic Quantum Computing paradigms. Finally, we gather, compare and analyze the current state-of-the-art concerning Quantum Perceptrons and Quantum Neural Networks implementations.
ArXiv, 2021
Space exploration has always been a source of inspiration for humankind, and thanks to modern tel... more Space exploration has always been a source of inspiration for humankind, and thanks to modern telescopes, it is now possible to observe celestial bodies far away from us. With a growing number of real and imaginary images of space available on the web and exploiting modern Deep Learning architectures such as Generative Adversarial Networks, it is now possible to generate new representations of space. In this research, using a Lightweight GAN, a dataset of images obtained from the web, and the Galaxy Zoo Dataset, we have generated thousands of new images of celestial bodies, galaxies, and finally, by combining them, a wide view of the universe. The code for reproducing our results is publicly available at https://github.com/davide-coccomini/GAN-Universe, and the generated images can be explored at https://davide-coccomini.github.io/GANUniverse/.
A new approach for video-stream filtering that makes use of the features representing video conte... more A new approach for video-stream filtering that makes use of the features representing video content and exploits the properties of metric spaces can help reduce the filtering receiver’s computational load.

Lecture Notes in Computer Science, 2019
Neural networks are said to be biologically inspired since they mimic the behavior of real neuron... more Neural networks are said to be biologically inspired since they mimic the behavior of real neurons. However, several processes in state-of-the-art neural networks, including Deep Convolutional Neural Networks (DCNN), are far from the ones found in animal brains. One relevant difference is the training process. In state-of-the-art artificial neural networks, the training process is based on backpropagation and Stochastic Gradient Descent (SGD) optimization. However, studies in neuroscience strongly suggest that this kind of processes does not occur in the biological brain. Rather, learning methods based on Spike-Timing-Dependent Plasticity (STDP) or the Hebbian learning rule seem to be more plausible, according to neuroscientists. In this paper, we investigate the use of the Hebbian learning rule when training Deep Neural Networks for image classification by proposing a novel weight update rule for shared kernels in DCNNs. We perform experiments using the CIFAR-10 dataset in which we employ Hebbian learning, along with SGD, to train parts of the model or whole networks for the task of image classification, and we discuss their performance thoroughly considering both effectiveness and efficiency aspects.

Face verification is a key task in many application fields, such as security and surveillance. Se... more Face verification is a key task in many application fields, such as security and surveillance. Several approaches and methodologies are currently used to try to determine if two faces belong to the same person. Among these, facial landmarks are very important in forensics, since the distance between some characteristic points of a face can be used as an objective measure in court during trials. However, the accuracy of the approaches based on facial landmarks in verifying whether a face belongs to a given person or not is often not quite good. Recently, deep learning approaches have been proposed to address the face verification problem, with very good results. In this paper, we compare the accuracy of facial landmarks and deep learning approaches in performing the face verification task. Our experiments, conducted on a real case scenario, show that the deep learning approach greatly outperforms in accuracy the facial landmarks approach. Keywords–Face Verification; Facial Landmarks;...

Evaluation of Continuous Image Features Learned by ODE Nets
Deep-learning approaches in data-driven modeling relies on learning a finite number of transforma... more Deep-learning approaches in data-driven modeling relies on learning a finite number of transformations (and representations) of the data that are structured in a hierarchy and are often instantiated as deep neural networks (and their internal activations). State-of-the-art models for visual data usually implement deep residual learning: the network learns to predict a finite number of discrete updates that are applied to the internal network state to enrich it. Pushing the residual learning idea to the limit, ODE Net—a novel network formulation involving continuously evolving internal representations that gained the best paper award at NeurIPS 2018—has been recently proposed. Differently from traditional neural networks, in this model the dynamics of the internal states are defined by an ordinary differential equation with learnable parameters that defines a continuous transformation of the input representation. These representations can be computed using standard ODE solvers, and t...

Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision i... more Relational reasoning is an emerging theme in Machine Learning in general and in Computer Vision in particular. Deep Mind has recently proposed a module called Relation Network (RN) that has shown impressive results on visual question answering tasks. Unfortunately, the implementation of the proposed approach was not public. To reproduce their experiments and extend their approach in the context of Information Retrieval, we had to re-implement everything, testing many parameters and conducting many experiments. Our implementation is now public on GitHub and it is already used by a large community of researchers. Furthermore, we recently presented a variant of the relation network module that we called Aggregated Visual Features RN (AVF-RN). This network can produce and aggregate at inference time compact visual relationship-aware features for the Relational-CBIR (R-CBIR) task. R-CBIR consists in retrieving images with given relationships among objects. In this paper, we discuss the d...

Soccer analytics is attracting increasing interest in academia and industry, thanks to the availa... more Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of data that describe all the spatio-temporal events that occur in each match. These events (e.g., passes, shots, fouls) are collected by human operators manually, constituting a considerable cost for data providers in terms of time and economic resources. In this paper, we describe PassNet, a method to recognize the most frequent events in soccer, i.e., passes, from video streams. Our model combines a set of artificial neural networks that perform feature extraction from video streams, object detection to identify the positions of the ball and the players, and classification of frame sequences as passes or not passes. We test PassNet on different scenarios, depending on the similarity of conditions to the match used for training. Our results show good classification results and significant improvement in the accuracy of pass detection with respect to baseline classifiers, even wh...

An Image Retrieval System for Video
Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. N... more Since the 1970’s the Content-Based Image Indexing and Retrieval (CBIR) has been an active area. Nowadays, the rapid increase of video data has paved the way to the advancement of the technologies in many different communities for the creation of Content-Based Video Indexing and Retrieval (CBVIR). However, greater attention needs to be devoted to the development of effective tools for video search and browse. In this paper, we present Visione, a system for large-scale video retrieval. The system integrates several content-based analysis and retrieval modules, including a keywords search, a spatial object-based search, and a visual similarity search. From the tests carried out by users when they needed to find as many correct examples as possible, the similarity search proved to be the most promising option. Our implementation is based on state-of-the-art deep learning approaches for content analysis and leverages highly efficient indexing techniques to ensure scalability. Specificall...

New Trends in Image Analysis and Processing – ICIAP 2019, 2019
Convolutional neural networks have reached extremely high performances on the Face Recognition ta... more Convolutional neural networks have reached extremely high performances on the Face Recognition task. These models are commonly trained by using high-resolution images and for this reason, their discrimination ability is usually degraded when they are tested against lowresolution images. Thus, Low-Resolution Face Recognition remains an open challenge for deep learning models. Such a scenario is of particular interest for surveillance systems in which it usually happens that a low-resolution probe has to be matched with higher resolution galleries. This task can be especially hard to accomplish since the probe can have resolutions as low as 8, 16 and 24 pixels per side while the typical input of state-of-the-art neural network is 224. In this paper, we described the training campaign we used to fine-tune a ResNet-50 architecture, with Squeeze-and-Excitation blocks, on the tasks of very low and mixed resolutions face recognition. For the training process we used the VGGFace2 dataset and then we tested the performance of the final model on the IJB-B dataset; in particular, we tested the neural network on the 1:1 verification task. In our experiments we considered two different scenarios: 1) probe and gallery with same resolution; 2) probe and gallery with mixed resolutions. Experimental results show that with our approach it is possible to improve upon state-of-the-art models performance on the low and mixed resolution face recognition tasks with a negligible loss at very high resolutions.

Similarity Search and Applications, 2019
Many approaches for approximate metric search rely on a permutation-based representation of the o... more Many approaches for approximate metric search rely on a permutation-based representation of the original data objects. The main advantage of transforming metric objects into permutations is that the latter can be efficiently indexed and searched using data structures such as inverted-files and prefix trees. Typically, the permutation is obtained by ordering the identifiers of a set of pivots according to their distances to the object to be represented. In this paper, we present a novel approach to transform metric objects into permutations. It uses the object-pivot distances in combination with a metric transformation, called n-Simplex projection. The resulting permutation-based representation, named SPLX-Perm, is suitable only for the large class of metric space satisfying the n-point property. We tested the proposed approach on two benchmarks for similarity search. Our preliminary results are encouraging and open new perspectives for further investigations on the use of the n-Simplex projection for supporting permutation-based indexing.

Similarity Search and Applications, 2019
Transformations of data objects into the Hamming space are often exploited to speed-up the simila... more Transformations of data objects into the Hamming space are often exploited to speed-up the similarity search in metric spaces. Techniques applicable in generic metric spaces require expensive learning, e.g., selection of pivoting objects. However, when searching in common Euclidean space, the best performance is usually achieved by transformations specifically designed for this space. We propose a novel transformation technique that provides a good trade-off between the applicability and the quality of the space approximation. It uses the n-Simplex projection to transform metric objects into a low-dimensional Euclidean space, and then transform this space to the Hamming space. We compare our approach theoretically and experimentally with several techniques of the metric embedding into the Hamming space. We focus on the applicability, learning cost, and the quality of search space approximation.

Information Retrieval Journal, 2017
In this paper we tackle the problem of image search when the query is a short textual description... more In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature mapping, while a text-focused loss is aimed at modeling the higherlevel semantic concepts expressed in language and countering the overfit on non-relevant visual components of the visual loss. We report preliminary results on the MS-COCO dataset.

Large Scale Indexing and Searching Deep Convolutional Neural Network Features
Lecture Notes in Computer Science, 2016
Content-based image retrieval using Deep Learning has become very popular during the last few yea... more Content-based image retrieval using Deep Learning has become very popular during the last few years. In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient retrieval on very large image databases. The idea is to provide a text encoding for these features enabling the use of a text retrieval engine to perform image similarity search. In this way, we built LuQ a robust retrieval system that combines full-text search with content-based image retrieval capabilities. In order to optimize the index occupation and the query response time, we evaluated various tuning parameters to generate the text encoding. To this end, we have developed a web-based prototype to efficiently search through a dataset of 100 million of images.
In this paper, we consider the task of recognizing epigraphs in images such as photos taken using... more In this paper, we consider the task of recognizing epigraphs in images such as photos taken using mobile devices. Given a set of 17,155 photos related to 14,560 epigraphs, we used a k-NearestNeighbor approach in order to perform the recognition. The contribution of this work is in evaluating state-ofthe-art visual object recognition techniques in this specific context. The experimental results conducted show that Vector of Locally Aggregated Descriptors obtained aggregating SIFT descriptors is the best choice for this task.

Landmark recognition in VISITO: VIsual Support to Interactive TOurism in Tuscany
Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR'11, 2011
We present the VIsual Support to Interactive TOurism in Tuscany (VISITO Tuscany) project which of... more We present the VIsual Support to Interactive TOurism in Tuscany (VISITO Tuscany) project which offers an interactive guide for tourists visiting cities of art accessible via smartphones. The peculiarity of the system is that user interaction is mainly obtained by the use of images -- In order to receive information on a particular monument users just have to take a picture of it. VISITO Tuscany, using techniques of image analysis and content recognition, automatically recognize the photographed monuments and pertinent information is displayed to the user. In this paper we illustrate how the use of landmarks recognition from mobile devices can provide the tourist with relevant and customized information about various type of objects in cities of art.
Uploads
Papers by Fabrizio Falchi