Skip to main content

Marco Bertini

Università degli Studi di Firenze (University of Florence), Dinfo - MICC, Faculty Member

Followers

2

Following

163

Public Views

Interests

Uploads

Papers by Marco Bertini

The VIDI-Video semantic video search engine

Video is becoming vital to society and economy. It plays a key role in information distribution a... more Video is becoming vital to society and economy. It plays a key role in information distribution and access, and it is also becoming the natural form of communication on the Internet and via mobile devices. The massive increase in digital audiovisual information will pose high demands on advanced storage and retrieval engines, and it is certain that consumers and professionals will need advanced storage and search technologies for the management of large-scale video assets.

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

ACM Multimedia Asia, 2021

Effective triplet mining improves training of multi-scale pooled CNN for image retrieval

Machine Vision and Applications, 2022

In this paper, we address the problem of content-based image retrieval (CBIR) by learning images ... more In this paper, we address the problem of content-based image retrieval (CBIR) by learning images representations based on the activations of a Convolutional Neural Network. We propose an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on the trainable aggregation layer NetVLAD (Arandjelovic et al in Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, NetVLAD, 2016) and bags of local features obtained by splitting the activations, allowing to reduce the dimensionality of the descriptor and to increase the performance of retrieval. Training is performed using an improved triplet mining procedure that selects samples based on their difficulty to obtain an effective image representation, reducing the risk of overfitting and loss of generalization. Extensive experiments show that our approach, that can be effectively used with different CNN architectures, obtains state-of-the-art results on standard and challenging CBIR datasets.

Wearable systems for improving tourist experience

Multimodal Behavior Analysis in the Wild, 2019

In this chapter we present original approaches for the development of a smart audio-guide that ad... more In this chapter we present original approaches for the development of a smart audio-guide that adapts to the actions and interests of visitors of cultural heritage sites and exhibitions either in indoor or outdoor scenarios. The guide is capable of perceiving the context. It understands what the user is looking at, if he is moving or is inattentive (e.g. talking with someone), in order to provide relevant information at the appropriate timing. Automatic recognition of artworks is performed with different approaches depending on the scenario, i.e. indoor and outdoor. These approaches are, respectively, based on Convolutional Neural Network (CNN) and SIFT descriptors, performing, when appropriate, object localization and classification. The computer-vision system works in real time on the mobile device, exploiting also a fusion of audio and motion sensors. Configurable interfaces to ease interaction and fruition of multimedia insights are provided for both scenarios. The audio-guide h...

Deep Sentiment Features of Context and Faces for Affective Video Analysis

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, 2017

Given the huge quantity of hours of video available on video sharing platforms such as YouTube, V... more Given the huge quantity of hours of video available on video sharing platforms such as YouTube, Vimeo, etc. development of automatic tools that help users nd videos that t their interests has attracted the attention of both scienti c and industrial communities. So far the majority of the works have addressed semantic analysis, to identify objects, scenes and events depicted in videos, but more recently a ective analysis of videos has started to gain more attention. In this work we investigate the use of sentiment driven features to classify the induced sentiment of a video, i.e. the sentiment reaction of the user. Instead of using standard computer vision features such as CNN features or SIFT features trained to recognize objects and scenes, we exploit sentiment related features such as the ones provided by Deep-SentiBank [4], and features extracted from models that exploit deep networks trained on face expressions. We experiment on two recently introduced datasets: LIRIS-ACCEDE [2] and MEDIAEVAL-2015, that provide sentiment annotations of a large set of short videos. We show that our approach not only outperforms the current state-of-the-art in terms of valence and arousal classi cation accuracy, but it also uses a smaller number of features, requiring thus less video processing.

Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides

ACM Transactions on Multimedia Computing, Communications, and Applications, 2017

In this article, we address the problem of creating a smart audio guide that adapts to the action... more In this article, we address the problem of creating a smart audio guide that adapts to the actions and interests of museum visitors. As an autonomous agent, our guide perceives the context and is able to interact with users in an appropriate fashion. To do so, it understands what the visitor is looking at, if the visitor is moving inside the museum hall, or if he or she is talking with a friend. The guide performs automatic recognition of artworks, and it provides configurable interface features to improve the user experience and the fruition of multimedia materials through semi-automatic interaction. Our smart audio guide is backed by a computer vision system capable of working in real time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the rec...

Spatio-Temporal Closed-Loop Object Detection

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society, Jan 10, 2017

Object detection is one of the most important tasks of computer vision. It is usually performed b... more Object detection is one of the most important tasks of computer vision. It is usually performed by evaluating a subset of the possible locations of an image that are more likely to contain the object of interest. Exhaustive approaches have now been superseded by object proposal methods. The interplay of detectors and proposal algorithms has not been fully analyzed and exploited up to now, although this is a very relevant problem for object detection in video sequences. We propose to connect, in a closed-loop, detectors and object proposal generator functions exploiting the ordered and continuous nature of video sequences. Different from tracking we only require a previous frame to improve both proposal and detection: no prediction based on local motion is performed, thus avoiding tracking errors. We obtain 3 to 4 points of improvement in mAP and a detection time that is lower than Faster R-CNN, which is the fastest CNN based generic object detector known at the moment.

Item-Based Video Recommendation

Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

In this paper we propose a method for video recommendation in Social Networks based on crowdsourc... more In this paper we propose a method for video recommendation in Social Networks based on crowdsourced and automatic video annotations of salient frames. We show how two human factors, users' self-expression in user profiles and perception of visual saliency in videos, can be exploited in order to stimulate annotations and to obtain an efficient representation of video content features. Results are assessed through experiments conducted on a prototype of social network for video sharing. Several baseline approaches are evaluated and we show how the proposed method improves over them.

Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video, 2015

In this paper, an innovative solution is presented: a smart emotional system for impaired people'... more In this paper, an innovative solution is presented: a smart emotional system for impaired people's TV. It aims to accompany the cognitive information contained in a movie, with the affective content. The affect is then communicated to the movie viewers in ways compatible for people with hearing and/or visual impairments, to let them experience all of the sensations offered by the movie. To do so, emotion recognition techniques are used to classify movie scenes into seven basic emotions. These emotions are then represented, in realtime, while the movie is playing, to the viewers, using environmental lights, emotional subtitles and a second screen application that integrates vibrations, emoticons and background music.

Multimedia Ontology Based Computational Framework for Video Annotation and Retrieval

Lecture Notes in Computer Science, 2007

Ontologies are defined as the representation of the semantics of terms and their relationships. T... more Ontologies are defined as the representation of the semantics of terms and their relationships. Traditionally, they consist of concepts, concept properties, and relationships between concepts, all expressed in linguistic terms. In order to support effectively video annotation and content-based retrieval the traditional linguistic ontologies should be extended to include structural video information and perceptual elements such as visual data descriptors. These extended ontologies (referred in the following as multimedia ontologies) should support definition of visual concepts as representatives of specific patterns of a linguistic concept. While the linguistic part of the ontology embeds permanent and objective items of the domain, the perceptual part includes visual concepts that are dependent on temporal experience and are subject to changes with time and perception. This is the reason why dynamic update of visual concepts has to be supported by multimedia ontologies, to represent temporal evolution of concepts.

Annotation and retrieval of video documents: sports videos

3rd International Symposium on Image and Signal Processing and Analysis, 2003. ISPA 2003. Proceedings of the

Multimedia information, especially videos, is growing explosively with the rapid development of t... more Multimedia information, especially videos, is growing explosively with the rapid development of the Internet and multimedia technology. Due to its variety of image features, it is capable of reaching several hundred dimensions and even thousands of dimensions. Storing and indexing the high-dimensional feature vectors has become key technologies of content-based video retrieval. The residual quantization mechanism, which combines the asymmetric distance and set sorting algorithm based on multi-feature candidates, is improved after analyzing the characteristics of soccer videos. For soccer videos, SD-VLAD (Soft Distribution-Vectors of Locally Aggregated Descriptors), BOC (Bag of Color), and shot type are selected for describing the information of images. To address the problem that the original residual quantized inverted index can only retrieve single features, multiple feature retrieval and sorting are proposed. In the stage of candidate set sorting, a multi-feature based similarity calculation method is designed according to the shots type. The experimental results show that multi-feature hierarchical retrieval and sorting can be achieved at the cost of memory space. While ensuring query speed, the accuracy of the query is improved.

Common visual cues for sports highlights detection

2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)

Automatic annotation of semantic events allows effective retrieval of video content. In this pape... more Automatic annotation of semantic events allows effective retrieval of video content. In this paper, we present automatic annotation of sports highlights for some of the principal sport types, obtained by detecting and tracking a limited number of visual cues common to each sports. Highlights are represented as atomic entities at the semantic level. They have a limited temporal extension and can be modeled as the spatio-temporal concatenation of specific events. Visual cues encode position and speed information coming from the camera and from the object/athletes that are present in the scene, and are estimated automatically from the video stream. Algorithms for model checking and for visual cues estimation are discussed, as well as applications of the representation to different sport domains.

Soccer videos highlight prediction and annotation in real time

In this paper, we present an automatic system that is able to forecast the appearance of a soccer... more In this paper, we present an automatic system that is able to forecast the appearance of a soccer highlight, and annotate it, based on MPEG features; processing is performed in strict real time. A probabilistic framework based on Bayes networks is used to detect the most significant soccer highlights. Predictions are validated by different Bayes networks, to check the outcome of forecasts.

Domain knowledge extension with pictorially enriched ontologies

Classifying video elements according to some pre-defined ontology of the video content is the typ... more Classifying video elements according to some pre-defined ontology of the video content is the typical way to perform video annotation. Ontologies are built by defining relationship between linguistic terms that describe domain concepts at different abstraction levels. Linguistic terms are appropriate to distinguish specific events and object categories but they are inadequate when they must describe video entities or specific patterns of events. In these cases visual prototypes can better express pattern specifications and the diversity of visual events. To support video annotation up to the level of pattern specification enriched ontologies, that include visual concepts together with linguistic keywords, are needed. This paper presents Pictorially Enriched ontologies and provides a solution for their implementation in the soccer video domain. The pictorially enriched ontology created is used both to directly assign multimedia objects to concepts, providing a more meaningful definition than the linguistics terms, and to extend the initial knowledge of the domain, adding subclasses of highlights or new highlight classes that were not defined in the linguistic ontology. Automatic annotation of soccer clips up to the pattern specification level using a pictorially enriched ontology is discussed.

Non-parametric anomaly detection exploiting space-time features

In this paper a real-time anomaly detection system for video streams is proposed. Spatio-temporal... more In this paper a real-time anomaly detection system for video streams is proposed. Spatio-temporal features are exploited to capture scene dynamic statistics together with appearance. Anomaly detection is performed in a non-parametric fashion, evaluating directly local descriptor statistics. A method to update scene statistics, to cope with scene changes that typically happen in real world settings, is also provided. The proposed method is tested on publicly available datasets.

Scene and crowd behaviour analysis with local space-time descriptors

In this paper we propose a local space-time descriptor to be employed for behaviour analysis in v... more In this paper we propose a local space-time descriptor to be employed for behaviour analysis in video-surveillance applications. We show how this local video representation is able to extract scene semantics in both a supervised (behaviour recognition) and semi-supervised (anomaly detection) setup. Our approach yields state-of-the art performance on two publicly available datasets and is not computationally intensive.

A system for automatic detection and recognition of advertising trademarks in sports videos

In this technical demonstration we show the current version of our trademark detection and recogn... more In this technical demonstration we show the current version of our trademark detection and recognition system that has been developed in collaboration with a sport marketing firm 1 with the aim of evaluating the visibility of advertising trademarks in broadcast sporting events. We propose a semi-automatic system for detecting and retrieving trademark appearances in sports videos. A human annotator supervises the results of the automatic annotation through an interface that shows the time and the position of the detected trademarks; due to this fact the aim of the system is to provide a good recall figure, so that the supervisor can safely skip the parts of the video that have been marked as not containing a trademark, thus speeding up his work.

Recognizing human actions by fusing spatio-temporal appearance and motion descriptors

2009 16th IEEE International Conference on Image Processing (ICIP), 2009

In this paper we propose a new method for human action categorization by using an effective combi... more In this paper we propose a new method for human action categorization by using an effective combination of a new 3D gradient descriptor with an optic flow descriptor, to represent spatio-temporal interest points. These points are used to represent video sequences using a bag of spatio-temporal visual words, following the successful results achieved in object and scene classification. We extensively test our approach on the standard KTH and Weizmann actions datasets, showing its validity and good performance. Experimental results outperform state-of-the-art methods, without requiring fine parameter tuning.

Effective Codebooks for human action categorization

2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, 2009

In this paper we propose a new method for human action categorization by using an effective combi... more In this paper we propose a new method for human action categorization by using an effective combination of novel gradient and optic flow descriptors, and creating a more effective codebook modeling the ambiguity of feature assignment in the traditional bag-of-words model. Recent approaches have represented video sequences using a bag of spatio-temporal visual words, following the successful results achieved in object and scene classification. Codebooks are usually obtained by k-means clustering and hard assignment of visual features to the best representing codeword. Our main contribution is two-fold. First, we define a new 3D gradient descriptor that combined with optic flow outperforms the state-of-the-art, without requiring fine parameter tuning. Second, we show that for spatio-temporal features the popular k-means algorithm is insufficient because cluster centers are attracted by the denser regions of the sample distribution, providing a non-uniform description of the feature space and thus failing to code other informative regions. Therefore, we apply a radius-based clustering method and a soft assignment that considers the information of two or more relevant candidates. This approach generates a more effective codebook resulting in a further improvement of classification performances. We extensively test our approach on standard KTH and Weizmann action datasets showing its validity and outperforming other recent approaches.

Proceedings of the 21st ACM international conference on Multimedia - MM '13, 2013

In this paper, we describe the euTV system, which provides a flexible approach to collect, manage... more In this paper, we describe the euTV system, which provides a flexible approach to collect, manage, annotate and publish collections of images, videos and textual documents. The system is based on a Service Oriented Architecture that allows to combine and orchestrate a large set of web services for automatic and manual annotation, retrieval, browsing, ingestion and authoring of multimedia sources. euTV tools have been used to create several publicly available vertical applications, addressing different use cases. Positive results of user evaluations have shown that the system can be effectively used to create different types of applications.