Much research has been concerned with the notion of bottomup saliency in visual scenes, i.e. the ... more Much research has been concerned with the notion of bottomup saliency in visual scenes, i.e. the contribution of low-level image features such as brightness, colour, contrast, and motion to the deployment of attention. Because the human visual system is obviously highly optimized for the real world, it is reasonable to draw inspiration from human behaviour in the design of machine vision algorithms that determine regions of relevance. In previous work, we were able to show that a very simple and generic grayscale video representation, namely the geometric invariants of the structure tensor, predicts eye movements when viewing dynamic natural scenes better than complex, state-of-the-art models. Here, we moderately increase the complexity of our model and compute the invariants for colour videos, i.e. on the multispectral structure tensor and for different colour spaces. Results show that colour slightly improves predictive power.
We here study the predictability of eye movements when viewing high-resolution natural videos. We... more We here study the predictability of eye movements when viewing high-resolution natural videos. We use three recently published gaze data sets that contain a wide range of footage, from scenes of almost still-life character to professionally made, fast-paced advertisements and movie trailers. Inter-subject gaze variability differs significantly between data sets, with variability being lowest for the professional movies. We then evaluate three state-of-the-art saliency models on these data sets. A model that is based on the invariants of the structure tensor and that combines very generic, sparse video representations with machine learning techniques outperforms the two reference models; performance is further improved for two data sets when the model is extended to a perceptually inspired colour space. Finally, a combined analysis of gaze variability and predictability shows that eye movements on the professionally made movies are the most coherent (due to implicit gaze-guidance strategies of the movie directors), yet the least predictable (presumably due to the frequent cuts). Our results highlight the need for standardized benchmarks to comparatively evaluate eye movement prediction algorithms.
This paper deals with the problem of estimating multiple motions at points where these motions ar... more This paper deals with the problem of estimating multiple motions at points where these motions are overlaid. We present a new approach that is based on block-matching and can deal with both transparent motions and occlusions. We derive a block-matching constraint for an arbitrary number of moving layers. We use this constraint to design a hierarchical algorithm that can distinguish between the occurrence of single, transparent, and occluded motions and can thus select the appropriate local motion model. The algorithm adapts to the amount of noise in the image sequence by use of a statistical confidence test. Robustness is further increased with a regularization scheme based on Markov Random Fields. Performance is demonstrated on image sequences synthesized from natural textures with high levels of additive dynamic noise and on real video sequences.
We show that a simple low-dimensional representation of movie patches, namely local spectral ener... more We show that a simple low-dimensional representation of movie patches, namely local spectral energy, can be used to predict where people will look in dynamic natural scenes. We then present a gaze-contingent display that modifies local spectral energy in real time. This modification of the saliency distribution of the scene leads to a change in eye movement statistics. Our research aims at the guidance of gaze with the ultimate goal of optimising vision-based communication systems.
Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings., 2003
We present a model for predicting the eye-movements of observers who is viewing dynamic sequences... more We present a model for predicting the eye-movements of observers who is viewing dynamic sequences of images. As an indicator for the degree of saliency we evaluate an invariant of the spatio-temporal structure tensor that indicates an intrinsic dimension of at least two. The saliency is used to derive a list of candidate locations. Out of this list, the currently attended location is selected according to a mapping found by supervised learning. The true locations used for learning are obtained with an eyetracker. In addition to the saliency-based candidates, the selection algorithm uses a limited history of locations attended in the past. The mapping is linear and can thus be quickly adapted to the individual observer. The mapping is optimal in the sense that it is obtained by minimizing, by gradient descent, the overall quadratic difference between the predicted and the actually attended location.
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2012
We investigate the contribution of local spatio-temporal variation of image intensity to saliency... more We investigate the contribution of local spatio-temporal variation of image intensity to saliency. To measure different types of variation, we use the geometrical invariants of the structure tensor. With a video represented in spatial axes x and y and temporal axis t, the ndimensional structure tensor can be evaluated for different combinations of axes (2D and 3D) and also for the (degenerate) case of only one axis. The resulting features are evaluated on several spatio-temporal scales in terms of how well they can predict eye movements on complex videos. We find that a 3D structure tensor is optimal: the most predictive regions of a movie are those where intensity changes along all spatial and temporal directions. Among two-dimensional variations, the axis pair yt, which is sensitive to horizontal translation, outperforms xy and xt by a large margin, and is even superior in prediction to two baseline models of bottom-up saliency.
Proceedings of the 2nd symposium on Applied perception in graphics and visualization, 2005
The spatio-temporal characteristics of the human visual system vary widely across the visual fiel... more The spatio-temporal characteristics of the human visual system vary widely across the visual field. Many studies that have investigated these characteristics were limited to the use of artificial stimuli, such as flickering sinusoidal gratings, etc. Here, we present a gazecontingent system that is capable of modulating the spatio-temporal content of a high-resolution image sequence in real-time. In a first experiment, we measure -as a function of eccentricity -how much the temporal resolution of a natural video can be reduced before this manipulation becomes visible.
Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications - ETRA '10, 2010
We introduce an algorithm for space-variant filtering of video based on a spatio-temporal Laplaci... more We introduce an algorithm for space-variant filtering of video based on a spatio-temporal Laplacian pyramid and use this algorithm to render videos in order to visualize prerecorded eye movements. Spatio-temporal contrast and colour saturation are reduced as a function of distance to the nearest gaze point of regard, i.e. nonfixated, distracting regions are filtered out, whereas fixated image regions remain unchanged. Results of an experiment in which the eye movements of an expert on instructional videos are visualized with this algorithm, so that the gaze of novices is guided to relevant image locations, show that this visualization technique facilitates the novices' perceptual learning.
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2012
Much research has been concerned with the notion of bottomup saliency in visual scenes, i.e. the ... more Much research has been concerned with the notion of bottomup saliency in visual scenes, i.e. the contribution of low-level image features such as brightness, colour, contrast, and motion to the deployment of attention. Because the human visual system is obviously highly optimized for the real world, it is reasonable to draw inspiration from human behaviour in the design of machine vision algorithms that determine regions of relevance. In previous work, we were able to show that a very simple and generic grayscale video representation, namely the geometric invariants of the structure tensor, predicts eye movements when viewing dynamic natural scenes better than complex, state-of-the-art models. Here, we moderately increase the complexity of our model and compute the invariants for colour videos, i.e. on the multispectral structure tensor and for different colour spaces. Results show that colour slightly improves predictive power.
We investigate the extent to which eye movements in natural dynamic scenes can be predicted with ... more We investigate the extent to which eye movements in natural dynamic scenes can be predicted with a simple model of bottom-up saliency, which learns on different visual representations to discriminate between salient and less salient movie regions. Our image representations, the geometrical invariants of the structure tensor, are computed on multiple scales of an anisotropic spatio-temporal multiresolution pyramid. Eye movement data is used to label video locations as salient. For each location, low-dimensional features are extracted on the multiscale representations and used to train a classifier. The quality of the predictor is tested on a large test set of eye movement data and compared with the performance of two state-of-the-art saliency models on this data set. The proposed model demonstrates significant improvement -mean ROC score of 0.665 -over the selected baseline models with ROC scores of 0.625 and 0.635.
A less studied component of gaze allocation in dynamic real-world scenes is the time lag of eye m... more A less studied component of gaze allocation in dynamic real-world scenes is the time lag of eye movements in responding to dynamic attention-capturing events. Despite the vast amount of research on anticipatory gaze behaviour in natural situations, such as action execution and observation, little is known about the predictive nature of eye movements when viewing different types of natural or realistic scene sequences. In the present study, we quantify the degree of anticipation during the free viewing of dynamic natural scenes. The cross-correlation analysis of image-based saliency maps with an empirical saliency measure derived from eye movement data reveals the existence of predictive mechanisms responsible for a nearzero average lag between dynamic changes of the environment and the responding eye movements. We also show that the degree of anticipation is reduced when moving away from natural scenes by introducing camera motion, jump cuts, and film-editing.
Proceedings of the 2006 symposium on Eye tracking research & applications - ETRA '06, 2006
We describe an algorithm for manipulating the temporal resolution of a video in real time, contin... more We describe an algorithm for manipulating the temporal resolution of a video in real time, contingent upon the viewer's direction of gaze. The purpose of this work is to study the effect that a controlled manipulation of the temporal frequency content in realworld scenes has on eye movements. We build on the work of , who manipulate spatial resolution as a function of gaze direction, allowing them to mimic the resolution distribution of the human retina or to simulate the effect of various diseases (e.g. glaucoma). Our temporal filtering algorithm is similar to that of Perry and Geisler in that we interpolate between the levels of a multiresolution pyramid. However, in our case, the pyramid is built along the temporal dimension, and this requires careful management of the buffering of video frames and of the order in which the filtering operations are performed. On a standard personal computer, the algorithm achieves real-time performance (30 frames per second) on high-resolution videos (960 by 540 pixels). We present experimental results showing that the manipulation performed by the algorithm reduces the number of high-amplitude saccades and can remain unnoticed by the observer.
The saliency of an image or video region indicates how likely it is that the viewer of the image ... more The saliency of an image or video region indicates how likely it is that the viewer of the image or video fixates that region due to its conspicuity. An intriguing question is how we can change the video region to make it more or less salient. Here, we address this problem by using a machine learning framework to learn from a large set of eye movements collected on real-world dynamic scenes how to alter the saliency level of the video locally. We derive saliency transformation rules by performing spatio-temporal contrast manipulations (on a spatio-temporal Laplacian pyramid) on the particular video region. Our goal is to improve visual communication by designing gaze-contingent interactive displays that change, in real time, the saliency distribution of the scene.
We analyze the predictability of eye movements of observers viewing dynamic scenes. We first asse... more We analyze the predictability of eye movements of observers viewing dynamic scenes. We first assess the effectiveness of model-based prediction. The model is divided into inter-saccade prediction, which is based on a limited history of attended locations, and saccade prediction, which is based on a list of salient locations. The quality of the predictions and of the underlying saliency maps is tested on a large set of eye movement data recorded on high-resolution real-world video sequences. In addition, frequently fixated locations are used to predict individual eye movements to obtain a reference for model-based predictions.
This paper briefly summarises our results on gaze guidance such as to complement the demonstratio... more This paper briefly summarises our results on gaze guidance such as to complement the demonstrations that we plan to present at the workshop. Our goal is to integrate gaze into visual communication systems by measuring and guiding eye movements. Our strategy is to predict a set of about ten salient locations and then change the probability for one of these candidates to be attended: for one candidate the probability is increased, for the others it is decreased. To increase saliency, in our current implementation, we show a natural-scene movie and overlay red dots very briefly such that they are hardly perceived consciously. To decrease the probability, for example, we locally reduce the temporal frequency content of the movie. We here present preliminary results, which show that the three steps of our above strategy are feasible. The long-term goal is to find the optimal real-time video transformation that minimises the difference between the actual and the desired eye movements without being obtrusive. Applications are in the area of vision-based communication, augmented vision, and learning.
We present a model that predicts saccadic eye-movements and can be tuned to a particular human ob... more We present a model that predicts saccadic eye-movements and can be tuned to a particular human observer who is viewing a dynamic sequence of images. Our work is motivated by applications that involve gaze-contingent interactive displays on which information is displayed as a function of gaze direction. The approach therefore differs from standard approaches in two ways: (i) we deal with dynamic scenes, and (ii) we provide means of adapting the model to a particular observer. As an indicator for the degree of saliency we evaluate the intrinsic dimension of the image sequence within a geometric approach implemented by using the structure tensor. Out of these candidate saliencybased locations, the currently attended location is selected according to a strategy found by supervised learning. The data are obtained with an eye-tracker and subjects who view video sequences. The selection algorithm receives candidate locations of current and past frames and a limited history of locations attended in the past. We use a linear mapping that is obtained by minimizing the quadratic difference between the predicted and the actually attended location by gradient descent. Being linear, the learned mapping can be quickly adapted to the individual observer.
We here study the predictability of eye movements when viewing high-resolution natural videos. We... more We here study the predictability of eye movements when viewing high-resolution natural videos. We use three recently published gaze data sets that contain a wide range of footage, from scenes of almost still-life character to professionally made, fast-paced advertisements and movie trailers. Inter-subject gaze variability differs significantly between data sets, with variability being lowest for the professional movies. We then evaluate three state-of-the-art saliency models on these data sets. A model that is based on the invariants of the structure tensor and that combines very generic, sparse video representations with machine learning techniques outperforms the two reference models; performance is further improved for two data sets when the model is extended to a perceptually inspired colour space. Finally, a combined analysis of gaze variability and predictability shows that eye movements on the professionally made movies are the most coherent (due to implicit gaze-guidance strategies of the movie directors), yet the least predictable (presumably due to the frequent cuts). Our results highlight the need for standardized benchmarks to comparatively evaluate eye movement prediction algorithms.
Based on the principle of efficient coding, we present a theoretical framework for how to categor... more Based on the principle of efficient coding, we present a theoretical framework for how to categorize the basic types of changes that can occur in a spatio-temporal signal. First, theoretical results for the problem of estimating multiple transparent motions are reviewed. Then, confidence measures for the presence of multiple motions are used to derive a basic alphabet of local signal variation that includes motion layers. To better understand and visualize this alphabet, a representation of motions in the projective plane is used. A further, practical contribution is an interactive tool that allows generating multiple motion patterns and displaying them in various apertures. In our framework, we can explain some well-known results on coherent motion and a few more complex perceptual phenomena such as the 2D-1D entrainment effect, but the focus of this paper is on the methods. Our working hypothesis is that efficient representations can be obtained by suppressing all the redundancies that arise if the visual input does not change in a particular direction, or a set of directions. Finally, we assume that human eye movements will tend to avoid the redundant parts of the visual input and report results where our framework has been used to obtain very good predictions of eye movements made on overlaid natural videos.
Uploads
Papers by Erhardt Barth