Papers by Julian Stottinger
Context-based Media Geotagging of Personal Photos
2008 19th International Conference on Pattern Recognition, 2008
Five image segmentation algorithms are evaluated: mean shift, normalised cuts, efficient graph-ba... more Five image segmentation algorithms are evaluated: mean shift, normalised cuts, efficient graph-based segmentation, hierarchical watershed, and waterfall. The evaluation is done using three evaluation metrics: probabilistic Rand index, global consistency error, and boundary precision-recall. We examine region-based metrics as a function of the number of regions produced by an algorithm. This allows new insights into algorithms and evaluation metrics to be gained.

Personal photo indexing
Proceedings of the 20th Acm International Conference, Jul 1, 2012
ABSTRACT Sorting one's own private photo collection is a time consuming and tedious task.... more ABSTRACT Sorting one's own private photo collection is a time consuming and tedious task. We demonstrate our event-centered approach to perform this task fully automatically. In the course of the demonstration, we either use our own photo collections, or invite the conference visitors to bring their own cameras and photos. We will sort the photos into a semantically meaningful hierarchy for the users within a couple of minutes. Events as a media aggregator allow a user to manage and annotate a photo collection in more convenient and natural to the human being way. Based on the recognized user behavior the application is able to reveal the nature of an event and build its hierarchy with a event/sub-event relationship. One important prerequisite of our approach is a precise GPS based spatial annotation of the photos. To accommodate for devices without GPS chips or temporary low GPS perception, we propose an approach to enrich the collection with automatically estimated GPS data by semantically interpolating possible routes of the user. We are positive that we can provide a well received service for the conference visitors, especially since the conference venue will trigger a lot of memorable photos. Large scale experimental validation showed that the approach is able to recreate a user's desired hierarchy with an F-measure of about 0.8.
(Unseen) Event Recognition Using Faceted Recognition

Proceedings of the International Conference, 2010
Successful state-of-the-art video retrieval and classification applications are predominantly car... more Successful state-of-the-art video retrieval and classification applications are predominantly carried out by means of spatio-temporal features. Typically, the evaluation of these tasks is exclusively done based on their final performance but no systematic analysis of feature robustness, invariance and stability has been done yet for large scale video retrieval. In this work, we analyze the impact of visual transformation on spatio-temporal features in large scale experiments. Following the recipe of recent state of the art evaluations, we choose the best performing approaches, namely the spatio-temporal Harris3D, Hessian3D, and Cuboid detectors and the HOG/HOF, SURF3D, and HOG3D descriptors. We show that these features have different properties and behave differently under varying transformations (challenges). This helps researchers to justify the choice of features for new applications and helps to optimize the choice of input video in terms of resolution, compression, frames per second or noise suppression. We make the extracted features accessible on-line for further independent evaluation and applications.

2010 20th International Conference on Pattern Recognition, 2010
The most successful approaches to video understanding and video matching use local spatio-tempora... more The most successful approaches to video understanding and video matching use local spatio-temporal features as a sparse representation for video content. Until now, no principled evaluation of these features has been done. We present FeEval, a dataset for the evaluation of such features. For the first time, this dataset allows for a systematic measurement of the stability and the invariance of local features in videos. FeEval consists of 30 original videos from a great variety of different sources, including HDTV shows, 1080p HD movies and surveillance cameras. The videos are iteratively varied by increasing blur, noise, increasing or decreasing light, median filter, compression quality, scale and rotation leading to a total of 1710 video clips. Homography matrices are provided for geometric transformations. The surveillance videos are taken from 4 different angles in a calibrated environment. Similar to prior work on 2D images, this leads to a repeatability and matching measurement in videos for spatiotemporal features estimating the overlap of features under increasing changes in the data.
Introducing 3D Vision and Computer Graphics to Archaeological Workflow - an Applicable Framework
Visapp, 2008
Augmentation of Skin Segmentation
Ipcv, 2010
2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012
Since high-level events in images (e.g. "dinner", "motorcycle stunt", etc.) may not be directly c... more Since high-level events in images (e.g. "dinner", "motorcycle stunt", etc.) may not be directly correlated with their visual appearance, low-level visual features do not carry enough semantics to classify such events satisfactorily. This paper explores a fully compositional approach for event based image retrieval which is able to overcome this shortcoming. Furthermore, the approach is fully scalable in both adding new events and new primitives. Using the Pascal VOC 2007 dataset, our contributions are the following: (i) We apply the Faceted Analysis-Synthesis Theory (FAST) to build a hierarchy of 228 high-level events.
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009
Local image descriptors computed in areas around salient points in images are essential for many ... more Local image descriptors computed in areas around salient points in images are essential for many algorithms in computer vision. Recent work suggests using as many salient points as possible. While sophisticated classifiers have been proposed to cope with the resulting large number of descriptors, processing this large amount of data is computationally costly.

Proceedings of the international conference on Multimedia - MM '10, 2010
Due to the increasing flood of digital images and the overall increase of storage capacity, large... more Due to the increasing flood of digital images and the overall increase of storage capacity, large scale image databases are common these days. This work deals with the problem of finding replicas in image databases containing more than 100000 images. A clustering algorithm is developed that has linear runtime and can be carried out in parallel. We observe that with increasing size of the database, the problem of decreasing discrimination between high frequency images arises. Features of images with natural repetitive texture become similar to other images and show up in most of the search results. This problem is addressed by developing an asymmetric Hamming distance measurement for bags of visual words. It allows better discrimination power in large databases, while being robust to image transformations such as rotation, cropping, or change of resolution and size.

Evaluation of Gradient Vector Flow for Interest Point Detection
Lecture Notes in Computer Science, 2008
ABSTRACT We present and evaluate an approach for finding local interest points in images based on... more ABSTRACT We present and evaluate an approach for finding local interest points in images based on the non-minima suppression of Gradient Vector Flow (GVF) magnitude. Based on the GVF’s properties it provides the approximate centers of blob-like structures or homogeneous structures confined by gradients of similar magnitude. It results in a scale and orientation invariant interest point detector, which is highly stable against noise and blur. These interest points outperform the state of the art detectors in various respects. We show that our approach gives a dense and repeatable distribution of locations that are robust against affine transformations while they outperform state of the art techniques in robustness against lighting changes, noise, rotation and scale changes. Extensive evaluation is carried out using the Mikolajcyzk framework for interest point detector evaluation.

Lecture Notes in Computer Science, 2010
We present a principled approach for general skin segmentation using graph cuts. We present the i... more We present a principled approach for general skin segmentation using graph cuts. We present the idea of a highly adaptive universal seed thereby exploiting the positive training data only. We model the skin segmentation as a min-cut problem on a graph defined by the image color characteristics. The prior graph cuts based approaches for skin segmentation do not provide general skin detection when the information of foreground or background seeds is not available. We propose a concept for processing arbitrary images; using a universal seed to overcome the potential lack of successful seed detections thereby providing basis for general skin segmentation. The advantage of the proposed approach is that it is based on skin sampled training data only making it robust to unseen backgrounds. It exploits the spatial relationship among the neighboring skin pixels providing more accurate and stable skin blobs. Extensive evaluation on a dataset of 8991 images with annotated pixel-level ground truth show that the universal seed approach outperforms other state of the art approaches.
2008 19th International Conference on Pattern Recognition, 2008
Five image segmentation algorithms are evaluated: mean shift, normalised cuts, efficient graph-ba... more Five image segmentation algorithms are evaluated: mean shift, normalised cuts, efficient graph-based segmentation, hierarchical watershed, and waterfall. The evaluation is done using three evaluation metrics: probabilistic Rand index, global consistency error, and boundary precision-recall. We examine region-based metrics as a function of the number of regions produced by an algorithm. This allows new insights into algorithms and evaluation metrics to be gained.

2010 20th International Conference on Pattern Recognition, 2010
The most successful approaches to video understanding and video matching use local spatio-tempora... more The most successful approaches to video understanding and video matching use local spatio-temporal features as a sparse representation for video content. Until now, no principled evaluation of these features has been done. We present FeEval, a dataset for the evaluation of such features. For the first time, this dataset allows for a systematic measurement of the stability and the invariance of local features in videos. FeEval consists of 30 original videos from a great variety of different sources, including HDTV shows, 1080p HD movies and surveillance cameras. The videos are iteratively varied by increasing blur, noise, increasing or decreasing light, median filter, compression quality, scale and rotation leading to a total of 1710 video clips. Homography matrices are provided for geometric transformations. The surveillance videos are taken from 4 different angles in a calibrated environment. Similar to prior work on 2D images, this leads to a repeatability and matching measurement in videos for spatiotemporal features estimating the overlap of features under increasing changes in the data.
Proceeding of the 1st ACM workshop on Analysis and retrieval of events/actions and workflows in video streams - AREA '08, 2008
We propose a straightforward skin detection method for online videos. To overcome varying illumin... more We propose a straightforward skin detection method for online videos. To overcome varying illumination circumstances and a variety of skin colors, we introduce a multiple model approach which can be carried out independently per model. The color models are initiated by skin detection based on face detection and adapted in real time. Our approach outperforms static approaches both in precision and runtime. If we detect a face in a scene, the number of false positives can be diminished significantly. Evaluation is carried out on publicly available on-line videos showing that adaptive multiple model outperforms static methods in classification precision and suppression of false positives.

2009 15th International Conference on Virtual Systems and Multimedia, 2009
This paper illustrates how taking advantage of user studies highlighting the user requirements ca... more This paper illustrates how taking advantage of user studies highlighting the user requirements can lead to the selection of suitable visual features in image search systems. The results of a study to identify pertinent visual features to enhance a text-based press photo search system used by journalists are presented. A requirement was that the visual features should be intuitively understandable by the journalists. This feature selection task is approached by first determining the journalists' photo searching requirements based on a published user study. These requirements are then mapped to suitable visual features. The emphasis was on identifying suitable and intuitive low-level features, as these can be rapidly implemented in the existing text-based image search system. Results demonstrating the use of the selected features are shown.

Lecture Notes in Computer Science, 2009
User generated video content has become increasingly popular, with a large number of internet vid... more User generated video content has become increasingly popular, with a large number of internet video sharing portals appearing. Many portals wish to rapidly find and remove objectionable material from the uploaded videos. This paper considers the flagging of uploaded videos as potentially objectionable due to sexual content of an adult nature. Such videos are often characterized by the presence of a large amount of skin, although other scenes, such as close-ups of faces, also satisfy this criterion. The main contribution of this paper is to introduce to this task two uses of contextual information in the form of detected faces. The first is to use a combination of different face detectors to adjust the parameters of the skin detection model. The second is through the summarization of a video in the form of a path in a skin-face plot. This plot allows potentially objectionable segments of videos to be found, while ignoring segments containing close-ups of faces. The proposed approach runs in real-time. Experiments are done on per pixel annotated and challenging on-line videos from an on-line service provider to prove our approach. Large scale experiments are carried out on 200 popular public video clips from web platforms. These are chosen from the community (top-rated) and cover a large variety of different skin-colors, illuminations, image quality and difficulty levels. We find a compact and reliable representation for videos to flag suspicious content efficiently.
2009 IEEE International Workshop on Multimedia Signal Processing, 2009
By analyzing the low level features of images only, skin detection in visual data cannot be solve... more By analyzing the low level features of images only, skin detection in visual data cannot be solved. To compensate for this major drawback of many approaches, we combine a state of the art recognition algorithm with color model based skin detection. Detected faces in videos are the basis for adaptive skin-color models, which are propagated throughout the video, providing a more precise and accurate model in its recognition performance than pure color based approaches. The approach is able to run in real-time and does not need prior dataspecific training. We received challenging online videos from an online service provider and use additional videos from public web platforms covering a grand variety of different skin-colors, illumination circumstances, image quality and difficulty levels.
A hybrid machine-crowd approach to photo retrieval result diversification
Uploads
Papers by Julian Stottinger