In this paper, we raise important issues concerning the evaluation complexity of existing Mahalan... more In this paper, we raise important issues concerning the evaluation complexity of existing Mahalanobis metric learning methods. The complexity scales linearly with the size of the dataset. This is especially cumbersome on large scale or for real-time applications with limited time bud-get. To alleviate this problem we propose to represent the dataset by a fixed number of discriminative prototypes. In particular, we introduce a new method that jointly chooses the positioning of prototypes and also optimizes the Ma-halanobis distance metric with respect to these. We show that choosing the positioning of the prototypes and learning the metric in parallel leads to a drastically reduced eval-uation effort while maintaining the discriminative essence of the original dataset. Moreover, for most problems our method performing k-nearest prototype (k-NP) classifica-tion on the condensed dataset leads to even better general-ization compared to k-NN classification using all data. Re-sults on a v...
In this paper we present a natural feature tracking algorithm based on on-line boosting used for ... more In this paper we present a natural feature tracking algorithm based on on-line boosting used for localizing a mobile computer. Mobile augmented reality requires highly accurate and fast six degrees of freedom tracking in order to provide registered graphical overlays to a mobile user. With advances in mobile computer hardware, vision-based tracking approaches have the potential to provide efficient solutions that are non-invasive in contrast to the currently dominating marker-based approaches. We propose to use a tracking approach which can use in an unknown environment, i.e. the target has not be known beforehand. The core of the tracker is an on-line learning algorithm, which updates the tracker as new data becomes available. This is suitable in many mobile augmented reality applications. We demonstrate the applicability of our approach on tasks where the target objects are not known beforehand, i.e. interactive planing. (a) 1.
Abstract. For face recognition from video streams often cues such as transcripts, subtitles or on... more Abstract. For face recognition from video streams often cues such as transcripts, subtitles or on-screen text are available. This information could be very valuable for improving the recognition performance. However, frequently this data can not be associated directly with just one of the visible faces. To overcome this limitations and to exploit valuable information, we define the task as a multiple instance learning (MIL) problem. We formulate a robust loss function that describes our problem and incorporates ambiguous and unreliable information sources and optimize it using Gradient Boosting. A new definition of the posterior probability of a bag, based on the Lp-norm, improves the ability to deal with varying bag sizes over existing formulations. The benefits of the approach are demonstrated for face recognition in videos on a publicly available benchmark dataset. In fact, we show that exploring new information sources can drastically improve the classification results. Addition...
Face alignment is a crucial step in face recognition tasks. Especially, using landmark localizati... more Face alignment is a crucial step in face recognition tasks. Especially, using landmark localization for geometric face normalization has shown to be very effective, clearly improving the recognition results. However, no adequate databases exist that provide a sufficient number of annotated facial landmarks. The databases are either limited to frontal views, provide only a small number of annotated images or have been acquired under controlled conditions. Hence, we introduce a novel database overcoming these limitations: Annotated Facial Landmarks in the Wild (AFLW). AFLW provides a large-scale collection of images gathered from Flickr, exhibiting a large variety in face appearance (e.g., pose, expression, ethnicity, age, gender) as well as general imaging and environmental conditions. In total 25,993 faces in 21,997 real-world images are annotated with up to 21 landmarks per image. Due to the comprehensive set of annotations AFLW is well suited to train and test algorithms for multi...
In this paper, we raise important issues on scalability and the required degree of supervision of... more In this paper, we raise important issues on scalability and the required degree of supervision of existing Mahalanobis metric learning methods. Often rather tedious optimization procedures are applied that become computationally intractable on a large scale. Further, if one considers the constantly growing amount of data it is often infeasible to specify fully supervised labels for all data points. Instead, it is easier to specify labels in form of equivalence constraints. We introduce a simple though effective strategy to learn a distance metric from equivalence constraints, based on a statistical inference perspective. In contrast to existing methods we do not rely on complex optimization problems requiring computationally expensive iterations. Hence, our method is orders of magnitudes faster than comparable methods. Results on a variety of challenging benchmarks with rather diverse nature demonstrate the power of our method. These include faces in unconstrained environments, matc...
Abstract. In this paper, we introduce a formulation for the task of detecting objects based on th... more Abstract. In this paper, we introduce a formulation for the task of detecting objects based on the information gathered from a standard Implicit Shape Model (ISM). We describe a probabilistic approach in a general random field setting, which enables to effectively detect object instances and additionally identifies all local patches contributing to the different instances. We propose a sparse graph structure and define a semantic label space, specifically tuned to the task of localizing objects. The design of the graph structure then allows to define a novel inference process that efficiently returns a good local minimum of our energy minimization problem. A key benefit of our method is, that we do not have to fix a range for local neighborhood suppression, as necessary for instance in related non maximum suppression approaches. Our inference process implicitly is capable to separate even strongly overlapping object instances. Experimental evaluation compares our method to stateof-t...
In this paper, we present an efficient solution for automatic detection and reading of dangerous ... more In this paper, we present an efficient solution for automatic detection and reading of dangerous goods plates on trucks and trains. According to the ADR agreement dangerous goods transports are marked with an orange plate covering the hazard class and the identification number for the hazardous substances. Since under real-world conditions high resolution images (often at low quality) have to be processed an efficient and robust system is required. In particular, we propose a multi-stage system consisting of an acquisition step, a saliency region detector (to reduce the run-time), a plate detector, and a robust recognition step based on an Optical Character Recognition (OCR). To demonstrate the system, we show qualitative and quantitative localization/recognition results on two challenging data sets. In fact, building on proven robust and efficient methods, we show excellent detection and classification results under hard environmental conditions at low run-time. 1.
In this work, we present an entirely data-driven approach to estimating the 3D pose of a hand giv... more In this work, we present an entirely data-driven approach to estimating the 3D pose of a hand given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network (CNN) trained to predict an estimate of the 3D pose by using a feedback loop of Deep Networks, also utilizing a CNN architecture. Since this approach critically relies on a training set of labeled frames, we further present a method for creating the required training data. We propose a semi-automated method for efficiently and accurately labeling each frame of a depth video of a hand with the 3D locations of the joints.
In this paper we present an efficient algorithm for camera tracking applicable for mobile devices... more In this paper we present an efficient algorithm for camera tracking applicable for mobile devices. In particular, the work is motivated by the limited computational power and memory, precluding the use of existing methods for estimation of the 6-DoF pose of a mobile device (camera) relative to a previously unknown planar object. Similar to existing methods, we introduce a keypoint based approach. We establish a relationship between the object and its image by selecting keypoints on the object, preferably such with a distinctive appearance, and identifying their location within subsequent images. In contrast to existing works, we solve the problem of re-identifying such feature points by robustly learning their appearance with an on-line learning algorithm. We demonstrate the proposed algorithm, hence not limited to this application, in the context of AR. In particular, we give several qualitative and quantitative evaluations showing the benefits of the proposed approach.
Imitation learning allows agents to learn complex behaviors from demonstrations. However, learnin... more Imitation learning allows agents to learn complex behaviors from demonstrations. However, learning a complex vision-based task may require an impractical number of demonstrations. Meta-imitation learning is a promising approach towards enabling agents to learn a new task from one or a few demonstrations by leveraging experience from learning similar tasks. In the presence of task ambiguity or unobserved dynamics, demonstrations alone may not provide enough information; an agent must also try the task to successfully infer a policy. In this work, we propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback. In comparison to meta-imitation, this approach enables the agent to effectively and efficiently improve itself autonomously beyond the demonstration data. In comparison to meta-reinforcement learning, we can scale to substantially broader distributions of tasks, as the demonstration reduces the burden of exploratio...
We present a solution for the automatic detection and classification of dangerous goods on trucks... more We present a solution for the automatic detection and classification of dangerous goods on trucks. Dangerous goods are labeled by an orange dangerous goods plate and/or a dangerous goods symbol sign. The acquisition system consists of a camera and dedicated illumination setup. A computer vision system processes the images by localizing and reading the dangerous goods number. The proposed system can be installed on both ends of a tunnel, thus raising awareness of all dangerous goods currently within the tunnel. To demonstrate the system, we show qualitative and quantitative localization/recognition results on real world data.
In this paper we present an efficient algorithm for camera tracking applicable for mobile devices... more In this paper we present an efficient algorithm for camera tracking applicable for mobile devices. In particular, the work is motivated by the limited computational power and memory, precluding the use of existing methods for estimation of the 6-DoF pose of a mobile device (camera) relative to a previously unknown planar object. Similar to existing methods, we introduce a keypoint based approach. We establish a relationship between the object and its image by selecting keypoints on the object, preferably such with a distinctive appearance, and identifying their location within subsequent images. In contrast to existing works, we solve the problem of re-identifying such feature points by robustly learning their appearance with an on-line learning algorithm. We demonstrate the proposed algorithm, hence not limited to this application, in the context of AR. In particular, we give several qualitative and quantitative evaluations showing the benefits of the proposed approach.
Face detection is still one of the core problems in computer vision, especially in unconstrained ... more Face detection is still one of the core problems in computer vision, especially in unconstrained real-world situations where variations in face pose or bad imaging conditions have to be handled. These problems are covered by recent benchmarks such as Face Detection Dataset and Benchmark (FDDB) [2], which reveals that established methods, e.g, Viola and Jones [8] suffer a drop in performance. More effective approaches exist, but are closed source and not publicly available. Thus, we propose a simple but effective detector that is available to the public. It combines Histograms of Orientated Gradient (HOG) [1] features with linear Support Vector Machine (SVM) classification.
Deep Learning methods usually require huge amounts of training data to perform at their full pote... more Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this paper, we show that a simple trick is sufficient to train very effectively modern object detectors with synthetic images only: We 'freeze' the layers responsible for feature extraction to generic layers pre-trained on real images, and train only the remaining layers with plain OpenGL rendering. Our experiments with very recent deep architectures for object recognition (Faster-RCNN, R-FCN, Mask-RCNN) and image feature extractors (InceptionResnet and Resnet) show this simple approach performs surprisingly well.
2018 IEEE International Conference on Robotics and Automation (ICRA)
Instrumenting and collecting annotated visual grasping datasets to train modern machine learning ... more Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms can be extremely time-consuming and expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. We study how randomized simulated environments and domain adaptation methods can be extended to train a grasping system to grasp novel objects from raw monocular RGB images. We extensively evaluate our approaches with a total of more than 25,000 physical test grasps, studying a range of simulation conditions and domain adaptation methods, including a novel extension of pixel-level domain adaptation that we term the GraspGAN. We show that, by using synthetic data and domain adaptation, we are able to reduce the number of real-world samples needed to achieve a given level of performance by up to 50 times, using only randomly generated simulated objects. We also show that by using only unlabeled real-world data and our GraspGAN methodology, we obtain real-world grasping performance without any real-world labels that is similar to that achieved with 939,777 labeled real-world samples.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 1, 2019
Real world data, especially in the domain of robotics, is notoriously costly to collect. One way ... more Real world data, especially in the domain of robotics, is notoriously costly to collect. One way to circumvent this can be to leverage the power of simulation to produce large amounts of labelled data. However, training models on simulated images does not readily transfer to realworld ones. Using domain adaptation methods to cross this "reality gap" requires a large amount of unlabelled realworld data, whilst domain randomization alone can waste modeling power. In this paper, we present Randomizedto-Canonical Adaptation Networks (RCANs), a novel approach to crossing the visual reality gap that uses no realworld data. Our method learns to translate randomized rendered images into their equivalent non-randomized, canonical versions. This in turn allows for real images to also be translated into canonical sim images. We demonstrate the effectiveness of this sim-to-real approach by training a vision-based closed-loop grasping reinforcement learning agent in simulation, and then transferring it to the real world to attain 70% zero-shot grasp success on unseen objects, a result that almost doubles the success of learning the same task directly on domain randomization alone. Additionally, by joint finetuning in the real-world with only 5,000 real-world grasps, our method achieves 91%, attaining comparable performance to a state-of-the-art system trained with 580,000 real-world grasps, resulting in a reduction of real-world data by more than 99%.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 1, 2016
While many recent hand pose estimation methods critically rely on a training set of labelled fram... more While many recent hand pose estimation methods critically rely on a training set of labelled frames, the creation of such a dataset is a challenging task that has been overlooked so far. As a result, existing datasets are limited to a few sequences and individuals, with limited accuracy, and this prevents these methods from delivering their full potential. We propose a semi-automated method for efficiently and accurately labeling each frame of a hand depth video with the corresponding 3D locations of the joints: The user is asked to provide only an estimate of the 2D reprojections of the visible joints in some reference frames, which are automatically selected to minimize the labeling work by efficiently optimizing a sub-modular loss function. We then exploit spatial, temporal, and appearance constraints to retrieve the full 3D poses of the hand over the complete sequence. We show that this data can be used to train a recent state-of-the-art hand pose estimation method, leading to increased accuracy. The code and dataset can be found at https://github.com/ moberweger/semi-auto-anno/.
IEEE Transactions on Pattern Analysis and Machine Intelligence
We propose an approach to estimating the 3D pose of a hand, possibly handling an object, given a ... more We propose an approach to estimating the 3D pose of a hand, possibly handling an object, given a depth image. We show that we can correct the mistakes made by a Convolutional Neural Network trained to predict an estimate of the 3D pose by using a feedback loop. The components of this feedback loop are also Deep Networks, optimized using training data. This approach can be generalized to a hand interacting with an object. Therefore, we jointly estimate the 3D pose of the hand and the 3D pose of the object. Our approach performs en-par with state-of-the-art methods for 3D hand pose estimation, and outperforms state-of-the-art methods for joint hand-object pose estimation when using depth images only. Also, our approach is efficient as our implementation runs in real-time on a single GPU.
Uploads
Papers by Paul Wohlhart