Papers by Karel Lebeda

Action recognition " in the wild " is extremely challenging, particularly when complex 3D actions... more Action recognition " in the wild " is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial depth sensors, makes it possible to overcome this. However, there is little work examining the best way to exploit this new modality. In this paper we introduce the Hollywood 3D benchmark , which is the first dataset containing " in the wild " action footage including 3D data. This dataset consists of 650 stereo video clips across 14 action classes, taken from Hollywood movies. We provide stereo calibrations and depth reconstructions for each clip. We also provide an action recognition pipeline, and propose a number of specialised depth-aware techniques including five interest point detectors and three feature descriptors. Extensive tests allow evaluation of different appearance and depth encoding schemes. Our novel techniques exploiting this depth allow us to reach performance levels more than triple those of the best baseline algorithm using only appearance information. The benchmark data, code and calibrations are all made available to the community .

We present a framework which allows standard stereo reconstruction to be unified with a wide rang... more We present a framework which allows standard stereo reconstruction to be unified with a wide range of classic top-down cues from urban scene understanding. The resulting algorithm is analogous to the human visual system where conflicting interpretations of the scene due to ambiguous data can be resolved based on a higher level understanding of urban environments. The cues which are reformulated within the framework include: recognising common arrangements of surface normals and semantic edges (e.g. concave, convex and occlusion boundaries), recognising connected or coplanar structures such as walls, and recognising collinear edges (which are common on repetitive structures such as windows). Recognition of these common configurations has only recently become feasible, thanks to the emergence of large-scale reconstruction datasets. To demonstrate the importance and generality of scene understanding during stereo-reconstruction, the proposed approach is integrated with 3 different state-of-the-art techniques for bottom-up stereo reconstruction. The use of high-level cues is shown to improve performance by up to 15 % on the Middlebury 2014 and KITTI datasets. We further evaluate the technique using the recently proposed HCI stereo metrics, finding significant improvements in the quality of depth discontinuities, planar surfaces and thin structures. (a) Input data (b) Appearance matching (c) Scene reasoning (d) Output reconstruction Figure 1: An illustration of the different components which are unified within the proposed framework. Specifically including both bottom-up matching (b) and a top-down understanding (c) of an outdoor (top) and indoor (bottom) urban scene.

Proceedings, European Conference on Computer Vision (ECCV) Visual Object Tracking Challenge Workshop, Sep 2014
The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object vi... more The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.

Lecture Notes in Computer Science, 2015
The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object vi... more The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are Authors Suppressed Due to Excessive Length presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website 25 .
The Visual Object Tracking VOT2014 Challenge Results
Lecture Notes in Computer Science, 2015
Proceedings of the Asian Conference on Computer VIsion (ACCV)
We propose a novel approach to tracking objects by
low-level line correspondences. In our imp... more We propose a novel approach to tracking objects by
low-level line correspondences. In our implementation we show that
this approach is usable even when tracking objects with lack of
texture, exploiting situations, when feature-based trackers fail
due to the aperture problem. Furthermore, we suggest an approach
to failure detection and recovery to maintain long-term
stability. This is achieved by remembering configurations which
lead to good pose estimations and using them later for tracking
corrections.

In this thesis, the problem of robust estimation of a
multiple view geometry in... more In this thesis, the problem of robust estimation of a
multiple view geometry in the computer vision is
studied. The main focus is put on random sampling techniques
for estimation of two-view geometries, in particular
homography and epipolar geometry, in a presence of
outliers. After a thorough analysis of LO-RANSAC, several
improvements are proposed to make it more robust to the
selection of the inlier/outlier error threshold and to the
number of points. A new estimator, faster, more accurate and
more robust than the state-of-the-art is the result. The
improvements were implemented in the framework of CMP
WBS-Demo and extensively tuned and experimentally tested on
diverse data, using a newly created testing framework. The
LO-RANSAC implementation for homography and epipolar
geometry estimation has been separated from the rest of
WBS-Demo and is now publicly available. The datasets were
made available as well, including new manually annotated
ground truth point correspondences.

Proceedings of the International Conference on Computer Vision (ICCV), Dec 2015
Although 3D reconstruction from a monocular video has
been an active area of research for a long ... more Although 3D reconstruction from a monocular video has
been an active area of research for a long time, and the
resulting models offer great realism and accuracy, strong
conditions must be typically met when capturing the video
to make this possible. This prevents general reconstruction
of moving objects in dynamic, uncontrolled scenes. In this
paper, we address this issue. We present a novel algorithm
for modelling 3D shapes from unstructured, unconstrained
discontinuous footage. The technique is robust against distractors
in the scene, background clutter and even shot cuts.
We show reconstructed models of objects, which could not
be modelled by conventional Structure from Motion methods
without additional input. Finally, we present results
of our reconstruction in the presence of shot cuts, showing
the strength of our technique at modelling from existing
footage.

Proceedings of the International Conference on Computer Vision (ICCV), Dec 2015
Causal relationships can often be found in visual object tracking between the motions of the came... more Causal relationships can often be found in visual object tracking between the motions of the camera and that of the tracked object. This object motion may be an effect of the camera motion, e.g. an unsteady handheld camera. But it may also be the cause, e.g. the cameraman framing the object. In this paper we explore these relationships, and provide statistical tools to detect and quantify them; these are based on transfer entropy and stem from information theory. The relationships are then exploited to make predictions about the object location. The approach is shown to be an excellent measure for describing such relationships. On the VOT2013 dataset the prediction accuracy is increased by 62% over the best non-causal predictor. We show that the location predictions are robust to camera shake and sudden motion, which is invaluable for any tracking algorithm and demonstrate this by applying causal prediction to two state-of-the-art trackers. Both of them benefit, Struck gaining a 7% accuracy and 22 % robustness increase on the VTB1.1 benchmark, becoming the new state-of-the-art.

Long term tracking of an object, given only a single instance in an initial frame, remains an ope... more Long term tracking of an object, given only a single instance in an initial frame, remains an open problem. We propose a visual tracking algorithm, robust to many of the difficulties which often occur in real-world scenes. Correspondences of edge-based features are used, to overcome the reliance on the texture of the tracked object and improve invariance to lighting. Furthermore we address long-term stability, enabling the tracker to recover from drift and to provide redetection following object disappearance or occlusion. The two-module principle is similar to the successful state-of-the-art long-term TLD tracker, however our approach offers better performance in benchmarks and extends to cases of low-textured objects. This becomes obvious in cases of plain objects with no texture at all, where the edge-based approach proves the most beneficial.
We perform several different experiments to validate the proposed method. Firstly, results on short-term sequences show the performance of tracking challenging (low-textured and/or transparent) objects which represent failure cases for competing state-of-the-art approaches. Secondly, long sequences are tracked, including one of almost 30 000 frames which to our knowledge is the longest tracking sequence reported to date. This tests the redetection and drift resistance properties of the tracker. Finally, we report results of the proposed tracker on the VOT Challenge 2013 and 2014 datasets as well as on the VTB1.0 benchmark and we show relative performance of the tracker compared to its competitors. All the results are comparable to the state-ofthe-art on sequences with textured objects and superior on nontextured objects. The new annotated sequences are made publicly available.

Lecture Notes in Computer Science, 2015
The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object vi... more The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are Authors Suppressed Due to Excessive Length presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website 25 .
Proc. of the ICCV workshop on Visual Object Tracking, 2013
Visual tracking has attracted a significant attention in the last few decades. The recent surge i... more Visual tracking has attracted a significant attention in the last few decades. The recent surge in the number of publications on tracking-related problems have made it almost impossible to follow the developments in the field. One

Proceedings of the Asian Conference on Computer VIsion (ACCV)
In this paper, we address the problem of tracking an unknown object in 3D space. Online... more In this paper, we address the problem of tracking an unknown object in 3D space. Online 2D tracking often fails for strong outof-plane rotation which results in considerable changes in appearance beyond those that can be represented by online update strategies. However, by modelling and learning the 3D structure of the object explicitly, such effects are mitigated. To address this, a novel approach is presented, combining techniques from the fields of visual tracking, structure from motion (SfM) and simultaneous localisation and mapping (SLAM). This algorithm is referred to as TMAGIC (Tracking, Modelling And Gaussian-process Inference Combined). At every frame, point and line features are tracked in the image plane and are used, together with their 3D correspondences, to estimate the camera pose. These features are also used to model the 3D shape of the object as a Gaussian process. Tracking determines the trajectories of the object in both the image plane and 3D
space, but the approach also provides the 3D object shape. The approach
is validated on several video-sequences used in the tracking literature,
comparing favourably to state-of-the-art trackers for simple scenes (error
reduced by 22 %) with clear advantages in the case of strong out-of-plane
rotation, where 2D approaches fail (error reduction of 58 %).

Proceedings of the European Conference on Computer VIsion (ECCV)
We investigate the recognition of actions "in the wild"' using 3D motion information. The lack of... more We investigate the recognition of actions "in the wild"' using 3D motion information. The lack of control over (and knowledge of) the camera configuration, exacerbates this already challenging task, by introducing systematic projective inconsistencies between 3D motion fields, hugely increasing intra-class variance. By introducing a robust, sequence based, stereo calibration technique, we reduce these inconsistencies from fully projective to a simple similarity transform. We then introduce motion encoding techniques which provide the necessary scale invariance, along with additional invariances to changes in camera viewpoint. On the recent Hollywood 3D natural action recognition dataset, we show improvements of 40% over previous state-of-the-art techniques based on implicit motion encoding. We also demonstrate that our robust sequence calibration simplifies the task of recognising actions, leading to recognition rates 2.5 times those for the same technique without calibration. In addition, the sequence calibrations are made available.
Proceedings of the ICCV workshop on Visual Object Tracking Challenge (ICCVW VOT)
Long term tracking of an object, given only a single
instance in an initial frame, remains an... more Long term tracking of an object, given only a single
instance in an initial frame, remains an open problem. We
propose a visual tracking algorithm, robust to many of the
difficulties which often occur in real-world scenes.
Correspondences of edge-based features are used, to overcome
the reliance on the texture of the tracked object and improve
invariance to lighting. Furthermore we address long-term
stability, enabling the tracker to recover from drift and to
provide redetection following object disappearance or
occlusion. The two-module principle is similar to the
successful state-of-the-art long-term TLD tracker, however our
approach extends to cases of low-textured objects.
Proceedings of the British Machine Vision Conference (BMVC)
The paper revisits the problem of local optimization
for RANSAC. Improvements of the LO-RANSA... more The paper revisits the problem of local optimization
for RANSAC. Improvements of the LO-RANSAC procedure are proposed:
a use of truncated quadratic cost function, an introduction of a
limit on the number of inliers used for the least squares
computation and several implementation issues are addressed. The
implementation is made publicly available.
Thesis Chapters by Karel Lebeda

Visual tracking of unknown objects in unconstrained video-sequences is extremely
challenging due ... more Visual tracking of unknown objects in unconstrained video-sequences is extremely
challenging due to a number of unsolved issues. This thesis explores several of these
and examines possible approaches to tackle them.
The unconstrained nature of real-world input sequences creates huge variation in
the appearance of the target object due to changes in pose and lighting. Additionally,
the object can be occluded by either parts of itself, other elements of the scene, or
the frame boundaries. Observations may also be corrupted due to low resolution,
motion blur, large frame-to-frame displacement, or incorrect exposure or focus of the
camera. Finally, some objects are inherently difficult to track due to their (low) texture,
specular/transparent nature, non-rigid deformations, etc.
Conventional trackers depend heavily on the texture of the target. This causes issues
with transparent or untextured objects. Edge points can be used in cases where
standard feature points are scarce; these however suffer from the aperture problem. To
address this, the first contribution of this thesis explores the idea of virtual corners,
using pairs of non-adjacent line correspondences, tangent to edges in the image. Furthermore,
the chapter investigates the possibility of long-term tracking, introducing a
re-detection scheme to handle occlusions while limiting drift of the object model. The
outcome of this research is an edge-based tracker, able to track in scenarios including
untextured objects, full occlusions and significant length. The tracker, besides reporting
excellent results in standard benchmarks, is demonstrated to successfully track the
longest sequence published to date.
Some of the issues in visual tracking are caused by suboptimal utilisation of the
image information. The object of interest can easily occupy as few as ten or even one
percent of the video frame area. This causes difficulties in challenging scenarios such
as sudden camera shakes or full occlusions. To improve tracking in such cases, the next
major contribution of this thesis explores relationships within the context of visual
tracking, with a focus on causality. These include causal links between the tracked
object and other elements of the scene such as the camera motion or other objects.
Properties of such relationships are identified in a framework based on information
theory. The resulting technique can be employed as a causality-based motion model to
improve the results of virtually any tracker.
Significant effort has previously been devoted to rapid learning of object properties
on the fly. However, state-of-the-art approaches still often fail in cases such as
rapid out-of-plane rotations, when the appearance changes suddenly. One of the major
contributions of this thesis is a radical rethinking of the traditional wisdom of modelling
3D motion as appearance change. Instead, 3D motion is modelled as 3D motion.
This intuitive but previously unexplored approach provides new possibilities in visual
tracking research.
Firstly, 3D tracking is more general, as large out-of-plane motion is often fatal for 2D
trackers, but helps 3D trackers to build better models. Secondly, the tracker's internal
model of the object can be used in many different applications and it could even become
the main motivation, with tracking supporting reconstruction rather than vice versa.
This effectively bridges the gap between visual tracking and Structure from Motion.
The proposed method is capable of successfully tracking sequences with extreme out-of-plane
rotation, which poses a considerable challenge to 2D trackers. This is done by creating
realistic 3D models of the targets, which then aid in tracking.
In the majority of the thesis, the assumption is made that the target's 3D shape is
rigid. This is, however, a relatively strong limitation. In the final chapter, tracking and
dense modelling of non-rigid targets is explored, demonstrating results in even more
generic (and therefore challenging) scenarios. This final advancement truly generalises
the tracking problem with support for long-term tracking of low texture and non-rigid
objects in sequences with camera shake, shot cuts and significant rotation.
Taken together, these contributions address some of the major sources of failure
in visual tracking. The presented research advances the field of visual tracking, facilitating
tracking in scenarios which were previously infeasible. Excellent results are
demonstrated in these challenging scenarios. Finally, this thesis demonstrates that 3D
reconstruction and visual tracking can be used together to tackle difficult tasks.
Uploads
Papers by Karel Lebeda
low-level line correspondences. In our implementation we show that
this approach is usable even when tracking objects with lack of
texture, exploiting situations, when feature-based trackers fail
due to the aperture problem. Furthermore, we suggest an approach
to failure detection and recovery to maintain long-term
stability. This is achieved by remembering configurations which
lead to good pose estimations and using them later for tracking
corrections.
multiple view geometry in the computer vision is
studied. The main focus is put on random sampling techniques
for estimation of two-view geometries, in particular
homography and epipolar geometry, in a presence of
outliers. After a thorough analysis of LO-RANSAC, several
improvements are proposed to make it more robust to the
selection of the inlier/outlier error threshold and to the
number of points. A new estimator, faster, more accurate and
more robust than the state-of-the-art is the result. The
improvements were implemented in the framework of CMP
WBS-Demo and extensively tuned and experimentally tested on
diverse data, using a newly created testing framework. The
LO-RANSAC implementation for homography and epipolar
geometry estimation has been separated from the rest of
WBS-Demo and is now publicly available. The datasets were
made available as well, including new manually annotated
ground truth point correspondences.
been an active area of research for a long time, and the
resulting models offer great realism and accuracy, strong
conditions must be typically met when capturing the video
to make this possible. This prevents general reconstruction
of moving objects in dynamic, uncontrolled scenes. In this
paper, we address this issue. We present a novel algorithm
for modelling 3D shapes from unstructured, unconstrained
discontinuous footage. The technique is robust against distractors
in the scene, background clutter and even shot cuts.
We show reconstructed models of objects, which could not
be modelled by conventional Structure from Motion methods
without additional input. Finally, we present results
of our reconstruction in the presence of shot cuts, showing
the strength of our technique at modelling from existing
footage.
We perform several different experiments to validate the proposed method. Firstly, results on short-term sequences show the performance of tracking challenging (low-textured and/or transparent) objects which represent failure cases for competing state-of-the-art approaches. Secondly, long sequences are tracked, including one of almost 30 000 frames which to our knowledge is the longest tracking sequence reported to date. This tests the redetection and drift resistance properties of the tracker. Finally, we report results of the proposed tracker on the VOT Challenge 2013 and 2014 datasets as well as on the VTB1.0 benchmark and we show relative performance of the tracker compared to its competitors. All the results are comparable to the state-ofthe-art on sequences with textured objects and superior on nontextured objects. The new annotated sequences are made publicly available.
space, but the approach also provides the 3D object shape. The approach
is validated on several video-sequences used in the tracking literature,
comparing favourably to state-of-the-art trackers for simple scenes (error
reduced by 22 %) with clear advantages in the case of strong out-of-plane
rotation, where 2D approaches fail (error reduction of 58 %).
instance in an initial frame, remains an open problem. We
propose a visual tracking algorithm, robust to many of the
difficulties which often occur in real-world scenes.
Correspondences of edge-based features are used, to overcome
the reliance on the texture of the tracked object and improve
invariance to lighting. Furthermore we address long-term
stability, enabling the tracker to recover from drift and to
provide redetection following object disappearance or
occlusion. The two-module principle is similar to the
successful state-of-the-art long-term TLD tracker, however our
approach extends to cases of low-textured objects.
for RANSAC. Improvements of the LO-RANSAC procedure are proposed:
a use of truncated quadratic cost function, an introduction of a
limit on the number of inliers used for the least squares
computation and several implementation issues are addressed. The
implementation is made publicly available.
Thesis Chapters by Karel Lebeda
challenging due to a number of unsolved issues. This thesis explores several of these
and examines possible approaches to tackle them.
The unconstrained nature of real-world input sequences creates huge variation in
the appearance of the target object due to changes in pose and lighting. Additionally,
the object can be occluded by either parts of itself, other elements of the scene, or
the frame boundaries. Observations may also be corrupted due to low resolution,
motion blur, large frame-to-frame displacement, or incorrect exposure or focus of the
camera. Finally, some objects are inherently difficult to track due to their (low) texture,
specular/transparent nature, non-rigid deformations, etc.
Conventional trackers depend heavily on the texture of the target. This causes issues
with transparent or untextured objects. Edge points can be used in cases where
standard feature points are scarce; these however suffer from the aperture problem. To
address this, the first contribution of this thesis explores the idea of virtual corners,
using pairs of non-adjacent line correspondences, tangent to edges in the image. Furthermore,
the chapter investigates the possibility of long-term tracking, introducing a
re-detection scheme to handle occlusions while limiting drift of the object model. The
outcome of this research is an edge-based tracker, able to track in scenarios including
untextured objects, full occlusions and significant length. The tracker, besides reporting
excellent results in standard benchmarks, is demonstrated to successfully track the
longest sequence published to date.
Some of the issues in visual tracking are caused by suboptimal utilisation of the
image information. The object of interest can easily occupy as few as ten or even one
percent of the video frame area. This causes difficulties in challenging scenarios such
as sudden camera shakes or full occlusions. To improve tracking in such cases, the next
major contribution of this thesis explores relationships within the context of visual
tracking, with a focus on causality. These include causal links between the tracked
object and other elements of the scene such as the camera motion or other objects.
Properties of such relationships are identified in a framework based on information
theory. The resulting technique can be employed as a causality-based motion model to
improve the results of virtually any tracker.
Significant effort has previously been devoted to rapid learning of object properties
on the fly. However, state-of-the-art approaches still often fail in cases such as
rapid out-of-plane rotations, when the appearance changes suddenly. One of the major
contributions of this thesis is a radical rethinking of the traditional wisdom of modelling
3D motion as appearance change. Instead, 3D motion is modelled as 3D motion.
This intuitive but previously unexplored approach provides new possibilities in visual
tracking research.
Firstly, 3D tracking is more general, as large out-of-plane motion is often fatal for 2D
trackers, but helps 3D trackers to build better models. Secondly, the tracker's internal
model of the object can be used in many different applications and it could even become
the main motivation, with tracking supporting reconstruction rather than vice versa.
This effectively bridges the gap between visual tracking and Structure from Motion.
The proposed method is capable of successfully tracking sequences with extreme out-of-plane
rotation, which poses a considerable challenge to 2D trackers. This is done by creating
realistic 3D models of the targets, which then aid in tracking.
In the majority of the thesis, the assumption is made that the target's 3D shape is
rigid. This is, however, a relatively strong limitation. In the final chapter, tracking and
dense modelling of non-rigid targets is explored, demonstrating results in even more
generic (and therefore challenging) scenarios. This final advancement truly generalises
the tracking problem with support for long-term tracking of low texture and non-rigid
objects in sequences with camera shake, shot cuts and significant rotation.
Taken together, these contributions address some of the major sources of failure
in visual tracking. The presented research advances the field of visual tracking, facilitating
tracking in scenarios which were previously infeasible. Excellent results are
demonstrated in these challenging scenarios. Finally, this thesis demonstrates that 3D
reconstruction and visual tracking can be used together to tackle difficult tasks.