Probabilistic Future Prediction for Video Scene Understanding
2020, ArXiv
https://doi.org/10.1007/978-3-030-58517-4_45Abstract
We present a novel deep learning architecture for probabilistic future prediction from video. We predict the future semantics, geometry and motion of complex real-world urban scenes and use this representation to control an autonomous vehicle. This work is the first to jointly predict ego-motion, static scene, and the motion of dynamic agents in a probabilistic manner, which allows sampling consistent, highly probable futures from a compact latent space. Our model learns a representation from RGB video with a spatio-temporal convolutional module. The learned representation can be explicitly decoded to future semantic segmentation, depth, and optical flow, in addition to being an input to a learnt driving policy. To model the stochasticity of the future, we introduce a conditional variational approach which minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During infere...
References (33)
- Amini, A., Rosman, G., Karaman, S., Rus, D.: Variational end-to-end navigation and localization. In: Proceedings of the International Conference on Robotics and Automation (ICRA). IEEE (2019)
- Ballas, N., Yao, L., Pas, C., Courville, A.: Delving deeper into convolutional net- works for learning video representations. In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2016)
- Bellemare, M.G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., Munos, R.: The cramer distance as a solution to biased wasserstein gradients. arXiv preprint (2017)
- Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of se- quences based on a "best of many" sample objective. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
- Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image net- works for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
- Bojarski, M., Testa, D.D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., Zieba, K.: End to End Learning for Self-Driving Cars. arXiv preprint (2016)
- Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
- Casas, S., Luo, W., Urtasun, R.: Intentnet: Learning to predict intention from raw sensor data. In: Proceedings of the Conference on Robot Learning (CoRL) (2018)
- Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint (2017)
- Hsu-kuang Chiu, Ehsan Adeli, J.C.N.: Segmenting the future. arXiv preprint (2019)
- Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems (NeurIPS) (2018)
- Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. In: arXiv preprint (2019)
- Codevilla, F., Miiller, M., López, A., Koltun, V., Dosovitskiy, A.: End-to-end driv- ing via conditional imitation learning. In: Proceedings of the International Confer- ence on Robotics and Automation (ICRA) (2018)
- Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
- Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video. Advances in Neural Information Processing Systems (NeurIPS) (2017)
- Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Pro- ceedings of the International Conference on Machine Learning (ICML). Proceed- ings of Machine Learning Research (2018)
- Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A.X., Levine, S.: Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint (2018)
- Ebert, F., Finn, C., Lee, A., Levine, S.: Self-supervised visual planning with tempo- ral skip connections. In: Proceedings of the Conference on Robot Learning (CoRL) (2017)
- Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
- Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
- Finn, C., Levine, S.: Deep visual foresight for planning robot motion. Proceedings of the International Conference on Robotics and Automation (ICRA) (2017)
- Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representa- tion warping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
- Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks (2016)
- Ha, D., Schmidhuber, J.: World models. In: Advances in Neural Information Pro- cessing Systems (NeurIPS) (2018)
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: Proceedings of the Interna- tional Conference on Machine Learning (ICML) (2019)
- Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d resid- ual networks for action recognition. In: Proceedings of the International Conference on Computer Vision, workshop (ICCVw) (2017)
- He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
- Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., Silver, D.: Rainbow: Combining improvements in deep reinforcement learning. AAAI Conference on Artificial Intelligence (2018)
- Huang, X., Cheng, X., Geng, Q., Cao, B., Zhou, D., Wang, P., Lin, Y., Yang, R.: The apolloscape dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, workshop (CVPRw) (2018)
- Ioannou, Y., Robertson, D., Shotton, J., Cipolla, R., Criminisi, A.: Training cnns with low-rank filters for efficient image classification. In: Proceedings of the Inter- national Conference on Learning Representations (ICLR) (2016)
- Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., Kavukcuoglu, K.: Reinforcement learning with unsupervised auxiliary tasks. Pro- ceedings of the International Conference on Learning Representations (ICLR) (2017)
- Jayaraman, D., Ebert, F., Efros, A., Levine, S.: Time-agnostic prediction: Pre- dicting predictable video frames. Proceedings of the International Conference on Learning Representations (ICLR) (2018)
- Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H.: Model-based reinforcement learning for atari. In: Proceedings of the International Conference on Learning Representations (ICLR) (2020)