Action Anticipation from Multimodal Data
2019, Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
https://doi.org/10.5220/0007379001540161Abstract
The idea of multi-sensor data fusion is to combine the data coming from different sensors to provide more accurate and complementary information to solve a specific task. Our goal is to build a shared representation related to data coming from different domains, such as images, audio signal, heart rate, acceleration, etc., in order to anticipate daily activities of a user wearing multimodal sensors. To this aim, we consider the Stanford-ECM Dataset which contains syncronized data acquired with different sensors: video, acceleration and heart rate signals. The dataset is adapted to our action prediction task by identifying the transitions from the generic "Unknown" class to a specific "Activity". We discuss and compare a Siamese Network with the Multi Layer Perceptron and the 1D CNN where the input is an unknown observation and the output is the next activity to be observed. The feature representations obtained with the considered deep architecture are classified with SVM or KNN classifiers. Experimental results pointed out that prediction from multimodal data seems a feasible task, suggesting that multimodality improves both classification and prediction. Nevertheless, the task of reliably predicting next actions is still open and requires more investigations as well as the availability of multimodal dataset, specifically built for prediction purposes.
References (25)
- Aytar, Y., Vondrick, C., and Torralba, A. (2017).
- Bishop, C. M. (2006). Patter Recognition and Machine Le- arning (Information Science and Statistics). Springer- Verlag, Berlin, Heidelberg.
- Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., and Shah, R. (1993). Signature verification using a "siamese" time delay neural network. In Proceedings of the 6th International Conference on Neural Information Pro- cessing Systems, pages 737-744, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- Chan, F.-H., Chen, Y.-T., Xiang, Y., and Sun, M. (2017). Anticipating accidents in dashcam videos. In Lai, S.-H., Lepetit, V., Nishino, K., and Sato, Y., editors, Asian Conference on Computer Vision, pages 136- 153, Cham. Springer International Publishing.
- Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Fur- nari, A., Kazakos, E., Moltisanti, D., Munro, J., Per- rett, T., Price, W., and Wray, M. (2018). Scaling ego- centric vision: The epic-kitchens dataset. European Conference on Computer Vision.
- Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255.
- Duarte, N., Tasevski, J., Coco, M. I., Rakovic, M., and Santos-Victor, J. (2018). Action anticipation: Reading the intentions of humans and robots. IEEE Robotics and Automation Letters, abs/1802.02788.
- Furnari, A., Battiato, S., Grauman, K., and Farinella, G. M. (2017). Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation, 49:401 -411.
- Gao, J., Yang, Z., and Nevatia, R. (2017). RED: reinfor- ced encoder-decoder networks for action anticipation. British Machine Vision Conference, abs/1707.04818.
- Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensio- nality reduction by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition (CVPR'06), volume 2, pages 1735-1742.
- Kiranyaz, S., Ince, T., and Gabbouj, M. (2016). Real-time patient-specific ecg classification by 1-d convolutional neural networks. IEEE Transactions on Biomedical Engineering, 63(3):664-675.
- Koch, G., Zemel, R., and Salakhutdinov, R. (2015). Sia- mese neural networks for one-shot image recognition. In ICML Deep Learning Workshop.
- Koppula, H. S., Jain, A., and Saxena, A. (2016). cipatory planning for human-robot teams. In Experimental Ro- botics: The 14th International Symposium on Experi- mental Robotics.
- Koppula, H. S. and Saxena, A. (2016). Anticipating human activities using object affordances for reactive robo- tic response. IEEE Trans. Pattern Anal. Mach. Intell., 38(1):14-29.
- Lan, T., Chen, T.-C., and Savarese, S. (2014). A hierar- chical representation for future action prediction. In European Conference on Computer Vision -ECCV, pages 689-704, Cham. Springer International Publis- hing.
- Lee, S.-M., Yoon, S. M., and Cho, H. (2017). Human acti- vity recognition from accelerometer data using convo- lutional neural network. In IEEE International Confe- rence on Big Data and Smart Computing (BigComp), pages 131-134.
- Ma, S., Sigal, L., and Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1942-1950.
- Mainprice, J. and Berenson, D. (2013). Human-robot colla- borative manipulation planning using early prediction of human motion. In IEEE/RSJ International Confe- rence on Intelligent Robots and Systems, pages 299- 306. Nakamura, K., Yeung, S., Alahi, A., and Fei-Fei, L. (2017). Jointly learning energy expenditures and activities using egocentric multimodal signals. In IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 6817-6826.
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). Multimodal deep learning. In Procee- dings of the 28th International Conference on Interna- tional Conference on Machine Learning, pages 689- 696, USA. Omnipress.
- Pirsiavash, H. and Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2847-2854.
- Song, S., Cheung, N., Chandrasekhar, V., Mandal, B., and Lin, J. (2016). Egocentric activity recognition with multimodal fisher vector. abs/1601.06603.
- Srivastava, N. and Salakhutdinov, R. (2014). Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15:2949-2980.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Angue- lov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1-9.
- Torre, F. D., Hodgins, J. K., Montano, J., and Valcarcel, S. (2009). Detailed human data acquisition of kit- chen activities: the cmu-multimodal activity database (cmu-mmac). In CHI 2009 Workshop. Developing Shared Home Behavior Datasets to Advance HCI and Ubiquitous Computing Research.
- Wu, T., Chien, T., Chan, C., Hu, C., and Sun, M. (2017). Anticipating daily intention using on-wrist motion triggered sensing. Intenational Conference on Conm- puter Vision, abs/1710.07477.