Academia.eduAcademia.edu

Outline

Improving Action Quality Assessment Using Weighted Aggregation

Pattern Recognition and Image Analysis

https://doi.org/10.1007/978-3-031-04881-4_46

Abstract

Action quality assessment (AQA) aims at automatically judging human action based on a video of the said action and assigning a performance score to it. The majority of works in the existing literature on AQA divide RGB videos into short clips, transform these clips to higherlevel representations using Convolutional 3D (C3D) networks, and aggregate them through averaging. These higher-level representations are used to perform AQA. We find that the current clip level feature aggregation technique of averaging is insufficient to capture the relative importance of clip level features. In this work, we propose a learning-based weightedaveraging technique. Using this technique, better performance can be obtained without sacrificing too much computational resources. We call this technique Weight-Decider(WD). We also experiment with ResNets for learning better representations for action quality assessment. We assess the effects of the depth and input clip size of the convolutional neural network on the quality of action score predictions. We achieve a new state-of-the-art Spearman's rank correlation of 0.9315 (an increase of 0.45%) on the MTL-AQA dataset using a 34 layer (2+1)D ResNet with the capability of processing 32 frame clips, with WD aggregation.

References (22)

  1. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 4724-4733. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.502
  2. Diba, A., Fayyaz, M., Sharma, V., Arzani, M.M., Yousefzadeh, R., Gall, J., Gool, L.V.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision - ECCV 2018 -15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 11208, pp. 299-315. Springer (2018). https://doi.org/10.1007/978-3-030-01225-0 18
  3. Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3d convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery 14(7), 1217-1225 (2019). https://doi.org/10.1007/s11548-019-01995-1
  4. Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre- training for video action recognition. In: IEEE Conference on Computer Vi- sion and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 12046-12055. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.01232
  5. Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d residual networks for action recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017. pp. 3154-3160. IEEE Computer Society (2017). https://doi.org/10.1109/ICCVW.2017.373
  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770-778. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.90
  7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735-1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
  8. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large- scale video classification with convolutional neural networks. In: 2014 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. pp. 1725-1732. IEEE Computer Society (2014). https://doi.org/10.1109/CVPR.2014.223
  9. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinet- ics human action video dataset. arXiv preprint arXiv:1705.06950 abs/1705.06950 (2017), http://arxiv.org/abs/1705.06950
  10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6980
  11. LeCun, Y., Bengio, Y.: Convolutional Networks for Images, Speech, and Time Series, p. 255-258. MIT Press, Cambridge, MA, USA (1998). https://doi.org/10.5555/303568.303704
  12. Leong, M., Prasad, D., Lee, Y.T., Lin, F.: Semi-cnn architecture for effective spatio-temporal learning in action recognition. Applied Sciences 10, 557 (01 2020). https://doi.org/10.3390/app10020557
  13. Parmar, P., Morris, B.T.: Measuring the quality of exercises. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp. 2241-2244 (2016). https://doi.org/10.1109/EMBC.2016.7591175
  14. Parmar, P., Morris, B.: Action quality assessment across multiple actions. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019. pp. 1468-1476. IEEE (2019). https://doi.org/10.1109/WACV.2019.00161
  15. Parmar, P., Morris, B.T.: Learning to score olympic events. In: 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 76-84. IEEE Computer Society (2017). https://doi.org/10.1109/CVPRW.2017.16
  16. Parmar, P., Morris, B.T.: What and how well you performed? A multitask learning approach to action quality assessment. In: IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 304-313. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.00039
  17. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
  18. Pirsiavash, H., Vondrick, C., Torralba, A.: Assessing the quality of actions. In: Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision -ECCV 2014 -13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part VI. Lecture Notes in Computer Science, vol. 8694, pp. 556-571. Springer (2014). https://doi.org/10.1007/978-3-319-10599-4 36
  19. Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., Zhou, J.: Uncertainty- aware score distribution learning for action quality assessment. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. pp. 9836-9845. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00986
  20. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE In- ternational Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. pp. 4489-4497. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.510
  21. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. pp. 6450-6459. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00675
  22. Xiang, X., Tian, Y., Reiter, A., Hager, G.D., Tran, T.D.: S3D: stacking segmental P3D for action quality assessment. In: 2018 IEEE International Conference on Image Processing, ICIP 2018, Athens, Greece, October 7-10, 2018. pp. 928-932. IEEE (2018). https://doi.org/10.1109/ICIP.2018.8451364