LOTR: Face Landmark Localization Using Localization Transformer
2022, IEEE Access
https://doi.org/10.1109/ACCESS.2022.3149380Abstract
This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in a feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any postprocessing steps. This paper also introduces a loss function named smooth-Wing loss, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-the-art methods. Additionally, we report an improvement in the performance of state-of-the-art face recognition systems when using our proposed LOTRs for face alignment.
References (75)
- S. W. Earp, A. Samacoits, S. Jain, P. Noinongyao, and S. Boonpun- mongkol, "Sub-pixel face landmarks using heatmaps and a bag of tricks," arXiv preprint arXiv:2103.03059, 2021.
- N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, "Attribute and simile classifiers for face verification," in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 365-372.
- L. Wolf, T. Hassner, and Y. Taigman, "Similarity scores based on back- ground samples," in Computer Vision -ACCV 2009, H. Zha, R.-i. Taniguchi, and S. Maybank, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 88-97.
- O. M. Parkhi, A. Vedaldi, and A. Zisserman, "Deep face recognition," in Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, September 2015, pp. 41.1-41.12.
- F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified embed- ding for face recognition and clustering," CoRR, vol. abs/1503.03832, 2015.
- W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, "Sphereface: Deep hypersphere embedding for face recognition," CoRR, vol. abs/1704.08063, 2017.
- H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu, "Cosface: Large margin cosine loss for deep face recognition," CoRR, vol. abs/1801.09414, 2018.
- J. Deng, J. Guo, and S. Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," CoRR, vol. abs/1801.07698, 2018.
- X. An, X. Zhu, Y. Xiao, L. Wu, M. Zhang, Y. Gao, B. Qin, D. Zhang, and Y. Fu, "Partial FC: Training 10 million identities on a single machine," arXiv e-prints, p. arXiv:2010.05222, Oct. 2020.
- P. Barra, C. Bisogni, M. Nappi, and S. Ricciardi, "Fast quadtree-based pose estimation for security applications using face biometrics," in Network and System Security, M. H. Au, S. M. Yiu, J. Li, X. Luo, C. Wang, A. Castiglione, and K. Kluczniak, Eds. Cham: Springer International Publishing, 2018, pp. 160-173.
- V. Kazemi and J. Sullivan, "One millisecond face alignment with an ensemble of regression trees," in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1867-1874.
- C. Cao, Y. Weng, S. Lin, and K. Zhou, "3d shape regression for real-time facial animation," vol. 32, no. 4, Jul 2013.
- J. Roth, Y. Tong, and X. Liu, "Unconstrained 3D face reconstruction," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-June-2015. IEEE Computer Society, Oct 2015, pp. 2606-2615.
- P. Dou, S. K. Shah, and I. A. Kakadiaris, "End-to-end 3D face reconstruc- tion with deep neural networks," Proceedings -30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1503-1512, Apr 2017.
- Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou, "Joint 3D face reconstruc- tion and dense alignment with position map regression Network," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11218 LNCS, pp. 557-574, Mar 2018.
- X. Yang, Y. Li, H. Qi, and S. Lyu, "Exposing gan-synthesized faces using landmark locations," CoRR, vol. abs/1904.00167, 2019.
- M. Day, "Exploiting facial landmarks for emotion recognition in the wild," CoRR, vol. abs/1603.09129, 2016.
- M. I. N. P. Munasinghe, "Facial expression recognition using facial land- marks and random forest classifier," in 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS), 2018, pp. 423- 427.
- S. Hinduja and S. Canavan, "Facial action unit detection using 3D facial landmarks," arXiv e-prints, p. arXiv:2005.08343, May 2020.
- H. Dibeklioglu, A. A. Salah, and L. Akarun, "3d facial landmarking under expression, pose, and occlusion variations," in 2008 IEEE Second Interna- tional Conference on Biometrics: Theory, Applications and Systems, 2008, pp. 1-6.
- T. Cootes, C. Taylor, D. Cooper, and J. Graham, "Active shape models- their training and application," Computer Vision and Image Understand- ing, vol. 61, no. 1, pp. 38 -59, 1995.
- T. F. Cootes, G. J. Edwards, and C. J. Taylor, "Active appearance models," in IEEE Transactions on Pattern Analysis and Machine Intelligence. Springer, 1998, pp. 484-498.
- L. Liang, R. Xiao, F. Wen, and J. Sun, "Face alignment via component- based discriminative search," 10 2008, pp. 72-85.
- X. Zhu and D. Ramanan, "Face detection, pose estimation, and landmark localization in the wild," in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2879-2886.
- B. Amberg and T. Vetter, "Optimal landmark detection using shape models and branch and bound," 11 2011, pp. 455-462.
- P. Belhumeur, D. Jacobs, D. Kriegman, and N. Kumar, "Localizing parts of faces using a consensus of exemplars," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 2930-40, 12 2013.
- B. A. Efraty, M. Papadakis, A. Profitt, S. Shah, and I. A. Kakadiaris, "Facial component-landmark detection," in 2011 IEEE International Con- ference on Automatic Face Gesture Recognition (FG), 2011, pp. 278-285.
- Y. Sun, X. Wang, and X. Tang, "Deep convolutional network cascade for facial point detection," in 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3476-3483.
- E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin, "Extensive facial landmark localization with coarse-to-fine convolutional network cascade," in 2013 IEEE International Conference on Computer Vision Workshops, 2013, pp. 386-391.
- J. Zhang, S. Shan, M. Kan, and X. Chen, "Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment," in Computer Vision -ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 1-16.
- Z. Zhang, P. Luo, C. C. Loy, and X. Tang, "Learning deep representation for face alignment with auxiliary attributes," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 5, pp. 918-930, 2016.
- R. Ranjan, V. M. Patel, and R. Chellappa, "Hyperface: A deep multi- task learning framework for face detection, landmark localization, pose estimation, and gender recognition," CoRR, vol. abs/1603.01249, 2016.
- M. Kowalski, J. Naruniec, and T. Trzcinski, "Deep alignment network: A convolutional neural network for robust face alignment," CoRR, vol. abs/1706.01789, 2017.
- Y. Xiong, Z. Zhou, Y. Dou, and Z. Su, "Gaussian vector: An efficient solution for facial landmark detection," in Computer Vision -ACCV 2020, H. Ishikawa, C.-L. Liu, T. Pajdla, and J. Shi, Eds. Cham: Springer International Publishing, 2021, pp. 70-87.
- S. Mahpod, R. Das, E. Maiorana, Y. Keller, and P. Campisi, "Facial landmark point localization using coarse-to-fine deep recurrent neural network," arXiv, vol. abs/1805.01760, 2018.
- A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Tech. Rep., 2017.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers." in ECCV (1), ser. Lecture Notes in Computer Science, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., vol. 12346. Springer, 2020, pp. 213- 229.
- W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. feng Zhou, "Look at boundary: A boundary-aware face alignment algorithm," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2129-2138, 2018.
- Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, "Wing loss for robust facial landmark localisation with convolutional neural networks," Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2235-2245, Nov 2017.
- S. Sengupta, J. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs, "Frontal to profile face verification in the wild," in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2016, pp. 1-9.
- T. Zheng and W. Deng, "Cross-pose lfw: A database for studying cross- pose face recognition in unconstrained environments," Beijing University of Posts and Telecommunications, Tech. Rep, vol. 5, 2018.
- C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, J. Cheney, and P. Grother, "Iarpa janus benchmark-b face dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 592-600.
- B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, and P. Grother, "Iarpa janus bench- mark -c: Face dataset and protocol," in 2018 International Conference on Biometrics (ICB), Feb 2018, pp. 158-165.
- Z. Zhang, P. Luo, C. C. Loy, and X. Tang, "Facial landmark detection by deep multi-task learning," in European Conference on Computer Vision. Springer, 2014, pp. 94-108.
- X. Dong, Y. Yan, W. Ouyang, and Y. Yang, "Style aggregated network for facial landmark detection," CoRR, vol. abs/1803.04108, 2018.
- K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang, "High-resolution representations for labeling pixels and regions," arXiv, vol. abs/1904.04514, 2019.
- A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian, T. Koike- Akino, X. Liu, and C. Feng, "Luvli face alignment: Estimating landmarks' location, uncertainty, and visibility likelihood," 2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 8233- 8243, 2020.
- A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformat- ics), vol. 9912 LNCS, pp. 483-499, Mar 2016.
- S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, "Convolutional pose machines," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4724-4732.
- B. Xiao, H. Wu, and Y. Wei, "Simple baselines for human pose estimation and tracking," arXiv, vol. abs/1804.06208, 2018.
- F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, "Distribution-aware coordinate representation for human pose estimation," arXiv preprint arXiv:1910.06278, 2019.
- Y. Liu, H. Shen, Y. Si, X. Wang, X. Zhu, H. Shi, Z. Hong, H. Guo, Z. Guo, Y. Chen, B. Li, T. Xi, J. Yu, H. Xie, G. Xie, M. Li, Q. Lu, Z. Wang, S. Lai, Z. Chai, and X. Wei, "Grand challenge of 106-point facial landmark localization," arXiv e-prints, p. arXiv:1905.03469, May 2019.
- X. Lan, Q. Hu, and J. Cheng, "Hih: Towards more accurate face alignment via heatmap in heatmap," arXiv preprint arXiv:2104.03100, 2021.
- X. Wang, L. Bo, and L. Fuxin, "Adaptive wing loss for robust face alignment via heatmap regression," Proceedings of the IEEE International Conference on Computer Vision, pp. 6970-6980, Apr 2019.
- X. P. Burgos-Artizzu, P. Perona, and P. Dollar, "Robust face landmark estimation under occlusion," in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2013.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.
- T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, "Microsoft COCO: Common Objects in Context," arXiv e-prints, p. arXiv:1405.0312, May 2014.
- R. Girshick, "Fast R-CNN," arXiv e-prints, p. arXiv:1504.08083, Apr 2015.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mo- bileNetV2: Inverted residuals and linear bottlenecks," Jan 2018.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
- J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, "Deep high-resolution repre- sentation learning for visual recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349-3364, 2021.
- C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, "300 faces in-the-wild challenge: The first facial landmark localization challenge," in 2013 IEEE International Conference on Computer Vision Workshops, 2013, pp. 397-403.
- C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, "300 faces in-the-wild challenge: Database and results." Image Vis. Com- put., vol. 47, pp. 3-18, 2016.
- X. Zhu and D. Ramanan, "Face detection, pose estimation, and landmark localization in the wild," in 2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp. 2879-2886.
- V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, "Interactive facial feature localization," Computer Vision -ECCV 2012 Lecture Notes in Computer Science, p. 679-692, 2012.
- C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, "A semi- automatic methodology for facial landmark annotation." in CVPR Work- shops. IEEE Computer Society, 2013, pp. 896-903.
- S. Yang, P. Luo, C. C. Loy, and X. Tang, "Wider face: A face detection benchmark," in IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), 2016.
- J. Deng, J. Guo, Z. Yuxiang, J. Yu, I. Kotsia, and S. Zafeiriou, "Retinaface: Single-stage dense face localisation in the wild," in arxiv, 2019.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," in CVPR09, 2009.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- nov, "Dropout: A simple way to prevent neural networks from overfitting," Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929-1958, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026-1034.
- Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, "Large batch optimization for deep learning: Training BERT in 76 minutes," Tech. Rep., 2019.
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems," arXiv e-prints, p. arXiv:1512.01274, Dec. 2015.
- J. Guo, H. He, T. He, L. Lausen, M. Li, H. Lin, X. Shi, C. Wang, J. Xie, S. Zha, A. Zhang, H. Zhang, Z. Zhang, Z. Zhang, S. Zheng, and Y. Zhu, "Gluoncv and gluonnlp: Deep learning in computer vision and natural language processing," Journal of Machine Learning Research, vol. 21, no. 23, pp. 1-7, 2020.
- S. W. F. Earp, P. Noinongyao, J. A. Cairns, and A. Ganguly, "Face detection with feature pyramids and landmarks," arXiv e-prints, p. arXiv:1912.00596, Dec. 2019.