Sub-pixel face landmarks using heatmaps and a bag of tricks
2021, ArXiv
Abstract
Accurate face landmark localization is an essential part of face recognition, reconstruction and morphing. To accurately localize face landmarks, we present our heatmap regression approach. Each model consists of a MobileNetV2 backbone followed by several upscaling layers, with different tricks to optimize both performance and inference cost. We use five naïve face landmarks from a publicly available face detector to position and align the face instead of using the bounding box like traditional methods. Moreover, we show by adding random rotation, displacement and scaling—after alignment—that the model is more sensitive to the face position than orientation. We also show that it is possible to reduce the upscaling complexity by using a mixture of deconvolution and pixel-shuffle layers without impeding localization performance. We present our state-of-the-art face landmark localization model (ranking second on The 2nd Grand Challenge of 106-Point Facial Landmark Localization validati...
Key takeaways
AI
AI
- The proposed model ranks second in the JD-landmark-2 challenge, utilizing a MobileNetV2 backbone.
- Heatmap regression improves landmark localization accuracy compared to traditional bounding box methods.
- Random rotation enhances model sensitivity to face position, while scaling and repositioning reduce performance.
- Combining pixel-shuffle and deconvolution layers minimizes FLOPS with minimal impact on localization accuracy.
- Facial landmark adjustments improve face recognition performance by 0.07% to 0.44% on specific benchmarks.
References (76)
- B. Amberg and T. Vetter. Optimal landmark detection us- ing shape models and branch and bound. pages 455-462, 11 2011. doi: 10.1109/ICCV.2011.6126275.
- X. An, X. Zhu, Y. Xiao, L. Wu, M. Zhang, Y. Gao, B. Qin, D. Zhang, and Y. Fu. Partial FC: Training 10 Mil- lion Identities on a Single Machine. arXiv e-prints, art. arXiv:2010.05222, Oct. 2020.
- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3686-3693, 2014. doi: 10.1109/CVPR.2014.471.
- P. Belhumeur, D. Jacobs, D. Kriegman, and N. Kumar. Lo- calizing parts of faces using a consensus of exemplars. IEEE transactions on pattern analysis and machine in- telligence, 35:2930-40, 12 2013. doi: 10.1109/TPAMI. 2013.23.
- C. Cao, Y. Weng, S. Lin, and K. Zhou. 3d shape re- gression for real-time facial animation. ACM Trans. Graph., 32(4), July 2013. ISSN 0730-0301. doi: 10.1145/2461912.2462012. URL https://doi.org/ 10.1145/2461912.2462012.
- T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A Flexi- ble and Efficient Machine Learning Library for Het- erogeneous Distributed Systems. arXiv e-prints, art. arXiv:1512.01274, Dec. 2015.
- T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. arXiv e-prints, art. arXiv:1802.04799, Feb. 2018.
- Y. Chen and T. Pock. Trainable nonlinear reaction diffu- sion: A flexible framework for fast and effective im- age restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1256-1272, 2017. doi: 10.1109/TPAMI.2016.2596743.
- T. Cootes, C. Taylor, D. Cooper, and J. Graham. Ac- tive shape models-their training and application. Com- puter Vision and Image Understanding, 61(1):38 -59, 1995. ISSN 1077-3142. doi: https://doi.org/10.1006/ cviu.1995.1004. URL http://www.sciencedirect. com/science/article/pii/S1077314285710041.
- T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active ap- pearance models. In IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, pages 484-498. Springer, 1998.
- M. Dantone, J. Gall, C. Leistner, and L. van Gool. Human pose estimation using body parts dependent joint regres- sors. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3041-3048, Portland, OR, USA, June 2013. IEEE.
- M. Day. Exploiting facial landmarks for emotion recog- nition in the wild. CoRR, abs/1603.09129, 2016. URL http://arxiv.org/abs/1603.09129.
- J. Deng, J. Guo, and S. Zafeiriou. Arcface: Additive an- gular margin loss for deep face recognition. CoRR, abs/1801.07698, 2018. URL http://arxiv.org/ abs/1801.07698.
- J. Deng, J. Guo, Z. Yuxiang, J. Yu, I. Kotsia, and S. Zafeiriou. Retinaface: Single-stage dense face lo- calisation in the wild. In arxiv, 2019a.
- J. Deng, A. T. Roussos, G. Chrysos, E. Ververas, I. Kotsia, J. Shen, and S. Zafeiriou. The menpo benchmark for multi-pose 2d and 3d facial landmark localisation and tracking. International Journal of Computer Vision, 127, 06 2019b. doi: 10.1007/s11263-018-1134-y.
- H. Dibeklioglu, A. A. Salah, and L. Akarun. 3d facial landmarking under expression, pose, and occlusion vari- ations. In 2008 IEEE Second International Conference on Biometrics: Theory, Applications and Systems, pages 1-6, 2008. doi: 10.1109/BTAS.2008.4699324.
- C. Dong, C. C. Loy, K. He, and X. Tang. Image super- resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 38(2):295-307, 2016. doi: 10.1109/TPAMI.2015. 2439281.
- P. Dou, S. K. Shah, and I. A. Kakadiaris. End-to-end 3D face reconstruction with deep neural networks. Proceed- ings -30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January:1503- 1512, apr 2017. URL http://arxiv.org/abs/1704. 05020.
- S. W. F. Earp, P. Noinongyao, J. A. Cairns, and A. Ganguly. Face Detection with Feature Pyramids and Landmarks. arXiv e-prints, art. arXiv:1912.00596, Dec. 2019.
- B. A. Efraty, M. Papadakis, A. Profitt, S. Shah, and I. A. Kakadiaris. Facial component-landmark detection. In 2011 IEEE International Conference on Automatic Face Gesture Recognition (FG), pages 278-285, 2011. doi: 10.1109/FG.2011.5771411.
- M. Eichner and V. Ferrari. Better appearance models for pictorial structures. 23, 01 2009. doi: 10.5244/C.23.3.
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 32(9):1627- 1645, 2010. doi: 10.1109/TPAMI.2009.167.
- Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin- formatics), 11218 LNCS:557-574, mar 2018. URL http://arxiv.org/abs/1803.07835.
- M. A. Fischler and R. A. Elschlager. The representa- tion and matching of pictorial structures. IEEE Trans- actions on Computers, C-22(1):67-92, 1973. doi: 10.1109/T-C.1973.223602.
- R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. In 2008 8th IEEE International Conference on Automatic Face Gesture Recognition, pages 1-8, 2008. doi: 10.1109/AFGR.2008.4813399.
- J. Guo, H. He, T. He, L. Lausen, M. Li, H. Lin, X. Shi, C. Wang, J. Xie, S. Zha, A. Zhang, H. Zhang, Z. Zhang, Z. Zhang, S. Zheng, and Y. Zhu. Gluoncv and gluonnlp: Deep learning in computer vision and natural language processing. Journal of Machine Learning Research, 21(23):1-7, 2020. URL http://jmlr.org/papers/ v21/19-429.html.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016. doi: 10.1109/CVPR.2016.90.
- S. Hinduja and S. Canavan. Facial Action Unit Detec- tion using 3D Facial Landmarks. arXiv e-prints, art. arXiv:2005.08343, May 2020.
- A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam. Searching for MobileNetV3. may 2019. URL http://arxiv.org/abs/1905.02244.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. apr 2017. URL http://arxiv.org/abs/1704.04861.
- G. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La- beled faces in the wild: A database forstudying face recognition in unconstrained environments. Tech. rep., 10 2008a.
- G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Workshop on Faces in 'Real-Life' Images: Detection, Alignment, and Recognition, Marseille, France, Oct. 2008b. Erik Learned-Miller and Andras Ferencz and Frédéric Jurie. URL https://hal.inria.fr/inria-00321923.
- J. Huang, Z. Zhu, G. Huang, and D. Du. Aid: Pushing the performance boundary of human pose estimation with information dropping augmentation. arXiv preprint arXiv:2008.07139, 2020.
- D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, art. arXiv:1412.6980, Dec. 2014.
- M. Kowalski, J. Naruniec, and T. Trzcinski. Deep align- ment network: A convolutional neural network for ro- bust face alignment. CoRR, abs/1706.01789, 2017. URL http://arxiv.org/abs/1706.01789.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima- genet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25, pages 1097-1105. Curran Associates, Inc., 2012. URL https: //proceedings.neurips.cc/paper/2012/file/ c399862d3b9d6b76c8436e924a68c45b-Paper. pdf. N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face ver- ification. In 2009 IEEE 12th International Confer- ence on Computer Vision, pages 365-372, 2009. doi: 10.1109/ICCV.2009.5459250.
- L. Liang, R. Xiao, F. Wen, and J. Sun. Face align- ment via component-based discriminative search. pages 72-85, 10 2008. ISBN 978-3-540-88685-3. doi: 10.1007/978-3-540-88688-4_6.
- T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common ob- jects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
- W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. CoRR, abs/1704.08063, 2017. URL http://arxiv.org/abs/1704.08063.
- Y. Liu, H. Shen, Y. Si, X. Wang, X. Zhu, H. Shi, Z. Hong, H. Guo, Z. Guo, Y. Chen, B. Li, T. Xi, J. Yu, H. Xie, G. Xie, M. Li, Q. Lu, Z. Wang, S. Lai, Z. Chai, and X. Wei. Grand Challenge of 106-Point Facial Landmark Localization. arXiv e-prints, art. arXiv:1905.03469, May 2019.
- S. Mahpod, R. Das, E. Maiorana, Y. Keller, and P. Campisi. Facial landmark point localization using coarse-to-fine deep recurrent neural network. ArXiv, abs/1805.01760, 2018.
- M. I. N. P. Munasinghe. Facial expression recognition using facial landmarks and random forest classifier. In 2018 IEEE/ACIS 17th International Conference on Com- puter and Information Science (ICIS), pages 423-427, 2018. doi: 10.1109/ICIS.2018.8466510.
- A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin- formatics), 9912 LNCS:483-499, mar 2016. ISSN 16113349. doi: 10.1007/978-3-319-46484-8_29. URL http://arxiv.org/abs/1603.06937.
- C. Osendorfer, H. Soyer, and P. van der Smagt. Image super-resolution with fast approximate convolutional sparse coding. 11 2014. ISBN 978-3-319-12642-5. doi: 10.1007/978-3-319-12643-2_31.
- O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proceedings of the British Ma- chine Vision Conference (BMVC), pages 41.1-41.12. BMVA Press, September 2015. ISBN 1-901725-53-7. doi: 10.5244/C.29.41. URL https://dx.doi.org/ 10.5244/C.29.41.
- R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detec- tion, landmark localization, pose estimation, and gen- der recognition. CoRR, abs/1603.01249, 2016. URL http://arxiv.org/abs/1603.01249.
- J. Roth, Y. Tong, and X. Liu. Unconstrained 3D face recon- struction. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion, volume 07-12-June-2015, pages 2606-2615. IEEE Computer Society, oct 2015. ISBN 9781467369640. doi: 10.1109/CVPR.2015.7298876.
- C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pan- tic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In 2013 IEEE Inter- national Conference on Computer Vision Workshops, pages 397-403, 2013. doi: 10.1109/ICCVW.2013.59.
- C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47, 01 2016. doi: 10.1016/j.imavis.2016.01.002.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. MobileNetV2: Inverted Residuals and Linear Bot- tlenecks. jan 2018. URL http://arxiv.org/abs/ 1801.04381.
- F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clus- tering. CoRR, abs/1503.03832, 2015. URL http: //arxiv.org/abs/1503.03832.
- S. Sengupta, J. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Appli- cations of Computer Vision (WACV), pages 1-9, March 2016. doi: 10.1109/WACV.2016.7477558.
- W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an ef- ficient sub-pixel convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 1874-1883, 2016. doi: 10.1109/CVPR.2016.207.
- K. Sun, B. Xiao, D. Liu, and J. Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
- X. Sun, S. Jiaxiang, S. Liang, and Y. Wei. Compo- sitional human pose regression. Computer Vision and Image Understanding, 176-177, 04 2017. doi: 10.1016/j.cviu.2018.10.006.
- Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In 2013 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 3476-3483, 2013. doi: 10.1109/CVPR.2013.446.
- J. Tompson, A. Jain, Y. Lecun, and C. Bregler. Joint train- ing of a convolutional network and a graphical model for human pose estimation. 06 2014a.
- J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint train- ing of a convolutional network and a graphical model for human pose estimation. CoRR, abs/1406.2984, 2014b. URL http://arxiv.org/abs/1406.2984.
- A. Toshev and C. Szegedy. Deeppose: Human pose es- timation via deep neural networks. pages 1653-1660, 2014. doi: 10.1109/CVPR.2014.214.
- H. Wang, Y. Wang, Z. Zhou, X. Ji, Z. Li, D. Gong, J. Zhou, and W. Liu. Cosface: Large margin cosine loss for deep face recognition. CoRR, abs/1801.09414, 2018. URL http://arxiv.org/abs/1801.09414.
- Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deeply improved sparse coding for image super-resolution. 07 2015.
- S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con- volutional pose machines. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724-4732, 2016. doi: 10.1109/CVPR.2016.511.
- L. Wolf, T. Hassner, and Y. Taigman. Similarity Scores Based on Background Samples. In H. Zha, R.-i. Taniguchi, and S. Maybank, editors, Computer Vision -ACCV 2009, pages 88-97, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-12304-7.
- B. Xiao, H. Wu, and Y. Wei. Simple baselines for human pose estimation and tracking. ArXiv, abs/1804.06208, 2018.
- X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In 2013 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 532-539, 06 2013. doi: 10.1109/CVPR. 2013.75.
- Y. Xiong, Z. Zhou, Y. Dou, and Z. Su. Gaussian vector: An efficient solution for facial landmark detection, 2020.
- X. Yang, Y. Li, H. Qi, and S. Lyu. Exposing gan- synthesized faces using landmark locations. CoRR, abs/1904.00167, 2019. URL http://arxiv.org/ abs/1904.00167.
- Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, 35:2878-90, 12 2013. doi: 10.1109/TPAMI.2012.261.
- F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu. Distribution- aware coordinate representation for human pose estima- tion, 2019.
- F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu. Distribution- aware coordinate representation for human pose estima- tion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- J. Zhang, S. Shan, M. Kan, and X. Chen. Coarse-to-fine auto-encoder networks (cfan) for real-time face align- ment. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuyte- laars, editors, Computer Vision -ECCV 2014, pages 1-16, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10605-2.
- Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learn- ing deep representation for face alignment with aux- iliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):918-930, 2016. doi: 10.1109/TPAMI.2015.2469286.
- T. Zheng and W. Deng. Cross-pose lfw: A database for studying cross-pose face recognition in unconstrained environments. Beijing University of Posts and Telecom- munications, Tech. Rep, 5, 2018.
- T. Zheng, W. Deng, and J. Hu. Cross-age lfw: A database for studying cross-age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197, 2017.
- E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin. Extensive facial landmark localization with coarse-to-fine convo- lutional network cascade. In 2013 IEEE International Conference on Computer Vision Workshops, pages 386- 391, 2013. doi: 10.1109/ICCVW.2013.58.
- X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2879-2886, 2012. doi: 10.1109/CVPR.2012. 6248014.