Academia.eduAcademia.edu

Outline

A survey of evolution of image captioning techniques

2018, International Journal of Hybrid Intelligent Systems

Abstract

Automatic captioning of Images has been explored extensively in the past 10 to 15 years. It is one of the elementary problems in Computer Vision and Natural Language Processing and has vast array of applications in the real world. In this survey, we aim to study different approaches used for the generation of image captions in a chronological manner starting from the basic template based caption generation model to using Neural Networks combined with external world knowledge. We review existing models in detail, highlighting the involved methodologies and improvements in the same that have occurred in time. We gave an overview to the standard image datasets and the evaluation measures developed to discern the quality of generated image captions. Apart from the basic benchmarks we also note speed and accuracy improvements in all the different approaches. Finally, we investigate further possibilities in automatic image caption generation.

References (127)

  1. G.A. Miller, WordNet: A lexical database for English, Com- munications of the ACM 38(11) (1995), 39-41.
  2. S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, 1997.
  3. I. Langkilde and K. Knight, Generation that exploits corpus- based statistical knowledge, Proceedings of the 36th ACL, 1998.
  4. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, Gradient- based learning applied to document recognition, Proc IEEE 86(11) (1998), 2278-2324.
  5. E. Reiter and R. Dale, Building Natural Language Genera- tion Systems, Cambridge University Press, 2000.
  6. K. Papineni, S. Roukos, T. Ward and W.-J. Zhu, Bleu: A method for automatic evaluation of machine translation, In ACL, 2002, pp. 311-318.
  7. K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D.M. Blei and M.I. Jordan, Matching words and pictures, JMLR, 2003.
  8. C.-Y. Lin and F.J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, In ACL, 2004, p. 605.
  9. N. Dalal and B. Triggs, Histograms of oriented gradients for human detections, Proceedings of CVPR, 2005.
  10. Y. Bengio, H. Schwenk, J.-S. Senecal, F. Morin and J.-L. Gauvain, Neural probabilistic language models, Innovations in Machine Learning, Springer, 2006.
  11. M. Grubinger, P. Clough, H. Muller and T. Deselaers, The iapr tc-12 benchmark: A new evaluation resource for visual information systems.
  12. L.-J. Li and F.-F. Li, What, where and who? classifying events by scene and object recognition, ICCV, 2007.
  13. L.-J. Li, R. Socher and F.F. Li, Towards total scene un- derstanding: Classification, annotation and segmentation in an automatic framework. In Computer Vision and Pattern Recognition. CVPR. IEEE Conference on, IEEE, 2009, pp. 2036-2043.
  14. S. Gould, R. Fulton and D. Koller, Decomposing a scene into geometric and semantically consistent regions, In Computer Vision, IEEE 12th International Conference on, IEEE, 2009, pp. 1-8.
  15. E. Reiter and A. Belz, An investigation into the validity of some metrics for automatically evaluating natural lan- guage generation systems, Computational Linguistics 35(4) (2009), 529-558.
  16. A. Farhadi, I. Endres, D. Hoiem and D. Forsyth, Describing objects by their attributes, Proceedings of CVPR, 2009.
  17. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. FeiFei, Imagenet: A large-scale hierarchical image database, In Computer Vision and Pattern Recognition (2009). CVPR (2009). IEEE Conference on, IEEE, 2009, pp. 248-255.
  18. A. Aker and R. Gaizauskas, Generating image descriptions using dependency relational patterns, In ACL, 2010.
  19. A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C.
  20. Rashtchian, J. Hockenmaier and D. Forsyth, Every picture 825 tells a story: Generating sentences from images. In ECCV, 826 2010. 827
  21. C. Rashtchian, P. Young, M. Hodosh and J. Hockenmaier,
  22. R. Socher and F.-F. Li, Connecting modalities: Semi-su- 836 pervised segmentation and annotation of images using un- 837 aligned text corpora, In CVPR, 2010.
  23. A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C.
  24. Rashtchian, J. Hockenmaier and D. Forsyth, Every picture 840 tells a story: Generating sentences from images, In ECCV, 841 2010.
  25. T. Mikolov, M. Karafiat, L. Burget, J. Cernocky'and S. Khu- 843 danpur, Recurrent neural network based language model, In 844 INTERSPEECH, 2010.
  26. C. Callison-Burch and M. Dredze, Creating speech and lan- 846 guage data with Amazons Mechanical Turk. NAACL 2010
  27. 847 Workshop on Creating Speech and Language Data with 848 Amazons Mechanical Turk, 2010.
  28. B.Z. Yao, X. Yang, L. Lin, M.W. Lee and S.-C. Zhu, I2T: 850 Image parsing to text description, Proceedings of IEEE 98(8) 851 (2010), 1485-1508.
  29. 852 [27] l. Petrov, Berkeley parser. GNU General Public License v.2, 853 2010. 854
  30. A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C.
  31. Rashtchian, J. Hockenmaier and D. Forsyth, Every picture 856 tells a story: Generating sentences from images, In ECCV, 857 Springer, 2010, pages 15 to 29.
  32. B.Z. Yao, X. Yang, L. Lin, M.W. Lee and S.-C. Zhu, I2T: 859 Image parsing to text description, Proceedings of the IEEE 860 98(8) (2010), 1485-1508.
  33. C. Rashtchian, P. Young, M. Hodosh and J. Hockenmaier, 862 Collecting image annotations using Amazons mechanical 863 turk, In NAACL HLT Workshop Creating Speech and Lan- 864 guage Data with Amazons Mechanical Turk, 2010.
  34. A. Aker and R. Gaizauskas, Generating image descriptions 866 using dependency relational patterns, In ACL, 2010.
  35. R. Kiros and R.Z.R. Salakhutdinov, Multimodal neural lan- 868 guage models, In NIPS Deep Learning Workshop, 2013.
  36. S. Li, G. Kulkarni, T.L. Berg, A.C. Berg and Y. Choi, Com- 870 posing simple image descriptions using web-scalen-grams, 871 In Conference on Computational Natural Language Learn- 872 ing, 2011.
  37. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg 874 and T.L. Berg, Baby talk: Understanding and generating sim- 875 ple image descriptions, In CVPR, 2011.
  38. Y. Jia, M. Salzmann and T. Darrell, Learning cross-modality 877 similarity for multinomial data. In ICCV, 2011.
  39. Y. Yang, C.L. Teo, H. Daume, III. and Y. Aloimonos, 879 Corpus-guided sentence generation of natural images, In 880 EMNLP, 2011.
  40. V. Ordonez, G. Kulkarni and T.L. Berg, Im2text: Describ- 882 Ing images using 1 million captioned photographs. In NIPS, 883 2011.
  41. I. Sutskever, J. Martens and G.E. Hinton, Generating text 885 with recurrent neural networks, In ICML, 2011.
  42. Amazon, Amazon mechanical turk: Artificial artificial intel- ligence, 2011.
  43. Y. Yang, C.L. Teo, H. Daume, III. and Y. Aloimonos, Corpus-guided sentence generation of natural images, Pro- ceedings of EMNLP, 2011.
  44. S. Li, G. Kulkarni, T.L. Berg, A.C. Berg and Y. Choi, Com- posing simple image descriptions using web-scale n-grams. Proceedings of CoNLL, 2011.
  45. V. Ordonez, G. Kulkarni and T.L. Berg, Im2text:Describing images using 1 million captioned photographs. Proceedings of NIPS, 2011.
  46. Flickr, http://www.flickr.com. Accessed 1, 11 Sep 2011.
  47. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg and T L. Berg, Baby talk: Understanding and generating sim- ple image descriptions. In CVPR, IEEE, 2011, pp. 1601- 1608.
  48. Y. Yang, C.L. Teo, H. Daume, III. and Y. Aloimonos, Corpus-guided sentence generation of natural images, In EMNLP, 2011.
  49. T. Mikolovs, Recurrent neural network based language model.
  50. T. Mikolov, A. Deoras, D. Povey, L. Burget and J. Cernocky, Strategies for training large scale neural network language models. In Automatic Speech Recognition and Understand- ing (ASRU), 2011 IEEE Workshop on, IEEE, 2011, pp 196- 201.
  51. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg and T. Berg, Baby talk: Understanding and generating image descriptions. Proceedings of the 24th CVPR, 2011.
  52. Y. Jia, M. Salzmann and T. Darrell, Learning cross-modality similarity for multinomial data, Proc. IEEE Int.Conference Computer Vision, 2011.
  53. V. Ordonez, G. Kulkarni and T.L. Berg, Im2text: Describ- ing images using 1 million captioned photographs Proc. Ad- vances in Neural Inf. Process. Syst, 2011.
  54. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee and A. Ng, Multimodal deep learning, In ICML, 2011.
  55. P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg and Y. Choi, Collective generation of natural image descriptions, In ACL, 2012.
  56. Y. Feng and M. Lapata, Automatic Caption Generation for News Images, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2012), 10.1109/TPAMI.2012.118.
  57. A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickin- son, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi et al., Video in sentences out. arXiv preprint arXiv:1204.2742, 2012.
  58. A. Gupta and P. Mannem, From image annotation to im- age description, In Neural information processing. Springer, 2012.
  59. A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet clas- sification with deep convolutional neural networks, In NIPS, 2012.
  60. A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet clas- sification with deep convolutional neural networks, In NIPS, 2012.
  61. A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet clas- sification with deep convolutional neural networks, In Ad- vances in neural information processing systems, 2012, pp. 1097-1105.
  62. M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos and H. Daume, III., Midge: Generating image descriptions from computer vision
  63. A. Kumar and S. Goel / A survey of evolution of image captioning techniques detections, In EACL, Association for Computational Lin- 950 guistics, 2012, pp. 747-756.
  64. A. Gupta, Y. Verma and C. Jawahar, Choosing linguistics 952 over vision to describe images, In AAAI, 2012.
  65. P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg and Y.
  66. 954 Choi, Collective generation of natural image descriptions, 955 2012.
  67. N. Srivastava and R. Salakhutdinov, Multimodal learning 957 with deep boltzmann machines, In NIPS, 2012, pp. 2222- 958 2230.
  68. D. Elliott and F. Keller, Image description using visual de- 960 pendency representations, In EMNLP, 2013.
  69. A. Graves, Generating sequences with recurrent neural net- 962 works, arXiv:1308.0850, 2013.
  70. M. Hodosh, P. Young and J. Hockenmaier, Framing image 964 description as a ranking task: Data, models and evaluation 965 metrics, JAIR 47 (2013).
  71. R. Kiros and Z.R. Salakhutdinov, Multimodal neural lan- 967 guage models, In NIPS Deep Learning Workshop, 2013.
  72. T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient es- 969 timation of word representations in vector space, In ICLR, 970 2013. 971
  73. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus 972 and Y. LeCun, Overfeat: Integrated recognition, localization 973 and detection using convolutional networks. arXiv preprint 974 arXiv:1312.6229, 2013.
  74. S. Fidler, A. Sharma and R. Urtasun, A sentence is worth a 976 thousand pixels, In CVPR, 2013.
  75. T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado and J. Dean, 978 Distributed representations of words and phrases and their 979 compositionality, In NIPS, 2013.
  76. M. Hodosh, P. Young and J. Hockenmaier, Framing image 981 description as a ranking task: data, models and evaluation 982 metrics. Journal of Artificial Intelligence Research, 2013.
  77. D. Elliott and F. Keller, Image description using visual de- 984 pendency representations, In EMNLP, 2013, pp. 1292-1302.
  78. A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T. 986 Mikolov et al., Devise: A deep visual-semantic embedding 987 model. In NIPS, 2013.
  79. A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, T. 989 Mikolov et al., Devise: A deep visual-semantic embedding 990 model. In Advances in Neural Information Processing Sys- 991 tems, 2013, pp. 2121-2129.
  80. C.L. Zitnick and D. Parikh, Bringing semantics into focus 993 using visual abstraction, In Computer Vision and Pattern 994 Recognition(CVPR), IEEE Conference on, IEEE, 2013, pp. 995 3009-3016.
  81. R. Socher, Q. Le, C. Manning and A. Ng, Grounded com- 997 positional semantics for finding and describing images with 998 sentences, In NIPS Deep Learning Workshop, 2013.
  82. M. Hodosh, P. Young and J. Hockenmaier, Framing image 1000 description as a ranking task: Data, models and evaluation 1001 metrics, J. Artif. Intell. Res. (JAIR) 47 (2013), 853-899.
  83. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. 1003 Tzeng and T. Darrell, Decaf: A deep convolutional activation 1004 feature for generic visual recognition. 1005
  84. N. Kalchbrenner and P. Blunsom, Recurrent continuous 1006 translation models, In EMNLP, 2013.
  85. D. Bahdanau, K. Cho and Y. Bengio, Neural machine 1008 translation by jointly learning to align and translate, 1009 arXiv:1409.0473, 2014.
  86. K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. 1011 Schwenk and Y. Bengio, Learning phrase representations us- 1012 ing RNN encoder-decoder for statistical machine translation, In EMNLP, 2014.
  87. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng and T. Darrell, Decaf: A deep convolutional activation feature for generic visual recognition, In ICML, 2014.
  88. Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier and S. Lazebnik, Improving image-sentence embeddings using large weakly annotated photo collections, In ECCV, 2014.
  89. A. Karpathy, A. Joulin and L. Fei-Fei, Deep fragment em- beddings for bidirectional image sentence mapping, NIPS, 2014.
  90. R. Kiros, R. Salakhutdinov and R.S. Zemel, Unifying visual- semantic embeddings with multimodal neural language models, In arXiv:1411.2539, 2014.
  91. Kuznetsova, V. Ordonez, T. Berg and Y. Choi, Treetalk: Composition and compression of trees for image descrip- tions, ACL 2(10) (2014).
  92. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dollar and C.L. Zitnick, Microsoft coco: Common objects in context, arXiv:1405.0312, 2014.
  93. J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Ex- plain images with multimodal recurrent neural networks, In arXiv:1411090, 2014.
  94. R. Socher, A. Karpathy, Q.V. Le, C.D. Manning and A.Y. Ng, Grounded compositional semantics for finding and de- scribing images with sentences, TACL, 2014.
  95. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg and L. Fei-Fei, Image Net Large Scale Visual Recogni- tion Challenge, 2014.
  96. R. Socher, A. Karpathy, Q.V. Le, C. Manning and A.Y. Ng, Grounded compositional semantics for finding and describ- ing images with sentences, In ACL, 2014.
  97. R.J. Pennington and C. Manning, Glove: Global vectors for word representation, 2014.
  98. O.V. Sutskever and Q.V. Le, Sequence to sequence learning with neural networks, In NIPS, 2014.
  99. P. Young, A. Lai, M. Hodosh and J. Hockenmaier, From im- age descriptions to visual denotations: New similarity met- rics for semantic inference over event descriptions, In ACL, 2014.
  100. W. Zaremba, I. Sutskever and O. Vinyals, Recurrent neural network regularization, In arXiv:1409.2329, 2014.
  101. X. Chen and C.L. Zitnick, Learning a recurrent vi- sual representation for image caption generation, CoRR, abs/1411.5654, 2014.
  102. O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show and tell: A neural image caption generator, arXiv preprint arXiv:1411.4555, 2014.
  103. H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dol- lar, J. Gao, X. He, M. Mitchell, J. Platt et al., From captions to visual concepts and back, arXiv preprint arXiv:1411.4952, 2014.
  104. P. Sermane, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun, OverFeat: Integrated recognition, localization and detection using convolutional networks, ICLR, 2014.
  105. C. Szegedy, S. Reed, D. Erhan and D. Anguelov, Scalable, high-quality object detection, arXiv preprint arXiv:1412.1441, 2014.
  106. A. Karpathy, A. Joulin and F.F. Li, Deep fragment embed- dings for bidirectional image sentence mapping in Proc, Ad- vances in Neural Inf. Process. Syst, 2014.
  107. O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show and u n tell: A neural image caption generator in Proc, IEEE Conf. Comp. Vis. Patt. Recogn, 2014.
  108. R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich fea- ture hierarchies for accurate object detection and semantic segmentation, 2014.
  109. R. Kiros, R. Salakhutdinov and R. Zemel, Multimodal neural language models, In ICML, 2014.
  110. J. Mao, W. Xu, Y. Yang, J. Wang and A.L. Yuille, Explain images with multimodal recurrent neural networks, arXiv preprint arXiv:1410.1090, 2014.
  111. A. Karpathy, A. Joulin and L. Fei-Fei, Deep fragment em- beddings for bidirectional image sentence mapping, arXiv preprint arXiv:1406.5679, 2014.
  112. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell, Caffe: Convolu- tional architecture for fast feature embedding, arXiv preprint arXiv:1408.5093, 2014.
  113. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Dollar and C.L. Zitnick, Microsoft coco: Common objects in context, In ECCV, 2014.
  114. P. Blunsom, N. de Freitas, E. Grefenstette, K.M. Hermann et al., A deep architecture for semantic parsing, In ACL 2014 Workshop on Semantic Parsing, 2014.
  115. S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, In arXiv:1502.03167, 2015.
  116. R. Vedantam, C.L. Zitnick and D. Parikh, CIDEr: Consensus-based image description evaluation, In arXiv:1411.5726, 2015.
  117. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg and L. Fei-Fei, ImageNet Large Scale Visual Recog- nition Challenge, International Journal of Computer Vision (IJCV), April 2015, pp. 1-2.
  118. X. Chen and C.L. Zitnick, Minds Eye: A Recurrent 1110 Visual Representation for Image Caption Generation in 1111 Proc.IEEEConf. Comp. Vis. Patt. Recogn, 2015. 1112
  119. J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, 1113 S. Venugopalan, K. Saenko and T. Darrell, Long-term re-1114 current convolutional networks for visual recognition and 1115 description in Proc. IEEE Conf. Comp. Vis. Patt. Recogn, 1116 2015. 1117
  120. J. Mao, W. Xu, Y. Yang, J. Wang and A. Yuille, Deep Cap-1118 tioning with Multimodal Recurrent Neural Networks (m-1119 RNN) in Proc. Int. Conf. Learn. Representations, 2015. 1120
  121. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle 1121 and A. Courville, Describing videos by exploiting temporal 1122 structure in Proc. IEEE Int. Conf. Comp. Vis, 2015. 1123
  122. J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. 1124 He, G. Zweig and M. Mitchell, Language models for im-1125 age captioning: The quirks and what works, arXiv preprint 1126 arXiv:1505.01809, 2015. 1127
  123. K. Simonyan and A. Zisserman, Very deep convolutional 1128 networks for large-scale image recognition, in Proc. Int. 1129 Conf. Learn. Representations, 2015. 1130
  124. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. 1131 Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Go-1132 ing deeper with convolutions, in Proc. IEEE Conf. Comp. 1133 Vis. Patt. Recogn, 2015. 1134
  125. A. Karpathy and F.-F. Li, Deep Visual-Semantic Alignments 1135 for Generating Image Descriptions arXiv:1412.2306v2, 1136 2015. 1137
  126. Q. Wu, P. Wang, C. Shen, A. Dick and A.V.D. Hengel, Ask 1138 Me Anything: Free-form Visual Question Answering Based 1139 on Knowledge from External Sources in Proc. IEEE Conf. 1140
  127. Comp. Vis. Patt. Recogn, 2016.