End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration

doi:10.13328/j.cnki.jos.006773

微信服务号

微信订阅号

2025-4-21- 13

Home > Archive>Volume 34, Issue 5, 2023 >2152-2169. DOI:10.13328/j.cnki.jos.006773

PDF HTML XML Export Cite reminder

End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
DOI:
                        10.13328/j.cnki.jos.006773
                    
Author:
                        SONG Jing-KuanSONG Jing-Kuan
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZENG Peng-PengZENG Peng-Peng
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GU Jia-YangGU Jia-Yang
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHU Jin-KuanZHU Jin-Kuan
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GAO Lian-LiGAO Lian-Li
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [67]

Cited by

Materials

Comments

Abstract:

In recent years, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, which leads to a shift towards a fully end-to-end paradigm for multimodal downstream tasks, such as image captioning tasks, and enables better performance and faster inference speed of models. However, the grid visual features extracted with such pre-trained models lack regional visual information, which results in inaccurate descriptions of the object content. Thus, the applicability of pre-trained models in image captioning remains largely unexplored. Therefore, this study proposes a novel end-to-end image captioning method based on visual region aggregation and dual-level collaboration (VRADC). Specifically, to learn regional visual information, this study designs a visual region aggregation module that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, the dual-level collaboration module uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which guides the model to generate more fine-grained image captions. The experimental results on the MSCOCO dataset and Flickr30k dataset show that the proposed VRADC-based method can significantly improve the quality of image captioning and achieves state-of-the-art performance.

Key words:image captioning;end-to-end training;pre-trained model;visual region aggregation;dual-level collaboration

Reference

[1] Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6077–6086.

[2] Gao LL, Li XP, Song JK, Shen HT. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(5): 1112–1131. [doi: 10.1109/TPAMI.2019.2894139]

[3] Zhang XY, Sun XS, Luo YP, Ji JY, Zhou YY, Wu YJ, Huang FY, Ji RR. RSTNet: Captioning with adaptive attention on visual and non-visual words. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 15460–15469.

[4] Cui H, Zhu L, Li JJ, Yang Y, Nie LQ. Scalable deep hashing for large-scale social image retrieval. IEEE Transactions on Image Processing, 2019, 29: 1271–1284. [doi: 10.1109/TIP.2019.2940693]

[5] Wu SM, Wieland J, Farivar O, Schiller J. Automatic alt-text: Computer-generated image descriptions for blind users on a social network service. In: Proc. of the 2017 ACM Conf. on Computer Supported Cooperative Work and Social Computing. Portland: ACM, 2017. 1180–1192.

[6] Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JMF, Parikh D, Batra D. Visual dialog. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1080–1089.

[7] Jain U, Schwing AG, Lazebnik S. Two can play this game: Visual dialog with discriminative question generation and answering. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 5754–5763.

[8] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Proc. of the 27th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2014. 3104–3112.

[9] Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. In: Proc. of the 2017 Int’l Conf. on Engineering and Technology. Antalya: IEEE, 2017. 1–6.

[10] Ren SQ, He KM, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: NIPS, 2015. 91–99.

[11] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780. [doi: 10.1162/neco.1997.9.8.1735]

[12] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 5998–6008.

[13] Luo YP, Ji JY, Sun XS, Cao LJ, Wu YJ, Huang FY, Lin CW, Ji RR. Dual-level collaborative transformer for image captioning. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. Palo Alto: AAAI Press, 2021. 2286–2293.

[14] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.

[15] Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z, Lin S, Guo BN. Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 9992–10002.

[16] Fang ZY, Wang JF, Hu XW, Liang L, Gan Z, Wang LJ, Yang YZ, Liu ZC. Injecting semantic concepts into end-to-end image captioning. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 17988–17998.

[17] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.

[18] Wang YY, Xu JG, Sun YF. End-to-end transformer based model for image captioning. In: Proc. of the 36th AAAI Conf. on Artificial Intelligence. Palo Alto: AAAI Press, 2022. 2585–2594.

[19] 薛子育, 郭沛宇, 祝晓斌, 张乃光. 一种基于生成式对抗网络的图像描述方法. 软件学报, 2018, 29: 30–43. http://www.jos.org.cn/jos/article/abstract/18015?st=search

Xue ZY, Guo PY, Zhu XB, Zhang NG. Image description method based on generative adversarial networks. Ruan Jian Xue Bao/Journal of Software, 2018, 29: 30–43 (in Chinese with English abstract). http://www.jos.org.cn/jos/article/abstract/18015?st=search

[20] Chen H, Ding GG, Lin ZJ, Zhao SC, Han JG. Show, observe and tell: Attribute-driven attention model for image captioning. In: Proc. of the 27th Int’l Joint Conf. on Artificial Intelligence. Stockholm: IJCAI.org, 2018. 606–612.

[21] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652–663.

[22] Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proc. of the 32nd Int’l Conf. on Machine Learning. Lille: JMLR.org, 2015. 2048–2057.

[23] Chen L, Zhang HW, Xiao J, Nie LQ, Shao J, Liu W, Chua TS. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6298–6306.

[24] Yang X, Tang KH, Zhang HW, Cai JF. Auto-encoding scene graphs for image captioning. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 10677–10686.

[25] Pan YW, Yao T, Li YH, Mei T. X-linear attention networks for image captioning. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10968–10977.

[26] Song ZL, Zhou XF, Dong LH, Tan JL, Guo L. Direction relation transformer for image captioning. In: Proc. of the 29th ACM Int’l Conf. on Multimedia. Virtual Event: ACM, 2021. 5056–5064.

[27] Cornia M, Stefanini M, Baraldi L, Cucchiara R. Meshed-memory transformer for image captioning. In: Proc. of the 2020 IEEE Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10575–10584.

[28] Jiang HZ, Misra I, Rohrbach M, Learned-Miller E, Chen XL. In defense of grid features for visual question answering. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10264–10273.

[29] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.

[30] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

[31] Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 779–788.

[32] Krishna R, Zhu YK, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, LI LJ, Shamma DA, Bernstein MS, Li FF. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32–73.

[33] Sharma P, Ding N, Goodman S, Soricut R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018. 2556–2565.

[34] Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, Sivic J. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 2630–2640.

[35] Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557, 2019.

[36] Sun C, Myers A, Vondrick C, Murphy K, Schmid C. VideoBERT: A joint model for video and language representation learning. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 7464–7473.

[37] Zhang PC, Li XJ, Hu XW, Yang JW, Zhang L, Wang LJ, Choi YJ, Gao JF. VinVL: Revisiting visual representations in vision-language models. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 5575–5584.

[38] Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N. MDETR-modulated detection for end-to-end multi-modal understanding. In: Proc. of the 2021 IEEE Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 1760–1770.

[39] Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: Association for Computational Linguistics, 2019. 5100–5111.

[40] Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le QV, Sung YH, Li Z, Duerig T. Scaling up visual and vision-language representation learning with noisy text supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 4904–4916.

[41] Li JN, Selvaraju R, Gotmare A, Joty S, Xiong CM, Hoi SCH. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 2021, 34: 9694–9705.

[42] Bain M, Nagrani A, Varol G, Zisserman A. Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proc. of the 2021 IEEE Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 1728–1738.

[43] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of the 2019 Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.

[44] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1877–1901.

[45] Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V. Self-critical sequence training for image captioning. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1179–1195.

[46] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common objects in context. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 740–755.

[47] Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. of the Association for Computational Linguistics, 2014, 2: 67–78.

[48] Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 3128–3137.

[49] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: A method for automatic evaluation of machine translation. In: Proc. of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.

[50] Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In: Proc. of the 9th Workshop on Statistical Machine Translation. Baltimore: Association for Computational Linguistics, 2014. 376–380.

[51] Lin CY. ROUGE: A package for automatic evaluation of summaries. In: Proc. of the 2004 Text Summarization Branches Out. Barcelona: Association for Computational Linguistics, 2004. 74–81.

[52] Anderson P, Fernando B, Johnson M, Gould S. SPICE: Semantic propositional image caption evaluation. In: Proc. of the 14th European Conf. on Computer Vision. Amsterdam: Springer, 2016. 382–398.

[53] Wang PD, Ng HT. A beam-search decoder for normalization of social media text with application to machine translation. In: Proc. of the 2013 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta: Association for Computational Linguistics, 2013. 471–481.

[54] Lu JS, Xiong CM, Parikh D, Socher R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3242–3250.

[55] 刘茂福, 施琦, 聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法. 软件学报, 2022, 33(9): 3210–3222. http://www.jos.org.cn/1000-9825/6623.htm

Liu MF, Shi Q, Nie LQ. Image captioning based on visual relevance and context dual attention. Ruan Jian Xue Bao/Journal of Software, 2022, 33(9): 3210–3222 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6623.htm

[56] Yao T, Pan YW, Li YH, Mei T. Exploring visual relationship for image captioning. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 711–727. [doi: 0.1007/978-3-030-01264-9_42]

[57] Huang L, Wang WM, Chen J, Wei XY. Attention on attention for image captioning. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 4633–4642.

[58] Ji JY, Luo YP, Sun XS, Chen FH, Luo G, Wu YJ, Gao Y, Ji RR. Improving image captioning by leveraging intra- and inter-layer global representation in transformer network. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. Palo Alto: AAAI Press, 2021. 1655–1663.

[59] Li XJ, Yin X, Li CY, Zhang PC, Hu, XW, Zhang L, Wang LJ, Hu HD, Dong L, Wei FR, Choi YJ, Gao JF. Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 121–137.

[60] Wang ZR, Yu JH, Yu AW, Dai ZH, Tsvetkov YL, Cao Y. SimVLM: Simple visual language model pretraining with weak supervision. arXiv:2108.10904, 2021.

[61] You QZ, Jin HL, Wang ZW, Feng C, Luo JB. Image captioning with semantic attention. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 4651–4659.

[62] Lu JS, Yang JW, Batra D, Parikh D. Neural baby talk. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 7219–7228.

[63] Wang JB, Wang W, Wang L, Wang ZY, Feng DD, Tan TN. Learning visual relationship and context-aware attention for image captioning. Pattern Recognition, 2020, 98: 107075. [doi: 10.1016/j.patcog.2019.107075]

[64] 李志欣, 魏海洋, 黄飞成, 张灿龙, 马慧芳, 史忠植. 结合视觉特征和场景语义的图像描述生成. 计算机学报, 2020, 43(9): 1624–1640. [doi: 10.11897/SP.J.1016.2020.01624]

Li ZX, Wei HY, Huang FC, Zhang CL, Ma HF, Shi ZZ. Combine visual features and scene semantics for image captioning. Chinese Journal of Computers, 2020, 43(9): 1624–1640 (in Chinese with English abstract). [doi: 10.11897/SP.J.1016.2020.01624]

Get Citation

宋井宽,曾鹏鹏,顾嘉扬,朱晋宽,高联丽.基于视觉区域聚合与双向协作的端到端图像描述生成.软件学报,2023,34(5):2152-2169

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:April 18,2022
Revised:May 29,2022
Adopted:
Online: September 20,2022
Published: May 06,2023

You are the first2036641Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History