End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration

doi:10.13328/j.cnki.jos.006773

微信服务号

微信订阅号

2025-6-5- 13

Home > Archive>Volume 34, Issue 5, 2023 >2152-2169. DOI:10.13328/j.cnki.jos.006773

PDF HTML XML Export Cite reminder

End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
DOI:
                        10.13328/j.cnki.jos.006773
                    
Author:
                        SONG Jing-KuanSONG Jing-Kuan
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZENG Peng-PengZENG Peng-Peng
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GU Jia-YangGU Jia-Yang
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHU Jin-KuanZHU Jin-Kuan
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
GAO Lian-LiGAO Lian-Li
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

In recent years, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, which leads to a shift towards a fully end-to-end paradigm for multimodal downstream tasks, such as image captioning tasks, and enables better performance and faster inference speed of models. However, the grid visual features extracted with such pre-trained models lack regional visual information, which results in inaccurate descriptions of the object content. Thus, the applicability of pre-trained models in image captioning remains largely unexplored. Therefore, this study proposes a novel end-to-end image captioning method based on visual region aggregation and dual-level collaboration (VRADC). Specifically, to learn regional visual information, this study designs a visual region aggregation module that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, the dual-level collaboration module uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which guides the model to generate more fine-grained image captions. The experimental results on the MSCOCO dataset and Flickr30k dataset show that the proposed VRADC-based method can significantly improve the quality of image captioning and achieves state-of-the-art performance.

Key words:image captioning;end-to-end training;pre-trained model;visual region aggregation;dual-level collaboration

Get Citation

宋井宽,曾鹏鹏,顾嘉扬,朱晋宽,高联丽.基于视觉区域聚合与双向协作的端到端图像描述生成.软件学报,2023,34(5):2152-2169

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:April 18,2022
Revised:May 29,2022
Adopted:
Online: September 20,2022
Published: May 06,2023

You are the first2051388Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History