End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    In recent years, Transformer-based pre-trained models have demonstrated powerful capabilities of modality representation, which leads to a shift towards a fully end-to-end paradigm for multimodal downstream tasks, such as image captioning tasks, and enables better performance and faster inference speed of models. However, the grid visual features extracted with such pre-trained models lack regional visual information, which results in inaccurate descriptions of the object content. Thus, the applicability of pre-trained models in image captioning remains largely unexplored. Therefore, this study proposes a novel end-to-end image captioning method based on visual region aggregation and dual-level collaboration (VRADC). Specifically, to learn regional visual information, this study designs a visual region aggregation module that aggregates grid features with similar semantics to obtain a compact visual region representation. Next, the dual-level collaboration module uses the cross-attention mechanism to learn more representative semantic information from the two visual features, which guides the model to generate more fine-grained image captions. The experimental results on the MSCOCO dataset and Flickr30k dataset show that the proposed VRADC-based method can significantly improve the quality of image captioning and achieves state-of-the-art performance.

    Reference
    Related
    Cited by
Get Citation

宋井宽,曾鹏鹏,顾嘉扬,朱晋宽,高联丽.基于视觉区域聚合与双向协作的端到端图像描述生成.软件学报,2023,34(5):2152-2169

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 18,2022
  • Revised:May 29,2022
  • Adopted:
  • Online: September 20,2022
  • Published: May 06,2023
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063