基于多元实体对齐的视觉-语言多模态预训练
作者:
中图分类号:

TP18

基金项目:

国家自然科学基金(62376186, 61932009)


Visual-language Multimodal Pre-training Based on Multi-entity Alignment
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    视觉-语言预训练(visual-language pre-training, VLP)旨在通过在大规模图像-文本多模态数据集上进行学习得到强大的多模态表示. 多模态特征融合、对齐是多模态模型训练的关键挑战. 现有的大多数视觉-语言预训练模型对于多模态特征融合、对齐问题主要方式是将提取的视觉特征和文本特征直接输入至Transformer 模型中. 通过Transformer模型中的attention模块进行融合, 由于attention机制计算的是两两之间的相似度, 因而该方法难以实现多元实体间的对齐. 鉴于超图神经网络的超边具有连接多个实体、编码高阶实体相关性的特性, 进而实现多元实体间关系的建立. 提出基于超图神经网络的多元实体对齐的视觉-语言多模态模型预训练方法. 该方法在Transformer 多模态融合编码器中引入超图神经网络学习模块学习多模态间多元实体的对齐关系以增强预训练模型中多模态融合编码器实体对齐能力. 在大规模图像-文本数据集上对所提视觉-语言预训练模型进行预训练并在视觉问答、图文检索、视觉定位以及自然语言视觉推理多个视觉-语言下游任务上进行微调实验, 实验结果表明所提方法相比于baseline方法在多个下游任务中性能均有提升, 其中在NLVR2任务上相比baseline方法准确率提升1.8%.

    Abstract:

    Visual-language pre-training (VLP) aims to obtain a powerful multimodal representation by learning on a large-scale image-text multimodal dataset. Multimodal feature fusion and alignment is a key challenge in multimodal model training. In most of the existing visual-language pre-training models, for the multimodal feature fusion and alignment problem, the main approach is that the extracted visual features and text features are directly input into the Transformer model. Since the attention mechanism in the Transformer calculates the similarity between pairs, it is difficult to achieve the alignment among multiple entities. Considering that the hyperedges of hypergraph neural networks possess the characteristics of connecting multiple entities and encoding high-order entity correlations, thus enabling the establishment of relationships among multiple entities. In this study, a visual-language multimodal model pre-training method based on multi-entity alignment of hypergraph neural networks is proposed. In this method, the hypergraph neural network learning module is introduced into the Transformer multi-modal fusion encoder to learn the alignment relationship of multi-modal entities, thereby enhancing the entity alignment ability of the multi-modal fusion encoder in the pre-training model. The proposed visual-language pre-training model is pre-trained on the large-scale image-text datasets and fine-tuned on multiple visual-language downstream tasks such as visual question answering, image-text retrieval, visual grounding, and natural language visual reasoning. The experimental results indicate that compared with the baseline method, the proposed method has performance improvements in multiple downstream tasks, among which the accuracy is improved by 1.8% on the NLVR2 task.

    参考文献
    相似文献
    引证文献
引用本文

李登,武阿明,韩亚洪.基于多元实体对齐的视觉-语言多模态预训练.软件学报,,():1-16

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-07-07
  • 最后修改日期:2024-01-25
  • 在线发布日期: 2025-02-26
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号