Visual-language Multimodal Pre-training Based on Multi-entity Alignment

doi:10.13328/j.cnki.jos.007321

微信服务号

微信订阅号

2025-4-21- 12

Home > Archive>Volume , Issue , >1-16. DOI:10.13328/j.cnki.jos.007321

PDF HTML XML Export Cite reminder

Visual-language Multimodal Pre-training Based on Multi-entity Alignment
DOI:
                        10.13328/j.cnki.jos.007321
                    
Author:
                        LI DengLI Deng
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WU A-MingWU A-Ming
School of Electronic Engineering, Xidian University, Xi’an 710401, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
HAN Ya-HongHAN Ya-Hong
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:TP18
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Visual-language pre-training (VLP) aims to obtain a powerful multimodal representation by learning on a large-scale image-text multimodal dataset. Multimodal feature fusion and alignment is a key challenge in multimodal model training. In most of the existing visual-language pre-training models, for the multimodal feature fusion and alignment problem, the main approach is that the extracted visual features and text features are directly input into the Transformer model. Since the attention mechanism in the Transformer calculates the similarity between pairs, it is difficult to achieve the alignment among multiple entities. Considering that the hyperedges of hypergraph neural networks possess the characteristics of connecting multiple entities and encoding high-order entity correlations, thus enabling the establishment of relationships among multiple entities. In this study, a visual-language multimodal model pre-training method based on multi-entity alignment of hypergraph neural networks is proposed. In this method, the hypergraph neural network learning module is introduced into the Transformer multi-modal fusion encoder to learn the alignment relationship of multi-modal entities, thereby enhancing the entity alignment ability of the multi-modal fusion encoder in the pre-training model. The proposed visual-language pre-training model is pre-trained on the large-scale image-text datasets and fine-tuned on multiple visual-language downstream tasks such as visual question answering, image-text retrieval, visual grounding, and natural language visual reasoning. The experimental results indicate that compared with the baseline method, the proposed method has performance improvements in multiple downstream tasks, among which the accuracy is improved by 1.8% on the NLVR² task.

Key words:visual-language pre-training (VLP);hypergraph neural network;multi-entity alignment;attention mechanism;multi-modal understanding

Get Citation

李登,武阿明,韩亚洪.基于多元实体对齐的视觉-语言多模态预训练.软件学报,,():1-16

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:July 07,2023
Revised:January 25,2024
Adopted:
Online: February 26,2025
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History