Multimodal Pre-training Method for Vision-language Understanding and Generation

doi:10.13328/j.cnki.jos.006770

微信服务号

微信订阅号

2025-5-15- 18

Home > Archive>Volume 34, Issue 5, 2023 >2024-2034. DOI:10.13328/j.cnki.jos.006770

PDF HTML XML Export Cite reminder

Multimodal Pre-training Method for Vision-language Understanding and Generation
DOI:
                        10.13328/j.cnki.jos.006770
                    
Author:
                        LIU Tian-YiLIU Tian-Yi
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WU Zu-XuanWU Zu-Xuan
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
CHEN Jing-JingCHEN Jing-Jing
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
JIANG Yu-GangJIANG Yu-Gang
School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like loss functions (masked language modeling and image-text matching) during pre-training. Despite their good performance in the understanding of downstream tasks, such as visual question answering, image-text retrieval, and visual entailment, these methods cannot generate information. To tackle this problem, this study proposes unified multimodal pre-training for vision-language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generation tasks. It expands existing pre-training paradigms and uses random masks and causal masks simultaneously, where causal masks are triangular masks that mask future tokens, and such pre-trained models can have autoregressive generation abilities. Moreover, several vision-language understanding tasks are turned into text generation tasks according to specifications, and the prompt-based method is employed for fine-tuning of different downstream tasks. The experiments show that there is a trade-off between understanding tasks and generation tasks when the same model is used, and a feasible way to improve both tasks is to use more data. The proposed UniVL framework attains comparable performance to recent vision-language pre-training methods in both understanding tasks and generation tasks. Moreover, the prompt-based generation method is more effective and even outperforms discriminative methods in few-shot scenarios.

Key words:computer vision;multimodal learning;pre-training

Get Citation

刘天义,吴祖煊,陈静静,姜育刚.面向视觉语言理解与生成的多模态预训练方法.软件学报,2023,34(5):2024-2034

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:April 17,2022
Revised:May 29,2022
Adopted:
Online: September 20,2022
Published: May 06,2023

You are the first2044692Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History