视觉语言预训练综述

doi:10.13328/j.cnki.jos.006774

微信服务号

微信订阅号

首页 > 过刊浏览>2023年第34卷第5期 >2000-2023. DOI:10.13328/j.cnki.jos.006774

PDF HTML阅读 XML下载导出引用引用提醒

视觉语言预训练综述
DOI:
                        10.13328/j.cnki.jos.006774
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:殷炯(1999-),男,硕士生,CCF学生会员,主要研究领域为多模态学习,视觉语言预训练;张哲东(2000-),男,硕士生,主要研究领域为多媒体智能,信息融合;高宇涵(1997-),女,硕士,主要研究领域为深度学习,医学图像处理;杨智文(1998-),男,硕士生,主要研究领域为深度估计,深度补全;李亮(1986-),男,博士,副研究员,CCF高级会员,主要研究领域为多媒体内容分析,跨媒体智能;肖芒(1976-),男,博士,教授,主要研究领域为头颈部肿瘤的病因学,头颈部缺损的微血管重建;孙垚棋(1993-),男,博士生,主要研究领域为计算机视觉与图形学,多媒体信息处理;颜成钢(1984-),男,博士,教授,博士生导师,主要研究领域为智能信息处理.
通讯作者:肖芒，joelxm@zju.edu.cn
中图分类号:
基金项目:国家重点研发计划（2020YFB1406604）；国家自然科学基金（61931008，62071415，U21B2024）

Survey on Vision-language Pre-training

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

近年来深度学习在计算机视觉（CV）和自然语言处理（NLP）等单模态领域都取得了十分优异的性能.随着技术的发展，多模态学习的重要性和必要性已经慢慢展现.视觉语言学习作为多模态学习的重要部分，得到国内外研究人员的广泛关注.得益于Transformer框架的发展，越来越多的预训练模型被运用到视觉语言多模态学习上，相关任务在性能上得到了质的飞跃.系统地梳理了当前视觉语言预训练模型相关的工作，首先介绍了预训练模型的相关知识，其次从两种不同的角度分析比较预训练模型结构，讨论了常用的视觉语言预训练技术，详细介绍了5类下游预训练任务，最后介绍了常用的图像和视频预训练任务的数据集，并比较和分析了常用预训练模型在不同任务下不同数据集上的性能.

Abstract:

In recent years, deep learning has achieved excellent performance in unimodal areas such as computer vision (CV) and natural language processing (NLP). With the development of technology, the importance and necessity of multimodal learning begin to unfold. Essential to multimodal learning, vision-language learning has received extensive attention from researchers in and outside China. Thanks to the development of the Transformer framework, more and more pre-trained models are applied to vision-language multimodal learning, and the performance of related tasks is improved qualitatively. This study systematically reviews the current work on vision-language pre-trained models. Firstly, the knowledge about pre-trained models is introduced. Secondly, the structure of pre-trained models is analyzed and compared from two perspectives. The commonly used vision-language pre-training techniques are discussed, and five downstream pre-training tasks are elaborated. Finally, the common datasets used in image and video pre-training tasks are expounded, and the performance of commonly used pre-trained models on different datasets under different tasks is compared and analyzed.

参考文献

相似文献

引证文献

引用本文

殷炯,张哲东,高宇涵,杨智文,李亮,肖芒,孙垚棋,颜成钢.视觉语言预训练综述.软件学报,2023,34(5):2000-2023

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-04-18
最后修改日期:2022-05-29
录用日期:
在线发布日期: 2022-09-20
出版日期: 2023-05-06

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码