面向视觉语言理解与生成的多模态预训练方法
作者:
作者简介:

刘天义(1998-),男,硕士生,主要研究领域为计算机视觉;吴祖煊(1991-),男,博士,副研究员,CCF专业会员,主要研究领域为计算机视觉,深度学习;陈静静(1990-),女,博士,副研究员,CCF专业会员,主要研究领域为多媒体内容分析,计算机视觉,鲁棒可信人工智能;姜育刚(1981-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为多媒体信息处理,计算机视觉,鲁棒可信人工智能

通讯作者:

吴祖煊,zxwu@fudan.edu.cn;姜育刚,ygj@fudan.edu.cn

基金项目:

科技创新2030——“新一代人工智能”重大项目(2021ZD0112805);国家自然科学基金青年基金(62102092)


Multimodal Pre-training Method for Vision-language Understanding and Generation
Author:
  • LIU Tian-Yi

    LIU Tian-Yi

    School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • WU Zu-Xuan

    WU Zu-Xuan

    School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • CHEN Jing-Jing

    CHEN Jing-Jing

    School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • JIANG Yu-Gang

    JIANG Yu-Gang

    School of Computer Science, Fudan University, Shanghai 200438, China;Shanghai Key Laboratory of Intelligent Information Processing (Fudan University), Shanghai 200438, China;Shanghai Collaborative Innovation Center of Intelligent Visual Computing (Fudan University), Shanghai 200438, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [28]
  • | | | |
  • 文章评论
    摘要:

    大多数现有的视觉语言预训练方法侧重于理解任务,并在训练时使用类似于BERT的损失函数(掩码语言建模和图像文本匹配).尽管它们在许多理解类型的下游任务中表现良好,例如视觉问答、图像文本检索和视觉蕴涵,但它们不具备生成信息的能力.为了解决这个问题,提出了视觉语言理解和生成的统一多模态预训练(unified multimodal pre-training for vision-language understanding and generation,UniVL).UniVL能够处理理解任务和生成任务,并扩展了现有的预训练范式,同时使用随机掩码和因果掩码,因果掩码即掩盖未来标记的三角形掩码,这样预训练的模型可以具有自回归生成的能力.将几种视觉语言理解任务规范为文本生成任务,并使用基于模版提示的方法对不同的下游任务进行微调.实验表明,在使用同一个模型时,理解任务和生成任务之间存在权衡,而提升这两个任务的可行方法是使用更多的数据.UniVL框架在理解任务和生成任务方面的性能与最近的视觉语言预训练方法相当.此外,实验还证明了基于模版提示的生成方法更有效,甚至在少数场景中它优于判别方法.

    Abstract:

    Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like loss functions (masked language modeling and image-text matching) during pre-training. Despite their good performance in the understanding of downstream tasks, such as visual question answering, image-text retrieval, and visual entailment, these methods cannot generate information. To tackle this problem, this study proposes unified multimodal pre-training for vision-language understanding and generation (UniVL). The proposed UniVL is capable of handling both understanding tasks and generation tasks. It expands existing pre-training paradigms and uses random masks and causal masks simultaneously, where causal masks are triangular masks that mask future tokens, and such pre-trained models can have autoregressive generation abilities. Moreover, several vision-language understanding tasks are turned into text generation tasks according to specifications, and the prompt-based method is employed for fine-tuning of different downstream tasks. The experiments show that there is a trade-off between understanding tasks and generation tasks when the same model is used, and a feasible way to improve both tasks is to use more data. The proposed UniVL framework attains comparable performance to recent vision-language pre-training methods in both understanding tasks and generation tasks. Moreover, the prompt-based generation method is more effective and even outperforms discriminative methods in few-shot scenarios.

    参考文献
    [1] Lu JS, Batra D, Parikh D, Lee S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 2.
    [2] Su WJ, Zhu XZ, Cao Y, Li B, Lu LW, Wei FR, Dai JF. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020. 1–16.
    [3] Lu JS, Goswami V, Rohrbach M, Parikh D, Lee S. 12-in-1: Multi-task vision and language representation learning. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10437–10446.
    [4] Chen YC, Li LJ, Yu LC, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu JJ. UNITER: Universal image-text representation learning. arXiv:1909.11740, 2020.
    [5] Li XJ, Yin X, Li CY, Zhang PC, Hu XW, Zhang L, Wang LJ, Hu HD, Dong L, Wei FR, Choi Y, Gao JF. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 121–137.
    [6] Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557, 2019.
    [7] Kim W, Son B, Kim I. ViLT: Vision-and-language transformer without convolution or region supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 5583–5594.
    [8] Gan Z, Chen YC, Li LJ, Zhu C, Cheng Y, Liu JJ. Large-scale adversarial training for vision-and-language representation learning. In: Proc. of the 34th Advances in Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 6616–6628.
    [9] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: ACL, 2019. 4171–4186.
    [10] Dong L, Yang N, Wang WH, Wei FR, Liu XD, Wang Y, Gao JF, Zhou M, Hon HW. Unified language model pre-training for natural language understanding and generation. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 1170.
    [11] Bao HB, Dong L, Wei FR, Wang WH, Yang N, Liu XD, Wang Y, Gao JF, Piao SH, Zhou M, Hon HW. UniLMv2: Pseudo-masked language models for unified language model pre-training. In: Proc. of the 37th Int’l Conf. on Machine Learning. PMLR, 2020. 642–652.
    [12] Petroni F, Rocktäschel T, Riedel S, Lewis PSH, Bakhtin A, Wu YX, Miller AH. Language models as knowledge bases? In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing. Hong Kong: ACL, 2019. 2463–2473.
    [13] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. In: Proc. of the 34th Advances in Neural Information Processing Systems. Curran Associates Inc., 2020. 1877–1901.
    [14] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021. 1–21.
    [15] Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 3128–3137.
    [16] Plummer BA, Wang LW, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2641–2649.
    [17] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: Common objects in context. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 740–755.
    [18] Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6904–6913.
    [19] Bossard L, Guillaumin M, van Gool L. Food-101–Mining discriminative components with random forests. In: Proc. of the 13th European Conf. on Computer Vision. Zurich: Springer, 2014. 446–461.
    [20] Nilsback ME, Zisserman A. Automated flower classification over a large number of classes. In: Proc. of the 6th Indian Conf. on Computer Vision, Graphics & Image Processing. Bhubaneswar: IEEE, 2008. 722–729.
    [21] Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A. Describing textures in the wild. In: Proc. of the 2014 IEEE Conf. on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014. 3606–3613.
    [22] Xie N, Lai F, Doran D, Kadav A. Visual entailment: A novel task for fine-grained image understanding. arXiv:1901.06706, 2019.
    [23] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 8748–8763.
    [24] Jia C, Yang YF, Xia Y, Chen YT, Parekh Z, Pham H, Le QV, Sung YH, Li Z, Duerig T. Scaling up visual and vision-language representation learning with noisy text supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 4904–4916.
    [25] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: A method for automatic evaluation of machine translation. In: Proc. of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: ACL, 2002. 311–318.
    [26] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.
    [27] Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In: Proc. of the 9th Workshop on Statistical Machine Translation. Baltimore: ACL, 2014. 376–380.
    [28] Anderson P, Fernando B, Johnson M, Gould S. SPICE: Semantic propositional image caption evaluation. In: Proc. of the 14th European Conf. on Computer Vision. Amsterdam: Springer, 2016. 382–398.
    相似文献
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

刘天义,吴祖煊,陈静静,姜育刚.面向视觉语言理解与生成的多模态预训练方法.软件学报,2023,34(5):2024-2034

复制
分享
文章指标
  • 点击次数:1858
  • 下载次数: 5346
  • HTML阅读次数: 2702
  • 引用次数: 0
历史
  • 收稿日期:2022-04-17
  • 最后修改日期:2022-05-29
  • 在线发布日期: 2022-09-20
  • 出版日期: 2023-05-06
文章二维码
您是第19708246位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号