多模态视觉语言表征学习研究综述
作者:
作者简介:

杜鹏飞(1985-),男,博士生,主要研究领域为人工智能,情感计算,网络安全.
李小勇(1975-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为网络安全,可信服务工程.
高雅丽(1991-),女,博士,CCF专业会员,主要研究领域为网络安全,可信服务工程.

通讯作者:

李小勇,E-mail:lxyxjtu@163.com

基金项目:

国家自然科学基金(U1836215)


Survey on Multimodal Visual Language Representation Learning
Author:
  • DU Peng-Fei

    DU Peng-Fei

    Key Laboratory of Trustworthy Distributed Computing and Service(Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China;School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • LI Xiao-Yong

    LI Xiao-Yong

    Key Laboratory of Trustworthy Distributed Computing and Service(Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China;School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • GAO Ya-Li

    GAO Ya-Li

    Key Laboratory of Trustworthy Distributed Computing and Service(Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China;School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
    在期刊界中查找
    在百度中查找
    在本站中查找
Fund Project:

National Natural Science Foundation of China (U1836215)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [95]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    我们生活在一个由大量不同模态内容构建而成的多媒体世界中,不同模态信息之间具有高度的相关性和互补性,多模态表征学习的主要目的就是挖掘出不同模态之间的共性和特性,产生出可以表示多模态信息的隐含向量.主要介绍了目前应用较广的视觉语言表征的相应研究工作,包括传统的基于相似性模型的研究方法和目前主流的基于语言模型的预训练的方法.目前比较好的思路和解决方案是将视觉特征语义化,然后与文本特征通过一个强大的特征抽取器产生出表征,其中,Transformer作为主要的特征抽取器被应用表征学习的各类任务中.分别从研究背景、不同研究方法的划分、测评方法、未来发展趋势等几个不同角度进行阐述.

    Abstract:

    A multimedia world in which human beings live is built from a large number of different modal contents. The information between different modalities is highly correlated and complementary. The main purpose of multi-modal representation learning is to mine the different modalities. Commonness and characteristics produce implicit vectors that can represent multimodal information. This article mainly introduces the corresponding research work of the currently widely used visual language representation, including traditional research methods based on similarity models and current mainstream pre-training methods based on language models. The current better ideas and solutions are to semanticize visual features and then generate representations with textual features through a powerful feature extractor. Transformer is currently used in various tasks of representation learning as the mainstream network architecture. This article elaborates from several different angles of research background, division of different studies, evaluation methods, future development trends, etc.

    参考文献
    [1] Yuhas BP, Goldstein MH, Sejnowski TJ. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 1989,27(11):65-71.
    [2] Juang BH, Rabiner LR. Hidden Markov models for speech recognition. Technometrics, 1991,33(3):251-272.
    [3] Busso C, Bulut M, Lee CC, et al. IEMOCAP:Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008,42(4):335-359.
    [4] McKeown G, Valstar M, Cowie R, et al. The semaine database:Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. on Affective Computing, 2011,3(1):5-17.
    [5] Baltrušaitis T, Ahuja C, Morency LP. Multimodal machine learning:A survey and taxonomy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018,41(2):423-443.
    [6] Bengio Y, Courville A, Vincent P. Representation learning:A review and new perspectives. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2013,35(8):1798-1828.
    [7] Lowe DG. Distinctive image features from scale-invariant keypoints. Int'l Journal of Computer Vision, 2004,60(2):91-110.
    [8] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In:Proc. of the 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR 2005). IEEE, 2005. 886-893.
    [9] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In:Proc. of the Advances in Neural Information Processing Systems. 2012. 1097-1105.
    [10] Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups. IEEE Signal Processing Magazine, 2012,29(6):82-97.
    [11] Chowdhury GG. Introduction to Modern Information Retrieval. Facet Publishing, 2010.
    [12] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
    [13] Devlin J, Chang MW, Lee K, et al. Bert:Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    [14] Wang A, Singh A, Michael J, et al. Glue:A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
    [15] Rifai S, Vincent P, Muller X, et al. Contractive auto-encoders:Explicit invariance during feature extraction. In:Proc. of the Int'l Conf. on Machine Learning. 2011. 833-840.
    [16] Mroueh Y, Marcheret E, Goel V, et al. Deep multimodal learning for audio-visual speech recognition. In:Proc. of the Int'l Conf. on Acoustics, Speech, and Signal Processing. 2015. 2130-2134.
    [17] Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. In:Proc. of the Int'l Conf. on Machine Learning. 2011. 689-696.
    [18] Kim Y, Lee H, Provost EM, et al. Deep learning for robust feature generation in audiovisual emotion recognition. In:Proc. of the Int'l Conf. on Acoustics, Speech, and Signal Processing. 2013. 3687-3691.
    [19] Kahou SE, Bouthillier X, Lamblin P, et al. Emonets:Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 2016,10(2):99-111.
    [20] Nicolaou MA, Gunes H, Pantic M. Continuous prediction of spontaneous affect from multiple cues and modalities in valence- arousal space. IEEE Trans. on Affective Computing, 2011,2(2):92-105.
    [21] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In:Advances in Neural Information Processing Systems. 2017. 5998-6008.
    [22] Silberer C, Lapata M. Learning grounded meaning representations with autoencoders. In:Proc. of the Meeting of the Association for Computational Linguistics. 2014. 721-732.
    [23] Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmann machines. In:Proc. of the Neural Information Processing Systems. 2012. 2222-2230.
    [24] Rajagopalan SS, Morency LP, Baltrusaitis T, et al. Extending long short-term memory for multi-view structured learning. In:Proc. of the European Conf. on Computer Vision. Cham:Springer-Verlag, 2016. 338-353.
    [25] Frome A, Corrado GS, Shlens J, et al. Devise:A deep visual-semantic embedding model. In:Proc. of the Advances in Neural Information Processing Systems. 2013. 2121-2129.
    [26] Kiros R, Salakhutdinov R, Zemel RS. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
    [27] Vendrov I, Kiros R, Fidler S, et al. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015.
    [28] Zhang D, Li W. Large-scale supervised multimodal hashing with semantic correlation maximization. In:Proc. of the National Conf. on Artificial Intelligence. 2014. 2177-2183.
    [29] Cao Y, Long M, Wang J, et al. Deep visual-semantic hashing for cross-modal retrieval. In:Proc. of the 22nd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. 2016. 1445-1454.
    [30] Yang G, Miao H, Tang J, et al. Multi-kernel Hashing with semantic correlation maximization for cross-modal retrieval. In:Proc. of the Int'l Conf. on Image and Graphics. 2017. 23-34.
    [31] You Q, Zhang Z, Luo J. End-to-end convolutional semantic embeddings. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 5735-5744.
    [32] Alberti C, Ling J, Collins M, et al. Fusion of detected objects in text for visual question answering. arXiv preprint arXiv:1908. 05054, 2019.
    [33] Lu J, Batra D, Parikh D, et al. ViLBERT:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In:Proc. of the Neural Information Processing Systems. 2019. 13-23.
    [34] Shi B, Ji L, Lu P, et al. Knowledge aware semantic concept expansion for image-text matching. In:Proc. of the Int'l Joint Conf. on Artificial Intelligence. 2019. 5182-5189.
    [35] Chen Y, Li L, Yu L, et al. UNITER:Learning universal image-text representations. arXiv:Computer Vision and Pattern Recognition, 2019.
    [36] Su W, Zhu X, Cao Y, et al. VL-BERT:Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
    [37] Tan H, Bansal M. LXMERT:Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.
    [38] Li G, Duan N, Fang Y, et al. Unicoder-VL:A universal encoder for vision and language by cross-modal pre-training. arXiv:Computer Vision and Pattern Recognition, 2019.
    [39] Li LH, Yatskar M, Yin D, et al. VisualBERT:A simple and performant baseline for vision and language. arXiv:Computer Vision and Pattern Recognition, 2019.
    [40] Qi D, Su L, Song J, et al. Imagebert:Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
    [41] Pan Y, Mei T, Yao T, et al. Jointly modeling embedding and translation to bridge video and language. In:Proc. of the Computer Vision and Pattern Recognition. 2016. 4594-4602.
    [42] Xu R, Xiong C, Chen W, et al. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In:Proc. of the National Conf. on Artificial Intelligence. 2015. 2346-2352.
    [43] Luo H, Ji L, Shi B, et al. Univilm:A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
    [44] He K, Girshick R, Dollár P. Rethinking ImageNet pre-training. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 4918-4927.
    [45] Hendrycks D, Lee K, Mazeika M. Using pre-training can improve model robustness and uncertainty. arXiv preprint arXiv:1901. 09960, 2019.
    [46] He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning. In:Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition. 2020. 9729-9738.
    [47] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3:1137-1155.
    [48] Peters ME, Neumann M, Iyyer M, et al. Deep contextualized word representations. In:Proc. of the North American Chapter of the Association for Computational Linguistics. 2018. 2227-2237.
    [49] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
    [50] Lee KH, Chen X, Hua G, et al. Stacked cross attention for image-text matching. In:Proc. of the European Conf. on Computer Vision (ECCV). 2018. 201-216.
    [51] Weston J, Bengio S, Usunier N. Large scale image annotation:Learning to rank with joint word-image embeddings. Machine Learning, 2010,81(1):21-35.
    [52] Lazaridou A, Pham NT, Baroni M. Combining language and vision with a multimodal skip-gram model. arXiv preprint arXiv:1501. 02598, 2015.
    [53] Faghri F, Fleet DJ, Kiros JR, et al. VSE++:Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707. 05612, 2017.
    [54] Huang PS, He X, Gao J, et al. Learning deep structured semantic models for web search using clickthrough data. In:Proc. of the 22nd ACM Int'l Conf. on Information and Knowledge Management. 2013. 2333-2338.
    [55] Fang H, Gupta S, Iandola F, et al. From captions to visual concepts and back. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2015. 1473-1482.
    [56] Hubert Tsai YH, Huang LK, Salakhutdinov R. Learning robust visual-semantic embeddings. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2017. 3571-3580.
    [57] Huang Y, Wu Q, Song C, et al. Learning semantic concepts and order for image and sentence matching. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6163-6171.
    [58] Wu H, Mao J, Zhang Y, et al. Unified visual-semantic embeddings:Bridging vision and language with structured meaning representations. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 6609-6618.
    [59] Wang Y, Yang H, Qian X, et al. Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748, 2019.
    [60] Sun C, Myers A, Vondrick C, et al. Videobert:A joint model for video and language representation learning. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 7464-7473.
    [61] Sun C, Baradel F, Murphy K, et al. Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743, 2019.
    [62] Zhou L, Palangi H, Zhang L, et al. Unified vision-language pre-training for image captioning and VQA. In:Proc. of the AAAI. 2020. 13041-13049.
    [63] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
    [64] Schuster M, Nakajima K. Japanese and Korean voice search. In:Proc. of the 2012 IEEE Int'l Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012. 5149-5152.
    [65] He K, Zhang X, Ren S, et al. Identity mappings in deep residual networks. In:Proc. of the ECCV. Cham:Springer-Verlag, 2016. 630-645.
    [66] Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 1492-1500.
    [67] Ren S, He K, Girshick R, et al. Faster-RCNN:Towards real-time object detection with region proposal networks. In:Proc. of the Advances in Neural Information Processing Systems. 2015. 91-99.
    [68] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 6077-6086.
    [69] Krishna R, Zhu Y, Groth O, et al. Visual genome:Connecting language and vision using crowdsourced dense image annotations. Int'l Journal of Computer Vision, 2017,123(1):32-73.
    [70] Hu H, Gu J, Zhang Z, et al. Relation networks for object detection. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 3588-3597.
    [71] Kay W, Carreira J, Simonyan K, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
    [72] Xie S, Sun C, Huang J, et al. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851, 2017.
    [73] Lin TY, Maire M, Belongie S, et al. Microsoft COCO:Common objects in context. In:Proc. of the European Conf. on Computer Vision. Cham:Springer-Verlag, 2014. 740-755.
    [74] Sharma P, Ding N, Goodman S, et al. Conceptual captions:A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In:Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Vol.1:Long Papers). 2018. 2556-2565.
    [75] Ordonez V, Kulkarni G, Berg TL. Im2text:Describing images using 1 million captioned photographs. In:Proc. of the Advances in Neural Information Processing Systems. 2011. 1143-1151.
    [76] Miech A, Zhukov D, Alayrac JB, et al. HowTo100M:Learning a text-video embedding by watching hundred million narrated video clips. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 2630-2640.
    [77] Zhou L, Xu C, Corso JJ. Towards automatic learning of procedures from web instructional videos. In:Proc. of the 32nd AAAI Conf. on Artificial Intelligence. 2018.
    [78] Xu J, Mei T, Yao T, et al. MSR-VTT:A large video description dataset for bridging video and language. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 5288-5296.
    [79] Wang X, Wu J, Chen J, et al. Vatex:A large-scale, high-quality multilingual dataset for video-and-language research. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 4581-4591.
    [80] Yu F, Tang J, Yin W, et al. ERNIE-ViL:Knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934, 2020.
    [81] Goyal Y, Khot T, Summers-Stay D, et al. Making the V in VQA matter:Elevating the role of image understanding in visual question answering. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2017. 6904-6913.
    [82] Zellers R, Bisk Y, Farhadi A, et al. From recognition to cognition:Visual commonsense reasoning. In:Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 6720-6731.
    [83] Suhr A, Zhou S, Zhang A, et al. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811. 00491, 2018.
    [84] Plummer BA, Wang L, Cervantes CM, et al. Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to- sentence models. In:Proc. of the IEEE Int'l Conf. on Computer Vision. 2015. 2641-2649.
    [85] Farhadi A, Hejrati M, Sadeghi MA, et al. Every picture tells a story:Generating sentences from images. In:Proc. of the European Conf. on Computer Vision. Berlin, Heidelberg:Springer-Verlag, 2010. 15-29.
    [86] Vinyals O, Toshev A, Bengio S, et al. Show and tell:Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2016,39(4):652-663.
    [87] Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008,9:2579-2605.
    [88] Chua TS, Tang J, Hong R, et al. NUS-WIDE:A real-world web image database from National University of Singapore. In:Proc. of the ACM Int'l Conf. on Image and Video Retrieval. 2009. 1-9.
    [89] Huang F, Zhang X, Li Z, et al. Learning social image embedding with deep multimodal attention networks. In:Proc. of the on Thematic Workshops of ACM Multimedia 2017. 2017. 460-468.
    [90] Xia Q, Huang H, Duan N, et al. XGPT:Cross-modal generative pre-training for image captioning. arXiv preprint, 2020
    [91] Jiao X, Yin Y, Shang L, et al. Tinybert:Distilling Bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
    [92] Child R, Gray S, Radford A, et al. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
    [93] Kitaev N, Kaiser Ł, Levskaya A. Reformer:The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
    [94] Kiela D, Bhooshan S, Firooz H, et al. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950, 2019.
    [95] Huang Z, Zeng Z, Liu B, et al. Pixel-BERT:Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

杜鹏飞,李小勇,高雅丽.多模态视觉语言表征学习研究综述.软件学报,2021,32(2):327-348

复制
分享
文章指标
  • 点击次数:5588
  • 下载次数: 13287
  • HTML阅读次数: 7704
  • 引用次数: 0
历史
  • 收稿日期:2020-05-11
  • 最后修改日期:2020-06-26
  • 在线发布日期: 2020-09-10
  • 出版日期: 2021-02-06
文章二维码
您是第19795250位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号