Abstract:In recent years, deep learning has achieved excellent performance in unimodal areas such as computer vision (CV) and natural language processing (NLP). With the development of technology, the importance and necessity of multimodal learning begin to unfold. Essential to multimodal learning, vision-language learning has received extensive attention from researchers in and outside China. Thanks to the development of the Transformer framework, more and more pre-trained models are applied to vision-language multimodal learning, and the performance of related tasks is improved qualitatively. This study systematically reviews the current work on vision-language pre-trained models. Firstly, the knowledge about pre-trained models is introduced. Secondly, the structure of pre-trained models is analyzed and compared from two perspectives. The commonly used vision-language pre-training techniques are discussed, and five downstream pre-training tasks are elaborated. Finally, the common datasets used in image and video pre-training tasks are expounded, and the performance of commonly used pre-trained models on different datasets under different tasks is compared and analyzed.