Abstract:Multimodal information extraction is a task to extract structured knowledge from unstructured or semi-structured multimodal data (such as text and images). It includes multimodal named entity recognition, multimodal relation extraction, and multimodal event extraction. This study analyzes multimodal information extraction tasks and summarizes the common part of the above three subtasks, i.e., a multimodal representation and fusion module. Moreover, it sorts out the commonly used datasets and mainstream research methods of the above three subtasks. Finally, it outlines research trends in multimodal information extraction and analyzes the existing problems and challenges in this field to provide a reference for future research.