多模态信息抽取研究综述
作者:
中图分类号:

TP18

基金项目:

国家自然科学基金(62276177, 61836007); 江苏高校优势学科建设工程项目


Survey on Multimodal Information Extraction Research
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [92]
  • | |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    多模态信息抽取任务是指从非结构化或半结构化的多模态数据(包含文本和图像等)中提取结构化知识. 其研究内容主要包含多模态命名实体识别、多模态实体关系抽取和多模态事件抽取. 首先对多模态信息抽取任务进行分析, 然后对多模态命名实体识别、多模态实体关系抽取和多模态事件抽取这3个子任务的共同部分, 即多模态表示和融合模块进行归纳和总结. 随后梳理上述3个子任务的常用数据集和主流研究方法. 最后总结多模态信息抽取的研究趋势并分析该研究存在的问题和挑战, 为后续相关研究提供参考.

    Abstract:

    Multimodal information extraction is a task to extract structured knowledge from unstructured or semi-structured multimodal data (such as text and images). It includes multimodal named entity recognition, multimodal relation extraction, and multimodal event extraction. This study analyzes multimodal information extraction tasks and summarizes the common part of the above three subtasks, i.e., a multimodal representation and fusion module. Moreover, it sorts out the commonly used datasets and mainstream research methods of the above three subtasks. Finally, it outlines research trends in multimodal information extraction and analyzes the existing problems and challenges in this field to provide a reference for future research.

    参考文献
    [1] 张亚洲, 戎璐, 宋大为, 张鹏. 多模态情感分析研究综述. 模式识别与人工智能, 2020, 33(5): 426–438.
    Zhang YZ, Rong L, Song DW, Zhang P. A survey on multimodal sentiment analysis. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426–438 (in Chinese with English abstract).
    [2] 包希港, 周春来, 肖克晶, 覃飙. 视觉问答研究综述. 软件学报, 2021, 32(8): 2522–2544. http://www.jos.org.cn/1000-9825/6215.htm
    Bao XG, Zhou CL, Xiao KJ, Qin B. Survey on visual question answering. Ruan Jian Xue Bao/Journal of Software, 2021, 32(8): 2522–2544 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6215.htm
    [3] Zhang TT, Whitehead S, Zhang HW, Li HZ, Ellis J, Huang LF, Liu W, Ji H, Chang SF. Improving event extraction via multimodal integration. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain: ACM, 2017. 270–278. [doi: 10.1145/3123266.3123294]
    [4] Lu D, Neves L, Carvalho V, Zhang N, Ji H. Visual attention model for name tagging in multimodal social media. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Melbourne: ACL, 2018. 1990–1999. [doi: 10.18653/v1/P18-1185]
    [5] 吴友政, 李浩然, 姚霆, 何晓冬. 多模态信息处理前沿综述: 应用、融合和预训练. 中文信息学报, 2022, 36(5): 1–20.
    Wu YZ, Li HR, Yao T, He XD. A survey of multimodal information processing frontiers: Application, fusion and pre-training. Journal of Chinese Information Processing, 2022, 36(5): 1–20 (in Chinese with English abstract).
    [6] Moon S, Neves L, Carvalho V. Multimodal named entity recognition for short social media posts. In: Proc. of the 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans: ACL, 2018. 852–860. [doi: 10.18653/v1/N18-1078]
    [7] Zhang Q, Fu JL, Liu XY, Huang XJ. Adaptive co-attention network for named entity recognition in tweets. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence and the 30th Innovative Applications of Artificial Intelligence Conf. and the 8th Symp. on Educational Advances in Artificial Intelligence. New Orleans: AAAI, 2018. 5674–5681. [doi: 10.1609/aaai.v32i1.11962]
    [8] Jia MHZ, Shen X, Shen L, Pang JH, Liao LJ, Song Y, Chen M, He XD. Query prior matters: A MRC framework for multimodal named entity recognition. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 3549–3558.
    [9] Chen DW, Li ZX, Gu BB, Chen ZG. Multimodal named entity recognition with image attributes and image knowledge. Database systems for advanced applications. In: Proc. of the 26th Int’l Conf. on Database Systems for Advanced Applications. Taipei: Springer, 2021. 186–201. [doi: 10.1007/978-3-030-73197-7_12]
    [10] Eberts M, Ulges A. Span-based joint entity and relation extraction with transformer pre-training. In: 24th European Conf. on Artificial Intelligence. Santiago de Compostela: IOS Press, 2020. 2006–2013. [doi: 10.3233/FAIA200321]
    [11] Wan H, Zhang MR, Du JF, Huang ZL, Yang YF, Pan JZ. FL-MSRE: A few-shot learning based approach to multimodal social relation extraction. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 13916–13923.
    [12] 张天明, 张杉, 刘曦, 曹斌, 范菁. 融合多模态数据的小样本命名实体识别方法. 软件学报, 2024, 35(3): 1107–1124. http://www.jos.org.cn/1000-9825/7069.htm
    Zhang TM, Zhang S, Liu X, Cao B, Fan J. Multimodal data fusion for few-shot named entity recognition method. Ruan Jian Xue Bao/Journal of Software, 2024, 35(3): 1107–1124 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/7069.htm
    [13] Li ML, Zareian A, Zeng Q, Whitehead S, Lu D, Ji H, Chang SF. Cross-media structured common space for multimedia event extraction. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 2557–2568.
    [14] Liu J, Chen YF, Xu JN. Multimedia event extraction from news with a unified contrastive learning framework. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 1945–1953. [doi: 10.1145/3503161.3548132]
    [15] Zhang D, Wei SZ, Li SS, Wu HQ, Zhu QM, Zhou GD. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 14347–14355.
    [16] Zheng CM, Wu ZW, Feng JH, Fu Z, Cai Y. MNRE: A challenge multimodal dataset for neural relation extraction with visual evidence in social media posts. In: Proc. of the 2021 IEEE Int’l Conf. on Multimedia and Expo (ICME). Shenzhen: IEEE, 2021. 1–6.
    [17] Tong MH, Wang S, Cao YX, Xu B, Li JZ, Hou L, Chua TS. Image enhanced event detection in news articles. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI, 2020. 9040–9047. [doi: 10.1609/aaai.v34i05.6437]
    [18] Chiu JPC, Nichols E. Named entity recognition with bidirectional LSTM-CNNs. Trans. of the Association for Computational Linguistics, 2016, 4: 357–370.
    [19] Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. In: Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: ACL, 2016. 260–270. [doi: 10.18653/v1/N16-1030]
    [20] Li J, Sun AX, Han JL, Li CL. A survey on deep learning for named entity recognition. IEEE Trans. on Knowledge and Data Engineering, 2022, 34(1): 50–70.
    [21] Adel H, Schütze H. Global normalization of convolutional neural networks for joint entity and relation classification. In: Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing. Copenhagen: ACL, 2017. 1723–1729. [doi: 10.18653/v1/D17-1181]
    [22] Ahn D. The stages of event extraction. In: Proc. of the Workshop on Annotating and Reasoning about Time and Events. Sydney: ACL, 2006. 1–8.
    [23] 张汝佳, 代璐, 王邦, 郭鹏. 基于深度学习的中文命名实体识别最新研究进展综述. 中文信息学报, 2022, 36(6): 20–35.
    Zhang RJ, Dai L, Wang B, Guo P. Recent advances of Chinese named entity recognition based on deep learning. Journal of Chinese Information Processing, 2022, 36(6): 20–35 (in Chinese with English abstract).
    [24] Arshad O, Gallo I, Nawaz S, Calefati A. Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proc. of the 2019 Int’l Conf. on Document Analysis and Recognition (ICDAR). Sydney: IEEE, 2019. 337–342.
    [25] Wu ZW, Zheng CM, Cai Y, Chen JY, Leung HF, Li Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 1038–1046.
    [26] Yu JF, Jiang J, Yang L, Xia R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 3342–3352.
    [27] Wang XY, Gui M, Jiang Y, Jia ZX, Bach N, Wang T, Huang ZQ, Huang F, Tu KW. ITA: Image-text alignments for multi-modal named entity recognition. In: Proc. of the 2022 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle: ACL, 2022. 3176–3189. [doi: 10.18653/v1/2022.naacl-main.232]
    [28] Xu B, Huang SZ, Sha CF, Wang HY. MAF: A general matching and alignment framework for multimodal named entity recognition. In: Proc. of the 15th ACM Int’l Conf. on Web Search and Data Mining. ACM, 2022. 1215–1223.
    [29] Sun L, Wang JQ, Zhang K, Su YD, Weng FS. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI, 2021. 13860–13868.
    [30] 黄世洲. 面向社交媒体的通用多模态信息抽取方法研究 [硕士学位论文]. 上海. 东华大学, 2022.
    Huang SZ. Research on general multimodal information extraction for social media [MS. Thesis]. Shanghai: Donghua University, 2022 (in Chinese with English abstract).
    [31] Zheng CM, Feng JH, Fu Z, Cai Y, Li Q, Wang T. Multimodal relation extraction with efficient graph alignment. In: Proc. of the 29th ACM Int’l Conf. on Multimedia. ACM, 2021. 5298–5306.
    [32] Sun L, Wang JQ, Su YD, Weng FS, Sun YX, Zheng ZW, Chen YY. RIVA: A pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: Proc. of the 28th Int’l Conf. on Computational Linguistics. Barcelona: ACL, 2020. 1852–1862. [doi: 10.18653/v1/2020.coling-main.168]
    [33] Zhao F, Li CH, Wu Z, Xing SY, Dai XY. Learning from different text-image pairs: A relation-enhanced graph convolutional network for multimodal NER. In: Proc. of the 30th ACM Int’l Conf. on Multimedia. Lisboa: ACM, 2022. 3983–3992. [doi: 10.1145/3503161.354822]
    [34] Zheng CM, Wu ZW, Wang T, Cai Y, Li Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. on Multimedia, 2021, 23: 2520–2532.
    [35] Li XY, Feng JR, Meng YX, Han QH, Wu F, Li JW. A unified MRC framework for named entity recognition. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, 2020. 5849–5859.
    [36] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proc. of the 3rd Int’l Conf. on Learning Representations. 2015.
    [37] Yatskar M, Zettlemoyer L, Farhadi A. Situation recognition: Visual semantic role labeling for image understanding. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 5534–5542. [doi: 10.1109/CVPR.2016.597]
    [38] Lahat D, Adali T, Jutten C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc. of the IEEE, 2015, 103(9): 1449–1477.
    [39] Wang XW, Ye JB, Li ZX, Tian JF, Jiang Y, Yan M, Zhang J, Xiao YH. CAT-MNER: Multimodal named entity recognition with knowledge-refined cross-modal attention. In: Proc. of the 2022 IEEE Int’l Conf. on Multimedia and Expo (ICME). Taipei: IEEE, 2022. 1–6. [doi: 10.1109/ICME52920.2022.9859972]
    [40] Chen X, Zhang NY, Li L, Yao YZ, Deng SM, Tan CQ, Huang F, Si L, Chen HJ. Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics: NAACL 2022. Seattle: ACL, 2022. 1607–1618. [doi: 10.18653/v1/2022.findings-naacl.121]
    [41] Xu B, Huang SZ, Du M, Wang HY, Song H, Sha CF, Xiao YH. Different data, different modalities! Reinforced data splitting for effective multimodal information extraction from social media posts. In: Proc. of the 29th Int’l Conf. on Computational Linguistics. Gyeongju: ACL, 2022. 1855–1864.
    [42] Sang EFTK, Veenstra J. Representing text chunks. In: Proc. of the 9th Conf. on European Chapter of the Association for Computational Linguistics. Bergen: ACL, 1999. 173–179. [doi: 10.3115/977035.977059]
    [43] Li JY, Li H, Pan Z, Sun D, Wang JH, Zhang WK, Pan G. Prompting ChatGPT in MNER: Enhanced multimodal named entity recognition with auxiliary refined knowledge. In: Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: ACL, 2023. 2787–2802. [doi: 10.18653/v1/2023.findings-emnlp.184]
    [44] Ji YZ, Li BB, Zhou J, Li F, Teng C, Ji DH. CMNER: A Chinese multimodal NER dataset based on social media. arXiv:2402.13693, 2024.
    [45] Wang JM, Li ZY, Yu JF, Yang L, Xia R. Fine-grained multimodal named entity recognition and grounding with a generative framework. In: Proc. of the 31st ACM Int’l Conf. on Multimedia. Ottawa: ACM, 2023. 3934–3943. [doi: 10.1145/3581783.3612322]
    [46] Sui DB, Tian ZK, Chen YB, Liu K, Zhao J. A large-scale Chinese multimodal ner dataset with speech clues. In: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int’l Joint Conf. on Natural Language Processing. ACL, 2021. 2807–2818.
    [47] Wang XY, Cai J, Jiang Y, Xie PJ, Tu KW, Lu W. Named entity and relation extraction with multi-modal retrieval. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi: ACL, 2022. 5925–5936. [doi: 10.18653/v1/2022.findings-emnlp.437]
    [48] Lu JY, Zhang DX, Zhang JX, Zhang PJ. Flat multi-modal interaction transformer for named entity recognition. In: Proc. of the 29th Int’l Conf. on Computational Linguistics. Gyeongju: ACL, 2022. 2055–2064.
    [49] Miyamoto Y, Cho K. Gated word-character recurrent language model. In: Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. Austin: ACL, 2016. 1992–1997. [doi: 10.18653/v1/D16-1209]
    [50] Shen T, Zhou TY, Long GD, Jiang J, Pan SR, Zhang CQ. DiSAN: Directional self-attention network for RNN/CNN-free language understanding. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI, 2018. 5446–5455.
    [51] Vempala A, Preoţiuc-Pietro D. Categorizing and inferring the relationship between the text and image of twitter posts. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 2830–2840. [doi: 10.18653/v1/P19-1272]
    [52] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. 2021. 8748–8763.
    [53] Su WJ, Zhu XZ, Cao Y, Li B, Lu LW, Wei FR, Dai JF. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proc. of the 8th Int’l Conf. on Learning Representations. 2020. 1–16.
    [54] Zhou BH, Zhang Y, Song KH, Guo WY, Zhao GQ, Wang HB, Yuan XJ. A span-based multimodal variational autoencoder for semi-supervised multimodal named entity recognition. In: Proc. of the 2022 Conf. on Empirical Methods in Natural Language Processing. Abu Dhabi: ACL, 2022. 6293–6302. [doi: 10.18653/v1/2022.emnlp-main.422]
    [55] Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002, 14(8): 1771–1800.
    [56] Yamada I, Asai A, Shindo H, Takeda H, Matsumoto Y. LUKE: Deep contextualized entity representations with entity-aware self-attention. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2020. 6442–6454.
    [57] Chen F, Feng Yj. Chain-of-thought prompt distillation for multimodal named entity recognition and multimodal relation extraction. arXiv:2306.14122, 2023.
    [58] Hu XM, Chen JZ, Liu AW, Meng SA, Wen LJ, Yu PS. Prompt me up: Unleashing the power of alignments for multimodal entity and relation extraction. In: Proc. of the 31st ACM Int’l Conf. on Multimedia. Ottawa: ACM, 2023. 5185–5194.
    [59] Zhao S, Hu MH, Cai ZP, Liu F. Modeling dense cross-modal interactions for joint entity-relation extraction. In: Proc. of the 29th Int’l Joint Conf. on Artificial Intelligence. 2021. 4032–4038. [doi: 10.24963/ijcai.2020/558]
    [60] Zheng SC, Wang F, Bao HY, Hao YX, Zhou P, Xu B. Joint extraction of entities and relations based on a novel tagging scheme. In: Proc. of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver: ACL, 2017. 1227–1236. [doi: 10.18653/v1/P17-1113]
    [61] Zhong ZX, Chen DQ. A frustratingly easy approach for entity and relation extraction. In: Proc. of the 2021 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 2021. 50–61.
    [62] Nguyen TH, Grishman R. Relation extraction: Perspective from convolutional neural networks. In: Proc. of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Denver: ACL, 2015. 39–48. [doi: 10.3115/v1/W15-1506]
    [63] Soares LB, Fitzgerald N, Ling J, Kwiatkowski T. Matching the blanks: Distributional similarity for relation learning. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 2895–2905. [doi: 10.18653/v1/P19-1279]
    [64] Zeng DJ, Liu K, Chen YB, Zhao J. Distant supervision for relation extraction via piecewise convolutional neural networks. In: Proc. of the 2015 Conf. on Empirical Methods in Natural Language Processing. Lisbon: ACL, 2015. 1753–1762. [doi: 10.18653/v1/D15-1203]
    [65] Du JF, Pan JZ, Wang S, Qi KX Shen YM, Deng Y. Validation of growing knowledge graphs by abductive text evidences. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI, 2019. 2784–2791. [doi: 10.1609/aaai.v33i01.33012784]
    [66] Du YJ, Su FH, Yang AZ, Li XY, Fan YQ. Extracting deep personae social relations in microblog posts. IEEE Access, 2020, 8: 5488–5501.
    [67] Zhang ZP, Luo P, Loy CC, Tang XO. Learning social relation traits from face images. In: Proc. of 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 3631–3639. [doi: 10.1109/ICCV.2015.414]
    [68] Mori J, Ishizuka M, Matsuo Y. Extracting keyphrases to represent relations in social networks from Web. In: Proc. of the 20th Int’l Joint Conf. on Artificial Intelligence. 2007. 2820–2827.
    [69] Bramsen P, Escobar-Molano M, Patel A, Alonso R. Extracting social power relationships from natural language. In: Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland: ACL, 2011. 773–782.
    [70] Du ZL, Li YX, Guo X, Sun YD, Li BY. Training multimedia event extraction with generated images and captions. In: Proc. of the 31st ACM Int’l Conf. on Multimedia. Ottawa: ACM, 2023. 5504–5513. [doi: 10.1145/3581783.3612526]
    [71] Chen YB, Xu LH, Liu K, Zeng DJ, Zhao J. Event extraction via dynamic multi-pooling convolutional neural networks. In: Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int’l Joint Conf. on Natural Language Processing. Beijing: ACL, 2015. 167–176. [doi: 10.3115/v1/P15-1017]
    [72] Yang S, Feng DW, Qiao LB, Kan ZG, Li DS. Exploring pre-trained language models for event extraction and generation. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019. 5284–5294. [doi: 10.18653/v1/P19-1522]
    [73] Wadden D, Wennberg U, Luan Y, Hajishirzi H. Entity, relation, and event extraction with contextualized span representations. In: Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int’l Joint Conf. on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: ACL, 2019. 5784–5789. [doi: 10.18653/v1/D19-1585]
    [74] Ji H, Grishman R. Refining event extraction through cross-document inference. In: Proc. of the 46th Annual Meeting of the Association for Computational Linguistics. Columbus: ACL, 2008. 254–262.
    [75] Li H, Ji H, Deng HB, Han JW. Exploiting background information networks to enhance bilingual event extraction through topic modeling. In: Proc. of the 1st Int’l Conf. on Advances in Information Mining and Management. 2011. 23–30.
    [76] Hong Y, Zhang JF, Ma B, Yao JM, Zhou GD, Zhu QM. Using cross-entity inference to improve event extraction. In: Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland: ACL, 2011. 1127–1136.
    [77] Li Q, Ji H, Huang L. Joint event extraction via structured prediction with global features. In: Proc. of the 51st Annual Meeting of the Association for Computational Linguistics. Sofia: ACL, 2013. 73–82.
    [78] Liao SS, Grishman R. Using document level cross-event inference to improve event extraction. In: Proc. of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala: ACL, 2010. 789–797.
    [79] Feng XC, Huang LF, Tang DY, Ji H, Qin B, Liu T. A language-independent neural network for event detection. In: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: ACL, 2016. 66–71. [doi: 10.18653/v1/P16-2011]
    [80] Nguyen TH, Cho K, Grishman R. Joint event extraction via recurrent neural networks. In: Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: ACL, 2016. 300–309. [doi: 10.18653/v1/N16-1034]
    [81] Bosselut A, Chen JF, Warren D, Hajishirzi H, Choi Y. Learning prototypical event structure from photo albums. In: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers). Berlin: ACL, 2016. 1769–1779.
    [82] Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. of the Association for Computational Linguistics, 2014, 2: 67–78.
    [83] Song ZY, Bies A, Strassel S, Riese T, Mott J, Ellis J, Wright J, Kulick S, Ryant N, Ma XY. From light to rich ERE: Annotation of entities, relations, and events. In: Proc. of the the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation. Denver: ACL, 2015. 89–98. [doi: 10.3115/v1/W15-0812]
    [84] Li ML, Xu RC, Wang SH, Zhou LW, Lin XD, Zhu CG, Zeng M, Ji H, Chang SF. Clip-event: Connecting text and images with event structures. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 16399–16408. [doi: 10.1109/CVPR52688.2022.01593]
    [85] Moghimifar F, Shiri F, Nguyen V, Li YF, Haffari G. Theia: Weakly supervised multimodal event extraction from incomplete data. In: Proc. of the 13th Int’l Joint Conf. on Natural Language Processing and the 3rd Conf. of the Asia-Pacific Chapter of the Association for Computational Linguistics. Nusa Dua: ACL, 2023. 139–145. [doi: 10.18653/v1/2023.ijcnlp-short.16]
    [86] Chen B, Lin XD, Thomas C, Li ML, Yoshida S, Chum L, Ji H, Chang SF. Joint multimedia event extraction from video and article. In: Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana: ACL, 2021. 74–88.
    相似文献
    引证文献
    引证文献 [0] 您输入的地址无效!
    没有找到您想要的资源,您输入的路径无效!

    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王永胜,李培峰,王中卿,朱巧明.多模态信息抽取研究综述.软件学报,,():1-27

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-09-13
  • 最后修改日期:2024-02-25
  • 在线发布日期: 2024-12-09
文章二维码
您是第19528971位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号