面向遥感视觉问答的尺度引导融合推理网络

doi:10.13328/j.cnki.jos.007025

微信服务号

微信订阅号

首页 > 过刊浏览>2024年第35卷第5期 >2024,35(5):2133-2149. DOI:10.13328/j.cnki.jos.007025 CSTR:

面向遥感视觉问答的尺度引导融合推理网络

Scale-guided Fusion Inference Network for Remote Sensing Visual Question Answering

发布日期：2023-09-11

摘要
图/表
访问统计
PDF预览
参考文献
相似文献
引证文献
附件

[关键词]

[摘要]

遥感视觉问答(remote sensing visual question answering, RSVQA)旨在从遥感图像中抽取科学知识. 近年来, 为了弥合遥感视觉信息与自然语言之间的语义鸿沟, 涌现出许多方法. 但目前方法仅考虑多模态信息的对齐和融合, 既忽略了对遥感图像目标中的多尺度特征及其空间位置信息的深度挖掘, 又缺乏对尺度特征的建模和推理的研究, 导致答案预测不够全面和准确. 针对以上问题, 提出一种多尺度引导的融合推理网络(multi-scale guided fusion inference network, MGFIN), 旨在增强RSVQA系统的视觉空间推理能力. 首先, 设计基于Swin Transformer的多尺度视觉表征模块, 对嵌入空间位置信息的多尺度视觉特征进行编码; 其次, 在语言线索的引导下, 使用多尺度关系推理模块以尺度空间为线索学习跨多个尺度的高阶群内对象关系, 并进行空间层次推理; 最后, 设计基于推理的融合模块来弥合多模态语义鸿沟, 在交叉注意力基础上, 通过自监督范式、对比学习方法、图文匹配机制等训练目标来自适应地对齐融合多模态特征, 并辅助预测最终答案. 实验结果表明, 所提模型在两个公共RSVQA数据集上具有显著优势.

[Key word]

[Abstract]

Remote sensing visual question answering (RSVQA) aims to extract scientific knowledge from remote sensing images. In recent years, many methods have emerged to bridge the semantic gap between remote sensing visual information and natural language. However, most of these methods only consider the alignment and fusion of multimodal information, ignoring the deep mining of multi-scale features and their spatial location information in remote sensing image objects and lacking research on modeling and reasoning about scale features, thus resulting in incomplete and inaccurate answer prediction. To address these issues, this study proposes a multi-scale-guided fusion inference network (MGFIN), which aims to enhance the visual spatial reasoning ability of RSVQA systems. First, the study designs a multi-scale visual representation module based on Swin Transformer to encode multi-scale visual features embedded with spatial position information. Second, guided by language clues, the study uses a multi-scale relation reasoning module to learn cross-scale higher-order intra-group object relations with scale space as clues and performs spatial hierarchical inference. Finally, this study designs the inference-based fusion module to bridge the multimodal semantic gap. On the basis of cross-attention, training goals such as self-supervised paradigms, contrastive learning methods, and image-text matching mechanisms are used to adaptively align and fuse multimodal features and assist in predicting the final answer. Experimental results show that the proposed model has significant advantages on two public RSVQA datasets.

[中图分类号]

[基金项目]

国家重点研发计划(2021YFF0704000);国家自然科学基金(62172376);国家自然科学基金区域创新发展联合基金(U22A2068);中央引导地方科技发展专项资金(YDZX2022028)

更多...

更多...