Abstract:Remote sensing visual question answering (RSVQA) aims to extract scientific knowledge from remote sensing images. In recent years, many methods have emerged to bridge the semantic gap between remote sensing visual information and natural language. However, most of these methods only consider the alignment and fusion of multimodal information, ignoring the deep mining of multi-scale features and their spatial location information in remote sensing image objects and lacking research on modeling and reasoning about scale features, thus resulting in incomplete and inaccurate answer prediction. To address these issues, this study proposes a multi-scale-guided fusion inference network (MGFIN), which aims to enhance the visual spatial reasoning ability of RSVQA systems. First, the study designs a multi-scale visual representation module based on Swin Transformer to encode multi-scale visual features embedded with spatial position information. Second, guided by language clues, the study uses a multi-scale relation reasoning module to learn cross-scale higher-order intra-group object relations with scale space as clues and performs spatial hierarchical inference. Finally, this study designs the inference-based fusion module to bridge the multimodal semantic gap. On the basis of cross-attention, training goals such as self-supervised paradigms, contrastive learning methods, and image-text matching mechanisms are used to adaptively align and fuse multimodal features and assist in predicting the final answer. Experimental results show that the proposed model has significant advantages on two public RSVQA datasets.