Scale-guided Fusion Inference Network for Remote Sensing Visual Question Answering
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [46]
  • |
  • Related
  • | | |
  • Comments
    Abstract:

    Remote sensing visual question answering (RSVQA) aims to extract scientific knowledge from remote sensing images. In recent years, many methods have emerged to bridge the semantic gap between remote sensing visual information and natural language. However, most of these methods only consider the alignment and fusion of multimodal information, ignoring the deep mining of multi-scale features and their spatial location information in remote sensing image objects and lacking research on modeling and reasoning about scale features, thus resulting in incomplete and inaccurate answer prediction. To address these issues, this study proposes a multi-scale-guided fusion inference network (MGFIN), which aims to enhance the visual spatial reasoning ability of RSVQA systems. First, the study designs a multi-scale visual representation module based on Swin Transformer to encode multi-scale visual features embedded with spatial position information. Second, guided by language clues, the study uses a multi-scale relation reasoning module to learn cross-scale higher-order intra-group object relations with scale space as clues and performs spatial hierarchical inference. Finally, this study designs the inference-based fusion module to bridge the multimodal semantic gap. On the basis of cross-attention, training goals such as self-supervised paradigms, contrastive learning methods, and image-text matching mechanisms are used to adaptively align and fuse multimodal features and assist in predicting the final answer. Experimental results show that the proposed model has significant advantages on two public RSVQA datasets.

    Reference
    [1] Zador A, Escola S, Richards B, et al. Catalyzing next-generation Artificial Intelligence through NeuroAI. Nature Communications, 2023, 14(1):1597.
    [2] Cavender-Bares J, Schneider FD, Santos MJ, Armstrong A, Carnaval A, Dahlin KM, Fatoyinbo L, Hurtt GC, Schimel D, Townsend PA, Ustin SL, Wang ZH, Wilson AM. Integrating remote sensing with ecology and evolution to advance biodiversity conservation. Nature Ecology & Evolution, 2022, 6(5):506-519.
    [3] Zhang H, Li F, Liu SL, Zhang L, Su H, Zhu J, Ni LM, Shum HY. DINO:DETR with improved denoising anchor boxes for end-to-end object detection. In:Proc. of the 11th Int'l Conf. on Learning Representations. Kigali:OpenReview.net, 2023.
    [4] Li MY, Cao CQ, Feng ZJ, Xu XK, Wu ZY, Ye SB, Yong JW. Remote sensing object detection based on strong feature extraction and prescreening network. IEEE Geoscience and Remote Sensing Letters, 2023, 20:8000505.
    [5] Li G, Li LL, Zhu H, Liu X, Jiao LC. Adaptive multiscale deep fusion residual network for remote sensing image classification. IEEE Trans. on Geoscience and Remote Sensing, 2019, 57(11):8506-8521.
    [6] Liu X, Jiao LC, Li LL, Cheng L, Liu F, Yang SY, Hou B. Deep multiview union learning network for multisource image classification. IEEE Trans. on Cybernetics, 2022, 52(6):4534-4546.
    [7] Liu X, Li LL, Liu F, Hou B, Yang SY, Jiao LC. GAFnet:Group attention fusion network for PAN and MS image high-resolution classification. IEEE Trans. on Cybernetics, 2022, 52(10):10556-10569.
    [8] Cheng G, Han JW, Lu XQ. Remote sensing image scene classification:Benchmark and state of the art. Proc. of the IEEE, 2017, 105(10):1865-1883.
    [9] Zhang F, Du B, Zhang LP. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. on Geoscience and Remote Sensing, 2015, 53(4):2175-2184.
    [10] Zhu H, Jiao LC, Ma WP, Liu F, Zhao W. A novel neural network for remote sensing image matching. IEEE Trans. on Neural Networks and Learning Systems, 2019, 30(9):2853-2865.
    [11] Quan D, Wang S, Li Y, Yang BW, Huyan N, Chanussot J, Hou B, Jiao LC. Multi-relation attention network for image patch matching. IEEE Trans. on Image Processing, 2021, 30:7127-7142.
    [12] Ma WP, Wen ZL, Wu Y, Jiao LC, Gong MG, Zheng YF, Liu L. Remote sensing image registration with modified SIFT and enhanced feature matching. IEEE Geoscience and Remote Sensing Letters, 2017, 14(1):3-7.
    [13] Ma AL, Wang JJ, Zhong YF, Zheng Z. FactSeg:Foreground activation-driven small object semantic segmentation in large-scale remote sensing imagery. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5606216.
    [14] Zheng CY, Nie J, Wang ZX, Song N, Wang JY, Wei ZQ. High-order semantic decoupling network for remote sensing image semantic segmentation. IEEE Trans. on Geoscience and Remote Sensing, 2023, 61:5401415.
    [15] Xie YX, Tian JJ, Zhu XX. Linking points with labels in 3D:A review of point cloud semantic segmentation. IEEE Geoscience and Remote Sensing Magazine, 2020, 8(4):38-59.
    [16] Li AJ, Jiao LC, Zhu H, Li LL, Liu F. Multitask semantic boundary awareness network for remote sensing image segmentation. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5400314.
    [17] Zhang ZY, Zhang WK, Yan ML, Gao X, Fu K, Sun X. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5615216.
    [18] Zhao R, Shi ZW, Zou ZX. High-resolution remote sensing image captioning based on structured attention. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5603814.
    [19] Li YP, Zhang XR, Gu J, Li C, Wang X, Tang X, Jiao LC. Recurrent attention and semantic gate for remote sensing image captioning. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5608816.
    [20] Cheng QM, Zhou YZ, Fu P, Xu Y, Zhang L. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14:4284-4297.
    [21] Yuan ZQ, Zhang WK, Rong XE, Li X, Chen JL, Wang HQ, Fu K, Sun X. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5612819.
    [22] Zheng G, Li XF, Zhou LZ, Yang JS, Ren L, Chen P, Zhang HG, Lou XL. Development of a gray-level co-occurrence matrix-based texture orientation estimation method and its application in sea surface wind direction retrieval from SAR imagery. IEEE Trans. on Geoscience and Remote Sensing, 2018, 56(9):5244-5260.
    [23] Lobry S, Marcos D, Murray J, Tuia D. RSVQA:Visual question answering for remote sensing data. IEEE Trans. on Geoscience and Remote Sensing, 2020, 58(12):8555-8566.
    [24] Yuan ZH, Mou LX, Wang Q, ZHU XX. From easy to hard:Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5623111.
    [25] Bazi Y, Al Rahhal MM, Mekhalfi ML, Al Zuair MA, Melgani F. Bi-modal Transformer-based approach for visual question answering in remote sensing imagery. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:4708011.
    [26] Zhang ZX, Jiao LC, Li LL, Liu X, Chen PH, Liu F, Li YX, Guo ZC. A spatial hierarchical reasoning network for remote sensing visual question answering. IEEE Trans. on Geoscience and Remote Sensing, 2023, 61:4400815.
    [27] Antol S, Agrawal A, Lu JS, Mitchell M, Batra D, Zitnick CL, Parikh D. VQA:Visual question answering. In:Proc. of the 2015 IEEE Int'l Conf. on Computer Vision. Santiago:IEEE, 2015. 2425-2433.
    [28] Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter:Elevating the role of image understanding in visual question answering. In:Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu:IEEE, 2017. 6325-6334.
    [29] Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In:Proc. of the 2016 Conf. on Empirical Methods in Natural Language Processing. Austin:Association for Computational Linguistics, 2016. 457-468.
    [30] Kim JH, On KW, Lim W, Jeonghee Kim, Ha JW, Zhang BT. Hadamard product for low-rank bilinear pooling. In:Proc. of the 5th Int'l Conf. on Learning Representations. Toulon:OpenReview.net, 2017.
    [31] Yu Z, Yu J, Fan JP, Tao DC. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In:Proc. of the 2017 IEEE Int'l Conf. on Computer Vision. Venice:IEEE, 2017. 1839-1848.
    [32] Yang ZC, He XD, Gao JF, Deng L, Smola A. Stacked attention networks for image question answering. In:Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas:IEEE, 2016. 21-29.
    [33] Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In:Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City:IEEE, 2018. 6077-6086.
    [34] Ren SQ, He KM, Girshick R, Sun J. Faster R-CNN:Towards real-time object detection with region proposal networks. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
    [35] Song JK, Zeng PP, Gao LL, Shen HT. From pixels to objects:Cubic visual attention for visual question answering. In:Proc. of the 27th Int'l Joint Conf. on Artificial Intelligence. Stockholm:IJCAI.org, 2018. 906-912.
    [36] Yu Z, Yu J, Cui YH, Tao DC, Tian Q. Deep modular co-attention networks for visual question answering. In:Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach:IEEE, 2019. 6274-6283.
    [37] Chappuis C, Zermatten V, Lobry S, Le Saux B, Tuia D. Prompt-RSVQA:Prompting visual context to a language model for Remote Sensing Visual Question Answering. In:Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition Workshops. New Orleans:IEEE, 2022. 1371-1380.
    [38] Yuan ZH, Mou LC, Xiong ZT, Zhu XX. Change detection meets visual question answering. IEEE Trans. on Geoscience and Remote Sensing, 2022, 60:5630613.
    [39] Santoro A, Raposo D, Barrett DGT, Malinowski M, Pascanu R, Battaglia P, Lillicrap T. A simple neural network module for relational reasoning. In:Proc. of the 31st Int'l Conf. on Neural Information Processing Systems. Long Beach:Curran Associates Inc., 2017. 4974-4983.
    [40] Zhou BL, Andonian A, Oliva A, Torralba A. Temporal relational reasoning in videos. In:Proc. of the 15th European Conf. on Computer Vision. Munich:Springer, 2018. 831-846.
    [41] Le TM, Le V, Venkatesh S, Tran T. Hierarchical conditional relation networks for video question answering. In:Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle:IEEE, 2020. 9969-9978.
    [42] Hu H, Gu JY, Zhang Z, Dai JF, Wei YC. Relation networks for object detection. In:Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City:IEEE, 2018. 3588-3597.
    [43] Mou LC, Hua YS, Zhu XX. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In:Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach:IEEE, 2019. 12408-12417.
    [44] Liu Z, Hu H, Lin YT, Yao ZL, Xie ZD, Wei YX, Ning J, Cao Y, Zhang Z, Dong L, Wei FR, Guo BN. Swin Transformer V2:Scaling up capacity and resolution. In:Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. New Orleans:IEEE, 2022. 11999-12009.
    [45] Xu X, Wu CF, Rosenman S, Lal V, Che WX, Duan N. BridgeTower:Building bridges between encoders in vision-language representation learning. In:Proc. of the 37th AAAI Conf. on Artificial Intelligence. Washington:AAAI, 2023. 10637-10647.
    [46] Li JM, Selvaraju RR, Gotmare AD, Joty S, Xiong CM, Hoi SCH. Align before fuse:Vision and language representation learning with momentum distillation. In:Proc. of the 35th Conf. on Neural Information Processing Systems. NeurIPS, 2021. 9694-9705.
    Related
    Cited by
Get Citation

赵恩源,宋宁,聂婕,王鑫,郑程予,魏志强.面向遥感视觉问答的尺度引导融合推理网络.软件学报,2024,35(5):2133-2149

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 10,2023
  • Revised:June 08,2023
  • Online: September 11,2023
  • Published: May 06,2024
You are the first2049922Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063