Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding

doi:10.13328/j.cnki.jos.007303

微信服务号

微信订阅号

2025-4-5- 10

Home > Archive>Volume , Issue , >1-16. DOI:10.13328/j.cnki.jos.007303

PDF HTML XML Export Cite reminder

Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding
DOI:
                        10.13328/j.cnki.jos.007303
                    
Author:
                        ZHAO Jia-NingZHAO Jia-Ning
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
WANG Jing-JingWANG Jing-Jing
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
LUO Jia-MinLUO Jia-Min
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site
ZHOU Guo-DongZHOU Guo-Dong
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
Find this author on CNKI
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:TP18
Fund Project:

Article

Figures

Metrics

Reference [52]

Cited by

Materials

Comments

Abstract:

Phrasal visual grounding, a fundamental and critical research task in the field of multimodal studies, aims at predicting fine-grained alignment relationships between textual phrases and image regions. Despite the remarkable progress achieved by existing phrasal visual grounding approaches, they all ignore the implicit alignment relationships between textual phrases and their corresponding image regions, commonly referred to as implicit phrase-region alignment. Predicting such relationships can effectively evaluate the ability of models to understand deep multimodal semantics. Therefore, to effectively model implicit phrase-region alignment relationships, this study proposes an implicit-enhanced causal modeling (ICM) approach for phrasal visual grounding, which employs the intervention strategies of causal reasoning to mitigate the confusion caused by shallow semantics. To evaluate models’ ability to understand deep multimodal semantics, this study annotates a high-quality implicit dataset and conducts a large number of experiments. Multiple sets of comparative experimental results demonstrate the effectiveness of the proposed ICM approach in modeling implicit phrase-region alignment relationships. Furthermore, the proposed ICM approach outperforms some advanced multimodal large language models (MLLMs) on the implicit dataset, further promoting the research of MLLMs towards more implicit scenarios.

Key words:implicit phrase-region alignment;causal inference;phrasal visual grounding

Reference

[1] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述. 软件学报, 2021, 32(2): 327–348. http://www.jos.org.cn/1000-9825/6125.htm

Du PF, Li XY, Gao YL. Survey on multimodal visual language representation learning. Ruan Jian Xue Bao/Journal of Software, 2021, 32(2): 327–348 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6125.htm

[2] Wang LW, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2019, 41(2): 394–407.

[3] Hossain MDZ, Sohel F, Shiratuddin MF, Laga H. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 2019, 51(6): 118.

[4] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR), 2008, 40(2): 5.

[5] Antol S, Agrawal A, Lu JS, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. VQA: Visual question answering. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2425–2433. [doi: 10.1109/ICCV.2015.279]

[6] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 8748–8763.

[7] Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N. MDETR—Modulated detection for end-to-end multi-modal understanding. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 1760–1770.

[8] Li LH, Zhang PC, Zhang HT, Yang JW, Li CY, Zhong YW. Grounded language-image pre-training. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 10955–10965. [doi: 10.1109/CVPR52688.2022.01069]

[9] Pearl J, Mackenzie D. The Book of Why: The New Science of Cause and Effect. New York: Basic Books, Inc., 2018.

[10] Kazemzadeh S, Ordonez V, Matten M, Berg T. Referitgame: Referring to objects in photographs of natural scenes. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014. 787–798. [doi: 10.3115/v1/D14-1086]

[11] Yang ZY, Gong BQ, Wang LW, Huang WB, Yu D, Luo JB. A fast and accurate one-stage approach to visual grounding. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 4682–4692. [doi: 10.1109/ICCV.2019.00478]

[12] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 91–99.

[13] Yu LC, Lin Z, Shen XH, Yang JM, Lu X, Bansal M, Berg TL. MattNet: Modular attention network for referring expression comprehension. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 1307–1315. [doi: 10.1109/CVPR.2018.00142]

[14] Zhuang BH, Wu Q, Shen CH, Reid I, van den Hengel A. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 4252–4261. [doi: 10.1109/CVPR.2018.00447]

[15] Yu Z, Yu J, Xiang C, Xiang CC, Zhao Z, Tian Q, Tao DC. Rethinking diversified and discriminative proposal generation for visual grounding. In: Proc. of the 27th Int’l Joint Conf. on Artificial Intelligence. Stockholm: AAAI Press, 2018. 1114–1120.

[16] Yang SB, Li GB, Yu YZ. Dynamic graph attention for referring expression comprehension. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 4643–4652. [doi: 10.1109/ICCV.2019.00474]

[17] Wang P, Wu Q, Cao JW, Shen CS, Gao LL, van den Hengel A. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 1960–1968. [doi: 10.1109/CVPR.2019.00206]

[18] Yang SB, Li GB, Yu YZ. Relationship-embedded representation learning for grounding referring expressions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2020, 43(8): 2765–2779.

[19] Yang L, Xu Y, Yuan CF, Liu W, Li B, Hu WM. Improving visual grounding with visual-linguistic verification and iterative reasoning. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 9489–9498. [doi: 10.1109/CVPR52688.2022.00928]

[20] Ye JB, Tian JF, Yan M, Yang XS, Wang XW, Zhang J, He L, Lin X. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 15481–15491. [doi: 10.1109/CVPR52688.2022.01506]

[21] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv:1804.02767, 2018.

[22] Liao Y, Liu S, Li GB, Wang F, Chen YJ, Qian C, Li B. A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 10877–10886. [doi: 10.1109/CVPR42600.2020.01089]

[23] Yang ZY, Chen TL, Wang LB, Luo JB. Improving one-stage visual grounding by recursive sub-query construction. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 387–404. [doi: 10.1007/978-3-030-58568-6_23]

[24] Huang BB, Lian DZ, Luo WX, Gao SH. Look before you leap: Learning landmark features for one-stage visual grounding. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 16883–16892.

[25] Liao Y, Zhang AY, Chen ZY, Hui TR, Liu S. Progressive language-customized visual feature learning for one-stage visual grounding. IEEE Trans. on Image Processing, 2022, 31: 4266–4277.

[26] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.

[27] Deng JJ, Yang ZY, Chen TL, Zhou WG, Li HQ. TransVG: End-to-end visual grounding with Transformers. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 1749–1759. [doi: 10.1109/ICCV48922.2021.00179]

[28] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers. Minneapolis: Association for Computational Linguistics, 2018. 4171–4186. [doi: 10.18653/v1/N19-1423]

[29] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]

[30] Tang KH, Niu YL, Huang JQ, Shi JX, Zhang HW. Unbiased scene graph generation from biased training. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 3713–3722. [doi: 10.1109/CVPR42600.2020.00377]

[31] Zhang D, Zhang HW, Tang JH, Hua XS, Sun QR. Causal intervention for weakly-supervised semantic segmentation. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 56.

[32] Chen L, Yan X, Xiao J, Zhang HW, Pu SL, Zhuang YT. Counterfactual samples synthesizing for robust visual question answering. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10797–10806.

[33] Yue ZQ, Zhang HW, Sun QR, Hua XS. Interventional few-shot learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 2734–2746.

[34] Wang T, Huang JQ, Zhang HW, Sun QR. Visual commonsense R-CNN. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10757–10767. [doi: 10.1109/CVPR42600.2020.01077]

[35] Wang T, Zhou C, Sun QR, Zhang HW. Causal attention for unbiased visual recognition. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 3071–3080. [doi: 10.1109/ICCV48922.2021.00308]

[36] Huang JQ, Qin Y, Qi JX, Sun QR, Zhang HW. Deconfounded visual grounding. In: Proc. of the 36th AAAI Conf. on Artificial Intelligence. AAAI Press, 2022. 998–1006. [doi: 10.1609/aaai.v36i1.19983]

[37] Yang X, Zhang HW, Qi GJ, Cai JF. Causal attention for vision-language tasks. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 9842–9852. [doi: 10.1109/CVPR46437.2021.00972]

[38] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]

[39] Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Doha: ACL, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]

[40] Liu Y, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized bert pretraining approach. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: ICLR, 2020. 1–15.

[41] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958.

[42] Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RSZ, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proc. of the of the 32nd Int’l Conf. on Int’l Conf. on Machine Learning. Lille: JMLR.org, 2015. 2048–2057.

[43] Hartigan JA, Wong MA. Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 1979, 28(1): 100.

[44] Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955, 2(1–2): 83–97.

[45] Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding box regression. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 658–666. [doi: 10.1109/CVPR.2019.00075]

[46] Plummer BA, Wang LW, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2641–2649. [doi: 10.1109/ICCV.2015.303]

[47] Zhu DY, Chen J, Shen XQ, Li X, Elhoseiny M. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In: Proc. of the 12th Int’l Conf. on Learning Representations. Vienna: ICLR, 2024.

[48] Liu HT, Li CY, Wu QY, Lee YL. Visual instruction tuning. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 1516.

[49] Li JN, Li DX, Savarese S, Hoi S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proc. of the 40th Int’l Conf. on Machine Learning. Honolulu: JMLR.org, 2023. 814.

[50] Bang Y, Cahyawijaya S, Lee N, Dai WL, Su D, Wilie B, Lovenia H, Ji ZW, Yu TZ, Chung W, Do QV, Xu Y, Fung P. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In: Proc. of the 13th Int’l Joint Conf. on Natural Language Processing and the 3rd Conf. of the Asia-Pacific Chapter of the Association for Computational Linguistics (Vol. 1: Long Papers). Nusa Dua: Association for Computational Linguistics, 2023. 675–718. [doi: 10.18653/v1/2023.ijcnlp-main.45]

[51] Dong QX, Li L, Dai DM, Zheng C, Ma JY, Li R, XiaHM, Xu JJ, Wu ZY, Chang BB, Sun X, Li L, Sui ZF. A survey on in-context learning. In: Proc. of the 2024 Conf. on Empirical Methods in Natural Language Processing. Miami: Association for Computational Linguistics, 2022. 1107–1128. [doi: 10.18653/v1/2024.emnlp-main.64]

Get Citation

赵嘉宁,王晶晶,罗佳敏,周国栋.提升隐式场景下短语视觉定位的因果建模方法.软件学报,,():1-16

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:November 01,2023
Revised:July 09,2024
Adopted:
Online: February 26,2025
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

微信扫一扫：分享

Article Metrics

History