提升隐式场景下短语视觉定位的因果建模方法

doi:10.13328/j.cnki.jos.007303

微信服务号

微信订阅号

2025年5月1日 20:40 星期四

首页 > 过刊浏览>年第卷第期 >1-16. DOI:10.13328/j.cnki.jos.007303

PDF HTML阅读 XML下载导出引用引用提醒

提升隐式场景下短语视觉定位的因果建模方法
DOI:
                        10.13328/j.cnki.jos.007303
                    
CSTR:
                        
                    
作者:
                        赵嘉宁赵嘉宁
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
王晶晶王晶晶
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
罗佳敏罗佳敏
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
周国栋周国栋
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金(62006166, 62076175, 62076176); 江苏高校优势学科建设工程

Implicit-enhanced Causal Modeling Method for Phrasal Visual Grounding

Author:

ZHAO Jia-Ning
ZHAO Jia-Ning
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Jing-Jing
WANG Jing-Jing
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
LUO Jia-Min
LUO Jia-Min
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
ZHOU Guo-Dong
ZHOU Guo-Dong
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [52]

相似文献

引证文献

资源附件

文章评论

摘要:

短语视觉定位是多模态研究中一个基础且重要的研究任务, 旨在预测细粒度的文本短语与图片区域的对齐关系. 尽管已有的短语视觉定位方法已经取得了不错的进展, 但都忽略了文本中的短语与其对应图片区域的隐式对齐关系(即隐式短语-区域对齐关系), 而预测这种关系可以有效评估模型理解深层多模态语义的能力. 因此, 为了有效建模隐式短语-区域对齐关系, 提出一种隐式增强的因果建模短语视觉定位方法. 该方法使用因果推理中的干预策略来缓解浅层语义所带来的混淆信息. 为评估模型理解深层多模态语义的能力, 标注一个高质量的隐式数据集, 并进行大量实验. 多组对比实验结果表明, 所提方法能够有效建模隐式短语-区域对齐关系. 此外, 在这个隐式数据集上, 所提方法的性能优于一些先进的多模态大语言模型, 这将进一步促进多模态大模型更多的面向隐式场景的研究.

关键词:隐式短语-区域对齐关系;因果推理;短语视觉定位

Abstract:

Phrasal visual grounding, a fundamental and critical research task in the field of multimodal studies, aims at predicting fine-grained alignment relationships between textual phrases and image regions. Despite the remarkable progress achieved by existing phrasal visual grounding approaches, they all ignore the implicit alignment relationships between textual phrases and their corresponding image regions, commonly referred to as implicit phrase-region alignment. Predicting such relationships can effectively evaluate the ability of models to understand deep multimodal semantics. Therefore, to effectively model implicit phrase-region alignment relationships, this study proposes an implicit-enhanced causal modeling (ICM) approach for phrasal visual grounding, which employs the intervention strategies of causal reasoning to mitigate the confusion caused by shallow semantics. To evaluate models’ ability to understand deep multimodal semantics, this study annotates a high-quality implicit dataset and conducts a large number of experiments. Multiple sets of comparative experimental results demonstrate the effectiveness of the proposed ICM approach in modeling implicit phrase-region alignment relationships. Furthermore, the proposed ICM approach outperforms some advanced multimodal large language models (MLLMs) on the implicit dataset, further promoting the research of MLLMs towards more implicit scenarios.

Key words:implicit phrase-region alignment;causal inference;phrasal visual grounding

参考文献

[1] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述. 软件学报, 2021, 32(2): 327–348. http://www.jos.org.cn/1000-9825/6125.htm

Du PF, Li XY, Gao YL. Survey on multimodal visual language representation learning. Ruan Jian Xue Bao/Journal of Software, 2021, 32(2): 327–348 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6125.htm

[2] Wang LW, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2019, 41(2): 394–407.

[3] Hossain MDZ, Sohel F, Shiratuddin MF, Laga H. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 2019, 51(6): 118.

[4] Datta R, Joshi D, Li J, Wang JZ. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR), 2008, 40(2): 5.

[5] Antol S, Agrawal A, Lu JS, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. VQA: Visual question answering. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2425–2433. [doi: 10.1109/ICCV.2015.279]

[6] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proc. of the 38th Int’l Conf. on Machine Learning. PMLR, 2021. 8748–8763.

[7] Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I, Carion N. MDETR—Modulated detection for end-to-end multi-modal understanding. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 1760–1770.

[8] Li LH, Zhang PC, Zhang HT, Yang JW, Li CY, Zhong YW. Grounded language-image pre-training. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 10955–10965. [doi: 10.1109/CVPR52688.2022.01069]

[9] Pearl J, Mackenzie D. The Book of Why: The New Science of Cause and Effect. New York: Basic Books, Inc., 2018.

[10] Kazemzadeh S, Ordonez V, Matten M, Berg T. Referitgame: Referring to objects in photographs of natural scenes. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014. 787–798. [doi: 10.3115/v1/D14-1086]

[11] Yang ZY, Gong BQ, Wang LW, Huang WB, Yu D, Luo JB. A fast and accurate one-stage approach to visual grounding. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 4682–4692. [doi: 10.1109/ICCV.2019.00478]

[12] Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proc. of the 28th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2015. 91–99.

[13] Yu LC, Lin Z, Shen XH, Yang JM, Lu X, Bansal M, Berg TL. MattNet: Modular attention network for referring expression comprehension. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 1307–1315. [doi: 10.1109/CVPR.2018.00142]

[14] Zhuang BH, Wu Q, Shen CH, Reid I, van den Hengel A. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 4252–4261. [doi: 10.1109/CVPR.2018.00447]

[15] Yu Z, Yu J, Xiang C, Xiang CC, Zhao Z, Tian Q, Tao DC. Rethinking diversified and discriminative proposal generation for visual grounding. In: Proc. of the 27th Int’l Joint Conf. on Artificial Intelligence. Stockholm: AAAI Press, 2018. 1114–1120.

[16] Yang SB, Li GB, Yu YZ. Dynamic graph attention for referring expression comprehension. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Seoul: IEEE, 2019. 4643–4652. [doi: 10.1109/ICCV.2019.00474]

[17] Wang P, Wu Q, Cao JW, Shen CS, Gao LL, van den Hengel A. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 1960–1968. [doi: 10.1109/CVPR.2019.00206]

[18] Yang SB, Li GB, Yu YZ. Relationship-embedded representation learning for grounding referring expressions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2020, 43(8): 2765–2779.

[19] Yang L, Xu Y, Yuan CF, Liu W, Li B, Hu WM. Improving visual grounding with visual-linguistic verification and iterative reasoning. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 9489–9498. [doi: 10.1109/CVPR52688.2022.00928]

[20] Ye JB, Tian JF, Yan M, Yang XS, Wang XW, Zhang J, He L, Lin X. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In: Proc. of the 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022. 15481–15491. [doi: 10.1109/CVPR52688.2022.01506]

[21] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv:1804.02767, 2018.

[22] Liao Y, Liu S, Li GB, Wang F, Chen YJ, Qian C, Li B. A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 10877–10886. [doi: 10.1109/CVPR42600.2020.01089]

[23] Yang ZY, Chen TL, Wang LB, Luo JB. Improving one-stage visual grounding by recursive sub-query construction. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 387–404. [doi: 10.1007/978-3-030-58568-6_23]

[24] Huang BB, Lian DZ, Luo WX, Gao SH. Look before you leap: Learning landmark features for one-stage visual grounding. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 16883–16892.

[25] Liao Y, Zhang AY, Chen ZY, Hui TR, Liu S. Progressive language-customized visual feature learning for one-stage visual grounding. IEEE Trans. on Image Processing, 2022, 31: 4266–4277.

[26] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.

[27] Deng JJ, Yang ZY, Chen TL, Zhou WG, Li HQ. TransVG: End-to-end visual grounding with Transformers. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 1749–1759. [doi: 10.1109/ICCV48922.2021.00179]

[28] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers. Minneapolis: Association for Computational Linguistics, 2018. 4171–4186. [doi: 10.18653/v1/N19-1423]

[29] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with Transformers. In: Proc. of the 16th European Conf. on Computer Vision. Glasgow: Springer, 2020. 213–229. [doi: 10.1007/978-3-030-58452-8_13]

[30] Tang KH, Niu YL, Huang JQ, Shi JX, Zhang HW. Unbiased scene graph generation from biased training. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 3713–3722. [doi: 10.1109/CVPR42600.2020.00377]

[31] Zhang D, Zhang HW, Tang JH, Hua XS, Sun QR. Causal intervention for weakly-supervised semantic segmentation. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 56.

[32] Chen L, Yan X, Xiao J, Zhang HW, Pu SL, Zhuang YT. Counterfactual samples synthesizing for robust visual question answering. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10797–10806.

[33] Yue ZQ, Zhang HW, Sun QR, Hua XS. Interventional few-shot learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 2734–2746.

[34] Wang T, Huang JQ, Zhang HW, Sun QR. Visual commonsense R-CNN. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10757–10767. [doi: 10.1109/CVPR42600.2020.01077]

[35] Wang T, Zhou C, Sun QR, Zhang HW. Causal attention for unbiased visual recognition. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 3071–3080. [doi: 10.1109/ICCV48922.2021.00308]

[36] Huang JQ, Qin Y, Qi JX, Sun QR, Zhang HW. Deconfounded visual grounding. In: Proc. of the 36th AAAI Conf. on Artificial Intelligence. AAAI Press, 2022. 998–1006. [doi: 10.1609/aaai.v36i1.19983]

[37] Yang X, Zhang HW, Qi GJ, Cai JF. Causal attention for vision-language tasks. In: Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 9842–9852. [doi: 10.1109/CVPR46437.2021.00972]

[38] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778. [doi: 10.1109/CVPR.2016.90]

[39] Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP). Doha: ACL, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]

[40] Liu Y, Ott M, Goyal N, Du JF, Joshi M, Chen DQ, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. RoBERTa: A robustly optimized bert pretraining approach. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: ICLR, 2020. 1–15.

[41] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15(1): 1929–1958.

[42] Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RSZ, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proc. of the of the 32nd Int’l Conf. on Int’l Conf. on Machine Learning. Lille: JMLR.org, 2015. 2048–2057.

[43] Hartigan JA, Wong MA. Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 1979, 28(1): 100.

[44] Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955, 2(1–2): 83–97.

[45] Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized intersection over union: A metric and a loss for bounding box regression. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 658–666. [doi: 10.1109/CVPR.2019.00075]

[46] Plummer BA, Wang LW, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k Entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 2641–2649. [doi: 10.1109/ICCV.2015.303]

[47] Zhu DY, Chen J, Shen XQ, Li X, Elhoseiny M. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In: Proc. of the 12th Int’l Conf. on Learning Representations. Vienna: ICLR, 2024.

[48] Liu HT, Li CY, Wu QY, Lee YL. Visual instruction tuning. In: Proc. of the 37th Int’l Conf. on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 1516.

[49] Li JN, Li DX, Savarese S, Hoi S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proc. of the 40th Int’l Conf. on Machine Learning. Honolulu: JMLR.org, 2023. 814.

[50] Bang Y, Cahyawijaya S, Lee N, Dai WL, Su D, Wilie B, Lovenia H, Ji ZW, Yu TZ, Chung W, Do QV, Xu Y, Fung P. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In: Proc. of the 13th Int’l Joint Conf. on Natural Language Processing and the 3rd Conf. of the Asia-Pacific Chapter of the Association for Computational Linguistics (Vol. 1: Long Papers). Nusa Dua: Association for Computational Linguistics, 2023. 675–718. [doi: 10.18653/v1/2023.ijcnlp-main.45]

[51] Dong QX, Li L, Dai DM, Zheng C, Ma JY, Li R, XiaHM, Xu JJ, Wu ZY, Chang BB, Sun X, Li L, Sui ZF. A survey on in-context learning. In: Proc. of the 2024 Conf. on Empirical Methods in Natural Language Processing. Miami: Association for Computational Linguistics, 2022. 1107–1128. [doi: 10.18653/v1/2024.emnlp-main.64]

引用本文

赵嘉宁,王晶晶,罗佳敏,周国栋.提升隐式场景下短语视觉定位的因果建模方法.软件学报,,():1-16

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2023-11-01
最后修改日期:2024-07-09
录用日期:
在线发布日期: 2025-02-26
出版日期:

微信服务号

微信订阅号

引用本文

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码