TP18
国家自然科学基金(62006166, 62076175, 62076176); 江苏高校优势学科建设工程
短语视觉定位是多模态研究中一个基础且重要的研究任务, 旨在预测细粒度的文本短语与图片区域的对齐关系. 尽管已有的短语视觉定位方法已经取得了不错的进展, 但都忽略了文本中的短语与其对应图片区域的隐式对齐关系(即隐式短语-区域对齐关系), 而预测这种关系可以有效评估模型理解深层多模态语义的能力. 因此, 为了有效建模隐式短语-区域对齐关系, 提出一种隐式增强的因果建模短语视觉定位方法. 该方法使用因果推理中的干预策略来缓解浅层语义所带来的混淆信息. 为评估模型理解深层多模态语义的能力, 标注一个高质量的隐式数据集, 并进行大量实验. 多组对比实验结果表明, 所提方法能够有效建模隐式短语-区域对齐关系. 此外, 在这个隐式数据集上, 所提方法的性能优于一些先进的多模态大语言模型, 这将进一步促进多模态大模型更多的面向隐式场景的研究.
Phrasal visual grounding, a fundamental and critical research task in the field of multimodal studies, aims at predicting fine-grained alignment relationships between textual phrases and image regions. Despite the remarkable progress achieved by existing phrasal visual grounding approaches, they all ignore the implicit alignment relationships between textual phrases and their corresponding image regions, commonly referred to as implicit phrase-region alignment. Predicting such relationships can effectively evaluate the ability of models to understand deep multimodal semantics. Therefore, to effectively model implicit phrase-region alignment relationships, this study proposes an implicit-enhanced causal modeling (ICM) approach for phrasal visual grounding, which employs the intervention strategies of causal reasoning to mitigate the confusion caused by shallow semantics. To evaluate models’ ability to understand deep multimodal semantics, this study annotates a high-quality implicit dataset and conducts a large number of experiments. Multiple sets of comparative experimental results demonstrate the effectiveness of the proposed ICM approach in modeling implicit phrase-region alignment relationships. Furthermore, the proposed ICM approach outperforms some advanced multimodal large language models (MLLMs) on the implicit dataset, further promoting the research of MLLMs towards more implicit scenarios.
赵嘉宁,王晶晶,罗佳敏,周国栋.提升隐式场景下短语视觉定位的因果建模方法.软件学报,,():1-16
复制