面向免训练视频问答的双重自适应冗余消除
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP37

基金项目:

国家自然科学基金(62562035, 62441203, 62311530101, 62461028); 江西省重点研发计划(20252BCE310034); 江西省自然科学基金(20242BAB23012, 20252BAC200182); 江西省双千计划(jxsq2023101092); 江西省职业早期青年科技人才培养项目(20252BEJ730121); 江西财经大学第十九届学生科研课题(20241126104806366)


Dual Adaptive Redundancy Elimination for Training-free Video Question Answering
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近年来, 免训练的视频问答模型因其即插即用的特性, 成为轻量级多模态推理研究的热点. 然而, 包含丰富语义信息的高帧率视频往往具备天然的冗余性, 导致在时间维度上存在信息密度与计算效率之间的平衡问题, 传统的采样策略容易受到噪声帧的干扰. 此外, 在复杂的动态场景中, 背景干扰物和局部身体部位等非目标区域会引入空间特征偏差, 严重影响答案生成的可靠性. 为解决以上两个问题, 提出了双重自适应冗余消除框架, 旨在通过时空冗余协同优化机制, 实现免训练范式下视频语义理解精度与答案质量的系统性提升. 首先, 提出一种基于文本-视觉对齐与帧间语义一致的双关联时间采样方法, 通过双向交互推理筛选视频关键帧序列, 并同步剔除与文本语境冲突的冗余帧. 其次, 引入一种动态空间采样方法, 从与提示相关的热力图候选区域中提取最大连通语义区域, 以消除与问题无关的分散区域的干扰, 增强空间特征表达的紧密相关性. 所提方法在MSVD-QA、MSRVTT-QA、TGIF-QA和ActivityNet-QA等广泛使用的数据集上进行了实验, 并在零样本(zero-shot)设定下与14个最新模型进行了对比评估. 实验结果表明, 所提方法在使用更少视频特征序列的情况下实现了更具竞争力的性能. 可视化分析进一步验证了该方法在复杂场景中(如多人交互和细粒度动作识别)表现出更准确的时空定位能力. 所提出的双重自适应冗余消除框架通过协同优化时空冗余, 在免训练范式下显著提升了视频问答任务的性能, 能够生成准确且高质量的答案, 展现出其在多模态视频理解中的应用潜力.

    Abstract:

    In recent years, training-free video question answering (VQA) models have become a research hotspot for lightweight multimodal reasoning due to their plug-and-play nature. However, although high frame rate videos contain rich semantic information, their inherent redundancy leads to a balance problem between information density and computational efficiency in the temporal dimension, with traditional sampling strategies being susceptible to noise frame interference. Furthermore, in complex dynamic scenes, background clutter and local body parts, as non-target regions, introduce spatial feature bias, significantly affecting the reliability of answer generation. To address these two issues, this study proposes a dual adaptive redundancy elimination (DARE-VQA) framework, which aims to systematically improve the accuracy of video semantic understanding and answer quality in the training-free paradigm through a spatiotemporal redundancy collaborative optimization mechanism. First, a dual-relation temporal sampling method is proposed, based on text-visual alignment and inter-frame semantic consistency. This method selects key frame sequences through bidirectional interactive reasoning, while simultaneously eliminating redundant frames that conflict with the text context. Next, a dynamic spatial sampling method is introduced, which extracts the largest connected semantic region from candidate regions in the prompt-related heatmap, aiming to eliminate scattered non-target regions and enhance the compactness of spatial feature representations. Experiments are conducted on widely used datasets, including MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA. The proposed method is evaluated in a zero-shot setting against 14 state-of-the-art models. The results show that the proposed approach achieves competitive performance with significantly fewer video feature sequences. Visual analysis confirms that the proposed method exhibits more accurate spatiotemporal localization abilities in challenging tasks, such as multi-person interactions and fine-grained action recognition in complex scenes. The proposed DARE-VQA framework achieves significant improvements in video question answering performance by collaboratively optimizing spatiotemporal redundancy. It can generate accurate and high-quality answers within the training-free paradigm, demonstrating its potential in multimodal video understanding.

    参考文献
    相似文献
    引证文献
引用本文

方承炀,朱畅,姜文晖,方玉明,鄢杰斌.面向免训练视频问答的双重自适应冗余消除.软件学报,2026,37(5):1950-1963

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-26
  • 最后修改日期:2025-07-11
  • 录用日期:
  • 在线发布日期: 2025-09-23
  • 出版日期: 2026-05-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号