语言结构引导的可解释视频语义描述

doi:10.13328/j.cnki.jos.006736

微信服务号

微信订阅号

2025年7月18日 2:02 星期五

首页 > 过刊浏览>2023年第34卷第12期 >5905-5920. DOI:10.13328/j.cnki.jos.006736

PDF HTML阅读 XML下载导出引用引用提醒

语言结构引导的可解释视频语义描述
DOI:
                        10.13328/j.cnki.jos.006736
                    
CSTR:
                        
                    
作者:
                        李冠彬李冠彬
中山大学 计算机学院, 广东 广州 510006;人工智能与数字经济广东省实验室(广州), 广东 广州 510320
在期刊界中查找
在百度中查找
在本站中查找
张锐斐张锐斐
中山大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
刘梦梦刘梦梦
中山大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
刘劲刘劲
中山大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找
林倞林倞
中山大学 计算机学院, 广东 广州 510006
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:李冠彬(1986－),男,博士,副教授,CCF高级会员,主要研究领域为计算机视觉,机器学习;张锐斐(1998－),男,硕士生,主要研究领域为计算机视觉;刘梦梦(1997－),女,硕士生,主要研究领域为计算机视觉;刘劲(1995－),男,硕士生,主要研究领域为计算机视觉;林倞(1981－),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为计算机视觉,机器学习.
通讯作者:林倞,E-mail:linliang@ieee.org
中图分类号:TR391
基金项目:国家自然科学基金(61976250, U1811463); 广东省基础与应用基础研究基金(2020B1515020048)

Interpretable Video Captioning Guided by Language Structure

Author:

LI Guan-Bin
LI Guan-Bin
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China;Guangdong Artificial Intelligence and Digital Economy Laboratory (Guangzhou), Guangzhou 510320, China
在期刊界中查找
在百度中查找
在本站中查找
ZHANG Rui-Fei
ZHANG Rui-Fei
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Meng-Meng
LIU Meng-Meng
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Jing
LIU Jing
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找
LIN Liang
LIN Liang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

视频描述技术旨在为视频自动生成包含丰富内容的文字描述, 近年来吸引了广泛的研究兴趣. 一个准确而精细的视频描述生成方法, 不仅需要对视频有全局上的理解, 更离不开具体显著目标的局部空间和时序特征. 如何建模一个更优的视频特征表达, 一直是视频描述工作的研究重点和难点. 另一方面, 大多数现有工作都将句子视为一个链状结构, 并将视频描述任务视为一个生成单词序列的过程, 而忽略了句子的语义结构, 这使得算法难以应对和优化复杂的句子描述及长句子中易引起的逻辑错误. 为了解决上述问题, 提出一种新颖的语言结构引导的可解释视频语义描述生成方法, 通过设计一个基于注意力的结构化小管定位机制, 充分考虑局部对象信息和句子语义结构. 结合句子的语法分析树, 所提方法能够自适应地加入具有文本内容的相应时空特征, 进一步提升视频描述的生成效果. 在主流的视频描述任务基准数据集MSVD和MSR-VTT上的实验结果表明, 所提出方法在大多数评价指标上都达到了最先进的水平.

关键词:视频描述;编码器-解码器架构;小管;注意力机制;依存分析

Abstract:

Video description technology aims to automatically generate textual descriptions with rich content for videos, and it has attracted extensive research interest in recent years. An accurate and elaborate method of video description generation not only should have achieved a global understanding of the video but also depends heavily on the local spatial and time-series features of specific salient objects. How to model a better video feature representation has always been an important but difficult part of video description tasks. In addition, most of the existing work regards a sentence as a chain structure and views a video description task as a process of generating a sequence of words, ignoring the semantic structure of the sentence. Consequently, the currently available algorithms are unable to handle and optimize complex sentence descriptions or avoid logical errors commonly seen in the long sentences generated. To tackle the problems mentioned above, this study proposes a novel generation method for interpretable video descriptions guided by language structure. Due consideration is given to both local object information and the semantic structure of the sentence by designing an attention-based structured tubelet localization mechanism. When it is incorporated with the parse tree constructed from sentences, the proposed method can adaptively attend to corresponding spatial-temporalfeatures with textual contents and further improve the performance of video description generation. Experimental results on mainstream benchmark datasets of video description tasks, i.e., Microsoft research video captioning corpus (MSVD) and Microsoft research video to text (MSR-VTT), show that the proposed approach achieves state-of-the-art performance on most of the evaluation metrics.

Key words:video captioning;encoder-decoder framework;tubelet;attention machanism;dependency parsing

引用本文

李冠彬,张锐斐,刘梦梦,刘劲,林倞.语言结构引导的可解释视频语义描述.软件学报,2023,34(12):5905-5920

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-06-24
最后修改日期:2021-11-08
录用日期:
在线发布日期: 2023-05-18
出版日期: 2023-12-06

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码