国家自然科学基金(61976250, U1811463); 广东省基础与应用基础研究基金(2020B1515020048)
视频描述技术旨在为视频自动生成包含丰富内容的文字描述, 近年来吸引了广泛的研究兴趣. 一个准确而精细的视频描述生成方法, 不仅需要对视频有全局上的理解, 更离不开具体显著目标的局部空间和时序特征. 如何建模一个更优的视频特征表达, 一直是视频描述工作的研究重点和难点. 另一方面, 大多数现有工作都将句子视为一个链状结构, 并将视频描述任务视为一个生成单词序列的过程, 而忽略了句子的语义结构, 这使得算法难以应对和优化复杂的句子描述及长句子中易引起的逻辑错误. 为了解决上述问题, 提出一种新颖的语言结构引导的可解释视频语义描述生成方法, 通过设计一个基于注意力的结构化小管定位机制, 充分考虑局部对象信息和句子语义结构. 结合句子的语法分析树, 所提方法能够自适应地加入具有文本内容的相应时空特征, 进一步提升视频描述的生成效果. 在主流的视频描述任务基准数据集MSVD和MSR-VTT上的实验结果表明, 所提出方法在大多数评价指标上都达到了最先进的水平.
Video description technology aims to automatically generate textual descriptions with rich content for videos, and it has attracted extensive research interest in recent years. An accurate and elaborate method of video description generation not only should have achieved a global understanding of the video but also depends heavily on the local spatial and time-series features of specific salient objects. How to model a better video feature representation has always been an important but difficult part of video description tasks. In addition, most of the existing work regards a sentence as a chain structure and views a video description task as a process of generating a sequence of words, ignoring the semantic structure of the sentence. Consequently, the currently available algorithms are unable to handle and optimize complex sentence descriptions or avoid logical errors commonly seen in the long sentences generated. To tackle the problems mentioned above, this study proposes a novel generation method for interpretable video descriptions guided by language structure. Due consideration is given to both local object information and the semantic structure of the sentence by designing an attention-based structured tubelet localization mechanism. When it is incorporated with the parse tree constructed from sentences, the proposed method can adaptively attend to corresponding spatial-temporalfeatures with textual contents and further improve the performance of video description generation. Experimental results on mainstream benchmark datasets of video description tasks, i.e., Microsoft research video captioning corpus (MSVD) and Microsoft research video to text (MSR-VTT), show that the proposed approach achieves state-of-the-art performance on most of the evaluation metrics.