Interpretable Video Captioning Guided by Language Structure
Author:
Affiliation:

Clc Number:

TR391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Video description technology aims to automatically generate textual descriptions with rich content for videos, and it has attracted extensive research interest in recent years. An accurate and elaborate method of video description generation not only should have achieved a global understanding of the video but also depends heavily on the local spatial and time-series features of specific salient objects. How to model a better video feature representation has always been an important but difficult part of video description tasks. In addition, most of the existing work regards a sentence as a chain structure and views a video description task as a process of generating a sequence of words, ignoring the semantic structure of the sentence. Consequently, the currently available algorithms are unable to handle and optimize complex sentence descriptions or avoid logical errors commonly seen in the long sentences generated. To tackle the problems mentioned above, this study proposes a novel generation method for interpretable video descriptions guided by language structure. Due consideration is given to both local object information and the semantic structure of the sentence by designing an attention-based structured tubelet localization mechanism. When it is incorporated with the parse tree constructed from sentences, the proposed method can adaptively attend to corresponding spatial-temporalfeatures with textual contents and further improve the performance of video description generation. Experimental results on mainstream benchmark datasets of video description tasks, i.e., Microsoft research video captioning corpus (MSVD) and Microsoft research video to text (MSR-VTT), show that the proposed approach achieves state-of-the-art performance on most of the evaluation metrics.

    Reference
    Related
    Cited by
Get Citation

李冠彬,张锐斐,刘梦梦,刘劲,林倞.语言结构引导的可解释视频语义描述.软件学报,2023,34(12):5905-5920

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 24,2021
  • Revised:November 08,2021
  • Adopted:
  • Online: May 18,2023
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063