语言结构引导的可解释视频语义描述
作者:
作者简介:

李冠彬(1986-),男,博士,副教授,CCF高级会员,主要研究领域为计算机视觉,机器学习;张锐斐(1998-),男,硕士生,主要研究领域为计算机视觉;刘梦梦(1997-),女,硕士生,主要研究领域为计算机视觉;刘劲(1995-),男,硕士生,主要研究领域为计算机视觉;林倞(1981-),男,博士,教授,博士生导师,CCF专业会员,主要研究领域为计算机视觉,机器学习.

通讯作者:

林倞,E-mail:linliang@ieee.org

中图分类号:

TR391

基金项目:

国家自然科学基金(61976250, U1811463); 广东省基础与应用基础研究基金(2020B1515020048)


Interpretable Video Captioning Guided by Language Structure
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [58]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    视频描述技术旨在为视频自动生成包含丰富内容的文字描述, 近年来吸引了广泛的研究兴趣. 一个准确而精细的视频描述生成方法, 不仅需要对视频有全局上的理解, 更离不开具体显著目标的局部空间和时序特征. 如何建模一个更优的视频特征表达, 一直是视频描述工作的研究重点和难点. 另一方面, 大多数现有工作都将句子视为一个链状结构, 并将视频描述任务视为一个生成单词序列的过程, 而忽略了句子的语义结构, 这使得算法难以应对和优化复杂的句子描述及长句子中易引起的逻辑错误. 为了解决上述问题, 提出一种新颖的语言结构引导的可解释视频语义描述生成方法, 通过设计一个基于注意力的结构化小管定位机制, 充分考虑局部对象信息和句子语义结构. 结合句子的语法分析树, 所提方法能够自适应地加入具有文本内容的相应时空特征, 进一步提升视频描述的生成效果. 在主流的视频描述任务基准数据集MSVD和MSR-VTT上的实验结果表明, 所提出方法在大多数评价指标上都达到了最先进的水平.

    Abstract:

    Video description technology aims to automatically generate textual descriptions with rich content for videos, and it has attracted extensive research interest in recent years. An accurate and elaborate method of video description generation not only should have achieved a global understanding of the video but also depends heavily on the local spatial and time-series features of specific salient objects. How to model a better video feature representation has always been an important but difficult part of video description tasks. In addition, most of the existing work regards a sentence as a chain structure and views a video description task as a process of generating a sequence of words, ignoring the semantic structure of the sentence. Consequently, the currently available algorithms are unable to handle and optimize complex sentence descriptions or avoid logical errors commonly seen in the long sentences generated. To tackle the problems mentioned above, this study proposes a novel generation method for interpretable video descriptions guided by language structure. Due consideration is given to both local object information and the semantic structure of the sentence by designing an attention-based structured tubelet localization mechanism. When it is incorporated with the parse tree constructed from sentences, the proposed method can adaptively attend to corresponding spatial-temporalfeatures with textual contents and further improve the performance of video description generation. Experimental results on mainstream benchmark datasets of video description tasks, i.e., Microsoft research video captioning corpus (MSVD) and Microsoft research video to text (MSR-VTT), show that the proposed approach achieves state-of-the-art performance on most of the evaluation metrics.

    参考文献
    [1] Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B. Translating video content to natural language descriptions. In: Proc. of the 2013 IEEE Int’l Conf. on Computer Vision. Sydney: IEEE, 2013. 433–440.
    [2] Xu R, Xiong CM, Chen W, Corso JJ. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proc. of the 29th AAAI Conf. on Artificial Intelligence. Austin: AAAI Press, 2015. 2346–2352.
    [3] Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Venugopalan S, Mooney R, Darrell T, Saenko S. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proc. of the 2013 IEEE Int’l Conf. on Computer Vision. Sydney: IEEE, 2013. 2712–2719.
    [4] Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to sequence-video to text. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 4534–4542.
    [5] Venugopalan S, Xu HJ, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural networks. In: Proc. of the 2015 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver: Association for Computational Linguistics, 2015. 1494–1504.
    [6] Pan YW, Mei T, Yao T, Li HQ, Rui Y. Jointly modeling embedding and translation to bridge video and language. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 4594–4602.
    [7] Yu HN, Wang J, Huang ZH, Yang Y, Xu W. Video paragraph captioning using hierarchical recurrent neural networks. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 4584–4593.
    [8] Xu J, Yao T, Zhang YD, Mei T. Learning multimodal attention LSTM networks for video captioning. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain View: ACM, 2017. 537–545.
    [9] Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proc. of the 32nd Int’l Conf. on Int’l Conf. on Machine Learning. Lille: JMLR.org, 2015. 2048–2057.
    [10] Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville C. Describing videos by exploiting temporal structure. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 4507–4515.
    [11] Long X, Gan C, De Melo G. Video captioning with multi-faceted attention. Trans. of the Association for Computational Linguistics, 2018, 6: 173–184. [doi: 10.1162/tacl_a_00013
    [12] Hori C, Hori T, Lee TY, Zhang ZM, Harsham B, Hershey JR, Marks TK, Sumi K. Attention-based multimodal fusion for video description. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 4203–4212.
    [13] Zhang JC, Peng YX. Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 8319–8328.
    [14] Hu YS, Chen ZZ, Zha ZJ, Wu F. Hierarchical global-local temporal modeling for video captioning. In: Proc. of the 27th ACM Int’l Conf. on Multimedia. Nice: ACM, 2019. 774–783.
    [15] Zhang ZQ, Shi YY, Yuan CF, Li B, Wang PJ, Hu WM, Zha ZJ. Object relational graph with teacher-recommended learning for video captioning. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 13275–13285.
    [16] Hou JY, Wu XX, Zhang XX, Qi YY, Jia YD, Luo JB. Joint commonsense and relation reasoning for image and video captioning. Proc. of the AAAI Conf. on Artificial Intelligence, 2020, 34(7): 10973–10980. [doi: 10.1609/aaai.v34i07.6731
    [17] Lei Y, He ZH, Zeng PP, Song JK, Gao LL. Hierarchical representation network with auxiliary tasks for video captioning. In: Proc. of the 2021 IEEE Int’l Conf. on Multimedia and Expo (ICME). Shenzhen: IEEE, 2021. 1–6.
    [18] Zheng Q, Wang CY, Tao DC. Syntax-aware action targeting for video captioning. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 13093–13102.
    [19] Tan GC, Liu DQ, Wang M, Zha ZJ. Learning to discretely compose reasoning module networks for video captioning. In: Proc. of the 29th Int’l Joint Conf. on Artificial Intelligence. Yokohama: IJCAI, 2020. 753–759.
    [20] Chen SX, Jiang YG. Motion guided spatial attention for video captioning. Proc. of the AAAI Conf. on Artificial Intelligence, 2019, 33(1): 8191–8198. [doi: 10.1609/aaai.v33i01.33018191
    [21] Hou JY, Wu XX, Zhao WT, Luo JB, Jia YD. Joint syntax representation learning and visual cue translation for video captioning. In: Proc. of the 2019 IEEE/CVF Int’l Conf. on Computer Vision. Seoul: IEEE, 2019. 8917–8926.
    [22] Jin T, Huang SY, Chen M, Li YM, Zhang ZF. SBAT: Video captioning with sparse boundary-aware transformer. In: Proc. of the 29th Int’l Joint Conf. on Artificial Intelligence. Yokohama: IJCAI, 2020. 630–636.
    [23] Kojima A, Tamura T, Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. Int’l Journal of Computer Vision, 2002, 50(2): 171–184. [doi: 10.1023/A:1020346032608
    [24] Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S. Generating natural-language video descriptions using text-mined knowledge. In: Proc. of the 27th AAAI Conf. on Artificial Intelligence. Bellevue: AAAI Press, 2013. 541–547.
    [25] Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proc. of the 2005 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition. San Diego: IEEE, 2005. 886–893.
    [26] Chaudhry R, Ravichandran A, Hager G, Vidal R. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: Proc. of the 2009 IEEE Conf. on Computer Vision and Pattern Recognition. Miami: IEEE, 2009. 1932–1939.
    [27] Lowe DG. Distinctive image features from scale-invariant keypoints. Int’l Journal of Computer Vision, 2004, 60(2): 91–110. [doi: 10.1023/B:VISI.0000029664.99615.94
    [28] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proc. of the 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015.
    [29] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.
    [30] Wang H, Kläser A, Schmid C, Liu CL. Action recognition by dense trajectories. In: Proc. of the 2011 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 2011). Colorado Springs: IEEE, 2011. 3169–3176.
    [31] Wang H, Schmid C. Action recognition with improved trajectories. In: Proc.of the 2013 IEEE Int’l Conf. on Computer Vision. Sydney: IEEE, 2013. 3551–3558.
    [32] Wang LM, Qiao Y, Tang XO. Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proc.of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4305–4314.
    [33] Kang K, Li HS, Yan JJ, Zeng XY, Yang B, Xiao T, Zhang C, Wang Z, Wang RH, Wang XG, Ouyang WL. T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans. on Circuits and Systems for Video Technology, 2018, 28(10): 2896–2907. [doi: 10.1109/TCSVT.2017.2736553
    [34] Wang Q, Zhang L, Bertinetto L, Hu WM, Torr PHS. Fast online object tracking and segmentation: A unifying approach. In: Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 1328–1338.
    [35] Chen FH, Ji RR, Su JS, Wu YJ, Wu YS. StructCap: Structured semantic embedding for image captioning. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain View: ACM, 2017. 46–54.
    [36] Niu ZX, Zhou M, Wang L, Gao XB, Hua G. Hierarchical multimodal LSTM for dense visual-semantic embedding. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 1899–1907.
    [37] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780. [doi: 10.1162/neco.1997.9.8.1735
    [38] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang ZH, Karpathy A, Khosla A, Bernstein M, Berg AC, Li FF. ImageNet large scale visual recognition challenge. Int’l Journal of Computer Vision, 2015, 115(3): 211–252. [doi: 10.1007/s11263-015-0816-y
    [39] Wu X, Li GB, Cao QX, Ji QG, Lin L. Interpretable video captioning via trajectory structured localization. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6829–6837.
    [40] He KM, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 2980–2988.
    [41] Liu LB, Zhang RM, Peng JF, Li GB, Du BW, Lin L. Attentive crowd flow machines. In: Proc. of the 26th ACM Int’l Conf. on Multimedia. Seoul: ACM, 2018. 1553–1561.
    [42] Wang LM, Xiong YJ, Wang Z, Qiao Y, Lin DH, Tang XO, van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In: Proc. of the 14th European Conf. on Computer Vision. Amsterdam: Springer, 2016. 20–36.
    [43] Chen DQ, Manning C. A fast and accurate dependency parser using neural networks. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing. Doha: Association for Computational Linguistics, 2014. 740–750.
    [44] Chen DL, Dolan WB. Collecting highly parallel data for paraphrase evaluation. In: Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland: Association for Computational Linguistics, 2011. 190–200.
    [45] Xu J, Mei T, Yao T, Rui Y. MSR-VTT: A large video description dataset for bridging video and language. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 5288–5296.
    [46] Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: A method for automatic evaluation of machine translation. In: Proc. of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.
    [47] Denkowski M, Lavie A. Meteor universal: Language specific translation evaluation for any target language. In: Proc. of the 9th Workshop on Statistical Machine Translation. Baltimore: Association for Computational Linguistics, 2014. 376–380.
    [48] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. In: Proc. of the 2015 IEEE Conf. on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.
    [49] Lin CY. ROUGE: A package for automatic evaluation of summaries. In: Proc. of Workshop on Text Summarization Branches Out. Barcelona: Association for Computational Linguistics, 2004. 74–81.
    [50] Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The stanford CoreNLP natural language processing toolkit. In: Proc. of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore: ACL, 2014. 55–60.
    [51] He KM, Zhang XY, Ren SQ, Sun J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proc. of the 2015 IEEE Int’l Conf. on Computer Vision. Santiago: IEEE, 2015. 1026–1034.
    [52] Kingma DP, Ba J. Adam: A method for stochastic optimization. In: Proc. of the 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015.
    [53] Pan PB, Xu ZW, Yang Y, Wu F, Zhuang YT. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 1029–1038.
    [54] Zhang XS, Gao K, Zhang YD, Zhang DM, Li JT, Tian Q. Task-driven dynamic fusion: Reducing ambiguity in video description. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6250–6258.
    [55] Baraldi L, Grana C, Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proc. of the 2017 IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 3185–3194.
    [56] Gao LL, Guo Z, Zhang HW, Xu X, Shen HT. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. on Multimedia, 2017, 19(9): 2045–2055. [doi: 10.1109/TMM.2017.2729019
    [57] Chen YY, Wang SH, Zhang WG, Huang QM. Less is more: Picking informative frames for video captioning. In: Proc. of the 15th European Conf. on Computer Vision. Munich: Springer, 2018. 367–384.
    [58] Yan CG, Tu YB, Wang XZ, Zhang YB, Hao XH, Zhang YD, Dai QH. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. on Multimedia, 2020, 22(1): 229–241. [doi: 10.1109/TMM.2019.2924576
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李冠彬,张锐斐,刘梦梦,刘劲,林倞.语言结构引导的可解释视频语义描述.软件学报,2023,34(12):5905-5920

复制
分享
文章指标
  • 点击次数:522
  • 下载次数: 2170
  • HTML阅读次数: 1195
  • 引用次数: 0
历史
  • 收稿日期:2021-06-24
  • 最后修改日期:2021-11-08
  • 在线发布日期: 2023-05-18
  • 出版日期: 2023-12-06
文章二维码
您是第20059630位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号