Self-supervised Graph Contrastive Learning for Video Question Answering
Author:
Affiliation:

  • Article
  • | |
  • Metrics
  • |
  • Reference [85]
  • | |
  • Cited by
  • | |
  • Comments
    Abstract:

    As a cross-modal understanding task, video question answering (VideoQA) requires the interaction of semantic information with different modalities to generate answers to questions given a video and the questions associated with it. In recent years, graph neural networks (GNNs) have made remarkable progress in VideoQA tasks due to their powerful capabilities in cross-modal information fusion and inference. However, most existing GNN approaches fail to improve the performance of VideoQA models due to their inherent deficiencies of overfitting or over-smoothing, as well as weak robustness and generalization. In view of the effectiveness and robustness of self-supervised contrastive learning methods in pre-training techniques, this study proposes a self-supervised graph contrastive learning framework GMC based on the idea of graph data augmentation in VideoQA tasks. The framework uses two independent data augmentation operations for nodes and edges to generate dissimilar subsamples and improves the consistency between predicted graph data distributions of the original samples and augmented subsamples for higher accuracy and robustness of the VideoQA models. The effectiveness of the proposed framework is verified by experimental comparisons with existing state-of-the-art VideoQA models and different GMC variants on the public dataset for VideoQA tasks.

    Reference
    [1] Gupta P, Gupta V. A survey of text question answering techniques. International Journal of Computer Applications, 2012, 53(4): 1–8. [doi: 10.5120/8406-2030]
    [2] Agrawal A, Lu JS, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D. VQA: Visual question answering. International Journal of Computer Vision, 2017, 123(1): 4–31. [doi: 10.1007/s11263-016-0966-6]
    [3] Xiao JB, Shang XD, Yao A, Chua TS. NExT-QA: Next phase of question-answering to explaining temporal actions. In: Proc. of the 34th IEEE Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 9772–9781.
    [4] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In: Proc. of the 16th IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.
    [5] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. [doi: 10.1145/3065386]
    [6] Chung J, Gulcehre C, Cho KH, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014.
    [7] Zhang LJ, Zhou YQ, Duan XY, Chen RQ. A hierarchical multi-input and output bi-GRU model for sentiment analysis on customer reviews. IOP Conference Series: Materials Science and Engineering, 2018, 322(6): 062007. [doi: 10.1088/1757-899X/322/6/062007]
    [8] Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. In: Proc. of the 2015 Conf. on Empirical Methods in Natural Language Processing. Lisbon: The Association for Computational Linguistics, 2015. 1412–1421.
    [9] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [10] 张博伦. 基于注意力机制与图卷积网络的视频问答研究 [硕士学位论文]. 哈尔滨: 哈尔滨理工大学, 2021.
    Zhang BL. Video question answering based on attention mechanism and graph convolutional network [MS. Thesis]. Harbin: Harbin University of Science and Technology, 2021 (in Chinese with English abstract).
    [11] Sun FY, Hoffmann J, Verma V, Tang J. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2019.
    [12] 薛东辉. 基于卷积神经网络的道路风险目标检测模型研究与应用 [硕士学位论文]. 南京: 南京邮电大学, 2021.
    Xue DH. Research and application of road risk target detection algorithm based on convolutional neural network [MS. Thesis]. Nanjing: Nanjing University of Posts and Telecommunications, 2021 (in Chinese with English abstract).
    [13] Rong Y, Huang WB, Xu TY, Huang JZ. DropEdge: Towards deep graph convolutional networks on node classification. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020.
    [14] Feng WZ, Zhang J, Dong YX, Han Y, Luan HB, Yang Q, Kharlamov E, Tang J. Graph random neural networks for semi-supervised learning on graphs. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1853.
    [15] 陶超, 阴紫薇, 朱庆, 李海峰. 遥感影像智能解译: 从监督学习到自监督学习. 测绘学报, 2021, 50(8): 1122–1134.
    Tao C, Yin ZW, Zhu Q, Li HF. Remote sensing image intelligent interpretation: From supervised learning to self-supervised learning. Acta Geodaetica et Cartographica Sinica, 2021, 50(8): 1122–1134 (in Chinese with English abstract).
    [16] Berthelot D, Carlini N, Goodfellow I, Oliver A, Papernot N, Raffel C. MixMatch: A holistic approach to semi-supervised learning. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 454.
    [17] Xie QZ, Dai ZH, Hovy E, Luong MT, Le QV. Unsupervised data augmentation for consistency training. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 525.
    [18] 权海波, 杨颖. 视觉问答语言先验性研究综述. 信息与电脑(理论版), 2022, 34(1): 55–58.
    Quan HB, Yang Y. Survey on language prior research of visual question answering. China Computer & Communication, 2022, 34(1): 55–58 (in Chinese with English abstract).
    [19] You YL, Chen TL, Sui YD, Chen T, Wang ZY, Shen Y. Graph contrastive learning with augmentations. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. 2020. 5812–5823.
    [20] Jang Y, Song YL, Yu Y, Kim Y, Kim G. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: Proc. of the 30th IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1359–1367.
    [21] Yu Z, Xu DJ, Yu J, Yu T, Zhao Z, Huang YT, Tao DC. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 9127–9134.
    [22] Zhong YY, Ji W, Xiao JB, Li YC, Deng WH, Chua TS. Video question answering: Datasets, algorithms and challenges. arXiv:2203.01225, 2022.
    [23] Xue HY, Zhao Z, Cai D. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, 2017, 26(12): 5656–5666. [doi: 10.1109/TIP.2017.2746267]
    [24] Zhao Z, Lin JH, Jiang XH, Cai D, He XF, Zhuang YT. Video question answering via hierarchical dual-level attention network learning. In: Proc. of the 25th ACM Int’l Conf. on Multimedia. Mountain View: ACM, 2017. 1050–1058.
    [25] Zhao Z, Zhang Z, Xiao SW, Yu Z, Yu J, Cai D, Wu F, Zhuang YT. Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: Proc. of the 27th Int’l Joint Conf. on Artificial Intelligence. Stockholm: IJCAI.org, 2018. 3683–3689.
    [26] 吴猛. 基于深度记忆融合方法的视频问答研究 [硕士学位论文]. 哈尔滨: 哈尔滨理工大学, 2021.
    Wu M. Video question answering based on deep memory fusion method [MS. Thesis]. Harbin: Harbin University of Science and Technology, 2021 (in Chinese with English abstract).
    [27] Zhu LC, Xu ZW, Yang Y, Hauptmann AG. Uncovering the temporal context for video question answering. International Journal of Computer Vision, 2017, 124(3): 409–421. [doi: 10.1007/s11263-017-1033-7]
    [28] Maharaj T, Ballas N, Rohrbach A, Courville A, Pal C. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: Proc. of the 30th IEEE Conf. on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 7359–7368.
    [29] Lei J, Yu LC, Bansal M, Berg TL. TVQA: Localized, compositional video question answering. In: Proc. of the 2018 Conf. on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018. 1369–1379.
    [30] Tapaswi M, Zhu YK, Stiefelhagen R, Torralba A, Urtasun R, Fidler S. MovieQA: Understanding stories in movies through question-answering. In: Proc. of the 29th IEEE Conf. on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 4631–4640.
    [31] Gao JY, Ge RZ, Chen K, Nevatia R. Motion-appearance co-memory networks for video question answering. In: Proc. of the 31st IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6576–6585.
    [32] Fan CY, Zhang XF, Zhang S, Wang WS, Zhang C, Huang H. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proc. of the 32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 1999–2007.
    [33] Li XP, Song JK, Gao LL, Lu XL, Huang WB, He XN, Gan C. Beyond RNNs: Positional self-attention with co-attention for video question answering. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 8658–8665.
    [34] Yang A, Miech A, Sivic J, Laptev I, Schmid C. Just Ask: Learning to answer questions from millions of narrated videos. In: Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision (ICCV). Montreal: IEEE, 2021. 1666–1677.
    [35] Zellers R, Lu XM, Hessel J, Yu Y, Park JS, Cao JZ, Farhadi A, Choi Y. Merlot: Multimodal neural script knowledge models. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. 2021. 23634–23651.
    [36] Le TM, Le V, Venkatesh S, Tran T. Hierarchical conditional relation networks for video question answering. In: Proc. of the 33rd IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9969–9978.
    [37] Yi KX, Wu JJ, Gan C, Torralba A, Kohli P, Tenenbaum JB. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In: Proc. of the 32nd Int’l Conf. on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2018. 1039–1050.
    [38] Yi KX, Gan C, Li YZ, Kohli P, WU JJ, Torralba A, Tenenbaum JB. CLEVRER: Collision events for video representation and reasoning. In: Proc. of the 8th Int’l Conf. on Learning Representations. Addis Ababa: OpenReview.net, 2020.
    [39] Chen ZF, Mao JY, Wu JJ, Wong KYK, Tenenbaum JB, Gan C. Grounding physical concepts of objects and events through dynamic visual reasoning. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.
    [40] Jiang P, Han YH. Reasoning with heterogeneous graph alignment for video question answering. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI Press, 2020. 11109–11116.
    [41] Park J, Lee J, Sohn K. Bridge to answer: Structure-aware graph interaction network for video question answering. In: Proc. of the 34th IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 15521–15530.
    [42] Wang JY, Bao BK, Xu CS. DualVGR: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, 2021, 24: 3369–3380. [doi: 10.1109/TMM.2021.3097171]
    [43] Liu F, Liu J, Wang WN, Lu HQ. HAIR: Hierarchical visual-semantic relational reasoning for video question answering. In: Proc. of the 34th IEEE/CVF Int’l Conf. on Computer Vision. Montreal: IEEE, 2021. 1678–1687.
    [44] Peng M, Wang C, Gao Y, Shi Y, Zhou XD. Multilevel hierarchical network with multiscale sampling for video question answering. arXiv:2205.04061, 2022.
    [45] Xiao JB, Yao A, Liu ZY, Li YC, Ji W, Chua TS. Video as conditional graph hierarchy for multi-granular question answering. In: Proc. of the 36th AAAI Conf. on Artificial Intelligence. Palo Alto: AAAI Press, 2021. 2804–2812.
    [46] Bruna J, Zaremba W, Szlam A, LeCun Y. Spectral networks and locally connected networks on graphs. In: Proc. of the 2nd Int’l Conf. on Learning Representations. Banff: ICLR, 2014.
    [47] Henaff M, Bruna J, LeCun Y. Deep convolutional networks on graph-structured data. arXiv:1506.05163, 2015.
    [48] Defferrard M, Bresson X, Vandergheynst P. Convolutional neural networks on graphs with fast localized spectral filtering. In: Proc. of the 30th Int’l Conf. on Neural Information Processing Systems. Barcelona: ACM, 2016. 3844–3852.
    [49] Levie R, Monti F, Bresson X, Bronstein MM. CayleyNets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 2018, 67(1): 97–109. [doi: 10.1109/TSP.2018.2879624]
    [50] Wu ZH, Pan SR, Chen FW, Long GD, Zhang CQ, Yu PS. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4–24. [doi: 10.1109/TNNLS.2020.2978386]
    [51] Niepert M, Ahmed M, Kutzkov K. Learning convolutional neural networks for graphs. In: Proc. of the 33rd Int’l Conf. on Machine Learning. New York: PMLR, 2016. 2014–2023.
    [52] Hamilton WL, Ying Z, Leskovec J. Inductive representation learning on large graphs. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 1025–1035.
    [53] Gao HY, Wang ZY, Ji SW. Large-scale learnable graph convolutional networks. In: Proc. of the 24th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. London: ACM, 2018. 1416–1424.
    [54] 陈学信. 面向链接预测的图卷积神经网络算法研究 [硕士学位论文]. 广州: 广东工业大学, 2021.
    Chen XX. Graph convolutional neural network algorithm for link prediction [MS. Thesis]. Guangzhou: Guangdong University of Technology, 2021 (in Chinese with English abstract).
    [55] Li YJ, Tarlow D, Brockschmidt M, Zemel RS. Gated graph sequence neural networks. In: Proc. of the 4th Int’l Conf. on Learning Representations. San Juan: ICLR, 2016.
    [56] Verma V, Qu M, Lamb A, Bengio Y, Kannala J, Tang J. GraphMix: Regularized training of graph neural networks for semi-supervised learning. arXiv:1909.11715, 2019.
    [57] Ding M, Tang J, Zhang J. Semi-supervised learning on graphs with generative adversarial nets. In: Proc. of the 27th ACM Int’l Conf. on Information and Knowledge Management. Torino: ACM, 2018. 913–922.
    [58] Li QM, Han ZC, Wu XM. Deeper insights into graph convolutional networks for semi-supervised learning. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI Press, 2018. 3538–3545.
    [59] Deng ZJ, Dong YP, Zhu J. Batch virtual adversarial training for graph convolutional networks. arXiv:1902.09192, 2019.
    [60] Feng FL, He XN, Tang J, Chua TS. Graph adversarial training: Dynamically regularizing based on graph structure. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(6): 2493–2504. [doi: 10.1109/TKDE.2019.2957786]
    [61] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville AC, Bengio Y. Generative adversarial nets. In: Proc. of the 27th Int’l Conf. on Neural Information Processing Systems. Montreal: MIT Press, 2014. 2672–2680.
    [62] Zhu JY, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proc. of the 2017 IEEE Int’l Conf. on Computer Vision. Venice: IEEE, 2017. 2242–2251.
    [63] Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proc. of the 32nd IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 4396–4405.
    [64] Kim T, Cha M, Kim H, Lee JK, Kim J. Learning to discover cross-domain relations with generative adversarial networks. In: Proc. of the 34th Int’l Conf. on Machine Learning. Sydney: JMLR.org, 2017. 1857–1865.
    [65] Wu LR, Lin HT, Tan C, Gao ZY, Li SZ. Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Trans. on Knowledge and Data Engineering, 2021: 1–20.
    [66] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proc. of the 37th Int’l Conf. on Machine Learning. JMLR.org, 2020. 149.
    [67] Grill JB, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BÁ, Guo ZD, Azar MG, Piot B, Kavukcuoglu K, Munos R, Valko M. Bootstrap your own latent a new approach to self-supervised learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1786.
    [68] Suresh S, Li P, Hao C, Neville J. Adversarial graph augmentation to improve graph contrastive learning. In: Proc. of the 35th Int’l Conf. on Neural Information Processing Systems. 2021. 15920–15933.
    [69] Peng Z, Huang WB, Luo MN, Zheng QH, Rong Y, Xu TY. Graph representation learning via graphical mutual information maximization. In: Proc. of the 2020 Web Conf. Taipei: ACM, 2020. 259–270.
    [70] Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm RD. Deep graph infomax. In: Proc. of the 7th Int’l Conf. on Learning Representations. New Orleans: OpenReview.net, 2019.
    [71] Li P, Wang YB, Wang HW, Leskovec J. Distance encoding: Design provably more powerful neural networks for graph representation learning. In: Proc. of the 34th Int’l Conf. on Neural Information Processing Systems. 2020. 4465–4478.
    [72] Jiao YY, Xiong Y, Zhang JW, Zhang Y, Zhang TQ, Zhu YY. Sub-graph contrast for scalable self-supervised graph representation learning. In: Proc. of the 2020 IEEE Int’l Conf. on Data Mining (ICDM). Sorrento: IEEE, 2020. 222–231.
    [73] Qiu JZ, Chen QB, Dong YX, Zhang J, Yang HX, Ding M. GCC: Graph contrastive coding for graph neural network pre-training. In: Proc. of the 26th ACM SIGKDD Int’l Conf. on Knowledge Discovery & Data Mining. ACM, 2020. 1150–1160.
    [74] Falcon A, Lanz O, Serra G. Data augmentation techniques for the video question answering task. In: Proc. of the 2020 European Conf. on Computer Vision. Glasgow: Springer, 2020. 511–525.
    [75] Chen J, Ma TF, Xiao C. FastGCN: Fast learning with graph convolutional networks via importance sampling. In: Proc. of the 6th Int’l Conf. on Learning Representations. Vancouver: OpenReview.net, 2018.
    [76] Shang XD, Di DL, Xiao JB, Cao Y, Yang X, Chua TS. Annotating objects and relations in user-generated videos. In: Proc. of the 2019 Int’l Conf. on Multimedia Retrieval. Ottawa: ACM, 2019. 279–287.
    [77] Wu Z, Palmer M. Verb semantics and lexical selection. arXiv:cmp-lg/9406033, 1994.
    [78] Miller GA. WordNet: A lexical database for English. Communications of the ACM, 1995, 38(11): 39–41.
    [79] Chapelle CA. Vocabulary and language for specific purposes. In: The Encyclopedia of Applied Linguistics. Wiley Online Library. 2012.
    Related
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

姚暄,高君宇,徐常胜.基于自监督图对比学习的视频问答方法.软件学报,2023,34(5):2083-2100

Copy
Share
Article Metrics
  • Abstract:1492
  • PDF: 4905
  • HTML: 2929
  • Cited by: 0
History
  • Received:April 18,2022
  • Revised:May 29,2022
  • Online: September 20,2022
  • Published: May 06,2023
You are the first2033159Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063