多模态可信度感知的情感计算
作者:
中图分类号:

TP18

基金项目:

国家自然科学基金(62006166, 62076175, 62076176); 江苏高校优势学科建设工程


Multi-modal Reliability-aware Affective Computing
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [35]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    多模态情感计算是情感计算领域一个基础且重要的研究任务, 旨在利用多模态信号对用户生成的视频进行情感理解. 尽管已有的多模态情感计算方法在基准数据集上取得了不错的性能, 但这些方法无论是设计复杂的融合策略还是学习模态表示, 普遍忽视了多模态情感计算任务中存在的模态可信度偏差问题. 认为相较于文本, 语音和视觉模态往往能更真实的表达情感, 因而在情感计算任务中, 语音和视觉是高可信度的, 文本是低可信度的. 然而, 已有的针对不同模态特征抽取工具的学习能力不同, 导致文本模态表示能力往往强于语音和视觉模态(例如: GPT3与ResNet), 这进一步加重了模态可信度偏差问题, 不利于高精度的情感判断. 为缓解模态可信度偏差, 提出一种模型无关的基于累积学习的多模态可信度感知的情感计算方法, 通过为低可信度的文本模态设计单独的文本模态分支捕捉偏差, 让模型在学习过程中从关注于低可信度文本模态的情感逐步关注到高可信度语音和视觉模态的情感, 从而有效缓解低可信度文本模态导致的情感预测不准确. 在多个基准数据集上进行实验, 多组对比实验的结果表明, 所提出的方法能够有效地突出高可信度语音和视觉模态的重要性, 缓解低可信度文本模态的偏差; 并且, 该模型无关的方法显著提升了多模态情感计算方法的性能, 这表明所提方法在多模态情感计算任务中的有效性和通用性.

    Abstract:

    Multi-modal affective computing is a fundamental and important research task in the field of affective computing, using multi-modal signals to understand the sentiment of user-generated video. Although existing multi-modal affective computing approaches have achieved good performance on benchmark datasets, they generally ignore the problem of modal reliability bias in multi-modal affective computing tasks, whether in designing complex fusion strategies or learning modal representations. This study believes that compared to text, acoustic and visual modalities often express sentiment more realistically. Therefore, voice and vision have high reliability, while text has low reliability in affective computing tasks. However, existing learning abilities of different modality feature extraction tools are different, resulting in a stronger ability to represent textual modality than acoustic and visual modalities (e.g., GPT3 and ResNet). This further exacerbates the problem of modal reliability bias, which is unfavorable for high-precision sentiment judgment. To mitigate the bias caused by modal reliability, this study proposes a model-agnostic multi-modal reliability-aware affective computing approach (MRA) based on cumulative learning. MRA captures the modal reliability bias by designing a single textual-modality branch and gradually shifting the focus from sentiments expressed in low-reliability textual modality to high-reliability acoustic and visual modalities during the model learning process. Thus, MRA effectively alleviates inaccurate sentiment predictions caused by low-reliability textual modality. Multiple comparative experiments conducted on multiple benchmark datasets demonstrate that the proposed approach MRA can effectively highlight the importance of high-reliability acoustic and visual modalities and mitigate the bias of low-reliability textual modality. Additionally, the model-agnostic approach significantly improves the performance of multi-modal affective computing, indicating its effectiveness and generality in multi-modal affective computing tasks.

    参考文献
    [1] 杨杨, 詹德川, 姜远, 熊辉. 可靠多模态学习综述. 软件学报, 2021, 32(4): 1067–1081. http://www.jos.org.cn/1000-9825/6167.htm
    Yang Y, Zhan DC, Jiang Y, Xiong H. Reliable multi-modal learning: A survey. Ruan Jian Xue Bao/Journal of Software, 2021, 32(4): 1067–1081 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6167.htm
    [2] Wöllmer M, Weninger F, Knaup T, Schuller W, Sun CK, Sagae K, Morency LP. YouTube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems, 2013, 28(3): 46–53.
    [3] Zadeh A, Chen MH, Poria S, Cambria E, Morency LP. Tensor fusion network for multimodal sentiment analysis. In: Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017. 1103–1114. [doi: 10.18653/v1/D17-1115]
    [4] Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP. Memory fusion network for multi-view sequential learning. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI Press, 2018. 5634–5641.
    [5] Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh AAB, Morency LP. Efficient low-rank multimodal fusion with modality-specific factors. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018. 2247–2256. [doi: 10.18653/v1/P18-1209]
    [6] Tsai YHH, Bai SJ, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R. Multimodal Transformer for unaligned multimodal language sequences. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 6558–6569. [doi: 10.18653/v1/P19-1656]
    [7] Tsai YHH, Ma M, Yang MQ, Salakhutdinov R, Morency LP. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020. 1823–1833.
    [8] Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021. 9180–9192. [doi: 10.18653/v1/2021.emnlp-main.723]
    [9] Wang YS, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 7216–7223. [doi: 10.1609/aaai.v33i01.33017216]
    [10] Pham H, Liang PP, Manzini T, Morency LP, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 6892–6899. [doi: 10.1609/aaai.v33i01.33016892]
    [11] Sun ZK, Sarma P, Sethares W, Liang YY. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI Press, 2020. 8992–8999. [doi: 10.1609/aaai.v34i05.6431]
    [12] Hazarika D, Zimmermann R, Poria S. MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 1122–1131. [doi: 10.1145/3394171.3413678]
    [13] Rahman W, Hasan MK, Lee S, Zadeh AB, Mao CF, Morency LP, Hoque E. Integrating multimodal information in large pretrained Transformers. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 2359–2369.
    [14] Yu WM, Xu H, Yuan ZQ, Wu JL. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI Press, 2021. 10790–10797.
    [15] Reece AG, Danforth CM. Instagram photos reveal predictive markers of depression. EPJ Data Science, 2017, 6: 15.
    [16] Thórisson KR, Bieger J, Li X, Wang P. Cumulative learning. In: Proc. of the 12th Int’l Conf. on Artificial General Intelligence. Shenzhen: Springer, 2019. 198–208. [doi: 10.1007/978-3-030-27005-6_20]
    [17] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [18] Fu ZW, Liu F, Xu Q, Qi YJ, Fu XL, Zhou AM, Li ZB. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In: Proc. of the 2022 IEEE Int’l Conf. on Multimedia and Expo. Taipei: IEEE, 2022. 1–6. [doi: 10.1109/ICME52920.2022.9859836]
    [19] Kozerawski J, Turk M. CLEAR: Cumulative learning for one-shot one-class image recognition. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 3446–3455. [doi: 10.1109/CVPR.2018.00363]
    [20] Zhou BY, Cui Q, Wei XS, Chen ZM. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9716–9725.
    [21] Naderian P, Loaiza-Ganem G, Braviner HJ, Caterini AL, Cresswell JC, Li T, Garg A. C-learning: Horizon-aware cumulative accessibility estimation. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.
    [22] Cadene R, Dancette C, Ben-Younes H, Cord M, Parikh D. RUBi: Reducing unimodal biases for visual question answering. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 76.
    [23] Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing. Doha: Association for Computational Linguistics, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]
    [24] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]
    [25] Degottex G, Kane J, Drugman T, Raitio T, Scherer S. COVAREP—A collaborative voice analysis repository for speech technologies. In: Proc. of the 2014 IEEE Int’l Conf. on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014. 960–964.
    [26] McFee B, Raffel C, Liang DW, Ellis DPW, McVicar M, Battenberg E, Nieto O. LibROSA: Audio and music signal analysis in Python. In: Proc. of the 14th Python in Science Conf. Austin: scipy.org, 2015. 18–24.
    [27] Ekman P, Rosenberg EL. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford: Oxford University Press, 1997.
    [28] Zhang KP, Zhang ZP, Li ZF, Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499–1503.
    [29] Baltrusaitis T, Zadeh A, Lim YC, Morency LP. OpenFace 2.0: Facial behavior analysis toolkit. In: Proc. of the 13th IEEE Int’l Conf. on Automatic Face & Gesture Recognition. Xi’an: IEEE, 2018. 59–66. [doi: 10.1109/FG.2018.00019]
    [30] Zadeh A, Zellers R, Pincus E, Morency LP. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259, 2016.
    [31] Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018. 2236–2246. [doi: 10.18653/v1/P18-1208]
    [32] Yu WM, Xu H, Meng FY, Zhu YL, Ma YX, Wu JL, Zou JY, Yang KC. CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 3718–3727.
    [33] Kingma DP, Ba J. Adam: A method for stochastic optimization. In: Proc. of the 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015.
    [34] Han ZB, Zhang CQ, Fu HZ, Zhou JT. Trusted multi-view classification with dynamic evidential fusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023, 45(2): 2551–2566.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

罗佳敏,王晶晶,周国栋.多模态可信度感知的情感计算.软件学报,2025,36(2):537-553

复制
分享
文章指标
  • 点击次数:695
  • 下载次数: 2662
  • HTML阅读次数: 480
  • 引用次数: 0
历史
  • 收稿日期:2023-04-03
  • 最后修改日期:2023-07-06
  • 在线发布日期: 2024-05-08
文章二维码
您是第20234342位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号