多模态可信度感知的情感计算

doi:10.13328/j.cnki.jos.007144

微信服务号

微信订阅号

2025年7月15日 1:34 星期二

首页 > 过刊浏览>2025年第36卷第2期 >537-553. DOI:10.13328/j.cnki.jos.007144

PDF HTML阅读 XML下载导出引用引用提醒

多模态可信度感知的情感计算
DOI:
                        10.13328/j.cnki.jos.007144
                    
CSTR:
                        
                    
作者:
                        罗佳敏罗佳敏
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
王晶晶王晶晶
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找
周国栋周国栋
苏州大学 计算机科学与技术学院, 江苏 苏州 215006
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP18
基金项目:国家自然科学基金(62006166, 62076175, 62076176); 江苏高校优势学科建设工程

Multi-modal Reliability-aware Affective Computing

Author:

LUO Jia-Min
LUO Jia-Min
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Jing-Jing
WANG Jing-Jing
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找
ZHOU Guo-Dong
ZHOU Guo-Dong
School of Computer Science and Technology, Soochow University, Suzhou 215006, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [35]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

多模态情感计算是情感计算领域一个基础且重要的研究任务, 旨在利用多模态信号对用户生成的视频进行情感理解. 尽管已有的多模态情感计算方法在基准数据集上取得了不错的性能, 但这些方法无论是设计复杂的融合策略还是学习模态表示, 普遍忽视了多模态情感计算任务中存在的模态可信度偏差问题. 认为相较于文本, 语音和视觉模态往往能更真实的表达情感, 因而在情感计算任务中, 语音和视觉是高可信度的, 文本是低可信度的. 然而, 已有的针对不同模态特征抽取工具的学习能力不同, 导致文本模态表示能力往往强于语音和视觉模态(例如: GPT3与ResNet), 这进一步加重了模态可信度偏差问题, 不利于高精度的情感判断. 为缓解模态可信度偏差, 提出一种模型无关的基于累积学习的多模态可信度感知的情感计算方法, 通过为低可信度的文本模态设计单独的文本模态分支捕捉偏差, 让模型在学习过程中从关注于低可信度文本模态的情感逐步关注到高可信度语音和视觉模态的情感, 从而有效缓解低可信度文本模态导致的情感预测不准确. 在多个基准数据集上进行实验, 多组对比实验的结果表明, 所提出的方法能够有效地突出高可信度语音和视觉模态的重要性, 缓解低可信度文本模态的偏差; 并且, 该模型无关的方法显著提升了多模态情感计算方法的性能, 这表明所提方法在多模态情感计算任务中的有效性和通用性.

关键词:多模态可信度感知;多模态情感计算;可信度偏差;累积学习

Abstract:

Multi-modal affective computing is a fundamental and important research task in the field of affective computing, using multi-modal signals to understand the sentiment of user-generated video. Although existing multi-modal affective computing approaches have achieved good performance on benchmark datasets, they generally ignore the problem of modal reliability bias in multi-modal affective computing tasks, whether in designing complex fusion strategies or learning modal representations. This study believes that compared to text, acoustic and visual modalities often express sentiment more realistically. Therefore, voice and vision have high reliability, while text has low reliability in affective computing tasks. However, existing learning abilities of different modality feature extraction tools are different, resulting in a stronger ability to represent textual modality than acoustic and visual modalities (e.g., GPT3 and ResNet). This further exacerbates the problem of modal reliability bias, which is unfavorable for high-precision sentiment judgment. To mitigate the bias caused by modal reliability, this study proposes a model-agnostic multi-modal reliability-aware affective computing approach (MRA) based on cumulative learning. MRA captures the modal reliability bias by designing a single textual-modality branch and gradually shifting the focus from sentiments expressed in low-reliability textual modality to high-reliability acoustic and visual modalities during the model learning process. Thus, MRA effectively alleviates inaccurate sentiment predictions caused by low-reliability textual modality. Multiple comparative experiments conducted on multiple benchmark datasets demonstrate that the proposed approach MRA can effectively highlight the importance of high-reliability acoustic and visual modalities and mitigate the bias of low-reliability textual modality. Additionally, the model-agnostic approach significantly improves the performance of multi-modal affective computing, indicating its effectiveness and generality in multi-modal affective computing tasks.

Key words:multi-modal reliability-aware;multi-modal affective computing;reliability bias;cumulative learning

参考文献

[1] 杨杨, 詹德川, 姜远, 熊辉. 可靠多模态学习综述. 软件学报, 2021, 32(4): 1067–1081. http://www.jos.org.cn/1000-9825/6167.htm

Yang Y, Zhan DC, Jiang Y, Xiong H. Reliable multi-modal learning: A survey. Ruan Jian Xue Bao/Journal of Software, 2021, 32(4): 1067–1081 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/6167.htm

[2] Wöllmer M, Weninger F, Knaup T, Schuller W, Sun CK, Sagae K, Morency LP. YouTube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems, 2013, 28(3): 46–53.

[3] Zadeh A, Chen MH, Poria S, Cambria E, Morency LP. Tensor fusion network for multimodal sentiment analysis. In: Proc. of the 2017 Conf. on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017. 1103–1114. [doi: 10.18653/v1/D17-1115]

[4] Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP. Memory fusion network for multi-view sequential learning. In: Proc. of the 32nd AAAI Conf. on Artificial Intelligence. New Orleans: AAAI Press, 2018. 5634–5641.

[5] Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh AAB, Morency LP. Efficient low-rank multimodal fusion with modality-specific factors. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018. 2247–2256. [doi: 10.18653/v1/P18-1209]

[6] Tsai YHH, Bai SJ, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R. Multimodal Transformer for unaligned multimodal language sequences. In: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019. 6558–6569. [doi: 10.18653/v1/P19-1656]

[7] Tsai YHH, Ma M, Yang MQ, Salakhutdinov R, Morency LP. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In: Proc. of the 2020 Conf. on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020. 1823–1833.

[8] Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proc. of the 2021 Conf. on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021. 9180–9192. [doi: 10.18653/v1/2021.emnlp-main.723]

[9] Wang YS, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 7216–7223. [doi: 10.1609/aaai.v33i01.33017216]

[10] Pham H, Liang PP, Manzini T, Morency LP, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proc. of the 33rd AAAI Conf. on Artificial Intelligence. Honolulu: AAAI Press, 2019. 6892–6899. [doi: 10.1609/aaai.v33i01.33016892]

[11] Sun ZK, Sarma P, Sethares W, Liang YY. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proc. of the 34th AAAI Conf. on Artificial Intelligence. New York: AAAI Press, 2020. 8992–8999. [doi: 10.1609/aaai.v34i05.6431]

[12] Hazarika D, Zimmermann R, Poria S. MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proc. of the 28th ACM Int’l Conf. on Multimedia. Seattle: ACM, 2020. 1122–1131. [doi: 10.1145/3394171.3413678]

[13] Rahman W, Hasan MK, Lee S, Zadeh AB, Mao CF, Morency LP, Hoque E. Integrating multimodal information in large pretrained Transformers. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 2359–2369.

[14] Yu WM, Xu H, Yuan ZQ, Wu JL. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proc. of the 35th AAAI Conf. on Artificial Intelligence. AAAI Press, 2021. 10790–10797.

[15] Reece AG, Danforth CM. Instagram photos reveal predictive markers of depression. EPJ Data Science, 2017, 6: 15.

[16] Thórisson KR, Bieger J, Li X, Wang P. Cumulative learning. In: Proc. of the 12th Int’l Conf. on Artificial General Intelligence. Shenzhen: Springer, 2019. 198–208. [doi: 10.1007/978-3-030-27005-6_20]

[17] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Proc. of the 31st Int’l Conf. on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.

[18] Fu ZW, Liu F, Xu Q, Qi YJ, Fu XL, Zhou AM, Li ZB. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In: Proc. of the 2022 IEEE Int’l Conf. on Multimedia and Expo. Taipei: IEEE, 2022. 1–6. [doi: 10.1109/ICME52920.2022.9859836]

[19] Kozerawski J, Turk M. CLEAR: Cumulative learning for one-shot one-class image recognition. In: Proc. of the 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 3446–3455. [doi: 10.1109/CVPR.2018.00363]

[20] Zhou BY, Cui Q, Wei XS, Chen ZM. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In: Proc. of the 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 9716–9725.

[21] Naderian P, Loaiza-Ganem G, Braviner HJ, Caterini AL, Cresswell JC, Li T, Garg A. C-learning: Horizon-aware cumulative accessibility estimation. In: Proc. of the 9th Int’l Conf. on Learning Representations. OpenReview.net, 2021.

[22] Cadene R, Dancette C, Ben-Younes H, Cord M, Parikh D. RUBi: Reducing unimodal biases for visual question answering. In: Proc. of the 33rd Int’l Conf. on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 76.

[23] Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing. Doha: Association for Computational Linguistics, 2014. 1532–1543. [doi: 10.3115/v1/D14-1162]

[24] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186. [doi: 10.18653/v1/N19-1423]

[25] Degottex G, Kane J, Drugman T, Raitio T, Scherer S. COVAREP—A collaborative voice analysis repository for speech technologies. In: Proc. of the 2014 IEEE Int’l Conf. on Acoustics, Speech and Signal Processing. Florence: IEEE, 2014. 960–964.

[26] McFee B, Raffel C, Liang DW, Ellis DPW, McVicar M, Battenberg E, Nieto O. LibROSA: Audio and music signal analysis in Python. In: Proc. of the 14th Python in Science Conf. Austin: scipy.org, 2015. 18–24.

[27] Ekman P, Rosenberg EL. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford: Oxford University Press, 1997.

[28] Zhang KP, Zhang ZP, Li ZF, Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499–1503.

[29] Baltrusaitis T, Zadeh A, Lim YC, Morency LP. OpenFace 2.0: Facial behavior analysis toolkit. In: Proc. of the 13th IEEE Int’l Conf. on Automatic Face & Gesture Recognition. Xi’an: IEEE, 2018. 59–66. [doi: 10.1109/FG.2018.00019]

[30] Zadeh A, Zellers R, Pincus E, Morency LP. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259, 2016.

[31] Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proc. of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018. 2236–2246. [doi: 10.18653/v1/P18-1208]

[32] Yu WM, Xu H, Meng FY, Zhu YL, Ma YX, Wu JL, Zou JY, Yang KC. CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 3718–3727.

[33] Kingma DP, Ba J. Adam: A method for stochastic optimization. In: Proc. of the 3rd Int’l Conf. on Learning Representations. San Diego: ICLR, 2015.

[34] Han ZB, Zhang CQ, Fu HZ, Zhou JT. Trusted multi-view classification with dynamic evidential fusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023, 45(2): 2551–2566.

引用本文

罗佳敏,王晶晶,周国栋.多模态可信度感知的情感计算.软件学报,2025,36(2):537-553

复制

文章指标

点击次数:695
下载次数: 2662
HTML阅读次数: 480
引用次数: 0

历史

收稿日期:2023-04-03
最后修改日期:2023-07-06
录用日期:
在线发布日期: 2024-05-08
出版日期:

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码

微信服务号

微信订阅号

引用本文

相关视频

分享

微信扫一扫：分享

文章指标

历史

文章二维码